# Introduction
We use this notebook to vectorize the text data. 

As mentioned in [simpletransformers](https://simpletransformers.ai/docs/text-rep-model/):
> "The RepresentationModel class is used for generating (contextual) word or sentence embeddings from a list of text sentences, You can then feed these vectors to any model or downstream task."

After we vectorize the data, it will be used for data augmentation or any other task phases in other notebooks.

In [None]:
!pip install simpletransformers
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from simpletransformers.language_representation import RepresentationModel
from google.colab import drive
drive.mount('/content/drive')

In [None]:
train_data = pd.read_csv("/content/drive/MyDrive/TUBITAK_TASK1/data/stweet/train.csv")
test_data = pd.read_csv('/content/drive/MyDrive/TUBITAK_TASK1/data/stweet/test.csv')

In [None]:
print(train_data.shape)
print(test_data.shape)

(1046343, 2)
(478955, 2)


In [None]:
## Both test and train set must be vectorized by BERT 
train = train_data["text"]
test = test_data["text"]

In [None]:
print(train.shape)
print(test.shape)

(1046343,)
(478955,)


_____
I did not vectorize every single sentence for our case due to lack of space and time. Creating a million sentences vector is taking long time and needs to 10GB+ disk space. Therefore, we decrease the size of dataset while vectorizing. In this notebook 150k sentences was found the max available choice for RAM in Colab. However, you can try for larger vectors if you have the hardware.

We vectorize both train and test set seperately and "bert-base-uncased" will be using in most of our cases.

I used encoded_text phrase for vectorized data rest of the study, but you need to basically see as it becomes numerical text in 768 length through BERT.
_____

In [None]:
# use this block if new bert vectors needed
# without CUDA=true it is nearly imposible to vectorize

# TRAIN SET

sentences = train[:150000]
model = RepresentationModel(model_type="bert", model_name='bert-base-uncased', use_cuda=True)
word_vectors_train = model.encode_sentences(sentences, combine_strategy="mean")

In [None]:
word_vectors_train.shape

In [None]:
# Saving the word_vectors to drive cuz it takes long hours u know

df_train = pd.DataFrame(word_vectors_train)
df_train.to_csv('/content/drive/MyDrive/TUBITAK_TASK1/encoded_texts/stweet/encoded_stweet150000_train.csv')

In [None]:
# use this block if new bert vectors needed
# TEST SET

sentences = test[:20000]
model = RepresentationModel(model_type="bert", model_name='bert-base-uncased', use_cuda=True)
word_vectors_test = model.encode_sentences(sentences, combine_strategy= "mean")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTextRepresentation: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForTextRepresentation from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTextRepresentation from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# Saving the word_vectors to drive cuz it takes long hours u know

df_test = pd.DataFrame(word_vectors_test)
df_test.to_csv('/content/drive/MyDrive/TUBITAK_TASK1/encoded_texts/stweet/encoded_stweet20000_test.csv')

In [None]:
print("X_train shape:", word_vectors_train.shape)
print("X_test shape:", word_vectors_test.shape)

X_train shape: (150000, 768)
X_test shape: (20000, 768)
