# Section 0: Import section

In [1]:
# Import section

# Helper funcitons and classes
from helper_func_and_classes import create_dataset_list, create_submission_file, create_vocab
from helper_func_and_classes import split_dataset
from helper_func_and_classes import word_vec_to_word_embeddings
from helper_func_and_classes import word_vec_to_aggregated_word_embeddings
from helper_func_and_classes import word_embeddings_extraction
from helper_func_and_classes import TwitterDataset
from helper_func_and_classes import get_count_of_longest_sentence

# word embeddings
from torchtext.vocab import GloVe

# scikit-learn
from sklearn.preprocessing import scale
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV


# plotting 
import matplotlib.pyplot as plt

# random state
RANDOM_SEED = 123 # used in helper functions

# Section 1: Data preprocessing section
This section will extract the data from the three different .txt files. Then, the helper functions will process the tweets to create our vocabulary and three python lists for positive, negative, and submission tweets. The tweets are pre tokenized, so we only split on white space.

In [2]:
# creating the text vocabulary from the whole data set, including positive, negative and test data.
text_vocab_full = create_vocab("./twitter-datasets/train_pos_full.txt",
                          "./twitter-datasets/train_neg_full.txt",
                          "./twitter-datasets/test_data.txt")

text_vocab_lite = create_vocab("./twitter-datasets/train_pos.txt",
                          "./twitter-datasets/train_neg.txt",
                          "./twitter-datasets/test_data.txt")

# creating a standard python library list of the tweets that will be used for submission, 1 tweet per index
submission_data = create_dataset_list("./twitter-datasets/test_data.txt")

print("Length of text_vocab_full: ", len(text_vocab_full))
print("Length of text_vocab_lite: ", len(text_vocab_lite))
print("Length of submission_data: ",len(submission_data))

100%|████████████████████████████████████████████████████| 1250000/1250000 [00:01<00:00, 800910.95it/s]
100%|████████████████████████████████████████████████████| 1250000/1250000 [00:01<00:00, 695402.22it/s]
100%|████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 666450.15it/s]
100%|██████████████████████████████████████████████████████| 100000/100000 [00:00<00:00, 880505.17it/s]
100%|██████████████████████████████████████████████████████| 100000/100000 [00:00<00:00, 817314.75it/s]
100%|████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 868655.69it/s]


Length of text_vocab_full:  604014
Length of text_vocab_lite:  127802
Length of submission_data:  10000


Here we will create the TwitterDataset class which contains all the pre processed data, here we do not have any cutoff since we can have any length for the quotes since they are just aggregated over the same dimensions.

Furthermore we will split and shuffle the dataset into training and testing. The longest sentences is also located by using the get_count_of_longest_sentence on the dataset.

In [3]:
# Creating the dataset in the TwitterDataset class which is found in the helper functions
dataset_full = TwitterDataset(text_vocab_full,
                              create_dataset_list("./twitter-datasets/train_pos_full.txt"),
                              create_dataset_list("./twitter-datasets/train_neg_full.txt"),
                              submission_data,
                              1000)

dataset_lite = TwitterDataset(text_vocab_lite, 
                              create_dataset_list("./twitter-datasets/train_pos.txt"),
                              create_dataset_list("./twitter-datasets/train_neg.txt"),
                              submission_data,
                              1000)

# create training dataset and test dataset - using a split of 85% / 15%
train_dataset_full, test_dataset_full = split_dataset(dataset_full, 0.9);
train_dataset_lite, test_dataset_lite = split_dataset(dataset_lite, 0.9);

# calculating the longest sentence
max_len_full = get_count_of_longest_sentence(dataset_full)
max_len_lite = get_count_of_longest_sentence(dataset_lite)

100%|████████████████████████████████████████████████████| 1250000/1250000 [00:08<00:00, 152301.73it/s]
100%|████████████████████████████████████████████████████| 1250000/1250000 [00:10<00:00, 118618.90it/s]
100%|██████████████████████████████████████████████████████| 100000/100000 [00:00<00:00, 225059.53it/s]
100%|██████████████████████████████████████████████████████| 100000/100000 [00:00<00:00, 184720.32it/s]


Number of elements in train_data is:  2250000
Number of elements in test_data is:  250000
Number of elements in train_data is:  180000
Number of elements in test_data is:  20000


# Section 2: Logistic regression
#### Pretrained word embeddings (dim=200) 
Here we will use pre-trained Global Vector word embeddings (Glove); these will have a dimension of 200 per word. The pre-trained word embeddings were downloaded from https://pytorch.org/text/stable/_modules/torchtext/vocab/vectors.html#GloVe. The word embedding used has been trained on Twitter data, which will increase accuracy since our corpus also consists of Twitter data.

In [4]:
glove_vec = GloVe(name='twitter.27B', dim=200)

In [5]:
word_embeddings_lite = glove_vec.get_vecs_by_tokens(list(text_vocab_lite.keys()), lower_case_backup=True)
word_embeddings_full = glove_vec.get_vecs_by_tokens(list(text_vocab_full.keys()), lower_case_backup=True)

In [6]:
# full version of dataset used for training the model
matrix_train_full, labels_train_full = word_vec_to_aggregated_word_embeddings(train_dataset_full, word_embeddings_full, 200)
scaled_matrix_train_full = scale(matrix_train_full)

matrix_test_full, labels_test_full = word_vec_to_aggregated_word_embeddings(test_dataset_full, word_embeddings_full, 200)
scaled_matrix_test_full = scale(matrix_test_full)

# small version of dataset used for param optimizing
matrix_train_lite, labels_train_lite = word_vec_to_aggregated_word_embeddings(train_dataset_lite, word_embeddings_lite, 200)
scaled_matrix_train_lite = scale(matrix_train_lite)

matrix_test_lite, labels_test_lite = word_vec_to_aggregated_word_embeddings(test_dataset_lite, word_embeddings_lite, 200)
scaled_matrix_test_lite = scale(matrix_test_lite)

2250000it [01:54, 19624.10it/s]
250000it [00:15, 15919.97it/s]
180000it [00:10, 17919.57it/s]
20000it [00:00, 20660.25it/s]


## Section 2.1: Choosing best parameters

In [7]:

grid_params = {
    'solver': ['newton-cg', 'sag', 'saga'],
    'penalty': ['l2'],
    'C': [1e5, 1e4, 1e3, 1e2, 1e1, 1e-1, 1e-2, 1e-3]
}

In [8]:
%%time

lr_clf = LogisticRegression(random_state=RANDOM_SEED)
param_search = GridSearchCV(
    estimator=lr_clf,
    param_grid=grid_params,
    n_jobs=-1,
    cv=3,
    verbose=1,
    return_train_score=True
)

param_search.fit(matrix_train_lite, labels_train_lite)

Fitting 3 folds for each of 24 candidates, totalling 72 fits
CPU times: user 25.2 s, sys: 20.8 s, total: 46 s
Wall time: 2min 55s


GridSearchCV(cv=3, estimator=LogisticRegression(random_state=123), n_jobs=-1,
             param_grid={'C': [100000.0, 10000.0, 1000.0, 100.0, 10.0, 0.1,
                               0.01, 0.001],
                         'penalty': ['l2'],
                         'solver': ['newton-cg', 'sag', 'saga']},
             return_train_score=True, verbose=1)

In [9]:
param_search.best_params_

{'C': 10.0, 'penalty': 'l2', 'solver': 'sag'}

**Best parameters form the parameters grid search:**  
`{'C': 1000.0, 'penalty': 'l2', 'solver': 'sag'}`

## Section 2.2: Training the Logistic regression
Since we have figured out the best parameters in the last step, we will used them here. We also set verbose=1 so we can track the progrss of the model.

We also set `dual=False` since we have n_samples > n_features, as explained in the documentation found at https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html.

In [10]:
lr_clf_best_params = LogisticRegression(
    random_state=RANDOM_SEED,
    solver='sag',
    C=1000,
    penalty='l2',
    n_jobs=-1,
    fit_intercept=True,
    dual=False,
    verbose=1
    )
lr_clf_best_params.fit(scaled_matrix_train_full, labels_train_full)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.


convergence after 21 epochs took 85 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  1.4min finished


LogisticRegression(C=1000, n_jobs=-1, random_state=123, solver='sag', verbose=1)

In [11]:
label_prediction = lr_clf_best_params.predict(scaled_matrix_test_full)
print(classification_report(labels_test_full, label_prediction, digits=4))

              precision    recall  f1-score   support

         0.0     0.7939    0.7594    0.7763    124929
         1.0     0.7697    0.8031    0.7860    125071

    accuracy                         0.7813    250000
   macro avg     0.7818    0.7813    0.7812    250000
weighted avg     0.7818    0.7813    0.7812    250000



# Section 3: Creating submission

In [None]:
submission_matrix, id_submission_matrix = word_vec_to_aggregated_word_embeddings(
    dataset_full.submission_dataset,
    word_embeddings_full,
    200)
scaled_submission_matrix = scale(submission_matrix)

label_prediction_submission = lr_clf_best_params.predict(submission_matrix)
create_submission_file(label_prediction_submission)