# Section 0: Import section

In [9]:
# Import section

# Helper funcitons and classes
from helper_func_and_classes import create_dataset_list, create_submission_file, create_vocab
from helper_func_and_classes import split_dataset
from helper_func_and_classes import word_vec_to_word_embeddings
from helper_func_and_classes import word_vec_to_aggregated_word_embeddings
from helper_func_and_classes import word_embeddings_extraction
from helper_func_and_classes import TwitterDataset
from helper_func_and_classes import get_count_of_longest_sentence

# word embeddings
from torchtext.vocab import GloVe

# scikit-learn
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold

# plotting 
import matplotlib.pyplot as plt

# random state
RANDOM_SEED = 123 # used in helper functions

# Section 1: Data preprocessing section
This section will extract the data from the three different .txt files. Then, the helper functions will process the tweets to create our vocabulary and three python lists for positive, negative, and submission tweets. The tweets are pre tokenized, so we only split on white space.

In [2]:
# creating the text vocabulary from the whole data set, including positive, negative and test data.
text_vocab_full = create_vocab("./twitter-datasets/train_pos_full.txt",
                          "./twitter-datasets/train_neg_full.txt",
                          "./twitter-datasets/test_data.txt")

text_vocab_lite = create_vocab("./twitter-datasets/train_pos.txt",
                          "./twitter-datasets/train_neg.txt",
                          "./twitter-datasets/test_data.txt")

# creating a standard python library list of the tweets that will be used for submission, 1 tweet per index
submission_data = create_dataset_list("./twitter-datasets/test_data.txt")

print("Length of text_vocab_full: ", len(text_vocab_full))
print("Length of text_vocab_lite: ", len(text_vocab_lite))
print("Length of submission_data: ",len(submission_data))

100%|█████████████████████████████████████| 1250000/1250000 [00:01<00:00, 825122.96it/s]
100%|█████████████████████████████████████| 1250000/1250000 [00:01<00:00, 716216.56it/s]
100%|█████████████████████████████████████████| 10000/10000 [00:00<00:00, 672314.94it/s]
100%|███████████████████████████████████████| 100000/100000 [00:00<00:00, 949057.80it/s]
100%|███████████████████████████████████████| 100000/100000 [00:00<00:00, 893423.99it/s]
100%|█████████████████████████████████████████| 10000/10000 [00:00<00:00, 934393.16it/s]


Length of text_vocab_full:  604014
Length of text_vocab_lite:  127802
Length of submission_data:  10000


Here we will create the TwitterDataset class which contains all the pre processed data, we have done a sentence cutoff for the sentences longe than 37, this was found in the exploratory data analysis part.  

Furthermore we will split and shuffle the dataset into training and testing. The longest sentences is also located by using the get_count_of_longest_sentence on the dataset.

In [3]:
# Creating the dataset in the TwitterDataset class which is found in the helper functions
dataset_full = TwitterDataset(text_vocab_full,
                              create_dataset_list("./twitter-datasets/train_pos_full.txt"),
                              create_dataset_list("./twitter-datasets/train_neg_full.txt"),
                              submission_data,
                              37)

dataset_lite = TwitterDataset(text_vocab_lite, 
                              create_dataset_list("./twitter-datasets/train_pos.txt"),
                              create_dataset_list("./twitter-datasets/train_neg.txt"),
                              submission_data,
                              37)

# create training dataset and test dataset - using a split of 85% / 15%
train_dataset_full, test_dataset_full = split_dataset(dataset_full, 0.9);
train_dataset_lite, test_dataset_lite = split_dataset(dataset_lite, 0.9);

# calculating the longest sentence
max_len_full = get_count_of_longest_sentence(dataset_full)
max_len_lite = get_count_of_longest_sentence(dataset_lite)

100%|█████████████████████████████████████| 1250000/1250000 [00:07<00:00, 171408.23it/s]
100%|█████████████████████████████████████| 1250000/1250000 [00:09<00:00, 138731.53it/s]
100%|███████████████████████████████████████| 100000/100000 [00:00<00:00, 237080.85it/s]
100%|███████████████████████████████████████| 100000/100000 [00:00<00:00, 206120.16it/s]


Number of elements in train_data is:  2246870
Number of elements in test_data is:  249653
Number of elements in train_data is:  179784
Number of elements in test_data is:  19976


<br><br>
# Section 2: RandomForestClassifier
#### Pretrained word embeddings (dim=200) 
Here we will use pre-trained Global Vector word embeddings (Glove); these will have a dimension of 200 per word. The pre-trained word embeddings were downloaded from https://pytorch.org/text/stable/_modules/torchtext/vocab/vectors.html#GloVe. The word embedding used has been trained on Twitter data, which will increase accuracy since our corpus also consists of Twitter data.

In [4]:
glove_vec = GloVe(name='twitter.27B', dim=200)

In [5]:
word_embeddings_lite = glove_vec.get_vecs_by_tokens(list(text_vocab_lite.keys()), lower_case_backup=True)
word_embeddings_full = glove_vec.get_vecs_by_tokens(list(text_vocab_full.keys()), lower_case_backup=True)

We will create the word embeddings for our train data and test data here.

In [6]:
dimension = 200
vec_size_lite = max_len_lite
vec_size_full = max_len_full

train_matrix_lite, train_labels_lite = word_vec_to_aggregated_word_embeddings(
    train_dataset_lite,
    word_embeddings_lite,
    dimension)

test_matrix_lite, test_labels_lite = word_vec_to_aggregated_word_embeddings(
    test_dataset_lite,
    word_embeddings_lite,
    dimension)

train_matrix_full, train_labels_full = word_vec_to_aggregated_word_embeddings(
    train_dataset_full,
    word_embeddings_full,
    dimension)

test_matrix_full, test_labels_full = word_vec_to_aggregated_word_embeddings(
    test_dataset_full,
    word_embeddings_full,
    dimension)

179784it [00:08, 20649.89it/s]
19976it [00:00, 21709.10it/s]
2246870it [01:46, 21020.64it/s]
249653it [00:11, 21194.14it/s]


## Section 2.1: Choosing best parameters
Here we will optimize the parameters for the `RandomForestClassifier`. 

We will do a randomized search for parameters because it is a huge amount of parameters to optimize in the `RandomForestClassifer`.  
Therefore after creating the grid of parameters, we will use the `RandomizedSearchCV` to collect a random collection of parameters form the `grid_params`. To ensure reproducability we will change the cross-validation function to `StratifiedKFolk`, this way we can set the `random-state` to our `RANDOM_SEED` constant.

In [7]:
# Create the random grid
grid_params = {
    'max_features': ['auto', 'sqrt'],
    'n_estimators': [5, 10, 20, 30, 40],
    'min_samples_leaf': [1, 2, 3, 4],
    'min_samples_split': [2, 5, 10, 15, 20],
    'max_depth': [10, 50, 100, 150, 200, 250, 300],
    'bootstrap': [False, True]
}


In [10]:
%%time
cv = StratifiedKFold(n_splits=3, random_state=RANDOM_SEED, shuffle=True)
rf_clf = RandomForestClassifier(random_state=RANDOM_SEED)

rf_random_clf = RandomizedSearchCV(
    param_distributions = grid_params,
    estimator = rf_clf,
    n_iter = 150,
    cv = cv,
    verbose=1,
    random_state=RANDOM_SEED,
    n_jobs = -1)

rf_random_clf.fit(train_matrix_lite, train_labels_lite)

Fitting 3 folds for each of 150 candidates, totalling 450 fits
CPU times: user 3min 38s, sys: 1min 11s, total: 4min 50s
Wall time: 53min 56s


RandomizedSearchCV(cv=StratifiedKFold(n_splits=3, random_state=123, shuffle=True),
                   estimator=RandomForestClassifier(random_state=123),
                   n_iter=150, n_jobs=-1,
                   param_distributions={'bootstrap': [False, True],
                                        'max_depth': [10, 50, 100, 150, 200,
                                                      250, 300],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 3, 4],
                                        'min_samples_split': [2, 5, 10, 15, 20],
                                        'n_estimators': [5, 10, 20, 30, 40]},
                   random_state=123, verbose=1)

After this we can collect the best parameters by using `best_params_`. We will use these values to train the model on the full dataset.

In [11]:
rf_random_clf.best_params_

{'n_estimators': 40,
 'min_samples_split': 15,
 'min_samples_leaf': 3,
 'max_features': 'sqrt',
 'max_depth': 100,
 'bootstrap': False}

`{'n_estimators': 40,
 'min_samples_split': 15,
 'min_samples_leaf': 3,
 'max_features': 'sqrt',
 'max_depth': 100,
 'bootstrap': False}`

## Section 2.2: Training the RandomForestClassifier

In [12]:
rf_clf_best_params = RandomForestClassifier(
    n_estimators=40, 
    min_samples_split=15, 
    min_samples_leaf=3,
    max_features='sqrt',
    max_depth=100,
    bootstrap=False,
    verbose=2,
    n_jobs=-1,
    random_state=RANDOM_SEED
)

rf_clf_best_params.fit(train_matrix_full, train_labels_full)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.


building tree 1 of 40building tree 2 of 40
building tree 3 of 40

building tree 4 of 40building tree 5 of 40

building tree 6 of 40
building tree 7 of 40building tree 8 of 40
building tree 9 of 40

building tree 10 of 40building tree 11 of 40
building tree 12 of 40

building tree 13 of 40
building tree 14 of 40
building tree 15 of 40
building tree 16 of 40
building tree 17 of 40
building tree 18 of 40
building tree 19 of 40
building tree 20 of 40
building tree 21 of 40
building tree 22 of 40
building tree 23 of 40
building tree 24 of 40
building tree 25 of 40
building tree 26 of 40
building tree 27 of 40
building tree 28 of 40
building tree 29 of 40


[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:  6.2min


building tree 30 of 40
building tree 31 of 40
building tree 32 of 40
building tree 33 of 40
building tree 34 of 40
building tree 35 of 40
building tree 36 of 40
building tree 37 of 40
building tree 38 of 40
building tree 39 of 40
building tree 40 of 40


[Parallel(n_jobs=-1)]: Done  38 out of  40 | elapsed: 11.0min remaining:   34.9s
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed: 11.2min finished


RandomForestClassifier(bootstrap=False, max_depth=100, max_features='sqrt',
                       min_samples_leaf=3, min_samples_split=15,
                       n_estimators=40, n_jobs=-1, random_state=123, verbose=2)

Then we do a prediction on the test data that we had put aside and use the `classification_report` to see how well it did

In [13]:
label_prediction = rf_clf_best_params.predict(test_matrix_full)
print(classification_report(test_labels_full, label_prediction, digits=4))
print(label_prediction)

[Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  17 tasks      | elapsed:    0.9s
[Parallel(n_jobs=12)]: Done  38 out of  40 | elapsed:    1.4s remaining:    0.1s
[Parallel(n_jobs=12)]: Done  40 out of  40 | elapsed:    1.5s finished


              precision    recall  f1-score   support

         0.0     0.8300    0.7478    0.7867    124773
         1.0     0.7707    0.8469    0.8070    124880

    accuracy                         0.7974    249653
   macro avg     0.8003    0.7974    0.7969    249653
weighted avg     0.8003    0.7974    0.7969    249653

[0. 0. 0. ... 1. 1. 1.]


# Section 3: Creating submission

In [14]:
submission_matrix, id_submission_matrix = word_vec_to_aggregated_word_embeddings(
    dataset_full.submission_dataset,
    word_embeddings_full,
    dimension)

10000it [00:01, 8067.42it/s]


In [15]:
submission_numpy_array = rf_clf_best_params.predict(submission_matrix)
submission_numpy_array

[Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  17 tasks      | elapsed:    0.0s
[Parallel(n_jobs=12)]: Done  38 out of  40 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=12)]: Done  40 out of  40 | elapsed:    0.1s finished


array([0., 1., 0., ..., 0., 1., 0.], dtype=float32)

In [None]:
create_submission_file(submission_numpy_array)