# Doc2Vec Modeling

The code in the notebook was referenced from [this](https://towardsdatascience.com/multi-class-text-classification-with-doc2vec-logistic-regression-9da9947b43f4) Medium post.

In [2]:
# imports
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import cm
import seaborn as sns; sns.set()
%matplotlib inline
# NLP
import nltk
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction import text 
from nltk.stem import WordNetLemmatizer 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
# modeling
import sklearn.metrics as metrics
from sklearn import metrics, utils
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, GridSearchCV
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
# metrics
from sklearn import metrics, model_selection, svm
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, plot_confusion_matrix, roc_curve, auc, classification_report
import pickle

## Importing `clean_df`

In [3]:
clean_df = pd.read_pickle(r'C:\Users\Ricky\Desktop\4.2 FINAL SEMESTER\PROJECT II  Computer systems Project\rOOT\Preprocessing\clean_df.pkl')

In [5]:
clean_df.head(5)

Unnamed: 0,total_votes,hate_speech_votes,other_votes,label,tweet,clean_tweets
0,3,1,3,0,kindly say bickering to kikuyus and kalenjins....,kindly say bickering to kikuyus and kalenjins ...
1,3,1,3,0,kindly remind them that we do not have thoroug...,kindly remind them that we do not have thoroug...
2,3,1,3,0,kindly look at moses' statement. where has he ...,kindly look at moses statement where has he sa...
3,3,1,3,0,kindly like this page>>>wtf fun facts maasai a...,kindly like this pagewtf fun facts maasai are ...
4,3,1,3,0,kindly kikuyus humble yourselves in 2022 and t...,kindly kikuyus humble yourselves in and take t...


## Train-Test Split

In [6]:
doc_train, doc_test = train_test_split(clean_df, test_size=0.3, random_state=42)

## Preparing the Data

In [7]:
def tokenize_text(text):
    tokens = []
    for sent in nltk.sent_tokenize(text):
        for word in nltk.word_tokenize(sent):
            if len(word) < 2:
                continue
            tokens.append(word.lower())
    return tokens

In [8]:
tagged_train = doc_train.apply(
    lambda r: TaggedDocument(words=tokenize_text(r['clean_tweets']), tags=[r.label]), axis=1)
tagged_test = doc_test.apply(
    lambda r: TaggedDocument(words=tokenize_text(r['clean_tweets']), tags=[r.label]), axis=1)

In [9]:
tagged_train.values[30]

TaggedDocument(words=['fuck', 'you', 'zuck', 'you', 'canâ\x80\x99t', 'have', 'ai', 'constantly', 'police', 'â\x80\x9c', 'hatespeech', 'â\x80', 'this', 'things', 'change', 'within', 'minutes', 'and', 'different', 'words', 'mean', 'different', 'things', 'in', 'different', 'pas', 'of', 'the', 'world', 'or', 'even', 'this', 'country'], tags=[0])

## Training DBOW Model

This is the Doc2Vec model analogous to Skip-gram model in Word2Vec. Here we can see that training a Doc2Vec model is much more straight forward in Gensim.

In [10]:
# train a doc2vec model, using only training data
dbow_model = Doc2Vec(vector_size=100, 
                alpha=0.025, 
                min_count=5,
                dm=1, epochs=100)

In [11]:
from tqdm import tqdm
tqdm.pandas(desc="progress-bar")

# building vocabulary 
dbow_model.build_vocab([x for x in tqdm(tagged_train.values)])


100%|██████████| 35122/35122 [00:00<00:00, 377857.67it/s]


In [12]:
%%time
# this cell takes about 26 seconds to run
for epoch in range(30):
    dbow_model.train(utils.shuffle([x for x in tqdm(tagged_train.values)]), total_examples=len(tagged_train.values), epochs=1)
    dbow_model.alpha -= 0.002
    dbow_model.min_alpha = dbow_model.alpha

100%|██████████| 35122/35122 [00:00<00:00, 408589.21it/s]
100%|██████████| 35122/35122 [00:00<00:00, 675785.00it/s]
100%|██████████| 35122/35122 [00:00<00:00, 731990.45it/s]
100%|██████████| 35122/35122 [00:00<00:00, 494803.30it/s]
100%|██████████| 35122/35122 [00:00<00:00, 563385.48it/s]
100%|██████████| 35122/35122 [00:00<00:00, 605901.99it/s]
100%|██████████| 35122/35122 [00:00<00:00, 462257.89it/s]
100%|██████████| 35122/35122 [00:00<00:00, 557813.26it/s]
100%|██████████| 35122/35122 [00:00<00:00, 516765.75it/s]
100%|██████████| 35122/35122 [00:00<00:00, 616831.62it/s]
100%|██████████| 35122/35122 [00:00<00:00, 576084.69it/s]
100%|██████████| 35122/35122 [00:00<00:00, 702812.66it/s]
100%|██████████| 35122/35122 [00:00<00:00, 605884.55it/s]
100%|██████████| 35122/35122 [00:00<00:00, 517025.11it/s]
100%|██████████| 35122/35122 [00:00<00:00, 675434.87it/s]
100%|██████████| 35122/35122 [00:00<00:00, 494939.62it/s]
100%|██████████| 35122/35122 [00:00<00:00, 549093.10it/s]
100%|█████████

Wall time: 5min 28s


### Building the final vector feature for the classifier

In [None]:

#refer the next cell for the correct  output after the correct output after the elimination of #step=20 
def vec_for_learning(model, tagged_docs):
    sents = tagged_docs.values
    targets, regressors = zip(*[(doc.tags[0], model.infer_vector(doc.words, steps=20)) for doc in sents])
    return targets, regressors

In [36]:
def vec_for_learning(model, tagged_docs):
    sents = tagged_docs.values
    targets, regressors = zip(*[(doc.tags[0], model.infer_vector(doc.words )) for doc in sents])
    return targets, regressors

## Baseline Models

In [37]:
# train-test split
y_train, X_train = vec_for_learning(dbow_model, tagged_train)
y_test, X_test = vec_for_learning(dbow_model, tagged_test)

## Logisitic Regression

The Logisitic Regression baseline had the highest unweighted F1 score of 0.387805 with the Tf-IDF vectorization method.

In [38]:
logreg = LogisticRegression(n_jobs=1, C=1e5)

In [39]:
%%time
logreg.fit(X_train, y_train)

Wall time: 4.18 s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression(C=100000.0, n_jobs=1)

In [40]:
logreg_y_preds = logreg.predict(X_test)

In [41]:
logreg_precision = precision_score(y_test, logreg_y_preds)
logreg_recall = recall_score(y_test, logreg_y_preds)
logreg_f1_score = f1_score(y_test, logreg_y_preds)
logreg_f1_weighted = f1_score(y_test, logreg_y_preds, average='weighted')

In [42]:
print('Precision: {:.4}'.format(logreg_precision))
print('Recall: {:.4}'.format(logreg_recall))
print('F1 Score: {:.4}'.format(logreg_f1_score))
print('Weighted F1 Score: {:.4}'.format(logreg_f1_weighted))


Precision: 0.0
Recall: 0.0
F1 Score: 0.0
Weighted F1 Score: 0.906


Looks like using Doc2Vec on a Logistic Regression model really lowers the F1 score, but it gets bumped up if we add the `weighted` parameter.

Additionally, this method increased Precision but decreased Recall.

According to the scikit-learn documentation, a weighted F1 score calculates metrics for each label, and finds their average weighted by support (the number of true instances for each label). **This alters ‘macro’ to account for label imbalance;** it can result in an F-score that is not between precision and recall.



In [43]:
# creating dictionary with all metrics
metric_dict = {}
metric_dict['Baseline Logisitic Regression'] = {'precision': logreg_precision, 'recall': logreg_recall, 'f1_score': logreg_f1_score, 'weighted_f1': logreg_f1_weighted}

## Support Vector Machine (SVM)
The baseline SVM had the highest weighted F1 score of 0.938102 with the Tf-IDF vectorization method.

#########################to be edited and put on the corect F1 score

In [45]:
SVM_baseline = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto', class_weight='balanced')

In [46]:
%%time
# this cell takes about 26 seconds to run
# fit the training dataset on the classifier
SVM_baseline.fit(X_train, y_train)

Wall time: 8min 53s


SVC(class_weight='balanced', gamma='auto', kernel='linear')

In [47]:
# predict the labels on validation dataset
SVM_y_preds = SVM_baseline.predict(X_test)

In [48]:
SVM_precision = precision_score(y_test, SVM_y_preds)
SVM_recall = recall_score(y_test, SVM_y_preds)
SVM_f1_score = f1_score(y_test, SVM_y_preds)
SVM_f1_weighted = f1_score(y_test, SVM_y_preds, average='weighted')

In [49]:
# printing evaluation metrics up to 4th decimal place
print('Testing Metrics for SVM Baseline with Lemmatization Features')
print('Precision: {:.4}'.format(SVM_precision))
print('Recall: {:.4}'.format(SVM_recall))
print('F1 Score: {:.4}'.format(SVM_f1_score))
print('Weighted F1 Score: {:.4}'.format(SVM_f1_weighted))

Testing Metrics for SVM Baseline with Lemmatization Features
Precision: 0.1031
Recall: 0.4748
F1 Score: 0.1694
Weighted F1 Score: 0.7798


In [50]:
metric_dict['Baseline SVM'] = {'precision': SVM_precision, 'recall': SVM_recall, 'f1_score': SVM_f1_score, 'weighted_f1': SVM_f1_weighted}

## Evaluation Metrics

In [51]:
pd.DataFrame.from_dict(metric_dict, orient='index')

Unnamed: 0,precision,recall,f1_score,weighted_f1
Baseline Logisitic Regression,0.0,0.0,0.0,0.906035
Baseline SVM,0.103079,0.47479,0.169384,0.779814


At first glance, it looks like the SVM model does slightly better with unweighted F1, but the Logisitic Regression model does better with weighted F1.



In [52]:
target_names = ['class 0', 'class 1']
# logistic regression baseline
print(classification_report(y_test, logreg_y_preds, target_names=target_names))
# SVM baseline
print(classification_report(y_test, SVM_y_preds, target_names=target_names))

              precision    recall  f1-score   support

     class 0       0.94      1.00      0.97     14101
     class 1       0.00      0.00      0.00       952

    accuracy                           0.94     15053
   macro avg       0.47      0.50      0.48     15053
weighted avg       0.88      0.94      0.91     15053

              precision    recall  f1-score   support

     class 0       0.95      0.72      0.82     14101
     class 1       0.10      0.47      0.17       952

    accuracy                           0.71     15053
   macro avg       0.53      0.60      0.50     15053
weighted avg       0.90      0.71      0.78     15053



However, it's important to note that the Doc2Vec method may be performing worse than the Tf-IDF method. We can try grid searching to improve this.
