# Classifiyng tweets with Active Learning: Denmark


## 1. Label tweets  

**`Instructions`**

> Start by labelling around 10-15 tweets in following chunk 

> Press the `retrain` button 

> Continue untill the `progress bar` is improved **or** the `accuracy` measure is satisfactory 

> Save the tweets and their labels in a dictionary *(Which will be used to train the supervised classifiers)*

In [None]:
# import modules 
from superintendent import ClassLabeller
import pandas as pd
from IPython.display import display, Markdown
import random
import string
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
import ast
from tabulate import tabulate
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
### load data 

# function to turn the tokenized/lemmatized list into a readable format
def string_list(text):
    
    # we transform the string representation of the list into an actual list
    text = ast.literal_eval(text)
    
    # return the transformed text
    return text

# import data
df = pd.read_csv('da_preprocess.csv')

# apply function: YOU NEED TO SPECIFY ALL RELEVANT COLUMNS HERE
df['token'] = df['token'].apply(string_list)
df['lemma'] = df['lemma'].apply(string_list)
df['token_no_mention'] = df['token_no_mention'].apply(string_list)
df['lemma_no_mention'] = df['lemma_no_mention'].apply(string_list)

df['lemma_no_mention'] = df['lemma_no_mention'].astype(str)
# print the dataframe
print(df.shape)
df.head(3)

**For Log.reg params:** 

https://machinelearningmastery.com/multinomial-logistic-regression-with-python/

In [None]:
### initiate a model using multinomial logistic regression 
ALmodel = Pipeline([
    ('tfidf_vectorizer', TfidfVectorizer()),
    ('logistic_regression', LogisticRegression(multi_class="multinomial", solver="lbfgs"))]) # max_iter=5000 + solver="lbfgs"])


### define display function to show the preprocessed text for manual labelling
def display_text(df):
    display(Markdown(df["preprocess"]))
    
    
### define preprocessor to only train the model on the lemmas (no mention)
def preprocessor(x,y):
    return x['lemma_no_mention'], y



In [None]:
### create widget for labelling 
widget = ClassLabeller(
    features=df, # specify column 
    model=ALmodel,
    display_func=display_text, # use the display function to show the text
    model_preprocess=preprocessor,
    options=['vaxx','anti-vaxx','neutral','trash'], # specify the label options (r=remove)
    acquisition_function='certainty') # specify sampling strategy 

widget

In [None]:
### save the indexes and labels to dictionary 
labels = widget.queue.labels

### map the dict to a dataframe 
labelframe = pd.DataFrame.from_dict(labels, orient='index')
labelframe.columns = ['label']

labelframe

**Important**
 1. Inspect a random tweet by first providing a label to the widget **AND** make sure you can recognize the text later
 2. Then, run the chunk above to make sure the label is saved with its index 
 3. Run the chunk below and inspect if the tweet has the same index in the original dataframe. Below, **Remember to sepcify the tweet index in line 5**

In [None]:
### inspect that the tweets are shuffelled at each retrianing of the model BUT KEEPS THEIR INDEX 
print(labelframe.tail(10))
      
# label a tweet in the widget (and remember it) > find index in dictionary > find tweet in df using index > same tweet? 
df.iloc[1379]['preprocess'] #change to the index number we want to inspect + column we want to inspect 


**Create a dataframe with the manually labelled tweets** (We will later split this and use as both training and testing data)

In [None]:
print(df.columns)

In [None]:
### merge the dataframes 
labeldf = df.merge(labelframe, left_index=True, right_index=True)
labeldf


### drop the columns we don't need 
labeldf = labeldf.drop(['date', 'preprocess','preprocess_no_mention',
       'token', 'lemma', 'token_no_mention'], axis = 1)


### drop the rows that aren't danish (marked 'r' during active learning)
# Get indexes where label column has value 'r'
#indexes= labeldf[(labeldf['label'] == 'r')].index
# Delete these row indexes from dataFrame
#labeldf.drop(indexes, inplace=True)

# inspect df
labeldf.tail(15)

**Save the labelled data as a csv-file**

In [None]:
# save the df 
labeldf.to_csv(r'C:\Users\Frederikke\OneDrive\MSc. Social Data Science\Exam\da_labelsNew.csv') # change to own directory 

## 2. Hyperparameter tuning and model training 

In this part, we will use the labelled tweets *(from step 1)* to train several supervised classifiers `(SVM, Multinomial Logistic Regression, Baggin, Boosting, Multinomial Naïve Bayes)`


Moreover, we will use `grid search` to tune the hyperparameters of the classifiers. This will be performed as an integrated part of the training, using cross-validation.

> Which hyperparameters to tune, and which vlaues to consider, is dependent on the respective classifier and the datasets. Hence, the **search space of the `grid-search` will differ**!

### 2.1 : Load classifiers, define search spaces, create pipelines

In [None]:
# import modules 
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 

###  preprocessing and classifiers 
#from nltk.corpus import stopwords
#from nltk import word_tokenize, corpus
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer 
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (RandomTreesEmbedding, RandomForestClassifier,
                              GradientBoostingClassifier)
from sklearn.svm import (SVC, LinearSVC)
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.naive_bayes import MultinomialNB 
# If you have trouble importing the following two functions: 
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.HalvingGridSearchCV.html
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV

# evaluation 
import sklearn.metrics as metrics
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split

import warnings
warnings.simplefilter("ignore")


In [None]:
# load data, make train/test split, prepare for pipeline 
data = pd.read_csv('da_labels.csv')
data


**Visualize class proprotions:** Use `stratified split` if imbalanced classes

In [None]:
# visualize the class proportions 
fig, ax = plt.subplots()
fig.suptitle("Label", fontsize=12)
data["label"].reset_index().groupby("label").count().sort_values(by= 
       "index").plot(kind="bar", legend=False, 
        ax=ax).grid(axis='x') 
plt.show()


print(data.groupby("label").count())

In [None]:
# do a stratified train/test split (on label) to ensure representative class proportions in the train/test sets
X = data['lemma_no_mention'] # Collection of documents (tweets)
y = data['label'] # labels  (4 classes)

# split in to train and test sets 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=data['label'],random_state=24) 


#### 2.1.1 : Multinomial Logistic Regression

In [None]:
### build a pipeline
lg_pipe = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('classifier', LogisticRegression(random_state=1, multi_class='auto', penalty='l2')),])



### define parameters to be tested usign k-fold CV
lg_params = {'classifier__C': [0.5,1,2,3,4,5], # try [0.1, 1, 10, 100, 1000] 0.01,0.1,
             'vectorizer__max_df': (0.6,0.7,0.9,0.99),
             'vectorizer__min_df': [0.01,0.001,0.0001],
             'vectorizer__ngram_range' : [(1,1),(1,2),(2,2), (1,3), (2,3)],
             'classifier__solver' : ['newton-cg', 'lbfgs', 'liblinear'], # 'sag', 'saga' not for danish
             'vectorizer__use_idf': [True, False],
             'classifier__class_weight': [None, 'balanced'],
            }

### perform gridsearch using clf and params ! 
lg_gs = HalvingGridSearchCV(lg_pipe,lg_params,cv=5,n_jobs=-1, verbose=1,scoring='f1_micro', factor=2) #  #5-fold, n_jobs =-1 :computation will be dispatched on all the CPUs

### train best estimator on traning data 
lg_gs = lg_gs.fit(X_train, y_train)
#print('Best score on training data:',lg_gs.score(X_train, y_train))
#print('Best score on testing data:',lg_gs.score(X_test, y_test))
#print('Best score',lg_gs.best_score_)
print('Best parameters',lg_gs.best_params_)


## test classifier 
y_pred = lg_gs.best_estimator_.predict(X_test)
predicted_prob = lg_gs.best_estimator_.predict_proba(X_test)


############################################## Evaluation ####################################################

classes = np.unique(y_test)
y_test_array = pd.get_dummies(y_test, drop_first=False).values


### Accuracy, Precision, Recall
accuracy = metrics.accuracy_score(y_test, y_pred)
auc = metrics.roc_auc_score(y_test, predicted_prob, 
                            multi_class='ovr') # or ovo 
print("Accuracy:",  round(accuracy,2))
print("Auc:", round(auc,2))
print("Logistic Regression Details:")
print(metrics.classification_report(y_test, y_pred))
    
    
    
### Plot confusion matrix
cm = metrics.confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots()
sns.heatmap(cm, annot=True, fmt='d', ax=ax, cmap=plt.cm.Blues, 
            cbar=False)
ax.set(xlabel="Predicted label", ylabel="True label", xticklabels=classes, 
       yticklabels=classes, title="Confusion matrix: Logistic Regression")
plt.yticks(rotation=0)
fig, ax = plt.subplots(nrows=1, ncols=2)

### Plot ROC-curve (to illustrate trade-off between sensitivity (or TPR) and specificity (1 – FPR) )
for i in range(len(classes)):
    fpr, tpr, thresholds = metrics.roc_curve(y_test_array[:,i],  
                           predicted_prob[:,i])
    ax[0].plot(fpr, tpr, lw=3, 
              label='{0} (area={1:0.2f})'.format(classes[i], 
                              metrics.auc(fpr, tpr))
               )
ax[0].plot([0,1], [0,1], color='navy', lw=3, linestyle='--')
ax[0].set(xlim=[-0.05,1.0], ylim=[0.0,1.05], 
          xlabel='False Positive Rate', 
          ylabel="True Positive Rate (Recall)", 
          title="Receiver operating characteristic")
ax[0].legend(loc="lower right")
ax[0].grid(True)

    
### Plot precision-recall curve
for i in range(len(classes)):
    precision, recall, thresholds = metrics.precision_recall_curve(
                 y_test_array[:,i], predicted_prob[:,i])
    ax[1].plot(recall, precision, lw=3, 
               label='{0} (area={1:0.2f})'.format(classes[i], 
                                  metrics.auc(recall, precision))
              )
ax[1].set(xlim=[0.0,1.05], ylim=[0.0,1.05], xlabel='Recall', 
          ylabel="Precision", title="Precision-Recall curve")
ax[1].legend(loc="best")
ax[1].grid(True)
plt.show()


### Confusion matrix vol 2.0
labels = classes 
print(classification_report(y_test, y_pred, labels)) #classification report from sklearn
cnf_matrix = confusion_matrix(y_test, y_pred, labels=labels)
plt.imshow(cnf_matrix, cmap=plt.cm.Blues) #plot confusion matrix grid
threshold = cnf_matrix.max() / 2 #threshold to define text color
for i in range(cnf_matrix.shape[0]): #print text in grid
    for j in range(cnf_matrix.shape[1]): 
        plt.text(j, i, cnf_matrix[i,j], color="w" if cnf_matrix[i,j] > threshold else 'black')
tick_marks = np.arange(len(labels)) #define labeling spacing based on number of classes
plt.xticks(tick_marks, labels, rotation=45)
plt.yticks(tick_marks, labels)
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.colorbar()
plt.tight_layout()

#### 2.1.2 :  Multinomial Naïve Bayes 

In [None]:
### build a pipeline
mnb_pipe = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('classifier', MultinomialNB()),])

# see paramerts for model 
mnb_pipe.get_params().keys()


### define parameters to be tested usign k-fold CV
mnb_params = {'vectorizer__max_features': [1000,1500,2000,2500], #5000, 7000
                  'vectorizer__ngram_range': [(1, 1), (1, 2),(1, 3)],
                  'vectorizer__max_df': (0.6, 0.8, 0.9, 0.99),
                 # 'vectorizer__min_df': [3,4],
                  'vectorizer__stop_words': ['english', None],
                  'vectorizer__smooth_idf': [True, False],
                  'vectorizer__use_idf': [True, False],
                  'classifier__fit_prior': [True, False],
                  'classifier__alpha': [0.6, 0.7, 0.8],} # (1e-2, 1e-3) 


mnb_gs = HalvingGridSearchCV(mnb_pipe,mnb_params,cv=5,n_jobs=-1,scoring='f1_micro',verbose=1,factor=2)#5-fold, computation will be dispatched on all the CPUs


### train best estimator on traning data 
mnb_gs = mnb_gs.fit(X_train, y_train)
print('Best score on training data:',mnb_gs.score(X_train, y_train))
print('Best score on testing data:',mnb_gs.score(X_test, y_test))
print('Best score',mnb_gs.best_score_)
print('Best parameters',mnb_gs.best_params_)


## test classifier 
y_pred = mnb_gs.predict(X_test)
predicted_prob = mnb_gs.predict_proba(X_test)

####################################################### Evaluation 

classes = np.unique(y_test)
y_test_array = pd.get_dummies(y_test, drop_first=False).values
    
## Accuracy, Precision, Recall
accuracy = metrics.accuracy_score(y_test, y_pred)
auc = metrics.roc_auc_score(y_test, predicted_prob, 
                            multi_class="ovr")
print("Accuracy:",  round(accuracy,2))
print("Auc:", round(auc,2))
print("Multinomial Naïve Bayes Details:")
print(metrics.classification_report(y_test, y_pred))
    
## Plot confusion matrix
cm = metrics.confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots()
sns.heatmap(cm, annot=True, fmt='d', ax=ax, cmap=plt.cm.Blues, 
            cbar=False)
ax.set(xlabel="Pred", ylabel="True", xticklabels=classes, 
       yticklabels=classes, title="Confusion matrix: Multinomial Naïve Bayes")
plt.yticks(rotation=0)

fig, ax = plt.subplots(nrows=1, ncols=2)
## Plot roc
for i in range(len(classes)):
    fpr, tpr, thresholds = metrics.roc_curve(y_test_array[:,i],  
                           predicted_prob[:,i])
    ax[0].plot(fpr, tpr, lw=3, 
              label='{0} (area={1:0.2f})'.format(classes[i], 
                              metrics.auc(fpr, tpr))
               )
ax[0].plot([0,1], [0,1], color='navy', lw=3, linestyle='--')
ax[0].set(xlim=[-0.05,1.0], ylim=[0.0,1.05], 
          xlabel='False Positive Rate', 
          ylabel="True Positive Rate (Recall)", 
          title="Receiver operating characteristic")
ax[0].legend(loc="lower right")
ax[0].grid(True)
    
## Plot precision-recall curve
for i in range(len(classes)):
    precision, recall, thresholds = metrics.precision_recall_curve(
                 y_test_array[:,i], predicted_prob[:,i])
    ax[1].plot(recall, precision, lw=3, 
               label='{0} (area={1:0.2f})'.format(classes[i], 
                                  metrics.auc(recall, precision))
              )
ax[1].set(xlim=[0.0,1.05], ylim=[0.0,1.05], xlabel='Recall', 
          ylabel="Precision", title="Precision-Recall curve")
ax[1].legend(loc="best")
ax[1].grid(True)
plt.show()

#### Other confusion matrix

labels = classes 
print(classification_report(y_test, y_pred, labels)) #classification report from sklearn
cnf_matrix = confusion_matrix(y_test, y_pred, labels=labels)
plt.imshow(cnf_matrix, cmap=plt.cm.Blues) #plot confusion matrix grid
threshold = cnf_matrix.max() / 2 #threshold to define text color
for i in range(cnf_matrix.shape[0]): #print text in grid
    for j in range(cnf_matrix.shape[1]): 
        plt.text(j, i, cnf_matrix[i,j], color="w" if cnf_matrix[i,j] > threshold else 'black')
tick_marks = np.arange(len(labels)) #define labeling spacing based on number of classes
plt.xticks(tick_marks, labels, rotation=45)
plt.yticks(tick_marks, labels)
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.colorbar()
plt.tight_layout()


#### 2.1.3 :  Support vector machine 

In [None]:
### build a pipeline
svm_pipe = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('classifier', SVC(random_state=2,probability=True)),]) # probability must be = True, otherwise no 'predict_proba' 

# see paramerts for model 
svm_pipe.get_params().keys()


### define parameters to be tested usign k-fold CV ###CHANGE###

# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html 
svm_params = {'classifier__gamma': [0.6,0.5,0.4,0.3,0.1], # ['scale', 'auto'] 0.01,0.001 
             'vectorizer__max_df': [0.5, 0.6, 0.7],
            # 'vectorizer__min_df': [4,5],
            # 'classifier__class_weight' : ['balanced'],
             'classifier__degree':[1,2,3,4,5,6],
            # 'classifier__probability': [True, False],     # probability must be = True, otherwise no 'predict_proba' 
             'vectorizer__stop_words': ['english', None],
             'vectorizer__use_idf': [True, False], 
             'classifier__kernel': ['poly', 'rbf', 'sigmoid', 'linear'],
             'classifier__C': [4,6,8,10]} # 0.1, 1, 2, 100, 1000


    

svm_gs = HalvingGridSearchCV(svm_pipe,svm_params,cv=5,n_jobs=-1, verbose=1, scoring='f1_micro', factor=2)#5-fold, computation will be dispatched on all the CPUs


### train best estimator on traning data 
svm_gs = svm_gs.fit(X_train, y_train)
print('Best score on training data:',svm_gs.score(X_train, y_train))
print('Best score on testing data:',svm_gs.score(X_test, y_test))
print('Best score',svm_gs.best_score_)
print('Best parameters',svm_gs.best_params_)

## test classifier 
y_pred = svm_gs.predict(X_test)
predicted_prob = svm_gs.predict_proba(X_test)

####################################################### Evaluation 

classes = np.unique(y_test)
y_test_array = pd.get_dummies(y_test, drop_first=False).values
    
## Accuracy, Precision, Recall
accuracy = metrics.accuracy_score(y_test, y_pred)
auc = metrics.roc_auc_score(y_test, predicted_prob, 
                            multi_class="ovr")
print("Accuracy:",  round(accuracy,2))
print("Auc:", round(auc,2))
print("SVM Details:")
print(metrics.classification_report(y_test, y_pred))
    
## Plot confusion matrix
cm = metrics.confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots()
sns.heatmap(cm, annot=True, fmt='d', ax=ax, cmap=plt.cm.Blues, 
            cbar=False)
ax.set(xlabel="Pred", ylabel="True", xticklabels=classes, 
       yticklabels=classes, title="Confusion matrix: SVM")
plt.yticks(rotation=0)

fig, ax = plt.subplots(nrows=1, ncols=2)
## Plot roc
for i in range(len(classes)):
    fpr, tpr, thresholds = metrics.roc_curve(y_test_array[:,i],  
                           predicted_prob[:,i])
    ax[0].plot(fpr, tpr, lw=3, 
              label='{0} (area={1:0.2f})'.format(classes[i], 
                              metrics.auc(fpr, tpr))
               )
ax[0].plot([0,1], [0,1], color='navy', lw=3, linestyle='--')
ax[0].set(xlim=[-0.05,1.0], ylim=[0.0,1.05], 
          xlabel='False Positive Rate', 
          ylabel="True Positive Rate (Recall)", 
          title="Receiver operating characteristic")
ax[0].legend(loc="lower right")
ax[0].grid(True)
    
## Plot precision-recall curve
for i in range(len(classes)):
    precision, recall, thresholds = metrics.precision_recall_curve(
                 y_test_array[:,i], predicted_prob[:,i])
    ax[1].plot(recall, precision, lw=3, 
               label='{0} (area={1:0.2f})'.format(classes[i], 
                                  metrics.auc(recall, precision))
              )
ax[1].set(xlim=[0.0,1.05], ylim=[0.0,1.05], xlabel='Recall', 
          ylabel="Precision", title="Precision-Recall curve")
ax[1].legend(loc="best")
ax[1].grid(True)
plt.show()

#### Other confusion matrix
labels = classes 
print(classification_report(y_test, y_pred, labels)) #classification report from sklearn
cnf_matrix = confusion_matrix(y_test, y_pred, labels=labels)
plt.imshow(cnf_matrix, cmap=plt.cm.Blues) #plot confusion matrix grid
threshold = cnf_matrix.max() / 2 #threshold to define text color
for i in range(cnf_matrix.shape[0]): #print text in grid
    for j in range(cnf_matrix.shape[1]): 
        plt.text(j, i, cnf_matrix[i,j], color="w" if cnf_matrix[i,j] > threshold else 'black')
tick_marks = np.arange(len(labels)) #define labeling spacing based on number of classes
plt.xticks(tick_marks, labels, rotation=45)
plt.yticks(tick_marks, labels)
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.colorbar()
plt.tight_layout()

#### 2.1.4 :  Bagging *(Bootstrapped aggregattion )*

In [None]:
### build a pipeline
bag_pipe = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('classifier', BaggingClassifier(random_state=14, verbose=True)),])
# see paramerts for model 
bag_pipe.get_params().keys()



### define parameters to be tested usign k-fold CV
bag_params = {'classifier__n_estimators': [1400,1500,1600],
              'classifier__max_features' : [150,200,220], #250 is too much
             'vectorizer__max_df': (0.8,0.9, 0.99),
            # 'vectorizer__min_df': [4,5,6,7],
             'vectorizer__use_idf': [True, False],}


### perform gridsearch using clf and params
bag_gs = HalvingGridSearchCV(bag_pipe,bag_params,cv=5,n_jobs=-1, scoring='f1_micro',verbose=0, factor=2) #5-fold, n_jobs =-1 :computation will be dispatched on all the CPUs
                             
                             
### train best estimator on traning data 
bag_gs = bag_gs.fit(X_train, y_train)
print('Best score on testing data:',bag_gs.score(X_test, y_test))
print('Best score on training data:',bag_gs.score(X_train, y_train))
print('Best score',bag_gs.best_score_)
print('Best parameters',bag_gs.best_params_)


## test classifier 
y_pred = bag_gs.best_estimator_.predict(X_test)
predicted_prob = bag_gs.best_estimator_.predict_proba(X_test)



############################################## Evaluation ####################################################

classes = np.unique(y_test)
y_test_array = pd.get_dummies(y_test, drop_first=False).values


### Accuracy, Precision, Recall
accuracy = metrics.accuracy_score(y_test, y_pred)
auc = metrics.roc_auc_score(y_test, predicted_prob, 
                            multi_class="ovr")
print("Accuracy:",  round(accuracy,2))
print("Auc:", round(auc,2))
print("Bagging Classifier Details:")
print(metrics.classification_report(y_test, y_pred))
    
    
    
### Plot confusion matrix
cm = metrics.confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots()
sns.heatmap(cm, annot=True, fmt='d', ax=ax, cmap=plt.cm.Blues, 
            cbar=False)
ax.set(xlabel="Predicted label", ylabel="True label", xticklabels=classes, 
       yticklabels=classes, title="Confusion matrix: Bagging Classifier")
plt.yticks(rotation=0)
fig, ax = plt.subplots(nrows=1, ncols=2)

### Plot roc
for i in range(len(classes)):
    fpr, tpr, thresholds = metrics.roc_curve(y_test_array[:,i],  
                           predicted_prob[:,i])
    ax[0].plot(fpr, tpr, lw=3, 
              label='{0} (area={1:0.2f})'.format(classes[i], 
                              metrics.auc(fpr, tpr))
               )
ax[0].plot([0,1], [0,1], color='navy', lw=3, linestyle='--')
ax[0].set(xlim=[-0.05,1.0], ylim=[0.0,1.05], 
          xlabel='False Positive Rate', 
          ylabel="True Positive Rate (Recall)", 
          title="Receiver operating characteristic")
ax[0].legend(loc="lower right")
ax[0].grid(True)

    
### Plot precision-recall curve
for i in range(len(classes)):
    precision, recall, thresholds = metrics.precision_recall_curve(
                 y_test_array[:,i], predicted_prob[:,i])
    ax[1].plot(recall, precision, lw=3, 
               label='{0} (area={1:0.2f})'.format(classes[i], 
                                  metrics.auc(recall, precision))
              )
ax[1].set(xlim=[0.0,1.05], ylim=[0.0,1.05], xlabel='Recall', 
          ylabel="Precision", title="Precision-Recall curve")
ax[1].legend(loc="best")
ax[1].grid(True)
plt.show()


### Confusion matrix vol 2.0
labels = classes
print(classification_report(y_test, y_pred, labels)) #classification report from sklearn
cnf_matrix = confusion_matrix(y_test, y_pred, labels=labels)
plt.imshow(cnf_matrix, cmap=plt.cm.Blues) #plot confusion matrix grid
threshold = cnf_matrix.max() / 2 #threshold to define text color
for i in range(cnf_matrix.shape[0]): #print text in grid
    for j in range(cnf_matrix.shape[1]): 
        plt.text(j, i, cnf_matrix[i,j], color="w" if cnf_matrix[i,j] > threshold else 'black')
tick_marks = np.arange(len(labels)) #define labeling spacing based on number of classes
plt.xticks(tick_marks, labels, rotation=45)
plt.yticks(tick_marks, labels)
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.colorbar()
plt.tight_layout()

#### 2.1.5 :  Gradient boosting 

In [None]:
### build a pipeline
gb = Pipeline([
    ('vectorizer', TfidfVectorizer()), #max_df=0.7, min_df=4,use_idf=False)
    ('classifier', GradientBoostingClassifier(random_state=14)),])


### define parameters to be tested usign k-fold CV
gb_params = {'classifier__learning_rate': [0.4,0.5,0.6,0.7],
          #   'classifier__loss': ['deviance', 'exponential'], 
             'classifier__n_estimators' : [175,200,250],
              'classifier__max_depth' : [10,12,15],
              'classifier__subsample' : [0.6,0.7,0.8,0.9],
              'vectorizer__use_idf': [True, False],}
 
### perform gridsearch using clf and params
gb_gs = HalvingGridSearchCV(gb,gb_params,cv=5, scoring='f1_micro',verbose=1,n_jobs=-1,factor=2)#5-fold, computation will be dispatched on all the CPUs


### train best estimator on traning data 
gb_gs = gb_gs.fit(X_train, y_train)
print('Best score on testing data:',gb_gs.score(X_test, y_test))
print('Best score on training data:',gb_gs.score(X_train, y_train))
print('Best score',gb_gs.best_score_)
print('Best parameters',gb_gs.best_params_)


## test classifier 
y_pred = gb_gs.best_estimator_.predict(X_test)
predicted_prob = gb_gs.best_estimator_.predict_proba(X_test)



############################################## Evaluation ####################################################

classes = np.unique(y_test)
y_test_array = pd.get_dummies(y_test, drop_first=False).values


### Accuracy, Precision, Recall
accuracy = metrics.accuracy_score(y_test, y_pred)
auc = metrics.roc_auc_score(y_test, predicted_prob, 
                            multi_class="ovr")
print("Accuracy:",  round(accuracy,2))
print("Auc:", round(auc,2))
print("Gradient Boosting Classifier Details:")
print(metrics.classification_report(y_test, y_pred))
    
    
    
### Plot confusion matrix
cm = metrics.confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots()
sns.heatmap(cm, annot=True, fmt='d', ax=ax, cmap=plt.cm.Blues, 
            cbar=False)
ax.set(xlabel="Predicted label", ylabel="True label", xticklabels=classes, 
       yticklabels=classes, title="Confusion matrix: Gradient Boosting Classifier")
plt.yticks(rotation=0)
fig, ax = plt.subplots(nrows=1, ncols=2)

### Plot roc
for i in range(len(classes)):
    fpr, tpr, thresholds = metrics.roc_curve(y_test_array[:,i],  
                           predicted_prob[:,i])
    ax[0].plot(fpr, tpr, lw=3, 
              label='{0} (area={1:0.2f})'.format(classes[i], 
                              metrics.auc(fpr, tpr))
               )
ax[0].plot([0,1], [0,1], color='navy', lw=3, linestyle='--')
ax[0].set(xlim=[-0.05,1.0], ylim=[0.0,1.05], 
          xlabel='False Positive Rate', 
          ylabel="True Positive Rate (Recall)", 
          title="Receiver operating characteristic")
ax[0].legend(loc="lower right")
ax[0].grid(True)

    
### Plot precision-recall curve
for i in range(len(classes)):
    precision, recall, thresholds = metrics.precision_recall_curve(
                 y_test_array[:,i], predicted_prob[:,i])
    ax[1].plot(recall, precision, lw=3, 
               label='{0} (area={1:0.2f})'.format(classes[i], 
                                  metrics.auc(recall, precision))
              )
ax[1].set(xlim=[0.0,1.05], ylim=[0.0,1.05], xlabel='Recall', 
          ylabel="Precision", title="Precision-Recall curve")
ax[1].legend(loc="best")
ax[1].grid(True)
plt.show()


### Confusion matrix vol 2.0
labels = classes
print(classification_report(y_test, y_pred, labels)) #classification report from sklearn
cnf_matrix = confusion_matrix(y_test, y_pred, labels=labels)
plt.imshow(cnf_matrix, cmap=plt.cm.Blues) #plot confusion matrix grid
threshold = cnf_matrix.max() / 2 #threshold to define text color
for i in range(cnf_matrix.shape[0]): #print text in grid
    for j in range(cnf_matrix.shape[1]): 
        plt.text(j, i, cnf_matrix[i,j], color="w" if cnf_matrix[i,j] > threshold else 'black')
tick_marks = np.arange(len(labels)) #define labeling spacing based on number of classes
plt.xticks(tick_marks, labels, rotation=45)
plt.yticks(tick_marks, labels)
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.colorbar()
plt.tight_layout()

## 3. Selecting best model > refit on all labels >  predict on unlabelled tweets 

In [None]:
''' 
As we don't want to predict labels for tweets the classifier has been trained on, 
we subset the dataset to only contain unlabbled data and thereby avoid any spill-over from training. 
At the same time, this allow us to retrain the classifier on the full labelled dataset (X_train,y_train,X_test,y_test), 
thereby making use of the sparse amount of labelled data in the attempt to improve the predictions on the unlabbled data. 

Subsequently, the classifier predictions will be evaluated manually by drawing a stratified random sample
from the label predictions and manually assessing the lables and creating a confusion matrix. 

'''

### create dataset for testing and manual evaluation of the classifier predictions 
# merge the full and the labelled dataframes 
evaldf = pd.merge(df, data, how="outer") #keep all data on index
evaldf.fillna('unknown', inplace=True) # 'inknown' for the tweets without lables 

### drop the columns we don't need 
evaldf = evaldf.drop(['date', 'preprocess','preprocess_no_mention',
       'token', 'lemma', 'token_no_mention', 'Unnamed: 0'], axis = 1)

### drop the rows that already have labels (we don't want spill-over from fitting the classifier) 
# Get indexes of tweets to remove 
rows= evaldf[(evaldf['label'] != 'unknown')].index # tweets that are not 'unknown'
# Delete these row indexes from dataFrame
evaldf.drop(rows, inplace=True)
evaldf=evaldf.reset_index()

### drop the 'label' column, as we now don't have lables 
evaldf = evaldf.drop(['label'], axis = 1)

# create validation set (to be used later for manual evaluation)
XVal = evaldf['lemma_no_mention'] # 
len(XVal) # should be 2404 for danish (for Pl and GER: should be len(DF) minus len(Data))

In [None]:
### Select best classifier with optimal hyperparameters 
clf_pipe= Pipeline([
    ('vectorizer', TfidfVectorizer(use_idf=False, max_df=0.6)), 
    ('classifier', SVC(C=8, degree=4,gamma=0.4, kernel='sigmoid',probability=True, random_state=10)),]) 


# fit the classifier on the full labelled dataset(X,y) 
clf = clf_pipe.fit(X, y)


# predict labels for the unlabbled dataset (XVal, yVal)
yVal = clf.predict(XVal)
predict_prob = clf.predict_proba(XVal)

# save predicted lables with the tweets in the df
evaldf['prediction']=pd.Series(yVal)
evaldf

In [None]:
#save the df to use for network
evaldf.to_csv(r'C:\Users\Frederikke\OneDrive\MSc. Social Data Science\Exam\da_labels_pred.csv') # change to own directory 

## 5. Draw random sample to evaluate classifier


#### 5.1 Inspecting class proportions

In [None]:
plt

#### Draw random sample

In [None]:
noStrat = evaldf.groupby('prediction', group_keys=False).apply(lambda x: x.sample(50).sample(frac=1)).reset_index(drop=True)
print(noStrat.groupby("prediction").count())
noStrat.loc[noStrat['prediction'] == 'anti-vaxx']

In [None]:
### initiate manual tweet labelling widget 

# define display function to show the preprocessed text for manual labelling
def display_text(noStrat):
    display(Markdown(noStrat["text"]))
    
# label widget 
labellerNS = ClassLabeller(
    features=noStrat, # change this to the sub-sample dataframe and correct column 
    display_func=display_text,
    options=['vaxx', 'anti-vaxx', 'neutral','trash'],
)

labellerNS

In [None]:
### save labels to list and df 
# save index and lables in a dictionary 
manlabsNS = labellerNS.queue.labels

# map the dict to a dataframe 
targetdfNS = pd.DataFrame.from_dict(manlabsNS, orient='index')
targetdfNS.columns = ['target']

### inspect that the tweets are shuffelled at each retrianing of the model BUT KEEPS THEIR INDEX 
print(targetdfNS.tail(10))
      
# label a tweet in the widget (and remember it) > find index in dictionary > find tweet in df using index > same tweet? 
noStrat.iloc[5]['text'] #change to the index number we want to inspect + to the sub-sample dataframe 

In [None]:
### merge the dataframes  
targetdfNS = noStrat.merge(targetdfNS, left_index=True, right_index=True) 
targetdfNS


In [None]:
### identify correct and wrong labels 

#define conditions
conditions = [targetdfNS['target'] == targetdfNS['prediction'], # the predicted label is equal to the real label
              targetdfNS['target'] != targetdfNS['prediction']] # the predicted label is NOT equal to the real label
#define choices
choices = ['correct', 'wrong'] 

#create new column in DataFrame that displays results of comparisons
targetdfNS['evaluation'] = np.select(conditions, choices, default='Tie')
targetdfNS

In [None]:
### list of targets 
targetsNS = targetdfNS['target']

### list of predictions 
predictionsNS = targetdfNS['prediction']


### Plot confusion matrix
classes = np.unique(targetsNS)
labels = classes 
print(classification_report(targetsNS, predictionsNS, labels)) #classification report from sklearn
cnf_matrix = confusion_matrix(targetsNS, predictionsNS, labels=labels)
plt.imshow(cnf_matrix, cmap=plt.cm.Blues) #plot confusion matrix grid
threshold = cnf_matrix.max() / 2 #threshold to define text color
for i in range(cnf_matrix.shape[0]): #print text in grid
    for j in range(cnf_matrix.shape[1]): 
        plt.text(j, i, cnf_matrix[i,j], color="w" if cnf_matrix[i,j] > threshold else 'black')
tick_marks = np.arange(len(labels)) #define labeling spacing based on number of classes
plt.xticks(tick_marks, labels, rotation=45)
plt.yticks(tick_marks, labels)
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.colorbar()
plt.tight_layout()

In [None]:
# save the df 
targetdf.to_csv(r'C:\Users\Frederikke\OneDrive\MSc. Social Data Science\Exam\da_labels_done.csv') # change to own directory 