# PROPAGANDA CLASSIFICATION MODEL USING SENTENCE EMBEDDINGS

In [1]:
pip install -U sentence-transformers

Collecting sentence-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/c9/91/c85ddef872d5bb39949386930c1f834ac382e145fcd30155b09d6fb65c5a/sentence-transformers-0.2.5.tar.gz (49kB)
[K     |████████████████████████████████| 51kB 2.7MB/s eta 0:00:01
[?25hCollecting transformers==2.3.0 (from sentence-transformers)
[?25l  Downloading https://files.pythonhosted.org/packages/50/10/aeefced99c8a59d828a92cc11d213e2743212d3641c87c82d61b035a7d5c/transformers-2.3.0-py3-none-any.whl (447kB)
[K     |████████████████████████████████| 450kB 4.9MB/s eta 0:00:01
Collecting sentencepiece (from transformers==2.3.0->sentence-transformers)
[?25l  Downloading https://files.pythonhosted.org/packages/e6/56/2e6cfc364c4760b85adab40cb38d91e7ce67d6b2745a2e1aa1497c776fe1/sentencepiece-0.1.85-cp37-cp37m-macosx_10_6_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 4.6MB/s eta 0:00:01
Collecting regex!=2019.12.17 (from transformers==2.3.0->sentence-transformers)
[?25l  Dow

In this notebook I optimize the classification model taking in sentence embeddings as features. 
The order goes as follows:
* Transform sentence-text to embeddings using BERT transformer modified by huggingface
    * This transformer allows for context-based meaning to enter into the model.
    * Since so much of the grammatical structure and tokens seem to overlap between propaganda and non-propaganda sentences, this seems like one of the most promising inputs.
* Train-Test split embeddings and corresponding labels
* Optimize different classification models. From research it appeared that Logistic Regression and SGD Classifier tend to do better with text-embeddings. In this notebook, I am only including the final model with the best hyper-parameters.


Evaluation Metrics:
Optimizing for Propaganda-class recall while maintaining a Propaganda-class precicion score above 50. Since Propaganda-class is a minority class (composoing about 30% of the dataset), I wanted to prioritize a model that can identify as many propaganda instances out of the total amount of propaganda instances as possible.

The best model ended up being a tuned Logisitc Regression. It reached a Propaganda F1 score of 58 (close to the researcher's value of 60). However, it received substantially higher propaganda recall score.



## IMPORTS

In [5]:
import numpy as np
import pandas as pd
import en_core_web_sm
from wordcloud import WordCloud, STOPWORDS
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
import string
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
import re

In [6]:
df = pd.read_csv('meta_features.csv')

## Loading in Pre-Trained BERT Sentence Embedding Model 

In [7]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')

In [26]:
# Save the trained model as a pickle string. 
# save the model to disk
filename = 'BERT_embeds_model.sav'
pickle.dump(model, open(filename, 'wb'))



In [8]:
X_text = df['text']
y_text = df['propaganda']

## Transforming text into sentence-embeddings 
* Usually, we wait to do this after performing train-test-split. However, because the model has been pre-trained, we are merely TRANSFORMING our data. Therefore, we don't need to worry about data-leakage since this occurs when we FIT on all the data.

In [9]:
X_embeddings = model.encode(X_text)

In [10]:
import pandas as pd

In [11]:
X_embeds = pd.DataFrame(X_embeddings)

In [12]:
y_labels = [1 if label == 'propaganda' else 0 for label in y_text]

### Saving embeddings for future use since transformation takes time and computing power

In [24]:
pd.DataFrame(X_embeddings).to_csv('sentence_embeddings_all.csv')
pd.DataFrame(y_labels).to_csv('y_labels_all.csv')

In [33]:
test = pd.read_csv('sentence_embeddings_all.csv').drop('Unnamed: 0', axis=1)

In [37]:
test.shape

(15172, 768)

In [36]:
test_y = pd.read_csv('y_labels_all.csv').drop('Unnamed: 0', axis=1)
test_y.shape

(15172, 1)

In [13]:
embeds_train, embeds_test, y_train, y_test = train_test_split(X_embeddings, y_labels, 
                                                                    test_size=0.33, random_state=42)

# Best Logistic Regression

In [15]:
import sklearn
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV

In [16]:
from sklearn.metrics import roc_auc_score

In [17]:
# grid={"C":np.logspace(-3,3,7), "penalty":["l1","l2"]}# l1 lasso l2 ridge
logreg2=LogisticRegression(C = 5, penalty = 'l2', solver='newton-cg', class_weight = 'balanced', max_iter = 1000)
# logreg_cv=GridSearchCV(logreg,grid,cv=10,scoring='f1_weighted')
logreg_embeddings_2 = logreg2.fit(embeds_train, y_train)
logreg_embeds_preds_2 = logreg_embeddings_2.predict(embeds_test)

# Print the confusion matrix
print(sklearn.metrics.confusion_matrix(y_test, logreg_embeds_preds_2))

# Print the precision and recall, among other metrics
print(sklearn.metrics.classification_report(y_test, logreg_embeds_preds_2, digits=3))

roc_auc_score(y_test, logreg_embeds_preds_2)

[[2509 1001]
 [ 461 1036]]
              precision    recall  f1-score   support

           0      0.845     0.715     0.774      3510
           1      0.509     0.692     0.586      1497

    accuracy                          0.708      5007
   macro avg      0.677     0.703     0.680      5007
weighted avg      0.744     0.708     0.718      5007



0.7034327915089438

In [27]:
filename2 = 'final_prop_model.sav'
pickle.dump(logreg_embeddings_2, open(filename2, 'wb'))

In [18]:
import pickle 
  
# Save the trained model as a pickle string. 
saved_model_embeds = pickle.dumps(logreg_embeddings_2) 
  
# Load the pickled model 
log_from_pickle = pickle.loads(saved_model_embeds) 
  
# Use the loaded pickled model to make predictions 
log_from_pickle.predict(embeds_test) 

array([1, 1, 0, ..., 0, 1, 0])