# PROPAGANDA CLASSIFICATION MODEL OF META FEATURES

In this notebook I optimize the classification model taking in engineered meta-features. 
The order goes as follows:
* Train-Test split
* Fitting scaler to training data, transforming training and testing data with fit scaler
* Optimize different classification models. Optimized a Logistic Regression, Random Forest, Gradient Boosted Decision Tree. I include the optimized model for all three alogrithms. 


Evaluation Metrics:
Optimizing for Propaganda-class recall while maintaining a Propaganda-class precicion score above 50. Since Propaganda-class is a minority class (composoing about 30% of the dataset), I wanted to prioritize a model that can identify as many propaganda instances out of the total amount of propaganda instances as possible.

The best model ended up being a tuned Random Forest. However, since I created a stacked model that needed to output probabilities, I will use the best version of the Logistic Regression for the stacked model.



## Imports

In [4]:
import numpy as np
import pandas as pd
import en_core_web_sm
from wordcloud import WordCloud, STOPWORDS
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
import string
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
import re
import sklearn

## Loading in Meta-Features

In [5]:
df = pd.read_csv('meta_features.csv')

## Visualizing DataFrame

In [6]:
df.head()

Unnamed: 0,article_id,propaganda,propaganda_type,text,prop_txt_snippet,sent_#,sentiment_score,abs_sent_score,punct_count,word_count,%adj,%verb,%adv,%noun,avg_word_length,strong_subjectives_count
0,701225819,non-propaganda,,South Florida Muslim Leader Sofian Zakkout’s D...,,1,0.0,0.0,0,9,0.0,0.0,0.0,0.0,5.444444,0
1,701225819,propaganda,"Name_Calling,Labeling","David Duke, the white supremacist icon and for...",Grand Wizard of the Ku Klux Klan,2,0.5423,0.5423,4,26,0.020548,0.006849,0.013699,0.006849,4.423077,2
2,701225819,propaganda,Loaded_Language,"However, one individual who represents the Mus...",enamored,3,0.3612,0.3612,4,27,0.017241,0.017241,0.005747,0.022989,5.0,0
3,701225819,non-propaganda,,"Last month, once again, Zakkout chose to showc...",,4,0.0,0.0,5,22,0.021127,0.021127,0.014085,0.035211,5.045455,0
4,701225819,non-propaganda,,The postings can be rivaled only by Zakkout’s ...,,5,0.0,0.0,1,11,0.014493,0.043478,0.014493,0.028986,4.636364,0


## Dropping Non-Meta and Deterministic Columns

In [7]:
meta_df = df.drop(['propaganda_type','text','prop_txt_snippet','sent_#','article_id'], axis = 1)

## Previewing Final DataFrame and Missing Values Before Diving In

In [8]:
meta_df.head()

Unnamed: 0,propaganda,sentiment_score,abs_sent_score,punct_count,word_count,%adj,%verb,%adv,%noun,avg_word_length,strong_subjectives_count
0,non-propaganda,0.0,0.0,0,9,0.0,0.0,0.0,0.0,5.444444,0
1,propaganda,0.5423,0.5423,4,26,0.020548,0.006849,0.013699,0.006849,4.423077,2
2,propaganda,0.3612,0.3612,4,27,0.017241,0.017241,0.005747,0.022989,5.0,0
3,non-propaganda,0.0,0.0,5,22,0.021127,0.021127,0.014085,0.035211,5.045455,0
4,non-propaganda,0.0,0.0,1,11,0.014493,0.043478,0.014493,0.028986,4.636364,0


In [9]:
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15172 entries, 0 to 15171
Data columns (total 11 columns):
propaganda                  15172 non-null object
sentiment_score             15172 non-null float64
abs_sent_score              15172 non-null float64
punct_count                 15172 non-null int64
word_count                  15172 non-null int64
%adj                        15172 non-null float64
%verb                       15172 non-null float64
%adv                        15172 non-null float64
%noun                       15172 non-null float64
avg_word_length             15172 non-null float64
strong_subjectives_count    15172 non-null int64
dtypes: float64(7), int64(3), object(1)
memory usage: 1.3+ MB


## Train-Test Split

In [10]:
y = meta_df['propaganda']
X = meta_df.drop('propaganda', axis=1)

In [11]:
y = [1 if label == 'propaganda' else 0 for label in y]

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## Feature Scaling

In [13]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()


In [14]:
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Dummy Classifier

In [15]:
from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier(strategy='most_frequent')
dummy_clf.fit(X_train_scaled, y_train)

dummy_preds = dummy_clf.predict(X_test_scaled)

In [16]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [17]:
# Print the confusion matrix
print(sklearn.metrics.confusion_matrix(y_test, dummy_preds))

# Print the precision and recall, among other metrics
print(sklearn.metrics.classification_report(y_test, dummy_preds, digits=3))

[[3510    0]
 [1497    0]]
              precision    recall  f1-score   support

           0      0.701     1.000     0.824      3510
           1      0.000     0.000     0.000      1497

    accuracy                          0.701      5007
   macro avg      0.351     0.500     0.412      5007
weighted avg      0.491     0.701     0.578      5007



  _warn_prf(average, modifier, msg_start, len(result))


In [18]:
roc_auc_score(y_test, dummy_preds)

NameError: name 'roc_auc_score' is not defined

## Logistic Regression

In [19]:
from sklearn import linear_model
from sklearn import ensemble
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

In [20]:
# Create logistic regression
logistic = linear_model.LogisticRegression()

In [21]:
hyperparam_grid_logistic = {'penalty' : ['l1', 'l2'],
    'C' : np.logspace(-4, 4, 20),
    'solver' : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
    'class_weight': 'balanced'}


In [22]:
# # Create regularization penalty space
# penalty = ['l1', 'l2']

# # Create regularization hyperparameter distribution using uniform distribution
# C = uniform(loc=0, scale=4)

# # Create hyperparameter options
# hyperparameters = dict(C=C, penalty=penalty)

In [23]:
# Create randomized search 5-fold cross validation and 100 iterations
clf_log = RandomizedSearchCV(logistic, hyperparam_grid_logistic, random_state=1, n_iter=200, cv=5, 
                         verbose=True, n_jobs=-1, scoring = 'roc_auc')

In [24]:
# Fit randomized search
best_model_log = clf_log.fit(X_train_scaled, y_train)

Fitting 5 folds for each of 200 candidates, totalling 1000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    3.8s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   10.9s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:   22.6s
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:   41.8s
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:   53.9s finished


In [25]:
# View best hyperparameters
print('Best Penalty:', best_model_log.best_estimator_.get_params()['penalty'])
print('Best C:', best_model_log.best_estimator_.get_params()['C'])
print('Best solver:', best_model_log.best_estimator_.get_params()['solver'])

Best Penalty: l2
Best C: 0.012742749857031334
Best solver: sag


In [26]:
# Predict target vector
log_preds = best_model_log.predict(X_test_scaled)

In [27]:
# Print the confusion matrix
print(sklearn.metrics.confusion_matrix(y_test, log_preds))

# Print the precision and recall, among other metrics
print(sklearn.metrics.classification_report(y_test, log_preds, digits=3))

[[3347  163]
 [1267  230]]
              precision    recall  f1-score   support

           0      0.725     0.954     0.824      3510
           1      0.585     0.154     0.243      1497

    accuracy                          0.714      5007
   macro avg      0.655     0.554     0.534      5007
weighted avg      0.683     0.714     0.650      5007



In [28]:
from sklearn.metrics import roc_auc_score

roc_auc_score(y_test, log_preds)

0.553600934061856

## Random Forest

In [182]:
randomforest = ensemble.RandomForestClassifier()

In [183]:
hyperparam_grid_rf=    {'n_estimators' : list(range(10,101,10)),
    'max_features' : list(range(6,32,5)),
    'criterion':['gini','entropy'],
    'class_weight':['balanced']}


In [184]:
clf_rf = RandomizedSearchCV(randomforest, hyperparam_grid_rf, random_state=1, n_iter=200, cv=5, 
                         verbose=True, n_jobs=-1, scoring = 'f1')

In [185]:
# Fit randomized search
best_model_rf = clf_rf.fit(X_train_scaled, y_train)

Fitting 5 folds for each of 120 candidates, totalling 600 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   26.7s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   40.1s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 600 out of 600 | elapsed:  1.7min finished


In [186]:
# View best hyperparameters
print('Best Penalty:', best_model_rf.best_estimator_.get_params()['n_estimators'])
print('Best C:', best_model_rf.best_estimator_.get_params()['max_features'])
print('Best solver:', best_model_rf.best_estimator_.get_params()['criterion'])

Best Penalty: 30
Best C: 6
Best solver: entropy


In [187]:
# Predict target vector
rf_preds = best_model_rf.predict(X_test_scaled)

# Print the confusion matrix
print(sklearn.metrics.confusion_matrix(y_test, rf_preds))

# Print the precision and recall, among other metrics
print(sklearn.metrics.classification_report(y_test, rf_preds, digits=3))

print(roc_auc_score(y_test, rf_preds))

[[3196  314]
 [1153  344]]
              precision    recall  f1-score   support

           0      0.735     0.911     0.813      3510
           1      0.523     0.230     0.319      1497

    accuracy                          0.707      5007
   macro avg      0.629     0.570     0.566      5007
weighted avg      0.671     0.707     0.666      5007

0.5701671148564936


## Gradient Boosted Decision Trees

In [188]:
from sklearn.ensemble import GradientBoostingClassifier


In [189]:
clf_gboost = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
    max_depth=1)

In [190]:
gboost_model = clf_gboost.fit(X_train_scaled, y_train)

In [191]:
# Predict target vector
gboost_preds = gboost_model.predict(X_test_scaled)

# Print the confusion matrix
print(sklearn.metrics.confusion_matrix(y_test, gboost_preds))

# Print the precision and recall, among other metrics
print(sklearn.metrics.classification_report(y_test, gboost_preds, digits=3))

print(roc_auc_score(y_test, gboost_preds))

[[3295  215]
 [1176  321]]
              precision    recall  f1-score   support

           0      0.737     0.939     0.826      3510
           1      0.599     0.214     0.316      1497

    accuracy                          0.722      5007
   macro avg      0.668     0.577     0.571      5007
weighted avg      0.696     0.722     0.673      5007

0.5765876482309348
