# PROPAGANDA CLASSIFICATION OF TFIDF TRANFORMATION OF TEXT 

In this notebook I optimize both the TFIDF vectorizor and the classification model taking in TFIDF vectorized text as features. The order goes as follows:
* Train-Test split text data
* Create special tokenizer for corpus to be called-on in TFIDF tranformer
* Fit and transform training data; transform testing data
* Evaluate performance of different TFIDF hyperparameters and stick to best transformation (for the sake of readability, this stage has been ommitted from this notebook)
* Optimize different classification models using Logistic Regression, Gradient Boosted Decicion Trees, Random Forest, and KNN


Evaluation Metrics:
Optimizing for Propaganda-class recall while maintaining a Propaganda-class precicion score above 50. Since Propaganda-class is a minority class (composoing about 30% of the dataset), I wanted to prioritize a model that can identify as many propaganda instances out of the total amount of propaganda instances as possible.

The best model ended up being a tuned Logisitc Regression.



## Imports

In [70]:
import numpy as np
import pandas as pd
import en_core_web_sm
from wordcloud import WordCloud, STOPWORDS
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
import string
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
import re
import sklearn
from nltk import word_tokenize
# import en_core_web_sm
# from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer


# Create our list of stopwords
nlp = spacy.load('en_core_web_sm')
stop_words = spacy.lang.en.stop_words.STOP_WORDS

# Load English tokenizer, tagger, parser, NER and word vectors
parser = English()


In [68]:
stop_words

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'fron

In [73]:
STOP_WORDS = stop_words.union({'th','st'})
STOP_WORDS

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'fron

## Loading in data

In [3]:
df = pd.read_csv('meta_features.csv')

## Visualizing DataFrame

In [4]:
df.head()

Unnamed: 0,article_id,propaganda,propaganda_type,text,prop_txt_snippet,sent_#,sentiment_score,abs_sent_score,punct_count,word_count,%adj,%verb,%adv,%noun,avg_word_length,strong_subjectives_count
0,701225819,non-propaganda,,South Florida Muslim Leader Sofian Zakkout’s D...,,1,0.0,0.0,0,9,0.0,0.0,0.0,0.0,5.444444,0
1,701225819,propaganda,"Name_Calling,Labeling","David Duke, the white supremacist icon and for...",Grand Wizard of the Ku Klux Klan,2,0.5423,0.5423,4,26,0.020548,0.006849,0.013699,0.006849,4.423077,2
2,701225819,propaganda,Loaded_Language,"However, one individual who represents the Mus...",enamored,3,0.3612,0.3612,4,27,0.017241,0.017241,0.005747,0.022989,5.0,0
3,701225819,non-propaganda,,"Last month, once again, Zakkout chose to showc...",,4,0.0,0.0,5,22,0.021127,0.021127,0.014085,0.035211,5.045455,0
4,701225819,non-propaganda,,The postings can be rivaled only by Zakkout’s ...,,5,0.0,0.0,1,11,0.014493,0.043478,0.014493,0.028986,4.636364,0


## Dropping Non-Meta and Deterministic Columns

In [5]:
text_df = df[['text','propaganda']]

## Previewing Final DataFrame and Missing Values Before Diving In

In [6]:
text_df.head()

Unnamed: 0,text,propaganda
0,South Florida Muslim Leader Sofian Zakkout’s D...,non-propaganda
1,"David Duke, the white supremacist icon and for...",propaganda
2,"However, one individual who represents the Mus...",propaganda
3,"Last month, once again, Zakkout chose to showc...",non-propaganda
4,The postings can be rivaled only by Zakkout’s ...,non-propaganda


In [7]:
text_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15172 entries, 0 to 15171
Data columns (total 2 columns):
text          15172 non-null object
propaganda    15172 non-null object
dtypes: object(2)
memory usage: 237.1+ KB


## Train-Test Split

In [8]:
y = text_df['propaganda']
X = text_df['text']

In [9]:
y = [1 if label == 'propaganda' else 0 for label in y]

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## Tokenizer

In [11]:
#IMPORTS
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import string

punctuation = string.punctuation
punctuation = punctuation+"..."+"--"+"“"+"”"+"``"+"''"+"’"+"–"+"—"+"‘"
lemmatizer = WordNetLemmatizer()

In [12]:
contr_dict={"I’m": "I am",
            "won’t": "will not",
            "’s" : "", 
            "’ll":"will",
            "’ve ":"have ",
            "n’t":" not",
            "’re": "are",
            "’d": "would",
            "y’all": "all of you",
            "I'm": "I am",
            "won't": "will not",
            "'s" : "", 
            "'ll":"will",
            "'ve ":"have ",
            "n't":"not",
            "'re": "are",
            "'d": "would",
            "y'all": "all of you"}
contr_dict.keys()


dict_keys(['I’m', 'won’t', '’s', '’ll', '’ve ', 'n’t', '’re', '’d', 'y’all', "I'm", "won't", "'s", "'ll", "'ve ", "n't", "'re", "'d", "y'all"])

In [13]:
def replace_contractions(sentence, contr_dict=contr_dict):
    for contr in contr_dict.keys():
        if contr in sentence:
            sentence = sentence.replace(contr,contr_dict[contr])
    return sentence

In [52]:
import re 
  
def remove_numbers(tokens): 
    pattern = '[0-9]'
    tokens_updated = [re.sub(pattern, '', token) for token in tokens] 
    return tokens_updated

In [53]:
remove_numbers(['my','15th','birthday','2011'])

['my', 'th', 'birthday', '']

In [14]:
# function to convert nltk tag to wordnet tag
# this is important because having the POS tag improves lemmatization
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None

In [110]:
#tokenization and lemmatization function
def tokenize_sentence(sentence):
    #remove contractions
    sentence = replace_contractions(sentence, contr_dict=contr_dict)
    
    #tokenize the sentence
    mytokens = nltk.word_tokenize(sentence)

    #remove numbers
    mytokens = remove_numbers(mytokens)
    
    #remove tokens left over that are only space char
    mytokens = [token for token in mytokens if len(token)>0]
    
    #tag tokens with part of speech
    nltk_tagged = nltk.pos_tag(mytokens)

    # remove punctuation
    nltk_tagged = [ word for word in nltk_tagged if word[0] not in punctuation ]
    
    #
    nltk_tagged = [word for word in nltk_tagged if word[0].isalpha()]
    
    # strip all tokens and make lowercase 
    nltk_tagged = [ (word[0].lower().strip(),word[1]) for word in nltk_tagged ]
    
    #tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    
    lemmatized_tokens = []
    for word, tag in wordnet_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            lemmatized_tokens.append(word)
        else:        
            #else use the tag to lemmatize the token
            lemmatized_tokens.append(lemmatizer.lemmatize(word, tag))
            
        lemmatized_tokens = [word for word in lemmatized_tokens if word not in STOP_WORDS]
    return lemmatized_tokens


In [111]:
#checking to see lemmatization and number removal
tokenize_sentence('my favorite birthday was my 21st birthday that ran on 5th of december 2011 :-')

['favorite', 'birthday', 'birthday', 'run', 'december']

## TFIDF Transformation

In [112]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

### Created different tfidf vectorizers to test best one for model. They all performed similarly with maxdf = .7 slightly outperforming the rest. For readability, I am only keeping the optimized tfidf tranformation.

In [113]:
tfidf7 = TfidfVectorizer(tokenizer = tokenize_sentence, 
                               min_df=5, max_df=0.7)

In [114]:
X_train_tfidf7 = tfidf7.fit_transform(X_train)
X_test_tfidf7 = tfidf7.transform(X_test)

In [115]:
tfidf_7_df = pd.DataFrame(X_train_tfidf7.toarray())
tfidf_7_df.columns = list(tfidf7.get_feature_names())

## Dummy Classifier

In [173]:
from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier(strategy='most_frequent')
dummy_clf.fit(X_train_tfidf7, y_train)

dummy_preds_tfidf7 = dummy_clf.predict(X_test_tfidf7)

In [174]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [175]:
# Print the confusion matrix
print(sklearn.metrics.confusion_matrix(y_test, dummy_preds_tfidf7))

# Print the precision and recall, among other metrics
print(sklearn.metrics.classification_report(y_test, dummy_preds_tfidf7, digits=3))

[[3510    0]
 [1497    0]]
              precision    recall  f1-score   support

           0      0.701     1.000     0.824      3510
           1      0.000     0.000     0.000      1497

    accuracy                          0.701      5007
   macro avg      0.351     0.500     0.412      5007
weighted avg      0.491     0.701     0.578      5007



  _warn_prf(average, modifier, msg_start, len(result))


In [176]:
from sklearn.metrics import roc_auc_score

In [177]:
roc_auc_score(y_test, dummy_preds_tfidf7)

0.5

## Optimizing Logistic Regression

In [132]:
from sklearn import linear_model
from sklearn import ensemble
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

In [133]:
# Create logistic regression
logistic = linear_model.LogisticRegression()

In [134]:
hyperparam_grid_logistic = {'penalty' : ['l1', 'l2'],
    'C' : np.logspace(-4, 4, 20),
    'solver' : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
    'class_weight': 'balanced'}


In [136]:
# Create randomized search 5-fold cross validation and 100 iterations
clf_log = RandomizedSearchCV(logistic, hyperparam_grid_logistic, random_state=1, n_iter=200, cv=5, 
                         verbose=True, n_jobs=-1, scoring = 'f1')

In [137]:
# Fit randomized search
best_model_log_7 = clf_log.fit(X_train_tfidf7, y_train)

Fitting 5 folds for each of 200 candidates, totalling 1000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  4.2min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:  7.1min finished
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


### Best HyperParameters for Best Logistic Regression with TFIDF features

In [139]:
# View best hyperparameters
print('Best Penalty:', best_model_log_7.best_estimator_.get_params()['penalty'])
print('Best C:', best_model_log_7.best_estimator_.get_params()['C'])
print('Best solver:', best_model_log_7.best_estimator_.get_params()['solver'])

Best Penalty: l2
Best C: 29.763514416313132
Best solver: lbfgs


### Results for Best Logistic Regression with TFIDF features

In [140]:
# Predict target vector
log_preds7 = best_model_log_7.predict(X_test_tfidf7)

# Print the confusion matrix
print(sklearn.metrics.confusion_matrix(y_test, log_preds7))

# Print the precision and recall, among other metrics
print(sklearn.metrics.classification_report(y_test, log_preds7, digits=3))

roc_auc_score(y_test, log_preds7)

[[2945  565]
 [ 884  613]]
              precision    recall  f1-score   support

           0      0.769     0.839     0.803      3510
           1      0.520     0.409     0.458      1497

    accuracy                          0.711      5007
   macro avg      0.645     0.624     0.630      5007
weighted avg      0.695     0.711     0.700      5007



0.6242584884869454

## Optimizing Random Forest

In [153]:
randomforest = ensemble.RandomForestClassifier()

In [154]:
hyperparam_grid_rf=    {'n_estimators' : list(range(10,101,10)),
    'max_features' : list(range(6,32,5)),
    'criterion':['gini','entropy'],
    'class_weight':['balanced']}


In [155]:
clf_rf = RandomizedSearchCV(randomforest, hyperparam_grid_rf, random_state=1, n_iter=200, cv=5, 
                         verbose=True, n_jobs=-1, scoring = 'f1')

In [157]:
# Fit randomized search
best_model_rf = clf_rf.fit(X_train_tfidf7, y_train)

Fitting 5 folds for each of 120 candidates, totalling 600 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  7.1min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed: 17.9min
[Parallel(n_jobs=-1)]: Done 600 out of 600 | elapsed: 25.7min finished


### Best HyperParameters for Random Forest with TFIDF Features

In [158]:
# View best hyperparameters
print('Best Penalty:', best_model_rf.best_estimator_.get_params()['n_estimators'])
print('Best C:', best_model_rf.best_estimator_.get_params()['max_features'])
print('Best solver:', best_model_rf.best_estimator_.get_params()['criterion'])

Best Penalty: 60
Best C: 26
Best solver: entropy


### Best Results for Random Forest with TFIDF Features

In [159]:
# Predict target vector
rf_preds = best_model_rf.predict(X_test_tfidf7)

# Print the confusion matrix
print(sklearn.metrics.confusion_matrix(y_test, rf_preds))

# Print the precision and recall, among other metrics
print(sklearn.metrics.classification_report(y_test, rf_preds, digits=3))

print(roc_auc_score(y_test, rf_preds))

[[3253  257]
 [1104  393]]
              precision    recall  f1-score   support

           0      0.747     0.927     0.827      3510
           1      0.605     0.263     0.366      1497

    accuracy                          0.728      5007
   macro avg      0.676     0.595     0.597      5007
weighted avg      0.704     0.728     0.689      5007

0.5946528384404136


## Gradient Boosted Decision Trees

In [160]:
from sklearn.ensemble import GradientBoostingClassifier


In [161]:
clf_gboost = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
    max_depth=1)

In [162]:
gboost_model = clf_gboost.fit(X_train_tfidf7, y_train)

### Results for Gradient Boosted Decision Trees

In [163]:
# Predict target vector
gboost_preds = gboost_model.predict(X_test_tfidf7)

# Print the confusion matrix
print(sklearn.metrics.confusion_matrix(y_test, gboost_preds))

# Print the precision and recall, among other metrics
print(sklearn.metrics.classification_report(y_test, gboost_preds, digits=3))

print(roc_auc_score(y_test, gboost_preds))

[[3190  320]
 [1115  382]]
              precision    recall  f1-score   support

           0      0.741     0.909     0.816      3510
           1      0.544     0.255     0.347      1497

    accuracy                          0.713      5007
   macro avg      0.643     0.582     0.582      5007
weighted avg      0.682     0.713     0.676      5007

0.5820044647699958


In [None]:
# hyperparam_grid_gb=    {'n_estimators' : list(range(10,101,10)),
#     'max_features' : list(range(6,32,5)),
#     'criterion':['gini','entropy'],
#     'class_weight':'balanced'}

## KNN

In [181]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()

hyperparam_grid_knn={'n_neighbors' : [3,5,11,15,21,25,31],
    'weights':['uniform','distance'],
    'metric':['euclidean','minkowski','manhattan']}

clf_knn = RandomizedSearchCV(knn, hyperparam_grid_knn, random_state=1, n_iter=200, cv=5, 
                         verbose=True, n_jobs=-1, scoring = 'f1')
# Fit randomized search
best_model_knn = clf_knn.fit(X_train_tfidf7, y_train)


Fitting 5 folds for each of 42 candidates, totalling 210 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   20.9s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 210 out of 210 | elapsed:  2.2min finished


### Results for Best KNN for TFIDF Features

In [182]:
# Predict target vector
knn_preds = best_model_knn.predict(X_test_tfidf7)

# Print the confusion matrix
print(sklearn.metrics.confusion_matrix(y_test, knn_preds))

# Print the precision and recall, among other metrics
print(sklearn.metrics.classification_report(y_test, knn_preds, digits=3))

print(roc_auc_score(y_test, knn_preds))

[[3308  202]
 [1318  179]]
              precision    recall  f1-score   support

           0      0.715     0.942     0.813      3510
           1      0.470     0.120     0.191      1497

    accuracy                          0.696      5007
   macro avg      0.592     0.531     0.502      5007
weighted avg      0.642     0.696     0.627      5007

0.5310113103700279
