<a href="https://colab.research.google.com/github/Ayavie/Reddit-fake-post-detection/blob/main/Reddit_Fake_Post_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ✔️ Problem Formulation:

The objective is to find out whether a Reddit post is fake or not. NLP techniques is used to handle this problem. The challenges is to find the best way of preprocessing techniques of posts and also type of vectorization used to indicate the importance of each word/feature. The impact could be having much less fake posts by banning the fake ones.

The ideal solution is finding model with accuracy with at least 70% success rate.

#Strategy used:
1. Trying different classifiers.
2. Try the best one with word and vector vectorizers also RandomSearch was used to help find better hyperparameter values.
3. Try different preprocessing techniques with RandomSearch with the best model and best vectorizer.
4. Try XGBoost with RandomSearch.


#Importing Libraries and investigating data

In [None]:
import re
import pickle
import sklearn
import pandas as pd
import numpy as np
import nltk 

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import BernoulliNB
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from xgboost import XGBClassifier
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 
! pip install contractions
import contractions
! pip install unidecode
import unidecode

# some seeting for pandas 

pd.options.display.max_columns = 100
pd.options.display.max_rows = 300
pd.options.display.max_colwidth = 100
np.set_printoptions(threshold=2000)



In [None]:
# Reading the training 
df_main=pd.read_csv('xy_train.csv')

#Make a copy of the original training dataframe for usage
df=df_main.copy()

# Reading testing data
df_test=pd.read_csv('x_test.csv')

In [None]:
#Printing few lines of the data
df.head(5)

Unnamed: 0,id,text,label
0,265723,"A group of friends began to volunteer at a homeless shelter after their neighbors protested. ""Se...",0.0
1,284269,"British Prime Minister @Theresa_May on Nerve Attack on Former Russian Spy: ""The government has c...",0.0
2,207715,"In 1961, Goodyear released a kit that allows PS2s to be brought to heel. https://m.youtube.com/w...",0.0
3,551106,"Happy Birthday, Bob Barker! The Price Is Right Host on How He'd Like to Be Remembered | ""As the ...",0.0
4,8584,"Obama to Nation: 聙""Innocent Cops and Unarmed Young Black Men Should Not be Dying Before Magic Jo...",0.0


#Data cleaning and preprocessing


In [None]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


# Init the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()
stemmer = SnowballStemmer("english")
stop_words = set(stopwords.words("english"))


def clean_text(text,stem,for_embedding=False):
    """ steps:
        - remove any html tags (< /br> often found)
        - Keep only ASCII + European Chars and whitespace, no digits
        - remove single letter chars
        - convert all whitespaces (tabs etc.) to single wspace
        if not for embedding (but e.g. tdf-idf):
        - all lowercase
        - remove stopwords, punctuation and stemm
    """
    # compile regular expression and return pattern object
    RE_WSPACE = re.compile(r"\s+", re.IGNORECASE)
    RE_TAGS = re.compile(r"<[^>]+>")
    RE_ASCII = re.compile(r"[^A-Za-zÀ-ž ]", re.IGNORECASE)
    RE_SINGLECHAR = re.compile(r"\b[A-Za-zÀ-ž]\b", re.IGNORECASE)
    if for_embedding:

        # Keep punctuation
        RE_ASCII = re.compile(r"[^A-Za-zÀ-ž,.!? ]", re.IGNORECASE)
        RE_SINGLECHAR = re.compile(r"\b[A-Za-zÀ-ž,.!?]\b", re.IGNORECASE)

    text = re.sub(RE_TAGS, " ", text)
    text = re.sub(RE_ASCII, " ", text)
    text = re.sub(RE_SINGLECHAR, " ", text)
    text = re.sub(RE_WSPACE, " ", text)

    word_tokens = word_tokenize(text)


    for word in word_tokens:
      #convert words to lower case
      word.lower()
      #remove accented words
      #unidecode.unidecode(word)
      #split words like don't to do not
      #contractions.fix(word)


    if for_embedding:
        # no stemming, lowering and punctuation / stop words removal
        words_filtered = word_tokens
        
    else:
      if stem==True:
        words_filtered = [stemmer.stem(word) for word in word_tokens if word not in stop_words ]
      elif stem==False:
        words_filtered =[lemmatizer.lemmatize(word) for word in word_tokens if word not in stop_words]
      else:
        words_filtered = word_tokens


        



    text_clean = " ".join(words_filtered)
    return text_clean

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [None]:

#Testing clean_text function

print(clean_text("Python is my favorite   programming language!!", stem=None)) # No stemming or lemmetization
print(clean_text("Python is my favorite   programming language!!", stem=True)) # Applied stemming
print(clean_text("Python is my favorite   programming language!!", stem=False)) # Applied lemmatization

Python is my favorite programming language
python favorit program languag
Python favorite programming language


We'll try the three different preprocessing techniques and let's see how this works out.


Let's try with no stemming or lemmatization.

In [None]:
%%time
#Clean Comments 
df["text_clean"] = df.loc[df_main["text"].str.len() > 10, "text"]
df["text_clean"] = df["text_clean"].map(
    lambda x: clean_text(x,stem=None, for_embedding=False) if isinstance(x, str) else x
)


CPU times: user 15.9 s, sys: 89.5 ms, total: 16 s
Wall time: 21.4 s


In [None]:
df

Unnamed: 0,id,text,label,text_clean
0,265723,"A group of friends began to volunteer at a homeless shelter after their neighbors protested. ""Se...",0.0,group of friends began to volunteer at homeless shelter after their neighbors protested Seeing a...
1,284269,"British Prime Minister @Theresa_May on Nerve Attack on Former Russian Spy: ""The government has c...",0.0,British Prime Minister Theresa May on Nerve Attack on Former Russian Spy The government has conc...
2,207715,"In 1961, Goodyear released a kit that allows PS2s to be brought to heel. https://m.youtube.com/w...",0.0,In Goodyear released kit that allows PS to be brought to heel https youtube com watch ALXulk cg ...
3,551106,"Happy Birthday, Bob Barker! The Price Is Right Host on How He'd Like to Be Remembered | ""As the ...",0.0,Happy Birthday Bob Barker The Price Is Right Host on How He Like to Be Remembered As the man who...
4,8584,"Obama to Nation: 聙""Innocent Cops and Unarmed Young Black Men Should Not be Dying Before Magic Jo...",0.0,Obama to Nation Innocent Cops and Unarmed Young Black Men Should Not be Dying Before Magic Johns...
...,...,...,...,...
51016,551985,"British general Jeffrey Amherst Giving smallpox blankets to area natives, colorized (1736)",0.0,British general Jeffrey Amherst Giving smallpox blankets to area natives colorized
51017,294811,"If The Zombie Apocalypse Happens, This Is Where Scientists Say You Should Go | The Rockies",0.0,If The Zombie Apocalypse Happens This Is Where Scientists Say You Should Go The Rockies
51018,462648,"""The I.W.W. is Coming! Join The One Big Union"", Industrial Workers of the World, USA, 1907",0.0,The is Coming Join The One Big Union Industrial Workers of the World USA
51019,335340,Why the Viral United Airlines Video Kept Getting Deleted From Reddit | It depicts violence,0.0,Why the Viral United Airlines Video Kept Getting Deleted From Reddit It depicts violence


In [None]:
# Comparing visually text and its clean version 
df.head(5)

Unnamed: 0,id,text,label,text_clean
0,265723,"A group of friends began to volunteer at a homeless shelter after their neighbors protested. ""Se...",0.0,group of friends began to volunteer at homeless shelter after their neighbors protested Seeing a...
1,284269,"British Prime Minister @Theresa_May on Nerve Attack on Former Russian Spy: ""The government has c...",0.0,British Prime Minister Theresa May on Nerve Attack on Former Russian Spy The government has conc...
2,207715,"In 1961, Goodyear released a kit that allows PS2s to be brought to heel. https://m.youtube.com/w...",0.0,In Goodyear released kit that allows PS to be brought to heel https youtube com watch ALXulk cg ...
3,551106,"Happy Birthday, Bob Barker! The Price Is Right Host on How He'd Like to Be Remembered | ""As the ...",0.0,Happy Birthday Bob Barker The Price Is Right Host on How He Like to Be Remembered As the man who...
4,8584,"Obama to Nation: 聙""Innocent Cops and Unarmed Young Black Men Should Not be Dying Before Magic Jo...",0.0,Obama to Nation Innocent Cops and Unarmed Young Black Men Should Not be Dying Before Magic Johns...


In [None]:
# Checking presence of NaNs
df.isna().sum()

id            0
text          0
label         1
text_clean    0
dtype: int64

We have no NaNs to drop.

In [None]:
# Creating a copy of the cleaned text
df_clean=df.copy()

In [None]:
from bokeh.models import NumeralTickFormatter
# Word Frequency of most common words
word_freq = pd.Series(" ".join(df_clean["text_clean"]).split()).value_counts()
word_freq[1:40]

of       24233
to       24147
in       18918
and      16857
for       9642
The       8986
on        8969
it        8100
is        7743
with      6480
from      5989
that      5370
this      4966
at        4912
his       4816
my        4517
by        4464
was       4422
you       4113
This      4071
after     3397
as        3365
an        3332
To        3202
has       3173
he        3042
are       2905
be        2662
have      2526
out       2463
It        2328
they      2284
like      2266
but       2236
their     2233
her       2217
Trump     2207
up        2152
In        2130
dtype: int64

In [None]:
# list the least 10 words used and set the name of their repetition to "freq"
word_freq[-10:].reset_index(name='freq')

Unnamed: 0,index,freq
0,Cheapjack,1
1,Nationalisation,1
2,Strasbourg,1
3,zips,1
4,Frankel,1
5,secretariat,1
6,NRDC,1
7,Monsoon,1
8,vigilant,1
9,gpu,1


In [None]:
#Checking distribution of target variable values

df_clean['label'].value_counts(normalize=True)

0.0    0.552117
1.0    0.443983
2.0    0.003900
Name: label, dtype: float64

**Ops!** We're having binary classification problem and we're supposed to have only 2 labels but we're obviously getting 3! 

Of course the undefined one is the one with least distribution and since it has 0.003867, we'll drop the records which have it as their label.

Also a good thing to notice is that the classes are somehow balanced.

In [None]:
#Include in the data frame only records which lave 0 or 1 in label column
df_clean=df_clean[(df_clean['label']==0) | (df_clean['label']==1)]

In [None]:
#Checking for label column values

df_clean['label'].value_counts(normalize=True)

0.0    0.554279
1.0    0.445721
Name: label, dtype: float64

 **Viola!**  It seems we got rid of the unknown class! 🙂

 Now, it's time for the next step.

Let's checkout the most frequent and least frequent words.

In [None]:
from bokeh.models import NumeralTickFormatter
# Word Frequency of most common words
#join >> enter space after each char , split >> split each character
word_freq = pd.Series(" ".join(df_clean["text_clean"]).split()).value_counts()
word_freq[1:20]

of      24197
to      24102
in      18891
and     16820
for      9621
on       8954
The      8893
it       8083
is       7718
with     6473
from     5985
that     5354
this     4954
at       4904
his      4808
my       4507
by       4458
was      4412
you      4092
dtype: int64

In [None]:
# list most uncommon words
word_freq[-10:].reset_index(name="freq")

Unnamed: 0,index,freq
0,enlightening,1
1,Ty,1
2,Pennington,1
3,morphs,1
4,unravelling,1
5,Kilcullen,1
6,Transcaucasian,1
7,Byelorussian,1
8,Ruthie,1
9,gpu,1


#Assigning  dependent and independent variables


In [None]:
X=df_clean['text_clean']
y=df_clean['label']

In [None]:
X

0        group of friends began to volunteer at homeless shelter after their neighbors protested Seeing a...
1        British Prime Minister Theresa May on Nerve Attack on Former Russian Spy The government has conc...
2        In Goodyear released kit that allows PS to be brought to heel https youtube com watch ALXulk cg ...
3        Happy Birthday Bob Barker The Price Is Right Host on How He Like to Be Remembered As the man who...
4        Obama to Nation Innocent Cops and Unarmed Young Black Men Should Not be Dying Before Magic Johns...
                                                        ...                                                 
51015               Discussion What monitor type would you recommend for this config is bottlenecking my gpu
51016                     British general Jeffrey Amherst Giving smallpox blankets to area natives colorized
51017                If The Zombie Apocalypse Happens This Is Where Scientists Say You Should Go The Rockies
51018              

#Feature Creation using TF-IDF

In [None]:
# Sample data - 20% of data to validation set
from sklearn.model_selection import PredefinedSplit

# Further split the original training set to a train and a validation set
xtr, xts, ytr, yts = train_test_split(
    X, y, train_size = 0.8, stratify =y, random_state = 2022)

# Create a list where train data indices are -1 and validation data indices are 0
# X_train2 (new training set), X_train
split_index = [-1 if x in xtr.index else 0 for x in X.index]

# Use the list to create PredefinedSplit
pds = PredefinedSplit(test_fold = split_index)


In [None]:

    # lowercase : bool, default=True
    #     Convert all characters to lowercase before tokenizing.
    # smooth_idf : bool, default=True
    #     Smooth idf weights by adding one to document frequencies, as if an
    #     extra document was seen containing every term in the collection
    #     exactly once. Prevents zero divisions.
vectorizer = TfidfVectorizer(
    analyzer="word", max_df=0.3, min_df=10, ngram_range=(1, 2), norm="l2"
)
vectorizer.fit(df_clean["text_clean"])

TfidfVectorizer(max_df=0.3, min_df=10, ngram_range=(1, 2))

In [None]:
# Vector representation of vocabulary
word_vector = pd.Series(vectorizer.vocabulary_).sample(5, random_state=5)
print(f"Unique word (ngram) vector extract:\n\n {word_vector}")

Unique word (ngram) vector extract:

 it before        8026
we do           18010
wiped out       18460
furniture        5802
building and     2175
dtype: int64


In [None]:
# transform each sentence to numeric vector with tf-idf value as elements
xtr_vec = vectorizer.transform(xtr)
xts_vec = vectorizer.transform(xts)
xtr_vec.get_shape()

(40656, 18970)

In [None]:
# Compare original comment text with its numeric vector representation
print(f"Original sentence:\n{xtr[3:4].values}\n")
# Feature Matrix
features = pd.DataFrame(
    xtr_vec[3:4].toarray(), columns=vectorizer.get_feature_names()
)
nonempty_feat = features.loc[:, (features != 0).any(axis=0)]
print(f"Vector representation of sentence:\n {nonempty_feat}")

Original sentence:
['These rocks at my local beach They have jagged square pattern that looks so cool guessing it from erosion from the sea']

Vector representation of sentence:
          at     at my     beach      cool      from  from the      have  \
0  0.107928  0.209136  0.211901  0.217699  0.203701   0.15044  0.126764   

        it   it from     local    looks  looks so        my  my local  \
0  0.09506  0.249808  0.181366  0.16067  0.297264  0.106793  0.223643   

    pattern     rocks       sea        so    square      that  that looks  \
0  0.247035  0.263348  0.218432  0.138628  0.238498  0.103642    0.250787   

    the sea     these     they  they have  
0  0.263348  0.161443  0.12348    0.23173  




#We'll try different classifiers.

In [None]:
# models to test
classifiers = [
    #HistGradientBoostingClassifier(),
    LogisticRegression (random_state=22),
    LinearSVC(random_state=1),
    RandomForestClassifier(random_state=1),
    XGBClassifier(random_state=1),
    BernoulliNB(),
    MLPClassifier(
        random_state=1,
        solver="adam",
        hidden_layer_sizes=(12,12,12),
        activation="relu",
        early_stopping=True,
        n_iter_no_change=1 ),
        


]
# get names of the objects in list (too lazy for c&p...)
names = [re.match(r"[^\(]+", name.__str__())[0] for name in classifiers]
print(f"Classifiers to test: {names}")

Classifiers to test: ['LogisticRegression', 'LinearSVC', 'RandomForestClassifier', 'XGBClassifier', 'BernoulliNB', 'MLPClassifier']


In [None]:
%%time
# test all classifiers and save pred. results on test data
results = {}
for name, clf in zip(names, classifiers):
    print(f"Training classifier: {name}")
    clf_object=clf
    clf.fit(xtr_vec, ytr)
    prediction = clf.predict(xts_vec)
    report = sklearn.metrics.classification_report(yts, prediction)
    results[name] = report

    

Training classifier: LogisticRegression


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Training classifier: LinearSVC
Training classifier: RandomForestClassifier
Training classifier: XGBClassifier
Training classifier: BernoulliNB
Training classifier: MLPClassifier
CPU times: user 1min 39s, sys: 8.06 s, total: 1min 47s
Wall time: 1min 40s


In [None]:
# Prediction results
for k, v in results.items():
    print(f"Results for {k}:")
    print(f"{v}\n")


Results for LogisticRegression:
              precision    recall  f1-score   support

         0.0       0.83      0.83      0.83      5634
         1.0       0.79      0.78      0.79      4531

    accuracy                           0.81     10165
   macro avg       0.81      0.81      0.81     10165
weighted avg       0.81      0.81      0.81     10165


Results for LinearSVC:
              precision    recall  f1-score   support

         0.0       0.82      0.81      0.82      5634
         1.0       0.77      0.78      0.78      4531

    accuracy                           0.80     10165
   macro avg       0.80      0.80      0.80     10165
weighted avg       0.80      0.80      0.80     10165


Results for RandomForestClassifier:
              precision    recall  f1-score   support

         0.0       0.80      0.82      0.81      5634
         1.0       0.77      0.74      0.75      4531

    accuracy                           0.78     10165
   macro avg       0.78      0.78  

In [None]:
%%time
# feature creation and modelling in a single function
pipe = Pipeline([("tfidf", TfidfVectorizer()), ("LR", LogisticRegression())])

# define parameter space to test # runtime 
params = {
    "tfidf__ngram_range": [(1, 2), (1, 3),(1,4),(1,5)],
    "tfidf__max_df": np.arange(0.3, 0.8),
    "tfidf__min_df": np.arange(5, 100),
     'LR__penalty' : ['l1', 'l2'],
    'LR__C' : [0.1,0.5],
    'LR__solver' : ['liblinear','sag','saga'],
    'LR__max_iter': [100,200]
}
# it is quite slow so we do 4 for now
pipe_clf = RandomizedSearchCV(
    pipe, params, n_jobs=-1, scoring="roc_auc", n_iter=30,cv=pds)
pipe_clf.fit(X, y)
pickle.dump(pipe_clf, open("./clf_pipe.pck", "wb"))

3 fits failed out of a total of 30.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
3 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/pipeline.py", line 394, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py", line 449, in _check_solver

CPU times: user 9.57 s, sys: 700 ms, total: 10.3 s
Wall time: 2min 58s


In [None]:
best_params = pipe_clf.best_params_
print(best_params)

{'tfidf__ngram_range': (1, 3), 'tfidf__min_df': 9, 'tfidf__max_df': 0.3, 'LR__solver': 'saga', 'LR__penalty': 'l2', 'LR__max_iter': 200, 'LR__C': 0.5}


In [None]:
# run pipe with optimized parameters
pipe.set_params(**best_params).fit(X, y)
pipe_pred = pipe.predict(xts)
report = sklearn.metrics.classification_report(yts, pipe_pred)
print(report)

              precision    recall  f1-score   support

         0.0       0.86      0.87      0.87      5634
         1.0       0.84      0.83      0.83      4531

    accuracy                           0.85     10165
   macro avg       0.85      0.85      0.85     10165
weighted avg       0.85      0.85      0.85     10165



In [None]:
submission= pd.DataFrame()

submission['id'] = df_test.index

submission['label']=pipe.predict_proba(df_test['text']) [:,1]

submission.to_csv('submission_LogReg_trial.csv',index=False)

Logistic Regression with no stemming or lemmatization got testing score: 0.85176!

Since the accuracy is not bad and also weird to be higher than validation set at the same time.
Let's try with 'char' vectorizer instead of 'word' vectorizer.

In [None]:
%%time
# feature creation and modelling in a single function
pipe = Pipeline([("tfidf", TfidfVectorizer(analyzer='char')), ("LR", LogisticRegression())])

# define parameter space to test # runtime 
params = {
    "tfidf__ngram_range": [(1, 4), (1, 5)],
    "tfidf__max_df": np.arange(0.3, 0.8),
    "tfidf__min_df": np.arange(5, 100),
     'LR__penalty' : ['l1', 'l2'],
    'LR__C' : [0.1,0.5],
    'LR__solver' : ['liblinear','sag','saga'],
    'LR__max_iter': [100,200]
}
# it is quite slow so we do 4 for now
pipe_clf = RandomizedSearchCV(
    pipe, params, n_jobs=-1, scoring="roc_auc", n_iter=30,cv=pds)
pipe_clf.fit(X, y)
pickle.dump(pipe_clf, open("./clf_pipe.pck", "wb"))

4 fits failed out of a total of 30.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
4 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/pipeline.py", line 394, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py", line 449, in _check_solver

CPU times: user 26.1 s, sys: 1.66 s, total: 27.8 s
Wall time: 12min 59s


In [None]:
best_params = pipe_clf.best_params_
print(best_params)

{'tfidf__ngram_range': (1, 4), 'tfidf__min_df': 17, 'tfidf__max_df': 0.3, 'LR__solver': 'sag', 'LR__penalty': 'l2', 'LR__max_iter': 100, 'LR__C': 0.5}


In [None]:
# run pipe with optimized parameters
pipe.set_params(**best_params).fit(X, y)
pipe_pred = pipe.predict(xts)
report = sklearn.metrics.classification_report(yts, pipe_pred)
print(report)

              precision    recall  f1-score   support

         0.0       0.84      0.85      0.84      5634
         1.0       0.81      0.80      0.80      4531

    accuracy                           0.83     10165
   macro avg       0.82      0.82      0.82     10165
weighted avg       0.83      0.83      0.83     10165



In [None]:
submission= pd.DataFrame()

submission['id'] = df_test.index

submission['label']=pipe.predict_proba(df_test['text']) [:,1]

submission.to_csv('submission_LogReg_trial_char.csv',index=False)

Using character vectorizer got us testing accuracy of 0.83334

#Observation 1: Using word vectorizer was better by almost 2 pecent.

#Let's try stemming preprocessing with Logistic Regression!
Expectation: Higher accuracy!

In [None]:
%%time
#Cleaning using stemming 
df["text_clean"] = df.loc[df_main["text"].str.len() > 10, "text"]
df["text_clean"] = df["text_clean"].map(
    lambda x: clean_text(x,stem=True, for_embedding=False) if isinstance(x, str) else x
)


CPU times: user 21.5 s, sys: 39.9 ms, total: 21.5 s
Wall time: 21.5 s


In [None]:
#Assigning values to dependent and independent variables once more.
X=df_clean['text_clean']
y=df_clean['label']

In [None]:
# Sample data - 20% of data to validation set
from sklearn.model_selection import PredefinedSplit

# Further split the original training set to a train and a validation set
xtr, xts, ytr, yts = train_test_split(
    X, y, train_size = 0.8, stratify =y, random_state = 2022)

# Create a list where train data indices are -1 and validation data indices are 0
# X_train2 (new training set), X_train
split_index = [-1 if x in xtr.index else 0 for x in X.index]

# Use the list to create PredefinedSplit
pds = PredefinedSplit(test_fold = split_index)

In [None]:
# Prediction results
for k, v in results.items():
    print(f"Results for {k}:")
    print(f"{v}\n")


Results for LogisticRegression:
              precision    recall  f1-score   support

         0.0       0.83      0.83      0.83      5634
         1.0       0.79      0.78      0.79      4531

    accuracy                           0.81     10165
   macro avg       0.81      0.81      0.81     10165
weighted avg       0.81      0.81      0.81     10165


Results for LinearSVC:
              precision    recall  f1-score   support

         0.0       0.82      0.81      0.82      5634
         1.0       0.77      0.78      0.78      4531

    accuracy                           0.80     10165
   macro avg       0.80      0.80      0.80     10165
weighted avg       0.80      0.80      0.80     10165


Results for RandomForestClassifier:
              precision    recall  f1-score   support

         0.0       0.80      0.82      0.81      5634
         1.0       0.77      0.74      0.75      4531

    accuracy                           0.78     10165
   macro avg       0.78      0.78  

In [None]:
%%time
# feature creation and modelling in a single function
pipe = Pipeline([("tfidf", TfidfVectorizer()), ("LR", LogisticRegression())])

# define parameter space to test # runtime 
params = {
    "tfidf__ngram_range": [(1, 2), (1, 3),(1,4),(1,5)],
    "tfidf__max_df": np.arange(0.3, 0.8),
    "tfidf__min_df": np.arange(5, 100),
     'LR__penalty' : ['l1', 'l2'],
    'LR__C' : [0.1,0.5],
    'LR__solver' : ['liblinear','sag','saga'],
    'LR__max_iter': [100,200]
}
# it is quite slow so we do 4 for now
pipe_clf = RandomizedSearchCV(
    pipe, params, n_jobs=-1, scoring="roc_auc", n_iter=30,cv=pds)
pipe_clf.fit(X, y)
pickle.dump(pipe_clf, open("./clf_pipe.pck", "wb"))

2 fits failed out of a total of 30.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
2 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/pipeline.py", line 394, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py", line 449, in _check_solver

CPU times: user 6.79 s, sys: 798 ms, total: 7.58 s
Wall time: 2min 47s


In [None]:
best_params = pipe_clf.best_params_
print(best_params)

{'tfidf__ngram_range': (1, 2), 'tfidf__min_df': 35, 'tfidf__max_df': 0.3, 'LR__solver': 'liblinear', 'LR__penalty': 'l2', 'LR__max_iter': 100, 'LR__C': 0.5}


In [None]:
# run pipe with optimized parameters
pipe.set_params(**best_params).fit(X, y)
pipe_pred = pipe.predict(xts)
report = sklearn.metrics.classification_report(yts, pipe_pred)
print(report)

              precision    recall  f1-score   support

         0.0       0.84      0.85      0.85      5634
         1.0       0.81      0.80      0.81      4531

    accuracy                           0.83     10165
   macro avg       0.83      0.83      0.83     10165
weighted avg       0.83      0.83      0.83     10165



In [None]:
submission= pd.DataFrame()

submission['id'] = df_test.index

submission['label']=pipe.predict_proba(df_test['text']) [:,1]

submission.to_csv('submission_LogReg_trial_stemming.csv',index=False)

#Observation 2
A bit higher accuracy with stemming of value 0.85609!

#Let's try lemmatization with Logistic Regression

In [None]:
%%time
#Cleaning using stemming 
df["text_clean"] = df.loc[df_main["text"].str.len() > 10, "text"]
df["text_clean"] = df["text_clean"].map(
    lambda x: clean_text(x,stem=False, for_embedding=False) if isinstance(x, str) else x
)


CPU times: user 13.5 s, sys: 58.2 ms, total: 13.5 s
Wall time: 13.5 s


In [None]:
#Assigning values to dependent and independent variables once more.
X=df_clean['text_clean']
y=df_clean['label']

In [None]:
# Sample data - 20% of data to validation set
from sklearn.model_selection import PredefinedSplit

# Further split the original training set to a train and a validation set
xtr, xts, ytr, yts = train_test_split(
    X, y, train_size = 0.8, stratify =y, random_state = 2022)

# Create a list where train data indices are -1 and validation data indices are 0
# X_train2 (new training set), X_train
split_index = [-1 if x in xtr.index else 0 for x in X.index]

# Use the list to create PredefinedSplit
pds = PredefinedSplit(test_fold = split_index)

In [None]:
# Prediction results
for k, v in results.items():
    print(f"Results for {k}:")
    print(f"{v}\n")


Results for LogisticRegression:
              precision    recall  f1-score   support

         0.0       0.83      0.83      0.83      5634
         1.0       0.79      0.78      0.79      4531

    accuracy                           0.81     10165
   macro avg       0.81      0.81      0.81     10165
weighted avg       0.81      0.81      0.81     10165


Results for LinearSVC:
              precision    recall  f1-score   support

         0.0       0.82      0.81      0.82      5634
         1.0       0.77      0.78      0.78      4531

    accuracy                           0.80     10165
   macro avg       0.80      0.80      0.80     10165
weighted avg       0.80      0.80      0.80     10165


Results for RandomForestClassifier:
              precision    recall  f1-score   support

         0.0       0.80      0.82      0.81      5634
         1.0       0.77      0.74      0.75      4531

    accuracy                           0.78     10165
   macro avg       0.78      0.78  

In [None]:
%%time
# feature creation and modelling in a single function
pipe = Pipeline([("tfidf", TfidfVectorizer()), ("LR", LogisticRegression())])

# define parameter space to test # runtime 
params = {
    "tfidf__ngram_range": [(1, 2), (1, 3),(1,4),(1,5)],
    "tfidf__max_df": np.arange(0.3, 0.8),
    "tfidf__min_df": np.arange(5, 100),
     'LR__penalty' : ['l1', 'l2'],
    'LR__C' : [0.1,0.5],
    'LR__solver' : ['liblinear','sag','saga'],
    'LR__max_iter': [100,200]
}
# it is quite slow so we do 4 for now
pipe_clf = RandomizedSearchCV(
    pipe, params, n_jobs=-1, scoring="roc_auc", n_iter=30,cv=pds)
pipe_clf.fit(X, y)
pickle.dump(pipe_clf, open("./clf_pipe.pck", "wb"))

5 fits failed out of a total of 30.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/pipeline.py", line 394, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py", line 449, in _check_solver

CPU times: user 15.9 s, sys: 1.73 s, total: 17.7 s
Wall time: 3min 15s


In [None]:
best_params = pipe_clf.best_params_
print(best_params)

{'tfidf__ngram_range': (1, 5), 'tfidf__min_df': 11, 'tfidf__max_df': 0.3, 'LR__solver': 'liblinear', 'LR__penalty': 'l2', 'LR__max_iter': 200, 'LR__C': 0.5}


In [None]:
# run pipe with optimized parameters
pipe.set_params(**best_params).fit(X, y)
pipe_pred = pipe.predict(xts)
report = sklearn.metrics.classification_report(yts, pipe_pred)
print(report)

              precision    recall  f1-score   support

         0.0       0.86      0.87      0.86      5634
         1.0       0.83      0.83      0.83      4531

    accuracy                           0.85     10165
   macro avg       0.85      0.85      0.85     10165
weighted avg       0.85      0.85      0.85     10165



In [None]:
submission= pd.DataFrame()

submission['id'] = df_test.index

submission['label']=pipe.predict_proba(df_test['text']) [:,1]

submission.to_csv('submission_LogReg_trial_lemm.csv',index=False)

#Observation 3
Logistic Regression with lemmatization score: 0.85265!

#One more trial with XGBoostClassifier!
Expectation: Can not expect higher accuracy than logistic Regression but let's find out. We'll try it with RandomSearch, trying different hyperparameters including vectorization techniques. This will take alot of time but worth trying! 

In [None]:
%%time
#Cleaning using stemming 
df["text_clean"] = df.loc[df_main["text"].str.len() > 10, "text"]
df["text_clean"] = df["text_clean"].map(
    lambda x: clean_text(x,stem=False, for_embedding=False) if isinstance(x, str) else x
)


CPU times: user 13.5 s, sys: 53.9 ms, total: 13.5 s
Wall time: 13.5 s


In [None]:
#Assigning values to dependent and independent variables once more.
X=df_clean['text_clean']
y=df_clean['label']

In [None]:
# Sample data - 20% of data to validation set
from sklearn.model_selection import PredefinedSplit

# Further split the original training set to a train and a validation set
xtr, xts, ytr, yts = train_test_split(
    X, y, train_size = 0.8, stratify =y, random_state = 2022)

# Create a list where train data indices are -1 and validation data indices are 0
# X_train2 (new training set), X_train
split_index = [-1 if x in xtr.index else 0 for x in X.index]

# Use the list to create PredefinedSplit
pds = PredefinedSplit(test_fold = split_index)

In [None]:
# Prediction results
for k, v in results.items():
    print(f"Results for {k}:")
    print(f"{v}\n")


Results for LogisticRegression:
              precision    recall  f1-score   support

         0.0       0.83      0.83      0.83      5634
         1.0       0.79      0.78      0.79      4531

    accuracy                           0.81     10165
   macro avg       0.81      0.81      0.81     10165
weighted avg       0.81      0.81      0.81     10165


Results for LinearSVC:
              precision    recall  f1-score   support

         0.0       0.82      0.81      0.82      5634
         1.0       0.77      0.78      0.78      4531

    accuracy                           0.80     10165
   macro avg       0.80      0.80      0.80     10165
weighted avg       0.80      0.80      0.80     10165


Results for RandomForestClassifier:
              precision    recall  f1-score   support

         0.0       0.80      0.82      0.81      5634
         1.0       0.77      0.74      0.75      4531

    accuracy                           0.78     10165
   macro avg       0.78      0.78  

In [None]:
%%time
# feature creation and modelling in a single function
pipe = Pipeline([("tfidf", TfidfVectorizer()), ("xgb", XGBClassifier())])

# define parameter space to test # runtime 19min
params = {
    "tfidf__ngram_range": [(1,2),(1, 3),(1,4),(1,5)],
    "tfidf__analyzer" : ['word', 'char'],
    "tfidf__max_df": [0.5],
    "tfidf__min_df": [10],
    "xgb__n_estimators":[200,300,400],
    "xgb__max_depth":[30,40,50],

}
pipe_xgb_clf = RandomizedSearchCV(pipe, params, n_jobs=-1, scoring="f1_macro",  n_iter=3, cv=pds)
pipe_xgb_clf.fit(X, y)
pickle.dump(pipe_xgb_clf, open("./pipe_xgb_clf.pck", "wb"))

CPU times: user 9min 23s, sys: 3.67 s, total: 9min 27s
Wall time: 1h 6min 5s


In [None]:
best_params = pipe_xgb_clf.best_params_
print(best_params)

{'xgb__n_estimators': 200, 'xgb__max_depth': 50, 'tfidf__ngram_range': (1, 4), 'tfidf__min_df': 10, 'tfidf__max_df': 0.5, 'tfidf__analyzer': 'word'}


In [None]:
# run pipe with optimized parameters
pipe.set_params(**best_params).fit(X,y)
pipe_pred = pipe.predict(xts)
report = sklearn.metrics.classification_report(yts, pipe_pred)
print(report)

              precision    recall  f1-score   support

         0.0       0.99      0.99      0.99      5634
         1.0       0.99      0.99      0.99      4531

    accuracy                           0.99     10165
   macro avg       0.99      0.99      0.99     10165
weighted avg       0.99      0.99      0.99     10165



In [None]:
submission= pd.DataFrame()

submission['id'] = df_test.index

submission['label']=pipe.predict_proba(df_test['text']) [:,1]

submission.to_csv('submission_xgb_trial.csv',index=False)

#Observation 4
XGBclassifier got testing score:0.83706!

The model is definetely overfitting


# One trial with MLP Classifier
We'll apply lemmatization and RandomSearch


In [None]:
%%time
#Cleaning using stemming 
df["text_clean"] = df.loc[df_main["text"].str.len() > 10, "text"]
df["text_clean"] = df["text_clean"].map(
    lambda x: clean_text(x,stem=False, for_embedding=False) if isinstance(x, str) else x
)


CPU times: user 13.5 s, sys: 44.1 ms, total: 13.5 s
Wall time: 13.5 s


In [None]:
#Assigning values to dependent and independent variables once more.
X=df_clean['text_clean']
y=df_clean['label']

In [None]:
# Sample data - 20% of data to validation set
from sklearn.model_selection import PredefinedSplit

# Further split the original training set to a train and a validation set
xtr, xts, ytr, yts = train_test_split(
    X, y, train_size = 0.8, stratify =y, random_state = 2022)

# Create a list where train data indices are -1 and validation data indices are 0
# X_train2 (new training set), X_train
split_index = [-1 if x in xtr.index else 0 for x in X.index]

# Use the list to create PredefinedSplit
pds = PredefinedSplit(test_fold = split_index)

In [None]:
%%time
# feature creation and modelling in a single function
pipe = Pipeline([("tfidf", TfidfVectorizer()), ("mlp", MLPClassifier())])

# define parameter space to test # runtime 19min
params = {
    "tfidf__ngram_range": [(1,2),(1, 3)],
    "tfidf__analyzer" : ['word', 'char'],
    "tfidf__max_df": [0.5],
    "tfidf__min_df": [10],
    "mlp__hidden_layer_sizes": [128],
    "mlp__solver":['adam','sgd'] ,
    "mlp__batch_size":[128] ,
    "mlp__early_stopping":[True] 
}
pipe_mlp_clf = RandomizedSearchCV(pipe, params, n_jobs=-1, scoring="f1_macro",  n_iter=3, cv=pds)
pipe_mlp_clf.fit(X, y)
pickle.dump(pipe_mlp_clf, open("./pipe_mlp_clf.pck", "wb"))

CPU times: user 1min 14s, sys: 57 s, total: 2min 11s
Wall time: 5min 59s


In [None]:
best_params = pipe_mlp_clf.best_params_
print(best_params)

{'tfidf__ngram_range': (1, 2), 'tfidf__min_df': 10, 'tfidf__max_df': 0.5, 'tfidf__analyzer': 'char', 'mlp__solver': 'adam', 'mlp__hidden_layer_sizes': 128, 'mlp__early_stopping': True, 'mlp__batch_size': 128}


In [None]:
# run pipe with optimized parameters
pipe.set_params(**best_params).fit(X, y)
pipe_pred = pipe.predict(xts)
report = sklearn.metrics.classification_report(yts, pipe_pred)
print(report)

              precision    recall  f1-score   support

         0.0       0.82      0.84      0.83      5634
         1.0       0.79      0.78      0.78      4531

    accuracy                           0.81     10165
   macro avg       0.81      0.81      0.81     10165
weighted avg       0.81      0.81      0.81     10165



In [None]:
submission= pd.DataFrame()

submission['id'] = df_test.index

submission['label']=pipe.predict_proba(df_test['text']) [:,1]

submission.to_csv('submission_mlp_trial.csv',index=False)

#Observation 5:

MLP got testing score: 0.74576
The model is seems to have both high variance and high bias.

#Final conclusion:
Best model used was Logistic Regressesion with stemming preprocessing and word vectorizer.

#✔️ Answer the questions below (briefly):

🌈 What is the difference between Character n-gram and Word n-gram? Which one tends to suffer more from the OOV issue?
> Answer:  n-gram is a contiguous sequence of n items from a given sample of text or speech. 
Word N-gram is number of words subsequent each time from a sentence.
Character n-gram is the number of subsequent of characters taken from sentence each time.

> I think Character N-gram is more susibtible to out-of-vocabulary issue because it is not necessary that every time we split characters they would give meaningful word.

🌈 What is the difference between stop word removal and stemming? Are these techniques language-dependent?

> Stop word removal is removing the words that occur in all the documents with high frequency but stemming is removing part of the word/verb which is inflection and are considered unnecessary characters. Yes, they are language dependent as every language has its own rules of language even if some of them overlap on fews rules, others could be completely different.   

🌈 Is tokenization techniques language dependent? Why?

> Yes they are language specific because tokenization can be done to either separate words or sentences and each language has is own rules.

🌈 What is the difference between count vectorizer and tf-idf vectorizer? Would it be feasible to use all possible n-grams? If not, how should you select them?

> Count Vectorizer is a way to convert a given set of strings into a frequency representation.

> TF-IDF means Term Frequency - Inverse Document Frequency. This is a statistic that is based on the frequency of a word in the corpus but it also provides a numerical representation of how important a word is for statistical analysis.
 TF-IDF is better than Count Vectorizers because it not only focuses on the frequency of words present in the corpus but also provides the importance of the words. We can then remove the words that are less important for analysis, hence making the model building less complex by reducing the input dimensions.


> No, it will not be feasible to use all possible N-grams as this will need alot of computational power and time. I think the best way to get the best N-grams is trying different values in hyperparameter tuning technique like GridSearch (maybe not feasible) or RandomSearch.

References: 


https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html

https://www.linkedin.com/pulse/count-vectorizers-vs-tfidf-natural-language-processing-sheel-saket