<a href="https://colab.research.google.com/github/Mostafa3zazi/CISC-873-DM-Data-Mining/blob/main/CISC_873_DM_F22_A3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#CISC-873-DM-F22-a3: Fake Reddit Prediction

#Download data from kaggle

In [None]:
from google.colab import drive
drive.mount('/content/drive')
!mkdir ~/.kaggle
!cp '/content/drive/MyDrive/Colab Notebooks/kaggle.json' ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

Mounted at /content/drive


In [None]:
!kaggle competitions download -c cisc-873-dm-f22-a3

Downloading cisc-873-dm-f22-a3.zip to /content
  0% 0.00/5.62M [00:00<?, ?B/s]
100% 5.62M/5.62M [00:00<00:00, 105MB/s]


In [None]:
!unzip cisc-873-dm-f22-a3.zip

Archive:  cisc-873-dm-f22-a3.zip
  inflating: sample_submission.csv   
  inflating: x_test.csv              
  inflating: xy_train.csv            


#inspecting training data

In [None]:
#import libraries for data exploration and processing
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

In [None]:
pd.options.display.max_columns = 100
pd.options.display.max_rows = 300
pd.options.display.max_colwidth = 100
np.set_printoptions(threshold=2000)

In [None]:
#read train and test files
df_train = pd.read_csv('xy_train.csv',na_values=[""])
df_test = pd.read_csv('x_test.csv',na_values=[""])

In [None]:
df_train

Unnamed: 0,id,text,label
0,265723,"A group of friends began to volunteer at a homeless shelter after their neighbors protested. ""Se...",0
1,284269,"British Prime Minister @Theresa_May on Nerve Attack on Former Russian Spy: ""The government has c...",0
2,207715,"In 1961, Goodyear released a kit that allows PS2s to be brought to heel. https://m.youtube.com/w...",0
3,551106,"Happy Birthday, Bob Barker! The Price Is Right Host on How He'd Like to Be Remembered | ""As the ...",0
4,8584,"Obama to Nation: 聙""Innocent Cops and Unarmed Young Black Men Should Not be Dying Before Magic Jo...",0
...,...,...,...
59995,70046,"Finish Sniper Simo H盲yh盲 during the invasion of Finland by the USSR (1939, colorized)",0
59996,189377,"Nigerian Prince Scam took $110K from Kansas man; 10 years later, he's getting it back",1
59997,93486,Is It Safe To Smoke Marijuana During Pregnancy? You鈥檇 Be Surprised Of The Answer | no,0
59998,140950,Julius Caesar upon realizing that everyone in the room has a knife except him (44 bc),0


In [None]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      60000 non-null  int64 
 1   text    60000 non-null  object
 2   label   60000 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 1.4+ MB


In [None]:
df_train.label.value_counts()

0    32172
1    27596
2      232
Name: label, dtype: int64

In [None]:
#drop rows with label 2 (should be zero or one only)
df_train = df_train[df_train.label != 2]

In [None]:
df_train.label.value_counts()

0    32172
1    27596
Name: label, dtype: int64

#Problem Formulation
the objective is to predict if a specific reddit post is fake news or not, by looking at its title, because false information on the Internet has caused many social problems.

the input is: raw data (contains various forms of words)

the output is: a probability (0-1, float) that the reddit post is fake or not (0 - not fake , 1 - fake)

this is a binary Classification in which we predict a probability using ROCAUC as the evaluation metric. The main challenge is that the data is row text and contains various forms of words. we need first to check if our data clean (no null values, no duplicated and label has only 2 values 0 and 1).
how to handel the text data which preprocessing techniques will be used to transform the text into numbers. then which model will be used and how would the hyperparameter be tuned.

Text preprocessing techniques will be used.
* remove any html tags (< /br> often found)
* Keep only ASCII + European Chars and whitespace, no digits
* remove single letter chars
* convert all whitespaces (tabs etc.) to single wspace
* all lowercase
* remove stopwords, punctuation and stemm

using different stemmers, tunable pipeline including the vectorizer. Cover both character-level vectorizer and word-level vectorizer. hyperparamter search method (grid/random) with validation set.
try different model and try to tune them to achieve the best auc score.

the ideal solution would be fining the best stratigy to preprocess the text data and the optimal hyperparameters for the suitable model. the impact is that we will have a powerfull model to check whether the post is fake or not and solve this social media problem.

#Text preprocessing
we will preprocess our data using 3 different ways:
1. SnowballStemmer
2. Lancaster Stemmer
3. no stemmer


In [None]:
import nltk
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC, SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from xgboost import XGBClassifier
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.pipeline import Pipeline
from sklearn.base import TransformerMixin,BaseEstimator
from sklearn.metrics import classification_report, roc_auc_score

In [None]:
nltk.download('punkt')
nltk.download('stopwords')

snowball_stemmer = SnowballStemmer("english")
lancaster_stemmer = LancasterStemmer()
stop_words = set(stopwords.words("english"))


def clean_text(text, stemmer = None):
    """ steps:
        - remove any html tags (< /br> often found)
        - Keep only ASCII + European Chars and whitespace, no digits
        - remove single letter chars
        - convert all whitespaces (tabs etc.) to single wspace
        if not for embedding (but e.g. tdf-idf):
        - all lowercase
        - remove stopwords, punctuation and stemm
    """
    RE_WSPACE = re.compile(r"\s+", re.IGNORECASE)
    RE_TAGS = re.compile(r"<[^>]+>")
    RE_ASCII = re.compile(r"[^A-Za-zÀ-ž ]", re.IGNORECASE)
    RE_SINGLECHAR = re.compile(r"\b[A-Za-zÀ-ž]\b", re.IGNORECASE)

    text = re.sub(RE_TAGS, " ", text)
    text = re.sub(RE_ASCII, " ", text)
    text = re.sub(RE_SINGLECHAR, " ", text)
    text = re.sub(RE_WSPACE, " ", text)

    word_tokens = word_tokenize(text)
    words_tokens_lower = [word.lower() for word in word_tokens]

    if stemmer == None:
        # no stemming
        words_filtered = [
            word for word in words_tokens_lower if word not in stop_words
        ]
    else:
        words_filtered = [
            stemmer.stem(word) for word in words_tokens_lower if word not in stop_words
        ]

    text_clean = " ".join(words_filtered)
    return text_clean

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
#using snowball_stemmer
df_train["text_clean_snowball"] = df_train["text"].map(
    lambda x: clean_text(x, stemmer = snowball_stemmer) if isinstance(x, str) else x
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [None]:
#using lancaster_stemmer
df_train["text_clean_lancaster"] = df_train["text"].map(
    lambda x: clean_text(x, stemmer = lancaster_stemmer) if isinstance(x, str) else x
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [None]:
# no stemmer used
df_train["text_clean"] = df_train["text"].map(
    lambda x: clean_text(x) if isinstance(x, str) else x
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [None]:
df_train

Unnamed: 0,id,text,label,text_clean_snowball,text_clean_lancaster,text_clean
0,265723,"A group of friends began to volunteer at a homeless shelter after their neighbors protested. ""Se...",0,group friend began volunt homeless shelter neighbor protest see anoth person also need natur lik...,group friend beg volunt homeless shelt neighb protest see anoth person also nee nat lik want hel...,group friends began volunteer homeless shelter neighbors protested seeing another person also ne...
1,284269,"British Prime Minister @Theresa_May on Nerve Attack on Former Russian Spy: ""The government has c...",0,british prime minist theresa may nerv attack former russian spi govern conclud high like russia ...,brit prim min theres may nerv attack form russ spy govern conclud high lik russ respons act anor...,british prime minister theresa may nerve attack former russian spy government concluded highly l...
2,207715,"In 1961, Goodyear released a kit that allows PS2s to be brought to heel. https://m.youtube.com/w...",0,goodyear releas kit allow ps brought heel https youtub com watch alxulk cg zwillc fish midatlant...,goodyear releas kit allow ps brought heel https youtub com watch alxulk cg zwillc fish midatl ye...,goodyear released kit allows ps brought heel https youtube com watch alxulk cg zwillc fishing mi...
3,551106,"Happy Birthday, Bob Barker! The Price Is Right Host on How He'd Like to Be Remembered | ""As the ...",0,happi birthday bob barker price right host like rememb man said ave pet spay neuter fuckincorpor...,happy birthday bob bark pric right host lik rememb man said av pet spay neut fuckincorporateshil...,happy birthday bob barker price right host like remembered man said ave pets spayed neutered fuc...
4,8584,"Obama to Nation: 聙""Innocent Cops and Unarmed Young Black Men Should Not be Dying Before Magic Jo...",0,obama nation innoc cop unarm young black men die magic johnson jimbobshawobodob olymp athlet sho...,obam nat innoc cop unarm young black men dying mag johnson jimbobshawobodob olymp athlet shoot r...,obama nation innocent cops unarmed young black men dying magic johnson jimbobshawobodob olympic ...
...,...,...,...,...,...,...
59995,70046,"Finish Sniper Simo H盲yh盲 during the invasion of Finland by the USSR (1939, colorized)",0,finish sniper simo yh invas finland ussr color,fin snip simo yh invas finland ussr col,finish sniper simo yh invasion finland ussr colorized
59996,189377,"Nigerian Prince Scam took $110K from Kansas man; 10 years later, he's getting it back",1,nigerian princ scam took kansa man year later get back,nig print scam took kansa man year lat get back,nigerian prince scam took kansas man years later getting back
59997,93486,Is It Safe To Smoke Marijuana During Pregnancy? You鈥檇 Be Surprised Of The Answer | no,0,safe smoke marijuana pregnanc surpris answer,saf smok marijuan pregn surpr answ,safe smoke marijuana pregnancy surprised answer
59998,140950,Julius Caesar upon realizing that everyone in the room has a knife except him (44 bc),0,julius caesar upon realiz everyon room knife except bc,juli caes upon real everyon room knif exceiv bc,julius caesar upon realizing everyone room knife except bc


In [None]:
#drop rows with empty text
df_train = df_train[(df_train.text_clean_snowball != "") &
                    (df_train.text_clean_lancaster != "") &
                    (df_train.text_clean != "")]

In [None]:
df_train.isna().sum()

id                      0
text                    0
label                   0
text_clean_snowball     0
text_clean_lancaster    0
text_clean              0
dtype: int64

In [None]:
df_train.shape

(59758, 6)

In [None]:
data_clean = df_train.copy()

#Descriptive analysis

In [None]:
from bokeh.models import NumeralTickFormatter
# Word Frequency of most common words
word_freq = pd.Series(" ".join(data_clean["text_clean"]).split()).value_counts()
word_freq[1:40]

new          2998
like         2899
man          2694
trump        2558
colorized    2430
people       2268
first        2247
old          2201
year         2126
years        1999
found        1956
poster       1765
war          1664
time         1625
world        1538
get          1507
us           1506
life         1482
psbattle     1468
day          1433
two          1364
says         1328
made         1314
back         1302
post         1300
looks        1285
circa        1249
american     1227
woman        1202
school       1197
president    1166
make         1152
got          1132
house        1125
true         1125
photo        1112
would        1108
see          1086
police       1085
dtype: int64

In [None]:
# list most uncommon words
word_freq[-10:]

wfaa           1
unprintable    1
faur           1
tae            1
snubbed        1
bookmarked     1
puffing        1
ransgenders    1
wiimotes       1
wahre          1
dtype: int64

Using the most frequent words, we can identify additional candidates for our stop word list in the pre-processing step.

We also observe many uncommon words that are hardly used. Often, these will be misspellings or very uncommon words. Such sparse data will not be useful for our model, as it won't have enough observations to learn any associations. We'll come back to this in the modeling phase making use of our models ability to deal with such issues.

In [None]:
data_clean["label"].value_counts(normalize=True)

0    0.538221
1    0.461779
Name: label, dtype: float64

nearly balanced



#Trials
we will split our data into train and test set.
when tuning the parameters train set will be splitted into train and validation set.

In [None]:
train, test = train_test_split(data_clean, random_state=1, test_size=0.1, shuffle=True)

print(train.shape[0])
print(test.shape[0])

53782
5976


##Trial 1
for the first trial we will use the cleaned text without stemmers. train 3 models to find a base auc score to compare with in later trials.


(TfidfVectorizer) vectorizer with word-level will be used for now

In [None]:
#use text_clean for training data
X_train = train["text_clean"]
Y_train = train["label"]
X_test = test["text_clean"]
Y_test = test["label"]

In [None]:
X_train

30751                               stay safe firearms weapons attack modern british psa event terror attack
8267     ottoman troops locate sink privateers ship hired arab merchants drive prices sabotage somewhere ...
29177                   man deep fries pc starvation bangkok mall witness states suspect locked storage days
8555     president john kennedy funeral casket conveyed white house cathedral st mathew apostle washingto...
19571                      hotel stayed captain crunch crunchberries wallpaper addition cows pigs corn beans
                                                        ...                                                 
50264                   university florida eliminates computer science department increases athletic budgets
32677                                   record breaking quadruple amputee wheelchair returned stolen thieves
5235     man bought old log cabin made everyone jealous demolished built back ground added storage huntin...
12268           bra

In [None]:
Y_train

30751    0
8267     0
29177    0
8555     0
19571    1
        ..
50264    1
32677    1
5235     0
12268    1
33170    1
Name: label, Length: 53782, dtype: int64

In [None]:
#using 3 classifiers with almost defult values
classifiers = [
    LogisticRegression(solver="sag", random_state=1),
    XGBClassifier(random_state=1),
    MLPClassifier(
        random_state=1,
        solver="adam",
        hidden_layer_sizes=(12, 12, 12),
        activation="relu",
        early_stopping=True,
        n_iter_no_change=1,
    ),
]
names = ['lg','xgb','mlp']

In [None]:
results_noStemmer = {}
for name, clf in zip(names, classifiers):
    print(f"Training classifier: {name}")
    pipe = Pipeline([("tfidf", TfidfVectorizer(ngram_range=(1, 2))), ("clf", clf)])
    pipe.fit(X_train.values, Y_train.values)
    prediction = pipe.predict_proba(X_test)[:,1]
    report = roc_auc_score(Y_test, prediction)
    results_noStemmer[name] = report

Training classifier: lg
Training classifier: xgb
Training classifier: mlp


In [None]:
# Prediction results
for k, v in results_noStemmer.items():
    print(f"Results for {k}:")
    print(f"{v}\n")

Results for lg:
0.8763229909877142

Results for xgb:
0.7445431923334269

Results for mlp:
0.8779802499698023



our goal is to find a good base for comparison but luckly logistic regression and mlp acheived good auc score so I tried to submiit on kaggel but unfortunately the score was 0.82300

##Trail 2
we will test which stemmer achieve higher auc value using the same 3 models (xgboost , logistic regression and MLPClassifier) and hope to achieve higher score

In [None]:
# 1- using snowball stemmer
X_train = train["text_clean_snowball"]
Y_train = train["label"]
X_test = test["text_clean_snowball"]
Y_test = test["label"]

In [None]:
X_train

30751                                 stay safe firearm weapon attack modern british psa event terror attack
8267     ottoman troop locat sink privat ship hire arab merchant drive price sabotag somewher coast hatay...
29177                                 man deep fri pc starvat bangkok mall wit state suspect lock storag day
8555         presid john kennedi funer casket convey white hous cathedr st mathew apostl washington colouris
19571                                  hotel stay captain crunch crunchberri wallpap addit cow pig corn bean
                                                        ...                                                 
50264                                      univers florida elimin comput scienc depart increas athlet budget
32677                                            record break quadrupl ampute wheelchair return stolen thiev
5235        man bought old log cabin made everyon jealous demolish built back ground ad storag hunt gear etc
12268              

In [None]:
classifiers = [
    LogisticRegression(solver="sag", random_state=1),
    XGBClassifier(random_state=1),
    MLPClassifier(
        random_state=1,
        solver="adam",
        hidden_layer_sizes=(12, 12, 12),
        activation="relu",
        early_stopping=True,
        n_iter_no_change=1,
    ),
]
names = ['lg','xgb','mlp']

In [None]:
results_snowball = {}
for name, clf in zip(names, classifiers):
    print(f"Training classifier: {name}")
    pipe = Pipeline([("tfidf", TfidfVectorizer(ngram_range=(1, 2))), ("clf", clf)])
    pipe.fit(X_train, Y_train)
    prediction = pipe.predict_proba(X_test)[:,1]
    report = roc_auc_score(Y_test, prediction)
    results_snowball[name] = report

Training classifier: lg
Training classifier: xgb
Training classifier: mlp


In [None]:
# Prediction results
for k, v in results_snowball.items():
    print(f"Results for {k}:")
    print(f"{v}\n")

Results for lg:
0.8717893646793131

Results for xgb:
0.7654445272338783

Results for mlp:
0.8785080891610535



slightly better score than the previous trial

In [None]:
# 2- using lancaster stemmer
X_train = train["text_clean_lancaster"]
Y_train = train["label"]
X_test = test["text_clean_lancaster"]
Y_test = test["label"]

In [None]:
results_lancaster = {}
for name, clf in zip(names, classifiers):
    print(f"Training classifier: {name}")
    pipe = Pipeline([("tfidf", TfidfVectorizer(ngram_range=(1, 2))), ("clf", clf)])
    pipe.fit(X_train, Y_train)
    prediction = pipe.predict_proba(X_test)[:,1]
    report = roc_auc_score(Y_test, prediction)
    results_lancaster[name] = report

Training classifier: lg
Training classifier: xgb
Training classifier: mlp


In [None]:
# Prediction results
for k, v in results_lancaster.items():
    print(f"Results for {k}:")
    print(f"{v}\n")

Results for lg:
0.8641763559710408

Results for xgb:
0.7585228442287395

Results for mlp:
0.8637313062136863



for the 3 models lancaster stemmer achieved lower score than snowball stemmer so in the upcomming trials we will use snowball stemmer.

##Trail 3
test if character-level vectorizer or word-level vectorizer is better for our task.

we already coverd the case with word-level using snowball stemmer in the previous trail.

now let's use TfidfVectorizer with char-level

In [None]:
#from now on we will train with snowball stemmer
X_train = train["text_clean_snowball"]
Y_train = train["label"]
X_test = test["text_clean_snowball"]
Y_test = test["label"]

In [None]:
X_train

30751                                 stay safe firearm weapon attack modern british psa event terror attack
8267     ottoman troop locat sink privat ship hire arab merchant drive price sabotag somewher coast hatay...
29177                                 man deep fri pc starvat bangkok mall wit state suspect lock storag day
8555         presid john kennedi funer casket convey white hous cathedr st mathew apostl washington colouris
19571                                  hotel stay captain crunch crunchberri wallpap addit cow pig corn bean
                                                        ...                                                 
50264                                      univers florida elimin comput scienc depart increas athlet budget
32677                                            record break quadrupl ampute wheelchair return stolen thiev
5235        man bought old log cabin made everyon jealous demolish built back ground ad storag hunt gear etc
12268              

In [None]:
results_snowball_char_vectorizer = {}
for name, clf in zip(names, classifiers):
    print(f"Training classifier: {name}")
    pipe = Pipeline([("tfidf", TfidfVectorizer(analyzer = 'char',ngram_range=(1, 2))), ("clf", clf)])
    pipe.fit(X_train, Y_train)
    prediction = pipe.predict_proba(X_test)[:,1]
    report = roc_auc_score(Y_test, prediction)
    results_snowball_char_vectorizer[name] = report

Training classifier: lg
Training classifier: xgb
Training classifier: mlp


In [None]:
# Prediction results
for k, v in results_snowball_char_vectorizer.items():
    print(f"Results for {k}:")
    print(f"{v}\n")

Results for lg:
0.7285160288034844

Results for xgb:
0.732994575256274

Results for mlp:
0.7278767857974953



there is a huge steep in auc score so we will stick to word-level vectorizer

##Trial 4
now for the tuning part
let's first start with logistic regression.

we will tune the parameters using search method (random search) with validation set

In [None]:
# this code is to devide the training set into train and validation set
# validation set will be 0.1 of the training set
# PredefinedSplit object will be passed to the random search as cv parameter
from sklearn.model_selection import PredefinedSplit
val_fold = np.full((X_train.shape[0], ),-1, dtype=int)
val_fold[-int(X_train.shape[0]*0.1):] = 0
ps = PredefinedSplit(val_fold)

In [None]:
clf =  LogisticRegression(solver="sag", random_state=1)
pipe_lg = Pipeline([("tfidf", TfidfVectorizer()), ("lg", clf)])

In [None]:
params = {
    "tfidf__ngram_range": [(1,2),(1, 3)],
    "tfidf__max_df": np.arange(0.3,0.8,0.1),
    "tfidf__min_df": np.arange(1,20),
    'lg__C': [0.1, 1, 10, 100, 10000],
    'lg__solver' :['lbfgs', 'liblinear', 'sag', 'saga']
}

In [None]:
pipe_clf = RandomizedSearchCV(pipe_lg, params, n_jobs=-1, n_iter = 200
                              ,scoring="roc_auc",cv = ps , verbose = 3,refit=True)
pipe_clf.fit(X_train, Y_train)

Fitting 1 folds for each of 200 candidates, totalling 200 fits




RandomizedSearchCV(cv=PredefinedSplit(test_fold=array([-1, -1, ...,  0,  0])),
                   estimator=Pipeline(steps=[('tfidf', TfidfVectorizer()),
                                             ('lg',
                                              LogisticRegression(random_state=1,
                                                                 solver='sag'))]),
                   n_iter=200, n_jobs=-1,
                   param_distributions={'lg__C': [0.1, 1, 10, 100, 10000],
                                        'lg__solver': ['lbfgs', 'liblinear',
                                                       'sag', 'saga'],
                                        'tfidf__max_df': array([0.3, 0.4, 0.5, 0.6, 0.7]),
                                        'tfidf__min_df': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19]),
                                        'tfidf__ngram_range': [(1, 2), (1, 3)]},
                   scoring='roc_auc', verbos

In [None]:
results = pd.DataFrame(pipe_clf.cv_results_)
results.sort_values('rank_test_score').head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_tfidf__ngram_range,param_tfidf__min_df,param_tfidf__max_df,param_lg__solver,param_lg__C,params,split0_test_score,mean_test_score,std_test_score,rank_test_score
31,9.80247,0.0,0.254438,0.0,"(1, 2)",1,0.3,sag,10,"{'tfidf__ngram_range': (1, 2), 'tfidf__min_df': 1, 'tfidf__max_df': 0.3, 'lg__solver': 'sag', 'l...",0.878517,0.878517,0.0,1
166,7.234909,0.0,0.269924,0.0,"(1, 2)",1,0.5,liblinear,10,"{'tfidf__ngram_range': (1, 2), 'tfidf__min_df': 1, 'tfidf__max_df': 0.5, 'lg__solver': 'liblinea...",0.878516,0.878516,0.0,2
127,31.516126,0.0,0.417683,0.0,"(1, 3)",1,0.5,lbfgs,10000,"{'tfidf__ngram_range': (1, 3), 'tfidf__min_df': 1, 'tfidf__max_df': 0.5, 'lg__solver': 'lbfgs', ...",0.877176,0.877176,0.0,3
121,15.541275,0.0,0.282263,0.0,"(1, 2)",1,0.4,lbfgs,10000,"{'tfidf__ngram_range': (1, 2), 'tfidf__min_df': 1, 'tfidf__max_df': 0.4, 'lg__solver': 'lbfgs', ...",0.876948,0.876948,0.0,4
90,12.270171,0.0,0.394601,0.0,"(1, 3)",1,0.7,liblinear,10,"{'tfidf__ngram_range': (1, 3), 'tfidf__min_df': 1, 'tfidf__max_df': 0.7000000000000002, 'lg__sol...",0.876078,0.876078,0.0,5
103,17.752099,0.0,0.362624,0.0,"(1, 3)",1,0.4,saga,10,"{'tfidf__ngram_range': (1, 3), 'tfidf__min_df': 1, 'tfidf__max_df': 0.4, 'lg__solver': 'saga', '...",0.876073,0.876073,0.0,6
176,3.749802,0.0,0.224356,0.0,"(1, 2)",2,0.3,liblinear,1,"{'tfidf__ngram_range': (1, 2), 'tfidf__min_df': 2, 'tfidf__max_df': 0.3, 'lg__solver': 'liblinea...",0.871764,0.871764,0.0,7
189,4.299301,0.0,0.227615,0.0,"(1, 2)",2,0.7,sag,1,"{'tfidf__ngram_range': (1, 2), 'tfidf__min_df': 2, 'tfidf__max_df': 0.7000000000000002, 'lg__sol...",0.871759,0.871759,0.0,8
36,5.016878,0.0,0.209786,0.0,"(1, 2)",3,0.3,lbfgs,1,"{'tfidf__ngram_range': (1, 2), 'tfidf__min_df': 3, 'tfidf__max_df': 0.3, 'lg__solver': 'lbfgs', ...",0.871704,0.871704,0.0,9
74,6.378326,0.0,0.307111,0.0,"(1, 3)",2,0.5,saga,1,"{'tfidf__ngram_range': (1, 3), 'tfidf__min_df': 2, 'tfidf__max_df': 0.5, 'lg__solver': 'saga', '...",0.871491,0.871491,0.0,10


In [None]:
prediction = pipe_clf.predict_proba(X_test)[:,1]
roc_auc_score(Y_test, prediction)

0.8796216749071559

logistic regression achieved slightly higher aus score with validation auc = 0.8785 and test auc = 0.8796

I used this model for submission and got a higher score on kaggel score = 0.83679

from the results we can see that tfidf__ngram_range = (1 ,2) achieved higher score than (1 ,3) so we will use (1 ,2) for the upcomming trials.

param_tfidf__min_df is 1 or 2 for the top 10 scores so we will reduce param_tfidf__min_df range

##Trial 5
xgboost achieve higher scores if fine tuned so XGBClassifier will be used in this trial.

I considered using MLP for this trail but it takes more time for training and there is many parameters to be tuned.

we will tune the parameters using search method (random search) with validation set

In [None]:
clf =  XGBClassifier()
pipe_gbc = Pipeline([("tfidf", TfidfVectorizer()), ("GBC", clf)])

In [None]:
# adjust tfidf parameters according to the previous trial
params = {
    "tfidf__ngram_range": [(1,2)],
    "tfidf__max_df": np.arange(0.2,0.8,0.1),
    "tfidf__min_df": np.arange(1,10),
    'GBC__n_estimators' : np.arange(70,800),
    'GBC__learning_rate' : [0.1,0.1,2],
    'GBC__loss': ['deviance','exponential'],
    'GBC__criterion' : ['friedman_mse','squared_error'],
    'GBC__min_samples_leaf': np.arange(2,10)
}

In [None]:
pipe_clf = RandomizedSearchCV(pipe_gbc, params, n_jobs=-1, n_iter = 50
                              ,scoring="roc_auc",cv = ps , verbose = 3,refit=True)
pipe_clf.fit(X_train, Y_train)

Fitting 1 folds for each of 50 candidates, totalling 50 fits


RandomizedSearchCV(cv=PredefinedSplit(test_fold=array([-1, -1, ...,  0,  0])),
                   estimator=Pipeline(steps=[('tfidf', TfidfVectorizer()),
                                             ('GBC', XGBClassifier())]),
                   n_iter=50, n_jobs=-1,
                   param_distributions={'GBC__criterion': ['friedman_mse',
                                                           'squared_error'],
                                        'GBC__learning_rate': [0.1, 0.1, 2],
                                        'GBC__loss': ['deviance',
                                                      'exponential'],
                                        'GBC__min_samples_leaf': array([2, 3,...
       746, 747, 748, 749, 750, 751, 752, 753, 754, 755, 756, 757, 758,
       759, 760, 761, 762, 763, 764, 765, 766, 767, 768, 769, 770, 771,
       772, 773, 774, 775, 776, 777, 778, 779, 780, 781, 782, 783, 784,
       785, 786, 787, 788, 789, 790, 791, 792, 793, 794, 795, 796, 797

In [None]:
results = pd.DataFrame(pipe_clf.cv_results_)
results.sort_values('rank_test_score').head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_tfidf__ngram_range,param_tfidf__min_df,param_tfidf__max_df,param_GBC__n_estimators,param_GBC__min_samples_leaf,param_GBC__loss,param_GBC__learning_rate,param_GBC__criterion,params,split0_test_score,mean_test_score,std_test_score,rank_test_score
42,167.851213,0.0,0.421544,0.0,"(1, 2)",3,0.5,768,5,deviance,0.1,friedman_mse,"{'tfidf__ngram_range': (1, 2), 'tfidf__min_df': 3, 'tfidf__max_df': 0.5000000000000001, 'GBC__n_...",0.837778,0.837778,0.0,1
44,118.366962,0.0,0.406255,0.0,"(1, 2)",7,0.2,730,3,deviance,0.1,friedman_mse,"{'tfidf__ngram_range': (1, 2), 'tfidf__min_df': 7, 'tfidf__max_df': 0.2, 'GBC__n_estimators': 73...",0.836376,0.836376,0.0,2
28,124.356395,0.0,0.387956,0.0,"(1, 2)",5,0.4,701,2,deviance,0.1,friedman_mse,"{'tfidf__ngram_range': (1, 2), 'tfidf__min_df': 5, 'tfidf__max_df': 0.4000000000000001, 'GBC__n_...",0.833726,0.833726,0.0,3
29,100.214436,0.0,0.368555,0.0,"(1, 2)",8,0.5,630,9,exponential,0.1,friedman_mse,"{'tfidf__ngram_range': (1, 2), 'tfidf__min_df': 8, 'tfidf__max_df': 0.5000000000000001, 'GBC__n_...",0.833548,0.833548,0.0,4
35,95.81554,0.0,0.362448,0.0,"(1, 2)",8,0.5,610,5,exponential,0.1,friedman_mse,"{'tfidf__ngram_range': (1, 2), 'tfidf__min_df': 8, 'tfidf__max_df': 0.5000000000000001, 'GBC__n_...",0.832579,0.832579,0.0,5
20,118.872488,0.0,0.401544,0.0,"(1, 2)",5,0.3,658,9,deviance,0.1,squared_error,"{'tfidf__ngram_range': (1, 2), 'tfidf__min_df': 5, 'tfidf__max_df': 0.30000000000000004, 'GBC__n...",0.832298,0.832298,0.0,6
14,117.830956,0.0,0.37307,0.0,"(1, 2)",4,0.6,615,6,deviance,0.1,squared_error,"{'tfidf__ngram_range': (1, 2), 'tfidf__min_df': 4, 'tfidf__max_df': 0.6000000000000001, 'GBC__n_...",0.831446,0.831446,0.0,7
34,101.616915,0.0,0.371746,0.0,"(1, 2)",5,0.5,575,7,exponential,0.1,squared_error,"{'tfidf__ngram_range': (1, 2), 'tfidf__min_df': 5, 'tfidf__max_df': 0.5000000000000001, 'GBC__n_...",0.83037,0.83037,0.0,8
38,89.915568,0.0,0.357581,0.0,"(1, 2)",7,0.6,549,7,deviance,0.1,friedman_mse,"{'tfidf__ngram_range': (1, 2), 'tfidf__min_df': 7, 'tfidf__max_df': 0.6000000000000001, 'GBC__n_...",0.830244,0.830244,0.0,9
12,83.272531,0.0,0.341183,0.0,"(1, 2)",6,0.5,487,4,deviance,0.1,friedman_mse,"{'tfidf__ngram_range': (1, 2), 'tfidf__min_df': 6, 'tfidf__max_df': 0.5000000000000001, 'GBC__n_...",0.82673,0.82673,0.0,10


In [None]:
prediction = pipe_clf.predict_proba(X_test)[:,1]
roc_auc_score(Y_test, prediction)

0.8407223468587743

logistic regression achieved slightly higher aus score with validation auc = 0.837778 and test auc = 0.84072

I used this model for submission but unfortunately I got a lower score on kaggel (score = 0.79383)

#Testing
this part is only used for submission

In [None]:
df_test

Unnamed: 0,id,text,text_clean
0,0,stargazer,stargaz
1,1,yeah,yeah
2,2,PD: Phoenix car thief gets instructions from YouTube video,pd phoenix car thief get instruct youtub video
3,3,"As Trump Accuses Iran, He Has One Problem: His Own Credibility",trump accus iran one problem credibl
4,4,"""Believers"" - Hezbollah 2011",believ hezbollah
...,...,...,...
59146,59146,Bicycle taxi drivers of New Delhi,bicycl taxi driver new delhi
59147,59147,Trump blows up GOP's formula for winning House races,trump blow gop formula win hous race
59148,59148,"Napoleon returns from his exile on the island of Elba. (March 1815), Colourised",napoleon return exil island elba march colouris
59149,59149,Deep down he always wanted to be a ballet dancer,deep alway want ballet dancer


In [None]:
df_test["text_clean"] = df_test["text"].map(
    lambda x: clean_text(x, stemmer = snowball_stemmer) if isinstance(x, str) else x
)

In [None]:
# pipe_lg.predict_proba(df_test["text_clean"])

In [None]:
submission = pd.DataFrame()

submission['id'] = df_test['id']

submission['label'] = pipe_clf.predict_proba(df_test["text_clean"])[:,1]

submission.to_csv('sample_submission_walkthrough.csv', index=False)

In [None]:
!kaggle competitions submit -c cisc-873-dm-f22-a3 -f sample_submission_walkthrough.csv -m ""

100% 930k/930k [00:00<00:00, 4.15MB/s]
Successfully submitted to CISC-873-DM-F22-a3

#Questions

1. What is the difference between Character n-gram and Word n-gram? Which one tends to suffer more from the OOV issue?
* Character n-gram: Represent unique character sequence of length n as feature.
* Word n-gram: Represent unique word sequence of length n as feature.
* Word n-gram suffer more.

2. What is the difference between stop word removal and stemming? Are these techniques language-dependent?
* A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) so stop word removal is the process of removing these common words because in most cases they are not useful,take up valuable processing time, take space in our memory and won't add much to our model so we remove them.
* stemming: It is the process of reducing the word to its word stem that affixes to suffixes and prefixes or to roots of words known as a lemma.. In simple words stemming is reducing a word to its base word or stem in such a way that the words of similar kind lie under a common stem. (not removing the word like stop word).
* yes these techniques are language-dependent as the stop words differ from language to another and the root words too.

3. Is tokenization techniques language dependent? Why?
* yes, There are various tokenization techniques like
  1. White Space Tokenization
  2. **Dictionary Based Tokenization**: In this method the tokens are found based on the tokens already existing in the dictionary. If the token is not found, then special rules are used to tokenize it. It is an advanced technique compared to whitespace tokenizer.
  3. Regular Expression Tokenizer
  4. Penn TreeBank Tokenization : Tree bank is a corpus created which gives the semantic and syntactical annotation of language.

  ref: https://towardsdatascience.com/tokenization-for-natural-language-processing-a179a891bad4

4. What is the difference between count vectorizer and tf-idf vectorizer? Would it be feasible to use all possible n-grams? If not, how should you select them?
* In CountVectorizer we only count the number of times a word appears in the document.
* In TfidfVectorizer we consider overall document weightage of a word. It helps us in dealing with most frequent words. Using it we can penalize them. TfidfVectorizer weights the word counts by a measure of how often they appear in the documents.
* No it wouldn't be feasible because of the Storage limitation
* by trial and error (using cross validation for example) and according to the problem. for example if we predicting the rate of doctors based on the feed back this (good and not good) will have great effect so we consider using bigrams
and in the predictions of auto completion systems we may consider using 3-gram or higher