# Assignment 3

## Problem Formulation: ✍
* We need to know the predicted probability of whether the new record (news title) is fake or not fake. 

### Input:
* Text data for titles for news, some of them are fake and some are not. About 60000 records are labeled (train dataset) with label 1 which means fake news title, and label 0 which means not fake news title. About 59151 records are not labeled (test dataset).

### Output:
* We need to predict the probability of fake value for the given dataset (test dataset). So, this would help to know if this news title is fake (has high probability to be labeled by class 1) or not (has low probability to be labeled by class 1).


## What data mining function is required? 🕵🏽
* In this assignment we need "classification and prediction" data mining function.

## What could be the challenges? 😕
* Challenges are: how to make data organization and deal with the missing values and news with label 2 in the train dataset. How to make good preprocessing for the text data. How to use vectorizer and make tuning for its hyperparameters and make tuning for each model used and get the best hyperparameters for the model and the vectorizer. In addition to how to develop a successful solution for the problem and make good predictions to avoid the fake titles.

## What is the impact? 🤓
* The impact is to know more about preprocessing for the data, deal with text data, deal with vectorizer, and deal with the hyperparameters to see how they can affect the model and the results (predicted probability of whether is the label fake or not), and how to deal with the pipelines with types of searching for the hyperparameters. So, by making a good algorithm this would give the right predictions then this would help to avoid the fake news titles, so the one or the newspaper that post the news won't lose people's trust.

## What is an ideal solution? 🦸
* The ideal solution is to make very good learning for the model that will give perfect performance parameters, and this happens after making good preprocessing steps on the text data. So, the model can predict whether the new record (news title) is fake or not.

## What is the experimental protocol used and how was it carried out? 🤔
* I used the cross-validation method in some trials, and the holdout method in some trials. So in the cross-validation the algorithm splits the input data into training, and validation datasets, and in the holdout method I am working by splitting the given train dataset into two parts (train and test) and by selecting the test size I needed. I used the new parts for the train and test to tune the hyperparameters and get the know the best hyperparameters for the models I created by calculating the AUCROC score on that test part which I took from the given training set, so I have the right labels and by getting the predicted labels I could calculate the AUCROC score.

## What preprocessing steps are used? 
* I used two preprocessing techniques:
    * First one: get any number of white spaces with single white space - remove html tags - remove any letter not written in English language - remove any single character that has space before it and space after it - make word tokenization and split the sentences into tokens - convert any capital letter with its small letter to make all of the letters small - make stemming for all words if it is not stop word.
    * Second one: make all letters in lowercase - Numbers removing - remove punctuation - remove white spaces - make tokenization for the words - stop words removal - make stemming for the words- make lemmatization for the words.
    
* As stemming means the process of removing or stems the last few characters of a word, often leading to incorrect meanings and spelling. But lemmatization means the process of consider the context and converts the word to its meaningful base form.

# Import required libraries

In [1]:
import re
import pickle
import sklearn
import pandas as pd
import numpy as np
import nltk 

from pathlib import Path
# some seeting for pandas

pd.options.display.max_columns = 100
pd.options.display.max_rows = 300
pd.options.display.max_colwidth = 100
np.set_printoptions(threshold=2000)

from skopt import BayesSearchCV

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from xgboost import XGBClassifier
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.model_selection import PredefinedSplit

# Data preparation
## First read the train csv file and test csv file

In [2]:
# read the train file and make the id in the csv file be the index of the dataframe
data_train = pd.read_csv('xy_train.csv', index_col='id')
# take a copy from the dataframe to start preprocessing technique 1
tr_data = data_train.copy()
# view information about the train dataset
tr_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 60000 entries, 265723 to 34509
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    60000 non-null  object
 1   label   60000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 1.4+ MB


In [3]:
# display some of the training dataset
tr_data

Unnamed: 0_level_0,text,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1
265723,"A group of friends began to volunteer at a homeless shelter after their neighbors protested. ""Se...",0
284269,"British Prime Minister @Theresa_May on Nerve Attack on Former Russian Spy: ""The government has c...",0
207715,"In 1961, Goodyear released a kit that allows PS2s to be brought to heel. https://m.youtube.com/w...",0
551106,"Happy Birthday, Bob Barker! The Price Is Right Host on How He'd Like to Be Remembered | ""As the ...",0
8584,"Obama to Nation: 聙""Innocent Cops and Unarmed Young Black Men Should Not be Dying Before Magic Jo...",0
...,...,...
70046,"Finish Sniper Simo H盲yh盲 during the invasion of Finland by the USSR (1939, colorized)",0
189377,"Nigerian Prince Scam took $110K from Kansas man; 10 years later, he's getting it back",1
93486,Is It Safe To Smoke Marijuana During Pregnancy? You鈥檇 Be Surprised Of The Answer | no,0
140950,Julius Caesar upon realizing that everyone in the room has a knife except him (44 bc),0


In [4]:
# read the test file and make the id in the csv file be the index of the dataframe
data_test = pd.read_csv('x_test.csv', index_col='id')
# take a copy from the dataframe to start preprocessing technique 1
ts_data = data_test.copy()
# view information about the train dataset
ts_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 59151 entries, 0 to 59150
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    59151 non-null  object
dtypes: object(1)
memory usage: 924.2+ KB


# Trials 🏃‍♀️ 🤗
### 1. Implement two different ways in preprocessing and apply on both Logistic Regression classifier without tuning any hyperparameters.
### 2. A tunable pipeline including the vectorizer with word based analyzer using Logistic Regression classifier.
### 3. Pipeline with character-level vectorizer and Logistic Regression classifier.
### 4. Pipeline with random search with validation set and KNN classifier.
### 5. Tuned pipeline with XGBoost classifier


# Trial 1:
## Preprocessing technique number 1 with word-level-vectorizer and Logisitc regression classifier.

In [5]:
# nltk.download('punkt')
# nltk.download('stopwords')

# define our stemmer and get the stopwords from the nltk library by English language as our dataset in English language
stemmer = SnowballStemmer("english")
stop_words = set(stopwords.words("english"))

# define the first preprocessing function
def preprocessing_1(text):
    """ steps:
        - remove any html tags (< /br> often found)
        - Keep only ASCII + European Chars and whitespace, no digits
        - remove single letter chars
        - convert all whitespaces (tabs etc.) to single wspace
        - remove stopwords, punctuation and stemm
    """
    # get any number of white spaces with single white space
    RE_WSPACE = re.compile(r"\s+", re.IGNORECASE)
    # get html tages
    RE_TAGS = re.compile(r"<[^>]+>")
    # get any letter not written in English way
    RE_ASCII = re.compile(r"[^A-Za-zÀ-ž ]", re.IGNORECASE)
    # get any single character that has space before it and space after it
    RE_SINGLECHAR = re.compile(r"\b[A-Za-zÀ-ž]\b", re.IGNORECASE)
    
    # romove anything from the above with space and the text after what will be removed
    text = re.sub(RE_TAGS, " ", text)
    text = re.sub(RE_ASCII, " ", text)
    text = re.sub(RE_SINGLECHAR, " ", text)
    text = re.sub(RE_WSPACE, " ", text)
    
    # make word tokenization and split the sentences into tokens
    word_tokens = word_tokenize(text)
    # convert any capital letter with its small letter to make all of the letters small
    words_tokens_lower = [word.lower() for word in word_tokens]

    # stemming for all words if it is not stop word
    words_filtered = [
        stemmer.stem(word) for word in words_tokens_lower if word not in stop_words
    ]

    # make joining for all words in any sentence to make them one sentence again
    text_clean = " ".join(words_filtered)
    return text_clean

### Make cleaning for the training and test dataset

In [6]:
%%time
# Clean text in the training dataset and insert the clean text in new column
tr_data["text_clean"] = tr_data.loc[tr_data["text"].str.len() > 0, "text"]
tr_data["text_clean"] = tr_data["text_clean"].map(
    lambda x: preprocessing_1(x) if isinstance(x, str) else x
)

Wall time: 20.5 s


In [7]:
%%time
# Clean text in the test dataset and insert the clean text in new column
ts_data["text_clean"] = ts_data.loc[ts_data["text"].str.len() > 0, "text"]
ts_data["text_clean"] = ts_data["text_clean"].map(
    lambda x: preprocessing_1(x) if isinstance(x, str) else x
)

Wall time: 10.4 s


In [8]:
# get the text clean column to be used in the test phase
ts_data_clean = ts_data['text_clean']

In [9]:
# As there is label = 2 in the training dataset so I will change them with nan values them drop them
tr_data.loc[tr_data["label"] >1] = np.NaN


# Drop when any of x missing
tr_data = tr_data[(tr_data["text_clean"] != "") & (tr_data["text_clean"] != "null")]

tr_data = tr_data.dropna(
    axis="index", subset=["label", "text", "text_clean"]
).reset_index(drop=True)

### Descriptive analysis

Even though we deal with texts, we should still use some descriptive analysis to get a better understanding of the data. So, I will get the words that happened the most and the words that happened the least. 

In [10]:
from bokeh.models import NumeralTickFormatter
# Word Frequency of most common words by split any sentence in the dataset and get its count in the whole documnet
word_freq = pd.Series(" ".join(tr_data["text_clean"]).split()).value_counts()
word_freq[1:40]

one         3285
like        3128
new         2998
look        2847
color       2737
man         2729
get         2602
trump       2578
say         2347
peopl       2316
use         2307
first       2248
make        2227
old         2226
time        2027
poster      2000
found       1999
day         1935
war         1858
post        1648
world       1570
work        1531
show        1513
us          1506
american    1504
take        1491
life        1482
psbattl     1470
help        1442
go          1420
state       1409
back        1369
two         1364
school      1345
see         1329
photo       1324
made        1314
right       1311
save        1308
dtype: int64

In [11]:
# list most uncommon words 
word_freq[-10:].reset_index(name="freq")

Unnamed: 0,index,freq
0,scoliosi,1
1,nicklaus,1
2,briancush,1
3,preprint,1
4,friggin,1
5,haleakala,1
6,daedec,1
7,investitur,1
8,toth,1
9,vaynerchuck,1


In [12]:
# Distribution of ratings to check whether there is unbalance in the weights or not
tr_data["label"].value_counts(normalize=True)

0.0    0.538221
1.0    0.461779
Name: label, dtype: float64

### This means there is no data unbalance as both classes have close weight to each other.

## Modeling
### Start with splitting the label and features

In [13]:
# split the training data to labels and features to be used in the supervised learning :)
# get the label column as it is our label column to be used in the supervising learning
y = tr_data['label']
# get the text clean column as it is the features that I will use 
X = tr_data['text_clean']
# print the shapes of both labels and features
print('original shape', X.shape, y.shape)

original shape (59758,) (59758,)


## Make vectorization for the features with TF-IDF (Feature creation)

In [14]:
# I will start with word based analyzer without any use from other hyperparameters and keep them with the default
vectorizer = TfidfVectorizer(analyzer="word")
# make fitting for the analyzer on the text clean column in the training dataset
vectorizer.fit(tr_data["text_clean"])

# transform each sentence to numeric vector with tf-idf value as elements for the training and test datasets
X_train_vec = vectorizer.transform(X)
X_test_vec = vectorizer.transform(ts_data_clean)
# display the shapes to check them
print(X_test_vec.get_shape())
X_train_vec.get_shape()


(59151, 40370)


(59758, 40370)

In [15]:
# Logistic regression classifier
lr1 = LogisticRegression()

# make fitting for the model on the training features after make vectorization for it
lr1.fit(X_train_vec, y)

LogisticRegression()

### Saving results in csv file

In [16]:
# getting the results and saving in csv file 
submission = pd.DataFrame()
# get the id from the test dataframe
submission['id'] = ts_data.index

# get the predict probability of the label value '1' by making predict on the test features 
submission['label'] = lr1.predict_proba(X_test_vec)[:,1]

submission.to_csv('trial 1 pre 1_version2.csv', index=False)

### Thoughts and observations for trial 1 for preprocessing technique 1: 🤓
This gave score on Kaggle: 0.83466

In this try, I need to deal with preprocessing more than dealing with the model. To see more how to make preprocessing and the effect of it. My expectations were that this preprocessing cover many things to be made on the text columns in the training and test datasets. Here in this try in the preprocessing, I removed whitespaces, removed any single character, removed any letter not written in English language, and removed any html tags. Then make tokenization for the words and make stemming for each word if this word is not stop word, and make all letters small. I used Logistic regression model with the default parameters for it as it is one of the best classifiers to be used in binary classification. I think this preprocessing technique got a very good results as it gave on the public leaderboard 0.83466, and I will ensure whether it is good or not after trying the second preprocessing technique.


### Plan for trial 1 for preprocessing technique 2: 🤔
I will change the preprocessing technique by make all letters small, remove punctuation, remove any numbers, remove extra white spaces, make tokenization for the sentences, and remove any stop word by using sklearn library. Then make stemming for the words, and make lemmatization for the words. As stemming means the process of removing or stems the last few characters of a word, often leading to incorrect meanings and spelling. But lemmatization means the process of consider the context and converts the word to its meaningful base form. I will keep using the Logistic regression classifier with its default parameters, so the change only in the preprocessing so I can compare which preprocessing technique will be more efficient and gives better results to complete my trials with it.

## Preprocessing technique number 2 with word-level-vectorizer and Logisitc regression classifier.

In [17]:
import string
# get the stopwords from the sklearn library by English language
from sklearn.feature_extraction._stop_words import ENGLISH_STOP_WORDS
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
# define our stemmer and lemmatizer by English language as our dataset in English language
stemmer= PorterStemmer()
# import nltk
# nltk.download('wordnet')
lemmatizer=WordNetLemmatizer()

# define the second preprocessing function
def preprocessing_2 (text):
    # lowercase 
    input_str = text.lower()
    
    # Numbers removing
    num_removed = re.sub(r'\d+', '', input_str)
    
    # punctuation removing
    pun_removed = num_removed.translate(str.maketrans('', '', string.punctuation))
    
    # remove white spaces
    without_space = pun_removed.strip()
    
    # tokenization
    word_tokens = word_tokenize(without_space)
    
    # Stop words removal
    Without_stop = [i for i in word_tokens if not i in ENGLISH_STOP_WORDS]
    
    
    # stemming
    new_word= []
    for word in Without_stop:
        new_word.append(stemmer.stem(word))
    
    # Lemmatization
    Lemm_new_word = []
    for word in new_word:
        Lemm_new_word.append(lemmatizer.lemmatize(word))
        
   # make joining for all words in any sentence to make them one sentence again     
    text_clean_pre2 = " ".join(Lemm_new_word)  
    
    return text_clean_pre2


In [18]:
# take copy from the training dataset to be used in the second preprocessing technique
tr_data_newpre = data_train.copy()

### Make cleaning for the training and test dataset

In [19]:
%%time
# Clean text in the training dataset and insert the clean text in new column using the second preprocessing technique
tr_data_newpre["text_clean"] = tr_data_newpre.loc[tr_data_newpre["text"].str.len() > 0, "text"]
tr_data_newpre["text_clean"] = tr_data_newpre["text_clean"].map(
    lambda x: preprocessing_2(x) if isinstance(x, str) else x
)

Wall time: 28.8 s


In [20]:
# take copy from the test dataset to be used in the second preprocessing technique
ts_data_newpre = data_test.copy()

In [21]:
%%time
# Clean text in the test dataset and insert the clean text in new column using the second preprocessing tehnique
ts_data_newpre["text_clean"] = ts_data_newpre.loc[ts_data_newpre["text"].str.len() > 0, "text"]
ts_data_newpre["text_clean"] = ts_data["text_clean"].map(
    lambda x: preprocessing_2(x) if isinstance(x, str) else x
)

Wall time: 11.7 s


In [22]:
# get the text clean column to be used in the test phase
ts_data_newpre = ts_data_newpre['text_clean']

In [23]:
# As there is label = 2 in the training dataset so I will change them with nan values them drop them
tr_data_newpre.loc[tr_data_newpre["label"] >1] = np.NaN


# Drop when any of x missing
tr_data_newpre = tr_data_newpre[(tr_data_newpre["text_clean"] != "") & (tr_data_newpre["text_clean"] != "null")]

tr_data_newpre = tr_data_newpre.dropna(
    axis="index", subset=["label", "text", "text_clean"]
).reset_index(drop=True)

In [24]:
from bokeh.models import NumeralTickFormatter
# Word Frequency of most common words
word_freq = pd.Series(" ".join(tr_data_newpre["text_clean"]).split()).value_counts()
word_freq[1:40]

like        3074
new         2977
look        2825
color       2708
man         2592
just        2560
trump       2351
say         2288
use         2279
peopl       2255
make        2178
time        1950
poster      1936
day         1822
woman       1805
war         1779
work        1498
help        1436
world       1431
life        1388
american    1381
old         1369
state       1357
photo       1310
school      1309
save        1289
circa       1225
hous        1223
know        1222
right       1209
want        1207
presid      1199
psbattl     1164
pictur      1160
child       1145
way         1132
got         1131
true        1122
get         1091
dtype: int64

In [25]:
# list most uncommon words
word_freq[-10:].reset_index(name="freq")

Unnamed: 0,index,freq
0,tchiiko,1
1,dejuan,1
2,sunnier,1
3,king禄,1
4,corp鈥檚,1
5,eyeand,1
6,beckinsal,1
7,joal,1
8,mohn,1
9,attacked鈥攁nd,1


In [26]:
# Distribution of ratings to make sure that there is no unbalance in the dataset
tr_data_newpre["label"].value_counts(normalize=True)

0.0    0.538281
1.0    0.461719
Name: label, dtype: float64

In [27]:
# split the training data to labels and features to be used in the supervised learning :)
# get the match column as it is our label column
y2 = tr_data_newpre['label']
# drop the match column and take the rest of the dataframe to be our features
X2 = tr_data_newpre['text_clean']
# print the shapes of both labels and features
print('original shape', X2.shape, y2.shape)

original shape (59768,) (59768,)


## Make vectorization for the features with TF-IDF (Feature creation)

In [28]:
# use word analyzer to make the vectorization for the features of the training and features of the test
vectorizer2 = TfidfVectorizer(analyzer="word")
vectorizer2.fit(tr_data_newpre["text_clean"])

# transform each sentence to numeric vector with tf-idf value as elements
X_train_vec2 = vectorizer2.transform(X2)
X_test_vec2 = vectorizer2.transform(ts_data_newpre)
X_train_vec2.get_shape()

(59768, 50433)

In [29]:
# Logistic regression classifier
lr2 = LogisticRegression()
# make fitting for the model on the training features after make vectorization for it
lr2.fit(X_train_vec2, y2)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression()

### Saving results in csv file

In [30]:
# getting the results and saving in csv file 
submission = pd.DataFrame()
# get the id from the test dataframe
submission['id'] = ts_data.index

# get the predict probability of the label value '1' by making predict on the test features 
submission['label'] = lr2.predict_proba(X_test_vec2)[:,1]

submission.to_csv('trial 1 pre 2_version2.csv', index=False)

### Thoughts and observations for trial 1 for preprocessing technique 1: 🤓
This gave score on Kaggle: 0.82216

In this try, I used another technique in the preprocessing by make all letters small, remove punctuation, remove any numbers, remove extra white spaces, make tokenization for the sentences, and remove any stop word. Then, make stemming for the words, and make lemmatization for the words. My expectations were to got a better results on Kaggle as I made some changes in the preprocessing, but this didn't happen. So, this means lemmatization don't make affect on the data we have this time.


### Plan for trial 2: 🤔

I will use the first technique in preprocessing, as it gave me a better results than the second one. In this trial, I will make pipeline to tune the hyperparameters for the model and the vectorizer. I will use word analyzer in the vectorizer. I will use Logistic regression model. I will tune the number of grams, max_df, and min_df in the vectorizer. I will tune C value, penalty, and solver in the classifier. As number of grams means how the vectorizer will deal with the words, whether it will make combinations from them by making 2 words together with all single words or by making 3 words together with all single words. max_df means ignore terms that appear in more than 'the number result from the tuning' of the documents. min_df  means ignore terms that appear in less than 'the number result from the tuning' of the documents. For the hyperparameters for the classifier, the solver means the algorithm that will be uses in the optimization (if the dataset is small ‘liblinear’ is a good choice, for multiclass problems ‘newton-cg’, ‘lbfgs’ handle multinomial loss, and the default for the model is ‘lbfgs’), the regularization means the regularization method that will be used on the weights and the parameters of the model that is used to prevent the overfitting on the training dataset, and C to control with the penality strength of the regularization as smaller values specify stronger regularization. The tuning for the hyperparameters is done by grid search method, that makes all possible combinations from the hyperparameters given in the params grid, and use them to fit the training features with their labels. Then, it gives the best hyperparameters combination that gives the best result, and makes splitting for the data in the best way.

# Trial 2:
## A tunable pipeline including the vectorizer with word based analyzer using Logistic Regression classifier.

In [31]:
%%time
# feature creation and modelling in a single function using pipeline
pipe = Pipeline([("tfidf", TfidfVectorizer(analyzer="word")), ("lr", LogisticRegression())])

# define parameter space to test
params = {
    # parameters for the vectorizer
    "tfidf__ngram_range": [(1, 2),(1, 3)],
    "tfidf__max_df": np.arange(0.3, 0.8),
    "tfidf__min_df": np.arange(5, 30),
    # parameters for the classifier
    # lr__solver points to lr->solver for the classifier
    'lr__solver' : ['newton-cg', 'lbfgs', 'liblinear'],
     # lr__penalty points to lr->penalty of the regularization to prevent the overfitting in the model  
    'lr__penalty': ['l2'],
      # lr__C points to lr->C values 
    'lr__C' : [1000, 100, 10, 1.0, 0.1, 0.01]
}
# n_jobs = -1 to work on all virtual processors
# cv = 3 means 3 k folds and 3 times for cross validation by changing each time the validation part
pipe_lr_clf = GridSearchCV(pipe, params, n_jobs=-1, scoring="roc_auc",verbose=1, cv = 3)
# make fitting for the model to try all combinations created for the hyperparameters
pipe_lr_clf.fit(X, y)
pickle.dump(pipe_lr_clf, open("./pipe_lr_clf.pck", "wb"))

Fitting 3 folds for each of 900 candidates, totalling 2700 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Wall time: 36min 21s


In [32]:
# display the best combination of the hyperparameters and the best score on the training dataset using cross-validation method
best_params = pipe_lr_clf.best_params_
print(best_params)
print('best score {}'.format(pipe_lr_clf.best_score_))

{'lr__C': 1.0, 'lr__penalty': 'l2', 'lr__solver': 'lbfgs', 'tfidf__max_df': 0.3, 'tfidf__min_df': 5, 'tfidf__ngram_range': (1, 2)}
best score 0.8621365521110835


In [33]:
# run the pipe with optimized parameters
pipe.set_params(**best_params).fit(X, y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Pipeline(steps=[('tfidf',
                 TfidfVectorizer(max_df=0.3, min_df=5, ngram_range=(1, 2))),
                ('lr', LogisticRegression())])

In [34]:
# getting the results and saving in csv file 
submission = pd.DataFrame()
# get the id from the test dataframe
submission['id'] = ts_data.index

# get the predict probability of the label value '1' by making predict on the test features 
submission['label'] = pipe.predict_proba(ts_data_clean)[:,1]

submission.to_csv('trial 2 lr_version2.csv', index=False)

### Thoughts and observations for trial 2: 🧐
This gave score on Kaggle: 0.83735

This model made 2700 iterations in the fitting as there are 900 different combinations from the hyperparameters and the cross-validation = 3 so the k-folds in the training and test phases will change 3 times and tried on the 900 combinations, so I got 2700 fitting iterations.

This model used hyperparameters as following: 'lr__C': 1.0, 'lr__penalty': 'l2', 'lr__solver': 'lbfgs', 'tfidf__max_df': 0.3, 'tfidf__min_df': 5, 'tfidf__ngram_range': (1, 2). For the vectorizer, it used max_df = 0.3 which means ignored words that appeared in more than 30% of the documents. min_df = 5 which means ignored words that appeared in less than 5 of the documents. For the number of grams, it used (1, 2) so it made combinations from each 2 words beside using each word alone.
Besides, the classifier will use solver = 'lbfgs', which means that the model is using the default solver as the algorithm to solve the optimization and penalty for the weights and the parameters = L2, which means the weights and the parameters of the model will be smaller. The C value is 1, which makes the regularization stronger. The grid search on the cross-validation phase on the training dataset gave AUC score = 0.86213, and on Kaggle gave 0.83735; so this means that this combination from the hyperparameters with logistic regression model didn't make overfitting on the training dataset as on the unseen dataset (test dataset) this model gave a very good score. My expectations were to get good results from this model as the Logistic regression is one of the best classifier in binary classification, and this is true as I got 0.83735 AUCROC score on Kaggle which meets my expectations.


### Plan for trial 3: 🤨

I will continue the work with the first preprocessing technique as it gave better results on Kaggle than the second technique.

I will keep using the same classifier 'Logistic regression' with making tuning for the same hyperparameters that I talked about before, but I will change the analyzer in the vectorizer to character based. With keep making tuning for the rest of the hyperparameters for the vectorizer. With using the same method in searching for the hyperparameters which is grid search that I talked about in planning for trial 2.



## Trial 3:
## Pipeline with character-level vectorizer and Logistic Regression classifier

In [35]:
%%time
# feature creation and modelling in a single function using pipeline
pipe_char = Pipeline([("tfidf", TfidfVectorizer(analyzer="char")), ("lr", LogisticRegression())])

# define parameter space to test 
params = {
    # parameters for the vectorizer
    "tfidf__ngram_range": [(1, 2),(1, 3)],
    "tfidf__max_df": np.arange(0.3, 0.8),
    "tfidf__min_df": np.arange(5, 30),
    # parameters for the classifier
    # lr__solver points to lr->solver for the classifier
    'lr__solver' : ['newton-cg', 'lbfgs', 'liblinear'],
     # lr__penalty points to lr->penalty of the regularization to prevent the overfitting in the model  
    'lr__penalty': ['l2'],
      # lr__C points to lr->C values 
    'lr__C' : [1000, 100, 10, 1.0, 0.1, 0.01]
}
# n_jobs = -1 to work on all virtual processors
# cv = 3 means 3 k folds and 3 times for cross validation with changing the validation part each time
pipe_lr_clf_char = GridSearchCV(pipe_char, params, n_jobs=-1, scoring="roc_auc",verbose=1, cv = 3)
# make fitting for the model to try all combinations created for the hyperparameters
pipe_lr_clf_char.fit(X, y)
pickle.dump(pipe_lr_clf_char, open("./pipe_lr_clf.pck", "wb"))

Fitting 3 folds for each of 900 candidates, totalling 2700 fits
Wall time: 1h 39min 31s


In [36]:
# display the best combination of the hyperparameters and the best score on the training dataset using cross-validation method
best_params = pipe_lr_clf_char.best_params_
print(best_params)
print('best score {}'.format(pipe_lr_clf_char.best_score_))

{'lr__C': 10, 'lr__penalty': 'l2', 'lr__solver': 'liblinear', 'tfidf__max_df': 0.3, 'tfidf__min_df': 5, 'tfidf__ngram_range': (1, 3)}
best score 0.8195522299654213


In [37]:
# run pipe with optimized parameters
pipe_char.set_params(**best_params).fit(X, y)

Pipeline(steps=[('tfidf',
                 TfidfVectorizer(analyzer='char', max_df=0.3, min_df=5,
                                 ngram_range=(1, 3))),
                ('lr', LogisticRegression(C=10, solver='liblinear'))])

In [38]:
# getting the results and saving in csv file 
submission = pd.DataFrame()
# get the id from the test dataframe
submission['id'] = ts_data.index

# get the predict probability of the label value '1' by making predict on the test features 
submission['label'] = pipe_char.predict_proba(ts_data_clean)[:,1]

submission.to_csv('trial 3 lr char_version2.csv', index=False)

### Thoughts and observations for trial 3: ✌
This gave score on Kaggle: 0.77437

This model made 2700 iterations in the fitting as there are 900 different combinations from the hyperparameters and the cross-validation = 3 so the k-folds in the training and test phases will change 3 times and tried on the 900 combinations, so I got 2700 fitting iterations.

This model used hyperparameters as following:'lr__C': 10, 'lr__penalty': 'l2', 'lr__solver': 'liblinear', 'tfidf__max_df': 0.3, 'tfidf__min_df': 5, 'tfidf__ngram_range': (1, 3). So, the model used the solver = 'liblinear', which means that the model is using that solver as the algorithm to solve the optimization and penalty for the weights and the parameters = L2, which means the weights and the parameters of the model will be smaller. The C value is 10, which makes the regularization stronger, but it is a little bit big number, so the regularization isn't too strong on the weights and parameters for the model. For the vectorizer, it used max_df = 0.3 which means ignore terms that appear in more than 0.3 of the documents. min_df = 5 which means ignore terms that appear in less than 5 of the documents. For the number of grams, it used (1, 3), so it made combinations from each 3 words beside using each word alone. 
My expectations were to got better results as I used character based in the vectorizer, and it is better than the word based. But I got score = 0.81955 from fitting on the training dataset, and I got score = 0.77437 from Kaggle which isn't my expectation from this trial.


### Plan for trial 4: 🤕

I will continue the work with the first preprocessing technique as it gave better results on Kaggle than the second technique.

I will change the classifier in the next trial and I will use K nearest neighbor classifier with continue using holdout way by splitting the training dataset into training and validation datasets. Adjusting the hyperparameters: weights, number of neighbors, and metric using the random search method, which takes a random combination from the combinations of the hyperparameters at each iteration, and each time it keeps the score that it gets from this iteration and after finishing all iterations which I choose its number, it gives me the combination that made the best performance through all iterations. And it calculates the score by ROCAUC score in each iteration. About the hyperparameters, the weights which means weight function used in prediction, metric which means the distance metric to use for the tree, and number of neighbors which means the number neighbors that used in the calculations.

## Trial 4:
## Pipeline with random search with validation set and KNN classifier

In [39]:
# Split the training features and labels into training and validation dataset with stratify the label so each part will contain
# suitable number of each label 
X_train2, X_val, y_train2, y_val = train_test_split(
    X, y, train_size = 0.8, stratify = y, random_state = 42)

In [40]:
%%time
# Create a list where train data indices are -1 and validation data indices are 0
# X_train2 (new training set), X the main training set
split_index = [-1 if x in X_train2.index else 0 for x in X.index]

# Use the list to create PredefinedSplit
pds = PredefinedSplit(test_fold = split_index)

# define parameter space to test 
param_random = {
    # parameters for the vectorizer
    "tfidf__max_df": np.arange(0.3, 0.8),
    "tfidf__min_df": np.arange(5, 30),
    # parameters for the classifier
    # KNN__metric points to KNN->metric for the classifier
    'KNN__metric' : ['euclidean', 'manhattan', 'minkowski'],
    # KNN__weights points to KNN->weights of the regularization to prevent the overfitting in the model  
    'KNN__weights': ['uniform', 'distance'],
      # KNN__n_neighbors points to KNNr->n_neighbors
    'KNN__n_neighbors' : range(1, 31, 2)
}

# feature creation and modelling in a single function using pipeline
pipe_word_KNN = Pipeline([("tfidf", TfidfVectorizer(analyzer="word")), ("KNN", KNeighborsClassifier())])

# use random search method to search for the hyperparameters with using the predefined split in the cv to use the validation set
random_search_val_KNN = RandomizedSearchCV(
    pipe_word_KNN, param_random, cv=pds, verbose=1, n_jobs=2, 
    # number of trials 
    n_iter=40,
    scoring='roc_auc')


# will use our predefined split internally to determine 
# which sample belongs to the validation set
# make fitting on the training dataset and their labels to get the best combination of the hyperparameters
random_search_val_KNN.fit(X, y)

# print the best score and the combination of the hyperparameters that got that score
print('best score {}'.format(random_search_val_KNN.best_score_))
print('best score {}'.format(random_search_val_KNN.best_params_))

Fitting 1 folds for each of 40 candidates, totalling 40 fits
best score 0.8320932616454605
best score {'tfidf__min_df': 7, 'tfidf__max_df': 0.3, 'KNN__weights': 'distance', 'KNN__n_neighbors': 29, 'KNN__metric': 'minkowski'}
Wall time: 6min 18s


In [41]:
# display the best combination of the hyperparameters and the best score on the training dataset using cross-validation method
best_params = random_search_val_KNN.best_params_
print(best_params)
print('best score {}'.format(random_search_val_KNN.best_score_))

{'tfidf__min_df': 7, 'tfidf__max_df': 0.3, 'KNN__weights': 'distance', 'KNN__n_neighbors': 29, 'KNN__metric': 'minkowski'}
best score 0.8320932616454605


In [42]:
# run pipe with optimized parameters
pipe_word_KNN.set_params(**best_params).fit(X, y)

Pipeline(steps=[('tfidf', TfidfVectorizer(max_df=0.3, min_df=7)),
                ('KNN',
                 KNeighborsClassifier(n_neighbors=29, weights='distance'))])

In [43]:
# getting the results and saving in csv file 
submission = pd.DataFrame()
# get the id from the test dataframe
submission['id'] = ts_data.index

# get the predict probability of the label value '1' by making predict on the test features 
submission['label'] = pipe_word_KNN.predict_proba(ts_data_clean)[:,1]

submission.to_csv('trial 4 knn bayes_version2.csv', index=False)

### Thoughts and observations for trial 4: 👀
This gave score on Kaggle: 0.77379

This model made 40 iterations according to the number I put. So, it will pick 40 random combinations from the combinations that were created from the grid I made in this trial.

This model used hyperparameters as following:'tfidf__min_df': 7, 'tfidf__max_df': 0.3, 'KNN__weights': 'distance', 'KNN__n_neighbors': 29, 'KNN__metric': 'minkowski', this means that in the vectorizer, it used min_df = 7, so it ignored any word appeared less than 7 times in the documents. max_df = 0.3, so it ignored any word appeared in 30% of the documents. It used the default n-grams which equal (1, 1), so it used words and didn't make any combinations from the words. For the classifier, it used number of neighbors that are used in calculation of the distance = 29. It used the weights = 'distance', so in this case, closer neighbors of a query point had a greater influence than neighbors which are further away. It used metric = 'minkowski' which means it calculated the distance in a normed vector space, which means in a space where distances can be represented as a vector that has a length and the lengths cannot be negative according to the following equation: 
![alt text](https://www.kdnuggets.com/wp-content/uploads/popular-knn-metrics-1.png "minkowski equation").

My expectations were to have a good result, as from the calculated score on the training phase, I got score = 0.83209. On Kaggle I got 0.77379 although I used high number of iterations to get the best combination of the hyperparameters, so this model didn't work well on the unseen dataset (test data).


### Plan for trial 5: 😪
I will use the XGBoost classifier in the next trial with cross-validation as the experimental protocol. For the hyperparameters that I will use with this classifier: the number of the estimators, which refers to the number of gradient boosted trees that will be used in the model, I will pass different numbers for it from small number to big number to see what the best number of estimators is, the max depth, which refers to the maximum tree depth for the base learners, I will use a little bit large numbers as we have a very large dataset, and not too big, so this won't make overfitting in the training phase. For the vectorizer I will search for the best max_df and min_df.

But this time I will search for the best combination of hyperparameters using Bayesian search method which takes a one hyperparameter value from the hyperparameters and makes calculations to get the second hyperparameter that will make a good performance, and takes the third one by some calculations until it makes a combination from the hyperparameters. It keeps doing this in searching for the hyperparameters until it finishes the number of iterations that I give to the model. And it calculates the score by ROCAUC score in each iteration. Using holdout method to calculate performance using training, and validation datasets by the cross-validation method.


## Trial 5:
## Tuned pipeline with XGBoost classifier

In [44]:
%%time
# feature creation with word based analyzer and modelling in a single function using pipeline
pipe_xgb = Pipeline([("tfidf", TfidfVectorizer(analyzer="word")), ("xgb", XGBClassifier())])

# define parameter space to test
params_xgb = {
    # parameters for the vectorizer
    "tfidf__max_df": np.arange(0.3, 0.8),
    "tfidf__min_df": np.arange(5, 30),
    # parameters for the classifier
    # xgb__n_estimators points to xgb->number of estimators
    'xgb__n_estimators': [80, 100, 200, 300, 400],  
     # xgb__max_depth points to xgb->max_depth
    'xgb__max_depth':[50, 100, 200, 300]  
}

# use bayes search to find the best combination of the hyperparameters with cross-validation = 2 which means 2 k folds and 2 
# times to change the validation set each time
pipe_xgb_clf = BayesSearchCV(pipe_xgb, params_xgb, n_jobs=-1, cv=2, scoring="roc_auc", verbose = 1,
     # number of trials 
        n_iter=60)
# make fitting with each combination made to find the best one that gives the best results
pipe_xgb_clf.fit(X, y)

Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fi



Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fi



Fitting 2 folds for each of 1 candidates, totalling 2 fits




Fitting 2 folds for each of 1 candidates, totalling 2 fits
Fitting 2 folds for each of 1 candidates, totalling 2 fits




Wall time: 2h 39min 53s


BayesSearchCV(cv=2,
              estimator=Pipeline(steps=[('tfidf', TfidfVectorizer()),
                                        ('xgb',
                                         XGBClassifier(base_score=None,
                                                       booster=None,
                                                       colsample_bylevel=None,
                                                       colsample_bynode=None,
                                                       colsample_bytree=None,
                                                       enable_categorical=False,
                                                       gamma=None, gpu_id=None,
                                                       importance_type=None,
                                                       interaction_constraints=None,
                                                       learning_rate=None,
                                                       max_delta_step=None,
            

In [45]:
# display the best combination of the hyperparameters and the best score on the training dataset using cross-validation method
best_params = pipe_xgb_clf.best_params_
print(best_params)
print('best score {}'.format(pipe_xgb_clf.best_score_))

OrderedDict([('tfidf__max_df', 0.3), ('tfidf__min_df', 10), ('xgb__max_depth', 50), ('xgb__n_estimators', 200)])
best score 0.8171171162030598


In [46]:
# run pipe with optimized parameters
pipe_xgb.set_params(**best_params).fit(X, y)
pipe_pred = pipe_xgb.predict_proba(ts_data_clean)
pipe_pred





array([[0.5455636 , 0.45443645],
       [0.5455636 , 0.45443645],
       [0.29311693, 0.7068831 ],
       ...,
       [0.99691194, 0.00308808],
       [0.89561254, 0.10438748],
       [0.08564711, 0.9143529 ]], dtype=float32)

In [47]:
# getting the results and saving in csv file 
submission = pd.DataFrame()
# get the id from the test dataframe
submission['id'] = ts_data.index

# get the predict probability of the label value '1' by making predict on the test features 
submission['label'] = pipe_xgb.predict_proba(ts_data_clean)[:,1]

submission.to_csv('trial 5 xgb_version2.csv', index=False)

### Thoughts and observations for trial 5: 🤔
This gave score on Kaggle: 0.79731



This model made 60 iterations according to the number I set. So, it will pick 60 combinations by getting the combination of the hyperparameters by the way I mentioned in the planning for this trial.



This model used hyperparameters as following: ('tfidf__max_df', 0.3), ('tfidf__min_df', 10), ('xgb__max_depth', 50), ('xgb__n_estimators', 200). First, for the vectorizer, it used the default n-grams which equal (1, 1), so it used words and didn't make any combinations of the words. max_df = 0.3, meaning it ignored any word that appeared in the documents more than 30%. min_df = 10 means it ignored any word appeared less than 10 times in the documnets.

Besides, the classifier will use a number of estimators = 200, which means that the model is compiled with 200 gradient boosted trees and max depth for each tree = 50, which is the smallest number, so I thought this wouldn't be overfit on the training dataset. The bayesian search on the cross-validation phase on the training dataset gave AUC score = 0.81711 and on Kaggle gave 0.79731, so I think there is no overfitting on the training dataset as on the unseen dataset (test dataset). This model gave a good score which is close to the one I got from trying on the training dataset regarding this search, which didn't try all combinations; it tried 60 combination only in the 60 iterations, so I think this result and those hyperparameters are very good and suitable for this  dataset.

# Overall results for all trials: 🥳


| Trial | Score on the experimental protocol applied | Score on Kaggle |
| :--- | :----: | ---: |
| Try with the first technique in preprocessing | No experimental protocol | 0.83466 |
| Try with the second technique in preprocessing | No experimental protocol | 0.82216 |
| A tunable pipeline including the vectorizer with word based analyzer using Logistic Regression classifier. | 0.86213 | 0.83735 |
| Pipeline with character-level vectorizer and Logistic Regression classifier | 0.81955 | 0.77437 |
| Pipeline with random search with validation set and KNN classifier | 0.83209 | 0.77379 |
| Tuned pipeline with XGBoost classifier | 0.81711 | 0.79731 |



**From the above table the best trial for me was:**

**A tunable pipeline including the vectorizer with word based analyzer using Logistic Regression classifier with using those hyperparameters 'lr__C': 1.0, 'lr__penalty': 'l2', 'lr__solver': 'lbfgs', 'tfidf__max_df': 0.3, 'tfidf__min_df': 5, 'tfidf__ngram_range': (1, 2).**
**Besides, the TF-IDF vectorization with word analyzer gave me better results than the character analyzer, so my preprocessing steps and implementation for the model didn't work well with the character analyzer.**

-------------------------------------------------------------------------------------------------------------------------------

# Questions


## What is the difference between Character n-gram and Word n-gram? Which one tends to suffer more from the OOV issue?

* Character n-gram splits the sentence according to n number of characters, like if we have sentences: 'I play volleyball' and n = 3, so it will split into the following: I p - pla - lay - vol - oll - .... and so on.

* Word n-gram splits the sentence according to n number of words, like if we have sentences: 'I play volleyball' and n = 2, so it will split into the following: I play - play volleyball.

* The one that tends to suffer more from the Out Of Vocabulary (OOV) is: Word n-gram.

## What is the difference between stop word removal and stemming? Are these techniques language-dependent?

* Stop word removal means removing unimportant words that may happen too much in the data without importance in the model, so this reduces the dataset size and thus reduces the training time due to the lower number of tokens involved in the training. Like removing am - is - are - .... and so on in English.

* Stemming means removing morphological affixes from words, leaving only the word stem. Like in word different, it become differ.

* Yes, they are language-dependent, as both change from one language to another. For example, 'is' is a stop word in English, and 'ist; is a stop word in German.

## Is tokenization techniques language dependent? Why?
* Yes, tokenization is language dependent.

* As each language has its own shape and letters, so in some languages there are white spaces between words and some don't have white spaces. And some languages combine words together with shapes related to this language, so tokenization may not split the combination of two words and take it as one token instead of splitting them and taking them as two tokens. For example, languages like Vietnamese and Chinese, if we used English tokenization with them, it wouldn't work well. 

## What is the difference between count vectorizer and tf-idf vectorizer? Would it be feasible to use all possible n-grams? If not, how should you select them?

* Count vectorizer means counting how many times the word appears in the document. So, here it considers the count of each unique word in the document.

* TF-IDF vectorizer means getting the weight of word counts by measuring how often the word appears in the documents. So, here it considers the overall document weightage of a word.

* It wouldn't be feasible to use all possible n-grams.

* As to select the number of grams, so it depends on the dataset and the application. In some applications, it is important to deal with some combinations of words together as they are together, have a big effect. For example: if we talk about a dataset for rating a company, and the sentence 'company good organized' is important for us, so we need to make the number of grams = 3 to capture this combination of words, so here we use trigrams, and this would have a big effect in increasing the accuracy of the model.

# References
* https://rb.gy/rvvfmq
* https://rb.gy/pv963r
* https://rb.gy/ylzjdd
* https://rb.gy/imeowk
* https://rb.gy/9xua2n