# ✔️ Problem Formulation:
**The problem: ❎**

build ML model to predict if a specific reddit post is fake news or not, by looking at its title.

**What is the input? ⏩**

set of data(includle text column ) which is useful in prediction.

**What is the output? ⏪**

predict the probability (0-1, float) if a specific reddit post is real news or fake.

**What data mining function is required? 🤔**

-the data mining function required to build model is classification and prediction.
-the data mining function required to text preprocessing is tokenization and vectorization each text

**What could be the challenges? ⛏**


*   Very big data that include 60000 record.

*   each record in the text column contains a lot of punctuation, non-English letters, misspellings, and grammatical errors. As a result, we need to select a suitable text cleaning technique by testing each one and selecting the one that produces the best results.

*  Build pipeline to dealing with categorical data and build ML model



**What is the impact? 😀**

Since false information on the Internet has caused many social problems due to the rise of the social network and its role in various fields such as politics, this model will help solve such problems.

**What is an ideal solution?**✊

The First trail when using Random forest model and searches for the best hyperparameter combination using random search technique when the TfidfVectorizer() by default `word-level vectorizer.`
The Score on Kaggle `(0.85350)` and the roc_acc is `0.99`.

# Experimental protocol 💻
-Import some modules to dealing with data set

-load dataset from CSV file

-Data Exploration 

-preprocessing (Check the data if it's Clean or not and clean it if it's not chean)
*   preprocessing text Column with two methods.
*   Descriptive analysis

-Spilt data into train test spilt (Once the data which lemmatizer and once the other data which stemmer)

-A Tunable Pipeline


*   TfidfVectorizer (I covered both character-level vectorizer and word-level vectorizer).

*   building models (seven trail) (I am using Radom forest, Logistic regression,XGboost, MLP )

-creating search spaces. (Using Random search 6 times Validation and last one Cross_Validation).

-training each model with no. hyperparameters.

(training on data cleaned by lemmatizer and the data whose cleaned by stemmer)
-predicting the test data.

-Create Submit file and check the score of each model on kaggle.

# Import some modules to dealing with data set ▶

**loading all relevant modules and setting some options:**

In [None]:
import re
import pickle
import sklearn
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy as np # linear algebra
import matplotlib.pyplot as plt #visualizations 
import seaborn as sns

import holoviews as hv
#For dealing with text related tasks, we will be using nltk. The terrific scikit-learn library will be used to handle tasks related to machine learning.
import nltk 
from bokeh.io import output_notebook
output_notebook()

from pathlib import Path

# some seeting for pandas and hvplot

pd.options.display.max_columns = 100
pd.options.display.max_rows = 300
pd.options.display.max_colwidth = 100
np.set_printoptions(threshold=2000)

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import PredefinedSplit
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from xgboost import XGBClassifier
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

#load dataset from CSV file ⏬

In [None]:
#read dataset into files 
train= pd.read_csv('/content/xy_train.csv', sep=",", na_values=[""]) #train dataset
test= pd.read_csv('/content/x_test.csv', sep=",", na_values=[""]) #test dataset

#Data Exploration 🔍


In [None]:
#display the train set
train.head()

Unnamed: 0,id,text,label
0,265723,"A group of friends began to volunteer at a homeless shelter after their neighbors protested. ""Se...",0
1,284269,"British Prime Minister @Theresa_May on Nerve Attack on Former Russian Spy: ""The government has c...",0
2,207715,"In 1961, Goodyear released a kit that allows PS2s to be brought to heel. https://m.youtube.com/w...",0
3,551106,"Happy Birthday, Bob Barker! The Price Is Right Host on How He'd Like to Be Remembered | ""As the ...",0
4,8584,"Obama to Nation: 聙""Innocent Cops and Unarmed Young Black Men Should Not be Dying Before Magic Jo...",0


In [None]:
#display the first three row of the test set
test.head(3)

Unnamed: 0,id,text
0,0,stargazer
1,1,yeah
2,2,PD: Phoenix car thief gets instructions from YouTube video


In [None]:









print(train.shape)
print(test.shape)

(60000, 3)
(59151, 2)


The texts are mostly written in English using punctuation and don't include emojis. However, as with any real-life text data, there will be slang, grammatical mistakes, misspellings, etc. 

#Preprocessing 🧹

Having consistent and clean data is fundamental for good modeling results. No matter how sophisticated your model the basic principle is: trash in trash out. When dealing with NLP the cleaning and pre processing can differ depending on which model you intend to use. We will use frequency based representation methods for our text. Thus, we usually want to have a pretty thorough manipulation of the input data:

In [None]:
#check the null values (train set)
train.isnull().sum()


id       0
text     0
label    0
dtype: int64

In [None]:
#check the null values (test set)
test.isnull().sum()


id      0
text    0
dtype: int64

# Text Cleaning ☺

For dealing with text related tasks, we will be using nltk. The terrific scikit-learn library will be used to handle tasks related to machine learning.


In [None]:
import nltk
nltk.download('wordnet')

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

The `clean_text` function takes a string input and applies a bunch of manipulations to it (described in the code).

In [None]:
stop_words = set(stopwords.words("english"))
def clean_text(text, for_embedding=False):
    """ steps:
        - remove any html tags (< /br> often found)
        - remove single letter chars
        - convert all whitespaces (tabs etc.) to single wspace
        if not for embedding (but e.g. tdf-idf):
        - all lowercase
        - remove stopwords, punctuation and stemm
    """
    # match one or more white sepace
    RE_WSPACE = re.compile(r"\s+", re.IGNORECASE)
    # match <any num of words>
    RE_TAGS = re.compile(r"<.*?>")
    # match any word with word boundary
    RE_SINGLECHAR = re.compile(r"\b^[^A-Za-zÀ-ž0-9]+\b", re.IGNORECASE)
    if for_embedding:
        # Keep punctuation
        # match any word and any punctuation with word boundary.
        RE_SINGLECHAR = re.compile(r"\b[A-Za-zÀ-ž,.!?]\b", re.IGNORECASE)

    #remove <any num of words>
    text = re.sub(RE_TAGS, " ", text)
    #remove any word with word boundary
    text = re.sub(RE_SINGLECHAR, " ", text)
    #remove one or more white sepace
    text = re.sub(RE_WSPACE, " ", text)

    
    word_tokens = word_tokenize(text)

    return word_tokens

# Using two methods for text cleaning 🤔



1.   **Stemming** is a process that stems or removes last few characters from a word, often leading to incorrect meanings and spelling.

    *   it will stem each text if for_embedding parameter was false.


2.   **Lemmatization** considers the context and converts the word to its meaningful base form, which is called Lemma.

    *  it will lemmtize each text if for_embedding parameter was false.





The `stemmer_clean` function

In [None]:
def stemmer_clean(text, for_embedding=False):

  stemmer = SnowballStemmer("english")
  word_tokens = clean_text(text, for_embedding)
  '''steps:
        if not for embedding (but e.g. tdf-idf):
        - all lowercase
        - remove stopwords, punctuation and stemming'''

  if for_embedding:
    # no stemming, lowering and punctuation / stop words removal
    words_filtered = word_tokens
  else:
    words_tokens_lower = [word.lower() for word in word_tokens]

    words_filtered = [stemmer.stem(word) for word in words_tokens_lower if word not in stop_words ]

    text_clean = " ".join(words_filtered)
    return text_clean


The `lemma_clean` function

In [None]:
def lemma_clean(text, for_embedding=False):
  lemmatizer = WordNetLemmatizer()
  word_tokens = clean_text(text, for_embedding)
  ''' steps:
        if not for embedding (but e.g. tdf-idf):
        - all lowercase
        - remove stopwords, punctuation and lemmatize
    '''

  if for_embedding:
    # no lemmatization, lowering and punctuation / stop words removal
    words_filtered = word_tokens
  else:
    words_tokens_lower = [word.lower() for word in word_tokens]

    words_filtered = [lemmatizer.lemmatize(word) for word in words_tokens_lower if word not in stop_words]

    text_clean = " ".join(words_filtered)
    return text_clean

we can improve model performance again by increasing the number of relevant data points.
Let's apply this to our data:

In [None]:
train["text_clean"] = train.loc[train["text"].str.len() > 20, "text"]

# preprocessing text Column with two methods
 



1.   Lemmatizing
2.   Stemming



In [None]:
# Clean text (train set)
#clean text with lemmatizing each word in the text 
#(lemmatizing will remove any word ending with take in consideration the meaning of the word).
train_lemma=train['text_clean'].map(lambda x: lemma_clean(x, for_embedding=False) if isinstance(x, str) else x)
#clean text with stemming each word in the text 
#(stemming will remove any word ending without take in consideration the meaning of the word).
train_stem=train['text_clean'].map(lambda x: stemmer_clean(x, for_embedding=False) if isinstance(x, str) else x)

In [None]:
# Clean text (test set)
#clean text with lemmatizing each word in the text 
#(lemmatizing will remove any word ending with take in consideration the meaning of the word).
test_lemma=test['text'].map(lambda x: lemma_clean(x, for_embedding=False) if isinstance(x, str) else x)
#clean text with stemming each word in the text 
#(stemming will remove any word ending without take in consideration the meaning of the word).
test_stem=test['text'].map(lambda x: stemmer_clean(x, for_embedding=False) if isinstance(x, str) else x)

The cleaned comments are much more concise because of their original sentence structure and their words have been altered severely. Although the meaning can still be grasped, humans will probably have a harder time understanding these sentences. 

### Descriptive analysis ✌

Even though we deal with texts, we should still use some descriptive analysis to get a better understanding of the data:

**Using the most frequent words, we can identify additional candidates for our stop word list**

In [None]:
from bokeh.models import NumeralTickFormatter
# Word Frequency of most common words in train_lemma
word_freq_lemma = pd.Series(" ".join(train_lemma).split()).value_counts()
word_freq_lemma[1:40]

.            34472
)            12466
(            12456
:             9770
's            9618
?             8626
``            7381
''            7231
|             6476
!             4930
'             3632
-             3534
n't           3292
[             3270
]             3260
one           3201
year          3183
new           2989
like          2911
&             2647
man           2584
colorized     2404
trump         2382
people        2266
look          2215
first         2203
say           2148
get           2069
found         1958
poster        1948
time          1943
woman         1813
day           1810
war           1791
make          1706
$             1689
life          1681
...           1625
2             1599
dtype: int64

In [None]:
# list most uncommon words
word_freq_lemma[-10:].reset_index(name="freq")

Unnamed: 0,index,freq
0,//external-preview.redd.it/book1b119x1ij9al3zh6tenrxz3llmfcs6l2l7vmmoa.jpg,1
1,dayi3a,1
2,'el,1
3,1569779293.0,1
4,jormono,1
5,219x6l,1
6,halper,1
7,ginsberg,1
8,polemic,1
9,110k,1


In [None]:
# Word Frequency of most common words in train_stem
word_freq_stem = pd.Series(" ".join(train_stem).split()).value_counts()
word_freq_stem[1:40]

.         34472
)         12466
(         12456
:          9770
's         9618
?          8626
``         7381
''         7231
|          6476
!          4930
'          3632
-          3534
n't        3292
[          3270
]          3260
one        3205
year       3187
like       3099
new        2994
look       2844
color      2702
&          2647
get        2608
man        2604
trump      2388
say        2353
use        2287
peopl      2272
first      2209
make       2200
found      1998
time       1961
poster     1950
day        1812
war        1792
$          1689
...        1625
2          1598
show       1504
dtype: int64

In [None]:
# list most uncommon words
word_freq_stem[-10:].reset_index(name="freq")

Unnamed: 0,index,freq
0,al-sunnah,1
1,workship,1
2,cbssport,1
3,majin,1
4,buu,1
5,774,1
6,r/ot,1
7,r/sequelmem,1
8,doubter,1
9,110k,1


#Spilt data into train test spilt 😲

**lemmatizer**

In [None]:
#split train_lemma 

X1=train_lemma
Y=train['label']

#Split train_lemma set to a train and a validation set becuase we will use them in search method
x_train_lemma, x_val_lemma, y_train_lemma, y_val_lemma = train_test_split(X1, Y, test_size=0.25)

# Create a list where train data indices are -1 and validation data indices are 0
# x_train_lemma (new training set), train_lemma
split_index_lemmatized = [-1 if x in x_train_lemma.index else 0 for x in train_lemma.index]


**Stemmer**


In [None]:
#split train_stem 

X2=train_stem

#Split train_stem set to a train and a validation set becuase we will use them in search method
x_train_stem, x_val_stem, y_train_stem, y_val_stem = train_test_split(X2, Y, test_size=0.25)

# Create a list where train data indices are -1 and validation data indices are 0
# x_train_stem (new training set), train_stem
split_index_stemmed = [-1 if x in x_train_stem.index else 0 for x in train_stem.index]


# A Tunable Pipeline ➿


# First trial 


1.   Feature creation with TfidfVectorizer 
(Because classification models cannot deal with text data directly, we need to convert our text column to a numeric representation.)
( For many applications, `TF-IDF` (term frequency, inverse document frequency) is a good choice. In our case, the `TF` part summarizes how often a word appears in a comment in relation to all words.)

    *   TfidfVectorizer by default `word-level vectorizer.`



2.   building our model (Random Forest with Random Search (validation))



**this trial on train_lemma**

In [None]:
#converting each text the input column to numerical values using TfidfVectorizer 
# training them using Random Forest classifier.

RF_pipline = Pipeline(
    steps=[
        ('TF-IDF', TfidfVectorizer()),
        ('RandomForest', RandomForestClassifier())]
)

# Use the list to create PredefinedSplit
predefinedspilt1 = PredefinedSplit(split_index_lemmatized)

# define parameter space to test

params={
    # points to TfidfVectorizer->ngram_range 
    'TF-IDF__ngram_range':[(1,2),(1,3)],
    # points to TfidfVectorizer->Max_df 
    'TF-IDF__max_df': np.arange(0.3, 0.8),
    # points to TfidfVectorizer->Min_df 
    'TF-IDF__min_df': np.arange(5, 100),
    # points to RandomForestClassifier->n_estimators 
    'RandomForest__n_estimators': [10,25,30,50,100,200],
    # points to RandomForestClassifier->max_depth 
    'RandomForest__max_depth':[2,3,5,10,20],
    # points to RandomForestClassifier->min_samples_leaf 
    'RandomForest__min_samples_leaf': [5,10,20,50,100,200]
}

# it is quite slow so we do 4 for now

#Using Random search with validation set
#random search CV (may be good enough and even more generalizable)
pipe_clf_RF = RandomizedSearchCV(
RF_pipline, params, cv=predefinedspilt1, n_jobs=-1, scoring="roc_auc", n_iter=3)

# here we still use X1; but the Radom search model  will use our predefined split internally to determine which sample belongs to the validation set

#Fit the model on train_lemma
pipe_clf_RF.fit(X1, Y)




In [None]:
#Best hyperparameters combinations and roc_auc score for TfidfVectorizer() Random Forest classifier

best_params = pipe_clf_RF.best_params_
print(best_params)

Take the best hyperparameter combination for `TfidfVectorizer`() and use them with the model to search for best hyperparameters combination for the model.


Using this best hyperparameters for `TfidfVectorizer()`, we can search for optimal hyperparameters for the Random Forest classifier becuase that will improve the classification results:



In [None]:
#Fit the model with best param

RF_pipline.set_params(**best_params).fit(X1, Y)


In [None]:
#the roc_auc score of the best params
RF_pipline.set_params(**best_params).score(X1,Y)

In [None]:
#create submission file
submission = pd.DataFrame()
submission['id'] = test['id']
submission['label'] = RF_pipline.predict_proba(test['text'])[:,1]
submission.to_csv('Randomforest.csv', index=False)

Random Forest with Random Search Get the best Score on Kaggle (0.85350)
we can improve it by trying to chance hyperparameter

#Secound trial
*   TfidfVectorizer (character-level)
*   building **(XGBoost Classifier with Random Search (validation))**







**this trial on train_lemma**

In [None]:
#converting each text the input column to numerical values using TfidfVectorizer 
# training them using XGBoost Classifier.
XG_pipline = Pipeline(
    steps=[
        ('TF-IDF', TfidfVectorizer(analyzer="char", max_df=0.2, min_df=10, ngram_range=(1, 3), norm="l2")),
        ('xgboost', XGBClassifier(random_state=42,n_jobs=-1,eval_metric='rmse'))]
)

# Use the list to create PredefinedSplit
predefinedspilt2 = PredefinedSplit(split_index_lemmatized)

# define parameter space to test

params_XG={
    # points to TfidfVectorizer->ngram_range 
    'TF-IDF__ngram_range': [(1, 2), (1, 3), (1,4), (1,5)],
    # points to TfidfVectorizer->analyzer 
    'TF-IDF__analyzer':['char'],
    # points to TfidfVectorizer->min_df 
    'TF-IDF__min_df':np.arange(5, 100),
    # points to TfidfVectorizer->max_df 
    'TF-IDF__max_df':np.arange(0.2, 1.0),
    # points to xgboost->n_estimators' 
    'xgboost__n_estimators': [20, 30, 40], 
    # points to xgboost->max_depth' 
    'xgboost__max_depth':[10, 20, 30],
    # points to xgboost->booster' 
    'xgboost__booster':['gbtree','gblinear', 'dart'],
    # points to xgboost->learning_rate' 
    'xgboost__learning_rate':[1.0, 0.1,0.01,0.0001, 1.5],  
}

# it is quite slow so we do 4 for now
#Using Random search with validation set
#random search CV (may be good enough and even more generalizable)
pipe_clf_XG = RandomizedSearchCV(
XG_pipline, params_XG, cv=predefinedspilt2, n_jobs=-1, scoring="roc_auc", n_iter=3)

# here we still use X1; but the Radom search model  will use our predefined split internally to determine which sample belongs to the validation set
#Fit the model on train_lemma
pipe_clf_XG.fit(X1, Y)




Parameters: { "max_depth" } are not used.



In [None]:
#Best hyperparameters combinations and roc_auc score for TfidfVectorizer() XGBoost Classifier
best_params2 = pipe_clf_XG.best_params_
print(best_params2)

{'xgboost__n_estimators': 30, 'xgboost__max_depth': 10, 'xgboost__learning_rate': 0.01, 'xgboost__booster': 'gblinear', 'TF-IDF__ngram_range': (1, 3), 'TF-IDF__min_df': 49, 'TF-IDF__max_df': 0.2, 'TF-IDF__analyzer': 'char'}


Take the best hyperparameter combination for `TfidfVectorizer`() and use them with the model to search for best hyperparameters combination for the model.


Using this best hyperparameters for `TfidfVectorizer()`, we can search for optimal hyperparameters for XGBoost Classifier becuase that will improve the classification results:



In [None]:
#Fit the model with best param
XG_pipline.set_params(**best_params2).fit(X1, Y)


Parameters: { "max_depth" } are not used.



In [None]:
#the roc_auc score of the best params
XG_pipline.set_params(**best_params2).score(X1,Y)

Parameters: { "max_depth" } are not used.



0.8409833333333333

In [None]:
#create submission file
submission = pd.DataFrame()
submission['id'] = test['id']
submission['label'] = XG_pipline.predict_proba(test['text'])[:,1]
submission.to_csv('XG.csv', index=False)

XGBoost Classifier with Random Search when TfidfVectorizer --> character-level vectorizer. Get Score: 0.7554 on Kaggle 
we can improve it by trying to chance hyperparameter

# Third trial

building **(logistic regression Classifier with Random Search (validation))**


**this trial on train_lemma**

In [None]:
#converting each text the input column to numerical values using TfidfVectorizer 
# training them using (logistic regression  Classifier).
log_Pipeline = Pipeline(
    steps=[
        ('TF-IDF', TfidfVectorizer()),
        ('lg', LogisticRegression(max_iter=10000,random_state=42,n_jobs=-1))]
)

# Use the list to create PredefinedSplit
predefinedspilt3 = PredefinedSplit(split_index_lemmatized)

# define parameter space to test

params_lg={
     # points to TfidfVectorizer->ngram_range 
    'TF-IDF__ngram_range':[(1, 2), (1, 3), (1,4), (1,5)],
    # points to TfidfVectorizer->max_df 
    'TF-IDF__max_df': np.arange(0.2, 1.0),
    # points to TfidfVectorizer->min_df 
    'TF-IDF__min_df': np.arange(5, 100),
    # points to logistic regression->class_weight' 
    'lg__class_weight':['balanced',None],
    # points to logistic regression->C' 
    'lg__C': [1.0,0.1,0.001,0.0001,0.005,1.5,2.0,3.5],
    # points to logistic regression->fit_intercept' 
    'lg__fit_intercept':[False, True],
}

# it is quite slow so we do 4 for now
#Using Random search with validation set
#random search CV (may be good enough and even more generalizable)
pipe_clf_lg = RandomizedSearchCV(
log_Pipeline, params_lg, n_jobs=-1,cv=predefinedspilt3, scoring="roc_auc", n_iter=3)

# here we still use X1; but the Radom search model  will use our predefined split internally to determine which sample belongs to the validation set
#Fit the model on train_lemma
pipe_clf_lg.fit(X1, Y)



In [None]:
#Best hyperparameters combinations and roc_auc score for TfidfVectorizer() logistic regression  Classifier
best_params3 = pipe_clf_lg.best_params_
print(best_params3)

{'lg__fit_intercept': False, 'lg__class_weight': 'balanced', 'lg__C': 3.5, 'TF-IDF__ngram_range': (1, 2), 'TF-IDF__min_df': 15, 'TF-IDF__max_df': 0.2}


Take the best hyperparameter combination for `TfidfVectorizer`() and use them with the model to search for best hyperparameters combination for the model.


Using this best hyperparameters for `TfidfVectorizer()`, we can search for optimal hyperparameters for logistic regression  Classifier becuase that will improve the classification results:



In [None]:
#Fit the model with best param
log_Pipeline.set_params(**best_params3).fit(X1, Y)


In [None]:
#the roc_auc score of the best params
log_Pipeline.set_params(**best_params3).score(X1,Y)

0.8576833333333334

In [None]:
#create submission file
submission = pd.DataFrame()
submission['id'] = test['id']
submission['label'] = log_Pipeline.predict_proba(test['text'])[:,1]
submission.to_csv('log.csv', index=False)

logistic regression with Random Search Get Score: 0.83602 on Kaggle 
we can improve it by trying to chance hyperparameter




# **Fourth trial**

building **(logistic regression Classifier with Random Search (validation))**

TfidfVectorizer (character-level)

**this trial on train_stem**

In [None]:
#converting each text the input column to numerical values using TfidfVectorizer 
# training them using (logistic regression  Classifier).
log_Pipeline2 = Pipeline(
    steps=[
        ('TF-IDF', TfidfVectorizer(analyzer="char", max_df=0.2, min_df=10, ngram_range=(1, 3), norm="l2")),
        ('lg', LogisticRegression(max_iter=10000,random_state=42,n_jobs=-1))]
)

# Use the list to create PredefinedSplit
predefinedspilt4 = PredefinedSplit(split_index_stemmed)

# define parameter space to test

params_lg={
     # points to TfidfVectorizer->ngram_range 
    'TF-IDF__ngram_range':[(1, 2), (1, 3), (1,4), (1,5)],
    # points to TfidfVectorizer->max_df 
    'TF-IDF__max_df': np.arange(0.2, 1.0),
    # points to TfidfVectorizer->min_df 
    'TF-IDF__min_df': np.arange(5, 100),
    # points to logistic regression->class_weight' 
    'lg__class_weight':['balanced',None],
    # points to logistic regression->C' 
    'lg__C': [1.0,0.1,0.001,0.0001,0.005,1.5,2.0,3.5],
    # points to logistic regression->fit_intercept' 
    'lg__fit_intercept':[False, True],
}

# it is quite slow so we do 4 for now
#Using Random search with validation set
#random search CV (may be good enough and even more generalizable)
pipe_clf_lg = RandomizedSearchCV(
log_Pipeline2, params_lg, n_jobs=-1,cv=predefinedspilt4, scoring="roc_auc", n_iter=3)

# here we still use X2; but the Radom search model  will use our predefined split internally to determine which sample belongs to the validation set
#Fit the model on train_stem
pipe_clf_lg.fit(X2, Y)



In [None]:
#Best hyperparameters combinations and roc_auc score for TfidfVectorizer() logistic regression  Classifier
best_params4 = pipe_clf_lg.best_params_
print(best_params4)

{'lg__fit_intercept': True, 'lg__class_weight': 'balanced', 'lg__C': 3.5, 'TF-IDF__ngram_range': (1, 2), 'TF-IDF__min_df': 43, 'TF-IDF__max_df': 0.2, 'TF-IDF__analyzer': 'char'}


Take the best hyperparameter combination for `TfidfVectorizer`() and use them with the model to search for best hyperparameters combination for the model.


Using this best hyperparameters for `TfidfVectorizer()`, we can search for optimal hyperparameters for logistic regression  Classifier becuase that will improve the classification results:



In [None]:
#Fit the model with best param
log_Pipeline.set_params(**best_params4).fit(X2, Y)


In [None]:
#the roc_auc score of the best params
log_Pipeline.set_params(**best_params4).score(X2,Y)

0.80435

In [None]:
#create submission file
submission = pd.DataFrame()
submission['id'] = test['id']
submission['label'] = log_Pipeline.predict_proba(test['text'])[:,1]
submission.to_csv('log2.csv', index=False)

this model on train_stem with TfidfVectorizer (character-level) getScore: `0.74112` on kaggle 

SO, I noticed that Lemmatization has higher accuracy than stemming.
(Lemmatization is preferred for context analysis, whereas stemming is recommended when the context is not important.)


# Fifth trail


*  building **(MLP Classifier with Random Search (validation))**

*  TfidfVectorizer (character-level)




**this trial on train_Lemma**

In [None]:
#converting each text the input column to numerical values using TfidfVectorizer 
# training them using (logistic regression  Classifier).
MLP_Pipeline = Pipeline(
    steps=[
        ('TF-IDF', TfidfVectorizer()),
        ('MLP', MLPClassifier(random_state=1,solver="adam",hidden_layer_sizes=(12, 12, 12),activation="relu",early_stopping=True,n_iter_no_change=1))
        ]
)

# Use the list to create PredefinedSplit
predefinedspilt5 = PredefinedSplit(split_index_lemmatized)

# define parameter space to test

params0={
    # points to TfidfVectorizer->ngram_range 
    'TF-IDF__ngram_range': [(1, 2), (1, 3), (1,4), (1,5)],
    # points to TfidfVectorizer->analyzer 
    'TF-IDF__analyzer':['char'],
    # points to TfidfVectorizer->min_df 
    'TF-IDF__min_df':np.arange(5, 100),
    # points to TfidfVectorizer->max_df 
    'TF-IDF__max_df':np.arange(0.2, 1.0),
}

# it is quite slow so we do 4 for now
#Using Random search with validation set
#random search CV (may be good enough and even more generalizable)
pipe_clf_MLP = RandomizedSearchCV(
MLP_Pipeline, params0, n_jobs=-1,cv=predefinedspilt5, scoring="roc_auc", n_iter=3)

# here we still use X1; but the Radom search model  will use our predefined split internally to determine which sample belongs to the validation set
#Fit the model on train_lemma
pipe_clf_MLP.fit(X1, Y)



In [None]:
#Best hyperparameters combinations and roc_auc score for TfidfVectorizer() logistic regression  Classifier
best_params5 = pipe_clf_MLP.best_params_
print(best_params5)

{'TF-IDF__ngram_range': (1, 4), 'TF-IDF__min_df': 76, 'TF-IDF__max_df': 0.2, 'TF-IDF__analyzer': 'char'}


Take the best hyperparameter combination for `TfidfVectorizer`() and use them with the model to search for best hyperparameters combination for the model.


Using this best hyperparameters for `TfidfVectorizer()`, we can search for optimal hyperparameters for MLP Classifier becuase that will improve the classification results:



In [None]:
#Fit the model with best param
MLP_Pipeline.set_params(**best_params5).fit(X1, Y)


In [None]:
#the roc_auc score of the best params
MLP_Pipeline.set_params(**best_params5).score(X1,Y)

0.89655

In [None]:
#create submission file
submission = pd.DataFrame()
submission['id'] = test['id']
submission['label'] = MLP_Pipeline.predict_proba(test['text'])[:,1]
submission.to_csv('MLP.csv', index=False)

MLP model on train_lemma  get Score: `0.8164` on kaggle ,we can improve it by trying to chance hyperparameter




#VI trail


*   building **(XGBoost Classifier with Random Search (validation))**

*   TfidfVectorizer --> word-level vectorizer.








**this trial on train_lemma**



In [None]:
#converting each text the input column to numerical values using TfidfVectorizer 
# training them using XGBoost Classifier.
XG_pipline2 = Pipeline(
    steps=[
        ('TF-IDF', TfidfVectorizer(analyzer="word", max_df=0.4, min_df=10, ngram_range=(1, 2))),
        ('xgboost', XGBClassifier(eval_metric='rmse',max_depth=5,n_estimators=200,use_label_encoder=False))]
)

# Use the list to create PredefinedSplit
predefinedspilt6 = PredefinedSplit(split_index_lemmatized)

# define parameter space to test

params_XG2={
    # points to TfidfVectorizer->ngram_range 
    'TF-IDF__ngram_range':[(1,2),(1,3)],
    # points to TfidfVectorizer->Max_df 
    'TF-IDF__max_df': np.arange(0.3, 0.8),
    # points to TfidfVectorizer->Min_df 
    'TF-IDF__min_df': np.arange(5, 100),
    # points to xgboost->n_estimators' 
    'xgboost__n_estimators': [20, 30, 40], 
    # points to xgboost->max_depth' 
    'xgboost__max_depth':[10, 20, 30],
    # points to xgboost->booster' 
    'xgboost__booster':['gbtree','gblinear', 'dart'],
    # points to xgboost->learning_rate' 
    'xgboost__learning_rate':[1.0, 0.1,0.01,0.0001, 1.5],  
}

# it is quite slow so we do 4 for now
#Using Random search with validation set
#random search CV (may be good enough and even more generalizable)
pipe_clf_XG2 = RandomizedSearchCV(
XG_pipline2, params_XG2, cv=predefinedspilt6, n_jobs=-1, scoring="roc_auc", n_iter=3)

# here we still use X1; but the Radom search model  will use our predefined split internally to determine which sample belongs to the validation set
#Fit the model on train_lemma
pipe_clf_XG2.fit(X1, Y)




Parameters: { "max_depth" } are not used.



In [None]:
#Best hyperparameters combinations and roc_auc score for TfidfVectorizer() XGBoost Classifier
best_params6 = pipe_clf_XG2.best_params_
print(best_params6)

{'xgboost__n_estimators': 20, 'xgboost__max_depth': 20, 'xgboost__learning_rate': 1.0, 'xgboost__booster': 'gblinear', 'TF-IDF__ngram_range': (1, 3), 'TF-IDF__min_df': 65, 'TF-IDF__max_df': 0.3}


Take the best hyperparameter combination for `TfidfVectorizer`() and use them with the model to search for best hyperparameters combination for the model.


Using this best hyperparameters for `TfidfVectorizer()`, we can search for optimal hyperparameters for XGBoost Classifier becuase that will improve the classification results:



In [None]:
#Fit the model with best param
XG_pipline2.set_params(**best_params6).fit(X1, Y)


Parameters: { "max_depth" } are not used.



In [None]:
#the roc_auc score of the best params
XG_pipline2.set_params(**best_params6).score(X1,Y)

Parameters: { "max_depth" } are not used.



0.8091333333333334

In [None]:
#create submission file
submission = pd.DataFrame()
submission['id'] = test['id']
submission['label'] = XG_pipline2.predict_proba(test['text'])[:,1]
submission.to_csv('XG2.csv', index=False)

XGBoost Classifier with Random Search when TfidfVectorizer --> word-level vectorizer. Get Score: 0.0.8164 on Kaggle 
we can improve it by trying to chance hyperparameter

# VII trial 




*    TfidfVectorizer by default `word-level vectorizer.`
*    building  Random Forest with Random Search (Cross-validation)






let's try **Cross-validation** is usually the preferred method because it gives your model the opportunity to train on multiple train-test splits.

In [None]:
#converting each text the input column to numerical values using TfidfVectorizer 
# training them using Random Forest classifier.

RF_pipline2 = Pipeline(
    steps=[
        ('TF-IDF', TfidfVectorizer()),
        ('RandomForest', RandomForestClassifier())]
)

# define parameter space to test

params2={
    # points to TfidfVectorizer->ngram_range 
    'TF-IDF__ngram_range':[(1,2),(1,3)],
    # points to TfidfVectorizer->Max_df 
    'TF-IDF__max_df': np.arange(0.3, 0.8),
    # points to TfidfVectorizer->Min_df 
    'TF-IDF__min_df': np.arange(5, 100),
}

# it is quite slow so we do 4 for now

#random search CV (may be good enough and even more generalizable)
#using random search
# cv=2 means two-fold cross-validation
# n_jobs means the cucurrent number of jobs
# (on colab since we only have two cpu cores, we set it to 2)
pipe_clf_RF2 = RandomizedSearchCV(
    RF_pipline2, params2,cv=2, verbose=1, n_jobs=2, 
    # number of random trials
    n_iter=10,
    scoring='roc_auc')

# here we still use X1; but the Radom search model  will use our predefined split internally to determine which sample belongs to the validation set

#Fit the model on train_lemma
pipe_clf_RF2.fit(X1, Y)


Fitting 2 folds for each of 10 candidates, totalling 20 fits




In [None]:
#Best hyperparameters combinations and roc_auc score for TfidfVectorizer() Random Forest classifier

best_params7 = pipe_clf_RF2.best_params_
print(best_params7)

{'TF-IDF__ngram_range': (1, 3), 'TF-IDF__min_df': 79, 'TF-IDF__max_df': 0.3}


Take the best hyperparameter combination for `TfidfVectorizer`() and use them with the model to search for best hyperparameters combination for the model.


Using this best hyperparameters for `TfidfVectorizer()`, we can search for optimal hyperparameters for the Random Forest classifier becuase that will improve the classification results:



In [None]:
#Fit the model with best param

RF_pipline2.set_params(**best_params7).fit(X1, Y)


In [None]:
#the roc_auc score of the best params
RF_pipline2.set_params(**best_params7).score(X1,Y)

0.9993833333333333

In [None]:
#create submission file
submission = pd.DataFrame()
submission['id'] = test['id']
submission['label'] = RF_pipline2.predict_proba(test['text'])[:,1]
submission.to_csv('Randomforest2.csv', index=False)

Random Forest with Random Search (Cross-validation) Get the best Score on Kaggle (0.81898)
we can improve it by trying to chance hyperparameter

# Conclusion ⭐

*   I noticed that the TfidfVectorizer --> `word-level` vectorizer get score better than TfidfVectorizer --> `character-level` vectorizer.
*  I noticed that Lemmatization has higher accuracy than stemming.
(Lemmatization is preferred for context analysis, whereas stemming is recommended when the context is not important.)

*   I used the Random Search because it's faster than grid search and reduces unnecessary computation.

*   The best score on Kaggle **`(0.85350)`** when, I used **Random Forest model with Random Search (Vaildation).**





## ✔️ Answer the questions

**🌈 What is the difference between Character n-gram and Word n-gram? Which one tends to suffer more from the OOV issue?**



*  A character n-Gram is defined as a series of characters of length n.
*  A Word n-grams is is a contiguous series of n words from a given sample of text or speech.
*   word n-gram is suffer more from the OOV issue.


**🌈 What is the difference between stop word removal and stemming? Are these techniques language-dependent?**



*   The "stemming" is turning a word into a root word by removing the phrase prefix , While the "stopwords removal" is removed words that often appear and do not have any meaning.

*  Stop word elimination and stemming are commonly used method in indexing. Stop words are high frequency words that have little semantic weight and are thus unlikely to help the retrieval process. 

*   both are language dependant stop words in English not like in German and vice versa also the grammars in English not like in the German language.



**🌈 Is tokenization techniques language dependent? Why?**

No,because the tokenization is a way of separating a piece of text into smaller units called tokens. Different word-level tokens are created depending on the delimiters, not the language.


**🌈 What is the difference between count vectorizer and tf-idf vectorizer? Would it be feasible to use all possible n-grams? If not, how should you select them?**


*   CountVectorizer simply counts the number of times a word appears in a document (using a bag-of-words approach), while TF-IDF Vectorizer takes into account not only how many times a word appears in a document but also how important that word is to the whole corpus.
*  It wouldn't be feasiable and it would be np-complete problem.
*  we can select them by using some of search method techniques like (Grid search, random search).



