https://www.kaggle.com/competitions/cisc-873-dm-f22-a3

# **Problem Formulation**

**Problem:** There is a lot of false information on the internet, so we need to clean this information to determine if this information is fake news or not.

**Inputs:** Two columns(text and lable)

**Output:** Is the information fake or not

**Function required:** Classification & Prediction

**Challenges:** \
1. Remove stopwords, html tags, single letter, and multiple space.
2. Use Tdf-ift.
3. Determine siutable Classifier.
4. Use cross validation
5. Select optimal hyperparameters in each algorithm.
6. Find best accuracy.

**What is the impact?**
* If the model predicts the type of news correctly, this means they will not wait to know if the news is fake or not, as the model will tell them the type of news, so they can save time waiting to know if the news is fake or not.

**What is the ideal solution?**
* The **Neural Network** model is the best solution by using **Random** search and Word-level Vectorizer.
* Accuracy **0.82669**(public) **0.82914**(private) in kaggle


# **Trials**

## **Common Commands** in all models

**What is the experimental protocol used and how was it carried out?** \
1. Read Training and Testing Data 
2. Data Preprocessing using Pipline 
3. Splitting data
4. Validation set
5. PipeLine
4. Tuning hyperparameters
5. Built model
* I used the validation set.

**What preprocessing steps are used?**

1. Remove html tags.
2. Remove stopwords.
3. Remove single letter.
4. Remove multiple spaces.
5. Convert all letters to lower case.
6. Join all words in text_clean and separate them by space.
7. Taking any text that's length is greater than 25
8. Convert a collection of raw documents to a matrix of TF-IDF features.
9. Normalization.

##### Import liberaries

In [None]:
pip install scikit-optimize # install scikit-optimize to be able to use bayesian search.



In [None]:
#import liberaries that I will use in my code
import re
import pickle
import sklearn
import pandas as pd
import numpy as np
import holoviews as hv # HoloViews is an open-source Python library designed to make data analysis and visualization seamless and simple. 
import nltk # NLTK is a standard Python package with prebuilt functions and utilities for quick and easy use.
from bokeh.io import output_notebook # It is used to create interactive visualisations for modern web browsers and to build graphics.
output_notebook()

from pathlib import Path

# some seeting for pandas and hvplot

pd.options.display.max_columns = 100 # Determine the maximum number of columns that I want to appear when displaying the dataframe
pd.options.display.max_rows = 300 # Determine the maximum number of rows that I want to appear when displaying the dataframe
pd.options.display.max_colwidth = 100 # Maximum width of columns
np.set_printoptions(threshold=2000)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer
from xgboost import XGBClassifier
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [None]:
#connect to my drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


##### Read Training and Testing data

In [None]:
# Read all our training data by using read_csv, which takes the path of the file with the extension that I want to read.
data_tr = pd.read_csv('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt3/xy_train.csv')
# Based on position, this function returns the first 5 rows of the dataset. It's used to quickly see if our dataset contains the proper kind of data.
data_tr.head(5)

Unnamed: 0,id,text,label
0,265723,"A group of friends began to volunteer at a homeless shelter after their neighbors protested. ""Se...",0
1,284269,"British Prime Minister @Theresa_May on Nerve Attack on Former Russian Spy: ""The government has c...",0
2,207715,"In 1961, Goodyear released a kit that allows PS2s to be brought to heel. https://m.youtube.com/w...",0
3,551106,"Happy Birthday, Bob Barker! The Price Is Right Host on How He'd Like to Be Remembered | ""As the ...",0
4,8584,"Obama to Nation: 聙""Innocent Cops and Unarmed Young Black Men Should Not be Dying Before Magic Jo...",0


In [None]:
# Read all our testing data by using read_csv, which takes the path of the file with the extension that I want to read.
data_ts = pd.read_csv('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt3/x_test.csv')
# Based on position, this function returns the first 5 rows of the dataset. It's used to quickly see if our dataset contains the proper kind of data.
data_ts.head(5)

Unnamed: 0,id,text
0,0,stargazer
1,1,yeah
2,2,PD: Phoenix car thief gets instructions from YouTube video
3,3,"As Trump Accuses Iran, He Has One Problem: His Own Credibility"
4,4,"""Believers"" - Hezbollah 2011"


In [None]:
# Display the column's name in training and testing data
print(data_tr.columns)
print(data_ts.columns)

Index(['id', 'text', 'label'], dtype='object')
Index(['id', 'text'], dtype='object')


##### Preprocessing

In [None]:
nltk.download('punkt')
nltk.download('stopwords')

stemmer = SnowballStemmer("english")   # It is the method used to return the word to its original form
stop_words = set(stopwords.words("english")) # It is the method of producing a stop words

def clean_text(text):
    """ steps:
        - remove any html tags (< /br> often found)
        - Keep only ASCII + European Chars and whitespace, no digits
        - remove single letter chars
        - convert all whitespaces (tabs etc.) to single wspace
        if not for embedding (but e.g. tdf-idf):
        - all lowercase
        - remove stopwords, punctuation and stemm
    """
    # IGNORECASE : is a flag allows for case-insensitive matching of the Regular Expression with the given string
    RE_WSPACE = re.compile(r"\s+", re.IGNORECASE) # Remove any more than one space
    RE_TAGS = re.compile(r"<[^>]+>") # Remove web tags
    RE_ASCII = re.compile(r"[^A-Za-zÀ-ž ]", re.IGNORECASE) # Remove any leter does not english charachter
    RE_SINGLECHAR = re.compile(r"\b[A-Za-zÀ-ž]\b", re.IGNORECASE) # Remove any single character

    text = re.sub(RE_TAGS, " ", text)# Replace any tag with a single space.
    text = re.sub(RE_ASCII, " ", text) # Replace any non english character with a single space.
    text = re.sub(RE_SINGLECHAR, " ", text) # Replace any single character with a single space.
    text = re.sub(RE_WSPACE, " ", text)  # Replace any more than one space with a single space.

    word_tokens = word_tokenize(text) # split the sentence into words
    words_tokens_lower = [word.lower() for word in word_tokens] # Convert all letters to small letters

    # words_filtered (Words can be filtered based on how many times they appear)
    # stemmer used to return the word to its original form.
    words_filtered = [
        stemmer.stem(word) for word in words_tokens_lower if word not in stop_words
    ]

    # Join all words in text_clean and separate them by space.
    text_clean = " ".join(words_filtered)
    return text_clean

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
# call clean_text function that take string as a parameter to test the function
clean_text("Python is a basic programming language .")

'python basic program languag'

In [None]:
# Clean texts by taking any text that's length is greater than 25
data_tr['clean_text']=data_tr.loc[data_tr['text'].str.len()>25,"text"]
data_ts['clean_text']=data_ts.loc[data_ts['text'].str.len()>0,"text"]

# map is an iterator function that returns a result after applying a clean_text function to each item in an iterable 
# lambda is a function used to apply certain functions to all rows of a data set.
# lambda take one argument (x) then put x in clean_text function
# if statement means ( if input x is string enter x to clean_text function then the result put in data['clean_com'] if not return x in data['clean_com'] as it is  )
data_tr['clean_text']=data_tr['clean_text'].map(
    lambda x: clean_text(x) if isinstance(x, str) else x   
)
data_ts['clean_text']=data_ts['clean_text'].map(
    lambda x: clean_text(x) if isinstance(x, str) else x   
)

In [None]:
data_ts

Unnamed: 0,id,text,clean_text
0,0,stargazer,stargaz
1,1,yeah,yeah
2,2,PD: Phoenix car thief gets instructions from YouTube video,pd phoenix car thief get instruct youtub video
3,3,"As Trump Accuses Iran, He Has One Problem: His Own Credibility",trump accus iran one problem credibl
4,4,"""Believers"" - Hezbollah 2011",believ hezbollah
...,...,...,...
59146,59146,Bicycle taxi drivers of New Delhi,bicycl taxi driver new delhi
59147,59147,Trump blows up GOP's formula for winning House races,trump blow gop formula win hous race
59148,59148,"Napoleon returns from his exile on the island of Elba. (March 1815), Colourised",napoleon return exil island elba march colouris
59149,59149,Deep down he always wanted to be a ballet dancer,deep alway want ballet dancer


In [None]:
# Examine the label column for unique values and the number of times they appear.
data_tr['label'].value_counts()

0    32172
1    27596
2      232
Name: label, dtype: int64

In [None]:
# drop rows that has label = 2 
index_2=data_tr['label']==2
index_2
data_tr.drop(data_tr.index[index_2],inplace=True)

In [None]:
# Examine the label column for unique values and the number of times they appear.
data_tr['label'].value_counts()

0    32172
1    27596
Name: label, dtype: int64

In [None]:
# copy data in data_clean
data_tr_clean=data_tr.copy()

In [None]:
data_tr_clean.head(5)

Unnamed: 0,id,text,label,clean_text
0,265723,"A group of friends began to volunteer at a homeless shelter after their neighbors protested. ""Se...",0,group friend began volunt homeless shelter neighbor protest see anoth person also need natur lik...
1,284269,"British Prime Minister @Theresa_May on Nerve Attack on Former Russian Spy: ""The government has c...",0,british prime minist theresa may nerv attack former russian spi govern conclud high like russia ...
2,207715,"In 1961, Goodyear released a kit that allows PS2s to be brought to heel. https://m.youtube.com/w...",0,goodyear releas kit allow ps brought heel https youtub com watch alxulk cg zwillc fish midatlant...
3,551106,"Happy Birthday, Bob Barker! The Price Is Right Host on How He'd Like to Be Remembered | ""As the ...",0,happi birthday bob barker price right host like rememb man said ave pet spay neuter fuckincorpor...
4,8584,"Obama to Nation: 聙""Innocent Cops and Unarmed Young Black Men Should Not be Dying Before Magic Jo...",0,obama nation innoc cop unarm young black men die magic johnson jimbobshawobodob olymp athlet sho...


In [None]:
# Word Frequency of most common words
# Split all words with whitespace between them, then put them in word_freq.
word_freq_tr = pd.Series(" ".join(data_tr_clean["clean_text"]).split()).value_counts()
word_freq_tr[1:40] # display the first 40 words with frequencies

one         3285
like        3128
new         2998
look        2847
color       2737
man         2729
get         2602
trump       2578
say         2347
peopl       2316
use         2307
first       2248
make        2227
old         2226
time        2027
poster      2000
found       1999
day         1935
war         1858
post        1648
world       1570
work        1531
show        1513
us          1506
american    1504
take        1491
life        1482
psbattl     1470
help        1442
go          1420
state       1409
back        1369
two         1364
school      1345
see         1329
photo       1324
made        1314
right       1311
save        1308
dtype: int64

In [None]:
# display list most uncommon words  
# reset_index is reset the index of the DataFrame
word_freq_tr[-10:].reset_index(name="freq")

Unnamed: 0,index,freq
0,angriff,1
1,delusion,1
2,wane,1
3,undament,1
4,miku,1
5,hatsun,1
6,nfler,1
7,hicock,1
8,mccall,1
9,wahr,1


In [None]:
# Distribution of ratings
data_tr_clean["label"].value_counts(normalize=True) # count proportions of label

0    0.538281
1    0.461719
Name: label, dtype: float64

In [None]:
"""
Compute unique word vector with frequencies
exclude very uncommon (<10 obsv.) and common (>=30%) words
use pairs of two words (ngram)
"""
# TfidfVectorizer convert a collection of raw documents to a matrix of TF-IDF features.
# min_df ignore terms that have a document frequency strictly lower than the given threshold
# max_df ignore terms that have a document frequency strictly higher than the given threshold
# ngram_range Two words have a higher correlation than the threshold and frequently appear together.
# ngram_range=(a,b)-> a is the minimum and b is the maximum size of ngrams
vectorizer = TfidfVectorizer(
    max_df=0.3, min_df=10, ngram_range=(1, 2)
)
vectorizer.fit(data_tr_clean["clean_text"])

TfidfVectorizer(max_df=0.3, min_df=10, ngram_range=(1, 2))

In [None]:
# Vector representation of vocabulary show some sample from our data_clean
word_vector = pd.Series(vectorizer.vocabulary_).sample(7, random_state=1) # By sample, I choose the number of vocabulary that I want to display.
print(f"Unique word (ngram) vector extract:\n\n {word_vector}")

Unique word (ngram) vector extract:

 eli       2666
go far    3595
bamboo     650
wisdom    9837
pocket    6705
elf       2665
hockey    4027
dtype: int64


In [None]:
data_tr_clean

Unnamed: 0,id,text,label,clean_text
0,265723,"A group of friends began to volunteer at a homeless shelter after their neighbors protested. ""Se...",0,group friend began volunt homeless shelter neighbor protest see anoth person also need natur lik...
1,284269,"British Prime Minister @Theresa_May on Nerve Attack on Former Russian Spy: ""The government has c...",0,british prime minist theresa may nerv attack former russian spi govern conclud high like russia ...
2,207715,"In 1961, Goodyear released a kit that allows PS2s to be brought to heel. https://m.youtube.com/w...",0,goodyear releas kit allow ps brought heel https youtub com watch alxulk cg zwillc fish midatlant...
3,551106,"Happy Birthday, Bob Barker! The Price Is Right Host on How He'd Like to Be Remembered | ""As the ...",0,happi birthday bob barker price right host like rememb man said ave pet spay neuter fuckincorpor...
4,8584,"Obama to Nation: 聙""Innocent Cops and Unarmed Young Black Men Should Not be Dying Before Magic Jo...",0,obama nation innoc cop unarm young black men die magic johnson jimbobshawobodob olymp athlet sho...
...,...,...,...,...
59995,70046,"Finish Sniper Simo H盲yh盲 during the invasion of Finland by the USSR (1939, colorized)",0,finish sniper simo yh invas finland ussr color
59996,189377,"Nigerian Prince Scam took $110K from Kansas man; 10 years later, he's getting it back",1,nigerian princ scam took kansa man year later get back
59997,93486,Is It Safe To Smoke Marijuana During Pregnancy? You鈥檇 Be Surprised Of The Answer | no,0,safe smoke marijuana pregnanc surpris answer
59998,140950,Julius Caesar upon realizing that everyone in the room has a knife except him (44 bc),0,julius caesar upon realiz everyon room knife except bc


#####  Splitting data

I split the data into X and y.

In [None]:
# splitting Trainig data into X_train and y_train
X=data_tr_clean['clean_text'] # X contains only clean_text column
y=data_tr_clean['label']   # y contains only label column
X_test=data_ts['clean_text'] # X_test contains only clean_text column

In [None]:
# transform each sentence to numeric vector with tf-idf value as elements
X_train_vec = vectorizer.transform(X)
X_test_vec = vectorizer.transform(X_test)
X_train_vec.get_shape()

(59768, 10061)

In [None]:
# Compare original comment text with its numeric vector representation
print(f"Original sentence:\n{X[3:4].values}\n")
# Feature is a dataframe that takes X_train_vec and converts it to an array, and the column is the name of the feature
features = pd.DataFrame(
    X_train_vec[3:4].toarray(), columns=vectorizer.get_feature_names()
)
nonempty_feat = features.loc[:, (features != 0).any(axis=0)]
print(f"Vector representation of sentence:\n {nonempty_feat}")

Original sentence:
['happi birthday bob barker price right host like rememb man said ave pet spay neuter fuckincorporateshil irish rover jolli rove tar imgur com true dh bqw https extern preview redd qseokdslwivxzdehgxxbceuxh vhemi muoxw yaqq jpg width crop smart auto webp eda ba bf ed fakealbumcov irish rover jolli rove tar kna fire made becam green place redd true cd anx https preview redd nw nexjtfaa jpg width crop smart auto webp ee da ce da mildlyinterest fire made becam green place rhym jarat true cf qh http imgur com crvur jpg wvf psbattl artwork jarat dagnummmong stop redd true qbhc https preview redd lsvmqhtjcz jpg width crop smart auto webp bfb dd ccafa cad fakealbumcov stop top today ss without jupit probabl asteroid impact artist use one vermont senat top surrog year stormtroop finish anim imgur com true cal https extern preview redd vwnj gq tvah fdc ahmc qyzjj sw fqlrwqauk jpg width crop smart auto webp fba de ac subredditsimul without jupit probabl asteroid impact artist 



##### Validation set

In [None]:
from sklearn.model_selection import PredefinedSplit
# Further split the original training set to a train and a validation set
X_train, X_val, y_train, y_val = train_test_split(
    X, y, train_size = 0.8, stratify = y, random_state = 2022)

# Create a list where train data indices are -1 and validation data indices are 0
# X_train2 (new training set), X_train
split_index = [-1 if x in X_train.index else 0 for x in X.index]

# Use the list to create PredefinedSplit
pds = PredefinedSplit(test_fold = split_index)


## **Trials ( Word-level Vectorizer )**

### **Naural Network**

#### **Trial 0**

* I will use the **MLPClassifier** with:
 * solver = adam
 * size of hidden layer = 12 in each layers
 * ReLU as activation function
 * Use early stoppping to avoid overfitting
 * n_iter_no_change = 1
* I will use **Random** search in tuning.
* **word**-level vectorizer

**My thoughts and observations :** The accuracy would be between 0.75 and 0.80

##### PipeLine Tuning

In [None]:
# Built pipline
# The pipeline's goal is to combine numerous processes that can be cross-validated while modifying various parameters.
# It does this by allowing set parameters for each step using their names and parameter names separated by a "__"
# It takes steps as a prameter that contain all the preprocessing, vectorization, and normalization that I need.
# It saves time by applying any preprocessing to both train and test data without repeating the process.

pipe = Pipeline([("tfidf", TfidfVectorizer(analyzer="word", norm="l2")), ("my_classifier",  MLPClassifier(
        random_state=1,
        solver="adam",
        hidden_layer_sizes=(12, 12, 12),
        activation="relu",
        early_stopping=True,
        n_iter_no_change=1,
    ))])

# define parameter space to test
params = {
    "tfidf__ngram_range": [(1, 2), (1, 3)],
    # tfidf__ngram_range points to TfidfVectorizer -> ngram_range
    # ngram_range Two words have a higher correlation than the threshold and frequently appear together.
    # ngram_range=(a,b)-> a is the minimum and b is the maximum size of ngrams
    "tfidf__max_df": np.arange(0.3, 0.8),
    # max_df ignore terms that have a document frequency strictly higher than the given threshold
    "tfidf__min_df": np.arange(5, 100),
    # min_df ignore terms that have a document frequency strictly lower than the given threshold
}

pipe_clf = RandomizedSearchCV(
    pipe, params, cv=pds, verbose=1, n_jobs=2, 
    scoring='roc_auc')

# here we still use X but the Randomized search model by use validation set 
# fit the pipeline
pipe_clf.fit(X, y)

print('best score {}'.format(pipe_clf.best_score_))
print('best score {}'.format(pipe_clf.best_params_))

Fitting 1 folds for each of 10 candidates, totalling 10 fits
best score 0.8619907522969671
best score {'tfidf__ngram_range': (1, 2), 'tfidf__min_df': 8, 'tfidf__max_df': 0.3}


In [None]:
# run pipe with optimized parameters
best_params = pipe_clf.best_params_
pipe.set_params(**best_params).fit(X, y)
pipe_pred = pipe.predict(X_test)

Best paramters:
* min_df = 33
* max_df = 0.3

In [None]:
# Use this cell to write the result in the excel sheet.
submission = pd.DataFrame()

submission['id'] = data_ts['id']

submission['label'] = pipe_clf.predict_proba(X_test)[:,1]

submission.to_csv('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt3/sample_submission_walkthrough.csv', index=False)

##### Result

Accuracy in **Cross-Validation** = 0.8619 \
Accuracy in **kaggle** =0.82320

#### **Trial 1**

Accordding to previous trial I will use the same classifier but i will use **Bayes** Search instaed of Random search\
**My thoughts and observations :** The accuracy would be between 0.82320 and 0.8250

##### PipeLine Tuning

In [None]:
# Built pipline
# The pipeline's goal is to combine numerous processes that can be cross-validated while modifying various parameters.
# It does this by allowing set parameters for each step using their names and parameter names separated by a "__"
# It takes steps as a prameter that contain all the preprocessing, vectorization, and normalization that I need.
# It saves time by applying any preprocessing to both train and test data without repeating the process.

pipe = Pipeline([("tfidf", TfidfVectorizer(analyzer="word", norm="l2")), ("my_classifier",  MLPClassifier(
        random_state=1,
        solver="adam",
        hidden_layer_sizes=(12, 12, 12),
        activation="relu",
        early_stopping=True,
        n_iter_no_change=1,
    ))])

# define parameter space to test
params = {
    # ngram_range=(a,b)-> a is the minimum and b is the maximum size of ngrams
    "tfidf__max_df": np.arange(0.3, 0.8),
    # max_df ignore terms that have a document frequency strictly higher than the given threshold
    "tfidf__min_df": np.arange(5, 100),
    # min_df ignore terms that have a document frequency strictly lower than the given threshold
}

pipe_clf = BayesSearchCV(
    pipe, params, cv=pds, verbose=1, n_jobs=2, 
    scoring='roc_auc')

# here we still use X but the Bayes search model by use validation set 
# fit the pipeline
pipe_clf.fit(X, y)

print('best score {}'.format(pipe_clf.best_score_))
print('best score {}'.format(pipe_clf.best_params_))

Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fi

In [None]:
# run pipe with optimized parameters
best_params = pipe_clf.best_params_
pipe.set_params(**best_params).fit(X, y)
pipe_pred = pipe.predict(X_test)

Best paramters:
* min_df = 8
* max_df = 0.3

In [None]:
# Use this cell to write the result in the excel sheet.
submission = pd.DataFrame()

submission['id'] = data_ts['id']

submission['label'] = pipe_clf.predict_proba(X_test)[:,1]

submission.to_csv('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt3/sample_submission_walkthrough.csv', index=False)

##### Result

Accuracy in **Cross-Validation** = 0.8606 \
Accuracy in **kaggle** =0.81479 \
The previous trial is better than this trial.

#### **Trial 2**

* Accordding to previous trial I will **change hyperparameters**:
 * solver = lbfgs
 * size of hidden layer = 12 in each layers
 * ReLU as activation function
 * Use early stoppping to avoid overfitting
 * alpha = 0.0001
 * learning_rate =adaptive
* I will use **Random** search in tuning.\

**My thoughts and observations :** The accuracy would be between 0.82320 and 0.8250

##### PipeLine Tuning

In [None]:
# Built pipline
# The pipeline's goal is to combine numerous processes that can be cross-validated while modifying various parameters.
# It does this by allowing set parameters for each step using their names and parameter names separated by a "__"
# It takes steps as a prameter that contain all the preprocessing, vectorization, and normalization that I need.
# It saves time by applying any preprocessing to both train and test data without repeating the process.

pipe = Pipeline([("tfidf", TfidfVectorizer(analyzer="word", norm="l2")), ("my_classifier",  MLPClassifier(
        random_state=1,
        solver='lbfgs',
        hidden_layer_sizes=(12, 12, 12),
        activation="relu",
        early_stopping=True,
        alpha=0.0001,
        learning_rate= 'adaptive',
    ))])

# define parameter space to test
params = {
    "tfidf__ngram_range": [(1, 2), (1, 3)],
    # tfidf__ngram_range points to TfidfVectorizer -> ngram_range
    # ngram_range Two words have a higher correlation than the threshold and frequently appear together.
    # ngram_range=(a,b)-> a is the minimum and b is the maximum size of ngrams
    "tfidf__max_df": np.arange(0.3, 0.8),
    # max_df ignore terms that have a document frequency strictly higher than the given threshold
    "tfidf__min_df": np.arange(5, 100),
    # min_df ignore terms that have a document frequency strictly lower than the given threshold
}

pipe_clf = RandomizedSearchCV(
    pipe, params, cv=pds, verbose=1, n_jobs=2, 
    scoring='roc_auc')

# here we still use X but the Randomized search model by use validation set 
# fit the pipeline
pipe_clf.fit(X, y)

print('best score {}'.format(pipe_clf.best_score_))
print('best score {}'.format(pipe_clf.best_params_))

Fitting 1 folds for each of 10 candidates, totalling 10 fits
best score 0.8504907747524162
best score {'tfidf__ngram_range': (1, 2), 'tfidf__min_df': 17, 'tfidf__max_df': 0.3}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


In [None]:
# run pipe with optimized parameters
best_params = pipe_clf.best_params_
pipe.set_params(**best_params).fit(X, y)
pipe_pred = pipe.predict(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Best paramters:
* min_df = 17
* max_df = 0.3

In [None]:
# Use this cell to write the result in the excel sheet.
submission = pd.DataFrame()

submission['id'] = data_ts['id']

submission['label'] = pipe_clf.predict_proba(X_test)[:,1]

submission.to_csv('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt3/sample_submission_walkthrough.csv', index=False)

##### Result

Accuracy in **Cross-Validation** = 0.85049 \
Accuracy in **kaggle** =0.82669\
This trial is better than all previous trials.

### **XGBoost Classifier**

#### **Trial 0**

* I will use the **XGBClassifier** with:
 * min_child_weight = [20,40,80]
 * max_depth = [50,60,70]
 * gamma = [0.5, 1, 1.5, 2, 5]
 * colsample_bytree = [0.6, 0.8, 1.0]
* I will use **Random** search in tuning.
* **word**-level vectorizer

**My thoughts and observations :** The accuracy would be between 0.0.82 and 0.83

##### PipeLine Tuning

In [None]:
# Built pipline
# The pipeline's goal is to combine numerous processes that can be cross-validated while modifying various parameters.
# It does this by allowing set parameters for each step using their names and parameter names separated by a "__"
# It takes steps as a prameter that contain all the preprocessing, vectorization, and normalization that I need.
# It saves time by applying any preprocessing to both train and test data without repeating the process.

pipe = Pipeline([("tfidf", TfidfVectorizer(analyzer="word", norm="l2")), ("my_classifier",  XGBClassifier())])

# define parameter space to test
params = {
    "tfidf__ngram_range": [(1, 2), (1, 3)],
    # tfidf__ngram_range points to TfidfVectorizer -> ngram_range
    # ngram_range Two words have a higher correlation than the threshold and frequently appear together.
    # ngram_range=(a,b)-> a is the minimum and b is the maximum size of ngrams
    "tfidf__max_df": np.arange(0.3, 0.8),
    # max_df ignore terms that have a document frequency strictly higher than the given threshold
    "tfidf__min_df": np.arange(5, 100),
    # min_df ignore terms that have a document frequency strictly lower than the given threshold
    'my_classifier__min_child_weight': [20,40,80],
    # min_child_weight is a minimal total of the weights in a child.
    'my_classifier__max_depth':[50,60,70],  
    # max_depth is a maximum depth of a tree
    'my_classifier__gamma':[0.5, 1, 1.5, 2, 5],
    # gamma is a minimum loss that we need it to split tree
    'my_classifier__colsample_bytree':[0.6, 0.8, 1.0]
}

pipe_clf = RandomizedSearchCV(
    pipe, params, cv=pds, verbose=1, n_jobs=2, 
    scoring='roc_auc')

# here we still use X but the Randomized search model by use validation set 
# fit the pipeline
pipe_clf.fit(X, y)

print('best score {}'.format(pipe_clf.best_score_))
print('best score {}'.format(pipe_clf.best_params_))

Fitting 1 folds for each of 10 candidates, totalling 10 fits
best score 0.8311641791801241
best score {'tfidf__ngram_range': (1, 3), 'tfidf__min_df': 9, 'tfidf__max_df': 0.3, 'my_classifier__min_child_weight': 20, 'my_classifier__max_depth': 70, 'my_classifier__gamma': 0.5, 'my_classifier__colsample_bytree': 0.6}


In [None]:
# run pipe with optimized parameters
best_params = pipe_clf.best_params_
pipe.set_params(**best_params).fit(X, y)
pipe_pred = pipe.predict(X_test)

Best paramters:
* min_df = 9
* max_df = 0.3
* min_child_weight = 20
* max_depth = 70 (When increasing max_depth, the accuracy gets better)
* gamma = 0.5
* colsample_bytree = 0.6

In [None]:
# Use this cell to write the result in the excel sheet.
submission = pd.DataFrame()

submission['id'] = data_ts['id']

submission['label'] = pipe_clf.predict_proba(X_test)[:,1]

submission.to_csv('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt3/sample_submission_walkthrough.csv', index=False)

##### Result

Accuracy in **Cross-Validation** = 0.83116\
Accuracy in **kaggle** = 0.80296

### **Logistic Regression**

#### **Trial 0**

* I will use the **LogisticRegression** with:
 * c = 100
 * max_iter = 100
 * tol = [1e-4,1e-5,1e-3]
* I will use **Grid** search in tuning.
* **word**-level vectorizer

**My thoughts and observations :** The accuracy would be between 0.80 and 0.82

##### PipeLine Tuning

In [None]:
# Built pipline
# The pipeline's goal is to combine numerous processes that can be cross-validated while modifying various parameters.
# It does this by allowing set parameters for each step using their names and parameter names separated by a "__"
# It takes steps as a prameter that contain all the preprocessing, vectorization, and normalization that I need.
# It saves time by applying any preprocessing to both train and test data without repeating the process.

pipe = Pipeline([("tfidf", TfidfVectorizer(analyzer="word", norm="l2")), ("my_classifier",  LogisticRegression())])

# define parameter space to test
params = {
    "tfidf__ngram_range": [(1, 2), (1, 3)],
    # tfidf__ngram_range points to TfidfVectorizer -> ngram_range
    # ngram_range Two words have a higher correlation than the threshold and frequently appear together.
    # ngram_range=(a,b)-> a is the minimum and b is the maximum size of ngrams
    "tfidf__max_df": np.arange(0.3, 0.8),
    # max_df ignore terms that have a document frequency strictly higher than the given threshold
    "tfidf__min_df": np.arange(5, 100),
    # min_df ignore terms that have a document frequency strictly lower than the given threshold
    'my_classifier__C': [100],
    # C controls the penality strength 
    # my_classifier__C points to my_classifier->C
    'my_classifier__max_iter':[100], 
    # max_iter is a maximum number of iterations
    # my_classifier__max_iter points to my_classifier-> max_iter
    'my_classifier__tol':[1e-4,1e-5,1e-3]
    # tol is a tolerance for stopping
    # my_classifier__tol points to my_classifier-> tol

}

pipe_clf = GridSearchCV(
    pipe, params, cv=pds, verbose=1, n_jobs=2, 
    scoring='roc_auc')

# here we still use X but the Grid search model by use validation set 
# fit the pipeline
pipe_clf.fit(X, y)

print('best score {}'.format(pipe_clf.best_score_))
print('best score {}'.format(pipe_clf.best_params_))

Fitting 1 folds for each of 570 candidates, totalling 570 fits
best score 0.852913232003647
best score {'my_classifier__C': 100, 'my_classifier__max_iter': 100, 'my_classifier__tol': 0.0001, 'tfidf__max_df': 0.3, 'tfidf__min_df': 27, 'tfidf__ngram_range': (1, 2)}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [None]:
# run pipe with optimized parameters
best_params = pipe_clf.best_params_
pipe.set_params(**best_params).fit(X, y)
pipe_pred = pipe.predict(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Best paramters:
* min_df = 27
* max_df = 0.3
* C = 100
* max_iter = 100
* tol = 0.0001

In [None]:
# Use this cell to write the result in the excel sheet.
submission = pd.DataFrame()

submission['id'] = data_ts['id']

submission['label'] = pipe_clf.predict_proba(X_test)[:,1]

submission.to_csv('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt3/sample_submission_walkthrough.csv', index=False)

##### Result

Accuracy in **Cross-Validation** = 0.85291 \
Accuracy in **kaggle** =0.80995

#### **Trial 1**

Accordding to previous trial I will use the same classifier but
*  I will use **Bayes** Search instead of Grid search
* Change hyperparameter values

**My thoughts and observations :** The accuracy would be between 0.80995 and 0.81955

##### PipeLine Tuning

In [None]:
# Built pipline
# The pipeline's goal is to combine numerous processes that can be cross-validated while modifying various parameters.
# It does this by allowing set parameters for each step using their names and parameter names separated by a "__"
# It takes steps as a prameter that contain all the preprocessing, vectorization, and normalization that I need.
# It saves time by applying any preprocessing to both train and test data without repeating the process.

pipe = Pipeline([("tfidf", TfidfVectorizer(analyzer="word", norm="l2")), ("my_classifier",  LogisticRegression())])

# define parameter space to test
params = {
    "tfidf__max_df": np.arange(0.3, 0.8),
    # max_df ignore terms that have a document frequency strictly higher than the given threshold
    "tfidf__min_df": np.arange(5, 100),
    # min_df ignore terms that have a document frequency strictly lower than the given threshold
    'my_classifier__C': [100,200,300],
    # C controls the penality strength 
    # my_classifier__C points to my_classifier->C
    'my_classifier__max_iter':[100 ,200, 300], 
    # max_iter is a maximum number of iterations
    # my_classifier__max_iter points to my_classifier-> max_iter
    'my_classifier__tol':[1e-4,1e-5,1e-3]
    # tol is a tolerance for stopping
    # my_classifier__tol points to my_classifier-> tol

}

pipe_clf = BayesSearchCV(
    pipe, params, cv=pds, verbose=1, n_jobs=2, 
    scoring='roc_auc')

# here we still use X but the Randomized search model by use validation set 
# fit the pipeline
pipe_clf.fit(X, y)

print('best score {}'.format(pipe_clf.best_score_))
print('best score {}'.format(pipe_clf.best_params_))

Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits




Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fi

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [None]:
# run pipe with optimized parameters
best_params = pipe_clf.best_params_
pipe.set_params(**best_params).fit(X, y)
pipe_pred = pipe.predict(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Best paramters:
* min_df = 27
* max_df = 0.3
* C = 300
* max_iter = 200
* tol = 1e-05

In [None]:
# Use this cell to write the result in the excel sheet.
submission = pd.DataFrame()

submission['id'] = data_ts['id']

submission['label'] = pipe_clf.predict_proba(X_test)[:,1]

submission.to_csv('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt3/sample_submission_walkthrough.csv', index=False)

##### Result

Accuracy in **Cross-Validation** = 0.8498 \
Accuracy in **kaggle** =0.82423 \
This trial is better than The previous trial and it is better than I thought.

### **Random Forest**

#### **Trial 0**

* I will use the **RandomForestClassifier** with:
 * n_estimators = [170,200,250]
 * max_depth = [70,80,90]
 * max_features = [10,20,30]
* I will use **Random** search in tuning.
* **word**-level vectorizer

**My thoughts and observations :** The accuracy would be between 0.81 and 0.82

##### PipeLine Tuning

In [None]:
# Built pipline
# The pipeline's goal is to combine numerous processes that can be cross-validated while modifying various parameters.
# It does this by allowing set parameters for each step using their names and parameter names separated by a "__"
# It takes steps as a prameter that contain all the preprocessing, vectorization, and normalization that I need.
# It saves time by applying any preprocessing to both train and test data without repeating the process.

pipe = Pipeline([("tfidf", TfidfVectorizer(analyzer="word", norm="l2")), ("my_classifier",  RandomForestClassifier())])

# define parameter space to test
params = {
    "tfidf__ngram_range": [(1, 2), (1, 3)],
    # tfidf__ngram_range points to TfidfVectorizer -> ngram_range
    # ngram_range Two words have a higher correlation than the threshold and frequently appear together.
    # ngram_range=(a,b)-> a is the minimum and b is the maximum size of ngrams
    "tfidf__max_df": np.arange(0.3, 0.8),
    # max_df ignore terms that have a document frequency strictly higher than the given threshold
    "tfidf__min_df": np.arange(5, 100),
    # min_df ignore terms that have a document frequency strictly lower than the given threshold
    'my_classifier__n_estimators': [170,200,250],
    # n_estimators is a number of trees 
    # my_classifier__n_estimators points to my_classifier->n_estimators
    'my_classifier__max_depth':[70,80,90],   
    # max_depth is a maximum depth of the tree
    # my_classifier__max_depth points to my_classifier-> max_depth
    'my_classifier__max_features':[10,20,30]
    # max_features is a maximum number of features
    # my_classifier__max_features points to my_classifier-> max_features

}

pipe_clf = RandomizedSearchCV(
    pipe, params, cv=pds, verbose=1, n_jobs=2, 
    scoring='roc_auc')

# here we still use X but the Randomized search model by use validation set 
# fit the pipeline
pipe_clf.fit(X, y)

print('best score {}'.format(pipe_clf.best_score_))
print('best score {}'.format(pipe_clf.best_params_))

Fitting 1 folds for each of 10 candidates, totalling 10 fits
best score 0.8462690658378282
best score {'tfidf__ngram_range': (1, 3), 'tfidf__min_df': 8, 'tfidf__max_df': 0.3, 'my_classifier__n_estimators': 170, 'my_classifier__max_features': 10, 'my_classifier__max_depth': 80}


In [None]:
# run pipe with optimized parameters
best_params = pipe_clf.best_params_
pipe.set_params(**best_params).fit(X, y)
pipe_pred = pipe.predict(X_test)

Best paramters:
* ngram_range = (1, 3)
* min_df = 8
* max_df = 0.3
* n_estimators = 170
* max_features = 10
* max_depth = 80

In [None]:
# Use this cell to write the result in the excel sheet.
submission = pd.DataFrame()

submission['id'] = data_ts['id']

submission['label'] = pipe_clf.predict_proba(X_test)[:,1]

submission.to_csv('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt3/sample_submission_walkthrough.csv', index=False)

##### Result

Accuracy in **Cross-Validation** = 0.84626 \
Accuracy in **kaggle** =0.81024

## **Trials ( Character-level Vectorizer )**

### **Naural Network**

#### **Trial 0**

* In this trial I will use Naural Network with:
 * solver = lbfgs
 * size of hidden layer = 12 in each layers
 * ReLU as activation function
 * Use early stoppping to avoid overfitting
 * alpha = 0.0001
 * learning_rate =adaptive
* I will use **Random** search in tuning.

**My thoughts and observations :** The accuracy would be between 0.73 and 0.75

##### PipeLine Tuning

In [None]:
# Built pipline
# The pipeline's goal is to combine numerous processes that can be cross-validated while modifying various parameters.
# It does this by allowing set parameters for each step using their names and parameter names separated by a "__"
# It takes steps as a prameter that contain all the preprocessing, vectorization, and normalization that I need.
# It saves time by applying any preprocessing to both train and test data without repeating the process.

pipe = Pipeline([("tfidf", TfidfVectorizer(analyzer="char", norm="l2")), ("my_classifier",  MLPClassifier(
        random_state=1,
        solver='lbfgs',
        hidden_layer_sizes=(12, 12, 12),
        activation="relu",
        early_stopping=True,
        alpha=0.0001,
        learning_rate= 'adaptive',
    ))])

# define parameter space to test
params = {
    "tfidf__ngram_range": [(1, 2), (1, 3)],
    # tfidf__ngram_range points to TfidfVectorizer -> ngram_range
    # ngram_range Two words have a higher correlation than the threshold and frequently appear together.
    # ngram_range=(a,b)-> a is the minimum and b is the maximum size of ngrams
    "tfidf__max_df": np.arange(0.3, 0.8),
    # max_df ignore terms that have a document frequency strictly higher than the given threshold
    "tfidf__min_df": np.arange(5, 100),
    # min_df ignore terms that have a document frequency strictly lower than the given threshold
}

pipe_clf = RandomizedSearchCV(
    pipe, params, cv=pds, verbose=1, n_jobs=2, 
    scoring='roc_auc')

# here we still use X but the Randomized search model by use validation set 
# fit the pipeline
pipe_clf.fit(X, y)

print('best score {}'.format(pipe_clf.best_score_))
print('best score {}'.format(pipe_clf.best_params_))

Fitting 1 folds for each of 10 candidates, totalling 10 fits
best score 0.8242078752316113
best score {'tfidf__ngram_range': (1, 3), 'tfidf__min_df': 6, 'tfidf__max_df': 0.3}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


In [None]:
# run pipe with optimized parameters
best_params = pipe_clf.best_params_
pipe.set_params(**best_params).fit(X, y)
pipe_pred = pipe.predict(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Best paramters:
* ngram_range = (1,3)
* min_df = 6
* max_df = 0.3

In [None]:
# Use this cell to write the result in the excel sheet.
submission = pd.DataFrame()

submission['id'] = data_ts['id']

submission['label'] = pipe_clf.predict_proba(X_test)[:,1]

submission.to_csv('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt3/sample_submission_walkthrough.csv', index=False)

##### Result

Accuracy in **Cross-Validation** = 0.8242 \
Accuracy in **kaggle** =0.76849\
It is better than I thought.

### **Logistic Regression**

#### **Trial 0**

* I will use the **LogisticRegression** with:
 * c = [100,200,300]
 * max_iter = [100,200,300]
 * tol = [1e-4,1e-5,1e-3]
* I will use **Bayes** search in tuning.
* **Char**-level vectorizer

**My thoughts and observations :** The accuracy would be between 0.70 and 0.75

##### PipeLine Tuning

In [None]:
# Built pipline
# The pipeline's goal is to combine numerous processes that can be cross-validated while modifying various parameters.
# It does this by allowing set parameters for each step using their names and parameter names separated by a "__"
# It takes steps as a prameter that contain all the preprocessing, vectorization, and normalization that I need.
# It saves time by applying any preprocessing to both train and test data without repeating the process.

pipe = Pipeline([("tfidf", TfidfVectorizer(analyzer="char", norm="l2")), ("my_classifier",  LogisticRegression())])

# define parameter space to test
params = {
    "tfidf__max_df": np.arange(0.3, 0.8),
    # max_df ignore terms that have a document frequency strictly higher than the given threshold
    "tfidf__min_df": np.arange(5, 100),
    # min_df ignore terms that have a document frequency strictly lower than the given threshold
    'my_classifier__C': [100,200,300],
    # C controls the penality strength 
    # my_classifier__C points to my_classifier->C
    'my_classifier__max_iter':[100 ,200, 300], 
    # max_iter is a maximum number of iterations
    # my_classifier__max_iter points to my_classifier-> max_iter
    'my_classifier__tol':[1e-4,1e-5,1e-3]
    # tol is a tolerance for stopping
    # my_classifier__tol points to my_classifier-> tol

}

pipe_clf = BayesSearchCV(
    pipe, params, cv=pds, verbose=1, n_jobs=2, 
    scoring='roc_auc')

# here we still use X but the Randomized search model by use validation set 
# fit the pipeline
pipe_clf.fit(X, y)

print('best score {}'.format(pipe_clf.best_score_))
print('best score {}'.format(pipe_clf.best_params_))

Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fi

In [None]:
# run pipe with optimized parameters
best_params = pipe_clf.best_params_
pipe.set_params(**best_params).fit(X, y)
pipe_pred = pipe.predict(X_test)

Best paramters:
* min_df = 28
* max_df = 0.3
* C = 100
* max_iter = 100
* tol = 1e-05

In [None]:
# Use this cell to write the result in the excel sheet.
submission = pd.DataFrame()

submission['id'] = data_ts['id']

submission['label'] = pipe_clf.predict_proba(X_test)[:,1]

submission.to_csv('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt3/sample_submission_walkthrough.csv', index=False)

##### Result

Accuracy in **Cross-Validation** = 0.5376 \
Accuracy in **kaggle** =0.46892 \
The previous trial is better than This trial and it is worse than I thought

# **Questions**

**What is the difference between Character n-gram and Word n-gram? Which one tends to suffer more from the OOV issue?** \
- Character n-gram divide text into a set of characters.
- Word n-gram divide text into words.
- Word n-gram suffer more from the OOV issue

**What is the difference between stop word removal and stemming? Are these techniques language-dependent?** \
- **Stopwords** are the most common words in any natural language, and they may offer little value to the document's meaning.
- **Stemming** is used to return the word to its original form(root form) By slicing off the end or beginning of a word and taking in a list of common prefixes and suffixes that could appear in that word.
- **Yes**, these are language-dependent.

**Is tokenization techniques language dependent? Why?** \
- yes, because there are some words that exist in different languages but may have the same or different sounds, and the meaning.

**What is the difference between count vectorizer and tf-idf vectorizer? Would it be feasible to use all possible n-grams? If not, how should you select them?** \
-  **CountVectorizer** focuses on the frequency of words ( number of times words appear ) in the document but **TF-IDF** focuses on the frequency of words, and that's important, so we can remove less important words. Model complexity will be reduced by reducing the input dimension.
- No, It is selected by the try and error and hyperparameter search methods.


