# **Questions**

**Q1) What is the difference between Character n-gram and Word n-gram? Which one tends to suffer more from the OOV issue?**

* **What is the difference between Character n-gram and Word n-gram?**
  > * **Character n-grams**
     * Character n-grams are found in text documents by representing the document as a sequence of characters. 
     * Character n-grams have proven to be of high quality for authorship attribution.
     * Character n-grams are including whitespaces and punctuation.
     * example, a character 4-gram model results in the following tokens: [_It_], [It_i], [t_is], [_is_], [is_a], [s_a_], [_a_s].
  * **Word n-grams**
     * Word n-grams are found in text documents by representing the document as a sequence of words. 
     * example, a character 1-gram model results in the following tokens: [hello], [like], [eat].



* **Which one tends to suffer more from the OOV issue?**
> The word is OOV. An Out-Of-Vocabulary (OOV) Word is a Linguistic Unit or a token that does not appear in training vocabulary or document.

**Q2) What is the difference between stop word removal and stemming? Are these techniques language-dependent?**

* **What is the difference between stop word removal and stemming?**
> * **Stop words:**
    * Stop words are a set of commonly used words in a language. Examples of stop words in English are “a”, “the”, “is”, “are” and etc.
    * Stop words are commonly used in Text Mining and Natural Language Processing (NLP) to eliminate words that are so commonly used that they carry very little useful information.
  * **Stemming:**
    * Stemming algorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. Examples of stemming in English are "studies" when we use the stemming it will be "studi".



* **Are these techniques language-dependent?**
> yes, the stop words and the stemming are language-dependent.

**Q3) Is tokenization techniques language dependent? Why?**
> yes, tokenization is heavily dependent on language. Each language can have various linguistic rules and exceptions. Languages such as English identify token boundaries via whitespace and punctuation, but other languages such as Chinese require a more complex segmenter to extract tokens from a stream of text that does not contain any whitespaces.

**Q4) What is the difference between count vectorizer and tf-idf vectorizer? Would it be feasible to use all possible n-grams? If not, how should you select them?**
> * **What is the difference between count vectorizer and tf-idf vectorizer?**
    * The time it takes to create the count. Vectorizer is much lesser as compared to your hashing function or the tf-idf representation.
    * CountVectorizer: Counts the frequency of all words in our corpus, sorts them and grabs the most recurring features (using max_features hyperparameter). But these results are mostly biased and our model might loose out on some of the important less frequent features. These are all boolean values. Ex. SEO People used to take advantage of this.
    * TFIDFVectorizer: TFIDF is a statistical measure said to have fixed the issues with CountVectorizer in some way. It consists of 2 parts, TF (Term Frequency) multiplied with IDF (Inverse Document Frequency). The main intuition being some words that appear frequently in 1 document and less frequently in other documents could be considered as providing extra insight for that 1 document and could help our model learn from this additional piece of information. In short, common words are penalized. These are relative frequencies identified as floating point numbers.      
* **Would it be feasible to use all possible n-grams? If not, how should you select them?**
>  * No, This will make it very difficult to assign likelihoods that capture the target of our analysis.
  * it will depand on the model and i will try different n to decide the best n. because If we consider a chunk size of n=2, our results include “The reporters,” “the President,” “the United,” and “the room.” While not perfect, this model successfully identifies three of the relevant entities as candidates in a lightweight fashion.
On the other hand, a model based on the small n-gram window of 2 would fail to capture some of the nuance of the original text. For instance, if our sentence is from a text that references multiple heads of state, “the President” could be somewhat ambiguous. In order to capture the entirety of the phrase “the President of the United States,” 

# **Problem Formulation**

## **Define the problem**

* Because of the rise of social networks and their involvement in other spheres such as politics, false information on the Internet has produced a slew of social issues.
* We are going to predict if a specific reddit post is fake news or not, by looking at its title.

## **What is the input?**

The input is the text feature. it contains various forms of words.

## **What is the output?**

If a specific reddit post is fake news or not.
In the dataset, the label column is output.

## **What data mining function is required?**

In this case, it will be binary Classification that separates data points into different classes (fake or not / 0 or 1) which If a specific reddit post is fake news or not.

## **What could be the challenges?**

* The data contains various forms of words.
* The datasets have outliers values.
* predict a specific reddit post is fake news or not, by looking at its title.

## **What is the impact?**

When I create a new system and give it a Feature, it can decide whether If a specific reddit post is fake news or not.

## **What is an ideal solution?**

According to my subsequent attempts, Bayesian Search and Random Forest Classifier with Cross Validation. is the best approach because it provides me the highest kaggle score.

     In Kaggle 
        * Public score: 0.87476
        * Private score: 0.87632



The Bayesian Search use of intelligence to pick the next set of hyperparameters which will improve the model performance.

# **Implementation**

## **Steps**

### **What preprocessing steps are used?**

* Remove outliers.
* Cleaning the text.
* Compute the frequency of the words.

### **What is the experimental protocol used and how was it carried out?**

* Read the data using the function "read_csv"
* Cleaning the text by I'll remove any html tags, digits, single letter chars, stopwords, punctuation, the noise data, convert all whitespaces to single wspace and make stemming. 
* I will split the data to use Holdout method is split the training dataset to training data and validation data using "train_test_split".
* I use Cross validation for training the model well.
* Determine the optimal values for a given model by using GridSearch, RandomSearch and BayesianSearch.
* I use Xgboost, Random Forest and Logistic Regression to fit the model.  


## **Important Libraries**
I will install a package and import several libraries.

In [None]:
!pip install scikit-optimize

Collecting scikit-optimize
  Downloading scikit_optimize-0.9.0-py2.py3-none-any.whl (100 kB)
[?25l[K     |███▎                            | 10 kB 20.8 MB/s eta 0:00:01[K     |██████▌                         | 20 kB 24.7 MB/s eta 0:00:01[K     |█████████▉                      | 30 kB 11.5 MB/s eta 0:00:01[K     |█████████████                   | 40 kB 4.6 MB/s eta 0:00:01[K     |████████████████▍               | 51 kB 4.4 MB/s eta 0:00:01[K     |███████████████████▋            | 61 kB 5.2 MB/s eta 0:00:01[K     |██████████████████████▉         | 71 kB 5.7 MB/s eta 0:00:01[K     |██████████████████████████▏     | 81 kB 4.3 MB/s eta 0:00:01[K     |█████████████████████████████▍  | 92 kB 4.8 MB/s eta 0:00:01[K     |████████████████████████████████| 100 kB 3.5 MB/s 
Collecting pyaml>=16.9
  Downloading pyaml-21.10.1-py2.py3-none-any.whl (24 kB)
Installing collected packages: pyaml, scikit-optimize
Successfully installed pyaml-21.10.1 scikit-optimize-0.9.0


In [None]:
import re
import pickle
import sklearn
import pandas as pd
import numpy as np
# import holoviews as hv
import nltk 
from bokeh.io import output_notebook
output_notebook()

from pathlib import Path

# some seeting for pandas and hvplot

pd.options.display.max_columns = 100
pd.options.display.max_rows = 300
pd.options.display.max_colwidth = 100
np.set_printoptions(threshold=2000)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from xgboost import XGBClassifier
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

from bokeh.models import NumeralTickFormatter

from skopt import BayesSearchCV

# SelectKBest use for Select features according to the k highest scores. mutual_info_classif utilize the mutual information.
from sklearn.feature_selection import SelectKBest, mutual_info_classif


# Provides train/test indices to split data into train/test sets
from sklearn.model_selection import PredefinedSplit

from sklearn.naive_bayes import GaussianNB, MultinomialNB

# import warnings to prevent show warnings
import warnings
warnings.filterwarnings("ignore")

## **Read Data**
I Will connect to the drive and load and read train and test files from there.

I'll use the read csv function to read the data. It may read any delimited text file and change the delimiter by using the sep option.

I'm going to read the training and testing datasets.

In [None]:
# Connect to my drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# reading the training dataset 
df_train = pd.read_csv('/content/drive/MyDrive/Data Mining/Compition 3/xy_train.csv', index_col='id') 
# df_train = pd.read_csv('xy_train.csv', index_col='id') 
# reading the testing dataset 
df_test = pd.read_csv('/content/drive/MyDrive/Data Mining/Compition 3/x_test.csv', index_col='id') 
# df_test = pd.read_csv('x_test.csv', index_col='id')

In [None]:
# show the training data
df_train

Unnamed: 0_level_0,text,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1
265723,"A group of friends began to volunteer at a homeless shelter after their neighbors protested. ""Se...",0
284269,"British Prime Minister @Theresa_May on Nerve Attack on Former Russian Spy: ""The government has c...",0
207715,"In 1961, Goodyear released a kit that allows PS2s to be brought to heel. https://m.youtube.com/w...",0
551106,"Happy Birthday, Bob Barker! The Price Is Right Host on How He'd Like to Be Remembered | ""As the ...",0
8584,"Obama to Nation: 聙""Innocent Cops and Unarmed Young Black Men Should Not be Dying Before Magic Jo...",0
...,...,...
70046,"Finish Sniper Simo H盲yh盲 during the invasion of Finland by the USSR (1939, colorized)",0
189377,"Nigerian Prince Scam took $110K from Kansas man; 10 years later, he's getting it back",1
93486,Is It Safe To Smoke Marijuana During Pregnancy? You鈥檇 Be Surprised Of The Answer | no,0
140950,Julius Caesar upon realizing that everyone in the room has a knife except him (44 bc),0


In [None]:
# show the testing data
df_test

Unnamed: 0_level_0,text
id,Unnamed: 1_level_1
0,stargazer
1,yeah
2,PD: Phoenix car thief gets instructions from YouTube video
3,"As Trump Accuses Iran, He Has One Problem: His Own Credibility"
4,"""Believers"" - Hezbollah 2011"
...,...
59146,Bicycle taxi drivers of New Delhi
59147,Trump blows up GOP's formula for winning House races
59148,"Napoleon returns from his exile on the island of Elba. (March 1815), Colourised"
59149,Deep down he always wanted to be a ballet dancer


In [None]:
# check the data have the null values in training data
df_train.isna().sum()

text     0
label    0
dtype: int64

In [None]:
# check the data have the null values in testing data
df_test.isna().sum()

text    0
dtype: int64

## **Cleaning and preprocessing**

Create Function for remove any html tags, remove any digits, remove any single letter chars, convert all whitespaces to single wspace, make all lowercase words, remove any stopwords, remove any punctuation and make stemming. remove the noise data in training data.

In [None]:
# required package for tokenization.
nltk.download('punkt') 
nltk.download('stopwords')

# for stemming algorithm
stemmer = SnowballStemmer("english")
stop_words = set(stopwords.words("english"))


# Make Function to clean text
def clean_text(text, for_embedding=False):
    """ steps:
        - remove any html tags (< /br> often found)
        - Keep only ASCII + European Chars and whitespace, no digits
        - remove single letter chars
        - convert all whitespaces (tabs etc.) to single wspace
        if not for embedding (but e.g. tdf-idf):
        - all lowercase
        - remove stopwords, punctuation and stemm
        - return the clean text
    """
    re_wspace = re.compile(r"\s+", re.IGNORECASE)
    re_tags = re.compile(r"<[^>]+>")
    re_ASII = re.compile(r"[^A-Za-zÀ-ž ]", re.IGNORECASE)
    re_single_char = re.compile(r"\b[A-Za-zÀ-ž]\b", re.IGNORECASE)
    if for_embedding:
        # Keep punctuation
        re_ASII = re.compile(r"[^A-Za-zÀ-ž,.!? ]", re.IGNORECASE)
        re_single_char = re.compile(r"\b[A-Za-zÀ-ž,.!?]\b", re.IGNORECASE)

    text = re.sub(re_tags, " ", text)
    text = re.sub(re_ASII, " ", text)
    text = re.sub(re_single_char, " ", text)
    text = re.sub(re_wspace, " ", text)

    word_tokens = word_tokenize(text)
    words_tokens_lower = [word.lower() for word in word_tokens]

    if for_embedding:
        # no stemming, lowering and punctuation / stop words removal
        words_filtered = word_tokens
    else:
        words_filtered = [
            stemmer.stem(word) for word in words_tokens_lower if word not in stop_words
        ]

    text_clean = " ".join(words_filtered)
    return text_clean

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
%%time
# Clean texts training data 
df_train["text_clean"] = df_train.loc[df_train["text"].str.len() > 0, "text"] # get all text data the length it greater than 0 in training data
# call clean_text of method to apply it on text_clean feature in traing data
df_train["text_clean"] = df_train["text_clean"].map(
    lambda x: clean_text(x, for_embedding=False) if isinstance(x, str) else x  # check if text is instance of string
) 

CPU times: user 46.5 s, sys: 188 ms, total: 46.7 s
Wall time: 1min 1s


In [None]:
# show the training data
df_train

Unnamed: 0_level_0,text,label,text_clean
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
265723,"A group of friends began to volunteer at a homeless shelter after their neighbors protested. ""Se...",0,group friend began volunt homeless shelter neighbor protest see anoth person also need natur lik...
284269,"British Prime Minister @Theresa_May on Nerve Attack on Former Russian Spy: ""The government has c...",0,british prime minist theresa may nerv attack former russian spi govern conclud high like russia ...
207715,"In 1961, Goodyear released a kit that allows PS2s to be brought to heel. https://m.youtube.com/w...",0,goodyear releas kit allow ps brought heel https youtub com watch alxulk cg zwillc fish midatlant...
551106,"Happy Birthday, Bob Barker! The Price Is Right Host on How He'd Like to Be Remembered | ""As the ...",0,happi birthday bob barker price right host like rememb man said ave pet spay neuter fuckincorpor...
8584,"Obama to Nation: 聙""Innocent Cops and Unarmed Young Black Men Should Not be Dying Before Magic Jo...",0,obama nation innoc cop unarm young black men die magic johnson jimbobshawobodob olymp athlet sho...
...,...,...,...
70046,"Finish Sniper Simo H盲yh盲 during the invasion of Finland by the USSR (1939, colorized)",0,finish sniper simo yh invas finland ussr color
189377,"Nigerian Prince Scam took $110K from Kansas man; 10 years later, he's getting it back",1,nigerian princ scam took kansa man year later get back
93486,Is It Safe To Smoke Marijuana During Pregnancy? You鈥檇 Be Surprised Of The Answer | no,0,safe smoke marijuana pregnanc surpris answer
140950,Julius Caesar upon realizing that everyone in the room has a knife except him (44 bc),0,julius caesar upon realiz everyon room knife except bc


In [None]:
%%time
# Clean texts testing data
df_test["text_clean"] = df_test.loc[df_test["text"].str.len() > 0, "text"] # get all text data the length it greater than 0 in testing data
# call clean_text of method to apply it on text_clean feature in testing data
df_test["text_clean"] = df_test["text_clean"].map(
    lambda x: clean_text(x, for_embedding=False) if isinstance(x, str) else x  # check if text is instance of string
)

CPU times: user 14 s, sys: 78.8 ms, total: 14.1 s
Wall time: 14.1 s


In [None]:
# show the testing data
df_test

Unnamed: 0_level_0,text,text_clean
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,stargazer,stargaz
1,yeah,yeah
2,PD: Phoenix car thief gets instructions from YouTube video,pd phoenix car thief get instruct youtub video
3,"As Trump Accuses Iran, He Has One Problem: His Own Credibility",trump accus iran one problem credibl
4,"""Believers"" - Hezbollah 2011",believ hezbollah
...,...,...
59146,Bicycle taxi drivers of New Delhi,bicycl taxi driver new delhi
59147,Trump blows up GOP's formula for winning House races,trump blow gop formula win hous race
59148,"Napoleon returns from his exile on the island of Elba. (March 1815), Colourised",napoleon return exil island elba march colouris
59149,Deep down he always wanted to be a ballet dancer,deep alway want ballet dancer


In [None]:
# Distribution of ratings
df_train["label"].value_counts(normalize=True)

0    0.536200
1    0.459933
2    0.003867
Name: label, dtype: float64

In [None]:
# remove any values of label have 2
df_train = df_train[df_train["label"] != 2]

In [None]:
# show the training data
df_train

Unnamed: 0_level_0,text,label,text_clean
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
265723,"A group of friends began to volunteer at a homeless shelter after their neighbors protested. ""Se...",0,group friend began volunt homeless shelter neighbor protest see anoth person also need natur lik...
284269,"British Prime Minister @Theresa_May on Nerve Attack on Former Russian Spy: ""The government has c...",0,british prime minist theresa may nerv attack former russian spi govern conclud high like russia ...
207715,"In 1961, Goodyear released a kit that allows PS2s to be brought to heel. https://m.youtube.com/w...",0,goodyear releas kit allow ps brought heel https youtub com watch alxulk cg zwillc fish midatlant...
551106,"Happy Birthday, Bob Barker! The Price Is Right Host on How He'd Like to Be Remembered | ""As the ...",0,happi birthday bob barker price right host like rememb man said ave pet spay neuter fuckincorpor...
8584,"Obama to Nation: 聙""Innocent Cops and Unarmed Young Black Men Should Not be Dying Before Magic Jo...",0,obama nation innoc cop unarm young black men die magic johnson jimbobshawobodob olymp athlet sho...
...,...,...,...
70046,"Finish Sniper Simo H盲yh盲 during the invasion of Finland by the USSR (1939, colorized)",0,finish sniper simo yh invas finland ussr color
189377,"Nigerian Prince Scam took $110K from Kansas man; 10 years later, he's getting it back",1,nigerian princ scam took kansa man year later get back
93486,Is It Safe To Smoke Marijuana During Pregnancy? You鈥檇 Be Surprised Of The Answer | no,0,safe smoke marijuana pregnanc surpris answer
140950,Julius Caesar upon realizing that everyone in the room has a knife except him (44 bc),0,julius caesar upon realiz everyon room knife except bc


In [None]:
# Distribution of ratings
df_train["label"].value_counts(normalize=True)

0    0.538281
1    0.461719
Name: label, dtype: float64

In [None]:
# set the tect_clean feature and label to values in training data 
X = df_train["text_clean"]
Y = df_train["label"]

## **Functions**

I'll do functions because I'll be using them a lot and don't want to repeat the code. such as create pipline and set multiple classifiers and fit them, predict the testing set and record the probability of prediction in csv file.

In [None]:
# Make a Function to Pipeline vectorizer and my_classifier
# combine the vectorizer with the model as a full tunable pipeline
# we gave them a name so we can set their hyperparameters

def create_fit_pipeline(my_classifier):
  full_pipeline = Pipeline(
      steps=[
          ("vectorizer", TfidfVectorizer(norm="l2")), 
          ('my_classifier', my_classifier)
      ]
  )
  # The pipeline object can be used like any sk-learn model and training it 
  full_pipeline = full_pipeline.fit(X, Y)
  return full_pipeline

In [None]:
# Make Funiction to prediction the pipeline
def predict_pipeline(full_pipeline):
  # prediction the df_test
  y_pred = full_pipeline.predict(df_test)
  # Show unique and count values
  return pd.DataFrame(y_pred).value_counts()

In [None]:
# Make a Function for predict the testing data and save it in the csv file
def predict_save_csv(search_model, classifier_name):
  submission = pd.DataFrame()
  submission['id'] = df_test.index
  submission['label'] = search_model.predict_proba(df_test.text_clean)[:,1]
  file_name = 'Compition_3_' + classifier_name + '.csv'
  submission.to_csv(file_name, index=False)

In [None]:
# Further split the original training set to a train and a validation set
X_train2, X_val, y_train2, y_val = train_test_split(
    X, Y, train_size = 0.8, stratify = Y, random_state = 42)

# Create a list where train data indices are -1 and validation data indices are 0
# X_train2 (new training set), X_train
split_index = [-1 if x in X_train2.index else 0 for x in X.index]

# Use the list to create PredefinedSplit
pds = PredefinedSplit(test_fold = split_index)

### **Tuning Methods**

Tuning Methods *(Grid Search, Random Search, Bayisen Search)* are available in the scikit-learn class model_selection. It can be initiated by creating an object.


**Parameters of Tuning Methods *(Grid Search, Random Search, Bayisen Search)* method are:**
* **estimator:** *(object)* a scikit-learn model.
* **param_grid:** *(dict or list of dictionaries)* This enables searching over any sequence of parameter settings.
* **scoring:** *(str, callable, list, tuple or dict)* Strategy to evaluate the performance of the cross-validated model on the test set.
* **n_jobs:** *(int)* Number of jobs to run in parallel. 
  * `None` means 1.
  * `-1` means using all processors.
* **refit:** *(bool, str, or callable)* Refit an estimator using the best found parameters on the whole dataset.
* **cv:** *(int, cross-validation generator or an iterable)* determines the cross-validation splitting strategy. Possible inputs for cv are:

  * None, to use the default 5-fold cross validation.
  * integer, to specify the number of folds in a (Stratified)KFold.
  * CV splitter.
  * An iterable yielding (train, test) splits as arrays of indices.
* **verbose:** *(int)* Controls the verbosity (Controll to show messages)
  * `>1`: the computation time for each fold and parameter candidate is displayed.
  * `>2` : the score is also displayed.
  * `>3` : the fold and candidate parameter indexes are also displayed together with the starting time of the computation.
* **error_score:** *(‘raise’ or numeric)* Value to assign to the score if an error occurs in estimator fitting.

#### **Grid Search**

Grid search is the process of performing hyperparameter tuning in order to determine the optimal values for a given model. The performance of a model significantly depends on the value of hyperparameters. 

Grid Search uses a different combination of all the specified hyperparameters and their values and calculates the performance for each combination and selects the best value for the hyperparameters. This makes the processing time-consuming and expensive based on the number of hyperparameters involved.

I will create function to create object from Grid Search, fit them and get the best score.  

In [None]:
# Make Function to create and fit the Grid Search to pipeline

def create_fit_grid_search(full_pipeline, param_grid, cv):
  # cv means number of K-fold cross-validation or validation set
  # n_jobs means the cucurrent number of jobs (on colab since we only have two cpu cores, we set it to 2)

  grid_search = GridSearchCV(
      full_pipeline, param_grid, cv=cv, verbose=1, n_jobs=2, 
      scoring='roc_auc') # create object GridSearchCV

  grid_search.fit(X, Y) # train the gridsearch

  print('best score {}'.format(grid_search.best_score_)) # print the best score of model
  print('best score {}'.format(grid_search.best_params_)) # print the best hyperparameters of model
  return grid_search

#### **Random Search**

Random search methods are stochastic approaches that rely entirely on the random sampling of a succession of points in the problem's feasible region, according to a predetermined probability distribution or sequence of probability distributions.


I will create function to create object from Random Search, fit them and get the best score.  

In [None]:
# Make Function to create and fit the Random Search to pipeline

def create_fit_random_search(full_pipeline, param_random, cv):
  # cv= cv means cv-fold cross-validation or validation set
  # n_jobs means the cucurrent number of jobs
  # (on colab since we only have two cpu cores, we set it to 2)
  random_search = RandomizedSearchCV(
      full_pipeline, param_random, cv=cv, verbose=1, n_jobs=2, 
      # number of random trials
      n_iter=5,
      scoring='roc_auc')

  random_search.fit(X, Y)

  print('best score {}'.format(random_search.best_score_)) # print the best score of model
  print('best score {}'.format(random_search.best_params_)) # print the best hyperparameters of model
  return random_search

#### **Bayesian Search**

This model is called a **surrogate** for the objective function. The surrogate is much easier to optimize than the objective function and Bayesian methods work by finding the next set of hyperparameters to evaluate on the actual objective function by selecting hyperparameters that perform best on the surrogate function.

Bayesian Search keeps track of previous assessment results, which they use to create a probabilistic model that maps hyperparameters to the likelihood of a score on the objective function.

\
This method advocates the usage of intelligence to pick the next set of hyperparameters which will improve the model performance. We iteratively repeat this process until we converge to an optimum.



I will create function to create object from Bayesian Search, fit them and get the best score.  

In [None]:
# Make Function to create and fit the Bayesian Search to pipeline

def create_fit_bayesian_search(full_pipeline, param_bayesian, cv):
  # cv= cv means cv-fold cross-validation or validation set
  # n_jobs means the cucurrent number of jobs
  # (on colab since we only have two cpu cores, we set it to 2)
    Bayes_search = BayesSearchCV(
      full_pipeline, param_bayesian, cv=cv, verbose=1, n_jobs=2, 
      # number of Bayes trials
      n_iter=5)

    Bayes_search.fit(X, Y)

    print('best score {}'.format(Bayes_search.best_score_)) # print the best score of model
    print('best score {}'.format(Bayes_search.best_params_)) # print the best hyperparameters of model
    return Bayes_search

## **Different trials on model tuning**

###  **1* XGBoost**

\

This algorithm goes by lots of different names such as gradient boosting, multiple additive regression trees, stochastic gradient boosting or gradient boosting machines.

A Gradient Boosting Decision Trees (GBDT) is a decision tree ensemble learning algorithm similar to random forest, **Ensemble learning algorithms** combine multiple machine learning algorithms to obtain a better model. **Random forest** uses to build full decision trees in parallel from random bootstrap samples of the data set. 

It is the top machine learning library for regression, classification, and ranking tasks.

* It includes parallel tree boosting.
* It supports regularization.
* It is designed to handle missing data with its in-build features.
* The user can run a cross-validation after each iteration. 
* It works well in small to medium dataset.
* It is designed to be highly efficient, flexible and portable.
* It has a distributed weighted quantile sketch algorithm to effectively handle weighted data.

To develop a model, the XGBoost classifier contains a lot of hyperparameters. I'll use some of them to assist us enhance the model and score.

**The hyperparameters are:**
* **learning_rate:** Learning rate reduces each tree's contribution by learning rate. Between learning rate and n estimators, there is a trade-off.
* **n_estimators:** The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance.
* **subsample:** The percentage of samples that will be used to fit particular base learners. Stochastic Gradient Boosting occurs when the value is less than 1.0. The parameter n estimators interacts with subsample. Choosing subsample < 1.0 leads to a reduction of variance and an increase in bias.
* **colsample_bytree:** Subsample ratio of columns when constructing each tree.
*  **nthread:** Number of threads to use for loading data when parallelization is applicable. If -1, uses maximum threads available on the system.
* **objective:** Specify the learning task and the corresponding learning objective or a custom objective function to be used.
* **silent:** Whether print messages during construction.
* **random_state:** Controls the random seed given to each Tree estimator at each boosting iteration. In addition, it controls the random permutation of the features at each split (see Notes for more details). It also controls the random splitting of the training data to obtain a validation set if n_iter_no_change is not None. Pass an int for reproducible output across multiple function calls.


In [None]:
# for the create pipeline with my classifier is XGBoost Classifier
full_pipeline_XGB = create_fit_pipeline(XGBClassifier(objective='binary:logistic', silent=True, random_state= 42))
# prediction the pipeline
predict_pipeline(full_pipeline_XGB)

0    2
dtype: int64

#### **1- Bayesian Search with Cross Validation**

**using Bayesian Search and XGBoost Classifier with Cross Validation**

**Expectations:**

I'll utilise Cross Validation with Bayesian Search and XGBoost. Because Bayesian Search discovers the extrema of objective functions that are expensive to evaluate and fits the estimator (model) on your training set, I expect it to give me the greatest score.

I'm going to specify some hyperparameters for the preprocessor, select features, and XGBoost classifier.

\

**observations:**

The best hyperparameters for this model will be:
* **objective:** binary:logistic
* **silent:** True
* **random_state:** 42
* **analyzer:** word
* **max_df:** 0.3
* **min_df:** 30

* **learning_rate:** 0.1
* **n_estimators:** 1500
* **subsample:** 0.8
* **colsample_bytree:** 1.0
* **nthread:** 9
* **CV:** 3

\


Scores:

     In colab ==> Score: 0.754952

     In Kaggle 
        * Public score: 0.80636
        * Private score: 0.80525




\

**plan:**

I will use Grid Search and Random Forest Classifier with Validation Set.


In [None]:
# hyperparameter for XGBoost Classifier
# here we specify the search space 
# `__` denotes an attribute of the preceeding name
# (e.g. my_classifier__n_estimators means the `n_estimators` param for `my_classifier`)
param_XGB = {
    'vectorizer__analyzer': ["word"], 
    'vectorizer__max_df': [0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
    # 'vectorizer__ngram_range' : [(1, 2)],
    'vectorizer__min_df': [5, 10, 15, 20, 25, 30],
    

    'my_classifier__learning_rate' : [0.005, 0.001, 0.01, 0.02, 0.03, 0.05, 0.1, 0.2, 0.3],
    'my_classifier__n_estimators' : [600,1000, 1100, 1500, 2000, 3000, 4000],
    'my_classifier__nthread' : [1, 2, 3, 4, 5, 6, 7, 8, 9],
#     'my_classifier__min_child_weight': [1, 5, 10],
#     'my_classifier__gamma': [0.4, 0.5, 0.6, 1, 1.5, 2, 2.5, 3, 5],
    'my_classifier__subsample': [0.05, 0.2, 0.3, 0.6, 0.8, 0.9],
    'my_classifier__colsample_bytree': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
#     'my_classifier__max_depth': np.arange(3, 20),
#     'my_classifier__random_state' : [0, 1, 42, 15]
}

In [None]:
# using the create_fit_bayesian_search function and it will return bayesian_search for XGBoost and it will use the (X) and (Y)
bayesian_search_XGB = create_fit_bayesian_search(full_pipeline_XGB, param_XGB, 3)
print("Best: %f using %s" % (bayesian_search_XGB.best_score_, bayesian_search_XGB.best_params_))

Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
best score 0.7549524903552775
best score OrderedDict([('my_classifier__colsample_bytree', 1.0), ('my_classifier__learning_rate', 0.1), ('my_classifier__n_estimators', 1500), ('my_classifier__nthread', 9), ('my_classifier__subsample', 0.8), ('vectorizer__analyzer', 'word'), ('vectorizer__max_df', 0.3), ('vectorizer__min_df', 30)])
Best: 0.754952 using OrderedDict([('my_classifier__colsample_bytree', 1.0), ('my_classifier__learning_rate', 0.1), ('my_classifier__n_estimators', 1500), ('my_classifier__nthread', 9), ('my_classifier__subsample', 0.8), ('vectorizer__analyzer', 'word'), ('vectorizer__max_df', 0.3), ('vectorizer__min_df', 30)])


In [None]:
# using the predict_save_csv function and it will predict the testing data and save it in the csv file
predict_save_csv(bayesian_search_XGB, 'XGB_Bayesian_Cross')

### **3* Random Forest**
Random forests are a type of ensemble method. An ensemble method is a process in which numerous models are fitted and the results are combined for stronger predictions. While this provides great predictions, inference and explainability are often limited. Random forests are composed of a number of decision trees where the included predictors are chosen at random. The name comes from randomly building trees to make a forest.

We can create a random forest just like we created a decision tree, except now we are also specifying parameters that indicate how many trees should be in the forest, how we should subset the data items (the rows), and how we should subset the fields (the columns).

In the following function definition, `n_estimators` defines the number of trees we want,` max_samples` defines how many rows to sample for training each tree, and `max_features` defines how many columns to sample at each split point (where 0.5 means “take half the total number of columns”). We can also specify when to stop splitting the tree nodes, effectively limiting the depth of the tree, by including the same `min_samples_leaf` parameter we used in the preceding section. Finally, we pass `n_jobs=-1` to tell sklearn to use all our CPUs to build the trees in parallel.

In [None]:
# for the create pipeline with my classifier is RandomForestClassifier
full_pipeline_Randomforst = create_fit_pipeline(RandomForestClassifier())

##### **1- Grid Search with Validation Set**

**using Grid Search and Random Forest Classifier with Validation Set**

**Expectations:**

I will use the Grid Search and Random Fores with Validation Set. I expect that it will give me the highest score, because the model try all possible values to know the optimal values, and fit the estimator (model) on your training set.

I'm going to specify some hyperparameters for the preprocessor, select features, and Random Fores classifier.

\

**observations:**

The best hyperparameters for this model will be:

* **analyzer:** word
* **max_df:** 0.3
* **min_df:** 5
* **range:** (1, 2)


* **n_estimators:** 500
* **criterion:** gini
* **max_features:** 0.8

\


Scores:

    

     In Kaggle 
        * Public score: 0.86520
        * Private score: 0.86613




\

**plan:**

I will use Random Search and Random Forest Classifier with Validation Set.


In [None]:
# hyperparameter for RandomForestClassifier
# here we specify the search space 
# `__` denotes an attribute of the preceeding name
# (e.g. my_classifier__n_estimators means the `n_estimators` param for `my_classifier`)
param_RandomForest = {
    'vectorizer__analyzer': ["word"], 
    'vectorizer__max_df': np.arange(0.3, 0.8),
    'vectorizer__min_df': range(5, 30, 5),
    'vectorizer__ngram_range': [(1, 2)], 



    'my_classifier__n_estimators': range(500, 1000, 100),
    'my_classifier__criterion' :['gini', 'entropy'],
#     'my_classifier__max_features' : ['auto', 'sqrt', 'log2'],
     # my_classifier__n_estimators points to my_classifier->n_estimators 
    # 'my_classifier__max_depth': [100, 200, 400, 600, 2000]       
}

In [None]:
# using the create_fit_grid_search function and it will return grid_search for RandomForestClassifier and it will use the use the (X) and (Y)
grid_search_RandomForest = create_fit_grid_search(full_pipeline_Randomforst, param_RandomForest, pds)

Fitting 1 folds for each of 50 candidates, totalling 50 fits
best score nan
best score {'my_classifier__criterion': 'gini', 'my_classifier__n_estimators': 500, 'vectorizer__analyzer': 'word', 'vectorizer__max_df': 0.3, 'vectorizer__min_df': 5, 'vectorizer__ngram_range': (1, 2)}


In [None]:
# using the predict_save_csv function and it will predict the testing data and save it in the csv file
predict_save_csv(grid_search_RandomForest, 'RF_Grid_Validation')

#### **2- Random Search with Validation Set**
**using Random Search and Random Forest Classifier with Validation Set**

**Expectations:**

Random Search and Random Forest with Validation Set will be used. Because Random search works best for lower dimensional data and fits the estimator (model) on your training set, I expect it to give me the greatest score.

I'm going to specify some hyperparameters for the preprocessor, select features, and Random Forest classifier.

\

**observations:**

The best hyperparameters for this model will be:

* **analyzer:** word
* **max_df:** 0.3
* **min_df:** 25
* **range:** (1, 2)


* **n_estimators:** 600
* **criterion:** entropy
* **max_features:** log2

\
Scores:

     In colab ==> Score: 0.86488

     In Kaggle 
        * Public score:  0.86781
        * Private score: 0.86734


\

**plan:**

I will use Bayesian Search and Random Forest Classifier with Cross Validation.

In [None]:
# using the create_fit_Random_search function and it will return Random_search for Random Forest and it will use the (X) and (Y)
random_search_RF = create_fit_random_search(full_pipeline_Randomforst, param_RandomForest, pds)

Fitting 1 folds for each of 5 candidates, totalling 5 fits
best score 0.8648842389918672
best score {'vectorizer__ngram_range': (1, 2), 'vectorizer__min_df': 25, 'vectorizer__max_df': 0.3, 'vectorizer__analyzer': 'word', 'my_classifier__n_estimators': 600, 'my_classifier__max_features': 'log2', 'my_classifier__criterion': 'entropy'}


In [None]:
# using the predict_save_csv function and it will predict the testing data and save it in the csv file
predict_save_csv(random_search_RF, 'RF_random_Validation')

#### **3- Bayesian Search with Cross Validation**

**Expectations:**

I'll utilise Cross Validation with Bayesian Search and XGBoost. Because Bayesian Search discovers the extrema of objective functions that are expensive to evaluate and fits the estimator (model) on your training set, I expect it to give me the greatest score.

I'm going to specify some hyperparameters for the preprocessor, select features, and XGBoost classifier.

\

**observations:**

The best hyperparameters for this model will be:

* **analyzer:** word
* **max_df:** 0.8
* **min_df:** 10


* **n_estimators:** 600
* **criterion:** gini
* **max_features:** log2
* **CV:** 20

\


Scores:

     In colab ==> Score: 0.78195

     In Kaggle 
        * Public score: 0.87476
        * Private score: 0.87632




\

**plan:**

I will use Bayesian Search and Logistic Regression Classifier with Validation Set.


In [None]:
# hyperparameter for RandomForestClassifier
# here we specify the search space 
# `__` denotes an attribute of the preceeding name
# (e.g. my_classifier__n_estimators means the `n_estimators` param for `my_classifier`)
param_RandomForest = {
    'vectorizer__analyzer': ["word"], 
    'vectorizer__max_df': [0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
    'vectorizer__min_df': [5, 10, 15, 20, 25, 30],
    # 'vectorizer__ngram_range': [(1, 2)], 



    'my_classifier__n_estimators': [500, 600, 700, 800, 900, 1000],
    'my_classifier__criterion' :['gini', 'entropy'],
    'my_classifier__max_features' : ['auto', 'sqrt', 'log2'],
     # my_classifier__n_estimators points to my_classifier->n_estimators 
    # 'my_classifier__max_depth': [100, 200, 400, 600, 2000]       
}

In [None]:
# using the create_fit_Bayesian_search function and it will return Bayesian_search for XGBoost and it will use the (x_train) and (y_train)
bayesian_search_RF = create_fit_bayesian_search(full_pipeline_Randomforst, param_RandomForest, 20)

Fitting 20 folds for each of 1 candidates, totalling 20 fits
Fitting 20 folds for each of 1 candidates, totalling 20 fits
Fitting 20 folds for each of 1 candidates, totalling 20 fits
Fitting 20 folds for each of 1 candidates, totalling 20 fits
Fitting 20 folds for each of 1 candidates, totalling 20 fits
best score 0.781956480992555
best score OrderedDict([('my_classifier__criterion', 'gini'), ('my_classifier__max_features', 'log2'), ('my_classifier__n_estimators', 600), ('vectorizer__analyzer', 'word'), ('vectorizer__max_df', 0.8), ('vectorizer__min_df', 10)])


In [None]:
# using the predict_save_csv function and it will predict the testing data and save it in the csv file
predict_save_csv(bayesian_search_RF, 'RF_bayesian_Cross')

### **4* Logistic Regression**

Logistic regression is used to handle the classification problems.

It is used in statistical software to understand the relationship between the dependent variable and one or more independent variables by estimating probabilities using a logistic regression equation.  

It is often used for predictive analytics and modeling, and extends to applications in machine learning. Logistic regression is easier to implement, interpret, and very efficient to train. 

\

**There are three main types of logistic regression:**
 * **Binary regression** deals with two possible values, essentially: yes or no. 
 * **Multinomial logistic regression** deals with three or more values.
 * **ordinal logistic regression** deals with three or more classes in a predetermined order. 

To develop a model, the Logistic Regression classifier contains a lot of hyperparameters. I'll use some of them to assist us enhance the model and score.

**The hyperparameters are:**
* **penalty:** Used to specify the norm used in the penalization. The newton-cg and lbfgs solvers support only l2 penalties.
   * `'none':` no penalty is added;
   * `'l2':` add a L2 penalty term and it is the default choice;
   * `'l1':` add a L1 penalty term;
   * `'elasticnet':` both L1 and L2 penalty terms are added.

* **C:** Inverse of regularization strength.
* **solver:** *(‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’)* use in the optimization problem. Default is ‘lbfgs’.
  * `For small datasets, ‘liblinear’` is a good choice, whereas ‘sag’ and ‘saga’ are faster for large ones;
  * `For multiclass problems,` only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial loss;
  * `‘liblinear’` is limited to one-versus-rest schemes.
* **random_state:** Used when solver == ‘sag’, ‘saga’ or ‘liblinear’ to shuffle the data. 

In [None]:
# for the create pipeline with my classifier is Logistic Regression Classifier
full_pipeline_Log = create_fit_pipeline(LogisticRegression(random_state = 42))

#### **2* Random Search**

##### **1- Random Search With Cross Validation**

**using Random Search and Logistic Regression Classifier with Cross Validation**

**Expectations:**

Random Search and Logistic Regression with Cross Validation will be used. Because Random search works best for lower dimensional data and fits the estimator (model) on your training set, I expect it to give me the greatest score.

I'm going to specify some hyperparameters for the preprocessor, select features, and Logistic Regression classifier.

\

**observations:**

The best hyperparameters for this model will be:


* **analyzer:** word
* **max_df:** 0.4
* **min_df:** 10



* **penalty:** l2
* **C:** 1
* **solver:** sag


\
Scores:

     In colab ==> Score: 0.86863

     In Kaggle 
        * Public score: 0.82753
        * Private score: 0.82932


\

**plan:**

I will use Random Search and Logistic Regressio Classifier with Validation Set.

In [None]:
# hyperparameter for Logistic Regression Classifier
# here we specify the search space 
# `__` denotes an attribute of the preceeding name
# (e.g. my_classifier__n_estimators means the `n_estimators` param for `my_classifier`)
param_Log = {
    'vectorizer__analyzer': ["word"], 
    'vectorizer__max_df': [0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
    'vectorizer__min_df': [5, 10, 15, 20, 25, 30, 35, 40, 50],
    
    'my_classifier__penalty' : ['l1', 'l2', 'elasticnet'],
    'my_classifier__C' : [0.001,0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.1, 1],
    'my_classifier__solver' : ['newton-cg', 'sag', 'saga', 'lbfgs']
}

In [None]:
# using the create_fit_grid_search function and it will return grid_search for Logistic Regression and it will use the (x_train) and (y_train)
random_search_Log = create_fit_random_search(full_pipeline_Log, param_Log, 20)

Fitting 20 folds for each of 5 candidates, totalling 100 fits
best score 0.8686367046931786
best score {'vectorizer__min_df': 10, 'vectorizer__max_df': 0.4, 'vectorizer__analyzer': 'word', 'my_classifier__solver': 'sag', 'my_classifier__penalty': 'l2', 'my_classifier__C': 1}


In [None]:
# using the predict_save_csv function and it will predict the testing data and save it in the csv file
predict_save_csv(random_search_Log, 'Log_Random_Cross')

##### **2- Random Search With Validation Set**

**using Random Search and Logistic Regression Classifier with Validation Set**

**Expectations:**

Random Search and Logistic Regression with Validation set will be used. Because Random search works best for lower dimensional data and fits the estimator (model) on your training set, I expect it to give me the greatest score.

I'm going to specify some hyperparameters for the preprocessor, select features, and Logistic Regression classifier.

\

**observations:**

The best hyperparameters for this model will be:


* **analyzer:** char
* **max_df:** 0.8
* **min_df:** 50



* **penalty:** l2
* **C:** 0.1
* **solver:** sag


\
Scores:

     In colab ==> Score: 0.5558

     In Kaggle 
        * Public score: 0.56054
        * Private score: 0.56428


\

**plan:**

I will use Bayesian Search and Logistic Regressio Classifier with Cross Validation.

In [None]:
# hyperparameter for Logistic Regression Classifier
# here we specify the search space 
# `__` denotes an attribute of the preceeding name
# (e.g. my_classifier__n_estimators means the `n_estimators` param for `my_classifier`)
param_Log = {
    'vectorizer__analyzer': ["char"], 
    'vectorizer__max_df': [0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
    'vectorizer__min_df': [5, 10, 15, 20, 25, 30, 35, 40, 50],
    
    'my_classifier__penalty' : ['l2', 'none'],
    'my_classifier__C' : [0.001,0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.1, 1, 2, 2.5],
    'my_classifier__solver' : ['newton-cg', 'sag', 'saga', 'lbfgs']
}

In [None]:
# using the create_fit_grid_search function and it will return grid_search for Logistic Regression and it will use the (x_train) and (y_train)
random_search_Log = create_fit_random_search(full_pipeline_Log, param_Log, pds)

Fitting 1 folds for each of 5 candidates, totalling 5 fits
best score 0.555813279350152
best score {'vectorizer__min_df': 50, 'vectorizer__max_df': 0.8, 'vectorizer__analyzer': 'char', 'my_classifier__solver': 'saga', 'my_classifier__penalty': 'l2', 'my_classifier__C': 0.1}


In [None]:
# using the predict_save_csv function and it will predict the testing data and save it in the csv file
predict_save_csv(random_search_Log, 'Log_Random_Validation')

#### **2* Bayesian Search**

##### **1- Bayesian Search With Cross Validation**

**using Bayesian Search and Logistic Regression Classifier with Cross Validation**

**Expectations:**

Bayesian Search and Logistic Regression with Cross Validation will be used. Because Bayesian Search discovers the extrema of objective functions that are expensive to evaluate and fits the estimator (model) on your training set, I expect it to give me the greatest score.

I'm going to specify some hyperparameters for the preprocessor, select features, and Logistic Regression classifier.

\

**observations:**

The best hyperparameters for this model will be:


* **analyzer:** word
* **max_df:** 0.4
* **min_df:** 30


* **penalty:** none
* **C:** 0.01
* **solver:** sag


\
Scores:

     In colab ==> Score: 0.77178

     In Kaggle 
        * Public score: 0.82408
        * Private score: 0.82425




In [None]:
# hyperparameter for Logistic Regression Classifier
# here we specify the search space 
# `__` denotes an attribute of the preceeding name
# (e.g. my_classifier__n_estimators means the `n_estimators` param for `my_classifier`)
param_Log = {
    'vectorizer__analyzer': ["word"], 
    'vectorizer__max_df': [0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
    'vectorizer__min_df': [5, 10, 15, 20, 25, 30, 35, 40, 50],
    
    'my_classifier__penalty' : ['l2', 'none'],
    'my_classifier__C' : [0.001,0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.1, 1],
    'my_classifier__solver' : ['newton-cg', 'sag', 'saga', 'lbfgs']
}

In [None]:
# using the create_fit_grid_search function and it will return grid_search for Logistic Regression and it will use the (x_train) and (y_train)
bayesian_search_Log = create_fit_bayesian_search(full_pipeline_Log, param_Log, 20)

Fitting 20 folds for each of 1 candidates, totalling 20 fits
Fitting 20 folds for each of 1 candidates, totalling 20 fits
Fitting 20 folds for each of 1 candidates, totalling 20 fits
Fitting 20 folds for each of 1 candidates, totalling 20 fits
Fitting 20 folds for each of 1 candidates, totalling 20 fits
best score 0.7717839239191628
best score OrderedDict([('my_classifier__C', 0.01), ('my_classifier__penalty', 'none'), ('my_classifier__solver', 'sag'), ('vectorizer__analyzer', 'word'), ('vectorizer__max_df', 0.4), ('vectorizer__min_df', 30)])


In [None]:
# using the predict_save_csv function and it will predict the testing data and save it in the csv file
predict_save_csv(bayesian_search_Log, 'Log_Bayesian_Cross')

##### **2- Bayesian Search With Validation Set**

**using Bayesian Search and Logistic Regression Classifier with Validation Set**

**Expectations:**

Bayesian Search and Logistic Regression with Validation Set will be used. Because Bayesian Search discovers the extrema of objective functions that are expensive to evaluate and fits the estimator (model) on your training set, I expect it to give me the greatest score.

I'm going to specify some hyperparameters for the preprocessor, select features, and Logistic Regression classifier.

\

**observations:**

The best hyperparameters for this model will be:


* **analyzer:** char
* **max_df:** 0.7
* **min_df:** 40


* **penalty:** none
* **C:** 0.04
* **solver:** sag


\
Scores:

     In colab ==> Score: 0.54893

     In Kaggle 
        * Public score: 0.55407
        * Private score: 0.55859


In [None]:
# hyperparameter for Logistic Regression Classifier
# here we specify the search space 
# `__` denotes an attribute of the preceeding name
# (e.g. my_classifier__n_estimators means the `n_estimators` param for `my_classifier`)
param_Log = {
    'vectorizer__analyzer': ["char"], 
    'vectorizer__max_df': [0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
    'vectorizer__min_df': [5, 10, 15, 20, 25, 30, 35, 40, 50],
    
    'my_classifier__penalty' : ['l2', 'none'],
    'my_classifier__C' : [0.001,0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.1, 1],
    'my_classifier__solver' : ['newton-cg', 'sag', 'saga', 'lbfgs']
}

In [None]:
# using the create_fit_grid_search function and it will return grid_search for Logistic Regression and it will use the (x_train) and (y_train)
bayesian_search_Log = create_fit_bayesian_search(full_pipeline_Log, param_Log, pds)

Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
best score 0.5489375941107579
best score OrderedDict([('my_classifier__C', 0.04), ('my_classifier__penalty', 'none'), ('my_classifier__solver', 'sag'), ('vectorizer__analyzer', 'char'), ('vectorizer__max_df', 0.7), ('vectorizer__min_df', 40)])


In [None]:
# using the predict_save_csv function and it will predict the testing data and save it in the csv file
predict_save_csv(bayesian_search_Log, 'Log_Bayesian_Validation')