# Assignment 2

## Problem Formulation: ✍
* We need to know the predicted probability of whether the new record (person) will match with another record (person) in the speed dating. 

### Input:
* General features for the persons who attended the speed dating program collected, and the information collected from 4 different surveys applied, one survey before the start of the speed dating, two during the program, and one after finishing the speed dating. Some of them have the match value ( train dataset), and some don't have the match value (test dataset). Datasets contain many missing values which needs to be dealed with and organized.

### Output:
* We need to predict the probability of match value for the given dataset (test dataset). So, this would help to know if the person will match with another in the speed dating (has high probability to match) or not (has low probability to match).


## What data mining function is required? 🕵🏽
* In this assignment we need "classification and prediction" data mining function.

## What could be the challenges? 😕
* Challenges are: how to make data organization and deal with the missing values and the unbalanced weights. How to tune the hyperparameters for each model and get the best hyperparameters for the model. In addition to how to develop a successful solution for the problem.

## What is the impact? 🤓
* The impact is to know more about preprocessing for the data, and deal with the hyperparameters to see how they can affect the model and the results (predicted probability of match values), and how to deal with the pipelines. So, by making a good algorithm this would give the right predictions then this would help the organizers of the speed dating program to know the match value for each person before starting the program, so this would help more in knowing whether the person will be match with another or not. Then organizers of the program can know who will match in this time, so by those predictions the organizers will put each time people who have match value, so this will help in saving time in the program and the persons.

## What is an ideal solution? 🦸
* The ideal solution is to make very good learning for the model that will give perfect performance parameters, and this happens after making good preprocessing steps on the data. So, the model can predict the match value for each person to know whether he will find a suitable person to match with or not.

## What is the experimental protocol used and how was it carried out? 🤔
* I used the crossvalidation method in some trials, and the holdout method in some trials. So in the crossvalidation the algorithm splits the input data into training, and validation datasets, and in the holdout method I am working by splitting the given train dataset into two parts (train and test) and by selecting the test size I needed. I used the new parts for the train and test to tune the hyperparameters and get the know the best hyperparameters for the models I created by calculating the AUROC score on that test part which I took from the given training set so I have the right labels and by getting the predicted labels I could calculate the AUROC score.

# Import required libraries 📝

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from xgboost.sklearn import XGBClassifier

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.model_selection import PredefinedSplit
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer
from sklearn.svm import SVC

import warnings 
warnings.filterwarnings('ignore')

# Read training, and test datasets 

In [2]:
# read the training dataset and make the id column in the csv file be the index of the dataframe
tr_data = pd.read_csv('train.csv',index_col='id')
tr_data

Unnamed: 0_level_0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2583,0,3,2,14,18,2,2.0,14,12,372.0,...,,,,,,,,,,
6830,1,14,1,3,10,2,,8,8,63.0,...,6.0,8.0,8.0,7.0,8.0,,,,,
4840,1,14,1,13,10,8,8.0,10,10,331.0,...,,,,,,,,,,
5508,1,38,2,9,20,18,13.0,6,7,200.0,...,8.0,9.0,8.0,8.0,6.0,,,,,
4828,1,24,2,14,20,6,6.0,20,17,357.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3390,0,1,2,9,20,2,2.0,18,1,214.0,...,12.0,12.0,12.0,9.0,12.0,,,,,
4130,1,24,2,9,20,19,15.0,5,6,199.0,...,,,,,,,,,,
1178,0,13,2,11,21,5,5.0,3,18,290.0,...,,,,,,,,,,
5016,1,10,2,7,16,6,14.0,9,10,151.0,...,,,,,,,,,,


In [3]:
# view information about the training dataframe
tr_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5909 entries, 2583 to 8149
Columns: 191 entries, gender to amb5_3
dtypes: float64(173), int64(10), object(8)
memory usage: 8.7+ MB


## Read test file

In [4]:
# read the test dataset and make the id column in the csv file be the index of the dataframe
ts_data = pd.read_csv('test.csv',index_col='id')
ts_data

Unnamed: 0_level_0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
934,0,5,2,2,16,3,,13,13,52.0,...,5.0,7.0,8.0,6.0,8.0,,,,,
6539,0,33,2,14,18,6,6.0,4,8,368.0,...,6.0,8.0,7.0,7.0,8.0,6.0,7.0,6.0,5.0,5.0
6757,1,6,2,9,20,10,16.0,15,19,212.0,...,,,,,,,,,,
2275,1,26,2,2,19,15,,8,10,30.0,...,,,,,,,,,,
1052,0,29,2,7,16,7,7.0,10,5,162.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7982,0,23,2,15,19,18,18.0,14,11,407.0,...,,,,,,,,,,
7299,0,5,1,13,9,4,4.0,4,8,339.0,...,,,,,,,,,,
1818,1,26,2,2,19,3,,15,3,23.0,...,,,,,,,,,,
937,0,19,2,9,20,11,11.0,9,2,215.0,...,9.0,7.0,12.0,12.0,9.0,,,,,


In [5]:
# split the training data to labels and features to be used in the supervised learning :)
# get the match column as it is our label column
y = tr_data['match']
# drop the match column and take the rest of the dataframe to be our features
X = tr_data.drop('match', axis=1) 
# print the shapes of both labels and features
print('original shape', X.shape, y.shape)

original shape (5909, 190) (5909,)


In [6]:
# make concatinating for the two dataframes so I can make the data preparation steps once on both of them then after finishing
# the data preparation I split them again, make the concatinate by keeping the id in the same order so we can split again 
# both datasets according to the id
frames = [X, ts_data]
result = pd.concat(frames)
result.head()

Unnamed: 0_level_0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2583,0,3,2,14,18,2,2.0,14,12,372.0,...,,,,,,,,,,
6830,1,14,1,3,10,2,,8,8,63.0,...,6.0,8.0,8.0,7.0,8.0,,,,,
4840,1,14,1,13,10,8,8.0,10,10,331.0,...,,,,,,,,,,
5508,1,38,2,9,20,18,13.0,6,7,200.0,...,8.0,9.0,8.0,8.0,6.0,,,,,
4828,1,24,2,14,20,6,6.0,20,17,357.0,...,,,,,,,,,,


In [7]:
# view information about both training and test features after concataining them
result.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8378 entries, 2583 to 6691
Columns: 190 entries, gender to amb5_3
dtypes: float64(173), int64(9), object(8)
memory usage: 12.2+ MB


## Data preparation 🔭

In [8]:
# make function to calculate the percentage and number of the nan values in each column in the dataframe
def missing_v(df):
    missing = pd.DataFrame(df.isnull().sum()/len(df))*100
    missing.columns = ['missing_values(%)']
    missing['missing_values(numbers)'] = pd.DataFrame(df.isnull().sum())
    return missing.sort_values(by = 'missing_values(%)', ascending=False)

In [9]:
# get the percentage and number of the nan values in the features dataframe 
result_mis = missing_v(result)
result_mis

Unnamed: 0,missing_values(%),missing_values(numbers)
num_in_3,92.026737,7710
numdat_3,82.143710,6882
expnum,78.515159,6578
sinc7_2,76.665075,6423
amb7_2,76.665075,6423
...,...,...
idg,0.000000,0
order,0.000000,0
partner,0.000000,0
samerace,0.000000,0


In [10]:
# loop on the all columns in the dataframe that contains the percentage values of nan values and fill a list with columns' names
# that contain more than 70%  missing values
empty_col = []
for i in range (len(result_mis)):
    if (result_mis['missing_values(%)'][i]) > 70:
#         print(tr_mis['missing_values(%)'][i])
        empty_col.append(result_mis['missing_values(%)'].index[i])
empty_col

['num_in_3',
 'numdat_3',
 'expnum',
 'sinc7_2',
 'amb7_2',
 'shar7_2',
 'intel7_2',
 'attr7_2',
 'fun7_2',
 'attr5_3',
 'shar7_3',
 'fun5_3',
 'intel5_3',
 'sinc5_3',
 'attr7_3',
 'sinc7_3',
 'intel7_3',
 'fun7_3',
 'amb7_3',
 'amb5_3',
 'shar2_3']

In [11]:
# get all columns that have tpye: object as we need to convert the object type to categorical data type so we can deal with it
result.select_dtypes(include='object')

Unnamed: 0_level_0,field,undergra,mn_sat,tuition,from,zipcode,income,career
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2583,Ed.D. in higher education policy at TC,University of Michigan-Ann Arbor,1290.00,21645.00,"Palo Alto, CA",,,University President
6830,Engineering,,,,"Boston, MA",2021,,Engineer or iBanker or consultant
4840,Urban Planning,"Rizvi College of Architecture, Bombay University",,,"Bombay, India",,,Real Estate Consulting
5508,International Affairs,,,,"Washington, DC",10471,45300.00,public service
4828,Business,Harvard College,1400.00,26019.00,Midwest USA,66208,46138.00,undecided
...,...,...,...,...,...,...,...,...
7982,Neuroscience and Education,Columbia,1430.00,26908.00,Hong Kong,0,,Academic
7299,School Psychology,Bucknell University,1290.00,25335.00,"Erie, PA",,,school psychologist
1818,Law,,,,Brooklyn,11204,26482.00,Intellectual Property Attorney
937,Mathematics,,,,Vestal,13850,42640.00,college professor


In [12]:
# make copy from the dataframe so we can return to the original data frame if we made a mistake so we don't read the file again
df2 = result.copy()
# convert the object data type to categorical data type
# I won't make convert for the field, and career columns as they are redundant and I will drop them 
# categorical encoding of undergra
df2['undergra'] = result['undergra'].astype("category")

# categorical encoding of mn_sat location
df2['mn_sat'] = result['mn_sat'].astype("category")

# use the first letter in tuition
df2['tuition'] = result['tuition'].astype("category")

# use the first letter in from
df2['from'] = result['from'].astype("category")

# use the first letter in from
df2['zipcode'] = result['zipcode'].astype("category")

# use the first letter in income
df2['income'] = result['income'].astype("category")

# drop field, career columns (As they are redundant columns and there are two columns describe the content in numerical way)
# axis=1 indicates that we drop vertically
# drop the columns that contain missing values more than 70% 
df2 = df2.drop(['field','career', 'num_in_3', 'numdat_3', 'expnum', 'sinc7_2', 'amb7_2', 'shar7_2', 'intel7_2', 'attr7_2',
                 'fun7_2', 'attr5_3', 'shar7_3', 'fun5_3', 'intel5_3', 'sinc5_3', 'attr7_3', 'sinc7_3', 'intel7_3', 'fun7_3',
                 'amb7_3', 'amb5_3', 'shar2_3'], axis=1)

In [13]:
# view information about the dataframe after filling the nan values, drop some columns, convert object columns to categorical
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8378 entries, 2583 to 6691
Columns: 167 entries, gender to amb3_3
dtypes: category(6), float64(152), int64(9)
memory usage: 10.5 MB


In [14]:
# splitting dataframe by row index after finishing data preparation to return again to training features and test features
X = df2.iloc[:5909 ,:]
df2_ts = df2.iloc[5909:,:]
# print the shape of each part to make sure that the shapes are right
print("Shape of new dataframes - {} , {}".format(X.shape, df2_ts.shape))
# print the top 5 rows in the test dataframe to make sure that it starts with the right ID after splitting the result dataframe
df2_ts.head()

Shape of new dataframes - (5909, 167) , (2469, 167)


Unnamed: 0_level_0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,attr2_3,sinc2_3,intel2_3,fun2_3,amb2_3,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
934,0,5,2,2,16,3,,13,13,52.0,...,,,,,,5.0,7.0,8.0,6.0,8.0
6539,0,33,2,14,18,6,6.0,4,8,368.0,...,30.0,10.0,0.0,30.0,0.0,6.0,8.0,7.0,7.0,8.0
6757,1,6,2,9,20,10,16.0,15,19,212.0,...,,,,,,,,,,
2275,1,26,2,2,19,15,,8,10,30.0,...,,,,,,,,,,
1052,0,29,2,7,16,7,7.0,10,5,162.0,...,,,,,,,,,,


In [15]:
# we extract numeric features and categorical features names
# numeric features can be selected by get the column that have data type float and int from the training data frame
features_numeric = list(X.select_dtypes(include=['float64', 'int64']))

# categorical features can be selected by get the column that have data type categorical from the training data frame 
features_categorical = list(X.select_dtypes(include=['category']))

# print names of the features numeric and categorical
print('numeric features:', features_numeric)
print('categorical features:', features_categorical)

numeric features: ['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1', 'order', 'partner', 'pid', 'int_corr', 'samerace', 'age_o', 'race_o', 'pf_o_att', 'pf_o_sin', 'pf_o_int', 'pf_o_fun', 'pf_o_amb', 'pf_o_sha', 'attr_o', 'sinc_o', 'intel_o', 'fun_o', 'amb_o', 'shar_o', 'like_o', 'prob_o', 'met_o', 'age', 'field_cd', 'race', 'imprace', 'imprelig', 'goal', 'date', 'go_out', 'career_c', 'sports', 'tvsports', 'exercise', 'dining', 'museums', 'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv', 'theater', 'movies', 'concerts', 'music', 'shopping', 'yoga', 'exphappy', 'attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1', 'attr4_1', 'sinc4_1', 'intel4_1', 'fun4_1', 'amb4_1', 'shar4_1', 'attr2_1', 'sinc2_1', 'intel2_1', 'fun2_1', 'amb2_1', 'shar2_1', 'attr3_1', 'sinc3_1', 'fun3_1', 'intel3_1', 'amb3_1', 'attr5_1', 'sinc5_1', 'intel5_1', 'fun5_1', 'amb5_1', 'attr', 'sinc', 'intel', 'fun', 'amb', 'shar', 'like', 'prob', 'met', 'match_es', 'attr1_s', 'sinc1_s', 'in

## Start in creating the pipeline 👷


**First I will make the fix the random seed by 0 so, it will generate the same random numbers each time I run the code.**

**Then the pipeline needs a preprocessor, and a classifier, so I will talk about the preprocessor first. In the preprocessor, I need to deal with two types in the datasets "numerical data types, and categorical data types". So, I will create two pipelines, each one to deal with one of the data types.**

**First the numerical pipeline 🔢, I use the imputer as 'SimpleImputer' to deal with the missing values as it makes a transform to complete the missing values, and its parameter will be passed from each callback for the pipeline in each trial. I use the scaler as 'StandardScaler' so it will make normalization for the numerical features by removing the mean and scaling to unit variance for each numerical feature.**


**Second the categorical pipeline 🔠, I use the imputer as 'SimpleImputer' to deal with the missing values as it makes a transform to complete the missing values, and its parameter is 'constant' so it will fill the missing value by “missing_value” as here it fills categorical data not numerical, if numerical it will fill them by '0'. Then I use OneHotEncoder method to convert the categorical columns into numerical, and ignore it there any unknown features in the train dataset or test dataset. As there may be unknown features in one of the two datasets contains more categorical values than the other; so after converting them may one contains more numerical features than the other, and this causes errors, so I will ignore this.**

**So, the processor contains both pipelines one for the numerical features and one for the categorical features so it will transform the columns according to the pipeline which the column related to.**

**Then, as the full pipeline contains a preprocessor and classifier, so the preprocessor I talked about and first I will use the 'XGBClassifier'. As the XGBClassifier is efficient, flexible, and portable. It provides parallel tree boosting (also known as GBDT, GBM) that solves many data science problems in a fast and accurate way.**


**In the other pipeline, I use Logistic regression classifier as it is one of the best classifiers to be used in binary classification. I will make it handle the unbalance between the weights of each class.**


In [16]:
# Create the preprocessor which will be used in the pipeline
# make any random generations be the same by each run for the code
np.random.seed(0)

# define a pipe line for numeric feature preprocessing
# we gave them a name so we can set their hyperparameters
transformer_numeric = Pipeline(
    steps=[
        ('imputer', SimpleImputer()),
        ('scaler', StandardScaler())]
)

# define a pipe line for categorical feature preprocessing
# we gave them a name so we can set their hyperparameters
transformer_categorical = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='constant')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ]
)

# define the preprocessor 
# we gave them a name so we can set their hyperparameters
# we also specify what are the categorical 
# pass the numerical features and categorical features for each transformer 'pipeline'
preprocessor = ColumnTransformer(
    transformers=[
        ('num', transformer_numeric, features_numeric),
        ('cat', transformer_categorical, features_categorical)
    ]
)

In [17]:
# create pipeline for XGBClassifier
# combine the preprocessor with the model as a full tunable pipeline
# we gave them a name so we can set their hyperparameters
full_pipline = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('my_classifier', 
           XGBClassifier(),
        )
    ]
)
full_pipline

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['gender', 'idg', 'condtn',
                                                   'wave', 'round', 'position',
                                                   'positin1', 'order',
                                                   'partner', 'pid', 'int_corr',
                                                   'samerace', 'age_o',
                                                   'race_o', 'pf_o_att',
                                                   'pf_o_sin', 'pf_o_int',
                                                   

In [18]:
# create another pipeline with change in the classifier only and use the same preprocessor
# In the LogisticRegression I will make it makes balance between the weights of each class
# combine the preprocessor with the model as a full tunable pipeline
# we gave them a name so we can set their hyperparameters
full_pipline2 = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('my_classifier', 
           LogisticRegression(class_weight="balanced"),
        )
    ]
)
full_pipline2 

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['gender', 'idg', 'condtn',
                                                   'wave', 'round', 'position',
                                                   'positin1', 'order',
                                                   'partner', 'pid', 'int_corr',
                                                   'samerace', 'age_o',
                                                   'race_o', 'pf_o_att',
                                                   'pf_o_sin', 'pf_o_int',
                                                   

# Trials 🏃‍♀️ 🤗
* I use 6 trials as the following:
    1. Grid Search with Cross-validation with xgboost classifier
    2. Grid Search with Cross-validation with LogisticRegression classifier
    3. Grid Search with validation set with xgboost classifier
    4. Grid Search with validation set with LogisticRegression classifier
    5. Random Search with xgboost classifier
    6. Bayesian Search with xgboost classifier

##  1. Grid Search with Cross-validation with xgboost classifier
* Here I will use cross-validation as the experimental protocol, so in this method the training set is divided into training and test datasets and in each iteration the test dataset changes. So, here I will use cross-validation equal 5, so the training dataset will splitted into 5 parts, and in each iteration the model will use 4 from them as training dataset and 1 for the test the model and get the score according to the metric I will use, which is = roc_auc.
* I use a grid to put the ranges of the hyperparameters for each model. So, in the imputer for the numerical pipeline I will pass 'mean' and 'most_frequent'.So, it will fill the missing values in the columns of the training and test whether by the mean of the numbers in each column or the most repeated number in each column.
* In the number of the estimators, which refers to the number of gradient boosted trees that will be used in the model, I will pass different numbers for it from small number to big number to see what the best number of estimators is.
* In the max depth, which refers to the maximum tree depth for the base learners, I will use small different values, so by using small numbers, this won't make overfitting in the training phase.


* Here I use grid search so the grid search makes combinations of all hyperparameters given and try all of them in the fitting of the training dataset with their labels; then it gives the best combination of the hyperparameters that will separate the two classes in the best way.


* Then in the test phase I use the combination that gave the best result in the training phase to predict the probability of each class and save the probability of the value '1' that means it will match to test the results on Kaggle.

In [19]:
# Grid Search with Cross-validation with XGBoost classifier

# here we specify the search space for this classifier and type for the search for the hyperparameters
# `__` denotes an attribute of the preceeding name
# so according to this, my_classifier__max_depth means the attribute max_depth for the classifier in the pipeline used and so on
param_grid = {
    # preprocessor__num__imputer__strategy points to preprocessor->num (a Pipeline)-> imputer -> strategy
    'preprocessor__num__imputer__strategy': ['mean', 'most_frequent'],
    # my_classifier__n_estimators points to my_classifier->n_estimators 
    'my_classifier__n_estimators': [80, 100, 200, 300, 400, 500, 600],  
     # my_classifier__max_depth points to my_classifier->max_depth
    'my_classifier__max_depth':[5, 7, 10, 15]       
}

# cv=5 means five-folds cross-validation
# n_jobs means the cucurrent number of jobs
# (on colab since we only have two cpu cores, we set it to 2)
grid_search = GridSearchCV(
    full_pipline, param_grid, cv=5, verbose=1, n_jobs=2, 
    scoring='roc_auc')

# make fitting on the training dataset and their labels to get the best combination of the hyperparameters
grid_search.fit(X, y)

# print the best score and the combination of the hyperparameters that got that score
print('best score {}'.format(grid_search.best_score_))
print('best score {}'.format(grid_search.best_params_))

Fitting 5 folds for each of 56 candidates, totalling 280 fits
best score 0.8810673915129037
best score {'my_classifier__max_depth': 5, 'my_classifier__n_estimators': 80, 'preprocessor__num__imputer__strategy': 'mean'}


In [20]:
# getting the results and saving in csv file 
submission = pd.DataFrame()
# get the id from the test dataframe
submission['id'] = ts_data.index

# get the predict probability of the match value '1' by the combination of the hyperparameters that fitted the model good in the
# training phase
submission['match'] = grid_search.predict_proba(df2_ts)[:,1]

submission.to_csv('GS_CV.csv', index=False)

### Thoughts and observations for trial 1: 🤓
This gave score on Kaggle: 0.87355

This model made 280 iterations in the fitting as there are 56 different combinations from the hyperparameters and the cross-validation = 5 so the k-folds in the training and test phases will change 5 times and tried on the 56 combinations, so I got 280 fitting iterations.

This model used hyperparameters as following:'max_depth': 5, 'n_estimators': 80, 'imputer__strategy': 'mean', this means that the imputer in the numerical pipeline will fill the missing values in the numerical features (columns) with the mean of each feature (column) in the training dataframe and test dataframe. Besides, the classifier will use number of estimators = 80, which means that the model is compiled with 80 gradient boosted trees and max depth for each tree = 5, which is a small number, so I thought this won't make overfitting on the training dataset. The grid search on the cross-validation phase on the training dataset gave AUC score = 0.88106, and on Kaggle gave 0.87355; so this means that this combination from the hyperparameters didn't make overfitting on the training dataset as on the unseen dataset (test dataset) this model gave a very good score.


### Plan for trial 2: 🤔
I will change the classifier in the next trial and I will use Logistic Regression classifier with continue using the cross-validation way but this time I will use 2 cross validations so the whole training dataset will divide into two parts (training and test) and this will change in the 2 iterations. Adjusting the  hyperparameters: solver, penalty, and C value using the grid search method, which I talked about before. As the solver means the algorithm that will be uses in the optimization (if the dataset is small ‘liblinear’ is a good choice, for multiclass problems ‘newton-cg’, ‘lbfgs’ handle multinomial loss, and the default for the model is ‘lbfgs’), the regularization means the regularization method that will be used on the weights and the parameters of the model that is used to prevent the overfitting on the training dataset, and C to control with the penality strength of the regularization as smaller values specify stronger regularization.

##  2. Grid Search with Cross-validation with LogisticRegression classifier

In [21]:
# Grid Search with Cross-validation with LogisticRegression classifier

# here we specify the search space
# `__` denotes an attribute of the preceeding name
# so according to this, my_classifier__max_depth means the attribute max_depth for the classifier in the pipeline used and so on
param_grid2 = {
    # preprocessor__num__imputer__strategy points to preprocessor->num (a Pipeline)-> imputer -> strategy
    'preprocessor__num__imputer__strategy': ['mean', 'most_frequent'],
    # my_classifier__solver points to my_classifier->solver 
    'my_classifier__solver' : ['newton-cg', 'lbfgs', 'liblinear'],
     # my_classifier__penalty points to my_classifier->penalty of the regularization to prevent the overfitting in the model  
    'my_classifier__penalty': ['l2'],
      # my_classifier__C points to my_classifier->C values 
    'my_classifier__C' : [1000, 100, 10, 1.0, 0.1, 0.01]
}

# cv=2 means two-fold cross-validation
# n_jobs means the cucurrent number of jobs
# (on colab since we only have two cpu cores, we set it to 2)
grid_search_LR = GridSearchCV(
    full_pipline2, param_grid2, cv=2, verbose=1, n_jobs=2, 
    scoring='roc_auc')

# make fitting on the training dataset and their labels to get the best combination of the hyperparameters
grid_search_LR.fit(X, y)

# print the best score and the combination of the hyperparameters that got that score
print('best score {}'.format(grid_search_LR.best_score_))
print('best score {}'.format(grid_search_LR.best_params_))

Fitting 2 folds for each of 36 candidates, totalling 72 fits
best score 0.8579257126421844
best score {'my_classifier__C': 0.01, 'my_classifier__penalty': 'l2', 'my_classifier__solver': 'lbfgs', 'preprocessor__num__imputer__strategy': 'mean'}


In [22]:
# getting the results and saving in csv file 
submission = pd.DataFrame()
# get the id from the test dataframe
submission['id'] = ts_data.index

# get the predict probability of the match value '1' by the combination of the hyperparameters that fitted the model good in the
# training phase
submission['match'] = grid_search_LR.predict_proba(df2_ts)[:,1]

submission.to_csv('GS_CV_LR.csv', index=False)

### Thoughts and observations  for trial 2: 🧐
This gave score on Kaggle: 0.86746

This model made 72 iterations in the fitting as there are 36 different combinations from the hyperparameters and the cross-validation = 2, so the k-folds in the training, and test phases will change 2 times and tried on the 36 combinations, so I got 72 fitting iterations.

This model used hyperparameters as following: 'C': 0.01, 'penalty': 'l2', 'solver': 'lbfgs', 'imputer__strategy': 'mean', this means that the imputer in the numerical pipeline will fill the missing values in the numerical features (columns) with the mean of each feature (column) in the training dataframe and test dataframe. Besides, the classifier will use solver = 'lbfgs', which means that the model is using the default solver as the algorithm to solve the optimization and penalty for the weights and the parameters = L2, which means the weights and the parameters of the model will be smaller. The C value is 0.01, which makes the regularization stronger. The grid search on the cross-validation phase on the training dataset gave AUC score = 0.85792, and on Kaggle gave 0.86746; so this means that this combination from the hyperparameters with logistic regression model didn't make overfitting on the training dataset as on the unseen dataset (test dataset) this model gave a very good score, but less than the XGBoost classifier, as we mentioned before, the XGBoost is better classifier than any classifier else.


### Plan for trial 3: 🤨
I will return to using the XGBoost classifier in the next trial. Using the grid parameters I used for the first time with the XGBoost classifier. But this time I will use the holdout method by splitting the training dataset into training and validation datasets. Besides, I keep using the grid search method to get the best combination of the hyperparameters that will give best AUC score when testing on the validation dataset. I won't change anything in the hyperparameters so I can see how cross-validation and holdout methods make a difference from each other.

## 3. Grid Search with validation set with xgboost classifier

In [23]:
# Split the training features and labels into training and validation dataset with stratify the label so each part will contain
# suitable number of each label 
X_train2, X_val, y_train2, y_val = train_test_split(
    X, y, train_size = 0.9, stratify = y, random_state = 42)

In [24]:
# Create a list where train data indices are -1 and validation data indices are 0
# X_train2 (new training set), X_train
split_index = [-1 if x in X_train2.index else 0 for x in X.index]

# Use the list to create PredefinedSplit
pds = PredefinedSplit(test_fold = split_index)

grid_search_val = GridSearchCV(
    full_pipline, param_grid, cv=pds, verbose=1, n_jobs=2, 
    scoring='roc_auc')

# here we still use X_train; but the grid search model
# will use our predefined split internally to determine 
# which sample belongs to the validation set
# make fitting on the training dataset and their labels to get the best combination of the hyperparameters
grid_search_val.fit(X, y)

# print the best score and the combination of the hyperparameters that got that score
print('best score {}'.format(grid_search_val.best_score_))
print('best score {}'.format(grid_search_val.best_params_))

Fitting 1 folds for each of 56 candidates, totalling 56 fits
best score 0.911102898907777
best score {'my_classifier__max_depth': 5, 'my_classifier__n_estimators': 100, 'preprocessor__num__imputer__strategy': 'most_frequent'}


In [25]:
# getting the results and saving in csv file 
submission = pd.DataFrame()
# get the id from the test dataframe
submission['id'] = ts_data.index

# get the predict probability of the match value '1' by the combination of the hyperparameters that fitted the model good in the
# training phase
submission['match'] = grid_search_val.predict_proba(df2_ts)[:,1]

submission.to_csv('GS_Val.csv', index=False)

### Thoughts and observations for trial 3: ✌
This gave score on Kaggle: 0.87976

This model made 56 iterations in the fitting as there are 56 different combinations from the hyperparameters and all of them will be tried while fitting on the part of the training dataset and tested on the validation dataset, so this will give 56 total iterations.

This model used hyperparameters as following:'max_depth': 5, 'n_estimators': 100, 'imputer__strategy': 'most_frequent', this means that the imputer in the numerical pipeline will fill the missing values in the numerical features (columns) with the most value happen in each feature (column) in the training dataframe and test dataframe. Besides, the classifier will use number of estimators = 100, which means that the model is compiled with 100 gradient boosted trees and max depth for each tree = 5, which is a small number, so I thought this won't make overfitting on the training dataset. The grid search on the holdout phase on the training dataset gave AUC score = 0.91110 and on Kaggle gave 0.87976, so I think there is small overfitting on the training dataset as on the unseen dataset (test dataset) this model gave a very good score, but less than what I expected after I got 0.91110 score from the holdout method. But this is better from the first trial using the same classifier, so this means changing the way in the experimental protocol that made changes in the imputer, and number of estimators, all of this made the model got better predictions on the test dataset.

### Plan for trial 4: 🤕
I will change the classifier in the next trial and I will return to using Logistic Regression classifier, but this time using the hold out method not the cross-validation. Using the same grid for the hyperparameters: solver, penalty, and C value using the grid search method which I talked about before to get the best combination using the holdout method. I mentioned the rule of each hyperparameter before when I used this classifier with the cross-validation way.


## 4. Grid Search with validation set with LogisticRegression classifier

In [26]:
# Create a list where train data indices are -1 and validation data indices are 0
# X_train2 (new training set), X_train
split_index = [-1 if x in X_train2.index else 0 for x in X.index]

# Use the list to create PredefinedSplit
pds = PredefinedSplit(test_fold = split_index)

grid_search_val_LR = GridSearchCV(
    full_pipline2, param_grid2, cv=pds, verbose=1, n_jobs=2, 
    scoring='roc_auc')

# here we still use X_train; but the grid search model
# will use our predefined split internally to determine 
# which sample belongs to the validation set
# make fitting on the training dataset and their labels to get the best combination of the hyperparameters
grid_search_val_LR.fit(X, y)

# print the best score and the combination of the hyperparameters that got that score
print('best score {}'.format(grid_search_val_LR.best_score_))
print('best score {}'.format(grid_search_val_LR.best_params_))

Fitting 1 folds for each of 36 candidates, totalling 36 fits
best score 0.9012687854151268
best score {'my_classifier__C': 0.01, 'my_classifier__penalty': 'l2', 'my_classifier__solver': 'liblinear', 'preprocessor__num__imputer__strategy': 'most_frequent'}


In [27]:
# getting the results and saving in csv file 
submission = pd.DataFrame()

# get the id from the test dataframe
submission['id'] = ts_data.index

# get the predict probability of the match value '1' by the combination of the hyperparameters that fitted the model good in the
# training phase
submission['match'] = grid_search_val_LR.predict_proba(df2_ts)[:,1]

submission.to_csv('GS_Val_LR.csv', index=False)


### Thoughts and observations for trial 4: 😟
This gave score on Kaggle: 0.86704 


This model made 36 iterations in the fitting as there is 36 different combinations from the hyperparameters and I used the holdout method while fitting the model using those combinations of the hyperparameters and test them on the validation dataset to search for the best score and get the combination that gave that best score from all combinations.

This model used hyperparameters as following: 'C': 0.01, 'penalty': 'l2', 'solver': 'liblinear', 'imputer__strategy': 'most_frequent', this means that the imputer in the numerical pipeline will fill the missing values in the numerical features (columns) with the most value happen in each feature (column) in the training dataframe and test dataframe. Besides, the classifier will use solver = 'liblinear', which means that the model is using that solver as the algorithm to solve the optimization and penalty for the weights and the parameters = L2, which means the weights and the parameters of the model will be smaller. The C value is 0.01, which makes the  regularization stronger. The grid search on the holdout phase on the training dataset gave AUC score = 0.90126 and on Kaggle gave 0.86704; so this means that this combination from the hyperparameters with logistic regression model made small overfitting on the training dataset as on the unseen dataset (test dataset) this model gave a very good score but less than the XGBoost classifier, as we mentioned before, the XGBoost is better classifier than any classifier else, it is also less than the score that I got from the cross-validation method in trial 2, so this means this combination from hyperparameters didn't work well on learning on the features given.


### Plan for trial 5: 😢🙆‍♀️
I will return to using the XGBoost classifier in the next trial. But this time I will search for the best combination of hyperparameters using Random search method which takes a random combination from the combinations of the hyperparameters at each iteration, and each time it keep the score that it gets from this iteration and after finishing all iterations which I choose its number, it gives me the combination that made the best performance through all iterations. And it calculates the score by ROCAUC score in each iteration. Using hold-out method to calculate performance using training, and validation datasets. I will use the XGBoost classifier with the same grid of the hyperparameters that I used in the first, and third trials.


## 5. Random Search with xgboost classifier

In [28]:
# using the random search with XGBoost classifier
# make 20 iteration
random_search = RandomizedSearchCV(
    full_pipline, param_grid, cv=pds, verbose=1, n_jobs=2, 
    # number of random trials
    n_iter=20,
    scoring='roc_auc')

# make fitting on the training dataset and their labels to get the best combination of the hyperparameters
random_search.fit(X, y)

# print the best score and the combination of the hyperparameters that got that score
print('best score {}'.format(random_search.best_score_))
print('best score {}'.format(random_search.best_params_))

Fitting 1 folds for each of 20 candidates, totalling 20 fits
best score 0.9066683091073334
best score {'preprocessor__num__imputer__strategy': 'most_frequent', 'my_classifier__n_estimators': 600, 'my_classifier__max_depth': 15}


In [29]:
# getting the results and saving in csv file 
submission = pd.DataFrame()
# get the id from the test dataframe
submission['id'] = ts_data.index
# get the predict probability of the match value '1' by the combination of the hyperparameters that fitted the model good in the
# training phase
submission['match'] = random_search.predict_proba(df2_ts)[:,1]

submission.to_csv('Random_Search_XGB.csv', index=False)

### Thoughts and observations for trial 5: 👀
This gave score on Kaggle: 0.87600

This model made 20 iterations according to the number I put. So, it will pick 20 random combinations from the combinations that were created from the grid I used in the XGBoost classifier.

This model used hyperparameters as following:'max_depth': 15, 'n_estimators': 600, 'imputer__strategy': 'most_frequent', this means that the imputer in the numerical pipeline will fill the missing values in the numerical features (columns) with the most value happen in each feature (column) in the training dataframe and test dataframe. Besides, the classifier will use number of estimators = 600, which means that the model is compiled with 600 gradient boosted trees and max depth for each tree = 15, which is a little bit large number, so I thought this model made overfitting on the training dataset as the depth is large comparing with the previous times I used the same classifier. The random search on the holdout phase on the training dataset gave AUC score = 0.90666 and on Kaggle gave 0.87600, so I think there is overfitting on the training dataset as on the unseen dataset (test dataset) this model gave a very good score, but less than what I expected after I got 0.90666 score from the holdout method, and the other reason is the large number in the maximum depth. This result isn't good as the previous trials using grid search with the same classifier.


### Plan for trial 6: 😪
I will use the XGBoost classifier in the next trial. But this time I will search for the best combination of hyperparameters using Bayesian search method which takes a one hyperparameter value from the hyperparameters and makes calculations to get the second hyperparameter that will make a good performance, and takes the third one by some calculations until it makes a combination from the hyperparameters. It keeps doing this in searching for the hyperparameters until it finishes the number of iterations. I put. And it calculates the score by ROCAUC score in each iteration. Using holdout method to calculate performance using training, and validation datasets. I will use the hyperparameters number of estimators and max depth for each tree.


## 6. Bayesian Search with xgboost classifier

In [30]:
# create xgboost model with bayesian search

XGB_pipline = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('my_classifier', 
           XGBClassifier()),
    ]
)


# define ranges for bayes search
bayes_search = BayesSearchCV(
    XGB_pipline,
    {
     # my_classifier__n_estimators points to my_classifier->n_estimators
    'my_classifier__n_estimators': [50, 100, 200, 250, 300, 400, 500, 600],  
     # my_classifier__max_depth points to my_classifier->max_depth
    'my_classifier__max_depth':[5, 7, 10, 15, 17]  
    },
    # number of trials 
    n_iter=40,
    random_state=0,
    verbose=1,
    # we still use 
    cv=pds,
)

# make fitting on the training dataset and their labels to get the best combination of the hyperparameters
bayes_search.fit(X, y)

# print the best score and the combination of the hyperparameters that got that score
print('best score {}'.format(bayes_search.best_score_))
print('best score {}'.format(bayes_search.best_params_))

Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fi

Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fits
Fitting 1 folds for each of 1 candidates, totalling 1 fi

In [31]:
# getting the results and saving in csv file 
submission = pd.DataFrame()
# get the id from the test dataframe
submission['id'] = ts_data.index

# get the predict probability of the match value '1' by the combination of the hyperparameters that fitted the model good in the
# training phase
submission['match'] = bayes_search.predict_proba(df2_ts)[:,1]

submission.to_csv('BS_XGB.csv', index=False)


### Thoughts and observations for trial 6: 🤔
This gave score on Kaggle: 0.87928

This model made 40 iterations according to the number I put. So, it will pick 40 random combinations from the combinations that were created from the grid I used in the XGBoost classifier.

This model used hyperparameters as following:'max_depth': 5, 'n_estimators': 200, and for the numerical imputer as I didn't pass a value it will use the default value which is the 'mean', this means that the imputer in the numerical pipeline will fill the missing values in the numerical features (columns) with the mean of each feature (column) in the training dataframe and test dataframe. Besides, the classifier will use number of estimators = 200, which means that the model is compiled with 200 gradient boosted trees and max depth for each tree = 5, which is a small number, so I thought this wouldn't make overfitting on the training dataset. The bayesian search on the holdout phase on the training dataset gave AUC score = 0.88155 and on Kaggle gave 0.87928, so I think there is no overfitting on the training dataset as on the unseen dataset (test dataset) this model gave a very good score regarding this search which didn't try all combinations; it tried 40 combination only in the 40 iterations, so I think this result and those hyperparameters are very good and suitable this dataset.  



# Overall results for all trials: 🥳


| Trial | Score on the experimental protocol applied | Score on Kaggle |
| :--- | :----: | ---: |
| Grid Search with Cross-validation with xgboost classifier | 0.88106 | 0.87355 |
| Grid Search with Cross-validation with LogisticRegression classifier | 0.85792 | 0.86746 |
| Grid Search with validation set with xgboost classifier | 0.91110 | 0.87976 |
| Grid Search with validation set with LogisticRegression classifier | 0.90126 | 0.86704 |
| Random Search with xgboost classifier | 0.90666 | 0.87600 |
| Bayesian Search with xgboost classifier | 0.88155 | 0.87928 |


**From the above table the best trial for me was:**

**Grid Search with validation set with xgboost classifier using those hyperparameters ('max_depth': 5, 'n_estimators': 100, 'imputer__strategy': 'most_frequent').**

-------------------------------------------------------------------------------------------------------------------------------

# Questions


## Why a simple linear regression model (without any activation function) is not good for classification task, compared to Perceptron/Logistic regression?

* As linear regression gives predicted values, which are continuous, not probabilistic, in range from negative infinity to positive infinity, and we need in the classification the predicted value between 0 and 1 for the binary classification. But logistic regression gives predicted values in the range from 0 to 1.
* The linear regression is sensitive to imbalance data, as if there is imbalance in the data, the linear regression may cause miss classification for part of them, but the logistic regression deals better with the imbalance in the data. As the linear regression tries to fit a straight line between the classes, but the logistic regression tries to fit a sigmoid curve between the classes, so the logistic regression will be better in the classification.

## What's a decision tree and how it is different to a logistic regression model?
* Decision tree is a supervised machine learning algorithm; it is used in classification and regression. It has a tree structure; its branches are the decision rules; its nodes are the features of the dataset, and its leaf nodes are the outcome.

* There are a lot of differences between the decision tree and the logistic regression like: the predictive accuracy in the logistic regression is better; the logistic regression is easier to use; the logistic regression provides single coefficients for each predictor but the decision tree provides classifications rules, and they may be not numerical minded audiences. So, the logistic regression is a parametric model, but the decision tree is a non-parametric model.

## What's the difference between grid search and random search?
* Grid search makes combinations from all hyperparameters given in the grid without repeating any combination and try all of them in the fitting of the training dataset with their labels; then it gives the best combination of the hyperparameters that will separate the two classes in the best way.
* Random search takes a random combination from the combinations of the hyperparameters at each iteration, and each time it keep the score that it gets from that iteration, and after finishing all iterations it takes, it gives me the combination that made the best performance through all iterations.
* Difference is grid search try all possible combinations from the hyperparameters given in the grid and get the best combination from them that gives the best score, but random search tries combinations from the hyperparameters randomly according to number of the iterations given it will try number of combinations equavilant to the number of the iterations and get the best combination from those it tried that gives the best score, so this means in the random search it doesn't be necessary that its output be the best combinations from all combinations but it is the best from the combinations it tried.

## What's the difference between bayesian search and random search?

* Random search takes a random combination from the combinations of the hyperparameters at each iteration, and each time it keep the score that it gets from that iteration, and after finishing all iterations it takes, it gives me the combination that made the best performance through all iterations.

* Bayesian search takes a one hyperparameter value from the hyperparameters and makes calculations to get the second hyperparameter that will make a good performance, and takes the third one by some calculations until it makes a combination from the hyperparameters. It keeps doing this in searching for the hyperparameters until it finishes the number of iterations it takes.

* So the main difference is bayesian search works on one by one hyperparameter and makes calculations to get the best combination from the hyperparameters given, and random search works by taking random combinations of hyperparameters and get the best combination that gives the best performance from the combinations it tried not from all possible combinations that may be created from the grid of the hyperparameters.

# References:
* Lectures and Labs
* https://rb.gy/ky4y7f
* https://rb.gy/yly3db
* https://rb.gy/vlaymv
* https://rb.gy/vzipjm
* https://rb.gy/rxzvlk
