https://www.kaggle.com/competitions/cisc-873-dm-f22-a2

# **Problem Formulation**

**Problem:** According to information about the dating , we want to predict the probability that the dating will lead to a successful match or not. 

**Inputs:** 191 features

**Output:** Correlation between 2 poeple is success or not

**Function required:** Classification & Prediction

**Challenges:** \
1- NaN values.\
2- Pipline\
3- Determine siutable Classifier\
4- Determine unimportant columns\
5- Solving imbalance data\
6- Select optimal hyperparameters in each algorithm.\
7- Find best accuracy

**What is the impact?**
* If the model predicts the relationship between two people correctly, this means they will not wait to know if the relationship between two people is correct or not, as the model will tell them the relation before the event, so they can save time waiting to know if the relationship is successful or not.

**What is the ideal solution?**
* The **XGclassifier** model is the best solution by using **Bayesian** search.
* Accuracy **0.88222**(public) **0.88646**(private) in kaggle


# **Trials**

## **Common Commands** in all models

**What is the experimental protocol used and how was it carried out?** \
1-Read Training and Testing Data \
2-Data Preprocessing using Pipline \
3-Splitting data\
4-Tuning hyperparameters\
5-Built model
* I used the Cross Validation function

**What preprocessing steps are used?**

1- Drop an unimportant features \
2- Handling NaN values and imbalanced data\
3- Normalization\
4- Choose Grid, Random, or Bayesian Search.\
5- OneHotEncoder to convert categorical data into numerical.

In [None]:
pip install scikit-optimize # install scikit-optimize to be able to use bayesian search.

Collecting scikit-optimize
  Downloading scikit_optimize-0.9.0-py2.py3-none-any.whl (100 kB)
[?25l[K     |███▎                            | 10 kB 20.0 MB/s eta 0:00:01[K     |██████▌                         | 20 kB 9.8 MB/s eta 0:00:01[K     |█████████▉                      | 30 kB 5.9 MB/s eta 0:00:01[K     |█████████████                   | 40 kB 5.4 MB/s eta 0:00:01[K     |████████████████▍               | 51 kB 4.3 MB/s eta 0:00:01[K     |███████████████████▋            | 61 kB 5.0 MB/s eta 0:00:01[K     |██████████████████████▉         | 71 kB 5.7 MB/s eta 0:00:01[K     |██████████████████████████▏     | 81 kB 6.1 MB/s eta 0:00:01[K     |█████████████████████████████▍  | 92 kB 6.7 MB/s eta 0:00:01[K     |████████████████████████████████| 100 kB 3.4 MB/s 
Collecting pyaml>=16.9
  Downloading pyaml-21.10.1-py2.py3-none-any.whl (24 kB)
Installing collected packages: pyaml, scikit-optimize
Successfully installed pyaml-21.10.1 scikit-optimize-0.9.0


In [None]:
#liberaries will be used during all models
import pandas as pd
import numpy as np
from google.colab import drive
from sklearn import preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer
from sklearn.model_selection import train_test_split, GridSearchCV,RandomizedSearchCV
from xgboost.sklearn import XGBClassifier

In [None]:
#connect to my drive
drive.mount('/content/drive')

Mounted at /content/drive


## **Grid Search**

### **Trial 0**

In this trial, I want to use Random Forest with grid search technique and hyperparameters (n_estimator, and max_depth with different values)

* I will use 191 features.
* I will not drop any columns in this trial.
* I will solve the "unbalancing data".

**My thoughts and observations :** The accuracy would be between 0.70 and 0.75.

##### Read Training and Testing data

In [None]:
# Read all our training data by using read_csv, which takes the path of the file with the extension that I want to read.
df = pd.read_csv ('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt2/train.csv')
# Based on position, this function returns the first 5 rows of the dataset. It's used to quickly see if our dataset contains the proper kind of data.
df.head(5)

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,3,2,14,18,2,2.0,14,12,372.0,...,,,,,,,,,,2583
1,1,14,1,3,10,2,,8,8,63.0,...,8.0,8.0,7.0,8.0,,,,,,6830
2,1,14,1,13,10,8,8.0,10,10,331.0,...,,,,,,,,,,4840
3,1,38,2,9,20,18,13.0,6,7,200.0,...,9.0,8.0,8.0,6.0,,,,,,5508
4,1,24,2,14,20,6,6.0,20,17,357.0,...,,,,,,,,,,4828


In [None]:
# Read all our testing data by using read_csv, which takes the path of the file with the extension that I want to read.
df_test = pd.read_csv ('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt2/test.csv')
# Based on position, this function returns the first 5 rows of the dataset. It's used to quickly see if our dataset contains the proper kind of data.
df_test.head(5)

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,5,2,2,16,3,,13,13,52.0,...,7.0,8.0,6.0,8.0,,,,,,934
1,0,33,2,14,18,6,6.0,4,8,368.0,...,8.0,7.0,7.0,8.0,6.0,7.0,6.0,5.0,5.0,6539
2,1,6,2,9,20,10,16.0,15,19,212.0,...,,,,,,,,,,6757
3,1,26,2,2,19,15,,8,10,30.0,...,,,,,,,,,,2275
4,0,29,2,7,16,7,7.0,10,5,162.0,...,,,,,,,,,,1052


In [None]:
# Display the column's name in training and testing data
print(df.columns)
print(df_test.columns)

Index(['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1',
       'order', 'partner', 'pid',
       ...
       'sinc3_3', 'intel3_3', 'fun3_3', 'amb3_3', 'attr5_3', 'sinc5_3',
       'intel5_3', 'fun5_3', 'amb5_3', 'id'],
      dtype='object', length=192)
Index(['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1',
       'order', 'partner', 'pid',
       ...
       'sinc3_3', 'intel3_3', 'fun3_3', 'amb3_3', 'attr5_3', 'sinc5_3',
       'intel5_3', 'fun5_3', 'amb5_3', 'id'],
      dtype='object', length=191)


#### Preprocessing

I will check the number of NaN values. Next, I will check the data types. If there is object data, I will convert it into categorical data so I can use it.

##### Check NaN values

* Check number of missing data exist by using
 * isnull() Return a boolean value indicating whether or not the values are NA.
 * sum() Return summation of nan value exists in each column.
 * sort_values() Sorting sum of value descending

###### Taining data

In [None]:
df.isnull().sum().sort_values(ascending=False)

num_in_3    5449
numdat_3    4849
expnum      4627
amb7_2      4519
sinc7_2     4519
            ... 
position       0
round          0
wave           0
condtn         0
id             0
Length: 192, dtype: int64

###### Testing data

In [None]:
# Check number of missing data exist
df_test.isnull().sum().sort_values(ascending=False)

num_in_3    2261
numdat_3    2033
expnum      1951
amb7_2      1904
sinc7_2     1904
            ... 
position       0
round          0
wave           0
condtn         0
id             0
Length: 191, dtype: int64

##### Checking data types(convert object data to categorical data)

In this section, I will check data types and then any object will be converted to categorical data.

In [None]:
# Display data set info for checking types:
# info() prints data frame information, such as the index dtype and columns, non-null values, and memory usage.
df.info()
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5909 entries, 0 to 5908
Columns: 192 entries, gender to id
dtypes: float64(173), int64(11), object(8)
memory usage: 8.7+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2469 entries, 0 to 2468
Columns: 191 entries, gender to id
dtypes: float64(173), int64(10), object(8)
memory usage: 3.6+ MB


So there exist 8 object data in training and testing data

###### Taining data

In [None]:
#select_dtypes based on the column dtypes, return a subset of the DataFrame's columns.
# include contains the type of data that I want to select.
df.select_dtypes(include=['object'])

Unnamed: 0,field,undergra,mn_sat,tuition,from,zipcode,income,career
0,Ed.D. in higher education policy at TC,University of Michigan-Ann Arbor,1290.00,21645.00,"Palo Alto, CA",,,University President
1,Engineering,,,,"Boston, MA",2021,,Engineer or iBanker or consultant
2,Urban Planning,"Rizvi College of Architecture, Bombay University",,,"Bombay, India",,,Real Estate Consulting
3,International Affairs,,,,"Washington, DC",10471,45300.00,public service
4,Business,Harvard College,1400.00,26019.00,Midwest USA,66208,46138.00,undecided
...,...,...,...,...,...,...,...,...
5904,Clinical Psychology,,,,New York,11803,65708.00,Psychologist
5905,MBA,,,,Colombia,,,Consulting
5906,MA Science Education,University of Washington,1155.00,13258.00,Seattle,98115,37881.00,Teacher
5907,Biochemistry,,,,Canada,,,pharmaceuticals and biotechnology


###### Testing data

In [None]:
# select_dtypes based on the column dtypes, return a subset of the DataFrame's columns.
# include contains the type of data that I want to select.
df_test.select_dtypes(include=['object'])

Unnamed: 0,field,undergra,mn_sat,tuition,from,zipcode,income,career
0,Psychology,,,,Hong Kong,0,,psychologist
1,education,wellesley college,1341.00,25504.00,"atlanta, ga",30071,36223.00,education
2,MBA,,,,San Francisco,10021,55080.00,Consulting
3,Law,,,,Brooklyn,11204,26482.00,Intellectual Property Attorney
4,Business,,,,"Atlanta, GA",27870,21590.00,Marketing and Media
...,...,...,...,...,...,...,...,...
2464,Neuroscience and Education,Columbia,1430.00,26908.00,Hong Kong,0,,Academic
2465,School Psychology,Bucknell University,1290.00,25335.00,"Erie, PA",,,school psychologist
2466,Law,,,,Brooklyn,11204,26482.00,Intellectual Property Attorney
2467,Mathematics,,,,Vestal,13850,42640.00,college professor


###### Covert training and testing objects data to categorical data

In [None]:
# Making a copy from the training and testing data frame before doing any assignment 
df_tr=df.copy() # Make a copy of dataframe's indices and data.
df_ts=df_test.copy() # Make a copy of dataframe's indices and data.

# obj_tr contains all categorical data in the training set by using
# select_dtypes selects an object from the training data set.
obj_tr=df.select_dtypes(include=['object'])  

#categorical encoding of all object data
for i in obj_tr:
   df_tr[i]=df_tr[i].astype("category")

# obj_tr contains all categorical data in the testing by using
# select_dtypes selects an object from the testing data set.
obj_ts=df.select_dtypes(include=['object']) 

#categorical encoding of all object data
for i in obj_ts:
   df_ts[i]=df_ts[i].astype("category")


In [None]:
# look of the training values:
df_tr

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,3,2,14,18,2,2.0,14,12,372.0,...,,,,,,,,,,2583
1,1,14,1,3,10,2,,8,8,63.0,...,8.0,8.0,7.0,8.0,,,,,,6830
2,1,14,1,13,10,8,8.0,10,10,331.0,...,,,,,,,,,,4840
3,1,38,2,9,20,18,13.0,6,7,200.0,...,9.0,8.0,8.0,6.0,,,,,,5508
4,1,24,2,14,20,6,6.0,20,17,357.0,...,,,,,,,,,,4828
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5904,0,1,2,9,20,2,2.0,18,1,214.0,...,12.0,12.0,9.0,12.0,,,,,,3390
5905,1,24,2,9,20,19,15.0,5,6,199.0,...,,,,,,,,,,4130
5906,0,13,2,11,21,5,5.0,3,18,290.0,...,,,,,,,,,,1178
5907,1,10,2,7,16,6,14.0,9,10,151.0,...,,,,,,,,,,5016


In [None]:
# look of the testing values:
df_ts

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,5,2,2,16,3,,13,13,52.0,...,7.0,8.0,6.0,8.0,,,,,,934
1,0,33,2,14,18,6,6.0,4,8,368.0,...,8.0,7.0,7.0,8.0,6.0,7.0,6.0,5.0,5.0,6539
2,1,6,2,9,20,10,16.0,15,19,212.0,...,,,,,,,,,,6757
3,1,26,2,2,19,15,,8,10,30.0,...,,,,,,,,,,2275
4,0,29,2,7,16,7,7.0,10,5,162.0,...,,,,,,,,,,1052
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2464,0,23,2,15,19,18,18.0,14,11,407.0,...,,,,,,,,,,7982
2465,0,5,1,13,9,4,4.0,4,8,339.0,...,,,,,,,,,,7299
2466,1,26,2,2,19,3,,15,3,23.0,...,,,,,,,,,,1818
2467,0,19,2,9,20,11,11.0,9,2,215.0,...,7.0,12.0,12.0,9.0,,,,,,937


In [None]:
# Display data set info for checking types:
# info() prints data frame information, such as the index dtype and columns, non-null values, and memory usage.
df_tr.info()
df_ts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5909 entries, 0 to 5908
Columns: 192 entries, gender to id
dtypes: category(8), float64(173), int64(11)
memory usage: 8.5 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2469 entries, 0 to 2468
Columns: 191 entries, gender to id
dtypes: category(8), float64(173), int64(10)
memory usage: 3.6 MB


So now that there is no object data in the datasets.

#### Model

##### Splitting

I split the data into X and y.

###### Trainig data

In [None]:
#splitting Trainig data into X_train and y_train
y_train=df_tr['match'] # y_train contains only match column
X_train=df_tr.drop(columns=['match'],axis=1) # X_train will be all columns except for the match column.
# Shape return a tuple representing the dimensionality of the DataFrame.
print(y_train.shape) 
print(X_train.shape)

(5909,)
(5909, 191)


###### Testing data

In [None]:
X_test=df_ts # X_test contains all columns except id column.
print(X_test.shape)

(2469, 191)


##### PipeLine Tuning

In [None]:
#Sparate numerical and categorical features in the training data

# put numeric feature in feature_numeric list
features_numeric=list(X_train.select_dtypes(include=['float64','int64']))

#put categoric features in feature_cat list
features_cat=list(X_train.select_dtypes(include=['category']))
# print each list to know the column's name in each list.
print('numeric features:', features_numeric)
print('categorical features:', features_cat)

numeric features: ['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1', 'order', 'partner', 'pid', 'int_corr', 'samerace', 'age_o', 'race_o', 'pf_o_att', 'pf_o_sin', 'pf_o_int', 'pf_o_fun', 'pf_o_amb', 'pf_o_sha', 'attr_o', 'sinc_o', 'intel_o', 'fun_o', 'amb_o', 'shar_o', 'like_o', 'prob_o', 'met_o', 'age', 'field_cd', 'race', 'imprace', 'imprelig', 'goal', 'date', 'go_out', 'career_c', 'sports', 'tvsports', 'exercise', 'dining', 'museums', 'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv', 'theater', 'movies', 'concerts', 'music', 'shopping', 'yoga', 'exphappy', 'expnum', 'attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1', 'attr4_1', 'sinc4_1', 'intel4_1', 'fun4_1', 'amb4_1', 'shar4_1', 'attr2_1', 'sinc2_1', 'intel2_1', 'fun2_1', 'amb2_1', 'shar2_1', 'attr3_1', 'sinc3_1', 'fun3_1', 'intel3_1', 'amb3_1', 'attr5_1', 'sinc5_1', 'intel5_1', 'fun5_1', 'amb5_1', 'attr', 'sinc', 'intel', 'fun', 'amb', 'shar', 'like', 'prob', 'met', 'match_es', 'attr1_s', 'sin

In [None]:
from pandas.core.arrays import numeric
# Built pipline
# The pipeline's goal is to combine numerous processes that can be cross-validated while modifying various parameters.
# It does this by allowing set parameters for each step using their names and parameter names separated by a "__"
# It takes steps as a prameter that contain all the preprocessing that I need.
# It saves time by applying any preprocessing to both train and test data without repeating the process.


# Create a pipline for numerical features and select it's hyperparameters
numeric=Pipeline(
    steps=[
           ('imputer', SimpleImputer()), # SimpleImputer used to handel missing value and have strategy='mean' is default val that means fill nan value with mean
           ('scaler', StandardScaler())  # StandardScaler used to scale number
    ]
)
categorical=Pipeline(
    steps=[
           ('imputer',SimpleImputer(strategy='constant')), # SimpleImputer used to handel missing value and have strategy='constant' that means fill nan value with constant
            ('onehot',OneHotEncoder(handle_unknown='ignore'))# OneHotEncoder used to encode categorical data
    ]
)
# ColumnTransformer used to construct and apply separate numerical and categorical data transformers.
# Select and prepare the columns of the dataset before fitting a model to the modified data.
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric, features_numeric),# Numerical data
        ('cat', categorical, features_cat) # Categorical data
    ]
)
# Put the preprocessing a siutable classifier.
full_pipline = Pipeline(  
    steps=[
        ('preprocessor', preprocessor), 
        ('my_classifier', 
           RandomForestClassifier(), # I used RandomForestClassifier as a classifier.
        )
    ]
)
full_pipline


np.random.seed(0) # used to make the random numbers predictable



In [None]:
# Fitting and predict The pipeline object.
full_pipline = full_pipline.fit(X_train, y_train)
full_pipline.predict(X_test)

array([0, 0, 0, ..., 0, 0, 0])

In [None]:
# Grid search hyperparameters
# param_grid is a dictionary that contains all the parameters I want to try.
param_grid = {
    'preprocessor__num__imputer__strategy': ['mean'],
    # preprocessor__num__imputer__strategy points to preprocessor->num (a Pipeline)-> imputer -> strategy
    # used to determine strategy value = mean to fill NaN values
    'my_classifier__n_estimators': [20, 30, 40],  
     # my_classifier__n_estimators points to my_classifier->n_estimators 
     # n_estimators is the total number of trees in the forest.
    'my_classifier__max_depth':[10, 20, 30]    
    # my_classifier__max_depth points to my_classifier->max_depth   
    # max_depth determines how many features should be considered when looking for a split.
}


# cross-validation is the number of cv folds for each combination of parameters
# scoring is used to evaluation metric that used when ranking results
# n_job is a number of jobs to run in parallel.
# cv=2 means two-fold cross-validation
# n_jobs = 2
grid_search = GridSearchCV(
    full_pipline, param_grid, cv=2, verbose=1, n_jobs=2, 
    scoring='roc_auc')
# Fitting the model after grid search
grid_search.fit(X_train, y_train)
# best_score_ is a best estimator score on the data on the left.
# best_params_ is a setting of parameters that produced the best results on the hold out data.
print('best score {}'.format(grid_search.best_score_))
print('best score {}'.format(grid_search.best_params_))

Fitting 2 folds for each of 9 candidates, totalling 18 fits
best score 0.8280888377566975
best score {'my_classifier__max_depth': 10, 'my_classifier__n_estimators': 40, 'preprocessor__num__imputer__strategy': 'mean'}


Best paramters:
*  max_depth = 10
*  n_estimators = 40

In [None]:
# Use this cell to write the result in the excel sheet.
submission = pd.DataFrame()

submission['id'] = df_ts['id']

submission['match'] = grid_search.predict_proba(X_test)[:,1]

submission.to_csv('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt2/sample_submission_walkthrough.csv', index=False)

##### Result

Accuracy in **Cross-Validation** = 0.8280 \
Accuracy in **kaggle** =0.82754 

### **Trial 1**

In this trial, I want to use XGBClassifier with grid search technique and hyperparameters (n_estimator, and max_depth with different values) to see if the result will improve or not.

* I will use 187 features.
* I will drop some columns.
* I will solve the "unbalancing data".

**My thoughts and observations :** The accuracy would be between 0.82 and 0.85.

##### Read Training and Testing data

In [None]:
# Read all our training data by using read_csv, which takes the path of the file with the extension that I want to read.
df = pd.read_csv ('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt2/train.csv')
# Based on position, this function returns the first 5 rows of the dataset. It's used to quickly see if our dataset contains the proper kind of data.
df.head(5)

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,3,2,14,18,2,2.0,14,12,372.0,...,,,,,,,,,,2583
1,1,14,1,3,10,2,,8,8,63.0,...,8.0,8.0,7.0,8.0,,,,,,6830
2,1,14,1,13,10,8,8.0,10,10,331.0,...,,,,,,,,,,4840
3,1,38,2,9,20,18,13.0,6,7,200.0,...,9.0,8.0,8.0,6.0,,,,,,5508
4,1,24,2,14,20,6,6.0,20,17,357.0,...,,,,,,,,,,4828


In [None]:
# Read all our testing data by using read_csv, which takes the path of the file with the extension that I want to read.
df_test = pd.read_csv ('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt2/test.csv')
# Based on position, this function returns the first 5 rows of the dataset. It's used to quickly see if our dataset contains the proper kind of data.
df_test.head(5)

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,5,2,2,16,3,,13,13,52.0,...,7.0,8.0,6.0,8.0,,,,,,934
1,0,33,2,14,18,6,6.0,4,8,368.0,...,8.0,7.0,7.0,8.0,6.0,7.0,6.0,5.0,5.0,6539
2,1,6,2,9,20,10,16.0,15,19,212.0,...,,,,,,,,,,6757
3,1,26,2,2,19,15,,8,10,30.0,...,,,,,,,,,,2275
4,0,29,2,7,16,7,7.0,10,5,162.0,...,,,,,,,,,,1052


In [None]:
# Display the column's name in training and testing data
print(df.columns)
print(df_test.columns)

Index(['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1',
       'order', 'partner', 'pid',
       ...
       'sinc3_3', 'intel3_3', 'fun3_3', 'amb3_3', 'attr5_3', 'sinc5_3',
       'intel5_3', 'fun5_3', 'amb5_3', 'id'],
      dtype='object', length=192)
Index(['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1',
       'order', 'partner', 'pid',
       ...
       'sinc3_3', 'intel3_3', 'fun3_3', 'amb3_3', 'attr5_3', 'sinc5_3',
       'intel5_3', 'fun5_3', 'amb5_3', 'id'],
      dtype='object', length=191)


#### Preprocessing

I will check the number of NaN values. Next, I will check the data types. If there is object data, I will convert it into categorical data so I can use it.

##### Drop some of features

In [None]:
# Drop unimportant training and testing features
# drop() Remove columns by specifying column names, inplace used to drop column from data frame
df.drop(columns=['zipcode','round','position','pid','condtn'],inplace=True)
df_test.drop(columns=['zipcode','round','position','pid','condtn'],inplace=True)

##### Check NaN values

* Check number of missing data exist by using
 * isnull() Return a boolean value indicating whether or not the values are NA.
 * sum() Return summation of nan value exists in each column.
 * sort_values() Sorting sum of value descending

###### Taining data

In [None]:
# Check number of missing data exist
df.isnull().sum().sort_values(ascending=False)

num_in_3    5449
numdat_3    4849
expnum      4627
amb7_2      4519
sinc7_2     4519
            ... 
match          0
partner        0
order          0
wave           0
id             0
Length: 187, dtype: int64

###### Testing data

In [None]:
# Check number of missing data exist
df_test.isnull().sum().sort_values(ascending=False)

num_in_3    2261
numdat_3    2033
expnum      1951
amb7_2      1904
sinc7_2     1904
            ... 
samerace       0
partner        0
order          0
wave           0
id             0
Length: 186, dtype: int64

##### Checking data types(convert object data to categorical data)

In this section, I will check data types and then any object will be converted to categorical data.

In [None]:
# Display data set info for checking types:
# info() prints data frame information, such as the index dtype and columns, non-null values, and memory usage.
df.info()
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5909 entries, 0 to 5908
Columns: 187 entries, gender to id
dtypes: float64(172), int64(8), object(7)
memory usage: 8.4+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2469 entries, 0 to 2468
Columns: 186 entries, gender to id
dtypes: float64(172), int64(7), object(7)
memory usage: 3.5+ MB


So there exist 7 object data in training and testing data

###### Taining data

In [None]:
#select_dtypes based on the column dtypes, return a subset of the DataFrame's columns.
# include contains the type of data that I want to select.
df.select_dtypes(include=['object'])

Unnamed: 0,field,undergra,mn_sat,tuition,from,income,career
0,Ed.D. in higher education policy at TC,University of Michigan-Ann Arbor,1290.00,21645.00,"Palo Alto, CA",,University President
1,Engineering,,,,"Boston, MA",,Engineer or iBanker or consultant
2,Urban Planning,"Rizvi College of Architecture, Bombay University",,,"Bombay, India",,Real Estate Consulting
3,International Affairs,,,,"Washington, DC",45300.00,public service
4,Business,Harvard College,1400.00,26019.00,Midwest USA,46138.00,undecided
...,...,...,...,...,...,...,...
5904,Clinical Psychology,,,,New York,65708.00,Psychologist
5905,MBA,,,,Colombia,,Consulting
5906,MA Science Education,University of Washington,1155.00,13258.00,Seattle,37881.00,Teacher
5907,Biochemistry,,,,Canada,,pharmaceuticals and biotechnology


###### Testing data

In [None]:
# select_dtypes based on the column dtypes, return a subset of the DataFrame's columns.
# include contains the type of data that I want to select.
df_test.select_dtypes(include=['object'])

Unnamed: 0,field,undergra,mn_sat,tuition,from,income,career
0,Psychology,,,,Hong Kong,,psychologist
1,education,wellesley college,1341.00,25504.00,"atlanta, ga",36223.00,education
2,MBA,,,,San Francisco,55080.00,Consulting
3,Law,,,,Brooklyn,26482.00,Intellectual Property Attorney
4,Business,,,,"Atlanta, GA",21590.00,Marketing and Media
...,...,...,...,...,...,...,...
2464,Neuroscience and Education,Columbia,1430.00,26908.00,Hong Kong,,Academic
2465,School Psychology,Bucknell University,1290.00,25335.00,"Erie, PA",,school psychologist
2466,Law,,,,Brooklyn,26482.00,Intellectual Property Attorney
2467,Mathematics,,,,Vestal,42640.00,college professor


###### Covert training and testing objects data to categorical data

In [None]:
# Making a copy from the training and testing data frame before doing any assignment 
df_tr=df.copy() # Make a copy of dataframe's indices and data.
df_ts=df_test.copy() # Make a copy of dataframe's indices and data.

# obj_tr contains all categorical data in the training set by using
# select_dtypes selects an object from the training data set.
obj_tr=df.select_dtypes(include=['object'])  

#categorical encoding of all object data
for i in obj_tr:
   df_tr[i]=df_tr[i].astype("category")

# obj_tr contains all categorical data in the testing by using
# select_dtypes selects an object from the testing data set.
obj_ts=df.select_dtypes(include=['object']) 

#categorical encoding of all object data
for i in obj_ts:
   df_ts[i]=df_ts[i].astype("category")


In [None]:
# look of the training values:
df_tr

Unnamed: 0,gender,idg,wave,positin1,order,partner,match,int_corr,samerace,age_o,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,3,14,2.0,14,12,0,-0.03,0,27.0,...,,,,,,,,,,2583
1,1,14,3,,8,8,0,0.21,0,24.0,...,8.0,8.0,7.0,8.0,,,,,,6830
2,1,14,13,8.0,10,10,0,0.43,0,34.0,...,,,,,,,,,,4840
3,1,38,9,13.0,6,7,0,0.72,1,25.0,...,9.0,8.0,8.0,6.0,,,,,,5508
4,1,24,14,6.0,20,17,0,0.33,0,27.0,...,,,,,,,,,,4828
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5904,0,1,9,2.0,18,1,0,-0.22,1,23.0,...,12.0,12.0,9.0,12.0,,,,,,3390
5905,1,24,9,15.0,5,6,0,0.08,0,30.0,...,,,,,,,,,,4130
5906,0,13,11,5.0,3,18,0,0.35,0,34.0,...,,,,,,,,,,1178
5907,1,10,7,14.0,9,10,1,0.45,0,28.0,...,,,,,,,,,,5016


In [None]:
# look of the testing values:
df_ts

Unnamed: 0,gender,idg,wave,positin1,order,partner,int_corr,samerace,age_o,race_o,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,5,2,,13,13,-0.13,0,21.0,2.0,...,7.0,8.0,6.0,8.0,,,,,,934
1,0,33,14,6.0,4,8,0.12,0,24.0,6.0,...,8.0,7.0,7.0,8.0,6.0,7.0,6.0,5.0,5.0,6539
2,1,6,9,16.0,15,19,0.11,0,27.0,3.0,...,,,,,,,,,,6757
3,1,26,2,,8,10,0.11,1,23.0,2.0,...,,,,,,,,,,2275
4,0,29,7,7.0,10,5,0.45,0,27.0,4.0,...,,,,,,,,,,1052
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2464,0,23,15,18.0,14,11,0.74,0,24.0,2.0,...,,,,,,,,,,7982
2465,0,5,13,4.0,4,8,,0,,,...,,,,,,,,,,7299
2466,1,26,2,,15,3,-0.13,0,21.0,4.0,...,,,,,,,,,,1818
2467,0,19,9,11.0,9,2,0.43,0,26.0,4.0,...,7.0,12.0,12.0,9.0,,,,,,937


In [None]:
# Display data set info for checking types:
# info() prints data frame information, such as the index dtype and columns, non-null values, and memory usage.
df_tr.info()
df_ts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5909 entries, 0 to 5908
Columns: 187 entries, gender to id
dtypes: category(7), float64(172), int64(8)
memory usage: 8.2 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2469 entries, 0 to 2468
Columns: 186 entries, gender to id
dtypes: category(7), float64(172), int64(7)
memory usage: 3.5 MB


So now that there is no object data in the datasets.

#### Model

##### Splitting

I split the data into X and y.

###### Trainig data

In [None]:
#splitting Trainig data into X_train and y_train
y_train=df_tr['match'] # y_train contains only match column
X_train=df_tr.drop(columns=['match','id'],axis=1) # X_train will be all columns except for the match and id columns.
print(y_train.shape)
print(X_train.shape)

(5909,)
(5909, 185)


###### Testing data

In [None]:
X_test=df_ts # X_test contains all columns except id column.
print(X_test.shape)

(2469, 186)


##### PipeLine Tuning

In [None]:
#Sparate numerical and categorical features in the training data

# put numeric feature in feature_numeric list
features_numeric=list(X_train.select_dtypes(include=['float64','int64']))

#put categoric features in feature_cat list
features_cat=list(X_train.select_dtypes(include=['category']))
# print each list to know the column's name in each list.
print('numeric features:', features_numeric)
print('categorical features:', features_cat)

numeric features: ['gender', 'idg', 'wave', 'positin1', 'order', 'partner', 'int_corr', 'samerace', 'age_o', 'race_o', 'pf_o_att', 'pf_o_sin', 'pf_o_int', 'pf_o_fun', 'pf_o_amb', 'pf_o_sha', 'attr_o', 'sinc_o', 'intel_o', 'fun_o', 'amb_o', 'shar_o', 'like_o', 'prob_o', 'met_o', 'age', 'field_cd', 'race', 'imprace', 'imprelig', 'goal', 'date', 'go_out', 'career_c', 'sports', 'tvsports', 'exercise', 'dining', 'museums', 'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv', 'theater', 'movies', 'concerts', 'music', 'shopping', 'yoga', 'exphappy', 'expnum', 'attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1', 'attr4_1', 'sinc4_1', 'intel4_1', 'fun4_1', 'amb4_1', 'shar4_1', 'attr2_1', 'sinc2_1', 'intel2_1', 'fun2_1', 'amb2_1', 'shar2_1', 'attr3_1', 'sinc3_1', 'fun3_1', 'intel3_1', 'amb3_1', 'attr5_1', 'sinc5_1', 'intel5_1', 'fun5_1', 'amb5_1', 'attr', 'sinc', 'intel', 'fun', 'amb', 'shar', 'like', 'prob', 'met', 'match_es', 'attr1_s', 'sinc1_s', 'intel1_s', 'fun1_s', 'amb1_s',

In [None]:
from pandas.core.arrays import numeric
# Built pipline
# The pipeline's goal is to combine numerous processes that can be cross-validated while modifying various parameters.
# It does this by allowing set parameters for each step using their names and parameter names separated by a "__"
# It takes steps as a prameter that contain all the preprocessing that I need.
# It saves time by applying any preprocessing to both train and test data without repeating the process.


# Create a pipline for numerical features and select it's hyperparameters
numeric=Pipeline(
    steps=[
           ('imputer', SimpleImputer()), # SimpleImputer used to handel missing value and have strategy='mean' is default val that means fill nan value with mean
           ('scaler', StandardScaler())  # StandardScaler used to scale number
    ]
)
categorical=Pipeline(
    steps=[
           ('imputer',SimpleImputer(strategy='constant')), # SimpleImputer used to handel missing value and have strategy='constant' that means fill nan value with constant
            ('onehot',OneHotEncoder(handle_unknown='ignore'))# OneHotEncoder used to encode categorical data
    ]
)
# ColumnTransformer used to construct and apply separate numerical and categorical data transformers.
# Select and prepare the columns of the dataset before fitting a model to the modified data.
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric, features_numeric),# Numerical data
        ('cat', categorical, features_cat) # Categorical data
    ]
)
# Put the preprocessing a siutable classifier.
full_pipline = Pipeline(  
    steps=[
        ('preprocessor', preprocessor), 
        ('my_classifier', 
           XGBClassifier(), # I used XGBClassifier as a classifier.
        )
    ]
)
full_pipline


np.random.seed(0) # used to make the random numbers predictable



In [None]:
# Fitting and predict The pipeline object.
full_pipline = full_pipline.fit(X_train, y_train)
full_pipline.predict(X_test)

array([0, 0, 0, ..., 0, 0, 0])

In [None]:
# Grid search hyperparameters
# param_grid is a dictionary that contains all the parameters I want to try.
param_grid = {
    'preprocessor__num__imputer__strategy': ['mean'],
    # preprocessor__num__imputer__strategy points to preprocessor->num (a Pipeline)-> imputer -> strategy
    # used to determine strategy value = mean to fill NaN values
    'my_classifier__nfold': [30, 40, 50],  
    # my_classifier__nfold points to my_classifier->nfold
    'my_classifier__max_depth':[20, 30, 40]   
    # my_classifier__max_depth points to my_classifier->max_depth   
    # max_depth determines how many features should be considered when looking for a split.
}

# cross-validation is the number of cv folds for each combination of parameters
# scoring is used to evaluation metric that used when ranking results
# n_job is a number of jobs to run in parallel.
# cv=2 means two-fold cross-validation
# n_jobs = 2
grid_search = GridSearchCV(
    full_pipline, param_grid, cv=2, verbose=1, n_jobs=2, 
    scoring='roc_auc')
# Fitting the model after grid search
grid_search.fit(X_train, y_train)
# best_score_ is a best estimator score on the data on the left.
# best_params_ is a setting of parameters that produced the best results on the hold out data.
print('best score {}'.format(grid_search.best_score_))
print('best score {}'.format(grid_search.best_params_))

Fitting 2 folds for each of 9 candidates, totalling 18 fits
best score 0.8690726991774422
best score {'my_classifier__max_depth': 30, 'my_classifier__nfold': 30, 'preprocessor__num__imputer__strategy': 'mean'}


Best paramters:
*  max_depth = 30
*  nfold = 30

In [None]:
# Use this cell to write the result in the excel sheet.
submission = pd.DataFrame()

submission['id'] = df_ts['id']

submission['match'] = grid_search.predict_proba(X_test)[:,1]

submission.to_csv('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt2/sample_submission_walkthrough.csv', index=False)

##### Result

Accuracy in **Cross-Validation** = 0.8691 \
Accuracy in **kaggle** =0.87710

### **Trial 2**

In this trial, I decided to drop all columns that had an 80% NaN value and change number of CV= 4 instead of 2 to see if the result will improve or not.

*   I will use 189 features.
*   I will drop some columns.
*   I will solve the "unbalancing data".


**My thoughts and observations :** The accuracy would be between 0.82 and 0.85.

##### Read Training and Testing data

In [None]:
# Read all our training data by using read_csv, which takes the path of the file with the extension that I want to read.
df = pd.read_csv ('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt2/train.csv')
# Based on position, this function returns the first 5 rows of the dataset. It's used to quickly see if our dataset contains the proper kind of data.
df.head(5)

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,3,2,14,18,2,2.0,14,12,372.0,...,,,,,,,,,,2583
1,1,14,1,3,10,2,,8,8,63.0,...,8.0,8.0,7.0,8.0,,,,,,6830
2,1,14,1,13,10,8,8.0,10,10,331.0,...,,,,,,,,,,4840
3,1,38,2,9,20,18,13.0,6,7,200.0,...,9.0,8.0,8.0,6.0,,,,,,5508
4,1,24,2,14,20,6,6.0,20,17,357.0,...,,,,,,,,,,4828


In [None]:
# Read all our testing data by using read_csv, which takes the path of the file with the extension that I want to read.
df_test = pd.read_csv ('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt2/test.csv')
# Based on position, this function returns the first 5 rows of the dataset. It's used to quickly see if our dataset contains the proper kind of data.
df_test.head(5)

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,5,2,2,16,3,,13,13,52.0,...,7.0,8.0,6.0,8.0,,,,,,934
1,0,33,2,14,18,6,6.0,4,8,368.0,...,8.0,7.0,7.0,8.0,6.0,7.0,6.0,5.0,5.0,6539
2,1,6,2,9,20,10,16.0,15,19,212.0,...,,,,,,,,,,6757
3,1,26,2,2,19,15,,8,10,30.0,...,,,,,,,,,,2275
4,0,29,2,7,16,7,7.0,10,5,162.0,...,,,,,,,,,,1052


In [None]:
# Display the column's name in training and testing data
print(df.columns)
print(df_test.columns)

Index(['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1',
       'order', 'partner', 'pid',
       ...
       'sinc3_3', 'intel3_3', 'fun3_3', 'amb3_3', 'attr5_3', 'sinc5_3',
       'intel5_3', 'fun5_3', 'amb5_3', 'id'],
      dtype='object', length=192)
Index(['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1',
       'order', 'partner', 'pid',
       ...
       'sinc3_3', 'intel3_3', 'fun3_3', 'amb3_3', 'attr5_3', 'sinc5_3',
       'intel5_3', 'fun5_3', 'amb5_3', 'id'],
      dtype='object', length=191)


#### Preprocessing

I will check the number of NaN values. Next, I will check the data types. If there is object data, I will convert it into categorical data so I can use it.

##### Drop some of features

In [None]:
# Drop unimportant training and testing features
# If any columns include more than 80% NaN data, I will remove this column.
for i in df.columns:
  col_nam=i
  # print(col_nam)
  per=df[i].isnull().sum()/5909 
  if per > 0.80:
    df.drop(columns=[col_nam],inplace=True)
    df_test.drop(columns=[col_nam],inplace=True)
df_test

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,5,2,2,16,3,,13,13,52.0,...,7.0,8.0,6.0,8.0,,,,,,934
1,0,33,2,14,18,6,6.0,4,8,368.0,...,8.0,7.0,7.0,8.0,6.0,7.0,6.0,5.0,5.0,6539
2,1,6,2,9,20,10,16.0,15,19,212.0,...,,,,,,,,,,6757
3,1,26,2,2,19,15,,8,10,30.0,...,,,,,,,,,,2275
4,0,29,2,7,16,7,7.0,10,5,162.0,...,,,,,,,,,,1052
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2464,0,23,2,15,19,18,18.0,14,11,407.0,...,,,,,,,,,,7982
2465,0,5,1,13,9,4,4.0,4,8,339.0,...,,,,,,,,,,7299
2466,1,26,2,2,19,3,,15,3,23.0,...,,,,,,,,,,1818
2467,0,19,2,9,20,11,11.0,9,2,215.0,...,7.0,12.0,12.0,9.0,,,,,,937


##### Check NaN values

* Check number of missing data exist by using
 * isnull() Return a boolean value indicating whether or not the values are NA.
 * sum() Return summation of nan value exists in each column.
 * sort_values() Sorting sum of value descending

###### Taining data

In [None]:
# Check number of missing data exist
df.isnull().sum().sort_values(ascending=False)

expnum      4627
amb7_2      4519
sinc7_2     4519
shar7_2     4505
fun7_2      4498
            ... 
position       0
round          0
wave           0
condtn         0
id             0
Length: 190, dtype: int64

###### Testing data

In [None]:
# Check number of missing data exist
df_test.isnull().sum().sort_values(ascending=False)

expnum      1951
amb7_2      1904
sinc7_2     1904
shar7_2     1899
fun7_2      1896
            ... 
position       0
round          0
wave           0
condtn         0
id             0
Length: 189, dtype: int64

##### Checking data types(convert object data to categorical data)

In this section, I will check data types and then any object will be converted to categorical data.

In [None]:
# Display data set info for checking types:
# info() prints data frame information, such as the index dtype and columns, non-null values, and memory usage.
df.info()
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5909 entries, 0 to 5908
Columns: 190 entries, gender to id
dtypes: float64(171), int64(11), object(8)
memory usage: 8.6+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2469 entries, 0 to 2468
Columns: 189 entries, gender to id
dtypes: float64(171), int64(10), object(8)
memory usage: 3.6+ MB


So there exist 8 object data in training and testing data

###### Taining data

In [None]:
#select_dtypes based on the column dtypes, return a subset of the DataFrame's columns.
# include contains the type of data that I want to select.
df.select_dtypes(include=['object'])

Unnamed: 0,field,undergra,mn_sat,tuition,from,zipcode,income,career
0,Ed.D. in higher education policy at TC,University of Michigan-Ann Arbor,1290.00,21645.00,"Palo Alto, CA",,,University President
1,Engineering,,,,"Boston, MA",2021,,Engineer or iBanker or consultant
2,Urban Planning,"Rizvi College of Architecture, Bombay University",,,"Bombay, India",,,Real Estate Consulting
3,International Affairs,,,,"Washington, DC",10471,45300.00,public service
4,Business,Harvard College,1400.00,26019.00,Midwest USA,66208,46138.00,undecided
...,...,...,...,...,...,...,...,...
5904,Clinical Psychology,,,,New York,11803,65708.00,Psychologist
5905,MBA,,,,Colombia,,,Consulting
5906,MA Science Education,University of Washington,1155.00,13258.00,Seattle,98115,37881.00,Teacher
5907,Biochemistry,,,,Canada,,,pharmaceuticals and biotechnology


###### Testing data

In [None]:
# select_dtypes based on the column dtypes, return a subset of the DataFrame's columns.
# include contains the type of data that I want to select.
df_test.select_dtypes(include=['object'])

Unnamed: 0,field,undergra,mn_sat,tuition,from,zipcode,income,career
0,Psychology,,,,Hong Kong,0,,psychologist
1,education,wellesley college,1341.00,25504.00,"atlanta, ga",30071,36223.00,education
2,MBA,,,,San Francisco,10021,55080.00,Consulting
3,Law,,,,Brooklyn,11204,26482.00,Intellectual Property Attorney
4,Business,,,,"Atlanta, GA",27870,21590.00,Marketing and Media
...,...,...,...,...,...,...,...,...
2464,Neuroscience and Education,Columbia,1430.00,26908.00,Hong Kong,0,,Academic
2465,School Psychology,Bucknell University,1290.00,25335.00,"Erie, PA",,,school psychologist
2466,Law,,,,Brooklyn,11204,26482.00,Intellectual Property Attorney
2467,Mathematics,,,,Vestal,13850,42640.00,college professor


###### Covert training and testing objects data to categorical data

In [None]:
# Making a copy from the training and testing data frame before doing any assignment 
df_tr=df.copy() # Make a copy of dataframe's indices and data.
df_ts=df_test.copy() # Make a copy of dataframe's indices and data.

# obj_tr contains all categorical data in the training set by using
# select_dtypes selects an object from the training data set.
obj_tr=df.select_dtypes(include=['object'])  

#categorical encoding of all object data
for i in obj_tr:
   df_tr[i]=df_tr[i].astype("category")

# obj_tr contains all categorical data in the testing by using
# select_dtypes selects an object from the testing data set.
obj_ts=df.select_dtypes(include=['object']) 

#categorical encoding of all object data
for i in obj_ts:
   df_ts[i]=df_ts[i].astype("category")


In [None]:
# look of the training values:
df_tr

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,3,2,14,18,2,2.0,14,12,372.0,...,,,,,,,,,,2583
1,1,14,1,3,10,2,,8,8,63.0,...,8.0,8.0,7.0,8.0,,,,,,6830
2,1,14,1,13,10,8,8.0,10,10,331.0,...,,,,,,,,,,4840
3,1,38,2,9,20,18,13.0,6,7,200.0,...,9.0,8.0,8.0,6.0,,,,,,5508
4,1,24,2,14,20,6,6.0,20,17,357.0,...,,,,,,,,,,4828
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5904,0,1,2,9,20,2,2.0,18,1,214.0,...,12.0,12.0,9.0,12.0,,,,,,3390
5905,1,24,2,9,20,19,15.0,5,6,199.0,...,,,,,,,,,,4130
5906,0,13,2,11,21,5,5.0,3,18,290.0,...,,,,,,,,,,1178
5907,1,10,2,7,16,6,14.0,9,10,151.0,...,,,,,,,,,,5016


In [None]:
# look of the testing values:
df_ts

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,5,2,2,16,3,,13,13,52.0,...,7.0,8.0,6.0,8.0,,,,,,934
1,0,33,2,14,18,6,6.0,4,8,368.0,...,8.0,7.0,7.0,8.0,6.0,7.0,6.0,5.0,5.0,6539
2,1,6,2,9,20,10,16.0,15,19,212.0,...,,,,,,,,,,6757
3,1,26,2,2,19,15,,8,10,30.0,...,,,,,,,,,,2275
4,0,29,2,7,16,7,7.0,10,5,162.0,...,,,,,,,,,,1052
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2464,0,23,2,15,19,18,18.0,14,11,407.0,...,,,,,,,,,,7982
2465,0,5,1,13,9,4,4.0,4,8,339.0,...,,,,,,,,,,7299
2466,1,26,2,2,19,3,,15,3,23.0,...,,,,,,,,,,1818
2467,0,19,2,9,20,11,11.0,9,2,215.0,...,7.0,12.0,12.0,9.0,,,,,,937


In [None]:
# Display data set info for checking types:
# info() prints data frame information, such as the index dtype and columns, non-null values, and memory usage.
df_tr.info()
df_ts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5909 entries, 0 to 5908
Columns: 190 entries, gender to id
dtypes: category(8), float64(171), int64(11)
memory usage: 8.4 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2469 entries, 0 to 2468
Columns: 189 entries, gender to id
dtypes: category(8), float64(171), int64(10)
memory usage: 3.5 MB


So now that there is no object data in the datasets.

#### Model

##### Splitting

I split the data into X and y.

###### Trainig data

In [None]:
#splitting Trainig data into X_train and y_train
y_train=df_tr['match'] # y_train contains only match column
X_train=df_tr.drop(columns=['match','id'],axis=1) # X_train will be all columns except for the match and id columns.
# Shape return a tuple representing the dimensionality of the DataFrame.
print(y_train.shape) 
print(X_train.shape)

(5909,)
(5909, 188)


###### Testing data

In [None]:
X_test=df_ts # X_test contains all columns except id column.
print(X_test.shape)

(2469, 189)


##### PipeLine Tuning

In [None]:
#Sparate numerical and categorical features in the training data

# put numeric feature in feature_numeric list
features_numeric=list(X_train.select_dtypes(include=['float64','int64']))

#put categoric features in feature_cat list
features_cat=list(X_train.select_dtypes(include=['category']))
# print each list to know the column's name in each list.
print('numeric features:', features_numeric)
print('categorical features:', features_cat)

numeric features: ['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1', 'order', 'partner', 'pid', 'int_corr', 'samerace', 'age_o', 'race_o', 'pf_o_att', 'pf_o_sin', 'pf_o_int', 'pf_o_fun', 'pf_o_amb', 'pf_o_sha', 'attr_o', 'sinc_o', 'intel_o', 'fun_o', 'amb_o', 'shar_o', 'like_o', 'prob_o', 'met_o', 'age', 'field_cd', 'race', 'imprace', 'imprelig', 'goal', 'date', 'go_out', 'career_c', 'sports', 'tvsports', 'exercise', 'dining', 'museums', 'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv', 'theater', 'movies', 'concerts', 'music', 'shopping', 'yoga', 'exphappy', 'expnum', 'attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1', 'attr4_1', 'sinc4_1', 'intel4_1', 'fun4_1', 'amb4_1', 'shar4_1', 'attr2_1', 'sinc2_1', 'intel2_1', 'fun2_1', 'amb2_1', 'shar2_1', 'attr3_1', 'sinc3_1', 'fun3_1', 'intel3_1', 'amb3_1', 'attr5_1', 'sinc5_1', 'intel5_1', 'fun5_1', 'amb5_1', 'attr', 'sinc', 'intel', 'fun', 'amb', 'shar', 'like', 'prob', 'met', 'match_es', 'attr1_s', 'sin

In [None]:
from pandas.core.arrays import numeric
# Built pipline
# The pipeline's goal is to combine numerous processes that can be cross-validated while modifying various parameters.
# It does this by allowing set parameters for each step using their names and parameter names separated by a "__"
# It takes steps as a prameter that contain all the preprocessing that I need.
# It saves time by applying any preprocessing to both train and test data without repeating the process.


# Create a pipline for numerical features and select it's hyperparameters
numeric=Pipeline(
    steps=[
           ('imputer', SimpleImputer()), # SimpleImputer used to handel missing value and have strategy='mean' is default val that means fill nan value with mean
           ('scaler', StandardScaler())  # StandardScaler used to scale number
    ]
)
categorical=Pipeline(
    steps=[
           ('imputer',SimpleImputer(strategy='constant')), # SimpleImputer used to handel missing value and have strategy='constant' that means fill nan value with constant
            ('onehot',OneHotEncoder(handle_unknown='ignore'))# OneHotEncoder used to encode categorical data
    ]
)
# ColumnTransformer used to construct and apply separate numerical and categorical data transformers.
# Select and prepare the columns of the dataset before fitting a model to the modified data.
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric, features_numeric),# Numerical data
        ('cat', categorical, features_cat) # Categorical data
    ]
)
# Put the preprocessing and suitable classifier.
full_pipline = Pipeline(  
    steps=[
        ('preprocessor', preprocessor), 
        ('my_classifier', 
           XGBClassifier(), # I used XGBClassifier as a classifier.
        )
    ]
)
full_pipline


np.random.seed(0) # used to make the random numbers predictable



In [None]:
# Fitting and predict The pipeline object.
full_pipline = full_pipline.fit(X_train, y_train)
full_pipline.predict(X_test)

array([0, 0, 0, ..., 0, 0, 0])

In [None]:
# Grid search hyperparameters
# param_grid is a dictionary that contains all the parameters I want to try.
param_grid = {
    'preprocessor__num__imputer__strategy': ['mean'],
    # preprocessor__num__imputer__strategy points to preprocessor->num (a Pipeline)-> imputer -> strategy
    # used to determine strategy value = mean to fill NaN values
    'my_classifier__nfold': [30, 40, 50],  
    # my_classifier__nfold points to my_classifier->nfold
    'my_classifier__max_depth':[20, 30, 40]   
    # my_classifier__max_depth points to my_classifier->max_depth   
    # max_depth determines how many features should be considered when looking for a split.
}
# cross-validation is the number of cv folds for each combination of parameters
# scoring is used to evaluation metric that used when ranking results
# n_job is a number of jobs to run in parallel.
# cv=4 means two-fold cross-validation
# n_jobs = 2
grid_search = GridSearchCV(
    full_pipline, param_grid, cv=4, verbose=1, n_jobs=2, 
    scoring='roc_auc')
# Fitting the model after grid search
grid_search.fit(X_train, y_train)
# best_score_ is a best estimator score on the data on the left.
# best_params_ is a setting of parameters that produced the best results on the hold out data.
print('best score {}'.format(grid_search.best_score_))
print('best score {}'.format(grid_search.best_params_))

Fitting 4 folds for each of 9 candidates, totalling 36 fits
best score 0.8774803897889172
best score {'my_classifier__max_depth': 30, 'my_classifier__nfold': 30, 'preprocessor__num__imputer__strategy': 'mean'}


Best paramters:
*  max_depth = 30
*  nfold = 30

In [None]:
# Use this cell to write the result in the excel sheet.
submission = pd.DataFrame()

submission['id'] = df_ts['id']

submission['match'] = grid_search.predict_proba(X_test)[:,1]

submission.to_csv('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt2/sample_submission_walkthrough.csv', index=False)

##### Result

Accuracy in **Cross-Validation** = 0.87748 \
Accuracy in **kaggle** =0.87568 

## **Random Search**

### **Trial 0**

In this trial, I want to test the Random Forest with hyperparameters (n_estimators, and max_depth) to see if the results will improve or not.



* In this trial, I will use 190 features.
* I will not drop any columns in this trial.
* I will solve the "unbalancing data".
* I will use random search in tuning.

**My thoughts and observations :** The accuracy would be between 0.75 and 0.80.

##### Read Training and Testing data

In [None]:
# Read all our training data by using read_csv, which takes the path of the file with the extension that I want to read.
df = pd.read_csv ('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt2/train.csv')
# Based on position, this function returns the first 5 rows of the dataset. It's used to quickly see if our dataset contains the proper kind of data.
df.head(5)

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,3,2,14,18,2,2.0,14,12,372.0,...,,,,,,,,,,2583
1,1,14,1,3,10,2,,8,8,63.0,...,8.0,8.0,7.0,8.0,,,,,,6830
2,1,14,1,13,10,8,8.0,10,10,331.0,...,,,,,,,,,,4840
3,1,38,2,9,20,18,13.0,6,7,200.0,...,9.0,8.0,8.0,6.0,,,,,,5508
4,1,24,2,14,20,6,6.0,20,17,357.0,...,,,,,,,,,,4828


In [None]:
# Read all our testing data by using read_csv, which takes the path of the file with the extension that I want to read.
df_test = pd.read_csv ('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt2/test.csv')
# Based on position, this function returns the first 5 rows of the dataset. It's used to quickly see if our dataset contains the proper kind of data.
df_test.head(5)

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,5,2,2,16,3,,13,13,52.0,...,7.0,8.0,6.0,8.0,,,,,,934
1,0,33,2,14,18,6,6.0,4,8,368.0,...,8.0,7.0,7.0,8.0,6.0,7.0,6.0,5.0,5.0,6539
2,1,6,2,9,20,10,16.0,15,19,212.0,...,,,,,,,,,,6757
3,1,26,2,2,19,15,,8,10,30.0,...,,,,,,,,,,2275
4,0,29,2,7,16,7,7.0,10,5,162.0,...,,,,,,,,,,1052


In [None]:
# Display the column's name in training and testing data
print(df.columns)
print(df_test.columns)

Index(['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1',
       'order', 'partner', 'pid',
       ...
       'sinc3_3', 'intel3_3', 'fun3_3', 'amb3_3', 'attr5_3', 'sinc5_3',
       'intel5_3', 'fun5_3', 'amb5_3', 'id'],
      dtype='object', length=192)
Index(['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1',
       'order', 'partner', 'pid',
       ...
       'sinc3_3', 'intel3_3', 'fun3_3', 'amb3_3', 'attr5_3', 'sinc5_3',
       'intel5_3', 'fun5_3', 'amb5_3', 'id'],
      dtype='object', length=191)


#### Preprocessing

I will check the number of NaN values. Next, I will check the data types. If there is object data, I will convert it into categorical data so I can use it.

##### Check NaN values

* Check number of missing data exist by using
 * isnull() Return a boolean value indicating whether or not the values are NA.
 * sum() Return summation of nan value exists in each column.
 * sort_values() Sorting sum of value descending

###### Taining data

In [None]:
# Check number of missing data exist
df.isnull().sum().sort_values(ascending=False)

num_in_3    5449
numdat_3    4849
expnum      4627
amb7_2      4519
sinc7_2     4519
            ... 
position       0
round          0
wave           0
condtn         0
id             0
Length: 192, dtype: int64

###### Testing data

In [None]:
# Check number of missing data exist
df_test.isnull().sum().sort_values(ascending=False)

num_in_3    2261
numdat_3    2033
expnum      1951
amb7_2      1904
sinc7_2     1904
            ... 
position       0
round          0
wave           0
condtn         0
id             0
Length: 191, dtype: int64

##### Checking data types(convert object data to categorical data)

In this section, I will check data types and then any object will be converted to categorical data.

In [None]:
# Display data set info for checking types:
# info() prints data frame information, such as the index dtype and columns, non-null values, and memory usage.
df.info()
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5909 entries, 0 to 5908
Columns: 192 entries, gender to id
dtypes: float64(173), int64(11), object(8)
memory usage: 8.7+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2469 entries, 0 to 2468
Columns: 191 entries, gender to id
dtypes: float64(173), int64(10), object(8)
memory usage: 3.6+ MB


So there exist 8 object data in training and testing data

###### Taining data

In [None]:
#select_dtypes based on the column dtypes, return a subset of the DataFrame's columns.
# include contains the type of data that I want to select.
df.select_dtypes(include=['object'])

Unnamed: 0,field,undergra,mn_sat,tuition,from,zipcode,income,career
0,Ed.D. in higher education policy at TC,University of Michigan-Ann Arbor,1290.00,21645.00,"Palo Alto, CA",,,University President
1,Engineering,,,,"Boston, MA",2021,,Engineer or iBanker or consultant
2,Urban Planning,"Rizvi College of Architecture, Bombay University",,,"Bombay, India",,,Real Estate Consulting
3,International Affairs,,,,"Washington, DC",10471,45300.00,public service
4,Business,Harvard College,1400.00,26019.00,Midwest USA,66208,46138.00,undecided
...,...,...,...,...,...,...,...,...
5904,Clinical Psychology,,,,New York,11803,65708.00,Psychologist
5905,MBA,,,,Colombia,,,Consulting
5906,MA Science Education,University of Washington,1155.00,13258.00,Seattle,98115,37881.00,Teacher
5907,Biochemistry,,,,Canada,,,pharmaceuticals and biotechnology


###### Testing data

In [None]:
# select_dtypes based on the column dtypes, return a subset of the DataFrame's columns.
# include contains the type of data that I want to select.
df_test.select_dtypes(include=['object'])

Unnamed: 0,field,undergra,mn_sat,tuition,from,zipcode,income,career
0,Psychology,,,,Hong Kong,0,,psychologist
1,education,wellesley college,1341.00,25504.00,"atlanta, ga",30071,36223.00,education
2,MBA,,,,San Francisco,10021,55080.00,Consulting
3,Law,,,,Brooklyn,11204,26482.00,Intellectual Property Attorney
4,Business,,,,"Atlanta, GA",27870,21590.00,Marketing and Media
...,...,...,...,...,...,...,...,...
2464,Neuroscience and Education,Columbia,1430.00,26908.00,Hong Kong,0,,Academic
2465,School Psychology,Bucknell University,1290.00,25335.00,"Erie, PA",,,school psychologist
2466,Law,,,,Brooklyn,11204,26482.00,Intellectual Property Attorney
2467,Mathematics,,,,Vestal,13850,42640.00,college professor


###### Covert training and testing objects data to categorical data

In [None]:
# Making a copy from the training and testing data frame before doing any assignment 
df_tr=df.copy() # Make a copy of dataframe's indices and data.
df_ts=df_test.copy() # Make a copy of dataframe's indices and data.

# obj_tr contains all categorical data in the training set by using
# select_dtypes selects an object from the training data set.
obj_tr=df.select_dtypes(include=['object'])  

#categorical encoding of all object data
for i in obj_tr:
   df_tr[i]=df_tr[i].astype("category")

# obj_tr contains all categorical data in the testing by using
# select_dtypes selects an object from the testing data set.
obj_ts=df.select_dtypes(include=['object']) 

#categorical encoding of all object data
for i in obj_ts:
   df_ts[i]=df_ts[i].astype("category")


In [None]:
# look of the training values:
df_tr

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,3,2,14,18,2,2.0,14,12,372.0,...,,,,,,,,,,2583
1,1,14,1,3,10,2,,8,8,63.0,...,8.0,8.0,7.0,8.0,,,,,,6830
2,1,14,1,13,10,8,8.0,10,10,331.0,...,,,,,,,,,,4840
3,1,38,2,9,20,18,13.0,6,7,200.0,...,9.0,8.0,8.0,6.0,,,,,,5508
4,1,24,2,14,20,6,6.0,20,17,357.0,...,,,,,,,,,,4828
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5904,0,1,2,9,20,2,2.0,18,1,214.0,...,12.0,12.0,9.0,12.0,,,,,,3390
5905,1,24,2,9,20,19,15.0,5,6,199.0,...,,,,,,,,,,4130
5906,0,13,2,11,21,5,5.0,3,18,290.0,...,,,,,,,,,,1178
5907,1,10,2,7,16,6,14.0,9,10,151.0,...,,,,,,,,,,5016


In [None]:
# look of the testing values:
df_ts

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,5,2,2,16,3,,13,13,52.0,...,7.0,8.0,6.0,8.0,,,,,,934
1,0,33,2,14,18,6,6.0,4,8,368.0,...,8.0,7.0,7.0,8.0,6.0,7.0,6.0,5.0,5.0,6539
2,1,6,2,9,20,10,16.0,15,19,212.0,...,,,,,,,,,,6757
3,1,26,2,2,19,15,,8,10,30.0,...,,,,,,,,,,2275
4,0,29,2,7,16,7,7.0,10,5,162.0,...,,,,,,,,,,1052
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2464,0,23,2,15,19,18,18.0,14,11,407.0,...,,,,,,,,,,7982
2465,0,5,1,13,9,4,4.0,4,8,339.0,...,,,,,,,,,,7299
2466,1,26,2,2,19,3,,15,3,23.0,...,,,,,,,,,,1818
2467,0,19,2,9,20,11,11.0,9,2,215.0,...,7.0,12.0,12.0,9.0,,,,,,937


In [None]:
# Display data set info for checking types:
# info() prints data frame information, such as the index dtype and columns, non-null values, and memory usage.
df_tr.info()
df_ts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5909 entries, 0 to 5908
Columns: 192 entries, gender to id
dtypes: category(8), float64(173), int64(11)
memory usage: 8.5 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2469 entries, 0 to 2468
Columns: 191 entries, gender to id
dtypes: category(8), float64(173), int64(10)
memory usage: 3.6 MB


So now that there is no object data in the datasets.

#### Model

##### Splitting

I split the data into X and y.

###### Trainig data

In [None]:
#splitting Trainig data into X_train and y_train
y_train=df_tr['match'] # y_train contains only match column
X_train=df_tr.drop(columns=['match','id'],axis=1) # X_train will be all columns except for the match and id columns.
# Shape return a tuple representing the dimensionality of the DataFrame.
print(y_train.shape) 
print(X_train.shape)

(5909,)
(5909, 190)


###### Testing data

In [None]:
X_test=df_ts # X_test contains all columns except id column.
print(X_test.shape)

(2469, 191)


##### PipeLine Tuning

In [None]:
#Sparate numerical and categorical features in the training data

# put numeric feature in feature_numeric list
features_numeric=list(X_train.select_dtypes(include=['float64','int64']))

#put categoric features in feature_cat list
features_cat=list(X_train.select_dtypes(include=['category']))
# print each list to know the column's name in each list.
print('numeric features:', features_numeric)
print('categorical features:', features_cat)

numeric features: ['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1', 'order', 'partner', 'pid', 'int_corr', 'samerace', 'age_o', 'race_o', 'pf_o_att', 'pf_o_sin', 'pf_o_int', 'pf_o_fun', 'pf_o_amb', 'pf_o_sha', 'attr_o', 'sinc_o', 'intel_o', 'fun_o', 'amb_o', 'shar_o', 'like_o', 'prob_o', 'met_o', 'age', 'field_cd', 'race', 'imprace', 'imprelig', 'goal', 'date', 'go_out', 'career_c', 'sports', 'tvsports', 'exercise', 'dining', 'museums', 'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv', 'theater', 'movies', 'concerts', 'music', 'shopping', 'yoga', 'exphappy', 'expnum', 'attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1', 'attr4_1', 'sinc4_1', 'intel4_1', 'fun4_1', 'amb4_1', 'shar4_1', 'attr2_1', 'sinc2_1', 'intel2_1', 'fun2_1', 'amb2_1', 'shar2_1', 'attr3_1', 'sinc3_1', 'fun3_1', 'intel3_1', 'amb3_1', 'attr5_1', 'sinc5_1', 'intel5_1', 'fun5_1', 'amb5_1', 'attr', 'sinc', 'intel', 'fun', 'amb', 'shar', 'like', 'prob', 'met', 'match_es', 'attr1_s', 'sin

In [None]:
from pandas.core.arrays import numeric
# Built pipline
# The pipeline's goal is to combine numerous processes that can be cross-validated while modifying various parameters.
# It does this by allowing set parameters for each step using their names and parameter names separated by a "__"
# It takes steps as a prameter that contain all the preprocessing that I need.
# It saves time by applying any preprocessing to both train and test data without repeating the process.


# Create a pipline for numerical features and select it's hyperparameters
numeric=Pipeline(
    steps=[
           ('imputer', SimpleImputer()), # SimpleImputer used to handel missing value and have strategy='mean' is default val that means fill nan value with mean
           ('scaler', StandardScaler())  # StandardScaler used to scale number
    ]
)
categorical=Pipeline(
    steps=[
           ('imputer',SimpleImputer(strategy='constant')), # SimpleImputer used to handel missing value and have strategy='constant' that means fill nan value with constant
            ('onehot',OneHotEncoder(handle_unknown='ignore'))# OneHotEncoder used to encode categorical data
    ]
)
# ColumnTransformer used to construct and apply separate numerical and categorical data transformers.
# Select and prepare the columns of the dataset before fitting a model to the modified data.
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric, features_numeric),# Numerical data
        ('cat', categorical, features_cat) # Categorical data
    ]
)
# Put the preprocessing a siutable classifier.
full_pipline = Pipeline(  
    steps=[
        ('preprocessor', preprocessor), 
        ('my_classifier', 
           RandomForestClassifier(), # I used RandomForestClassifier as a classifier.
        )
    ]
)
full_pipline


np.random.seed(0)  # used to make the random numbers predictable



In [None]:
# Fitting and predict The pipeline object.
full_pipline = full_pipline.fit(X_train, y_train)
full_pipline.predict(X_test)

array([0, 0, 0, ..., 0, 0, 0])

In [None]:
# Random search hyperparameters
# param_random is a dictionary that contains all the parameters I want to try.
param_random = {
    'preprocessor__num__imputer__strategy': ['mean'],
    # preprocessor__num__imputer__strategy points to preprocessor->num (a Pipeline)-> imputer -> strategy
    # used to determine strategy value = mean to fill NaN values
    'my_classifier__n_estimators': [20, 30, 40],  
     # my_classifier__n_estimators points to my_classifier->n_estimators 
     # n_estimators is the total number of trees in the forest.
    'my_classifier__max_depth':[10, 20, 30]    
    # my_classifier__max_depth points to my_classifier->max_depth   
    # max_depth determines how many features should be considered when looking for a split.         
}

# cross-validation is the number of cv folds for each combination of parameters
# scoring is used to evaluation metric that used when ranking results
# n_job is a number of jobs to run in parallel.
# cv=4 means two-fold cross-validation
# n_jobs = 2
random_search = RandomizedSearchCV(
    full_pipline, param_random, cv=4, verbose=1, n_jobs=2, 
    scoring='roc_auc')
# Fitting the model after grid search
random_search.fit(X_train, y_train)
# best_score_ is a best estimator score on the data on the left.
# best_params_ is a setting of parameters that produced the best results on the hold out data.
print('best score {}'.format(random_search.best_score_))
print('best score {}'.format(random_search.best_params_))



Fitting 4 folds for each of 9 candidates, totalling 36 fits
best score 0.8415182723682186
best score {'preprocessor__num__imputer__strategy': 'mean', 'my_classifier__n_estimators': 40, 'my_classifier__max_depth': 10}


Best paramters:
*  max_depth = 10
*  n_estimators = 40

In [None]:
# Use this cell to write the result in the excel sheet.
submission = pd.DataFrame()

submission['id'] = df_ts['id']

submission['match'] = random_search.predict_proba(X_test)[:,1]

submission.to_csv('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt2/sample_submission_walkthrough.csv', index=False)

##### Result

Accuracy in **Cross-Validation** = 0.8415 \
Accuracy in **kaggle** =0.83595\
The accuracy did not improve.

### **Trial 1**

In this trial, I decided to drop some columns and change number of CV= 4 instead of 2 and use XGBClassifier to see if the result will improve or not.

* I will use 184 features.
* I will drop some columns.
* I will solve the "unbalancing data".

**My thoughts and observations :** The accuracy would be between 0.83 and 0.87.

##### Read Training and Testing data

In [None]:
# Read all our training data by using read_csv, which takes the path of the file with the extension that I want to read.
df = pd.read_csv ('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt2/train.csv')
# Based on position, this function returns the first 5 rows of the dataset. It's used to quickly see if our dataset contains the proper kind of data.
df.head(5)

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,3,2,14,18,2,2.0,14,12,372.0,...,,,,,,,,,,2583
1,1,14,1,3,10,2,,8,8,63.0,...,8.0,8.0,7.0,8.0,,,,,,6830
2,1,14,1,13,10,8,8.0,10,10,331.0,...,,,,,,,,,,4840
3,1,38,2,9,20,18,13.0,6,7,200.0,...,9.0,8.0,8.0,6.0,,,,,,5508
4,1,24,2,14,20,6,6.0,20,17,357.0,...,,,,,,,,,,4828


In [None]:
# Read all our testing data by using read_csv, which takes the path of the file with the extension that I want to read.
df_test = pd.read_csv ('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt2/test.csv')
# Based on position, this function returns the first 5 rows of the dataset. It's used to quickly see if our dataset contains the proper kind of data.
df_test.head(5)

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,5,2,2,16,3,,13,13,52.0,...,7.0,8.0,6.0,8.0,,,,,,934
1,0,33,2,14,18,6,6.0,4,8,368.0,...,8.0,7.0,7.0,8.0,6.0,7.0,6.0,5.0,5.0,6539
2,1,6,2,9,20,10,16.0,15,19,212.0,...,,,,,,,,,,6757
3,1,26,2,2,19,15,,8,10,30.0,...,,,,,,,,,,2275
4,0,29,2,7,16,7,7.0,10,5,162.0,...,,,,,,,,,,1052


In [None]:
# Display the column's name in training and testing data
print(df.columns)
print(df_test.columns)

Index(['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1',
       'order', 'partner', 'pid',
       ...
       'sinc3_3', 'intel3_3', 'fun3_3', 'amb3_3', 'attr5_3', 'sinc5_3',
       'intel5_3', 'fun5_3', 'amb5_3', 'id'],
      dtype='object', length=192)
Index(['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1',
       'order', 'partner', 'pid',
       ...
       'sinc3_3', 'intel3_3', 'fun3_3', 'amb3_3', 'attr5_3', 'sinc5_3',
       'intel5_3', 'fun5_3', 'amb5_3', 'id'],
      dtype='object', length=191)


#### Preprocessing

I will check the number of NaN values. Next, I will check the data types. If there is object data, I will convert it into categorical data so I can use it.

##### Drop some of features

In [None]:
# Drop unimportant training and testing features
# drop() Remove columns by specifying column names, inplace used to drop column from data frame
df.drop(columns=['zipcode','round','positin1','pid','condtn','field','tuition','career'],inplace=True)
df_test.drop(columns=['zipcode','round','positin1','pid','condtn','field','tuition','career'],inplace=True)

##### Check NaN values

* Check number of missing data exist by using
 * isnull() Return a boolean value indicating whether or not the values are NA.
 * sum() Return summation of nan value exists in each column.
 * sort_values() Sorting sum of value descending

###### Taining data

In [None]:
# Check number of missing data exist
df.isnull().sum().sort_values(ascending=False)

num_in_3    5449
numdat_3    4849
expnum      4627
sinc7_2     4519
amb7_2      4519
            ... 
partner        0
order          0
position       0
wave           0
id             0
Length: 184, dtype: int64

###### Testing data

In [None]:
# Check number of missing data exist
df_test.isnull().sum().sort_values(ascending=False)

num_in_3    2261
numdat_3    2033
expnum      1951
amb7_2      1904
sinc7_2     1904
            ... 
partner        0
order          0
position       0
wave           0
id             0
Length: 183, dtype: int64

##### Checking data types(convert object data to categorical data)

In this section, I will check data types and then any object will be converted to categorical data.

In [None]:
# Display data set info for checking types:
# info() prints data frame information, such as the index dtype and columns, non-null values, and memory usage.
df.info()
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5909 entries, 0 to 5908
Columns: 184 entries, gender to id
dtypes: float64(171), int64(9), object(4)
memory usage: 8.3+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2469 entries, 0 to 2468
Columns: 183 entries, gender to id
dtypes: float64(171), int64(8), object(4)
memory usage: 3.4+ MB


So there exist 4 object data in training and testing data

###### Taining data

In [None]:
#select_dtypes based on the column dtypes, return a subset of the DataFrame's columns.
# include contains the type of data that I want to select.
df.select_dtypes(include=['object'])

Unnamed: 0,undergra,mn_sat,from,income
0,University of Michigan-Ann Arbor,1290.00,"Palo Alto, CA",
1,,,"Boston, MA",
2,"Rizvi College of Architecture, Bombay University",,"Bombay, India",
3,,,"Washington, DC",45300.00
4,Harvard College,1400.00,Midwest USA,46138.00
...,...,...,...,...
5904,,,New York,65708.00
5905,,,Colombia,
5906,University of Washington,1155.00,Seattle,37881.00
5907,,,Canada,


###### Testing data

In [None]:
# select_dtypes based on the column dtypes, return a subset of the DataFrame's columns.
# include contains the type of data that I want to select.
df_test.select_dtypes(include=['object'])

Unnamed: 0,undergra,mn_sat,from,income
0,,,Hong Kong,
1,wellesley college,1341.00,"atlanta, ga",36223.00
2,,,San Francisco,55080.00
3,,,Brooklyn,26482.00
4,,,"Atlanta, GA",21590.00
...,...,...,...,...
2464,Columbia,1430.00,Hong Kong,
2465,Bucknell University,1290.00,"Erie, PA",
2466,,,Brooklyn,26482.00
2467,,,Vestal,42640.00


###### Covert training and testing objects data to categorical data

In [None]:
# Making a copy from the training and testing data frame before doing any assignment 
df_tr=df.copy() # Make a copy of dataframe's indices and data.
df_ts=df_test.copy() # Make a copy of dataframe's indices and data.

# obj_tr contains all categorical data in the training set by using
# select_dtypes selects an object from the training data set.
obj_tr=df.select_dtypes(include=['object'])  

#categorical encoding of all object data
for i in obj_tr:
   df_tr[i]=df_tr[i].astype("category")

# obj_tr contains all categorical data in the testing by using
# select_dtypes selects an object from the testing data set.
obj_ts=df.select_dtypes(include=['object']) 

#categorical encoding of all object data
for i in obj_ts:
   df_ts[i]=df_ts[i].astype("category")


In [None]:
# look of the training values:
df_tr

Unnamed: 0,gender,idg,wave,position,order,partner,match,int_corr,samerace,age_o,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,3,14,2,14,12,0,-0.03,0,27.0,...,,,,,,,,,,2583
1,1,14,3,2,8,8,0,0.21,0,24.0,...,8.0,8.0,7.0,8.0,,,,,,6830
2,1,14,13,8,10,10,0,0.43,0,34.0,...,,,,,,,,,,4840
3,1,38,9,18,6,7,0,0.72,1,25.0,...,9.0,8.0,8.0,6.0,,,,,,5508
4,1,24,14,6,20,17,0,0.33,0,27.0,...,,,,,,,,,,4828
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5904,0,1,9,2,18,1,0,-0.22,1,23.0,...,12.0,12.0,9.0,12.0,,,,,,3390
5905,1,24,9,19,5,6,0,0.08,0,30.0,...,,,,,,,,,,4130
5906,0,13,11,5,3,18,0,0.35,0,34.0,...,,,,,,,,,,1178
5907,1,10,7,6,9,10,1,0.45,0,28.0,...,,,,,,,,,,5016


In [None]:
# look of the testing values:
df_ts

Unnamed: 0,gender,idg,wave,position,order,partner,int_corr,samerace,age_o,race_o,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,5,2,3,13,13,-0.13,0,21.0,2.0,...,7.0,8.0,6.0,8.0,,,,,,934
1,0,33,14,6,4,8,0.12,0,24.0,6.0,...,8.0,7.0,7.0,8.0,6.0,7.0,6.0,5.0,5.0,6539
2,1,6,9,10,15,19,0.11,0,27.0,3.0,...,,,,,,,,,,6757
3,1,26,2,15,8,10,0.11,1,23.0,2.0,...,,,,,,,,,,2275
4,0,29,7,7,10,5,0.45,0,27.0,4.0,...,,,,,,,,,,1052
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2464,0,23,15,18,14,11,0.74,0,24.0,2.0,...,,,,,,,,,,7982
2465,0,5,13,4,4,8,,0,,,...,,,,,,,,,,7299
2466,1,26,2,3,15,3,-0.13,0,21.0,4.0,...,,,,,,,,,,1818
2467,0,19,9,11,9,2,0.43,0,26.0,4.0,...,7.0,12.0,12.0,9.0,,,,,,937


In [None]:
# Display data set info for checking types:
# info() prints data frame information, such as the index dtype and columns, non-null values, and memory usage.
df_tr.info()
df_ts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5909 entries, 0 to 5908
Columns: 184 entries, gender to id
dtypes: category(4), float64(171), int64(9)
memory usage: 8.2 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2469 entries, 0 to 2468
Columns: 183 entries, gender to id
dtypes: category(4), float64(171), int64(8)
memory usage: 3.4 MB


So now that there is no object data in the datasets.

#### Model

##### Splitting

I split the data into X and y.

###### Trainig data

In [None]:
#splitting Trainig data into X_train and y_train
y_train=df_tr['match'] # y_train contains only match column
X_train=df_tr.drop(columns=['match','id'],axis=1) # X_train will be all columns except for the match and id columns.
# Shape return a tuple representing the dimensionality of the DataFrame.
print(y_train.shape) 
print(X_train.shape)

(5909,)
(5909, 182)


###### Testing data

In [None]:
X_test=df_ts # X_test contains all columns except id column.
print(X_test.shape)

(2469, 183)


##### PipeLine Tuning

In [None]:
#Sparate numerical and categorical features in the training data

# put numeric feature in feature_numeric list
features_numeric=list(X_train.select_dtypes(include=['float64','int64']))

#put categoric features in feature_cat list
features_cat=list(X_train.select_dtypes(include=['category']))
# print each list to know the column's name in each list.
print('numeric features:', features_numeric)
print('categorical features:', features_cat)

numeric features: ['gender', 'idg', 'wave', 'position', 'order', 'partner', 'int_corr', 'samerace', 'age_o', 'race_o', 'pf_o_att', 'pf_o_sin', 'pf_o_int', 'pf_o_fun', 'pf_o_amb', 'pf_o_sha', 'attr_o', 'sinc_o', 'intel_o', 'fun_o', 'amb_o', 'shar_o', 'like_o', 'prob_o', 'met_o', 'age', 'field_cd', 'race', 'imprace', 'imprelig', 'goal', 'date', 'go_out', 'career_c', 'sports', 'tvsports', 'exercise', 'dining', 'museums', 'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv', 'theater', 'movies', 'concerts', 'music', 'shopping', 'yoga', 'exphappy', 'expnum', 'attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1', 'attr4_1', 'sinc4_1', 'intel4_1', 'fun4_1', 'amb4_1', 'shar4_1', 'attr2_1', 'sinc2_1', 'intel2_1', 'fun2_1', 'amb2_1', 'shar2_1', 'attr3_1', 'sinc3_1', 'fun3_1', 'intel3_1', 'amb3_1', 'attr5_1', 'sinc5_1', 'intel5_1', 'fun5_1', 'amb5_1', 'attr', 'sinc', 'intel', 'fun', 'amb', 'shar', 'like', 'prob', 'met', 'match_es', 'attr1_s', 'sinc1_s', 'intel1_s', 'fun1_s', 'amb1_s',

In [None]:
from pandas.core.arrays import numeric
# Built pipline
# The pipeline's goal is to combine numerous processes that can be cross-validated while modifying various parameters.
# It does this by allowing set parameters for each step using their names and parameter names separated by a "__"
# It takes steps as a prameter that contain all the preprocessing that I need.
# It saves time by applying any preprocessing to both train and test data without repeating the process.


# Create a pipline for numerical features and select it's hyperparameters
numeric=Pipeline(
    steps=[
           ('imputer', SimpleImputer()), # SimpleImputer used to handel missing value and have strategy='mean' is default val that means fill nan value with mean
           ('scaler', StandardScaler())  # StandardScaler used to scale number
    ]
)
categorical=Pipeline(
    steps=[
           ('imputer',SimpleImputer(strategy='constant')), # SimpleImputer used to handel missing value and have strategy='constant' that means fill nan value with constant
            ('onehot',OneHotEncoder(handle_unknown='ignore'))# OneHotEncoder used to encode categorical data
    ]
)
# ColumnTransformer used to construct and apply separate numerical and categorical data transformers.
# Select and prepare the columns of the dataset before fitting a model to the modified data.
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric, features_numeric),# Numerical data
        ('cat', categorical, features_cat) # Categorical data
    ]
)
# Put the preprocessing a siutable classifier.
full_pipline = Pipeline(  
    steps=[
        ('preprocessor', preprocessor), 
        ('my_classifier', 
           XGBClassifier(), # I used XGBClassifier as a classifier.
        )
    ]
)
full_pipline


np.random.seed(0)  # used to make the random numbers predictable



In [None]:
# Fitting and predict The pipeline object.
full_pipline = full_pipline.fit(X_train, y_train)
full_pipline.predict(X_test)

array([0, 0, 0, ..., 0, 0, 0])

In [None]:
# Random search hyperparameters
# param_random is a dictionary that contains all the parameters I want to try.
param_random = {
    'preprocessor__num__imputer__strategy': ['mean'],
    # preprocessor__num__imputer__strategy points to preprocessor->num (a Pipeline)-> imputer -> strategy
    # used to determine strategy value = mean to fill NaN values
    'my_classifier__nfold': [30, 40, 50],  
    # my_classifier__nfold points to my_classifier->nfold
    'my_classifier__max_depth':[20, 30, 40]   
    # my_classifier__max_depth points to my_classifier->max_depth   
    # max_depth determines how many features should be considered when looking for a split.
}

# cross-validation is the number of cv folds for each combination of parameters
# scoring is used to evaluation metric that used when ranking results
# n_job is a number of jobs to run in parallel.
# cv=4 means two-fold cross-validation
# n_jobs = 2
random_search = RandomizedSearchCV(
    full_pipline, param_random, cv=4, verbose=1, n_jobs=2, 
    scoring='roc_auc')
# Fitting the model after grid search
random_search.fit(X_train, y_train)
# best_score_ is a best estimator score on the data on the left.
# best_params_ is a setting of parameters that produced the best results on the hold out data.
print('best score {}'.format(random_search.best_score_))
print('best score {}'.format(random_search.best_params_))



Fitting 4 folds for each of 9 candidates, totalling 36 fits
best score 0.8781787478283851
best score {'preprocessor__num__imputer__strategy': 'mean', 'my_classifier__nfold': 30, 'my_classifier__max_depth': 20}


Best parameters:
*  max_depth = 30
*  n_fold = 30

In [None]:
# Use this cell to write the result in the excel sheet.
submission = pd.DataFrame()

submission['id'] = df_ts['id']

submission['match'] = random_search.predict_proba(X_test)[:,1]

submission.to_csv('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt2/sample_submission_walkthrough.csv', index=False)

##### Result

Accuracy in **Cross-Validation** = 0.8782 \
Accuracy in **kaggle** =0.87085 \
It is better than the previous trial.

## **Bayesian Search**

### **Trial 0**

In this trial, I want to test the Random Forest with hyperparameters (n_estimators, and max_depth) and bayesian technique to see if the results will improve or not.

In this trial, I will use 190 features.
I will not drop any columns in this trial.
I will solve the "unbalancing data".

**My thoughts and observations :** The accuracy would be between 0.80 and 0.85.

##### Read Training and Testing data

In [None]:
# Read all our training data by using read_csv, which takes the path of the file with the extension that I want to read.
df = pd.read_csv ('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt2/train.csv')
# Based on position, this function returns the first 5 rows of the dataset. It's used to quickly see if our dataset contains the proper kind of data.
df.head(5)

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,3,2,14,18,2,2.0,14,12,372.0,...,,,,,,,,,,2583
1,1,14,1,3,10,2,,8,8,63.0,...,8.0,8.0,7.0,8.0,,,,,,6830
2,1,14,1,13,10,8,8.0,10,10,331.0,...,,,,,,,,,,4840
3,1,38,2,9,20,18,13.0,6,7,200.0,...,9.0,8.0,8.0,6.0,,,,,,5508
4,1,24,2,14,20,6,6.0,20,17,357.0,...,,,,,,,,,,4828


In [None]:
# Read all our testing data by using read_csv, which takes the path of the file with the extension that I want to read.
df_test = pd.read_csv ('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt2/test.csv')
# Based on position, this function returns the first 5 rows of the dataset. It's used to quickly see if our dataset contains the proper kind of data.
df_test.head(5)

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,5,2,2,16,3,,13,13,52.0,...,7.0,8.0,6.0,8.0,,,,,,934
1,0,33,2,14,18,6,6.0,4,8,368.0,...,8.0,7.0,7.0,8.0,6.0,7.0,6.0,5.0,5.0,6539
2,1,6,2,9,20,10,16.0,15,19,212.0,...,,,,,,,,,,6757
3,1,26,2,2,19,15,,8,10,30.0,...,,,,,,,,,,2275
4,0,29,2,7,16,7,7.0,10,5,162.0,...,,,,,,,,,,1052


In [None]:
# Display the column's name in training and testing data
print(df.columns)
print(df_test.columns)

Index(['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1',
       'order', 'partner', 'pid',
       ...
       'sinc3_3', 'intel3_3', 'fun3_3', 'amb3_3', 'attr5_3', 'sinc5_3',
       'intel5_3', 'fun5_3', 'amb5_3', 'id'],
      dtype='object', length=192)
Index(['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1',
       'order', 'partner', 'pid',
       ...
       'sinc3_3', 'intel3_3', 'fun3_3', 'amb3_3', 'attr5_3', 'sinc5_3',
       'intel5_3', 'fun5_3', 'amb5_3', 'id'],
      dtype='object', length=191)


#### Preprocessing

I will check the number of NaN values. Next, I will check the data types. If there is object data, I will convert it into categorical data so I can use it.

##### Check NaN values

* Check number of missing data exist by using
 * isnull() Return a boolean value indicating whether or not the values are NA.
 * sum() Return summation of nan value exists in each column.
 * sort_values() Sorting sum of value descending

###### Taining data

In [None]:
# Check number of missing data exist
df.isnull().sum().sort_values(ascending=False)

num_in_3    5449
numdat_3    4849
expnum      4627
amb7_2      4519
sinc7_2     4519
            ... 
position       0
round          0
wave           0
condtn         0
id             0
Length: 192, dtype: int64

###### Testing data

In [None]:
# Check number of missing data exist
df_test.isnull().sum().sort_values(ascending=False)

num_in_3    2261
numdat_3    2033
expnum      1951
amb7_2      1904
sinc7_2     1904
            ... 
position       0
round          0
wave           0
condtn         0
id             0
Length: 191, dtype: int64

##### Checking data types(convert object data to categorical data)

In this section, I will check data types and then any object will be converted to categorical data.

In [None]:
# Display data set info for checking types:
# info() prints data frame information, such as the index dtype and columns, non-null values, and memory usage.
df.info()
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5909 entries, 0 to 5908
Columns: 192 entries, gender to id
dtypes: float64(173), int64(11), object(8)
memory usage: 8.7+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2469 entries, 0 to 2468
Columns: 191 entries, gender to id
dtypes: float64(173), int64(10), object(8)
memory usage: 3.6+ MB


So there exist 8 object data in training and testing data

###### Taining data

In [None]:
#select_dtypes based on the column dtypes, return a subset of the DataFrame's columns.
# include contains the type of data that I want to select.
df.select_dtypes(include=['object'])

Unnamed: 0,field,undergra,mn_sat,tuition,from,zipcode,income,career
0,Ed.D. in higher education policy at TC,University of Michigan-Ann Arbor,1290.00,21645.00,"Palo Alto, CA",,,University President
1,Engineering,,,,"Boston, MA",2021,,Engineer or iBanker or consultant
2,Urban Planning,"Rizvi College of Architecture, Bombay University",,,"Bombay, India",,,Real Estate Consulting
3,International Affairs,,,,"Washington, DC",10471,45300.00,public service
4,Business,Harvard College,1400.00,26019.00,Midwest USA,66208,46138.00,undecided
...,...,...,...,...,...,...,...,...
5904,Clinical Psychology,,,,New York,11803,65708.00,Psychologist
5905,MBA,,,,Colombia,,,Consulting
5906,MA Science Education,University of Washington,1155.00,13258.00,Seattle,98115,37881.00,Teacher
5907,Biochemistry,,,,Canada,,,pharmaceuticals and biotechnology


###### Testing data

In [None]:
# select_dtypes based on the column dtypes, return a subset of the DataFrame's columns.
# include contains the type of data that I want to select.
df_test.select_dtypes(include=['object'])

Unnamed: 0,field,undergra,mn_sat,tuition,from,zipcode,income,career
0,Psychology,,,,Hong Kong,0,,psychologist
1,education,wellesley college,1341.00,25504.00,"atlanta, ga",30071,36223.00,education
2,MBA,,,,San Francisco,10021,55080.00,Consulting
3,Law,,,,Brooklyn,11204,26482.00,Intellectual Property Attorney
4,Business,,,,"Atlanta, GA",27870,21590.00,Marketing and Media
...,...,...,...,...,...,...,...,...
2464,Neuroscience and Education,Columbia,1430.00,26908.00,Hong Kong,0,,Academic
2465,School Psychology,Bucknell University,1290.00,25335.00,"Erie, PA",,,school psychologist
2466,Law,,,,Brooklyn,11204,26482.00,Intellectual Property Attorney
2467,Mathematics,,,,Vestal,13850,42640.00,college professor


###### Covert training and testing objects data to categorical data

In [None]:
# Making a copy from the training and testing data frame before doing any assignment 
df_tr=df.copy() # Make a copy of dataframe's indices and data.
df_ts=df_test.copy() # Make a copy of dataframe's indices and data.

# obj_tr contains all categorical data in the training set by using
# select_dtypes selects an object from the training data set.
obj_tr=df.select_dtypes(include=['object'])  

#categorical encoding of all object data
for i in obj_tr:
   df_tr[i]=df_tr[i].astype("category")

# obj_tr contains all categorical data in the testing by using
# select_dtypes selects an object from the testing data set.
obj_ts=df.select_dtypes(include=['object']) 

#categorical encoding of all object data
for i in obj_ts:
   df_ts[i]=df_ts[i].astype("category")


In [None]:
# look of the training values:
df_tr

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,3,2,14,18,2,2.0,14,12,372.0,...,,,,,,,,,,2583
1,1,14,1,3,10,2,,8,8,63.0,...,8.0,8.0,7.0,8.0,,,,,,6830
2,1,14,1,13,10,8,8.0,10,10,331.0,...,,,,,,,,,,4840
3,1,38,2,9,20,18,13.0,6,7,200.0,...,9.0,8.0,8.0,6.0,,,,,,5508
4,1,24,2,14,20,6,6.0,20,17,357.0,...,,,,,,,,,,4828
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5904,0,1,2,9,20,2,2.0,18,1,214.0,...,12.0,12.0,9.0,12.0,,,,,,3390
5905,1,24,2,9,20,19,15.0,5,6,199.0,...,,,,,,,,,,4130
5906,0,13,2,11,21,5,5.0,3,18,290.0,...,,,,,,,,,,1178
5907,1,10,2,7,16,6,14.0,9,10,151.0,...,,,,,,,,,,5016


In [None]:
# look of the testing values:
df_ts

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,5,2,2,16,3,,13,13,52.0,...,7.0,8.0,6.0,8.0,,,,,,934
1,0,33,2,14,18,6,6.0,4,8,368.0,...,8.0,7.0,7.0,8.0,6.0,7.0,6.0,5.0,5.0,6539
2,1,6,2,9,20,10,16.0,15,19,212.0,...,,,,,,,,,,6757
3,1,26,2,2,19,15,,8,10,30.0,...,,,,,,,,,,2275
4,0,29,2,7,16,7,7.0,10,5,162.0,...,,,,,,,,,,1052
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2464,0,23,2,15,19,18,18.0,14,11,407.0,...,,,,,,,,,,7982
2465,0,5,1,13,9,4,4.0,4,8,339.0,...,,,,,,,,,,7299
2466,1,26,2,2,19,3,,15,3,23.0,...,,,,,,,,,,1818
2467,0,19,2,9,20,11,11.0,9,2,215.0,...,7.0,12.0,12.0,9.0,,,,,,937


In [None]:
# Display data set info for checking types:
# info() prints data frame information, such as the index dtype and columns, non-null values, and memory usage.
df_tr.info()
df_ts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5909 entries, 0 to 5908
Columns: 192 entries, gender to id
dtypes: category(8), float64(173), int64(11)
memory usage: 8.5 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2469 entries, 0 to 2468
Columns: 191 entries, gender to id
dtypes: category(8), float64(173), int64(10)
memory usage: 3.6 MB


So now that there is no object data in the datasets.

#### Model

##### Splitting

I split the data into X and y.

###### Trainig data

In [None]:
#splitting Trainig data into X_train and y_train
y_train=df_tr['match'] # y_train contains only match column
X_train=df_tr.drop(columns=['match','id'],axis=1) # X_train will be all columns except for the match and id columns.
# Shape return a tuple representing the dimensionality of the DataFrame.
print(y_train.shape) 
print(X_train.shape)

(5909,)
(5909, 190)


###### Testing data

In [None]:
X_test=df_ts # X_test contains all columns except id column.
print(X_test.shape)

(2469, 191)


##### PipeLine Tuning

In [None]:
#Sparate numerical and categorical features in the training data

# put numeric feature in feature_numeric list
features_numeric=list(X_train.select_dtypes(include=['float64','int64']))

#put categoric features in feature_cat list
features_cat=list(X_train.select_dtypes(include=['category']))
# print each list to know the column's name in each list.
print('numeric features:', features_numeric)
print('categorical features:', features_cat)

numeric features: ['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1', 'order', 'partner', 'pid', 'int_corr', 'samerace', 'age_o', 'race_o', 'pf_o_att', 'pf_o_sin', 'pf_o_int', 'pf_o_fun', 'pf_o_amb', 'pf_o_sha', 'attr_o', 'sinc_o', 'intel_o', 'fun_o', 'amb_o', 'shar_o', 'like_o', 'prob_o', 'met_o', 'age', 'field_cd', 'race', 'imprace', 'imprelig', 'goal', 'date', 'go_out', 'career_c', 'sports', 'tvsports', 'exercise', 'dining', 'museums', 'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv', 'theater', 'movies', 'concerts', 'music', 'shopping', 'yoga', 'exphappy', 'expnum', 'attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1', 'attr4_1', 'sinc4_1', 'intel4_1', 'fun4_1', 'amb4_1', 'shar4_1', 'attr2_1', 'sinc2_1', 'intel2_1', 'fun2_1', 'amb2_1', 'shar2_1', 'attr3_1', 'sinc3_1', 'fun3_1', 'intel3_1', 'amb3_1', 'attr5_1', 'sinc5_1', 'intel5_1', 'fun5_1', 'amb5_1', 'attr', 'sinc', 'intel', 'fun', 'amb', 'shar', 'like', 'prob', 'met', 'match_es', 'attr1_s', 'sin

In [None]:
from pandas.core.arrays import numeric
# Built pipline
# The pipeline's goal is to combine numerous processes that can be cross-validated while modifying various parameters.
# It does this by allowing set parameters for each step using their names and parameter names separated by a "__"
# It takes steps as a prameter that contain all the preprocessing that I need.
# It saves time by applying any preprocessing to both train and test data without repeating the process.


# Create a pipline for numerical features and select it's hyperparameters
numeric=Pipeline(
    steps=[
           ('imputer', SimpleImputer()), # SimpleImputer used to handel missing value and have strategy='mean' is default val that means fill nan value with mean
           ('scaler', StandardScaler())  # StandardScaler used to scale number
    ]
)
categorical=Pipeline(
    steps=[
           ('imputer',SimpleImputer(strategy='constant')), # SimpleImputer used to handel missing value and have strategy='constant' that means fill nan value with constant
            ('onehot',OneHotEncoder(handle_unknown='ignore'))# OneHotEncoder used to encode categorical data
    ]
)
# ColumnTransformer used to construct and apply separate numerical and categorical data transformers.
# Select and prepare the columns of the dataset before fitting a model to the modified data.
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric, features_numeric),# Numerical data
        ('cat', categorical, features_cat) # Categorical data
    ]
)
# Put the preprocessing a siutable classifier.
full_pipline = Pipeline(  
    steps=[
        ('preprocessor', preprocessor), 
        ('my_classifier', 
           RandomForestClassifier(), # I used RandomForestClassifier as a classifier.
        )
    ]
)
full_pipline


np.random.seed(0)  # used to make the random numbers predictable



In [None]:
# Fitting and predict The pipeline object.
full_pipline = full_pipline.fit(X_train, y_train)
full_pipline.predict(X_test)

array([0, 0, 0, ..., 0, 0, 0])

In [None]:
# Bayesian search hyperparameters
# param_bayes is a dictionary that contains all the parameters I want to try.
param_bayes = {
    'preprocessor__num__imputer__strategy': ['mean'],
    # preprocessor__num__imputer__strategy points to preprocessor->num (a Pipeline)-> imputer -> strategy
    # used to determine strategy value = mean to fill NaN values
    'my_classifier__n_estimators': [20, 30, 40],  
     # my_classifier__n_estimators points to my_classifier->n_estimators 
     # n_estimators is the total number of trees in the forest.
    'my_classifier__max_depth':[10, 20, 30]    
    # my_classifier__max_depth points to my_classifier->max_depth   
    # max_depth determines how many features should be considered when looking for a split.  
}

# cross-validation is the number of cv folds for each combination of parameters
# scoring is used to evaluation metric that used when ranking results
# n_job is a number of jobs to run in parallel.
# cv=3 means two-fold cross-validation
# n_jobs = 2
BayesS = BayesSearchCV(
    full_pipline, param_bayes, cv=3, verbose=1, n_jobs=2, 
    scoring='roc_auc')
# Fitting the model after grid search
BayesS.fit(X_train, y_train)
# best_score_ is a best estimator score on the data on the left.
# best_params_ is a setting of parameters that produced the best results on the hold out data.
print('best score {}'.format(BayesS.best_score_))
print('best score {}'.format(BayesS.best_params_))

Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits




Fitting 3 folds for each of 1 candidates, totalling 3 fits
best score 0.8406002214975845
best score OrderedDict([('my_classifier__max_depth', 10), ('my_classifier__n_estimators', 40), ('preprocessor__num__imputer__strategy', 'mean')])


Best parameters:
*  max_depth = 10
*  n_estimators = 40

In [None]:
# Use this cell to write the result in the excel sheet.
submission = pd.DataFrame()

submission['id'] = df_ts['id']

submission['match'] = BayesS.predict_proba(X_test)[:,1]

submission.to_csv('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt2/sample_submission_walkthrough.csv', index=False)

##### Result

Accuracy in **Cross-Validation** = 0.8406 \
Accuracy in **kaggle** =0.84530 \
The performance did not improve from the previous trial.

### **Trial 1**

In this trial, I decided to drop some columns and change number of CV= 4 instead of 2 and use XGBClassifier to see if the result will improve or not.

* I will use 184 features.
* I will drop some columns.
* I will solve the "unbalancing data".

**My thoughts and observations :** The accuracy would be between 0.84 and 0.88.

##### Read Training and Testing data

In [None]:
# Read all our training data by using read_csv, which takes the path of the file with the extension that I want to read.
df = pd.read_csv ('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt2/train.csv')
# Based on position, this function returns the first 5 rows of the dataset. It's used to quickly see if our dataset contains the proper kind of data.
df.head(5)

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,3,2,14,18,2,2.0,14,12,372.0,...,,,,,,,,,,2583
1,1,14,1,3,10,2,,8,8,63.0,...,8.0,8.0,7.0,8.0,,,,,,6830
2,1,14,1,13,10,8,8.0,10,10,331.0,...,,,,,,,,,,4840
3,1,38,2,9,20,18,13.0,6,7,200.0,...,9.0,8.0,8.0,6.0,,,,,,5508
4,1,24,2,14,20,6,6.0,20,17,357.0,...,,,,,,,,,,4828


In [None]:
# Read all our testing data by using read_csv, which takes the path of the file with the extension that I want to read.
df_test = pd.read_csv ('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt2/test.csv')
# Based on position, this function returns the first 5 rows of the dataset. It's used to quickly see if our dataset contains the proper kind of data.
df_test.head(5)

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,5,2,2,16,3,,13,13,52.0,...,7.0,8.0,6.0,8.0,,,,,,934
1,0,33,2,14,18,6,6.0,4,8,368.0,...,8.0,7.0,7.0,8.0,6.0,7.0,6.0,5.0,5.0,6539
2,1,6,2,9,20,10,16.0,15,19,212.0,...,,,,,,,,,,6757
3,1,26,2,2,19,15,,8,10,30.0,...,,,,,,,,,,2275
4,0,29,2,7,16,7,7.0,10,5,162.0,...,,,,,,,,,,1052


In [None]:
# Display the column's name in training and testing data
print(df.columns)
print(df_test.columns)

Index(['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1',
       'order', 'partner', 'pid',
       ...
       'sinc3_3', 'intel3_3', 'fun3_3', 'amb3_3', 'attr5_3', 'sinc5_3',
       'intel5_3', 'fun5_3', 'amb5_3', 'id'],
      dtype='object', length=192)
Index(['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1',
       'order', 'partner', 'pid',
       ...
       'sinc3_3', 'intel3_3', 'fun3_3', 'amb3_3', 'attr5_3', 'sinc5_3',
       'intel5_3', 'fun5_3', 'amb5_3', 'id'],
      dtype='object', length=191)


#### Preprocessing

I will check the number of NaN values. Next, I will check the data types. If there is object data, I will convert it into categorical data so I can use it.

##### Drop some of features

In [None]:
# Drop unimportant training and testing features
# drop() Remove columns by specifying column names, inplace used to drop column from data frame
df.drop(columns=['zipcode','round','positin1','pid','condtn','field','tuition','career'],inplace=True)
df_test.drop(columns=['zipcode','round','positin1','pid','condtn','field','tuition','career'],inplace=True)

##### Check NaN values

* Check number of missing data exist by using
 * isnull() Return a boolean value indicating whether or not the values are NA.
 * sum() Return summation of nan value exists in each column.
 * sort_values() Sorting sum of value descending

###### Taining data

In [None]:
# Check number of missing data exist
df.isnull().sum().sort_values(ascending=False)

num_in_3    5449
numdat_3    4849
expnum      4627
sinc7_2     4519
amb7_2      4519
            ... 
partner        0
order          0
position       0
wave           0
id             0
Length: 184, dtype: int64

###### Testing data

In [None]:
# Check number of missing data exist
df_test.isnull().sum().sort_values(ascending=False)

num_in_3    2261
numdat_3    2033
expnum      1951
amb7_2      1904
sinc7_2     1904
            ... 
partner        0
order          0
position       0
wave           0
id             0
Length: 183, dtype: int64

##### Checking data types(convert object data to categorical data)

In this section, I will check data types and then any object will be converted to categorical data.

In [None]:
# Display data set info for checking types:
# info() prints data frame information, such as the index dtype and columns, non-null values, and memory usage.
df.info()
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5909 entries, 0 to 5908
Columns: 184 entries, gender to id
dtypes: float64(171), int64(9), object(4)
memory usage: 8.3+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2469 entries, 0 to 2468
Columns: 183 entries, gender to id
dtypes: float64(171), int64(8), object(4)
memory usage: 3.4+ MB


So there exist 4 object data in training and testing data

###### Taining data

In [None]:
#select_dtypes based on the column dtypes, return a subset of the DataFrame's columns.
# include contains the type of data that I want to select.
df.select_dtypes(include=['object'])

Unnamed: 0,undergra,mn_sat,from,income
0,University of Michigan-Ann Arbor,1290.00,"Palo Alto, CA",
1,,,"Boston, MA",
2,"Rizvi College of Architecture, Bombay University",,"Bombay, India",
3,,,"Washington, DC",45300.00
4,Harvard College,1400.00,Midwest USA,46138.00
...,...,...,...,...
5904,,,New York,65708.00
5905,,,Colombia,
5906,University of Washington,1155.00,Seattle,37881.00
5907,,,Canada,


###### Testing data

In [None]:
# select_dtypes based on the column dtypes, return a subset of the DataFrame's columns.
# include contains the type of data that I want to select.
df_test.select_dtypes(include=['object'])

Unnamed: 0,undergra,mn_sat,from,income
0,,,Hong Kong,
1,wellesley college,1341.00,"atlanta, ga",36223.00
2,,,San Francisco,55080.00
3,,,Brooklyn,26482.00
4,,,"Atlanta, GA",21590.00
...,...,...,...,...
2464,Columbia,1430.00,Hong Kong,
2465,Bucknell University,1290.00,"Erie, PA",
2466,,,Brooklyn,26482.00
2467,,,Vestal,42640.00


###### Covert training and testing objects data to categorical data

In [None]:
# Making a copy from the training and testing data frame before doing any assignment 
df_tr=df.copy() # Make a copy of dataframe's indices and data.
df_ts=df_test.copy() # Make a copy of dataframe's indices and data.

# obj_tr contains all categorical data in the training set by using
# select_dtypes selects an object from the training data set.
obj_tr=df.select_dtypes(include=['object'])  

#categorical encoding of all object data
for i in obj_tr:
   df_tr[i]=df_tr[i].astype("category")

# obj_tr contains all categorical data in the testing by using
# select_dtypes selects an object from the testing data set.
obj_ts=df.select_dtypes(include=['object']) 

#categorical encoding of all object data
for i in obj_ts:
   df_ts[i]=df_ts[i].astype("category")


In [None]:
# look of the training values:
df_tr

Unnamed: 0,gender,idg,wave,position,order,partner,match,int_corr,samerace,age_o,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,3,14,2,14,12,0,-0.03,0,27.0,...,,,,,,,,,,2583
1,1,14,3,2,8,8,0,0.21,0,24.0,...,8.0,8.0,7.0,8.0,,,,,,6830
2,1,14,13,8,10,10,0,0.43,0,34.0,...,,,,,,,,,,4840
3,1,38,9,18,6,7,0,0.72,1,25.0,...,9.0,8.0,8.0,6.0,,,,,,5508
4,1,24,14,6,20,17,0,0.33,0,27.0,...,,,,,,,,,,4828
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5904,0,1,9,2,18,1,0,-0.22,1,23.0,...,12.0,12.0,9.0,12.0,,,,,,3390
5905,1,24,9,19,5,6,0,0.08,0,30.0,...,,,,,,,,,,4130
5906,0,13,11,5,3,18,0,0.35,0,34.0,...,,,,,,,,,,1178
5907,1,10,7,6,9,10,1,0.45,0,28.0,...,,,,,,,,,,5016


In [None]:
# look of the testing values:
df_ts

Unnamed: 0,gender,idg,wave,position,order,partner,int_corr,samerace,age_o,race_o,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,5,2,3,13,13,-0.13,0,21.0,2.0,...,7.0,8.0,6.0,8.0,,,,,,934
1,0,33,14,6,4,8,0.12,0,24.0,6.0,...,8.0,7.0,7.0,8.0,6.0,7.0,6.0,5.0,5.0,6539
2,1,6,9,10,15,19,0.11,0,27.0,3.0,...,,,,,,,,,,6757
3,1,26,2,15,8,10,0.11,1,23.0,2.0,...,,,,,,,,,,2275
4,0,29,7,7,10,5,0.45,0,27.0,4.0,...,,,,,,,,,,1052
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2464,0,23,15,18,14,11,0.74,0,24.0,2.0,...,,,,,,,,,,7982
2465,0,5,13,4,4,8,,0,,,...,,,,,,,,,,7299
2466,1,26,2,3,15,3,-0.13,0,21.0,4.0,...,,,,,,,,,,1818
2467,0,19,9,11,9,2,0.43,0,26.0,4.0,...,7.0,12.0,12.0,9.0,,,,,,937


In [None]:
# Display data set info for checking types:
# info() prints data frame information, such as the index dtype and columns, non-null values, and memory usage.
df_tr.info()
df_ts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5909 entries, 0 to 5908
Columns: 184 entries, gender to id
dtypes: category(4), float64(171), int64(9)
memory usage: 8.2 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2469 entries, 0 to 2468
Columns: 183 entries, gender to id
dtypes: category(4), float64(171), int64(8)
memory usage: 3.4 MB


So now that there is no object data in the datasets.

#### Model

##### Splitting

I split the data into X and y.

###### Trainig data

In [None]:
#splitting Trainig data into X_train and y_train
y_train=df_tr['match'] # y_train contains only match column
X_train=df_tr.drop(columns=['match','id'],axis=1) # X_train will be all columns except for the match and id columns.
# Shape return a tuple representing the dimensionality of the DataFrame.
print(y_train.shape) 
print(X_train.shape)

(5909,)
(5909, 182)


###### Testing data

In [None]:
X_test=df_ts # X_test contains all columns except id column.
print(X_test.shape)

(2469, 183)


##### PipeLine Tuning

In [None]:
#Sparate numerical and categorical features in the training data

# put numeric feature in feature_numeric list
features_numeric=list(X_train.select_dtypes(include=['float64','int64']))

#put categoric features in feature_cat list
features_cat=list(X_train.select_dtypes(include=['category']))
# print each list to know the column's name in each list.
print('numeric features:', features_numeric)
print('categorical features:', features_cat)

numeric features: ['gender', 'idg', 'wave', 'position', 'order', 'partner', 'int_corr', 'samerace', 'age_o', 'race_o', 'pf_o_att', 'pf_o_sin', 'pf_o_int', 'pf_o_fun', 'pf_o_amb', 'pf_o_sha', 'attr_o', 'sinc_o', 'intel_o', 'fun_o', 'amb_o', 'shar_o', 'like_o', 'prob_o', 'met_o', 'age', 'field_cd', 'race', 'imprace', 'imprelig', 'goal', 'date', 'go_out', 'career_c', 'sports', 'tvsports', 'exercise', 'dining', 'museums', 'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv', 'theater', 'movies', 'concerts', 'music', 'shopping', 'yoga', 'exphappy', 'expnum', 'attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1', 'attr4_1', 'sinc4_1', 'intel4_1', 'fun4_1', 'amb4_1', 'shar4_1', 'attr2_1', 'sinc2_1', 'intel2_1', 'fun2_1', 'amb2_1', 'shar2_1', 'attr3_1', 'sinc3_1', 'fun3_1', 'intel3_1', 'amb3_1', 'attr5_1', 'sinc5_1', 'intel5_1', 'fun5_1', 'amb5_1', 'attr', 'sinc', 'intel', 'fun', 'amb', 'shar', 'like', 'prob', 'met', 'match_es', 'attr1_s', 'sinc1_s', 'intel1_s', 'fun1_s', 'amb1_s',

In [None]:
from pandas.core.arrays import numeric
# Built pipline
# The pipeline's goal is to combine numerous processes that can be cross-validated while modifying various parameters.
# It does this by allowing set parameters for each step using their names and parameter names separated by a "__"
# It takes steps as a prameter that contain all the preprocessing that I need.
# It saves time by applying any preprocessing to both train and test data without repeating the process.


# Create a pipline for numerical features and select it's hyperparameters
numeric=Pipeline(
    steps=[
           ('imputer', SimpleImputer()), # SimpleImputer used to handel missing value and have strategy='mean' is default val that means fill nan value with mean
           ('scaler', StandardScaler())  # StandardScaler used to scale number
    ]
)
categorical=Pipeline(
    steps=[
           ('imputer',SimpleImputer(strategy='constant')), # SimpleImputer used to handel missing value and have strategy='constant' that means fill nan value with constant
            ('onehot',OneHotEncoder(handle_unknown='ignore'))# OneHotEncoder used to encode categorical data
    ]
)
# ColumnTransformer used to construct and apply separate numerical and categorical data transformers.
# Select and prepare the columns of the dataset before fitting a model to the modified data.
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric, features_numeric),# Numerical data
        ('cat', categorical, features_cat) # Categorical data
    ]
)
# Put the preprocessing a siutable classifier.
full_pipline = Pipeline(  
    steps=[
        ('preprocessor', preprocessor), 
        ('my_classifier', 
           XGBClassifier(), # I used XGBClassifier as a classifier.
        )
    ]
)
full_pipline


np.random.seed(0)  # used to make the random numbers predictable



In [None]:
# Fitting and predict The pipeline object.
full_pipline = full_pipline.fit(X_train, y_train)
full_pipline.predict(X_test)

array([0, 0, 0, ..., 0, 0, 0])

In [None]:
# Bayesian search hyperparameters
# param_bayes is a dictionary that contains all the parameters I want to try.
param_bayes = {
    'preprocessor__num__imputer__strategy': ['mean'],
    # preprocessor__num__imputer__strategy points to preprocessor->num (a Pipeline)-> imputer -> strategy
    # used to determine strategy value = mean to fill NaN values
    'my_classifier__nfold': [30, 40, 50],  
    # my_classifier__nfold points to my_classifier->nfold
    'my_classifier__max_depth':[20, 30, 40]   
    # my_classifier__max_depth points to my_classifier->max_depth   
    # max_depth determines how many features should be considered when looking for a split.
}

# cross-validation is the number of cv folds for each combination of parameters
# scoring is used to evaluation metric that used when ranking results
# n_job is a number of jobs to run in parallel.
# cv=4 means two-fold cross-validation
# n_jobs = 2
bayesS = BayesSearchCV(
    full_pipline, param_bayes, cv=4, verbose=1, n_jobs=2, 
    scoring='roc_auc')
# Fitting the model after grid search
bayesS.fit(X_train, y_train)
# best_score_ is a best estimator score on the data on the left.
# best_params_ is a setting of parameters that produced the best results on the hold out data.
print('best score {}'.format(bayesS.best_score_))
print('best score {}'.format(bayesS.best_params_))


Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits
best score 0.8781787478283851
best score OrderedDict([('my_classifier__max_depth', 20), ('my_classifier__nfold', 50), ('preprocessor__num__imputer__strategy', 'mean')])


Best parameters:
*  max_depth = 20
*  n_fold= 50

In [None]:
# Use this cell to write the result in the excel sheet.
submission = pd.DataFrame()

submission['id'] = df_ts['id']

submission['match'] = bayesS.predict_proba(X_test)[:,1]

submission.to_csv('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt2/sample_submission_walkthrough.csv', index=False)

##### Result

Accuracy in **Cross-Validation** = 0.8782 \
Accuracy in **kaggle** =0.87085 \
It is better than the previous trial.

### **Trial 2**

In this trial,According to the previous trial,I decided to drop one column and change hyperparameter( add learning rate and change max_depth and n_estimator values) to see if the result will improve or not.

I will use 184 features.\
I will drop one column.\
I will solve the "unbalancing data".

**My thoughts and observations :** The accuracy would be between 0.87 and 0.88

##### Read Training and Testing data

In [None]:
# Read all our training data by using read_csv, which takes the path of the file with the extension that I want to read.
df = pd.read_csv ('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt2/train.csv')
# Based on position, this function returns the first 5 rows of the dataset. It's used to quickly see if our dataset contains the proper kind of data.
df.head(5)

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,3,2,14,18,2,2.0,14,12,372.0,...,,,,,,,,,,2583
1,1,14,1,3,10,2,,8,8,63.0,...,8.0,8.0,7.0,8.0,,,,,,6830
2,1,14,1,13,10,8,8.0,10,10,331.0,...,,,,,,,,,,4840
3,1,38,2,9,20,18,13.0,6,7,200.0,...,9.0,8.0,8.0,6.0,,,,,,5508
4,1,24,2,14,20,6,6.0,20,17,357.0,...,,,,,,,,,,4828


In [None]:
# Read all our testing data by using read_csv, which takes the path of the file with the extension that I want to read.
df_test = pd.read_csv ('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt2/test.csv')
# Based on position, this function returns the first 5 rows of the dataset. It's used to quickly see if our dataset contains the proper kind of data.
df_test.head(5)

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,5,2,2,16,3,,13,13,52.0,...,7.0,8.0,6.0,8.0,,,,,,934
1,0,33,2,14,18,6,6.0,4,8,368.0,...,8.0,7.0,7.0,8.0,6.0,7.0,6.0,5.0,5.0,6539
2,1,6,2,9,20,10,16.0,15,19,212.0,...,,,,,,,,,,6757
3,1,26,2,2,19,15,,8,10,30.0,...,,,,,,,,,,2275
4,0,29,2,7,16,7,7.0,10,5,162.0,...,,,,,,,,,,1052


In [None]:
# Display the column's name in training and testing data
print(df.columns)
print(df_test.columns)

Index(['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1',
       'order', 'partner', 'pid',
       ...
       'sinc3_3', 'intel3_3', 'fun3_3', 'amb3_3', 'attr5_3', 'sinc5_3',
       'intel5_3', 'fun5_3', 'amb5_3', 'id'],
      dtype='object', length=192)
Index(['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1',
       'order', 'partner', 'pid',
       ...
       'sinc3_3', 'intel3_3', 'fun3_3', 'amb3_3', 'attr5_3', 'sinc5_3',
       'intel5_3', 'fun5_3', 'amb5_3', 'id'],
      dtype='object', length=191)


#### Preprocessing

I will check the number of NaN values. Next, I will check the data types. If there is object data, I will convert it into categorical data so I can use it.

##### Drop some of features

In [None]:
# Drop unimportant training and testing features
# drop() Remove columns by specifying column names, inplace used to drop column from data frame
df.drop(columns=['pid'],inplace=True)
df_test.drop(columns=['pid'],inplace=True)

##### Check NaN values

* Check number of missing data exist by using
 * isnull() Return a boolean value indicating whether or not the values are NA.
 * sum() Return summation of nan value exists in each column.
 * sort_values() Sorting sum of value descending

###### Taining data

In [None]:
# Check number of missing data exist
df.isnull().sum().sort_values(ascending=False)

num_in_3    5449
numdat_3    4849
expnum      4627
amb7_2      4519
sinc7_2     4519
            ... 
position       0
round          0
wave           0
condtn         0
id             0
Length: 191, dtype: int64

###### Testing data

In [None]:
# Check number of missing data exist
df_test.isnull().sum().sort_values(ascending=False)

num_in_3    2261
numdat_3    2033
expnum      1951
amb7_2      1904
sinc7_2     1904
            ... 
position       0
round          0
wave           0
condtn         0
id             0
Length: 190, dtype: int64

##### Checking data types(convert object data to categorical data)

In this section, I will check data types and then any object will be converted to categorical data.

In [None]:
# Display data set info for checking types:
# info() prints data frame information, such as the index dtype and columns, non-null values, and memory usage.
df.info()
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5909 entries, 0 to 5908
Columns: 191 entries, gender to id
dtypes: float64(172), int64(11), object(8)
memory usage: 8.6+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2469 entries, 0 to 2468
Columns: 190 entries, gender to id
dtypes: float64(172), int64(10), object(8)
memory usage: 3.6+ MB


So there exist 4 object data in training and testing data

###### Taining data

In [None]:
#select_dtypes based on the column dtypes, return a subset of the DataFrame's columns.
# include contains the type of data that I want to select.
df.select_dtypes(include=['object'])

Unnamed: 0,field,undergra,mn_sat,tuition,from,zipcode,income,career
0,Ed.D. in higher education policy at TC,University of Michigan-Ann Arbor,1290.00,21645.00,"Palo Alto, CA",,,University President
1,Engineering,,,,"Boston, MA",2021,,Engineer or iBanker or consultant
2,Urban Planning,"Rizvi College of Architecture, Bombay University",,,"Bombay, India",,,Real Estate Consulting
3,International Affairs,,,,"Washington, DC",10471,45300.00,public service
4,Business,Harvard College,1400.00,26019.00,Midwest USA,66208,46138.00,undecided
...,...,...,...,...,...,...,...,...
5904,Clinical Psychology,,,,New York,11803,65708.00,Psychologist
5905,MBA,,,,Colombia,,,Consulting
5906,MA Science Education,University of Washington,1155.00,13258.00,Seattle,98115,37881.00,Teacher
5907,Biochemistry,,,,Canada,,,pharmaceuticals and biotechnology


###### Testing data

In [None]:
# select_dtypes based on the column dtypes, return a subset of the DataFrame's columns.
# include contains the type of data that I want to select.
df_test.select_dtypes(include=['object'])

Unnamed: 0,field,undergra,mn_sat,tuition,from,zipcode,income,career
0,Psychology,,,,Hong Kong,0,,psychologist
1,education,wellesley college,1341.00,25504.00,"atlanta, ga",30071,36223.00,education
2,MBA,,,,San Francisco,10021,55080.00,Consulting
3,Law,,,,Brooklyn,11204,26482.00,Intellectual Property Attorney
4,Business,,,,"Atlanta, GA",27870,21590.00,Marketing and Media
...,...,...,...,...,...,...,...,...
2464,Neuroscience and Education,Columbia,1430.00,26908.00,Hong Kong,0,,Academic
2465,School Psychology,Bucknell University,1290.00,25335.00,"Erie, PA",,,school psychologist
2466,Law,,,,Brooklyn,11204,26482.00,Intellectual Property Attorney
2467,Mathematics,,,,Vestal,13850,42640.00,college professor


###### Covert training and testing objects data to categorical data

In [None]:
# Making a copy from the training and testing data frame before doing any assignment 
df_tr=df.copy() # Make a copy of dataframe's indices and data.
df_ts=df_test.copy() # Make a copy of dataframe's indices and data.

# obj_tr contains all categorical data in the training set by using
# select_dtypes selects an object from the training data set.
obj_tr=df.select_dtypes(include=['object'])  

#categorical encoding of all object data
for i in obj_tr:
   df_tr[i]=df_tr[i].astype("category")

# obj_tr contains all categorical data in the testing by using
# select_dtypes selects an object from the testing data set.
obj_ts=df.select_dtypes(include=['object']) 

#categorical encoding of all object data
for i in obj_ts:
   df_ts[i]=df_ts[i].astype("category")


In [None]:
# look of the training values:
df_tr

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,match,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,3,2,14,18,2,2.0,14,12,0,...,,,,,,,,,,2583
1,1,14,1,3,10,2,,8,8,0,...,8.0,8.0,7.0,8.0,,,,,,6830
2,1,14,1,13,10,8,8.0,10,10,0,...,,,,,,,,,,4840
3,1,38,2,9,20,18,13.0,6,7,0,...,9.0,8.0,8.0,6.0,,,,,,5508
4,1,24,2,14,20,6,6.0,20,17,0,...,,,,,,,,,,4828
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5904,0,1,2,9,20,2,2.0,18,1,0,...,12.0,12.0,9.0,12.0,,,,,,3390
5905,1,24,2,9,20,19,15.0,5,6,0,...,,,,,,,,,,4130
5906,0,13,2,11,21,5,5.0,3,18,0,...,,,,,,,,,,1178
5907,1,10,2,7,16,6,14.0,9,10,1,...,,,,,,,,,,5016


In [None]:
# look of the testing values:
df_ts

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,int_corr,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,5,2,2,16,3,,13,13,-0.13,...,7.0,8.0,6.0,8.0,,,,,,934
1,0,33,2,14,18,6,6.0,4,8,0.12,...,8.0,7.0,7.0,8.0,6.0,7.0,6.0,5.0,5.0,6539
2,1,6,2,9,20,10,16.0,15,19,0.11,...,,,,,,,,,,6757
3,1,26,2,2,19,15,,8,10,0.11,...,,,,,,,,,,2275
4,0,29,2,7,16,7,7.0,10,5,0.45,...,,,,,,,,,,1052
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2464,0,23,2,15,19,18,18.0,14,11,0.74,...,,,,,,,,,,7982
2465,0,5,1,13,9,4,4.0,4,8,,...,,,,,,,,,,7299
2466,1,26,2,2,19,3,,15,3,-0.13,...,,,,,,,,,,1818
2467,0,19,2,9,20,11,11.0,9,2,0.43,...,7.0,12.0,12.0,9.0,,,,,,937


In [None]:
# Display data set info for checking types:
# info() prints data frame information, such as the index dtype and columns, non-null values, and memory usage.
df_tr.info()
df_ts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5909 entries, 0 to 5908
Columns: 191 entries, gender to id
dtypes: category(8), float64(172), int64(11)
memory usage: 8.4 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2469 entries, 0 to 2468
Columns: 190 entries, gender to id
dtypes: category(8), float64(172), int64(10)
memory usage: 3.5 MB


So now that there is no object data in the datasets.

#### Model

##### Splitting

I split the data into X and y.

###### Trainig data

In [None]:
#splitting Trainig data into X_train and y_train
y_train=df_tr['match'] # y_train contains only match column
X_train=df_tr.drop(columns=['match','id'],axis=1) # X_train will be all columns except for the match and id columns.
# Shape return a tuple representing the dimensionality of the DataFrame.
print(y_train.shape) 
print(X_train.shape)

(5909,)
(5909, 189)


###### Testing data

In [None]:
X_test=df_ts # X_test contains all columns except id column.
print(X_test.shape)

(2469, 190)


##### PipeLine Tuning

In [None]:
#Sparate numerical and categorical features in the training data

# put numeric feature in feature_numeric list
features_numeric=list(X_train.select_dtypes(include=['float64','int64']))

#put categoric features in feature_cat list
features_cat=list(X_train.select_dtypes(include=['category']))
# print each list to know the column's name in each list.
print('numeric features:', features_numeric)
print('categorical features:', features_cat)

numeric features: ['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1', 'order', 'partner', 'int_corr', 'samerace', 'age_o', 'race_o', 'pf_o_att', 'pf_o_sin', 'pf_o_int', 'pf_o_fun', 'pf_o_amb', 'pf_o_sha', 'attr_o', 'sinc_o', 'intel_o', 'fun_o', 'amb_o', 'shar_o', 'like_o', 'prob_o', 'met_o', 'age', 'field_cd', 'race', 'imprace', 'imprelig', 'goal', 'date', 'go_out', 'career_c', 'sports', 'tvsports', 'exercise', 'dining', 'museums', 'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv', 'theater', 'movies', 'concerts', 'music', 'shopping', 'yoga', 'exphappy', 'expnum', 'attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1', 'attr4_1', 'sinc4_1', 'intel4_1', 'fun4_1', 'amb4_1', 'shar4_1', 'attr2_1', 'sinc2_1', 'intel2_1', 'fun2_1', 'amb2_1', 'shar2_1', 'attr3_1', 'sinc3_1', 'fun3_1', 'intel3_1', 'amb3_1', 'attr5_1', 'sinc5_1', 'intel5_1', 'fun5_1', 'amb5_1', 'attr', 'sinc', 'intel', 'fun', 'amb', 'shar', 'like', 'prob', 'met', 'match_es', 'attr1_s', 'sinc1_s', 

In [None]:
from pandas.core.arrays import numeric
# Built pipline
# The pipeline's goal is to combine numerous processes that can be cross-validated while modifying various parameters.
# It does this by allowing set parameters for each step using their names and parameter names separated by a "__"
# It takes steps as a prameter that contain all the preprocessing that I need.
# It saves time by applying any preprocessing to both train and test data without repeating the process.


# Create a pipline for numerical features and select it's hyperparameters
numeric=Pipeline(
    steps=[
           ('imputer', SimpleImputer()), # SimpleImputer used to handel missing value and have strategy='mean' is default val that means fill nan value with mean
           ('scaler', StandardScaler())  # StandardScaler used to scale number
    ]
)
categorical=Pipeline(
    steps=[
           ('imputer',SimpleImputer(strategy='constant')), # SimpleImputer used to handel missing value and have strategy='constant' that means fill nan value with constant
            ('onehot',OneHotEncoder(handle_unknown='ignore'))# OneHotEncoder used to encode categorical data
    ]
)
# ColumnTransformer used to construct and apply separate numerical and categorical data transformers.
# Select and prepare the columns of the dataset before fitting a model to the modified data.
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric, features_numeric),# Numerical data
        ('cat', categorical, features_cat) # Categorical data
    ]
)
# Put the preprocessing a siutable classifier.
full_pipline = Pipeline(  
    steps=[
        ('preprocessor', preprocessor), 
        ('my_classifier', 
           XGBClassifier(), # I used XGBClassifier as a classifier.
        )
    ]
)
full_pipline


np.random.seed(0)  # used to make the random numbers predictable



In [None]:
# Fitting and predict The pipeline object.
full_pipline = full_pipline.fit(X_train, y_train)
full_pipline.predict(X_test)

array([0, 0, 0, ..., 0, 0, 0])

In [None]:
# Bayesian search hyperparameters
# param_bayes is a dictionary that contains all the parameters I want to try.
param_bayes = {#***************************************************************
    'preprocessor__num__imputer__strategy': ['mean'],
    # preprocessor__num__imputer__strategy points to preprocessor->num (a Pipeline)-> imputer -> strategy
    # used to determine strategy value = mean to fill NaN values
    'my_classifier__n_estimators': [130, 150, 170],  
     # my_classifier__n_estimators points to my_classifier->n_estimators 
     # n_estimators is the total number of trees in the forest.
    'my_classifier__max_depth':[40, 50, 80],   
    # my_classifier__max_depth points to my_classifier->max_depth   
    # max_depth determines how many features should be considered when looking for a split.
    'my_classifier__learning_rate':[0.1,0.01,0.001] 
    # 'my_classifier__learning_rate':range(0.1, 0.01, 0.001) 
    # It's used to control how new trees in the model are weighted.
}

# cross-validation is the number of cv folds for each combination of parameters
# scoring is used to evaluation metric that used when ranking results
# n_job is a number of jobs to run in parallel.
# cv=4 means two-fold cross-validation
# n_jobs = 2
bayesS = BayesSearchCV(
    full_pipline, param_bayes, cv=4, verbose=1, n_jobs=2, 
    scoring='roc_auc')
# Fitting the model after grid search
bayesS.fit(X_train, y_train)
# best_score_ is a best estimator score on the data on the left.
# best_params_ is a setting of parameters that produced the best results on the hold out data.
print('best score {}'.format(bayesS.best_score_))
print('best score {}'.format(bayesS.best_params_))


Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits
best score 0.8787789642351773
best score OrderedDict([('my_classifier__learning_rate', 0.1), ('my_classifier__max_depth', 50), ('my_classifier__n_estimators', 170), ('preprocessor__num__imputer__strategy', 'mean')])


Best paramters:
* learning_rate = 0.1
*  max_depth = 50
*  n_stimator = 170

In [None]:
# Use this cell to write the result in the excel sheet.
submission = pd.DataFrame()

submission['id'] = df_ts['id']

submission['match'] = bayesS.predict_proba(X_test)[:,1]

submission.to_csv('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt2/sample_submission_walkthrough.csv', index=False)

##### Result

Accuracy in **Cross-Validation** = 0.8787 \
Accuracy in **kaggle** =0.88029 


It is better than the previous trial. So when I increased the n_estimators and max_depth, the accuracy improved.

### **Trial 3**

According to the previous trial, I noticed that when I change max_depth and n_estimators into larger value the accuracy increase more so I tried different values to improve model performance.

**My thoughts and observations :** The accuracy would be between 0.88029 and 0.8890

##### Read Training and Testing data

In [None]:
# Read all our training data by using read_csv, which takes the path of the file with the extension that I want to read.
df = pd.read_csv ('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt2/train.csv')
# Based on position, this function returns the first 5 rows of the dataset. It's used to quickly see if our dataset contains the proper kind of data.
df.head(5)

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,3,2,14,18,2,2.0,14,12,372.0,...,,,,,,,,,,2583
1,1,14,1,3,10,2,,8,8,63.0,...,8.0,8.0,7.0,8.0,,,,,,6830
2,1,14,1,13,10,8,8.0,10,10,331.0,...,,,,,,,,,,4840
3,1,38,2,9,20,18,13.0,6,7,200.0,...,9.0,8.0,8.0,6.0,,,,,,5508
4,1,24,2,14,20,6,6.0,20,17,357.0,...,,,,,,,,,,4828


In [None]:
# Read all our testing data by using read_csv, which takes the path of the file with the extension that I want to read.
df_test = pd.read_csv ('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt2/test.csv')
# Based on position, this function returns the first 5 rows of the dataset. It's used to quickly see if our dataset contains the proper kind of data.
df_test.head(5)

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,5,2,2,16,3,,13,13,52.0,...,7.0,8.0,6.0,8.0,,,,,,934
1,0,33,2,14,18,6,6.0,4,8,368.0,...,8.0,7.0,7.0,8.0,6.0,7.0,6.0,5.0,5.0,6539
2,1,6,2,9,20,10,16.0,15,19,212.0,...,,,,,,,,,,6757
3,1,26,2,2,19,15,,8,10,30.0,...,,,,,,,,,,2275
4,0,29,2,7,16,7,7.0,10,5,162.0,...,,,,,,,,,,1052


In [None]:
# Display the column's name in training and testing data
print(df.columns)
print(df_test.columns)

Index(['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1',
       'order', 'partner', 'pid',
       ...
       'sinc3_3', 'intel3_3', 'fun3_3', 'amb3_3', 'attr5_3', 'sinc5_3',
       'intel5_3', 'fun5_3', 'amb5_3', 'id'],
      dtype='object', length=192)
Index(['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1',
       'order', 'partner', 'pid',
       ...
       'sinc3_3', 'intel3_3', 'fun3_3', 'amb3_3', 'attr5_3', 'sinc5_3',
       'intel5_3', 'fun5_3', 'amb5_3', 'id'],
      dtype='object', length=191)


#### Preprocessing

I will check the number of NaN values. Next, I will check the data types. If there is object data, I will convert it into categorical data so I can use it.

##### Drop some of features

In [None]:
# Drop unimportant training and testing features
# drop() Remove columns by specifying column names, inplace used to drop column from data frame
df.drop(columns=['pid'],inplace=True)
df_test.drop(columns=['pid'],inplace=True)

##### Check NaN values

* Check number of missing data exist by using
 * isnull() Return a boolean value indicating whether or not the values are NA.
 * sum() Return summation of nan value exists in each column.
 * sort_values() Sorting sum of value descending

###### Taining data

In [None]:
# Check number of missing data exist
df.isnull().sum().sort_values(ascending=False)

num_in_3    5449
numdat_3    4849
expnum      4627
amb7_2      4519
sinc7_2     4519
            ... 
position       0
round          0
wave           0
condtn         0
id             0
Length: 191, dtype: int64

###### Testing data

In [None]:
# Check number of missing data exist
df_test.isnull().sum().sort_values(ascending=False)

num_in_3    2261
numdat_3    2033
expnum      1951
amb7_2      1904
sinc7_2     1904
            ... 
position       0
round          0
wave           0
condtn         0
id             0
Length: 190, dtype: int64

##### Checking data types(convert object data to categorical data)

In this section, I will check data types and then any object will be converted to categorical data.

In [None]:
# Display data set info for checking types:
# info() prints data frame information, such as the index dtype and columns, non-null values, and memory usage.
df.info()
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5909 entries, 0 to 5908
Columns: 191 entries, gender to id
dtypes: float64(172), int64(11), object(8)
memory usage: 8.6+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2469 entries, 0 to 2468
Columns: 190 entries, gender to id
dtypes: float64(172), int64(10), object(8)
memory usage: 3.6+ MB


So there exist 8 object data in training and testing data

###### Taining data

In [None]:
#select_dtypes based on the column dtypes, return a subset of the DataFrame's columns.
# include contains the type of data that I want to select.
df.select_dtypes(include=['object'])

Unnamed: 0,field,undergra,mn_sat,tuition,from,zipcode,income,career
0,Ed.D. in higher education policy at TC,University of Michigan-Ann Arbor,1290.00,21645.00,"Palo Alto, CA",,,University President
1,Engineering,,,,"Boston, MA",2021,,Engineer or iBanker or consultant
2,Urban Planning,"Rizvi College of Architecture, Bombay University",,,"Bombay, India",,,Real Estate Consulting
3,International Affairs,,,,"Washington, DC",10471,45300.00,public service
4,Business,Harvard College,1400.00,26019.00,Midwest USA,66208,46138.00,undecided
...,...,...,...,...,...,...,...,...
5904,Clinical Psychology,,,,New York,11803,65708.00,Psychologist
5905,MBA,,,,Colombia,,,Consulting
5906,MA Science Education,University of Washington,1155.00,13258.00,Seattle,98115,37881.00,Teacher
5907,Biochemistry,,,,Canada,,,pharmaceuticals and biotechnology


###### Testing data

In [None]:
# select_dtypes based on the column dtypes, return a subset of the DataFrame's columns.
# include contains the type of data that I want to select.
df_test.select_dtypes(include=['object'])

Unnamed: 0,field,undergra,mn_sat,tuition,from,zipcode,income,career
0,Psychology,,,,Hong Kong,0,,psychologist
1,education,wellesley college,1341.00,25504.00,"atlanta, ga",30071,36223.00,education
2,MBA,,,,San Francisco,10021,55080.00,Consulting
3,Law,,,,Brooklyn,11204,26482.00,Intellectual Property Attorney
4,Business,,,,"Atlanta, GA",27870,21590.00,Marketing and Media
...,...,...,...,...,...,...,...,...
2464,Neuroscience and Education,Columbia,1430.00,26908.00,Hong Kong,0,,Academic
2465,School Psychology,Bucknell University,1290.00,25335.00,"Erie, PA",,,school psychologist
2466,Law,,,,Brooklyn,11204,26482.00,Intellectual Property Attorney
2467,Mathematics,,,,Vestal,13850,42640.00,college professor


###### Covert training and testing objects data to categorical data

In [None]:
# Making a copy from the training and testing data frame before doing any assignment 
df_tr=df.copy() # Make a copy of dataframe's indices and data.
df_ts=df_test.copy() # Make a copy of dataframe's indices and data.

# obj_tr contains all categorical data in the training set by using
# select_dtypes selects an object from the training data set.
obj_tr=df.select_dtypes(include=['object'])  

#categorical encoding of all object data
for i in obj_tr:
   df_tr[i]=df_tr[i].astype("category")

# obj_tr contains all categorical data in the testing by using
# select_dtypes selects an object from the testing data set.
obj_ts=df.select_dtypes(include=['object']) 

#categorical encoding of all object data
for i in obj_ts:
   df_ts[i]=df_ts[i].astype("category")


In [None]:
# look of the training values:
df_tr

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,match,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,3,2,14,18,2,2.0,14,12,0,...,,,,,,,,,,2583
1,1,14,1,3,10,2,,8,8,0,...,8.0,8.0,7.0,8.0,,,,,,6830
2,1,14,1,13,10,8,8.0,10,10,0,...,,,,,,,,,,4840
3,1,38,2,9,20,18,13.0,6,7,0,...,9.0,8.0,8.0,6.0,,,,,,5508
4,1,24,2,14,20,6,6.0,20,17,0,...,,,,,,,,,,4828
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5904,0,1,2,9,20,2,2.0,18,1,0,...,12.0,12.0,9.0,12.0,,,,,,3390
5905,1,24,2,9,20,19,15.0,5,6,0,...,,,,,,,,,,4130
5906,0,13,2,11,21,5,5.0,3,18,0,...,,,,,,,,,,1178
5907,1,10,2,7,16,6,14.0,9,10,1,...,,,,,,,,,,5016


In [None]:
# look of the testing values:
df_ts

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,int_corr,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,5,2,2,16,3,,13,13,-0.13,...,7.0,8.0,6.0,8.0,,,,,,934
1,0,33,2,14,18,6,6.0,4,8,0.12,...,8.0,7.0,7.0,8.0,6.0,7.0,6.0,5.0,5.0,6539
2,1,6,2,9,20,10,16.0,15,19,0.11,...,,,,,,,,,,6757
3,1,26,2,2,19,15,,8,10,0.11,...,,,,,,,,,,2275
4,0,29,2,7,16,7,7.0,10,5,0.45,...,,,,,,,,,,1052
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2464,0,23,2,15,19,18,18.0,14,11,0.74,...,,,,,,,,,,7982
2465,0,5,1,13,9,4,4.0,4,8,,...,,,,,,,,,,7299
2466,1,26,2,2,19,3,,15,3,-0.13,...,,,,,,,,,,1818
2467,0,19,2,9,20,11,11.0,9,2,0.43,...,7.0,12.0,12.0,9.0,,,,,,937


In [None]:
# Display data set info for checking types:
# info() prints data frame information, such as the index dtype and columns, non-null values, and memory usage.
df_tr.info()
df_ts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5909 entries, 0 to 5908
Columns: 191 entries, gender to id
dtypes: category(8), float64(172), int64(11)
memory usage: 8.4 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2469 entries, 0 to 2468
Columns: 190 entries, gender to id
dtypes: category(8), float64(172), int64(10)
memory usage: 3.5 MB


So now that there is no object data in the datasets.

#### Model

##### Splitting

I split the data into X and y.

###### Trainig data

In [None]:
#splitting Trainig data into X_train and y_train
y_train=df_tr['match'] # y_train contains only match column
X_train=df_tr.drop(columns=['match','id'],axis=1) # X_train will be all columns except for the match and id columns.
# Shape return a tuple representing the dimensionality of the DataFrame.
print(y_train.shape) 
print(X_train.shape)

(5909,)
(5909, 189)


###### Testing data

In [None]:
X_test=df_ts # X_test contains all columns except id column.
print(X_test.shape)

(2469, 190)


##### PipeLine Tuning

In [None]:
#Sparate numerical and categorical features in the training data

# put numeric feature in feature_numeric list
features_numeric=list(X_train.select_dtypes(include=['float64','int64']))

#put categoric features in feature_cat list
features_cat=list(X_train.select_dtypes(include=['category']))
# print each list to know the column's name in each list.
print('numeric features:', features_numeric)
print('categorical features:', features_cat)

numeric features: ['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1', 'order', 'partner', 'int_corr', 'samerace', 'age_o', 'race_o', 'pf_o_att', 'pf_o_sin', 'pf_o_int', 'pf_o_fun', 'pf_o_amb', 'pf_o_sha', 'attr_o', 'sinc_o', 'intel_o', 'fun_o', 'amb_o', 'shar_o', 'like_o', 'prob_o', 'met_o', 'age', 'field_cd', 'race', 'imprace', 'imprelig', 'goal', 'date', 'go_out', 'career_c', 'sports', 'tvsports', 'exercise', 'dining', 'museums', 'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv', 'theater', 'movies', 'concerts', 'music', 'shopping', 'yoga', 'exphappy', 'expnum', 'attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1', 'attr4_1', 'sinc4_1', 'intel4_1', 'fun4_1', 'amb4_1', 'shar4_1', 'attr2_1', 'sinc2_1', 'intel2_1', 'fun2_1', 'amb2_1', 'shar2_1', 'attr3_1', 'sinc3_1', 'fun3_1', 'intel3_1', 'amb3_1', 'attr5_1', 'sinc5_1', 'intel5_1', 'fun5_1', 'amb5_1', 'attr', 'sinc', 'intel', 'fun', 'amb', 'shar', 'like', 'prob', 'met', 'match_es', 'attr1_s', 'sinc1_s', 

In [None]:
from pandas.core.arrays import numeric
# Built pipline
# The pipeline's goal is to combine numerous processes that can be cross-validated while modifying various parameters.
# It does this by allowing set parameters for each step using their names and parameter names separated by a "__"
# It takes steps as a prameter that contain all the preprocessing that I need.
# It saves time by applying any preprocessing to both train and test data without repeating the process.


# Create a pipline for numerical features and select it's hyperparameters
numeric=Pipeline(
    steps=[
           ('imputer', SimpleImputer()), # SimpleImputer used to handel missing value and have strategy='mean' is default val that means fill nan value with mean
           ('scaler', StandardScaler())  # StandardScaler used to scale number
    ]
)
categorical=Pipeline(
    steps=[
           ('imputer',SimpleImputer(strategy='constant')), # SimpleImputer used to handel missing value and have strategy='constant' that means fill nan value with constant
            ('onehot',OneHotEncoder(handle_unknown='ignore'))# OneHotEncoder used to encode categorical data
    ]
)
# ColumnTransformer used to construct and apply separate numerical and categorical data transformers.
# Select and prepare the columns of the dataset before fitting a model to the modified data.
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric, features_numeric),# Numerical data
        ('cat', categorical, features_cat) # Categorical data
    ]
)
# Put the preprocessing a siutable classifier.
full_pipline = Pipeline(  
    steps=[
        ('preprocessor', preprocessor), 
        ('my_classifier', 
           XGBClassifier(), # I used XGBClassifier as a classifier.
        )
    ]
)
full_pipline


np.random.seed(0)  # used to make the random numbers predictable



In [None]:
# Fitting and predict The pipeline object.
full_pipline = full_pipline.fit(X_train, y_train)
full_pipline.predict(X_test)

array([0, 0, 0, ..., 0, 0, 0])

In [None]:
# bayesian search hyperparameters
# param_bayes is a dictionary that contains all the parameters I want to try.
param_bayes = {
    'preprocessor__num__imputer__strategy': ['mean'],
    # preprocessor__num__imputer__strategy points to preprocessor->num (a Pipeline)-> imputer -> strategy
    # used to determine strategy value = mean to fill NaN values
    'my_classifier__n_estimators': [400, 450, 550],  
     # my_classifier__n_estimators points to my_classifier->n_estimators 
     # n_estimators is the total number of trees in the forest.
    'my_classifier__max_depth':[60, 70,80],    
    # my_classifier__max_depth points to my_classifier->max_depth   
    # max_depth determines how many features should be considered when looking for a split.
    'my_classifier__learning_rate':[0.1] 
    # 'my_classifier__learning_rate':range(0.1) 
    # It's used to control how new trees in the model are weighted. 
}

# cross-validation is the number of cv folds for each combination of parameters
# scoring is used to evaluation metric that used when ranking results
# n_job is a number of jobs to run in parallel.
# cv=4 means two-fold cross-validation
# n_jobs = 2
bayesS = BayesSearchCV(
    full_pipline, param_bayes, cv=4, verbose=1, n_jobs=2, 
    scoring='roc_auc')
# Fitting the model after grid search
bayesS.fit(X_train, y_train)
# best_score_ is a best estimator score on the data on the left.
# best_params_ is a setting of parameters that produced the best results on the hold out data.
print('best score {}'.format(bayesS.best_score_))
print('best score {}'.format(bayesS.best_params_))


Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits
best score 0.8787963978512694
best score OrderedDict([('my_classifier__learning_rate', 0.1), ('my_classifier__max_depth', 80), ('my_classifier__n_estimators', 400), ('preprocessor__num__imputer__strategy', 'mean')])


Best paramters:
* Learning_ rate = 0.1
*  max_depth = 80
*  n_estimators = 400

In [None]:
# Use this cell to write the result in the excel sheet.
submission = pd.DataFrame()

submission['id'] = df_ts['id']

submission['match'] = bayesS.predict_proba(X_test)[:,1]

submission.to_csv('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt2/sample_submission_walkthrough.csv', index=False)

##### Result

Accuracy in **Cross-Validation** = 0.87879 \
Accuracy in **kaggle Public** =0.88222 \
Accuracy in **kaggle Private** =0.88646

It is better than the all previous trials.

Notes: 
* I tried more than these trials, but these trial that give me the large accuracy 
* I tried to use feature selection like SelectKBest, but it gave me less accuracy.

### **Trial 4 (RandomOverSampler)**

According to the previous trial, I used RandomOverSampler to solve imbalanced data to improve model performance.

**My thoughts and observations :** The accuracy would be between 0.88222 and 0.8850

##### Read Training and Testing data

In [None]:
# Read all our training data by using read_csv, which takes the path of the file with the extension that I want to read.
df = pd.read_csv ('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt2/train.csv')
# Based on position, this function returns the first 5 rows of the dataset. It's used to quickly see if our dataset contains the proper kind of data.
df.head(5)

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,3,2,14,18,2,2.0,14,12,372.0,...,,,,,,,,,,2583
1,1,14,1,3,10,2,,8,8,63.0,...,8.0,8.0,7.0,8.0,,,,,,6830
2,1,14,1,13,10,8,8.0,10,10,331.0,...,,,,,,,,,,4840
3,1,38,2,9,20,18,13.0,6,7,200.0,...,9.0,8.0,8.0,6.0,,,,,,5508
4,1,24,2,14,20,6,6.0,20,17,357.0,...,,,,,,,,,,4828


In [None]:
# Read all our testing data by using read_csv, which takes the path of the file with the extension that I want to read.
df_test = pd.read_csv ('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt2/test.csv')
# Based on position, this function returns the first 5 rows of the dataset. It's used to quickly see if our dataset contains the proper kind of data.
df_test.head(5)

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,5,2,2,16,3,,13,13,52.0,...,7.0,8.0,6.0,8.0,,,,,,934
1,0,33,2,14,18,6,6.0,4,8,368.0,...,8.0,7.0,7.0,8.0,6.0,7.0,6.0,5.0,5.0,6539
2,1,6,2,9,20,10,16.0,15,19,212.0,...,,,,,,,,,,6757
3,1,26,2,2,19,15,,8,10,30.0,...,,,,,,,,,,2275
4,0,29,2,7,16,7,7.0,10,5,162.0,...,,,,,,,,,,1052


In [None]:
# Display the column's name in training and testing data
print(df.columns)
print(df_test.columns)

Index(['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1',
       'order', 'partner', 'pid',
       ...
       'sinc3_3', 'intel3_3', 'fun3_3', 'amb3_3', 'attr5_3', 'sinc5_3',
       'intel5_3', 'fun5_3', 'amb5_3', 'id'],
      dtype='object', length=192)
Index(['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1',
       'order', 'partner', 'pid',
       ...
       'sinc3_3', 'intel3_3', 'fun3_3', 'amb3_3', 'attr5_3', 'sinc5_3',
       'intel5_3', 'fun5_3', 'amb5_3', 'id'],
      dtype='object', length=191)


#### Preprocessing

I will check the number of NaN values. Next, I will check the data types. If there is object data, I will convert it into categorical data so I can use it.

##### Drop some of features

In [None]:
# Drop unimportant training and testing features
# drop() Remove columns by specifying column names, inplace used to drop column from data frame
df.drop(columns=['pid'],inplace=True)
df_test.drop(columns=['pid'],inplace=True)

##### Check NaN values

* Check number of missing data exist by using
 * isnull() Return a boolean value indicating whether or not the values are NA.
 * sum() Return summation of nan value exists in each column.
 * sort_values() Sorting sum of value descending

###### Taining data

In [None]:
# Check number of missing data exist
df.isnull().sum().sort_values(ascending=False)

num_in_3    5449
numdat_3    4849
expnum      4627
amb7_2      4519
sinc7_2     4519
            ... 
position       0
round          0
wave           0
condtn         0
id             0
Length: 191, dtype: int64

###### Testing data

In [None]:
# Check number of missing data exist
df_test.isnull().sum().sort_values(ascending=False)

num_in_3    2261
numdat_3    2033
expnum      1951
amb7_2      1904
sinc7_2     1904
            ... 
position       0
round          0
wave           0
condtn         0
id             0
Length: 190, dtype: int64

##### Checking data types(convert object data to categorical data)

In this section, I will check data types and then any object will be converted to categorical data.

In [None]:
# Display data set info for checking types:
# info() prints data frame information, such as the index dtype and columns, non-null values, and memory usage.
df.info()
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5909 entries, 0 to 5908
Columns: 191 entries, gender to id
dtypes: float64(172), int64(11), object(8)
memory usage: 8.6+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2469 entries, 0 to 2468
Columns: 190 entries, gender to id
dtypes: float64(172), int64(10), object(8)
memory usage: 3.6+ MB


So there exist 8 object data in training and testing data

###### Taining data

In [None]:
#select_dtypes based on the column dtypes, return a subset of the DataFrame's columns.
# include contains the type of data that I want to select.
df.select_dtypes(include=['object'])

Unnamed: 0,field,undergra,mn_sat,tuition,from,zipcode,income,career
0,Ed.D. in higher education policy at TC,University of Michigan-Ann Arbor,1290.00,21645.00,"Palo Alto, CA",,,University President
1,Engineering,,,,"Boston, MA",2021,,Engineer or iBanker or consultant
2,Urban Planning,"Rizvi College of Architecture, Bombay University",,,"Bombay, India",,,Real Estate Consulting
3,International Affairs,,,,"Washington, DC",10471,45300.00,public service
4,Business,Harvard College,1400.00,26019.00,Midwest USA,66208,46138.00,undecided
...,...,...,...,...,...,...,...,...
5904,Clinical Psychology,,,,New York,11803,65708.00,Psychologist
5905,MBA,,,,Colombia,,,Consulting
5906,MA Science Education,University of Washington,1155.00,13258.00,Seattle,98115,37881.00,Teacher
5907,Biochemistry,,,,Canada,,,pharmaceuticals and biotechnology


###### Testing data

In [None]:
# select_dtypes based on the column dtypes, return a subset of the DataFrame's columns.
# include contains the type of data that I want to select.
df_test.select_dtypes(include=['object'])

Unnamed: 0,field,undergra,mn_sat,tuition,from,zipcode,income,career
0,Psychology,,,,Hong Kong,0,,psychologist
1,education,wellesley college,1341.00,25504.00,"atlanta, ga",30071,36223.00,education
2,MBA,,,,San Francisco,10021,55080.00,Consulting
3,Law,,,,Brooklyn,11204,26482.00,Intellectual Property Attorney
4,Business,,,,"Atlanta, GA",27870,21590.00,Marketing and Media
...,...,...,...,...,...,...,...,...
2464,Neuroscience and Education,Columbia,1430.00,26908.00,Hong Kong,0,,Academic
2465,School Psychology,Bucknell University,1290.00,25335.00,"Erie, PA",,,school psychologist
2466,Law,,,,Brooklyn,11204,26482.00,Intellectual Property Attorney
2467,Mathematics,,,,Vestal,13850,42640.00,college professor


###### Covert training and testing objects data to categorical data

In [None]:
# Making a copy from the training and testing data frame before doing any assignment 
df_tr=df.copy() # Make a copy of dataframe's indices and data.
df_ts=df_test.copy() # Make a copy of dataframe's indices and data.

# obj_tr contains all categorical data in the training set by using
# select_dtypes selects an object from the training data set.
obj_tr=df.select_dtypes(include=['object'])  

#categorical encoding of all object data
for i in obj_tr:
   df_tr[i]=df_tr[i].astype("category")

# obj_tr contains all categorical data in the testing by using
# select_dtypes selects an object from the testing data set.
obj_ts=df.select_dtypes(include=['object']) 

#categorical encoding of all object data
for i in obj_ts:
   df_ts[i]=df_ts[i].astype("category")


In [None]:
# look of the training values:
df_tr

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,match,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,3,2,14,18,2,2.0,14,12,0,...,,,,,,,,,,2583
1,1,14,1,3,10,2,,8,8,0,...,8.0,8.0,7.0,8.0,,,,,,6830
2,1,14,1,13,10,8,8.0,10,10,0,...,,,,,,,,,,4840
3,1,38,2,9,20,18,13.0,6,7,0,...,9.0,8.0,8.0,6.0,,,,,,5508
4,1,24,2,14,20,6,6.0,20,17,0,...,,,,,,,,,,4828
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5904,0,1,2,9,20,2,2.0,18,1,0,...,12.0,12.0,9.0,12.0,,,,,,3390
5905,1,24,2,9,20,19,15.0,5,6,0,...,,,,,,,,,,4130
5906,0,13,2,11,21,5,5.0,3,18,0,...,,,,,,,,,,1178
5907,1,10,2,7,16,6,14.0,9,10,1,...,,,,,,,,,,5016


In [None]:
# look of the testing values:
df_ts

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,int_corr,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,5,2,2,16,3,,13,13,-0.13,...,7.0,8.0,6.0,8.0,,,,,,934
1,0,33,2,14,18,6,6.0,4,8,0.12,...,8.0,7.0,7.0,8.0,6.0,7.0,6.0,5.0,5.0,6539
2,1,6,2,9,20,10,16.0,15,19,0.11,...,,,,,,,,,,6757
3,1,26,2,2,19,15,,8,10,0.11,...,,,,,,,,,,2275
4,0,29,2,7,16,7,7.0,10,5,0.45,...,,,,,,,,,,1052
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2464,0,23,2,15,19,18,18.0,14,11,0.74,...,,,,,,,,,,7982
2465,0,5,1,13,9,4,4.0,4,8,,...,,,,,,,,,,7299
2466,1,26,2,2,19,3,,15,3,-0.13,...,,,,,,,,,,1818
2467,0,19,2,9,20,11,11.0,9,2,0.43,...,7.0,12.0,12.0,9.0,,,,,,937


In [None]:
# Display data set info for checking types:
# info() prints data frame information, such as the index dtype and columns, non-null values, and memory usage.
df_tr.info()
df_ts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5909 entries, 0 to 5908
Columns: 191 entries, gender to id
dtypes: category(8), float64(172), int64(11)
memory usage: 8.4 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2469 entries, 0 to 2468
Columns: 190 entries, gender to id
dtypes: category(8), float64(172), int64(10)
memory usage: 3.5 MB


So now that there is no object data in the datasets.

#### Model

##### Splitting

I split the data into X and y.

###### Trainig data

In [None]:
#splitting Trainig data into X_train and y_train
y_train=df_tr['match'] # y_train contains only match column
X_train=df_tr.drop(columns=['match','id'],axis=1) # X_train will be all columns except for the match and id columns.
# Shape return a tuple representing the dimensionality of the DataFrame.
print(y_train.shape) 
print(X_train.shape)

(5909,)
(5909, 189)


In [None]:
# counter is a subclass of the dictionary dict, with elements serving as keys and counts serving as values.
from collections import Counter
print(sorted(Counter(y_train).items()))

[(0, 4921), (1, 988)]


In [None]:
from imblearn.over_sampling import RandomOverSampler
# RandomOverSampleris used to solve unbalanced data so the number of 1's equals the number of 0's.
ros = RandomOverSampler(random_state=0)
X_train, y_train = ros.fit_resample(X_train, y_train)
from collections import Counter
print(sorted(Counter(y_train).items()))

[(0, 4921), (1, 4921)]


###### Testing data

In [None]:
X_test=df_ts # X_test contains all columns except id column.
print(X_test.shape)

(2469, 190)


##### PipeLine Tuning

In [None]:
#Sparate numerical and categorical features in the training data

# put numeric feature in feature_numeric list
features_numeric=list(X_train.select_dtypes(include=['float64','int64']))

#put categoric features in feature_cat list
features_cat=list(X_train.select_dtypes(include=['category']))
# print each list to know the column's name in each list.
print('numeric features:', features_numeric)
print('categorical features:', features_cat)

numeric features: ['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1', 'order', 'partner', 'int_corr', 'samerace', 'age_o', 'race_o', 'pf_o_att', 'pf_o_sin', 'pf_o_int', 'pf_o_fun', 'pf_o_amb', 'pf_o_sha', 'attr_o', 'sinc_o', 'intel_o', 'fun_o', 'amb_o', 'shar_o', 'like_o', 'prob_o', 'met_o', 'age', 'field_cd', 'race', 'imprace', 'imprelig', 'goal', 'date', 'go_out', 'career_c', 'sports', 'tvsports', 'exercise', 'dining', 'museums', 'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv', 'theater', 'movies', 'concerts', 'music', 'shopping', 'yoga', 'exphappy', 'expnum', 'attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1', 'attr4_1', 'sinc4_1', 'intel4_1', 'fun4_1', 'amb4_1', 'shar4_1', 'attr2_1', 'sinc2_1', 'intel2_1', 'fun2_1', 'amb2_1', 'shar2_1', 'attr3_1', 'sinc3_1', 'fun3_1', 'intel3_1', 'amb3_1', 'attr5_1', 'sinc5_1', 'intel5_1', 'fun5_1', 'amb5_1', 'attr', 'sinc', 'intel', 'fun', 'amb', 'shar', 'like', 'prob', 'met', 'match_es', 'attr1_s', 'sinc1_s', 

In [None]:
from pandas.core.arrays import numeric
# Built pipline
# The pipeline's goal is to combine numerous processes that can be cross-validated while modifying various parameters.
# It does this by allowing set parameters for each step using their names and parameter names separated by a "__"
# It takes steps as a prameter that contain all the preprocessing that I need.
# It saves time by applying any preprocessing to both train and test data without repeating the process.


# Create a pipline for numerical features and select it's hyperparameters
numeric=Pipeline(
    steps=[
           ('imputer', SimpleImputer()), # SimpleImputer used to handel missing value and have strategy='mean' is default val that means fill nan value with mean
           ('scaler', StandardScaler())  # StandardScaler used to scale number
    ]
)
categorical=Pipeline(
    steps=[
           ('imputer',SimpleImputer(strategy='constant')), # SimpleImputer used to handel missing value and have strategy='constant' that means fill nan value with constant
            ('onehot',OneHotEncoder(handle_unknown='ignore'))# OneHotEncoder used to encode categorical data
    ]
)
# ColumnTransformer used to construct and apply separate numerical and categorical data transformers.
# Select and prepare the columns of the dataset before fitting a model to the modified data.
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric, features_numeric),# Numerical data
        ('cat', categorical, features_cat) # Categorical data
    ]
)
# Put the preprocessing a siutable classifier.
full_pipline = Pipeline(  
    steps=[
        ('preprocessor', preprocessor), 
        ('my_classifier', 
           XGBClassifier(), # I used XGBClassifier as a classifier.
        )
    ]
)
full_pipline


np.random.seed(0)  # used to make the random numbers predictable



In [None]:
# Fitting and predict The pipeline object.
full_pipline = full_pipline.fit(X_train, y_train)
full_pipline.predict(X_test)

array([0, 1, 1, ..., 0, 0, 0])

In [None]:
# bayesian search hyperparameters
# param_bayes is a dictionary that contains all the parameters I want to try.
param_bayes = {
    'preprocessor__num__imputer__strategy': ['mean'],
    # preprocessor__num__imputer__strategy points to preprocessor->num (a Pipeline)-> imputer -> strategy
    # used to determine strategy value = mean to fill NaN values
    'my_classifier__n_estimators': [170],  
     # my_classifier__n_estimators points to my_classifier->n_estimators 
     # n_estimators is the total number of trees in the forest.
    'my_classifier__max_depth':[50],    
    # my_classifier__max_depth points to my_classifier->max_depth   
    # max_depth determines how many features should be considered when looking for a split.
    'my_classifier__learning_rate':[0.1] 
    # 'my_classifier__learning_rate':range(0.1) 
    # It's used to control how new trees in the model are weighted. 
}

# cross-validation is the number of cv folds for each combination of parameters
# scoring is used to evaluation metric that used when ranking results
# n_job is a number of jobs to run in parallel.
# cv=4 means two-fold cross-validation
# n_jobs = 2
bayesS = BayesSearchCV(
    full_pipline, param_bayes, cv=4, verbose=1, n_jobs=2, 
    scoring='roc_auc')
# Fitting the model after grid search
bayesS.fit(X_train, y_train)
# best_score_ is a best estimator score on the data on the left.
# best_params_ is a setting of parameters that produced the best results on the hold out data.
print('best score {}'.format(bayesS.best_score_))
print('best score {}'.format(bayesS.best_params_))


Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits




Fitting 4 folds for each of 1 candidates, totalling 4 fits
best score 0.9979023266144571
best score OrderedDict([('my_classifier__learning_rate', 0.1), ('my_classifier__max_depth', 50), ('my_classifier__n_estimators', 170), ('preprocessor__num__imputer__strategy', 'mean')])


Best paramters:
* Learning_ rate = 0.1
*  max_depth = 50
*  n_estimators = 170

In [None]:
# Use this cell to write the result in the excel sheet.
submission = pd.DataFrame()

submission['id'] = df_ts['id']

submission['match'] = bayesS.predict_proba(X_test)[:,1]

submission.to_csv('/content/drive/MyDrive/Queens_Practical/Data_Mining/compt2/sample_submission_walkthrough.csv', index=False)

##### Result

Accuracy in **Cross-Validation** = 0.9979 \
Accuracy in **kaggle Public** =0.88244\
Accuracy in **kaggle Private** =0.88039\
In public this trial is better than prevoius trial  but in private the prevoius trial is better than this trial.

# **Questions**

**Why a simple linear regression model (without any activation function) is not good for classification task?** \
As the linear regression model uses continuous data while classification uses discrete data, and when new data points are added, the threshold value shifts.

**Compared to Perceptron/Logistic regression?** \
*   Perceptron
 * Activation function is a threshold function
 * Uses a logistic function
 * The output is a binary lable
*Logistic regression
 * Activation function is an identity(activation) function like sigmoid(x) = x
 * Uses a step function
 * The output is a real number
 
**What's a decision tree and how it is different to a logistic regression model?**
 *  Decision Trees are a supervised machine learning technique where the data is split according to a certain parameter.
 * Decision Trees 
  * Divides the space into smaller and smaller sections
  * More interpretable
  * Can lead to over-fitting.
  * Can train on training set 
 
 * Logistic Regression 
  * fits a single line to exactly divide the space into two.
  * Less interpretable
  * There is no over-fitting.
  * Need large training set
 * Decision trees and logistic regression handle both continuous and categorical data.

**What's the difference between grid search and random search?**
*  Grid Search.
 * The hyperparameters' domain is divided into a discrete grid. Then, using cross-validation, try every possible combination of values in this grid, calculating various performance measures. The ideal combination of values for the hyperparameters is the point on the grid that maximises the average value in cross-validation.
 
* Random Search
 * It's similar to grid search, except instead of testing all of the points in the grid, it just tries a selection of them at random. A smaller subset means faster optimization but less accurate. A larger dataset means more accurate optimization, but the search comes closer to a grid search.

**What's the difference between bayesian search and random search?**
* The main difference between Bayesian search and Random search is that in each round, the tuning algorithm optimises its parameter selection based on the previous round's result. As a result, rather than selecting the next set of parameters at random, the algorithm optimises the selection and is likely to arrive at the best parameter set faster than the Grid and Random techniques.