<a href="https://colab.research.google.com/github/DeeeTeeee/IncomePredictionZindziProject/blob/main/Starter_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IMPORTANT

## Install latest version of packages to be used in the code

The latest version of libraries need to be installed as per competition rules and kindly adhere to that and install the updated version of libraries in the code.

## Please set random seed so that reproducible answers are attained

Wherever randomness is expected, do select the random seed so that the results are reproducible. Reproducibility of results is a **very important** component of model development without which reliable models are not attained.

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [7]:
# #load data;
# from google.colab import files
# files.upload()

In [8]:
#!pip install -- catboost

In [9]:
import pandas as pd
import numpy as np
import os
import seaborn as sns
import matplotlib.pyplot as plt
from termcolor import colored
import plotly.express as px
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier
from sklearn.metrics import f1_score,accuracy_score,classification_report
import warnings
warnings.filterwarnings('always')

## Loading test and train datasets

We will load the train and test datasets and do some basic level of EDA to understand the pattern of features in the data

* <b> Train data: </b> This is the data which we will be using to train the model. Since we are solving a classification problem, we will have a column in train dataset corresponding to the target labels.
* <b> Test data: </b> This is the data on which the predictions will be made based on the model trained on train dataset.



In [10]:
################# Reading train and test datasets

path = "/content/drive/MyDrive/CapStoneProject"
train_data = pd.read_csv(os.path.join(path, 'Train.csv'))
test_data = pd.read_csv(os.path.join(path, 'Test.csv'))
ss = pd.read_csv(os.path.join(path, 'SampleSubmission.csv'))
vd = pd.read_csv(os.path.join(path, 'VariableDefinitions.csv'))


target_column_name = ['income_above_limit']

########## The target column to be used for training
target_column      = train_data[target_column_name]

########## Since ID is a unique identifier, it must be dropped
Cols2drop          = ['ID', 'education_institute', 'is_hispanic',
       'employment_commitment',
       'unemployment_reason', 'employment_stat',
       'is_labor_union', 'industry_code',
       'industry_code_main', 'occupation_code',
       'total_employed', 'household_stat', 'household_summary',
       'under_18_family', 'veterans_admin_questionnaire', 'vet_benefit',
       'tax_status', 'stocks_status',
       'mig_year', 'country_of_birth_father',
       'country_of_birth_mother', 'migration_code_change_in_msa',
       'migration_prev_sunbelt', 'migration_code_move_within_reg',
       'migration_code_change_in_reg', 'residence_1_year_ago',
       'old_residence_reg', 'old_residence_state', 'importance_of_record']


######### Feature set corresponding to train and test data
train_df           = train_data.drop(Cols2drop,axis=1)
test_df            = test_data.drop(Cols2drop,axis=1)

print(colored(f'The shape of train data is    {train_df.shape}     ','green',attrs=['bold']))

print(colored(f'The shape of target column is {target_column.shape}','green',attrs=['bold']))

print(colored(f'The shape of test data is     {test_df.shape}      ','blue',attrs=['bold']))

print(colored(f'The data types of train data are    {train_df.dtypes}      ','blue',attrs=['bold']))

print(colored(f'The data types of test data are     {test_df.dtypes}      ','blue',attrs=['bold']))
print('------------------------------------------------------------------------------')
print(colored('The train data looks like below :- \n','green'))
display(train_df.head(5))

print('------------------------------------------------------------------------------')
print(colored('The test data looks like below :- \n','blue'))
display(test_df.head(5))


  and should_run_async(code)


The shape of train data is    (209499, 14)     
The shape of target column is (209499, 1)
The shape of test data is     (89786, 13)      
The data types of train data are    age                       int64
gender                   object
education                object
class                    object
marital_status           object
race                     object
wage_per_hour             int64
working_week_per_year     int64
occupation_code_main     object
gains                     int64
losses                    int64
citizenship              object
country_of_birth_own     object
income_above_limit       object
dtype: object      
The data types of test data are     age                       int64
gender                   object
education                object
class                    object
marital_status           object
race                     object
wage_per_hour             int64
working_week_per_year     int64
occupation_code_main     object
gains                     int64
lo



Unnamed: 0,age,gender,education,class,marital_status,race,wage_per_hour,working_week_per_year,occupation_code_main,gains,losses,citizenship,country_of_birth_own,income_above_limit
0,79,Female,High school graduate,,Widowed,White,0,52,,0,0,Native,US,Below limit
1,65,Female,High school graduate,,Widowed,White,0,0,,0,0,Native,US,Below limit
2,21,Male,12th grade no diploma,Federal government,Never married,Black,500,15,Adm support including clerical,0,0,Native,US,Below limit
3,2,Female,Children,,Never married,Asian or Pacific Islander,0,0,,0,0,Native,US,Below limit
4,70,Male,High school graduate,,Married-civilian spouse present,White,0,0,,0,0,Native,US,Below limit


------------------------------------------------------------------------------
The test data looks like below :- 



Unnamed: 0,age,gender,education,class,marital_status,race,wage_per_hour,working_week_per_year,occupation_code_main,gains,losses,citizenship,country_of_birth_own
0,54,Male,High school graduate,Private,Married-civilian spouse present,White,600,46,Transportation and material moving,0,0,Native,US
1,53,Male,5th or 6th grade,Private,Married-civilian spouse present,White,0,52,Machine operators assmblrs & inspctrs,0,0,Foreign born- Not a citizen of U S,El-Salvador
2,42,Male,Bachelors degree(BA AB BS),Private,Married-civilian spouse present,White,0,44,Professional specialty,15024,0,Native,US
3,16,Female,9th grade,,Never married,White,0,8,,0,0,Native,US
4,16,Male,9th grade,,Never married,White,0,0,,0,0,Native,US


# Data Understanding


## Exploring the Data
The data contains 209499 rows and 14 columns (including the target/ output column (`income_above_limit`) ). The data type is a mix of categorical and numeric data. We notice that there are no null values.

In [11]:
train_df.isnull().sum()

  and should_run_async(code)


age                           0
gender                        0
education                     0
class                    105245
marital_status                0
race                          0
wage_per_hour                 0
working_week_per_year         0
occupation_code_main     105694
gains                         0
losses                        0
citizenship                   0
country_of_birth_own          0
income_above_limit            0
dtype: int64

In [24]:

train_df['country_of_birth_own'].value_counts()

  and should_run_async(code)


US                               185666
 Mexico                            6082
 ?                                 3667
 Puerto-Rico                       1458
 Philippines                        902
 Cuba                               889
 Germany                            858
 El-Salvador                        744
 Dominican-Republic                 741
 Canada                             741
 China                              508
 South Korea                        488
 England                            473
 Italy                              459
 India                              457
 Columbia                           455
 Poland                             385
 Vietnam                            384
 Guatemala                          373
 Jamaica                            352
 Japan                              350
 Ecuador                            277
 Peru                               277
 Nicaragua                          241
 Haiti                              225


we noticed `?` in the `country_of_birth_own` column.

Let us deal with the ‘?’ now. We shall replace it with the ‘MODE’

In [25]:
# Replace "?" with "US" in the 'country_of_birth_own' column
train_df['country_of_birth_own'] = train_df['country_of_birth_own'].replace('?', 'US')

  and should_run_async(code)


In [26]:
train_df['country_of_birth_own'].value_counts()

  and should_run_async(code)


US                               185666
 Mexico                            6082
 ?                                 3667
 Puerto-Rico                       1458
 Philippines                        902
 Cuba                               889
 Germany                            858
 El-Salvador                        744
 Dominican-Republic                 741
 Canada                             741
 China                              508
 South Korea                        488
 England                            473
 Italy                              459
 India                              457
 Columbia                           455
 Poland                             385
 Vietnam                            384
 Guatemala                          373
 Jamaica                            352
 Japan                              350
 Ecuador                            277
 Peru                               277
 Nicaragua                          241
 Haiti                              225


### Feature Engineering
Let's work on the `education` and the `marital_status` columns

In [31]:
train_df['education'].value_counts()

  and should_run_async(code)


 High school graduate                      50627
 Children                                  49685
 Some college but no degree                29320
 Bachelors degree(BA AB BS)                20979
 7th and 8th grade                          8438
 10th grade                                 7905
 11th grade                                 7260
 Masters degree(MA MS MEng MEd MSW MBA)     6861
 9th grade                                  6540
 Associates degree-occup /vocational        5650
 Associates degree-academic program         4494
 5th or 6th grade                           3542
 12th grade no diploma                      2282
 1st 2nd 3rd or 4th grade                   1917
 Prof school degree (MD DDS DVM LLB JD)     1852
 Doctorate degree(PhD EdD)                  1318
 Less than 1st grade                         829
Name: education, dtype: int64

In [32]:
train_df['marital_status'].value_counts()

  and should_run_async(code)


 Never married                      90723
 Married-civilian spouse present    88407
 Divorced                           13456
 Widowed                            11029
 Separated                           3596
 Married-spouse absent               1568
 Married-A F spouse present           720
Name: marital_status, dtype: int64

In [39]:
#education category:
train_df['education'] = train_df['education'].replace(['Children', '1st 2nd 3rd or 4th grade','5th or 6th grade', '7th and 8th grade','9th grade', '10th grade','11th grade', ' 12th grade no diploma'], 'left' )
train_df['education'] = train_df['education'].replace('High school graduate', 'High School')
train_df['education'] = train_df['education'].replace(['Associates degree-academic program','Associates degree-occup /vocational'],'Associate Degree')
train_df['education'] = train_df['education'].replace('Bachelors degree(BA AB BS)', 'Undergrad')
train_df['education'] = train_df['education'].replace(['Masters degree(MA MS MEng MEd MSW MBA), Prof school degree (MD DDS DVM LLB JD)'], 'Grad')
train_df['education'] = train_df['education'].replace('Masters degree(MA MS MEng MEd MSW MBA)', 'Grad')
train_df['education'] = train_df['education'].replace(' Doctorate degree(PhD EdD)', 'Doctorate')
#test_df
test_df['education'] = test_df['education'].replace(['Children', '1st 2nd 3rd or 4th grade','5th or 6th grade', '7th and 8th grade','9th grade', '10th grade','11th grade', ' 12th grade no diploma'], 'left' )
test_df['education'] = test_df['education'].replace('High school graduate', 'High School')
test_df['education'] = test_df['education'].replace(['Associates degree-academic program','Associates degree-occup /vocational'],'Associate Degree')
test_df['education'] = test_df['education'].replace('Bachelors degree(BA AB BS)', 'Undergrad')
test_df['education'] = test_df['education'].replace(['Masters degree(MA MS MEng MEd MSW MBA), Prof school degree (MD DDS DVM LLB JD)'], 'Grad')
test_df['education'] = test_df['education'].replace('Masters degree(MA MS MEng MEd MSW MBA)', 'Grad')
test_df['education'] = test_df['education'].replace(' Doctorate degree(PhD EdD)', 'Doctorate')

  and should_run_async(code)


In [40]:
train_df['education'].unique()

  and should_run_async(code)


array([' High school graduate', 'left', ' Children',
       ' Bachelors degree(BA AB BS)', ' 7th and 8th grade', ' 11th grade',
       ' 9th grade', ' Masters degree(MA MS MEng MEd MSW MBA)',
       ' 10th grade', ' Associates degree-academic program',
       ' 1st 2nd 3rd or 4th grade', ' Some college but no degree',
       ' Less than 1st grade', ' Associates degree-occup /vocational',
       ' Prof school degree (MD DDS DVM LLB JD)', ' 5th or 6th grade',
       'Doctorate'], dtype=object)

In [26]:
train_data.columns


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



Index(['ID', 'age', 'gender', 'education', 'class', 'education_institute',
       'marital_status', 'race', 'is_hispanic', 'employment_commitment',
       'unemployment_reason', 'employment_stat', 'wage_per_hour',
       'is_labor_union', 'working_week_per_year', 'industry_code',
       'industry_code_main', 'occupation_code', 'occupation_code_main',
       'total_employed', 'household_stat', 'household_summary',
       'under_18_family', 'veterans_admin_questionnaire', 'vet_benefit',
       'tax_status', 'gains', 'losses', 'stocks_status', 'citizenship',
       'mig_year', 'country_of_birth_own', 'country_of_birth_father',
       'country_of_birth_mother', 'migration_code_change_in_msa',
       'migration_prev_sunbelt', 'migration_code_move_within_reg',
       'migration_code_change_in_reg', 'residence_1_year_ago',
       'old_residence_reg', 'old_residence_state', 'importance_of_record',
       'income_above_limit'],
      dtype='object')

In [28]:
train_data['occupation_code_main'].value_counts()


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



 Adm support including clerical           15351
 Professional specialty                   14544
 Executive admin and managerial           13107
 Other service                            12856
 Sales                                    12487
 Precision production craft & repair      11207
 Machine operators assmblrs & inspctrs     6650
 Handlers equip cleaners etc               4340
 Transportation and material moving        4244
 Farming forestry and fishing              3273
 Technicians and related support           3136
 Protective services                       1700
 Private household services                 878
 Armed Forces                                32
Name: occupation_code_main, dtype: int64

In [8]:
train_df.education.value_counts()

  and should_run_async(code)


 High school graduate                      50627
 Children                                  49685
 Some college but no degree                29320
 Bachelors degree(BA AB BS)                20979
 7th and 8th grade                          8438
 10th grade                                 7905
 11th grade                                 7260
 Masters degree(MA MS MEng MEd MSW MBA)     6861
 9th grade                                  6540
 Associates degree-occup /vocational        5650
 Associates degree-academic program         4494
 5th or 6th grade                           3542
 12th grade no diploma                      2282
 1st 2nd 3rd or 4th grade                   1917
 Prof school degree (MD DDS DVM LLB JD)     1852
 Doctorate degree(PhD EdD)                  1318
 Less than 1st grade                         829
Name: education, dtype: int64

In [9]:
vd

  and should_run_async(code)


Unnamed: 0,Column,Description
0,age,Age Of Individual
1,gender,Gender
2,education,Education
3,class,Class Of Worker
4,education_institute,Enrolled Educational Institution in last week
5,marital_status,Marital_Status
6,race,Race
7,is_hispanic,Hispanic Origin
8,employment_commitment,Full Or Part Time Employment Stat
9,unemployment_reason,Reason For Unemployment


In [10]:
train_data.education.value_counts()

  and should_run_async(code)


 High school graduate                      50627
 Children                                  49685
 Some college but no degree                29320
 Bachelors degree(BA AB BS)                20979
 7th and 8th grade                          8438
 10th grade                                 7905
 11th grade                                 7260
 Masters degree(MA MS MEng MEd MSW MBA)     6861
 9th grade                                  6540
 Associates degree-occup /vocational        5650
 Associates degree-academic program         4494
 5th or 6th grade                           3542
 12th grade no diploma                      2282
 1st 2nd 3rd or 4th grade                   1917
 Prof school degree (MD DDS DVM LLB JD)     1852
 Doctorate degree(PhD EdD)                  1318
 Less than 1st grade                         829
Name: education, dtype: int64

In [11]:
########### Encoding the target column

target_column['income_above_limit'] = target_column['income_above_limit'].map({'Above limit':1,'Below limit':0})
target_column['income_above_limit'].value_counts()

  and should_run_async(code)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  target_column['income_above_limit'] = target_column['income_above_limit'].map({'Above limit':1,'Below limit':0})


0    196501
1     12998
Name: income_above_limit, dtype: int64

<b>Class imbalance </b> <br>


We will be seeing the class imbalance using value_counts() method of pandas dataframe and use histogram to plot the imbalances
<hr>

In [12]:
print('The class Imbalance in the data is given below')
display(train_data['income_above_limit'].value_counts())
print('---------------------------------------------------------------\n')
print('The class imbalance in terms of percentage is given below ')
display(train_data['income_above_limit'].value_counts(normalize=True))
print('----------------------------------------------------------------\n')
pct_df = pd.DataFrame(train_data['income_above_limit'].value_counts(normalize=True)).reset_index().rename({'index':'Target_values','income_above_limit':'Percentage'},axis=1)
fig = px.bar(pct_df,x='Target_values',y='Percentage', height=400,width = 400,title='class imbalance')
fig.show()

The class Imbalance in the data is given below


  and should_run_async(code)


Below limit    196501
Above limit     12998
Name: income_above_limit, dtype: int64

---------------------------------------------------------------

The class imbalance in terms of percentage is given below 


Below limit    0.937957
Above limit    0.062043
Name: income_above_limit, dtype: float64



----------------------------------------------------------------



[1;30;43mStreaming output truncated to the last 5000 lines.[0m

APICoreClientInfoImportHook.find_spec() not found; falling back to find_module()


_PyDriveImportHook.find_spec() not found; falling back to find_module()


_OpenCVImportHook.find_spec() not found; falling back to find_module()


_BokehImportHook.find_spec() not found; falling back to find_module()


_AltairImportHook.find_spec() not found; falling back to find_module()


APICoreClientInfoImportHook.find_spec() not found; falling back to find_module()


_PyDriveImportHook.find_spec() not found; falling back to find_module()


_OpenCVImportHook.find_spec() not found; falling back to find_module()


_BokehImportHook.find_spec() not found; falling back to find_module()


_AltairImportHook.find_spec() not found; falling back to find_module()


APICoreClientInfoImportHook.find_spec() not found; falling back to find_module()


_PyDriveImportHook.find_spec() not found; falling back to find_module()


_OpenCVImportHook.find_spec

Clearly we have a highly imbalanced dataset available with us and hence we need to perform steps to mitigate the imbalance accordingly. The following methods could be used:-
1. Downsample the majority class (Here majority class is 'Below limit')
2. Upsample the minority class (Here, minority class is 'Above limit')
3. Use class weights while performing model development <br>
Reference : https://scikit-learn.org/stable/modules/generated/sklearn.utils.class_weight.compute_class_weight.html



<b> NaN value analysis </b>


In [13]:
def nan_value_plot(df):
    nan_dict  = {}
    for cols in df.columns:
        nan_dict[cols] = df[cols].isna().sum()/df.shape[0]
    nan_pct_df = pd.DataFrame.from_dict(nan_dict,orient='index').reset_index().rename({'index':'Columns',0:'NaN_pct'},axis=1)
    fig = px.bar(nan_pct_df,x='Columns',y='NaN_pct', height=400,width = 400,title='NaN value percentage in each column')
    fig.update_layout(
                        xaxis = dict(
                        tickfont = dict(size=5)))
    fig.show()


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [14]:
print(colored('We see the distribution of NaN values in train data as below','green',attrs=['bold']))
nan_value_plot(train_df)

print('-------------------------------------------------------------------------------------------------')
print('\n')
print(colored('We see the distribution of NaN values in test data as below','blue',attrs=['bold']))
nan_value_plot(test_df)


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



We see the distribution of NaN values in train data as below



APICoreClientInfoImportHook.find_spec() not found; falling back to find_module()


_PyDriveImportHook.find_spec() not found; falling back to find_module()


_OpenCVImportHook.find_spec() not found; falling back to find_module()


_BokehImportHook.find_spec() not found; falling back to find_module()


_AltairImportHook.find_spec() not found; falling back to find_module()


APICoreClientInfoImportHook.find_spec() not found; falling back to find_module()


_PyDriveImportHook.find_spec() not found; falling back to find_module()


_OpenCVImportHook.find_spec() not found; falling back to find_module()


_BokehImportHook.find_spec() not found; falling back to find_module()


_AltairImportHook.find_spec() not found; falling back to find_module()


APICoreClientInfoImportHook.find_spec() not found; falling back to find_module()


_PyDriveImportHook.find_spec() not found; falling back to find_module()


_OpenCVImportHook.find_spec() not found; falling back to find_module()


_BokehImportHook.fi

-------------------------------------------------------------------------------------------------


We see the distribution of NaN values in test data as below


<b> Comments:- </b>
* There are columns with extremely high proportion of NaN values, we may drop them.
* There are columns with NaN values that can be handled easily using imputations with mean, median (in case of numerical) or mode(in case of categorical)
* Use Models like LightGBM, CatBoost or XGBoost that handles the NaN values implicitly while model training.
* Observe that the proportion of NaN value distribution is same in train and test and select NaN value handling techniques accordingly.
* Be creative 🧠 (but also be logical 😉) !!



I will personally drop all the columns where the proportion of NaN values is above 80% and proceed with columns/features that are left.

In [15]:
nan_cols_drop  = []
for cols in test_df.columns:
    if test_df[cols].isna().sum()/test_df.shape[0] >0.8:
        nan_cols_drop.append(cols)


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [16]:
print(colored(f'We will drop the following columns from both train and test data: ','yellow',attrs=['bold']))
print(nan_cols_drop)

We will drop the following columns from both train and test data: 
['education_institute', 'unemployment_reason', 'is_labor_union', 'veterans_admin_questionnaire', 'old_residence_reg', 'old_residence_state']



`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [17]:
print('The shape of train and test data before dropping columns with high proportion of NaN values is - ')
print(colored(f'The shape of train data is    {train_df.shape}     ','green',attrs=['bold']))

print(colored(f'The shape of target column is {target_column.shape}','green',attrs=['bold']))

print(colored(f'The shape of test data is     {test_df.shape}      ','blue',attrs=['bold']))

train_df = train_df.drop(nan_cols_drop,axis=1)
test_df  = test_df.drop(nan_cols_drop,axis=1)

print('---------------------------------------------------------------------------------------------------')
print('The shape of train and test data after dropping columns with high proportion of NaN values is - ')
print(colored(f'The shape of train data is    {train_df.shape}     ','green',attrs=['bold']))

print(colored(f'The shape of target column is {target_column.shape}','green',attrs=['bold']))

print(colored(f'The shape of test data is     {test_df.shape}      ','blue',attrs=['bold']))


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



The shape of train and test data before dropping columns with high proportion of NaN values is - 
The shape of train data is    (209499, 41)     
The shape of target column is (209499, 1)
The shape of test data is     (89786, 41)      
---------------------------------------------------------------------------------------------------
The shape of train and test data after dropping columns with high proportion of NaN values is - 
The shape of train data is    (209499, 35)     
The shape of target column is (209499, 1)
The shape of test data is     (89786, 35)      


### Simple Baseline Validation strategy

We will now do an 80-20 split of train data provided. As discussed previously, the participants are free to use the validation strategy of their own choice.

Points to consider while selecting a validation strategy:
* Make sure the model is not overfitting on train data.
* CV score and leaderboard scores are in sync.
* Stable validation strategy when using K Folds etc.

In [18]:
train, X_test, train_y, y_test = train_test_split(train_df, target_column, test_size=0.2, random_state=42,stratify=target_column)


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



### Model development 🤖 💻 🤖

We will be straight away using a CatBoost model for training because it handles categorical features well, can implicitly handle NaN values, and can give a quick baseline (with minimal preprocessing) which can be used as a benchmark to be improved upon.

<br>

In the below steps, we will convert all the categorical columns to string datatype and capture the indices where string datatype is present which will then be used as an input for the CatBoost Classification model.

In [19]:
cat_cols_index = np.where(train_df.dtypes=='object')[0]
for i in range(len(train_df.columns)):
    if i in cat_cols_index:
        train[train_df.columns[i]]   = train[train_df.columns[i]].astype(str)
        X_test[train_df.columns[i]]  = X_test[train_df.columns[i]].astype(str)
        test_df[train_df.columns[i]] = test_df[train_df.columns[i]].astype(str)


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [20]:
model           = CatBoostClassifier(random_state=42,n_estimators =50 )
_               = model.fit(train,train_y,cat_features= cat_cols_index)



`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



Learning rate set to 0.5
0:	learn: 0.2249132	total: 660ms	remaining: 32.3s
1:	learn: 0.1503419	total: 1.11s	remaining: 26.7s
2:	learn: 0.1340319	total: 1.69s	remaining: 26.4s
3:	learn: 0.1286229	total: 2.03s	remaining: 23.3s
4:	learn: 0.1246902	total: 2.27s	remaining: 20.4s
5:	learn: 0.1229398	total: 2.51s	remaining: 18.4s
6:	learn: 0.1217443	total: 2.75s	remaining: 16.9s
7:	learn: 0.1207407	total: 2.99s	remaining: 15.7s
8:	learn: 0.1198554	total: 3.23s	remaining: 14.7s
9:	learn: 0.1186997	total: 3.49s	remaining: 14s
10:	learn: 0.1178143	total: 3.74s	remaining: 13.3s
11:	learn: 0.1171361	total: 3.99s	remaining: 12.6s
12:	learn: 0.1158921	total: 4.24s	remaining: 12.1s
13:	learn: 0.1155654	total: 4.49s	remaining: 11.6s
14:	learn: 0.1151041	total: 4.75s	remaining: 11.1s
15:	learn: 0.1146088	total: 4.99s	remaining: 10.6s
16:	learn: 0.1143813	total: 5.23s	remaining: 10.1s
17:	learn: 0.1137526	total: 5.47s	remaining: 9.72s
18:	learn: 0.1134374	total: 5.73s	remaining: 9.35s
19:	learn: 0.11309

Parameter tuning tips for CatBoost:

👓 Do focus on parameters like n_estimators, max_depth, reg_lambda, reg_alpha, scale_pos_weight, learning_rate and explore other parameters from the link : https://catboost.ai/en/docs/references/training-parameters/


In [21]:
acc_valid = accuracy_score(model.predict(X_test),y_test)

print(colored(f'The accuracy attained on the validation set is {acc_valid}','green',attrs=['bold']))




`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



The accuracy attained on the validation set is 0.9576372315035799


We got a good enough accuracy but is our model really performing that good ?? 🤔

👓 Consider the class imbalance of the data given with respect to the metric assigned. We can get 94% accuracy just by classifying everything as 'Below limit' but that will mean that we must get an accuracy above 94% to ensure the models are learning properly. 👓

🔭 Let's investigate the classification report for both train and validation data and see how good the baseline is.

In [22]:
print('\n')
print('The classification report only on the validation data is below-')
print(colored(classification_report(y_test, model.predict(X_test)),'blue',attrs=['bold']))

print('The classification report only on the train data is below-')
print(colored(classification_report(train_y, model.predict(train)),'green',attrs=['bold']))



The classification report only on the validation data is below-



`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



              precision    recall  f1-score   support

           0       0.97      0.99      0.98     39300
           1       0.76      0.47      0.58      2600

    accuracy                           0.96     41900
   macro avg       0.86      0.73      0.78     41900
weighted avg       0.95      0.96      0.95     41900

The classification report only on the train data is below-
              precision    recall  f1-score   support

           0       0.97      0.99      0.98    157201
           1       0.80      0.49      0.61     10398

    accuracy                           0.96    167599
   macro avg       0.88      0.74      0.79    167599
weighted avg       0.96      0.96      0.96    167599



The performance of our minority class in terms of precision and recall is too low. Hence our F1 score is also very low.



### A little hack

Let's do a small hack though 🤓 🤓 🤓

We can use probability based thresholds and see how performance improves. We will select a lower threshold for class label 1.
The default threshold is 0.5 which means that if the probability of 1 is above 0.5, then the predicted class is 1 else it is 0.

<br>

We will lower the threshold to 0.4 and say that if the probability of class being 1 is above 0.4, then we can classify it as 1 and if it is less than 0.4, then it will be 0.

In [23]:
thresh     = 0.4
train_pred = np.where(model.predict_proba(train)[:,1]>thresh,1,0)
test_pred  = np.where(model.predict_proba(X_test)[:,1]>thresh,1,0)

print('\n')
print('The classification report only on the validation data is below-')
print(colored(classification_report(y_test,test_pred),'blue',attrs=['bold']))

print('The classification report only on the train data is below-')
print(colored(classification_report(train_y, train_pred),'green',attrs=['bold']))


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.





The classification report only on the validation data is below-
              precision    recall  f1-score   support

           0       0.97      0.98      0.98     39300
           1       0.69      0.55      0.61      2600

    accuracy                           0.96     41900
   macro avg       0.83      0.77      0.80     41900
weighted avg       0.95      0.96      0.95     41900

The classification report only on the train data is below-
              precision    recall  f1-score   support

           0       0.97      0.99      0.98    157201
           1       0.73      0.57      0.64     10398

    accuracy                           0.96    167599
   macro avg       0.85      0.78      0.81    167599
weighted avg       0.96      0.96      0.96    167599



We do see some improvement in the performance because the f1 score on our validation data moved from 0.58 to 0.61.
For more information about how the threshold is selected, please follow [ROC Curve](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html) of sklearn and in general how ROC curve works 📚 📚

### Submission Time 🎉

We will now predict on the test data given and see what score we get on leaderboard.

We will now download the file "Sample_submission_1.csv" and submit it.

In [24]:
# subdf                       = pd.read_csv('/content/SampleSubmission.csv')
# subdf['income_above_limit'] = model.predict(test_df)
# subdf.to_csv('Sample_submission_1.csv',index=False)
# subdf['income_above_limit'].value_counts(normalize=True)


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [25]:
#ss                      = pd.read_csv('/content/SampleSubmission.csv')
ss ['income_above_limit'] = model.predict(test_df)
ss .to_csv('Sample_submission_1.csv',index=False)
ss ['income_above_limit'].value_counts(normalize=True)


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.


APICoreClientInfoImportHook.find_spec() not found; falling back to find_module()


_PyDriveImportHook.find_spec() not found; falling back to find_module()


_OpenCVImportHook.find_spec() not found; falling back to find_module()


_BokehImportHook.find_spec() not found; falling back to find_module()


_AltairImportHook.find_spec() not found; falling back to find_module()



0    0.962533
1    0.037467
Name: income_above_limit, dtype: float64

How to get better scores:
1. Feature engineering is the key. Refer to the variable dictionary and create meaningful features which can boost the score
2. Try out different models and categorical data preprocessing (read about categorical encoding) because a lot of features are categorical.
3. Feature selection with feature importance
4. Keep a check on classification report to observe overfitting and underfitting and select appropriate hyper-parameters to tune.
5. Suitable probability threshold selection as shown above.
6. Be creative while selecting validation split
For example:- Use Stratified K folds, grouped K folds, repeated stratified k folds, train test split with stratification etc
7. Ensemble multiple models to get a stable prediction.
8. Be creative and may the best model win 🏆 🏆 🏆


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.

