<a href="https://www.kaggle.com/code/priyankapalshetkar/kagglex-competition?scriptVersionId=117912035" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## Steps to Consider:
* Add more algorithms to the mix
* Improve on the best algorithm (Random Forest, K-neighbors) by changing its hyperparameters (n_estimators, n_neighbors)
* Change the encoder
* Drop features which are not as important

# Load required libraries

In [48]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import fbeta_score, accuracy_score,  make_scorer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from collections import defaultdict
from category_encoders.leave_one_out import LeaveOneOutEncoder
from sklearn.utils.class_weight import compute_class_weight
import matplotlib.pyplot as plt
import plotly.express as px
# from lazypredict.Supervised import LazyClassifier
import lightgbm as lgb
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
import catboost as cb

# SEED Everything : for reproducibility 

In [49]:
SEED = 42

def seed_everything(seed = 42):
    import random, os
    import numpy as np

    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)

seed_everything(SEED)

# Load dataset 

In [50]:
train_df = pd.read_csv('../input/kagglex-bipoc-2022-2023-ml-foundation/Train.csv')
test_df = pd.read_csv('../input/kagglex-bipoc-2022-2023-ml-foundation/Test.csv')
sample_sub = pd.read_csv('../input/kagglex-bipoc-2022-2023-ml-foundation/Sample_submission.csv')

In [4]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157509 entries, 0 to 157508
Data columns (total 42 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   ID        157509 non-null  object
 1   AAGE      157509 non-null  int64 
 2   ACLSWKR   157509 non-null  object
 3   ADTIND    157509 non-null  int64 
 4   ADTOCC    157509 non-null  int64 
 5   AHGA      157509 non-null  object
 6   AHRSPAY   157509 non-null  int64 
 7   AHSCOL    157509 non-null  object
 8   AMARITL   157509 non-null  object
 9   AMJIND    157509 non-null  object
 10  AMJOCC    157509 non-null  object
 11  ARACE     157509 non-null  object
 12  AREORGN   157509 non-null  object
 13  ASEX      157509 non-null  object
 14  AUNMEM    157509 non-null  object
 15  AUNTYPE   157509 non-null  object
 16  AWKSTAT   157509 non-null  object
 17  CAPGAIN   157509 non-null  int64 
 18  CAPLOSS   157509 non-null  int64 
 19  DIVVAL    157509 non-null  int64 
 20  FILESTAT  157509 non-null 

**My Initial Analysis:**

* What variables would affect my income?
AAGE (age), AHGA (education), AHRSPAY (Wage per hour), WKSWORK (Weeks worked in year), SEOTR (own business or self employed), CAPGAIN (capital gains), CAPLOSS (capital losses), AWKSTAT (full or part time employment stat), AUNTYPE (reason for unemployment), AMJIND (major industry code), AMJOCC (major occupation code), ACLSWK (class of worker), PRCITSHP (citizenship)

* Out of these which are the three most important ones?
AHRSPAY (Wage per hour), WKSWORK (Weeks worked in year),AWKSTAT (full or part time employment stat)

# Missing value analysis

In [51]:
missing_values_info = train_df.isnull().sum() / len(train_df)

In [52]:
missing_values_info_df = pd.DataFrame()
missing_values_info_df['features'] = missing_values_info.index
missing_values_info_df['missing_values'] = missing_values_info.values

In [7]:
px.bar(x='missing_values', y='features', data_frame=missing_values_info_df, title='Missing values in %', color='features')

* Features with null values?
GRINST (state of previous residence), MIGMTR1 (migration code change in msa), MIGMTR3 (migration code-change in reg), MIGMTR4 (migration code-move within reg), MIGSUN (migration prev res in sunbelt), PEFNTVTY (country of birth father), PEMNTVTY (country of birth mother), PENATVTY (country of birth self)

Since most of these fields are categorical, it won't be possible to guess what these values would be. So we can fill the NAs with category 'unknown'.

# Missing Values Imputation
* imputing with new category 'unknown'

In [53]:
selected_features = missing_values_info_df[missing_values_info_df['missing_values']>0]['features'].values

In [9]:
selected_features

array(['GRINST', 'MIGMTR1', 'MIGMTR3', 'MIGMTR4', 'MIGSUN', 'PEFNTVTY',
       'PEMNTVTY', 'PENATVTY'], dtype=object)

## Simple Imputer for filling in missing values?

In [54]:
for col in selected_features:
    train_df[col] = train_df[col].fillna('unknown')
    test_df[col] = test_df[col].fillna('unknown')

In [11]:
train_df.info()
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157509 entries, 0 to 157508
Data columns (total 42 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   ID        157509 non-null  object
 1   AAGE      157509 non-null  int64 
 2   ACLSWKR   157509 non-null  object
 3   ADTIND    157509 non-null  int64 
 4   ADTOCC    157509 non-null  int64 
 5   AHGA      157509 non-null  object
 6   AHRSPAY   157509 non-null  int64 
 7   AHSCOL    157509 non-null  object
 8   AMARITL   157509 non-null  object
 9   AMJIND    157509 non-null  object
 10  AMJOCC    157509 non-null  object
 11  ARACE     157509 non-null  object
 12  AREORGN   157509 non-null  object
 13  ASEX      157509 non-null  object
 14  AUNMEM    157509 non-null  object
 15  AUNTYPE   157509 non-null  object
 16  AWKSTAT   157509 non-null  object
 17  CAPGAIN   157509 non-null  int64 
 18  CAPLOSS   157509 non-null  int64 
 19  DIVVAL    157509 non-null  int64 
 20  FILESTAT  157509 non-null 

# Analysing the output variable

In [12]:
train_df['TARGET'].describe()

count    157509.000000
mean          0.082408
std           0.274986
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max           1.000000
Name: TARGET, dtype: float64

In [13]:
income_greater_equal_to_50k = train_df['TARGET'].sum()
income_less_than_50k = train_df.shape[0] - income_greater_equal_to_50k
print(income_greater_equal_to_50k, income_less_than_50k)
print("% of people with salary greater than or equal to 50k", (100*income_greater_equal_to_50k/train_df.shape[0]).round(2), "%")

12980 144529
% of people with salary greater than or equal to 50k 8.24 %


**Only 8.24% of people have total income greater than or equal to 50k**

# LabelEncoder
* Converting categorical data to numerical form. LabelEncoder encode target labels with value between 0 and n_classes-1.

* using select_dtypes('object') for retreving string columns. select_dtypes returns a subset of the DataFrame’s columns based on the column dtypes

* [Reference](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) 

In [62]:
for col in train_df.select_dtypes('object'):
    if col != 'ID':
        le = LabelEncoder()
        train_df[col] = le.fit_transform(train_df[col])
        test_df[col] = le.transform(test_df[col])

In [63]:
train_df.info()
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157509 entries, 0 to 157508
Data columns (total 42 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   ID        157509 non-null  object
 1   AAGE      157509 non-null  int64 
 2   ACLSWKR   157509 non-null  int64 
 3   ADTIND    157509 non-null  int64 
 4   ADTOCC    157509 non-null  int64 
 5   AHGA      157509 non-null  int64 
 6   AHRSPAY   157509 non-null  int64 
 7   AHSCOL    157509 non-null  int64 
 8   AMARITL   157509 non-null  int64 
 9   AMJIND    157509 non-null  int64 
 10  AMJOCC    157509 non-null  int64 
 11  ARACE     157509 non-null  int64 
 12  AREORGN   157509 non-null  int64 
 13  ASEX      157509 non-null  int64 
 14  AUNMEM    157509 non-null  int64 
 15  AUNTYPE   157509 non-null  int64 
 16  AWKSTAT   157509 non-null  int64 
 17  CAPGAIN   157509 non-null  int64 
 18  CAPLOSS   157509 non-null  int64 
 19  DIVVAL    157509 non-null  int64 
 20  FILESTAT  157509 non-null 

# Prepare Train and Validation dataset

* Keeping 80% data for training and 20% for validation. 

* Using stratified approach to split the data. Stratified helps to keep distribution of target variable same for training and validation dataset. 

In [68]:
X = train_df.drop(columns=['ID','TARGET'])
print(X.shape)
y = train_df.TARGET
print(y.shape)

(157509, 40)
(157509,)


In [87]:
X_train, X_valid, y_train, y_valid = train_test_split(X,y,
                                                        test_size=0.2, 
                                                        stratify=y, 
                                                        random_state=SEED)

In [83]:
X_train.shape, y_train.shape

((118131, 40), (118131,))

### target distribution in training 

In [20]:
y_train.value_counts()/len(y_train)

0    0.917592
1    0.082408
Name: TARGET, dtype: float64

In [21]:
X_valid.shape, y_valid.shape

((31502, 40), (31502,))

### target distribution in validation 

In [71]:
y_valid.value_counts()/len(y_valid)

0    0.917593
1    0.082407
Name: TARGET, dtype: float64

# Helper Function

In [72]:
def train_model(classifier, input_x, input_y):
    clf = classifier.fit(input_x, input_y)
    return clf

def evaluate_model(classifier, validation_x, validation_y, eval_metrics=fbeta_score):
    ypred = classifier.predict(validation_x)
    return ypred, eval_metrics(validation_y, ypred, beta=0.5)


# Comparing different ML algorithms

In [99]:
classifiers = {
#     'logistic_regression' : LogisticRegression(solver='liblinear', random_state=SEED),
#     'decision_tree' : DecisionTreeClassifier(random_state=SEED),
#     'random_forest': RandomForestClassifier(random_state=SEED),
#     # 'linear_svm': svm.SVC(kernel='linear',random_state=SEED),
#     'naive_bayes': GaussianNB(),
#     'k_neighbors': KNeighborsClassifier(),
#     'LGBM': lgb.LGBMClassifier(objective="binary", random_state=SEED, n_estimators=250),
#     'Gradient_Boosting_Classifier': GradientBoostingClassifier(random_state=SEED),
#     'XGBoost' : XGBClassifier(random_state=SEED),
    'CatBoost' : cb.CatBoostClassifier(random_state=SEED)
    }

In [100]:
trained_models = {}
for classifier_name, classifier in classifiers.items():
    print("Started for: ", classifier)
    model = train_model(classifier, X_train, y_train)
    ypred, validation_score = evaluate_model(model, X_valid, y_valid)
    print("Done for: ", classifier)
    trained_models[classifier_name] = {'model': model, 'f1_score': validation_score}

Started for:  <catboost.core.CatBoostClassifier object at 0x7feead744250>
Learning rate set to 0.081246
0:	learn: 0.5711160	total: 33.3ms	remaining: 33.3s
1:	learn: 0.4879350	total: 63.2ms	remaining: 31.5s
2:	learn: 0.4164204	total: 100ms	remaining: 33.4s
3:	learn: 0.3756840	total: 126ms	remaining: 31.5s
4:	learn: 0.3385133	total: 160ms	remaining: 31.9s
5:	learn: 0.3020995	total: 195ms	remaining: 32.3s
6:	learn: 0.2738372	total: 225ms	remaining: 31.9s
7:	learn: 0.2562095	total: 258ms	remaining: 32s
8:	learn: 0.2426776	total: 286ms	remaining: 31.5s
9:	learn: 0.2299725	total: 319ms	remaining: 31.5s
10:	learn: 0.2229403	total: 346ms	remaining: 31.1s
11:	learn: 0.2135409	total: 376ms	remaining: 31s
12:	learn: 0.2061069	total: 410ms	remaining: 31.1s
13:	learn: 0.2020128	total: 436ms	remaining: 30.7s
14:	learn: 0.1975049	total: 463ms	remaining: 30.4s
15:	learn: 0.1938144	total: 494ms	remaining: 30.4s
16:	learn: 0.1904087	total: 525ms	remaining: 30.4s
17:	learn: 0.1868059	total: 563ms	remaini

## Validation Results

In [97]:
validation_results = defaultdict(list)
for k,v in trained_models.items():
    validation_results['classifier_name'].append(k)
    validation_results['f1_score'].append(v['f1_score'])
validation_results = pd.DataFrame(validation_results)

In [98]:
validation_results

Unnamed: 0,classifier_name,f1_score
0,CatBoost,0.683159


In [30]:
px.bar(x='f1_score', y='classifier_name', data_frame=validation_results, color="classifier_name", title='Algorithm Performance Comparison')

In [42]:
fbeta_scorer = make_scorer(fbeta_score, beta=0.5)

# Inference :

* Piciking the best model for inference which is RandomForest

In [31]:
# selected_columns.remove("TARGET")
# test_df = test_df[selected_columns]
# test_df.info()
XTest = test_df.drop(columns=['ID'])

In [33]:
ytest_pred = trained_models["CatBoost"]["model"].predict(XTest)

In [34]:
output = sample_sub.copy()
output['TARGET'] = ytest_pred
output.head()

Unnamed: 0,ID,TARGET
0,ai1kagv30p8v,0
1,9s9e3x6a8f7u,0
2,qlvd7mszxd2z,0
3,uwhbqcnx5a5z,0
4,27c5sqbrzdwf,1


In [35]:
output['TARGET'].value_counts()/len(output)

0    0.948522
1    0.051478
Name: TARGET, dtype: float64

In [36]:
output.to_csv('./solution.csv', index=False)