<a href="https://colab.research.google.com/github/KhuyenLE-maths/Project_Income_Classification/blob/main/notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Content
I. Data preparation

II. Data extracting and cleaning
- Check the duplicated rows
- Missing values
- Outliers

III. Exploratory data analysis

IV. Data preprocessing

V. PCA

VI. Deploy Machine learning models for classifying data
 
 ---------------------------------------------------------------


In [None]:
from google.colab import drive
drive.mount('/content/drive/')

In [None]:
import os
os.chdir('/content/drive/MyDrive/Competitions/Income_classification/')
os.listdir()

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from termcolor import colored

## I. Data preparation

In [None]:
data_path = 'Dataset/'

In [None]:
os.listdir(data_path)

In [None]:
df_train = pd.read_csv(data_path + 'train.csv.zip')
df_test = pd.read_csv(data_path + 'test.csv')

In [None]:
df_train.head()

In [None]:
df_test.head()

In [None]:
print('Train shape:', df_train.shape)
print('Test shape', df_test.shape)

## II. Data cleaning

### 1. Check the duplicated rows

In [None]:
df_train[df_train.duplicated()]

In [None]:
df_test[df_test.duplicated()]

**There is no dupplicated row in both training and test sets**

### 2. Missing values

In [None]:
df_train.isna().sum().sum()

In [None]:
df_test.isna().sum().sum()

Our dataset is completely clean. 

### III. Exploratory data analysis

### 1. Types of dataset

In [None]:
plt.figure(figsize = (5,5))
df_train.dtypes.value_counts().plot.pie()
plt.title('Distribution of data types')
plt.show()

### 2. Target column

In [None]:
plt.figure(figsize = (6,5))
df_train['target_income'].value_counts().plot.bar()
plt.title('Income distribution')
plt.show()

## 3. Visualize other columns

In [None]:
Cols = list(df_train.columns)
Cols.remove('target_income')
Cols.remove('ID')

In [None]:
Cols_visu = ['work_type', 'education', 'total_education_yrs', 'marital_state', 'job', 'status', 'ethnicity', 'sex', 'hrs_per_week']

for col in Cols_visu:
    
    if col == 'hrs_per_week':
        plt.figure(figsize = (18, 5))
        
    df_train[col].value_counts().plot.bar()
    plt.title(col)
    plt.show()

### Visualize the distribution of target in each category

**Income by work type**

In [None]:
group_work_type = df_train.groupby(['work_type', 'target_income'])
group_work_type_df = group_work_type.size().unstack()
group_work_type_df.plot(kind = 'barh', figsize = (10,6))
plt.show()

**Income by education**

In [None]:
group_edu = df_train.groupby(['education', 'target_income'])
group_edu_df = group_edu.size().unstack()
group_edu_df.plot(kind = 'barh', figsize = (10,6))
plt.show()

***This figure shows the higher education you have, the better income you get. The probability to get high income when you just finish 9th class is very small.***

**Income by merital state**

In [None]:
group_mari = df_train.groupby(['marital_state', 'target_income'])
group_mari_df = group_mari.size().unstack()
group_mari_df.plot(kind = 'barh', figsize = (8,6))
plt.show()

***Wow!!! Very suprise! The proportion of group 1 (income > 50k) in the group "Married-civ-spouse" is much larger than in other groups. So, you want to get better income, you should get married first, :v*** 

**Income by status**

In [None]:
group_status = df_train.groupby(['status', 'target_income'])
group_status_df = group_status.size().unstack()
group_status_df.plot(kind = 'barh', figsize = (8,6))
plt.show()

**Income by job**

In [None]:
group_jobs = df_train.groupby(['job', 'target_income'])
group_jobs_df = group_jobs.size().unstack()
group_jobs_df.plot(kind = 'barh', figsize = (10,6))
plt.show()

**Income by ethnicity**

In [None]:
group_eth = df_train.groupby(['ethnicity', 'target_income'])
group_eth_df = group_eth.size().unstack()
group_eth_df.plot(kind = 'bar', figsize = (8,6))
plt.show()

**Income by sex**

In [None]:
group_s = df_train.groupby(['sex', 'target_income'])
group_s_df = group_s.size().unstack()
group_s_df.plot(kind = 'bar', figsize = (8,6))
plt.show()

**Correlation between variables**

In [None]:
plt.figure(figsize = (8,8))
sns.heatmap(df_train.corr(), square = True, annot = False)
plt
plt.show()

**From this figure, we can see that the income is highly correlated to some variables as "age", "total_education_yrs". The correlations between income and "final_weight", "job", "ethnicity", "nationality" are nearly zeros.**

## IV. Data preprocessing

###  Encoding some categorical columns

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
Cols_encode = ['work_type', 'education', 'marital_state', 'job', 'status', 'ethnicity', 'sex', 'nationality']

for col in Cols_encode:
    le = LabelEncoder()
    df_train[col] = le.fit_transform(df_train[col])
    df_test[col] = le.transform(df_test[col])

In [None]:
df_train.head()

In [None]:
X_train = df_train.drop(['ID', 'target_income'], axis = 1).values
y_train = df_train['target_income']

X_test = df_test.drop('ID', axis = 1).values

In [None]:
print('Train shape', X_train.shape)
print('Test shape', X_test.shape)

## V. PCA

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from pca_plot_tools import display_explained_var_ratio

**Data normalization**

In [None]:
scaler = StandardScaler()
X_train_std = scaler.fit_transform(X_train)
X_test_std = scaler.transform(X_test)

In [None]:
pca = PCA().fit(X_train_std)

var_ratio = pca.explained_variance_ratio_*100
display_explained_var_ratio(var_ratio)

## VI. Classification

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler
from sklearn.preprocessing import StandardScaler

from sklearn.metrics import accuracy_score, balanced_accuracy_score, precision_score, recall_score, f1_score 

import timeit

In [None]:
#from classif_tools import save_classif

rst_path = os.getcwd() + '/' + 'Classif_rst/'
#os.mkdir(rst_path)

### Data preparation for classification

In [None]:
X0 = X_train[np.where(y_train == 0)]
X1 = X_train[np.where(y_train == 1)]

y0 = np.zeros(X0.shape[0])
y1 = np.ones(X1.shape[0])

**Train/val splitting**

In [None]:
t_size = 0.25
rand_state = 12

X0_train, X0_val, y0_train, y0_val = train_test_split(X0, y0, test_size = t_size, random_state = rand_state)
X1_train, X1_val, y1_train, y1_val = train_test_split(X1, y1, test_size = t_size, random_state = rand_state)

In [None]:
X0_train.shape

In [None]:
X1_train.shape

In [None]:
X_train = np.concatenate((X0_train, X1_train))
X_val = np.concatenate((X0_val, X1_val))

y_train = np.concatenate((y0_train, y1_train))
y_val = np.concatenate((y0_val, y1_val))

**Data balancing for training set**

In [None]:
over_sampler = SMOTE(random_state = rand_state)
X_train, y_train = over_sampler.fit_resample(X_train, y_train)

**Data normalization**

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

In [None]:
val_info = {}
val_info['y_val'] = y_val

np.save(rst_path + 'val_info.npy', val_info)

### 1. Logistic regression

In [None]:
from sklearn.linear_model import LogisticRegression
param_grid_LR = {'C': [1, 2, 3]}
model_LR = LogisticRegression

In [None]:
grid_search_LR = GridSearchCV(estimator = model_LR(),
                             param_grid = param_grid_LR, 
                             cv = 5, 
                             verbose = 1)

In [None]:
t0 = timeit.default_timer()
grid_search_LR.fit(X_train_scaled, y_train)
y_pred_LR = grid_search_LR.predict(X_val_scaled)
t1 = timeit.default_timer()
t = t1 - t0

rst_LR = {}
rst_LR['y_pred_LR'] = y_pred_LR
rst_LR['best_score'] = grid_search_LR.best_score_
rst_LR['best_params'] = grid_search_LR.best_params_
rst_LR['time'] = t 

np.save(rst_path + 'rst_LR.npy', rst_LR)

### 2. Support vector machine

In [None]:
from sklearn.svm import SVC
param_grid_SVM = [{'C': [5, 10, 20, 40], 'kernel': ['rbf']}]
model_SVM = SVC

In [None]:
grid_search_SVM = GridSearchCV(estimator = model_SVM(),
                           param_grid = param_grid_SVM,
                           cv = 5,
                           verbose = 1
                          )

In [None]:
t0 = timeit.default_timer()
grid_search_SVM.fit(X_train_scaled, y_train)
y_pred_SVM = grid_search_SVM.predict(X_val_scaled)
t1 = timeit.default_timer()
t = t1 - t0

rst_SVM = {}
rst_SVM['y_pred_SVM'] = y_pred_SVM
rst_SVM['best_score'] = grid_search_SVM.best_score_
rst_SVM['best_params'] = grid_search_SVM.best_params_
rst_SVM['time'] = t
np.save(rst_path + 'rst_SVM.npy', rst_SVM)

### 3. K-nearest neighbors

In [None]:
from sklearn.neighbors import KNeighborsClassifier
param_grid_knn = {'n_neighbors' : [6, 8, 10, 12]}
model_knn = KNeighborsClassifier

In [None]:
grid_search_knn = GridSearchCV(estimator = model_knn(),
                              param_grid = param_grid_knn, 
                              cv = 5, 
                              verbose = 1)

In [None]:
t0 = timeit.default_timer()
grid_search_knn.fit(X_train_scaled, y_train)
y_pred_knn = grid_search_knn.predict(X_val_scaled)
t1 = timeit.default_timer()
t = t1 - t0

rst_knn = {}
rst_knn['y_pred_knn'] = y_pred_knn
rst_knn['best_score'] = grid_search_knn.best_score_
rst_knn['best_params'] = grid_search_knn.best_params_
rst_knn['time'] = t 

np.save(rst_path + 'rst_knn.npy', rst_knn)

### 4. Linear Discriminant Analysis

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
param_grid_LDA = {}
model_LDA = LinearDiscriminantAnalysis

In [None]:
grid_search_LDA = GridSearchCV(estimator = model_LDA(), 
                               param_grid = param_grid_LDA, 
                               cv = 5, 
                               verbose = 1)

In [None]:
t0 = timeit.default_timer()
grid_search_LDA.fit(X_train_scaled, y_train)
y_pred_LDA = grid_search_LDA.predict(X_val_scaled)
t1 = timeit.default_timer()
t = t1 - t0

rst_LDA = {}
rst_LDA['y_pred_LDA'] = y_pred_LDA
rst_LDA['best_score'] = grid_search_LDA.best_score_
rst_LDA['best_params'] = grid_search_LDA.best_params_
rst_LDA['time'] = t

np.save(rst_path + 'rst_LDA.npy', rst_LDA)

### 5. Decision tree classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier
param_grid_DT = {'criterion': ['gini', 'entropy']}
model_DT = DecisionTreeClassifier

In [None]:
grid_search_DT = GridSearchCV(estimator = model_DT(), 
                              param_grid = param_grid_DT,
                              cv = 5, 
                              verbose = 1)

In [None]:
t0 = timeit.default_timer()
grid_search_DT.fit(X_train_scaled, y_train)
y_pred_DT = grid_search_DT.predict(X_val_scaled)
t1 = timeit.default_timer()
t = t1 - t0

rst_DT = {}
rst_DT['y_pred_DT'] = y_pred_DT
rst_DT['best_score'] = grid_search_DT.best_score_
rst_DT['best_params'] = grid_search_DT.best_params_
rst_DT['time'] = t 

np.save(rst_path + 'rst_DT.npy', rst_DT)

### 6. Gaussian Naive Bayes Classifier

In [None]:
from sklearn.naive_bayes import GaussianNB
param_grid_NB = {}
model_NB = GaussianNB

In [None]:
grid_search_NB = GridSearchCV(estimator = model_NB(),
                              param_grid = param_grid_NB, 
                              cv = 5, 
                              verbose = 1)

In [None]:
t0 = timeit.default_timer()
grid_search_NB.fit(X_train_scaled, y_train)
y_pred_NB = grid_search_NB.predict(X_val_scaled)
t1 = timeit.default_timer()
t = t1 - t0 

rst_NB = {}
rst_NB['y_pred_NB'] = y_pred_NB
rst_NB['best_score'] = grid_search_NB.best_score_
rst_NB['best_params'] = grid_search_NB.best_params_
rst_NB['time'] = t 

np.save(rst_path + 'rst_NB.npy', rst_NB)

### 7. Stochastic Gradient Descent Classifier

In [None]:
from sklearn.linear_model import SGDClassifier
param_grid_SGD = {
    "penalty" : ['l2', 'l1', 'elasticnet']
}
model_SGD = SGDClassifier

In [None]:
grid_search_SGD = GridSearchCV(estimator = model_SGD(),
                               param_grid = param_grid_SGD,
                               cv = 5, 
                               verbose = 2)

In [None]:
t0 = timeit.default_timer()
grid_search_SGD.fit(X_train_scaled, y_train)
y_pred_SGD = grid_search_SGD.predict(X_val_scaled)
t1 = timeit.default_timer()
t = t1 - t0 

rst_SGD = {}
rst_SGD['y_pred_SGD'] = y_pred_SGD
rst_SGD['best_score'] = grid_search_SGD.best_score_
rst_SGD['best_params'] = grid_search_SGD.best_params_
rst_SGD['time'] = t 

np.save(rst_path + 'rst_SGD.npy', rst_SGD)

### 8. Random forest classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
param_grid_RF = {
    "criterion": ['gini', 'entropy']
}
model_RF = RandomForestClassifier

In [None]:
grid_search_RF = GridSearchCV(estimator = model_RF(),
                              param_grid = param_grid_RF,
                              cv = 5, 
                              verbose = 2)

In [None]:
t0 = timeit.default_timer()
grid_search_RF.fit(X_train_scaled, y_train)
y_pred_RF = grid_search_RF.predict(X_val_scaled)
t1 = timeit.default_timer()
t = t1 - t0 

rst_RF = {}
rst_RF['y_pred_RF'] = y_pred_RF
rst_RF['best_score'] = grid_search_RF.best_score_
rst_RF['best_params'] = grid_search_RF.best_params_
rst_RF['time'] = t 

np.save(rst_path + 'rst_RF.npy', rst_RF)

### 9. Gradient Boosting Classifier

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
param_grid_GBC = {}
model_GBC = GradientBoostingClassifier

In [None]:
grid_search_GBC = GridSearchCV(estimator = model_GBC(),
                               param_grid = param_grid_GBC, 
                               cv = 5, 
                               verbose = 1)

In [None]:
t0 = timeit.default_timer()
grid_search_GBC.fit(X_train_scaled, y_train)
y_pred_GBC = grid_search_GBC.predict(X_val_scaled)
t1 = timeit.default_timer()
t = t1 - t0 

rst_GBC = {}
rst_GBC['y_pred_GBC'] = y_pred_GBC
rst_GBC['best_score'] = grid_search_GBC.best_score_
rst_GBC['best_params'] = grid_search_GBC.best_params_
rst_GBC['time'] = t 
np.save(rst_path + 'rst_GBC.npy', rst_GBC)

### 10. AdaBoost 

In [None]:
from sklearn.ensemble import AdaBoostClassifier
param_grid_Ada = {}
model_Ada = AdaBoostClassifier

In [None]:
grid_search_Ada = GridSearchCV(estimator = model_Ada(),
                               param_grid = param_grid_Ada,
                               cv = 5, 
                               verbose = 1)

In [None]:
t0 = timeit.default_timer()
grid_search_Ada.fit(X_train_scaled, y_train)
y_pred_Ada = grid_search_Ada.predict(X_val_scaled)
t1 = timeit.default_timer()
t = t1 - t0 

rst_Ada = {}
rst_Ada['y_pred_Ada'] = y_pred_Ada
rst_Ada['best_score'] = grid_search_Ada.best_score_
rst_Ada['best_params'] = grid_search_Ada.best_params_
rst_Ada['time'] = t 

np.save(rst_path + 'rst_Ada.npy', rst_Ada)

### 11. XGBoost (Extreme Gradient Boosting)

In [None]:
from xgboost import XGBClassifier
param_grid_XGB = {}
model_XGB = XGBClassifier

In [None]:
grid_search_XGB = GridSearchCV(estimator = model_XGB(),
                               param_grid = param_grid_XGB,
                               cv = 5, 
                               verbose = 1)

In [None]:
t0 = timeit.default_timer()
grid_search_XGB.fit(X_train_scaled, y_train)
y_pred_XGB = grid_search_XGB.predict(X_val_scaled)
t1 = timeit.default_timer()
t = t1 - t0 

rst_XGB = {}
rst_XGB['y_pred_XGB'] = y_pred_XGB
rst_XGB['best_score'] = grid_search_XGB.best_score_
rst_XGB['best_params'] = grid_search_XGB.best_params_
rst_XGB['time'] = t 

np.save(rst_path + 'rst_XGB.npy', rst_XGB)

### 12. Histogram-based gradient boosting

In [None]:
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
model_HGB = HistGradientBoostingClassifier
param_grid_HGB = {}

In [None]:
grid_search_HGB = GridSearchCV(estimator = model_HGB(),
                               param_grid = param_grid_HGB,
                               cv = 5, 
                               verbose = 1)

In [None]:
t0 = timeit.default_timer()
grid_search_HGB.fit(X_train_scaled, y_train)
y_pred_HGB = grid_search_HGB.predict(X_val_scaled)
t1 = timeit.default_timer()
t = t1 - t0 

rst_HGB = {}
rst_HGB['y_pred_HGB'] = y_pred_HGB
rst_HGB['best_score'] = grid_search_HGB.best_score_
rst_HGB['best_params'] = grid_search_HGB.best_params_
rst_HGB['time'] = t 

np.save(rst_path + 'rst_HGB.npy', rst_HGB)

### 13. LightBoosting

In [None]:
from lightgbm import LGBMClassifier
model_LGBM = LGBMClassifier
param_grid_LGBM = {}

In [None]:
grid_search_LGBM = GridSearchCV(estimator = model_LGBM(),
                                param_grid = param_grid_LGBM,
                                cv = 5,
                                verbose = 1)

In [None]:
t0 = timeit.default_timer()
grid_search_LGBM.fit(X_train_scaled, y_train)
y_pred_LGBM = grid_search_LGBM.predict(X_val_scaled)
t1 = timeit.default_timer()
t = t1 - t0 

rst_LGBM = {}
rst_LGBM['y_pred_LGBM'] = y_pred_LGBM
rst_LGBM['best_score'] = grid_search_LGBM.best_score_
rst_LGBM['best_params'] = grid_search_LGBM.best_params_
rst_LGBM['time'] = t 

np.save(rst_path + 'rst_LGBM.npy', rst_LGBM)

### 14. CatBoost Classifier 

In [None]:
!pip install catboost

In [None]:
from catboost import CatBoostClassifier
param_grid_Cat = {}
model_Cat = CatBoostClassifier

In [None]:
grid_search_Cat = GridSearchCV(estimator = model_Cat(),
                               param_grid = param_grid_Cat,
                               cv = 5,
                               verbose = 1)

In [None]:
t0 = timeit.default_timer()
grid_search_Cat.fit(X_train_scaled, y_train)
y_pred_Cat = grid_search_Cat.predict(X_val_scaled)
t1 = timeit.default_timer()
t = t1 - t0 

rst_Cat = {}
rst_Cat['y_pred_Cat'] = y_pred_Cat
rst_Cat['best_score'] = grid_search_Cat.best_score_
rst_Cat['best_params'] = grid_search_Cat.best_params_
rst_Cat['time'] = t 

np.save(rst_path + 'rst_Cat.npy', rst_Cat)

## VII. Compare the classification results of all algorithms

In [None]:
Model_names = ['LR', 'SVM', 'knn', 'LDA', 'DT', 'NB', 'SGD', 'RF', 'GBC', 'Ada', 'XGB', 'HGB', 'LGBM', 'Cat']
Acc_scores = []
B_acc_scores = []
f1_scores = []

for name in Model_names:
  rst = np.load(rst_path + 'rst_' + name + '.npy', allow_pickle = True).item()
  n = 'y_pred_' + name
  y_val_pred = rst[n]
  acc = np.round(accuracy_score(y_val, y_val_pred), 3)
  b_acc = np.round(balanced_accuracy_score(y_val, y_val_pred), 3)
  f1 = np.round(f1_score(y_val, y_val_pred),3)

  Acc_scores.append(acc)
  B_acc_scores.append(b_acc)
  f1_scores.append(f1)

In [None]:
rst_df = {}
rst_df['Models'] = Model_names
rst_df['Accuracy score'] = Acc_scores
rst_df['Balanced accuracy'] = B_acc_scores
rst_df['f1 score'] = f1_scores

In [None]:
rst_df = pd.DataFrame(rst_df)

In [None]:
rst_df.sort_values(by = 'Accuracy score', ascending= False)

In [None]:
M

In [None]:
from sklearn.metrics import confusion_matrix
M = confusion_matrix(y_val, y_pred_Cat)

fig, ax = plt.subplots()
ax.matshow(M, cmap = plt.cm.Blues)

for i in range(2):
    for j in range(2):
        c = M[j,i]
        ax.text(i, j, str(c), va='center', ha='center')

From this result, we see that CatBoost and LightBoost are the two best algorithms in this case. Their accuracy are 86.3% and 85.9%, respectively. 
The balanced accuracy is smaller than accuracy score in almost cases (exception for Linear Regression, Schochastic Gradient Descent, Linear Discriminant Analysis). It means that the obsevations tend to be classified to the majority class. We can see that more clearly from the confusion matric obtained by AdaBoost algorithm. The mis-classified proportion of group 1 is 28.51%, which is larger than the one of group 0 (9.02%).