Business goal
This an app to predict if someone make more or less than 50k/year using different features. This app can be used when that information is not available or is confidential during a loan application at any financial institution or car financing application to have a better financial picture of the applicant. In this notebook, a classification model with a precision > 80% is the goal.

Table of contents
0 Import the libraries
1 Get the data
1.1 Import csv file
1.2 Split the data into training and test sets, creating a copy of the datasets
2.1 Quick glance at the data
2.2 Pandas Profiling
2.3 Univariate analysis
2.3.1 Age
2.3.2 Workclass
2.3.3 Final weight
2.3.4 Education
2.3.5 Education-number
2.3.6 Marital-Status
2.3.7 Occupation
2.3.8 Relationship
2.3.9 Race
2.3.10 Gender
2.3.11 Capital gain
2.3.12 Capital loss
2.3.13 Hours per week
2.3.14 Native country
2.3.15 Income > 50K (Target variable)
2.4 Bi-variate analysis
2.4.1 Scatter plots
2.4.2 Age vs hours per week (Numerical vs Numerical feature)
2.4.3 Age vs educational number (Numerical vs Numerical feature)
2.4.4 Educational number vs hours per week (Numerical vs Numerical feature)
2.4.5 Educational number vs age (Numerical vs Numerical feature)
2.4.6 Chi2 test for all the the categorical features (Categorical vs Categorical feature)
2.4.7 ANOVA test of age vs the rest of categorical features (Numerical vs continuous feature)
2.4.8 Heatmap
3 Prepare the data
3.1 Transform to be done on each feature
3.2 Identify extra data that would be useful
3.3 Remove outliers
3.4 Fix the missing values
3.5 Fix skewness
3.6 Oversampling with SMOTE
3.7 Data preprocessing
4 Short-list promising models
4.1 Functions to evaluate the models and all the metrics
4.2 Quick models comparison
4.3 Drop least predictive features
4.4. Shortlist the top five models
5 Fine-Tune the top five models
5.1 Random forest
5.2 Neural network
5.3 KNN
5.4 Gradient boosting
5.5 Baggining
5.6 Winner
6 Test the performance of the model on the test set


0. Import the libraries

In [None]:
import numpy as np
import pandas as pd
import missingno as msno
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import seaborn as sns
from IPython.core.display import HTML, display
from pandas_profiling import ProfileReport
from pathlib import Path
from scipy.stats import probplot, chi2_contingency, chi2
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, cross_val_score, cross_val_predict
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, OrdinalEncoder
from sklearn.inspection import permutation_importance
from sklearn.metrics import ConfusionMatrixDisplay, classification_report, roc_curve, roc_auc_score
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, BaggingClassifier, AdaBoostClassifier, VotingClassifier, ExtraTreesClassifier, StackingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neural_network import MLPClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from imblearn.over_sampling import SMOTE
import joblib
import os
import pickle
import re
import streamlit as st
from scipy.stats import norm
%matplotlib inline


1. Get the data
1.1 Import csv file

In [None]:
train_data = pd.read_csv('../Credit-Card-Transactions-Fraud-Detection-Cat/fraudTrain.csv')
test_data = pd.read_csv('../Credit-Card-Transactions-Fraud-Detection-Cat/fraudTest.csv')

In [None]:
full_data = pd.concat([train_data, test_data], axis=0)

In [None]:
full_data = full_data.sample(frac=1).reset_index(drop=True)

In [None]:
full_data.shape

1.2 Split the data into training and test sets, creating a copy of the datasets

In [None]:
# split the data into train and test
def data_split(df, test_size):
    train_df, test_df = train_test_split(df, test_size=test_size, random_state=42)
    return train_df.reset_index(drop=True), test_df.reset_index(drop=True)

In [None]:
train_data, test_data = data_split(full_data, 0.2)

In [None]:
train_data.shape

In [None]:
test_data.shape

In [None]:
train_duplicate = train_data.duplicate()
test_duplicate = test_data.duplicate()

2. Explore the data
2.1 Quick glance at the data

In [None]:
train_duplicate.head()

In [None]:
train_duplicate.info()

In [None]:
train_duplicate.describe()

In [None]:
msno.matrix(train_duplicate)
plt.show()

In [None]:
msno.bar(train_duplicate)
plt.show()

2.2 Pandas Profiling

In [None]:
profile_report = ProfileReport(train_duplicate, explorative=True, dark_mode=True)

In [None]:
profile_report_folder = Path('pandas_profile_folder/Credit_Card_Transctions_Fraud_profile.html')

try:
    profile_report_folder.resolve(strict=True)
except FileNotFoundError:
    profile_report.to_folder("pandas_profile_folder/Credit_Card_Transctions_Fraud_profile.html")
else:
    pass

2.3 Univariate analysis

In [None]:
#Function that will return the value count and frequency of each observation within a column
def value_cnt_norm_cal(df,feature):
    ftr_value_cnt = df[feature].value_counts()
    ftr_value_cnt_norm = df[feature].value_counts(normalize=True) * 100
    ftr_value_cnt_concat = pd.concat([ftr_value_cnt, ftr_value_cnt_norm], axis=1)
    ftr_value_cnt_concat.columns = ['Count', 'Frequency (%)']
    return ftr_value_cnt_concat

2.3.1 Age

In [None]:
train_duplicate['age'].head()

In [None]:
train_duplicate['age'].describe()

In [None]:
train_duplicate['age'].dtype

In [None]:
train_duplicate['age'].isnull().sum()

In [None]:
age_value_cnt_norm = train_duplicate['age'].value_counts(normalize=True) * 100

In [None]:
age_value_cnt = train_duplicate['age'].value_counts()