# Machine Learning Pipeline - Model Training

In the following notebooks, each of the steps of below Machine Learning pipeline would be implemented. 

Machine Learning Pipleline:


1. Data Analysis
2. Feature Enginerring
3. Feature Selection
4. ***Model Training***
5. Obtaining Predictions/Scoring

This notebook is focussed on Data Analysis.


> Dataset Source: Using dataset from [Kaggle](https://www.kaggle.com/datasets/overload10/adult-census-dataset?resource=download) as per project requirement. See below for more details:

===================================================================================================================

## Predicting Adult Census Income

> The aim of this project to build a machine learning model to predict the class of adult census income i.e., whether the sample falls under >50K or <50K based on different explanatory variables describing aspect of the class.

### Why this is important?

> Predicting the class of adult census income would benefit various financial institutions and it would pave the way for fruitful profit for the institutions. It would also help consumer-based services to target the correct consumers.

### What is the objective of the machine learning model?

1. To perform in-depth exploratory data analysis of the datasets.
2. To engineer new predictive features from the available graphs
3. To develop a supervised model to classify census income into >50K and <50K.
4. To recommend a threshold that will perform better in terms of F1 score.
5. To create an API endpoint for the trained model and deploy it.

#### Reproducibility: Setting the seed

With the aim to ensure reproducibility between runs of the same notebook, but also between the research and production environment, for each step that includes some element of randomness, it is extremely important that we set the seed.

In [1]:
# to handle datasets
import pandas as pd
import numpy as np

# for plotting
import matplotlib.pyplot as plt

# to save the model
import joblib

# to build the model
from sklearn.linear_model import LogisticRegression

# to evaluate the model
from sklearn.metrics import mean_squared_error, r2_score

# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

In [2]:
# load the train and test set with the engineered variables

# we built and saved these datasets in a previous notebook.

X_train = pd.read_csv('xtrain.csv')
X_test = pd.read_csv('xtest.csv')

X_train.head()

Unnamed: 0,age,workclass,final_weight,education,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,country
0,0.482644,0.166667,0.352022,0.357143,0.666667,0.692308,0.4,0.666667,0.0,0.0,0.0,0.395685,0.666667
1,0.528238,1.0,0.279148,0.857143,1.0,1.0,0.8,0.666667,1.0,0.0,0.0,0.497826,0.666667
2,0.704502,0.5,0.440719,0.928571,0.5,0.923077,0.6,0.666667,1.0,0.0,0.0,0.548951,0.666667
3,0.813899,0.5,0.272839,0.785714,1.0,0.692308,0.8,0.666667,1.0,0.0,0.0,0.395685,0.666667
4,0.49823,0.166667,0.282357,0.5,0.0,0.384615,0.6,0.333333,0.0,0.0,0.0,0.518272,0.666667


In [3]:
# load the target
y_train = pd.read_csv('ytrain.csv')
y_test = pd.read_csv('ytest.csv')

y_test.head()

Unnamed: 0,salary
0,0
1,0
2,0
3,0
4,1


In [4]:
# load the pre-selected features
# ==============================

# we selected the features in the previous notebook (step 3)


features = pd.read_csv('selected_features.csv')
features = features['0'].to_list() 

# display final feature set
features

['final_weight',
 'education',
 'marital_status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'capital_gain',
 'capital_loss',
 'hours_per_week',
 'country']

In [5]:
# reduce the train and test set to the selected features

X_train = X_train[features]
X_test = X_test[features]

### Logistic Regression

In [6]:
# set up the model
# remember to set the random_state / seed

clf = LogisticRegression(random_state=0).fit(X_train, y_train)
y_pred = clf.predict(X_test)

  y = column_or_1d(y, warn=True)


In [7]:
# import the metrics class
from sklearn import metrics

cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

array([[2252,  188],
       [ 355,  462]], dtype=int64)

In [9]:
from sklearn.metrics import classification_report
target_names = ['<=50', '>50']
print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

        <=50       0.86      0.92      0.89      2440
         >50       0.71      0.57      0.63       817

    accuracy                           0.83      3257
   macro avg       0.79      0.74      0.76      3257
weighted avg       0.83      0.83      0.83      3257

