# Machine Learning Pipeline - Feature Selection

In the following notebooks, each of the steps of below Machine Learning pipeline would be implemented. 

Machine Learning Pipleline:


1. Data Analysis
2. Feature Enginerring
3. ***Feature Selection***
4. Model Training
5. Obtaining Predictions/Scoring

This notebook is focussed on Data Analysis.


> Dataset Source: Using dataset from [Kaggle](https://www.kaggle.com/datasets/overload10/adult-census-dataset?resource=download) as per project requirement. See below for more details:

===================================================================================================================

## Predicting Adult Census Income

> The aim of this project to build a machine learning model to predict the class of adult census income i.e., whether the sample falls under >50K or <50K based on different explanatory variables describing aspect of the class.

### Why this is important?

> Predicting the class of adult census income would benefit various financial institutions and it would pave the way for fruitful profit for the institutions. It would also help consumer-based services to target the correct consumers.

### What is the objective of the machine learning model?

1. To perform in-depth exploratory data analysis of the datasets.
2. To engineer new predictive features from the available graphs
3. To develop a supervised model to classify census income into >50K and <50K.
4. To recommend a threshold that will perform better in terms of F1 score.
5. To create an API endpoint for the trained model and deploy it.

#### Reproducibility: Setting the seed

With the aim to ensure reproducibility between runs of the same notebook, but also between the research and production environment, for each step that includes some element of randomness, it is extremely important that we set the seed.

In [1]:
# to handle datasets
import pandas as pd
import numpy as np

# for plotting
import matplotlib.pyplot as plt

# to build the models
import statsmodels.api as sm

# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

In [2]:
# load the train and test set with the engineered variables

X_train = pd.read_csv('xtrain.csv')
X_test = pd.read_csv('xtest.csv')

X_train.head()

Unnamed: 0,age,workclass,final_weight,education,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,country
0,0.482644,0.166667,0.352022,0.357143,0.666667,0.692308,0.4,0.666667,0.0,0.0,0.0,0.395685,0.666667
1,0.528238,1.0,0.279148,0.857143,1.0,1.0,0.8,0.666667,1.0,0.0,0.0,0.497826,0.666667
2,0.704502,0.5,0.440719,0.928571,0.5,0.923077,0.6,0.666667,1.0,0.0,0.0,0.548951,0.666667
3,0.813899,0.5,0.272839,0.785714,1.0,0.692308,0.8,0.666667,1.0,0.0,0.0,0.395685,0.666667
4,0.49823,0.166667,0.282357,0.5,0.0,0.384615,0.6,0.333333,0.0,0.0,0.0,0.518272,0.666667


In [3]:
# load the target
y_train = pd.read_csv('ytrain.csv')
y_test = pd.read_csv('ytest.csv')

y_train.head()

Unnamed: 0,salary
0,0
1,1
2,1
3,1
4,0


In [4]:
logit_model=sm.Logit(y_train,X_train)
result=logit_model.fit()


Optimization terminated successfully.
         Current function value: 0.404223
         Iterations 7


In [5]:
print(result.summary2())
print(result.pvalues<0.05)


                         Results: Logit
Model:              Logit            Method:           MLE       
Dependent Variable: salary           Pseudo R-squared: 0.266     
Date:               2024-05-12 18:32 AIC:              23716.6921
No. Observations:   29304            BIC:              23824.4033
Df Model:           12               Log-Likelihood:   -11845.   
Df Residuals:       29291            LL-Null:          -16139.   
Converged:          1.0000           LLR p-value:      0.0000    
No. Iterations:     7.0000           Scale:            1.0000    
-----------------------------------------------------------------
                  Coef.  Std.Err.    z     P>|z|   [0.025  0.975]
-----------------------------------------------------------------
age              -0.1325   0.0929  -1.4254 0.1540 -0.3146  0.0497
workclass         0.0992   0.0678   1.4629 0.1435 -0.0337  0.2322
final_weight     -3.1115   0.1335 -23.3050 0.0000 -3.3731 -2.8498
education         2.2814   0.1006  2

In [6]:
tmp = result.pvalues<0.05
tmp = tmp[tmp==True]

In [7]:
sel_elements = tmp.index
print(sel_elements)

Index(['final_weight', 'education', 'marital_status', 'occupation',
       'relationship', 'race', 'sex', 'capital_gain', 'capital_loss',
       'hours_per_week', 'country'],
      dtype='object')


In [8]:
pd.Series(sel_elements).to_csv('selected_features.csv', index=False)

This concludes the Feature Selection Section.