# Introduction to Modeling

In this notebook, the focus shifts to the modeling phase of the analysis, where various machine learning algorithms will be employed to predict income levels based on the preprocessed dataset. The primary goal is to evaluate the effectiveness of different models in accurately classifying individuals into income categories (<=50K or >50K).

## Objectives

1. **Model Selection**:
   - Three distinct algorithms will be explored: 
     - Logistic Regression
     - Decision Trees
     - Random Forests
   Each model brings its strengths and weaknesses, providing a comprehensive understanding of their performance on the dataset.

2. **Model Evaluation**:
   - The models will be assessed using appropriate metrics such as accuracy, precision, recall, and F1-score. This evaluation will help identify the best-performing model for predicting income levels.

3. **Hyperparameter Tuning**:
   - For the models that benefit from it, hyperparameter tuning will be conducted to optimize performance and enhance model accuracy.

4. **Comparison of Results**:
   - A comparative analysis of the models' performances will be conducted to draw insights on which algorithm is most effective for this specific task.

This modeling phase aims to leverage the cleaned and processed data to develop predictive insights into income levels, ultimately contributing to the overarching goal of understanding income determinants in the dataset.


In [1]:
import pandas as pd


## Data load and first visualization:

In [2]:
path = '../census_income/data/preprocessed_data.csv'
df = pd.read_csv(path)
df.head()

Unnamed: 0,age,fnlwgt,education_num,hours_per_week,workclass_ Local-gov,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,workclass_ Without-pay,...,native_country_ Scotland,native_country_ South,native_country_ Taiwan,native_country_ Thailand,native_country_ Trinadad&Tobago,native_country_ United-States,native_country_ Vietnam,native_country_ Yugoslavia,income_ >50K,capital_balance
0,0.452055,0.047277,0.8,0.122449,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0.0
1,0.287671,0.137244,0.533333,0.397959,0,1,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0.0
2,0.493151,0.150212,0.4,0.397959,0,1,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0.0
3,0.150685,0.220703,0.8,0.397959,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0.0
4,0.273973,0.184109,0.866667,0.397959,0,1,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0.0


## Features Selection

In [3]:
path = '../census_income/data/feature_scores.csv'
feature_scores = pd.read_csv(path)
feature_scores.head(25)

Unnamed: 0,Feature,Mutual_Information,Correlation,Variance,Importance,Average_Score
0,marital_status_ Married-civ-spouse,0.913009,1.0,1.0,0.508697,0.855426
1,marital_status_ Never-married,0.547597,0.717975,0.877849,0.177436,0.580214
2,age,0.552804,0.542426,0.129969,0.906054,0.532813
3,education_num,0.525607,0.752344,0.116003,0.646112,0.510017
4,capital_balance,1.0,0.043856,0.058312,0.704654,0.451706
5,gender_ Male,0.243741,0.485545,0.880532,0.094909,0.426182
6,relationship_ Not-in-family,0.174926,0.432728,0.765541,0.09369,0.366721
7,hours_per_week,0.323654,0.514261,0.059922,0.500037,0.349469
8,relationship_ Own-child,0.300288,0.506873,0.50682,0.065095,0.344769
9,fnlwgt,0.255079,0.018234,0.020599,1.0,0.323478


## Saving selected features dataset into a csv file

In [4]:
selected_features = feature_scores.iloc[:6,0].values.tolist()
selected_features.append('income_ >50K')

df[selected_features].to_csv('../census_income/data/selected_features.csv', index=False)

## Train, Test and evaluation

In [5]:
"""
################################ Evaluating: logistic_regression ################################ 

Best parameters: {'model__class_weight': 'balanced', 'model__max_iter': 100000, 'model__solver': 'newton-cg', 'pca__n_components': None}
Best cross-validation score: 0.8633477497559051
Train Accuracy: 0.7614626752557787
Train Balanced-Accuracy: 0.7878958510431031
Test Accuracy: 0.7633992706376396
Test Balanced-Accuracy: 0.7859581378252579
Confusion Matrix:
      0     1
0  5036  1760
1   381  1872

Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.74      0.82      6796
           1       0.52      0.83      0.64      2253

    accuracy                           0.76      9049
   macro avg       0.72      0.79      0.73      9049
weighted avg       0.83      0.76      0.78      9049

AUC: 0.864290095711767

 ################################ Evaluating: random_forest ################################ 

Best parameters: {'model__bootstrap': True, 'model__class_weight': None, 'model__criterion': 'gini', 'model__max_depth': 10, 'model__max_features': None, 'model__min_samples_leaf': 5, 'model__min_samples_split': 5, 'model__n_estimators': 180, 'model__n_jobs': -1, 'pca__n_components': None}
Best cross-validation score: 0.9032435294050781
Train Accuracy: 0.8624952633573323
Train Balanced-Accuracy: 0.7828858205655482
Test Accuracy: 0.8495966405127638
Test Balanced-Accuracy: 0.7667943624706003
Confusion Matrix:
      0     1
0  6332   464
1   897  1356

Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.93      0.90      6796
           1       0.75      0.60      0.67      2253

    accuracy                           0.85      9049
   macro avg       0.81      0.77      0.78      9049
weighted avg       0.84      0.85      0.84      9049

AUC: 0.9032980223608729

 ################################ Evaluating: lightgbm ################################ 

Best parameters: {'model__class_weight': None, 'model__learning_rate': 0.07, 'model__max_depth': 15, 'model__n_estimators': 75, 'model__num_leaves': 25, 'model__verbosity': -1, 'pca__n_components': None}
Best cross-validation score: 0.905880528667366
Train Accuracy: 0.8593217127699886
Train Balanced-Accuracy: 0.7725031527315342
Test Accuracy: 0.8497071499613217
Test Balanced-Accuracy: 0.7603403754120789
Confusion Matrix:
      0     1
0  6377   419
1   941  1312

Classification Report:
               precision    recall  f1-score   support

           0       0.87      0.94      0.90      6796
           1       0.76      0.58      0.66      2253

    accuracy                           0.85      9049
   macro avg       0.81      0.76      0.78      9049
weighted avg       0.84      0.85      0.84      9049

AUC: 0.9048911829548046
-------------------------------- 
Best model identified: --------------------------------
Parameters: {'model__class_weight': None, 'model__learning_rate': 0.07, 'model__max_depth': 15, 'model__n_estimators': 75, 'model__num_leaves': 25, 'model__verbosity': -1, 'pca__n_components': None}
Score: 0.905880528667366
----------------------------------------------------------------
Exporting model
Type of clf: <class 'sklearn.pipeline.Pipeline'>

Model successfully exported.

"""

"\n################################ Evaluating: logistic_regression ################################ \n\nBest parameters: {'model__class_weight': 'balanced', 'model__max_iter': 100000, 'model__solver': 'newton-cg', 'pca__n_components': None}\nBest cross-validation score: 0.8633477497559051\nTrain Accuracy: 0.7614626752557787\nTrain Balanced-Accuracy: 0.7878958510431031\nTest Accuracy: 0.7633992706376396\nTest Balanced-Accuracy: 0.7859581378252579\nConfusion Matrix:\n      0     1\n0  5036  1760\n1   381  1872\n\nClassification Report:\n               precision    recall  f1-score   support\n\n           0       0.93      0.74      0.82      6796\n           1       0.52      0.83      0.64      2253\n\n    accuracy                           0.76      9049\n   macro avg       0.72      0.79      0.73      9049\nweighted avg       0.83      0.76      0.78      9049\n\nAUC: 0.864290095711767\n\n ################################ Evaluating: random_forest ################################ \n

# Census Income Prediction: Model Evaluation

In this project, a predictive model was developed to classify individuals based on income level (e.g., income above or below a threshold). A **pipeline** was used to streamline preprocessing and model training. Three machine learning algorithms were evaluated: **Logistic Regression**, **Random Forest**, and **LightGBM**. The results for each model are detailed below.

---

### 1. Logistic Regression
- **Best Parameters:**  
  `{'model__class_weight': 'balanced', 'model__max_iter': 100000, 'model__solver': 'newton-cg', 'pca__n_components': None}`  
- **Performance Metrics:**  
  - Cross-Validation Score: 0.8633  
  - Train Accuracy: 76.15%  
  - Test Accuracy: 76.34%  
  - Balanced Test Accuracy: 78.60%  
  - AUC: 0.8643  

- **Insights:**  
  - High precision for class 0 (low income) at 93%, but lower precision for class 1 (high income) at 52%.  
  - Strong recall for class 1 (83%), making this model effective for minimizing false negatives.  

---

### 2. Random Forest
- **Best Parameters:**  
  `{'model__bootstrap': True, 'model__class_weight': None, 'model__criterion': 'gini', 'model__max_depth': 10, 'model__max_features': None, 'model__min_samples_leaf': 5, 'model__min_samples_split': 5, 'model__n_estimators': 180, 'model__n_jobs': -1, 'pca__n_components': None}`  
- **Performance Metrics:**  
  - Cross-Validation Score: 0.9032  
  - Train Accuracy: 86.25%  
  - Test Accuracy: 84.96%  
  - Balanced Test Accuracy: 76.68%  
  - AUC: 0.9033  

- **Insights:**  
  - Improved accuracy and AUC compared to Logistic Regression.  
  - Higher precision for class 1 (high income) at 75%, although recall dropped to 60%.  
  - Demonstrates a balance between precision and recall across both classes.  

---

### 3. LightGBM
- **Best Parameters:**  
  `{'model__class_weight': None, 'model__learning_rate': 0.07, 'model__max_depth': 15, 'model__n_estimators': 75, 'model__num_leaves': 25, 'model__verbosity': -1, 'pca__n_components': None}`  
- **Performance Metrics:**  
  - Cross-Validation Score: 0.9059  
  - Train Accuracy: 85.93%  
  - Test Accuracy: 84.97%  
  - Balanced Test Accuracy: 76.03%  
  - AUC: 0.9049  

- **Insights:**  
  - The best-performing model based on cross-validation score (0.9059) and AUC (0.9049).  
  - Precision for class 1 (high income) reached 76%, with recall at 58%.  
  - Offers an optimal combination of accuracy and computational efficiency.

---

### Conclusion
- **Best Model Identified:** **LightGBM**  
  - Parameters: `{'model__class_weight': None, 'model__learning_rate': 0.07, 'model__max_depth': 15, 'model__n_estimators': 75, 'model__num_leaves': 25, 'model__verbosity': -1, 'pca__n_components': None}`  
  - Score: 0.9059  

The **LightGBM** model was chosen as the final model for its superior performance. It demonstrates robust classification capabilities, particularly for predicting high-income individuals. The trained pipeline has been exported and is ready for deployment in income classification tasks.


# Conclusions

The **Census Income Prediction Project** plays a crucial role in understanding and classifying income levels based on demographic and socio-economic data. The developed model can provide valuable insights for organizations and policymakers by identifying key factors that influence income levels.

The final **LightGBM model** demonstrated robust performance, achieving a high AUC and balanced accuracy, making it suitable for practical applications. Potential uses for this model include:

- **Targeted Policy Design:** Assisting governments in identifying underserved populations and tailoring social welfare programs.  
- **Market Segmentation:** Helping businesses customize products or services based on income demographics.  
- **Resource Allocation:** Enabling efficient distribution of resources and investment in regions or groups with specific income characteristics.

By leveraging this predictive model, decisions can be data-driven, fostering more equitable and effective outcomes in various domains.
