<a href="https://colab.research.google.com/github/24057080-kiit/capstone-projects-using-machine-learning-techniques/blob/main/2023_07_11_SourabhKumar_CapstoneProject17.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Capstone Project 17: Census Income Analysis



---

## Instructions

### Goal of the Project:

From class 67 to class 79, you learned the following concepts:

 - Feature Encoding.
 - Recursive Feature Elimination (RFE).
 - Logistic Regression classification using `sklearn` module.

In this project, you will apply what you have learned in class 67 - 79 to achieve the following goals.

|||
|-|-|
|**Main Goal**|Create a Logistic Regression model classification model with ideal number of features selected using RFE.
|



---

### Context

According to the government, census income is the income received by an individual regularly before payments for personal income taxes, medicare deductions, and so on. This information is asked annually from the people to record in the census. It helps to identify the eligible families for various funds and programs rolled out by communities and the government.



---

#### Getting Started

Follow the steps described below to solve the project:

1. Click on the link provided below to open the Colab file for this project.
   
   https://colab.research.google.com/drive/1pZqPvDkOf0QA48v1GYzd_Vlp_7lPciiQ

2. Create the duplicate copy of the Colab file. Here are the steps to create the duplicate copy:

    - Click on the **File** menu. A new drop-down list will appear.

      <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/0_file_menu.png' width=500>

    - Click on the **Save a copy in Drive** option. A duplicate copy will get created. It will open up in the new tab on your web browser.

      <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/1_create_colab_duplicate_copy.png' width=500>

     - After creating the duplicate copy of the notebook, please rename it in the **YYYY-MM-DD_StudentName_CapstoneProject17** format.

3. Now, write your code in the prescribed code cells.

---

### Problem Statement

The dataset is extracted from 1994 Census Bureau. The data includes an instance of anonymous individual records with features like work-experience, age, gender, country, and so on. Also have divided the records into two labels with people having a salary **more than 50K or less than equal to 50K** so that they can determine the eligibility of individuals for government opted programs.

Looks like a very interesting dataset and as a data scientist, your job is to build a prediction model to predict whether a particular individual has an annual income of **<=50k** or **>50k**.

**Things To Do:**

1. Importing and Analysing the Dataset

2. Data Cleaning

3. Feature Engineering

4. Train-Test Split

5. Data Standardisation

6. Logistic Regression - Model Training

7. Model Prediction and Evaluation

8. Features Selection Using RFE

9. Model Training and Prediction Using Ideal Features


----

### Dataset Description

The dataset includes 32561 instances with 14 features and 1 target column which can be briefed as:

|Field|Description|
|---:|:---|
|age|age of the person, Integer|
|work-class| employment information about the individual, Categorical|
|fnlwgt| unknown weights, Integer|
|education| highest level of education obtained, Categorical|
|education-years|number of years of education, Integer|
|marital-status| marital status of the person, Categorical|
|occupation|job title, Categorical|
|relationship| individual relation in the family-like wife, husband, and so on. Categorical|
|race|Categorical|
|sex| gender, Male or Female|
|capital-gain| gain from sources other than salary/wages, Integer|
|capital-loss| loss from sources other than salary/wages, Integer|
|hours-per-week| hours worked per week, Integer|
|native-country| name of the native country, Categorical|
|income-group| annual income, Categorical,  **<=50k** or **>50k** |


**Notes:**
1. The dataset has no header row for the column name. (Can add column names manually)
2. There are invalid values in the dataset marked as **"?"**.
3. As the information about **fnlwgt** is non-existent it can be removed before model training.
4. Take note of the **whitespaces (" ")**  throughout the dataset.



**Dataset Credits:** https://archive.ics.uci.edu/ml/datasets/adult

**Dataset Creator:**
```
Ronny Kohavi and Barry Becker
Data Mining and Visualization
Silicon Graphics.
e-mail: ronnyk '@' live.com for questions.
```

---


#### Activity 1:  Importing and Analysing the Dataset

In this activity, we have to load the dataset and analyse it.


**Perform the following tasks:**
- Load the dataset into a DataFrame.
- Rename the columns with the given list.
- Verify the number of rows and columns.
- Print the information of the DataFrame.


**1.** Start with importing all the required modules:



In [None]:
# Import modules
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression


**2.** Create a Pandas DataFrame for the **Adult Income** dataset using the below link with `header=None`.
> **Dataset Link:**
https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/adult.csv

**3.** Print the first five rows of the dataset:

In [None]:
# Load the Adult Income dataset into DataFrame.
df = pd.read_csv('https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/adult.csv' ,header = None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


**4.** Rename the columns by applying the `rename()` function using the following column list:

>```python
column_name =['age', 'workclass', 'fnlwgt', 'education', 'education-years', 'marital-status', 'occupation', 'relationship', 'race','sex','capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income-group']
```



In [None]:
# Rename the column names in the DataFrame using the list given above.
# Create the list
# Rename the columns using 'rename()'
df.rename(columns={0 : 'age',
                   1 : 'workclass',
                   2 : 'fnlwgt',
                   3 : 'education',
                   4 : 'education-years',
                   5 : 'marital-status',
                   6 : 'occupation',
                   7 : 'relationship',
                   8 : 'race',
                   9 : 'sex',
                   10 : 'capital-gain',
                   11 : 'capital-loss',
                   12 : 'hours-per-week',
                   13 : 'native-country',
                   14 : 'income-group'}, inplace=True)
# Print the first five rows of the DataFrame
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-years,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income-group
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K



**Hint:**

Syntax for `rename()` function:

`DataFrame.rename(columns={old_column_name:new_column_name})`



**5.** Verify the number of rows and columns in the DataFrame:

In [None]:
# Print the number of rows and columns of the DataFrame
df.shape

(32561, 15)

**6.** Get the information of the DataFrame:

In [None]:
# Get the information of the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              32561 non-null  int64 
 1   workclass        32561 non-null  object
 2   fnlwgt           32561 non-null  int64 
 3   education        32561 non-null  object
 4   education-years  32561 non-null  int64 
 5   marital-status   32561 non-null  object
 6   occupation       32561 non-null  object
 7   relationship     32561 non-null  object
 8   race             32561 non-null  object
 9   sex              32561 non-null  object
 10  capital-gain     32561 non-null  int64 
 11  capital-loss     32561 non-null  int64 
 12  hours-per-week   32561 non-null  int64 
 13  native-country   32561 non-null  object
 14  income-group     32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


**Q:** Which is the target column?

**A:** income-group

**7.** Print the labels in the target column and their distribution as well:

In [None]:
# Check the distribution of the labels in the target column.
df['income-group'].value_counts()

 <=50K    24720
 >50K      7841
Name: income-group, dtype: int64

**Q:** Which target label has more records?

**A:**<= has more records

**After performing this activity, you must obtain the DataFrame with renamed columns and the target column identified.**

---


#### Activity 2: Data Cleaning


In this activity, we need to clean the DataFrame step by step.

**Perform the following tasks:**
- Check for the null or missing values in the DataFrame.
- Observe the categories in column `native-country`, `workclass`, and `occupation`.
- Replace the invalid `" ?"` values in the columns with `np.nan` using `replace()` function.
- Drop the rows having `nan` values using the `dropna()` function.



**1.** Verify the missing values in the DataFrame:

In [None]:
# Check for null values in the DataFrame.
df.isnull().sum()

age                0
workclass          0
fnlwgt             0
education          0
education-years    0
marital-status     0
occupation         0
relationship       0
race               0
sex                0
capital-gain       0
capital-loss       0
hours-per-week     0
native-country     0
income-group       0
dtype: int64

**Q:** Are there any missing/null values that can be observed in the DataFrame?

**A:** No



**2.**  Observe the unique categories in columns `native-country`, `workclass`, and `occupation` to find the invalid values:

In [None]:
# Print the distribution of the columns mentioned to find the invalid values.

# Print the categories in column 'native-country'
print('Unique categories in the column native-country:\n', df['native-country'].unique())
print()
# Print the categories in column 'workclass'
print('\nUnique categories in the column workclass:\n', df['workclass'].unique())
print()
# Print the categories in column 'occupation'
print('\nUnique categories in the column occupation', df['occupation'].unique())
print()

Unique categories in the column native-country:
 [' United-States' ' Cuba' ' Jamaica' ' India' ' ?' ' Mexico' ' South'
 ' Puerto-Rico' ' Honduras' ' England' ' Canada' ' Germany' ' Iran'
 ' Philippines' ' Italy' ' Poland' ' Columbia' ' Cambodia' ' Thailand'
 ' Ecuador' ' Laos' ' Taiwan' ' Haiti' ' Portugal' ' Dominican-Republic'
 ' El-Salvador' ' France' ' Guatemala' ' China' ' Japan' ' Yugoslavia'
 ' Peru' ' Outlying-US(Guam-USVI-etc)' ' Scotland' ' Trinadad&Tobago'
 ' Greece' ' Nicaragua' ' Vietnam' ' Hong' ' Ireland' ' Hungary'
 ' Holand-Netherlands']


Unique categories in the column workclass:
 [' State-gov' ' Self-emp-not-inc' ' Private' ' Federal-gov' ' Local-gov'
 ' ?' ' Self-emp-inc' ' Without-pay' ' Never-worked']


Unique categories in the column occupation [' Adm-clerical' ' Exec-managerial' ' Handlers-cleaners' ' Prof-specialty'
 ' Other-service' ' Sales' ' Craft-repair' ' Transport-moving'
 ' Farming-fishing' ' Machine-op-inspct' ' Tech-support' ' ?'
 ' Protective-serv' '

**Q:** Is there any invalid value or category in any of the three columns?

**A:** YES ("?")

---

**3.** Replace the invalid values with `np.nan` and verify the number of null values in the DataFrame again.

In [None]:
# Replace the invalid values ' ?' with 'np.nan'.
df['native-country'] = df['native-country'].replace(' ?', np.nan)
df['workclass'] = df['workclass'].replace(' ?', np.nan)
df['occupation'] = df['occupation'].replace(' ?', np.nan)
# Check for null values in the DataFrame again.
df.isnull().sum()

age                   0
workclass          1836
fnlwgt                0
education             0
education-years       0
marital-status        0
occupation         1843
relationship          0
race                  0
sex                   0
capital-gain          0
capital-loss          0
hours-per-week        0
native-country      583
income-group          0
dtype: int64

**Q:** Are there any missing/null values that can be observed in the DataFrame?

**A:** Yes there are missing values in workclass, occupation, and native-country

---



**4.** Delete the rows having invalid values and drop the column `fnlwgt`. Print the number of rows of the DataFrame after dropping invalid values:

In [None]:
# Delete the rows with invalid values and the column not required

# Delete the rows with the 'dropna()' function
df.dropna(inplace = True)
# Delete the column with the 'drop()' function
df.drop(columns = 'fnlwgt', axis = 1, inplace = True)
# Print the number of rows and columns in the DataFrame.
df.shape

(30162, 14)

**After this activity, the DataFrame should neither have any null or invalid values nor the `fnlwgt` column.**

----


#### Activity 3: Feature Engineering

The dataset contains certain features that are categorical.  To convert these features into numerical ones, use the `map()` and `get_dummies()` function.


**Perform the following tasks for feature engineering:**

- Create a list of numerical columns.

- Map the values of the column `gender` to:
  - **`Male: 0`**
  - **`Female: 1`**

- Map the values of the column `income-group` to:
  - **` <=50K: 0`**
  - **` >50K: 1`**

- Create a list of categorical columns.

- Perform **one-hot encoding** to obtain numeric values for the rest of the categorical columns.

---

**1.**  Separate the numeric columns first for that create a list of numeric columns using `select_dtypes()` function:


In [None]:
# Create a list of numeric columns names using 'select_dtypes()'.
numeric_df = df.select_dtypes(include='int64')
print(numeric_df.head())
numeric_columns = list(df.select_dtypes(include = ['int64', 'float64']).columns)
numeric_columns

   age  education-years  capital-gain  capital-loss  hours-per-week
0   39               13          2174             0              40
1   50               13             0             0              13
2   38                9             0             0              40
3   53                7             0             0              40
4   28               13             0             0              40


['age', 'education-years', 'capital-gain', 'capital-loss', 'hours-per-week']

**2.** Map the labels of the column `gender` to convert it into a numerical attribute using the `map()` function:
  - **`Male`** to **`0`**
  - **`Female`** to **`1`**

In [None]:
# Map the 'sex' column and verify the distribution of labels.

# Print the distribution before mapping
print(f"Before mapping: \n{df['sex'].value_counts()}")
# Map the values of the column to convert the categorical values to integer
df['sex'] = df['sex'].map({' Male':0,' Female':1})
# Print the distribution after mapping
print(f"\nAfter mapping: \n{df['sex'].value_counts()}")

Before mapping: 
 Male      20380
 Female     9782
Name: sex, dtype: int64

After mapping: 
0    20380
1     9782
Name: sex, dtype: int64


**3.** Map the labels of the column `income-group` to convert it into a numerical attribute from categorical one using `map()` function:
  - **` <=50K`** to **`0`**
  - **` >50K`** to **`1`**
  

In [None]:
# Map the 'income-group' column and verify the distribution of labels.
df['income-group'].unique()
# Print the distribution before mapping
print(f"Before mapping \n{df['income-group'].value_counts()}")
# Map the values of the column to convert the categorical values to integer
df['income-group'] = df['income-group'].map({' <=50K' : 0, ' >50K' : 1})
# Print the distribution after mapping
print(f'\nAfter mapping \n{df["income-group"].value_counts()}')

Before mapping 
 <=50K    22654
 >50K      7508
Name: income-group, dtype: int64

After mapping 
0    22654
1     7508
Name: income-group, dtype: int64


**4.** Create a list of categorical columns names using `select_dtypes()` function:

In [None]:
# Create the list of categorical columns names using 'select_dtypes()'.
categorical_col = list(df.select_dtypes(include=['object']).columns)
print(categorical_col)

['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'native-country']


**5.** Perform **one-hot encoding** on the columns of the DataFrame in the list above and save it in a **dummy DataFrame**. Also use parameter `drop_first= True` in the `get_dummies()` function.

***Recall:***
*This process of obtaining numeric values from non-numeric categorical values is called **one-hot encoding**. In this process a column is added for each of the categories in a particular feature and value in the columns will be binary `0` and `1` based on the original value in the feature and the category column. The `get_dummies()` function can be used to apply **one-hot encoding** to the non-numeric categorical feature columns*.

In [None]:
# Create a 'income_dummies_df' DataFrame using the 'get_dummies()' function on the non-numeric categorical columns
income_dummies_df = pd.get_dummies(df[categorical_col], drop_first=True, dtype=int)
income_dummies_df

Unnamed: 0,workclass_ Local-gov,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,workclass_ Without-pay,education_ 11th,education_ 12th,education_ 1st-4th,education_ 5th-6th,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0,1,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
32557,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
32558,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
32559,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


 **6.** Drop the non-numeric categorical columns from the original income DataFrame using the `drop()` function:

In [None]:
# Drop the categorical columns from the Income DataFrame `income_df`
income_df = df.drop(columns = df[categorical_col], axis = 1)
income_df

Unnamed: 0,age,education-years,sex,capital-gain,capital-loss,hours-per-week,income-group
0,39,13,0,2174,0,40,0
1,50,13,0,0,0,13,0
2,38,9,0,0,0,40,0
3,53,7,0,0,0,40,0
4,28,13,1,0,0,40,0
...,...,...,...,...,...,...,...
32556,27,12,1,0,0,38,0
32557,40,9,0,0,0,40,1
32558,58,9,1,0,0,40,0
32559,22,9,0,0,0,20,0


**7.** Concat the income DataFrame and the dummy DataFrame to create the final DataFrame for the model.

Print the first five values of the final DataFrame:

In [None]:
# Concat the income DataFrame and dummy DataFrame using 'concat()' function
new_df = pd.concat([income_df, income_dummies_df], axis = 1)
new_df

Unnamed: 0,age,education-years,sex,capital-gain,capital-loss,hours-per-week,income-group,workclass_ Local-gov,workclass_ Private,workclass_ Self-emp-inc,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
0,39,13,0,2174,0,40,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,50,13,0,0,0,13,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,38,9,0,0,0,40,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
3,53,7,0,0,0,40,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
4,28,13,1,0,0,40,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,12,1,0,0,38,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
32557,40,9,0,0,0,40,1,0,1,0,...,0,0,0,0,0,0,0,1,0,0
32558,58,9,1,0,0,40,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
32559,22,9,0,0,0,20,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0


**8.** Get the information of the DataFrame to verify the final columns and their data types:

In [None]:
# Get the information of the DataFrame
new_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30162 entries, 0 to 32560
Data columns (total 96 columns):
 #   Column                                      Non-Null Count  Dtype
---  ------                                      --------------  -----
 0   age                                         30162 non-null  int64
 1   education-years                             30162 non-null  int64
 2   sex                                         30162 non-null  int64
 3   capital-gain                                30162 non-null  int64
 4   capital-loss                                30162 non-null  int64
 5   hours-per-week                              30162 non-null  int64
 6   income-group                                30162 non-null  int64
 7   workclass_ Local-gov                        30162 non-null  int64
 8   workclass_ Private                          30162 non-null  int64
 9   workclass_ Self-emp-inc                     30162 non-null  int64
 10  workclass_ Self-emp-not-inc       

**Q:** How many columns are present in the final DataFrame?

**A:** There are 96 columns present in the final DataFrame

**Q:** What is the data type of the columns in the final DataFrame?

**A:** int64 is the data type of the columns in the final DataFrame.


**After this activity, the DataFrame should not have any non-numeric columns.**

---

#### Activity 4: Train-Test Split

We need to predict the value of the `income-group` variable, using other variables. Thus, `income-group` is the target or dependent variable and other columns except `income-group` are the features or the independent variables.

**1.** Split the dataset into the training set and test set such that the training set contains 70% of the instances and the remaining instances will become the test set.

**2.** Set `random_state = 42`:

In [None]:
# Split the training and testing data

# Import the module
from sklearn.model_selection import train_test_split

X = new_df.drop(columns='income-group', axis = 1)
y = new_df['income-group']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.70, random_state = 42)

**After this activity, the feature and target data should be distributed in training and testing data**

---

#### Activity 5: Data Standardisation

To avoid `ConvergenceWarning` message - That is to scale the data using one of the normalisation methods, for instance, standard normalisation.

**1.** Create a function `standard_scalar()` to normalise the numeric columns of `X_train` and `X_test` data-frames using the standard normalisation method:


In [None]:
# Normalise the train and test data-frames using the standard normalisation method.
# Define the 'standard_scalar()' function for calculating Z-scores
def standard_scalar(series):
    return (series - series.mean()) / series.std()
# Create the DataFrames norm_X_train and norm_X_train
norm_X_train = X_train
norm_X_test = X_test
# Apply the 'standard_scalar()' on X_train on numeric columns using apply() function and get the descriptive statistics of the normalised X_train
norm_X_train[numeric_columns] = X_train[numeric_columns].apply(standard_scalar, axis = 0)
print('\n Train set \n', norm_X_train.describe())
# Apply the 'standard_scalar()' on X_test on numeric columns using apply() function and get the descriptive statistics of the normalised X_test
norm_X_test[numeric_columns] = X_test[numeric_columns].apply(standard_scalar, axis = 0)
print('\n Test set \n', norm_X_test.describe())


 Train set 
                 age  education-years           sex  capital-gain  \
count  2.111300e+04     2.111300e+04  21113.000000  2.111300e+04   
mean   3.701970e-18    -2.628399e-16      0.325439 -1.076937e-17   
std    1.000000e+00     1.000000e+00      0.468550  1.000000e+00   
min   -1.620390e+00    -3.573064e+00      0.000000 -1.491608e-01   
25%   -7.835970e-01    -4.402200e-01      0.000000 -1.491608e-01   
50%   -9.894813e-02    -4.861450e-02      0.000000 -1.491608e-01   
75%    6.617728e-01     1.126202e+00      1.000000 -1.491608e-01   
max    3.932873e+00     2.301018e+00      1.000000  1.347470e+01   

       capital-loss  hours-per-week  workclass_ Local-gov  workclass_ Private  \
count  2.111300e+04    2.111300e+04          21113.000000        21113.000000   
mean   4.324575e-17    4.711599e-17              0.069625            0.738550   
std    1.000000e+00    1.000000e+00              0.254521            0.439435   
min   -2.184644e-01   -3.328547e+00              

**After this activity, training and testing feature data should be normalised using Data Standardisation.**

---

#### Activity 6: Logistic Regression - Model Training

Implement Logistic Regression Classification using `sklearn` module to estimate the values of $\beta$ coefficients in the following way:

1. Deploy the model by importing the `LogisticRegression` class and create an object of this class.
2. Call the `fit()` function on the Logistic Regression object and print the score using the `score()` function.


In [None]:
# Deploy the 'LogisticRegression' model using the 'fit()' function.
lg_df = LogisticRegression()
lg_df.fit(norm_X_train, y_train)
lg_df.score(norm_X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.8479136077298347

**After this activity, a multi-variate logistic regression should be trained with all features.**

---

#### Activity 7: Model Prediction and Evaluation

**1.** Predict the values for both training and test sets by calling the `predict()` function on the Logistic Regression object:

In [None]:
# Make predictions on the test dataset by using the 'predict()' function.
y_train_pred = lg_df.predict(norm_X_train)
y_test_pred = lg_df.predict(norm_X_test)

**2.** Display the confusion matrix:

In [None]:
# Display the results of confusion_matrix
print('Train set\n', confusion_matrix(y_train, y_train_pred))
print('\nTest set\n', confusion_matrix(y_test, y_test_pred))

Train set
 [[14753  1134]
 [ 2077  3149]]

Test set
 [[6311  456]
 [ 896 1386]]


**Q:** What is the positive outcome out of both labels?

**A:** 14753 , 1134
       2077  , 3149

**Q:** Write the count of True Positives and True Negatives?

**A:** True Positives : Train set - 14753 , Test set - 6311
       True Negatives :Train set - 3149, Test set - 1386


**3.** Print the classification report values to evaluate the accuracy of your model:

In [None]:
# Display the results of classification_report
print('\n Train set \n', classification_report(y_train, y_train_pred))
print('\n Test set \n', classification_report(y_test, y_test_pred))


 Train set 
               precision    recall  f1-score   support

           0       0.88      0.93      0.90     15887
           1       0.74      0.60      0.66      5226

    accuracy                           0.85     21113
   macro avg       0.81      0.77      0.78     21113
weighted avg       0.84      0.85      0.84     21113


 Test set 
               precision    recall  f1-score   support

           0       0.88      0.93      0.90      6767
           1       0.75      0.61      0.67      2282

    accuracy                           0.85      9049
   macro avg       0.81      0.77      0.79      9049
weighted avg       0.84      0.85      0.84      9049



**Q** Write the f1-score of both labels?

**A:**
In train-set, f1-score:
              [0: 0.90
             , 1: 0.66]

In test-set,  f1-score:
              [0: 0.90
             , 1: 0.67]

**After this activity, a multi-variate logistic regression model is used to predict and evaluate using all the features.**

----

#### Activity 8: Features Selection Using RFE

Select the relevant features from all the features that contribute the most to classifying individuals in income-groups using RFE.

**Steps:**

**1.** Create an empty dictionary and store it in a variable.

**2.** Create a `for` loop that iterates through all the columns in the normalised training data-frame.
Inside the loop:


   - Create an object of the Logistic Regression class and store it in a variable.
   
   - Create an object of RFE class and store it in a variable. Inside the RFE class constructor, pass the object of logistic regression and the number of features to be selected by RFE as inputs.
   
   - Train the model using the `fit()` function of the `RFE` class to train a logistic regression model on the train set with `i` number of features where `i` goes from `1` to the number of columns in the training dataset.
   
   - Create a list to store the important features using the `support_` attribute.
   
   - Create a new data-frame having the features selected by RFE store in a variable.
   
   - Create another Logistic Regression object, store it in a variable and build a logistic regression model using the new training DataFrame created using the rfe features data-frame and the target series.
   
   - Predict the target values for the normalised test set (containing the feature(s) selected by RFE) by calling the `predict()` function on the recent model object.
   
   - Calculate f1-scores using the function `f1_score()` function of `sklearn.metrics` module that returns a NumPy array containing f1-scores for both the classes. Store the array in a variable called `f1_scores_array`.
   
    The sytax for the `f1_score()` is given as:

      >**Syntax:** `f1_score(y_true, y_pred, average = None)`

        Where,
      
        **a.** `y_true`:  the actual labels

        **b.** `y_pred`: the predicted labels
  
        **c.** `average = None`: parameter returns the scores for each class.

  - Add the number of selected features and the corresponding features & f1-scores as key-value pairs in the dictionary.


**Note:**   
As the number of features is very high, the code will be a computationally heavy program. It will require very GPU to process the code faster. It will take some time to learn the feature variables through the training data and then make predictions on the test data.

To turn on the **GPU** in google colab follow the steps below:
1. Click on the **Edit** menu option on the top-left.
2. Click on the **Notebook settings** option from the menu. A pop-up will appear.
3. Click on the drop-down for selecting **Hardware accelerator**.
4. Select **GPU** from the drop-down options.
5. Click on **Save**.

In [None]:
# Create a dictionary containing the different combination of features selected by RFE and their corresponding f1-scores.
import warnings
warnings.filterwarnings('ignore')
# Import the libraries
from sklearn.feature_selection import RFE
from sklearn.metrics import f1_score
# Create the empty dictionary.
dict_rfe = {}
# Create a 'for' loop.
for i in range(1 , len(X_train.columns) + 1):
  print(i)
  # Create the Logistic Regression Model
  lg_clf2 = LogisticRegression()
  # Create the RFE model with 'i' number of features
  rfe = RFE(lg_clf2, n_features_to_select= i)
  # Train the rfe model on the normalised training data using 'fit()'
  rfe.fit(norm_X_train, y_train)
  # Create a list of important features chosen by RFE.
  rfe_features = list(norm_X_train.columns[rfe.support_])
  # Create the normalised training DataFrame with rfe features
  rfe_X_train = norm_X_train[rfe_features]
  # Create the logistic regression
  lg_clf3 = LogisticRegression()
  # Train the model normalised training DataFrame with rfe features using 'fit()'
  lg_clf3.fit(rfe_X_train, y_train)
  # Predict 'y' values only for the test set as generally, they are predicted quite accurately for the train set.
  y_test_predict = lg_clf3.predict(norm_X_test[rfe_features])
  # Calculate the f1-score
  f1_scores_array = f1_score(y_test, y_test_predict, average = None)
  # Add the name of features and f1-scores in the dictionary
  dict_rfe[i] = {'features' : list(rfe_features),
                 'f1_score' : f1_scores_array}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74


**6.** Print the dictionary with features and f1-scores.

In [None]:
# Print the dictionary
dict_rfe

{1: {'features': ['marital-status_ Married-civ-spouse'],
  'f1_score': array([0.85571573, 0.        ])},
 2: {'features': ['capital-gain', 'marital-status_ Married-civ-spouse'],
  'f1_score': array([0.86985043, 0.30483532])},
 3: {'features': ['capital-gain',
   'education_ Preschool',
   'marital-status_ Married-civ-spouse'],
  'f1_score': array([0.86985043, 0.30483532])},
 4: {'features': ['capital-gain',
   'education_ Preschool',
   'marital-status_ Married-AF-spouse',
   'marital-status_ Married-civ-spouse'],
  'f1_score': array([0.86985043, 0.30483532])},
 5: {'features': ['capital-gain',
   'education_ Preschool',
   'marital-status_ Married-AF-spouse',
   'marital-status_ Married-civ-spouse',
   'occupation_ Priv-house-serv'],
  'f1_score': array([0.86992457, 0.30494217])},
 6: {'features': ['capital-gain',
   'education_ Preschool',
   'marital-status_ Married-AF-spouse',
   'marital-status_ Married-civ-spouse',
   'occupation_ Priv-house-serv',
   'relationship_ Own-child'],


**7.** Convert the dictionary into a DataFrame by using the`from_dict()` function of the DataFrame.

**Note** Set the `pd.options.display.max_colwidth` to `200`

In [None]:
# Convert the dictionary to the DataFrame
pd.options.display.max_colwidth = 200
rfe_df = pd.DataFrame.from_dict(dict_rfe)
rfe_df

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,86,87,88,89,90,91,92,93,94,95
features,[marital-status_ Married-civ-spouse],"[capital-gain, marital-status_ Married-civ-spouse]","[capital-gain, education_ Preschool, marital-status_ Married-civ-spouse]","[capital-gain, education_ Preschool, marital-status_ Married-AF-spouse, marital-status_ Married-civ-spouse]","[capital-gain, education_ Preschool, marital-status_ Married-AF-spouse, marital-status_ Married-civ-spouse, occupation_ Priv-house-serv]","[capital-gain, education_ Preschool, marital-status_ Married-AF-spouse, marital-status_ Married-civ-spouse, occupation_ Priv-house-serv, relationship_ Own-child]","[capital-gain, education_ Preschool, marital-status_ Married-AF-spouse, marital-status_ Married-civ-spouse, occupation_ Farming-fishing, occupation_ Priv-house-serv, relationship_ Own-child]","[capital-gain, education_ Preschool, marital-status_ Married-AF-spouse, marital-status_ Married-civ-spouse, occupation_ Farming-fishing, occupation_ Other-service, occupation_ Priv-house-serv, rel...","[capital-gain, education_ Preschool, marital-status_ Married-AF-spouse, marital-status_ Married-civ-spouse, occupation_ Farming-fishing, occupation_ Handlers-cleaners, occupation_ Other-service, o...","[capital-gain, education_ Preschool, marital-status_ Married-AF-spouse, marital-status_ Married-civ-spouse, occupation_ Farming-fishing, occupation_ Handlers-cleaners, occupation_ Other-service, o...",...,"[age, education-years, sex, capital-gain, capital-loss, hours-per-week, workclass_ Local-gov, workclass_ Private, workclass_ Self-emp-inc, workclass_ Self-emp-not-inc, workclass_ State-gov, workcl...","[age, education-years, sex, capital-gain, capital-loss, hours-per-week, workclass_ Local-gov, workclass_ Private, workclass_ Self-emp-inc, workclass_ Self-emp-not-inc, workclass_ State-gov, workcl...","[age, education-years, sex, capital-gain, capital-loss, hours-per-week, workclass_ Local-gov, workclass_ Private, workclass_ Self-emp-inc, workclass_ Self-emp-not-inc, workclass_ State-gov, workcl...","[age, education-years, sex, capital-gain, capital-loss, hours-per-week, workclass_ Local-gov, workclass_ Private, workclass_ Self-emp-inc, workclass_ Self-emp-not-inc, workclass_ State-gov, workcl...","[age, education-years, sex, capital-gain, capital-loss, hours-per-week, workclass_ Local-gov, workclass_ Private, workclass_ Self-emp-inc, workclass_ Self-emp-not-inc, workclass_ State-gov, workcl...","[age, education-years, sex, capital-gain, capital-loss, hours-per-week, workclass_ Local-gov, workclass_ Private, workclass_ Self-emp-inc, workclass_ Self-emp-not-inc, workclass_ State-gov, workcl...","[age, education-years, sex, capital-gain, capital-loss, hours-per-week, workclass_ Local-gov, workclass_ Private, workclass_ Self-emp-inc, workclass_ Self-emp-not-inc, workclass_ State-gov, workcl...","[age, education-years, sex, capital-gain, capital-loss, hours-per-week, workclass_ Local-gov, workclass_ Private, workclass_ Self-emp-inc, workclass_ Self-emp-not-inc, workclass_ State-gov, workcl...","[age, education-years, sex, capital-gain, capital-loss, hours-per-week, workclass_ Local-gov, workclass_ Private, workclass_ Self-emp-inc, workclass_ Self-emp-not-inc, workclass_ State-gov, workcl...","[age, education-years, sex, capital-gain, capital-loss, hours-per-week, workclass_ Local-gov, workclass_ Private, workclass_ Self-emp-inc, workclass_ Self-emp-not-inc, workclass_ State-gov, workcl..."
f1_score,"[0.8557157309054122, 0.0]","[0.8698504329572291, 0.3048353188507358]","[0.8698504329572291, 0.3048353188507358]","[0.8698504329572291, 0.3048353188507358]","[0.8699245654312889, 0.30494216614090436]","[0.8707411540733933, 0.31273996509598606]","[0.8714416896235079, 0.3127629733520336]","[0.8717545239968529, 0.31271960646521435]","[0.8724937753898572, 0.31382228490832154]","[0.8725676472515232, 0.3139329805996473]",...,"[0.9032627361190613, 0.6720038816108684]","[0.9031057678545871, 0.6716779825412221]","[0.9032488907971947, 0.6721629485935985]","[0.9032488907971947, 0.6721629485935985]","[0.9032627361190613, 0.6720038816108684]","[0.9032488907971947, 0.6721629485935985]","[0.9031842576028623, 0.6718408925539656]","[0.9032488907971947, 0.6721629485935985]","[0.9031842576028623, 0.6718408925539656]","[0.9032488907971947, 0.6721629485935985]"


**Q:** How many features are required for the best f1-scores and why?

**A:** We need 4 features for the best f1_scores.Beyond this point, the number of features increase but the f1-scores increase only marginally.

**After this activity, rfe is used to find the ideal features for  logistic regression**

---

#### Activity 9: Model Training and Prediction Using Ideal Features

**1.** Create the logistic regression model again using RFE with the ideal number of features and predict the target variable:


In [None]:
# Logistic Regression with the ideal number of features and predict the target.

# Create the Logistic Regression Model
lg_clf_3 = LogisticRegression()

# Create the RFE model with ideal number of features
rfe = RFE(lg_clf_3)
# Train the rfe model on the normalised training data
rfe.fit(norm_X_train, y_train)
# Create a list of important features chosen by RFE.
rfe_features = norm_X_train.columns[rfe.support_]
print(rfe_features)
# Create the normalised training DataFrame with rfe features
final_X_train = norm_X_train[rfe_features]
# Create the Regression Model again
lg_clf_4 = LogisticRegression()
# Train the model with the normalised training features DataFrame with best rfe features and target training DataFrame
lg_clf_4.fit(final_X_train, y_train)
# Predict the target using the normalised test DataFrame with rfe features
y_test_predict = lg_clf_4.predict(norm_X_test[rfe_features])
# Calculate the final f1-score and print it
final_f1_scores_array = f1_score(y_test, y_test_predict, average = None)
print(final_f1_scores_array)

Index(['age', 'education-years', 'sex', 'capital-gain', 'hours-per-week',
       'workclass_ Self-emp-not-inc', 'workclass_ Without-pay',
       'education_ 1st-4th', 'education_ 5th-6th', 'education_ Preschool',
       'marital-status_ Married-AF-spouse',
       'marital-status_ Married-civ-spouse', 'marital-status_ Never-married',
       'occupation_ Exec-managerial', 'occupation_ Farming-fishing',
       'occupation_ Handlers-cleaners', 'occupation_ Machine-op-inspct',
       'occupation_ Other-service', 'occupation_ Priv-house-serv',
       'occupation_ Prof-specialty', 'occupation_ Protective-serv',
       'occupation_ Tech-support', 'relationship_ Other-relative',
       'relationship_ Own-child', 'relationship_ Wife',
       'race_ Asian-Pac-Islander', 'race_ Black', 'race_ White',
       'native-country_ Canada', 'native-country_ Columbia',
       'native-country_ Cuba', 'native-country_ Dominican-Republic',
       'native-country_ Ecuador', 'native-country_ England',
       'n

**Hint:** Create the model using the same steps mentioned in the **Features Selection Activity.**  

**Q:** What is the final f1-score?

**A:** Label 0 : 0.89952221

Label 1: 0.65423313


---

**Write your interpretation of the results here.**

- Interpretation 1:
- Interpretation 2:
- Interpretation 3:

**After this activity, a Logistic Regression model should be ready with ideal number of features to accurately predict the income group of the people that is to predict whether an individual has annual income `less or equal than 50K (label 0)` or `more than 50K (label 1)` based on the features selected.**

---

### Submitting the Project

Follow the steps described below to submit the project.

1. After finishing the project, click on the **Share** button on the top right corner of the notebook. A new dialog box will appear.

  <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/2_share_button.png' width=500>

2. In the dialog box, click on the **Copy link** button.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/3_copy_link.png' width=500>


3. The link of the duplicate copy (named as **YYYY-MM-DD_StudentName_CapstoneProject17**) of the notebook will get copied

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/4_copy_link_confirmation.png' width=500>

4. Go to your dashboard and click on the **My Projects** option.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/5_student_dashboard.png' width=800>

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/6_my_projects.png' width=800>

5. Click on the **View Project** button for the project you want to submit.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/7_view_project.png' width=800>

6. Click on the **Submit Project Here** button.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/8_submit_project.png' width=800>

7. Paste the link to the project file named as **YYYY-MM-DD_StudentName_CapstoneProject17** in the URL box and then click on the **Submit** button.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/9_enter_project_url.png' width=800>


---