# Capstone Project 18: Census Income Analysis



---

## Instructions

### Goal of the Project:

From class 67 to class 79, you learned the following concepts: 

 - Feature Encoding.
 - Recursive Feature Elimination (RFE).
 - Logistic Regression classification using `sklearn` module.

In this project, you will apply what you have learned in class 67 - 79 to achieve the following goals.

|||
|-|-|
|**Main Goal**|Create a Logistic Regression model classification model with ideal number of features selected using RFE.
|



---

### Context

According to the government, census income is the income received by an individual regularly before payments for personal income taxes, medicare deductions, and so on. This information is asked annually from the people to record in the census. It helps to identify the eligible families for various funds and programs rolled out by communities and the government. 

 

---

#### Getting Started

Follow the steps described below to solve the project:

1. Click on the link provided below to open the Colab file for this project.
   
   https://colab.research.google.com/drive/1SJYyeTu9sl43k_kiozYNCndVely1e7Lr

2. Create the duplicate copy of the Colab file. Here are the steps to create the duplicate copy:

    - Click on the **File** menu. A new drop-down list will appear.

      <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/0_file_menu.png' width=500>

    - Click on the **Save a copy in Drive** option. A duplicate copy will get created. It will open up in the new tab on your web browser.

      <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/1_create_colab_duplicate_copy.png' width=500>

     - After creating the duplicate copy of the notebook, please rename it in the **YYYY-MM-DD_StudentName_CapstoneProject18** format. 

3. Now, write your code in the prescribed code cells.

---

### Problem Statement

The dataset is extracted from 1994 Census Bureau. The data includes an instance of anonymous individual records with features like work-experience, age, gender, country, and so on. Also have divided the records into two labels with people having a salary **more than 50K or less than equal to 50K** so that they can determine the eligibility of individuals for government opted programs.

Looks like a very interesting dataset and as a data scientist, your job is to build a prediction model to predict whether a particular individual has an annual income of **<=50k** or **>50k**.

**Things To Do:**

1. Importing and Analysing the Dataset

2. Data Cleaning

3. Feature Engineering

4. Train-Test Split

5. Data Standardisation

6. Logistic Regression - Model Training

7. Model Prediction and Evaluation

8. Features Selection Using RFE

9. Model Training and Prediction Using Ideal Features


----

### Dataset Description

The dataset includes 32561 instances with 14 features and 1 target column which can be briefed as:

|Field|Description|
|---:|:---|
|age|age of the person, Integer|
|work-class| employment information about the individual, Categorical|
|fnlwgt| unknown weights, Integer|
|education| highest level of education obtained, Categorical|
|education-years|number of years of education, Integer|
|marital-status| marital status of the person, Categorical|
|occupation|job title, Categorical|
|relationship| individual relation in the family-like wife, husband, and so on. Categorical|
|race|Categorical|
|sex| gender, Male or Female|
|capital-gain| gain from sources other than salary/wages, Integer|
|capital-loss| loss from sources other than salary/wages, Integer|
|hours-per-week| hours worked per week, Integer|
|native-country| name of the native country, Categorical|
|income-group| annual income, Categorical,  **<=50k** or **>50k** |


**Notes:**
1. The dataset has no header row for the column name. (Can add column names manually)
2. There are invalid values in the dataset marked as **"?"**.
3. As the information about **fnlwgt** is non-existent it can be removed before model training.
4. Take note of the **whitespaces (" ")**  throughout the dataset. 



**Dataset Credits:** https://archive.ics.uci.edu/ml/datasets/adult 


**Dataset Creater:** 
```
Dua, D., & Graff, C.. (2017). UCI Machine Learning Repository.
```


---


#### Activity 1:  Importing and Analysing the Dataset

In this activity, we have to load the dataset and analyse it.


**Perform the following tasks:**
- Load the dataset into a DataFrame.
- Rename the columns with the given list.
- Verify the number of rows and columns.
- Print the information of the DataFrame.


**1.** Start with importing all the required modules:



In [None]:
# Import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


**2.** Create a Pandas DataFrame for the **Adult Income** dataset using the below link with `header=None`. 
> **Dataset Link:** 
https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/adult.csv

**3.** Print the first five rows of the dataset: 

In [None]:
# Load the Adult Income dataset into DataFrame.
adult_df = pd.read_csv("https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/adult.csv", header = None)
adult_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


**Hint:** In `read_csv()` function, `header=None` parameter allows the creation of DataFrame with first row of the file as first row rather than column names and the column names are assigned by the system from `0` to `n`.

----


**4.** Rename the columns by applying the `rename()` function using the following column list:

>```python
column_name =['age', 'workclass', 'fnlwgt', 'education', 'education-years', 'marital-status', 'occupation', 'relationship', 'race','sex','capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income-group']
```



In [None]:
# Rename the column names in the DataFrame using the list given above. 

# Create the list
column_name =['age', 'workclass', 'fnlwgt', 'education', 'education-years', 'marital-status', 'occupation', 'relationship', 'race','sex','capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income-group']
# Rename the columns using 'rename()'
col_dict = {}
for i in range(0, 15):
  col_dict[i] = column_name[i]
adult_df.rename(columns=col_dict, inplace=True)
# Print the first five rows of the DataFrame
adult_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-years,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income-group
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K



**Hint:**

Syntax for `rename()` function:

`DataFrame.rename(columns={old_column_name:new_column_name})`



**5.** Verify the number of rows and columns in the DataFrame:

In [None]:
# Print the number of rows and columns of the DataFrame
adult_df.shape

(32561, 15)

**6.** Get the information of the DataFrame:

In [None]:
# Get the information of the DataFrame
adult_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              32561 non-null  int64 
 1   workclass        32561 non-null  object
 2   fnlwgt           32561 non-null  int64 
 3   education        32561 non-null  object
 4   education-years  32561 non-null  int64 
 5   marital-status   32561 non-null  object
 6   occupation       32561 non-null  object
 7   relationship     32561 non-null  object
 8   race             32561 non-null  object
 9   sex              32561 non-null  object
 10  capital-gain     32561 non-null  int64 
 11  capital-loss     32561 non-null  int64 
 12  hours-per-week   32561 non-null  int64 
 13  native-country   32561 non-null  object
 14  income-group     32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


**Q:** Which is the target column?

**A:** `income-group` in the target colunm.

**7.** Print the labels in the target column and their distribution as well:

In [None]:
# Check the distribution of the labels in the target column.
adult_df['income-group'].value_counts()


 <=50K    24720
 >50K      7841
Name: income-group, dtype: int64

**Q:** Which target label has more records?

**A:**Label `<=50k` has more records.

**After performing this activity, you must obtain the DataFrame with renamed columns and the target column identified.**

---


#### Activity 2: Data Cleaning


In this activity, we need to clean the DataFrame step by step. 

**Perform the following tasks:**
- Check for the null or missing values in the DataFrame.
- Observe the categories in column `native-country`, `workclass`, and `occupation`.
- Replace the invalid `" ?"` values in the columns with `np.nan` using `replace()` function.
- Drop the rows having `nan` values using the `dropna()` function.



**1.** Verify the missing values in the DataFrame:

In [None]:
# Check for null values in the DataFrame.
adult_df.isnull().sum()

age                0
workclass          0
fnlwgt             0
education          0
education-years    0
marital-status     0
occupation         0
relationship       0
race               0
sex                0
capital-gain       0
capital-loss       0
hours-per-week     0
native-country     0
income-group       0
dtype: int64

**Q:** Are there any missing/null values that can be observed in the DataFrame?

**A:** No.



**2.**  Observe the unique categories in columns `native-country`, `workclass`, and `occupation` to find the invalid values:

In [None]:
# Print the distribution of the columns mentioned to find the invalid values.
# Print the categories in column 'native-country'
print( "For Native country",'\n',adult_df['native-country'].value_counts(),'\n')
# Print the categories in column 'workclass'
print("For work class",'\n',adult_df['workclass'].value_counts(),'\n')
# Print the categories in column 'occupation'
print("For occupation",'\n',adult_df['occupation'].value_counts())


For Native country 
  United-States                 29170
 Mexico                          643
 ?                               583
 Philippines                     198
 Germany                         137
 Canada                          121
 Puerto-Rico                     114
 El-Salvador                     106
 India                           100
 Cuba                             95
 England                          90
 Jamaica                          81
 South                            80
 China                            75
 Italy                            73
 Dominican-Republic               70
 Vietnam                          67
 Guatemala                        64
 Japan                            62
 Poland                           60
 Columbia                         59
 Taiwan                           51
 Haiti                            44
 Iran                             43
 Portugal                         37
 Nicaragua                        34
 Peru            

**Q:** Is there any invalid value or category in any of the three columns?

**A:** Yes there any invalid value.

---

**3.** Replace the invalid values with `np.nan` and verify the number of null values in the DataFrame again.

In [None]:
# Replace the invalid values ' ?' with 'np.nan'.
adult_df['native-country'][adult_df['native-country'] ==' ?'] = np.nan
adult_df['workclass'][adult_df['workclass'] ==' ?'] = np.nan
adult_df['occupation'][adult_df['occupation'] ==' ?'] = np.nan
# Check for null values in the DataFrame again.
adult_df.isnull().sum()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  adult_df['native-country'][adult_df['native-country'] ==' ?'] = np.nan
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  adult_df['workclass'][adult_df['workclass'] ==' ?'] = np.nan
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  adult_df['occupation'][adult_df['occupation'] ==' ?'] = np.nan


age                   0
workclass          1836
fnlwgt                0
education             0
education-years       0
marital-status        0
occupation         1843
relationship          0
race                  0
sex                   0
capital-gain          0
capital-loss          0
hours-per-week        0
native-country      583
income-group          0
dtype: int64

**Q:** Are there any missing/null values that can be observed in the DataFrame?

**A:** Yes

---



**4.** Delete the rows having invalid values and drop the column `fnlwgt`. Print the number of rows of the DataFrame after dropping invalid values:

In [None]:
# Delete the rows with invalid values and the column not required 

# Delete the rows with the 'dropna()' function
adult_df.dropna( inplace  =True)
# Delete the column with the 'drop()' function
adult_df.drop(columns='fnlwgt', axis  = 1, inplace = True)

# Print the number of rows and columns in the DataFrame.
adult_df.shape

(30162, 14)

**After this activity, the DataFrame should neither have any null or invalid values nor the `fnlwgt` column.**

----


#### Activity 3: Feature Engineering

The dataset contains certain features that are categorical.  To convert these features into numerical ones, use the `map()` and `get_dummies()` function.


**Perform the following tasks for feature engineering:**

- Create a list of numerical columns.

- Map the values of the column `gender` to: 
  - **`Male: 0`**
  - **`Female: 1`**

- Map the values of the column `income-group` to: 
  - **` <=50K: 0`**
  - **` >50K: 1`**

- Create a list of categorical columns.

- Perform **one-hot encoding** to obtain numeric values for the rest of the categorical columns.

--- 

**1.**  Separate the numeric columns first for that create a list of numeric columns using `select_dtypes()` function:


In [None]:
# Create a list of numeric columns names using 'select_dtypes()'.
num_lst = list(adult_df.select_dtypes(include=['int64', 'float64']).columns)
num_lst

['age', 'education-years', 'capital-gain', 'capital-loss', 'hours-per-week']

**2.** Map the labels of the column `gender` to convert it into a numerical attribute using the `map()` function:
  - **`Male`** to **`0`**
  - **`Female`** to **`1`**

In [None]:
# Map the 'sex' column and verify the distribution of labels.

# Print the distribution before mapping
print(adult_df['sex'].value_counts())

# Map the values of the column to convert the categorical values to integer
adult_df['sex'] =  adult_df['sex'].map({" Male": 0, " Female":1 })
# Print the distribution after mapping]
print(adult_df['sex'].value_counts())


 Male      20380
 Female     9782
Name: sex, dtype: int64
0    20380
1     9782
Name: sex, dtype: int64


**3.** Map the labels of the column `income-group` to convert it into a numerical attribute from categorical one using `map()` function:
  - **` <=50K`** to **`0`**
  - **` >50K`** to **`1`**
  

In [None]:
# Map the 'income-group' column and verify the distribution of labels.
# Print the "distribution" before mapping
print(adult_df['income-group'].value_counts())
# Map the values of the column to convert the categorical values to integer
adult_df['income-group'] = adult_df['income-group'].map({' <=50K' : 0, ' >50K' : 1})
# Print the distribution after mapping
print(adult_df['income-group'].value_counts())
adult_df.head()

 <=50K    22654
 >50K      7508
Name: income-group, dtype: int64
0    22654
1     7508
Name: income-group, dtype: int64


Unnamed: 0,age,workclass,education,education-years,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income-group
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,0,2174,0,40,United-States,0
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,0,0,0,13,United-States,0
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,0,0,0,40,United-States,0
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,0,0,0,40,United-States,0
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,1,0,0,40,Cuba,0


**4.** Create a list of categorical columns names using `select_dtypes()` function:

In [None]:
# Create the list of categorical columns names using 'select_dtypes()'.
cat_list = adult_df.select_dtypes(include=['float64', 'int64']).columns
cat_list

Index(['age', 'education-years', 'sex', 'capital-gain', 'capital-loss',
       'hours-per-week', 'income-group'],
      dtype='object')

**5.** Perform **one-hot encoding** on the columns of the DataFrame in the list above and save it in a **dummy DataFrame**. Also use parameter `drop_first= True` in the `get_dummies()` function.

***Recall:***
*This process of obtaining numeric values from non-numeric categorical values is called **one-hot encoding**. In this process a column is added for each of the categories in a particular feature and value in the columns will be binary `0` and `1` based on the original value in the feature and the category column. The `get_dummies()` function can be used to apply **one-hot encoding** to the non-numeric categorical feature columns*.

In [None]:
# Create a 'income_dummies_df' DataFrame using the 'get_dummies()' function on the non-numeric categorical columns
income_dummies_df = pd.get_dummies(data = adult_df.select_dtypes(exclude=['float64', 'int64']), drop_first = True)
income_dummies_df

Unnamed: 0,workclass_ Local-gov,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,workclass_ Without-pay,education_ 11th,education_ 12th,education_ 1st-4th,education_ 5th-6th,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0,1,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
32557,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
32558,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
32559,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


 **6.** Drop the non-numeric categorical columns from the original income DataFrame using the `drop()` function:

In [None]:
# Drop the categorical columns from the Income DataFrame `income_df`
adult_df = adult_df[cat_list]
adult_df

Unnamed: 0,age,education-years,sex,capital-gain,capital-loss,hours-per-week,income-group
0,39,13,0,2174,0,40,0
1,50,13,0,0,0,13,0
2,38,9,0,0,0,40,0
3,53,7,0,0,0,40,0
4,28,13,1,0,0,40,0
...,...,...,...,...,...,...,...
32556,27,12,1,0,0,38,0
32557,40,9,0,0,0,40,1
32558,58,9,1,0,0,40,0
32559,22,9,0,0,0,20,0


**7.** Concat the income DataFrame and the dummy DataFrame to create the final DataFrame for the model. 

Print the first five values of the final DataFrame:

In [None]:
# Concat the income DataFrame and dummy DataFrame using 'concat()' function
income_df = pd.concat([ income_dummies_df,adult_df], axis = 1)
income_df

Unnamed: 0,workclass_ Local-gov,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,workclass_ Without-pay,education_ 11th,education_ 12th,education_ 1st-4th,education_ 5th-6th,...,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia,age,education-years,sex,capital-gain,capital-loss,hours-per-week,income-group
0,0,0,0,0,1,0,0,0,0,0,...,1,0,0,39,13,0,2174,0,40,0
1,0,0,0,1,0,0,0,0,0,0,...,1,0,0,50,13,0,0,0,13,0
2,0,1,0,0,0,0,0,0,0,0,...,1,0,0,38,9,0,0,0,40,0
3,0,1,0,0,0,0,1,0,0,0,...,1,0,0,53,7,0,0,0,40,0
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,28,13,1,0,0,40,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,0,1,0,0,0,0,0,0,0,0,...,1,0,0,27,12,1,0,0,38,0
32557,0,1,0,0,0,0,0,0,0,0,...,1,0,0,40,9,0,0,0,40,1
32558,0,1,0,0,0,0,0,0,0,0,...,1,0,0,58,9,1,0,0,40,0
32559,0,1,0,0,0,0,0,0,0,0,...,1,0,0,22,9,0,0,0,20,0


**8.** Get the information of the DataFrame to verify the final columns and their data types:

In [None]:
# Get the information of the DataFrame
income_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30162 entries, 0 to 32560
Data columns (total 96 columns):
 #   Column                                      Non-Null Count  Dtype
---  ------                                      --------------  -----
 0   workclass_ Local-gov                        30162 non-null  uint8
 1   workclass_ Private                          30162 non-null  uint8
 2   workclass_ Self-emp-inc                     30162 non-null  uint8
 3   workclass_ Self-emp-not-inc                 30162 non-null  uint8
 4   workclass_ State-gov                        30162 non-null  uint8
 5   workclass_ Without-pay                      30162 non-null  uint8
 6   education_ 11th                             30162 non-null  uint8
 7   education_ 12th                             30162 non-null  uint8
 8   education_ 1st-4th                          30162 non-null  uint8
 9   education_ 5th-6th                          30162 non-null  uint8
 10  education_ 7th-8th                

**Q:** How many columns are present in the final DataFrame? 

**A:** 96 columns

**Q:** What is the data type of the columns in the final DataFrame? 

**A:** int and uint


**After this activity, the DataFrame should not have any non-numeric columns.**

---

#### Activity 4: Train-Test Split
 
We need to predict the value of the `income-group` variable, using other variables. Thus, `income-group` is the target or dependent variable and other columns except `income-group` are the features or the independent variables.
 
**1.** Split the dataset into the training set and test set such that the training set contains 70% of the instances and the remaining instances will become the test set.

**2.** Set `random_state = 42`:

In [None]:
# Split the training and testing data

# Import the module
from sklearn.model_selection import train_test_split
X = income_df.iloc[:, :-1]
y = income_df['income-group']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.3)
X_train

Unnamed: 0,workclass_ Local-gov,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,workclass_ Without-pay,education_ 11th,education_ 12th,education_ 1st-4th,education_ 5th-6th,...,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia,age,education-years,sex,capital-gain,capital-loss,hours-per-week
29253,0,1,0,0,0,0,0,0,0,0,...,0,1,0,0,35,9,0,0,0,40
14267,0,1,0,0,0,0,0,0,0,0,...,0,1,0,0,23,9,0,0,0,40
26021,0,1,0,0,0,0,0,0,0,0,...,0,1,0,0,39,13,1,0,0,60
24278,0,1,0,0,0,0,0,0,0,0,...,0,1,0,0,33,9,1,0,0,32
4225,0,0,1,0,0,0,0,0,0,0,...,0,1,0,0,27,9,0,0,0,38
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32171,0,1,0,0,0,0,0,0,0,0,...,0,1,0,0,40,9,0,0,0,35
5875,0,0,0,1,0,0,0,0,0,0,...,0,1,0,0,41,10,0,3103,0,40
935,0,1,0,0,0,0,0,0,0,0,...,0,1,0,0,37,9,0,0,0,99
17056,0,0,0,1,0,0,1,0,0,0,...,0,1,0,0,56,7,1,0,0,40


**After this activity, the feature and target data should be distributed in training and testing data**

---

#### Activity 5: Data Standardisation

To avoid `ConvergenceWarning` message - That is to scale the data using one of the normalisation methods, for instance, standard normalisation.

**1.** Create a function `standard_scalar()` to normalise the numeric columns of `X_train` and `X_test` data-frames using the standard normalisation method:


In [None]:
# Normalise the train and test data-frames using the standard normalisation method.

# Define the 'standard_scalar()' function for calculating Z-scores
def standard_scalar(series):
  z = (series - series.mean()) / series.std() 
  return z

# Create the DataFrames norm_X_train and norm_X_train


# Apply the 'standard_scalar()' on X_train on numeric columns using apply() function and get the descriptive statistics of the normalised X_train
nom_x_train = X_train
nom_x_test = X_test
nom_x_train[num_lst] = X_train[num_lst].apply(standard_scalar, axis = 0 )
print(nom_x_train.describe())
# Apply the 'standard_scalar()' on X_test on numeric columns using apply() function and get the descriptive statistics of the normalised X_test
nom_x_test[num_lst] = X_test[num_lst].apply(standard_scalar, axis = 0)
print(nom_x_test.describe())

       workclass_ Local-gov  workclass_ Private  workclass_ Self-emp-inc  \
count          21113.000000        21113.000000              21113.00000   
mean               0.069625            0.738550                  0.03486   
std                0.254521            0.439435                  0.18343   
min                0.000000            0.000000                  0.00000   
25%                0.000000            0.000000                  0.00000   
50%                0.000000            1.000000                  0.00000   
75%                0.000000            1.000000                  0.00000   
max                1.000000            1.000000                  1.00000   

       workclass_ Self-emp-not-inc  workclass_ State-gov  \
count                 21113.000000          21113.000000   
mean                      0.083361              0.042817   
std                       0.276434              0.202450   
min                       0.000000              0.000000   
25%            

**After this activity, training and testing feature data should be normalised using Data Standardisation.**

---

#### Activity 6: Logistic Regression - Model Training 

Implement Logistic Regression Classification using `sklearn` module in the following way:

1. Deploy the model by importing the `LogisticRegression` class and create an object of this class.
2. Call the `fit()` function on the Logistic Regression object and print the score using the `score()` function.


In [None]:
# Deploy the 'LogisticRegression' model using the 'fit()' function.
from sklearn.linear_model import LogisticRegression as LR
lr_model = LR()
lr_model.fit(nom_x_train, y_train)
lr_model.score(nom_x_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.8479136077298347

**After this activity, a multi-variate logistic regression should be trained with all features.**

---

#### Activity 7: Model Prediction and Evaluation

**1.** Predict the values for both training and test sets by calling the `predict()` function on the Logistic Regression object: 

In [None]:
# Make predictions on the test dataset by using the 'predict()' function.
y_test_pred = lr_model.predict(nom_x_test)

**2.** Display the confusion matrix:

In [None]:
# Display the results of confusion_matrix
from sklearn.metrics  import confusion_matrix, classification_report
print(confusion_matrix(y_test, y_test_pred))

[[6311  456]
 [ 896 1386]]


**Q:** What is the positive outcome out of both labels?

**A:** 

**Q:** Write the count of True Positives and True Negatives?

**A:** 


**3.** Print the classification report values to evaluate the accuracy of your model:

In [None]:
# Display the results of classification_report
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           0       0.88      0.93      0.90      6767
           1       0.75      0.61      0.67      2282

    accuracy                           0.85      9049
   macro avg       0.81      0.77      0.79      9049
weighted avg       0.84      0.85      0.84      9049



**Q** Write the f1-score of both labels?

**A:** 

**After this activity, a multi-variate logistic regression model is used to predict and evaluate using all the features.**

----

#### Activity 8: Features Selection Using RFE

Select the relevant features from all the features that contribute the most to classifying individuals in income-groups using RFE.

**Steps:**

**1.** Create an empty dictionary and store it in a variable.

**2.** Create a `for` loop that iterates through all the columns in the normalised training data-frame. 
Inside the loop: 


   - Create an object of the Logistic Regression class and store it in a variable.
   
   - Create an object of RFE class and store it in a variable. Inside the RFE class constructor, pass the object of logistic regression and the number of features to be selected by RFE as inputs. 
   
   - Train the model using the `fit()` function of the `RFE` class to train a logistic regression model on the train set with `i` number of features where `i` goes from `1` to the number of columns in the training dataset. 
   
   - Create a list to store the important features using the `support_` attribute.
   
   - Create a new data-frame having the features selected by RFE store in a variable.
   
   - Create another Logistic Regression object, store it in a variable and build a logistic regression model using the new training DataFrame created using the rfe features data-frame and the target series.
   
   - Predict the target values for the normalised test set (containing the feature(s) selected by RFE) by calling the `predict()` function on the recent model object.
   
   - Calculate f1-scores using the function `f1_score()` function of `sklearn.metrics` module that returns a NumPy array containing f1-scores for both the classes. Store the array in a variable called `f1_scores_array`. 
   
    The sytax for the `f1_score()` is given as:

      >**Syntax:** `f1_score(y_true, y_pred, average = None)`

        Where,
      
        **a.** `y_true`:  the actual labels

        **b.** `y_pred`: the predicted labels 
  
        **c.** `average = None`: parameter returns the scores for each class.

  - Add the number of selected features and the corresponding features & f1-scores as key-value pairs in the dictionary.


**Note:**   
As the number of features is very high, the code will be a computationally heavy program. It will require very GPU to process the code faster. It will take some time to learn the feature variables through the training data and then make predictions on the test data. 

To turn on the **GPU** in google colab follow the steps below:
1. Click on the **Edit** menu option on the top-left.
2. Click on the **Notebook settings** option from the menu. A pop-up will appear.
3. Click on the drop-down for selecting **Hardware accelerator**.
4. Select **GPU** from the drop-down options.
5. Click on **Save**.

In [None]:
# Create a dictionary containing the different combination of features selected by RFE and their corresponding f1-scores.
import warnings
warnings.filterwarnings('ignore')
# Import the libraries
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.metrics import f1_score

# Create the empty dictionary.
rfe_dict ={}
# Create a 'for' loop.
for i in range(1, 11):
  # Create the Logistic Regression Model
  lr = LogisticRegression()

  # Create the RFE model with 'i' number of features
  rfe = RFE(lr, n_features_to_select=i)

  # Train the rfe model on the normalised training data using 'fit()'
  rfe.fit(nom_x_train, y_train)

  # Create a list of important features chosen by RFE.
  imp_feat = list(nom_x_train.columns[rfe.support_])
  print(imp_feat)
  # Create the normalised training DataFrame with rfe features
  rfe_x_train = nom_x_train[imp_feat]
  
  # Create the logistic regression 
  lr1 = LogisticRegression()
  
  # Train the model normalised training DataFrame with rfe features using 'fit()'
  lr1.fit(rfe_x_train, y_train)

  # Predict 'y' values only for the test set as generally, they are predicted quite accurately for the train set.
  y_test_pred = lr1.predict(nom_x_test[imp_feat])

  # Calculate the f1-score
  f1 = f1_score(y_test, y_test_pred, average=None)

  # Add the name of features and f1-scores in the dictionary
  rfe_dict[i] = {'Features':list(imp_feat), 'f1 score':f1}


['marital-status_ Married-civ-spouse']
['marital-status_ Married-civ-spouse', 'capital-gain']
['education_ Preschool', 'marital-status_ Married-civ-spouse', 'capital-gain']
['education_ Preschool', 'marital-status_ Married-AF-spouse', 'marital-status_ Married-civ-spouse', 'capital-gain']
['education_ Preschool', 'marital-status_ Married-AF-spouse', 'marital-status_ Married-civ-spouse', 'occupation_ Priv-house-serv', 'capital-gain']
['education_ Preschool', 'marital-status_ Married-AF-spouse', 'marital-status_ Married-civ-spouse', 'occupation_ Priv-house-serv', 'relationship_ Own-child', 'capital-gain']
['education_ Preschool', 'marital-status_ Married-AF-spouse', 'marital-status_ Married-civ-spouse', 'occupation_ Farming-fishing', 'occupation_ Priv-house-serv', 'relationship_ Own-child', 'capital-gain']
['education_ Preschool', 'marital-status_ Married-AF-spouse', 'marital-status_ Married-civ-spouse', 'occupation_ Farming-fishing', 'occupation_ Other-service', 'occupation_ Priv-house-s

**6.** Print the dictionary with features and f1-scores.

In [None]:
# Print the dictionary
rfe_dict

{1: {'Features': ['marital-status_ Married-civ-spouse'],
  'f1 score': array([0.85571573, 0.        ])},
 2: {'Features': ['marital-status_ Married-civ-spouse', 'capital-gain'],
  'f1 score': array([0.86985043, 0.30483532])},
 3: {'Features': ['education_ Preschool',
   'marital-status_ Married-civ-spouse',
   'capital-gain'],
  'f1 score': array([0.86985043, 0.30483532])},
 4: {'Features': ['education_ Preschool',
   'marital-status_ Married-AF-spouse',
   'marital-status_ Married-civ-spouse',
   'capital-gain'],
  'f1 score': array([0.86985043, 0.30483532])},
 5: {'Features': ['education_ Preschool',
   'marital-status_ Married-AF-spouse',
   'marital-status_ Married-civ-spouse',
   'occupation_ Priv-house-serv',
   'capital-gain'],
  'f1 score': array([0.86992457, 0.30494217])},
 6: {'Features': ['education_ Preschool',
   'marital-status_ Married-AF-spouse',
   'marital-status_ Married-civ-spouse',
   'occupation_ Priv-house-serv',
   'relationship_ Own-child',
   'capital-gain'],


**7.** Convert the dictionary into a DataFrame by using the`from_dict()` function of the DataFrame.

**Note** Set the `pd.options.display.max_colwidth` to `200`

In [None]:
# Convert the dictionary to the DataFrame
pd.options.display.max_colwidth = 200
rfe_df = pd.DataFrame.from_dict(rfe_dict, orient = 'index')
rfe_df

Unnamed: 0,Features,f1 score
1,[marital-status_ Married-civ-spouse],"[0.8557157309054122, 0.0]"
2,"[marital-status_ Married-civ-spouse, capital-gain]","[0.8698504329572291, 0.3048353188507358]"
3,"[education_ Preschool, marital-status_ Married-civ-spouse, capital-gain]","[0.8698504329572291, 0.3048353188507358]"
4,"[education_ Preschool, marital-status_ Married-AF-spouse, marital-status_ Married-civ-spouse, capital-gain]","[0.8698504329572291, 0.3048353188507358]"
5,"[education_ Preschool, marital-status_ Married-AF-spouse, marital-status_ Married-civ-spouse, occupation_ Priv-house-serv, capital-gain]","[0.8699245654312889, 0.30494216614090436]"
6,"[education_ Preschool, marital-status_ Married-AF-spouse, marital-status_ Married-civ-spouse, occupation_ Priv-house-serv, relationship_ Own-child, capital-gain]","[0.8707411540733933, 0.31273996509598606]"
7,"[education_ Preschool, marital-status_ Married-AF-spouse, marital-status_ Married-civ-spouse, occupation_ Farming-fishing, occupation_ Priv-house-serv, relationship_ Own-child, capital-gain]","[0.8714416896235079, 0.3127629733520336]"
8,"[education_ Preschool, marital-status_ Married-AF-spouse, marital-status_ Married-civ-spouse, occupation_ Farming-fishing, occupation_ Other-service, occupation_ Priv-house-serv, relationship_ Own...","[0.8717545239968529, 0.31271960646521435]"
9,"[education_ Preschool, marital-status_ Married-AF-spouse, marital-status_ Married-civ-spouse, occupation_ Farming-fishing, occupation_ Handlers-cleaners, occupation_ Other-service, occupation_ Pri...","[0.8724937753898572, 0.31382228490832154]"
10,"[education_ Preschool, marital-status_ Married-AF-spouse, marital-status_ Married-civ-spouse, occupation_ Farming-fishing, occupation_ Handlers-cleaners, occupation_ Other-service, occupation_ Pri...","[0.8725676472515232, 0.3139329805996473]"


**Q:** How many features are required for the best f1-scores and why?

**A:** 10

**After this activity, rfe is used to find the ideal features for  logistic regression**

---

#### Activity 9: Model Training and Prediction Using Ideal Features

**1.** Create the logistic regression model again using RFE with the ideal number of features and predict the target variable:


In [None]:
# Logistic Regression with the ideal number of features and predict the target.

# Create the Logistic Regression Model
log_reg = LogisticRegression()
# Create the RFE model with ideal number of features
rfe_model = RFE(log_reg, n_features_to_select =10 )
# Train the rfe model on the normalised training data
rfe_model.fit(nom_x_train, y_train)
# Create a list of important features chosen by RFE.
rfe_features = list(nom_x_train.columns[rfe.support_])
# Create the normalised training DataFrame with rfe features
X_train_rfe = nom_x_train[rfe_features]
# Create the Regression Model again
lr_model = LogisticRegression()
# Train the model with the normalised training features DataFrame with best rfe features and target training DataFrame
lr_model.fit(X_train_rfe, y_train)
# Predict the target using the normalised test DataFrame with rfe features  
y_test_pred_rfe = lr_model.predict(nom_x_test[imp_feat])
# Calculate the final f1-score and print it
score = f1_score(y_test, y_test_pred_rfe, average=None)
print(score)


[0.87256765 0.31393298]


**Hint:** Create the model using the same steps mentioned in the **Features Selection Activity.**  

**Q:** What is the final f1-score?

**A:**[0.87256765 0.31393298]

---

**Write your interpretation of the results here.**

- Interpretation 1: 
- Interpretation 2: 
- Interpretation 3: 

**After this activity, a Logistic Regression model should be ready with ideal number of features to accurately predict the income group of the people that is to predict whether an individual has annual income `less or equal than 50K (label 0)` or `more than 50K (label 1)` based on the features selected.**

---

### Submitting the Project

Follow the steps described below to submit the project.

1. After finishing the project, click on the **Share** button on the top right corner of the notebook. A new dialog box will appear.

  <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/2_share_button.png' width=500>

2. In the dialog box, click on the **Copy link** button.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/3_copy_link.png' width=500>


3. The link of the duplicate copy (named as **YYYY-MM-DD_StudentName_CapstoneProject18**) of the notebook will get copied 

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/4_copy_link_confirmation.png' width=500>

4. Go to your dashboard and click on the **My Projects** option.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/5_student_dashboard.png' width=800>

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/6_my_projects.png' width=800>

5. Click on the **View Project** button for the project you want to submit.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/7_view_project.png' width=800>

6. Click on the **Submit Project Here** button.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/8_submit_project.png' width=800>

7. Paste the link to the project file named as **YYYY-MM-DD_StudentName_CapstoneProject18** in the URL box and then click on the **Submit** button.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/9_enter_project_url.png' width=800>


---