![logo.png](attachment:logo.png)

# **Adult Income Prediction**

**Data Set Information:**

This dataset was obtained from UCI Machine Learning Repository. The aim of this problem is to classify adults in two different groups based on their income where group 1 has an income less than USD 50k and group 2 has an income of more than or equal to USD 50k. The data available at hand comes from Census 1994.


**Attribute Information:**

Age: Describes the age of individuals. Continuous.

Workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.

fnlwgt: Continuous.

education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.

education-num: Number of years spent in education. Continuous.

marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.

occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.

relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

sex: Female, Male.

capital-gain: Continuous.

capital-loss: Continuous.

hours-per-week: Continuous.

native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

salary: >50K,<=50K

# Exploratory Data Analysis and Visualization

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
import plotly.express as px
import cufflinks as cf

%matplotlib inline

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_validate
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import log_loss, recall_score, accuracy_score, precision_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report, plot_confusion_matrix
from sklearn.metrics import plot_roc_curve, plot_precision_recall_curve, roc_auc_score, auc
from sklearn.metrics import roc_curve, average_precision_score, precision_recall_curve
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.neighbors import KNeighborsClassifier
from scipy.stats import skew
from sklearn.tree import DecisionTreeClassifier
from yellowbrick.classifier import ClassPredictionError
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score
from sklearn.metrics import make_scorer
from sklearn.metrics import confusion_matrix,classification_report,plot_confusion_matrix


from sklearn.model_selection import cross_validate, cross_val_score
import warnings
warnings.filterwarnings('ignore')
plt.rcParams["figure.figsize"] = (14,8)
pd.set_option('display.max_columns', 500) 
pd.set_option('display.max_rows', 500)

In [2]:
df = pd.read_csv("adult.csv")
df

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,22,Private,310152,Some-college,10,Never-married,Protective-serv,Not-in-family,White,Male,0,0,40,United-States,<=50K
32557,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32558,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32559,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education.num   32561 non-null  int64 
 5   marital.status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital.gain    32561 non-null  int64 
 11  capital.loss    32561 non-null  int64 
 12  hours.per.week  32561 non-null  int64 
 13  native.country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [14]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,30139.0,38.44172,13.131426,17.0,28.0,37.0,47.0,90.0
fnlwgt,30139.0,189795.02598,105658.624341,13769.0,117627.5,178417.0,237604.5,1484705.0
education.num,30139.0,10.122532,2.548738,1.0,9.0,10.0,13.0,16.0
capital.gain,30139.0,1092.841202,7409.110596,0.0,0.0,0.0,0.0,99999.0
capital.loss,30139.0,88.439928,404.445239,0.0,0.0,0.0,0.0,4356.0
hours.per.week,30139.0,40.934703,11.978753,1.0,40.0,40.0,45.0,99.0


In [4]:
df.duplicated().value_counts()

False    32537
True        24
dtype: int64

In [5]:
df=df.drop_duplicates()

In [6]:
df.isna().sum()

age               0
workclass         0
fnlwgt            0
education         0
education.num     0
marital.status    0
occupation        0
relationship      0
race              0
sex               0
capital.gain      0
capital.loss      0
hours.per.week    0
native.country    0
income            0
dtype: int64

In [7]:
#Replacing '?' string with NaN values
df.replace(to_replace='?',value=np.nan,inplace=True)

In [8]:
df.isna().sum()

age                  0
workclass         1836
fnlwgt               0
education            0
education.num        0
marital.status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital.gain         0
capital.loss         0
hours.per.week       0
native.country     582
income               0
dtype: int64

In [10]:
df.shape

(32537, 15)

In [11]:
df.dropna(axis=0,inplace=True)

In [12]:
df.shape

(30139, 15)

In [34]:
df.columns = df.columns.str.replace(".", "_")
df

Unnamed: 0,age,workclass,fnlwgt,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income,education_summary
1,82,Private,132870,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K,medium_level_grade
3,54,Private,140359,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K,low_level_grade
4,41,Private,264663,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K,medium_level_grade
5,34,Private,216864,Divorced,Other-service,Unmarried,White,Female,0,3770,45,United-States,<=50K,medium_level_grade
6,38,Private,150601,Separated,Adm-clerical,Unmarried,White,Male,0,3770,40,United-States,<=50K,low_level_grade
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,22,Private,310152,Never-married,Protective-serv,Not-in-family,White,Male,0,0,40,United-States,<=50K,medium_level_grade
32557,27,Private,257302,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K,medium_level_grade
32558,40,Private,154374,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K,medium_level_grade
32559,58,Private,151910,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K,medium_level_grade


## Decreasing The Number of Categories In Features

### EDUCATION

In [18]:
def mapping_education(x):
    if x in ["Preschool", "1st-4th", "5th-6th", "7th-8th", "9th", "10th", "11th", "12th"]:
        return "low_level_grade"
    elif x in ["HS-grad", "Some-college", "Assoc-voc", "Assoc-acdm"]:
        return "medium_level_grade"
    elif x in ["Bachelors", "Masters", "Prof-school", "Doctorate"]:
        return "high_level_grade"

In [19]:
df.education.apply(mapping_education).value_counts()


medium_level_grade    18818
high_level_grade       7585
low_level_grade        3736
Name: education, dtype: int64

In [20]:
df["education_summary"] = df.education.apply(mapping_education)

In [27]:
df = df.drop(['education', 'education.num'], axis=1)

Unnamed: 0,age,workclass,fnlwgt,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income,education_summary
1,82,Private,132870,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K,medium_level_grade
3,54,Private,140359,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K,low_level_grade
4,41,Private,264663,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K,medium_level_grade
5,34,Private,216864,Divorced,Other-service,Unmarried,White,Female,0,3770,45,United-States,<=50K,medium_level_grade
6,38,Private,150601,Separated,Adm-clerical,Unmarried,White,Male,0,3770,40,United-States,<=50K,low_level_grade
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,22,Private,310152,Never-married,Protective-serv,Not-in-family,White,Male,0,0,40,United-States,<=50K,medium_level_grade
32557,27,Private,257302,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K,medium_level_grade
32558,40,Private,154374,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K,medium_level_grade
32559,58,Private,151910,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K,medium_level_grade


### MARITIAL STATUS

In [38]:
df['marital_status'].value_counts()

Married-civ-spouse       14059
Never-married             9711
Divorced                  4212
Separated                  939
Widowed                    827
Married-spouse-absent      370
Married-AF-spouse           21
Name: marital_status, dtype: int64

In [35]:
def mapping_marital_status(x):
    if x in ["Never-married", "Divorced", "Separated", "Widowed"]:
        return "unmarried"
    elif x in ["Married-civ-spouse", "Married-AF-spouse", "Married-spouse-absent"]:
        return "married"

In [39]:
df.marital_status.apply(mapping_marital_status).value_counts()

unmarried    15689
married      14450
Name: marital_status, dtype: int64

In [40]:
df["marital_status_summary"] = df.marital_status.apply(mapping_marital_status)

In [42]:
df = df.drop(['marital_status', 'relationship'], axis=1)
df

Unnamed: 0,age,workclass,fnlwgt,occupation,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income,education_summary,marital_status_summary
1,82,Private,132870,Exec-managerial,White,Female,0,4356,18,United-States,<=50K,medium_level_grade,unmarried
3,54,Private,140359,Machine-op-inspct,White,Female,0,3900,40,United-States,<=50K,low_level_grade,unmarried
4,41,Private,264663,Prof-specialty,White,Female,0,3900,40,United-States,<=50K,medium_level_grade,unmarried
5,34,Private,216864,Other-service,White,Female,0,3770,45,United-States,<=50K,medium_level_grade,unmarried
6,38,Private,150601,Adm-clerical,White,Male,0,3770,40,United-States,<=50K,low_level_grade,unmarried
...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,22,Private,310152,Protective-serv,White,Male,0,0,40,United-States,<=50K,medium_level_grade,unmarried
32557,27,Private,257302,Tech-support,White,Female,0,0,38,United-States,<=50K,medium_level_grade,married
32558,40,Private,154374,Machine-op-inspct,White,Male,0,0,40,United-States,>50K,medium_level_grade,married
32559,58,Private,151910,Adm-clerical,White,Female,0,0,40,United-States,<=50K,medium_level_grade,unmarried


### WORKCLASS

In [43]:
df['workclass'].value_counts()

Private             22264
Self-emp-not-inc     2498
Local-gov            2067
State-gov            1279
Self-emp-inc         1074
Federal-gov           943
Without-pay            14
Name: workclass, dtype: int64

In [52]:
def mapping_workclass(x):
    if x in ["Local-gov", "State-gov", "Federal-gov"]:
        return "Government"
    elif x in ["Self-emp-not-inc", "Self-emp-inc"]:
        return "Self Employment"
    elif x in ["Private"]:
        return "Private"
    elif x in ["Without-pay"]:
        return "Without Pay"
    

In [53]:
df.workclass.apply(mapping_workclass).value_counts()

Private            22264
Government          4289
Self Employment     3572
Without Pay           14
Name: workclass, dtype: int64

In [54]:
df["workclass_summary"] = df.workclass.apply(mapping_workclass)

In [59]:
df = df.drop(['workclass'], axis=1)
df

Unnamed: 0,age,fnlwgt,occupation,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income,education_summary,marital_status_summary,workclass_summary
1,82,132870,Exec-managerial,White,Female,0,4356,18,United-States,<=50K,medium_level_grade,unmarried,Private
3,54,140359,Machine-op-inspct,White,Female,0,3900,40,United-States,<=50K,low_level_grade,unmarried,Private
4,41,264663,Prof-specialty,White,Female,0,3900,40,United-States,<=50K,medium_level_grade,unmarried,Private
5,34,216864,Other-service,White,Female,0,3770,45,United-States,<=50K,medium_level_grade,unmarried,Private
6,38,150601,Adm-clerical,White,Male,0,3770,40,United-States,<=50K,low_level_grade,unmarried,Private
...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,22,310152,Protective-serv,White,Male,0,0,40,United-States,<=50K,medium_level_grade,unmarried,Private
32557,27,257302,Tech-support,White,Female,0,0,38,United-States,<=50K,medium_level_grade,married,Private
32558,40,154374,Machine-op-inspct,White,Male,0,0,40,United-States,>50K,medium_level_grade,married,Private
32559,58,151910,Adm-clerical,White,Female,0,0,40,United-States,<=50K,medium_level_grade,unmarried,Private


In [60]:
df = df.drop(['native_country', 'capital_gain', 'capital_loss'], axis=1)
df


Unnamed: 0,age,fnlwgt,occupation,race,sex,hours_per_week,income,education_summary,marital_status_summary,workclass_summary
1,82,132870,Exec-managerial,White,Female,18,<=50K,medium_level_grade,unmarried,Private
3,54,140359,Machine-op-inspct,White,Female,40,<=50K,low_level_grade,unmarried,Private
4,41,264663,Prof-specialty,White,Female,40,<=50K,medium_level_grade,unmarried,Private
5,34,216864,Other-service,White,Female,45,<=50K,medium_level_grade,unmarried,Private
6,38,150601,Adm-clerical,White,Male,40,<=50K,low_level_grade,unmarried,Private
...,...,...,...,...,...,...,...,...,...,...
32556,22,310152,Protective-serv,White,Male,40,<=50K,medium_level_grade,unmarried,Private
32557,27,257302,Tech-support,White,Female,38,<=50K,medium_level_grade,married,Private
32558,40,154374,Machine-op-inspct,White,Male,40,>50K,medium_level_grade,married,Private
32559,58,151910,Adm-clerical,White,Female,40,<=50K,medium_level_grade,unmarried,Private


In native country close to 91% of observations belong to one category (i.e. United States).


Whereas in capital-gain and capital-loss, ~92% and ~95% of values are zeroes respectively which will not contribute in modelling


Hence, we drop these 3 variables (native-country, capital-gain, capital-loss) from the feature matrix

In [62]:
df = df.drop(['fnlwgt', 'race', 'sex', 'marital_status_summary' ], axis=1)
df

Unnamed: 0,age,occupation,hours_per_week,income,education_summary,workclass_summary
1,82,Exec-managerial,18,<=50K,medium_level_grade,Private
3,54,Machine-op-inspct,40,<=50K,low_level_grade,Private
4,41,Prof-specialty,40,<=50K,medium_level_grade,Private
5,34,Other-service,45,<=50K,medium_level_grade,Private
6,38,Adm-clerical,40,<=50K,low_level_grade,Private
...,...,...,...,...,...,...
32556,22,Protective-serv,40,<=50K,medium_level_grade,Private
32557,27,Tech-support,38,<=50K,medium_level_grade,Private
32558,40,Machine-op-inspct,40,>50K,medium_level_grade,Private
32559,58,Adm-clerical,40,<=50K,medium_level_grade,Private


In [63]:
df['occupation'].value_counts()

Prof-specialty       4034
Craft-repair         4025
Exec-managerial      3991
Adm-clerical         3719
Sales                3584
Other-service        3209
Machine-op-inspct    1964
Transport-moving     1572
Handlers-cleaners    1349
Farming-fishing       987
Tech-support          911
Protective-serv       644
Priv-house-serv       141
Armed-Forces            9
Name: occupation, dtype: int64

## ADJUSTING THE TARGET 

In [64]:
df['income'].replace(to_replace='<=50K',value=0,inplace=True)
df['income'].replace(to_replace='>50K',value=1,inplace=True)

In [66]:
df

Unnamed: 0,age,occupation,hours_per_week,income,education_summary,workclass_summary
1,82,Exec-managerial,18,0,medium_level_grade,Private
3,54,Machine-op-inspct,40,0,low_level_grade,Private
4,41,Prof-specialty,40,0,medium_level_grade,Private
5,34,Other-service,45,0,medium_level_grade,Private
6,38,Adm-clerical,40,0,low_level_grade,Private
...,...,...,...,...,...,...
32556,22,Protective-serv,40,0,medium_level_grade,Private
32557,27,Tech-support,38,0,medium_level_grade,Private
32558,40,Machine-op-inspct,40,1,medium_level_grade,Private
32559,58,Adm-clerical,40,0,medium_level_grade,Private


In [69]:
new_column_order = ['age', 'occupation', 'hours_per_week', 'education_summary', 'workclass_summary', 'income']
df = df.reindex(columns=new_column_order)
df

Unnamed: 0,age,occupation,hours_per_week,education_summary,workclass_summary,income
1,82,Exec-managerial,18,medium_level_grade,Private,
3,54,Machine-op-inspct,40,low_level_grade,Private,
4,41,Prof-specialty,40,medium_level_grade,Private,
5,34,Other-service,45,medium_level_grade,Private,
6,38,Adm-clerical,40,low_level_grade,Private,
...,...,...,...,...,...,...
32556,22,Protective-serv,40,medium_level_grade,Private,
32557,27,Tech-support,38,medium_level_grade,Private,
32558,40,Machine-op-inspct,40,medium_level_grade,Private,
32559,58,Adm-clerical,40,medium_level_grade,Private,


# Logistic Regression

# K-Nearest Neighbors (KNN) Classification

# Decision Tree Classification