# Adult Census Income Prediction

## Life Cycle of Machine Learning Project

* Understanding the Problem Statement
* Data Collection
* Data Checks to Perform
* Exploratory data analysis
* Data Pre-Processing
* Model Training
* Choose best model

# 1) Problem Statement

* The Goal is to predict whether a person has an income of more than 50K a year or not.
This is basically a binary classification problem where a person is classified into the 
'>50K group or <=50K group'.

# 2) Data Collection

* Data Source - https://www.kaggle.com/datasets/overload10/adult-census-dataset
* The data consists of 15 Columns and 32561 rows

## 2.1 Import Data and Required Packages

### Importing Pandas,Numpy,Matplotlib,Seaborn and Warnings Library

In [4]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

## Import the CSV Data as Pandas DataFrame

In [6]:
df = pd.read_csv(r"C:\Users\HP\Desktop\projects\Adult_census_Income_Prediction\notebook\data\adult.csv")

### Show Top 5 Records

In [8]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


### Shape of the dataset

In [9]:
df.shape

(32561, 15)

## 2.2 Dataset information

1. age: Age of the individual. This column represents the age of the people in the dataset.

2. workclass: The type of workclass. This column represents the employment status or category, such as "State-gov," "Private," or "Self-emp-not-inc."

3. fnlwgt: A weight assigned to the individual. This column represents the final weight of the individual in a survey or census. It's often used to adjust for oversampling or undersampling.

4. education: Highest level of education completed. This column represents the highest educational attainment of the individuals, such as "Bachelors," "HS-grad," or "11th."

5. education-num: Numeric representation of education level. This column provides a numerical representation of the education level, which often corresponds to the number of years of education.

6. marital-status: Marital status of the individual. This column represents the marital status of the individuals, such as "Never-married," "Married-civ-spouse," or "Divorced."

7. occupation: Occupation of the individual. This column represents the occupation or job role of the individuals, such as "Adm-clerical," "Exec-managerial," or "Handlers-cleaners."

8. relationship: Relationship status. This column describes the individual's relationship status, which can include values like "Not-in-family," "Husband," or "Wife."

9. race: Race of the individual. This column represents the race or ethnicity of the individuals, such as "White," "Black," or other racial categories.

10. sex: Gender of the individual. This column specifies the gender of the individuals and can have values "Male" or "Female."

11. capital-gain: Capital gains. This column likely represents the capital gains reported by the individuals.

12. capital-loss: Capital losses. This column likely represents the capital losses reported by the individuals.

13. hours-per-week: Number of hours worked per week. This column indicates the number of hours the individuals work in a typical week.

14. country: Country of origin. This column represents the country of origin or citizenship of the individuals, such as "United-States" or "Cuba."

15. salary: Income level (the target variable). This is the target variable for a binary classification problem. It indicates the income level of the individuals, with two categories: ">50K" (income over $50,000 per year) and "<=50K" (income less than or equal to $50,000 per year).

## 3. Data Checks to Perform

* Checking Missing Values
* Check Duplicates
* Check data type
* Check the number of Unique values of each column
* Check Statistics of data set
* Check various categories present in the different categorical column

### 3.1 Check Missing values

In [47]:
df.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
country           0
salary            0
dtype: int64

### There are no missing values in the data set

### 3.2 Check Duplicates

In [13]:
df.duplicated().sum()

24

#### There are 24 duplicates values in the data set

In [51]:
df = df.drop_duplicates()


## 3.3 Check data types

In [52]:
# Check Null and Dtypes
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32537 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32537 non-null  int64 
 1   workclass       32537 non-null  object
 2   fnlwgt          32537 non-null  int64 
 3   education       32537 non-null  object
 4   education-num   32537 non-null  int64 
 5   marital-status  32537 non-null  object
 6   occupation      32537 non-null  object
 7   relationship    32537 non-null  object
 8   race            32537 non-null  object
 9   sex             32537 non-null  object
 10  capital-gain    32537 non-null  int64 
 11  capital-loss    32537 non-null  int64 
 12  hours-per-week  32537 non-null  int64 
 13  country         32537 non-null  object
 14  salary          32537 non-null  object
dtypes: int64(6), object(9)
memory usage: 4.0+ MB


## 3.4 Checking the number of unique values of each column

In [53]:
df.nunique()

age                  73
workclass             9
fnlwgt            21648
education            16
education-num        16
marital-status        7
occupation           15
relationship          6
race                  5
sex                   2
capital-gain        119
capital-loss         92
hours-per-week       94
country              42
salary                2
dtype: int64

## 3.5 Check statistics of data set

In [54]:
df.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,32537.0,32537.0,32537.0,32537.0,32537.0,32537.0
mean,38.585549,189780.8,10.081815,1078.443741,87.368227,40.440329
std,13.637984,105556.5,2.571633,7387.957424,403.101833,12.346889
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,236993.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


####                INSIGHT

1. Age Distribution:

The dataset contains individuals with ages ranging from 17 to 90 years.
The average age of individuals in the dataset is around 38.58 years.
Most individuals (75%) are below 48 years of age, as indicated by the 75th percentile value.

2. Final Weight (fnlwgt):

The "fnlwgt" values vary significantly, with a minimum of 12,285 and a maximum of 1,484,705.
The standard deviation (std) is relatively high, indicating a wide spread of values.

3. Education Level (education-num):

The "education-num" column represents a numeric representation of education levels.
The education levels range from a minimum of 1 to a maximum of 16.
The most common education levels seem to cluster around 9 to 10, as indicated by the 25th and 50th percentiles.

4. Capital Gain and Loss:

The "capital-gain" and "capital-loss" columns have a wide range of values, with high standard deviations.
The majority of individuals have zero capital gains and losses, as indicated by the 25th, 50th, and 75th percentiles.
However, there are individuals with relatively high capital gains (up to $99,999) and capital losses (up to 4,356).

5. Hours Worked per Week:

The "hours-per-week" column represents the number of hours worked per week.
The average number of hours worked is approximately 40.44 hours.
Most individuals (75%) work 45 hours per week or less.

## 3.7 Exploring Data

In [56]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [57]:
categorical_features = [ col  for col in df.columns if df[col].dtype=='O']
numerical_features = [ col  for col in df.columns if df[col].dtype!='O']

In [69]:
categorical_features

['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'country',
 'salary']

In [70]:
numerical_features

['age',
 'fnlwgt',
 'education-num',
 'capital-gain',
 'capital-loss',
 'hours-per-week']

In [42]:
print(f"we have {len(categorical_features)} categorical_features")
print(f"we have {len(numerical_features)} numerical_features")

we have 9 categorical_features
we have 6 numerical_features


In [58]:
for feature in categorical_features :
    print(f"Categories in {feature} variable:     ",end=" " )
    print(df[feature].unique())
    print("\n\n")

Categories in workclass variable:      [' State-gov' ' Self-emp-not-inc' ' Private' ' Federal-gov' ' Local-gov'
 ' ?' ' Self-emp-inc' ' Without-pay' ' Never-worked']



Categories in education variable:      [' Bachelors' ' HS-grad' ' 11th' ' Masters' ' 9th' ' Some-college'
 ' Assoc-acdm' ' Assoc-voc' ' 7th-8th' ' Doctorate' ' Prof-school'
 ' 5th-6th' ' 10th' ' 1st-4th' ' Preschool' ' 12th']



Categories in marital-status variable:      [' Never-married' ' Married-civ-spouse' ' Divorced'
 ' Married-spouse-absent' ' Separated' ' Married-AF-spouse' ' Widowed']



Categories in occupation variable:      [' Adm-clerical' ' Exec-managerial' ' Handlers-cleaners' ' Prof-specialty'
 ' Other-service' ' Sales' ' Craft-repair' ' Transport-moving'
 ' Farming-fishing' ' Machine-op-inspct' ' Tech-support' ' ?'
 ' Protective-serv' ' Armed-Forces' ' Priv-house-serv']



Categories in relationship variable:      [' Not-in-family' ' Husband' ' Wife' ' Own-child' ' Unmarried'
 ' Other-relative']



Cate

#### Here you can see instead of nan ? is used so we have to again check for missing values based on ' ?'
Attention '?' is not used here, ' ?' is used here

In [67]:
(df==' ?').sum()

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
country           0
salary            0
dtype: int64

In [71]:
# For convenience lets replace ? with nan
df=df.replace(' ?',np.nan)
df.isnull().sum()

age                  0
workclass         1836
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
country            582
salary               0
dtype: int64

### New Insight
* workclass column has 1,836 missing values.
* occupation column has 1,843 missing values.
* country column has 582 missing values.

In [85]:
for col in ['workclass','occupation','country']:
    print(f" Missing percentage of {col} is {(df[col].isnull().sum()/df.shape[0])*100} %")

 Missing percentage of workclass is 5.6428066508897565 %
 Missing percentage of occupation is 5.664320619602299 %
 Missing percentage of country is 1.7887328272428313 %


In [103]:
# Filling the missing values

print(df['workclass'].mode()[0])
print(df['occupation'].mode())
print(df['country'].mode())

 Private
0     Prof-specialty
Name: occupation, dtype: object
0     United-States
Name: country, dtype: object


In [107]:
# Lets Fill the missing value 

df['workclass']=df['workclass'].fillna(df['workclass'].mode()[0])
df['occupation']=df['occupation'].fillna(df['occupation'].mode()[0])
df['country']=df['country'].fillna(df['country'].mode()[0])

In [108]:
df.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
country           0
salary            0
dtype: int64

##### Finally we have no missing value

countinues stay tuned_____

In [114]:
df.to_csv("C:/Users/HP/Desktop/projects/Adult_census_Income_Prediction/notebook/data/cleaned_Adult_dataset.csv",index=False)