**Problem Statement:**
In this project, initially you need to preprocess the data and then develop an understanding of different features of the data by performing exploratory analysis and creating visualizations.Further, after having sufficient knowledge about the attributes you will perform a predictive task of classification to predict whether an individual makes over 50K a year or less,by using different Machine Learning Algorithms.

In [24]:
#Importing all the required packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [25]:
#Importing the csv file
census=pd.read_csv('census-income.csv',skipinitialspace=True)
census.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [26]:
census.shape

(32561, 15)

In [27]:
census.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'income'],
      dtype='object')

1. Data Preprocessing:
a) Replace all the missing values with NA.
b) Remove all the rows that contain NA values.

In [28]:
census.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income            0
dtype: int64

2. Data Manipulation:  
a) Extract the “education” column and store it in “census_ed”.

In [29]:
census_ed=census[['education']]
census_ed.head()

Unnamed: 0,education
0,Bachelors
1,Bachelors
2,HS-grad
3,11th
4,Bachelors


In [30]:
census['workclass'].unique()

array(['State-gov', 'Self-emp-not-inc', 'Private', 'Federal-gov',
       'Local-gov', '?', 'Self-emp-inc', 'Without-pay', 'Never-worked'],
      dtype=object)

b) Extract all the columns from “age” to “relationship” and store it in “census_seq”.

In [31]:
census_seq=census.iloc[:,0:8]
census_seq.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife


c) Extract the column number “5”, “8”, “11” and store it in  “census_col”   

In [32]:
census_col=census.iloc[:,[5,8,11]]
census_col.head()

Unnamed: 0,marital-status,race,capital-loss
0,Never-married,White,0
1,Married-civ-spouse,White,0
2,Divorced,White,0
3,Married-civ-spouse,Black,0
4,Married-civ-spouse,Black,0


d) Extract all the male employees who work in state-gov and store it in “male_gov”.  

In [33]:
male_gov=census[(census['workclass']=='State-gov') & (census['sex']=='Male')]
male_gov.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
11,30,State-gov,141297,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,India,>50K
34,22,State-gov,311512,Some-college,10,Married-civ-spouse,Other-service,Husband,Black,Male,0,0,15,United-States,<=50K
48,41,State-gov,101603,Assoc-voc,11,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,<=50K
123,29,State-gov,267989,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,>50K


In [34]:
male_gov.shape

(809, 15)

e) Extract all the 39 year olds who either have a bachelor's degree or who are native of the United States and store the result in “census_us”. 

In [35]:
census_us=census[(census['age']==39) & ((census['native-country']=='United-States') | (census['education']=='Bachelors'))]
census_us.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
28,39,Private,367260,HS-grad,9,Divorced,Exec-managerial,Not-in-family,White,Male,0,0,80,United-States,<=50K
129,39,Private,365739,Some-college,10,Divorced,Craft-repair,Not-in-family,White,Male,0,0,40,United-States,<=50K
166,39,Federal-gov,235485,Assoc-acdm,12,Never-married,Exec-managerial,Not-in-family,White,Male,0,0,42,United-States,<=50K
320,39,Self-emp-not-inc,174308,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,United-States,<=50K


In [36]:
census_us.shape

(759, 15)

f) Extract 200 random rows from the “census” data frame and store it in “census_200”.

In [37]:
census_200=census.sample(200)
census_200

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
19113,25,Private,206600,12th,8,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,El-Salvador,<=50K
2703,18,Private,132652,11th,7,Never-married,Other-service,Own-child,White,Male,0,0,20,United-States,<=50K
31464,32,Private,309513,Assoc-acdm,12,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,<=50K
27389,45,State-gov,264052,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,<=50K
14775,42,State-gov,304302,HS-grad,9,Married-civ-spouse,Transport-moving,Husband,Amer-Indian-Eskimo,Male,0,0,40,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28379,46,Private,261059,Masters,14,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,1977,50,United-States,>50K
3422,40,Private,168936,Assoc-voc,11,Divorced,Other-service,Not-in-family,White,Female,0,0,32,United-States,<=50K
5567,30,Private,303867,HS-grad,9,Separated,Transport-moving,Not-in-family,White,Male,0,0,44,United-States,<=50K
18467,44,Private,322391,11th,7,Separated,Other-service,Unmarried,Black,Female,0,0,30,United-States,<=50K


g) Get the count of different levels of the “workclass” column.

In [38]:
workclass_count=census[['workclass']].value_counts()
workclass_count

workclass       
Private             22696
Self-emp-not-inc     2541
Local-gov            2093
?                    1836
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
dtype: int64

h) Calculate the mean of the “capital.gain” column grouped according to “workclass”.

In [39]:
capitalgain_mean=census.groupby(['workclass'])['capital-gain'].mean()
capitalgain_mean

workclass
?                    606.795752
Federal-gov          833.232292
Local-gov            880.202580
Never-worked           0.000000
Private              889.217792
Self-emp-inc        4875.693548
Self-emp-not-inc    1886.061787
State-gov            701.699538
Without-pay          487.857143
Name: capital-gain, dtype: float64

i) Create a separate dataframe with the details of males and females from the census data that has income more than 50,000. 

In [40]:
males=census[(census['sex']=='Male') & (census['income']=='>50K')]
males.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K
10,37,Private,280464,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,Black,Male,0,0,80,United-States,>50K
11,30,State-gov,141297,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,India,>50K
14,40,Private,121772,Assoc-voc,11,Married-civ-spouse,Craft-repair,Husband,Asian-Pac-Islander,Male,0,0,40,?,>50K


In [41]:
males.shape

(6662, 15)

In [42]:
females=census[(census['sex']=='Female') & (census['income']==' >50K')]
females.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income


In [43]:
females.shape

(0, 15)

j) Calculate the percentage of people from the United States who are private employees and earn less than 50,000 annually. 

In [45]:
Peoplecount_50k=census[(census['native-country']=='United-States') & (census['workclass']=='Private') & (census['income']=='<=50K')]

In [46]:
total=len(census)

In [47]:
percentage=(len(Peoplecount_50k)/total)*100
percentage

47.891649519363654

k) Calculate the percentage of married people in the census data.

In [48]:
census['marital-status'].value_counts()

Married-civ-spouse       14976
Never-married            10683
Divorced                  4443
Separated                 1025
Widowed                    993
Married-spouse-absent      418
Married-AF-spouse           23
Name: marital-status, dtype: int64

In [49]:
married=14976+418+23
married_percent=(married/total)*100
married_percent

47.34805442093302



l) Calculate the percentage of high school graduates earning more than 50,000 annually.

In [50]:
highschool_50k=census[(census['education']=='HS-grad') & (census['income']=='>50K')]
HS_percent=(len(highschool_50k)/total)*100
HS_percent

5.144190903227788

3. Linear Regression:                                                                                                             
a) Build a simple linear regression model as follows:                           

  ●	Divide the dataset into training and test sets in 70:30 ratio.                
  ●	Build a linear model on the test set where the dependent variable is “hours.per.week” and the independent variable is “education.num”.                      
  ●	Predict the values on the train set and find the error in prediction.         
  ●	Find the root-mean-square error (RMSE).

In [51]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [52]:
lr=LinearRegression()

In [54]:
#independent variable is “education.num”.
x=census[['education-num']]
#dependent variable is “hours.per.week”
y=census['hours-per-week']

In [55]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.30,random_state=1)
lr.fit(x_train,y_train)
y_pred=lr.predict(x_test)

In [56]:
error=y_test-y_pred

In [57]:
error

9646     30.044869
709     -13.159243
7385      7.432533
16671     0.371349
21932     1.840757
           ...    
29663    -1.832763
29310     0.371349
29661    -0.363355
19491    -1.098059
2861      5.514277
Name: hours-per-week, Length: 9769, dtype: float64

In [58]:
print('mean_squared_error :',mean_squared_error(y_test,y_pred))

print('root-mean-square error :',np.sqrt(mean_squared_error(y_test,y_pred)))

mean_squared_error : 147.15261838664162
root-mean-square error : 12.130647896408568


4. Logistic Regression:                                                         
 a) Build a simple logistic regression model as follows:                        
●	Divide the dataset into training and test sets in 65:35 ratio.                
●	Build a logistic regression model where the dependent variable is “X”(yearly income) and the independent variable is “occupation”.                           
●	Predict the values on the test set.                                           
●	Build a confusion matrix and find the accuracy.                               

In [59]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

In [60]:
lor=LogisticRegression()

In [62]:
census[['occupation']].value_counts()

occupation       
Prof-specialty       4140
Craft-repair         4099
Exec-managerial      4066
Adm-clerical         3770
Sales                3650
Other-service        3295
Machine-op-inspct    2002
?                    1843
Transport-moving     1597
Handlers-cleaners    1370
Farming-fishing       994
Tech-support          928
Protective-serv       649
Priv-house-serv       149
Armed-Forces            9
dtype: int64

In [64]:
x=census['occupation'].replace('?','Prof-specialty')
x=pd.DataFrame(x)

In [65]:
census['income'].value_counts()

<=50K    24720
>50K      7841
Name: income, dtype: int64

In [66]:
y=census['income'].replace('<=50K',0).replace('>50K',1)
y.value_counts()

0    24720
1     7841
Name: income, dtype: int64

In [68]:
le=LabelEncoder()

In [69]:
x=le.fit_transform(x)

  y = column_or_1d(y, warn=True)


In [70]:
x=pd.DataFrame(x)

In [71]:
x.head()

Unnamed: 0,0
0,0
1,3
2,5
3,5
4,9


In [72]:
x.value_counts()

9     5983
2     4099
3     4066
0     3770
11    3650
7     3295
6     2002
13    1597
5     1370
4      994
12     928
10     649
8      149
1        9
dtype: int64

In [73]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.35,random_state=1)
lor=LogisticRegression()
lor.fit(x_train,y_train)
y_pred=lor.predict(x_test)

print('confusion_matrix :')
print(confusion_matrix(y_pred,y_test))
print('accuracy_score :',accuracy_score(y_test,y_pred))

confusion_matrix :
[[8800 2597]
 [   0    0]]
accuracy_score : 0.7721330174607353


4.b)Build a multiple logistic regression model as follows:                       
●	Divide the dataset into training and test sets in 80:20 ratio.                
●	Build a logistic regression model where the dependent variable is “X”(yearly income) and independent variables are “age”, “workclass”, and “education”.      
●	Predict the values on the test set.                                           
●	Build a confusion matrix and find the accuracy.

In [75]:
cen=census[['age','workclass','education']]

In [76]:
cen.head()

Unnamed: 0,age,workclass,education
0,39,State-gov,Bachelors
1,50,Self-emp-not-inc,Bachelors
2,38,Private,HS-grad
3,53,Private,11th
4,28,Private,Bachelors


In [78]:
cen['workclass'].value_counts()

Private             22696
Self-emp-not-inc     2541
Local-gov            2093
?                    1836
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: workclass, dtype: int64

In [81]:
cen=cen.apply(le.fit_transform)

In [83]:
cen.head()

Unnamed: 0,age,workclass,education
0,22,6,9
1,33,5,9
2,21,3,11
3,36,3,1
4,11,3,9


In [84]:
x=cen

In [85]:
y=census['income'].replace('<=50K',0).replace('>50K',1)
y.value_counts()

0    24720
1     7841
Name: income, dtype: int64

In [86]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.20,random_state=1)
lor=LogisticRegression()
lor.fit(x_train,y_train)
y_pred=lor.predict(x_test)

print('confusion_matrix :')
print(confusion_matrix(y_pred,y_test))
print('accuracy_score :',accuracy_score(y_test,y_pred))

confusion_matrix :
[[4899 1456]
 [ 127   31]]
accuracy_score : 0.756947643175188


5. Decision Tree:                                                               
a) Build a decision tree model as follows:                                      
●	Divide the dataset into training and test sets in 70:30 ratio.                
●	Build a decision tree model where the dependent variable is “X”(Yearly Income) and the rest of the variables as independent variables.                 
●	Predict the values on the test set.                                           
●	Build a confusion matrix and calculate the accuracy.                          

In [87]:
from sklearn.tree import DecisionTreeClassifier

In [88]:
census.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [89]:
census.education= census.education.replace(['Preschool', '1st-4th', '5th-6th', '7th-8th', '9th','10th', '11th', '12th'], 'school')
census.education = census.education.replace('HS-grad', 'high school')
census.education = census.education.replace(['Assoc-voc', 'Assoc-acdm', 'Prof-school', 'Some-college'], 'higher')
census.education = census.education.replace('Bachelors', 'undergrad')
census.education = census.education.replace('Masters', 'grad')
census.education = census.education.replace('Doctorate', 'doc')

In [91]:
census['marital-status']= census['marital-status'].replace(['Married-civ-spouse', 'Married-AF-spouse'], 'married')
census['marital-status']= census['marital-status'].replace(['Never-married'], 'not-married')
census['marital-status']= census['marital-status'].replace(['Divorced', 'Separated','Widowed','Married-spouse-absent'], 'other')

In [92]:
census['workclass']=census['workclass'].replace('?','Private')
census['occupation']=census['occupation'].replace('?','Prof-specialty')
census['native-country']=census['native-country'].replace('?','United-States')

In [93]:
census.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,undergrad,13,not-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,undergrad,13,married,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,high school,9,other,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,school,7,married,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,undergrad,13,married,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [94]:
backup=census.copy()

In [96]:
census=census.apply(le.fit_transform)
census.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,22,6,2671,5,12,1,0,1,4,1,25,0,39,38,0
1,33,5,2926,5,12,0,3,0,4,1,0,0,12,38,0
2,21,3,14086,2,8,2,5,1,4,1,0,0,39,38,0
3,36,3,15336,4,6,0,5,0,2,1,0,0,39,38,0
4,11,3,19355,5,12,0,9,5,2,0,0,0,39,4,0


In [97]:
x=census.iloc[:,:-1]
y=census.iloc[:,-1]

In [98]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.30,random_state=1)
dt=DecisionTreeClassifier()
dt.fit(x_train,y_train)
y_pred=dt.predict(x_test)

print('confusion_matrix :')
print(confusion_matrix(y_pred,y_test))
print('accuracy_score :',accuracy_score(y_test,y_pred))

confusion_matrix :
[[6560  838]
 [ 990 1381]]
accuracy_score : 0.8128774695465247


6. Random Forest:                                                               
 a) Build a random forest model as follows:                                     
●	Divide the dataset into training and test sets in 80:20 ratio.                
●	Build a random forest model where the dependent variable is “X”(Yearly Income) and the rest of the variables as independent variables and number of trees as 300.                                                                   
●	Predict values on the test set                                                
●	Build a confusion matrix and calculate the accuracy

In [99]:
from sklearn.ensemble import RandomForestClassifier

In [100]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.20,random_state=1)
rf=RandomForestClassifier(n_estimators=300)
rf.fit(x_train,y_train)
y_pred=rf.predict(x_test)

print('confusion_matrix :')
print(confusion_matrix(y_pred,y_test))
print('accuracy_score :',accuracy_score(y_test,y_pred))

confusion_matrix :
[[4654  524]
 [ 372  963]]
accuracy_score : 0.8624289881774911
