# CENSUS
A census is the procedure of systematically calculating, acquiring and recording information about the members of a given population. This term is used mostly in connection with national population and housing censuses; other common censuses include the census of agriculture, and other censuses such as the traditional culture, business, supplies, and traffic censuses. The United Nations defines the essential features of population and housing censuses as "individual enumeration, universality within a defined territory, simultaneity and defined periodicity", and recommends that population censuses be taken at least every ten years. United Nations recommendations also cover census topics to be collected, official definitions, classifications and other useful information to co-ordinate international practices.

###DATA DESCRIPTION
Problem Statement:
This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). The prediction task is to determine whether a person makes over $50K a year.

Description of fnlwgt (final weight) :
The weights on the Current Population Survey (CPS) files are controlled to independent estimates of the civilian non-institutional population of the US. These are prepared monthly for us by Population Division here at the Census Bureau. We use 3 sets of controls. These are:

A single cell estimate of the population 16+ for each state.

Controls for Hispanic Origin by age and sex.

Controls by Race, age and sex.

We use all three sets of controls in our weighting program and "rake" through them 6 times so that by the end we come back to all the controls we used. The term estimate refers to population totals derived from CPS by creating "weighted tallies" of any specified socio-economic characteristics of the population. People with similar demographic characteristics should have similar weights. There is one important caveat to remember about this statement. That is that since the CPS sample is actually a collection of 51 state samples, each with its own probability of selection, the statement only applies within state.

In [None]:
#LOADING DATA SET

In [None]:
import pandas as pd
import numpy as np

path ='https://raw.githubusercontent.com/dsrscientist/dataset1/master/census_income.csv'
df= pd.read_csv(path)

In [None]:
df

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
Details About The Columns
 1. Age

This Column shows the Individuals Age Years

 2. Work Class

This Column shows the Individuals Occupoation

 3. Fnlwgt

This Column shows the Individuals Final Wieght

 4. Education

This Column shows the Individuals Education Level

 5. Education_num

This Column shows the Individuals Educations Num

 6. Marital_Status

This Column shows the Individuals Maritasl status That is whether married or not

 7. Occupation

This Column shows the Individuals Details of the Work Occupation

8. Relationship

This Column shows the Individuals Realtionship whith the Individual

9. Sex 

This Column shows the Individuals Sex that is which gender he belongs to

10. Race

This Column shows the Individuals Race or Region of origin details

11. Capital Gain

This Column shows the Individuals Capital gain or net worth

12. Capital Loss

This Column shows the Individuals Capital loss

13. Hours Per Week

This Column shows the Individuals working hours per week

14. Native_Country

This Column shows the Individuals Shows the country or statte from which he belongs

15. Income

This Column shows the Individuals income is more or less than 50k

In [None]:
#Data Exploration


In [None]:
cat_cols=df.select_dtypes([object])

for col in cat_cols.columns:
    print(col)
    print(df[col].value_counts())
    print('*****************************************************')


OUTCOME : -
Native.Country, Occupation,Workclass

  * It has unknown values represented by ?

Education

 * 9th, 10th, 11th, 12th comes under HighSchool Grad but it has mentioned separately

 * Creating Elementary object for 1st-4th, 5th-6th, 7th-8th
Marital Status

 * Married-civ-spouse,Married-spouse-absent,Married-AF-spouse comes under category Married

 * Divorced, separated again comes under category separated.
Workclass

 * Self-emp-not-inc, Self-emp-inc comes under category self employed

 * Local-gov,State-gov,Federal-gov comes under category goverment emloyees
Removing the rows with no Values ( ? )

In [None]:
df = df.drop(df[df['Native_country'] == ' ?'].index)
df = df.drop(df[df['Occupation'] == ' ?'].index)
df = df.drop(df[df['Workclass'] == ' ?'].index)

In [None]:
df['Native_country'].value_counts()

In [None]:
df['Occupation'].value_counts()
 

In [None]:
df['Workclass'].value_counts()

In [None]:
df['Hours_per_week'].value_counts()

In [None]:
df['Hours_per_week'] = pd.cut(df['Hours_per_week'], 
                                   bins = [0, 30, 40, 100], 
                                   labels = ['Lesser Hours', 'Normal Hours', 'Extra Hours'])
df['Hours_per_week'].value_counts()

In [None]:
df['Capital Diff'] = df['Capital_gain'] - df['Capital_loss']
df.drop(['Capital_gain'], axis = 1, inplace = True)
df.drop(['Capital_loss'], axis = 1, inplace = True)
df['Capital Diff'] = pd.cut(df['Capital Diff'], bins = [-5000, 5000, 100000], labels = ['Minor', 'Major'])
df['Age'].value_counts()

In [None]:
df['Age'] = pd.cut(df['Age'], bins = [0, 25, 50, 100], labels = ['Young', 'Adult', 'Old'])
df['Age'].value_counts()


In [None]:
df['Education'].value_counts()

In [None]:
education_classes = df['Education'].unique()
for edu_class in education_classes:
    print("For {}, the Education Number is {}"
          .format(edu_class, df[df['Education'] == edu_class]['Education_num'].unique()))


From the above we discovered that Education Number and Education are just the same. So, I can drop any one column. Also, I'll combine all information from Preschool to 12th as they can be considered of one class who have no college/university level education.

In [None]:
df.drop(['Education_num'], axis = 1, inplace = True)
df['Education'].replace([' 7th-8th', ' 5th-6th',' 1st-4th', ' Preschool',' 11th', ' 9th', ' 10th', ' 12th'],
                             ' School', inplace = True)

df['Education'].value_counts()

In [None]:
df['Native_country'].value_counts()

We can see that The majority of adults are from United States. Thus, we can distribute the column with values as either United States or Other.

In [None]:
Native_countrys = np.array(df['Native_country'].unique())
Native_countrys = np.delete(Native_countrys, 0)
df['Native_country'].replace(Native_countrys, 'Other', inplace = True)
df['Native_country'].value_counts()

In [None]:
df['Race'].value_counts()

we can see that The dataset includes majority of information about White race while all other races are lesser in number. I'll combine all other race data into one class as Other.

In [None]:
df['Race'].unique()
df['Race'].replace([' Black', ' Asian-Pac-Islander', ' Amer-Indian-Eskimo', ' Other'],' Other', inplace = True)
df['Race'].value_counts()

In [None]:
df

# Checking for the Columns containing Null , Blank Or Empty Values

In [None]:
df.isnull().sum()

In [None]:
df.dtypes

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(df.isnull())
plt.title("Null Values")
plt.show()

In [None]:
#Checking and Transforming the Data types of the Columns To Same DataTypes for Better Analysis
df.info()

In [None]:
df.describe()

In [None]:
df.head()

In [None]:
from sklearn.preprocessing import LabelEncoder
le =LabelEncoder()

list1=['Workclass','Education','Marital_status','Occupation','Relationship','Race','Sex','Native_country','Income']
for val in list1:
  df[val]=le.fit_transform(df[val].astype(str))

In [None]:
df.head()

# Data Analysis

In [None]:
plt.figure(figsize =(12,6));
sns.countplot(x = 'Income', data = df);
plt.xlabel("Income",fontsize = 12);
plt.ylabel("Frequency",fontsize = 12);

In [None]:

print(df['Workclass'].value_counts())  
plt.figure(figsize=(10,10))
sns.countplot(df['Workclass'])
plt.show()

In [None]:
plt.figure(figsize = (10,6))
plt.title("Comparision between Age and Income")
sns.barplot(x = "Age", y = "Income", data = df)
plt.show()

We can see that the age Group between 50
100 Years are more in numbers compared to the younger ones

In [None]:
plt.figure(figsize = (10,6))
plt.title("Comparision between Education and Income")
sns.barplot(x = "Education", y = "Income", data = df)
plt.show()


In [None]:
plt.figure(figsize = (10,6))
plt.title("Comparision between Marital_Status and Income")
sns.barplot(x = "Marital_status", y = "Income", data = df)
plt.show()

In [None]:
plt.figure(figsize = (20,10))
plt.title("Comparision between Occupation and Income")
sns.barplot(x = "Occupation", y = "Income", data = df)
plt.show()

In [None]:
plt.figure(figsize = (10,6))
plt.title("Comparision between Relationship and Income")
sns.barplot(x = "Relationship", y = "Income", data = df)
plt.show()


In [None]:
plt.figure(figsize = (10,6))
plt.title("Comparision between Race and Income")
sns.barplot(x = "Race", y = "Income", data = df)
plt.show()

In [None]:
plt.figure(figsize = (10,6))
plt.title("Comparision between Sex and Income")
sns.barplot(x = "Sex", y = "Income", data = df)
plt.show()

In [None]:
plt.figure(figsize = (10,6))
plt.title("Comparision between Native_country and Income")
sns.barplot(x = "Native_country", y = "Income", data = df)
plt.show()

Our dataset has 25000 people earning <=50K i.e. 75% and remainng 25% earns more than 50K.
We can see that the private workclass person are more in compare to all
We can see that the age Group between 50
100 Years are more in numbers compared to the younger ones
we can see that the educationis evenly distributed with Doctorate as the highest among all
We can see that the ratio of Married-Af-Spouse no's are too higher than any other
We can see that the Occupation columnn has The Highest No of Exec Manager followed proffesor speciality..
We can see that the in relationship wifes are the max in number
In context of race the white mens are the most in the census income ratio than any other combined

In context of the sex ratio we can see that the Male are morein number than any other
With respect to Native Country the ratio of Usa Citizen is much more higher than the others


In [None]:
df.plot.scatter(x='Income',y='Fnlwgt')
df.plot.scatter(x='Income',y='Hours_per_week')

In [None]:
df.hist(figsize=(15,30),edgecolor='red',layout=(9,3),bins=15,legend=True)
plt.show()

In [None]:
sns.pairplot(df)

# Corealtion between features and target ' INCOME ' ( EDA )

In [None]:
df.corr()

In [None]:
# Coorelation with the Target Column Primary Fuel 

df.corr()['Income'].sort_values()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(15,7))
sns.heatmap(df.corr(), annot=True, linewidths=0.5,linecolor="black", fmt='.2f')


In [None]:
## Dropping the irrelevant columns..

df.drop(columns=["Fnlwgt"], axis=1, inplace=True)

In [None]:
##Descriptive Statistics
df.describe()

In [None]:
plt.figure(figsize=(15,7))
sns.heatmap(round(df.describe()[1:].transpose(),2), annot=True, linewidths=0.5,linecolor="black", fmt='f')


In [None]:
df.info()

In [None]:
##Checking Skewness
my_column1 = df.pop('Income')
df.insert(11,'Income', my_column1) 


df.head()

In [None]:
df.iloc[:,:-1].skew()

In [None]:
from sklearn.preprocessing import power_transform
x_new=power_transform(df.iloc[:,:-1],method='yeo-johnson')

df.iloc[:,:-1]=pd.DataFrame(x_new,columns=df.iloc[:,:-1].columns)
df.iloc[:,:-1].skew()

In [None]:
Outliers Checking
import warnings
warnings.filterwarnings('ignore')
df.plot(kind='box',subplots=True, layout=(3,5), figsize=[20,8])


###IQR Proximity Rule
Z - Score Technique

In [None]:
from scipy.stats import zscore
import numpy as np
z=np.abs(zscore(df))
z.shape

In [None]:
threshold=3
print(np.where(z>3))

In [None]:
len(np.where(z>3)[0])

We can see that there are no outliers present



In [None]:
Feature Engineering ( VIF )

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
df.corr()

In [None]:
plt.figure(figsize=(25,22))
sns.heatmap(df.corr(),linewidths=.1,vmin=-1, vmax=1, fmt='.2g', annot = True, linecolor="black",annot_kws={'size':15},cmap="YlGnBu")
plt.yticks(rotation=0)

In [None]:
df = df.dropna()

In [None]:
#SPLITTING THE DATA SET
x=df.drop('Income',axis=1)
y=df['Income']

In [None]:
x

In [None]:
y

In [None]:
def vif_calc():
  vif=pd.DataFrame()
  vif["VIF Factor"]=[variance_inflation_factor(x.values,i) for i in range(x.shape[1])]
  vif["features"]=x.columns
  print(vif)

In [None]:
vif_calc()

In [None]:
##Scaling the Data
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
x=pd.DataFrame(sc.fit_transform(x), columns=x.columns)
x

# MODELLING FOR INCOME
Building CLASSIFICATION Model As Target Column's Has only Two Outputs


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

print(df['Income'].value_counts())  
plt.figure(figsize=(5,5))
sns.countplot(df['Income'])
plt.show()

In [None]:
#OverSampling
from imblearn.over_sampling import SMOTE
sm = SMOTE()
x, y = sm.fit_resample(x,y)
y.value_counts()

In [None]:
#Modelling to Get the best random state

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import GradientBoostingClassifier, BaggingClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, accuracy_score
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.model_selection import GridSearchCV

maxAccu=0
maxRS=0

for i in range(1,200):
    x_train,x_test, y_train, y_test=train_test_split(x,y,test_size=.30, random_state=i)
    rfc=RandomForestClassifier()
    rfc.fit(x_train,y_train)
    pred=rfc.predict(x_test)
    acc=accuracy_score(y_test,pred)
    if acc>maxAccu:
        maxAccu=acc
        maxRS=i
print("Best accuracy is ",maxAccu*100," on Random_state ",maxRS)


In [None]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.30,random_state=maxRS)


# Logistic Regression
# Checking Accuracy for Logistic Regression
log = LogisticRegression()
log.fit(x_train,y_train)

#Prediction
predlog = log.predict(x_test)

print(accuracy_score(y_test, predlog)*100)
print(confusion_matrix(y_test, predlog))
print(classification_report(y_test,predlog))

In [None]:
# Plotting Confusion_Matrix
cm = confusion_matrix(y_test,predlog)

x_axis_labels = ["0","1"]
y_axis_labels = ["0","1"]

f , ax = plt.subplots(figsize=(5,5))
sns.heatmap(cm, annot = True,linewidths=.2, linecolor="black", fmt = ".0f", ax=ax, cmap="BuGn",xticklabels=x_axis_labels,yticklabels=y_axis_labels)
plt.xlabel("PREDICTED LABEL")
plt.ylabel("TRUE LABEL")
plt.title('Confusion Matrix for LogisticRegression')
plt.show()

In [None]:
##Random Forest Classifier
# Checking accuracy for Random Forest Classifier
rf = RandomForestClassifier()
rf.fit(x_train,y_train)

# Prediction
predrf = rf.predict(x_test)

print(accuracy_score(y_test, predrf)*100)
print(confusion_matrix(y_test, predrf))
print(classification_report(y_test,predrf))


In [None]:
# Lets plot confusion matrix for RandomForestClassifier
cm = confusion_matrix(y_test,predrf)

x_axis_labels = ["0","1"]
y_axis_labels = ["0","1"]

f , ax = plt.subplots(figsize=(5,5))
sns.heatmap(cm, annot = True,linewidths=0.2, linecolor="black", fmt = ".0f", ax=ax, cmap="BuGn",xticklabels=x_axis_labels,yticklabels=y_axis_labels)
plt.xlabel("PREDICTED LABEL")
plt.ylabel("TRUE LABEL")
plt.title('Confusion Matrix for RandomForestClassifier')
plt.show()

In [None]:
#Decission Tree Classifier
# Checking Accuracy for Decision Tree Classifier
dtc = DecisionTreeClassifier()
dtc.fit(x_train,y_train)

#Prediction
preddtc = dtc.predict(x_test)

print(accuracy_score(y_test, preddtc)*100)
print(confusion_matrix(y_test, preddtc))
print(classification_report(y_test,preddtc))

In [None]:
# Lets plot confusion matrix for Decission Tree Classifier
cm = confusion_matrix(y_test,preddtc)

x_axis_labels = ["0","1"]
y_axis_labels = ["0","1"]

f , ax = plt.subplots(figsize=(5,5))
sns.heatmap(cm, annot = True,linewidths=.2, linecolor="black", fmt = ".0f", ax=ax, cmap="BuGn",xticklabels=x_axis_labels,yticklabels=y_axis_labels)
plt.xlabel("PREDICTED LABEL")
plt.ylabel("TRUE LABEL")
plt.title('Confusion Matrix for Decision Tree Classifier')
plt.show()

In [None]:
#Support Vector Machine Classifier
# Checking accuracy for Support Vector Machine Classifier
svc = SVC()
svc.fit(x_train,y_train)

# Prediction
predsvc = svc.predict(x_test)

print(accuracy_score(y_test, predsvc)*100)
print(confusion_matrix(y_test, predsvc))
print(classification_report(y_test,predsvc))

In [None]:
# Lets plot confusion matrix for Support Vector Machine Classifier
cm = confusion_matrix(y_test,predsvc)

x_axis_labels = ["0","1"]
y_axis_labels = ["0","1"]

f , ax = plt.subplots(figsize=(5,5))
sns.heatmap(cm, annot = True,linewidths=.2, linecolor="black", fmt = ".0f", ax=ax, cmap="BuGn",xticklabels=x_axis_labels,yticklabels=y_axis_labels)

plt.xlabel("PREDICTED LABEL")
plt.ylabel("TRUE LABEL")
plt.title('Confusion Matrix for Support Vector Machine Classifier')
plt.show()

In [None]:
#Gradient Boosting Classifier
# Checking accuracy for Gradient Boosting Classifier
GB = GradientBoostingClassifier()
GB.fit(x_train,y_train)

# Prediction
predGB = GB.predict(x_test)

print(accuracy_score(y_test, predGB)*100)
print(confusion_matrix(y_test, predGB))
print(classification_report(y_test,predGB))

In [None]:
# Lets plot confusion matrix for Gradient Boosting Classifier
cm = confusion_matrix(y_test,predGB)

x_axis_labels = ["0","1"]
y_axis_labels = ["0","1"]

f , ax = plt.subplots(figsize=(5,5))
sns.heatmap(cm, annot = True,linewidths=.2, linecolor="black", fmt = ".0f", ax=ax, cmap="BuGn",xticklabels=x_axis_labels,yticklabels=y_axis_labels)

plt.xlabel("PREDICTED LABEL")
plt.ylabel("TRUE LABEL")
plt.title('Confusion Matrix for Gradient Boosting Classifier')

In [None]:
##Cross Validation Score
#cv score for Logistic Regression
print(cross_val_score(log,x,y,cv=5).mean()*100)

# cv score for Decision Tree Classifier
print(cross_val_score(dtc,x,y,cv=5).mean()*100)

# cv score for Random Forest Classifier
print(cross_val_score(rf,x,y,cv=5).mean()*100)

# cv score for Support Vector  Classifier
print(cross_val_score(svc,x,y,cv=5).mean()*100)

# cv score for Gradient Boosting Classifier
print(cross_val_score(GB,x,y,cv=5).mean()*100)

# it is clear from the above that Random Forest Classifier is working the best with respect to Cross validation score as well which is minimum in the case..

So we move forward with Random Forest Classifier Model

##HyperParameter Tuning for the model with best score

In [None]:
#Random Forest Classifier

parameters = {'criterion':['gini'],
             'max_features':['auto'],
             'n_estimators':[0,200],
             'max_depth':[2,3,4,5,6,8]}
GCV=GridSearchCV(RandomForestClassifier(),parameters,cv=5)
GCV.fit(x_train,y_train)

In [None]:
GCV.best_params_

In [None]:
Incomee =RandomForestClassifier (criterion='gini', max_depth=8, max_features='auto', n_estimators=200)
Incomee.fit(x_train, y_train)
pred = Incomee.predict(x_test)
acc=accuracy_score(y_test,pred)
print(acc*100)

In [None]:
##Plotting ROC and compare AUC for the final model
from sklearn.metrics import plot_roc_curve
plot_roc_curve(rf,x_test,y_test)
plt.title("ROC AUC Plot")
plt.show()

Conclusion:
The accuracy score for Income is 84 %

In [None]:
##Saving the model
import joblib
joblib.dump(Incomee,"Census_Income.pkl")