## Prediction task is to determine whether a person makes over 50K a year.

#### Life cycle of Machine learning Project

- Understanding the Problem Statement
- Data Collection
- Exploratory data analysis
- Data Cleaning
- Data Pre-Processing
- Model Training
- Choose best model 

### Data Set Information:

Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))

### Attribute Information:

- Listing of attributes:

>50K, <=50K.

- age: continuous.
- workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- fnlwgt: continuous.
- education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
- education-num: continuous.
- marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- sex: Female, Male.
- capital-gain: continuous.
- capital-loss: continuous.
- hours-per-week: continuous.
- native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands. 

### Data Collection and EDA 


In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from warnings import filterwarnings
filterwarnings('ignore')

In [2]:
import os
os.listdir()

['.ipynb_checkpoints',
 '1.DATA CLEANING.ipynb',
 '2.Mongo DB store and retrive data.ipynb',
 '3.Model Building.ipynb',
 'Data']

In [3]:
column_names=['age','workclass',"fnlwgt","education",'education-num','marital-status','occupation','relationship','race','sex','capital-gain','capital-loss','hours-per-week','native-country','50K, <=50K.']

In [4]:
df=pd.read_csv('Data\adult.txt',names=column_names)

OSError: [Errno 22] Invalid argument: 'Data\x07dult.txt'

In [None]:
df.rename(columns={"50K, <=50K.":"class"},inplace=True)
df.head()

In [None]:
df.info()

In [None]:
df.isna().sum()

In [None]:
df.to_csv("Data_Train.csv")

In [None]:
import numpy as np
for i in df.columns:
    l=["?"," "]
    df[i] =np.where(df[i].isin(l),np.nan,df[i])

In [None]:
df.isna().sum()

In [None]:
df['age'] = np.where(df['age'] == "?",np.nan, df["age"])

In [None]:
df.age.isna().sum()

In [None]:
df['workclass'] = np.where(df['workclass'] == "?",np.nan, df["workclass"])

In [None]:
df.workclass.isna().sum()


In [None]:

df.isna().sum()

In [None]:
for i in df.columns:
    print(i,'>> ' ,df[i].nunique())

In [None]:
df.isna().sum()

In [None]:
numeric_features = [feature for feature in df.columns if df[feature].dtype != 'O']
categorical_features = [feature for feature in df.columns if df[feature].dtype == 'O']

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features : {}'.format(len(categorical_features), categorical_features))

In [None]:
for i in categorical_features:
    print(i,">>>  ",df[i].unique(),"\n\n")

#### As we can see spaces we try to removes the spaces  

In [None]:
## Strip the categorical feature columns
for i in categorical_features:
    df[i]=df[i].str.strip()

In [None]:
 for i in categorical_features:
    print(i,">>>  ",df[i].unique(),"\n")

In [None]:
for i in df.columns:
    l=["?"," "]
    df[i] =np.where(df[i]=="?",np.nan,df[i])

In [None]:
#df.replace('?', np.nan, inplace = True)

In [None]:
df.isna().sum()

In [None]:
# Check for the duplicate enteries in the dataset 
print(df.duplicated().sum())
# We found 24 duplicated Entries  


In [None]:
# Droping the duplciate entries
df.drop_duplicates(inplace=True)

In [None]:
df_copy=df.copy()

In [None]:
df.info()

In [None]:
null_df = pd.DataFrame({'Null Values' : df.isna().sum().sort_values(ascending=False), 'Percentage Null Values' : (df.isna().sum().sort_values(ascending=False)) / (df.shape[0]) * (100)})
null_df

In [None]:
# Since the missing values are less so we can drop the missing values 

In [None]:
df.dropna(inplace=True)

In [None]:
df.shape

In [None]:
df_copy.shape

#### Compare the variance of the droped rows and undroped rows 

In [None]:
df_copy.isna().sum()>0

In [None]:
features_with_null_values=['workclass','occupation','native-country']

In [None]:
import seaborn as sns
plt.figure(figsize=(15, 15))
plt.suptitle('Univariate Analysis of Numerical Features', fontsize=20, fontweight='bold', alpha=0.8, y=1.)

for i in range(0, len(numeric_features)):
    plt.subplot(5, 2, i+1)
    sns.kdeplot(x=df[numeric_features[i]],shade=True, color='r')
    plt.xlabel(numeric_features[i])
    plt.tight_layout()

In [None]:
plt.figure(figsize=(12,7))
cor=df.corr()
plt.xticks(rotation=45)
plt.yticks(rotation=45,)
sns.heatmap(cor,cmap="BuPu",annot=True,)

In [None]:
df.columns

In [None]:
numeric_features

In [None]:
sns.set_style("whitegrid");

sns.pairplot(df,hue="class");
plt.show()

In [None]:
## Data is imbalance 
df["class"].value_counts()

In [None]:
for features in numeric_features:
    print(features,">> ",df[features].var())

In [None]:
for features in numeric_features:
    print(features,">> ",df_copy[features].var())

In [None]:
for features in numeric_features:
    print(features,">> ",df_copy[features].skew())
    

In [None]:
for features in numeric_features:
    print(features,">> ",df_copy[features].kurtosis())

In [None]:
#

In [None]:
for feature in categorical_features:
    print(feature,">> \n",df[feature].value_counts(),"\n")

In [None]:
# categorical columns
def count_plot(x):
    plt.figure(figsize=(15,8))
    plt.suptitle('Univariate Analysis of Categorical Features', fontsize=20, fontweight='bold', alpha=0.8, y=1.)
    sns.countplot(x=df[x],palette="Set2",order=df[i].value_counts().index)
    plt.xlabel(x)
    plt.xticks(rotation=45)

In [None]:
for i in categorical_features:
    count_plot(i)

In [None]:
plt.figure(figsize=(15,8))
plt.suptitle('Univariate Analysis of Categorical Features', fontsize=20, fontweight='bold', alpha=0.8, y=1.)
sns.countplot(x=df['education'],palette="Set2",order=df["education"].value_counts().index)
plt.xlabel("education")
plt.xticks(rotation=45)

In [None]:
plt.figure(figsize=(15,8))
plt.suptitle('Univariate Analysis of Categorical Features', fontsize=20, fontweight='bold', alpha=0.8, y=1.)
sns.countplot(x=df['education-num'],palette="Set2",order=df["education-num"].value_counts().index)
plt.xlabel("education-num")
plt.xticks(rotation=45)

In [None]:
df.education.unique()

In [None]:
edu_df=pd.DataFrame({"Education_number":df["education-num"].value_counts()})
edu_df['Education']=df.education.unique()

In [None]:
edu_df.rename({"Education_number":"Count"},axis=1,inplace=True)
edu_df
# since Education number and education is same so we drop education column

In [None]:
df.drop("education",axis=1,inplace=True)
df

In [None]:
categorical_features.remove("education")

In [None]:
categorical_features

In [None]:
## 
from sklearn.preprocessing import LabelEncoder

In [None]:
# Storing the Label data as joblib
from joblib import dump, load
label=LabelEncoder()
for i in categorical_features:
    data =label.fit(df[i])
    df[i]=label.transform(df[i])
    # Name of the file
    joblib_file = '{}.joblib'.format(i)
    with open(joblib_file, 'wb') as f:
        dump(data, f)
    

In [None]:
import os
os.listdir()

In [None]:
plt.figure(figsize=(15,12))
sns.heatmap(df.corr(),cmap="Greens",annot=True)

In [None]:
def create_comparison_plot(df,column):
    # Comparing
    plt.figure(figsize=(16,8))
    plt.subplot(2,2,1)
    sns.distplot(df[column])

    plt.subplot(2,2,2)
    sns.boxplot(df[column],color="g")


    plt.show()

In [None]:
numeric_features

In [None]:
create_comparison_plot(df,'age')

In [None]:
create_comparison_plot(df,'fnlwgt')

In [None]:
create_comparison_plot(df,'education-num')

In [None]:
create_comparison_plot(df,'hours-per-week')

In [None]:
create_comparison_plot(df,'capital-gain')

In [None]:
create_comparison_plot(df,'capital-loss')

In [None]:
categorical_features

In [None]:
fig= plt.figure()

# density plot using seaborn library
fig, axs = plt.subplots(2, 1, figsize=(15, 7))

df['capital-gain'].plot.hist(color='red',ax=axs[0],alpha=0.5,label='Size')
df['capital-loss'].plot.hist(color='green',ax=axs[1],alpha=0.5,label='Size')


In [None]:
df.is

In [None]:
df.to_csv("Processed_data.csv",index=False)

In [None]:
import json
data=df.to_json(orient="records")
# the json file where the output must be stored
out_file = open("processed1.json", "w")

json.dump(data, out_file, indent = 6)

out_file.close()


In [None]:
df.columns