# Adult Income : Exploratory Analysis And Precition   

This notebook has been created to help you go through the steps of a Machine Learning project Life-Cicle, from Business Understanding to presenting the final result to the Business.  

## 1. Business Understanding 
## 2. Data aquisition  
          Automatique Data aquisition  
          Convert data into a Pandas Data Frame
          
## 3- Data Munging  
          Treating missing values
          Working with outliers
          
## 4- Exploratory Data Analysis 
          Univariate Analysis      
          Bivariate analysis           
          
## 5- Feature Engineering 
          Derived Features
          Categorical Feature encoding
          
## 6- Preparation, Models and Evaluation    
          Preparation
          Models and Evaluation



## 1- Business Understanding  
Our data contains an individual's annual income results based on various factors (Education level, Occupation,Gender, Age, etc.). 
Given a new individual, our goal is to predict if that person makes more or less than 50K. 

## 2- Data Acquisition  
We are going to acquire our dataset into **text** format, after downloading it from the **[UCI Machine Learning](https://archive.ics.uci.edu/ml/datasets/adult)** website. Here are the following libraries that we will be using to acquire the dataset and perform all the preprocessing and analysis.  

In [None]:
import requests
import os

In [None]:
# This function will be used to acquire the data from the UCI website
def aquire_data(path_to_data, data_urls):
    if not os.path.exists(path_to_data):
        os.mkdir(path_to_data)
        
    for url in data_urls:
        data = requests.get(url).content
        filename = os.path.join(path_to_data, os.path.basename(url))
        with open(filename, 'wb') as file: 
            file.write(data)

In [None]:
data_urls = ["https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
             "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names",
             "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test"]

aquire_data('data', data_urls)

In [None]:
# Check the success of accessing the data
print('Output n° {}\n'.format(1))
! find data

We can notice that all our data have been acquired from the UCI website. Here we have :  
* **adult.names**: which corresponds to the different column names   
* **adult.data**: corresponds to all the observations in the training data.  
* **data.test**: corresponds to all the observation in the test data  


In [None]:
column_names = ["Age", "Workclass", "fnlwgt", "Education", "Education-Num", 
                "Martial Status", "Occupation", "Relationship", "Race", "Sex", 
                "Capital-Gain", "Capital-Loss", "Hours-per-week", "Country", "Income"] 


### Convert Data into a Pandas Data Frame  

In [None]:
import pandas as pd
import numpy as np

Here we are going to acquire the training and the test datasets. 
The corresponding column names have been specified in the previous **column_names** variable. Then, we use the regular expression **' \*, \*'** to trim all the whitespaces we can encounter in our datasets. As all the missing values have been specificied by **?**, so, **na_values** is used to take them into consideration during the data loading. Finally we specify **engine='python'** to avoid the warning that comes after using regular expression syntax.  

In [None]:
train = pd.read_csv('data/adult.data', names=column_names, sep=' *, *', na_values='?', 
                   engine='python')
test = pd.read_csv('data/adult.test', names=column_names, sep=' *, *', skiprows=1, 
                   engine='python', na_values='?')

In [None]:
test.Income.unique() 

In [None]:
train.Income.unique()

We need to transform the **Income** column value for test data, in order to remove the **"."** at the end  

In [None]:
test.Income = np.where(test.Income == '<=50K.', '<=50K', '>50K')

In [None]:
# Concatenate train and test. We will split it before the training phase 
df = pd.concat((train, test), axis=0)

In [None]:
df.Income.unique()

In [None]:
print('Output n° {}\n'.format(2))

'''
First 5 observations
'''
df.head()

In [None]:
print('Output n° {}\n'.format(3))

'''
Last 5 observations
'''
df.tail()

In [None]:
print('Output n° {}\n'.format(4))

print('Our data contains {} observations and {} columns.'.format(df.shape[0],
                                                                df.shape[1]))

## 3- Data Munging 
In this step, we will perform two main tasks.  
* **Dealing with missing values**    
During data collection, it is very common to face missing data problem, that can occur for many reasons (confidentiality, error,etc.). So, it is very important to understand those problems, in order to fill them using appropriate techniques before applying any Machine Learning algorithm.    


* **Dealing with outliers**     *
Outliers are those values that are far away from the normal values that can be observed in the whole data. They can introduce high bias in our final model performance, and can even lead us to taking wrong conclusion during the analysis step.  

#### A- Treating missing values   
We will use pandas **isnull()** function to look at all the missing values for each column.  

In [None]:
print('Output n° {}\n'.format(5))
print(df.isnull().sum())

To the left, we have the name of the features and the number of missing values to the right. We can see that:   
* **Workclass** has 1836 missing values   
* **Occupation** has 1843 missing values  
* **Country** has 583 missing values   

To deal with all the missing data, we couuld think of removing all the records (rows/observations) with those missing values. But, this technique could not be a better choice for our case, because we could lose much more data. To do so, we will use the following technique :  
* Replace missing data of categorical columns data with the mode value (most occuring category) of that column.   
* Replace missing numerical columns data with the median value of that column. Here we could use the mean instead of median, but the mean is very prompt to outliers (extreme values).     

To be able to identify which columns has which type, we can use pandas dtype() function.   



In [None]:
print('Output n° {}\n'.format(6))
print(df.dtypes)

To the left, we have the columns name, and their corresponding types to the right. So, we can see that the columns with missing values (discussed previously) are all categorical data (object).    
Then, we can have a look at all the distincs (unique) values in each columns with pandas **unique()** function.  

In [None]:
# Workclass  
print('Output n° {}\n'.format(7))
print('Number of missing values: {}'.format(len(df['Workclass'].unique())))
print(df['Workclass'].unique())

Workclass has 9 unique values including **nan** (missing value)

In [None]:
# Occupation  
print(print('Output n° {}\n'.format(8)))
print('Number of missing values: {}'.format(len(df['Occupation'].unique())))
print(df['Occupation'].unique())

The Occupation column has 15 unique values, including **nan** 

In [None]:
# Country  
print('Output n° {}\n'.format(9))
print('Number of missing values: {}'.format(len(df['Country'].unique())))
print(df['Country'].unique())


The Country column has 42 unique values, including **nan** 

We know all the columns with missing values, and their type. We also have an idea of the unique values of each of those columns, now, we can perform the missing values replacement process.   

To do so, we will create a helper function that will perform this task for all the columns using python **statistics** built-in function.

In [None]:
import statistics as stat

In [None]:
def fill_categorical_missing(data, column):
    data.loc[data[column].isnull(), column] = stat.mode(data[column])

In [None]:
cols_to_fill = ['Workclass', 'Occupation', 'Country']

for col in cols_to_fill:
    fill_categorical_missing(df, col)

print('Output n° {}\n'.format(10))

# Check the final data if there is any missing values 
print(df.isnull().sum())

We can see that all the values to the right are equal to zero, which means that we have no missing values in our dataset.    

### B- Dealing with outliers  
To be able to identify outliers in our dataset, we will use **seaborn** **boxplot** to all our numerical columns, and show the final result with **matplotlib**'s **show()** function.    
We the help of the **Output n°6 (i.e print(df.dtypes))**, we can see all our numrical columns; But a better way to look at them is to apply pandas **describe** function, which gives more statistical information about all the numerical columns.  

In this part, we are going to use the copy of our training dataset for outliers analysis, then create a helper function that will finally be applied to the original training data for outliers removal.

In [None]:
df_cp = df.copy()

In [None]:
df_cp.head()

In [None]:
df_cp.describe()

We have 6 numerical columns (Age to Hours-per-week). To the left, we have many statistical information such as :  
* **count**: for the total number of observation for each column.   
* mean: the mean value of each column   
* std: the standard deviation    
* 25%, 50% and 75% are quantiles. 

With the quantiles, min and max, the dataset can be splitted into 4 buckets:  
* Bucket 1: below 25% (e.g) for **Age** column, 25% of people are under **28 years old**.
* Bucket 2: between 25% and 50% (e.g), 25% of them (50%-25%) are between **28 and 37 years old**.  
* Bucket 3: between 50% and 75% (e.g), 25% of them are between **37 and 48 years old** .  
* Bucket 4: between above 75% (e.g), 25% of them are over **48 years old**.  

**Then all the values beyond 1.5xIQR are considered as outliers. ** 
IQR = Inter Quartile Range = 75th - 25th.   

This images gives a better understanding of a boxplot.   
![](https://www.researchgate.net/publication/318986284/figure/fig1/AS:525404105646080@1502277508250/Boxplot-with-outliers-The-upper-and-lower-fences-represent-values-more-and-less-than.png)

Then we will create a helper function that will remove all the outliers from our dataset. But, before that, let have a look at the boxplot.   

In [None]:
import seaborn as sns 
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Age 
sns.boxplot(y='Age', data=df_cp)
plt.show()

Let calculate 0-100th percentile to find a correct percentile value for removal of outliers

In [None]:
def ten_to_ten_percentiles(data, column):
    for i in range(0,100,10):
        var = data[column].values
        var = np.sort(var, axis=None)
        print('{} percentile value is {}'.format(i, var[int(len(var) * (float(i)/100))]))
    print('100 percentile value is {}'.format(var[-1]))

In [None]:
ten_to_ten_percentiles(df_cp, 'Age')

We could see from the boxplot of Age that there is no extreme value. Then after checking with percentile values, we have a confirmation of our remark. 

In [None]:
#calculating column values at each percntile 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100
def percentiles_from_90(data, column):
    for i in range(90,100):
        var = data[column].values
        var = np.sort(var, axis=None)
        print('{} percentile value is {}'.format(i, var[int(len(var) * (float(i)/100))]))
    print('100 percentile value is {}'.format(var[-1]))

Going deeper with the percentile values, we can have more information. So, here is a function that will give us the percentile values for each values from 99 to 100 percentile. 

In [None]:
#calculating colunm values at each percntile 99.0,99.1,99.2,99.3,99.4,99.5,99.6,99.7,99.8,99.9,100
def percentiles_from_99(data, column):
    for i in np.arange(0.0, 1.0, 0.1):
        var =data[column].values
        var = np.sort(var,axis = None)
        print("{} percentile value is {}".format(99+i,var[int(len(var)*(float(99+i)/100))]))
    print("100 percentile value is ",var[-1])

In [None]:
# Education-Num
sns.boxplot(y='Education-Num', data=df_cp)
plt.show()

In [None]:
ten_to_ten_percentiles(df_cp, 'Education-Num')

There is no anomalies with Education number. 

In [None]:
# Capital-Gain
sns.boxplot(y='Capital-Gain', data=df_cp)
plt.show()

In [None]:
ten_to_ten_percentiles(df_cp, 'Capital-Gain')

In [None]:
percentiles_from_90(df_cp, 'Capital-Gain')

In [None]:
percentiles_from_99(df_cp, 'Capital-Gain')

In [None]:
# Removing the outliers based on 99.5th percentile of Capital-Gain
df_cp = df_cp[df_cp['Capital-Gain']<=34095]

In [None]:
# Capital-Gain
sns.boxplot(y='Capital-Gain', data=df_cp)
plt.show()

In [None]:
# Capital-Loss
sns.boxplot(y='Capital-Loss', data=df_cp)
plt.show()

In [None]:
ten_to_ten_percentiles(df_cp, 'Capital-Loss')

In [None]:
percentiles_from_90(df_cp, 'Capital-Loss')

In [None]:
percentiles_from_99(df_cp, 'Capital-Loss')

No special extreme value here as we could notice for Capital-Gain. 

In [None]:
# Hours-per-week
sns.boxplot(y='Hours-per-week', data=df_cp)
plt.show()

In [None]:
ten_to_ten_percentiles(df_cp, 'Hours-per-week')

There is no special extreme value here. 

Now, we are going to create a helper function in order to remove all the outliers, based in our previous univariate analysis.  

In [None]:
def remove_outliers(data):
    a = data.shape[0]
    print("Number of salary records = {}".format(a))
        
    temp_data = data[data['Capital-Gain']<=34095]
    b = temp_data.shape[0]
    
    print('Number of outliers from the Capital-Gain column= {}'.format(a - b))
        
    data = data[(data['Capital-Gain']<=34095)]
    
    print('Total outlies removed = {}'.format(a-b))
    print('-----'*10)
    return data

In [None]:
print('Removing all the outliers from the data')
print('-----'*10)
df_no_outliers = remove_outliers(df)

proportion_remaing_data = float(len(df_no_outliers)) / len(df)
print('Proportion of observation that remain after removing outliers = {}'.format(proportion_remaing_data))

After removing the outliers from out data, still 99.49% of the dataset remain present. 

## 4- Exploratory Data Analysis   

First thing first! 
Let's take a look at the number of people who make more that 50K and those who don't

In [None]:
df_no_outliers.Income.unique()

In [None]:
palette = {"<=50K":"r", ">50K":"g"}
sns.countplot(x="Income", data=df_no_outliers, hue="Income", palette=palette)

We can notice that we have 24720 adults who make less than 50K dollars and only 7841 of them make more than 50K dollars. So,only 24% of adult make more than 50K dollars.

#### A- Numerical Data   
For this part, we will be performing centrality measure (mean, median) and dispersion measures (range, percentiles, variance, standard deviation).  
All those information can be found with pandas **describe()** function.  

In [None]:
df_no_outliers.describe()

From this result, we can see that our features are in different scales, so that information will be useful for feature engineering step. For simple visualization purpose, we can plot the probability density of all those features. 

##### A.1- Univariate Analysis 

In [None]:
# Age  
df_no_outliers.Age.plot(kind='kde', title='Density plot for Age', color='c')

Here, we have a positive skewed distribution for Age feature. 

In [None]:
# Capital-Gain  
df_no_outliers['Capital-Gain'].plot(kind='kde', title='Density plot for Capital-Gain', color='c')

In [None]:
# Capital-Loss  
df_no_outliers['Capital-Loss'].plot(kind='kde', title='Density plot for Capital-Loss', color='c')

In [None]:
# Capital-Loss  
df_no_outliers['Hours-per-week'].plot(kind='kde', title='Density plot for Hours-per-week', color='c')

We need to deal with the problem of distribution for all our numerical data values in the feature engineering part. 

##### A.2- Bivariate analysis  
We will try to determine the correlation between some numerical data.

In [None]:
# Capital-Gain and Education-Num 
# use scatter plot for bi-variate distribution
df_no_outliers.plot.scatter(x='Education-Num', y='Capital-Gain', color='c', title='scatter plot : Education-Num vs Capital-Gain');

We have a positive relationship between the number of year of education and the Capital Gain. The more educated you are, your are likely to have more capital. 

In [None]:
# Hours-per-week and Education-Num 
# use scatter plot for bi-variate distribution
df_no_outliers.plot.scatter(x='Education-Num', y='Hours-per-week', color='c', title='scatter plot : Education-Num vs Hours-per-week');

There is no interesting pattern. 

In [None]:
# Capital-Gain and Hours-per-week
# use scatter plot for bi-variate distribution
df_no_outliers.plot.scatter(x='Hours-per-week', y='Capital-Gain', color='c', title='scatter plot : Hours-per-week vs Capital-Gain');

We can not identify any interesting pattern from this visualization. 

In [None]:
# Capital-Gain and Capital-Loss
# use scatter plot for bi-variate distribution
df_no_outliers.plot.scatter(x='Capital-Gain', y='Capital-Loss', color='c', title='scatter plot : Capital-Loss vs Capital-Gain');

People without any capital Gain lose a lot of money, which is obvious, because without any capital Gain, you would need to borrow with interest, and then keep **"surviving".** 

In [None]:
numerical_cols = ['int64']  
plt.figure(figsize=(10, 10))
sns.heatmap( 
            df_no_outliers.select_dtypes(include=numerical_cols).corr(),
            cmap=plt.cm.RdBu, 
            vmax=1.0,
            linewidths=0.1,
            linecolor='white', 
            square=True,
            annot=True
)

From the correlation matrix, we can see that the level of relationship is very low between the numerical features.  


#### B- Categorical Data

There are many explorations we can do in order to have a better understanding of the data.   
Here are some possibilities we could have:  
* B.1- Income VS Occupation for countries in each continent
* B.2- Income VS Workclass for countries in each continent
* B.3- Income VS Marital Status for countries in each continent
* B.4- Mean Capital Gain VS Martial Status for each continent


In [None]:
df_no_outliers.head()

We have many countries from different continent. For better visualization, it might be interesting to create a new column **Continent** in order to easily group information per continent and the corresponding countries. 

In [None]:
df_no_outliers['Country'].unique()

There is country name called **South** which is definitly an error. It could be considered as **continent**, then we could associate in with the corresponding continent. But, here is the problem: we have both **South-America**, **South-Asia** that could be possible values. In order to avoid including more errors in our data, it might be better to remove the corresponding observations in case that action does not lead to loosing too much data.  

In [None]:
south_df = df_no_outliers[df_no_outliers['Country']=='South']
a = south_df.shape[0]
b = df_no_outliers.shape[0]

print('{} rows corresponds to South, which represents {}% of the data'.format(a, (1.0*a/b)*100))

We can remove all the corresponding rows for **Country == South** because, it corresponds to only 0.244% of the original dataset. 

In [None]:
south_index = south_df.index 
df_no_outliers.drop(south_index, inplace=True)

We are going to perform the following preprocessing:  
* Outlying-US(Guam-USVI-etc) ==> Outlying-US   
* Trinadad&Tobago ==> Trinadad-Tobago  
* Hong ==> Hong-Kong

In [None]:
# Changing the corresponding values.
df_no_outliers.loc[df_no_outliers['Country']=='Outlying-US(Guam-USVI-etc)', 'Country'] = 'Outlying-US'
df_no_outliers.loc[df_no_outliers['Country']=='Trinadad&Tobago', 'Country'] = 'Trinadad-Tobago'
df_no_outliers.loc[df_no_outliers['Country']=='Hong', 'Country'] = 'Hong-Kong'

In [None]:
# Check if the process worked
df_no_outliers['Country'].unique()

We can clearly see that the changes have been made. 

In [None]:
asia = ['India', 'Iran', 'Philippines', 'Cambodia', 'Thailand', 'Laos', 'Taiwan', 
       'China', 'Japan', 'Vietnam', 'Hong-Kong']  

america = ['United-States', 'Cuba', 'Jamaica', 'Mexico', 'Puerto-Rico', 'Honduras', 
           'Canada', 'Columbia', 'Ecuador', 'Haiti', 'Dominican-Republic', 
           'El-Salvador', 'Guatemala', 'Peru', 'Outlying-US', 'Trinadad-Tobago', 
           'Nicaragua', '']  

europe = ['England', 'Germany', 'Italy', 'Poland', 'Portugal', 'France', 'Yugoslavia', 
          'Scotland', 'Greece', 'Ireland', 'Hungary', 'Holand-Netherlands'] 

In [None]:
# Now, create a dictionary to map each country to a Corresponding continent. 
continents = {country: 'Asia' for country in asia}
continents.update({country: 'America' for country in america})
continents.update({country: 'Europe' for country in europe})

In [None]:
# Then use Pandas map function to map continents to countries  
df_no_outliers['Continent'] = df_no_outliers['Country'].map(continents)

Here, we have the continents corresponding to all the existing contries in our dataset.

In [None]:
df_no_outliers['Continent'].unique()

## B.1- Income VS Occupation for countries in each continent  
I created a helper fonction in order to preprocess for each country in one shot. 

In [None]:
def Occupation_VS_Income(continent):
    choice = df_no_outliers[df_no_outliers['Continent']==continent] 
    countries = list(choice['Country'].unique())

    for country in countries:
        pd.crosstab(choice[choice['Country']==country].Occupation, choice[choice['Country']==country].Income).plot(kind='bar', 
                                                                                                                       title='Income VS Occupation in {}'.format(country))

### B.1.1- For Asia

In [None]:
Occupation_VS_Income('Asia')

### B.1.2- For America

In [None]:
Occupation_VS_Income('America')

### B.1.3- For Europe

In [None]:
Occupation_VS_Income('Europe')

## B.2- Income VS Workclass for countries in each continent  

In [None]:
def Workclass_VS_Income(continent):
    choice = df_no_outliers[df_no_outliers['Continent']==continent] 
    countries = list(choice['Country'].unique())

    for country in countries:
        pd.crosstab(choice[choice['Country']==country].Workclass, choice[choice['Country']==country].Income).plot(kind='bar', 
                                                                                                                       title='Income VS Workclass in {}'.format(country))

### B.2.1- For Asia

In [None]:
Workclass_VS_Income('Asia')

### B.2.2- For America

In [None]:
Workclass_VS_Income('America')

### B.2.3- For Europe

In [None]:
Workclass_VS_Income('Europe')

## B.3- Income VS Marital Status for countries in each continent  

In [None]:
def MaritalStatus_VS_Income(continent):
    choice = df_no_outliers[df_no_outliers['Continent']==continent] 
    countries = list(choice['Country'].unique())

    for country in countries:
        pd.crosstab(choice[choice['Country']==country]['Martial Status'], choice[choice['Country']==country].Income).plot(kind='bar', 
                                                                                                                       title='Income VS Workclass in {}'.format(country))

### B.3.1- For Asia

In [None]:
MaritalStatus_VS_Income('Asia')

## B.4- Mean Capital Gain VS Martial Status for each continent

To accomplish this task; I will create a new dataframe containing the grouping result of Continent, Contient, Marital Status and the **mean value of Capital Gain**

In [None]:
# reset_index(): to convert to aggregation result to a pandas dataframe.
agg_df = df_no_outliers.groupby(['Continent','Country', 'Martial Status'])['Capital-Gain'].mean().reset_index()

In [None]:
agg_df['Mean_Capital_Gain'] = agg_df['Capital-Gain']
agg_df.drop('Capital-Gain', axis=1, inplace=True)

In [None]:
agg_df.head()

In [None]:
import seaborn as sns

In [None]:
def Mean_TotCapital_VS_Marital_Status(continent):
    choice = agg_df[agg_df['Continent']==continent] 
    countries = list(choice['Country'].unique())

    for country in countries:
        df_c = choice[choice['Country']==country]
        ax = sns.catplot(x='Martial Status', y='Mean_Capital_Gain', 
                         kind='bar', data=df_c)

        ax.fig.suptitle('Country: {}'.format(country))
        ax.fig.autofmt_xdate()

### B.4.1- For Asia

In [None]:
Mean_TotCapital_VS_Marital_Status('Asia')

### B.4.2- For America

In [None]:
Mean_TotCapital_VS_Marital_Status('America')

### B.4.3- For Europe

In [None]:
Mean_TotCapital_VS_Marital_Status('Europe')

## 5- Feature Engineering   
This is one of the most crucial aspect for a Data Science project. It is a process of transforming the raw data to better representative 
features in order to create better predictive models. 

#### A- Derived Features   
Sometimes, it is important to perform some transformations on the features/columns in order to reduce the number of original data columns. 
Let's start looking at our columns.

##### A.1- Education and Education-Num  

In [None]:
edu = df_no_outliers.Education.unique()
eduNum = df_no_outliers['Education-Num'].unique()
print('Education: \nTotal category:{}\nValues: {}\n'.format(len(edu),list(edu)))
print('Education Num: \nTotal Education-Num:{}\nValues: {}'.format(len(eduNum),
                                                                  list(eduNum)))

We can see that The **Education-Num** seems to be the numerical representation of **Education**, and also the same Total number (16). To do so, we will need only one of them, not both columns.  
Let's check some observations (rows) to verify our hypothesis if there is a corrrespondance between **Education-Num** and **Education**.   
Then we can simply visualize the two columns in order to check the correspondance between them.  

In [None]:
ax = sns.catplot(x='Education', y='Education-Num', kind='bar', data=df_no_outliers)
ax.fig.suptitle('Numerical Representation of Educations')
ax.fig.autofmt_xdate()

From the previous plot, we can see that 
* Bachelor <==> 13  
* HS-grad <==> 9
* 7th-8th <==> 4   
* 9th <==> 5    
* Preschool <==> 1 
* etc.  
Based on those information, we will need only one column to represent the **level of education**, and in our case,   
we will choose **Education-Num** (remove **Education** column) which corresponds to the numerical representation.  

In [None]:
# Finally remove the Education column  
df_no_outliers.drop('Education', axis=1, inplace=True)

##### A.2- Capital-Loss and Capital-Gain  
From those two features, we can create a new column called **Capital-State** that will be the difference between Capital-Gain and Capital-Loss.  
Then we will remove those two features.  

In [None]:
df_no_outliers['Capital-State'] = df_no_outliers['Capital-Gain'] - df_no_outliers['Capital-Loss']

In [None]:
# Then remove Capital-Gain and Capital-Loss. 
df_no_outliers.drop(['Capital-Gain', 'Capital-Loss'], axis=1, inplace=True)

In [None]:
'''
Let not forget to drop the 'Continent' column we added for 
visualization purpose. 
'''
df_no_outliers.drop('Continent', axis=1, inplace=True)

In [None]:
df_no_outliers.head(3)

##### A.3- Age State (Adult or Child)   
A person older than 18 is an adult. Otherwise he/she is a child.  

In [None]:
# AgeState based on Age
df_no_outliers['AgeState'] = np.where(df_no_outliers['Age'] >= 18, 'Adult', 'Child')

In [None]:
# AgeState Counts  
df_no_outliers['AgeState'].value_counts()

In [None]:
sns.countplot(x='AgeState', data=df_no_outliers)

**fnlwgt** column is not an important feature. 

In [None]:
df_no_outliers.drop('fnlwgt', axis=1, inplace=True)

In [None]:
df_no_outliers.head()

In [None]:
# Information about our data
df_no_outliers.info()

#### B- Categorical Feature encoding    
A machine learning model only works with numerical features. To do so, we need to encode all our categorical features. Those features are represented by **object**  with the help of the previous **info** command.    
We are going to perform the **One Hot Ending** method on all the categorical features by using Pandas **get_dummies()** function.  
We are not going to take in consideration **Income** column, because it is the column we try to predict.  

In [None]:
# Columns: Workclass, Martial Status Occupation, Relationship, Race, Sex, Country, AgeState
df_no_outliers = pd.get_dummies(df_no_outliers, columns=['Workclass', 'Martial Status', 'Occupation', 
                                 'Relationship', 'Race', 'Sex', 'Country', 'AgeState'])

In [None]:
df_no_outliers['Income'].unique()

In [None]:
'''
1: For those who make more than 50K 
0: For those who don't
'''
df_no_outliers['Income'] = np.where(df_no_outliers['Income'] =='>50K', 1, 0)

In [None]:
# Reorder columns : In order to have 'Income' as last feature.
columns = [column for column in df_no_outliers.columns if column != 'Income']
columns = columns + ['Income'] 
df = df_no_outliers[columns]

In [None]:
# Information about our data
df.info()

## 6- Preparation, Models and Evaluation    
#### 6.1- Data Preparation   
We need to split our dataset for training and testing data.  
80% of the data will be used for training and 20% for testing.

In [None]:
y = df.Income.ravel()
X = df.drop('Income', axis=1).as_matrix().astype('float')

In [None]:
print('X shape: {} | y shape: {}'.format(X.shape, y.shape))

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
print('X train shape: {} | y shape: {}'.format(X_train.shape, y_train.shape))
print('X test shape: {} | y shape: {}'.format(X_test.shape, y_test.shape))

#### 6.2- Models & Evaluation   
Before building any machine learning model. It is important to build a baseline model first, in order judge the performance of the upcoming models.  

##### Baseline Model

In [None]:
from sklearn.dummy import DummyClassifier

In [None]:
dummy_clf = DummyClassifier(strategy='most_frequent', random_state=0)

In [None]:
# Train the model 
dummy_clf.fit(X_train, y_train)

In [None]:
print('Score of baseline model : {0:.2f}'.format(dummy_clf.score(X_test, y_test)))

##### Logistic Regression 

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

In [None]:
lr_clf = LogisticRegression(random_state=0)
parameters = {'C':[1.0, 10.0, 50.0, 100.0, 1000.0], 'penalty' : ['l1','l2']}
lr_clf = GridSearchCV(lr_clf, param_grid=parameters, cv=3)

In [None]:
lr_clf.fit(X_train, y_train)

In [None]:
lr_clf.best_params_

In [None]:
print('Best score : {0:.2f}'.format(lr_clf.best_score_))

In [1]:
print('Score for logistic regression - on test : {0:.2f}'.format(lr_clf.score(X_test, y_test)))

NameError: name 'lr_clf' is not defined