dataset available at:- https://www.kaggle.com/wenruliu/adult-income-dataset

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
import plotly as py
import cufflinks as cf
from plotly.offline import iplot
py.offline.init_notebook_mode(connected=True)
cf.go_offline()

In [3]:
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

In [4]:
pd.set_option('display.max_columns', None)

In [5]:
df = pd.read_csv('adult.csv')

## Overview of the dataset

**Looking at first five rows**

In [5]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


<font color = blue>**It can be seen that many of the columns have ? which means that they are missing values and should be imputed**</font>

**Checking number of rows and columns**

In [6]:
df.shape

(48842, 15)

**Checking datatype of each column**

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              48842 non-null  int64 
 1   workclass        48842 non-null  object
 2   fnlwgt           48842 non-null  int64 
 3   education        48842 non-null  object
 4   educational-num  48842 non-null  int64 
 5   marital-status   48842 non-null  object
 6   occupation       48842 non-null  object
 7   relationship     48842 non-null  object
 8   race             48842 non-null  object
 9   gender           48842 non-null  object
 10  capital-gain     48842 non-null  int64 
 11  capital-loss     48842 non-null  int64 
 12  hours-per-week   48842 non-null  int64 
 13  native-country   48842 non-null  object
 14  income           48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


**showing total number of missing values in the dataset**

In [9]:
df.isna().sum().sum()

0

**Separating numerical and string columns**

In [7]:
num = list(df.select_dtypes(include = np.number).columns)
cat = list(df.select_dtypes(include = 'object').columns)

**Statistics of the numerical columns**

In [33]:
colorscale = [[0, '#C71585'],[.5, '#FF1493'],[1, '#FF69B4']]
fig = ff.create_table(round(df[num].describe().reset_index(), 2), font_colors = ['white'], colorscale = colorscale)

for i in range(len(fig.layout.annotations)):
    fig.layout.annotations[i].font.size = 15
    
fig.update_layout(
    title_text = 'Descriptive Statistics on the numerical columns',
    margin = {'t':50},
    template= "plotly_dark"
)
    
fig.show()

**Statistics of the categorical columns**

In [8]:
colorscale = [[0, '#C71585'],[.5, '#FF1493'],[1, '#FF69B4']]
fig = ff.create_table(df[cat].describe().reset_index(), font_colors = ['white'], colorscale = colorscale)

for i in range(len(fig.layout.annotations)):
    fig.layout.annotations[i].font.size = 10
    
fig.update_layout(
    title_text = 'Statistics of the categorical columns',
    margin = {'t':50},
    template= "plotly_dark"
)
    
fig.show()

## Univariate analysis of Numerical Columns

### Age

In [27]:
fig = ff.create_distplot([df['age']],group_labels =['age_density'], colors = ['#00FFFF'])

fig.add_trace(go.Box(x = df['age'],
                    xaxis='x2', yaxis='y2', 
                     name = 'age_distribution'
                    ))

fig.update_layout(
    title_text = 'Checking density and distribution of the age column',
    margin = {'t':50, 'b':100},
    template= "plotly_dark"
)


fig.show()

<IPython.core.display.Javascript object>

<font color = blue>**We can clearly see that most of the working adults are between the age of 20 - 45**</font>

<font color = blue>**From the general knowledge we are aware that once the individual reaches to the age of 60 he/she usually prefer for retirement and hence there are very less records, we can see from above which are working even at the age of 60+**</font>

In [35]:
print('Total number of adults in the record is:', df.shape[0])
print('Number of working adults working after the age of 65 is:',df[df['age'] > 65].shape[0])

Total number of adults in the record is: 48842
Number of working adults working after the age of 65 is: 1803


<font color = green>**Conclusion:- It is better to remove the records having the age of > 65 as it won't affect our predictions and analysis much**</font>

In [36]:
df = df[df['age'] <= 65] # Removing the records having age of more than 65

In [37]:
fig = ff.create_distplot([df['age']],group_labels =['age_density'], colors = ['#00FFFF'])

fig.add_trace(go.Box(x = df['age'],
                    xaxis='x2', yaxis='y2', 
                     name = 'age_distribution'
                    ))

fig.update_layout(
    title_text = 'Checking density and distribution of the age column after outlier treatment',
    margin = {'t':50, 'b':100},
    template= "plotly_dark"
)


fig.show()

<IPython.core.display.Javascript object>

<font color = blue>**As you can see, the data has very little skewness but no need to bother about it as it will be treated later with box-cox transformation**</font>