In [1]:
import pandas as pd
import numpy as np

# Task 1

>Check for missing values in the dataset and handle them \[...\].

>Check the dataset for noisy data, inconsistencies, and duplicate entries \[...\].

### Note

Let us initiate the dataframe with the first 5000 rows.

In [2]:
# Store the first 5000 rows as a dataframe (and print it)
ds = pd.read_csv('adult.csv')
sp_ds = ds[:5000]
df = pd.DataFrame(sp_ds)
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


### Note

As we can see, there are missing values in the form of question marks ("?").

Let us [count](https://stackoverflow.com/questions/20076195/) these missing values.

In [3]:
# Store all values in effectively one column and count the question marks
column_names = ["age", "workclass", "fnlwgt", "education", "educational-num", "marital-status", "occupation", "relationship", "race", "gender", "capital-gain", "capital-loss", "hours-per-week", "native-country", "income"]
df[[i for i in column_names]].astype('str').stack().value_counts()['?'].sum()

707

### Note

As seen above, there are 707 missing values in total.

Let us count the null values (if any):

In [4]:
# Check for null values
df.isnull().sum().sum()

0

### Note

So there are no null values.

As for the 707 missing values found earlier, we need to know how to impute them. Let us find the columns that contain missing values.

In [5]:
df.columns[df.isin(['?']).any()]

Index(['workclass', 'occupation', 'native-country'], dtype='object')

### Note

The 707 missing values are distributed over three variables. All three of these are nominal variables.

For now, we will impute them using the most frequent category. Since the 707 missing values are distributed over a total of 5'000x3 = 15'000 values, this simple imputation method should not significantly affect the data.

Let us first find out the most frequent category in each variable.

In [6]:
df.workclass.value_counts().head(1)

workclass
Private    3420
Name: count, dtype: int64

In [7]:
df.occupation.value_counts().head(1)

occupation
Prof-specialty    644
Name: count, dtype: int64

In [8]:
df['native-country'].value_counts().head(1)

native-country
United-States    4514
Name: count, dtype: int64

### Note

Now we know that the most frequent categories are the Private sector, the "Prof-specialty" occupation, and 