# DSBDL Assignment 01 - Data Wrangling 1

Perform the following operations using Python on any open-source dataset (e.g., data.csv) 

1. Import all the required Python Libraries. 
2. Locate an open-source data from the web (e.g. https://www.kaggle.com). Provide a clear description of the data and its source (i.e., URL of the web site). 
3. Load the Dataset into pandas’ data frame. 
4. Data Preprocessing: check for missing values in the data using pandas isnull (), describe() function to get some initial statistics. Provide variable descriptions. Types of variables etc. Check the dimensions of the data frame.
5. Data Formatting and Data Normalization: Summarize the types of variables by checking the data types (i.e., character, numeric, integer, factor, and logical) of the variables in the data set. If variables are not in the correct data type, apply proper type conversions.
6. Turn categorical variables into quantitative variables in Python.

In addition to the codes and outputs, explain every operation that you do in the above steps and 
explain everything that you do to import/read/scrape the data set.



## 1. Import all Python libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## 2. Locate an open-source dataset

Source: https://archive.ics.uci.edu/dataset/2/adult

> Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset.


- Dataset Characteristics: `Multivariate`
- Subject Area: `Social Science`
- Associated Tasks: `Classification`
- Feature Type: `Categorical, Integer`
- Instances: `48842`
- Number of features: `14`

## 3. Load dataset in Pandas DataFrame

In [None]:
cols = [
    "age" , "workclass" , "fnlwgt" , 
    "education" , "education-num" , "marital-status" , 
    "occupation" , "relationship" , "race" , "sex" , 
    "capital-gain" , "capital-loss" , "hours-per-week" , 
    "native-country" , "income"
]
ds = pd.read_csv( "dataset/adult.data" , header=None , skipinitialspace=True ) 
ds.columns = cols

# Change dtypes from `object` to `str`
ds = ds.astype( { 
    "workclass": 'string' , "education": 'string' , "marital-status": 'string' , "occupation": 'string' , 
    "relationship": 'string' ,
    "race": 'string' , "sex": 'string' , "native-country": 'string' , "income": 'string'
} )

ds.head()

## 4. Data Preprocessing

### 4.1. Check dimensions of data frame

In [None]:
# Dimensions of dataframe
ds.shape

### 4.2. Check data types of all features

In [None]:
# Data-types of all columns
ds.dtypes

### 4.3. Use `.describe` to get statistics for numeric features

In [None]:
# Initial description of numerical columns in the dataset
ds.describe()

### 4.4. Check number of missing values for all features

In [None]:
# Check for missing values in all columns of the dataset
ds.isin( [ "?" ] ).sum() 

### 4.5. Provide variable descriptions

1. `age`: Age of the person
2. `workclass`: Working class of the person
3. `fnlwgt`: 
4. `education`: Education level of the person
5. `education-num`: Quantitative feature for `education`
6. `marital-status`: Marital status of the person
7. `occupation`: Occupation of the person
8. `relationship`: 
9. `race`: Race of the person
10. `sex`: Sex/gender of the person
11. `capital-gain`: 
12. `capital-loss`:
13. `hours-per-week`: 
14. `native-country`:

In [None]:
def plot_value_counts( 
    name: str , 
    ds: pd.DataFrame 
):
    value_counts = ds[ name ].value_counts()
    entities = value_counts.index.tolist()
    freqs = value_counts.values.tolist()
    plt.title( name )
    plt.xticks( rotation='vertical' )
    plt.bar( entities , freqs )
    plt.show()

## 5. Data Formatting and Normalization


In [None]:
# Plot value counts for all discrete features
plot_value_counts( "workclass" , ds )
plot_value_counts( "education" , ds )
plot_value_counts( "marital-status" , ds )
plot_value_counts( "occupation" , ds )
plot_value_counts( "relationship" , ds )
plot_value_counts( "race" , ds )
plot_value_counts( "sex" , ds )
plot_value_counts( "hours-per-week" , ds )
plot_value_counts( "income" , ds )
plot_value_counts( "native-country" , ds )

## 5.1. Handle missing values


### 5.1.1. Change missing values of `workclass` with mode of the feature

In [None]:
feature = ds.workclass
feature_values = feature.value_counts()
mode_feature = feature_values.index.tolist()[ feature_values.argmax() ]
ds.loc[ feature == "?" , "workclass" ] = mode_feature

### 5.1.2. Remove records with unknown `occupation` 

In [None]:
ds = ds.loc[ ds.occupation != "?" ]

### 5.1.3. Change missing values of `native-country` with mode of the feature

In [None]:
feature = ds[ "native-country" ]
feature_values = feature.value_counts()
mode_feature = feature_values.index.tolist()[ feature_values.argmax() ]
ds.loc[ feature == "?" , "native-country" ] = mode_feature

## 5.2 Data Normalization

In [None]:
def min_max_normalize(
    name: str
):
    ds[ name ] = (ds[ name ] - ds[ name ].min()) / ( ds[ name ].max() - ds[ name ].min() )

In [None]:
min_max_normalize( "age" ) 
min_max_normalize( "fnlwgt" ) 
min_max_normalize( "capital-gain" ) 
min_max_normalize( "capital-loss" ) 
min_max_normalize( "hours-per-peek" ) 

## 6. Categorical variables to quantitative variables

### 6.1. Encoding `gender` into a binary feature (not good though)

In [None]:
ds.loc[ ds.gender == "Male" , "gender" ] = 0
ds.loc[ ds.gender == "Male" , "gender" ] = 1

### 6.2. Encoding `income` into a binary feature

In [None]:
ds.loc[ ds.income == ">50K" , "income" ] = 0
ds.loc[ ds.income == "<=50K" , "income" ] = 1

### 6.3. 