# Data preprocessing in pandas

pandas.**DataFrame** reference:  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

### 1. Import necessary packages

In [None]:
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### 2. Download and import the Adult dataset
Follow the link and download the Adult dataset:  
https://archive.ics.uci.edu/ml/datasets/Adult

Next, load it from file into pandas.**DataFrame**. Docs:  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

*Hint* 1: look carefully at the raw data, especially at the delimiters. Inspect pandas.**read_csv** options and find out how to handle this.  
*Hint* 2: try setting column names when importing the dataset. Take a look at the dataset description:  
https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names

In [None]:
DATA_DIR = 'data'
FILE_NAME = 'adult.data.csv'

file_path = os.path.join(DATA_DIR, FILE_NAME)

### 3. Get DataFrame dimensions
DataFrame dimensions are numbers of records and attributes. See docs:  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shape.html

### 4. Print out first and last 5 rows of the DataFrame
See docs:  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html

### 5. Inspect data types of DataFrame columns
Docs:  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html

### 6. Get numbers of unique values in each column
Docs:  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nunique.html

Now, print out all unique values for attribute `'marital-status'`. Docs:  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html

### 7. Group attributes by data types

In [None]:
numeric_fields = ['age']
binary_fields = []
ordinal_fields = []
nominal_fields = []

### 8. Count null values in each column
Docs:  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isnull.html

### 9. Implement functions for counting absolute and relative frequencies of values

In [None]:
def make_abs_freq_dict(df, column):
    return {}


def make_rel_freq_dict(df, column):
    return {}

### 10. Print out absolute value frequencies for binary fields

### 11. Print out absolute value frequencies for nominal fields

### 12. Find out what columns have missing values
*Hint*: see dataset description once again:  
https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names

### 13. Fill missing values
Remember, that the most popular value is used to fill missing values of nominal fields.

See docs:  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html

### 14. Plot values frequencies for all continuous and categorical fields
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.hist.html  
https://stackoverflow.com/questions/31029560/plotting-categorical-data-with-pandas-and-matplotlib

### 15. Merge rare values of nominal fields
Let's call values ***rare***, if their relative frequency is **under 1%**. Merge rare values into one. Let's call this value `'OTHER'`. Note that if there is only one rare value, merging is useless. If there are only few rare values, merging does not give much advantage. 

### 16. Inspect dependencies in fields
See docs:  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html  
http://seaborn.pydata.org/generated/seaborn.heatmap.html

Be sure to drop ducplicates of columns.

Docs:  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html  

*Hint*: look attentively at the fields `'education'` and `'education-number'`.

### 17. Scale continuous and ordinal fields
Remember, that standardization is used for continuous values, and normalization - for ordinal ones.  

The following docs may appear useful:  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.min.html  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.max.html  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.std.html

### 18. Validate scaling
*Hint*: normalized values must be in range \[0, 1\], but standardized ones must have mean equal to 0 and standard deviation equal to 1.

### 19. Make dummy variables for binary and nominal fields
See docs:  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html

### 20. Binarize target variable