[Home](../../README.md)

### Data Preprocessing

This is a demonstration of data preprocessing using [Pandas](https://pandas.pydata.org/) the library for data analysis and manipulation. In previous steps you have already done some preprocessing to understand the data.

This Jupyter Notepad is less about steps and more about different processes you can apply to your data before data wrangling. For this demonstration we will use a bigger more complex dataset.

#### Load the required dependencies

In [46]:
# Import frameworks
import pandas as pd

####  Store the data as a local variable

The data frame is a Pandas object that structures your tabular data into an appropriate format. It loads the complete data in memory so it is now ready for preprocessing.

In [47]:
data_frame = pd.read_csv("7.example_data.csv")

#### head() & tail() - Data Snapshot

It is important to get a high-level look at your dataset to understand what you are working with. Printing the complete data might be impossible for large-scale datasets where the rows can be in thousands or even millions.

You can use the head and tail method call to inspect the first and last 5 rows of your dataset.

In [None]:
print(data_frame.head())
print(data_frame.tail())

####  info() - Data Summary
 
The info method call prints a summary of each column, giving you more information about the specific data types, total number of rows, null values and memory usage.

In [None]:
data_frame.info()

#### describe() - Statistics For Numerical Columns
 
The describe method call provides basic statistical knowledge like the mean and spread of the data.

In [None]:
data_frame.describe()

#### isnull()

Null values during data analysis can cause runtime errors and unexpected results. It is important to identify null values and deal with them appropriately beforehand.

In [None]:
print(data_frame.isnull().sum())

#### dropna() & fillna()

If you have null data there are many ways to deal with the empty/null values. These are the two most common approaches.
1. Remove any row with a null value.
2. Replace missing values with another value. Generally, we use mean value for numerical columns because it may cause minimal changes in your mathematical analysis while maintaining the original size of the data.

In [None]:
# Remove Null values
data_frame = data_frame.dropna(subset=['Item Type'])
print(data_frame.isnull().sum())

In [None]:
# Replace Null values with the mean value for the column
data_frame['Units Sold'] = data_frame['Units Sold'].fillna(data_frame['Units Sold'].mean())
print(data_frame.isnull().sum())

#### Filter Your Data

Filtering is like applying the where clause in a database. It is widely used and can help when you need to work on a specific subset of your data. For our use case, let us filter the data to only include rows where the region is "Sub-Saharan Africa". There is no method call for this, we can just use conditional indexing to fulfill our purpose.

In [None]:
data_frame = data_frame[data_frame['Region'] == 'Sub-Saharan Africa']
data_frame.head()

#### apply()

We can run a lambda function on a column to modify its values. For a simple example, let’s convert the name to lowercase. To run a function over a complete column, we can use the apply method which iterates over each row and modifies the values.

In [None]:
data_frame['Sales Channel'] = data_frame['Sales Channel'].apply(lambda x: x.lower())
print(data_frame['Sales Channel'].head())

#### quantile()

Outliers can skew your analysis on numerical columns, and it is important to remove them. We can use the 25th and 75th quartile on numerical data, to get the inter-quartile range. This allows us to estimate an acceptable range, and we can then filter out any values outside this range. Mathematically, outliers are values occurring outside 1.5 times the interquartile range (IQR) from the first quartile (Q1) or third quartile (Q3).

In [None]:
#get the inter-quartile range on the salary column
print(data_frame['Units Sold'].describe())
Q1 = data_frame['Units Sold'].quantile(0.25)
Q3 = data_frame['Units Sold'].quantile(0.75)
IQR = Q3 - Q1

In [None]:
# Filter salaries within the acceptable range
data_frame = data_frame[(data_frame['Units Sold'] >= Q1 - 1.5 * IQR) & (data_frame['Units Sold'] <= Q3 + 1.5 * IQR)]
print(data_frame['Units Sold'].describe())

#### Save the preprocessed data to CSV

In [48]:
data_frame.to_csv('7.preprocessed_data.csv')