[Home](../../README.md)

### Data Preprocessing

This is a demonstration of data preprocessing using [Pandas](https://pandas.pydata.org/) the library for data analysis and manipulation.

This Jupyter Notepad is less about steps and more about different processes you can apply to your data before data wrangling. For this demonstration we will use relatively a complex real dataset that compares health measures with the speed of progress of type 2 adult onset diabetes.

Creating New Features:

- Deriving new variables from existing ones (e.g., calculating the age from a birthdate).
- Combining features (e.g., creating interaction terms).
  Feature Selection:
- Identifying the most relevant features for the model (e.g., using techniques like recursive feature elimination).
  Transforming Features:
- Applying mathematical transformations (e.g., logarithmic transformations).
- Binning continuous variables into categorical bins.
  Domain-Specific Features:
- Incorporating knowledge from the specific domain to create features that capture important characteristics of the data.

#### Load the required dependencies

In [1]:
# Import frameworks
import pandas as pd

####  Store the data as a local variable

The data frame is a Pandas object that structures your tabular data into an appropriate format. It loads the complete data in memory so it is now ready for preprocessing.

In [2]:
data_frame = pd.read_csv("2.1.1.Diabeties_Sample_Data.csv")

#### Replace data

We can run a lambda function on a column to modify its values. For a simple example, let’s convert the Sex to lowercase. To run a function over a complete column, we can use the apply method which iterates over each row and modifies the values.

In [14]:
data_frame['SEX'] = data_frame['SEX'].apply(lambda x: x.lower())
print(data_frame['SEX'].head())

0    female
1      male
2    female
3      male
4      male
Name: SEX, dtype: object


#### Filter Your Data

Filtering is like applying the where clause in a database. It is widely used and can help when you need to work on a specific subset of your data. For our use case, let us filter the data to only include rows where the 'SEX' is 'Male'. There is no method call for this, we can just use conditional indexing to fulfill our purpose.

In [None]:
data_frame = data_frame[data_frame['Region'] == 'Sub-Saharan Africa']
data_frame.head()

#### quantile()

Outliers can skew your analysis on numerical columns, and it is important to remove them. We can use the 25th and 75th quartile on numerical data, to get the inter-quartile range. This allows us to estimate an acceptable range, and we can then filter out any values outside this range. Mathematically, outliers are values occurring outside 1.5 times the interquartile range (IQR) from the first quartile (Q1) or third quartile (Q3).

In [None]:
#get the inter-quartile range on the salary column
print(data_frame['Units Sold'].describe())
Q1 = data_frame['Units Sold'].quantile(0.25)
Q3 = data_frame['Units Sold'].quantile(0.75)
IQR = Q3 - Q1

In [None]:
# Filter salaries within the acceptable range
data_frame = data_frame[(data_frame['Units Sold'] >= Q1 - 1.5 * IQR) & (data_frame['Units Sold'] <= Q3 + 1.5 * IQR)]
print(data_frame['Units Sold'].describe())

#### Save the preprocessed data to CSV

In [None]:
data_frame.to_csv('2.1.1.preprocessed_data.csv')