# Data Preprocessing
Pre-processing refers to the transformations applied to our data before feeding it to the algorithm. Data preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.

##Need of Data Preprocessing
For achieving better results from the applied model in Machine Learning projects the format of the data has to be in a proper manner. Some specified Machine Learning model needs information in a specified format, for example, Random Forest algorithm does not support null values, therefore to execute random forest algorithm null values have to be managed from the original raw data set.
Another aspect is that the data set should be formatted in such a way that more than one Machine Learning and Deep Learning algorithm are executed in one data set, and best out of them is chosen.

##Example Dataset
We'll continue using the fictional dataset customers.csv with the following columns:

- CustomerID: Unique identifier for each customer.
- Age: Age of the customer.
- Income: Annual income of the customer.
- Gender: Gender of the customer (Male/Female).
- SpendingScore: A score assigned by the retail store based on customer behavior and spending patterns (scale of 1-100).
- City: The city where the customer resides.
- JoinedDate: The date when the customer joined the store's loyalty program.

###Steps in Data Preprocessing
Step 1: Import the necessary libraries




In [None]:
# importing libraries
import pandas as pd
import scipy
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
import matplotlib.pyplot as plt

Step 2: Load the dataset

Explanation: Load data into a pandas DataFrame.

In [None]:
df = pd.read_csv('customers.csv')

Step 3: Understanding the Dataset

Explanation: Explore the dataset's structure and summary statistics.


In [None]:
print(df.head())
print(df.info())

check the null values using `df.isnull()`

In [None]:
df.isnull().sum()

Step 4: Statistical Analysis

In statistical analysis, first, we use the `df.describe()` which will give a descriptive overview of the dataset.

In [None]:
print(df.describe())

Step 5: Handling Missing Data

Explanation: Handle missing values by filling or removing them.

In [None]:
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Income'].fillna(df['Income'].mean(), inplace=True)

Step 6: Handling Outliers

###What is Outlier?

An Outlier is a data item/object that deviates significantly from the rest of the (so-called normal) objects. Identifying outliers is important in statistics and data analysis because they can have a significant impact on the results of statistical analyses. The analysis for outlier detection is referred to as outlier mining.

Outliers can skew the mean (average) and affect measures of central tendency, as well as influence the results of tests of statistical significance.

###How Outliers are Caused?

Outliers can be caused by a variety of factors, and they often result from genuine variability in the data or from errors in data collection, measurement, or recording. Some common causes of outliers are:

- Measurement errors: Errors in data collection or measurement processes can lead to outliers.
- Sampling errors: In some cases, outliers can arise due to issues with the sampling process.
- Natural variability: Inherent variability in certain phenomena can also lead to outliers. Some systems may exhibit extreme values due to the nature of the process being studied.
- Data entry errors: Human errors during data entry can introduce outliers.
Experimental errors: In experimental settings, anomalies may occur due to uncontrolled factors, equipment malfunctions, or unexpected events.
- Sampling from multiple populations: Data is inadvertently combined from multiple populations with different characteristics.
- Intentional outliers: Outliers are introduced intentionally to test the robustness of statistical methods.

###Outlier Detection And Removal

Here pandas data frame is used for a more realistic approach as real-world projects need to detect the outliers that arose during the data analysis step, the same approach can be used on lists and series-type objects.

For example, if most customers have an annual income between $30,000 and $100,000, a customer with an income of $1,000,000 might be considered an outlier.

Explanation: Identify and remove outliers to prevent them from skewing the analysis.

###Detecting Outliers Using the Interquartile Range (IQR) Method

The IQR method is a common technique for detecting outliers:

- IQR: The range between the first quartile (25th percentile) and the third quartile (75th percentile).

- Outliers: Data points below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR.

In [None]:
# Calculate Q1 (25th percentile) and Q3 (75th percentile)
Q1 = df['Income'].quantile(0.25)
Q3 = df['Income'].quantile(0.75)
IQR = Q3 - Q1

# Define outlier bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Detecting outliers
outliers = df[(df['Income'] < lower_bound) | (df['Income'] > upper_bound)]
print("Detected outliers:")
print(outliers)

###Visualizing Outliers Using Boxplot
Visualizing outliers can help in understanding their distribution. A boxplot is a great way to visualize the presence of outliers.

In [None]:
# Boxplot to visualize outliers
plt.figure(figsize=(8, 6))
df.boxplot(column=['Income'])
plt.title('Boxplot of Income')
plt.show()

The boxplot displays the distribution of the Income variable. The points outside the whiskers of the boxplot are considered outliers.

###Removing Outliers
Once the outliers are detected, they can be removed to prevent them from skewing the analysis.

In [None]:
# Removing outliers
df_cleaned = df[(df['Income'] >= lower_bound) & (df['Income'] <= upper_bound)]
print("Data after removing outliers:")
print(df_cleaned.head())

The cleaned dataset df_cleaned now contains only those data points that lie within the calculated bounds, effectively removing the detected outliers.

###Verifying Outliers Removal
After removing outliers, it’s a good practice to verify that they have been successfully removed.

In [None]:
# Check if any outliers are still present
outliers_after_removal = df_cleaned[(df_cleaned['Income'] < lower_bound) | (df_cleaned['Income'] > upper_bound)]
print("Outliers present after removal:")
print(outliers_after_removal)

This step ensures that all outliers have been removed and the dataset is now clean.

### Visualizing the Cleaned Data
You can visualize the cleaned data using a boxplot to confirm that the outliers have been removed.

In [None]:
# Boxplot after removing outliers
plt.figure(figsize=(8, 6))
df_cleaned.boxplot(column=['Income'])
plt.title('Boxplot of Income After Outlier Removal')
plt.show()

The boxplot should no longer show any points outside the whiskers, indicating that the outliers have been successfully removed.

###Encoding Categorical Data
Explanation: Convert categorical variables into numerical format.

In [None]:
df = pd.get_dummies(df, columns=['City'])
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Gender'] = le.fit_transform(df['Gender'])

###Feature Scaling
a. Standardization
- Explanation: Scale features to have a mean of 0 and standard deviation of 1.
- When to Use: Use when features have different units or distributions that need to be normalized.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['Age', 'Income']] = scaler.fit_transform(df[['Age', 'Income']])

b. Normalization
- Explanation: Scale features to a range (usually 0 to 1).
- When to Use: Use when features need to be scaled to a specific range, often in algorithms sensitive to scales like Neural Networks.

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['SpendingScore']] = scaler.fit_transform(df[['SpendingScore']])

###Correlation Analysis
- Explanation: Check the correlation between features to understand their relationships.
- When to Use: Use to identify multicollinearity or redundant features.

In [None]:
correlation_matrix = df.corr()
print(correlation_matrix)

The correlation matrix shows the strength and direction of relationships between features. For example, if Income and SpendingScore have a high positive correlation, it might indicate that wealthier customers spend more.

###Feature Engineering
Explanation: Create or transform features to reveal patterns.

In [None]:
df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 18, 35, 60, 100], labels=['Youth', 'Adult', 'Senior', 'Elderly'])


###Dimensionality Reduction
Explanation: Reduce the number of features while retaining most of the variance.

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
df_pca = pca.fit_transform(df)


###Splitting the Dataset
Explanation: Split the data into training and testing sets for model evaluation.

In [None]:
from sklearn.model_selection import train_test_split
X = df.drop('SpendingScore', axis=1)
y = df['SpendingScore']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


###Dealing with Imbalanced Data
Explanation: Balance classes in classification problems using techniques like SMOTE.

In [None]:
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X_train, y_train)


###Text Data Preprocessing (for NLP)
Explanation: Preprocess text data by tokenizing, removing stopwords, and stemming.

In [None]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
df['tokens'] = df['CustomerFeedback'].apply(word_tokenize)
stop_words = set(stopwords.words('english'))
df['tokens'] = df['tokens'].apply(lambda x: [word for word in x if word not in stop_words])
ps = PorterStemmer()
df['tokens'] = df['tokens'].apply(lambda x: [ps.stem(word) for word in x])


###Saving and Loading Preprocessed Data

In [None]:
df.to_csv('preprocessed_customers.csv', index=False)
df = pd.read_csv('preprocessed_customers.csv')