# Data cleaning and preprocessing
## Data Normalization and Scaling
In data preprocessing, one essential step is data normalization and scaling. These techniques help us to standardize the range of independent variables or features of data. We'll delve into the importance of data normalization and scaling, common techniques, and their implementation in Python.

## Understanding the Importance of Data Normalization and Scaling

Machine learning algorithms perform better when input numerical variables fall within a similar scale. Without normalization or scaling, features with higher values may dominate the model's outcome. This could lead to misleading results and a model that fails to capture the  influence of other features.

Normalization and scaling bring different features to the same scale, allowing a fair comparison and ensuring that no particular feature dominates others. Moreover, these techniques can also accelerate the training process. For instance, gradient descent converges faster when features are on similar scales.

### Techniques for Data Normalization

Data normalization is a method to change the values of numeric columns in a dataset to a common scale. Here are a few normalization techniques:

**Min-Max Scaling**

Min-max scaling is one of the simplest methods to normalize data. It scales and translates each feature individually such that it is in the range of 0 to 1.


```
from sklearn.preprocessing import MinMaxScaler
# Create a simple dataset
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
# Create a scaler, fit and transform the data
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)
```

In [None]:
from sklearn.preprocessing import MinMaxScaler
# Create a simple dataset
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
# Create a scaler, fit and transform the data
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)

**Z-score Normalization (Standardization)**

This technique standardizes the feature such that it has a mean of 0 and a standard deviation
of 1. It redistributes the features with their mean at 0 and standard deviation as 1.

In [None]:
from sklearn.preprocessing import StandardScaler
# Create a scaler, fit and transform the data
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)

**Feature Scaling Techniques**

Feature scaling is an umbrella term for techniques that change the range of a feature. In
addition to the aforementioned normalization techniques, the following methods are also used
for feature scaling:

* **Robust Scaling**

Robust scaling is similar to min-max scaling but uses the interquartile range instead of the
min-max, making it robust to outliers.



In [None]:
## Robust Scaling
from sklearn.preprocessing import RobustScaler
# Create a scaler, fit and transform the data
scaler = RobustScaler()
robust_scaled_data = scaler.fit_transform(data)

### Implementing Data Normalization and Scaling with Python

Let's take a closer look at how to implement normalization and scaling with a Python example:

This script below creates a simple dataframe with three columns. We then initialize three different
types of scalers - `MinMaxScaler`, `StandardScaler`, and `RobustScaler`. We use these scalers to fit
and transform our dataframe, creating three new dataframes each scaled by a different
method. Finally, we print the original data and the transformed data to see the differences.

Data normalization and scaling are powerful techniques that can help to prepare your data for machine learning algorithms. These techniques ensure that all features contribute equally to the final decision of the model, regardless of their original scale.



In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

# Let's create a simple dataframe
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [100, 200, 400, 800, 1000],
'C': [200, 400, 600, 800, 1000]
})

# Initialize a min-max scaler
min_max_scaler = MinMaxScaler()

# Scale the dataframe
df_min_max = pd.DataFrame(min_max_scaler.fit_transform(df), columns=df.columns)

# Initialize a standard scaler
std_scaler = StandardScaler()

# Scale the dataframe
df_std = pd.DataFrame(std_scaler.fit_transform(df), columns=df.columns)

# Initialize a robust scaler
robust_scaler = RobustScaler()

# Scale the dataframe
df_robust = pd.DataFrame(robust_scaler.fit_transform(df), columns=df.columns)

print("Original Data")
print(df)
print("\nMin-Max Scaled Data")
print(df_min_max)
print("\nStandard Scaled Data")
print(df_std)
print("\nRobust Scaled Data")
print(df_robust)

## Feature Selection and Extraction

Feature selection and extraction are pivotal steps in the data preprocessing pipeline for machine learning and data science projects. These techniques can make the difference between a model that performs exceptionally well and one that falls flat. In this chapter, we will cover the basics of feature selection and extraction, discuss some common techniques, and
implement these techniques in Python.

### Introduction to Feature Selection and Extraction

Feature selection and extraction techniques are used to reduce the dimensionality of the data,
thus enhancing computational efficiency and potentially improving the model's performance.

* **Feature Selection**
Feature selection is the process of selecting a subset of relevant features (variables or predictors) for use in model construction. This is important for the following reasons:

 i. **Simplicity:** Fewer features make the model simpler and easier to interpret.<br>
 ii. **Speed:** Less data means algorithms train faster.  <br>
 iii. **Prevention of overfitting:** Less redundant data means less opportunity to make decisions based on noise.<br>


 * **Feature Extraction**

Feature extraction, on the other hand, is the process of transforming or mapping the original high-dimensional data into a lower-dimensional space. Unlike feature selection, where we keep the original features, feature extraction creates new ones that represent most of the "useful" information in the original data. The benefits are:

**Dimensionality reduction:** Similar to feature selection, fewer features speed up training.<br>

**Better performance:** Sometimes, the model can learn better in the transformed space.

### Techniques for Feature Selection
Feature selection methods are typically categorized into three classes: `filter methods`, `wrapper
methods`, and `embedded methods`

* **Filter Methods**

Filter methods select features based on their scores in statistical tests for their correlation with
the outcome variable. Examples include the chi-squared test, information gain, and correlation coefficient scores. These methods are fast and straightforward but they ignore the potential combined effect of individual features.

 * **Wrapper Methods**

Wrapper methods consider the selection of a set of features as a search problem, where
different combinations are prepared, evaluated and compared to other combinations. A predictive model is used to evaluate a combination of features and assign a score based on model accuracy. Examples of wrapper methods are recursive feature elimination and forward
selection. These methods often yield the best performance but can be very expensive computationally.

* **Embedded Methods**

Embedded methods learn which features best contribute to the accuracy of the model while the model is being created. The most common type of embedded feature selection methods are regularization methods.

Regularization methods are also called penalization methods that introduce additional constraints into the optimization of a predictive algorithm (like a
regression algorithm) that bias the model toward lower complexity (fewer coefficients).

In [None]:
##FILTTER METHOD
# Import libraries
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.datasets import load_iris

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Feature selection
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)


### Wrapper Methods

from sklearn.feature_selection import RFE
from sklearn.svm import SVR
estimator = SVR(kernel="linear")
selector = RFE(estimator, n_features_to_select=2, step=1)
selector = selector.fit(X, y)


###Embedded Methods

from sklearn.linear_model import LassoCV
from sklearn.datasets import make_regression

# Build a regression dataset
X, y = make_regression(noise=4, random_state=0)

# LassoCV: Lasso linear model with iterative fitting along a regularization path
lasso = LassoCV().fit(X, y)
importance = np.abs(lasso.coef_)

In [None]:
m

### Feature Extraction Methods

Feature extraction methods reduce the dimensionality in the feature space by creating new
features from the existing ones (and sometimes discarding the original features). Here are two
widely-used techniques for feature extraction:

* **Principal Component Analysis (PCA)**

PCA is a technique used to emphasize variation and bring out strong patterns in a dataset. It's
often used to make data easy to explore and visualize.

* **t-Distributed Stochastic Neighbor Embedding (t-SNE)**

t-SNE is a machine learning algorithm for visualization developed by Laurens van der Maaten
and Geoffrey Hinton. It is a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions

We shall examine code example for Feature Selection and Extraction. Now, let's put together the ideas discussed above into a real-world Python example but first, we import the necessary libraries:

In [None]:
## Principal Component Analysis (PCA)

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

## t-Distributed Stochastic Neighbor Embedding (t-SNE)
from sklearn.manifold import TSNE
X_tsne = TSNE(n_components=2).fit_transform(X)

In [None]:
## import the necessary libraries
import pandas as pd
from sklearn.feature_selection import SelectKBest, chi2, RFE
from sklearn.svm import SVR
from sklearn.linear_model import LassoCV
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.datasets import load_iris

## Next, we load the Iris dataset:
iris = load_iris()
X, y = iris.data, iris.target

## Now we apply the filter method using chi-squared test:
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)

## We apply the wrapper method using Support Vector Regression (SVR):
estimator = SVR(kernel="linear")
selector = RFE(estimator, n_features_to_select=2, step=1)
X_new = selector.fit_transform(X, y)
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
print("X shape after chi-squared feature selection: ", X_new.shape)

# Let's use Recursive Feature Elimination (RFE) as a wrapper method:
estimator = SVR(kernel="linear")
selector = RFE(estimator, n_features_to_select=2, step=1)
X_new = selector.fit_transform(X, y)
print("X shape after RFE: ", X_new.shape)

# Now let's try LassoCV as an embedded method:
lasso = LassoCV().fit(X, y)
importance = np.abs(lasso.coef_)
idx_third = importance.argsort()[-3]
threshold = importance[idx_third] + 0.01
idx_features = (-importance).argsort()[:2]
X_new = X[:, idx_features]
print("X shape after LassoCV: ", X_new.shape)

## Finally, we apply PCA and t-SNE for feature extraction:
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
print("X shape after PCA: ", X_pca.shape)
X_tsne = TSNE(n_components=2).fit_transform(X)
print("X shape after t-SNE: ", X_tsne.shape)



### Wrap-up
The importance of feature selection and extraction cannot be overstated. These techniques
enable you to reduce the dimensionality of your data, which can both speed up the learning
process and potentially increase your model's performance. Understanding these techniques is
a vital part of the data preprocessing pipeline.

## Encoding Categorical Variables

Categorical variables are a common type of non-numeric data variable that are critical in many
data science and machine learning applications. Encoding categorical data is an important step
in the data preprocessing stage. In this chapter, we'll examine what categorical variables are,
their challenges, different encoding techniques, and how to handle high cardinality and rare
categories.


### Understanding Categorical Variables and Their Challenges
Categorical variables represent types of data which may be divided into groups. Examples of categorical variables are race, sex, age group, and educational level. While the latter two variables may also be continuous, they are often categorized in practice.

Categorical variables pose a challenge when building machine learning models because these models, in essence, are algebraic. As a result, they require numerical inputs. This necessitates the transformation of categorical variables into a suitable numeric format, a process known as `categorical encoding`.

However, not all encodings are suitable for every problem. The choice of encoding often depends on the specifics of the data and the model to be used. Furthermore, some encoding techniques can significantly increase the dimensionality of the dataset, leading to longer training times and a higher chance of overfitting

### Techniques for Categorical Variable Encoding

There are numerous techniques to encode categorical variables, each with its strengths and weaknesses. Here, we'll introduce two commonly used techniques: `one-hot encoding` and `label encoding`.

* **One-hot encoding**

One-hot encoding is a process of converting categorical data variables so they can be provided
to machine learning algorithms to improve predictions. With one-hot, we convert each categorical value into a new categorical column and assign a binary value of 1 or 0. Each integer value is represented as a binary vector.

* **Label encoding**

Label Encoding is a popular encoding technique for handling categorical variables. In this technique, each label is assigned a unique integer based on alphabetical ordering.

In [None]:
## One-hot encoding
import pandas as pd

# Assuming `df` is your DataFrame and `category` is the categorical feature
df_one_hot = pd.get_dummies(df, columns=['category'], prefix='category')

## Label encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['category_encoded'] = le.fit_transform(df['category'])

### Dealing with High Cardinality and Rare Categories

High cardinality means that a category feature has a lot of unique values, which can be problematic for certain encoding methods. For example, a one-hot encoding of a high cardinality feature can greatly expand the memory footprint of your dataset.

One way to handle high cardinality is to group less common values into an **'other'** category. This can also help with the problem of rare categories, which may be present in your training data but unlikely to appear in future data.



In [None]:
##Dealing with High Cardinality and Rare Categories
counts = df['category'].value_counts()
other = counts[counts < threshold].index
df['category'] = df['category'].replace(other, 'Other')

### Python Code Examples for Categorical Variable Encoding

Here's a full example of encoding a categorical feature in a dataset:

In the code block below,
* We start with a simple DataFrame containing name, sex, and city columns.

* We then perform `one-hot encoding` on the sex column using the get_dummies function from pandas, and label encoding on the city column using `LabelEncoder` from `scikit-learn`. The result is a DataFrame where sex and city are converted into numeric formats suitable for a machine learning model.

* Next, we address the issue of high cardinality and rare categories in the name column. We count the occurrence of each name using the `value_counts function`, and consider names that appear less than 2 times as `"rare"`. We then replace these rare names with the label `'Other'`.


In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Let's create a simple DataFrame
data = {'name': ['John', 'Lisa', 'Peter', 'Carla', 'Eva', 'John'],
'sex': ['male', 'female', 'male', 'female', 'female', 'male'],
'city': ['London', 'London', 'Paris', 'Berlin', 'Paris', 'Berlin']}

df = pd.DataFrame(data)

# One-hot encode the 'sex' column
df_one_hot = pd.get_dummies(df, columns=['sex'], prefix='sex')

# Label encode the 'city' column
le = LabelEncoder()
df['city_encoded'] = le.fit_transform(df['city'])

# Display the original DataFrame and the modified DataFrame
print("Original DataFrame:")
print(df)
print("\nDataFrame after one-hot encoding 'sex' and label encoding 'city':")
print(df_one_hot)

# Handle high cardinality and rare categories in 'name' column
counts = df['name'].value_counts()

# here we consider names appearing less than 2 times as "rare"
other = counts[counts < 2].index
df['name'] = df['name'].replace(other, 'Other')
print("\nDataFrame after handling high cardinality and rare categories in 'name'
column:")

print(df)


### Wrap-up
Categorical encoding is a critical step in data preprocessing. Choosing the right encoding
technique for your data and model can significantly impact the model's performance.


## Handling Imbalanced Data
Imbalanced datasets are a common problem in machine learning, where the number of observations in one class is significantly lower than the others. In this chapter, we will discuss what imbalanced data is, its impact on machine learning models, and various techniques for handling imbalanced data.

### Understanding Imbalanced Data and Its Impact on Machine Learning

Imbalanced data, as the name suggests, refers to a situation in classification problems where the classes are not represented equally. For example, in a binary classification problem, we may have 100 samples, with 90 samples belonging to class 'A' (the majority class) and only 10 samples belonging to class 'B' (the minority class). This is a classic scenario of an imbalanced dataset.

The main problem with imbalanced datasets is that most machine learning algorithms work best when the number of samples in each class are about equal. This is because most algorithms are designed to maximize accuracy and reduce error. Thus, they tend to focus on
the majority class and ignore the minority class. They might only predict the majority class, and hence have a high accuracy rate, but this isn't useful because the minority class, which is usually the point of interest, is completely ignored.

### Techniques for Handling Imbalanced Classes

There are several strategies to handle imbalanced datasets. These strategies can broadly be divided into three categories: `resampling techniques`, `cost-sensitive learning`, and `ensemble methods`.

* **Resampling Techniques**

Resampling is the most straightforward way to handle imbalanced data, which involves removing samples from the majority class (undersampling) and/or adding more examples from the minority class (oversampling).

* **Cost-Sensitive Learning**

Cost-sensitive learning is a method that integrates the different misclassification costs (for false positives and false negatives) into the learning algorithm. In other words, it assigns higher costs to misclassifying minority class.


* **Ensemble Methods**
Ensemble methods, such as random forests or boosting algorithms, can also be used to deal with imbalanced datasets. These methods work by creating multiple models and then combining them to produce the final prediction.


In [None]:
#Resampling Techniques
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

# Assuming `X` is your feature set and `y` is the target variable
ros = RandomOverSampler()
X_resampled, y_resampled = ros.fit_resample(X, y)
rus = RandomUnderSampler()
X_resampled, y_resampled = rus.fit_resample(X, y)

##Cost-Sensitive Learning
from sklearn.svm import SVC
# Create a SVC model with 'balanced' class weight
clf = SVC(class_weight='balanced')
clf.fit(X, y)


##Ensemble Methods
from sklearn.ensemble import RandomForestClassifier
# Create a random forest classifier
clf = RandomForestClassifier()
clf.fit(X, y)

### Code Example

In this example, we used a popular oversampling technique called SMOTE (Synthetic Minority
Over-sampling Technique). SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space, and drawing a new sample at a point along that line.

Specifically, a random example from the minority class is first chosen. Then, k of the nearest neighbors for that example are found (typically k=5). A randomly selected neighbor is chosen, and a synthetic example is created at a randomly selected point between the two examples in feature space.

This approach is effective because new synthetic examples from the minority class are created that are plausible, that is, are relatively close in feature space to existing examples from the minority class.

In the final section of the code, we created a Random Forest classifier, fit it to the resampled data, made predictions on the test set, and printed a
 classification report to observe the results.

In [None]:
# import libaries
import pandas as pd
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier

# Let's assume that we have a binary classification problem
#with imbalanced classes

data = pd.read_csv('data.csv') # Replace with your data file
X = data.drop('target', axis=1) # Replace 'target' with your target variable
y = data['target'] # Replace 'target' with your target variable

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42, stratify=y)

# Check the distribution of target variable
print(y_train.value_counts())

# Apply SMOTE to generate synthetic samples
sm = SMOTE(random_state=42)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

# Check the distribution of target variable after applying SMOTE
print(y_train_res.value_counts())

# Create a random forest classifier and fit it to the resampled data
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train_res, y_train_res)

# Predict on the test data and generate a classification report
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

###Wrap-up
as emonstrated, we provided an overview of the challenges and strategies related to handling
imbalanced data. While we cover the most commonly used methods, it's worth noting that the optimal technique will depend on the specifics of the dataset and the problem at hand. Therefore, a good understanding of these methods is crucial for effectively handling
imbalanced datasets and ultimately building robust and reliable machine learning models.

## Data Integration and Transformation Techniques

In the world of data science, working with clean, well-structured data is the exception, not the rule. Often, data is scattered across multiple sources, each with its own structure and format.

Even when the data is all in one place, it might not be in a format that's optimal for the analysis or model you're planning to run. This chapter discusses data integration and transformation techniques that can help make the data more suitable for analysis.

### Data Integration Approaches
Data integration involves combining data from  different sources and providing users with a unified view of these data. This process becomes significant in a variety of situations, which include both commercial (when two similar companies need to merge their databases) and scientific (combining research findings from different bioinformatics repositories, for example) applications.

* **Merging**

Merging is the process of combining two or more data sets based on common columns between them.

* **Joining**

Joining is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame. In Pandas, we can join dataframes using the join function.

* **Concatenating**

Concatenation is a process of appending datasets, i.e., it adds dataframes along a particular
axis, either row-wise or column-wise.

In [None]:
##Merging
# Assuming `df1` and `df2` are your dataframes
merged_df = pd.merge(df1, df2, on='common_column')

##Joining
# Assuming `df1` and `df2` are your dataframes
joined_df = df1.join(df2, lsuffix='_df1', rsuffix='_df2')


##Concatenating

# Assuming `df1` and `df2` are your dataframes
concat_df = pd.concat([df1, df2])



### Data Transformation Techniques

Data transformation is the process of converting data from one format or structure into another format or structure.


* **Binning**

Binning is a data transformation technique used to group a set of continuous values into bins
or buckets. This can be particularly useful for managing noise or outliers.

* **Log Transformation**

Log transformation is a data transformation method in which it replaces each variable x with a log(x). The choice of the logarithm base is usually left up to the analyst and it would depend on the purposes of statistical modeling.

* **Power Transformation**

A power transformation is a statistical technique to make data more closely match a normal

In [None]:
## Binning

# Assuming `df` is your dataframe and `age` is the column to bin
bins = [0, 18, 35, 60, np.inf]
names = ['<18', '18-35', '35-60', '60+']
df['age_range'] = pd.cut(df['age'], bins, labels=names)

##Log Transformation
# Assuming `df` is your dataframe and `price` is the column to transform
df['log_price'] = np.log(df['price'])


##Power Transformation

from sklearn.preprocessing import PowerTransformer
# Assuming `X` is your feature set
pt = PowerTransformer()
X_transformed = pt.fit_transform(X)



###
Handling Skewed Distributions and Nonlinear Relationships

In statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. In other words, skewness tells you the amount and direction of skew (departure from horizontal symmetry). The skewness value can bepositive or negative, or undefined.

To handle skewed data, we often use transformations like logarithm, square root, or cube root
transformations which can normalize the data.

In [None]:
# Log transformation to handle right skewness
# Assuming `df` is your dataframe and `income` is the skewed feature

df['log_income'] = np.log(df['income'] + 1) # We add 1 to handle zero incomes

# Confirming the change in skewness
print("Old skewness: ", df['income'].skew())
print("New skewness: ", df['log_income'].skew())

Nonlinear relationships between variables can be addressed in several ways. One of the most common approaches is polynomial features, where features are raised to a power to capture

In [None]:
from sklearn.preprocessing import PolynomialFeatures
# Assuming `X` is your feature set
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

## Wrap-up
We have provided a broad overview of data integration and transformation techniques that are essential in data preprocessing. Understanding these techniques is crucial, as real-world data often requires extensive cleaning, preprocessing, and transformation to reveal
the underlying patterns and insights.

## Case Study: Predicting House Prices

In tis section, we will consolidate the various techniques discussed in the previous sections through a practical case study. By the end of this exercise, we should have a firm grasp of how to apply data cleaning and preprocessing techniques in a real-world context.

For our case study, we will work with the [Ames Housing](https://github.com/OluwaMarg/Introduction-to-Python-for-Data-Analytics/blob/main/WK5_DATA%20CLEANING%202/AmesHousing.csv) dataset, a richly detailed and relatively large dataset with 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa. Our task will be to predict the final price of each home.

The first step, as always, is to **load our data and the necessary Python libraries**.

In [None]:
from google.colab import files
files.upload()

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Load the data
df = pd.read_csv('AmesHousing.csv')
df.head(5)

We'll then **split our data into training and test sets**. It's important to **conduct preprocessing** steps separately on these sets to avoid data leakage, which can lead to overly optimistic performance estimates.

In [6]:
# Split the data into training and test sets
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Further split the training data into training and validation sets
train_df, val_df = train_test_split(train_df, test_size=0.2, random_state=42)

Now, let's take a look at the first few rows of our training data.

In [None]:
train_df.head()

Given the number of features in this dataset, we can expect a variety of data cleaning and
preprocessing tasks. We will need to **handle missing data, outliers, categorical variables**, and
possibly more.

Let's start with **missing data**.This will show us the count of missing values in each column.

In [None]:
# Checking for missing data
missing_values = train_df.isnull().sum()
missing_values = missing_values[missing_values > 0]
print(missing_values.sort_values(ascending=False))

For simplicity, let's impute missing values with the median for numerical features, and the most frequent value for categorical features

In [None]:
Create our imputers
num_imputer = SimpleImputer(strategy='median')
cat_imputer = SimpleImputer(strategy='most_frequent')
# Get lists of numeric and categorical column names
num_cols = train_df.select_dtypes(include=np.number).columns.tolist()
cat_cols = train_df.select_dtypes(include='object').columns.tolist()
# Impute missing values
train_df[num_cols] = num_imputer.fit_transform(train_df[num_cols])
train_df[cat_cols] = cat_imputer.fit_transform(train_df[cat_cols])

Next, let's handle **outliers**. For simplicity, we'll use the IQR method.

In [None]:
Q1 = train_df[num_cols].quantile(0.25)
Q3 = train_df[num_cols].quantile(0.75)
IQR = Q3 - Q1
# Removing outliers
train_df = train_df[~((train_df < (Q1 - 1.5 * IQR)) | (train_df > (Q3 + 1.5 *
IQR))).any(axis=1)]

For **encoding categorical variables**, we'll use one-hot encoding.

In [None]:
# Create a one-hot encoder
encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)

# Apply the encoder to the categorical columns
train_df_encoded = pd.DataFrame(encoder.fit_transform(train_df[cat_cols]))

# Add back the index and column names
train_df_encoded.index = train_df.index
train_df_encoded.columns = encoder.get_feature_names(input_features=cat_cols)

# Drop the original categorical columns and replace with the encoded ones
train_df = train_df.drop(cat_cols, axis=1)
train_df = pd.concat([train_df, train_df_encoded], axis=1)

With categorical variables handled, we can now move on to scaling the data. For this, we'll use the **StandardScaler** from `sklearn`.

In [None]:
#Create a standard scaler
scaler = StandardScaler()

# Scale the numeric columns
train_df[num_cols] = scaler.fit_transform(train_df[num_cols])

Finally, we need to address the **issue of imbalanced data**. This is a regression task, so we won't
need to worry about imbalanced classes. However, in a classification task, we might use techniques such as `resampling`, `cost-sensitive learning`, or `ensemble methods` to handle imbalanced classes


Now that we've preprocessed our training data, we can **apply the same transformations to the validation and test sets**. Note that we're using transform, not fit_transform, to ensure that the same transformations are applied.

In [None]:
# Impute missing values
val_df[num_cols] = num_imputer.transform(val_df[num_cols])
val_df[cat_cols] = cat_imputer.transform(val_df[cat_cols])

# Remove outliers (note: this is a simplified example)
val_df = val_df[~((val_df < (Q1 - 1.5 * IQR)) | (val_df > (Q3 + 1.5 *
IQR))).any(axis=1)]

# One-hot encode categorical columns
val_df_encoded = pd.DataFrame(encoder.transform(val_df[cat_cols]))
val_df_encoded.index = val_df.index
val_df_encoded.columns = encoder.get_feature_names(input_features=cat_cols)
val_df = val_df.drop(cat_cols, axis=1)
val_df = pd.concat([val_df, val_df_encoded], axis=1)

# Scale numeric columns
val_df[num_cols] = scaler.transform(val_df[num_cols])

