# Data Cleaning and preprocessing 

***What is data cleaning?*** 
This involves identifying and correcting or removing errors and inconsistencies in your dataset. Errors can come in many forms, i.e. missing values, duplicated records or outliers. 
If left unchecked, these problems can lead to inaccurate models and unreliable predictions. 

***Data Preprocessing***: While data cleaning is about fixing issues with the data, preprocessing is about transforming it into a format that is suitable for analysis and model training. 
This might involve normalizing, standardizing, encoding categorical variables or splitting the data into training and testing sets. 

***Why is this important?*** 
Data cleaning and preprocessing allows the model to learn the underlying patterns more effectively leading to better predictions and ultimately better performance.

## Managing missing values
Missing values are a commong issure in data sets and can arise for various raisons, such as data entry errors or unavailability of certain information. If not addressed, missing values can lead to biased results or reduce the accuracy of the model. 

### Strategies for handling missing values
1. Removing missing data
If a small number of rows or columns have missing values. You might consider removing them from the data set. This approach is useful when the missing data is minimal and its removal won't significantly impact the dataset.
**Code Example** 
```python
#Drop rows with missing values
df_cleaned = df.dropna()

# Drop columns with missing values
df_cleaned = df.dropna(axis = 1)
```
2. Impute missing Data
Imputation involves filling in missing values with a substitute value, such as the mean, median or mode of the columns. This is useful when missing data is more prevalent but you don't want to lose information by removing rows or columns.

**Code Example**: 
```python
# Fill missing values with the mean of the column
df['Column_name'].fillna(df['Column_name'].mean(), inplace = True)

# Fill missing values with the median
df['Column_name'].fillna(df['Column_name'].median(), inplace = True)
```
3. Forward or Backward fill
Forward fill propagates the last valid observation forward, while backward fill does the opposite. This is particularly useful in time serie data where tranch or sequence are important.

**Code Example**:

```python
# Forward fill
df.fillna(method = 'ffil', inplace = True)

# Backward fill
df.fillna(method = 'bfil', inplace = True)
```

## Manage Outliers
1. Identify Outliers 
The first step is identifying outliers, which can be done using statistical methods such as z-score or inter quartile range *(IQR)*
**Code Example**: 
```python

#Using z-score 

from scipy import stats
import numpy as np
z_scores = np.abs(stats.zscore(df['Column_name']))
outliers = df[z_scores > 3]

# using IQR 

Q1 = df['Column_name'].quantile(0.25)
Q3 = df['Column_name'].quantile(0.45)
Iqr = Q3-Q1
outliers = df[df(['Column_name'] < (Q1-1.5*Iqr1)) | (df['Column_name']> (Q3 + 1.5*Iqr))]
```
2. Handle outliers
- Remove outliers : They can be removed from the dataset if they are believed to be errors or not representative of the population.

**Code Example** 
```python
# Remove outliers identifyed by z-score
df_cleaned = df[(z_scores<=3)]

# Remove outliers identified by IQR
df_cleaned = df[~((df['Column_name'] < (Q1-1.5*Iqr))| (df['Column_namr'] < (Q1 + 1.5*Iqr)))]
```

- Cap or transform outliers: Instead of removing outliers, you might cap them into a certain thershold or transform them using logarithmic or other funcitons to reduce their impact. 

**Code Example**: 
```python
# Cap outliers to a threshold
df['Column_name'] = np.where(df['Column_name'] > upper_threshold , upper_threshold, df['Column_name'])

# Logarithmic transform to reduce the impace 
df['Column_name'] = np.log(df['Column_name'] + 1)
```


## Normalization

***Normalization (or scaling)*** is the process of adjusting the values of numeric columns in a dataset to a common scale, typically between 0-1. This is especially important for Machine Learning algorithms that rely on the magnitude of features such as gradient descent based algorithms.

### Methods for normalization 
1. Min-Max scaling: scales all numeric in a column between 0-1. 
*Example* : 
```python
from Sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['Scaled_Column'] = scaler.fit_transform(df[['Column_name']])
```
2. Z-Score standardization: Scales the data so that it has **Mean = 0, Std=1**. This is useful when you want to compare features with different units or scaler. 

```python
from Sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df['Standardized_Column'] = scaler.fit_transform(df[['Column_Name']])
```

## Data Transformation
This involves converting data from one format or structure to another. This is often necessary to meet the assumptions of statistical mmodels or to improve the performance of Machine learning algorithms. 

### Common data transformations 
1. Logarithmic transformation: This is used to stabilize variance, by making the data appear more like normal distribution and reducing the impact of outliers *e.g* `df['log_column'] = np.log(df['Column_name'] + 1)`

2. Box-Cox transformation : This is used to stabilize variance and make the data more normally distributed. 
*Example* : 
```python
from scipy import stats
df['boxcox_column'],_ = stats.boxcox(df['Column_name'] + 1)
```

3. Binning (or *Discretiation*) : This involves converting continous variables into discrete categories
*Example*
```python
# Creating bins for a continous variable 
df['binned_column'] = pd.cut(df['Column_name'], bins = [0,10,20,30], labels= 'Low', 'Medium', 'High')
```

4. Encoding Categorical variables : This is transforming categorical data into numerical format, which is necessary for many machine learning algorithms.
*One Hot coding* : `df_encoded = pd.get_dummies(df,columns = ['category '])`

