# Chapter 7 - Cleaning Messy Data

## Importance of Data Cleaning
- Data analysts and scientists spend a significant amount of time cleaning and pre-processing messy datasets.
- It is a critical skill for any data professional and aspiring data scientist.
- Data cleaning ensures high-quality data, which is essential for robust and error-free analysis.
- High-quality data (accurate, complete, and consistent) can outperform complex algorithms.

## What is Data Cleaning?
- The process of identifying, updating, and removing corrupt or incorrect data.
- Activities involved:
    - Handling missing values
    - Removing outliers
    - Feature encoding
    - Scaling
    - Transformation
    - Splitting

## Objectives of Data Cleaning
- Prepare data for analysis by cleaning, manipulating, and wrangling.
- Terms like data preparation, manipulation, wrangling, and munging refer to the same process.
- The goal is to clean up data to extract valuable insights.

## Tools Used
- **pandas**: For data manipulation and cleaning.
- **scikit-learn**: For advanced data transformation techniques.

## Topics Covered in This Chapter
1. **Exploring Data**: Understanding the structure and content of the dataset.
2. **Filtering Data**: Removing noise and irrelevant information.
3. **Handling Missing Values**: Techniques to deal with incomplete data.
4. **Handling Outliers**: Identifying and addressing extreme values.
5. **Feature Encoding Techniques**: Converting categorical data into numerical formats.
6. **Feature Scaling**: Normalizing data for consistent analysis.
7. **Feature Transformation**: Applying mathematical transformations to features.
8. **Feature Splitting**: Dividing data into training and testing sets.

## Key Takeaways
- Data cleaning is a foundational step in any data analysis or machine learning workflow.
- Mastering data cleaning techniques ensures better insights and model performance.
- Focus on understanding and applying tools like pandas and scikit-learn for efficient data preparation.

In [3]:
# import pandas
import pandas as pd
# Read the data using csv
data=pd.read_csv('employee.csv')

In [4]:
# See initial 5 records
data.head()

Unnamed: 0,name,age,income,gender,department,grade,performance_score
0,Allen Smith,45.0,,,Operations,G3,723
1,S Kumar,,16000.0,F,Finance,G0,520
2,Jack Morgan,32.0,35000.0,M,Finance,G2,674
3,Ying Chin,45.0,65000.0,F,Sales,G3,556
4,Dheeraj Patel,30.0,42000.0,F,Operations,G2,711


In [5]:
# See last 5 records
data.tail()

Unnamed: 0,name,age,income,gender,department,grade,performance_score
4,Dheeraj Patel,30.0,42000.0,F,Operations,G2,711
5,Satyam Sharma,,62000.0,,Sales,G3,649
6,James Authur,54.0,,F,Operations,G3,53
7,Josh Wills,54.0,52000.0,F,Finance,G3,901
8,Leo Duck,23.0,98000.0,M,Sales,G4,709


In [6]:
# Print list of columns in the data
print(data.columns)

Index(['name', 'age', 'income', 'gender', 'department', 'grade',
       'performance_score'],
      dtype='object')


In [7]:
# Print the shape of a DataFrame
print(data.shape)

(9, 7)


In [8]:
# Check the information of DataFrame
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   name               9 non-null      object 
 1   age                7 non-null      float64
 2   income             7 non-null      float64
 3   gender             7 non-null      object 
 4   department         9 non-null      object 
 5   grade              9 non-null      object 
 6   performance_score  9 non-null      int64  
dtypes: float64(2), int64(1), object(4)
memory usage: 636.0+ bytes


Now, let's take a look at the descriptive statistics of the data by using the describe function. This function will describe numerical objects. In our example, the age, income,and performance scores will describe the count, mean, standard deviation, min-max, and the first, second, and third quartiles:

In [9]:
# Check the descriptive statistics
data.describe()

Unnamed: 0,age,income,performance_score
count,7.0,7.0,9.0
mean,40.428571,52857.142857,610.666667
std,12.204605,26028.372797,235.671912
min,23.0,16000.0,53.0
25%,31.0,38500.0,556.0
50%,45.0,52000.0,674.0
75%,49.5,63500.0,711.0
max,54.0,98000.0,901.0


### Filtering Data to Weed Out the Noise

### Importance of Data Filtering
- The digitalization of companies and government agencies has led to an increase in data size.
- This growth has also introduced inconsistencies, errors, and missing values in datasets.
- Data filtering is essential for addressing these issues and optimizing data for:
    - Management
    - Reporting
    - Predictions

### Benefits of Data Filtering
- Enhances **accuracy**, **relevance**, **completeness**, **consistency**, and **quality** of data.
- Processes dirty, messy, or coarse datasets to make them usable.
- Plays a critical role in maintaining a competitive edge for businesses.

### Role of Data Scientists
- Mastering data filtering is a crucial skill for data scientists.
- Different datasets require different filtering techniques.
- A systematic approach to data filtering is necessary for effective data management.

### Types of Data Filtering
- Data can be filtered in two ways:
    1. **Column-wise Filtering**: Focuses on specific columns.
    2. **Row-wise Filtering**: Focuses on specific rows.

### Transition from Data Exploration to Filtering
- In the previous section, we learned about data exploration.
- This section focuses on the systematic process of data filtering.

### Column-wise filtration
In this subsection, we will learnhow to filter column-wise data. We can filter columns using the filter() method. The slicing []. filter() method selects the columns when they're passed as a list of columns.

In [13]:
# Filter columns
data.filter(['name', 'department'])

Unnamed: 0,name,department
0,Allen Smith,Operations
1,S Kumar,Finance
2,Jack Morgan,Finance
3,Ying Chin,Sales
4,Dheeraj Patel,Operations
5,Satyam Sharma,Sales
6,James Authur,Operations
7,Josh Wills,Finance
8,Leo Duck,Sales


Similarly, we can also filter columns using slicing. In slicing, a single column does not need a list, but when we are filtering multiple columns, then they should be on the list. The output of a single column is a pandas Series. If we want the output as a DataFrame, then we need to put the name of the single column into a list. Take a look at the following example:

In [14]:
# Filter column "name"
data['name']

0      Allen Smith
1          S Kumar
2      Jack Morgan
3        Ying Chin
4    Dheeraj Patel
5    Satyam Sharma
6     James Authur
7       Josh Wills
8         Leo Duck
Name: name, dtype: object

In [15]:
# Filter column "name"
data[['name']]

Unnamed: 0,name
0,Allen Smith
1,S Kumar
2,Jack Morgan
3,Ying Chin
4,Dheeraj Patel
5,Satyam Sharma
6,James Authur
7,Josh Wills
8,Leo Duck


In [16]:
# Filter two columns: name and department
data[['name','department']]

Unnamed: 0,name,department
0,Allen Smith,Operations
1,S Kumar,Finance
2,Jack Morgan,Finance
3,Ying Chin,Sales
4,Dheeraj Patel,Operations
5,Satyam Sharma,Sales
6,James Authur,Operations
7,Josh Wills,Finance
8,Leo Duck,Sales


### Row-wise filtration
Now, let's filter row-wise data. We can filter data using indices, slices, and conditions. In indices, you have to pass the index of the record, while for slicing, we need to pass the slicing range. Take a look at the following example:

In [17]:
# Select rows for the specific index
data.filter([0,1,2],axis=0)

Unnamed: 0,name,age,income,gender,department,grade,performance_score
0,Allen Smith,45.0,,,Operations,G3,723
1,S Kumar,,16000.0,F,Finance,G0,520
2,Jack Morgan,32.0,35000.0,M,Finance,G2,674


In [18]:
# Filter data using slicing
data[2:5]

Unnamed: 0,name,age,income,gender,department,grade,performance_score
2,Jack Morgan,32.0,35000.0,M,Finance,G2,674
3,Ying Chin,45.0,65000.0,F,Sales,G3,556
4,Dheeraj Patel,30.0,42000.0,F,Operations,G2,711


In condition-based filtration, we have to pass some conditions in square brackets, [ ], or brackets, ( ). For a single value, we use the == (double equal to) condition, while for multiple values, we use the isin() function and pass the list of values. Let's take a look at the following example:

In [19]:
# Filter data for specific value
data[data.department=='Sales']

Unnamed: 0,name,age,income,gender,department,grade,performance_score
3,Ying Chin,45.0,65000.0,F,Sales,G3,556
5,Satyam Sharma,,62000.0,,Sales,G3,649
8,Leo Duck,23.0,98000.0,M,Sales,G4,709


In the preceding code, we filtered the department sales in the first line of code using == (double equal to) as a condition. Now, let's filter multiple columns using the isin() function:

In [20]:
# Select data for multiple values
data[data.department.isin(['Sales','Finance'])]

Unnamed: 0,name,age,income,gender,department,grade,performance_score
1,S Kumar,,16000.0,F,Finance,G0,520
2,Jack Morgan,32.0,35000.0,M,Finance,G2,674
3,Ying Chin,45.0,65000.0,F,Sales,G3,556
5,Satyam Sharma,,62000.0,,Sales,G3,649
7,Josh Wills,54.0,52000.0,F,Finance,G3,901
8,Leo Duck,23.0,98000.0,M,Sales,G4,709


In the preceding example, we filtered the department sales and finance department using the isin() function.Now, let's look at the >= and <= conditions for continuous variables. We can have single or multiple conditions. Let's take a look at the following example:

In [21]:
# Filter employee who has more than 700 performance score
data[(data.performance_score >=700)]

Unnamed: 0,name,age,income,gender,department,grade,performance_score
0,Allen Smith,45.0,,,Operations,G3,723
4,Dheeraj Patel,30.0,42000.0,F,Operations,G2,711
7,Josh Wills,54.0,52000.0,F,Finance,G3,901
8,Leo Duck,23.0,98000.0,M,Sales,G4,709


### Handling Missing Values

#### What are Missing Values?
- Missing values are data points that are absent from the dataset.
- They can occur due to:
    - Human error
    - Privacy concerns
    - Incomplete responses in surveys

#### Why are Missing Values Important?
- Missing values are a common issue in data science and are often the first challenge in data preprocessing.
- They can significantly affect the performance of machine learning models.

#### Methods to Handle Missing Values
1. **Drop Missing Value Records**:
     - Remove rows or columns with missing values.
     - Useful when the amount of missing data is small and does not impact the dataset significantly.

2. **Fill Missing Values Manually**:
     - Manually input the missing values based on domain knowledge.

3. **Impute Missing Values Using Measures of Central Tendency**:
     - **Mean**: Used for imputing numeric features.
     - **Median**: Used for imputing ordinal features.
     - **Mode**: Used for imputing categorical features (most frequently occurring value).

4. **Impute Using Machine Learning Models**:
     - Predict missing values using models like:
         - Regression
         - Decision Trees
         - K-Nearest Neighbors (KNN)

#### When Missing Values May Not Matter
- In some cases, missing values do not impact the dataset or model performance.
- Examples:
    - Unique identifiers like driving license numbers or social security numbers.
    - These features are not used as predictors in machine learning models.

#### Next Steps
- In the following sections, we will explore how to handle missing values in more detail.
- We will start by learning how to drop missing values.

### Dropping missing values
In Python, missing values can be dropped using the dropna() function. dropna takes one argument: how. how can take two values: all or any. any drops certain rows that containNAN or missing values, while all drops all the rows contains NAN or missing values:

In [23]:
# Drop missing value rows using dropna() function
# Read the data
data=pd.read_csv('employee.csv')
data=data.dropna()
data

Unnamed: 0,name,age,income,gender,department,grade,performance_score
2,Jack Morgan,32.0,35000.0,M,Finance,G2,674
3,Ying Chin,45.0,65000.0,F,Sales,G3,556
4,Dheeraj Patel,30.0,42000.0,F,Operations,G2,711
7,Josh Wills,54.0,52000.0,F,Finance,G3,901
8,Leo Duck,23.0,98000.0,M,Sales,G4,709


### Filling in Missing Values

#### Using the `fillna()` Function
- In Python, missing values can be filled using the `fillna()` function.
- The `fillna()` function replaces missing values with a specified value.

#### Methods to Fill Missing Values
1. **Mean**:
    - Replace missing values with the mean of the column.
    - Suitable for numeric data.

2. **Median**:
    - Replace missing values with the median of the column.
    - Useful for handling skewed data.

3. **Mode**:
    - Replace missing values with the mode (most frequently occurring value) of the column.
    - Commonly used for categorical data.

#### Example Usage

In [24]:
# Read the data
data=pd.read_csv('employee.csv')
# Fill all the missing values in the age column with mean of the age column
data['age']=data.age.fillna(data.age.mean())
data

Unnamed: 0,name,age,income,gender,department,grade,performance_score
0,Allen Smith,45.0,,,Operations,G3,723
1,S Kumar,40.428571,16000.0,F,Finance,G0,520
2,Jack Morgan,32.0,35000.0,M,Finance,G2,674
3,Ying Chin,45.0,65000.0,F,Sales,G3,556
4,Dheeraj Patel,30.0,42000.0,F,Operations,G2,711
5,Satyam Sharma,40.428571,62000.0,,Sales,G3,649
6,James Authur,54.0,,F,Operations,G3,53
7,Josh Wills,54.0,52000.0,F,Finance,G3,901
8,Leo Duck,23.0,98000.0,M,Sales,G4,709


In [26]:
# Fill all the missing values in the income column with the median of the income column
data['income'] = data.income.fillna(data.income.median())
data

Unnamed: 0,name,age,income,gender,department,grade,performance_score
0,Allen Smith,45.0,52000.0,,Operations,G3,723
1,S Kumar,40.428571,16000.0,F,Finance,G0,520
2,Jack Morgan,32.0,35000.0,M,Finance,G2,674
3,Ying Chin,45.0,65000.0,F,Sales,G3,556
4,Dheeraj Patel,30.0,42000.0,F,Operations,G2,711
5,Satyam Sharma,40.428571,62000.0,,Sales,G3,649
6,James Authur,54.0,52000.0,F,Operations,G3,53
7,Josh Wills,54.0,52000.0,F,Finance,G3,901
8,Leo Duck,23.0,98000.0,M,Sales,G4,709


### Handling Outliers

#### What are Outliers?
- Outliers are data points that are significantly distant from most other data points.
- They are entities that differ from the majority of the data.
- Outliers can cause issues in predictive modeling, such as:
    - Long model training times
    - Poor accuracy
    - Increased error variance
    - Decreased normality
    - Reduced power of statistical tests

#### Types of Outliers
1. **Univariate Outliers**:
     - Found in single-variable distributions.
2. **Multivariate Outliers**:
     - Found in n-dimensional spaces.

#### Methods to Detect and Handle Outliers
1. **Box Plot**:
     - Visualizes data points through quartiles.
     - Groups data between the first and third quartile into a rectangular box.
     - Displays outliers as individual points using the interquartile range (IQR).

2. **Scatter Plot**:
     - Displays data points (or two variables) on a two-dimensional chart.
     - One variable is placed on the x-axis, and the other on the y-axis.

3. **Z-Score**:
     - A parametric approach to detecting outliers.
     - Assumes a normal distribution of data.
     - Outliers lie in the tail of the normal curve distribution and are far from the mean.

4. **Interquartile Range (IQR)**:
     - A robust statistical measure of data dispersion.
     - Calculated as the difference between the third and first quartile.
     - Visualized in a box plot.
     - Also known as the midspread, middle 50%, or H-spread.

5. **Percentile**:
     - A statistical measure that divides data into 100 groups of equal size.
     - Indicates the percentage of the population below a specific value.
     - Example: The 95th percentile means 95% of the population falls below this value.

In [28]:
# Dropping the outliers using Standard Deviation
# Dropping the outliers using Standard Deviation
upper_limit = data['performance_score'].mean() + 3 * data['performance_score'].std()
lower_limit = data['performance_score'].mean() - 3 * data['performance_score'].std()

# Filter the data to remove outliers
data = data[(data['performance_score'] < upper_limit) & (data['performance_score'] > lower_limit)]
data

Unnamed: 0,name,age,income,gender,department,grade,performance_score
0,Allen Smith,45.0,52000.0,,Operations,G3,723
1,S Kumar,40.428571,16000.0,F,Finance,G0,520
2,Jack Morgan,32.0,35000.0,M,Finance,G2,674
3,Ying Chin,45.0,65000.0,F,Sales,G3,556
4,Dheeraj Patel,30.0,42000.0,F,Operations,G2,711
5,Satyam Sharma,40.428571,62000.0,,Sales,G3,649
6,James Authur,54.0,52000.0,F,Operations,G3,53
7,Josh Wills,54.0,52000.0,F,Finance,G3,901
8,Leo Duck,23.0,98000.0,M,Sales,G4,709


### Feature Encoding Techniques

#### Importance of Feature Encoding
- Machine learning models are mathematical models that require numeric and integer values for computation.
- Categorical features cannot be directly used in such models.
- Feature encoding is the process of converting categorical features into numerical ones.
- The choice of encoding technique can significantly impact the performance of machine learning models.

#### Categorical Values
- Categorical values are discrete and typically range from 0 to N-1 categories.

---

### One-Hot Encoding

#### What is One-Hot Encoding?
- One-hot encoding transforms a categorical column into labels and splits it into multiple binary columns.
- Each category is represented as a binary value (1 or 0).

#### Example
- Consider a categorical variable `color` with three categories: red, green, and blue.
- One-hot encoding converts this variable into three binary columns, as shown below:

| Color  | Red | Green | Blue |
|--------|-----|-------|------|
| Red    |  1  |   0   |  0   |
| Green  |  0  |   1   |  0   |
| Blue   |  0  |   0   |  1   |

#### Benefits of One-Hot Encoding
- Ensures that categorical data is represented numerically without introducing ordinal relationships.
- Widely used for nominal categorical variables.

#### Considerations
- One-hot encoding can lead to a high-dimensional dataset if the number of categories is large.
- This is known as the "curse of dimensionality" and may require dimensionality reduction techniques.

In [29]:
# Read the data
data=pd.read_csv('employee.csv')
# Dummy encoding
encoded_data = pd.get_dummies(data['gender'])
# Join the encoded _data with original dataframe
data = data.join(encoded_data)

# Check the top-5 records of the dataframe
data.head()

Unnamed: 0,name,age,income,gender,department,grade,performance_score,F,M
0,Allen Smith,45.0,,,Operations,G3,723,False,False
1,S Kumar,,16000.0,F,Finance,G0,520,True,False
2,Jack Morgan,32.0,35000.0,M,Finance,G2,674,False,True
3,Ying Chin,45.0,65000.0,F,Sales,G3,556,True,False
4,Dheeraj Patel,30.0,42000.0,F,Operations,G2,711,True,False


In [34]:
# Import one hot encoder
from sklearn.preprocessing import OneHotEncoder
# Initialize the one-hot encoder object
onehotencoder = OneHotEncoder()
# Fill all the missing values in the gender column (categorical column) with the mode
data['gender'] = data['gender'].fillna(data['gender'].mode()[0])
# Fit and transform the gender column
onehotencoder.fit_transform(data[['gender']]).toarray()

array([[1., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.]])

### Ordinal Encoder

#### What is Ordinal Encoding?
- Ordinal encoding is similar to label encoding but includes an **order** in the encoding.
- The output encoding starts from `0` and ends at one less than the size of the categories.

#### Example
- Consider employee grades such as `G0`, `G1`, `G2`, `G3`, and `G4`.
- These grades are encoded with ordinal integer values:
    - `G0` → `0`
    - `G1` → `1`
    - `G2` → `2`
    - `G3` → `3`
    - `G4` → `4`

#### Defining the Order
- The order of the values can be defined as a list and passed to the `categories` parameter of the encoder.
- The ordinal encoder uses these integer or numeric values to encode the data.

#### Benefits
- The encoded values are **ordinal in nature**, meaning they have a meaningful order.
- This encoding helps machine learning algorithms take advantage of the ordinal relationship between categories.

#### Example Usage
- Below is an example of using the `OrdinalEncoder` to encode employee grades.

In [37]:
# Import pandas and OrdinalEncoder
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
# Load the data
data=pd.read_csv('employee.csv')
# Initialize OrdinalEncoder with order
order_encoder=OrdinalEncoder(categories=[['G0','G1','G2','G3','G4']])
# fit and transform the grade
data['grade_encoded'] = order_encoder.fit_transform(data[['grade']])
# Check top-5 records of the dataframe
data.head()

Unnamed: 0,name,age,income,gender,department,grade,performance_score,grade_encoded
0,Allen Smith,45.0,,,Operations,G3,723,3.0
1,S Kumar,,16000.0,F,Finance,G0,520,0.0
2,Jack Morgan,32.0,35000.0,M,Finance,G2,674,2.0
3,Ying Chin,45.0,65000.0,F,Sales,G3,556,3.0
4,Dheeraj Patel,30.0,42000.0,F,Operations,G2,711,2.0


### Feature Scaling

#### What is Feature Scaling?
- In real-world datasets, features often have different ranges, magnitudes, and units.
    - Example: Age may range from 0-200, while salary may range from thousands to millions.
- Comparing features with vastly different scales can be challenging.
- High-magnitude features tend to have a greater influence on machine learning models than lower-magnitude features.

#### Why is Feature Scaling Important?
- Feature scaling ensures that all features are brought to the same level of magnitude.
- It prevents high-magnitude features from dominating the model's learning process.
- It is particularly important for algorithms that rely on distance measures, such as:
    - **K-Nearest Neighbors (KNN)**
    - **K-Means Clustering**

#### When is Feature Scaling Necessary?
- Feature scaling is not required for all machine learning algorithms.
- It is crucial for algorithms that:
    - Use Euclidean distance as a metric.
    - Are sensitive to the scale of input features.

#### Benefits of Feature Scaling
- Improves the performance and convergence speed of machine learning models.
- Ensures fair comparison between features with different scales.

#### Key Takeaway
- Feature scaling or normalization is a critical preprocessing step for certain machine learning algorithms to ensure balanced and effective learning.

### Methods for Feature Scaling

#### Standard Scaling or Z-Score Normalization
- This method computes the scaled values of a feature by using the **mean** and **standard deviation** of that feature.
- It is best suited for **normally distributed data**.

#### Formula:
Let:
- \( \mu \) = Mean of the feature column
- \( \sigma \) = Standard deviation of the feature column

The formula for Z-Score normalization is:
\[
Z = \frac{X - \mu}{\sigma}
\]

Where:
- \( Z \) is the scaled value
- \( X \) is the original value of the feature

![image.png](attachment:image.png)

In [39]:
# Import StandardScaler(or z-score normalization)
from sklearn.preprocessing import StandardScaler
# Initialize the StandardScaler
scaler = StandardScaler()
# To scale data
scaler.fit(data['performance_score'].values.reshape(-1, 1))
data['performance_std_scaler'] = scaler.transform(data['performance_score'].values.reshape(-1, 1))
data.head()

Unnamed: 0,name,age,income,gender,department,grade,performance_score,grade_encoded,performance_std_scaler
0,Allen Smith,45.0,,,Operations,G3,723,3.0,0.505565
1,S Kumar,,16000.0,F,Finance,G0,520,0.0,-0.408053
2,Jack Morgan,32.0,35000.0,M,Finance,G2,674,2.0,0.285037
3,Ying Chin,45.0,65000.0,F,Sales,G3,556,3.0,-0.246032
4,Dheeraj Patel,30.0,42000.0,F,Operations,G2,711,2.0,0.451558


### Min-Max Scaling

#### What is Min-Max Scaling?
- Min-Max Scaling is a feature scaling technique that linearly transforms the original data into a specified range (e.g., [0, 1]).
- It preserves the relationships between the scaled data and the original data.

#### When to Use Min-Max Scaling?
- Min-Max Scaling works well when:
    - The data distribution is not normally distributed.
    - The standard deviation of the data is very small.
    - The dataset contains outliers, as Min-Max Scaling is more sensitive to them.

#### Formula for Min-Max Scaling:
Let:
- \( X_{\text{min}} \): Minimum value of the feature column.
- \( X_{\text{max}} \): Maximum value of the feature column.
- \( X_{\text{new\_min}} \): New minimum value (e.g., 0).
- \( X_{\text{new\_max}} \): New maximum value (e.g., 1).

The formula for Min-Max Scaling is:
\[
X_{\text{scaled}} = X_{\text{new\_min}} + \frac{(X - X_{\text{min}})}{(X_{\text{max}} - X_{\text{min}})} \times (X_{\text{new\_max}} - X_{\text{new\_min}})
\]

Where:
- \( X \): Original value of the feature.
- \( X_{\text{scaled}} \): Scaled value of the feature.

#### Key Takeaways:
- Min-Max Scaling ensures that all features are scaled to the same range, making them comparable.
- It is particularly useful for algorithms that are sensitive to the scale of input features, such as K-Nearest Neighbors (KNN) and Neural Networks.

![image.png](attachment:image.png)

In [41]:
# Import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
# Initialise the MinMaxScaler
scaler = MinMaxScaler()
# To scale data
scaler.fit(data['performance_score'].values.reshape(-1,1))
data['performance_minmax_scaler']=scaler.transform(data['performance_score'].values.reshape(-1,1))
data.head()

Unnamed: 0,name,age,income,gender,department,grade,performance_score,grade_encoded,performance_std_scaler,performance_minmax_scaler
0,Allen Smith,45.0,,,Operations,G3,723,3.0,0.505565,0.790094
1,S Kumar,,16000.0,F,Finance,G0,520,0.0,-0.408053,0.550708
2,Jack Morgan,32.0,35000.0,M,Finance,G2,674,2.0,0.285037,0.732311
3,Ying Chin,45.0,65000.0,F,Sales,G3,556,3.0,-0.246032,0.59316
4,Dheeraj Patel,30.0,42000.0,F,Operations,G2,711,2.0,0.451558,0.775943


### Robust Scaling

#### What is Robust Scaling?
- Robust Scaling is a feature scaling technique similar to Min-Max Scaling.
- Instead of using the minimum and maximum values, it uses the **interquartile range (IQR)**.
- This makes it **robust to outliers**, as it focuses on the central range of the data.

#### Formula:
Let:
- \( Q_1 \): First quartile (25th percentile) of the feature column.
- \( Q_3 \): Third quartile (75th percentile) of the feature column.
- \( \text{IQR} = Q_3 - Q_1 \): Interquartile range.

The formula for Robust Scaling is:
\[
X_{\text{scaled}} = \frac{X - Q_2}{Q_3 - Q_1}
\]

Where:
- \( X \): Original value of the feature.
- \( Q_2 \): Median of the feature column.
- \( X_{\text{scaled}} \): Scaled value of the feature.

#### Benefits of Robust Scaling:
- Handles outliers effectively by focusing on the interquartile range.
- Ensures that the scaled data is less influenced by extreme values.

#### When to Use Robust Scaling?
- When the dataset contains significant outliers.
- When the data distribution is skewed or non-normal.

#### Key Takeaway:
- Robust Scaling is a powerful technique for scaling features in datasets with outliers, ensuring that the scaled data remains unaffected by extreme values.

![image.png](attachment:image.png)