# Data Preparation in ML

☝
Data preparation is a crucial step in Machine Learning (ML), and understanding the different types of data is fundamental for selecting appropriate preprocessing methods and algorithms. Data types refer to the nature of the data you're working with and influence how you treat it during preparation and modeling.

## - Key Data Types: 


Numeric (Quantitative) Data  
- **Continuous:** Can take any real number value within a range. Examples include weight, height, and temperature.  
- **Discrete:** Represents countable quantities and only takes integer values. Examples include the number of products sold or the number of customers in a store.  

Categorical (Qualitative) Data  
- **Nominal:** Unordered categories or labels, such as gender (male, female), colors (red, blue, green), or product types.  
- **Ordinal:** Ordered categories with a meaningful rank or order, such as education levels (high school, bachelor’s, master’s) or ratings (poor, fair, good, excellent).  

Text Data  
- Text data includes any type of data that consists of words, sentences, or documents, such as product reviews, social media posts, or emails.  

Time Series Data  
- This data type represents observations collected at specific time intervals, such as stock prices, weather data, or sales over time. Time series data is ordered and often exhibits patterns like trends or seasonality.  

Boolean (Binary) Data  
- Binary data consists of two possible values, often represented as 0 and 1 or True and False. Examples include yes/no questions or outcomes such as spam/not spam.  

Image Data  
- Image data consists of pixel values, typically represented as arrays or matrices. Each pixel value is either grayscale (0 to 255) or RGB (three values, one for each channel: Red, Green, Blue). 

# Feature Preparation

## Step 1: Checking for Missing Values

In [None]:
import pandas as pd
import numpy as np

# Sample dataset
data = {
    'Age': [25, np.nan, 35, 45, np.nan, 50],
    'Salary': [50000, 60000, np.nan, 80000, 90000, np.nan],
    'City': ['New York', 'Los Angeles', 'New York', np.nan, 
'San Francisco', 'Chicago']
}

df = pd.DataFrame(data)

In [None]:
print(df.isna().sum())

This tells us:

- 2 missing values in Age
- 2 missing values in Salary
- 1 missing value in City

## Step 2: Understanding the Impact on Modeling

When using this dataset for machine learning, missing values can:

- Cause errors in models (most ML algorithms do not handle NaNs).
- Bias results if missing values are not random.
- Reduce training data if rows are removed.

In [None]:
from sklearn.linear_model import LinearRegression

# Selecting only numerical features for a simple regression model
X = df[['Age']]
y = df['Salary']

# Trying to fit a model with missing values (this will fail)
model = LinearRegression()
model.fit(X, y)  # Will raise an error due to NaN values


🔴 ValueError: Input X contains NaN. LinearRegression does not accept missing values encoded as NaN natively. 

## Step 3: Handling Missing Values

**We need to decide on a strategy before modeling.**

### - Dropping Missing Values (Not Always Recommended)

In [None]:
df_dropped =df.dropna()
print(df_dropped)

> This removes rows with NaNs but reduces dataset size.

###  - Filling Numerical Data with Mean/Median & Categorical Data with Mode

In [None]:

# Display the dataset
print("Original Data:\n", df)

df['Age'] = df['Age'].fillna(df['Age'].mean())  # Mean imputation
df['Salary'] = df['Salary'].fillna(df['Salary'].median())  # Median imputation

# Task 2: Fill missing categorical values with the mode
df['City'] = df['City'].fillna(df['City'].mode()[0])

# Display the modified dataset
print("\nData after filling missing values:\n", df)

> **mean/median**
- Mean: Suitable for normally distributed data.<br>
- Median: Better for skewed data (e.g., salaries often have outliers).
<br>

> **mode**
- Mode: Uses the most frequent category to fill missing values.

When Should You Still Consider Mean/Median?

🔹 **Small datasets** → If you don’t have enough data, KNNImputer might not work well.  
🔹 **Outliers** → If data has extreme values, mean/median imputation might be more stable.  
🔹 **Computational efficiency** → KNNImputer is more computationally expensive than mean/median.  

For most real-world structured datasets, your choice of KNN imputation is the preferred method. 🚀  


### - Using Machine Learning for Imputation **(Numerical Data)** + Mode

Instead of simple mean/median, we can predict missing values using another model.

Example: Predict missing Salary using Age:

🚀 Best Practice (Using KNNImputer)

In [43]:
from sklearn.impute import KNNImputer
import pandas as pd
import numpy as np

# Sample dataset
data = {
    'Age': [25, np.nan, 35, 45, np.nan, 50],
    'Salary': [50000, 60000, np.nan, 80000, 90000, np.nan],
    'City': ['New York', 'Los Angeles', 'New York', np.nan, 'San Francisco', 'Chicago']
}

df = pd.DataFrame(data)

# Display the dataset before processing
print("Original Data:\n", df)

# Step 1: Apply KNN Imputer for numerical columns
imputer = KNNImputer(n_neighbors=2)
df[['Age', 'Salary']] = imputer.fit_transform(df[['Age', 'Salary']])

# Step 2: Fill missing categorical values with mode
df['City'] = df['City'].fillna(df['City'].mode()[0])

# Display the modified dataset
print("\nData after filling missing values:\n", df)


Original Data:
     Age   Salary           City
0  25.0  50000.0       New York
1   NaN  60000.0    Los Angeles
2  35.0      NaN       New York
3  45.0  80000.0            NaN
4   NaN  90000.0  San Francisco
5  50.0      NaN        Chicago

Data after filling missing values:
     Age   Salary           City
0  25.0  50000.0       New York
1  35.0  60000.0    Los Angeles
2  35.0  65000.0       New York
3  45.0  80000.0       New York
4  35.0  90000.0  San Francisco
5  50.0  65000.0        Chicago


> ***Why is KNNImputer Better Than Mean/Median Imputation?***

✔ **More accurate predictions** → Instead of replacing missing values with a fixed number (mean/median), KNNImputer estimates values based on similar existing data.  
✔ **Handles complex relationships** → If Age and Salary have patterns, KNNImputer preserves those relationships.  
✔ **Reduces bias** → Mean/median imputation assumes a normal distribution, which may not always be the case.  


### - Scale Numerical Features Using StandardScaler  



After handling missing values, numerical features should be standardized to ensure consistent scaling. **StandardScaler** transforms data so that it has a **mean of 0** and a **standard deviation of 1**, improving model performance for algorithms sensitive to feature magnitude.  

**Steps to Apply StandardScaler**  

1. **Identify numerical features** → Select columns that require scaling.  
2. **Apply StandardScaler** → Fit and transform the numerical columns. 

In [45]:
import pandas as pd
import numpy as np

# Sample dataset
data = {
    'Age': [25, 37, 35, 45, 42, 50],
    'Salary': [50000, 60000, 100000, 80000, 90000, 70000],
    'City': ['New York', 'Los Angeles', 'New York', 'New York', 'San Francisco', 'Chicago']
}

df = pd.DataFrame(data)

from sklearn.preprocessing import StandardScaler

# Select numeric columns
numeric_cols = ['Age', 'Salary']

# Standardize the features
scaler = StandardScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

print("\nData after scaling:\n", df)


Data after scaling:
         Age   Salary           City
0 -1.754575 -1.46385       New York
1 -0.250654 -0.87831    Los Angeles
2 -0.501307  1.46385       New York
3  0.751961  0.29277       New York
4  0.375980  0.87831  San Francisco
5  1.378595 -0.29277        Chicago


**Applied StandardScaler for Feature Scaling**

- `from sklearn.preprocessing import StandardScaler` → Imports the correct scaler.  
- `scaler = StandardScaler()` → Initializes the StandardScaler.  
- `df[numeric_cols] = scaler.fit_transform(df[numeric_cols])` → Standardizes the numerical columns.  

This transforms **Age** and **Salary** so that they have a **mean of 0** and a **standard deviation of 1**, ensuring consistency for machine learning models. 🚀  


### - Explanation of StandardScaler Transformation  

When we apply **StandardScaler**, it transforms numerical features like **Age** and **Salary** so that they have:  

✔ **Mean of 0** → The average value of the feature becomes **0** after scaling.  
✔ **Standard deviation of 1** → The spread (variance) of the feature is normalized to **1**.  

#### **Formula for Standardization**  
Each value \( x \) in the column is transformed using the formula:  

\[
x' = \frac{x - \mu}{\sigma}
\]

where:  
- \( x' \) = Transformed (standardized) value  
- \( x \) = Original value  
- \( \mu \) = Mean of the column  
- \( \sigma \) = Standard deviation of the column  

This ensures that **all features are on the same scale**, preventing any one feature (e.g., Salary with large values) from dominating the model.  

---

#### **Example: Before and After Scaling**  

##### **Original Data:**  
| Age  | Salary  |  
|------|--------|  
| 25   | 50000  |  
| 37   | 60000  |  
| 35   | 100000 |  
| 45   | 80000  |  
| 42   | 90000  |  
| 50   | 70000  |  

##### **After StandardScaler Transformation:**  
| Age (scaled) | Salary (scaled) |  
|-------------|---------------|  
| -1.5       | -1.2          |  
| -0.5       | -0.8          |  
| -0.7       | 1.9           |  
| 0.5        | 0.3           |  
| 0.2        | 0.9           |  
| 1.0        | -0.1          |  

- **All values are now centered around 0.**  
- **The range is adjusted so that variance is standardized.**  
- **Machine learning models perform better when features are scaled consistently.**  


### - ### Text Preprocessing: Lowercasing and Removing Punctuation  

In text-based machine learning tasks, preprocessing is essential to clean and standardize textual data. This step ensures consistency and removes unnecessary elements such as punctuation and special characters, making the data more suitable for analysis.



In [47]:
# Sample sentence
text = "Hello! How's it going? I'm excited."

# Task 1: Convert to lowercase
text_lower = text.lower()
print(text_lower)

# Task 2: Remove punctuation and special characters
import re

# Remove punctuation and special characters
text_cleaned = re.sub(r'[^\w\s]', '', text_lower)
print(text_cleaned)

hello! how's it going? i'm excited.
hello hows it going im excited


or

In [49]:
import contractions
import re

# Sample text
text = "Hello! How's it going? I'm excited."

# Expand contractions
text_expanded = contractions.fix(text)
print("Expanded Text:", text_expanded)

# Convert to lowercase
text_lower = text_expanded.lower()

# Remove punctuation
text_cleaned = re.sub(r'[^\w\s]', '', text_lower)

print("Final Cleaned Text:", text_cleaned)


Expanded Text: Hello! How is it going? I am excited.
Final Cleaned Text: hello how is it going i am excited
