# Lab1 : **Data Preprocessing**
Data preprocessing is an essential step in the machine learning and data analysis pipeline. It involves cleaning, transforming, and organizing raw data into a format suitable for analysis and modeling. This step ensures that machine learning algorithms work effectively with the given data.

---

## **Key Steps in Data Preprocessing**

### **1. Data Cleaning**
Data cleaning involves fixing or removing incorrect, incomplete, or irrelevant data. Common tasks include:
- **Handling Missing Values**: Replace or remove missing values.
- **Removing Duplicates**: Eliminate duplicate rows.
- **Outlier Detection and Removal**: Identify and remove unusual data points.
- **Correcting Inconsistent Data**: Fix typos, formatting issues, or inconsistent capitalization.

---

### **2. Handling Missing Values**
Missing data can negatively impact the model's performance. To handle missing values:
- **For Numerical Data**:
  - Replace missing values with the **mean**, **median**, or **mode**.
  - Use a constant value, such as 0.
- **For Categorical Data**:
  - Replace missing values with the most frequent value (**mode**).
  - Use a placeholder category like `"Unknown"`.

---

### **3. Encoding Categorical Data**
Most machine learning algorithms require numerical input. Convert categorical variables into numerical values:
- **Label Encoding**:
  Assigns a unique integer to each category (e.g., Male = 0, Female = 1).
- **One-Hot Encoding**:
  Creates binary columns for each category (e.g., "City_A", "City_B").
- **Binary Encoding**:
  Encodes categories into binary values to reduce the number of features.

---

### **4. Feature Scaling**
Scaling ensures that all numerical features are on a similar scale, preventing dominance by features with larger ranges. Common methods:
- **Min-Max Scaling**:
  Rescales data to a range of `[0, 1]`.
  \[
  X' = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}
  \]
- **Standardization**:
  Rescales data to have a mean of 0 and a standard deviation of 1.
  \[
  X' = \frac{X - \mu}{\sigma}
  \]

---

### **5. Feature Selection/Engineering**
Feature engineering involves:
- **Removing irrelevant or redundant features**: Eliminate columns that do not contribute to the prediction.
- **Creating new features**: Combine existing features to create more meaningful ones.
- **Extracting features**: Use domain knowledge to identify the most relevant features.

---

### **6. Splitting the Dataset**
Divide the dataset into:
- **Training Set**: Used to train the model (typically 70-80% of the data).
- **Testing Set**: Used to evaluate the model’s performance (20-30%).

Splitting ensures that the model generalizes well to unseen data.

---

## **Why Is Data Preprocessing Important?**

1. **Improves Model Accuracy**:
   Clean and preprocessed data lead to better predictions and model performance.
2. **Handles Real-World Imperfections**:
   Raw data often contains missing, messy, or inconsistent values. Preprocessing ensures that the data is reliable.
3. **Prevents Algorithm Bias**:
   Scaling and encoding ensure fair treatment of all features.
4. **Reduces Noise and Redundancy**:
   Eliminating irrelevant data helps focus on meaningful patterns.

---



### **Step 1: Load the Dataset**


In [28]:

# Importing necessary libraries
import pandas as pd

# Sample dataset with issues: missing values, duplicates, and inconsistencies
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Alice'],  # Duplicate
    'Age': [25, 30, None, 35, 28, 25],  # Missing value
    'Gender': ['Female', 'Male', None, 'Male', 'Female', 'Female'],  # Missing value
    'Salary': [50000, 54000, 58000, None, 62000, 50000],  # Missing value, duplicate
    'Purchased': ['Yes', 'No', 'Yes', 'No', 'Yes', 'Yes']  # Target variable
})
df

Unnamed: 0,Name,Age,Gender,Salary,Purchased
0,Alice,25.0,Female,50000.0,Yes
1,Bob,30.0,Male,54000.0,No
2,Charlie,,,58000.0,Yes
3,David,35.0,Male,,No
4,Eve,28.0,Female,62000.0,Yes
5,Alice,25.0,Female,50000.0,Yes


### **Step 2: Handle Missing Values**



In [29]:
# Handling missing values
df['Age'] = df['Age'].fillna(df['Age'].mean())  # Fill missing numerical values with the mean
df['Salary'] = df['Salary'].fillna(df['Salary'].mean())  # Fill missing numerical values with the mean
df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0])  # Fill missing categorical values with the mode

# Display the dataset after handling missing values
df

Unnamed: 0,Name,Age,Gender,Salary,Purchased
0,Alice,25.0,Female,50000.0,Yes
1,Bob,30.0,Male,54000.0,No
2,Charlie,28.6,Female,58000.0,Yes
3,David,35.0,Male,54800.0,No
4,Eve,28.0,Female,62000.0,Yes
5,Alice,25.0,Female,50000.0,Yes


In [30]:
# Remove duplicate rows
df = df.drop_duplicates()

# Display dataset after removing duplicates
df


Unnamed: 0,Name,Age,Gender,Salary,Purchased
0,Alice,25.0,Female,50000.0,Yes
1,Bob,30.0,Male,54000.0,No
2,Charlie,28.6,Female,58000.0,Yes
3,David,35.0,Male,54800.0,No
4,Eve,28.0,Female,62000.0,Yes



### **Step 3: Encode Categorical Data**


In [31]:
# Encoding categorical data (Manual Encoding) using .loc to avoid SettingWithCopyWarning
df.loc[:, 'Gender'] = df['Gender'].map({'Male': 0, 'Female': 1})  # Gender: Male = 0, Female = 1
df.loc[:, 'Purchased'] = df['Purchased'].map({'No': 0, 'Yes': 1})  # Purchased: No = 0, Yes = 1

# Display dataset after encoding
df


Unnamed: 0,Name,Age,Gender,Salary,Purchased
0,Alice,25.0,1,50000.0,1
1,Bob,30.0,0,54000.0,0
2,Charlie,28.6,1,58000.0,1
3,David,35.0,0,54800.0,0
4,Eve,28.0,1,62000.0,1


### **Step 4: Scale Features (Min-Max Scaling)**


In [32]:
# Scaling numerical features (Min-Max Scaling) using .loc 
df.loc[:, 'Age'] = (df['Age'] - df['Age'].min()) / (df['Age'].max() - df['Age'].min())
df.loc[:, 'Salary'] = (df['Salary'] - df['Salary'].min()) / (df['Salary'].max() - df['Salary'].min())

# Display dataset after scaling features
df


Unnamed: 0,Name,Age,Gender,Salary,Purchased
0,Alice,0.0,1,0.0,1
1,Bob,0.5,0,0.333333,0
2,Charlie,0.36,1,0.666667,1
3,David,1.0,0,0.4,0
4,Eve,0.3,1,1.0,1


### **Step 5: Split the Dataset (Train-Test Split)**


In [33]:
# Shuffle the dataset
df_shuffled = df.sample(frac=1, random_state=42).reset_index(drop=True)

# Calculate split index (80% for training, 20% for testing)
split_index = int(0.8 * len(df))

# Split the dataset into training and testing sets
train_data = df_shuffled[:split_index]
test_data = df_shuffled[split_index:]

# Separate features and labels for training and testing sets
X_train = train_data[['Age', 'Gender', 'Salary']].values
y_train = train_data['Purchased'].values
X_test = test_data[['Age', 'Gender', 'Salary']].values
y_test = test_data['Purchased'].values

# Display the split data
print("Training Features:\n", X_train)
print("Testing Features:\n", X_test)
print("Training Labels:\n", y_train)
print("Testing Labels:\n", y_test)


Training Features:
 [[0.5 0 0.3333333333333333]
 [0.3 1 1.0]
 [0.36000000000000015 1 0.6666666666666666]
 [0.0 1 0.0]]
Testing Features:
 [[1.0 0 0.4]]
Training Labels:
 [0 1 1 1]
Testing Labels:
 [0]
