**Lab: Data Preprocessing in Python**

**Objective:**
Learn how to clean, preprocess, and prepare data for machine learning models using Python. We'll cover handling missing data, encoding categorical variables, feature scaling, and splitting data for training and testing.

**Dataset:**
We will use a sample dataset that includes both numerical and categorical data. You can try replacing the values from the data to experiment around with the code.

**Step 1: Import Required Libraries**


In [1]:
# Import libraries for data handling, splitting, and preprocessing
import numpy as np  # For numerical operations, e.g., handling missing data
import pandas as pd  # For working with data in table format (DataFrames)
from sklearn.model_selection import train_test_split  # To split data into training and testing sets
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder  # For encoding and scaling data
from sklearn.impute import SimpleImputer  # To fill in missing data

**Step 2: Load the Dataset**

In [2]:
# Create a sample dataset with missing values and both numerical and categorical data
data = {
    'Age': [25, np.nan, 28, 35, 42],  # The 'Age' column has a missing value (NaN)
    'Salary': [50000, 60000, np.nan, 80000, 120000],  # The 'Salary' column also has a missing value
    'Country': ['USA', 'France', 'Germany', np.nan, 'USA'],  # The 'Country' column has a missing value
    'Purchased': ['Yes', 'No', 'No', 'Yes', 'Yes']  # 'Purchased' is a categorical column (binary values: Yes/No)
}

# Load the data into a Pandas DataFrame (a table-like structure)
df = pd.DataFrame(data)

# Show the original dataset
print("Original Dataset:\n", df)

Original Dataset:
     Age    Salary  Country Purchased
0  25.0   50000.0      USA       Yes
1   NaN   60000.0   France        No
2  28.0       NaN  Germany        No
3  35.0   80000.0      NaN       Yes
4  42.0  120000.0      USA       Yes


**Step 3: Handling Missing Data**

In [3]:
# Step 3.1: Handle missing numerical data (Age and Salary columns)

# Specify which columns are numerical
numerical_features = ['Age', 'Salary']

# Create an imputer to fill missing values in numerical columns with the mean value of that column
imputer_num = SimpleImputer(strategy='mean')

# Apply the imputer to the numerical columns
df[numerical_features] = imputer_num.fit_transform(df[numerical_features])

# Step 3.2: Handle missing categorical data (Country column)

# Specify which column is categorical
categorical_features = ['Country']

# Create an imputer to fill missing values in the categorical column with the most frequent value
imputer_cat = SimpleImputer(strategy='most_frequent')

# Apply the imputer to the categorical column
df[categorical_features] = imputer_cat.fit_transform(df[categorical_features])

# Show the dataset after missing values are handled
print("Dataset after handling missing data:\n", df)

Dataset after handling missing data:
     Age    Salary  Country Purchased
0  25.0   50000.0      USA       Yes
1  32.5   60000.0   France        No
2  28.0   77500.0  Germany        No
3  35.0   80000.0      USA       Yes
4  42.0  120000.0      USA       Yes


**Step 4: Encoding Categorical Data**

In [4]:
# Step 4.1: Convert the 'Purchased' column (Yes/No) into numerical values using Label Encoding

# Create a label encoder to convert 'Yes' to 1 and 'No' to 0
label_encoder = LabelEncoder()

# Apply the encoder to the 'Purchased' column
df['Purchased'] = label_encoder.fit_transform(df['Purchased'])

# Step 4.2: Convert the 'Country' column into multiple columns (one for each country) using One-Hot Encoding
# This helps the machine learning model understand the different categories without ranking them

# Use get_dummies to create new columns for each unique value in 'Country' (e.g., USA, France)
# Set drop_first=True to avoid creating redundant columns
df = pd.get_dummies(df, columns=['Country'], drop_first=True)

# Show the dataset after encoding categorical data
print("Dataset after encoding categorical data:\n", df)

Dataset after encoding categorical data:
     Age    Salary  Purchased  Country_Germany  Country_USA
0  25.0   50000.0          1            False         True
1  32.5   60000.0          0            False        False
2  28.0   77500.0          0             True        False
3  35.0   80000.0          1            False         True
4  42.0  120000.0          1            False         True


**Step 5: Feature Scaling**

In [5]:
# Machine learning models work better when numerical features are on the same scale
# For example, Age ranges from 25 to 42, but Salary ranges from 50,000 to 120,000, so we need to scale them

# Create a StandardScaler to scale the numerical features (Age and Salary) to have a mean of 0 and standard deviation of 1
scaler = StandardScaler()

# Apply the scaler to the numerical columns
df[['Age', 'Salary']] = scaler.fit_transform(df[['Age', 'Salary']])

# Show the dataset after feature scaling
print("Dataset after feature scaling:\n", df)

Dataset after feature scaling:
         Age    Salary  Purchased  Country_Germany  Country_USA
0 -1.275038 -1.146829          1            False         True
1  0.000000 -0.729800          0            False        False
2 -0.765023  0.000000          0             True        False
3  0.425013  0.104257          1            False         True
4  1.615048  1.772373          1            False         True


**Step 6: Splitting the Dataset into Training and Test Sets**

In [6]:
# To evaluate the performance of a machine learning model, we need to split the data into:
# - A training set (to train the model)
# - A test set (to test how well the model performs on unseen data)

# Step 6.1: Separate the features (X) from the target label (y)

# The features (X) are all columns except 'Purchased'
X = df.drop('Purchased', axis=1)

# The target (y) is the 'Purchased' column, which tells us whether someone purchased or not
y = df['Purchased']

# Step 6.2: Split the data into training and testing sets
# We will use 80% of the data for training and 20% for testing

# Use train_test_split to randomly split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Show the resulting training and test sets
print("Training Set Features (X_train):\n", X_train)
print("Test Set Features (X_test):\n", X_test)

Training Set Features (X_train):
         Age    Salary  Country_Germany  Country_USA
4  1.615048  1.772373            False         True
2 -0.765023  0.000000             True        False
0 -1.275038 -1.146829            False         True
3  0.425013  0.104257            False         True
Test Set Features (X_test):
    Age  Salary  Country_Germany  Country_USA
1  0.0 -0.7298            False        False


**Summary:**
* **Handling Missing Data:** We filled missing values in numerical columns with the mean and in categorical columns with the most frequent value.
* **Encoding Categorical Data:** We converted categories into numbers so that machine learning models can understand them.
* **Feature Scaling:** We standardized numerical features to ensure they are on the same scale.
* **Train-Test Split:** We divided the data into training and test sets to evaluate our model’s performance.