**Lab: Data Preprocessing in Python**

Objective:
By the end of this lab, students will be able to:

* Load a dataset into Python.
* Handle missing values.
* Encode categorical data.
* Scale features for machine learning models.

Materials:
* A computer with Python installed.
* Access to Jupyter Notebook or any Python IDE (e.g., PyCharm, VSCode).
* A sample dataset (e.g., Titanic dataset available from Kaggle or UCI Machine Learning Repository).

Dataset:
For this lab, we'll use the Titanic dataset. It contains passenger information such as age, sex, and fare, and the goal is to predict survival.

**Setting Up the Environment**

1. Install Required Libraries: Make sure you have the necessary Python libraries installed. Open your terminal or command prompt and run:

2. Import Libraries: Open your Jupyter Notebook or Python IDE and start a new Python script. Import the following libraries:

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from io import StringIO

**Loading the Dataset**

1. Load the Dataset: We'll use a sample Titanic dataset provided below. Load it using Pandas:

In [2]:
data = """
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
1,0,3,'Braund, Mr. Owen Harris',male,22,1,0,'A/5 21171',7.25,S
2,1,1,'Cumings, Mrs. John Bradley (Florence Briggs Thayer)',female,38,1,0,'PC 17599',71.2833,C
3,1,3,'Heikkinen, Miss. Laina',female,26,0,0,'STON/O2. 3101282',7.925,S
4,1,1,'Futrelle, Mrs. Jacques Heath (Lily May Peel)',female,35,1,0,'113803',53.1,S
5,0,3,'Allen, Mr. William Henry',male,35,0,0,'373450',8.05,S
6,0,3,'Morley, Mr. John',male,40,0,0,'A/5 21171',8.05,S
7,0,1,'Davis, Mr. John',male,27,1,0,'A/5 21171',8.05,S
8,1,2,'Wilkes, Mrs. James (Ellen Needs)',female,30,1,0,'A/5 21171',8.05,S
9,1,3,'Bonnell, Miss. Elizabeth',female,22,1,0,'A/5 21171',8.05,S
10,0,2,'McCarthy, Mr. Timothy',male,23,0,0,'A/5 21171',8.05,S
"""

df = pd.read_csv(StringIO(data))

2. Explore the Dataset: Check the first few rows of the dataset to understand its structure:

In [None]:
print(df.head())

**Handling Missing Values**

1. Identify Missing Values: Find out which columns have missing values and their count:

In [None]:
print(df.isnull().sum())

2. Impute Missing Values: For simplicity, fill missing values in the Age column with the mean value, and fill missing values in Embarked with the most frequent value:

In [5]:
df['Age'] = df['Age'].fillna(df['Age'].mean())
df['Fare'] = df['Fare'].fillna(df['Fare'].mean())
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])

3. Verify Changes: Confirm that there are no missing values left:

In [None]:
print(df.isnull().sum())

**Encoding Categorical Data**

1. Label Encoding: Convert the Sex column to numeric values:

In [7]:
label_encoder = LabelEncoder()
df['Sex'] = label_encoder.fit_transform(df['Sex'])

2. One-Hot Encoding: Convert the Embarked column to one-hot encoded variables:

In [8]:
onehot_encoder = OneHotEncoder(sparse_output=False)
embarked_encoded = onehot_encoder.fit_transform(df[['Embarked']])
df = pd.concat([df, pd.DataFrame(embarked_encoded, columns=onehot_encoder.get_feature_names_out(['Embarked']))], axis=1)
df.drop('Embarked', axis=1, inplace=True)

3. Verify Encoding: Check the updated dataframe:

In [None]:
print(df.head())

**Feature Scaling**

1. Select Features for Scaling: Identify which features to scale (e.g., Age, Fare):

In [10]:
features_to_scale = ['Age', 'Fare']

2. Apply Scaling: Standardize these features:

In [11]:
scaler = StandardScaler()
df[features_to_scale] = scaler.fit_transform(df[features_to_scale])

3. Verify Scaling: Check the scaled features:

In [None]:
print(df[features_to_scale].head())

**Summary**

1. Save the Preprocessed Data: Save your preprocessed dataset to a new CSV file:

In [13]:
df.to_csv('titanic_preprocessed.csv', index=False)

**Lab Questions**

Answer the following questions after completing this lab:

1) What challenges did you encounter while preprocessing the data?
2) How do you think each preprocessing step (handling missing values, encoding, scaling) affects the performance of a machine learning model?
3) Can you think of additional preprocessing steps that might be necessary for different types of datasets?