# Data Preprocessing Lab
by Wilmer Garzón, last updated: 26-June-2025

In this lab, you will work on preprocessing the Titanic dataset. Follow the instructions and complete the tasks.

## 1. Load the Data

Load the Titanic dataset into a pandas DataFrame.

In [38]:
import pandas as pd

# Load the dataset
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
data = pd.read_csv(url)

data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## 2. Data Cleaning

### a. Handle Missing Values
- Identify and handle missing values in the dataset.
- Filla some NaN values by: mean and mode

In [39]:
# Check for missing values

print("Check for null values:")

print(data.isnull().any())
print(data.isnull().sum())


for i in data:
  if data.isnull().any()[i]:
    if data[i].dtype == 'float64':
      data[i].fillna(data[i].mean(), inplace=True)
    else:
      data[i].fillna(data[i].mode()[0], inplace=True)



print("\n\n\nCheck for null values after filling NaN values:")

print(data.isnull().any())
print(data.isnull().sum())


Check for null values:
PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin           True
Embarked        True
dtype: bool
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64



Check for null values after filling NaN values:
PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age            False
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin          False
Embarked       False
dtype: bool
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data[i].fillna(data[i].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data[i].fillna(data[i].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values 

### b. Remove Duplicates

Check for and remove any duplicate rows.

In [45]:
# Check for duplicates
print("Check for duplicates:")
print(data.duplicated().any())


#Remove duplicates
data.drop_duplicates(inplace=True)

print("\n\n\nCheck for duplicates after removing:")
print(data.duplicated().any())

Check for duplicates:
False



Check for duplicates after removing:
False



## 3. Data Transformation

### a. Encode Categorical Variables

Convert categorical variables into numerical values using one-hot encoding.

In [46]:
# Encode categorical variable
data = pd.get_dummies(data, columns=["Sex", "Embarked", "Pclass"])

data.head()


Unnamed: 0,PassengerId,Survived,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Pclass_1,Pclass_2,Pclass_3
0,1,0,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,B96 B98,False,True,False,False,True,False,False,True
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,True,False,True,False,False,True,False,False
2,3,1,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,B96 B98,True,False,False,False,True,False,False,True
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,True,False,False,False,True,True,False,False
4,5,0,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,B96 B98,False,True,False,False,True,False,False,True


### b. Normalize Numerical Variables

Normalize numerical variables to have a mean of 0 and a standard deviation of 1.

In [47]:
# Normalize numerical variables
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data[['Age', 'Fare']] = scaler.fit_transform(data[['Age', 'Fare']])

data.head()

Unnamed: 0,PassengerId,Survived,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Pclass_1,Pclass_2,Pclass_3
0,1,0,"Braund, Mr. Owen Harris",-0.592481,1,0,A/5 21171,-0.502445,B96 B98,False,True,False,False,True,False,False,True
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0.638789,1,0,PC 17599,0.786845,C85,True,False,True,False,False,True,False,False
2,3,1,"Heikkinen, Miss. Laina",-0.284663,0,0,STON/O2. 3101282,-0.488854,B96 B98,True,False,False,False,True,False,False,True
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0.407926,1,0,113803,0.42073,C123,True,False,False,False,True,True,False,False
4,5,0,"Allen, Mr. William Henry",0.407926,0,0,373450,-0.486337,B96 B98,False,True,False,False,True,False,False,True


## 4. Feature Selection

Select relevant features for the model.

In [51]:
# Select features
features = ['Age', 'Fare', 'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Pclass_1', 'Pclass_2', 'Pclass_3']
X = data[features]
y = data['Survived']

X.head()

Unnamed: 0,Age,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Pclass_1,Pclass_2,Pclass_3
0,-0.592481,-0.502445,False,True,False,False,True,False,False,True
1,0.638789,0.786845,True,False,True,False,False,True,False,False
2,-0.284663,-0.488854,True,False,False,False,True,False,False,True
3,0.407926,0.42073,True,False,False,False,True,True,False,False
4,0.407926,-0.486337,False,True,False,False,True,False,False,True


## 5. Split the Data

Split the dataset into training and testing sets.

In [52]:
# Split the data
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)