# Session 1 Exercise Notebook: Data Observation, Cleaning, and Preprocessing
In this notebook, you will apply the data observation, cleaning, and preprocessing techniques you've learned using the Titanic dataset. Complete each task step by step. Hints are provided where necessary.


## Task 1: Load the Dataset and Basic Exploration
- Load the Titanic dataset into a DataFrame.
- Display the first 5 rows using `.head()`.
- Use `.info()` to understand the structure of the dataset.
- Use `.describe()` to get a statistical summary of the numerical columns.

### Hint:
You can load the dataset using Pandas' `pd.read_csv()` function.


In [41]:
# Load the Titanic dataset and perform basic exploration
# (Code here)
import pandas as pd
import numpy as np

In [42]:
data = pd.read_csv(r"C:\Users\Lachu\Downloads\Titanic-Dataset - Titanic-Dataset.csv")


In [43]:
data.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [11]:
data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [13]:
data.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [16]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


## Task 2: Data Cleaning with Regular Expressions
- Use regular expressions to clean up the 'Name' column.
- Extract titles (e.g., 'Mr.', 'Mrs.', 'Miss') from the names and store them in a new column called `Title`.
- Clean the 'Name' column by removing unnecessary characters or text.

### Hint:
You can use the `re` module to apply regular expressions, and `re.findall()` to extract patterns.


In [44]:
# Clean the 'Name' column and extract titles using regular expressions
# (Code here)
import re
# Extract titles and store in a new column called 'Title'
data['Title'] = data['Name'].str.extract(r'\b(Mr|Mrs|Miss|Ms|Dr|Prof)\b', expand=False)

# # Remove the extracted titles from the 'Name' column
# data['Name'] = data['Name'].str.replace(r'\b(Mr|Mrs|Miss)\b'.?,? '', regex=True).str.strip()
data['Name'] = data['Name'].str.replace(r'\b(Mr|Mrs|Miss|Ms|Dr|Prof)\b\.?,?', '', regex=True)
# Optionally, remove extra whitespace
data['Name'] = data['Name'].str.strip()

# # Display the result
data


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
0,1,0,3,"Braund, Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,Mr
1,2,1,1,"Cumings, John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs
2,3,1,3,"Heikkinen, Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,Miss
3,4,1,1,"Futrelle, Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,Mrs
4,5,0,3,"Allen, William Henry",male,35.0,0,0,373450,8.0500,,S,Mr
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,
887,888,1,1,"Graham, Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,Miss
888,889,0,3,"Johnston, Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,Miss
889,890,1,1,"Behr, Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,Mr


## Task 3: Handling Missing Values
- Check for missing values in the dataset using `.isnull().sum()`.
- Fill the missing values in the 'Age' column using the median of the 'Age' column.
- Drop rows where 'Embarked' is missing.

### Hint:
Use `.fillna()` to fill missing values and `.dropna()` to remove rows with missing data.


In [None]:
# Handle missing values in the dataset
# (Code here)



Age_data = data['Age'].fillna(data['Age'].median())
data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
Title           56
dtype: int64

In [50]:
data = data.dropna(subset=['Embarked'])

## Task 4: Filtering Data
- Filter the dataset to show only passengers who:
  1. Are male
  2. Are over 30 years old
  3. Paid a fare greater than $50
- Display the filtered DataFrame.

### Hint:
Use the Pandas `.loc[]` method to filter the DataFrame based on multiple conditions.


In [56]:
# Filter the data based on specific conditions
# (Code here)
filtered_df = data.loc[(data['Age'] > 30) & (data['Sex'] == 'male')& (data['Fare'] > 50)]


## Task 5: Advanced Data Cleaning Scenario
- Assume the 'Cabin' column has some inconsistencies (e.g., extra spaces, missing values, or incorrect formatting).
- Clean the 'Cabin' column by removing extra spaces, and if the value is missing, fill it with 'Unknown'.

### Hint:
You can use `.str.strip()` to remove spaces and `.fillna()` to handle missing values.


In [60]:
# Clean the 'Cabin' column
# (Code here)
data['Cabin'] = data['Cabin'].str.strip('')
data['Cabin'] = data['Cabin'].fillna('Unknown')
data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Cabin'] = data['Cabin'].str.strip('')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Cabin'] = data['Cabin'].fillna('Unknown')


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
0,1,0,3,"Braund, Owen Harris",male,22.000000,1,0,A/5 21171,7.2500,Unknown,S,Mr
1,2,1,1,"Cumings, John Bradley (Florence Briggs Thayer)",female,38.000000,1,0,PC 17599,71.2833,C85,C,Mrs
2,3,1,3,"Heikkinen, Laina",female,26.000000,0,0,STON/O2. 3101282,7.9250,Unknown,S,Miss
3,4,1,1,"Futrelle, Jacques Heath (Lily May Peel)",female,35.000000,1,0,113803,53.1000,C123,S,Mrs
4,5,0,3,"Allen, William Henry",male,35.000000,0,0,373450,8.0500,Unknown,S,Mr
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.000000,0,0,211536,13.0000,Unknown,S,
887,888,1,1,"Graham, Margaret Edith",female,19.000000,0,0,112053,30.0000,B42,S,Miss
888,889,0,3,"Johnston, Catherine Helen ""Carrie""",female,29.699118,1,2,W./C. 6607,23.4500,Unknown,S,Miss
889,890,1,1,"Behr, Karl Howell",male,26.000000,0,0,111369,30.0000,C148,C,Mr


In [61]:
print(data.iloc[390])

PassengerId                    392
Survived                         1
Pclass                           3
Name           Jansson,  Carl Olof
Sex                           male
Age                           21.0
SibSp                            0
Parch                            0
Ticket                      350034
Fare                        7.7958
Cabin                      Unknown
Embarked                         S
Title                           Mr
Name: 391, dtype: object


## Task 6: Handling Multiple Missing Columns
- Handle missing data in the 'Age' and 'Cabin' columns differently:
  - For 'Age', fill the missing values with the median.
  - For 'Cabin', fill missing values with 'Unknown'.
- Display the updated DataFrame to confirm your changes.


In [None]:
# Handle multiple missing columns
# (Code here)