# Session 1 Exercise Notebook: Data Observation, Cleaning, and Preprocessing
In this notebook, you will apply the data observation, cleaning, and preprocessing techniques you've learned using the Titanic dataset. Complete each task step by step. Hints are provided where necessary.


## Task 1: Load the Dataset and Basic Exploration
- Load the Titanic dataset into a DataFrame.
- Display the first 5 rows using `.head()`.
- Use `.info()` to understand the structure of the dataset.
- Use `.describe()` to get a statistical summary of the numerical columns.

### Hint:
You can load the dataset using Pandas' `pd.read_csv()` function.


In [8]:
import pandas as pd

# Load the Titanic dataset into a DataFrame
df = pd.read_csv('D:\Komal\REDI School\Titanic-Dataset.csv')

# Display the first 5 rows
print(df.head())

# Understand the structure of the dataset
print(df.info())

# Get a statistical summary of the numerical columns
print(df.describe())


  df = pd.read_csv('D:\Komal\REDI School\Titanic-Dataset.csv')


   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  
<c

## Task 2: Data Cleaning with Regular Expressions
- Use regular expressions to clean up the 'Name' column.
- Extract titles (e.g., 'Mr.', 'Mrs.', 'Miss') from the names and store them in a new column called `Title`.
- Clean the 'Name' column by removing unnecessary characters or text.

### Hint:
You can use the `re` module to apply regular expressions, and `re.findall()` to extract patterns.


In [9]:

import re


# Function to extract titles from names
def extract_title(name):
    title_search = re.findall(r'(\bMr\b|\bMrs\b|\bMiss\b|\bMaster\b|\bDr\b|\bRev\b)', name)
    if title_search:
        return title_search[0]
    return None

# Apply the function to create a new 'Title' column
df['Title'] = df['Name'].apply(extract_title)

# Function to clean names
def clean_name(name):
    # Remove titles
    name = re.sub(r'(\bMr\b|\bMrs\b|\bMiss\b|\bMaster\b|\bDr\b|\bRev\b)', '', name)
    # Remove unnecessary characters (e.g., '.', ',', '()')
    name = re.sub(r'[^\w\s]', '', name)
    return name.strip()

# Apply the function to clean the 'Name' column
df['Name'] = df['Name'].apply(clean_name)

# Display the first few rows of the modified DataFrame
print(df[['Title', 'Name']].head())


  Title                                          Name
0    Mr                           Braund  Owen Harris
1   Mrs  Cumings  John Bradley Florence Briggs Thayer
2  Miss                              Heikkinen  Laina
3   Mrs         Futrelle  Jacques Heath Lily May Peel
4    Mr                          Allen  William Henry


## Task 3: Handling Missing Values
- Check for missing values in the dataset using `.isnull().sum()`.
- Fill the missing values in the 'Age' column using the median of the 'Age' column.
- Drop rows where 'Embarked' is missing.

### Hint:
Use `.fillna()` to fill missing values and `.dropna()` to remove rows with missing data.


In [10]:


# Check for missing values in the dataset
missing_values = df.isnull().sum()
print("Missing values before filling and dropping:")
print(missing_values)

# Fill the missing values in the 'Age' column using the median of the 'Age' column
df['Age'].fillna(df['Age'].median(), inplace=True)

# Drop rows where 'Embarked' is missing
df.dropna(subset=['Embarked'], inplace=True)

# Verify the changes
missing_values_after = df.isnull().sum()
print("\nMissing values after filling and dropping:")
print(missing_values_after)

# Display the first few rows to confirm changes
print("\nFirst few rows of the modified DataFrame:")
print(df.head())


Missing values before filling and dropping:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
Title           11
dtype: int64

Missing values after filling and dropping:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
Title           11
dtype: int64

First few rows of the modified DataFrame:
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                           Name     Sex   Age  SibSp  Parch  \
0                           Braund  Owen Harris    ma

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)


## Task 4: Filtering Data
- Filter the dataset to show only passengers who:
  1. Are male
  2. Are over 30 years old
  3. Paid a fare greater than $50
- Display the filtered DataFrame.

### Hint:
Use the Pandas `.loc[]` method to filter the DataFrame based on multiple conditions.


In [12]:


# Filter the dataset using the .loc[] method
filtered_df = df.loc[(df['Sex'] == 'male') & 
                     (df['Age'] > 30) & 
                     (df['Fare'] > 50)]

# Display the filtered DataFrame
print(filtered_df)


     PassengerId  Survived  Pclass                                  Name  \
6              7         0       1                   McCarthy  Timothy J   
35            36         0       1            Holverson  Alexander Oskar   
54            55         0       1            Ostby  Engelhart Cornelius   
62            63         0       1               Harris  Henry Birkhardt   
74            75         1       3                             Bing  Lee   
92            93         0       1               Chaffee  Herbert Fuller   
110          111         0       1            Porter  Walter Chamberlain   
124          125         0       1               White  Percival Wayland   
137          138         0       1               Futrelle  Jacques Heath   
155          156         0       1               Williams  Charles Duane   
224          225         1       1              Hoyt  Frederick Maxfield   
245          246         0       1               Minahan  William Edward   
248         

## Task 5: Advanced Data Cleaning Scenario
- Assume the 'Cabin' column has some inconsistencies (e.g., extra spaces, missing values, or incorrect formatting).
- Clean the 'Cabin' column by removing extra spaces, and if the value is missing, fill it with 'Unknown'.

### Hint:
You can use `.str.strip()` to remove spaces and `.fillna()` to handle missing values.


In [13]:


# Clean the 'Cabin' column
# Remove extra spaces
df['Cabin'] = df['Cabin'].str.strip()

# Fill missing values with 'Unknown'
df['Cabin'].fillna('Unknown', inplace=True)

# Verify the changes
print(df['Cabin'].unique())


['Unknown' 'C85' 'C123' 'E46' 'G6' 'C103' 'D56' 'A6' 'C23 C25 C27' 'B78'
 'D33' 'B30' 'C52' 'C83' 'F33' 'F G73' 'E31' 'A5' 'D10 D12' 'D26' 'C110'
 'B58 B60' 'E101' 'F E69' 'D47' 'B86' 'F2' 'C2' 'E33' 'B19' 'A7' 'C49'
 'F4' 'A32' 'B4' 'B80' 'A31' 'D36' 'D15' 'C93' 'C78' 'D35' 'C87' 'B77'
 'E67' 'B94' 'C125' 'C99' 'C118' 'D7' 'A19' 'B49' 'D' 'C22 C26' 'C106'
 'C65' 'E36' 'C54' 'B57 B59 B63 B66' 'C7' 'E34' 'C32' 'B18' 'C124' 'C91'
 'E40' 'T' 'C128' 'D37' 'B35' 'E50' 'C82' 'B96 B98' 'E10' 'E44' 'A34'
 'C104' 'C111' 'C92' 'E38' 'D21' 'E12' 'E63' 'A14' 'B37' 'C30' 'D20' 'B79'
 'E25' 'D46' 'B73' 'C95' 'B38' 'B39' 'B22' 'C86' 'C70' 'A16' 'C101' 'C68'
 'A10' 'E68' 'B41' 'A20' 'D19' 'D50' 'D9' 'A23' 'B50' 'A26' 'D48' 'E58'
 'C126' 'B71' 'B51 B53 B55' 'D49' 'B5' 'B20' 'F G63' 'C62 C64' 'E24' 'C90'
 'C45' 'E8' 'B101' 'D45' 'C46' 'D30' 'E121' 'D11' 'E77' 'F38' 'B3' 'D6'
 'B82 B84' 'D17' 'A36' 'B102' 'B69' 'E49' 'C47' 'D28' 'E17' 'A24' 'C50'
 'B42' 'C148']


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Cabin'].fillna('Unknown', inplace=True)


## Task 6: Handling Multiple Missing Columns
- Handle missing data in the 'Age' and 'Cabin' columns differently:
  - For 'Age', fill the missing values with the median.
  - For 'Cabin', fill missing values with 'Unknown'.
- Display the updated DataFrame to confirm your changes.


In [14]:


# Handle missing data in the 'Age' column by filling with the median
df['Age'].fillna(df['Age'].median(), inplace=True)

# Handle missing data in the 'Cabin' column by filling with 'Unknown'
df['Cabin'].fillna('Unknown', inplace=True)

# Display the updated DataFrame to confirm the changes
print(df.head())


   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                           Name     Sex   Age  SibSp  Parch  \
0                           Braund  Owen Harris    male  22.0      1      0   
1  Cumings  John Bradley Florence Briggs Thayer  female  38.0      1      0   
2                              Heikkinen  Laina  female  26.0      0      0   
3         Futrelle  Jacques Heath Lily May Peel  female  35.0      1      0   
4                          Allen  William Henry    male  35.0      0      0   

             Ticket     Fare    Cabin Embarked Title  
0         A/5 21171   7.2500  Unknown        S    Mr  
1          PC 17599  71.2833      C85        C   Mrs  
2  STON/O2. 3101282   7.9250  Unknown        S  Miss  
3            113803  53.1000     C123        S   Mrs  
4            373450   8.0500  Unk

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)
