Load a Dataset, Identify Missing Values, Handle Them, and Remove Outliers

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Load inbuilt dataset (Titanic)
df = sns.load_dataset("titanic")

# Identify missing values
print("Missing Values:\n", df.isnull().sum())

# Handle missing values:
# - Fill numerical columns with the median
# - Fill categorical columns with the mode
for col in df.select_dtypes(include=['number']).columns:
    df[col].fillna(df[col].median(), inplace=True)

for col in df.select_dtypes(include=['object']).columns:
    df[col].fillna(df[col].mode()[0], inplace=True)

# Detect and remove outliers using IQR method
Q1 = df.select_dtypes(include=['number']).quantile(0.25)
Q3 = df.select_dtypes(include=['number']).quantile(0.75)
IQR = Q3 - Q1

df_cleaned = df[~((df.select_dtypes(include=['number']) < (Q1 - 1.5 * IQR)) |
                  (df.select_dtypes(include=['number']) > (Q3 + 1.5 * IQR))).any(axis=1)]

print("\nShape before outlier removal:", df.shape)
print("Shape after outlier removal:", df_cleaned.shape)


Missing Values:
 survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

Shape before outlier removal: (891, 15)
Shape after outlier removal: (577, 15)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mode()[0], inplace=True)
