**Task 4 - CodingAlpha Internship Programme**

**Topic Data Preprocessing**

**Introduction to Preprocessing Task**

Data preprocessing is a crucial step in the data science workflow that involves preparing raw data for analysis by handling missing values, outliers, and ensuring the data is in the right format for modeling. The goal is to improve the quality of the data, making it suitable for building robust and reliable machine learning models.


**Why the Titanic Dataset?**

The Titanic dataset is a popular dataset in data science and machine learning due to its simplicity and the variety of preprocessing challenges it presents. It includes:  

1. Categorical Features: Such as 'sex', 'embarked', and 'class', which need to be converted into numerical format.
2. Numerical Features: Such as 'age' and 'fare', which may contain missing values and outliers.
3. Missing Values: Some records have missing values that need to be handled appropriately.
4. Outliers: The dataset contains outliers that can skew analysis and model performance.
5. Target Variable: The 'survived' column makes it ideal for classification problems.

These characteristics make the Titanic dataset are best choice for demonstrating data preprocessing techniques.

**Data Preprocessing Steps**

1. Loading the Dataset:

The dataset is loaded into a pandas DataFrame to begin the preprocessing tasks.

2. Handling Missing Values:

Numerical Features: The 'age' column is imputed with the mean value.
Categorical Features: The 'embarked' column is imputed with the most frequent value.

3. Detecting and Handling Outliers:

Outliers are identified using the Interquartile Range (IQR) method and removed from the dataset.

4. Normalizing/Scaling Features:

Numerical features such as 'age' and 'fare' are standardized using StandardScaler to ensure they have a mean of 0 and a standard deviation of 1.

5. Converting Categorical Variables:

Categorical variables are converted into dummy/indicator variables using one-hot encoding to prepare them for machine learning models.

6. Splitting the Dataset:

The dataset is split into training and testing sets to evaluate the model's performance on unseen data.


**Implementing the code**

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
import seaborn as sns

In [19]:
# Dataset 
df = sns.load_dataset('titanic')
print("Original Dataset:")
print(df.head())
print("Dataset Info:")
print(df.info())

Original Dataset:
   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
1  woman       False    C    Cherbourg   yes  False  
2  woman       False  NaN  Southampton   yes   True  
3  woman       False    C  Southampton   yes  False  
4    man        True  NaN  Southampton    no   True  
Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----

In [20]:
# Handling missing values
imputer = SimpleImputer(strategy='mean')
df['age'] = imputer.fit_transform(df[['age']])

imputer = SimpleImputer(strategy='most_frequent')
df['embarked'] = imputer.fit_transform(df[['embarked']])

print("Dataset after handling missing values:")
print(df.head())
print("Missing values after imputation:")
print(df.isnull().sum())

Dataset after handling missing values:
   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
1  woman       False    C    Cherbourg   yes  False  
2  woman       False  NaN  Southampton   yes   True  
3  woman       False    C  Southampton   yes  False  
4    man        True  NaN  Southampton    no   True  
Missing values after imputation:
survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
embarked      

In [21]:
# Select numerical columns for outlier detection
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()

# Detect and handle outliers using IQR for numerical columns only
Q1 = df[numerical_cols].quantile(0.25)
Q3 = df[numerical_cols].quantile(0.75)
IQR = Q3 - Q1

In [22]:
# Filtering out the outliers using conditions for numerical columns only
outlier_condition = ~((df[numerical_cols] < (Q1 - 1.5 * IQR)) | (df[numerical_cols] > (Q3 + 1.5 * IQR))).any(axis=1)
df_filtered = df[outlier_condition]

print("Dataset after removing outliers:")
print(df_filtered.head())
print(f"Number of rows after removing outliers: {df_filtered.shape[0]}")

Dataset after removing outliers:
   survived  pclass     sex        age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.000000      1      0   7.2500        S  Third   
2         1       3  female  26.000000      0      0   7.9250        S  Third   
3         1       1  female  35.000000      1      0  53.1000        S  First   
4         0       3    male  35.000000      0      0   8.0500        S  Third   
5         0       3    male  29.699118      0      0   8.4583        Q  Third   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
2  woman       False  NaN  Southampton   yes   True  
3  woman       False    C  Southampton   yes  False  
4    man        True  NaN  Southampton    no   True  
5    man        True  NaN   Queenstown    no   True  
Number of rows after removing outliers: 577


In [23]:
# Scaling features
scaler = StandardScaler()
df_filtered[['age', 'fare']] = scaler.fit_transform(df_filtered[['age', 'fare']])

print("Dataset after normalization/scaling:")
print(df_filtered[['age', 'fare']].head())

Dataset after normalization/scaling:
        age      fare
0 -0.909802 -0.609448
2 -0.439745 -0.555858
3  0.617882  3.030715
4  0.617882 -0.545934
5 -0.005046 -0.513517


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered[['age', 'fare']] = scaler.fit_transform(df_filtered[['age', 'fare']])


In [24]:
# Convert categorical variables to dummy variables
X = df_filtered.drop('survived', axis=1)
y = df_filtered['survived']
X = pd.get_dummies(X, drop_first=True)

print("Dataset after converting categorical variables to dummy variables:")
print(X.head())

Dataset after converting categorical variables to dummy variables:
   pclass       age  sibsp  parch      fare  adult_male  alone  sex_male  \
0       3 -0.909802      1      0 -0.609448        True  False         1   
2       3 -0.439745      0      0 -0.555858       False   True         0   
3       1  0.617882      1      0  3.030715       False  False         0   
4       3  0.617882      0      0 -0.545934        True   True         1   
5       3 -0.005046      0      0 -0.513517        True   True         1   

   embarked_Q  embarked_S  ...  who_woman  deck_B  deck_C  deck_D  deck_E  \
0           0           1  ...          0       0       0       0       0   
2           0           1  ...          1       0       0       0       0   
3           0           1  ...          1       0       1       0       0   
4           0           1  ...          0       0       0       0       0   
5           1           0  ...          0       0       0       0       0   

   deck_F  de

In [25]:
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the resulting datasets
print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")

Training set shape: (461, 23)
Test set shape: (116, 23)
