<a href="https://colab.research.google.com/github/Kulpreet-prog/NIELIT-FSK-PRIME-April21/blob/main/task_2__data_preprocessing_ai_bootcamp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 2: Data Preprocessing for Machine Learning – AI Bootcamp

Download Titanic Dataset here: https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv

#### About this file

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

## Section 1: Data Loading & Exploration

### **Task 1**: Load and Inspect a Dataset

*Instruction*: Load the `titanic.csv` dataset and display the first 5 rows. Show basic info and describe statistics of the dataset.

In [3]:
from google.colab import files
uploaded = files.upload()

import pandas as pd

df = pd.read_csv('titanic.csv')
print("File successfully loaded!")
df.head()





Saving titanic.csv to titanic.csv
File successfully loaded!


Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


## Section 2: Handling Missing Values

### **Task 2**: Identify and Handle Missing Data

*Instruction*:



*   Display the number of missing values per column.
*   Fill missing `Age` values with the median.
*   Drop the second row in the dataset.



In [4]:
import pandas as pd

# Step 1: Load the dataset
df = pd.read_csv('titanic.csv')

# Step 2: Display number of missing values per column
print("Missing values per column:\n")
print(df.isnull().sum())

# Step 3: Fill missing 'Age' values with the median age
median_age = df['Age'].median()
df['Age'].fillna(median_age, inplace=True)

# Step 4: Drop the second row (index 1 since indexing starts from 0)
df.drop(index=1, inplace=True)

# Optional: Display a few rows to confirm changes
print("\nPreview after handling missing data:\n")
print(df.head())




Missing values per column:

Survived                   0
Pclass                     0
Name                       0
Sex                        0
Age                        0
Siblings/Spouses Aboard    0
Parents/Children Aboard    0
Fare                       0
dtype: int64

Preview after handling missing data:

   Survived  Pclass                                         Name     Sex  \
0         0       3                       Mr. Owen Harris Braund    male   
2         1       3                        Miss. Laina Heikkinen  female   
3         1       1  Mrs. Jacques Heath (Lily May Peel) Futrelle  female   
4         0       3                      Mr. William Henry Allen    male   
5         0       3                              Mr. James Moran    male   

    Age  Siblings/Spouses Aboard  Parents/Children Aboard     Fare  
0  22.0                        1                        0   7.2500  
2  26.0                        0                        0   7.9250  
3  35.0                 

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(median_age, inplace=True)


## Section 3: Encoding Categorical Features

### **Task 3**: Convert Categorical to Numeric

*Instruction*: Convert `Sex` and `Pclass` columns to numeric using:


*   Label Encoding for `Sex`
*   One-Hot Encoding for `Pclass`



In [5]:

# Save the cleaned DataFrame to a new CSV file
df.to_csv('cleaned_titanic.csv', index=False)

print("File saved as cleaned_titanic.csv")


File saved as cleaned_titanic.csv


In [6]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Load the cleaned Titanic file
df = pd.read_csv('cleaned_titanic.csv')

# Create LabelEncoder object
le = LabelEncoder()

# Apply Label Encoding on 'Sex' column
df['Sex'] = le.fit_transform(df['Sex'])

# Apply One-Hot Encoding on 'Pclass' column (only once)
df = pd.get_dummies(df, columns=['Pclass'], prefix='Pclass')



## Section 4: Feature Scaling

### **Task 4**: Scale Numerical Features

*Instruction*: Use StandardScaler to scale the Age and Fare columns.*italicized text*

In [7]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Step 1: Load the encoded dataset
df = pd.read_csv('cleaned_titanic.csv')

# Step 2: Select the columns to scale
scaler = StandardScaler()

# Step 3: Apply scaler to 'Age' and 'Fare'
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])

# Step 4: Check the result
print(df[['Age', 'Fare']].head())




        Age      Fare
0 -0.528495 -0.502593
1 -0.245189 -0.489029
2  0.392250  0.418741
3  0.392250 -0.486517
4 -0.174362 -0.478313


## Section 5: Feature Engineering

### **Task 5**: Build Preprocessing Pipeline

*Instruction*: Using `ColumnTransformer` and `Pipeline` from `sklearn`, build a pipeline that:



*   Imputes missing values
*   Scales numeric data
*   Encodes categorical data



In [8]:

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split

# Load the dataset
df = pd.read_csv('titanic.csv')

# Step 2: Split data into X (features) and y (target)
X = df.drop('Survived', axis=1)  # Features, drop 'Survived' as it’s the target
y = df['Survived']  # Target variable

# Step 3: Define numerical and categorical columns
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()  # Select numeric columns
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()  # Select categorical columns

# Step 4: Create preprocessing steps for each type of data
# For numerical columns:
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),  # Impute missing values with the median
    ('scaler', StandardScaler())  # Scale numerical data
])

# For categorical columns:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Impute missing values with the most frequent value
    ('encoder', OneHotEncoder(handle_unknown='ignore'))  # One-Hot Encoding
])

# Step 5: Combine both transformations into a single ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Step 6: Create a full pipeline that preprocesses the data and then fits a model (optional)
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),

])

# Step 7: Fit and transform the data (for preprocessing steps)
X_processed = pipeline.fit_transform(X)

# Step 8: Verify output (first 5 rows)
print("Processed features:\n", X_processed[:5])



Processed features:
 <Compressed Sparse Row sparse matrix of dtype 'float64'
	with 35 stored elements and shape (5, 894)>
  Coords	Values
  (0, 0)	0.8305236329179975
  (0, 1)	-0.529366007257325
  (0, 2)	0.42990394821142364
  (0, 3)	-0.4749807967420064
  (0, 4)	-0.5035863459797053
  (0, 606)	1.0
  (0, 893)	1.0
  (1, 0)	-1.561276569673768
  (1, 1)	0.6042645431881828
  (1, 2)	0.42990394821142364
  (1, 3)	-0.4749807967420064
  (1, 4)	0.7834124485273979
  (1, 827)	1.0
  (1, 892)	1.0
  (2, 0)	0.8305236329179975
  (2, 1)	-0.24595836964594797
  (2, 2)	-0.47585567664257344
  (2, 3)	-0.4749807967420064
  (2, 4)	-0.4900195895218577
  (2, 176)	1.0
  (2, 892)	1.0
  (3, 0)	-1.561276569673768
  (3, 1)	0.3917088149796501
  (3, 2)	0.42990394821142364
  (3, 3)	-0.4749807967420064
  (3, 4)	0.4179481482311301
  (3, 818)	1.0
  (3, 892)	1.0
  (4, 0)	0.8305236329179975
  (4, 1)	0.3917088149796501
  (4, 2)	-0.47585567664257344
  (4, 3)	-0.4749807967420064
  (4, 4)	-0.48750722721484885
  (4, 737)	1.0
  (4, 893

## Section 6: Feature Engineering

### **Task 6**: Create a New Feature

*Instruction*: Create a new feature `FamilySize` = `Siblings/Spouses Aboard` + `Parents/Children Aboard` + 1.

In [10]:

# Step 1: Create the 'FamilySize' feature
df['FamilySize'] = df['Siblings/Spouses Aboard'] + df['Parents/Children Aboard'] + 1

print(df[['Siblings/Spouses Aboard', 'Parents/Children Aboard', 'FamilySize']].head())
# Save the final preprocessed DataFrame (after Task 2)
df.to_csv('titanic_preprocessed.csv', index=False)

print("✅ File saved as 'titanic_preprocessed.csv'")





   Siblings/Spouses Aboard  Parents/Children Aboard  FamilySize
0                        1                        0           2
1                        1                        0           2
2                        0                        0           1
3                        1                        0           2
4                        0                        0           1
✅ File saved as 'titanic_preprocessed.csv'
