Step 1 : Data Ingestion

In [2]:
import pandas as pd
df = pd.read_csv("C:/Users/User/Downloads/Titanic-Dataset.csv")
print("Raw Data Snapshot:\n", df.head(), "\n")


Raw Data Snapshot:
    PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500

Step 2 : Data cleaning

In [5]:
df.drop_duplicates(inplace=True)
print("After Dropping Duplicates:\n", df.head(), "\n")
df.loc[:, 'Age'] = df['Age'].fillna(df['Age'].median())
df.loc[:, 'Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])
df.drop(columns=['Cabin'], inplace=True, errors='ignore')
print("After Imputing Age and Embarked, and Dropping Cabin:\n", df.head(), "\n")


After Dropping Duplicates:
    PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Embarked  
0      0         A/5 21171   7.2500        S  
1      0          PC 17599  71.2833        C  
2      0  STON/O2. 3101282   7.9250        S  
3      0            113803  53.1000        S  
4      0            373450   8.0500        S   

After Im

Step 3 : Feature engineering

In [8]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Sex'] = le.fit_transform(df['Sex'])
df['Embarked'] = le.fit_transform(df['Embarked'])
print("After Label Encoding 'Sex' and 'Embarked':\n", df[['Sex', 'Embarked']].head(), "\n")
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
print("After Creating FamilySize:\n", df[['SibSp', 'Parch', 'FamilySize']].head(), "\n")
df['Title'] = df['Title'].replace(['Mlle', 'Ms'], 'Miss')
df['Title'] = df['Title'].replace(['Mme', 'Lady', 'Countess', 'Dona'], 'Mrs')
df['Title'] = df['Title'].replace(['Capt', 'Col', 'Major', 'Dr', 'Rev'], 'Officer')
df['Title'] = le.fit_transform(df['Title'])
print("After Extracting and Encoding Title:\n", df[['Name', 'Title']].head(), "\n")


After Label Encoding 'Sex' and 'Embarked':
    Sex  Embarked
0    1         2
1    0         0
2    0         2
3    0         2
4    1         2 

After Creating FamilySize:
    SibSp  Parch  FamilySize
0      1      0           2
1      1      0           2
2      0      0           1
3      1      0           2
4      0      0           1 

After Extracting and Encoding Title:
                                                 Name  Title
0                            Braund, Mr. Owen Harris      4
1  Cumings, Mrs. John Bradley (Florence Briggs Th...      5
2                             Heikkinen, Miss. Laina      3
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)      5
4                           Allen, Mr. William Henry      4 



Step 4 : Transformation

In [10]:
from sklearn.model_selection import train_test_split
features = ['Pclass', 'Sex', 'Age', 'Fare', 'Embarked', 'FamilySize', 'Title']
X = df[features]
y = df['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Training Feature Sample:\n", X_train.head(), "\n")
print("Training Label Sample:\n", y_train.head(), "\n")


Training Feature Sample:
      Pclass  Sex   Age     Fare  Embarked  FamilySize  Title
331       1    1  45.5  28.5000         2           1      4
733       2    1  23.0  13.0000         2           1      4
382       3    1  32.0   7.9250         2           1      4
704       3    1  26.0   7.8542         2           2      4
813       3    0   6.0  31.2750         2           7      3 

Training Label Sample:
 331    0
733    0
382    0
704    0
813    0
Name: Survived, dtype: int64 



Step 5 : Export cleaned data

In [12]:
X_train.to_csv("X_train.csv", index=False)
X_test.to_csv("X_test.csv", index=False)
y_train.to_csv("y_train.csv", index=False)
y_test.to_csv("y_test.csv", index=False)
print("Exported X_train.csv, X_test.csv, y_train.csv, y_test.csv")


Exported X_train.csv, X_test.csv, y_train.csv, y_test.csv
