## Titanic Project 

#### Objective: Perform data cleaning and preprocessing on the Titanic dataset. Handle missing values, encode categorical variables, and scale numerical features.

<!-- Steps:

Load the Titanic dataset using seaborn.
Explore the dataset to identify missing values and data types.
Handle missing values appropriately.
Encode categorical variables using suitable encoding techniques.
Scale numerical features.
Prepare the dataset for machine learning models. -->

In [3]:
# Steps:

# Load the Titanic dataset using seaborn.
# Explore the dataset to identify missing values and data types.
# Handle missing values appropriately.
# Encode categorical variables using suitable encoding techniques.
# Scale numerical features.
# Prepare the dataset for machine learning models.

In [5]:
import pandas as pd
import numpy as np 
import seaborn as sns 
from sklearn.preprocessing import  OneHotEncoder,LabelEncoder,StandardScaler
from sklearn.impute import SimpleImputer

In [7]:
# Load The titanic DataSet
df=sns.load_dataset('titanic')

In [8]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [9]:
# Explore Missing Values 
df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [11]:
# Handling Mising value 
#impute Age with the Median age 
impute_age=SimpleImputer(strategy='median')
df['age']=impute_age.fit_transform(df[['age']])

In [16]:
from sklearn.impute import SimpleImputer

# Assuming 'df' is your DataFrame and 'embarked' is the column with missing values

# Impute 'embarked' with the mode
imputer_embarked = SimpleImputer(strategy='most_frequent')

# Flatten the result using .ravel() to avoid the ValueError
df['embarked'] = imputer_embarked.fit_transform(df[['embarked']]).ravel()


In [17]:
# Drop columns with too many missing values or irrelevant for analysis
df.drop(['deck', 'embark_town', 'alive', 'who', 'adult_male', 'alone'], axis=1, inplace=True)

In [18]:

# Encoding Categorical Variables

# One-Hot Encoding for 'sex' and 'class'
df = pd.get_dummies(df, columns=['sex', 'class'], drop_first=True)

# Encoding 'embarked' using Label Encoding
label_encoder = LabelEncoder()
df['embarked'] = label_encoder.fit_transform(df['embarked'])

print("\nDataFrame after handling missing values and encoding:")
print(df.head())

# Feature Scaling

scaler = StandardScaler()
numerical_features = ['age', 'fare', 'sibsp', 'parch']
df[numerical_features] = scaler.fit_transform(df[numerical_features])

print("\nDataFrame after feature scaling:")
print(df.head())


DataFrame after handling missing values and encoding:
   survived  pclass   age  sibsp  parch     fare  embarked  sex_male  \
0         0       3  22.0      1      0   7.2500         2      True   
1         1       1  38.0      1      0  71.2833         0     False   
2         1       3  26.0      0      0   7.9250         2     False   
3         1       1  35.0      1      0  53.1000         2     False   
4         0       3  35.0      0      0   8.0500         2      True   

   class_Second  class_Third  
0         False         True  
1         False        False  
2         False         True  
3         False        False  
4         False         True  

DataFrame after feature scaling:
   survived  pclass       age     sibsp     parch      fare  embarked  \
0         0       3 -0.565736  0.432793 -0.473674 -0.502445         2   
1         1       1  0.663861  0.432793 -0.473674  0.786845         0   
2         1       3 -0.258337 -0.474545 -0.473674 -0.488854         2   
