<a href="https://colab.research.google.com/github/Greta1998/ML-notebooks/blob/main/Titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Step1: Data preprocessing

In [2]:
import pandas as pd
data = pd.read_csv('titanic.csv')
print(data.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


Get sum of missing values per column

In [4]:
print(data.isnull().sum())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


Get column details

In [5]:
print(data.describe())

       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  


Step2: Handling missing values

In [7]:
data['Age'].fillna(data['Age'].median(), inplace=True)
data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)
#Drop rows with missing target values
data.dropna(subset=['Survived'], inplace=True)

Step3: Scaling Numerical Data

In [8]:
from sklearn.preprocessing import StandardScaler

#define the numerical columns to scale
numerical_cols = ['Age', 'Fare']

#Initialize Scaler
scaler = StandardScaler()

#Scale numerical features
data[numerical_cols] = scaler.fit_transform(data[numerical_cols])

Step4: Encoding categorical Variables

In [9]:
#one-hot encode categorical columns
data = pd.get_dummies(data, columns=['Sex', 'Embarked'], drop_first=True)
#Verify that categorical columns have been succesfully encoded
print(data.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name       Age  SibSp  Parch  \
0                            Braund, Mr. Owen Harris -0.565736      1      0   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  0.663861      1      0   
2                             Heikkinen, Miss. Laina -0.258337      0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  0.433312      1      0   
4                           Allen, Mr. William Henry  0.433312      0      0   

             Ticket      Fare Cabin  Sex_male  Embarked_Q  Embarked_S  
0         A/5 21171 -0.502445   NaN      True       False        True  
1          PC 17599  0.786845   C85     False       False       False  
2  STON/O2. 3101282 -0.488854   NaN     False       False        True  
3            1

In [13]:
#display all columns
pd.set_option("display.max_columns", None)
data.head(20)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Sex_male,Embarked_Q,Embarked_S
0,1,0,3,"Braund, Mr. Owen Harris",-0.565736,1,0,A/5 21171,-0.502445,,True,False,True
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0.663861,1,0,PC 17599,0.786845,C85,False,False,False
2,3,1,3,"Heikkinen, Miss. Laina",-0.258337,0,0,STON/O2. 3101282,-0.488854,,False,False,True
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0.433312,1,0,113803,0.42073,C123,False,False,True
4,5,0,3,"Allen, Mr. William Henry",0.433312,0,0,373450,-0.486337,,True,False,True
5,6,0,3,"Moran, Mr. James",-0.104637,0,0,330877,-0.478116,,True,True,False
6,7,0,1,"McCarthy, Mr. Timothy J",1.893459,0,0,17463,0.395814,E46,True,False,True
7,8,0,3,"Palsson, Master. Gosta Leonard",-2.102733,3,1,349909,-0.224083,,True,False,True
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",-0.181487,0,2,347742,-0.424256,,False,False,True
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",-1.180535,1,0,237736,-0.042956,,False,False,False


Step5: Splitting Data into Training and Testing sets

Separate the target variable('Survived) from the feature variables

In [15]:
X = data.drop('Survived', axis=1)
y = data['Survived']

Split the dataset into training and testing sets using a 80/20 split

In [16]:
from sklearn.model_selection import train_test_split

#Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.shape}")
print(f"Testing set size: {X_test.shape}")

Training set size: (712, 12)
Testing set size: (179, 12)
