# Summary
This project is part of the [Titanic: Machine Learning from Disaster](https://www.kaggle.com/competitions/titanic) competition on Kaggle. The goal is to predict which passengers survived the Titanic shipwreck using machine learning models based on features like age, sex, and ticket class.
# 📊 Data

The dataset contains 891 rows and 12 columns. Below is a summary of the key variables used in the analysis:

| Column Name | Description |
|-------------|-------------|
| PassengerId | Unique identifier for each passenger. |
| Survived    | Target variable (0 = Did not survive, 1 = Survived). |
| Pclass      | Ticket class (1st, 2nd, or 3rd). |
| Name        | Full name of the passenger. |
| Sex         | Gender of the passenger. |
| Age         | Age of the passenger in years. |
| SibSp       | Number of siblings/spouses aboard. |
| Parch       | Number of parents/children aboard. |
| Ticket      | Ticket number. |
| Fare        | Passenger fare. |
| Cabin       | Cabin number (often missing). |
| Embarked    | Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton). |

Most columns are used in this analysis, especially those that provide demographic and socioeconomic information relevant to survival prediction.

# Process Phase

During this phase, data cleaning and preparation are performed to ensure quality and consistency. This includes handling missing values, correcting errors, normalizing formats, and validating data types. Key steps also include outlier detection, duplicate removal, and date format conversion.

## Data Inspection

The initial inspection helps understand the structure of the dataset, check data types, and identify missing or inconsistent values. This is done using basic functions to get an overview before applying transformations.


In [1]:
# Libraries & read datasets

import pandas as pd
import seaborn as sns
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier

test = pd.read_csv('/kaggle/input/titanic/test.csv')
train = pd.read_csv('/kaggle/input/titanic/train.csv')

# First few rows

print("test dataset")
print(test.head())

print("\ntrain dataset")
print(train.head())

test dataset
   PassengerId  Pclass                                          Name     Sex  \
0          892       3                              Kelly, Mr. James    male   
1          893       3              Wilkes, Mrs. James (Ellen Needs)  female   
2          894       2                     Myles, Mr. Thomas Francis    male   
3          895       3                              Wirz, Mr. Albert    male   
4          896       3  Hirvonen, Mrs. Alexander (Helga E Lindqvist)  female   

    Age  SibSp  Parch   Ticket     Fare Cabin Embarked  
0  34.5      0      0   330911   7.8292   NaN        Q  
1  47.0      1      0   363272   7.0000   NaN        S  
2  62.0      0      0   240276   9.6875   NaN        Q  
3  27.0      0      0   315154   8.6625   NaN        S  
4  22.0      1      1  3101298  12.2875   NaN        S  

train dataset
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3    

We can see that both datasets have the same columns, differing only in the Survived column, as indicated in the documentation.

In [2]:
#structure

test.info()
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.1+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass  

We can observe that several columns contain missing values, with the Cabin column having a particularly high number of nulls. Additionally, some variables are of object type, which will need to be transformed into numerical formats later in the process.

In [3]:
# Unique values

print(test.nunique())
print(train.nunique())

PassengerId    418
Pclass           3
Name           418
Sex              2
Age             79
SibSp            7
Parch            8
Ticket         363
Fare           169
Cabin           76
Embarked         3
dtype: int64
PassengerId    891
Survived         2
Pclass           3
Name           891
Sex              2
Age             88
SibSp            7
Parch            7
Ticket         681
Fare           248
Cabin          147
Embarked         3
dtype: int64


We can confirm that the columns contain the number of values as specified in the documentation.

In [4]:
# Duplicated values

print(train.duplicated().sum())
print(test.duplicated().sum())

0
0


There are not duplicated values

In [5]:
# statistics

print(test.describe())
print(train.describe())

       PassengerId      Pclass         Age       SibSp       Parch        Fare
count   418.000000  418.000000  332.000000  418.000000  418.000000  417.000000
mean   1100.500000    2.265550   30.272590    0.447368    0.392344   35.627188
std     120.810458    0.841838   14.181209    0.896760    0.981429   55.907576
min     892.000000    1.000000    0.170000    0.000000    0.000000    0.000000
25%     996.250000    1.000000   21.000000    0.000000    0.000000    7.895800
50%    1100.500000    3.000000   27.000000    0.000000    0.000000   14.454200
75%    1204.750000    3.000000   39.000000    1.000000    0.000000   31.500000
max    1309.000000    3.000000   76.000000    8.000000    9.000000  512.329200
       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min      

We can review basic statistics of the dataset, such as the minimum and maximum ages.
## Data cleaning process
The following section outlines the process for cleaning the dataset to ensure it is ready for analysis.

In [6]:
#  We keep the test set’s PassengerId for submission purposes.

testId = test['PassengerId']

# We drop columns that are not useful for the analysis.

test = test.drop(['Name', 'Ticket', 'Cabin', 'PassengerId'], axis = 1)
train = train.drop(['Name', 'Ticket', 'Cabin', 'PassengerId'], axis = 1)
test.head()

# Fill missing values.

train['Age']=train['Age'].fillna(train['Age'].mode()[0])
test['Age']=test['Age'].fillna(test['Age'].mode()[0])
train['Embarked'] = train['Embarked'].fillna(train['Embarked'].mode()[0])
test['Fare'] = test['Fare'].fillna(test['Fare'].mean())

# We convert the columns with object data types to numeric.

encoder = OrdinalEncoder()
train[['Sex', 'Embarked']] = encoder.fit_transform(train[['Sex', 'Embarked']])
test[['Sex', 'Embarked']] = encoder.transform(test[['Sex', 'Embarked']])

# Now we can examine the correlation between the different variables.

print(train.corr()['Survived'].sort_values(ascending=False))

Survived    1.000000
Fare        0.257307
Parch       0.081629
SibSp      -0.035322
Age        -0.052872
Embarked   -0.167675
Pclass     -0.338481
Sex        -0.543351
Name: Survived, dtype: float64


We can observe that several variables have a low linear correlation.
## Conclusions from data cleaning
The conclusions from the data cleaning process are summarized below.
* We removed the columns Name, PassengerId, and Ticket because the type of information they contained was not useful for prediction.
The Cabin column was also removed due to having too many missing values. (Note: The PassengerId from the test set was kept separately for submission purposes.)
* Missing values in key columns were filled to ensure completeness: Age and Embarked were filled using their mode values, while Fare in the test set was filled with the mean.
* Categorical columns (Sex and Embarked) were encoded into numeric values using ordinal encoding to prepare the data for modeling.
* Finally, we analyzed the correlation between variables to better understand their relationship with the target variable Survived.
Some features showed low linear correlation, but they were kept in the dataset as they may still contribute through non-linear interactions in the model.
# Predictions
In this section, we generate survival predictions using the trained model and prepare the results for submission to the competition.

In [7]:
# Define the variables

X = train.drop(['Survived'], axis = 1)
Y = train['Survived']

# Train the model

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, Y)

In [8]:
# Make predictions

predictions = model.predict(test)

# Prepare the predictions for submission

final = pd.DataFrame({
    'PassengerId': testId,
    'Survived': predictions
})
final.to_csv('submission.csv', index=False)