## Titanic Dataset Analysis and Gradient Boosting Regressor Model Report

## 1. Introduction

This report presents an in-depth analysis of the healthcare dataset and the implementation of a Gradient Boosting regression model. The objective is to analyze the dataset, perform preprocessing, build a predictive model, and evaluate its performance.

## 2. Libraries Used

The following libraries were used in the project:

In [23]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, StandardScaler

## 3. Dataset Overview

The dataset used in this project is the Titanic dataset. It contains multiple attributes related to passengers and their survival chances. The goal is to predict whether a passenger survived based on their details.

### 3.1 Number of Columns

The dataset consists of the following columns:

- **PassengerId**: Unique identifier for passengers  
- **Survived**: Target variable (1 = Survived, 0 = Did not survive)  
- **Pclass**: Passenger class (1st, 2nd, or 3rd class)  
- **Name**: Name of the passenger  
- **Sex**: Gender of the passenger  
- **Age**: Age of the passenger  
- **SibSp**: Number of siblings/spouses aboard  
- **Parch**: Number of parents/children aboard  
- **Ticket**: Ticket number  
- **Fare**: Fare paid for the ticket  
- **Cabin**: Cabin number, if available  
- **Embarked**: Port of embarkation (C, Q, or S)  



### 3.2 Relationship Between Columns

- **Pclass, Fare, and Cabin**: Higher-class passengers may have had better survival chances.  
- **Age and SibSp/Parch**: Younger passengers and those traveling with family might have had a better chance of survival.  
- **Sex**: Women had a higher survival rate than men.  
- **Embarked**: The port where the passenger boarded might influence survival.  


## 4. Basic Analysis

The dataset was loaded and examined using the following functions:

In [24]:
data = pd.read_csv(r"C:\Users\Shaik Sakhlaih\Downloads\titanic.csv")
print(data.head())
print(data.tail())
print(data.info())
print(data.describe())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  
  

### 4.1 Checking for Null Values

In [25]:
print(data.isnull().sum())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


### Missing Values

- **Age**: 177 missing values  
- **Cabin**: 687 missing values  
- **Embarked**: 2 missing values  


## 5. Data Preprocessing

To handle missing values, we replaced them with the mode:

In [26]:
imputer = SimpleImputer(strategy='most_frequent')
data[['Age', 'Cabin', 'Embarked']] = imputer.fit_transform(data[['Age', 'Cabin', 'Embarked']])

Since the dataset contains categorical variables, we used Label Encoding to convert them into numeric values:

In [27]:
le = LabelEncoder()
data['Name'] = le.fit_transform(data['Name'])
data['Cabin'] = le.fit_transform(data['Cabin'])
data['Ticket'] = le.fit_transform(data['Ticket'])

## 6. Model Building

The Survived column is the target variable (y), while the remaining columns are features (X).

In [29]:
x = data.drop(['Survived'], axis=1)
y = data['Survived']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

### 6.1 Applying Gradient Boosting Regressor

In [None]:
gbs = GradientBoostingRegressor(loss='squared_error', learning_rate=0.1, n_estimators=100, subsample=1.0)
model = gbs.fit(x_train, y_train)

## 7. Model Evaluation

### 7.1 Predictions

In [None]:
y_pred = model.predict(x_test)

### 7.2 Mean Squared Error (MSE)

In [None]:
mse = mean_squared_error(y_test, y_pred)
print("The mean squared error :", mse)

### Output:

The mean squared error: 0.2052

### 7.3 R2 Score

In [None]:
r2 = r2_score(y_test, y_pred) * 10
print(f"Accuracy of R2_Score = {r2}")

### Output:

Accuracy of R2_Score = 1.1585

## 8. Conclusion

The dataset contained missing values, which were replaced using the mode.  

Categorical columns were encoded using **Label Encoding**.  

The **Gradient Boosting Regressor** model was implemented.  

- **Mean Squared Error (MSE)**: 0.2052, indicating a low prediction error.  
- **R² Score (scaled by 10)**: 1.1585, suggesting a poor fit.  

The model’s performance can be improved by **hyperparameter tuning** or experimenting with other regression models such as **Random Forest Regressor** or **XGBoost**.  

This report provides a comprehensive analysis and evaluation of the **Titanic dataset** using **Gradient Boosting Regressor**.  
