
# Titanic Dataset Analysis

## Introduction

The Titanic dataset is a classic dataset often used for data analysis and machine learning. It contains information about the passengers who were on board the Titanic when it sank on April 15, 1912, after hitting an iceberg.

The dataset includes various features such as:
- Passenger ID
- Survival status (Survived)
- Class (Pclass)
- Name
- Sex
- Age
- Number of siblings/spouses aboard (SibSp)
- Number of parents/children aboard (Parch)
- Ticket number
- Fare
- Cabin
- Port of Embarkation (Embarked)

## Objective

The main objective of this analysis is to explore the data, understand the key factors that might have influenced the survival of the passengers, and build predictive models to determine the likelihood of survival based on the available features.

## Source

The dataset is available on [Kaggle](https://www.kaggle.com/c/titanic) and is widely used for educational purposes and competitions.

## Outline

1. Data Loading and Exploration
2. Data Cleaning and Preprocessing
3. Exploratory Data Analysis (EDA)
4. Feature Engineering
5. Model Building and Evalu to your subsequent code and analysis.

In [35]:
#import necessary libraries

In [36]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Data loading and Exploration

In [37]:
#load the dataset

In [38]:
df = pd.read_csv("Titanic-Dataset.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [39]:
#shape of the data

In [40]:
df.shape

(891, 12)

## Data cleaning and processing

In [41]:
df = df.drop(columns= ["PassengerId", "Name", "Ticket", "Cabin"], axis = 1)

In [42]:
df

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.2500,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.9250,S
3,1,1,female,35.0,1,0,53.1000,S
4,0,3,male,35.0,0,0,8.0500,S
...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S
887,1,1,female,19.0,0,0,30.0000,S
888,0,3,female,,1,2,23.4500,S
889,1,1,male,26.0,0,0,30.0000,C


In [43]:
df.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [44]:
#datatype check

In [45]:
df.dtypes

Survived      int64
Pclass        int64
Sex          object
Age         float64
SibSp         int64
Parch         int64
Fare        float64
Embarked     object
dtype: object

In [46]:
#check for misssing values

In [47]:
df.isnull().sum()

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      2
dtype: int64

In [49]:
#replace missing values

In [51]:
df["Age"] = df["Age"].replace(np.nan, df["Age"].median(axis=0))

# Feature Engineering

In [53]:
#select features

In [68]:
target = df['Survived']

In [55]:
features = df[["Pclass", "Age", "Fare", "Sex"]]

In [61]:
features.head()

Unnamed: 0,Pclass,Age,Fare,Sex
0,3,22.0,7.25,male
1,1,38.0,71.2833,female
2,3,26.0,7.925,female
3,1,35.0,53.1,female
4,3,35.0,8.05,male


In [62]:
features.dtypes

Pclass      int64
Age       float64
Fare      float64
Sex        object
dtype: object

In [65]:
from sklearn.preprocessing import LabelEncoder
sex_encoder = LabelEncoder()

In [66]:
features["Sex"] = sex_encoder.fit_transform(features["Sex"])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  features["Sex"] = sex_encoder.fit_transform(features["Sex"])


In [67]:
features

Unnamed: 0,Pclass,Age,Fare,Sex
0,3,22.0,7.2500,1
1,1,38.0,71.2833,0
2,3,26.0,7.9250,0
3,1,35.0,53.1000,0
4,3,35.0,8.0500,1
...,...,...,...,...
886,2,27.0,13.0000,1
887,1,19.0,30.0000,0
888,3,28.0,23.4500,0
889,1,26.0,30.0000,1


In [90]:
target.dtype

dtype('int64')

In [91]:
features.dtypes

Pclass      int64
Age       float64
Fare      float64
Sex         int32
dtype: object

# Model building and Evaluation

In [102]:
X = features

In [103]:
y = target

In [104]:
target.head()

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

In [105]:
features.head()

Unnamed: 0,Pclass,Age,Fare,Sex
0,3,22.0,7.25,1
1,1,38.0,71.2833,0
2,3,26.0,7.925,0
3,1,35.0,53.1,0
4,3,35.0,8.05,1


In [106]:
#use decision tree algorithm

In [108]:
from sklearn.tree import DecisionTreeClassifier
model = tree.DecisionTreeClassifier()

In [109]:
from sklearn.model_selection import train_test_split

In [110]:
X_train,X_test, y_train,y_test = train_test_split(X,y, train_size=0.2)

In [111]:
model.fit(X_train, y_train)

In [115]:
model.score(X_train,y_train)

0.9943820224719101