## EDA (Exploratory Data Analysis on Tatanic Dataset)

### Dataset link:
- Titanic Dataset
- https://www.kaggle.com/datasets/yasserh/titanic-dataset

In [1]:
# import libaries

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns


In [3]:
# Upload the dataset

df = pd.read_csv("Titanic-Dataset.csv")

df.head(10)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [9]:
# Basic infor about row, columns and data type

# See the shape to understand how many rows 
df.shape

(891, 12)

- For this dataset
    - there are 891 rows and 12 columns

In [None]:
# To see the columns and it's data type
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


#### Explaination:

- for this dataset
    - Total number of features/column => 12
    
    - 1. Numerical columns

        - PassengerId 
            - [- Data format:  1, 2, 3, ..., and so on]

        - Survived 
            - meaning : Survival indicator

            [- Data format: 0 and 1]

        - Pclass 
            - meaning : Passenger’s travel class

            [ - Data format: 1, 2, and 3]

        - SibSp 
            - meaning: Number of siblings + spouses travelling with the passenger.

            [ - Data format: 0, 1, 2,...and so on]

        - Parch 
            - Meaning: Number of parents + children travelling with the passenger.

            - [ - Data format: 0,1,2,.... so on]

        - Fare
            - Meaning: Ticket fare paid.

            - [- Data format: 7.25, 71.28, 255.33 ..... and so on]
    
    - 2. Categorical Columns
        - Name
            - [- Data format: Mrs. Jacques Heath (Lily May Peel)]

        - Sex
            - Meaning : Gender of the Passenger

            - [- Data Format: male, female]

        - Ticket
            - Meaning: Ticket number
            
            - [ - Data format: PC 17599, A/5 21171 and so on]

        - Cabin
            - Meaning: Cabin number

            - [ - Data format: C85, C123 and so on]

        - Embarked:
            - Meaning: boarding port 

            - [- Data format: C, Q, S]


- Note: 
    - Here are some features that belong to Numerical but actually act as categorical such as

    - 1. Pclass

        - Values: 1, 2, 3

        - Meaning: passenger’s ticket class (1st, 2nd, 3rd)

        - Acts as: Ordinal categorical feature

        - Should NOT be treated as a continuous variable.

    - 2. SibSp

        - Values: 0, 1, 2, 3, 4, 5, 8

        - Meaning: number of siblings/spouses aboard

        - Even though numeric, it behaves like a count-based category.

        - Often used to create categorical groups (alone / small family / big family)

    - 3. Parch

        - Values: 0, 1, 2, 3, 4, 5, 6

        - Similar to SibSp → behaves like discrete family-size categories

    
    - 4. Survived (Target)

        - Values: 0, 1

        - It is technically numeric, but conceptually binary categorical.

In [13]:
df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
