## Topic: From Data Loading to Data Preprocessing Steps

### 1. Data loading and Exploration of Numerical Features

- 1. Data load and basic Statistical operations

### 2. Univariate Analysis and Bivariate analysis

    - 2.1: Exploration of Numerical Features

    - 2.2: Exploration of Categorical Features

    - 2.3: Relationship between Features and Target

    - 2.4: Correlation Matrix and Heatmap

    - 2.5: Categorical Features vs Target Features

### 3. Data Preprocessing Steps

- 1. Handle Missing Values of each features

- 2. Encoding the Categorical Features.
    
- 3. Normalization and scaling the numerical Features.

### 

### 1. Data loading and Exploration of Numerical Features

In [1]:
# importing necessary Libaries

import numpy as np

import pandas as pd 

import matplotlib.pyplot as plt

import seaborn as sns

plt.style.use("default")

sns.set(font_scale = 1.1)



#### 1. Data load and basic Statistical operations

In [2]:
# Data loading 

df = pd.read_csv("Titanic-Dataset.csv")

df.head(6)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q


### 2. Step_02: Univariate Analysis and Bivariate analysis

In [None]:
### basic Statistical operations

# 1. To See the shape (univariate analysis)

# for check the nunique of value of each column for determine which columns are numerical and categorcal
print("Total Column: ", df.shape)
df.nunique()


Total Column:  (891, 12)


PassengerId    891
Survived         2
Pclass           3
Name           891
Sex              2
Age             88
SibSp            7
Parch            7
Ticket         681
Fare           248
Cabin          147
Embarked         3
dtype: int64

In [12]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

- Conclusion:
    - Total rows => 891
    - Total colums=> 12

- Null Value contain columns:
    - max Null value contain: Cabin
    - median null value contain: Age
    - small null value contain: Embarked

In [None]:
# Calculate Missing value Cabin Column

null_value_cabin = df['Cabin'].isnull().sum()/len(df['Cabin']) * 100

print(f"Cabin columns Null Values Percentage: {null_value_cabin:.2f}%")

# if a column has 70% missing value then we drop the column
# so, Cabin column can be drop


Cabin columns Null Values Percentage: 77.10%


In [18]:
# drop the cabin column
df = df.drop(columns=['Cabin'])

df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


In [19]:
# To see each column data type 

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 76.7+ KB


- For above code explain:
   - Allmost every features can't contain null values expect Age,Embarded featues

   - some of features contain null values 
        - Age,  Embarded

### 2.1 Distribution analysis of numerical columns

In [None]:
# 2 & 3: Central Tendency of data and Disperation of data ((univariate analysis))

# Quick Descriptive statistics for numerical columns.


df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
PassengerId,891.0,446.0,257.353842,1.0,223.5,446.0,668.5,891.0
Survived,891.0,0.383838,0.486592,0.0,0.0,0.0,1.0,1.0
Pclass,891.0,2.308642,0.836071,1.0,2.0,3.0,3.0,3.0
Age,714.0,29.699118,14.526497,0.42,20.125,28.0,38.0,80.0
SibSp,891.0,0.523008,1.102743,0.0,0.0,0.0,1.0,8.0
Parch,891.0,0.381594,0.806057,0.0,0.0,0.0,0.0,6.0
Fare,891.0,32.204208,49.693429,0.0,7.9104,14.4542,31.0,512.3292


#### After Univariate 1st steps Analysis Conclusion for numerical Features:

- 1. Distribution analysis of numerical columns

    - Using Descriptive Statistics:
        - i. shape:
            - Right/Positive Skewed Features:
                - [PassengerId, SibSp, Fare]
                 
            - Left/Negative Skewed Features:
                - [Survived, Age, ]
            - Symmetic Skewed Features:
                - [Pclass, ]

        - ii. Central Tendency of Features
            - mean
            - median

        - iii. Disperation(spread) of Features
             - Large Spread: [PassengerId, Age, Fare]

             - Smllar Spread: [Survived,Pclass, SibSp, Parch]


- Cabin column can be drop because 70% missing value then we drop the column

### Stpe_01_EDA: Lable the featues

In [10]:
# Label the features

numerical_cols = ['Age', 'Fare']

categorical_cols = ['Sex', 'Pclass', 'SibSp', 'Parch', 'Embarked']

target_col = ['Survived']

print("Numerical Features: ", numerical_cols)

print("Categorical Features: ", categorical_cols)

print("Target Features: ", target_col)


Numerical Features:  ['Age', 'Fare']
Categorical Features:  ['Sex', 'Pclass', 'SibSp', 'Parch', 'Embarked']
Target Features:  ['Survived']


###  Label or select the columns

- How Identify Features, Target and Task Type or

- 1. Numerical columns: 
    - PassengerId, Age, Fare

    - less important: PassengerId

- 2. Categorical Column:
    - Survived, Sex, Pclass, Name, SibSp, Parch, Ticket, Cabin, Embarked

    - less important: Name, Ticket, Cabin

- 3. Target column:
    - Survived


-  Disperatioin of data

    - Survive column 0.383838 or almost 38% passenger can only survived.

    - Age column 29.699118 or 29% ages passenger on average are survived.

### 2.2 : Visualizations  of numerical columns

#### 2.3: Relationship between Features and Target

#### 2.4: Correlation Matrix and Heatmap

####  2.5: Categorical Features vs Target Features

### 3. Data Preprocessing Steps

#### 3.1: Handle Missing Values of each features

#### 3.2: Encoding the Categorical Features.

#### 3.3: Normalization and scaling the numerical Features.