# Exploratory Data Analysis
Author: Siraj Shabbir  
Date: 26/09/2023  
Email: sirajshabbir321@gmail.com

**What is EDA?**  
Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.

In [1]:
# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# loading dataset
df = sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


#### Getting Information about the data

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


#### Checking for  null values

In [4]:
df.isnull().sum() / len(df) * 100

survived        0.000000
pclass          0.000000
sex             0.000000
age            19.865320
sibsp           0.000000
parch           0.000000
fare            0.000000
embarked        0.224467
class           0.000000
who             0.000000
adult_male      0.000000
deck           77.216611
embark_town     0.224467
alive           0.000000
alone           0.000000
dtype: float64

#### Dealing with null values

In [5]:
df2 = df.copy()

##### Imputing null values in `age`

In [6]:
df2['age'] = df2['age'].fillna(df2['age'].mean())

#### Imputing null values in `embarked` and `embark_town`

In [60]:
df2['embark_town'] = df2['embark_town'].fillna(df2['embark_town'].mode()[0])
df2['embarked'] = df2['embarked'].fillna(df2['embarked'].mode()[0])

##### Droping `deck` column

In [58]:
df2.drop('deck', axis=1, inplace=True)

# Steps of Data Wraangling (EDA)
1. Import libraries
2. Import dataset
3. Explore Data
    - Imformation about data `info()`
    - Data Types `dtypes`
    - Missing values
    - Make sense of data
4. Understanding the Variables
5. Relation between variables `heatmap()`, `pairplot()`, `corr()`
6. Brainstorming
    - Normalization  
        (Normalization is one of the most frequently used data preparation techniques, which helps us to change the values of numeric columns in the dataset to use a common scale.)
    - Removing Outliers
7. Tidy, clean Data
8. Ready for statistical Analysis
9. Ready for Prediction
10. Ready for Machine Learning
11. Ready for Deep Learning