# Seaborn for data visualizations
Author and instructor: ***Dr. Junaid Qazi, PhD***

Let's explore one of the most famous and benchmark dataset ["Titanic: Machine Learning from Disaster"](https://www.kaggle.com/c/titanic/data) from kaggle. 

Follow this link at [kaggle](https://www.kaggle.com/c/titanic) for detailed description on titanic dataset.<br>

**Data Dictionary**
* PassengerId
* Survived -- 0 = No, 1 = Yes
* Pclass -- Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
* Name -- Passenger name 
* Sex -- male / female 	
* Age -- age in years	
* SibSp -- no. of siblings / spouses aboard the Titanic	
* Parch -- no. of parents / children aboard the Titanic	
* Ticket -- Ticket number	
* Fare -- Passenger fare	
* Cabin -- Cabin number	
* Embarked -- Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

First thing first, let's import some libraries. At this stage, I am sure these libraries are not new to you!

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid') # just optional!
%matplotlib inline

#Setting display format to retina in matplotlib to see better quality images.
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('retina')

# Lines below are just to ignore warnings
import warnings
warnings.filterwarnings('ignore')

Let's read`"train_titanic.csv"` into `"train"`

```Python
train = pd.read_csv('titanic.csv')
```

```Python
train.head()
```

## Exploratory Data Analysis - EDA
Let's overview the dataset using `info()` first!

```Python
train.info()
```

**Missing Data**<br>
So, we have 891 entries in our train dataset with column `Name` along with other information of the traveler such as passenger class (`Pclass`), `Fare`, `Ticket` and `Cabin` etc. <br>
Notice, Age column have 714 non-nulls whereas Cabin have 204 non-null values. Embarked also have 889 non-nulls. So there is some data missing!<br> Let's do some calculation to find out the % of missing data in each column!<br>
Remember, we have a function `isnull()` in this situation!

```Python
#round((train.isnull().sum())/(train.isnull().count())*100,1)
pct_missing = round((train.isnull().sum())/(train.isnull().count())*100,1)
pct_missing.sort_values(ascending=False).head()
```

We have the numbers now!<br>
* `Cabin` column is missing 77.1% of its data
* `Age` column is missing 19.9% of its data
* `Embarked` column is missing 0.2% of its data

<font style="font-size:12px;color:green;">*Recall and refresh your skills in dealing with missing data, we are going to use those skills at later stage*</font>

`isnull()` return `True` for all the places where the data is missing. Our dataset is large, we better think about graphical visualization using seaborn's heatmap method to visualization missing data! <br>
Let's try!!

```Python
# heatmap using seaborn, you can set the figure size if you want!
plt.figure(figsize = (12,8))
sns.heatmap(data = train.isnull())
```

The above plot might be ok, but visualization of our heatmap can be improved. `yticklabels` are overlapping and the color bar is also not useful in this case.<br>
We can set `yticklabels` and `cbar` to `False` and also use `cmap = 'viridis'` for cleaner map (you can use color of your choice)!<br>
Let's try again!

```Python
plt.figure(figsize = (12,8))
sns.heatmap(data=train.isnull(), yticklabels=False, cbar=False, cmap='viridis' )
```

The map looks much better now.<br>
Notice, yellow are `True` which represent the missing data!<br><br>
**Well, we want to know more about the dataset<br>**
we can use `countplot()` to see how many people survived and how many died!

```Python
sns.countplot(x='survived', data=train)# try different palette, such as 'coolwarm' or anyother!
```

It's sad that not many passengers survived! <br><br>
**Let's dig into little deeper, pass `hue = Sex` to see the female and male ratio in survived and died passengers.**

```Python
sns.countplot(x='survived', hue='gender',data=train, palette='coolwarm')
```

The plot suggests that not many males survived whereas, most of the females survived. <br><br>
**We can ask another question here!**<br>
We know there were three passenger classes in the titanic, which class survived the most?<br>
`nunique() or unique()` on `Pclass` and `hue = Pclass` can be useful!

```Python
# Let's check the no of classes again to re-confirm!
train['pclass'].unique()
```

Just a comment for the next `countplot`, you can use any color e.g. `palette='coolwarm'/'rainbow'` etc, it's your choice, I am just trying to keep things simpler using default one! 

```Python
sns.countplot(x='survived', hue='pclass',data=train)
```

Excellent!<br>
We got even better understanding of our data. Now, we know that more than half of the `class-1` passenger survived whereas most of the `class-3` passengers died.<br><br>
**Let's explore more and see `what was the survival rate` based on the `Port of Embarkation`?**

```Python
# hue is useful here!
sns.countplot(x='embarked',data=train, hue='survived')#hue='Pclass')
```

So, it looks like passenger, embarked from Southampton Port have a better chance of survival!<br><br>
This suggest <b>another question</b>, we may want to explore the class of the passenger and their port of embarkation.<br>
Let's pass `hue = 'Pclass'` now!. This is again a `countplot`!

```Python
sns.countplot(x='embarked',data=train, hue='pclass')
```

Di you see, more you more understanding of the data gives better in-sights!<br>
Now, we see, ***Southampton was actually the busiest port for each class!*** <br><br>

***Furthermore, we can see how many passengers traveled with siblings/spouses and parent/children. We can plot a histogram to know how the age was distributed among the travelers.***<br>
<font style="font-size:16px;color:green;"><br>
&#9758; I encourage you to ask questions to yourself and try to apply your EDA skills to learn more about the data. Use different types of plots along with your skills in interactive plotting - Recall the skills you acquired in data visualization section! <b>Practice is a key.</b></font><br><br>

```Python
# Parch -- no. of parents / children aboard the Titanic
sns.countplot(x='parch',data=train, hue='survived')
```

* What do you learn from the plot above? 
* Is there any trend with survival with the group size?

Moving forward, I just want to have another plot to see the age distribution of the passengers on titanic! 

```Python
sns.distplot(train['age'].dropna(),kde=False,color='green',bins=30) 
#train['Age'].hist(bins=30,color='green',alpha=0.5) # using pandas data visualizations
```

### Data Cleaning
So, we know form EDA that some data is missing in our dataset, let's deal with that first.<br>
**`Age` column is missing ~ 19.9% of its data.**<br>
A convenient way to fix `'Age'` column is by filling the missing data with `mean` or `average` value of all passengers in that column. **We can do even better** in this case, because we know that their are three passenger classes, **its better to use the average age for each missing passenger for its own class.** <br>
Let's use a `boxplot()` to explore if their is any relationship in class and passenger age?

```Python
plt.figure(figsize=(10, 6)) # setting the figure size, its subjective
sns.boxplot(x='pclass',y='age',data=train, palette='rainbow')
```

Yes, `Pclass` and `Age` are somehow related, this makes sense, ***older the passenger is, higher the class he traveled in!*** <br>
So our hypothesis to to fill the missing `Age` with respect to the passenger class is the better way to fill in missing data in `Age` column!<br>
Before writing a function for this purpose, we may want to know the average age of the passenger for each class, **`groupby()` is usefull here!**<br><br>
Let's find the average age of passengers in each class first, we only need `Pclass` and `Age` columns for this purpose!

```Python
train[['pclass','age']].groupby('pclass').mean() #describe() # try describe with groupby!
```

Now, we have average age for each class, let's write a custom function to fill in the missing values in `Age` columns. Super easy, we can use `if-else conditional statement` in the function!

```Python
#defining a function 'impute_age'
def impute_age(age_pclass): # passing age_pclass as ['Age', 'Pclass']
    
    # Passing age_pclass[0] which is 'Age' to variable 'Age'
    Age = age_pclass[0]
    
    # Passing age_pclass[2] which is 'Pclass' to variable 'Pclass'
    Pclass = age_pclass[1]
    
    #applying condition based on the Age and filling the missing data respectively 
    if pd.isnull(Age):

        if Pclass == 1:
            return 38

        elif Pclass == 2:
            return 30

        else:
            return 25

    else:
        return Age
    ```

Let's apply the above function to our data now. We can use `apply()` method and pass `axis = 1` for column. (recall from pandas section)

```Python
# grab age and apply the impute_age, our custom function 
train['age'] = train[['age','pclass']].apply(impute_age,axis=1)
# You may want to revise 'impute_age' function and the statement above! 
```

```Python
plt.figure(figsize = (12,8)) # just fig size
# Let's try to re-plot the heatmap now!
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')
```

So, we got this done, ***no more yellow color in Age*** column. This means we have filled all the missing values in `Age` column using `impute_age` function. 

Now, there is another column,**`Cabin` with ~ 77.1% of missing data.**<br>
This is quite a lot of missing information, at the moment, we can drop this column!

```Python
# dropping 'Cabin' column, axis =1 for column and inplace = True for permanent change!
train.drop('cabin',axis=1,inplace=True)
```

Let's see how the `heatmap` looks like now!

```Python
plt.figure(figsize = (12,8))
sns.heatmap(train.isnull(), yticklabels=False, cbar=False, cmap='viridis')
```

So, we don't have `Cabin` Column in our data now, ***only yellow color is for `Embarked` column. This is only 0.2%***<br>

***Let's drop any other missing value in the dataset now, this will essentially drop the rows for missing `Embarked` data. We will re-plot the `heatmap` after this operation.***

```Python
plt.figure(figsize = (12,8))
train.dropna(inplace=True)
sns.heatmap(train.isnull(), yticklabels=False, cbar=False, cmap='viridis')
```