# Pandas - visualize pandas data frames

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [None]:
titanic3 = pd.read_csv('titanic3.csv')

titanic3

## Data visualization

Data visualization is an important step in data analysis as it can reveal interesting relationship among variables in our data.

We can use the DataFrame `titanic3` and create a frequency plot (histogram) of the fare paid by the passengers.

Notice that pandas comes with built-in plotting functionalities (built on top of matplotlib). 

In [None]:
titanic3.hist(column = 'Fare')

plt.show()

Alternatively, we can use matplotlib.

In [None]:
fig, ax = plt.subplots()

ax.hist(titanic3['Fare'], bins = 30)
ax.set_ylabel('Frequency')
ax.set_xlabel('Fare')

plt.show()

Let's use plots to visually inspect the relationship between the probability of survival and other variables:

- Distribution of age by sex and survival:

In [None]:
# copy DataFrame
titanic4 = titanic3.copy()

# drop rows with NaN
titanic4.dropna(inplace = True)

In [None]:
# filtering rows with female passenger
women = titanic4[titanic4['Sex'] == 'female']

# filtering rows with male passengers
men = titanic4[titanic4['Sex'] == 'male']

In [None]:
women

In [None]:
men

In [None]:
fig, ax = plt.subplots()

# women that did not survive
ax.hist(women[women['Survived'] == 0]['Age'], bins = 30, 
        alpha = 0.5, 
        label = 'Not survived')

# women that did survive
ax.hist(women[women['Survived'] == 1]['Age'], bins = 30, 
        alpha = 0.5, 
        label = 'Survived')

ax.legend()
ax.set_xlabel('Age')
ax.set_ylabel('Count')
ax.set_yticks(np.linspace(0,20,21))

plt.show()

In [None]:
fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (10, 4))

# plot
ax[0].hist(women[women['Survived'] == 1]['Age'].dropna(), bins = 30, 
           alpha = 0.5, 
           label = 'Survived')

ax[0].hist(women[women['Survived'] == 0]['Age'], bins = 30, 
           alpha = 0.5, 
           label = 'Not survived')

ax[1].hist(men[men['Survived'] == 1]['Age'], bins = 30, 
           alpha = 0.5, 
           label = 'Survived')

ax[1].hist(men[men['Survived'] == 0]['Age'], bins = 30, 
           alpha = 0.5, 
           label = 'Not survived')

# add labels and titles
ax[0].set_title('Female')
ax[1].set_title('Male')
ax[0].set_xlabel('Age')
ax[1].set_xlabel('Age')

# add legends
ax[0].legend()
ax[1].legend()

plt.show()

    - Two key findings:
        1. More females than males survived
        2. More males in the age group 18-40 did not survive

- Probability of surviving by Fam:

In [None]:
y = titanic3.groupby(['Fam'])['Survived'].mean()
y

Get the index for plotting purposes:

In [None]:
x = y.index
x

In [None]:
fig, ax = plt.subplots(figsize = (7, 4))

# plot
ax.plot(x, y, marker = 'o')

# labels
ax.set_ylabel('Survival')
ax.set_xlabel('Family members')
ax.set_title('Survival by # of family members on board')

# hide the right and top spines
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)

plt.show()

    - To key findings:
        1. No one with 7 or 10 relatives on board survived.
        2. Higest probability of surviving if 3 relatives on board.

In [None]:
titanic3[titanic3['Fam']==10] # only 1 family with 10 members

- Probability of survival by Pclass:

In [None]:
pclass = titanic3.groupby(['Pclass'])['Survived'].mean()
pclass

In [None]:
fig, ax = plt.subplots()

# plot
ax.bar(['1st class', '2nd class', '3rd class'], pclass, color = 'red', alpha=0.5)
# add label
ax.set_ylabel('Survived')

plt.show()

    - The key finding is that the probability of survival is related to Pclass.