**Multivariate Exploratory Data Analysis (EDA)** allows us to understand the relationships between different variables and to identify patterns and trends in the data. In this notebook, we will perform a multivariate EDA on the titanic dataset. We will use the `seaborn` library to create visualizations that will help us understand the relationships between different variables in the dataset.

In [None]:
#import libraries
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np 

plt.rcParams['figure.figsize'] = (6, 4)
plt.rcParams['figure.dpi'] = 150

In [None]:
#load data set
df = pd.read_csv("./datasets/titanic.csv")
df.head()

In [None]:
df.shape

Let’s take a  look at what is the number of records missing in the data set.

In [None]:
total = df.isna().sum().sort_values(ascending=False)
total

In [None]:
# Count of survived by gender
counts = df.groupby(df.Gender)[['Survived']].value_counts()
print(counts)

# % of survived by gender
percs = df.groupby(df.Gender)[['Survived']].value_counts(normalize=True)
print(percs)

summary = pd.concat([counts, percs], axis=1, keys=['count', '%'])
print(summary)

The number of female survival was higher, so gender could be an attribute that contributes to analyzing the survival of a passenger. Let's visualize this information on survival numbers in males and females.

In [None]:
# Exploratory Data Analysis using seaborn

# Map the values of the "Survived" column to "not_survived" and "survived"
if set(df['Survived']).issubset({0, 1}):
    df['Survived'] = df['Survived'].map({0:"not_survived", 1:"survived"})
    print(df['Survived'])

# Create a figure with two subplots
fig, ax = plt.subplots(1, 2, figsize = (7, 6))
fig.tight_layout(pad=3.0)

# Plot the number of passengers by gender in the first subplot
df["Gender"].value_counts().plot.bar(color = "skyblue", ax = ax[0])
ax[0].set_title("Number Of Passengers By Gender")
ax[0].set_ylabel("Population")

# Plot the count of survived and non-survived passengers by gender 
# in the second subplot
sns.countplot(data=df, x="Gender", hue="Survived", ax = ax[1])
ax[1].set_title("Gender: Survived vs Dead")


Let's visualize the number of survivals from different Pclasses.

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (7, 6))
fig.tight_layout(pad=3.0)

# Plot the number of passengers by Pclass in the first subplot
df["Pclass"].value_counts().plot.bar(color = "skyblue", ax = ax[0])
ax[0].set_title("Number Of Passengers By Pclass")
ax[0].set_ylabel("Population")

# Plot the count of survived and dead passengers by Pclass in the second subplot
sns.countplot(data=df, x="Pclass", hue="Survived", ax = ax[1])
ax[1].set_title("Pclass: Survived vs Dead")
plt.show()

Looks like the number of passenger in Pclass 3 was high and most of them did not survive. Most of Pclass 1 passengers survived.

In [None]:
# Fill missing values in the "Embarked" column with "S"
df["Embarked"] = df["Embarked"].fillna("S")
# Display the updated dataframe
df.head()

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (7, 6))
fig.tight_layout(pad=3.0)

# Plot the number of passengers by Embarked in the first subplot
df["Embarked"].value_counts().plot.bar(color = "skyblue", ax = ax[0])
ax[0].set_title("Number Of Passengers By Embarked")
ax[0].set_ylabel("Number")

# Plot the count of survived and dead passengers by Embarked in the second subplot
sns.countplot(data=df, x="Embarked", hue="Survived", ax = ax[1])
ax[1].set_title("Embarked: Survived vs Unsurvived")


In [None]:
# Plot the distribution of the "Age" column
# The `dropna()` function is used to remove any missing values (`NaN`) 
# from the 'Age' column before creating the plot. 
# The `kde` parameter is set to `True` to display the Kernel Density Estimate plot.
ax = sns.histplot(df['Age'].dropna(), kde=True, color="skyblue")
ax.lines[0].set_color('brown')
plt.title("Distribution of Age")

Now let's do first multivariate analysis into titanic data set with variables Survived, Pclass,Fear and Age. 

In [None]:
# Load the Iris dataset
iris = sns.load_dataset('iris')

# Create a scatter plot matrix
sns.pairplot(iris, hue='species', diag_kind='hist')

We use Seaborn's `pairplot` function to create a scatter plot matrix.
We specify hue='species' to color the data points based on the species of iris.
We set diag_kind='hist' to display histograms on the diagonal.
The resulting scatter plot matrix provides insights into the relationships between the different measurements (sepal length, sepal width, petal length, and petal width) for each species of iris. We can observe patterns such as strong linear relationship between petal length and width.

In [None]:
sns.set(style="ticks", color_codes=True)
# Pair plot (matrix scatterplot) of few columns
sns.pairplot(df, height=2, vars = [ 'Fare','Age','Pclass'], hue="Survived")

**Correlation Matrix**
Let's start with a simple example using the iris dataset. We will create a correlation matrix using the `corr` method of the DataFrame. We will then use the `heatmap` function of the `seaborn` library to create a heatmap of the correlation matrix. The heatmap will allow us to visualize the correlation between different variables in the dataset.

In [None]:
# Load the Iris dataset
iris = sns.load_dataset('iris')

# The `corr()` computes pairwise correlation of numeric columns in the DataFrame
correlation_matrix = iris.corr(method='pearson', numeric_only='true')
round(correlation_matrix, 2)

In [None]:
sns.heatmap(correlation_matrix, cmap='YlOrBr', annot=True, fmt=".2f", square=True)
plt.title("Correlation Heatmap")

Positive correlations having warm colors and negative correlations having cool colors.
We can observe a strong positive correlation between petal length and width.

Let's compute the Correlation Matric for the the titanic dataset. First map Embarked records with integer values so that we can include Embraked too in our correlation analysis.

In [None]:
# Map the values of the "Embarked" column to 0, 1, 2, and 3
if set(df['Embarked']).issubset({'S', 'C', 'Q', 'NaN'}):
    df['Embarked'] = df['Embarked'].map({"S":1, "C":2,"Q":2,"NaN":0})
    #print(df['Embarked'])

# The `corr()` computes pairwise correlation of numeric columns in the DataFrame
correlation_matrix = df.drop('PassengerId',axis=1).corr(method='pearson', numeric_only='true')
correlation_matrix

**Heatmap**

In [None]:
# Plots a correlation heatmap for the given dataframe
plt.figure(figsize=(14,12))

# The `heatmap()` function is used to plot rectangular data as a color-encoded matrix.
# The `cmap` parameter is used to set the color map to "YlOrBr" (Yellow-Orange-Brown) 
# The `annot` parameter is set to `True` to write the data value in each cell.
# The `fmt` parameter is set to ".2f" to format the data value with two decimal places.
# The `vmax` parameter is set to 0.6 to set the maximum value for the color bar.
# The `square` parameter is set to `True` to make the plot square.

sns.heatmap(correlation_matrix, cmap='YlOrBr', annot=True, fmt=".2f",
            vmax=0.6, square=True)
plt.title("Correlation Heatmap")