<div align="center"><img src="../images/LKYCIC_Header.jpg"></div>

**Table of contents**<a id='toc0_'></a>    
- [3-02: Data Visualization](#toc1_)    
  - [Basics of Matplotlib](#toc1_1_)    
    - [Key Elements: Figure and Axes](#toc1_1_1_)    
  - [Seaborn and Matplotlib](#toc1_2_)    
  - [Exploratory Data Analysis (EDA)](#toc1_3_)    
    - [Data Summarisation](#toc1_3_1_)    
    - [Quick Summary Statistics](#toc1_3_2_)    
    - [Univariate Analysis: Categorical](#toc1_3_3_)    
    - [Ordinal VS Nominal](#toc1_3_4_)    
  - [Another Dataset: Numerical Attributes](#toc1_4_)    
    - [Multivariate Analysis](#toc1_4_1_)    
    - [Univariate Analysis: Numerical](#toc1_4_2_)    
    - [Numerical VS Numerical](#toc1_4_3_)    
  - [Next Section](#toc1_5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[3-02: Data Visualization](#toc0_)

For datasets with multiple variables, **Exploratory Data Analysis (EDA)** is an essential method for understanding **data distributions and intercorrelations**. 

We will create high-quality visualisations using Python libraries like **Matplotlib** and **Seaborn**, enabling clear communication of insights.

## <a id='toc1_1_'></a>[Basics of Matplotlib](#toc0_)

Matplotlib is one of the most popular visualisation packages in Python.

It can be thought of as the **Python equivalent of ggplot in R**.

Many advanced visualisation packages are built on top of it.

### <a id='toc1_1_1_'></a>[Key Elements: Figure and Axes](#toc0_)

In [None]:
#%pip install matplotlib

In [None]:
#importing matplotlib to plot the graphs
import matplotlib.pyplot as plt
#to avoid pop-ups & show graphs inline with the code
%matplotlib inline
#pandas is required to read the input dataset
import pandas as pd

`Figure`: The overall container for all plot elements. It can contain multiple Axes.

`Axes`: It is a component of the Figure that defines a subplot.

It manages all the details within the subplot. We can customize the x-axis and y-axis limits, labels, and the type of graph.

In [None]:
#subplot with 1 row & 2 cols
fig, ax = plt.subplots(1,2)

In [None]:
#subplot with 1 row & 2 cols
fig, (ax1, ax2) = plt.subplots(1,2)

ax1.set_title('Plot 1')

ax2.set_title('Plot 2')

You can also customise size of the figure

In [None]:
#subplot with 2 rows & 2 cols
fig, ax = plt.subplots(2,2, figsize=(10, 5))

The size of individual subplots (like `ax1` and `ax2`) cannot be customised separately. 

However, you can adjust their relative sizes by tweaking the `gridspec_kw` parameter in `plt.subplots`, which allows for customisation of subplot proportions.

In [None]:
# Customising the relative size of plots
fig, (ax1, ax2) = plt.subplots(1, 2, gridspec_kw={'width_ratios': [2, 1]})

ax1.set_title('Plot 1')

ax2.set_title('Plot 2')

plt.show()

## <a id='toc1_2_'></a>[Seaborn and Matplotlib](#toc0_)


Seaborn is a visualisation library that is **built on top of Matplotlib**.

The key feature is to simplify the creation of publication-quality figures with **minimal code**.   

| [Seaborn Gallery]()                                                                 | [Matplotlib Gallery](https://matplotlib.org/stable/gallery/)                                                                |
|-----------------------------------------------------------------------------|--------------------------------------------------------------------------------|
| ![Seaborn Gallery](../images/seaborn_gallery.jpg) | ![](../images/matplotlib_gallery.jpg) |

| **Dimension**   | **Seaborn**                                                                 | **Matplotlib**                                                                 |
|------------------|-----------------------------------------------------------------------------|--------------------------------------------------------------------------------|
| **Ease of Use**  | Minimal code. Default settings are well-optimised. | Steeper learning curve for beginners. |
| **Customisation**| Limited customisation, relying on Matplotlib's underlying support. | Extremely customisable, with nearly all visual elements adjustable. |
| **Aesthetics**   | Modern default themes, appealing styles, and diverse color palettes. | Basic default styles but offers complete control to achieve aesthetic goals with extra configuration. |  

In [None]:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## <a id='toc1_3_'></a>[Exploratory Data Analysis (EDA)](#toc0_)

Exploratory Data Analysis (EDA) is an approach to analysing datasets to summarise their main characteristics, often using visual methods. 

**A quick way to understand the data.**

<div align="center">
    <img src="../images/eda_cheatsheet.png">
    <br><b>EDA cheat sheet</b>
    <br>Source: <u>https://www.visual-design.net/post/semi-automated-exploratory-data-analysis-process-in-python</u>
</div>

### <a id='toc1_3_1_'></a>[Data Summarisation](#toc0_)

Calculating summary statistics such as mean, median, standard deviation, and range.

In [1]:
import pandas as pd

In [2]:
df_cate = pd.read_csv('../data/raw/part_i/assembly-bike-survey-data.csv', index_col="RespondentID")

df_cate.head(2)

Unnamed: 0_level_0,Q1-Location,Q2-Age,Q3-Gender,Q4-BikeOwner,Q5-StartedCycling,Q6-WhenStarted,Q8-SuperhighwayUsed,Q9-SuperhighwayFrequency,Q10-Width,Q10-Surface,...,Q19-Frequency,Q20-Reason,Q20-ReasonOther,Q21-Duration,Q22-SupportCentre,Q23-Problem,Q24-SupportExperience,Q25-SupportComments,Q26-Improvements,Q27-Reason
RespondentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1203127707,Kent,30-39,Male,Yes,"No, I already cycled",1 year,"Superhighway 7, Merton to the City via A24-A3",Only tried it once,fair,bad,...,Once a fortnight,A combination of the above,"Tube, walking, bus and train.",Less than 30 mins,No,,,,There should be the provision to be able to hi...,
1202170092,Camden,40-49,Male,No,"Yes, because of the Cycle Hire Scheme",,,,,,...,Several times a week,A combination of the above,,Less than 30 mins,No,,,,The docking point opposite the Black Cats fact...,


### <a id='toc1_3_2_'></a>[Quick Summary Statistics](#toc0_)

Generate the data summary for the data using `.describe()`

In [6]:
df_cate.describe()

Unnamed: 0,Q1-Location,Q2-Age,Q3-Gender,Q4-BikeOwner,Q5-StartedCycling,Q6-WhenStarted,Q8-SuperhighwayUsed,Q9-SuperhighwayFrequency,Q10-Width,Q10-Surface,...,Q19-Frequency,Q20-Reason,Q20-ReasonOther,Q21-Duration,Q22-SupportCentre,Q23-Problem,Q24-SupportExperience,Q25-SupportComments,Q26-Improvements,Q27-Reason
count,956,1294,1291,1286,1284,1019,701,709,704,703,...,750,726,404,745,751,375,404,314,413,437
unique,238,5,2,2,4,3,3,5,9,7,...,5,7,396,4,2,373,5,312,409,403
top,Lambeth,30-39,Male,Yes,"No, I already cycled",Longer,"Superhighway 7, Merton to the City via A24-A3",Several times a week,good,good,...,Several times a week,A combination of the above,tube and walking,Less than 30 mins,No,#NAME?,Good,See above,#NAME?,I have my own bike
freq,100,549,987,1080,984,911,425,239,242,271,...,348,417,3,716,376,2,93,2,4,15


The output is default. But if you want to make it in a more readable survey, you can `transpose` the output table:

In [5]:
df_cate.describe().T

Unnamed: 0,count,unique,top,freq
Q1-Location,956,238,Lambeth,100
Q2-Age,1294,5,30-39,549
Q3-Gender,1291,2,Male,987
Q4-BikeOwner,1286,2,Yes,1080
Q5-StartedCycling,1284,4,"No, I already cycled",984
Q6-WhenStarted,1019,3,Longer,911
Q8-SuperhighwayUsed,701,3,"Superhighway 7, Merton to the City via A24-A3",425
Q9-SuperhighwayFrequency,709,5,Several times a week,239
Q10-Width,704,9,good,242
Q10-Surface,703,7,good,271


In [None]:
df_cate.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1297 entries, 1203127707 to 1155430443
Data columns (total 37 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Q1-Location                  956 non-null    object
 1   Q2-Age                       1294 non-null   object
 2   Q3-Gender                    1291 non-null   object
 3   Q4-BikeOwner                 1286 non-null   object
 4   Q5-StartedCycling            1284 non-null   object
 5   Q6-WhenStarted               1019 non-null   object
 6   Q8-SuperhighwayUsed          701 non-null    object
 7   Q9-SuperhighwayFrequency     709 non-null    object
 8   Q10-Width                    704 non-null    object
 9   Q10-Surface                  703 non-null    object
 10  Q10-Signs                    692 non-null    object
 11  Q10-Parking                  634 non-null    object
 12  Q11-SuperhighwayComments     495 non-null    object
 13  Q12-SuperhighwayRespect

It will return **names of all the columns** in the table:

In [10]:
df_cate.columns

Index(['Q1-Location', 'Q2-Age', 'Q3-Gender', 'Q4-BikeOwner',
       'Q5-StartedCycling', 'Q6-WhenStarted', 'Q8-SuperhighwayUsed',
       'Q9-SuperhighwayFrequency', 'Q10-Width', 'Q10-Surface', 'Q10-Signs',
       'Q10-Parking', 'Q11-SuperhighwayComments', 'Q12-SuperhighwayRespect',
       'Q13-SuperhighwaySafety', 'Q14-SuperhighwaySuggestions',
       'Q15-SuperhighwayReason', 'Q16-HireRegistration',
       'Q17-HireRegistration', 'Q17-FindingStation', 'Q17-BikeAvailability',
       'Q17-Unlocking', 'Q17-Bike', 'Q17-Returning', 'Q17-Payment',
       'Q17-Value', 'Q18-HireComments', 'Q19-Frequency', 'Q20-Reason',
       'Q20-ReasonOther', 'Q21-Duration', 'Q22-SupportCentre', 'Q23-Problem',
       'Q24-SupportExperience', 'Q25-SupportComments', 'Q26-Improvements',
       'Q27-Reason'],
      dtype='object')

### <a id='toc1_3_3_'></a>[Univariate Analysis: Categorical](#toc0_)

In [None]:
# Frequency Table
print("Frequency Table:")
print(df_cate['Q5-StartedCycling'].value_counts())

# Visualisations
plt.figure(figsize=(12, 6))

# Bar Plot
plt.subplot(1, 2, 1) # 1 row, 2 cols, 1st subplot
sns.countplot(x='Q5-StartedCycling', data=df_cate, palette='viridis', hue='Q5-StartedCycling')
plt.title('Bar Plot of Categories')
plt.xlabel('Q5-StartedCycling')
plt.ylabel('Count')

# Pie Chart
plt.subplot(1, 2, 2) # 1 row, 2 cols, 2nd subplot
df_cate['Q5-StartedCycling'].value_counts().plot.pie(autopct='%1.1f%%', startangle=90, colors=sns.color_palette('viridis', len(df_cate['Q5-StartedCycling'].unique())))
plt.title('Pie Chart of Categories')
plt.ylabel('')

plt.tight_layout()
plt.show()

### <a id='toc1_3_4_'></a>[Ordinal VS Nominal](#toc0_)

1. Frequency Table

   - The `groupby` method calculates the count of `Product_Category` for each `Satisfaction_Level`. The `unstack()` makes it easier to read.

2. Bar Plot

   - **`sns.countplot()`**:

     - `x='Satisfaction_Level'`: Groups by the ordinal variable.

     - `hue='Product_Category'`: Shows counts of the nominal variable within each ordinal category.
     
   - Use `palette='Set2'` to colour-code the categories for clarity.

In [None]:
df_cate.head(1)

In [None]:
# Convert 'Q19-Frequency' to an ordinal type
df_cate['Q19-Frequency'] = pd.Categorical(df_cate['Q19-Frequency'], categories=["Only tried it once", "Occasionally", "Once a fortnight", "Once a week", "Several times a week"], ordered=True)

# Frequency Table
print("Frequency Table:")
print(df_cate.groupby(['Q19-Frequency', 'Q3-Gender']).size().unstack())

# Bar Plot
plt.figure(figsize=(10, 6))
sns.countplot(x='Q19-Frequency', hue='Q3-Gender', data=df_cate, palette='Set2')
plt.title('Distribution of Product Categories by Satisfaction Level')
plt.xlabel('Satisfaction Level')
plt.ylabel('Count')
plt.legend(title='Q3-Gender')
plt.show()

## <a id='toc1_4_'></a>[Another Dataset: Numerical Attributes](#toc0_)

Let's load another dataset.

It is a socioeconomic statistics aggregated to planning areas in Singapore.

In [None]:
df_num = pd.read_csv('../data/raw/part_iii/socioeco_2015_sg.csv', index_col="zone")

print(df_num.head())

meta_df_num = pd.read_csv('../data/raw/meta_information/socioeco_2015_sg_meta.csv') #, index_col="RespondentID")

print(meta_df_num)

df_num.describe()

### <a id='toc1_4_1_'></a>[Multivariate Analysis](#toc0_)

Matrix Correlation analysis

In [None]:
# Compute the correlation matrix
corr_matrix = df_num.corr()

# Create a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap="coolwarm", cbar=True)

# Add a title
plt.title("Correlation Matrix Heatmap")
plt.show()

### <a id='toc1_4_2_'></a>[Univariate Analysis: Numerical](#toc0_)

In [None]:
# Visualisations
plt.figure(figsize=(12, 6))

# Histogram
plt.subplot(1, 3, 1)
sns.histplot(df_num['SOMPAPT2'], kde=False, bins=30, color='skyblue')
plt.title('Histogram')

# Boxplot
plt.subplot(1, 3, 2)
sns.boxplot(y=df_num['SOMPAPT2'], color='lightgreen')
plt.title('Boxplot')

# KDE Plot
plt.subplot(1, 3, 3)
sns.kdeplot(df_num['SOMPAPT2'], fill=True, color='orange')
plt.title('KDE Plot')

plt.tight_layout()
plt.show()

### <a id='toc1_4_3_'></a>[Numerical VS Numerical](#toc0_)

In [None]:
# Create a sample dataset
np.random.seed(42)
data = pd.DataFrame({
    'X': np.random.uniform(0, 100, 200),
    'Y': np.random.uniform(0, 100, 200) + np.random.normal(0, 10, 200),
    'Z': np.random.uniform(0, 100, 200)  # For additional analysis
})

# Scatter Plot
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
sns.scatterplot(x='MHI10K2', y='DivSep2', data=df_num, color='blue', alpha=0.7)
plt.title('Scatter Plot: X vs Y')

# Joint Plot
sns.jointplot(x='MHI10K2', y='DivSep2', data=df_num, kind='reg', height=6, ratio=4, marginal_kws={'bins': 30})
plt.suptitle('Joint Plot: X vs Y', y=1.02)

plt.show()


## <a id='toc1_5_'></a>[Next Section](#toc0_)

Go to [3-03-01: Static Mapping](./3-03-01_staticmapping.ipynb)