Source: https://www.kaggle.com/learn/data-visualization

# Table of Contents
1. [Visualization Basics](#chapter_1)
2. [Line Charts](#chapter_2)
3. [Bar Charts and Heatmaps](#chapter_3)
4. [Scatterplots](#chapter_4)
5. [Distributions](#chapter_5)

<a class="anchor" id="chapter_1"></a>

## 1. Visualization Basics


In [None]:
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
print("Setup Complete")

In [None]:
# Path of the file to read
fifa_filepath = "../data/Dataset_Fifa.csv"

# Read the file into a variable fifa_data
# Couple of new parameters: index_col --> each entry in the first column should be an index
# and parse_dates --> each row label is interpreted as date (not as string or integer)
fifa_data = pd.read_csv(fifa_filepath, index_col="Date", parse_dates=True)


In [None]:
# Print the first 5 rows of the data - FIFA rankings
fifa_data.head()

In [None]:
# Set the width and height of the figure
plt.figure(figsize=(16,6))

# Line chart showing how FIFA rankings evolved over time 
sns.lineplot(data=fifa_data)

<a class="anchor" id="chapter_2"></a>

## 2. Line Charts

The dataset for this tutorial tracks global daily streams on the music streaming service Spotify. We focus on five popular songs from 2017 and 2018:

    "Shape of You", by Ed Sheeran
    "Despacito", by Luis Fonzi
    "Something Just Like This", by The Chainsmokers and Coldplay
    "HUMBLE.", by Kendrick Lamar
    "Unforgettable", by French Montana


In [None]:
# Path of the file to read
spotify_filepath = "../data/Dataset_Spotify.csv"

# Read the file into a variable spotify_data
spotify_data = pd.read_csv(spotify_filepath, index_col="Date", parse_dates=True)

In [None]:
# Print the first 5 rows of the data (only data for "Shape of You", because it was released earlier than the others)
spotify_data.head()

In [None]:
# Print the last five rows of the data
spotify_data.tail()

In [None]:
# Line chart showing daily global streams of each song 
sns.lineplot(data=spotify_data)

In [None]:
# Set the width and height of the figure
plt.figure(figsize=(14,6))

# Add title
plt.title("Daily Global Streams of Popular Songs in 2017-2018")

# Line chart showing daily global streams of each song 
sns.lineplot(data=spotify_data)

In [None]:
#Plot a subset of the data
list(spotify_data.columns)

In [None]:
# Set the width and height of the figure
plt.figure(figsize=(14,6))

# Add title
plt.title("Daily Global Streams of Popular Songs in 2017-2018")

# Line chart showing daily global streams of 'Shape of You'
sns.lineplot(data=spotify_data['Shape of You'], label="Shape of You")

# Line chart showing daily global streams of 'Despacito'
sns.lineplot(data=spotify_data['Despacito'], label="Despacito")

# Add label for horizontal axis
plt.xlabel("Date")

<a class="anchor" id="chapter_3"></a>

## 3. Bar Charts and Heatmaps

In this tutorial, we'll work with a dataset from the US Department of Transportation that tracks flight delays.

Opening this CSV file in Excel shows a row for each month (where 1 = January, 2 = February, etc) and a column for each airline code.

Each entry shows the average arrival delay (in minutes) for a different airline and month (all in year 2015). Negative entries denote flights that (on average) tended to arrive early. For instance, the average American Airlines flight (airline code: AA) in January arrived roughly 7 minutes late, and the average Alaska Airlines flight (airline code: AS) in April arrived roughly 0.3 minutes early.

In [None]:
# Path of the file to read
flight_filepath = "../data/Dataset_Flight_Delays.csv"

# Read the file into a variable flight_data
flight_data = pd.read_csv(flight_filepath, index_col="Month")

In [None]:
# Print the data
flight_data

In [None]:
# Say we'd like to create a bar chart showing the average arrival delay for Spirit Airlines (airline code: NK) flights, by month.

# Set the width and height of the figure
plt.figure(figsize=(10,6))

# Add title
plt.title("Average Arrival Delay for Spirit Airlines Flights, by Month")

# Bar chart showing average arrival delay for Spirit Airlines flights by month
#sns.barplot(x=flight_data.index, y=flight_data['NK'])
sns.barplot(x=flight_data.index, y=flight_data['NK'], color="red")

# Add label for vertical axis
plt.ylabel("Arrival delay (in minutes)")

In [None]:
help(sns.barplot)

In [None]:
# Heatmaps!
# We create a heatmap to quickly visualize patterns in flight_data. Each cell is color-coded according to its corresponding value.

# Set the width and height of the figure
plt.figure(figsize=(14,7))

# Add title
plt.title("Average Arrival Delay for Each Airline, by Month")

# Heatmap showing average arrival delay for each airline by month
sns.heatmap(data=flight_data, annot=True, cmap = sns.cm.rocket_r)

# Add label for horizontal axis#

plt.xlabel("Airline")

In [None]:
# What patterns can you detect in the table? 
# For instance, if you look closely, the months toward the end of the year (especially months 9-11) appear relatively dark for all airlines. 
# This suggests that airlines are better (on average) at keeping schedule during these months!

<a class="anchor" id="chapter_4"></a>

## 4. Scatterplots

Scatterplots are great to show the relationship between 2 or 3 variables!

We'll work with a (synthetic) dataset of insurance charges, to see if we can understand why some customers pay more than others.

In [None]:
# Path of the file to read
insurance_filepath = "../data/Dataset_Insurance.csv"

# Read the file into a variable insurance_data
insurance_data = pd.read_csv(insurance_filepath)

In [None]:
# more info here: https://www.kaggle.com/datasets/mirichoi0218/insurance

insurance_data.head()

In [None]:
# simple scatterplot
plt.figure(figsize=(14,6))
sns.scatterplot(x=insurance_data['bmi'], y=insurance_data['charges'])

The scatterplot above suggests that body mass index (BMI) and insurance charges are positively correlated, where customers with higher BMI typically also tend to pay more in insurance costs. (This pattern makes sense, since high BMI is typically associated with higher risk of chronic disease.)

To double-check the strength of this relationship, you might like to add a regression line, or the line that best fits the data. We do this by changing the command to sns.regplot.

In [None]:
# draw a regression line
plt.figure(figsize=(14,6))

sns.regplot(x=insurance_data['bmi'], y=insurance_data['charges'], line_kws={"color": "red"})

We can use scatter plots to display the relationships between (not two, but...) three variables! One way of doing this is by color-coding the points.

For instance, to understand how smoking affects the relationship between BMI and insurance costs, we can color-code the points by 'smoker', and plot the other two columns ('bmi', 'charges') on the axes.

In [None]:
plt.figure(figsize=(14,6))
sns.scatterplot(x=insurance_data['bmi'], y=insurance_data['charges'], hue=insurance_data['smoker'])



This scatter plot shows that while nonsmokers to tend to pay slightly more with increasing BMI, smokers pay MUCH more.

To further emphasize this fact, we can use the sns.lmplot command to add two regression lines, corresponding to smokers and nonsmokers. (You'll notice that the regression line for smokers has a much steeper slope, relative to the line for nonsmokers!)


In [None]:
# plt.figure(figsize=(14,6))#
# sns.lmplot(x="bmi", y="charges", hue="smoker", data=insurance_data)
sns.lmplot(x="bmi", y="charges", hue="smoker", data=insurance_data, line_kws={"color": "green"})

In [None]:
# Categorical scatter plot!!
sns.swarmplot(x=insurance_data['smoker'],
              y=insurance_data['charges'])

Among other things, this plot shows us that
* on average, non-smokers are charged less than smokers, and
* the customers who pay the most are smokers; whereas the customers who pay the least are non-smokers.


<a class="anchor" id="chapter_5"></a>

## 5. Distributions
We'll work with a dataset of 150 different flowers, or 50 each from three different species of iris (Iris setosa, Iris versicolor, and Iris virginica).

In [None]:
# more info:
# https://en.wikipedia.org/wiki/Iris_flower_data_set

# Path of the file to read
iris_filepath = "../data/Dataset_Iris.csv"

# Read the file into a variable iris_data
iris_data = pd.read_csv(iris_filepath, index_col="Id")

# Print the first 5 rows of the data
iris_data.head()

In [None]:
iris_data.info()

In [None]:
# Histogram 
# What's interesting here? Could size be already a distingishuable feature?
sns.histplot(iris_data['Petal Length (cm)'])

In [None]:
# KDE plot ("kernel density estimate")
sns.kdeplot(data=iris_data['Petal Length (cm)'], fill=True)

In [None]:
# histogram for different species by petal length
sns.histplot(data = iris_data, x='Petal Length (cm)', hue = iris_data["Species"])


In [None]:
# Add title
plt.title("Distribution of Petal Lengths, by Species")

sns.kdeplot(data=iris_data, x = 'Petal Length (cm)', hue = 'Species', fill=True)

In [None]:
help(sns.kdeplot)