# The Introduction for Interactive Statitical Visualization
- Made by Houpu Li for CRP class Introduction for Urban Data Science

## <font color='F2700A'>Part I: Vega-Altair</font>  
Vega-Altair is a interactive statistical visualization library for Python, based on [Vega](https://vega.github.io/vega/) and [Vega-Lite](https://vega.github.io/vega-lite/). It offers a powerful and concise grammar that enables you to quickly build a wide range of statistical visualizations. Furthermore, when using Altair, datasets are most commonly provided as a Dataframe. We can learned a lot from the website of Vega-Altair.  
<iframe src="https://altair-viz.github.io/" width=100% height=700></iframe>

Install the `altair` liabrary using the following command

In [1]:
# pip install "altair[all]"

In [2]:
# import altair libraries
import altair as alt

# import fundamental libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
import os

# import dataset api
from vega_datasets import data

In [3]:
# This code sets the display option to show all columns in a pandas DataFrame.
pd.set_option('display.max_columns', None)

### Basic Statistical Visualization by Vega-Altair

- <font size=2>01.Simple Bar Chart: `mark_bar()`</font>  
- <font size=2>02.Simple Scatter Plot: `mark_point()/mark_circle()`</font>  
- <font size=2>03.Simple Stacked Area Chart: `mark_area()`</font>
- <font size=2>04.Simple Heatmap: `mark_rect()`</font>  

><font color = 'F2700A'>Simple Bar Chart and the Histogram:</font>  
    - <font size=2>*Simple Histogram:*<br>`alt.Chart(df,title='xxxx').mark_bar().encode(x='cat_column',y='count()',color='cat_column').interactive()`</font>  
    - <font size=2>*Simple Bar Char:*<br>`alt.Chart(df,title='xxxx').mark_bar().encode(x='conti_column',y='cat_column',color='cat_column').interactive()`</font>  

***Case Study 01: Analysis and Visualization of the Car Dataset***  
we explore the car dataset, focusing specifically on the relationships among car attributes such as the number of cylinders, miles per gallon (MPG), origin, and horsepower. The dataset is sourced from the vega_datasets package's cars data, which provides comprehensive information about various cars from 1970 to 1982. We utilize Altair, a declarative statistical visualization library in Python, to analyze and visualize these relationships.

In [4]:
# load a sample dataset as a pandas DataFrame
from vega_datasets import data
cars = data.cars()
cars.head()

Unnamed: 0,Name,Miles_per_Gallon,Cylinders,Displacement,Horsepower,Weight_in_lbs,Acceleration,Year,Origin
0,chevrolet chevelle malibu,18.0,8,307.0,130.0,3504,12.0,1970-01-01,USA
1,buick skylark 320,15.0,8,350.0,165.0,3693,11.5,1970-01-01,USA
2,plymouth satellite,18.0,8,318.0,150.0,3436,11.0,1970-01-01,USA
3,amc rebel sst,16.0,8,304.0,150.0,3433,12.0,1970-01-01,USA
4,ford torino,17.0,8,302.0,140.0,3449,10.5,1970-01-01,USA


<font size=2>First, let's start by grouping the data according to the "origin" column, and then perform descriptive statistical analysis on selected columns within each group.</font>

In [5]:
cars.groupby(['Origin'])[['Miles_per_Gallon','Cylinders','Displacement','Horsepower','Weight_in_lbs','Acceleration']].describe()

Unnamed: 0_level_0,Miles_per_Gallon,Miles_per_Gallon,Miles_per_Gallon,Miles_per_Gallon,Miles_per_Gallon,Miles_per_Gallon,Miles_per_Gallon,Miles_per_Gallon,Cylinders,Cylinders,Cylinders,Cylinders,Cylinders,Cylinders,Cylinders,Cylinders,Displacement,Displacement,Displacement,Displacement,Displacement,Displacement,Displacement,Displacement,Horsepower,Horsepower,Horsepower,Horsepower,Horsepower,Horsepower,Horsepower,Horsepower,Weight_in_lbs,Weight_in_lbs,Weight_in_lbs,Weight_in_lbs,Weight_in_lbs,Weight_in_lbs,Weight_in_lbs,Weight_in_lbs,Acceleration,Acceleration,Acceleration,Acceleration,Acceleration,Acceleration,Acceleration,Acceleration
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
Origin,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2,Unnamed: 29_level_2,Unnamed: 30_level_2,Unnamed: 31_level_2,Unnamed: 32_level_2,Unnamed: 33_level_2,Unnamed: 34_level_2,Unnamed: 35_level_2,Unnamed: 36_level_2,Unnamed: 37_level_2,Unnamed: 38_level_2,Unnamed: 39_level_2,Unnamed: 40_level_2,Unnamed: 41_level_2,Unnamed: 42_level_2,Unnamed: 43_level_2,Unnamed: 44_level_2,Unnamed: 45_level_2,Unnamed: 46_level_2,Unnamed: 47_level_2,Unnamed: 48_level_2
Europe,70.0,27.891429,6.72393,16.2,24.0,26.5,30.65,44.3,73.0,4.150685,0.490783,4.0,4.0,4.0,4.0,6.0,73.0,109.465753,22.371908,68.0,96.0,105.0,121.0,183.0,71.0,81.0,20.813457,46.0,69.5,77.0,90.5,133.0,73.0,2431.493151,490.883617,1825.0,2065.0,2246.0,2800.0,3820.0,73.0,16.821918,3.010917,12.2,14.5,15.7,19.0,24.8
Japan,79.0,30.450633,6.090048,18.0,25.7,31.6,34.05,46.6,79.0,4.101266,0.590414,3.0,4.0,4.0,4.0,6.0,79.0,102.708861,23.140126,70.0,86.0,97.0,119.0,168.0,79.0,79.835443,17.819199,52.0,67.0,75.0,95.0,132.0,79.0,2221.227848,320.497248,1613.0,1985.0,2155.0,2412.5,2930.0,79.0,16.172152,1.954937,11.4,14.6,16.4,17.55,21.0
USA,249.0,20.083534,6.402892,9.0,15.0,18.5,24.0,39.0,254.0,6.283465,1.662883,4.0,4.0,6.0,8.0,8.0,254.0,247.935039,98.647798,85.0,151.0,250.0,318.0,455.0,250.0,119.9,39.989482,52.0,88.0,106.0,150.0,230.0,254.0,3372.700787,791.695866,1800.0,2721.25,3380.5,4054.75,5140.0,254.0,14.94252,2.804542,8.0,13.0,15.0,16.7,22.2


<font size=2>Simple Histogram  
***The Visualization Between Number of Cylinders and Origin***  
First, we created a histogram chart to show the distribution of the number of cylinders across cars from different origins. This chart clearly presents the count of cars with each cylinder number, differentiated by origin using various colors. This way, we can visually compare how cars from different origins (e.g., USA, Europe, and Japan) vary in terms of cylinder numbers. This not only helps us understand the demand for car performance in different markets but also reflects the technological level and market strategy of the automobile industry in each origin.</font>

In [6]:
alt.Chart(cars,title="The Count of Cylinders by Origin").mark_bar().encode(x='Cylinders',y='count()',color='Origin').interactive()

<font size=2>Simple Bar Chart  
***The Visualization Between Miles Per Gallon (MPG) and Origin***  
Next, we create a bar chart to show the relationship between average miles per gallon (MPG) and the cars' origin. By calculating the average MPG for cars from each origin and using it as the X-axis of a bar chart, we could glimpse the fuel efficiency performance of cars from different origins. This chart clearly shows that European and Japanese cars generally have higher fuel efficiency compared to American cars, which may reflect the varying emphasis on fuel economy and technological investment in different regions.

To allow for more flexibility in how data are visualized, `Altair` has a built-in syntax for aggregation of data. For example, we can calculate the average miles per gallon within origin groups and illustrate the difference using a bar chart.</font>

In [7]:
alt.Chart(cars,title="Horsepower Bar Chart Distribution by Origin").mark_bar().encode(x='average(Miles_per_Gallon)',y='Origin',color='Origin').interactive()

><font color = 'F2700A'>02.Simple Scatter Plot:</font>  
    - <font size=2>`alt.Chart(df,title='xxxx').mark_point()/mark_circle().encode(x='conti_column',y='conti_column',color='cat_column').interactive()`</font>  

<font size=2>***The Visualization Between Horsepower and Miles Per Gallon by Origin***  
Lastly, we explored the relationship between horsepower and miles per gallon (MPG) through a scatter plot, differentiated by origin with color coding. This scatter plot offers a perspective to observe whether cars with higher horsepower tend to have lower fuel efficiency and whether this trend is consistent across cars from all origins. Additionally, the color coding helps us identify the performance differences in these two attributes among cars from different origins.</font>

In [8]:
# make the Scatter plot by mark_point
chart1 = alt.Chart(cars,title="Scatter Plot of the Horsepower vs. Miles Per Gallon by Origin").mark_point().encode(
             x='Horsepower',
             y='Miles_per_Gallon',
             color='Origin',
         ).interactive()

chart1

In [9]:
# make the Scatter plot by mark_circle
chart1 = alt.Chart(cars,title="Scatter Plot of the Horsepower vs. Miles Per Gallon by Origin").mark_circle().encode(
             x='Horsepower',
             y='Miles_per_Gallon',
             color='Origin',
         ).interactive()

chart1

><font color = 'F2700A'>Simple Stacked Area Chart:</font>  
    - <font size=2>`alt.Chart(df,title='xxxx').mark_area().encode(x='cat_column',y='conti_column',color='cat_column').interactive()`</font>  

***Case Study 02: Analysis and Visualization of the net electricity generation within Iowa State***  
we delve into the data on net electricity generation by source in Iowa, with a focus on how the contribution of different energy sources to the net electricity generation has evolved over time. The dataset is obtained through the iowa_electricity function in the vega_datasets package, which records the net electricity generation from various sources (such as fossil, nuclear, and renewable energy) in Iowa from 2001 to 2017. Through this visualization, we can intuitively see which types of energy usage are increasing or decreasing and how the proportions of each energy type in the overall power supply have shifted.

In [10]:
# load a sample dataset as a pandas DataFrame
ele = data.iowa_electricity()
ele.head()

Unnamed: 0,year,source,net_generation
0,2001-01-01,Fossil Fuels,35361
1,2002-01-01,Fossil Fuels,35991
2,2003-01-01,Fossil Fuels,36234
3,2004-01-01,Fossil Fuels,36205
4,2005-01-01,Fossil Fuels,36883


In [11]:
chart2 = alt.Chart(ele,title='Electricity Net Generation by Source').mark_area().encode(
             x="year",
             y="net_generation",
             color="source"
         ).interactive()

chart2

><font color = 'F2700A'>Simple Heatmap:</font>  
    - <font size=2>`alt.Chart(df,title='xxxx').mark_rect().encode(x='cat_column',y='cat_column',color='conti_column',tooltip=[]).interactive()`</font>  

***Case Study 03: Analysis and Visualization of the temperature variation within Seattle***  
we explore the daily maximum temperatures in Seattle, Washington, using a dataset loaded from the vega_datasets package's seattle_weather function. This dataset provides detailed weather information over several years, including temperature readings, precipitation levels, and weather types. Our focus is on visualizing the variation in maximum daily temperatures throughout the year, highlighting seasonal patterns and extreme temperature events.

<font size=2>Here, we employ `alt.X`, `alt.Y`, and `alt.Color` functions to customize the X and Y axes as well as the color scheme. Since the `date` column is in datetime format, we can use `date(date)` to extract the day, `month(date)` to extract the month, and `monthdate(date)` to get the specific month and day from it. Furthermore, `temp_max` records the highest temperature of the day, and `max(temp_max)` automatically searches for the highest temperature on the same day across different years within the same month.</font>

In [12]:
# load a sample dataset as a pandas DataFrame
temp = data.seattle_weather()
temp.head()

Unnamed: 0,date,precipitation,temp_max,temp_min,wind,weather
0,2012-01-01,0.0,12.8,5.0,4.7,drizzle
1,2012-01-02,10.9,10.6,2.8,4.5,rain
2,2012-01-03,0.8,11.7,7.2,2.3,rain
3,2012-01-04,20.3,12.2,5.6,4.7,rain
4,2012-01-05,1.3,8.9,2.8,6.1,rain


In [13]:
chart3 = alt.Chart(temp, title="Daily Max Temperatures (C) in Seattle, WA").mark_rect().encode(
             alt.X("date(date):O").title("Day").axis(format="%e", labelAngle=0), # X-axis represents the day of the month extracted from the 'date' field, and the label angle is set to 0 degrees.
             alt.Y("month(date):O").title("Month"),  # Y-axis represents the month extracted from the 'date' field.
             alt.Color("max(temp_max)").title(None), # Color encoding is mapped to the maximum value of daily maximum temperatures ('temp_max' field)
             tooltip=[
                 alt.Tooltip("monthdate(date)", title="Date"),
                 alt.Tooltip("max(temp_max)", title="Max Temp"), # Tooltips to show when hovering over a cell, displaying the date and maximum temperature for that day.
             ],
         ).configure_view(
             step=20,       # The size of each cell is set to 20 pixels.
             strokeWidth=0  # The stroke width of the cell borders is set to 0 pixels.
         ).configure_axis(
             domain=False   # The domain line is removed from the axis.
         ).interactive()

chart3

<font color = 'F2700A'>Define the type of data  
    - T: Temporal(Time data, used to represent moments or durations in time,such as `datetime` type)  
    - O: Ordinal(Ordinal data, indicating categories with a specific order,such as "k12")  
    - Q: Quantitative(Quantitative data, representing numerical data that can undergo mathematical operations,such as income data)  
    - N: Nominal(Nominal data, used for identifying different categories or groups, where the values do not possess an order or mathematical meaning,such as "red","green","white")
</font>

><font color = 'F2700A'>Publish your Visualization to html</font>  

In [14]:
# chart1.save('chart1.html')

### Exploring Some Other Advanced Interactive Visualization Examples by Vega-Altair

#### Bar Charts & Histogram

- <font size=2>01.Bar Chart Highlighting Values beyond a Threshold</font>  
- <font size=2>02.Bar Chart with Negative Values</font>  
- <font size=2>03.Faceted Stacked Bar Chart</font>  
- <font size=2>04.Layered Histogram</font>  

01.Bar Chart Highlighting Values beyond a Threshold  
<font size=2>Imagine we're analyzing the historical data of wheat production over several years by `data.wheat` dataset. Our objective is to visualize the yearly wheat production quantities and identify the years where production significantly exceeded a certain threshold, which, for this case study, is set at 60 thousands of tons. This visualization will help stakeholders, such as farmers, investors, and policy makers, understand production trends and make informed decisions.</font>

In [15]:
# Load the wheat production dataset.
wheat_pro = data.wheat()

# Generate the base bar chart for wheat production across years.
bars = alt.Chart(wheat_pro).mark_bar().encode(
    x="year:O",   # The x-axis represents years, treated as ordinal data.
    y="wheat:Q",  # The y-axis represents wheat production quantity, treated as quantitative data.
)

# Generate highlighted bars for years where wheat production exceeds the threshold.
highlight = alt.Chart(wheat_pro).mark_bar(color="#e45755").encode(
    x='year:O',
    y='baseline:Q',  # Start of the bar (baseline) set at 90 units.
    y2='wheat:Q'
).transform_filter(
    alt.datum.wheat > 60  ## Filter bar which include only years where production is above 60 units.
).transform_calculate("baseline", "60") # Calculate the baseline (starting point) for highlighting.

# Generate a horizontal benchmark line to mark the 60 threshold.
threshold = pd.DataFrame([{"threshold": 60}]) 
rule = alt.Chart(threshold).mark_rule().encode(
    y='threshold:Q'    # Position the rule at the 90-unit mark on the y-axis.
)

# Combine all three components and set the width of the chart.
chart1 = (bars + highlight + rule).properties(width=600,title='Annual Wheat Production Trends and Threshold Exceedances').interactive()  # properties define the figure size

chart1

02.Bar Chart with Negative Values  
<font size=2>We examine the fluctuations in nonfarm employment in the United States over a specified period by `data.us_employment` dataset. Nonfarm employment is a key economic indicator, reflecting the number of jobs in the economy excluding farm workers, private household employees, and employees of nonprofit organizations. This study utilizes a visual analysis approach to observe monthly changes in nonfarm employment, identifying periods of economic growth and contraction..</font>

In [16]:
# Load the us employment dataset.
emp = data.us_employment()

chart1 = alt.Chart(emp,title='Analyzing Trends in US Nonfarm Employment Changes').mark_bar().encode(
             x="month:T",
             y="nonfarm_change:Q",
             color=alt.condition(
                 alt.datum.nonfarm_change > 0, # define the benchmark
                 alt.value("steelblue"),  # The positive color
                 alt.value("orange")  # The negative color
             )
         ).properties(width=600).interactive()

chart1

03.Faceted Stacked Bar Chart  
<font size=2>Let's create a case study based on different barley varieties across various sites and years. Our dataset is `data.barley()`, contains information about barley yields for different varieties, harvested from multiple sites over several years.</font>

In [17]:
# Load the barley dataset.
barley = data.barley()

chart1 =alt.Chart(barley,title='Barley Yield by Variety Across Sites Over Years').mark_bar().encode(
            column="year:O", # Divided the data into columns based on the 'year' field.
            x="yield",  # Sets the 'yield' as the x-axis, showing the amount of yield.
            y="variety", # Sets the 'variety' as the y-axis, displaying different barley varieties.
            color="site", # Colors the bars based on the 'site' field to differentiate between the sites.
        ).properties(width=220).interactive()

chart1

04.Layered Histogram  
<font size=2>This example shows how to use opacity to make a layered histogram in Altair.</font>

In [18]:
# Ensures reproducibility of the results
np.random.seed(42)

# Generating Data
test = pd.DataFrame({
    'Trial A': np.random.normal(0, 0.8, 1000), # Data from Trial A including 1000 values, mean = 0, std dev = 0.8
    'Trial B': np.random.normal(-2, 1, 1000),  # Data from Trial B including 1000 values, mean = -2, std dev = 1
    'Trial C': np.random.normal(3, 2, 1000)    # Data from Trial C including 1000 values, mean = 3, std dev = 2
})


chart1 = alt.Chart(test,title='Layered Histogram with Opacity').transform_fold(
         ['Trial A', 'Trial B', 'Trial C'], # Specifies the columns to fold into a long format for separately plotting each trial.
         as_=['Experiment', 'Measurement']  # Names of the new folded columns, representing the experiment(Trial A,B,C) and measurement(values).
     ).mark_bar(
         opacity=0.3, # Sets the opacity of the bars for better visualization
         binSpacing=0 # Removes spacing between bins for a continuous appearance
     ).encode(
         alt.X('Measurement:Q').bin(maxbins=100), # Quantitative X-axis with binning
         alt.Y('count()').stack(None), # Counts the number of entries per bin
         alt.Color('Experiment:N') # Colors bars based on the experiment name
     ).properties(width=600).interactive()
     
chart1

#### Line Charts  
<font size=2>Suppose we are data analysts at an investment firm, responsible for analyzing and reporting on the performance of stocks in the company's portfolio. The portfolio includes several different stocks collected by `data.stocks` dataset, and we need to monitor their price movements to provide data support for investment decisions. In this context, the line chart emerges as an invaluable tool.</font>

In [19]:
# Load the stocks dataset.
stocks = data.stocks()

base = alt.Chart(stocks,title='Stock Price Trends Over Time')
line = base.mark_line(point=False).encode(
       x='date:T',  # X-axis as date, T indicates a temporal type
       y='price:Q', # Y-axis as price, Q indicates a quantitative type (numerical)
       color='symbol:N' # Different stocks represented by different colors, N indicates a nominal type (categorical data)
)
rule = base.mark_rule(strokeDash=[5, 5]).encode(   # strokeDash presents the line style, [5,5] means 5 pixels on, 5 pixels off
       y='average(price)',
       color='symbol',
       size=alt.value(1)
)

# Combine all three components and set the width of the chart.
chart1 = (line + rule).properties(width=600).interactive() 

chart1

#### Area Charts

- <font size=2>01.Faceted Area Chart</font>  
- <font size=2>02.Streamgraph</font>  

01.Faceted Area Chart  
<font size=2>Based on the same `data.stocks` dataset, this time we rely on the multiple area subcharts, one for each company. We also show filtering out one of the companies, and sorting the companies in a custom order.</font>

In [20]:
# Load the stocks dataset.
stocks = data.stocks()

chart1 = alt.Chart(stocks,title='Tecg Giant Stock Price Trend Excluding Google').transform_filter(alt.datum.symbol != "GOOG").mark_area().encode(   # Exclude Google (GOOG) data from the chart
             x="date:T",   
             y="price:Q",
             color="symbol:N",
             row=alt.Row("symbol:N").sort(["MSFT", "AAPL", "IBM", "AMZN"]), # Sort companies and create a row for each company
         ).properties(height=80, width=600).interactive()

chart1

02.Streamgraph  
<font size=2>In this case study, we examine the changes in unemployment rates across various industries over time collected by `data.unemployment_across_industries.url` dataset, aiming to identify trends and fluctuations in unemployment rates, offering insights into how economic cycles affect industries differently. </font>

In [21]:
# Load the unemployment dataset.
umemp = data.unemployment_across_industries.url

chart1 = alt.Chart(umemp,title='Unemployment Trends Across Industries').mark_area().encode(
         alt.X('date:T').axis(domain=False, tickSize=0),
         alt.Y('sum(count):Q').stack('center').axis(None),
         alt.Color('series:N').scale(scheme='category20b')
      ).properties(width=600).interactive()

chart1

#### Scatter Charts

- <font size=2>01.Multifeature Scatter Plot</font>  
- <font size=2>02.Scatter Matrix</font>  
- <font size=2>03.Scatter Plot with Faceted Marginal Histograms</font>  
- <font size=2>04.Scatter Plot with Minimap</font>  

01.Multifeature Scatter Plot  
<font size=2>To create a case study related to a Multifeature Scatter Plot, we can leverage the renowned Iris dataset collected by `data.iris` dataset, which includes data on iris flower species and their petal and sepal measurements. Multifeature scatter plots are particularly suited for showcasing and exploring relationships between features in such data because they can display multiple quantitative attributes in a two-dimensional chart simultaneously.</font>

In [22]:
# Load the iris dataset
iris = data.iris()

chart1 = alt.Chart(iris,title='Iris Flower Characteristics Comparation').mark_circle().encode(
         alt.X('sepalLength').scale(zero=False),  # X-axis represents sepal length, disable the zero baseline for better analysis
         alt.Y('sepalWidth').scale(zero=False, padding=1), # Y-axis represents sepal width, disable the zero baseline and add a little padding to improve visualization
         color='species',
         size='petalWidth'
     ).interactive()

chart1

02.Scatter Matrix  
<font size=2>A Scatter Matrix (also known as a pair plot) is useful for exploring correlations between multidimensional data, allowing you to see how each variable is distributed and how it relates to every other variable.In this case study, we'll analyze a dataset of cars `data.cars()`, focusing on three quantitative variables: Horsepower, Acceleration, and Miles_per_Gallon. This Scatter Matrix will help us understand relationships between these car attributes and how they vary by the car's origin.</font>

In [23]:
# Load the cars dataset
cars = data.cars()

chart1 = alt.Chart(cars).mark_circle().encode(
         alt.X(alt.repeat("column"), type='quantitative'), #  dynamically use the column specified in the 'repeat' method,'type='quantitative'' indicates that the data to be plotted on the x-axis is numeric
         alt.Y(alt.repeat("row"), type='quantitative'), 
         color='Origin:N'
     ).properties(     # Set the width and height for each individual plot in the matrix
         width=150,
         height=150
     ).repeat(
         row=['Horsepower', 'Acceleration', 'Miles_per_Gallon'],
         column=['Miles_per_Gallon', 'Acceleration', 'Horsepower']
     ).interactive()
    
chart1

03.Scatter Plot with Faceted Marginal Histograms  
<font size=2>In this case study, we explore how to analyze and display the distribution and relationships within the `data.iris` dataset using a combination of scatter plots and faceted marginal histograms. This visualization technique allows us to see the relationship between two variables (via the scatter plot) while also viewing the distribution of each variable individually (through the marginal histograms) on the same chart.  
In Altair, `(top_hist & (points | right_hist))` is a syntax for combining multiple charts into a composite visualization using two types of combination operators: `&` and `|`.
- The `&` operator is used for vertical concatenation of charts. In this context, `top_hist & (...)` means that the `top_hist chart` will be placed above the combined charts.
- The `|` operator is used for horizontal concatenation of charts. Here, `points | right_hist` indicates that the points chart (a scatter plot) and the right_hist chart (a histogram on the right) will be placed side by side.</font>

In [24]:
# Load the iris dataset
iris = data.iris()

# Set scales for the x and y axes
xscale = alt.Scale(domain=(4.0, 8.0))
yscale = alt.Scale(domain=(1.9, 4.55))

# Create the base chart，this will be reused for creating different chart components.
base = alt.Chart(iris)

# Construct the scatter plot component
points = base.mark_circle().encode(
    alt.X("sepalLength").scale(xscale),
    alt.Y("sepalWidth").scale(yscale),
    color="species",
)
# Build the top histogram component
top_hist = (
    base.mark_bar(opacity=0.3, binSpacing=0) # Use a bar mark with specified opacity and no bin spacing.
    .encode(
        alt.X("sepalLength:Q").bin(maxbins=20, extent=xscale.domain).stack(None).title(""), # Bin sepal length into 20 bins without stacking and no axis title
        alt.Y("count()").stack(None).title(""), # Count the number of each bin without stacking and no axis title
        alt.Color("species:N"),
    )
    .properties(height=60)
)
# Build the right histogram component
right_hist = (
    base.mark_bar(opacity=0.3, binSpacing=0)
    .encode(
        alt.X("count()").stack(None).title(""),
        alt.Y("sepalWidth:Q").bin(maxbins=20, extent=yscale.domain).stack(None).title(""),
        alt.Color("species:N"),
    )
    .properties(width=60)
)

chart1 = (top_hist & (points | right_hist)).properties(title='Iris Flower Characteristics Comparation').interactive()
chart1

04.Scatter Plot with Minimap  
<font size=2>This example shows how to create a miniature version of a plot such that creating a selection in the miniature version adjusts the axis limits in another, more detailed view.To create a case study around a scatter plot with minimap visualization, let's assume we're analyzing the weather patterns in Seattle, focusing on maximum temperatures and how they change over time.  
The `zoom` interactive selector is added to the minimap and detail charts in two different ways, primarily because the purpose and interaction mechanism of these two charts differ. Let's delve into each method,  

Minimap Chart  
- In the `minimap` chart, `zoom` is directly applied to the chart through the `.add_params(zoom)` method. This approach makes us can select an interval by clicking and dragging on the `minimap`, directly affecting the view range of the `detail` chart.  
- Here, the zoom selector also controls the color encoding on the minimap by `alt.condition`, changing the color of the points based on whether an interval is selected. If no interval is selected, points are colored in light gray as specified by `alt.value("lightgray")`, if an interval is selected, the points are colored based on the "weather" field.  

Detail Chart  
- The `detail` chart employs a different approach with the `zoom` selector. Interaction is enabled by configuring the scale attribute of the `x` and `y` axes, specifically by setting the `domain` property to be controlled by the `zoom` parameter. This method allows the detail chart's display range to zoom and pan based on the selection made on the minimap, without allowing direct interval selection through dragging on the detail chart itself. This design means the minimap acts as a control panel, allowing us to select the data range which we  want to explore in detail.
</font>

In [25]:
# Load the seattle weather dataset
weather = data.seattle_weather()

# Create an interactive zoom feature that enables zooming and panning on the scatter plot
zoom = alt.selection_interval(encodings=["x", "y"]) # ['x', 'y'] indicates that the zoom feature will be applied to both the x and y axes.

# Define the minimap chart
minimap = (
    alt.Chart(weather).mark_point()
    .add_params(zoom)
    .encode(
        x="date:T",
        y="temp_max:Q",
        color=alt.condition(zoom, "weather", alt.value("lightgray")),  # Color points based on weather condition; use light gray if not zoomed
    )
    .properties(
        width=200,
        height=200,
        title="Minimap -- click and drag to zoom in the detail view",
    )
)

# Define the detailed view chart
detail = (
    alt.Chart(weather)
    .mark_point()
    .encode(
        alt.X("date:T").scale(domain={"param": zoom.name, "encoding": "x"}),
        alt.Y("temp_max:Q").scale(domain={"param": zoom.name, "encoding": "y"}),
        color="weather",
    )
    .properties(width=600, height=400, title="Seattle weather -- detail view")
)

# Define a line chart that overlays a rolling mean of maximum temperatures
line = alt.Chart(weather).mark_line(
    color='red',   # Color the line red
    size=3         # Set the thickness of the line
).transform_window(
    frame=[-15, 15],                  # setting a range within 15 days before and after each point for the rolling mean
    rolling_mean='mean(temp_max)'     # Calculate the rolling mean of maximum temperatures
).encode(
    x='date:T',
    y='rolling_mean:Q'
)

chart1 = (detail + line | minimap + line).interactive()
chart1

#### Boxplot Charts
<font size=2>In this case study, we're using a boxplot to explore the distribution of the global population across different age groups collected by `data.population` dataset.</font>

In [26]:
# Load the populatioln dataset
pop = data.population.url

# Create a boxplot chart using the population data,"extent='min-max'" sets the whiskers to the minimum and maximum values
chart1 = alt.Chart(pop,title='').mark_boxplot(extent='min-max').encode(
         x='age:O',
         y='people:Q'
         ).properties(width=600).interactive()

chart1

#### Ridgeline Charts  
<font size=2>Let's create a case study centered around Ridgeline Charts, focusing on visualizing the distribution of maximum daily temperatures in Seattle across different months based on `data.seattle_weather` dataset. Ridgeline Charts are effective for comparing distributions between multiple groups (in this case, months) and are particularly useful for highlighting the variations and patterns within the data across these groups.  
Because there are some data transformation process before visualization, and it is hard to understand. Thus, I've prepared a series of steps using Pandas to simulate these operations. This approach allows you to run the provided code snippets step-by-step, enabling a clear observation of the results. Additionally, this includes a focus on the "binning" step, which is crucial for categorizing continuous data into discrete intervals.  
- 01.Extract the month from the date  
- 02.Calculate the mean maximum temperature for each month  
- 03.Create bins for the temp_max columns(This part is very important)  
- 04.Aggregate the data based on the month, mean_temp, and bin_min, bin_max  

In [27]:
# # Load the seattle weather dataset and read it
# weather = data.seattle_weather.url
# df_weather = pd.read_csv(weather)

# # 01.Extract the month from the date
# df_weather['date'] = pd.to_datetime(df_weather['date'])   
# df_weather['Month'] = df_weather['date'].dt.month

# # 02.Calculate the mean maximum temperature for each month
# df_monthly_mean = df_weather.groupby('Month')['temp_max'].mean().reset_index(name='mean_temp')

# # 03.Create bins for the temp_max columns
# bin_edges = np.arange(0, 41, 5)
# df_weather['bin_index'] = np.digitize(df_weather['temp_max'], bin_edges)
# df_weather['bin_min'] = bin_edges[df_weather['bin_index'] - 1]
# df_weather['bin_max'] = df_weather['bin_min'] + 5

# # 04.Aggregate the data based on the month, mean_temp, and bin_min, bin_max
# df_agg = df_weather.groupby(['Month', 'bin_min', 'bin_max'])['date'].count().reset_index(name='value')
# df_agg = pd.merge(df_agg, df_monthly_mean, on='Month')

# df_agg

In [28]:
# Load the seattle weather dataset
weather = data.seattle_weather.url

# Define the height of each individual plot in the ridgeline
step = 40
# Define the overlap between successive plots
overlap = 0

# Create the ridgeline chart
chart1 = alt.Chart(weather, height=step
        # Data transformation steps from 01-05
        ).transform_timeunit(
            Month='month(date)'      # 01.Extract the month from the date
        ).transform_joinaggregate(
            mean_temp='mean(temp_max)', groupby=['Month']    # 02.Calculate the mean maximum temperature for each month
        ).transform_bin(
            ['bin_max', 'bin_min'], 'temp_max'   # 03.create bins for the temp_max columns
        ).transform_aggregate(
            value='count()', groupby=['Month', 'mean_temp', 'bin_min', 'bin_max']    # 04.Aggregate the data based on the month, mean_temp, and bin_min, bin_max
        ).transform_impute(
            impute='value', groupby=['Month', 'mean_temp'], key='bin_min', value=0   # 05.Impute the missing values, but due to our dataset has no missing values, this step is not necessary
        # Data Visualization steps
        # 01. setting the mark to area
        ).mark_area(
            interpolate='monotone',  # Set the interpolation to monotone,which produces smooth curves.
            fillOpacity=0.8,
            stroke='lightgray',
            strokeWidth=0.5
        # 02. encoding the x-axis, y-axis, fill color, and facet
        ).encode(
            alt.X('bin_min:Q')
                .bin('binned')  # Bin the data for the x-axis based on the 'binned' group
                .title('Maximum Daily Temperature (C)'),
            alt.Y('value:Q')
                .axis(None) # Remove the y-axis
                .scale(range=[step, -step * overlap]),   # Set the range of the y-axis
            alt.Fill('mean_temp:Q')
                .legend(None)
                .scale(domain=[30, 5], scheme='redyellowblue')  # Set the color scale for the fill color
        ).facet(
            row=alt.Row('Month:T')
                .title(None)
                .header(labelAngle=0, labelAlign='left', format='%B')
        ).properties(
            title='Seattle Weather',
            bounds='flush' # Remove the padding around the plot
        ).configure_facet(
            spacing=0  # Remove the spacing between facets
        ).configure_view(
            stroke=None  # Remove the border around the plot
        ).configure_title(
            anchor='end' # Align the title to the right
        ).interactive()
        
chart1

#### Interactive Map  
<font size=2>This example shows how to create a map of income in the US by state, faceted over income brackets,This map will color-code states based on a percentage value ('pct' field) from the provided data, and it will feature tooltips that display more detailed information (such as the state name and its corresponding percentage).</font>

In [29]:
# Load the states geographical dataset
states = alt.topo_feature(data.us_10m.url, 'states')
print(states)
gdf = gpd.read_file('https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/us-10m.json')
gdf.head()

UrlData({
  format: TopoDataFormat({
    feature: 'states',
    type: 'topojson'
  }),
  url: 'https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/us-10m.json'
})


Unnamed: 0,id,geometry
0,22051,MULTIPOLYGON EMPTY
1,53000,"POLYGON ((-122.65544 48.41032, -122.65544 48.4..."
2,53073,"MULTIPOLYGON (((-120.85361 49.00011, -120.7674..."
3,30105,"POLYGON ((-106.11238 48.99904, -106.15187 48.8..."
4,30029,"POLYGON ((-114.06985 48.99904, -114.05908 48.8..."


In [30]:
# Load the income dataset
income = data.income.url
df = pd.read_json(income)
df.head()

Unnamed: 0,name,region,id,pct,total,group
0,Alabama,south,1,0.102,1837292,<10000
1,Alabama,south,1,0.072,1837292,10000 to 14999
2,Alabama,south,1,0.13,1837292,15000 to 24999
3,Alabama,south,1,0.115,1837292,25000 to 34999
4,Alabama,south,1,0.143,1837292,35000 to 49999


In [31]:
# Load the states geographical dataset
states = alt.topo_feature(data.us_10m.url, 'states')
# Load the income dataset
income = data.income.url

# Create the choropleth map
chart1 = alt.Chart(income,title='Income Distribution Across U.S. States').mark_geoshape().encode(
             shape='geo:G', # Encode geographical shapes
             color='pct:Q', # Encode income percentage as color
             tooltip=['name:N', 'pct:Q'], # Display state name and income percentage in the tooltip
             facet=alt.Facet('group:N', columns=2), # Facet the map based on the 'group' field
         ).transform_lookup(
             lookup='id', # Lookup the 'id' field in the income dataset,which corresponds to the state FIPS code
             from_=alt.LookupData(data=states, key='id'),   # Use the 'id' field in the states dataset as the key,which corresponds to the county FIPS code,but Altair will automatically identify the state FIPS code from County FIPS code
             as_='geo' # Store the matched data in the 'geo' field
         ).properties(
             width=300,
             height=175,
         ).project(
             type='albersUsa'
         ).interactive()
         
chart1

#### Worlds' Map Projection  
<font size=2>In the above map, we define `project` as `albersUsa`, and Altair indeed provide various projection choice for us, and you can review it by following codes</font>

In [32]:
# Load the world geographical dataset
project = alt.topo_feature(data.world_110m.url, 'countries')

# set the input dropdown
input_dropdown = alt.binding_select(options=[
    "albers",
    "albersUsa",
    "azimuthalEqualArea",
    "azimuthalEquidistant",
    "conicEqualArea",
    "conicEquidistant",
    "equalEarth",
    "equirectangular",
    "gnomonic",
    "mercator",
    "naturalEarth1",
    "orthographic",
    "stereographic",
    "transverseMercator"
], name='Projection ')
# create a initial selection object
param_projection = alt.param(value="albersUsa", bind=input_dropdown)

# create the chart
alt.Chart(project, width=500, height=300).mark_geoshape(
    fill='lightgray',
    stroke='gray'
).project(
    type=alt.expr(param_projection.name)
).add_params(param_projection)

#### Enhanced and Customizable Chart Visualizations  
- <font size=2>Case Study 01: Interactive Exploration of Movie Ratings, Genres, and Budget Over Time</font>  
- <font size=2>Case Study 02: Interactive Analysis of Seattle's Weather Patterns (2012-2015)</font>  
- <font size=2>Case Study 03: US Population Pyramid Over Time</font>  

***Case Study 01: Interactive Exploration of Movie Ratings, Genres, and Budget Over Time***  
<font size=2>This Altair visualization code is designed for creating an interactive chart that visualizes movie data based on various attributes such as `IMDB rating`, `worldwide gross`, `production budget`, and `genres` based on the `alt.UrlData` dataset. It includes advanced interactive features like slider filters, dropdown menus, radio buttons, and checkboxes to dynamically filter and customize the visualization based on the user's input. </font>

In [33]:
# Load the movies dataset based on Release Date and read it
movies = alt.UrlData(
    data.movies.url,
    format=alt.DataFormat(parse={"Release_Date":"date"})
)
print(movies)

df = pd.read_json('https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/movies.json')
df.head(2)

UrlData({
  format: DataFormat({
    parse: {'Release_Date': 'date'}
  }),
  url: 'https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/movies.json'
})


Unnamed: 0,Title,US_Gross,Worldwide_Gross,US_DVD_Sales,Production_Budget,Release_Date,MPAA_Rating,Running_Time_min,Distributor,Source,Major_Genre,Creative_Type,Director,Rotten_Tomatoes_Rating,IMDB_Rating,IMDB_Votes
0,The Land Girls,146083.0,146083.0,,8000000.0,Jun 12 1998,R,,Gramercy,,,,,,6.1,1071.0
1,"First Love, Last Rites",10876.0,10876.0,,300000.0,Aug 07 1998,R,,Strand,,Drama,,,,6.9,207.0


In [34]:
# Load the movies dataset based on Release Date
movies = alt.UrlData(
    data.movies.url,
    format=alt.DataFormat(parse={"Release_Date":"date"})
)

# Define the ratings
ratings = ['G', 'NC-17', 'PG', 'PG-13', 'R']   # G (General Audiences), NC-17 (Adults Only), PG (Parental Guidance), PG-13 (Parents Strongly Cautioned), R (Restricted)
# Define the genres 
genres = [
    'Action', 'Adventure', 'Black Comedy', 'Comedy',
    'Concert/Performance', 'Documentary', 'Drama', 'Horror', 'Musical',
    'Romantic Comedy', 'Thriller/Suspense', 'Western'
]

# Step1: create the base chart
# Base chart configuration: set the data source, chart dimensions, and point mark characteristics
base = alt.Chart(movies, width=200, height=200).mark_point(filled=True).transform_calculate(
    # Calculate and add fields for rounded IMDB rating, whether it's a big budget film, and the release year
    Rounded_IMDB_Rating = "floor(datum.IMDB_Rating)",
    Big_Budget_Film =  "datum.Production_Budget > 100000000 ? 'Yes' : 'No'",
    Release_Year = "year(datum.Release_Date)",
).transform_filter(
    alt.datum.IMDB_Rating > 0 # Filter out movies with no IMDB rating
).transform_filter(
    alt.FieldOneOfPredicate(field='MPAA_Rating', oneOf=ratings) # Filter to only include movies with specified MPAA ratings
).encode(
    x=alt.X('Worldwide_Gross:Q').scale(domain=(100000,10**9), clamp=True),
    y='IMDB_Rating:Q',
    tooltip="Title:N"
)

# Step 2: create the advanced chart

# 01.Configuration for the slider to filter movies by release year
year_slider = alt.binding_range(min=1969, max=2018, step=1, name="Release Year")
slider_selection = alt.selection_point(bind=year_slider, fields=['Release_Year'])
# Add the year slider filter to the base chart
filter_year = base.add_params(
    slider_selection
).transform_filter(
    slider_selection
).properties(title="Slider Filtering")


# 02.Configuration for the dropdown to filter movies by genre
genre_dropdown = alt.binding_select(options=genres, name="Genre")
genre_select = alt.selection_point(fields=['Major_Genre'], bind=genre_dropdown)
# Add the genre dropdown filter to the base chart
filter_genres = base.add_params(
    genre_select
).transform_filter(
    genre_select
).properties(title="Dropdown Filtering")


# 03.Configuration for the radio buttons to change color based on movie rating
rating_radio = alt.binding_radio(options=ratings, name="Rating")
rating_select = alt.selection_point(fields=['MPAA_Rating'], bind=rating_radio)
# Define the condition for changing color based on the rating selection
rating_color_condition = alt.condition(
    rating_select,
    alt.Color('MPAA_Rating:N').legend(None),
    alt.value('lightgray')
)
# Add the radio button feature to highlight movies based on their MPAA rating
highlight_ratings = base.add_params(
    rating_select
).encode(
    color=rating_color_condition
).properties(title="Radio Button Highlighting")

# 04.Configuration for the checkbox to change the size of points based on whether the movie is a big budget film
input_checkbox = alt.binding_checkbox(name="Big Budget Films ")
checkbox_selection = alt.param(bind=input_checkbox)
# Define the condition for changing point size based on big budget selection
size_checkbox_condition = alt.condition(
    checkbox_selection,
    alt.Size('Big_Budget_Film:N').scale(range=[25, 150]), # Set the size range for big budget films
    alt.SizeValue(25) # Set the default size for non-big budget films
)
# Add the checkbox feature to format points based on budget size
budget_sizing = base.add_params(
    checkbox_selection
).encode(
    size=size_checkbox_condition
).properties(title="Checkbox Formatting")


# Combine all the components to create the final interactive chart
chart1 = ((filter_year | budget_sizing) & (highlight_ratings | filter_genres)).properties(title="Interactive Exploration of Movie Ratings, Genres, and Budget Over Time").interactive()
chart1

***Case Study 02: Interactive Analysis of Seattle's Weather Patterns (2012-2015)***  
<font size=2>This case study demonstrates how Altair's declarative syntax can be leveraged to create interactive and customizable visualizations. By employing interval and point selections, we can interactively explore the dataset from different angles. The color encoding and the ability to filter based on selections enhance relevant experience by providing deeper insights into Seattle's weather patterns.</font>

In [35]:
# Load the seattle_weather dataset and read it
weather = data.seattle_weather()
weather.head()

Unnamed: 0,date,precipitation,temp_max,temp_min,wind,weather
0,2012-01-01,0.0,12.8,5.0,4.7,drizzle
1,2012-01-02,10.9,10.6,2.8,4.5,rain
2,2012-01-03,0.8,11.7,7.2,2.3,rain
3,2012-01-04,20.3,12.2,5.6,4.7,rain
4,2012-01-05,1.3,8.9,2.8,6.1,rain


In [36]:
# Load the seattle_weather dataset
weather = data.seattle_weather()

# Define a color scheme for different weather types
color = alt.Color('weather:N').scale(
    domain=['sun', 'fog', 'drizzle', 'rain', 'snow'],
    range=['#e7ba52', '#a7a7a7', '#aec7e8', '#1f77b4', '#9467bd']
)

# Define interactive selections:
brush = alt.selection_interval(encodings=['x'])  # - a brush that is active on the top panel
click = alt.selection_point(encodings=['color']) # - a multi-click that is active on the bottom panel


# Top panel: Scatter plot of maximum daily temperature vs. date
points = alt.Chart().mark_point().encode(
    alt.X('monthdate(date):T').title('Date'),  # X-axis shows date
    alt.Y('temp_max:Q')
        .title('Maximum Daily Temperature (C)')
        .scale(domain=[-5, 40]),  # Y-axis shows max temperature
    alt.Size('precipitation:Q').scale(range=[5, 200]),  # Point size based on precipitation
    color=alt.condition(brush, color, alt.value('lightgray')),  # Color based on brush selection
).properties(
    width=550,
    height=300
).add_params(
    brush   # Enable brush selection on this chart
).transform_filter(
    click   # Filter data based on click selection from the bottom panel
)

# Bottom panel: Bar chart of weather type counts
bars = alt.Chart().mark_bar().encode(
    x='count()',
    y='weather:N',
    color=alt.condition(click, color, alt.value('lightgray')),
).properties(
    width=550,
).add_params(
    click
).transform_filter(
    brush
)

# Combined chart with both panels
chart1 = alt.vconcat(points,bars,
         data=weather,
         title="Seattle Weather: 2012-2015"
     ).interactive()

chart1

***Case Study 03: US Population Pyramid Over Time***  
<font size=2>A population pyramid shows the distribution of age groups within a population. It uses a slider widget that is bound to the year to visualize the age distribution over time.</font>

In [37]:
# Load the population dataset and read it
pop = data.population.url
df = pd.read_json(pop)
df.head()

Unnamed: 0,year,age,sex,people
0,1850,0,1,1483789
1,1850,0,2,1450376
2,1850,5,1,1411067
3,1850,5,2,1359668
4,1850,10,1,1260099


In [38]:
# Load the population dataset
pop = data.population.url

# Create a slider to allow us interaction with the year of the data.
slider = alt.binding_range(min=1850, max=2000, step=10)  # This slider spans from 1850 to 2000, moving in steps of 10 years.
select_year = alt.selection_point( name='year',fields=['year'],
                                   bind=slider, value={'year': 2000})  # The initial value is set to 2000.

# Incorporating the interactive slider for selecting the year.
base = alt.Chart(pop).add_params(
    select_year
).transform_filter(
    select_year   # Filters the data to only include the selected year.
).transform_calculate(
    gender=alt.expr.if_(alt.datum.sex == 1, 'Male', 'Female')  # Calculates gender based on the 'sex' column.
).properties(
    width=250
)

# Define a color scale to differentiate between male and female populations.
color_scale = alt.Scale(domain=['Male', 'Female'],
                        range=['#1f77b4', '#e377c2'])

# Create the left part of the pyramid for the female population.
left = base.transform_filter(
    alt.datum.gender == 'Female'
).encode(
    alt.Y('age:O').axis(None),
    alt.X('sum(people):Q')
        .title('population')
        .sort('descending'),  # Sorts the population data in descending order from left to right.
    alt.Color('gender:N')
        .scale(color_scale)
        .legend(None)
).mark_bar().properties(title='Female')

# Create the middle part of the pyramid, which displays the age labels.
middle = base.encode(
    alt.Y('age:O').axis(None),
    alt.Text('age:Q'),
).mark_text().properties(width=20)

# Create the right part of the pyramid for the male population.
right = base.transform_filter(
    alt.datum.gender == 'Male'
).encode(
    alt.Y('age:O').axis(None),
    alt.X('sum(people):Q')
        .title('population')
        .sort('ascending'),  # Sorts the population data in ascending order from left to right.
    alt.Color('gender:N')
        .scale(color_scale)
        .legend(None)
).mark_bar().properties(title='Male')

chart1 = alt.concat(left, middle, right, spacing=5).properties(title='US Population Pyramid Over Time').interactive()
chart1

### More Detials
<iframe src="https://altair-viz.github.io/gallery/index.html" width=100% height=700></iframe>

## <font color='F2700A'>Part II: Plotly</font>  
Plotly's Python graphing library makes interactive, publication-quality graphs. In my personal opinion, I believe that compared to Altair, Plotly offers more powerful interactive capabilities and flexibility, <font color='F2700A'>especially in terms of Maps, Subplots, 3D graphics and animated graphics</font>, but it may require more time for learning and configuration.  
<iframe src="https://plotly.com/python/" width=100% height=700></iframe>  

In the following sections, we rely on the [dataset](https://github.com/plotly/datasets) published by plotly. But first thing is installing the `plotly` liabrary using the following command

In [39]:
# pip install plotly==5.20.0
# pip install dash

In [40]:
# import altair libraries
import plotly.express as px
from dash import Dash, dcc, html, Input, Output
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# import fundamental libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
import os

### Basic Statistical Visualization by Plotly

- <font size=2>01.Bar Chart:</font>  
- <font size=2>02.Scatter Plot</font>  
- <font size=2>03.Line Charts</font>  
- <font size=2>04.Pie Charts</font>  
- <font size=2>05.Bubble Charts</font>  

><font color = 'F2700A'>01.Bar Chart</font>  

In [41]:
# Load the gapminder dataset from plotly.express's built-in datasets
# Query European countries in the year 2007 with a population greater than 2 million
eurpop = px.data.gapminder().query("continent == 'Europe' and year == 2007 and pop > 2.e6")

# Create a bar chart using the filtered dataset
fig = px.bar(eurpop, x='country', y='pop', 
             text_auto='.1s',   # automatically display population numbers on each bar in scientific notation, rounded to one decimal place
             title="Population Overview of European Countries in 2007 Exceeding 2 Million")

# # The textfont, textposition and textangle trace attributes can be used to control these
# fig.update_traces(textfont_size=12, textangle=0, textposition="outside", cliponaxis=False)

fig.show()

In [42]:
# Load the 'medals_long' dataset from plotly.express's built-in datasets. 
# This dataset contains information about medals won by different nations in a sporting event.
metals = px.data.medals_long()

fig = px.bar(metals, x="medal", y="count", color="nation", text="nation",
             title="Medal Distribution by Nation in International Sporting Events")
fig.show()

In [43]:
# Load the 'tips' dataset from plotly.express's built-in datasets.
# This dataset contains information about tips given in a restaurant, including total bill and diner's gender.
tips = px.data.tips()

fig = px.histogram(tips, x="total_bill", y="tip", color="sex",
                   marginal="box", # or violin, rug
                   hover_data=['smoker','day','time'],
                   title='Tip Distribution by Total Bill Amount')
fig.show()

><font color = 'F2700A'>02.Scatter Plot</font>  

In [44]:
# Load the 'iris' dataset from plotly.express's built-in datasets.
# This dataset contains measurements of different parts of the iris flower, including sepal length, sepal width, petal length, petal width, and the species of the iris.
iris = px.data.iris()

fig = px.scatter(iris, x="sepal_width", y="sepal_length", color="species",
                 size='petal_length', 
                 hover_data=['petal_width'],
                 trendline="ols",
                 title='Iris Flower Characteristics Comparison')
fig.show()

><font color = 'F2700A'>03.Line Chart in Dash</font>  
[Dash]('https://plotly.com/dash/') is the best way to build analytical apps in Python using Plotly figures. which is an open-source Python framework for building interactive web applications specifically tailored for data visualization. Developed by Plotly, it enables us to create interactive, web-based dashboards and applications using pure Python code.  
<font color = 'F2700A'>But please remember an important thing, for each coding run by Dash, you should save a seperate python file, otherwise, the result will cover each other.</font>

In [45]:
# pip install dash
# from dash import Dash, dcc, html, Input, Output

In [46]:
# Initialize the Dash app. This is the starting point for creating any Dash application.
app = Dash(__name__)

# Define the layout of the app. The layout dictates what the application will look like.
app.layout = html.Div([
    html.H4('Life expentancy progression of countries per continents'), # The header that will be displayed on the page
    # The main content of the app, which includes the graph and the checklist
    dcc.Graph(id="graph"), # Define the figure id, and its contents will be defined by the 'callback' function.
    dcc.Checklist(
        id="checklist", # Define the checklist id, which will be used to update the graph.
        options=["Asia", "Europe", "Africa","Americas","Oceania"], # Define the options that will be displayed in the checklist.
        value=["Americas", "Oceania"], # Define the default selected values in the checklist.
        inline=True # True means the checklist items displayed in a horizontal line, False means vertical line.
    ),
])

# Define a callback function, which is triggered whenever the value of the checklist changes.
@app.callback(
    Output("graph", "figure"), # graph is a id, and figure is a property of the graph, in this case(line chart)
    Input("checklist", "value")) # checklist is a id, and value is a property of the checklist, in this case("Americas", "Oceania")

# Define the function that will be executed when the checklist value changes.
def update_line_chart(continents):
    # Load the Gapminder dataset.
    df = px.data.gapminder() 
    # Filter the dataset based on the selected continents.
    mask = df.continent.isin(continents)
    # Create a line chart showing the life expectancy progression of countries per continent.
    fig = px.line(df[mask], 
        x="year", y="lifeExp", color='country')
    return fig

# Start the Dash server, enabling the app to be viewed in a web browser.
app.run_server(debug=True) # The 'debug=True' parameter allows for live reloading

><font color = 'F2700A'>04.Bar Chart</font>  

In [47]:
# Load the 'gapminder' dataset from Plotly Express's built-in datasets.
# Query European countries in the year 2007 with the Americas continent.
amer_2007 = px.data.gapminder().query("year == 2007").query("continent == 'Americas'")

# Create a pie chart using the filtered dataset. 
fig = px.pie(amer_2007, values='pop', names='country',
             title='Population of American continent',
             hover_data=['lifeExp'], labels={'lifeExp':'life expectancy'})
fig.update_traces(textposition='inside', 
                  textinfo='percent+label') # percent presents the percentage of the total population, label presents the country name
fig.show()

><font color = 'F2700A'>05.Bubble Chart</font>  

In [48]:
# Load the 'gapminder' dataset from Plotly Express's built-in datasets.
lifexp = px.data.gapminder()

fig = px.scatter(lifexp.query("year==2007"), x="gdpPercap", y="lifeExp",
	             size="pop", size_max=60, color="continent",
                 hover_name="country", 
                 log_x=True,
                 title='Life Expectancy vs GDP per Capita in 2007')
fig.show()

### Maps Visualization by Plotly  

><font color = 'F2700A'>01.Map Visulization in Dash</font>  

In [49]:
# import mapbox token
# I don't want to share my mapbox token, and I create a json file to store it, and read it in the code.
import json
with open('C:/Users/Admin/Desktop/4680_5680_intro_uds/special topic_interactive map/mapbox_api_key_Houpu.json', 'r') as file:
    data = json.load(file)
token = data['key']

# Initialize the Dash app. This is the starting point for creating any Dash application.
app = Dash(__name__)

# Define the layout of the app. The layout dictates what the application will look like.
app.layout = html.Div([
    html.H4('Polotical candidate voting pool analysis'), # The header that will be displayed on the page
    html.P("Select a candidate:"), # A paragraph text prompting the us to select a candidate.
    # The main content of the app, which includes the graph and the radioItems
    dcc.Graph(id="graph"), # Define the figure id, and its contents will be defined by the 'callback' function.
    # Radio items to let the us select a candidate. The `value` represents the default selection.
    dcc.RadioItems(
        id='candidate', 
        options=["Joly", "Coderre", "Bergeron"],
        value="Coderre",
        inline=True
    ),
])

# Define a callback function, which is triggered whenever the value of the checklist changes.
@app.callback(
    Output("graph", "figure"), 
    Input("candidate", "value"))

# Define the function that will be executed when the RadioItems value changes.
def display_choropleth(candidate):
    # Load the election dataset.
    df = px.data.election() 
    # Load the geographical data (geojson) dataset.
    geojson = px.data.election_geojson()

    fig = px.choropleth_mapbox(
        df, geojson=geojson, color=candidate,
        # The df and geojson are joined on the 'district' column.
        locations="district", # identify the location column 'district' in the df
        featureidkey="properties.district", # identify the location column 'district' in the geojson
        # The initial center coordinates and zoom level of the map.
        center={"lat": 45.5517, "lon": -73.7073}, zoom=9,
        # create a color scale to differentiate between the candidates.
        range_color=[0, 6500])
    # Update the layout to adjust the map's appearance and set the Mapbox access token.
    fig.update_layout(
        margin={"r":0,"t":0,"l":0,"b":0}, # Set the margin to 0 to remove any padding around the map.
        mapbox_accesstoken=token)
    return fig

# Start the Dash server, enabling the app to be viewed in a web browser.
app.run_server(debug=True)

### Subplots Visualization by Plotly

><font color = 'F2700A'>01.Map Subplots</font>  

In [50]:
import plotly.graph_objects as go

# Load the dataset containing information about Walmart store openings from a CSV file hosted online
df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/1962_2006_walmart_store_openings.csv')

# Initialize an empty list to hold the plot data
data = []

# Define the layout of the entire figure
## including the title, size, and other appearance settings
layout = dict(
    title = 'New Walmart Stores per year 1962-2006<br>\
Source: <a href="http://www.econ.umn.edu/~holmes/data/WalMart/index.html">\
University of Minnesota</a>',
    # showlegend = False,
    autosize = False,
    width = 1000,
    height = 900,
    hovermode = 'closest', # Set the hover mode to display the data of the closest point
    legend = dict(
        x=0.7,
        y=-0.1,
        bgcolor="rgba(255, 255, 255, 0)",
        font = dict( size=11 ),
    )
)

# Extract unique years from the dataset
years = df['YEAR'].unique()
# Iterate through each year to create scattergeo plots for store locations and text markers for the years
for i in range(len(years)):
    geo_key = 'geo'+str(i+1) if i != 0 else 'geo'  # The code is responsible for generating a unique identifier for each geographic subplot within the figure.
    lons = list(df[ df['YEAR'] == years[i] ]['LON'])
    lats = list(df[ df['YEAR'] == years[i] ]['LAT'])
    # Add scatter points for Walmart store locations
    data.append(
        dict(
            type = 'scattergeo',
            showlegend=False,
            lon = lons,
            lat = lats,
            geo = geo_key,
            name = int(years[i]),
            marker = dict(
                color = "rgb(0, 0, 255)",
                opacity = 0.5
            ),
            text = df['STREETADDR'].astype(str), # Display the store address as text on hover
            hoverinfo = 'text'  # Set the hover information to display the store address
        )
    )
    # Year markers
    data.append(
        dict(
            type = 'scattergeo',
            showlegend = False,
            lon = [-78],
            lat = [47],
            geo = geo_key,
            text = [years[i]],
            mode = 'text',
        )
    )
    layout[geo_key] = dict(
        scope = 'usa',
        showland = True,
        landcolor = 'rgb(229, 229, 229)',
        showcountries = False,
        domain = dict( x = [], y = [] ),
        subunitcolor = "rgb(255, 255, 255)",
    )

# Define a function to create sparkline layouts for showing trends in store openings
def draw_sparkline( domain, lataxis, lonaxis ):
    ''' Returns a sparkline layout object for geo coordinates  '''
    return dict(
        showland = False,
        showframe = False,
        showcountries = False,
        showcoastlines = False,
        domain = domain,
        lataxis = lataxis,
        lonaxis = lonaxis,
        bgcolor = 'rgba(255,200,200,0.0)'
    )

# Stores per year sparkline
layout['geo44'] = draw_sparkline({'x':[0.6,0.8], 'y':[0,0.15]}, \
                                 {'range':[-5.0, 30.0]}, {'range':[0.0, 40.0]} )
data.append(
    dict(
        type = 'scattergeo',
        mode = 'lines',
        lat = list(df.groupby(by=['YEAR']).count()['storenum']/1e1),
        lon = list(range(len(df.groupby(by=['YEAR']).count()['storenum']/1e1))),
        line = dict( color = "rgb(0, 0, 255)" ),
        name = "New stores per year<br>Peak of 178 stores per year in 1990",
        geo = 'geo44',
    )
)

# Cumulative sum sparkline
layout['geo45'] = draw_sparkline({'x':[0.8,1], 'y':[0,0.15]}, \
                                 {'range':[-5.0, 50.0]}, {'range':[0.0, 50.0]} )
data.append(
    dict(
        type = 'scattergeo',
        mode = 'lines',
        lat = list(df.groupby(by=['YEAR']).count().cumsum()['storenum']/1e2),
        lon = list(range(len(df.groupby(by=['YEAR']).count()['storenum']/1e1))),
        line = dict( color = "rgb(214, 39, 40)" ),
        name ="Cumulative sum<br>3176 stores total in 2006",
        geo = 'geo45',
    )
)

# Arrange the geographical plots in a grid layout
z = 0
COLS = 5
ROWS = 9
for y in reversed(range(ROWS)):
    for x in range(COLS):
        geo_key = 'geo'+str(z+1) if z != 0 else 'geo'
        layout[geo_key]['domain']['x'] = [float(x)/float(COLS), float(x+1)/float(COLS)]
        layout[geo_key]['domain']['y'] = [float(y)/float(ROWS), float(y+1)/float(ROWS)]
        z=z+1
        if z > 42:
            break

fig = go.Figure(data=data, layout=layout)
fig.update_layout(width=800)
fig.show()

### 3D Graphics Visualization by Plotly

- <font size=2>01.3D Scatter Plots</font>  
- <font size=2>02.Topographical 3D Surface Plot</font>  
- <font size=2>03.3D Surface Subplots</font>  

><font color = 'F2700A'>01.3D Scatter Plots</font>  

In [51]:
# Load the 'iris' dataset from plotly.express's built-in datasets.
# This dataset contains measurements of different parts of the iris flower, including sepal length, sepal width, petal length, petal width, and the species of the iris.
iris = px.data.iris()

# Create a 3D scatter plot
fig = px.scatter_3d(iris, x='sepal_length', y='sepal_width', z='petal_width',
                    color='petal_length', size='petal_length', size_max=30,
                    symbol='species', opacity=0.7)


# fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))
fig.update_layout(
    margin=dict(l=0, r=0, b=0, t=0), # tight layout
    legend=dict(
        x=0.1,  # Adjust the horizontal position of the legend, where 0 is left and 1 is right
        y=1,    # Adjust the vertical position of the legend, where 0 is bottom and 1 is top
        font=dict(
            size=10,  # Adjust the font size in the legend
        ),
        orientation="h",  # Set the legend items to a horizontal orientation
    )
)
fig.show()

><font color = 'F2700A'>02.Topographical 3D Surface Plot</font>  

In [52]:
import plotly.graph_objects as go

# Read elevation data from a CSV file hosted on GitHub
# The data represents the elevation of Mt Bruno and is organized in a grid 24*24, where each cell value indicates the elevation at that point.
z_data = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/api_docs/mt_bruno_elevation.csv')

fig = go.Figure(data=[go.Surface(z=z_data.values)]) # The 'z' argument of go.Surface is set to the values of the elevation data

fig.update_layout(title='Mt Bruno Elevation', autosize=False,
                  width=500, height=500,
                  margin=dict(l=65, r=50, b=65, t=90))

fig.show()

><font color = 'F2700A'>03.3D Surface Subplots</font>  

In [53]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Load Mt Bruno elevation data
z_data = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/api_docs/mt_bruno_elevation.csv')

# Initialize figure with 4 3D subplots
fig = make_subplots(
    rows=2, cols=2,
    # define the subplots' specice is surface
    specs=[[{'type': 'surface'}, {'type': 'surface'}],
           [{'type': 'surface'}, {'type': 'surface'}]])

# Adding Mt Bruno elevation data to each subplot with different colorscales
fig.add_trace(
    go.Surface(z=z_data.values, colorscale='Viridis', showscale=False),
    row=1, col=1)

fig.add_trace(
    go.Surface(z=z_data.values, colorscale='RdBu', showscale=False),
    row=1, col=2)

fig.add_trace(
    go.Surface(z=z_data.values, colorscale='YlOrRd', showscale=False),
    row=2, col=1)

fig.add_trace(
    go.Surface(z=z_data.values, colorscale='YlGnBu', showscale=False),
    row=2, col=2)

# Update the layout
fig.update_layout(
    title_text='Mt Bruno Elevation in 3D Subplots with Different Colorscales',
    height=800,
    width=800
)

# Show figure
fig.show()

### Animated figures with Plotly

- <font size=2>Case study 01: Dynamics of Global Economy and Health Conditions</font>  
- <font size=2>Case study 02: Evolving Patterns of World Population Distribution</font>  

><font color = 'F2700A'>Case study 01: Dynamics of Global Economy and Health Conditions</font>  

In [54]:
# Load population and life expectancy data
df = px.data.gapminder()

fig =px.scatter(df, x="gdpPercap", y="lifeExp", animation_frame="year", animation_group="country",
                size="pop", color="continent", hover_name="country",
                log_x=True, size_max=55, range_x=[100,100000], range_y=[25,90])
fig.update_layout(title='GDP per Capita vs Life Expectancy by Continent from 1952 to 2007')\

fig.show()

><font color = 'F2700A'>Case study 02: Evolving Patterns of World Population Distribution</font>  

In [55]:
# Load population and life expectancy data
df = px.data.gapminder()

fig = px.bar(df, x="continent", y="pop", color="continent",
  animation_frame="year", animation_group="country", range_y=[0,4000000000])
fig.update_layout(title='Population by Continent from 1952 to 2007')

fig.show()

## <font color='F2700A'>Part III: Bokeh</font>  
Bokeh is an interactive visualization library for modern web browsers. It provides elegant, concise construction of versatile graphics, and affords high-performance interactivity over large or streaming datasets.  
***The Differences between Altair and Plotly***  
- In my personal learning experience, I think `Altair's` main selling point is its adherence to a concise description of the visualization, making it more suited for statistical analysis rather than interactive web applications.  
- And I think both of the `Bokeh` and `Plotly`, which focus on creating interactive visualizations, while `Plotly's` syntax and usage can be more accessible to those students not deeply familiar with programming.
<iframe src="https://bokeh.org/" width=100% height=700></iframe>  

Considering this is an introductory class, I want to keep things light and manageable, so I won't be delving into Bokeh examples. However, I've compiled a list of self-learning resources below for those interested in exploring further on their own.  
<font size=5>[Click here: self-learning about Bokeh]("https://github.com/bokeh/bokeh")</font> 