# C8M4 Notebook 1: Basic Plotly Charts

(Run files in local environment)

## Plotly Libraries

**plotly.graph_objects:** 
This is a low level interface to figures, traces and layout. The Plotly graph objects module provides an automatically generated hierarchy of classes ( figures, traces, and layout) called graph objects. These graph objects represent figures with a top-level class plotly.graph_objects.Figure.

**plotly.express:** 
Plotly express is a high-level wrapper for Plotly. It is a recommended starting point for creating the most common figures provided by Plotly using a simpler syntax. It uses graph objects internally.
Now let us use these libraries to plot some charts
We will start with plotly_graph_objects to plot line and scatter plots
> Note: You can hover the mouse over the charts whenever you want to view any statistics in the visualization charts 



In [None]:
#import libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

## Scatter Plot
A scatter plot shows the relationship between 2 variables on the x and y-axis. The data points here appear scattered when plotted on a two-dimensional plane. Using scatter plots, we can create exciting visualizations to express various relationships, such as:

* Height vs weight of persons
* Engine size vs automobile price
* Exercise time vs Body Fat


In [None]:
# ex1: Income vs Age  of people in scatter plot
age_array = np.random.randint(25, 55, 60)
income_array = np.random.randint(300000, 700000, 3000000)

#1st, create empty figure using go.Figure()
fig = go.Figure()
fig

In [None]:
#Next we will create a scatter plot by using the add_trace function and use the go.scatter() function within it
# In go.Scatter we define the x-axis data,y-axis data and define the mode as markers with color of the marker as blue
fig.add_trace(go.Scatter(x = age_array, y = income_array, mode = "markers", marker = dict(color = 'blue')))

In [None]:
fig.update_layout(title = "Economic Survey", xaxis_title = "Age", yaxis_title = "Income")
fig.show()

---
## Line Plot
A line plot shows information that changes continuously with time. Here the data points are connected by straight lines. Line plots are also plotted on a two dimensional plane like scatter plots. Using line plots, we can create exciting visualizations to illustrate:

  * Annual revenue growth
  * Stock Market analysis over time
  * Product Sales over time


In [None]:
#ex2: illustrate the sales of bicycles from Jan to August last year using a line chart
bicycle_sales_array = [50, 100, 40, 150, 160, 70, 60, 45, 85, 100, 90, 75]
months_array = ["Jan","Feb","Mar","April","May","Jun","Jul","Aug", "Sep", "Oct", "Nov", "Dec" ]

fig = go.Figure()
fig.add_trace(go.Scatter(x = months_array, y = bicycle_sales_array,
                         mode = "lines", marker = dict(color ="green")))

fig.update_layout(title='Bicycle Sales', xaxis_title='Months', yaxis_title='Number of Bicycles Sold')

fig.show()

---
## Bar Plot
  
## 3.Bar Plot: 
A bar plot represents categorical data in rectangular bars. Each category is defined on one axis, and the value counts for this category are represented on another axis. Bar charts are generally used to compare values.We can use bar plots in visualizing:

 * Pizza delivery time in peak and non peak hours
 * Population comparison by gender
 * Number of views by movie name

**In plotly express we set the axis values and the title within the same function call `px.<graphtype>(x=<xaxis value source>,y=<y-axis value source>,title=<appropriate title as a string>)`.In the below code we use `px.bar( x=grade_array, y=score_array, title='Pass Percentage of Classes')`.**


In [None]:
#ex3: illustrate the average pass percentage of classes from grade 6 to grade 10

score_array=[80,90,56,88,95]
grade_array=['Grade 6','Grade 7','Grade 8','Grade 9','Grade 10']

fig = px.bar(x = grade_array, y = score_array, title = "Pass Percentage of Classes")
fig.show()

---
## Histogram

 A histogram is used to represent continuous data in the form of bar. Each bar has discrete values in bar graphs, whereas in histograms, we have bars representing a range of values. Histograms show frequency distributions. We can use histograms to visualize:
 
 * Students marks distribution
 * Frequency of waiting time of customers in a Bank



In [None]:
#ex4: illustrate the distribution of heights of 200 people using a histogram
heights_array = np.random.normal(160, 11, 200)

fig = px.histogram(x = heights_array, title = "Distribuition of Heights")
fig.show()

---
## Bubble Plot
A bubble plot is used to show the relationship between 3 or more variables. It is an extension of a scatter plot. Bubble plots are ideal for visualizing:

  * Global Economic position of Industries
  * Impact of viruses on Diseases


In [None]:
#ex4: illustrate crime statistics of US cities with a bubble chart

#create dictionary having  city, number_of_crimes and year as key
crime_details = {
    'City' : ['Chicago', 'Chicago', 'Austin', 'Austin','Seattle','Seattle'],
    'Numberofcrimes' : [1000, 1200, 400, 700,350,1500],
    'Year' : ['2007', '2008', '2007', '2008','2007','2008'],
}

df = pd.DataFrame(crime_details)
df

In [None]:
bubble_data = df.groupby("City")["Numberofcrimes"].sum().reset_index()
bubble_data

In [None]:
fig = px.scatter(bubble_data, x = "City", y = "Numberofcrimes",
                 size = "Numberofcrimes", hover_name = "City",
                 title = "Crime Statistics", size_max = 60)
fig.show()

---
## Pie Chart
A pie plot is a circle chart mainly used to represent proportion of part of given data with respect to the whole data. Each slice represents a proportion and on total of the proportion becomes a whole. We can use bar plots in visualizing:
 
 * Sales turnover percentatge with respect to different products
 * Monthly expenditure of a Family


In [None]:
##ex5: Monthly expenditure of a family
exp_percent= [20, 50, 10,8,12]   # sum to 100
house_holdcategories = ['Grocery', 'Rent', 'School Fees','Transport','Savings']

# Use px.pie function to create the chart. Input dataset. 
# Values parameter will set values associated to the sector. 'exp_percent' feature is passed to it.
# labels for the sector are passed to the `house hold categoris` parameter.

fig = px.pie(values = exp_percent, names = house_holdcategories,
             title = "Household Expenditure")
fig.show()

---
## Sunburst Chart
Sunburst charts represent hierarchial data in the form of concentric circles. Here the innermost circle is the root node which defines the parent, and then the outer rings move down the hierarchy from the centre. They are also called radial charts.We can use them to plot

* Worldwide mobile Sales where we can drill down as follows:   
    * innermost circle represents total sales  
    * first outer circle represents continentwise sales
    * second outer circle represents countrywise sales within each continent
       
* Disease outbreak hierarchy

* Real Estate Industrial chain

In [None]:
#ex6:

#Create a dictionary having a set of people represented by a character array and the parents of these characters represented in another
## array and the values are the values associated to the vectors.

data = dict(
    character=["Eve", "Cain", "Seth", "Enos", "Noam", "Abel", "Awan", "Enoch", "Azura"],
    parent=["", "Eve", "Eve", "Seth", "Seth", "Eve", "Eve", "Awan", "Eve" ],
    value=[10, 14, 12, 10, 2, 6, 6, 4, 4])

fig = px.sunburst(
    data,
    names = "character",
    parents = "parent",
    values = "value",
    title = "Family Chart"
)

fig.show()

---
---
# Practice Exercises: Apply Plotly Skills to an Airline Dataset

The Reporting Carrier On-Time Performance Dataset contains information on approximately 200 million domestic US flights reported to the United States Bureau of Transportation Statistics. The dataset contains basic information about each flight (such as date, time, departure airport, arrival airport) and, if applicable, the amount of time the flight was delayed and information about the reason for the delay. This dataset can be used to predict the likelihood of a flight arriving on time.

In [None]:
df = pd.read_csv("datasets/airline.csv", encoding = "ISO-8859-1",
                            dtype={'Div1Airport': str, 'Div1TailNum': str, 
                                   'Div2Airport': str, 'Div2TailNum': str})
df.head()

In [None]:
df.shape

In [None]:
data = df.sample(n = 500, random_state = 42)
data.shape

It would be interesting if we visually  capture details such as

* Departure time changes with respect to airport distance.

* Average Flight Delay time over the months

* Comparing number of flights in each destination state

* Number of  flights per reporting airline

* Distrubution of arrival delay

* Proportion of distance group by month (month indicated by numbers)

* Hierarchical view in othe order of month and destination state holding value of number of flights


## Scatter Plot
Let us use a scatter plot to represent departure time changes with respect to airport distance

This plot should contain the following

* Title as **Distance vs Departure Time**.
* x-axis label should be **Distance**
* y-axis label should be **DeptTime**
* **Distance** column data from the flight delay dataset should be considered in x-axis
* **DepTime** column data from the flight delay dataset should be considered in y-axis
* Scatter plot markers should be of red color


In [None]:
distance = data.Distance
dep_time = data.DepTime

fig = go.Figure()
fig.add_trace(go.Scatter(x = distance, y  = dep_time, mode = "markers", marker = dict(color="blue")))
fig.update_layout(title='Distance vs Departure Time', xaxis_title='Distance', yaxis_title='Departure Time')

fig.show()

**Inferences**

It can be inferred that there are more flights round the clock for shorter distances. However, for longer distance there are limited flights through the day.

---
## Line Plot
Let us now use a line plot to extract average monthly arrival delay time and see how it changes over the year.

  This plot should contain the following

* Title as **Month vs Average Flight Delay Time**.
* x-axis label should be **Month**
* y-axis label should be **ArrDelay**
* A new dataframe **line_data** should be created which consists of 2 columns average **arrival delay time per month** and **month** from the dataset
* **Month** column data from the line_data dataframe should be considered in x-axis
* **ArrDelay** column data from the ine_data dataframeshould be considered in y-axis
* Plotted line in the line plot should be of green color


In [None]:
# Group the data by Month and compute average over arrival delay time.
line_data = data.groupby('Month')['ArrDelay'].mean().reset_index()
line_data

In [None]:
month = line_data.Month
arr_delay = line_data.ArrDelay

fig = go.Figure()
fig.add_trace(go.Scatter(x = month, y = arr_delay, mode = "lines", marker = dict(color="green")))
fig.update_layout(title = "Month vs Average Flight Delay Time",
                  xaxis_title = "Month",
                  yaxis_title = "ArrDelay"
                 )

fig.show()

**Inferences**

It is found that in the month of June the average monthly delay time is the maximum

---
## Bar Chart

Let us use a bar chart to extract number of flights from a specific airline that goes to a destination

This plot should contain the following

* Title as **Total number of flights to the destination state split by reporting air**.
* x-axis label should be **DestState**
* y-axis label should be **Flights**
* Create a new dataframe called **bar_data**  which contains 2 columns **DestState** and **Flights**.Here **flights** indicate total number of flights in each combination.



In [None]:
bar_data = data.groupby("DestState")["Flights"].sum().reset_index()
dest_state = bar_data.DestState
flights = bar_data.Flights

In [None]:
fig = px.bar(x = dest_state, y = flights, title = "Total number of flights to the destination state split by reporting air")
fig.show()

**Inferences**
It is found that there is only max of 5 flights with an arrival delay of 50-54 minutes and around 17 flights with an arrival delay of 20-25 minutes

---
## Histogram


Let us represent the distribution of arrival delay using a histogram

This plot should contain the following

* Title as **Total number of flights to the destination state split by reporting air**.
* x-axis label should be **ArrayDelay**
* y-axis will show the count of arrival delay


In [None]:
data['ArrDelay'] = data['ArrDelay'].fillna(0)
fig = px.histogram(x = data["ArrDelay"], title = "Total Number of flights to the destination state  split by reporting air")
fig.show()

**Inferences**

It is found that there is only max of 5 flights with an arrival delay of 50-54 minutes and around 17 flights with an arrival delay of 20-25 minutes

---
## Bubble Chart
Let  use a bubble plot to represent number of flights as per reporting airline

This plot should contain the following

* Title as **Reporting Airline vs Number of Flights**.
* x-axis label should be **Reporting_Airline**
* y-axis label should be **Flights**
* size of the bubble should be **Flights** indicating number of flights
* Name of the hover tooltip to `reporting_airline` using `hover_name` parameter.


In [None]:
bub_data = data.groupby('Reporting_Airline')['Flights'].sum().reset_index()
bub_data

In [None]:
fig = px.scatter(bub_data, x = "Reporting_Airline", y = "Flights",
                 size = "Flights", hover_name  = "Reporting_Airline",
                 size_max = 60)
fig.show()

**Inferences**

It is found that the reporting airline **WN** has the highest number of flights which is around 86

---
## Pie Chart

Let us represent the proportion of Flights by Distance Group (Flights indicated by numbers)

This plot should contain the following

* Title as **Flight propotion by Distance Group**.
* values should be **Flights**
* names should be **DistanceGroup**


In [None]:
fig=px.pie(values = data.Flights, names = data.DistanceGroup,
           title = "Flight Peoportion by distance Group")
fig.show()

**Inferences**
It is found that Distance group 2 has the highest flight proportion.

---
## SunBurst Chart
Let us represent the hierarchical view in othe order of month and destination state holding value of number of flights

This plot should contain the following

*  Define hierarchy of sectors from root to leaves in `path` parameter. Here, we go from `Month` to `DestStateName` feature.
*   Set sector values in `values` parameter. Here, we can pass in `Flights` feature.
*   Show the figure.
*   Title as **Flight Distribution Hierarchy**


In [None]:
path = data[["Month", "DestStateName", "Flights"]]

In [None]:
fig = px.sunburst(
    path.to_dict(),
    names = "DestStateName",
    parents = "Month",
    values =  "Flights",
    title  = "Flights Distribuition Hierarchy"
)
fig.show()

**Inferences**

Here the  **Month** numbers present in the innermost concentric circle is the root and for each month we will check the **number of flights** for the different **destination states** under it.


---
Thank You!!!