# How to Use Plotly

## Introduction : 

This tutorial will introduce you to a python package called Plotly.

Plotly is a data visualization package in Python. It serves for different purpose graphs ranging from statistical graphs to financial and geographical graphs. Unlike other data visualization packages that are built on top of the matplotlib package in Python, Plotly is built on top of JavaScript library. The graphs built through Plotly are highly interactive and intuitive allowing the users to visualize data better.

Data visualization in data science has always been important. Indeed it is important that we train and test machine learning models, but from those models we need to be able to locate particular patterns or results that could give us back meanings. Using convenient data visualization tools can enhance our ability to locate meanings that the dataset and the statistical model is conveying. Also we can manipulate these graphs much easier so we can always feel more free to change things around in these visualizations. Indeed machine learning takes a large part of data science but those predicted values would not mean anything without an effective data visualization.

In this tutorial, we will utilize Plotly to create different types of graphs and also learn how to visualize machine learning models easier and faster. We will apply Plotly in a real usage case to see how covenient Plotly can be in when large amounts of real data are being processed.

## Tutorial Content

In this tutorial, we will show some basic data visualization techniques that could be done with Plotly.

- Getting Started : Installing the Plotly library 
- Basic Graphs : What kind of basic graphs are used and how to visualize them
- Applying Plotly : Training data into a statistical model and visualizing the outcome using Plotly.


## Getting Started : Installing the Plotly library

Before using Plotly, we must install Plotly to use it. Fortunately, installing Plotly doesn't require any internet connection and just requires command in the terminal using `pip3` which will install all the necessary things that we need. First install the `nbformat` package and then the `plotly` package. In terminal, simply type : 


`!pip3 install nbformat==4.2.0`

`!pip3 install plotly==5.6.0`


Now that we have our Plotly package ready, let's import some necessary modules that we will need as we go on to check whether the plotly library has been downloaded well. Run the following code to check if plotly has been downloaded well. 

In [25]:
import plotly.graph_objects as go 
import plotly.express as px 
import plotly.data as pd

election_data = pd.election()

For the following explanation of the basic graphs, we decided to use the election data where the data includes columns like `districts`,`Coderre`,`Bergeron`,`Joly`,`total`,`winner`,`result`,`district_id`. This dataset includes some columns that are quantitative and some columns that are categorical variable. Because different graphs must be used when treating different types of variables, we chose the dataset that had a good mix of the both, so we can use this dataset to show different types of graphs.

## Basic Graphs

Now that we have downloaded the modules required to use plotly and chosen our dataset, lets dive into creating some basic graphs that could be made using plotly. The graphs that we will look at are bar charts, histograms, scatter plots and heat maps. These are some essential and most common graphs that come out often. Indeed some of the graphs aren't very compatible in visualizing machine learning models. But it still gives us a better idea of how the data is set up as it could show how the data is distributed and how the features in the data are related. Data science doesn't only include the aspects of machine learning, but conveying the data and results is just as important.

### Bar Chart

Bar chart is a type of graph used to describe the amount of a certain categorical data. Given categories in our data, we could utilize bar charts to show proportions of those categories or show comparison of numerical values. Bar chart uses its height to give an insight to the proportion of these categories.

For this tutorial we will look at the basic bar chart and also a stacked bar chart which conveys more information than the normal bar chart. 

First, we look at the normal bar chart. Using the election data that we imported, we want to see which candidate won the most districts. Given that all the rows map to distinct districts, we could easily show the relative difference of the amount of district won. 


In [36]:
plot = px.bar(election_data, x='winner',title="Count of Districts that the Candidates have Won")
plot.show()


Like above, we could simply find the count of the categories by just mentioning the column that contains these categories. By comparing the heights, we could see that Coderre won the most districts out of the three candidates followed by Bergeron. Here we didn't specify the y axis value as mentioning only the column with categorical variable automatically counts the number of those categories. We could always specify the y value to see the relation of the categories to some numeric values in other column.

What if we want to incoroporate different features into this bar plot? We could incorporate another categorical feature that could show new set of intuitive information on top of what we already have. Now we will look at something called the stacked bar plot.

In [111]:
plot = px.bar(election_data,x="winner",color="result",title="Comparison of the Number of Districts Won and How They Won It")
plot.show()

We added the bars to be filled with some other information. Here we specified the color parameter to be related to the `result` column. We could see the proportion of districts won by each candidate and also the proportion of how the candidate won the district. It shows also by heights that given the districts won by each of the candidates, how much were a majority win and how much were a plurality win. 

### Histograms

Another plot that we will introduce is histograms. Histograms are similar to bar charts, but we don't count or find proportion of a categorical data. Histograms are used to represent the distribution of the data. When using histograms, we can give different bin values. Bins are thresholds that are equal for each bar. For each of the bins, think of putting in datapoints that are within the threshold. Some bins would have more and some bins would have less. Putting all those bins together, we are able to find the distribution of the given numerical data.



In [28]:

figure = px.histogram(election_data,x="total",y="Bergeron",nbins=15)
figure.show()

Above plot represents the number of votes Bergeron received against the total number of votes in that district. Each bar represents the sum of votes that Bergeron received and each bin is divided by the number of the total votes. Keep in mind that different bin numbers could give drastically different histograms. It is important to use appropriate bin number to get a good idea of the distribution. If the bin numbers are too large or too small, it could lead to misleading of the data. If the bins become too large the bars will contain too many datapoints, failing to show the dataseet's distribution. Following is the same histogram with larger bins.

In [29]:
figure = px.histogram(election_data,x="total",y="Bergeron",nbins=30)
figure.show()

### Scatter Plot

Scatter plots are one of the plots that are used widely as it maps two numerically coordinated data into the graph. Scatter plots can show us trends and relationship between different numerical variables. Using the scatter plot, we could show the relationship of two numerical variables. For example, we could show the relationship of the number of votes that the candidate Bergeron received to the number of total votes in that district.

In [113]:
scatterplot = px.scatter(election_data,x="total",y="Bergeron")
scatterplot.show()

Therefore compared to histograms and bar plots, we can visualize more things such as regression on the same graph to give a better understanding to the people who would look at them. We will further explore with scatterplots in the next section when we explore on visualizing machine learning regression model on scatter plots. Compared to the plots that we explored above, scatterplot is a relationship between numberical variables. 

### Density Heat Maps

Density heat maps are also known as 2d histograms. We would apply the same x and y as we did above, but on top of that, we would also apply an aggregate function such as sum or mean which would decide the color that the portion of the map would take. Heat maps may not be useful in their incorporation with machine learning models, but when we have datapoints that are too clustered, heat maps help us better understand where these datapoints are compared to the other datapoints on the graph. Creating a heat map through plotly takes the `density_heatmap` function.

In [31]:
fig = px.density_heatmap(election_data,x="total",y="Coderre")
fig.show()

Because heat maps are also a type of histogram, we could change the number of bins to understand where the datapoints are mostly located in the dataset. Like histogram, finding the right size of the bin is important to get the correct representation of the data.

In [32]:
fig = px.density_heatmap(election_data,x="total",y="Coderre",nbinsx=10,nbinsy=10,color_continuous_scale="Viridis")
fig.show()

By looking at the color difference on the right, we could see where the datapoints are mostly concentrated in the graph and what kind of values they might have. Density heat maps convey more information than just simple histograms.

## Visualizing Machine Learning Model Using Plotly

In this section, we will apply using Plotly package to the trained machine learning model. Using scikit-learn package to train and build the model. Then we will visualize the model and its model to get a better understanding of the machine learning model. 

For this example we will use linear regression model with scatterplot to visualize our data. Using the `tip` dataset, we will use the features of total bill and the size of the group that people come in to predict the amount of tip they must pay.

In [97]:
import numpy as np
import plotly.data as pd
import plotly.express as px
import plotly.graph_objects as go
from sklearn.linear_model import LinearRegression


In [121]:
tip_data = pd.tips()
tip_data = iris_data.sample(frac=1)
xTrain = tip_data.loc[:,['total_bill','size']]
xTrain["Tmp"] = 1
tip_data["color"] = "red"

xTrain = xTrain.to_numpy()
yTrain = tip_data.loc[:,['tip']].to_numpy()

model = LinearRegression()

model.fit(xTrain,yTrain)
pred = model.predict(xTrain)

tip_data["pred"] = pred


Now that we have fit the model for the tip data, we now want to visualize this linear regression fit that we did for our features and show the outcome and the fitted surface in a 3 dimensional environment. 

In [122]:
plot = px.scatter_3d(tip_data, x='total_bill', y='size', z="pred")
plot.show()

After fitting the datapoints into the model, we could see the predicted values from the model. Because we fitted them in a linear regression model, we are able to observe the linear model that is existent. Using the scatter plot in 3d, we could observe the predicted data point at a certain total bill amount and the party size. If we go further, we could visualize it better with surfaces that cover these data to get more intuitive graph. 

## References

1. [Plotly Documentation](https://plotly.com/python/)
2. [Plotly Datasets](https://plotly.com/python-api-reference/generated/plotly.data.html)
3. [Plotly Package](https://github.com/plotly/plotly.py)