# <center> Introduction </center>

![](https://upload.wikimedia.org/wikipedia/commons/thumb/3/37/Plotly-logo-01-square.png/1200px-Plotly-logo-01-square.png)

## Charts are your friend

A major part of every data workflow involves doing Exploratory Data Analysis. 

Being able to extract information from data and presenting it in some meaningful and effective way is perhaps the most important Data Science skill. Charts are your friends here. Notebooks play a very important role in Kaggle and by extension plotting plays a major role as well.

Python has many charting libraries. Some of them present static graphs like those from Matplotlib and Seaborn. While some present us with interactive or dynamic graphs like those in the case of Plotly, Bokeh and Altair. 

## Dynamic > Static

Let me be honest. Static graphs look dull to me. Being able to provide other the ability to click around the plot, look at each data point closely gives them the opportunity to extract the maximum information out of the graph. You are no longer bounded by the constraints of a static plot. 

In the world of Dynamic plotting in Python, Plotly has a name of its own. It is perhaps the most widely used interactive graphing library used in Kaggle. Plus with the inclusion on plotly express, it has become extremely simple to create stunning graphs as well. 

### So, without much ado, lets start with learning how to use Plotly Express to create graphs!

### Graphs I am planning to include in the tutorial

- [x] Scatter Plots
- [x] Line Charts
- [x] Bar Charts
- [x] Pie Charts
- [x] Histogram
- [x] Box Plot
- [x] Violin Plot

If there's anything else you want to see, you can let me know. 

# An overview of the Datasets

In [None]:
# Importing the necessary libraries

# Importing pandas to use dataframes
import pandas as pd

# Importing plotly express which will be used for creating the visualizations
import plotly.express as px
from plotly.offline import init_notebook_mode
# Doing this to make sure the graphs are visible in the kaggle kernels and not just a blank white screen
init_notebook_mode()

In [None]:
# Reading all the datasets and creating dataframes

heart = pd.read_csv('../input/heart-disease-uci/heart.csv')
covid = pd.read_csv('../input/novel-corona-virus-2019-dataset/covid_19_data.csv')
co2 = pd.read_csv('../input/co2-ghg-emissionsdata/co2_emission.csv')
houses = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')

In [None]:
heart.head()

In [None]:
covid.head()

In [None]:
co2.head()

In [None]:
houses.head()

# Scatter Plots

We use px.scatter for creating Scatter Plots in plotly express

In [None]:
fig = px.scatter(heart, # the dataframe that has the data points we want to plot
                 x = 'trestbps', # the name of the column in the dataframe whose values will be plotted on the x-axis
                 y = 'chol', # the name of the column in the dataframe whose values will be plotted on the x-axis
                 color = 'target',# the name of the column that will be used to assign colour to the marks on the Scatter Plot
                 title='Cholestrol vs Blood Pressure', # Title of the plot
                 template = 'ggplot2' # ggplot2 is one of the in-built templates in plotly, used for some theming of the graphs
                )
fig.show()

As you can see, you can hover around the individual points to know the x and y co-ordinates at that point.

In this case, the color of each point is governed by the target column and are plotted in the same graph. We can also create subplots for each different category. Let us try this with the 'sex' column.

Faceting can quite simply be understood as splitting the plot into multiple subplots based on the values of a particular row/column. Here we are using facet_col and thus, we end up with 2 subplots each corresponding to one of the two sex in the dataset. 

In [None]:
fig = px.scatter(heart, x = 'trestbps', y = 'chol', title='Cholestrol vs Blood Pressure', 
                 facet_col = 'sex', # the name of the column in the dataframe whose values are used for creating subplots
                 color = 'target', template = 'ggplot2')
fig.show()

Thus, we ended up with 2 different graphs for each of the two sex. We are also able to see how the data points are scattered for each target which can be identified with their colour.

# Line Charts

px.line gives us the capability to draw line charts using Plotly Express. 

In [None]:
fig = px.line(co2[co2['Entity'] == 'India'],# the dataframe
              x = 'Year', y = r'Annual CO₂ emissions (tonnes )', 
              title = 'Annual Co2 Emmission by India over the years' # title of the graph
             )
fig.show()

# Bar Charts

Similar to other charts, we use px.bar to create Bar Plots in Plotly. 

In [None]:
#calculating the sum of all Confirmed Cases in each Country/Region
covidCases = pd.DataFrame(covid.groupby('Country/Region')['Confirmed'].sum()).reset_index()
#sorting the dataset in descending order on the basis of number of Confirmed cases,i.e countries with the most Confirmed covid cases will be at the top
covidCases = covidCases.sort_values(by = 'Confirmed', ascending = False)

fig = px.bar(covidCases.iloc[:20], #plotting only the top 20 Countries
             x = 'Country/Region', y = 'Confirmed', title = 'Top 20 Countries based on number of Confirmed Covid Cases')
fig.show()

Sometimes, we would want to plot the bar chart in a horizontal orientation so that we can easily compare the values, let us do that now. 

In [None]:
fig = px.bar(covidCases.iloc[:20],  
             x = 'Confirmed', y = 'Country/Region', #notice how we've swapped the x and y values here in comparison to the Vertical Bar Chart
             orientation = 'h', #orientation - 'h' signifies horizontal orientation, thus the bar chart converts into a row chart
             #however, even if we omit the orientation parameter, we would still get the same bar chart as swapping what goes on the x-axis and what on the y-axis was sufficient in our case
             title = 'Top 20 Countries based on number of Confirmed Covid Cases')
fig.show()

In [None]:
count = pd.DataFrame(heart.groupby('target')['slope'].value_counts().sort_index()) #calculating the number of samples for each Slope in both the targets
count = count.rename_axis(['target', 'Slope']).reset_index()
count['Counts'] = count['slope'] #adding a column with the name Counts

fig = px.bar(count, x = 'Slope', y = 'Counts',# Plotly Express Automatically labels the Axis based on the column names, thus, the fact, that I created a Column named Counts helps in creating meaning axis labels
             facet_col = 'target',# faceting using the target values, which are 0 and 1
             title = 'Slope Distribution Across Target')
fig.show()

# Pie Charts

> **A pie chart is a circular statistical chart, which is divided into sectors to illustrate numerical proportion.**

In [None]:
fig = px.pie(heart, # the dataframe from which values would be taken for plotting
             names = 'slope',# the values from this column are used as labels for the sectors of the pie chart. Since we do not set the values parameters, the number of observations from this column are used
             color_discrete_sequence=px.colors.sequential.Inferno, #plotly has a lot of in-built colour scales, The Inferno color scale is used to assign the colors to each of the sectors here
             title = 'Demonstrating Pie Charts')
fig.show()

In [None]:
# We can verify that the Values of the sectos are indeed the number of observations for that Slope in our dataset 
heart['slope'].value_counts(normalize = True)

**Pro Tip** : Avoid Pie Charts as much as possible. Try using Bar Charts instead as the human eyes are not very good at differentiating between angles. 

We can also convert this pie chart into a donut chart, by using the hole parameter. Again, the problem still remains of humans being not too good with angles. 

In [None]:
fig = px.pie(heart, names = 'slope',
             color_discrete_sequence=px.colors.sequential.RdBu,#using another color scale, because why not?
             hole=0.3, #determines the radius of the hole. Using 0.3 is a good standard according to me
             title = 'Demonstrating Donut Chart')
fig.show()

Some advanced customizations can be also be done such as pulling a sector out of the chart and displaying values instead of percentages on the chart

In [None]:
fig = px.pie(heart, names = 'slope',title = 'Advanced Customizations of Pie Chart')
# each chart in a plotly figure is called a trace. The plotly figure.update_traces allows us to have much finer control over the charts. 
fig.update_traces(textinfo = 'value',# we can now display actual values and not percentages in the pie chart
                  insidetextorientation = 'tangential',# the text would be tangetially oriented inside the chart
                  pull = [0.2,0,0] #pull the first sector, here the sector belonging to Slope 0
                 )
fig.show()

# Histograms

In [None]:
fig = px.histogram(houses,x = 'SalePrice',#the distribution of this column is plotted along the x-axis
                   title = 'Distribution of House Sale Price')
fig.show()

Now what if we want to see the distribution on the log scale? 

Simple. Just add one more parameter and we are all set.

In [None]:
fig = px.histogram(houses,x = 'SalePrice', title = 'Transforming the x-axis into log scale', 
                   log_x=True # the x-axis values are transformed into log scale, this can be seen in the range of values in the x-axis
                  )
fig.show()

After converting the above plot into log-scale, we see that it has a more or less normal distribution. 

One more common thing we like to do is change the bin size to see if we are able to get a better understanding of the data. Also, lets plot different histograms for houses sold in different years. 

It may sound like a lot, but is actually fairly simple, just tag along.

In [None]:
fig = px.histogram(houses,x = 'SalePrice', title = 'Distribution of House Price Across Years',
                   nbins=200, #this sets the number of bins to 200
                   color='YrSold' # each year would have a histogram plotted with different colours
                  )
fig.show()

Now we have a histogram with 200 bins with the distribution in each year of house sold represented in a different colour.

# Box Plots

In [None]:
fig = px.box(heart, x = 'slope',#the name of the column whose values are going to be used to place marks on the x-axis. Here, since we have slope as a categorical variable with 3 distict values, we will end up with a box plot for each slope
             y='chol',# here in this case, this is the column whose values be used to plot the box plot
             title = 'Distribution of Cholestrol across various slope', color = 'target')
fig.show()

Just like, bar charts, we can also change the orientation of the box plots as well. 

In [None]:
fig = px.box(heart, 
             x = 'chol',y='slope', #notice how we have swapped the x and y column variables in this case
             title = 'Distribution of Cholestrol across various slope',
             orientation = 'h',#orientation = 'h' means we want a horizontal box plot
             color = 'target')
fig.show()

Sometimes, box plots might give false information about the data. For example, if there are only a few Observations for a particular Slope that has a median of say 150 while for another Slope we see that the median is 145, we might end up concluding that the cholestrol is generally lower than in this slope when compared to the former slope, but if the number of samples in the later Slope is much larger, this conclusion would be not that accurate as owing to the large number of observations there can be lot of variation in the Cholestrol level corresponding that slope.

So let us also plot the underlying data along with the box plot using the points parameter

In [None]:
fig = px.box(heart, x = 'slope',y='chol',title = 'Distribution of Cholestrol across various slope with Underlying data', 
             color = 'target', 
             points = 'all' # we plot the underlying data as well of the plots
            )
fig.show()

# Violin Plot

In [None]:
fig = px.violin(heart, x = 'slope',#the name of the column whose values are going to be used to place marks on the x-axis. Here, since we have slope as a categorical variable with 3 distict values, we will end up with a violin plot for each slope
                y='chol',# in this case, this is the column whose values be used to plot the violin plot
                title = 'Distribution of Cholestrol across various slope', color = 'target')
fig.show()

The above line of code is exactly the same as that we've done in case of box plot. However, instead of box, we have used violin here so that we can draw violin plots. That's how easy it is to do this. 

In [None]:
fig = px.violin(
    heart, x = 'slope',y='chol',title = 'Distribution of Cholestrol across various slope with Underlying data', 
    color = 'target', 
    points = 'all' #we plot the underlying data as well of the plots
               )
fig.show()

Again, exactly similar code with this time adding the underlying data as well. 

# Bonus

If you also want to learn about how to use Plotly to create **Animated Race Charts**, I have another kernel demonstrating the same. 

Kernel : [[Plotly]Animated Race Charts & IPL 2020 Analysis](https://www.kaggle.com/foolofatook/plotly-animated-race-charts-ipl-2020-analysis/notebook)

### I Hope this tutorial was beneficial for you. In case you liked it, you can upvote, and in case you want to add something more, you can let me know in the comment section