# Session 8 Screencast Notebook

## 8.1 Introduction

Hi and welcome to our session on visualistions in python.

 As you can imagine, how we present our data is often as important as what we do with it in the first place, because a visualisation is likely going to be what really conveys information to the end user. If we choose the wrong kind of visualisation we could give a misleading message, or even completely fail to communicate it altogether.
 
 In this session we will consider which visualisation is best applied in different circumstances, and create visualisations using the libraries Matplotlib, Seaborn and Plotly. After we get an understanding of how these libraries work, we will display some geographic data on a map, choosing an appropriate method and projection


## 8.3 Choosing a visualisation

The way in which we display our data is incredibly important to convey information, and the kind of visualisation we chose depends primarily on two factors - what kind of data we have, and what information we are trying to convey.

 For example, if we want to demonstrate the differences in height between children at 4 years old, and 5 years old, we could plot each individual point on a graph like this. [points instead of a histogram]. We can kind of see the difference between the two groups, but it's really difficult to tell. Even though they are seperated on the bottom axis, the two groups still aren't very distinct. So let's make them different colours[coloured points instead of a histogram]. This is better, but still not very easy to see. That's because this kind of scatter plot is inappropriate to compare between two categories, when there is only a single measurement for each individual. What works much better here, is the use of a histogram [histogram]. Now we can see the average difference between these two groups, and this is great, because the samples of both of these normally distributed.
 
   So histograms are great for working with height, and like we said, the visualisation needs to be appropriate for the data. But what if we consider a new dataset, that seems like it's quite similar to the last one. Let's imagine we are applying for a new job in two different companies. Everything about both companies seems to be the same, and we've been offered both positions, the only difference is, there appears to be different payscales, but for some reason, we won't find out how much we will be paid until after we accept an offer.... We have however, got a sneak peak at the overall pay records in the companies, so we can make our decision based on what we might expect to get. So, like before, let's see a histogram of the pay. [histogram of pay]. 
   
   Wow! We're clearly going to choose company A because it has the highest average pay! But, because histograms only provide us with the average, we might miss something important! If we go back to plotting all the points, we can see something funny here.... company A has the highest average, but most employees get paid much less than the CEO! As we're not going to be CEO of this company, we can assume that we will fall in the lesser paid group, meaning that we should have chosen Company B! But, like we said before plotting single points isn't really appropriate when working with single measurements of single variables. A great option for this, is boxplots [boxplots]. Here, we get an insight into the distributions of our data - the line in the centre of these boxes is the median, rather than the mean that we seen with the histograms. But we also now have these boxes too. The box above the median line represents the 25% of the datapoints, otherwise known as a quartile, that are above the median. This line above that box, is known as a whisker, and represents the very highest 25%. The same applies to the bottom two quartiles, with 25% of the data falling in this part of the box, and the very lowest paid 25% being represented by this whisker. So you can see now, that company A is very skewed, with almost all the people being paid below the mean. Company B, however, has a much more even distribution, meaning that they're a much better bet!
   
   This is all fine if we're comparing a single measurement between different groups. But, what if we want compare data that has structure, like time-series data? For example, if instead of comparing the heights of children at 2 different ages, we want to see how their heights change over time. In that case, we can use a scatter plot,[scatter with all children] where we plot single points on a graph, with the x axis representing an ordered variable, like age. Now we can see that the children generally get taller as they get older. 
   
   If we were to look at the height of just one of those children [scatter with one child], each of the datapoints has a direct relation to the next, and we can therefore draw a line between them, [line with one child] to create a line graph. But be careful when using line graphs; when we join datapoints, we have to make sure that it makes sense; drawing a line between unrelated pieces of information would just be confusing. If we would like to, we can stack linegraphs on top of each other, to show the growth of multiple children.[line with two children][then][line with three children] Just be aware of adding too many, and making it harder to interpret. [line with all children]
   
   
   
   These are just some of the most fundamental visualisations, and we could easily create a whole module to try to cover the vast array of different methods available to us. But a good starting point is knowing when to use histograms, boxplots, scatter plots, and line graphs.
   
   
   Just remember these key points - ensure that the visualisation you use answers a specific question, ensure that the plot is appropriate for the kind of data you have, and always ensure that it is easily readable.
   
   

## 8.4.1 Matplotlib - line, scatter, and bar plots.

Welcome to our first class on plotting in python. We will start with the library matplotlib. 

Matplotlib is an excellent library for visualisation in python, with great documentation. Here, we will give an introduction to getting started in Matplotlib, but it's capabilities are much, much more. After this video, you will understand the foundations of how to work with it, and you should be able to do almost anything that you see in the documentation with only a little more reading.

To start, we begin with importing the necessary libraries

In [None]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

You will notice the bottom line says %matplotlib inline - this is just to make sure it plays well with jupyter notebooks. It works without it, but it's best to have it

First we start with some data:

In [None]:
data = [1,2,4,8,16]

And getting our first plot is as simple as calling the plot function in pyplot, and passing the data as an argument. Then, we need to tell python to show us our plot, by asking pyplot to show() us.

In [None]:
plt.plot(data)
plt.show()

As simple as that, we have our first plot, which just happens to default to a line plot. 

You can also see that the values in our list are plotted against the y axis, and because we didn't provide values for the X axis, it assumed that they just correspond to the indexes of the values. That is, the value at index 0 was 1, so it is plotted at [0,1], the value at index 1 was 2, so it is plotted at [1,2], the value at index 2 was 4, so it is plotted at [2,4], and so on.


We can make this a little more interesting with a slightly more interesting dataset.

First, we generate a lineraly spaced vector for the our X axis. Don't worry about how this works, we're just using it to create a list of numbers that fall along a straight line, so we can plot our bottom axis.

In [None]:
x = np.linspace(-4,4,20)

Next, we generate values to plot on our Y axis, which is simply the squares of the values on the x axis.

In [None]:
y = x*x

And now we create our plot, but this time passing values for both x and y axis.

In [None]:
plt.plot(x,y)
plt.show()

Quite often, we don't just want to show a single trend, we want to compare a few. This is actually very easy to do. Let's start by creating another set of values that contains the values of X cubed.

In [None]:
y2 = x*x*x

Like before, we plot x and y, but this time, we call the plot again, passing x and our new y directly afterwards, then ask pyplot to show us the results.

In [None]:
plt.plot(x, y)
plt.plot(x, y2)
plt.show()

And as you can see, we get x squared as our blue line, and x cubed as our orange line. 

In many cases, line plots are not appropriate. Taking the data we have generated here, joining the points is fine. We took the integers from -4 to 4, squared them, and plotted the results. Two squared is 4, and 3 squared is 9, and if fit a curved line between them, we can indeed work out what 2.5 squared is, simply by looking at that line. But in some cases, the points are actually independent and this would be inappropriate. In fact, it would be better to use a scatter graph in this case. 

To do this in matplotLib, it's as simple as adding an additional argument to each plot.

In [None]:
plt.plot(x, y, 'o')
plt.plot(x, y2, 'o')
plt.show()

By adding an string with the value o, we now see that each point is a circle.

 We can use other styles for the points too, and we can even mix them. Here we see the first plot will be an X and the second will be a star.

In [None]:
plt.plot(x, y, 'x')
plt.plot(x, y2, '*')
plt.show()

Another way to create scatter plots, is by using the actual scatter plot included in pyplot, which provides us with even more options.

 Here we generate 50 random points for both X and Y axis, and an variable, sz. Sz is another list of 50 random numbers, and will be represented by the size of each point.

In [None]:
x = np.random.random(50)
y = np.random.random(50)
sz = np.random.randint(2,200,50)
plt.scatter(x, y, s=sz)
plt.show()

The difference in our code, is that, instead of plt.plot, we have plt.scatter, and we pass the additional variable as s, or size. And now we can see our new plot, that is better defined as a bubble plot, rather than a scatter plot. This is because our scatter plot only allowed us to see the relationships between 2 variables, using the x and y axis. Our bubble plot allows us visualise a third variable, by using size.

Another important graph type, is the bar chart. Possibly one of the most commonly used when presenting data to the public, it also has its limititations, like we discussed in a previous screen. 

For example, let's imagine that we have a company that is exporting a product to a number of countries. If we have a list of the countries, with another list containing the sales for each, assuming that the indexes align, we can plot this using a histogram.

 So, just like before, we pass our lists as X and Y, and we call barH from pyplot.

In [None]:
region = ['USA', 'Europe', 'SE Aisa', 'China', 'S America', 'Australia']
sales = [15,8,12,18,14,5]
plt.barh(region,sales)
plt.ylabel('Region')
plt.xlabel('Sales £m')
plt.title('Sales by Region')
plt.show()

 Each one of these data points will be standalone from the others, and there is no data between them, so our bar chart has gaps between each bar, meaning that it is not a histogram, which we will cover nearer the end of this video.

In cases where we have sequential data, we prefer to use vertical bars. To do this, we simply change barH to bar, as we can see in this example. Here we iterate through each month of the year, in order, and record what our sales were. We then plot it as a vertical histogram.

In [None]:
import calendar
# Make array of the first three letters of each month from the calendar module
months = [calendar.month_name[i][:3] for i in range(1,13)]
sales = [5, 7, 12, 13, 16, 23, 25, 20, 18, 15, 12, 14]
plt.bar(months,sales)
plt.show()

If we want to compare our sales to those of a competitor, we might want to plot more than one set of bars on our graph. Just like when we wanted to compare two sequences in our scatter plots, we just call plt.bar again.

In [None]:
import calendar
# Make array of the first three letters of each month from the calendar module
months = [calendar.month_name[i][:3] for i in range(1,13)]
sales_our_company = [5, 7, 12, 13, 16, 23, 25, 20, 18, 15, 12, 14]
sales_competitor_company = [2, 5, 11, 24, 5, 18, 19, 16, 10, 7, 8, 1]
plt.bar(months,sales_our_company)
plt.bar(months,sales_competitor_company)
plt.show()

We can see that something is wrong here. Our bars have completely overlapped, and we can't see the value on the blue bars when they are smaller than the orange ones. What we can simply do, is stack them. This is as simple as adding our first set of sales as a 'bottom' to our second plot, like this.

In [None]:
import calendar
# Make array of the first three letters of each month from the calendar module
months = [calendar.month_name[i][:3] for i in range(1,13)]
sales_our_company = [5, 7, 12, 13, 16, 23, 25, 20, 18, 15, 12, 14]
sales_competitor_company = [2, 5, 11, 24, 5, 18, 19, 16, 10, 7, 8, 1]
plt.bar(months,sales_our_company)
plt.bar(months,sales_competitor_company, bottom=sales_our_company)
plt.show()

If our X axis was actually sampled from a continous variable, we would use a histogram, rather than just a bar plot. For example, if we sampled the mean age of every module in the University, we would have a range of values from 16 to 95, all of which would be rational numbers, that is, an integer divided by another integer. So, instead of nice cleanly seperated integers, we could have values to as many decimal places as we care to calculate. So, what do we do? We can group them together within certain ranges, which we call bins, and plot the values in those bins as a histogram. 

Here we generate a thousand random numbers, and then we tell pyplot to create a histogram, that puts our datapoints in 20 equally spaced buckets.

In [None]:
# 1000 random numbers with mean 0, variance 1 and a normal distribution
data = np.random.randn(1000)
plt.hist(data, 20)
plt.show()

Because these bars are touching, it signifies that the divisions between the ticks on our X axis are not clear cut, but are instead sampled from a continuous varaible.

## 8.4.2 Matplotlib - labelling

Let's bring back a visualisation we had from our last video, where we plotted the square of all the values between -4 and 4.

In [None]:
x = np.linspace(-4,4,20)
y = x*x

plt.plot(x,y)
plt.show()

This is pretty good, but we are violating one of the key rules in visualisation - it's not readable! Without labels on the X and Y axis, we have no idea what the data represents. To fix this, is pretty simple, so there's never an excuse to present an unlabelled graph. Just after we plot our graph, we set the xlabel like this, and the y label like this. And of course, no graph is complete without a title, so we just add it in the same way.

In [None]:
plt.plot(x, y)

plt.xlabel("X")
plt.ylabel("X Squared")
plt.title('X Squared from -4 to 4')

plt.show()


And there we are, a line showing the values of all the numbers between -4, and 4, squared.

Another wonderful thing about plots in matplotlib, is that we can also customise the labels. You will notice that the way in which we labelled these axis, was to call a function called x or ylabel in pyplot, and passed it a string containing what we wanted it to say. We can also pass additional arguments to change things like font size, like we have with our title here:

In [None]:
plt.plot(x, y)

plt.xlabel("X")
plt.ylabel("X Squared")
plt.title('X Squared from -4 to 4', fontsize=16)

plt.show()

We now have a much more pleasing looking title, and matplotlib offers a huge array of other options too.

As before, let's bring back our additional line to represent X cubed.

In [None]:
y2 = x*x*x

plt.plot(x, y)
plt.plot(x, y2)
plt.show()

And as you can see, we get x squared as our blue line, and x cubed as our orange line. But how would we have known which line was which, if we didn't already know what to expect? As we said before, no plot is complete without clear labelling, so let's add that in too.

In [None]:
plt.plot(x, y, label="Squared")
plt.plot(x, y2, label="Cubed")
plt.legend()
plt.xlabel("X")
plt.ylabel("f(X)")
plt.title('X Squared vs X Cubed from -4 to 4', fontsize=16)
plt.show()

The only changes we had to make, was to add labels to each of our datasets, and then asked pyplot to show us the legend, which is this little box right here.

Something you might be wondering is, if I can add as many lines as I want, simply by just adding another plot, how do I stop lots of old graphs building up and making my graph look over crowded? Thankfully, the graph is wiped clean, each time we call plt.show(). This means that, if we had written plt.show() between these two function calls, only the bottom one would have appeared.

The number of different options for customising the plots is immense, and we don't need to worry about them all, but let's just quickly cover a couple more.

Using the same graphs from before, we can change the colours of the lines by adding a character here. If we use 'r' it will make the line red, and if we use 'b' the make the line blue.

In [None]:
plt.plot(x, y, 'r', label="Squared")
plt.plot(x, y2, 'b', label="Cubed")
plt.legend()
plt.xlabel("X")
plt.ylabel("f(X)")
plt.title('X Squared vs X Cubed from -4 to 4', fontsize=16)
plt.show()

We seen before how we could change our line plot into a scatter plot by passing an additional argument in exactly the same way. Matplotlib is pretty smart, and can work out what you want it to do, from what you pass to it. 
Just remember, the first two arguments in the plot are the values for the x and y axis, and the next is the format. Each format parameter is optional, and is ignored if not passed, and we can also combine them like this.

In [None]:
plt.plot(x, y, 'rx', label="Squared")
plt.plot(x, y2, '*b', label="Cubed")
plt.legend()
plt.xlabel("X")
plt.ylabel("f(X)")
plt.title('X Squared vs X Cubed from -4 to 4', fontsize=16)
plt.show()

You can now see that we made x squared a scatter plot, represented by the red X's and we represented x cubed by stars, and coloured it blue. In the first case, we passed the colour, followed by the marker style, and in the second case, we passed the marker style, followed by the colour. That's just how smart matplotlib is. And I should add, there are lots more options, if you would like to read the documentation.


If we think back to our previous video, we will remember that we created a scatter plot using plt.scatter as well, and we made it into a bubble plot, by changing the size of our circles to represent a third variable. Just to refresh our memories, here it is.

In [None]:
x = np.random.random(50)
y = np.random.random(50)
sz = np.random.randint(2,200,50)
plt.scatter(x, y, s=sz)
plt.show()

So now that we can represent a third dimension, beyond just X and Y. But with colour, we can go even further, and represent a fourth! To do this, we will create random lists for x, y, sz, and now another variable z, as well. the scatter function has another argument called c, for colour, which we can then pass Z to, like this.

In [None]:
x = np.random.random(50)
y = np.random.random(50)
sz = np.random.randint(2,200,50)
z = np.random.random(50)

plt.scatter(x, y, s=sz, c=z)
plt.show()

## 8.4.3 Matplotlib - Figures

Hi, and welcome back again.

In this video, we will look at how we can prepare our graphs as figures. This can be a great way to neatly align multiple different graphs to be presented together.

 When we make our plots, the figure is the window on which they are drawn. Up until now, when we made our plots, we allowed them to fill the entire figure, like this.

In [None]:
categories = ['A', 'B', 'C']
values = [41, 32, 53]
plt.bar(categories, values)
plt.show()

but now, we're going to place a number of plots, side by side. We do this, by first telling matplotlib what size of figure we would like. In this case, we create a figure of size [9,3].
This then allows us to create subplots in different regions of the figure. So, we write subplot like this:

To signify that the next plot we make will be in position 131, where the first column specifies the height by telling python the number of graphs we want to fit vertically, the second number specifies the width, in terms of how many plots we would like to fit horizontally, and the last specifies the plot number, which ranges from 1 to the number of rows multiplied by the number of columns. To be more precise, we are telling matplotlib that the next plot should be tall enough that only one could fit, wide enough so that we can fit 3, and be the first plot in the figure.

In [None]:
plt.figure(figsize=(9, 3))

plt.subplot(131)

plt.bar(categories, values)
plt.show()

Now that we have called plt.show, it will have cleared our figure, so we need to add this graph again, and then move onto our next one.

Just like before, we specifify the dimensions of our plot, and at what position we want it placed. Again, a 1 by 3, and this time at position 2, and with a scatter plot instead.

In [None]:
plt.figure(figsize=(9, 3))
plt.subplot(131)
plt.bar(categories, values)

plt.subplot(132)
plt.scatter(categories, values)

plt.show()

Let's finish off with one last plot, this time a line plot, at figure number 3, and adding a suptitle, which is a title for the overall figure, and a title for each individually.

In [None]:
plt.figure(figsize=(9, 3))
plt.subplot(131)
plt.title("Bar Plot")
plt.bar(categories, values)

plt.subplot(132)
plt.title("Scatter Plot")
plt.scatter(categories, values)

plt.subplot(133)
plt.title("Line Plot")
plt.plot(categories, values)

plt.suptitle('Categorical Plotting')

plt.show()

And remember, creating figures like this is easy. If we wanted 4 plots, with 2 in each row, it simply involves changing the subplots to say that we want to create plots of a size, which we can fit 2 vertically, and 2 horizontally, like this, the third number simply tells matplotlib the order of the plots.

In [None]:
plt.figure(figsize=(4, 4))

plt.subplot(221)
plt.title("Bar Plot")
plt.bar(categories, values)

plt.subplot(222)
plt.title("Scatter Plot")
plt.scatter(categories, values)

plt.subplot(223)
plt.title("Line Plot")
plt.plot(categories, values)

plt.subplot(224)
plt.title("Line Plot")
plt.plot(categories, values)


plt.show()

In [None]:
plt.figure(figsize=(4, 2))

plt.subplot(121)
plt.title("Bar Plot")
plt.bar(categories, values)

plt.subplot(122)
plt.title("Scatter Plot")
plt.scatter(categories, values)

plt.show()

## 8.5.1 Seaborn - basic plots

  We have seen how to create plots using matplotlib, but if we are working with Panda's dataframes, Seaborn is a much better option, as it was built ontop of matplotlib to specifically compliment pandas. Not only this, but it has additional visualisation abilities, and even contains some statistical modelling capabilities to visualise the underlying relationships or distributions in our data.
  
  So let's start by importing Seaborn and pandas

In [None]:
import seaborn as sns
import pandas as pd

A nice thing about seaborn is that it includes some datasets for us to experiment with. Let's load the dataset "tips" from seaborn, and display the head of that dataframe, like you did in a previous session.

In [None]:
tips = sns.load_dataset("tips")
display(tips.head())


And now we can see that the dataset consists of the total bill, size of tip left, sex of the tipper, their smoker status, day of the week, which meal time it was, and the number of people at the table.

From this, it's easy to generate some descriptive statistics using .describe.

In [None]:
tips.describe()

Now that we know what kind of data is in our dataframe, we can start to ask questions of it. For example, what is the relation between the size of the bill, and the tip that is left? To do this, we can create a relational plot using seaborn by using relplot. Note that the arguments we pass as our x and y values are not data, but the name of the variables within the 'tips' dataframe, which we pass as the 'data' argument.

In [None]:
sns.relplot(x='total_bill', y='tip', data=tips)

And we can now see the relation between total_bill and size of tip left. It looks like we do have a relationship here, where the larger the bill, the larger the tip, but notice that there are some large bills that leave quite small tips.

 As in matplotlib, we can look at additional variables and what effects they might be having. For example, we might suspect that day of the week will have an effect, so let's use it to colour the datapoints. We do that, by using the Hue argument, and giving it the name of the variable for day, which in this case, is simply the word 'day'.

In [None]:
sns.relplot(x='total_bill', y='tip', data=tips, hue='day')


The day of the week tells us a little, but it's pretty noisy. What we can see, is that Saturday (in green) appears to be the most common day in our records, and that it accounts for the largest bills, and largest tips. We might also say that Sunday appears to result in slightly higher tips, as you can see that it seems to sit on top of most the other colours, but it's very difficult to say due to the noisiness of the data.

 Even though this visualisation is getting awfully crowded, we can still visualise another one of the variables by using the size of the points. So, what better variable to choose, than the size of the table. If we set the argument size to equal the variable name Size in our dataframe, we can do this:

In [None]:
sns.relplot(x='total_bill', y='tip', data=tips, hue='day', size='size')

And now the size of the points reflects the size of the tables, but the range of sizes is a little hard to distinguish. What we can do, is force them into a range by using the 'sizes' argument, like we have here.

In [None]:
sns.relplot(x='total_bill', y='tip', data=tips, hue='day', size='size', sizes=(10, 200))

This is actually a little more informative - as we can see the lowest tips all seem to be very small table sizes, the largest tables seem to leave reasonably large tips, but if we want to target the largest tips, they appear to consist of the medium sized tables, which is quite an odd finding. And perhaps another odd thing, is that the largest tables seem to be Red and Blue, which are Thursdays and Sundays... perhaps they have a special offer on Thursdays? Whatever the cause of this, the important thing to remember is that, when properly plotted patterns, are easy to spot.

## 8.5.2 Seaborn - Distributions

Up until now, we've been focussed on relational plots - but sometimes we want to understand the distributions between categories. An example of this was back in our introduction video when we discussed boxplots. 

Taking our Tips dataset from before, let's start with a categorical plot. You'll notice that we are going focus on the categorical variable, days, and plot this on our X axis. We are then going to look at the distribution of the the total_bill on each of these days. In additional argument that we haven't seen before, is jitter. Seaborn is so useful that it sets jitter to True by default and tries to help us out, but to show you what it does, I'm going to set it to False. 

In [None]:
sns.catplot(x="day", y="total_bill", jitter=False, data=tips)

As you can see, we have 4 different days in our dataset, and a wide range of different bill sizes. Where things are not very clear, is in the denser parts of our data. If we look in this region of the Sunday data, it's so dense that it overlaps and we have no idea just how many tips are hidden in there. This is where jitter comes in - it moves overlapping points slightly to left or right on the x axis, like this:

In [None]:
sns.catplot(x="day", y="total_bill", jitter=True, data=tips)

And now we can see the there were actually a lot of datapoints, hidden underneath others. This helped, but Seaborn can do much more - let's start by changing the 'kind' of plot to a "swarm", like this

In [None]:
sns.catplot(x="day", y="total_bill", kind="swarm", data=tips)

Now we can see that the points are not overlapping at all, and that there were actually quite a few bill totals around £10 on Sundays, something that was very difficult to see, even when using jitter.

 Let's try something a little more advanced - we talked about boxplots earlier, so let's make some. To create a boxplot, it's as simple as setting the kind of plot, to 'box'.

In [None]:
sns.catplot(x="day", y="total_bill", kind="box", data=tips)

Now we can see lots more - the median total_bill each day is represented by this line in the middle, with the next quartile of the datapoints falling in this part of the box, and the highest quartile being represented by this bar, known as the whisker. In addition, we seem to have some points up here, signifying outliers. Statisticians and data scientists will be looking at these carefully, to try to decide how best to handle them.

 This told us more about the distribution of the total bills each day of the week, but we can still use visualisations to extract even more information. If we use a violin plot, by setting "kind" to "violin" like this, we can get an estimation of the kernel densities.

In [None]:
sns.catplot(x="day", y="total_bill", kind="violin", data=tips)

This is somewhat like boxplots, but instead, we do some smoothing and estimate just how dense the data is at any given size of the total_bill. This draws our attention to how consistent the bills on Thursdays were... maybe there was some kind of special offer going on after all?

 Now that we're using more advanced plots, we can also visualise more variables. For instance, if we wish to know what sex tends to pay when the bill is higher or lower, we can pass "sex" to the "hue" argument.

In [None]:
sns.catplot(x="day", y="total_bill", kind="violin", hue="sex", data=tips)

This is quite a nice way to compare the two distributions, on each day, but 8 plots makes it looks a little crowded. Let's fix it by splitting our plots by Sex instead. To do this, just add the argument "split" to the method call.

In [None]:
sns.catplot(x="day", y="total_bill", kind="violin", hue="sex", split=True, data=tips)

And here we can compare the distributions very easily, on each day. This plot tells us a lot, but a standout message is on Fridays, where we see that, if the bill is being paid by a female, it is likely to be in the region of 15 to 20 pounds. One thing to remember when using these plots is that it is the density of the data points within each subgroup - that is, they are normalised regardless of the sample sizes between groups. While a violin plot may be a lot larger at this point, the male and female entries in the dataset are not equal, and therefore we cannot say that more females are paying at this bill size, only that, of the females that paid, most of them were in this region.

 If we wanted to look at the distributions more closely, we can use distplots. If we use distplot from seaborn, passing the tip data, we get a histogram showing quantity of tips at each value. Here we can see that most tips were £2 closely followed by £3. What we also have, is this 'rug' argument, which draws these little lines at the bottom, displaying where each individual datapoint lies. And finally, here we see that we have another variable KDE, which we set to true. This draws a kernel density estimation line, over the top of our histogram, as we see here. Much like the violin plots previously, this is a method to create a smooth line over the roughness of the data below, so that we can examine it more easily,

In [None]:
sns.distplot(tips['tip'], kde=True, rug=True)

As I mentioned earlier, seaborn can even perform some statistical modelling to visualise the underlying relationships in our data. We looked at the relationship between the total bill size and the tip left earlier, and we thought we could see a relationship. Using seaborn, we can put this to the test. Let's try plotting a linear regression line by using lmplot(), and passing the variable names as our X and Y values.

In [None]:
sns.lmplot(x="total_bill", y="tip", data=tips)

It even provides us with a 95% confidence iterval by displaying this shaded region. This only scratches the surface of what seaborn can do for us, so feel free to read the documentation, and really put it to the test!

## 8.6.1 Plotly - First Plots

Welcome back to another video on visualising data in Python! In this lesson, we are going to look at another visualisation library called Plotly. 

 Plotly is a very interesting tool, and probably not like any you've came across before. You can use it online or offline - if you have an account on plot.ly you can host your plots on their servers, and simply access them by URL. This can be a great way of sharing your visualisations with other people, or even embedding them on a website. What we will be doing however, is simply working offline, meaning that the plots are stored on our local computer.
 
  To get started, let's look at the Tips dataset from before. Just as a quick refresher, this is what the head of the dataset looks like. Total-bill, tip, sex, smoker, day, time, and table size.

In [None]:
tips.head()

If we jump right in, we can just import plotly express, and simply call the scatter function and pass the data, with the variable names we want to plot. This creates a figure object, that we then show like this.

In [None]:
import plotly.express as px
fig = px.scatter(tips, x="total_bill", y="tip")
fig.show()

And here we have our first plot, but you might be wondering why we would use this rather than matplotlib. Well, that's because we can do this in plotly.

- by hovering over the points, we can inspect their exact values, rather than trying to visually inspect where we think they might lie on the x and y axis

and we also get this menu bar, with lots of different options.

By clicking the camera, we can save the plot as an image, something that requires extra lines of code in matplotlib.

This magnifying glass allows us to zoom in any selected region like this. 

And if we want to zoom in further, we can

When we want to return to full plot, we can simply double click.

Panning allows us to move the graph around

box select allows us to highlight a region of datapoints

while lasso allows us to select a much more detailed region

zoom in, zoom out, and autoscale are what you might expect

while toggle spike lines gives us the ability to see exactly where on each axis a point lies, when we hover over it


As you can see, plotly has great potential in its interactivity.

## 8.6.2 Plotly - Interactive Plots

We seen in our last video that Plotly can create incredible interactive plots. But the scatter plot we looked at was comparatively simple - in previous exercises, we used scatter plots to represent two variables, but when we wanted to represent a third, we had to change the shape or colour of our markers. Plotly however, can actually plot 3D scatterplots. This is as simple as using the scatter_3d function, and passing another variable to represent the Z axis.

In [None]:
fig = px.scatter_3d(tips, x="total_bill", y="tip", z="size", color="sex")
fig.show()

And as you can see, we have a 3D scatter plot to show us the "total_bill", size of "tip", and size of the table.

*rotate and explore the plot*

We should exercise caution here, as this type of interactive graph should only be used when the end-user has the freedom to interact with it, and change the viewing angles. If we were to rotate it like this, we can see that a pattern we have previously observed is obscured, and therefore, a stationary angle can be misleading. So, remember, no printing 3D graphs.




One of the greatest advantages of Plotly, is that we can import and export the data from them in JSON format. This allows us to import data straight into graphs that we have developed with minimal effort, and even export the cool things we have done in plotly to other visualisation tools. For example, we can define a JSON object like this:

In [None]:
fig = {
    "data": [{"type": "bar",
              "x": ['A', 'B', 'C'],
              "y": [1, 2, 3]}],
    "layout": {"title": {"text": "A Bar Chart"}}
}



Where we set the data like so, and define the layout like this.

 And we can just pass it as an argument to plotly.

In [None]:
import plotly.io as pio
pio.show(fig)

## 8.7 Plotting on maps

Ok, let's get started on plotting on maps in python. 

 As always, let's start by importing the libraries we need.

In [None]:
import plotly.graph_objects as go
import pandas as pd

And now let's get a csv dataset from github

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/2011_us_ag_exports.csv')

And as always, let's have a peak at the head of our dataset, just so that we know what we're working with

In [None]:
df.head()

So we can see that we have states, total exports, and how much each product contributes to the total exports.

Well, now we have our libraries, and our data, all that's left is to start generating the plots!

Let's start with a Choropleth map - a map that has predefined areas, which we will shade according the average value within those regions.

So, let's look at the code. We take our graph object (go) and call it's figure function. We pass the argument data with the value returned by the Choropleth function, also withing the graph object. When we call Choropleth, we pass it the locations, which are the state codes in our dataset, we pass total_exports as z, which will be used to colour each state, and then we set locationmode to "USA-states", which will be matched with the location from above. We then set the colorscale to different tones of red, and we label the colorbar appropriately, because we should ALWAYS label our visualisations.

In [None]:
fig = go.Figure(data=go.Choropleth(
    locations=df['code'], # Spatial coordinates
    z = df['total exports'].astype(float), # Data to be color-coded
    locationmode = 'USA-states', # set of locations match entries in `locations`
    colorscale = 'Reds',
    colorbar_title = "Millions USD",
))

So we now have an figure object, and we get python to show us like this:

In [None]:
fig.show()

This zoom level is clearly inappropriate, so we can easily fix it by updating the layout like this. Adding a title, is always a good thing to do, and setting the geo_scope to the USA will exclude the rest of the world, as we don't have data for it anyway.

In [None]:
fig.update_layout(
    title_text = '2011 US Agriculture Exports by State',
    geo_scope='usa', # limite map scope to USA
)

As you can see Calafornia on the West coast exports dramatically more food than any other state in the US. Of course, we might have been able to see this by simply looking at the raw numbers, but this very effectively communicates the disparity between it, and it's neighbouring states, in particular Nevada here. Remember that, as this is plotly, we can hover over each state to have a look at the details being represented.

If you would like to change the colour gradients, it's always worth having a look at the documentation - we just used reds, but there are lots of other colour maps to choose from, with many using more than one colour, like blue red here:

In [None]:
fig = go.Figure(data=go.Choropleth(
    locations=df['code'], # Spatial coordinates
    z = df['total exports'].astype(float), # Data to be color-coded
    locationmode = 'USA-states', # set of locations match entries in `locations`
    colorscale = 'bluered',
    colorbar_title = "Millions USD",
    
))

fig.show()

Like before, we would benefit from updating the layout. But retyping the update_layout each time is tedious and unnecessary, especially when we can simply define a layout from the start. Here we set the title to "USA", set scope to USA, and then we choose the projection. Projections are interesting, because when we are trying to visualise a 3D sphere like the earth, we project it as a rectangle, and this means that there will always be distortion of sizes. For example, if we zoom in so that we can see both Ireland and the archipelago called Svalbard, you might think that Svalbard is larger than the UK, but in fact, it's only 61 thousand square kilometres, meaning it's smaller than the island of Ireland, here.

 If we look back at our layout code, the final 2 arguments involve setting show lakes to true, and filling them with the colour white. As this is the RGB colour scheme, it is represented by 3 bytes. The first byte is how much red, the second byte is how much green, and the third byte is how much blue we have. As a byte can only represent 256 values, 255 is the highest value we can have. Hence, all red, all green, and all blue, gives us the colour white, at least in a computer. All colours can be represented this way, just be turning up and down each of these values.

In [None]:
mymap = go.Layout(
    title = go.layout.Title(
        text = 'USA'
    ),
    geo = go.layout.Geo(
        scope = 'usa',
        projection = go.layout.geo.Projection(type = 'albers usa'),
        showlakes = True,
        lakecolor = 'rgb(255, 255, 255)'),
)

Now that we have our layout, it would be really useful to have our data stored in a seperate variable too.

This time, let's just look at corn exports, by changing the value of argument z, to the list of corn exports in our dataframe.

In [None]:
sub_data = [go.Choropleth(
#    colorscale = scl,
    autocolorscale = True,
    locations = df['code'],
    z = df['corn'].astype(float), # Data to be color-coded
    locationmode = 'USA-states',
    marker = go.choropleth.Marker(
        line = go.choropleth.marker.Line(
            color = 'rgb(255,255,255)',
            width = 2
        )),
    colorbar = go.choropleth.ColorBar(
        title = "Example")
)]

And now that we have them both as variables, it's incredibly easy to plot them whenever needed.

In [None]:
fig = go.Figure(data = sub_data, layout = mymap)
fig.show()

And now we can see that this part of the corn export economy is clearly dominated by this region, and in particular, the state of Iowa.

Now that we have our layout in an easy to change object, let's experiment with the projections that we touched on earlier;

first, let's set the scope and title to world, so that we can see the effects of the projections on a larger scale

In [None]:
mymap.title.text = 'World'
mymap.geo.scope = 'world'

And now let's set the projection to orthographic

In [None]:
mymap.geo.projection = go.layout.geo.Projection(type = 'orthographic')

And now if we make this plot, we can see that, instead of a flat projection, we get an interactive globe

In [None]:
fig = go.Figure(data = sub_data, layout = mymap)
fig.show()

While this is the most accurate representation, it's uncommon for us to have the opportunity to interact with a 3d map, so we need to choose a projection. Here are just a few:

Natural Earth

In [None]:
mymap.geo.projection = go.layout.geo.Projection(type = 'natural earth')
fig = go.Figure(data = sub_data, layout = mymap)
fig.show()

Conic equidisdant

In [None]:
mymap.geo.projection = go.layout.geo.Projection(type = 'conic equidistant')
fig = go.Figure(data = sub_data, layout = mymap)
fig.show()

and sinusoidal

In [None]:
mymap.geo.projection = go.layout.geo.Projection(type = 'sinusoidal')
fig = go.Figure(data = sub_data, layout = mymap)
fig.show()

As with the other tools we have looked at on this course, we have only scraped the surface of what we can achieve with Python. For further inspiration, explore the documentation and you'll be amazed what you can do with very little code.