# Lesson 3: Graphs


**Python learning objectives**

1. Develop a furthur understanding of how to use functions.
2. Reiterate how to make a list. 
3. Revise previously taught concepts. 

**What you will be able to do with these skills**

1. Learn how to draw the following types of graphs and charts:
    1. Pie 
    2. Line
    3. Bar
    4. Histogram
    5. Scatter
2. Learn about how we can adjust attributes to change the properties of our graph. 
3. Learn about *association* in scatter plots. 

Plotting charts and graphs in `pandas` is relatively straight forward. To plot them we initially need to use the `.plot` function which then is followed by a second function which specifies what type of graph we want to plot. For example, to plot a pie chart we will use `.plot.pie()`. Below is a table of all the different graphs we can plot with pandas. If you wish to learn more read the `pandas` tutorial [here](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/04_plotting.html#min-tut-04-plotting) or read the documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html). 

|Type|Function|
|-|-|
|Pie Chart|`.plot.pie()`|
|Line Graph|`.plot.line()`|
|Bar Chart|`.plot.bar()`|
|Histogram|`.plot.hist()`|
|Scatter|`.plot.scatter()`|


**Pie Charts**

First things first, as always, we need to import the `pandas` library. 

In [None]:
import pandas as pd

Lets start this section with data we had before about the car shop. The code below produces and saves a pivot table to the *DataFrame* labelled `CarShopPiv`.

In [None]:
CarShop = pd.read_csv(
    "https://raw.githubusercontent.com/ThomasJewson/datasets/master/CarShop.csv"
)

CarShopPiv = CarShop.pivot_table(
    index="Product",
    aggfunc=sum
)
CarShopPiv

A useful way to visualise this data might be with a pie chart of the `Total Price` column.

To plot a pie chart we need to use the following [function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.pie.html) `.plot.pie()`. 

We need to give this function an argument which tells the function which column to plot the graph about. As the column we want to draw a pie chart from is `Total Price` we need to pass the argument `y="Total Price"`. 

In [None]:
CarShopPiv.plot.pie(
    y="Total Price"
)

**Excercise 1:** *Plot a pie chart of the `Quantity` column from the `CarShopPiv` DataFrame*

In [None]:
#Answer
CarShopPiv.plot.pie(y="Quantity")

**Line plots**

The data below is the percentage Gross Domestic Product (GDP) growth per year of the EU and Germany from 2007 to 2018 [1]. 

In [None]:
GDP = pd.read_csv("https://raw.githubusercontent.com/ThomasJewson/datasets/master/GDP.csv")
GDP

If we want to plot a line graph, we need to use the `.plot` function again, however, we need to place `.line()` at the end to specify a line plot. Therefore, our [function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.line.html) is `.plot.line()`.

A line graph has two dimensions, the x-axis and the y-axis. Lets plot the `Year` column on the x-axis and the `EU` column on the y-axis. We need to pass the keyword arguments to the function instructing what columns should be the x-axis and y-axis. The y-axis argument, as we have seen before, will be `y="EU"`, likewise, the x-axis argument is `x="Year"`.

Recall, as we only have keyword arguments in this function the order does not matter. 

In [None]:
GDP.plot.line(
    x="Year",
    y="EU"
)

It is possible to change the attributes of the graph by introducing more keyword arguments into the function. For example, we can use the following arguments.

|Argument|Description|Variable type to equate to|Example|
|-|-|-|-|
|`color`|Changes the colour of the graph|String \ list of strings|`color = "orange"`|
|`figsize`|Changes the size of the graph output|List of two numbers, the width and hieght|`figsize = [5,6]`|
|`fontsize`|Changes the size of the writing on the graph|Integer|`fontsize = 10`|
|`title`|Adds a title to the graph|String|`title = "This is my graph"`|
|`grid`|Adds a grid to the graph|Boolean / True or False|`grid = True`|
|`xlim`|Limits on x-axis plot|List of two numbers, the lower and the upper limits|`xlim = [45,1231]`|
|`ylim`|Limits on y-axis plot|List of two numbers, the lower and the upper limits|`ylim = [0.023,1.505]`|
|`rot`|The angle to rotate labels on the x-axis|Integer|`rot = 25`|

Note, the American spelling of colour is used.

In [None]:
GDP.plot.line(
    x="Year",
    y="EU",
    
    color="green", #Note, pandas uses the American spelling of colour
    figsize=[6,6],
    fontsize=10,
    title="Change in EU's GDP (%) vs. Time",
    grid=False,
    xlim=[2007,2018],
    ylim=[-4.5,3.2],
    rot=10
)

All of this arguments can be adjusted and changed. For example, the colour argument can be `green`,`red`,`blue`,`orange`,`purple`...etc. 

Have a play around with the arguments and adjust them so you understand what each of them does. 

In [None]:
GDP.plot.line(
    x="Year",
    y="EU",
    
    color="violet",
    figsize=(10,6),
    fontsize=15,
    title="Change in EU's GDP (%) vs. Time",
    grid=True,
    xlim=(2010,2018),
    ylim=(-1,3)
)

**Excercise 2:** *Plot a line graph of with the x-axis being `Year` and the y-axis being `Germany` from the `GDP` DataFrame. Make the colour of the line `black` and give the graph the title `Change in Germanys GDP (%) vs. Time`.*

In [None]:
#Answer
GDP.plot.line(
    x="Year",
    y="Germany",
    color="black",
    title="Change in Germanys GDP (%) vs. Time"
)

We can also plot different columns of data on the same graph. For example, below we have plotted the `EU` and `Germany` columns on the y-axis by putting the two columns in a list `y=["EU","Germany"]`.

In [None]:
GDP.plot.line(
    x="Year",
    y=["EU","Germany"]
)

Again, we can also adjust the colour and other attributes of the graph. Notice how we have also put the colour in a list as well. 

With `y=["EU","Germany"]` and `color=["blue","orange"]`, EU will be blue and Germany will be orange. This is because EU and blue are both the first entries into both list.

In [None]:
GDP.plot.line(
    x="Year",
    y=["EU","Germany"],
    
    title="Change in GDP (%) over time",
    color=["blue","orange"],
    grid=True,
    figsize=(10,6),
    fontsize=13
)

**Excercise 3:** *Plot a line graph of with the x-axis being `Year` and the y-axis being `Germany` and `EU` from the `GDP` DataFrame. Give the `EU` line the colour `green` and give the `Germany` line the colour `yellow`. Finally, give the plot grid lines.*

In [None]:
#Answer
GDP.plot.line(
    x="Year",
    y=["EU","Germany"],
    color=["green","yellow"],
    grid=True
)

**Bar Charts**

To produce a bar chart we need to use the following [function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.bar.html) `.plot.bar()`. Bar charts have exactly the same arguments as line plots. 

In [None]:
GDP.plot.bar(
    x="Year",
    y=["EU","Germany"],
    
    color=["orange","purple"],
    title="Change in EU's GDP (%) vs. Time",
    grid=True,
    figsize=(10,6),
    fontsize=14
)

**Excercise 4:** *Plot a bar chart of with the x-axis being `Year` and the y-axis being `Germany`, giving the `Germany` bars the colour `red`. Make the figure size `(15,8)` and the font size `17`.*

In [None]:
#Answer
GDP.plot.bar(
    x="Year",
    y="Germany",
    color="red",
    figsize=(10,8),
    fontsize = 17
)

**Histograms**

A histogram plot shows the frequency of particular units. Generally, these units are a fixed range of data.

Below is some data about failed social media start-ups.

In [None]:
StartUp = pd.read_csv("https://raw.githubusercontent.com/ThomasJewson/datasets/master/startup.csv")
StartUp

Lets calculate the number of years each of these start-ups where active for and insert a new column called `Operating_Years` into the *DataFrame*. The `.insert()` function is from the advanced section in lesson 1, therefore, do not worry if you havent seen it before. 

In [None]:
StartUp.insert(
    3,
    "Operating_Years",
    StartUp["Failure_Year"] - StartUp["Creation_Year"]
)
StartUp

The [function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.hist.html) to plot a histogram is similar to what we have seen before in the previous examples `.plot.hist()`. We need to define what column we want to plot as the frequency, in this case it will be the `Operating_Years` column. Therefore, counterintuitively we make `y=Operating_Years`. This is counterintuitive as the x-axis plots the `Operating_Years`, whereas, the y-axis plots the frequency of `Operating_Years`.

Finaly we need to define the number of bins in the histogram plot. The number of bins is the number of intervals we split the x-axis into. For example, below we have `bins=7`, which means that the x-axis - which is the plot of `Operating_years` - will be divided into 7 equal parts. As our `Operating_Years` ranges from 2 years to 16 years, we have a data range of $16-2=14$. Therefore, the magnitude of the intervals is $14 / 7 = 2$. 

In [None]:
StartUp.plot.hist(
    y="Operating_Years",
    bins=7
)

To furthur my point, look at the first bin in the graph above, which spans from `Operating_Years = 2` to `Operating_Years = 4`, it has a frequency of 2. This is because there are two companies which have an operating years period between 4 and 2.

The same keyword arguments that control the attributes of the graphs that we saw in line plots also works in the histogram function `.plot.hist()` as well. Furthermore, with the same syntax it is possible to plot multiple histograms on the same plot.

**Excercise 4:** *Plot a histogram of the `StartUp` DataFrame with `11` bins*

In [None]:
#Answers
StartUp.plot.hist(
    y="Operating_Years",
    bins=11
)

**Scatter plots**

A scatter plot is a graph that displays values for two variables of a set of data by placing a mark where the two values coordinate. 

To draw a scatter plot we need to use the `.plot.scatter()` [function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.scatter.html). We need to pass two arguments to the function, the x-axis and the y-axis. 

The data below is from 1951 and it is a cross-sectional analysis of 24 British bus companies. [2] 

The `.head()` [function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) shows the first 5 rows of a *DataFrame*. This is a good way to quickly see what the *DataFrame* looks like without having to output the whole thing - which might be many rows long.

In [None]:
bus = pd.read_csv("https://raw.githubusercontent.com/ThomasJewson/datasets/master/BritishBusCompanies1951/BritishBusCompanies1951-LinReg-1951.csv")
bus.head() # This shows the first 5 rows of the bus DataFrame

We want to plot the `Expenses per car mile (pence)` against the `Car miles per year (1000s)` on scatter diagram to see whether there is a trend. 

In [None]:
bus.plot.scatter(
    y="Expenses per car mile (pence)",
    x="Car miles per year (1000s)"
)

From this plot you can see that, in general, it slopes downwards. Formally, we say that the plot shows an *association*. This association tells us that as the bus companies do more *car miles per year* the *expenses per car mile* decrease.  

This downward association is known as a *negative association*, likewise, the opposite would be a *positive association* - as you can see in the scatter plot below. 

In [None]:
bus.plot.scatter(
    y="Receipts per car mile (pence)",
    x="Expenses per car mile (pence)"
)

This positive association tells us that, generally, as the *expenses per car mile* increase the *receipts per car mile* increases. 

As with the other graphs we can also input extra attributes into the function to customise our graphs. 

In [None]:
bus.plot.scatter(
    y="Receipts per car mile (pence)",
    x="Expenses per car mile (pence)",
    
    color="black",
    title="Scatter plot showing positive association between expenses and receipts",
    grid=True,
    figsize=(10,6),
    fontsize=14
)

With scatter plots we can add another dimension to our data by having the colour / shade of the marks adjusted according to its value. 

For example, below we have plotted the `Percent of fleet on fuel oil` as the shade of the plots by including the argument `c="Percent of fleet on fuel oil"`.

We cannot, however, use the `color` attribute and this functionality at the same time. 

In [None]:
bus.plot.scatter(
    y="Receipts per car mile (pence)",
    x="Expenses per car mile (pence)",
    c="Percent of fleet on fuel oil",
    
    title="Scatter plot showing positive association between expenses and receipts",
    grid=True,
    figsize=(10,6),
    fontsize=14
)

**Excercise 5:** *Plot a scatter diagram with the x-axis being `Percent of Double Deckers in fleet` and the y-axis being `Percent of fleet on fuel oil`*

In [None]:
#Answer
bus.plot.scatter(
    x="Percent of Double Deckers in fleet",
    y="Percent of fleet on fuel oil"
)

**Conclusions**

*You should now be able to do the following:*
1. Produce the following graphs with the following functions:
    1. Pie charts with `.plot.pie()`
    2. Line plots with `.plot.line()`
    3. Bar charts with `.plot.bar()`
    4. Histogram plots with `.plot.bar()`
    5. Scatter plots with `.plot.scatter()`
2. Adjust the arguments of the previous functions to adjust the following:
    1. Colour
    2. Figure size
    3. Titles and their size
    4. Grid lines

**Sources:**

[1] https://data.worldbank.org/indicator/NY.GDP.MKTP.KD.ZG?locations=EU

[2] J. Johnston (1956), "Scale, Costs, and Profitability in Road
Passenger Transport," The Journal of Industrial Economics Vol 4, pp207-223.