# Bar charts, pie charts and frequency density charts

As this is the first Jupyter notebook we provide more explanation of what is being done, and why, than in later notebooks.  In a Jupyter notebook you have two types of cells: text (like this cell) and (Python) code.  The latter have 'In [ ]: next to them and a 'play' button which you click to 'run' the cell, which returns some output (or at least does something).  You can run each cell individually or all of them at once (via the 'Kernel' menu).  While running, and asterisk appears in the [ ], when finished a number appears. If no error message appears, it's worked.  

You can edit the code in a cell and/or add new cells.  In so doing you can experiment and learn how to analyse data using Python.  At the end of each notebook are a few suggested exercises.

The first thing we do in most notebooks is to import any libraries that we need.  Libraries contain additional commands and functions beyond the basic Python language.  In the next cell we import 'pandas', a library of routines for reading and manipulating data, and then 'mathplotlib' which contains routines for drawing graphs.  

The 'as pd' or 'as plt' part of the command means we can use the abbreviation pd for pandas and plt for mathplotlib.pyplot.  The second import statement does not import the whole matplotlib library (which you can do) but just the pyplot component of it, the only part we need for this notebook.  Run the cell now to execute those commands.  Nothing seems to happen but don't worry.  All they did was import some material and since we did not get an error message, all is well.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

Next we read a data file from disk.  In this case it is the education data used in chapter 1.  We use a pandas (pd, remember?) method to read the file (which is in csv format).  In Python, the data are contained in a 'dataframe', which here we give the abbreviation 'df'.  Any name will do, e.g. my_data, or educ, it is up to you.  However, it is better to use something generic such as 'df' which allows you to re-use code easily.

The 'df.head' command shows the first few lines of data (here four) as a confirmation that we have the right data.  Run the cell now.

In [2]:
df = pd.read_csv('education.csv')
df.head(4) 

Unnamed: 0,status,higher_ed,a_levels,other,none,total
0,In work,9713.0,5479.0,10173.0,1965.0,27330.0
1,Unemployed,394.0,432.0,1166.0,382.0,2374.0
2,Inactive,1256.0,1440.0,3277.0,2112.0,8084.0
3,Total,11362.0,7352.0,14615.0,4458.0,37788.0


These data are the same as in Table 1.1 of the book.  

## A simple bar chart

The first thing we do is to recreate Figure 1.1 of the book, a bar chart of the data in the 'In work' row.  The commands are below, with comments (preceded by '#') as to their meaning.

In [None]:
labels = ['higher_ed', 'a_levels', 'other', 'none']       # These will be the x-axis labels
values = [9713, 5479, 10173, 1965]                        # These are the heights of the bars to graph
plt.bar(labels, values)                                   # This is the command that constructs a bar chart
plt.xlabel('Level of qualification')                      # The next three commands label the axes and provide a title
plt.ylabel('Number')                                      # They are optional, the chart would still appear in their
plt.title('Qualifications of those in work')              # absence but it is better to have these.
plt.show()                                                # Finally, display the graph

There are many more options to the plt command, to improve the graph or change it in various ways.  In these notebooks we'll keep all these to a minimum.  You can experiment yourselves to do things like change colours, font, etc.  One simple thing you might try, is to change the category labels, more like English.  You need to change the first line of the code.  Give it a try, then run the cell again.

## A multiple bar chart

Next we will do something more complex, a multiple bar chart (it is actually Figure 1.2 in the book).  This requires more coding as you can see in the cell below.  But remember, you only need to figure this out once, then you can reuse the code for different data, so in the longer term it saves you time. 

The chart is similar to the one above but also includes the unemployed and inactive categories, all in one chart.  Run the cell to see the result, then look through the code to see how it is constructed.

In [None]:
import numpy as np                                       # Some tools needed from this library                      

labels = ['higher_ed', 'a_levels', 'other', 'none']      # Labels for the x-axis
in_work = [9713, 5479, 10173, 1965]                      # Data copied from manually from dataframe.  Tedious!
unemployed = [394.0, 432.0, 1166.0, 382.0]               # See below for better way.
inactive = [1256.0, 1440.0, 3277.0, 2112.0]

# Alternative way of referencing the data, directly from the dataframe. Uncomment the three lines below and 
# run the cell again. This will overwrite the three commands above (unless you comment those out.)
#in_work = df.iloc[0,1:5] 
#unemployed = df.iloc[1,1:5]
#inactive = df.iloc[2,1:5]

# How the alternative method works: df.iloc[0, 1:5] refers to cells in the dataframe, in row 0, columns 1 to 5. 
# But there's a trap...  Python counts from 0, not 1, so 1 refers to the second column of the dataframe ('higher_ed')
# Second trap...  The 1:5 instruction means go from 1 up to (but not including) 5.  This is why column 5 ('total')
# does not appear in the graph.  It's confusing at first to count this way and takes some getting used to.

x = np.arange(len(labels))                                # Needed for technical reasons, don't ask.
width = 0.25                                              # the width of the bars

# This is where the plots get drawn, one command per category
fig, ax = plt.subplots()                                              
rects1 = ax.bar(x - width, in_work, width, label='In work')            # Draw the 'in work' bar, etc
rects2 = ax.bar(x, unemployed, width, label='Unemployed')
rects3 = ax.bar(x + width, inactive, width, label='Inactive')
 
# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_xlabel('Level of qualification')
ax.set_ylabel('Numbers')
ax.set_title('Education and employment status')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()

# Finally, display the chart
plt.show()

## Pie chart

Now we draw a pie chart of the data, showing the qualifications of those in work, as in Figure 1.5 of the book.  It is slightly different as it contains the percentage in each qualification category rather than the absolute number.  This is arguably more useful.  We leave it as an exercise if you want to put the absolute numbers in instead.

In [None]:
# Pie chart, where the slices will be ordered and plotted clockwise:
labels = 'Higher education', 'A levels', 'Other', 'None'           # Labels for the slices
in_work = df.iloc[0, 1:5]                                          # Grab the data to graph

plt.pie(in_work, labels=labels, autopct='%1.1f%%',                 # Creates the chart
        counterclock = False, startangle=90)
plt.title('Qualifications of those in work')                       # Add title

plt.show()                                                         # Display the chart

## Suggested exercises

Now you have completed the first Jupyter notebook it is time to practise.  Play around with the code and see what happens, try to create a new chart, etc.  Here's some suggestions:

1. Try changing the 'labels = ...' statement to give labels which are more grammatical or easy to read, for the simple bar chart.
2. Change the color of the simple bar chart.  Change the 'plt.bar()' command so it reads plt.bar(labels, values, color = 'red').
3. Draw a similar pie chart to the one above, but for the unemployed category.  Change the df.iloc() function so it uses row 1 rather than row 0.  You'll need to change the title too.