<h1><center>COMP1008: Exercise 2<br/>Data Visualisation and Processing</center></h1>

Data processing is a key step in machine learning. By analysing and visualising the data, an appropriate machine learning model can be chosen for our tasks, creating a model with higher accuracy. Errors and outliers in the data can also become evident processing the data.

This tutorial provides some examples and hands-on tasks on
- <a href="#partone">Part 1 Pandas</a> for data manipulation
- <a href="#parttwo">Part 2 matplotlib</a> for data visualisation
- <a href="#partthree">Part 3 Data Processing</a> for machine learning

<div class="alert alert-success">
    <h3>Mini-Challenge 1: Graphing Combinatorial Explosion</h3>
</div>

After completing the guided tutorial on data visualisation and pre-processing, you are challenged to visualise the data in `data-combinatorial.xlsx` to show combinatorial explosion using your choice of suitable graph/chart and data processing techniques.

## Part 1. Visualisation using `pandas` 🐼 <span id="partone"></span>
In addition to data manipulation, the `pandas` package also provides a lot of plotting functionality.

In [None]:
# import the pandas package
import pandas as pd

# read a csv file from a url
df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/Emissions%20Data.csv')

df.head()

### Plot

In [None]:
# 'plot' method: plots numerical data samples along their index, i.e. 'Emission' and 'Year' in two series
df.plot()

This plot is not particularly useful. We need to understand the data with an appropriate visualisation, e.g. see any trend of $CO_2$ emissions over the years.

### Scatter Plot

In [None]:
# 'scatter' method: plots numerical data samples of x-axis vs. y-axis.
df.plot.scatter(x="Year", y="Emission")

This visualisation using scatter plot is also not informative. We can group data `df['Emission']` by `Year` and then do statistics.

In [None]:
# Plot the average emissions of each year group. This plot is more informative.
df_avg = df.groupby("Year")["Emission"].mean()
df_avg.plot()



<div class="alert alert-info">
    <h3>Task 1</h3>
</div>

Plot the total emissions grouped by continent. Which continent contributed the most to emissions?

In [None]:
## your code here
# df_total = ...


### Bar

A line plot is usually used to visualise continuous data. `Continent` is categorical, so a bar chart is more appropriate.

In [None]:
df_total

In [None]:
# specify the 'kind' of the plot
df_total.plot(kind='bar')

### Histogram

<a href="https://en.wikipedia.org/wiki/Histogram">Histograms</a> shows data distribution. Each bar shows the frequency of a value appears in a "bin" (i.e. range of values on the x-axis). See <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.hist.html">pandas histogram</a> for more information.

In [None]:
df['Emission'].hist()

In [None]:
# create a DataFrame df_Europe by using a condition in the index
df2008 = df[df['Year']==2008]
df2008['Emission'].hist()

### Visualise more information with combined plots

To combine plots on the same set of axes in scatter plots, we create and save the first plot by the returned axis reference `ax`. Then reuse the axis for the other plots by including `ax=ax` in the arguments of `scatter(...)`. 

In [None]:
# comparing the emissions in 2008 and 2011 in the same scatter plot, s defines size of each point
ax = df2008.plot.scatter(x='Emission', y='Continent', color='red', s=df2008['Emission']**3/100, label='2008')
df2011 = df[df['Year']==2011]
df2011.plot.scatter(x='Emission', y='Continent', color='blue', s=df2011['Emission']**3/100, label='2011', ax=ax)

## Part 2. Visualisation using `matplotlib` <span id="parttwo"></span>

The pandas package builds upon matplotlib to make visualisation easier. We can usually use pandas plots, and when not available, access the more advanced additional features in the `matplotlib` library directly.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

Let's now use the functions of `matplotlib` to create the same scatter plot as before but fix the issues with labels and axis spacing.

In [None]:
# create the scatter plot using matplotlib 'plt' directly, not by df.scatter.plot() in Pandas
plt.scatter(df2008['Emission'], df2008['Continent'], s=df2008['Emission']**3/100)
plt.scatter(df2011['Emission'], df2011['Continent'], s=df2011['Emission']**3/100)
plt.title('$CO_2$ Emissions data of Continent in 2008 and 2011')

# functionality available through matplotlib plt, not Pandas
# plt.xlim([0,80])
plt.xlabel('Emissions (ktns)')
plt.ylabel('Continent')
plt.legend(['2008','2011'])
# explicitly setting the size of the plot
plt.gcf().set_size_inches(8, 6)

We can also create seperate sub-plots using `matplotlib`, normally with the same y-axis range; otherwise the visualisation can be misleading!

In [None]:
# tell matplotlib how many subplots to draw
plt.subplot(1,2,1)
# start of first subplot
plt.scatter(df2008['Emission'], df2008['Continent'], s=df2008['Emission']**3/100)
# explicitly set the range of the y-axis, same for all subplots
# only add the y-axis label on the left-most subplot
plt.ylabel('$CO_2$ Emissions (ktns)')
plt.xlabel('2008')

# start of second subplot
plt.subplot(1,2,2)
plt.scatter(df2011['Emission'], df2011['Continent'], color='red', s=df2011['Emission']**3/100)
plt.xlabel('2011')

# explicitly setting the size of the plot
fig = plt.gcf()
fig.set_size_inches(16, 4)
fig.suptitle('Emissions in 2008 and 2011')
fig.supxlabel('Year')

## Part 3. Data Pre-processing <span id="partthree"></span>

Missing or erroneous data in real-world cause issues for some machine learning algorithms. We can either 1. delete (drop) the affected samples, or 2. populate missing data.

In [None]:
import numpy as np # We need to handle some numerical data below

df_pp = pd.read_excel('data-combinatorial-tutorial.xlsx', index_col=0)
df_pp

Notice the missing data in the $N^2$ column, indicated as `NaN`. The `isnull()` method tells us which data points are missing, flagging them as `True`.

In [None]:
df_pp.isnull()

### Dropping Missing Data

In [None]:
# Option 1: drop the rows/columns with any number of missing data
dropped_na_data = df_pp.dropna()
dropped_na_data

In [None]:
# Option 2: remove the associated column(s)
c_dropped_na_data = df_pp.dropna(axis='columns')
c_dropped_na_data

This option allows us to retain all data samples, but we lose the entire $N^2$ column. 
For more information on `dropna` see this <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html">documentation</a>.

### Replacing Missing Data
Data are expensive. We'd rather replace the missing data with some appropriate "fake data", with carefully chosen values. Observe the following methods. Do these methods make sense in this context?

#### Zero Values

In [None]:
# replace missing data simply with zero
zeroes_df = df_pp.replace(to_replace = np.nan, value = 0)
zeroes_df

In [None]:
zeroes_df.plot()

#### Average Values

In [None]:
# populate missing data with average values for each data series
df_pp = pd.read_excel('data-combinatorial-tutorial.xlsx', index_col=0) # column 0 read as the index
df_pp.isnull()

We can use the `fillna()` method to replace all the missing data in the $N^2$ column with the median data point.

In [None]:
# calculate the median average of the $N^2$ column
med = df_pp['N^2'].median()

# fill all 'na' data points with `med`, and assign this back
df_pp['N^2'] = df_pp['N^2'].fillna(med)
# observe the last three rows now contain values equal to the median of the first 12 samples.
df_pp['N^2']

In [None]:
df_pp.plot()

#### Forward Filling Values

In [None]:
# forward filling fills the missing data with the last known value
df_pp = pd.read_excel('data-combinatorial-tutorial.xlsx', index_col=0)
df_pp

In [None]:
df_pp = df_pp.fillna(method='ffill') # forward fill
df_pp

In [None]:
df_pp.plot()

Note that there are multiple methods for replacing missing values using `fillna()` as seen <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html">here</a>.

#### Randomised Values <span id="replace-ran"></span>
Try running this example several times. 
Note that the missing data has been replaced with a randomly selected value from the existing data.

In [None]:
import random
df_pp = pd.read_excel('data-combinatorial-tutorial.xlsx', index_col=0)
df_pp['N^2'] = df_pp['N^2'].fillna(random.choice(df_pp['N^2'].values.tolist()))
df_pp

In [None]:
df_pp.plot()

Note: Data is valuable. An improper method filling in the missing data can change the properties of the data, e.g. the mean and median of the distribution. The effects of using different methods to real-world data may not be so obvious!

## Mini-Challenge 1: Graphing Combinatorial Explosion <span id="part3"></span>

<div class="alert alert-success">
    <h3>Mini-Challenge (Submit to Moodle!)</h3>
</div>

A challenge dataset `data-combinatorial-challenge.xlsx` is provided. With the knowledge you have gained from this exercise, create a plot which you deem to be appropriate for visualising the data in the dataset.

You should import the data using pandas and pre-process the data to remove or correct (as appropriate) any missing data. By default the y-axis on a plot has a linear scaling. This might not be appropriate for this case; explore setting a logaritmic scale!

Once you are happy with your plot, submit an image of the plot in Moodle (save the plot by right-clicking on the plot and save it). After uploading the image and pressing submit, you should see the following screen; add your comment and press "Save comment". Your "comment" should be on the "pros and cons" when processing the missing data and the chosen plot as follows: 

```
Plot type: [ Pie chart ]
Missing data completion method: [ Zero fill ]
Pros: [ Retain all data series ]
Cons: [ Data is unrealistic for large values of 'N' ]
```

![na](img/submission.png)

***The class will receive anonymised feedback in a lecture based on everyone's responses.***

In [None]:
ddf = pd.read_excel('data-combinatorial-challenge.xlsx')
ddf.head()

In [None]:
# Enter your data pre-processing and plotting code here!


<div class="alert alert-success">
    <h2>🍰 End</h2> 
</div>