# Neoscholar Machine Learning Tutorials
### Session 01. Introduction to Numpy, Pandas and Matplotlib

### Contents
1. Numpy
2. Pandas
3. Matplotlib
4. EDA(Exploratory Data Analysis)

### Aim
At the end of this session, you will be able to:
- Understand the basics of numpy.
- Understand the basics of pandas.
- Understand the basics of matplotlib.
- Perform a simple EDA using libraries above.

## 3. Matplotlib
Matplotlib is a Python data visualisation library. Its plotting system is similar to that of MATLAB.

### 3.1 Basics of Matplotlib

In [None]:
# run this cell if you haven't installed matplotlib
!pip install matplotlib

- `%matplotlib inline` is only available for Jupyter Notebook and Jupyter QtConsole. With this backend, the output of your command will be displayed inline with frontends, directly below the code cell that produces it.
- `%matplotlib tk` is also only available for Jupyter Notebook. The output of your command will be displayed on a new broswer.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# %matplotlib tk

Plot a **y=sin(x)** graph.

In [None]:
# declare x and y 
x = np.arange(0, 3 * np.pi, 0.1)
y = np.sin(x)
print(x)
print(y)
assert len(x)==len(y)

In [None]:
# plot a graph of sin(x)
plt.plot(x, y)
plt.xlabel('x')
plt.ylabel('y')

plt.title('y = sin(x)')
plt.show()
# plt.savefig('./image/sinGraph.png')

Plot multiple functions: **y = x**, **y = x^2**, and **y = x^3** in one graph.

In [None]:
# TODO: plot multiple graphs in one graph
x = np.arange(10)
x_linear = x
x_square = x**2
x_cubic = x**3

plt.plot(x, x_linear)
plt.plot(x, x_square)
plt.plot(x, x_cubic)
plt.xlabel('x axis')
plt.ylabel('y axis')
plt.grid()
plt.legend(['y = x', 'y = x^2', 'y = x^3'])
plt.title('y = x | y = x^2 | y = x^3')
# TODO: print out the graph under this cell
plt.show()

### 3.2 Various Types of Plots
Matplotlib library supports various types of graphs such as bar graph, histogram, scatter plot, area plot and pie plot. Let's use IMDB-Movie-Data again to get a better understanding of the data. Visualising data is a crucial part of EDA, which you'll get hands-on experience soon!

In [None]:
movie = pd.read_csv("./data/IMDB-Movie-Data.csv")
movie.columns

Let's see if there is a positive correlation between `Rating` and `Revenue (Millions)` using `scatter` function. Parameter `s` controls the diameter / size of the scattered dots and `alpha` controls the degree of their transparency.

#### 3.2.1 Scatter plot

In [None]:
movie.plot.scatter(x = 'Rating', y = 'Revenue (Millions)', s = 10, alpha = 1)

They seem to have a bit of positive relationship.

Now, let's try to use different colors and markers to distinguish different groups of data and their centroids.

In [None]:
# Read the dataset and print the first 20 samples.
scatters = pd.read_csv("./data/scatters_data.csv",index_col=0)
scatters['class']=list(map(int,scatters['class']))
print(scatters.head(20))

# Split the datasets according to class index
clas=[0,1,2]
for clsNum in clas:
    tempData= scatters[scatters['class'].isin([clsNum])]
    exec("class%s=tempData"%clsNum)
# Store them into a dictionary
class_dict={'0':class0,'1':class1,'2':class2}

Firstly, calculate centroids for each class. Centroid ($c_x$,$c_y$) of a set of equally weighted points in 2D space is simply given by:

\begin{align}
\begin{array}
\$c_x  = \frac{\sum_{i=1}^n{x_i}}{n} 
\end{array} , 
\begin{array}
\$c_y  = \frac{\sum_{i=1}^n{y_i}}{n} 
\end{array}
\end{align}

In [None]:
# TODO: Calculate Cx, Cy for each class
cx=[np.mean(class0['x']),np.mean(class1['x']),np.mean(class2['x'])] # x coordinates for centroids of class 0, 1 and 2
cy=[np.mean(class0['y']),np.mean(class1['y']),np.mean(class2['y'])] # y coordinates for centroids of class 0, 1 and 2

Let's plot scatters for data points by assigning different color and marker for different group. And then plot their centroids with corresponding color but same marker **X**.

In [None]:
# Set potential colors and markers
color=['r','g','b']
marker=['o','s','v']
# TODO: plot scatters for data points and their centroids.
for i in clas:
    #plot the groups of data
    plt.scatter(class_dict[str(i)]['x'],class_dict[str(i)]['y'],color=color[i],marker=marker[i],s=100)
    #plot centroid of each group you calculate above
    plt.scatter(cx[i],cy[i],color=color[i],marker='x',s=200)

#### 3.2.2 Bar plot
Let's look into the Most Used Programming, Scripting, and Markup Languages in 2018.  
Use `plt.bar` and `plt.xticks` to plot a 'language against Percentage' graph.

`plt.xticks` has two parameters: `ticks` accepts an array to set the space between scales, `labels` accepts an arary or a list to indicate the name of each scale.

In [None]:
# https://insights.stackoverflow.com/survey/2018#most-popular-technologies
language = ["JS", "HTML", "SQL", "Java", "Shell", "Python", "C#"] # Name of each scale
percentage = [69.8, 68.5, 57.0, 45.3, 39.8, 38.8, 34.4]

# Generating the x positions as the scales on x axis.
x_positions = np.array(range(len(percentage)))

# TODO: Create a bar plot
plt.bar(x_positions, percentage)
plt.xticks(x_positions, language)
plt.xlabel('language')
plt.ylabel('Percentage (%)')
plt.title("Most Used Programming, Scripting, and Markup Languages in 2018")
plt.show()

Then, let's learn how to adjust the width of the bars.
`plt.bar()` has a parameter called `width` which can control the width of the bars and between each bar.

In [None]:
# Set the bar width(any float between 0 and 1)
bar_width=0.5
# TODO: Create a bar plot with different bar width
plt.bar(x_positions, percentage,width=bar_width)
plt.xticks(x_positions, language)
plt.xlabel('language')
plt.ylabel('Percentage (%)')
plt.title("Most Used Programming, Scripting, and Markup Languages in 2018")
plt.show()

# TODO: what if the bar_width is set bigger than 1?


Bar charts with multiple groups are particularly useful. Let's compare the Most Used Programming, Scripting, and Markup Languages in 2020 with that in 2018.
On this occasion, we are going to add labels, grids and each bar's value. Mind that the attributes used for `AxeSubplot` object may partially differ from these in the previous cases of directly using `pyplot` module.

`set_xticks()` is to set the ticks' position on x axis.

`set_xticklabels()` maps elements in `language` to each tick position.

`yaxis.grid()` sets the grids in y direction

`legend()` places a legend, where the parameter `loc='best'` indicates that the location of the legend card is determined automatically. 

`annotate()` add text/annotation in the chart. Parameter `test` is the annotation content, `xy` is position of annotating point and `xytext` is the postion of annotation content.

In [None]:
# language = ["JS", "HTML", "SQL", "Java", "Shell", "Python", "C#"]

# The first group (2018) of language popularity
percentage_2018 = percentage
# The second group (2020) of language popularity
percentage_2020 = [67.8, 63.5, 54.4, 41.1, 36.6, 41.7, 31.0]

fig,ax=plt.subplots()
# set bar width to 0.4, then it gives 0.1 bar padding.
barwidth=0.4
# Plot the first group of data
ax.bar(x_positions, percentage_2018, width=barwidth, label='2018')
# Plot the second group of data
ax.bar(x_positions+barwidth, percentage_2020, width=barwidth, label='2020')

# TODO: set proper ticks, labels, grids, and bars' value
ax.set_xticks(ticks= x_positions+0.5*barwidth) # Set the ticks in the center of the two bars
ax.set_xticklabels(language)
ax.set_yticks(ticks=range(0,100,10)) # Set the ticks on y axis by 10% spacing
ax.yaxis.grid(True)

# add the value of each bar
for xy in zip(x_positions, percentage_2018):
    ax.annotate(text="%s" % xy[1], xy=xy, xytext=(-15, 10), textcoords='offset points')
for xy in zip(x_positions+barwidth, percentage_2020):
    ax.annotate(text="%s" % xy[1], xy=xy, xytext=(-10, 10), textcoords='offset points')
ax.legend(loc='best')
ax.set_xlabel('language')
ax.set_ylabel('Percentage (%)')
fig.suptitle("Comparison of Most Used Programming Languages in 2018 and 2020")

It is often useful to use horizontal bars, particularly if there are a lot of them. The transformation from vertical chart to horizontal chart only requires replacing `bar` method with its horizontal counterpart `barh`, and simply switch the axis labels, grids, etc.

In [None]:
# TODO: Rewrite the counterpart horizontal bar chart by referring to the previous case.
y_positions=np.array(range(len(percentage)))
fig,ax=plt.subplots()
# Very similar to the previous example
# Plot the first group of data
ax.barh(None) 
# Plot the second group of data
ax.barh(None)
ax.set_yticks(ticks=None)
ax.set_yticklabels(None)
ax.set_xticks(None)
ax.xaxis.grid(None)
for xy in zip(percentage_2018,y_positions):
    ax.annotate(None)
for xy in zip(percentage_2020,y_positions+barwidth):
    ax.annotate(None)
ax.legend(loc=None)
ax.set_ylabel('language')
ax.set_xlabel('Percentage (%)')
fig.suptitle("Comparison of Most Used Programming Languages in 2018 and 2020")

### 3.3 (Advanced) Matplotlib Exercise

In [None]:
# TODO: Draw 4 graphs in total in one figure
# Hint: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplots.html
# You can see great example code from the url above

fig, axes = plt.subplots(2, 2)  # this is where you need to use subplots

# TODO: Scatter Graph (Upper left)
x = np.random.randn(50)
y = np.random.randn(50)
colors = np.random.randint(0, 100, 50)
sizes = 500 * np.pi * np.random.rand(50) ** 2
axes[0, 0].scatter(x,y)


# TODO: Bar Graph (Upper right)
x = np.arange(10)
axes[0, 1].bar(x, x ** 2)

# TODO: Multi-Bar Graph (Lower left) -> Understand how it works!!
x = np.random.rand(3)
y = np.random.rand(3)
z = np.random.rand(3)
data = [x, y, z]

x_ax = np.arange(3)
for i in x_ax:
    axes[1, 0].bar(x_ax, data[i], bottom=np.sum(data[:i], axis=0))
axes[1, 0].set_xticks(x_ax)
axes[1, 0].set_xticklabels(['A', 'B', 'C'])

# TODO: Histogram Graph (Lower right)
data = np.random.randn(1000)
axes[1,1].hist(data, bins=40)

# TODO: Either show the image or save it to png file
plt.show()

### What to do next?
Below websites would be helpful for your further study on matplotlib:
- [DataCamp Matplotlib Tutorial: Python Plotting](https://www.datacamp.com/community/tutorials/matplotlib-tutorial-python)
- [Matplotlib official website](https://matplotlib.org/#)
- [Python Plotting With Matplotlib (Guide)](https://realpython.com/python-matplotlib-guide/)
- [Different plotting using pandas and matplotlib](https://www.geeksforgeeks.org/different-plotting-using-pandas-and-matplotlib/)
- [Matplotlib tutorial for beginner](https://github.com/rougier/matplotlib-tutorial)