# Week 1 - Exploratory data analysis

## 4. Matplotlib

As it has been mentioned in the introduction, the goal of exploratory data analysis (or EDA) is to **analyze**, **summarize** and **visualize** data. The libraries we have just learned (NumPy and Pandas) does not provide a visualization option which brings us to another common library - Matplotlib.

### 4.1 Basics

First, install the matplotlib libary by running the cell below:

In [None]:
!pip install matplotlib

In the following cell, we import the library, while ```%matplotlib incline``` command ensures that the produced graphs can be seen below each cell.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

To demonstrate the basics of matplotlib plotting functionalities let's say we want to plot **cosine** function between $0$ and $2\pi$.

In [None]:
import numpy as np

x = np.arange(0, 2 * np.pi, 0.01)
y = np.cos(x)

Plotting and formating the graph for sin and cosine functions is relatively straightforward

In [None]:
#Plotting cosine function
plt.plot(x, y)

#Naming x and y axis
plt.xlabel('x')
plt.ylabel('y')

#Naming the whole graph
plt.title('y = cos(x)')

#Including the grid (for better analysis)
plt.grid()

#Plotting the graph
plt.show()

Plotting multiple graphs is quite similar. To demonstrate this, let's say that additionally we want to look at the **sine** graph in the same x interval.

In [None]:
y_2 = np.sin(x)
#Plotting multiple graphs
plt.plot(x, y)
plt.plot(x, y_2)

plt.xlabel('x')
plt.ylabel('y')
plt.grid()

plt.legend(['y = cos(x)', 'y = sin(x)'])
plt.title('y = cos(x) and y = sin(x)')

plt.show()

### 4.2 Types of plots

After seeing the basic matplotlib visualization capabilities, we will continue to analyze different kinds of plots:
- **Subplots**
- **Bar graph**
- **Box plot**
- **Scatter plot**
- **Heatmaps**

### Subplots

In the previous example, we wanted to plot the sine and cosine functions on the same figure. Frequently, however, such way of data visualization might be confusing (especially, when working with different orders of magnitude), thus we use **subplots** to analyze and **compare** two or more figures.

Let's say we want to analyze the previous cosine function and changed sine function $y = x\sin(2x)$. The following cell creates two subplots for both functions.

In [None]:
#Creating functions

y_1 = np.cos(x)
y_2 = x * np.sin(2 * x)

# .subplots(num_rows, num_cols, figsize = (fig_width, fig_height))
fig, ax = plt.subplots(1, 2, figsize = (40, 10))

ax[0].plot(x, y_1)
ax[0].set_xlabel('x', fontsize = 25)
ax[0].set_ylabel('y', fontsize = 25)
ax[0].set_title('y = cos(x)', fontsize = 25)

ax[1].plot(x, y_2)
ax[1].set_xlabel('x', fontsize = 25)
ax[1].set_ylabel('y', fontsize = 25)
ax[1].set_title('y = xsin(2x)', fontsize = 25)

### Bar graph

Let's we want to analyze the desktop OS market worldwide market share in 2021 (https://gs.statcounter.com/os-market-share/desktop/worldwide). For such comparison, the bar chart might provide the most useful insights, but first, let's visualize 2021 data:

In [None]:
prog_lang = ['Windows', 'OS X', 'Unknown', 'Chrome OS', 'Linux']
percentage = [75.39, 15.93, 3.75, 2.57, 2.35]

#computing bar graph
plt.bar(prog_lang, percentage)

#setting x label
plt.xlabel('Desktop OS')

#setting y label
plt.ylabel('Market percentage (%)')

#setting title
plt.title('Desktop OS market distribution')

plt.show()

Before moving on, we first might want to do a few adjustments to a current graph:
- Reduce the bar width
- Turn x axis labels for bars a bit

The first adjustment can be easily made by setting the `width` parameter in the `plt.bar()` function to a smaller than 1 value.

To turn the bar labels, we will need to use `plt.xticks()` function and setting `rotation` parameter to a desired angle.

The updated code:

In [None]:
#decreasing the bar width
bar_width = 0.7

plt.bar(prog_lang, percentage, width = bar_width)

plt.xlabel('Desktop OS')
plt.ylabel('Market percentage (%)')
plt.title('Desktop OS market distribution')

#rotating the x axis bar titles at 45 degrees
plt.xticks(rotation = 45)

plt.show()

### Box plot

The main goal of the box plot is to provide statistical information about the **distribution of numerical data** divided into groups. Thus, it is useful when **detecting an outlier** for each category.

For the demonstration purposes, we will create a dummy dataset containing 4 normally distributed categories.

In [None]:
#setting random seed
np.random.seed(10)

data_1 = np.random.normal(80, 30, 200)
data_2 = np.random.normal(90, 20, 200)
data_3 = np.random.normal(10, 100, 200)
data_4 = np.random.normal(25, 75, 200)

data = [y_1, y_2, y_3, y_4]

Creating box plot is a quite straightforward process

In [None]:
plt.boxplot(data)
plt.show()

### Scatter plot

Prior to building any ML model, it is important to distinguish any correlation between the variables as inability to do so might result in inefficiency in the model training. Scatter plots are useful when demonstrating any relationship (correlation, outliers) between two variables.

Let's say we want to look at the relationship between the previously created `data_1` and `data_2`. The scatter graph for this dataset could be simply computed in the following way

In [None]:
plt.scatter(data_1, data_2)
plt.xlabel('data_1')
plt.ylabel('data_2')
plt.show()

Since both of the variables are randomly distributed, such graph does not show any useful information. One thing we could explore is customization: similar to previous types plots, the `plt.scatter()` function has some built-in parameters for that.

In [None]:
#defining the colors variable
colors = range(len(data_1))

#c here is responsible for the color change, while alpha controls the opacity
plt.scatter(data_1, data_2, c = colors, alpha = 0.5)
plt.xlabel('data_1')
plt.ylabel('data_2')
plt.show()

### Heatmaps

So far, we have learned different visualization methods for comparing multiple independent variables for one dependent variable (desktop OS market distribution). However, in the cases when we want to compare the multiple independent variables for multiple dependent variables, plotting numerous bar plots might get messy quite early. Thus, we use the **heatmaps**.

For this example let's compare different countries according to the fruit growth (tons / year)

In [None]:
#Preparing the data
fruits = ['mango', 'watermelon', 'pineapple', 'strawberry', 'cherry', 'orange']

countries = ['India', 'Australia', 'USA', 'Canada', 'Brazil', 'Germany', 'Spain']

harvest = np.array([[0.8, 2.4, 2.5, 3.9, 0.0, 4.0, 0.0],
                    [2.4, 0.0, 4.0, 1.0, 2.7, 0.0, 0.0],
                    [1.1, 2.4, 0.8, 4.3, 1.9, 4.4, 0.0],
                    [0.6, 0.0, 0.3, 0.0, 3.1, 0.0, 0.0],
                    [0.7, 1.7, 0.6, 2.6, 2.2, 6.2, 0.0],
                    [0.1, 2.0, 0.0, 1.4, 0.0, 1.9, 6.3]])

In [None]:
#Creating subplot
fig, ax = plt.subplots()
img = ax.imshow(harvest)

#Aranging the labels
ax.set_xticks(np.arange(len(countries)))
ax.set_yticks(np.arange(len(fruits)))

#Naming those labels
ax.set_xticklabels(countries)
ax.set_yticklabels(fruits)

#Setting the heatmap title
ax.set_title('Growth of Fruits in Countries (tons / year)')
plt.show()

It is visible that Spain has the largest orange harvest, while Germany is a leader in cherry harvest. However, from the current heatmap, we cannot sea the exact numerical difference which important as the coloring is not distinct in all places.

In the following cell, we annotate each square as well as make a few adjustments to the layout of the graph.

In [None]:
#Creating subplot
fig, ax = plt.subplots()
img = ax.imshow(harvest)

#Aranging the labels
ax.set_xticks(np.arange(len(countries)))
ax.set_yticks(np.arange(len(fruits)))

#Naming those labels and rotating at 30 degrees
ax.set_xticklabels(countries, rotation = 30)
ax.set_yticklabels(fruits)

#Annotating squares in the heatmap
for i in range(len(fruits)):
    for j in range(len(countries)):
        
        # annotating numeric value at the center of each square
        text = ax.text(j, i, harvest[i, j], ha = 'center', va = 'center', 
                       color = 'w')


#Setting the heatmap title
ax.set_title('Growth of Fruits in Countries (tons / year)')
plt.show()