<h1 align="center">Python for DATA SCIENCE</h1><Br/>
<img src="https://goo.gl/ZKX5FF" style="width:15%; float:centre"><Br/>
<h2 align="center">Dr Mazen Gabriel Alhrishy</h2>
<h5 align="center"><i>MAZEN.ALHRISHY@GMAIL.COM</i></h5><Br/>

<table width=25%>
    <tr>
        <td>
            <a href="https://goo.gl/BTtR3C"><img src="https://goo.gl/rMsKok"></a>
        </td>
        <td>
            <a href="https://goo.gl/XaRDbH"><img src="https://goo.gl/KyMZcj"></a>
        </td>
        <td>
            <a href="https://goo.gl/9uCqS6"><img src="https://goo.gl/a8gcDK"></a>
        </td>
        <td>
            <a href="https://goo.gl/bnt2EL"><img src="https://goo.gl/1rT18x"></a>
        </td>
        <td>
            <a href="https://goo.gl/VmfU3S"><img src="https://goo.gl/WFFkxn"></a>
        </td>
    </tr>
</table>

***
# 7- Matplotlib for Data Representation

> ## [I- Introduction](#I)
> ## [II- Parts of a figure](#II)
> ## [III- Matplotlib hierarchy](#III)
> ## [IV- Plotting function](#IV)
> ## [V- Customizing matplotlib](#V)
> ## [VI- Saving plots](#VI)

> ### [- Exercises](#exercises)
> ### [- Solutions](#solutions)

***

## I- Introduction <a id='I'></a>

> ## [1. History](#I-1)
> ## [2. Installation](#I-2)
> ## [3. Motivation](#I-3)

### 1- History <a id='I-1'></a>

* In 2003, **John D. Hunter** was developing an EEG analysis application in MATLAB. As the application grew in complexity, he decided to start over in Python. However, he was having difficulty finding a 2D plotting package for Python

<img src="https://goo.gl/rk5Vhn" style="width:30%; border-radius:50%; float:left; padding:10px 30px 10px 30px;"/>
<Br/>
<Br/>
"Finding no package that suited me just right, I did what any self-respecting Python programmer would do: rolled up my sleeves and dived in"
<Br/>
— John D. Hunter (1968-2012), the original author of Matplotlib
<Br/>    
<Br/>
<Br/>
* The result was a Python extension to emulate the **MAT**LAB graphics commands, **mat**plotlib, with the philosophy that you should be able to create simple plots with just a few commands, or just one!


* Matplotlib is written primarily in pure Python, but it also makes heavy use of NumPy and other extension code to provide good performance even for large arrays


* __[Matplotlib website](https://matplotlib.org/index.html)__

### 2- Installation <a id='I-2'></a>

* Matplotlib requires a large number of dependencies, however, the Anaconda Python distribution already provides matplotlib built-in

* If you've created a basic virtual environment, you can get Matplotlib using conda:

In [None]:
! conda install matplotlib --y

* To verify the package was installed

In [None]:
! conda list

* To enable interactive figures in a live Jupyter notebook session (you don't need this in your Python script!)

In [None]:
import matplotlib
matplotlib.use('nbagg')

In [None]:
import numpy as np

* To import into a Python script

In [None]:
import matplotlib.pyplot as plt

### 3- Motivation <a id='I-3'></a>

* Exploring data and reporting insight


* The __[GapMinder](https://www.gapminder.org/downloads/updated-gapminder-world-poster-2015/)__ image below shows the Life Expectancy and Income of 182 nations in the year 2015. Each bubble is a country, and the size of the bubble is its population. Each region also has a diffrent color

<img src="https://goo.gl/Xk6bLS" width="90%" height="90%"/>

* At the end you will do something similar in matplotlib!

## II- Parts of a figure <a id="II"></a>

> ### [1- Figure](#II-1)
> ### [2- Axes](#II-2)
> ### [3- Axis](#II-3)
> ### [4- Artist](#II-4)

    The following image shows the anatomy of a matplotlib figure. All objects are encircled and labelled.
    (source: matplotlib.org)

<img src="https://goo.gl/mLHZcD" width="65%"/>

### 1- Figure <a id='II-1'></a>

* This refers to the whole figure that everything is drawn on (i.e. top level container for all plot elements)


* The easiest way to create a new Figure object is with **figure()**

In [None]:
fig = plt.figure()

print(type(fig))

* The Figure object will not show until you ask for it using the **show()** method

In [None]:
fig.show()

* You can control the size of the Figure by calling the **figsize(width, height)** method (given in inches)

In [None]:
fig = plt.figure(figsize=(2, 2))
fig.show()

> The Figure class docs can be found __[here](https://matplotlib.org/api/_as_gen/matplotlib.figure.Figure.html#matplotlib.figure.Figure)__

### 2- Axes <a id='II-2'></a>

* This refers to the region of the figure with the data space (what you think of as a 'plot' inside the inner box)


* A figure can contain many Axes (at least one to be useful). A given Axes object can only belong to one Figure object


* In fact, the Figure we've created does not yet have an Axes object. To add one, we can call the **add_axes([left, bottom, width, height])** method, which also returns the Axes object itself

In [None]:
fig = plt.figure(figsize=(5, 5))

ax = fig.add_axes([0.125,0.11,0.775,0.77])
print(type(ax))

fig.show()

* However, **add_axes([left, bottom, width, height])** is not usually used as arguments have to be given in fractions of the figure width and height to place the Axes at a predefined position (who wants to do that!)


* Instead, the **add_subplot(nrows, ncols, index)** method is used which automatically places an Axes on a grid specified by **nrows X ncols** at the given **index**

In [None]:
fig = plt.figure(figsize=(5, 5))

ax1 = fig.add_subplot(2, 2, 1)  # place axes at index=1 on a grid of 2X2 and return axes
ax2 = fig.add_subplot(2, 2, 2)  # place axes at index=2
ax3 = fig.add_subplot(2, 2, 3)  # place axes at index=3
ax4 = fig.add_subplot(2, 2, 4)  # place axes at index=4

fig.show()

* **subplots(nrows, ncols)** is a shorthand way to create a Figure and to add one/multiple Axes on a gird at the same time. It returns a tuple of (Figure object, **nrows X ncols** numpy array of Axes objects)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(8, 4))
ax0, ax1 = axes
fig.show()

* Each axes contains two (or three in the case of 3D) axis objects (be aware of the difference between axes and axis)


* Each Axes has:
    - A title, that can be set via **set_title()**
    - An x-label, that can be set via **set_xlabel()**
    - A y-label, that can be set via **set_ylabel()**
    
    
* **set(title='', x-label='', y-label='')** is a shorthand way to set the title, and labels

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(8, 4))  # axes is 1x2 numpy array

ax0, ax1 = axes  # unpack array

ax0.set_title('Axes 1 title')
ax0.set_ylabel('Y-Axis 1 label')
ax0.set_xlabel('X-Axis 1 label')

# using shorthand set
ax1.set(title='Axes 2 title', 
        ylabel='Y-Axis 2 label', 
        xlabel='X-Axis 2 label')

fig.show()

> The Axes class docs can be found __[here](https://matplotlib.org/api/axes_api.html)__

### 3- Axis <a id='II-3'></a>

* This refer to the number-line-like object. Axis objects take care of: 
    - Setting the data limits 
    - Generating the ticks (the marks on the Axis), and setting ticks location
    - Generating the ticklabels (strings labeling the ticks), and formatting ticklabels
    
    
* However, the Axes object also has methods to control: 
    - data limits, via **set_xlim()** and **set_ylim()**
    - ticks, via **set_xticks()** and **set_yticks()**
    - tick labels, via **set_xticklabels()** and **set_yticklabels()**
   

Therefore, unless we need a finer control over individual Axis, we will only use Axes methods to control Axis


* **set(xlim=[], xticks=[], xticklabels=[], ylim=[], yticks=[], yticklabels=[])** is a shorthand way to set the data limits, and ticks and tick-labels

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(4, 4), tight_layout=True)  # see what happens without tight_layout=True

ax.set_title('Axes title')
ax.set_ylabel('Y-Axis label')
ax.set_xlabel('X-Axis label')

ax.set_xlim([1, 5])
ax.set_xticks([1, 3, 5])
ax.set_xticklabels(['1K', '3K', '5K'])

# using shorthand set
ax.set(ylim=[1, 100], 
       yticks=[20, 40, 60, 80], 
       yticklabels=['20Y', '40Y', '60Y', '80Y'])

fig.show()

> The Axis class docs can be found __[here](https://matplotlib.org/api/axis_api.html)__

### 4- Artist <a id='II-4'></a>

* Everything you can see on the figure is an artist (even the Figure, Axes, and Axis objects themselves!)

* Most Artists belong to an Axes; such an Artist cannot be shared by multiple Axes, or moved from one to another

* For example, the plotted Line2D object below is an Artist

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(4, 4), tight_layout=True)

x = np.linspace(0, 2, 100)
y = range(len(x))

line = ax.plot(x, y, label='linear')  # plot y versus x as a line

ax.set_title('Linear plot')
ax.set_xlabel('x')
ax.set_ylabel('y')

fig.legend()
fig.show()

## III- Matplotlib hierarchy <a id='III'></a>

> ### [1- The pyplot module](#III-1)
> ### [2- General concept](#III-2)
> ### [3- Coding style](#III-3)

### 1- The pyplot module <a id='III-1'></a>

* Why did we use **pyplot** (imported as plt) functions in some calls (e.g. **plt.subplots()**), while we used objects functions in other calls (e.g. **axes.set_title()**)?


* In fact, we could have only used **pyplot** functions to create the same previous plot!

In [None]:
plt.close()  # first we should close any existing Figures (why?)

In [None]:
x = np.linspace(0, 2, 100)
y = range(len(x))

plt.plot(x, y, label='linear')

plt.title('Linear plot')
plt.xlabel('x')
plt.ylabel('y')

plt.legend()
plt.show()

* What has just happened!? We didn't create a Figure object, nor added Axes to it! We also set all the Axes elements (title, x-label, and y-label) using **pyplot** functions directly, not the Axes methods

* It's all about hierarchy!

> The pyplot module docs can be found __[here](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.html)__

### 2- General concept <a id='III-2'></a>


> The purpose of a plotting package is to assist you in visualizing your data as easily as possible, with all the necessary control – that is, by using relatively high-level commands most of the time, and still have the ability to use the low-level commands when needed. Therefore, everything in matplotlib is organized in a hierarchy

#### High-level
* At the top level, **pyplot** provides an environment where simple functions are used to add elements to the **current Axes** in the **current Figure** (known as the “state-machine environment”). The state-machine **implicitly and automatically creates Figures and Axes** to achieve the desired plot

>**plt.plot()** automatically create a new Figure and Axes objects (if don't exist already), and add the plot element to the **current Axes**. Any subsequent calls using pyplot will also affect the **current Axes***

*That's why we had to close any existing Figures before making any calls using pyplot

#### Low-level
* At the next level, **pyplot** is used only for a few functions such as Figure creation, and the user **explicitly creates and keeps track of the Figure and Axes objects** (known as "object-oriented approach"). These Axes objects are then used for most plotting actions

>**plt.subplots()** create the Figure and the Axes. Any subsequent calls using the created Axes will only affect **that Axes***


#### How are they related?

* A high-level function is the equivalent of calling the low-level functoin on the **current Axes**
* For example:

> **plt.title()** is the pyplot equivalent of calling **set_title()** on the **current Axes** <br>
**plt.xlim()** is the pyplot equivalent of calling **set_xlim()** on the **current Axes** <br>
**plt.xticks()** is the pyplot equivalent of calling **set_xticks()** on the **current Axes** <br>
**plt.xlabel()** is the pyplot equivalent of calling **set_xlabel()** on the **current Axes** <br>

### 3- Coding style <a id='III-3'></a>

* How do you choose which coding style to use?

    - The "object-oriented" approach is more explicit and offers fine control, but more verbose (i.e. extra typing)

    - The “state-machine" approach is less verbose, but less explicit and does not offer similar control


> It is up to you to choose either approaches as long as your style is consistent. However, it is suggested (and common) to use pyplot to create the Figures and Axes, and then the "object-oriented" approach for subsequent plotting

***
## IV- Plotting functions <a id="IV"></a>

> ### [1- Types of inputs](#IV-1)
> ### [2- Examples](#IV-2)

### 1- Types of inputs <a id="IV-1"></a>

* All plotting functions expect **numpy.array** as input


* Classes that are ‘array-like’ may or may not work as intended. Therefore, it is best to convert these to **numpy.array** objects prior to plotting

### 2- Examples <a id="IV-2"></a>

* There are a large number of plotting functions available in matplotlib which you can view samples of __[here](https://matplotlib.org/tutorials/introductory/sample_plots.html#sphx-glr-tutorials-introductory-sample-plots-py)__


* Some of the most common functions are listed below

#### Line plot function __[docs](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html)__

In [None]:
import numpy as np

# example data
x = np.arange(0.0, 2.0, 0.01)
y = 1 + np.sin(2 * np.pi * x)

fig, ax = plt.subplots()  # same as subplots(111)
line = ax.plot(x, y)

ax.set(xlabel='time (s)', 
       ylabel='voltage (mV)', 
       title='About as simple as it gets, folks')

ax.grid()  # show a grid
plt.show()

#### Scatter plot function (__[docs](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html)__) 

In [None]:
# example data
x = np.arange(0.0, 2.0, 0.01)
y = 1 + np.sin(2 * np.pi * x)

fig, ax = plt.subplots()  # same as subplots(111)
ax.scatter(x, y)

ax.set(xlabel='time (s)', 
       ylabel='voltage (mV)', 
       title='About as simple as it gets, folks')

ax.grid()  # show a grid
plt.show()

#### Histogram function (__[docs](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html)__)

In [None]:
import numpy as np

np.random.seed(19680801)

# example data
mu = 100  # mean of distribution
sigma = 15  # standard deviation of distribution
x = mu + sigma * np.random.randn(437)

fig, ax = plt.subplots(tight_layout=True)

# the histogram of the data
ax.hist(x, bins=50, density=1)

ax.set_xlabel('Smarts')
ax.set_ylabel('Probability density')
ax.set_title(r'Histogram of IQ: $\mu=100$, $\sigma=15$')

plt.show()

#### Bar charts function (__[docs](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.bar.html)__)

In [None]:
import numpy as np

# example data
means_men = np.array([20, 35, 30, 35, 27])
std_men = np.array([2, 3, 4, 1, 2])

n_groups = len(means_men)
index = np.arange(n_groups)

fig, ax = plt.subplots(tight_layout=True)

ax.bar(x=index, 
       height=means_men,
       width=0.35,
       alpha=0.4, 
       color='r',
       yerr=std_men, 
       label='Men')

ax.set(title='Scores by group', xlabel='Group', ylabel='Scores')
ax.set(xticks=index, xticklabels=['A', 'B', 'C', 'D', 'E'])

ax.legend()
plt.show()

#### Pie chart function (__[docs](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.pie.html)__)

In [None]:
# example data
labels = ['Frogs', 'Hogs', 'Dogs', 'Logs']
sizes = np.array([15, 30, 45, 10])
explode = [0, 0.1, 0, 0]  # only "explode" the 2nd slice (i.e. 'Hogs')

fig, ax = plt.subplots()
ax.pie(x=sizes, 
       explode=explode, 
       labels=labels, 
       autopct='%1.1f%%',
       shadow=True, 
       startangle=90)

ax.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle

plt.show()

#### - Exercises <a id='exercises'></a>
> Modified from dataCamp.com

The pickle file ./Data_Samples/GapMinder_2007.p includes fours lists for 142 countries in this order:
    - gdp_cap: the GDP per capita for each country expressed in US Dollars 
    - life_exp: the life expectancy for each country expressed in years
    - pop: the population for each country expressed in millions of people
    - col: the colour representation for each country 

1. Read the pickle file into these 4 lists, then convert them into 4 numpy arrays

2. To see how life expectancy in different countries is distributed, plot a histogram of the values in life_exp using 20 bins

3. Plot a scatter chart, with gdp_cap on the x-axis, and life_exp on the y-axis. Set the: 
    - title to 'World Development in 2007'
    - xlabel to 'GDP per Capita [in USD]'
    - ylabel to 'Life Expectancy [in years]'
    - xticks values to [1000,10000,100000]
    - xticks labels to ['1k','10k','100k']

4. Is there a correlation between GDP and life expectancy? For this you need to display the GDP on a logarithmic scale

5. Make the size of the scatter dots correspond to the population by setting the size argument to population

6. Double the values in the population array to double the scatter dots size

7. Add colours to the scatter dots by setting the colour argument to the colour array

8. Change the opacity of the scatter dots by setting the alpha argument to 0.5

#### - Solutions <a id='solutions'></a>

In [None]:
import matplotlib
matplotlib.use('nbagg')

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pickle
import os

# read the pickle file into 4 lists
filename = os.path.join('Data_Samples', 'GapMinder_2007.p')

with open(filename, "rb") as f:
    [gdp_cap, life_exp, pop, col] = pickle.load(f)

# convert lists into numpy arrays
np_gdp_cap = np.array(gdp_cap)
np_life_exp = np.array(life_exp)
np_pop = np.array(pop)
np_col = np.array(col)

In [None]:
# 2- life_exp histogram
fig, ax = plt.subplots()
ax.hist(life_exp, 20)
fig.show()

In [None]:
# 3- scatter chart
fig, ax = plt.subplots()
ax.scatter( x=gdp_cap, y=life_exp)

ax.set_xlabel('GDP per Capita [in USD]')
ax.set_ylabel('Life Expectancy [in years]')
ax.set_title('World Development in 2007')
ax.set_xticks([1000, 10000, 100000], ['1k', '10k', '100k'])

ax.grid(True)
fig.show()

In [None]:
# 4- display the GDP on a logarithmic scale
fig, ax = plt.subplots()
ax.scatter( x=gdp_cap, y=life_exp)

ax.set_xscale('log')
ax.set_xlabel('GDP per Capita [in USD]')
ax.set_ylabel('Life Expectancy [in years]')
ax.set_title('World Development in 2007')
ax.set_xticks([1000, 10000, 100000], ['1k', '10k', '100k'])

ax.grid(True)
fig.show()

In [None]:
# 5- make the size of the scatter dots correspond to the population
fig, ax = plt.subplots()
ax.scatter( x=gdp_cap, y=life_exp, s=np.array(pop))

ax.set_xscale('log')
ax.set_xlabel('GDP per Capita [in USD]')
ax.set_ylabel('Life Expectancy [in years]')
ax.set_title('World Development in 2007')
ax.set_xticks([1000, 10000, 100000], ['1k', '10k', '100k'])

ax.grid(True)
fig.show()

In [None]:
# 6- double the values in the population array to double the scatter dots size
fig, ax = plt.subplots()
ax.scatter( x=gdp_cap, y=life_exp, s=np.array(pop) * 2)

ax.set_xscale('log')
ax.set_xlabel('GDP per Capita [in USD]')
ax.set_ylabel('Life Expectancy [in years]')
ax.set_title('World Development in 2007')
ax.set_xticks([1000, 10000, 100000], ['1k', '10k', '100k'])

ax.grid(True)
fig.show()

In [None]:
# 7- sdd colours to the scatter dots
fig, ax = plt.subplots()
ax.scatter( x=gdp_cap, y=life_exp, s=np.array(pop) * 2, c=col)

ax.set_xscale('log')
ax.set_xlabel('GDP per Capita [in USD]')
ax.set_ylabel('Life Expectancy [in years]')
ax.set_title('World Development in 2007')
ax.set_xticks([1000, 10000, 100000], ['1k', '10k', '100k'])

ax.grid(True)
fig.show()

In [None]:
# 8- change the opacity of the scatter dots
fig, ax = plt.subplots()
ax.scatter( x=gdp_cap, y=life_exp, s=np.array(pop) * 2, c=col, alpha=0.5)

ax.set_xscale('log')
ax.set_xlabel('GDP per Capita [in USD]')
ax.set_ylabel('Life Expectancy [in years]')
ax.set_title('World Development in 2007')
ax.set_xticks([1000, 10000, 100000], ['1k', '10k', '100k'])

ax.grid(True)
fig.show()