# Lecture 8: Plotting and diagram visualization

*Matplotlib* is the most popular 2D plotting library in Python. Using matplotlib, you can create pretty much any type of plot.

*Pandas* has **tight integration** with *matplotlib*.

## How to install matplotlib?

If you have Anaconda installed, then matplotlib was already installed together with it.

If you have a standalone Python3 and Jupyter Notebook installation, open a command prompt / terminal and type in:
```
pip3 install matplotlib
```

## How to use matplotlib?

We will use the *pyplot module* inside the matlplotlib package for plotting. You can simply import this module as usual. It is usually aliased with the `plt` abbreviation:
```python
import matplotlib.pyplot as plt
```

---

## The dataset

Lets use the *Fortune 500* company list for year 2017, which shall be already familiar from the *mandatory assignment*. The *Fortune 500* is an annual list compiled and published by *Fortune magazine* that ranks 500 of the largest United States corporations by total revenue for their respective fiscal years. For each company the following information is given:
 1. rank,
 2. company name,
 3. industory sector,
 4. revenue (in million dollars),
 5. profit (in million dollars)
 6. employee number.

The dataset is given in the `fortune500_2017.csv` file. The used delimiter is the semicolon (`;`) character.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Special Jupyter Notebook command, so the plots by matplotlib will be display inside the Jupyter Notebook
%matplotlib inline

fortune500 = pd.read_csv('fortune500_2017.csv', delimiter = ';')
fortune500.columns = ['rank', 'company', 'sector', 'revenue', 'profit', 'employees']
display(fortune500)

Lets take just the top 50 companies, so visualization will be easier to overview in the following tasks:

In [None]:
fortune50 = fortune500.head(50)
display(fortune50)

---

## Plotting

Plots can be generated with the `plot()` function of a Pandas *DataFrame* (table) or *Series* (column). The most important parameter of the function is the `kind` parameter, which defines the type of plot to be generated. Supported kinds are (non-exhaustive list):
* `line`
* `bar` (vertical bar)
* `barh` (horizontal bar)
* `scatter`
* `hist` (histogram)
* `box` (boxplot
* `pie`

After a plot is generated, it can be displayed by the `show()` function of the `matplotlib.pyplot` module.

### Vertical bar plot

Display a bar plot of the revenue of the top 50 Fortune companies.

In [None]:
fortune50.plot(kind='bar', x='company', y='revenue', figsize = [15, 3])
plt.show() # matplotlib.pyplot was imported as plt

The size of the diagram can be configured with the `figsize` parameter. The size is given in inches (1 inch = 2.54 centimeters).  
The default size is `[6.4, 4.8]`.

The bar diagram can be created directly on the selected *Series* (column of data). In this case the Series will be placed along axis Y, while the horizontal axis X will become the index of the *DataFrame*.

In [None]:
fortune50['revenue'].plot(kind='bar', figsize = [15, 3])
plt.show()

The index column can be modified through the `set_index` function (see Lecture 6 for more details) of the *DataFrame* and a **new** *DataFrame* is created so:

In [None]:
fortune50_indexed = fortune50.set_index('company')
display(fortune50_indexed)

Creating the bar plot from the `fortune50_indexed` *DataFrame* will display the company names as labels correctly.

In [None]:
fortune50_indexed['revenue'].plot(kind='bar', figsize = [15, 3])
plt.show()

#### Visual tuning

The color of the bars can be defined with the `color` parameter.
The width of the bars is set by the `width` parameter, 1.0 meaning *100%*.

In [None]:
fortune50_indexed['revenue'].plot(kind='bar', figsize = [15, 3], color = 'red', width = 1.0)
plt.show()

Multiple colors can be passed in a list.

In [None]:
fortune50_indexed['revenue'].plot(kind='bar', figsize = [15, 3], color = ['red', 'green', 'yellow'], width = 1.0)
plt.show()

#### Outlook on passing arguments by name in Python

In Python arguments can be passed by either there *position* or by their *name*.

E.g. for the `read_csv()` function call below the file name (`fortune500_2017.csv`) is passed by its position (first argument), while the delimiter (`;`) is passed by its name.
```python
fortune500 = pd.read_csv('fortune500_2017.csv', delimiter = ';')
```

Let's define the `quadratic(a, b, c)` function which solves the [quadratic equation](https://en.wikipedia.org/wiki/Quadratic_equation):
$$ax^2+bx+c=0$$
This can be done by applying the well-known [quadratic formula](https://en.wikipedia.org/wiki/Quadratic_formula):
$$x=\frac{-b\pm\sqrt[]{b^2-4ac}}{2a}$$

In [None]:
import math

def quadratic(a, b, c):
    x1 = -b / (2*a)
    x2 = math.sqrt(b**2 - 4*a*c) / (2*a)
    return (x1 + x2), (x1 - x2)

# Passing parameters by position
print(quadratic(-2, 10, 12)) # -2 * x^2 + 10 * x + 12 = 0 is true for x = -1 and x = 6

# Passing parameters by name
print(quadratic(a = -2, b = 10, c = 12))

# The two approaches can also be mixed, but the positional arguments must be first!
print(quadratic(-2, c = 12, b = 10))

### Horizontal bar plot

Display a horizontal bar plot of the profit of the top 50 Fortune companies.

In [None]:
fortune50.plot(kind='barh', x='company', y='profit', figsize = [10, 10])
plt.show()

Note that for the horizontal bar plot, the *axis X* is the vertical axis, and *axis Y* is the horizontal axis. It is defined by this was, so only the `kind` parameter of the `plot()` function has to be changed when switching to a different type of diagram.

Before visualizing the data, sort it by the column profit, instead of the default revenue.

In [None]:
fortune50.sort_values(by = 'profit').plot(kind='barh', x='company', y='profit', figsize = [10, 10])
plt.show()

### Scatter plot

Display a scatter plot on the correlation of the revenue and the profit columns of the top 50 Fortune companies

**Question:** What correlation can be expected between these 2 attributes of companies?

In [None]:
fortune50.plot(kind='scatter', x='revenue', y='profit', title='Revenue vs. Profit of Top 50 Fortune Companies')
plt.show()

A title can be given to be displayed above the generated diagram with the `title` parameter.

Extend the scatter plot for all top 500 Fortune companies.

In [None]:
fortune500.plot(kind='scatter', x='revenue', y='profit', title='Revenue vs. Profit of Fortune 500 Companies')
plt.show()

As we can observer there is a moderate correlation between revenue and profit, which matches our expectation.

The limits of the X and Y axes can be configured with the `xlim` and `ylim` parameters, so the *outliers* can be excluded from the visualization. Both a minimum and a maximum biundary can be given, as a tuple.

In [None]:
fortune500.plot(kind='scatter', x='revenue', y='profit', title='Revenue vs. Profit', xlim=(0, 200000), ylim=(-10000, 30000))
plt.show()

#### Short outlook on correlation (optional)

The correlation matrix between *Series* of a *Pandas DataFrame* can be generated with the `corr()` function:

In [None]:
display(fortune50.corr())

Or just for 2 selected Series:

In [None]:
print(fortune50['revenue'].corr(fortune50['profit']))

Every correlation has two qualities: *strength* and *direction*. The direction of a correlation is either positive or negative. When two variables have a positive correlation, it means the variables move in the same direction. This means that as one variable increases, so does the other one. In a negative correlation, the variables move in inverse, or opposite, directions. In other words, as one variable increases, the other variable decreases.

We determine the strength of a relationship between two correlated variables by looking at the numbers. A correlation of 0 means that no relationship exists between the two variables, whereas a correlation of 1 indicates a perfect positive relationship. It is uncommon to find a perfect positive relationship in the real world.

The further away from 1 that a positive correlation lies, the weaker the correlation. Similarly, the further a negative correlation lies from -1, the weaker the correlation.
The following guidelines are useful when determining the strength of a positive correlation:
* 1: perfect positive correlation
* .70 to .99: very strong positive relationship
* .40 to .69: strong positive relationship
* .30 to .39: moderate positive relationship
* .20 to .29: weak positive relationship
* .01 to .19: no or negligible relationship
* 0: no relationship exists



### Histogram

A histogram is an accurate representation of the distribution of numerical data. It differs from a bar graph, in the sense that a bar graph relates two variables, but a histogram relates only one.

Display a histogram on the profit of the top 50 Fortune companies.

In [None]:
fortune50['profit'].plot(kind='hist')
plt.show()

The number of columns (called *bins* or *buckets*) in the histrogram can be configured with the `bins` parameter.

In [None]:
fortune50['profit'].plot(kind='hist', bins=20)
plt.show()

Extend the histogram to cover all 500 Fortune companies. Apply a logarithmic scale with the `logx` / `logy` parameter.

In [None]:
fortune500['profit'].plot(kind='hist', bins=50, logy=True)
plt.show()

### Boxplot

In descriptive statistics, a *boxplot* is a method for graphically depicting groups of numerical data through their quartiles. 

Display a boxplot of the profit of the top 50 Fortune companies.

In [None]:
fortune50['profit'].plot(kind='box')
plt.show()

Explaining the graphical visualizaiton of a boxplot:

![Explaining a boxplot](08_boxplot.gif "Source: https://www.learndatasci.com/")

### Pie chart

Display a pie chart on the revenue share of the top 50 Fortune companies. Since we are creating this plot on the `revenue` *Series*, we use the `fortune50_indexed` DataFrame, which was indexed with the company names, so the labels will contain them instead of numerical indices.

In [None]:
fortune50_indexed['revenue'].plot(kind='pie', figsize=[10,10], label="", title="Revenue share of the Fortune 50 companies")
plt.show()

Percentages for each slice can be displayed with the `autopct` parameter:

In [None]:
fortune50_indexed['revenue'].plot(kind='pie', figsize=[10,10], autopct='%.1f%%', label="", title="Revenue share of the Fortune 50 companies")
plt.show()

### Saving a plot to file

Intead of using the `show()` function of the `matplotlib.pyplot` module, the `savefig()` function can also be used to export and save a created plot to an external file.

In [None]:
fortune50.plot(kind='bar', x='company', y='revenue', figsize = [15, 3])
plt.savefig('08_company_revenue.png')

Hint: look for the created file right next this Jupyter Notebook file on yourr computer.

---

## Time Series Analyis

Read the `fortune500_1955-2005.csv` dataset, which contains the Fortune 500 company data between the years 1955 and 2005.  
All together the dataset contains 51 years of data, 500 rows for each year, so 25,500 rows of data.

Each row stores the following data:
 1. year,
 2. rank,
 3. company,
 4. revenue (in million dollars),
 5. profit (in million dollars).

*The delimiter is a simple comma (`,`) for this file.*

In [None]:
fortune_history = pd.read_csv('fortune500_1955-2005.csv')
fortune_history.columns = ['year', 'rank', 'company', 'revenue', 'profit']
display(fortune_history)

---

### Line plot

Line diagrams works best with a series of data, assuming continuous change between the knonw discrete values.  
Let's visualize the revenue and profit change of the company *Exxon Mobil* between the years 1955 an 2005.

First filter the rows for the company *Exxon Mobil* and set the year as the index column.

In [None]:
exxon = fortune_history[fortune_history['company'] == 'Exxon Mobil']
exxon.set_index('year', drop=False, inplace=True)
display(exxon)

Now a line plot on the revenue change of Exxon Mobil between 1955 and 2005 can be displayed.

In [None]:
exxon['revenue'].plot(kind='line')
plt.show()

# same:
#exxon.plot(kind='line', x='year', y='revenue')
#plt.show()

### Multiple column diagrams

Let's use multiple columns in the previous line plot, and add the profit to the diagram as a second line.

Multiple plot data can be generated with the `plot()` method of Pandas *Series*. Calling the `show()` function of the `matplotlib.pyplot` module will visualize them on a single diagram.

In [None]:
exxon['revenue'].plot(kind='line')
exxon['profit'].plot(kind='line')
plt.show()

Add legend to the diagram:

In [None]:
exxon['revenue'].plot(kind='line', legend=True)
exxon['profit'].plot(kind='line', legend=True)
plt.show()

The same can be done by calling the `plot()` method of a *Pandas DataFrame*. Be aware though, that in this case each plot will be displayed in an individual diagram:

In [None]:
exxon.plot(kind='line', x='year', y='revenue', legend=True)
exxon.plot(kind='line', x='year', y='profit', legend=True)
plt.show()

This can be fixed by explicitly configuring matplotlib to use the same *axis object* for visualization for both diagrams:

In [None]:
ca = plt.gca() # gca = get current axis configuration object
exxon.plot(kind='line', x='year', y='revenue', ax=ca, legend=True) # use the ca axis configuration object
exxon.plot(kind='line', x='year', y='profit', ax=ca, legend=True) # use the ca axis configuration object
plt.show()

Use a different, secondary scale for the profit.

In [None]:
exxon['revenue'].plot(kind='line', legend=True)
exxon['profit'].plot(kind='line', secondary_y=True, legend=True)
plt.show()

---

### Data groupping

*Pandas* supports the groupping of data by the given column(s), which then can be used also for visualization.

Select the top 10 Fortune companies in year 1955.  
Note that the original `fortune_history` DataFrame was ordered by year (ascending), then by revenue (descending), so no reordering is required.

In [None]:
selected_companies = fortune_history.head(10)['company']
display(selected_companies)

Select the rows of the original DataFrame for these selected companies. Set the year column as an index for this filtered DataFrame.

In [None]:
selected_history = fortune_history[fortune_history['company'].isin(selected_companies)]
selected_history.set_index('year', inplace=True, drop=False)
display(selected_history)

The `selected_history` *DataFrame* now contains all historical data for the top 10 Fortune companies in year 1955.

Visualize the revenue change of the top 10 Fortune companies in 1955 for the next 51 years in a line diagram.  
To achieve this, group the `selected_history` *DataFrame* by the `company` *Series*, then select the `revenue` *Series* and create a line plot.

In [None]:
selected_history.groupby('company')['revenue'].plot(
    kind='line', logy=True, 
    figsize=[15, 6], legend=True,
    title='Revenue history of Top 10 Fortune companies 1955-2005')
plt.show()

---

### Aggregate functions

Aggregate functions (`min`, `max`, `mean`, `median`, `sum`, etc.) transforms a group of values to a single value. By calling on aggregate function on a grouped *DataFrame*, the aggregated value of each group is calculated.

Let's calculate the best rank (*minimum value*) for each company they ever had in the Fortune 500 list between 1955 and 2005.

In [None]:
fortune_history.groupby('company').min()

Sort the result by the `rank` and only display the `rank`:

In [None]:
best_rank = fortune_history.groupby('company').min().sort_values(by='rank')['rank']
display(best_rank)

---

## Summary exercises on plotting

### Exercise 1

**Task:** Use the Fortune 500 dataset for year 2017 defined in the `fortune500` variable. That dataset contained the *sector* for each company. Compute for each sector how many companies belonged to it in 2017. Visualize the results in a pie a chart.

*Hint:* use groupping.

In [None]:
companies_by_sector = fortune500.groupby('sector').count()['rank']
display(companies_by_sector)

In [None]:
companies_by_sector.plot(kind='pie', figsize=[10,10], autopct='%.1f%%', label="", 
                         title="Sector distribution among the Fortune 500 companies in 2017")
plt.show()

### Exercise 2

**Task:** Calculate the accumulated revenue of the Fortune 500 companies for each year between 1955 and 2005.  
Create a bar diagram visualizing how the aggregated revenue changed over the years.

In [None]:
aggregated_by_year = fortune_history.groupby('year').sum()
display(aggregated_by_year)

In [None]:
aggregated_by_year.plot(kind='bar', y='revenue', figsize=[15, 4], width=0.8, color='orange')
plt.show()