<br><p style="font-family: Arial; font-size:3.75em;color:purple; font-style:bold">
Data Exploration with Graphs: Matplotlib</p><br><br>

### Matplotlib
It is probably the most used library for visualization in Python. You may find the documentation here: <br>
https://matplotlib.org/stable/contents.html<br>
Also, the examples can help you to apply templates directly. You may find them here:<br>
https://matplotlib.org/stable/gallery/index.html
Other popular visualization packages include:<br>
1. Seaborn: built on top of Matplotlib. The key difference is that Seaborn graphs are more aesthetically pleasing and modern.<br>
2. Geoplotlib: to visualize geographic data and make maps.<br>
3. Plotnine: A Python implementation of ggplot2 based on the grammar of graphics. It is easy to add mulitple layers of graphs.<br>
4. Bokeh: to create interactive, web-ready plots<br>
and many more.

In this Notebook, we are going to demonstrate in detail of Matplotlib graphs using the following data set.
<p>Data Source: https://www.kaggle.com/worldbank/world-development-indicators</p>
The World Development Indicators dataset obtained from the World Bank contains over a thousand annual indicators of economic development from hundreds of countries around the world. 

## Using graphs to explore data
### Some basic rules
### 1. Individual variables: examine their distributions - how data objects distributed across the values of the variable
     1. Interval: histogram or boxplot for distribution
     2. Categorical: bar or pie chart to show the counts of data objects in each category
### 2. Relationships between variables:
     1. Two continuous variables: scatterplot
     2. One continuous variable and one categorical variable: boxplots or histograms of the continuous variable
        across categories of the categorical variable
     3. Two categorical variables: pivot_table; stacked bar charts (bar chart of one categorical variable across
        the categories of the other variable)
     4. Change over time: line chart
### 3. More than two variables: high dimensional graphs

In [None]:
# This code appears in every demonstration Notebook.
# By default, when you run each cell, only the last output of the codes will show.
# This code makes all outputs of a cell show.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Step 1: Understanding the Dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random

In [None]:
# Import data. My Notebook file is in the same folder as the data file. So, I can directly use the file name.
# Please adapt the code to your file directory.
world=pd.read_csv('Indicators.csv')

This is a really large dataset, at least in terms of the number of rows.

In [None]:
world.head()
world.shape
world.info()

In [None]:
world.isnull().sum()

### Now let's examine the data. Each row records the value of a development indicator by country/region and year.

### How many different indicator names?

In [None]:
inds=world['IndicatorName'].unique()
len(inds)

### Is the number of indicator codes the same?

In [None]:
indsc=world['IndicatorCode'].unique()
len(indsc)

It suggests that IndicatorName and IndicatorCode are paired with each other.

### Is the number of unique indicators the same every year?

In [None]:
# check for the number of unique indicators in one specific year
# We can use filter to select a specific year
ind2000=world[world['Year']==2000]['IndicatorName'].unique()
len(ind2000)

### Is the number of unique inidicators the same for every country?

In [None]:
# check for the number of unique indicators for a specific country
indAfg=world[world['CountryName']=='Afghanistan']['IndicatorName'].unique()
len(indAfg)

In [None]:
# check for the number of unique indicators in one specific year for USA
indUSA2000 = world[(world['Year']==2000) & (world['CountryName']=='United States')]['IndicatorName'].unique()
len(indUSA2000)

### Conclusion: the indicators vary for country and year. 
We could make reasonable inferences in the following ways:
1. Fix one dimensions and observe the changes in the other dimension:<br>
    Compare the same indicator over different years for a country<br>
    Compare the same indicator across countries for one year<br>
3. Both dimensions<br>
    Compare the same indicator over different years and across countries

### How many unique country names are there ?

In [None]:
countries=world['CountryName'].unique()
len(countries)

### Are there same number of country codes ?

In [None]:
countcodes=world['CountryCode'].unique()
len(countcodes)

### Conclusion: Country code and name are paired correctly.

### How many years of data do we have ?

In [None]:
# How many years of data do we have ?
years=world['Year'].unique()
len(years)

### What's the range of years?

In [None]:
print(min(years), 'to', max(years))

### Conclusion: The data contains over a thousand development indicators for 247 countries over the years of 1960 - 2015. The number of indicators may vary year from year and across countries.

<p style="font-family: Arial; font-size:2.5em;color:blue; font-style:bold">
Step 2: Exploration with basic plotting</p><br>

### 1. Simple line graph
    Line graphs are good for describing trends over time. Pick a country and an indicator to explore changes over time: CO2 Emissions per capita and the USA

In [None]:
# First, we need to identify CO2-related indicators and find the one we want.
import re
world['IndicatorName'][world['IndicatorName'].str.contains('CO2', re.IGNORECASE)].unique()

In [None]:
# Then we use the indicator 'CO2 emissions (metric tons per capita)' for our examination
# We filter the data to get the indicator value for the US over the years.
hist_indicator='CO2 emissions \(metric' # filter1 - indicator
hist_country='USA' # filter 2 - country code

mask1=world['IndicatorName'].str.contains(hist_indicator)
mask2=world['CountryCode'].str.contains(hist_country)

USAco2=world[mask1 & mask2]

In [None]:
USAco2.head()

### To observe changes over years, line graphs is the best choice.

Now let's understand a 2D graph. It has x axis, y axis and a geometry object (lines, points or bars) 
to map x, y values to the coodinates.
It also may have a title and labels for axes and the geometry objects.
To draw a graph:
1. Understand what your x and y axes mean; 
2. Choose proper type of graphs
3. Prepare x and y values; 
4. Get the basic graph
5. Adjust aesthetic components of the graph 

In [None]:
# Plot CO2 emission in the US over the years
# 1. X - years; Y - CO2 emission values
# 2. Line graph
# 3. Prepare X and Y
# X: the years;
years=USAco2['Year']
# Y: the values 
co2=USAco2['Value']

In [None]:
# plt provides different styles for plots. For instance, 'ggplot' is one of the built-in style
# You can also search for other style packages to use. For instance, we can also use "seaborn" style.
plt.style.use('ggplot')
#plt.style.use('seaborn')

In [None]:
# Now let's create the graph using plot()
# plot() by default creates line plot.
plt.plot(years, co2)

In [None]:
# You can also get scatterplot by change the marker shape.
# 'b'    # blue markers with default shape
# 'or'   # red circles
# '-g'   # green solid line
# '--'   # dashed line with default color
# '^k:'  # black triangle_up markers connected by a dotted line
# The full list of format can be found here 
# https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html#matplotlib.pyplot.plot
plt.plot(years, co2, '^k:')

In [None]:
# plt also has a scatter function to create scatterplots.
plt.scatter(years, co2)

In [None]:
# Now make the graph to look better.
# Label the line
plt.plot(years, co2, label = 'CO2')
plt.legend() # show the legend

# Label the axes
plt.xlabel('Year')
plt.ylabel(USAco2['IndicatorName'].iloc[0])

# Label the figure
plt.title('CO2 Emissions in USA')

# Set axis limits. The list of numbers maps to [xmin, xmax, ymin, ymax] by default
plt.axis([1959,2011,15,25])

In [None]:
# If coding in a script, you may use plt.show() to display plots.
# It is suggested that plt.show() should be used only once per Python session, usually the very end of the script.
# We do not need this in the Notebook
# plt.show()

### observations from the line graph:
1. CO2 emission jumped during the 60s and topped at early 70s.
2. It started to drop from late 70s.
3. It remained stable with small peaks.
4. It dropped significantly from 2008 to 2009.

In [None]:
# We can compare other countries with the US by adding other lines
# I will leave this to you to figure out

In [None]:
# A barplot also works 
plt.bar(years,co2)

# Add details

### 2. The distributions of a continuous variable: Histograms
What does the distribution look like for CO2 emissions?

In [None]:
co2.describe()

In [None]:
# A histogram is a graph for a single variable. So we only need one set of values: the variable we explore
plt.hist(co2,facecolor='green')
# By default the number of bins is 10. 

In [None]:
# You may specify the bins with a list, e.g.[14, 16, 18, 20, 22, 24] 
# or a range, range(14, 23, 2)
plt.hist(co2, bins = range(14, 24), facecolor='green')

In [None]:
# We can refine the look of the graph by adding components.
plt.hist(co2, bins = range(14, 24), facecolor='green')
plt.xlabel(USAco2['IndicatorName'].iloc[0])
plt.ylabel('# of years')
plt.title('CO2 Histogram')
#plt.yticks(np.arange(0, 20, 2))

### How do the USA's numbers relate to those of other countries?

In [None]:
# This is a comparison across countries
# One way is to make a line for each country. Consider we have more than 240 countries (regions), it will be a mess.
# So the solution is to select a limited number of countries for the comparison.

# Another choice is to select all countries for one year and see where the US falls in the distribution
# That means, we need to describe the distribution of co2 emissions. A histogram is the choice.

# Select CO2 emissions for all countries in 2011
hist_indicator='CO2 emissions \(metric'
hist_year=2011

mask1 = world['IndicatorName'].str.contains(hist_indicator)
mask2=world['Year'].isin([hist_year])

# apply our mask
co2_2011 = world[mask1 & mask2]
co2_2011.head()

For how many countries do we have CO2 per capita emissions data in 2011

In [None]:
print(len(co2_2011))

In [None]:
# let's plot a histogram of the emmissions per capita by country
plt.hist(co2_2011['Value'],10,facecolor='green')

plt.xlabel(co2_2011['IndicatorName'].iloc[0])
plt.ylabel('# of Countries')
plt.title('CO2 histogram over countries')

# To highlight the US, we can make an annotation on the graph
plt.annotate('USA',xy=(18,5),xycoords='data',
            xytext=(18,30),textcoords='data',
            arrowprops=dict(facecolor='black', shrink=0.05)
            )

All plots above we use the MATLAB-style interface. The object-oriented interface is available for more complicated situations, and for when you want more control over your figure.<br>
The figure (an instance of the class plt.Figure) can be thought of as a single container that contains all the objects representing axes, graphics, text, and labels. <br>
The axes (an instance of the class plt.Axes) is a bounding box with ticks and labels, which will eventually contain the plot elements that make up our visualization. We'll commonly use the variable name fig to refer to a figure instance, and ax to refer to an axes instance or group of axes instances.<br>
-----From the book "Python Data Science Handbook".

In [None]:
# Object-oriented interface
fig = plt.figure()
ax = plt.axes()

ax.hist(co2_2011['Value'],facecolor='green')

ax.set_xlabel(co2_2011['IndicatorName'].iloc[0])
ax.set_ylabel('# of Countries')
ax.set_title('CO2 histogram over countries')

ax.annotate('USA',xy=(18,5),xycoords='data',
            xytext=(18,30),textcoords='data',
            arrowprops=dict(facecolor='black', shrink=0.05)
           )
ax.grid(True)

In [None]:
# Save the figure
fig.savefig('co2hist.jpeg')
fig.canvas.get_supported_filetypes()

In [None]:
# fig and ax can be created togther using subplots()
fig,ax=plt.subplots()

# ax.set() function
ax.hist(co2_2011['Value'],facecolor='green')

ax.set(xlabel=(co2_2011['IndicatorName'].iloc[0]),
       ylabel='# of Countries',
       title='CO2 histogram over countries')

ax.grid(True)

## Conclusion: 
The US, at ~18 CO2 emissions (metric tons per capital) is quite high among all countries.

### 3. Explore the relationship between GDP and CO2 Emissions in USA

In [None]:
# select GDP Per capita emissions for the United States
hist_indicator='GDP per capita \(constant 2005'
hist_country='USA'

mask1=world['IndicatorName'].str.contains(hist_indicator)
mask2=world['CountryCode'].str.contains(hist_country)

USAgdp=world[mask1 & mask2]

In [None]:
USAgdp.head(2)

In [None]:
USAco2.head(2)

In [None]:
# Check the trend of GDP over the years.
gdp=USAgdp['Value'].values
years=USAgdp['Year'].values
plt.plot(years, gdp, 'o')

# Label the axes


#label the figure


# to make it more honest, start the y axis at 0
plt.axis([1959,2011,0,50000])

So although we've seen a decline in the CO2 emissions per capita, it does not seem to translate to a decline in GDP per capita

### ScatterPlot is to explore the relationship between two numerical variables. 
First, we'll need to make sure we're looking at the same time frames

In [None]:
#Do the two indicators match?
print("GDP Min Year:",USAgdp['Year'].min(), "Max", USAgdp['Year'].max())
print("Co2 Min Year:",USAco2['Year'].min(), "Max", USAco2['Year'].max())

We have 3 extra years of GDP data, so let's trim those off so the scatterplot has equal length arrays to compare (this is actually required by scatterplot)

In [None]:
USAgdp_trunc=USAgdp[USAgdp['Year']<2012]
len(USAgdp_trunc)
len(USAco2)

In [None]:
fig,ax=plt.subplots()

X=USAgdp_trunc['Value']
Y=USAco2['Value']
ax.scatter(X,Y)

This doesn't look like a strong relationship.  We can test this by looking at correlation.

In [None]:
# Correlation between two variables
np.corrcoef(USAgdp_trunc['Value'],USAco2['Value'])

A correlation of 0.07 is pretty weak.
However, it seems that when GDP is in the lower range, say 15000 - 25000, there is a strong correlation.
This explorative finding would suggest that you may want to use step functions, splines or other more advanced models than a simple linear regression when you want to use GDP to predict CO2 emission.

You could continue to explore this to see if other countries have a closer relationship between CO2 emissions and GDP.  Perhaps it is stronger for developing countries?

In [None]:
#Excercise: find the relationship between GDP and CO2 emission for a developing country


In [None]:
# Scatterplot with more dimensions; size and color can represent variables
from sklearn.datasets import load_iris
iris = load_iris()
features = iris.data.T

plt.scatter(features[0], features[1], alpha=0.2,
            s=100*features[3], c=iris.target, cmap='viridis')
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])