# Pandas and matplotlib

In this session we'll do a brief overview of two libraries useful for scientific computing:

- `pandas` is a library which provides datastructures useful for storing and processing tabular data and especially time series. It has some similarities to the statistical language `R`. 
- `matplotlib` is low-level Python library for plotting. 

Both of these libraries are useful in their basic form, but both have many intricacies and can be error prone. 

## Pandas


In [None]:
import pandas as pd

Pandas has a useful function `pd.read_csv` for loading CSV files. Tabular data is stored in a DataFrame object. The DataFrame will be pretty-printed by the Jupyter notebook:

In [None]:
data = pd.read_csv("population.csv", sep='\t', index_col='year')
data.head()

A DataFrame has labels for columns (similar to a numpy structured array) but it also can have labels for rows. 
The set of labels for rows is called an index. This is especially useful for time series. The index can be accessed via the `.index` attribute:

In [None]:
print(data.index)

The data in a DataFrame can be accessed by column labels or by row labels:

In [None]:
# Print the lynx colum
print(data['lynx'])

In order to access data by row label, use the `.loc` attribute:

In [None]:
# print the 1919 row
print(data.loc[1919])
# print the range of data between 1900 and 1905
print(data.loc[1900:1905])

In order to access data by row position, use the `.iloc` attribute:

In [None]:
# print the penultimate row
print(data.iloc[-2])

DataFrames can be converted to the underlying numpy array

In [None]:
print(data.as_matrix())

Many standard methods can be applied to rows and columns of a DataFrame:

In [None]:
# Mean per column
print(data.mean(axis=0))
# Sum per row
print(data.sum(axis=1))

### Exercise 7.0

Print out all years where the population of lynxes in more than the population of hares and also more than the population of carrots. 
You can create a boolean index appropriate for use with the the `.loc` attribute by simply using a boolean comparison operator `>`. You can combine boolean indices using `&` (AND) and `|` (OR).

In [None]:
# -------------------


## Matplotlib

The IPython command `%pylab --inline` imports plotting functions from module pylab and also all the functions from `numpy`. It's often better to avoid these imports and use prefixed functions instead, by using this command: `%pylab --inline --no-import-all`. See (https://ipython.org/ipython-doc/dev/interactive/magics.html#magic-pylab)[https://ipython.org/ipython-doc/dev/interactive/magics.html#magic-pylab] for more information.

In [None]:
%pylab inline --no-import-all

Now modules `numpy` (also renamed as `np`) and `pyplot` (also renamed as `plt`) are available in the notebook.

## Basic plots

### Line plot

In [None]:
x = numpy.linspace(-10, 10, 100)
y = numpy.sin(x)
plt.plot(x, y)

#### Customizing plots

In [None]:
# Change style of line
# Add line label
plt.plot(x, y, linewidth=3, linestyle='dashed', color='red', label='sin(x)')
# Add another line
plt.plot(x, numpy.cos(x), linewidth=3, linestyle='dotted', color='green', label='cos(x)')
# Change axis ranges
plt.xlim(-5, 5)
plt.ylim(-1.5, 1.5)
# Add x-axis label
plt.xlabel("x")
# Add title
plt.title("Sine and cosine functions")
# Add legend, which will use the labels added to lines
plt.legend(loc='upper right')
# Save plot
plt.savefig('sin-cos.png')

#### Exercise 7.1

Load the data from [populations.csv](populations.csv) into a pandas DataFrame. Create a line plot with the year on the x-axis and the population of each species on the y-axis. Add axis labels, a legend and a title. The line for each species should have a different color.


### Scatter plot

In [None]:
# Noisy sine function
x = numpy.linspace(-7, 7, 100)
e = numpy.random.normal(0, 0.3, 100)
y = numpy.sin(x) + e
plt.scatter(x, y)

#### Customizing points

In [None]:
# Assign random area sizes to points
size = numpy.random.uniform(1,100,100)
# Plot y-values below 0 a different color
neg = y < 0.0
nonneg = y >= 0.0
plt.scatter(x[nonneg], y[nonneg], s=size, alpha=0.5, c='red', label='non-negative') 
plt.scatter(x[neg], y[neg], s=size, alpha=0.5, c='blue', label='negative')
# add transparency to better see overlapping points
plt.legend(loc='best')

#### Exercise 7.2

Load the iris dataset into a pandas DataFrame (you can use the function `pd.read_table` with the keyword argument `delim_whitespace=True` to read tabular data formatted as a space-delimited text file). 
Create a scatter plot of the first feature (Sepal length) vs the second feature (Sepal width). Make the points for different species a different color. Add title, axis labels, and a legend.

In [None]:
iris = pd.read_table("iris.txt", delim_whitespace=True, names=["SepalLength", "SepalWidth", "PetalLength", "PetalWidth", "Species"])

#### Exercise 7.3

Given a $m\times n$ matrix $X$ with $n$ data points and $n$ features, we can project the points to $d$ dimensional space  and store the result in matrix $X_d$ using Principal Component Analysis:
```python
from sklearn.decomposition import PCA
pca = PCA(n_components=D)
X_d = pca.fit_transform(X)
```

Project the 4-dimensional iris dataset to 2 dimensions using PCA and plot the result, with principal component 1 on the x-axis and principal component 2 on the y-axis. Label the axes and use different colors for each species.

#### Multiple plots

The command `plt.subplot(r, c, i)` allows us to display the $i$th plot in a figure consisting of $r$ rows and $c$ columns.



In [None]:
plt.figure(figsize=(8,6)) # Change width and height (inches)

plt.subplot(3,1,1)
plt.plot(x, numpy.sin(x), color='red')
plt.ylabel("sin(x)")

plt.subplot(3,1,2)
plt.plot(x, numpy.cos(x),color='blue')
plt.ylabel("cos(x)")

plt.subplot(3,1,3)
plt.plot(x, numpy.sin(x)+numpy.cos(x), color='purple')
plt.ylabel("sin(x)+cos(x)")
plt.xlabel("x")

plt.savefig('multiple.png')


#### Exercise 7.4

Load the data from [winequality-red.csv](winequality-red.csv) into a pandas DataFrame.
Create a figure with multiple subplots. Each subplot should be a scatterplot of one of the features in the data against the quality rating. Each subplot should also contain a legend with the name of the feature, and the correlation coefficient between it and the quality. 

- As an extra, add the linear regression line to each subplot.