# Modules, Vectors and Plotting (Oh my!)

Now it gets a bit advanced.

## Modules

Modules are to Python what appliances are to a kitchen; they make a whole lotta stuff easier to do.  
Modules are pre-baked collections of functions, classes and data that fulfill some specific task, e.g. numerical applications, symbolic math, plotting or optimization.

To load a module, simply import it

In [None]:
import sympy # Sympy is used for symbolic math

x, y = sympy.symbols('x y')
z = sympy.sin(x*y*sympy.sin(x))
print(z.diff(x))

To avoid having to write long names, you can import the module as an alias:

In [None]:
import sympy as sp

x, y = sp.symbols('x y')
z = sp.sinh(x*y)
print(z.diff(x)) # Still works!

If you need just a few functions, you can import only those functions:

In [None]:
from sympy import symbols, sinh

x, y = symbols('x y')
z = sinh(x*y)
print(z.diff(x)) # Still works!

Modules can have submodules. For example, the scipy module has a submodule that has to do with statistical functions:

In [None]:
import scipy.stats as st # Note the alias!

# Beta Prime distribution with shape params 2,2
bp = st.betaprime(2, 2)
print(bp.rvs(10)) # Generate some random numbers

The remainder of this session goes through the most commonly used modules for scientific work. Below are the modules I typically load when working.

In [None]:
#import numpy as np # Numerics (vectors, matrices, ...)
#import matplotlib.pyplot as plt # Plotting functions
#import seaborn as sns # Pretty plots!
#import pandas as pd # Structured data, .csv files, time series

Additional modules that may be worth knowing about:
 - Scipy (Huge collection of miscellanous scientific functions, statistics, regression etc.)
 - Basemap (Nice plotting for maps)
 - Pygrib (Geographical data handling)
 - BeautifulSoup (Parsing .xml files easily)
 - PyTables (Used for large dataset that can't be kept in memory)
 - PyOmo (Modeling library)
 - Scikit-learn (Machine learning library - We will be using this later in the semester) 
 - NetworkX (Network/Graph data library)

## Vectors and linear algebra with Numpy

Numpy implements vector, matrix and tensor structures in Python. All calculations are implemented in high-performance C and FORTRAN libraries, which means calculations are a lot faster than writing the loops out in pure Python.
This is typical for Python libraries; even though you're writing in Python, the calculations in the background may take place in another language in order to make them faster. This section is not mandatory to do in class, and you can skip to the visualization part if you are short on time.

In [None]:
import numpy as np

a = [1,2,3,4,5,6,7,8,9]
c = np.array(a)
print(c*3)
print(c**2)
print(c*c) # Equivalent to c.*c in MATLAB
print(np.dot(c,c)) #For vectors, orientation doesn't matter in dot products. In the next version of Python (3.5) we will be able
                   # to use @ as an operator for matrix dot multiplication instead of np.dot
print(c.dot(c)) # Some functions are available on the vectors as well

Matrices are implemented via lists-of-lists.

In [None]:
A = np.array([[1,2,3],[4,5,6]])
b = np.array([7,8,9])
print(A)
print(A.dot(b))
print(A.T) # Transposition
print(A.T.dot(b)) #Gives error due to orientation

Numpy has most of the "standard" library of linear algebra functions. (More functions are available in the SciPy module)

__EX:__ Generate a random 4x4 matrix A (Hint: np.random. )

Find the eigenvalues and eigenvectors of that matrix (Hint: np.linalg. )

Generate a random 4x4x4 matrix B.

Use Einstein summation (np.einsum?) to calculate the sum
$$\sum_{ij} A_{ij}, B_{ijk} = c_k$$

Solve the system $ A \vec{x} = c $ (Hint: np.linalg.solve)

In [None]:
# Write here

## Plotting data using Matplotlib

Matplotlib handles everything to do with plots.

In [None]:
import matplotlib.pyplot as plt
# Tell matplotlib that we want to have plots appear in the notebook
%matplotlib inline 

In [None]:
x = np.linspace(-10,10,201) #Generate evenly spaced points from -10 to 10
plt.plot(x, np.sin(x)) # A plot!
plt.plot(x, np.cos(x)) # Another line on the same plot!
# (In MATLAB parlance, hold is on by default)

In [None]:
# Close down your figures when done
# to free up RAM - run this box every once in a while
# although it shouldn't be necessary if you're running on your own laptop
plt.close('all')

Matplotlib supports subplots, legends and labels ala MATLAB. Plot parameters can be passed via keyword:

In [None]:
ax1 = plt.subplot(211)
plt.plot(x, np.cos(np.arctan(x/3)*x), label='Wobbly state')
plt.legend(frameon=True, fancybox=True)
plt.ylabel('Swing state')
ax2 = plt.subplot(212)
plt.plot(x, np.sin(x),
         label='Hi2', color='red',
         linestyle='--', linewidth=4)
plt.xlabel('Time delta [s]')
plt.ylabel('Swing state, pendulum 2')

2-D plots are available as well:

In [None]:
x = np.linspace(-5, 5, 101)
y = x.copy()
X, Y = np.meshgrid(x, y) # 2d arrays
plt.figure(figsize=(8, 6)) #Make a figure that will be 8 by 6 inches
# Contour plots allow for custom color maps, ranges, etc.
plt.contourf(X, Y, np.sin(X)*np.sin(Y), 21, # use 21 levels
             vmin=-0.7, vmax=0.7, extend='both',
             cmap=plt.cm.RdBu_r)
cb = plt.colorbar()
cb.set_label('Anomaly [cm]')
plt.draw() #Update the colorbar label

Once you're happy with a plot, you can save it to disk:

In [None]:
plt.savefig('myfig.pdf') # Move this to the cell above; save commands must be in the same cell as the plotting commands.

The standard matplotlib colors and layout are pretty basic. The seaborn package has some prettier standard settings, and some extra functions that are useful for data visualization. Just importing the package is enough to add the new settings:

In [None]:
import seaborn as sns
sns.set_style('ticks') # I like the "ticks" style for plotting

x = np.linspace(-10,10,201)
plt.plot(x, np.sin(x)) # Now with more pretty!
plt.plot(x, np.cos(x))

__EX:__ Making plots that are both clear and concise is necessary to communicating what you want with the data.

The plot below shows some temperature data with an increase over time, but the author clearly hasn't thought about how best to show this. Clean up the figure in some of the following ways:

- Look at the function sns.xkcd_palette, and the associated dictionary sns.xkcd_rgb, and use colors from this dictionary for the tasks below.
- Switch from using scatter to using plot with no line, give the points a nicer color, and use the alpha keyword to make the points slightly transparent.
- Use sns.regplot to plot a second-order regression line for the data. Remember to select nice colors! (Hint: Look at sns.regplot? for the parameters)
- Add labels to the axes
- Add labels to each of the plot commands, and add a legend to the plot
- Use plt.grid to add a grid. Set the grid to be transparent by using the alpha parameter.
- Use sns.despine to get rid of the upper and right-hand axes

In [None]:
x = np.arange(2*1980, 2*2014)/2.0
y = 20 + (x - 1950)**0.1*(1 + 0.1*np.random.random(x.shape)) \
+ 0.1*np.random.normal(x.shape)

In [None]:
plt.scatter(x, y)

## Structured data with Pandas

Pandas is both (i) a collection of data structures for structured data and (ii) a bunch of statistical functions.

At the center of the Pandas ecosystem is the DataFrame (similar to the R language), which is a matrix with indices for rows and columns.
For instance, the code below builds DataFrames containing mean monthly temperatures for 3 US cities. (Source: dmi.dk)

In [None]:
import pandas as pd

daytemps = pd.DataFrame(
    index=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'],
    data={  # Note the use of a dictionary here to name the columns
        'Baltimore':[6,8,14,19,25,29,31,31,27,21,15,8],
        'Los_Angeles':[19,19,19,20,21,22,24,25,25,24,21,19],
        'Anchorage':[-6,-3,1,6,12,16,18,17,13,5,-3,-5]}
    )
nighttemps = pd.DataFrame(
    index=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'],
    data={  # Note the use of a dictionary here to name the columns
        'Baltimore':[-3,-2,3,8,14,19,22,21,17,10,5,0],
        'Los_Angeles':[9,10,10,12,14,15,17,18,17,15,12,9],
        'Anchorage':[-13,-11,-8,-2,4,8,11,10,5,-2,-9,-12]}
    )
print(daytemps)

Columns and rows are indexable as below:

In [None]:
print('Baltimore day temps:')
print(daytemps['Baltimore']) # Dictionary-style indexing of columns
print('Baltimore day temps:')
print(daytemps.Baltimore) # Columns are available this way if properly named
print('June day temps:')
print(daytemps.ix['Jun']) # Row-indexing uses the .ix function
print('Summer night temps:')
print(nighttemps.ix['Jun':'Aug']) # Slicing is also available

Pandas hooks into matplotlib to plot data. Plots can be done directly from the dataframe. 

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline
sns.set_style('ticks')

daytemps.plot()

Note that Pandas will by default plot to a new figure when asked to plot multiple things (in MATLAB parlance, "hold" is off), so we have to do a bit more to get it to plot the day and night temps on the same axes.
Keywords for line attributes and labels are also supported

In [None]:
daytemps.Baltimore.plot(color='black', label='Day, Baltimore')
ax = plt.gca() # Get the current axes
daytemps.Los_Angeles.plot(ax=ax, color='red', label='Day, Los Angeles')
nighttemps.Baltimore.plot(ax=ax, linestyle='--', color='black', label='Night, Baltimore')
nighttemps.Los_Angeles.plot(ax=ax, linestyle='--', color='red', label='Night, Los Angeles')
plt.ylabel(u'Temperature [ºC]')
plt.legend(ncol=2, loc='lower center')

In most respects, DataFrames behave just like numpy arrays, and numpy functions work on them:

In [None]:
print('Mean day and night temperature differences:')
print((daytemps - nighttemps).mean())

print('RMS of day temperatures:')
print(np.sqrt((daytemps**2).mean(axis=1)))


__

__EX:__ Plot the day temperatures as a horizontal boxplot. (Hint: look at the possible options for plots in _daytemps.plot?_ )

Plot a scatterplot of day temperature vs. night temperature in each city. (Hint: A for-loop over the column names might be helpful here...)

In [None]:
#Type type type...

-----------

Pandas is useful for loading up and analyzing data. Below, we'll load up some data on various cars, contained in a .csv file.

In [None]:
cardata = pd.read_csv('16tstcar.csv')
print(cardata.iloc[0])
cardata.plot(kind='scatter', x='Equivalent Test Weight (lbs.)', y='RND_ADJ_FE')

cardata.plot(kind='scatter', x='Displacement (L)', y='RND_ADJ_FE')

Our data includes data points for both highway and town cycles, which could screw up our analysis. We should group the data points depending on which cycle they are from. The resulting object from the grouping (carbycycle) acts like a container of dataframes:

In [None]:
carbycycle = cardata.groupby('Test Category')
print(carbycycle.RND_ADJ_FE.mean()) # Show the mean MPG for each category
carhwy = carbycycle.get_group('HWY') # Only use highway data
print(carhwy.Model)

This is the end of this exercise. The rest is optional!

## Making your own modules

To make your own modules, simply place a .py file with the functions or data you want to import in the working directory.
I've placed a file called "myhelpers.py" in this folder. You can import it as you would any module:

In [None]:
import myhelpers as mh
import numpy as np

a = np.array([-3, 4, 5, 4, -12, 100, -22, 0])

print(mh.pos(a))
print(mh.neg(a))

Below is the code in myhelpers.py

In [None]:
from numpy import where


def pos(x):
    """
        Returns (x)_+, i.e.
        pos(x) = x if x > 0, else 0

        Input:
            x: numeric or array_like.

        Output:
            y: arraylike copy of x with negative entries set to 0
    """
    return where(x > 0, x, 0)


def neg(x):
    """
        Returns (x)_-, i.e.
        neg(x) = x if x < 0, else 0

        Input:
            x: numeric or array_like.

        Output:
            y: arraylike copy of x with positive entries set to 0
    """
    return where(x < 0, x, 0)
