In [16]:
# special IPython command to prepare the notebook for matplotlib
%matplotlib inline 
%precision 2

# Import NumPy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.plotly as py
import plotly.graph_objs as go
import cufflinks as cf


pd.options.display.mpl_style = 'default'

cf.set_config_file(theme='ggplot', offline=False, world_readable=True, sharing=True)

## Recall from from lab last week 09/12/2014

Previously discussed: 

* Reading in a CSV file into a pandas DataFrame
* Using histograms, scatterplots and boxplots as exploratory data analysis
* Summary statistics
* Functions to access a pandas DataFrame
* Defining your own functions and using loops

## Today, we will discuss the following:
* Brief introduction to Numpy, Scipy
    * Vectorizing functions
* More pandas and matplotlib
* Working in the command line
* Overview of git and Github

<a href=https://raw.githubusercontent.com/cs109/2014/master/labs/Lab3_Notes.ipynb download=Lab3_Notes.ipynb> Download this notebook from Github </a>

## Numpy

NumPy and SciPy are modules in Python for scientific computing.  [NumPy](http://www.numpy.org) lets you do fast, vectorized operations on arrays.  Why use this module?  

* It gives you the performance of using low-level code (e.g. C or Fortran) with the benefit of writing the code in an interpreted scripting language (all while keeping the native Python code). 
* It gives you a fast, memory-efficient multidimensional array called `ndarray` which allows you perform vectorized operations on (and supports mathematical functions such as linear algebra and random number generation)

To create a fast, multidimensional `ndarray` object, use the `np.array()` method on a python `list` or `tuple` or reading data from files. 

In [8]:
x = np.array([1,2,3,4])
y = np.array([[1,2], [3,4]])
x

array([1, 2, 3, 4])

In [9]:
y

array([[1, 2],
       [3, 4]])

## Testing tables

Descriptions | Numbers
- | --
Rifayan | 234
Richard | 67
Ben Gay | 56



#### Properties of NumPy arrays
There are a set of properties about the `ndarray` object such the dimensions, the size, etc.  

Property | Description
--- | ----
`y.shape` (or `shape(y)` | Shape or dimension of the array
`y.size` (or `size(y)`) | Number of elements in the array 
`y.ndim` | number of dimensions 


In [11]:
x.shape

(4,)

In [15]:
print(y.ndim)
y.shape

2


(2, 2)

#### Other ways to generate NumPy arrays

Function | Description
--- | ---
`np.arange(start,stop,step)` | Create a range between the start and stop arguments
`np.linspace(start,stop,num)` | Create a range between start and stop (both ends included) of length num
`np.logspace(start, stop,num,base)` | Create a range in the log space with a define base of length num
`np.eye(n)` | Generate an n x n identity matrix

In [8]:
np.arange(0, 21, 2)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18, 20])

In [26]:
# Try it: Create a numpy array from 0 to 20 in steps of size 2
np.arange(0,20,2)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [33]:
# Try it: Create a numpy array from -10 to 10 in steps of 0.5 (INCLUDING the number 10)
np.arange(-10, 10.5, 0.5)

array([-10. ,  -9.5,  -9. ,  -8.5,  -8. ,  -7.5,  -7. ,  -6.5,  -6. ,
        -5.5,  -5. ,  -4.5,  -4. ,  -3.5,  -3. ,  -2.5,  -2. ,  -1.5,
        -1. ,  -0.5,   0. ,   0.5,   1. ,   1.5,   2. ,   2.5,   3. ,
         3.5,   4. ,   4.5,   5. ,   5.5,   6. ,   6.5,   7. ,   7.5,
         8. ,   8.5,   9. ,   9.5,  10. ])

In [31]:
# Try it: Create a numpy array from 100 to 1000 of length 10
np.linspace(100,1000,10)

array([  100.,   200.,   300.,   400.,   500.,   600.,   700.,   800.,
         900.,  1000.])

In addition, the `numpy.random` module can be used to create arrays using a random number generation 

In [32]:
from numpy import random

Function | Description
--- | ---
`np.random.randint(a, b, N)` | Generate N random integers between a and b
`np.random.rand(n, m)` | Generate uniform random numbers in [0,1] of dim n x m
`np.random.randn(n, m)` | Generate standard normal random numbers of dim n x m


In [50]:
np.random.randint(1, 100, 50)

d = np.random.rand(5,4)

print(d)
print()
print(d.reshape(2,10))
print()
d.flatten()

[[ 0.48  0.98  0.8   0.93]
 [ 0.36  0.47  0.83  0.66]
 [ 0.5   0.94  0.33  0.57]
 [ 0.82  0.72  0.93  0.86]
 [ 0.21  0.92  0.4   0.2 ]]

[[ 0.48  0.98  0.8   0.93  0.36  0.47  0.83  0.66  0.5   0.94]
 [ 0.33  0.57  0.82  0.72  0.93  0.86  0.21  0.92  0.4   0.2 ]]



array([ 0.48,  0.98,  0.8 ,  0.93,  0.36,  0.47,  0.83,  0.66,  0.5 ,
        0.94,  0.33,  0.57,  0.82,  0.72,  0.93,  0.86,  0.21,  0.92,
        0.4 ,  0.2 ])

In [14]:
# Try it: Create a numpy array filled with random samples 
# from a normal distribution of size 4 x 4

#### Reshaping, resizing and stacking NumPy arrays

To reshape an array, use `reshape()`:

In [15]:
z = np.random.rand(4,4)
z 

array([[ 0.34961451,  0.75618943,  0.85774252,  0.29423465],
       [ 0.72196235,  0.02541357,  0.7708488 ,  0.07240782],
       [ 0.54376752,  0.41193452,  0.40132359,  0.63399867],
       [ 0.12622657,  0.34662246,  0.27813886,  0.95162428]])

In [16]:
z.shape

(4, 4)

In [17]:
z.reshape((8,2)) # dim is now 8 x 2

array([[ 0.34961451,  0.75618943],
       [ 0.85774252,  0.29423465],
       [ 0.72196235,  0.02541357],
       [ 0.7708488 ,  0.07240782],
       [ 0.54376752,  0.41193452],
       [ 0.40132359,  0.63399867],
       [ 0.12622657,  0.34662246],
       [ 0.27813886,  0.95162428]])

To flatten an array (convert a higher dimensional array into a vector), use `flatten()`

In [18]:
z.flatten()

array([ 0.34961451,  0.75618943,  0.85774252,  0.29423465,  0.72196235,
        0.02541357,  0.7708488 ,  0.07240782,  0.54376752,  0.41193452,
        0.40132359,  0.63399867,  0.12622657,  0.34662246,  0.27813886,
        0.95162428])

## Operating on NumPy arrays

#### Assigning values
To assign values to a specific element in a `ndarray`, use the assignment operator. 

In [19]:
y = np.array([[1,2], [3,4]])
y.shape

(2, 2)

In [20]:
y[0,0] = 10
y 

array([[10,  2],
       [ 3,  4]])

#### Indexing and slicing arrays
To extract elements of the NumPy arrays, use the bracket operator and the slice (i.e. colon) operator.  To slice specific elements in the array, use `dat[lower:upper:step]`. To extract the diagonal (and subdiagonal) elements, use `diag()`. 

In [51]:
 # random samples from a uniform distribution between 0 and 1
dat = np.random.rand(4,4)
dat

array([[ 0.24,  0.17,  0.07,  0.91],
       [ 0.6 ,  0.03,  0.84,  0.32],
       [ 0.1 ,  0.78,  0.12,  0.19],
       [ 0.08,  0.88,  0.57,  0.01]])

In [52]:
dat[0, :] # row 1

array([ 0.24,  0.17,  0.07,  0.91])

In [53]:
dat[:, 0] # column 1

array([ 0.24,  0.6 ,  0.1 ,  0.08])

In [54]:
dat[0:3:2, 0] # first and third elements in column 1

array([ 0.24,  0.1 ])

In [55]:
np.diag(dat) # diagonal

array([ 0.24,  0.03,  0.12,  0.01])

In [26]:
np.arange(32).reshape((8, 4)) # returns an 8 x 4 array

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23],
       [24, 25, 26, 27],
       [28, 29, 30, 31]])

In [65]:
dat = np.arange(10)

np.where(dat<5, 'less', 'high')

array(['less', 'less', 'less', 'less', 'less', 'high', 'high', 'high',
       'high', 'high'], 
      dtype='<U4')

#### Element-wise transformations on arrays
There are many vectorized wrappers that take in one scalar and produce one ore more scalars (e.g. `np.exp()`, `np.sqrt()`). This element-wise array methods are also known as NumPy `ufuncs`. 

Function | Description 
--- | --- 
`np.abs(x)` | absolute value of each element
`np.sqrt(x)` | square root of each element
`np.square(x)` | square of each element
`np.exp(x)` | exponential of each element
`np.maximum(x, y)` | element-wise maximum from two arrays x and y
`np.minimum(x,y)` | element-wise minimum
`np.sign(x)` | compute the sign of each element: 1 (pos), 0 (zero), -1 (neg)
`np.subtract(x, y)` | subtract elements in y from elements in x
`np.power(x, y)` | raise elements in first array x to powers in second array y
`np.where(cond, x, y)` | ifelse statement



## Vectorizing functions

It is important to state again that you should avoid looping through elements in vectors if at all possible.  One way to get around that when writing functions is to use what are called **vectorized functions**.  Say you wrote a function `f` which accepts some input `x` and checks if `x` is bigger or smaller than 0.  


In [28]:
def f(x):
    if x >=0:
        return True
    else:
        return False

print f(3)

True


If we give the function an array instead of just one value (e.g. 3), then Python will give an error because there is more than one element in `x`.  The way to get around this is to **vectorize** the function.  

In [29]:
f_vec = np.vectorize(f)
z = np.arange(-5, 6)
z 

array([-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5])

In [30]:
f_vec(z)

array([False, False, False, False, False,  True,  True,  True,  True,
        True,  True], dtype=bool)

Instead of vectorizing the function, you can also make the function itself aware that it will be accepting vectors from the beginning. 

In [31]:
def f(x):
    return (x >=0)

print f(3)

True


# Scipy

Now that you know a little bit about [NumPy](numpy.html) and SciPy is a collection of mathematical and scientific modules built on top of NumPy.  For example, SciPy can handle multidimensional arrays, integration, linear algebra, statistics and optimization.  

In [32]:
# Import SciPy
import scipy

SciPy includes most of NumPy, so importing SciPy should be generally OK. The main SciPy module is made up of many [submodules containing specialized topics](http://docs.scipy.org/doc/scipy/reference/). 

Favorite SciPy submodules | What does it contain? 
--- | --- 
`scipy.stats` | [statistics](http://docs.scipy.org/doc/scipy/reference/tutorial/stats.html): random variables, probability density functions, cumulative distribution functions, survival functions
`scipy.integrate` | [integration](http://docs.scipy.org/doc/scipy/reference/tutorial/integrate.html): single, double, triple integration, trapezoidal rule, Simpson's rule, differential equation solvers
`scipy.signal` | [signal processing tools](http://docs.scipy.org/doc/scipy/reference/signal.html): signal processing tools such as wavelets, spectral densities, filters, B-splines
`scipy.optimize` | [optimization](http://docs.scipy.org/doc/scipy/reference/optimize.html): find roots, curve fitting, least squares, etc 
`scipy.special` | [special functions](http://docs.scipy.org/doc/scipy/reference/tutorial/special.html): very specialized functions in mathematical physics e.g. bessel, gamma
`scipy.linalg` | [linear algebra](http://docs.scipy.org/doc/scipy/reference/linalg.html): inverse of a matrix, determinant, Kronecker product, eigenvalue decomposition, SVD, functions for matrices (beyond those in `numpy.linalg`)

If you want to import a SciPy submodule (e.g. the statistics submodule `scipy.stats`), use 

In [66]:
from scipy import stats

#### scipy.stats 
Let's dive a bit deeper in `scipy.stats`. The real utility of this submodule is to access probability distributions functions (pdfs) and standard statistical tests (e.g. $t$-test).  

#### Probability distribution functions
There is a large collection of [continuous and discrete pdfs](http://docs.scipy.org/doc/scipy/reference/stats.html) in the `scipy.stats` submodule.  The syntax to simulate random variables from a specific pdf is the name of the distribution  followed by `.rvs`. To generate $n$=10 $N(0,1)$ random variables, 

In [17]:
from scipy.stats import norm
x = norm.rvs(loc = 0, scale = 1, size = 1000)

df = pd.DataFrame(x, columns=['vals'])

df.iplot(kind='hist', bins=20, title='Histogram of 1000 normal random variables')

# More Pandas and Matplotlib

## Motor Trend Car Road Tests Data

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). This dataset is available on Github in the [2014_data repository](https://github.com/cs109/2014_data) and is called `mtcars.csv`. 

## Reading in the mtcars data (CSV file) from the web

This is a `.csv` file, so we will use the function `read_csv()` that will read in a CSV file into a pandas DataFrame. 

In [3]:
url = 'https://raw.githubusercontent.com/cs109/2014_data/master/mtcars.csv'
mtcars = pd.read_csv(url, sep = ',', index_col=0)
mtcars.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2


In [96]:
# DataFrame with 32 observations on 11 variables
mtcars.shape 

(32, 11)

In [37]:
# return the column names
mtcars.columns

Index([u'mpg', u'cyl', u'disp', u'hp', u'drat', u'wt', u'qsec', u'vs', u'am', u'gear', u'carb'], dtype='object')

Here is a table containing a description of all the column names. 

Column name | Description 
--- | --- 
mpg | Miles/(US) gallon
cyl | Number of cylinders
disp | Displacement (cu.in.)
hp | Gross horsepower
drat | Rear axle ratio
wt | Weight (lb/1000)
qsec | 1/4 mile time
vs | V/S
am | Transmission (0 = automatic, 1 = manual)
gear | Number of forward gears
carb | Number of carburetors


In [39]:
mtcars[25:] # rows 25 to end of data frame

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Fiat X1-9,27.3,4,79.0,66,4.08,1.935,18.9,1,1,4,1
Porsche 914-2,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2
Lotus Europa,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2
Ford Pantera L,15.8,8,351.0,264,4.22,3.17,14.5,0,1,5,4
Ferrari Dino,19.7,6,145.0,175,3.62,2.77,15.5,0,1,5,6
Maserati Bora,15.0,8,301.0,335,3.54,3.57,14.6,0,1,5,8
Volvo 142E,21.4,4,121.0,109,4.11,2.78,18.6,1,1,4,2


In [40]:
# return index
mtcars.index

Index([u'Mazda RX4', u'Mazda RX4 Wag', u'Datsun 710', u'Hornet 4 Drive', u'Hornet Sportabout', u'Valiant', u'Duster 360', u'Merc 240D', u'Merc 230', u'Merc 280', u'Merc 280C', u'Merc 450SE', u'Merc 450SL', u'Merc 450SLC', u'Cadillac Fleetwood', u'Lincoln Continental', u'Chrysler Imperial', u'Fiat 128', u'Honda Civic', u'Toyota Corolla', u'Toyota Corona', u'Dodge Challenger', u'AMC Javelin', u'Camaro Z28', u'Pontiac Firebird', u'Fiat X1-9', u'Porsche 914-2', u'Lotus Europa', u'Ford Pantera L', u'Ferrari Dino', u'Maserati Bora', u'Volvo 142E'], dtype='object')

In [41]:
mtcars.ix['Maserati Bora'] # access a row by an index

mpg      15.00
cyl       8.00
disp    301.00
hp      335.00
drat      3.54
wt        3.57
qsec     14.60
vs        0.00
am        1.00
gear      5.00
carb      8.00
Name: Maserati Bora, dtype: float64

In [42]:
# What other methods are available when working with pandas DataFrames?
# type 'mtcars.' and then click <TAB>
# mtcars.<TAB>

# try it here

## Exploratory Data Analysis (EDA)

Even though they may look like continuous variabes, `cyl`, `vs`, `am`, `gear` and `carb` are integer or categorical variables. First, let's look at some summary statistics of the mtcars data set. 

In [43]:
mtcars.describe()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
count,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0
mean,20.090625,6.1875,230.721875,146.6875,3.596563,3.21725,17.84875,0.4375,0.40625,3.6875,2.8125
std,6.026948,1.785922,123.938694,68.562868,0.534679,0.978457,1.786943,0.504016,0.498991,0.737804,1.6152
min,10.4,4.0,71.1,52.0,2.76,1.513,14.5,0.0,0.0,3.0,1.0
25%,15.425,4.0,120.825,96.5,3.08,2.58125,16.8925,0.0,0.0,3.0,2.0
50%,19.2,6.0,196.3,123.0,3.695,3.325,17.71,0.0,0.0,4.0,2.0
75%,22.8,8.0,326.0,180.0,3.92,3.61,18.9,1.0,1.0,4.0,4.0
max,33.9,8.0,472.0,335.0,4.93,5.424,22.9,1.0,1.0,5.0,8.0


#### Using conditional statements

To check if `any` or `all` elements in an array meet a certain criteria, use `any()` and `all()`. 

In [204]:
(mtcars.mpg >= 20).any()

True

Let's look at the distribution of `mpg` using a histogram.

In [18]:
mtcars.mpg.iplot(kind='hist', bins=10,
                xTitle='Distribution of MPG', yTitle='Miles Per Gallon')

### Relationship between cyl and mpg

In [19]:
# Relationship between cyl and mpg
mtcars[['cyl','mpg']].iplot(kind='scatter', x='cyl', y='mpg', mode='markers', 
                            colors='darkred', xTitle='Cylinder', yTitle='MPG', 
                            title='Relationship between cylinders and MPG')


### Relationship between horsepower and mpg


In [20]:
mtcars[['hp', 'mpg']].iplot(kind='scatter', mode='markers', x='hp', y='mpg', 
                            xTitle='Horsepower', yTitle='MPG', colors='green',
                            title='Relationship between horsepower and MPG')

### Generate corelation matrix

In [21]:
df = mtcars[['mpg', 'hp', 'cyl']]
df.scatter_matrix()

# Use KDE for the diagonals later

### Spread and Ratio Charts

In [22]:
mtcars[['mpg', 'hp']].iplot(kind='ratio')

In [23]:
mtcars[['cyl', 'mpg']].iplot(kind='spread')

In [21]:
mtcars.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2


### Box Plots

In [24]:
mtcars[['drat','wt', 'gear', 'carb', 'am']].iplot(kind='box')

## Bubble Plot

In [25]:
mtcars.iplot(kind='bubble', x='mpg', y='hp', size='cyl', xTitle='MPG', yTitle='Horsepower')

## Subplots

In [26]:
df = mtcars[['drat', 'gear', 'carb']]
df.iplot(subplots=True, shape=(3,1), shared_xaxes=True, fill=True, vertical_spacing=.05)

# Working on the command line

Now we will discuss working on the command line. For this section and the next section on git and GitHub we will use slides from the [Data Science Specialization](https://github.com/DataScienceSpecialization/courses/tree/master/01_DataScientistToolbox) course on Coursera.  These slides are available from 

* [Command line interface](https://github.com/DataScienceSpecialization/courses/tree/master/01_DataScientistToolbox/02_03_commandLineInterface) 


# Introduction to git and GitHub

Next we introduce git and GitHub. For this section we will also use slides from [Data Science Specialization](https://github.com/DataScienceSpecialization/courses/tree/master/01_DataScientistToolbox) course on Coursera.  These slides are available from 

* [Introduction to git](https://github.com/DataScienceSpecialization/courses/tree/master/01_DataScientistToolbox/02_04_01_introToGit) 
* [Github](https://github.com/DataScienceSpecialization/courses/tree/master/01_DataScientistToolbox/02_05_github)
* [Create a new repo](https://github.com/DataScienceSpecialization/courses/tree/master/01_DataScientistToolbox/02_06_01_createNewRepo)
* [Fork a repository](https://github.com/DataScienceSpecialization/courses/tree/master/01_DataScientistToolbox/02_06_02_forkRepo)
* [Basic git commands](https://github.com/DataScienceSpecialization/courses/tree/master/01_DataScientistToolbox/02_07_01_basicGitCommands)
* [git workflow](https://github.com/DataScienceSpecialization/courses/tree/master/01_DataScientistToolbox/02_07_02_gitWorkflow)

Other useful resources for learning git and github: 
* [Interactive tutorial to learn git (only takes under 15 mins to complete!)](https://try.github.io/levels/1/challenges/1)
* [Github guides](https://guides.github.com)
* [git - the simple guide](http://rogerdudler.github.io/git-guide/)
* [Github Youtube videos](https://www.youtube.com/user/GitHubGuides)

# Your turn

* If you don't have a github account yet, [register for a github account](https://github.com/join)
* Use `git clone` to clone the [CS109 2014 course repository](https://github.com/cs109/2014) on Github
* Use `git clone` to clone the [CS109 2014 data repository](https://github.com/cs109/2014_data) on Github

