# Table of Contents
* Explanation of jupyter notebooks (1 min)
* Explanation of dataset and what we want to eventually do (2 min)
* Numpy arrays
    * Broadcasting
    * Boolean indexing
    * Quiz
    * Random number generation
    * Helful documentation
    * Quiz?
* Pandas dataframe (hopefully no more than half as long as numpy)
    * Column indexing, row indexing
    * Making new columns, etc
    * Importing the dataset
* Working with the dataset
    * Quick explanation of matplotlib 
    * EDA
    * Quiz: Split into train, test sets
* Scikit-learn (very brief)
    * .fit, .predict
    * metrics for evaluation (e.g. r^2, rmse)

## Jupyter Notebooks

Jupyter notebooks are documents containing computer code (e.g. python) and rich text elements (paragraphs, $\LaTeX$, figures, links, etc).

Notebook documents are both human-readable documents containing the analysis description and the results (figures, tables, etc..) as well as executable documents which can be run to perform data analysis. 

Notebooks are essential to the typical data scientist's workflow in industry.

# TODO little explanation about how to add cells, how memory persists, and some keyboard shortcuts

## NumPy

NumPy is a Python package for working with very large, often multidimensional arrays, and provides a lot of useful functions for mathematical operations on either scalars or arrays. 

In [2]:
import numpy as np

In [3]:
print (np.array(range(10)))
print (np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]))

[0 1 2 3 4 5 6 7 8 9]
[0 1 2 3 4 5 6 7 8 9]


### Vectorization

A really common task is applying a transformation to data. If you're storing your data as a list, and you want to add `1` to every element of your list, then you'd want to iterate over your data like so.

I'm using the `time` library to demonstrate performance.

In [5]:
import time

n = 10**6

start = time.time()
foo = list(range(n))
for i in range(len(foo)):
    foo[i] = foo[i] + 1
end = time.time()
print ("For loop:", end - start)

For loop: 0.2199418544769287


In [6]:
# slightly better: list comprehensions
start = time.time()
foo = [elem + 1 for elem in foo]
end = time.time()
print(end - start)

0.12396693229675293


In [7]:
# TODO: leave this blank, fill in during presentation?
foo = np.array(range(n))
start = time.time()
foo = foo + 1 #this gets applied element-wise to the array foo
end = time.time()
print(end - start) 

0.003998994827270508


Note the syntax: `foo = foo + 1`. The `+ 1` gets *broadcast* to every element of the array `foo`. 

So not only is this much faster, but your code is more elegant - instead of having a bunch of `for` loops every time you need to do something with or to your data, you can have code that looks a lot more like a page of notes from 21-241.

Below is another example of what I'm talking about:

In [9]:
n = 10**6

a = list(range(n))
b = list(range(n))

start = time.time()
c = list(range(n))
for i in range(len(c)):
    c[i] = a[i] + b[i]
end = time.time()
print ("regular:", end - start)

start = time.time()
c = [a[i] + b[i] for i in range(len(a))]
end = time.time()
print ("list comprehension:", end - start)

a = np.array(a)
b = np.array(b)
start = time.time()
c = a + b
end = time.time()
print ("NumPy:", end - start)

regular: 0.22394275665283203
list comprehension: 0.19593238830566406
NumPy: 0.027993440628051758


### Boolean Indexing

Subsetting data based on conditions is very straightforward in NumPy.

In [11]:
# elements of foo that are < 10
[x for x in foo if x < 10]

[1, 2, 3, 4, 5, 6, 7, 8, 9]

In [14]:
inds = foo < 10
print (inds)
print (foo[inds])

[ True  True  True ..., False False False]
[1 2 3 4 5 6 7 8 9]


In [15]:
# Quiz: Get a list of all the odd elements of foo

### Random Number Generation

The numpy library comes with lots of useful functions, including for random number generation.

The [documentation](https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.random.html) is quite helpful.

In [26]:
np.random.choice(["A", "B", "C", "D"], size=6, replace=True)

array(['D', 'C', 'D', 'B', 'D', 'B'], 
      dtype='<U1')

## Pandas

Pandas gives us the `DataFrame` data structure, which is pretty much just a spreadsheet. The columns must have the same length, but you can have different column types.

The dataframe can be indexed by row or by column. You can index a dataframe by its row just like you would a python list, and you index a column by its string column name.

Example below.

In [28]:
import pandas as pd

df = pd.DataFrame({
        "A":list(range(10)),
        "B":np.random.randn(10),
        "C":7,
        "D":["this is a string" for i in range(10)]
    })

df

Unnamed: 0,A,B,C,D
0,0,-0.855098,7,this is a string
1,1,1.825408,7,this is a string
2,2,-0.813736,7,this is a string
3,3,0.268079,7,this is a string
4,4,1.159072,7,this is a string
5,5,1.305783,7,this is a string
6,6,-0.943045,7,this is a string
7,7,1.601139,7,this is a string
8,8,-0.181445,7,this is a string
9,9,0.356991,7,this is a string


In [29]:
df.head()

Unnamed: 0,A,B,C,D
0,0,-0.855098,7,this is a string
1,1,1.825408,7,this is a string
2,2,-0.813736,7,this is a string
3,3,0.268079,7,this is a string
4,4,1.159072,7,this is a string


In [32]:
df.iloc[3:7] #iloc for "integer location". Use this for selecting rows of a dataframe.

Unnamed: 0,A,B,C,D
3,3,0.268079,7,this is a string
4,4,1.159072,7,this is a string
5,5,1.305783,7,this is a string
6,6,-0.943045,7,this is a string


In [36]:
df["new col"] = np.random.randn(len(df))
df.head()

Unnamed: 0,A,B,C,D,new col
0,0,-0.855098,7,this is a string,-1.635934
1,1,1.825408,7,this is a string,-0.369859
2,2,-0.813736,7,this is a string,0.697543
3,3,0.268079,7,this is a string,-0.555294
4,4,1.159072,7,this is a string,1.059496


In [33]:
# Kaggle Competition: Predicting housing prices from data about the house
data = pd.read_csv("train.csv") 
data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [39]:
# Features we'll look at:
# OverallQual: Rates the overall material and finish of the house
# OverallCond: Rates the overall condition of the house
# GarageArea: Size of garage in square feet
# YrSold: Year Sold (YYYY)
# LotFrontage: Linear feet of street connected to property
# LotArea: Lot size in square feet
# YearBuilt: Original construction date

data_subset = data[["SalePrice", "OverallQual", "OverallCond", "GarageArea", 
              "YrSold", "LotArea", "LotFrontage", "YearBuilt"]]
data_subset.head()

Unnamed: 0,SalePrice,OverallQual,OverallCond,GarageArea,YrSold,LotArea,LotFrontage,YearBuilt
0,208500,7,5,548,2008,8450,65.0,2003
1,181500,6,8,460,2007,9600,80.0,1976
2,223500,7,5,608,2008,11250,68.0,2001
3,140000,7,5,642,2006,9550,60.0,1915
4,250000,8,5,836,2008,14260,84.0,2000
