## Week 1: Introduction

Instructor: Cornelia Ilin <br>
Email: cilin@ischool.berkeley.edu <br>

This notebook is intended to introduce you to running ipython notebook and to familiarize you with some basics of numpy, matplotlib, and sklearn, which you'll use extensively in this course. 

Read through the commands, try making changes, and make sure you understand how the plots below are generated.

In your projects, you should focus on making your code as organized and readable as possible. Use lots of comments -- see the code below -- !!

You should also familiarize yourself with the various keyboard shortcuts for moving between cells and running cells. Ctrl-ENTER runs a cell, while shift-ENTER runs a cell and advances focus to the next cell.

### Documentation

[1] Generate evenly spaced numbers: https://docs.scipy.org/doc/numpy/reference/generated/numpy.linspace.html

[2] Python's lambda syntax: <br>
http://www.python-course.eu/lambda.php

[3] Return a sample from the standard Normal distribution: <br>
https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.randn.html#numpy.random.randn

[4] Sklearn documentation for linear regression: <br>
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

[5] Sklearn documentation for the PolynomialFeatures preprocessor: <br>
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html

### Step 1: Define classes

### Step 2: Define functions

### Step 3: Import packages

In [None]:
# standard 
import numpy as np

# plots
import matplotlib.pyplot as plt

# prediction
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

# This tells matplolib not to try opening a new window for each plot
%matplotlib inline

### Step 4: Set working directories

### Step 5: Define global variables

In [None]:
# set a randomizer seeds (this will ensure the results are the same each time)
np.random.seed(100)

# set len(X)
len_X = 20

### Step 6: Read data

This time we will generate our own data (y, X), by using a random number generator.

generate X:

In [None]:
# generate evenly spaced X values in [0, 1]. Set len(X) = 20
X = np.linspace(0, 1, len_X)

In [None]:
print("Question 1: What is the data type of X?")
print(type(X))
print(X)
print (X.shape) # 1D ndarray 

generate y:

In [None]:
# create a "true" function (a piece of a cosine curve) that we will try to approximate with a model
true_function = lambda x: np.cos(1.5 * np.pi * x)

# try this function out. Notice that you can apply it to a scalar, an array, or you can use it even in pandas
print(true_function(0))
print(true_function(0.5))
print(true_function(np.array([0, 0.5])))

In [None]:
# generate true y values
y = true_function(X)

# print the values of y to the nearest hundredth
print (['%.2f' %i for i in y])

In [None]:
# add random noise to y
# the randn function samples random numbers from the standard Normal distribution
# multiplying adjusts the standard deviation of the distribution
noise = np.random.randn(len_X) * 0.2
y += noise

# print the noise-added values of y for comparison.
print (['%.2f' %i for i in y])

Next, we want to predict y, using the feature vector X. 

In this course, our outputs (y) will always be 1-dimensional. Our inputs (X) will usually have more than 1 dimension. Today, for simplicity, we have just a single feature. 

Since the machine learning classes in sklearn expect input feature vectors, we need to turn each input x in X into a feature vector [x].

### Step 7: Clean data

In [None]:
# transform X into a 2D (in the numpy ndarray sense) vector
X = X[:, np.newaxis]
print(X)


### Step 8: Analysis (model fit)

###### Linear model

In [None]:
# model fit
lm = LinearRegression(fit_intercept = True)
lm.fit(X, y)
lm_yhat = lm.predict(X)
print(lm.intercept_)
print(lm.coef_)
print ('Estimated function: y = %.2f + %.2fx' %(lm.intercept_, lm.coef_[0]))
print('Predicted values:')
print(lm_yhat)

Approximating a cosine function with a linear model doesn't work so well. By adding polynomial transformations of our feature(s), we can fit more complex functions. This is often called polynomial (nonlinear) regression. 

###### Nonlinear model (poly degree==4)

In [None]:
# create polinomial transformations
poly = PolynomialFeatures(degree=4, include_bias=False)
X4 = poly.fit_transform(X)
print(X4)

In [None]:
# model fit
lm4 = LinearRegression(fit_intercept=True)
lm4.fit(X4, y)
lm4_yhat = lm4.predict(X4)
print(lm4.intercept_)
print(lm4.coef_)
print ('Estimated function: y = %.2f + %.2fx + %.2fx^2 + %.2fx^3 + %.2fx^4' %(lm4.intercept_, lm4.coef_[0], lm4.coef_[1], lm4.coef_[2], lm4.coef_[3]))
print('Predicted values:')
print(lm4_yhat)

###### Nonlinear model (poly degree==15)

In [None]:
# create polinomial transformations
poly = PolynomialFeatures(degree=15, include_bias=False)
X15 = poly.fit_transform(X)
print(X15[0:3])

In [None]:
# model fit
lm15 = LinearRegression(fit_intercept=True)
lm15.fit(X15, y)
lm15_yhat = lm15.predict(X15)
print(lm15.intercept_)
print(lm15.coef_)
print('Predicted values:')
print(lm15_yhat)

### Step 9: Plots

In [None]:
degrees = [1, 4, 15]

# Initialize a new plot and set plot size
plt.figure(figsize=(14, 4)) 

for i in range(len(degrees)):
    # create sublots that are all on the same row
    ax = plt.subplot(1, len(degrees), i+1)
    
    # create the polynomial feature vector (or matrix)
    poly = PolynomialFeatures(degree = degrees[i], include_bias = False)
    temp_X = poly.fit_transform(X)
    
    # model fit
    lm = LinearRegression()
    lm.fit(temp_X, y)
    lm_yhat = lm.predict(temp_X)

    # plot the true function
    plt.plot(X, true_function(X), label="True function");

    # plot the true function with noise added
    plt.scatter(X, y, label="Function with noise");

    # Show the fitted function for the linear model
    plt.plot(X, lm_yhat, label="Fitted model");

    # Add labels, title, legend to the plot
    plt.xlabel("x")
    plt.ylabel("y")
    plt.xlim((-.05, 1.05))
    plt.ylim((-2, 2))
    plt.legend(loc="best")
    plt.title("Degree %d" %degrees[i])

### Conclusions

The machine learning lesson here is that we are interested in the smallest model that fits our data the best. Clearly, the degree 1 model, while very small (only 2 parameters), doesn't fit the observed data well. The degree 15 model fits the observed data extremely well, but is unlikely to generalize to new data. This is a case of "over-fitting", which often happens when we try to estimate too many parameters from just a few examples. The degree 4 model appears to be a good blend of small model size and good generalization.