<h1>Introduction to Linear Regression</h1>

## NumPy

In [1]:
import numpy as np

NumPy is a python library that adds support for dealing with large, multi-dimensional arrays and matrices. It also includes many high-level mathematical functions to operate on these arrays. 

<h4> Numpy Arrays </h4>

NumPy arrays are commonly used to store lists of numerical data and to represent vectors and matrices. They are faster & more compact than python lists.

In [3]:
# Creating a basic array
L1 = np.array([1,2,3,4,5])

# Creating an array of all zeros
L2 = np.zeros(10)

# Creating a 2-dimensional array
L3 = np.array([[1,2,3], [4,5,6], [7,8,9]])

# Creating an array using a range
L4 = np.arange(0,10,2)

print("L1: ", L1)
print("L2: ", L2)
print("L3: \n", L3)
print("L4: ", L4)

L1:  [1 2 3 4 5]
L2:  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
L3: 
 [[1 2 3]
 [4 5 6]
 [7 8 9]]
L4:  [0 2 4 6 8]


You can quickly find the shape of numpy arrays (tuple of integers that indicate the number of elements stored along each dimension).

In [4]:
# Finding shape of numpy arrays
print("L1 shape: ", np.shape(L1))
print("L3 shape: ", np.shape(L3))

L1 shape:  (5,)
L3 shape:  (3, 3)


You can reshape arrays to have a certain number of rows & columns. The new dimensions <i>must</i> line up with the number of entries.

In [5]:
# Reshaping an array
L2_reshaped = L2.reshape(2,5)
L3_reshaped = L3.reshape(9,1)

print("L2 reshaped: \n", L2_reshaped)
print("L3 reshaped: \n", L3_reshaped)

L2 reshaped: 
 [[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]
L3 reshaped: 
 [[1]
 [2]
 [3]
 [4]
 [5]
 [6]
 [7]
 [8]
 [9]]


You can also quickly index or slice an array.

In [6]:
# Getting the third element of L1:
print("Third element of L1: ", L1[2])

# Getting the first three elements of L1:
print("First three elements of L1: ", L1[0:3])

# Getting the last three elements of L4:
print("Last three elements of L4: ", L4[-3:])

# Splitting L2 into two lists (L4_A and L4_B):
L2_A = L2[0:5]
L2_B = L2[5:]
print("L2_A: ", L2_A)
print("L2_B: ", L2_B)

Third element of L1:  3
First three elements of L1:  [1 2 3]
Last three elements of L4:  [4 6 8]
L2_A:  [0. 0. 0. 0. 0.]
L2_B:  [0. 0. 0. 0. 0.]


<h4>Numpy Review Exercises</h4>

1. Create a numpy array with all even numbers from 0 to 100
2. Reshape the array into a 2D array with 10 rows and 10 columns
3. Verify the shape of the resulting array
4. Return only the last 2 rows of this 2D array

In [7]:
# Responses to Numpy Review Exercises go here

## Pandas

In [8]:
import pandas as pd

Pandas is a python library that provides fast and flexible data structures used for data manipulation and analysis. One of the most useful tools pandas provides is a DataFrame: a general 2D labeled, size-mutable tabular structure with heterogeneously typed columns.

In [15]:
# Creating a dataframe from a dictionary

dict = {"Cities": ["Half Moon Bay", "San Francisco", "Boston", "Denver"], 
        "States": ["California", "California", "Massachussetts", "Colorado"], 
        "Populations": [12834, 874961, 684379, 705576],
        "Elevations": [75, 52, 141, 5280]}

df = pd.DataFrame(dict)
df.head()

Unnamed: 0,Cities,States,Populations,Elevations
0,Half Moon Bay,California,12834,75
1,San Francisco,California,874961,52
2,Boston,Massachussetts,684379,141
3,Denver,Colorado,705576,5280


In [16]:
# Accessing a column

df['Cities']

0    Half Moon Bay
1    San Francisco
2           Boston
3           Denver
Name: Cities, dtype: object

In [17]:
# General descriptive Stats of DataFrame

df.describe(include = 'all')

Unnamed: 0,Cities,States,Populations,Elevations
count,4,4,4.0,4.0
unique,4,3,,
top,San Francisco,California,,
freq,1,2,,
mean,,,569437.5,1387.0
std,,,380743.704202,2595.607443
min,,,12834.0,52.0
25%,,,516492.75,69.25
50%,,,694977.5,108.0
75%,,,747922.25,1425.75


<h4>Pandas Exercises</h4>

1. Create a DataFrame from the dict <i>cars</i>
2. Find descriptive statistics

In [19]:
# Solutions to Pandas Exercises go here

cars = {"Model": ["Toyota", "Ford", "Honda", "Toyota", "Subaru"],
        "Make": ["Prius", "Fusion", "Civic", "Camry", "Forrester"],
        "Year": [2011, 2015, 2001, 2003, 2018],
        "Miles": [50000, 72000, 160230, 20080, 30100]}

## scikit-learn

In [55]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
import sklearn.linear_model as linear_model
from sklearn import metrics

scikit-learn is a python library used for machine learning and statistical modeling. It contains many classification, regression and clustering algorithms.

## California Housing Prices Regression

https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset

In [29]:
# fetch dataset

In [47]:
# view the entire DataFrame

In [48]:
# view the feature data

In [49]:
# view the target data

In [50]:
# split the data into training and testings datasets

In [51]:
# create model

In [52]:
# fit model to our dataset

In [53]:
# print the coefficient values of our model

<h4> Model Analysis </h4>

In [54]:
# find predicted values on our test dataset

In [57]:
# find mean absolute error, MSE, RMSE

In [59]:
# find our linear regression score (R2 value) for training and testing data

<h2> Overfitting & Underfitting</h2>

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html

In [10]:
from sklearn.datasets import make_regression

In [61]:
# make regression problem to demonstrate overfitting

In [62]:
# Split the data into training and testing datasets

In [63]:
# Create model

In [65]:
# Fit the model to our training data

<h4> Model Analysis </h4>

In [66]:
# find our linear regression score (R2 value) for training and testing data

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score