<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Multiple-Linear-Regression" data-toc-modified-id="Multiple-Linear-Regression-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Multiple Linear Regression</a></span><ul class="toc-item"><li><span><a href="#Example-Code" data-toc-modified-id="Example-Code-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Example Code</a></span></li></ul></li><li><span><a href="#Categoical-Variables" data-toc-modified-id="Categoical-Variables-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Categoical Variables</a></span><ul class="toc-item"><li><span><a href="#Realizing-When-to-Use-Categorical-Variables" data-toc-modified-id="Realizing-When-to-Use-Categorical-Variables-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Realizing When to Use Categorical Variables</a></span><ul class="toc-item"><li><span><a href="#First-look-at-the-data-(part-of-exploration)" data-toc-modified-id="First-look-at-the-data-(part-of-exploration)-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>First look at the data (part of exploration)</a></span></li><li><span><a href="#Can-use-plots-to-help-us" data-toc-modified-id="Can-use-plots-to-help-us-2.1.2"><span class="toc-item-num">2.1.2&nbsp;&nbsp;</span>Can use plots to help us</a></span></li></ul></li><li><span><a href="#Transforming-Categorical-Variables" data-toc-modified-id="Transforming-Categorical-Variables-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Transforming Categorical Variables</a></span><ul class="toc-item"><li><span><a href="#Label-encoding" data-toc-modified-id="Label-encoding-2.2.1"><span class="toc-item-num">2.2.1&nbsp;&nbsp;</span>Label encoding</a></span></li><li><span><a href="#Dummy-Variables" data-toc-modified-id="Dummy-Variables-2.2.2"><span class="toc-item-num">2.2.2&nbsp;&nbsp;</span>Dummy Variables</a></span></li></ul></li></ul></li></ul></div>

# Multiple Linear Regression

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Instead of using one feature (variable) and a target, we will use multiple features to predict a target!

## Example Code

In [None]:
np.random.seed(1234)
sen = np.random.uniform(18, 65, 100)
income = np.random.normal((sen/10), 0.5)
sen = sen.reshape(-1,1)

In [None]:
fig = plt.figure(figsize=(7,5))
fig.suptitle('seniority vs. income', fontsize=16)

# Scatter plot with labels
plt.scatter(sen, income)
plt.xlabel("seniority", fontsize=14)
plt.ylabel("monthly income", fontsize=14)

# Quick line through the data
plt.plot(sen, sen/10, c = "black")

plt.show()

Our line of best fit would come in the form of 
$$ target = slope * feature + intercept $$

But now what if we want multiple ($n$) variables?
$$ \hat{y} = w_0 + w_1 x_1 + w_2 x_2 + ... w_n x_n $$

What shape does this make?

In [None]:
# 2D plane example
from mpl_toolkits.mplot3d import Axes3D

xs = np.linspace(0, 10, 100)
zs = np.linspace(0, 10, 100)

X, Z = np.meshgrid(xs, zs)
Y = Z - X + 10

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(X, Y, Z)
plt.show()

# Categoical Variables

So far we really at how to do analysis on continuous variables. But not all data is like that

In [None]:
# Example data
import pandas as pd
data = pd.read_csv("auto-mpg.csv")
data.head()

## Realizing When to Use Categorical Variables

### First look at the data (part of exploration)

In [None]:
data.info()

In [None]:
data.describe()

### Can use plots to help us

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(12,3))

for xcol, ax in zip([ 'cylinders', 'model year', 'origin'], axes):
    data.plot(kind='scatter', x=xcol, y='mpg', ax=ax, alpha=0.4, color='b')

In [None]:
sns.pairplot(data)

In [None]:
fig = plt.figure(figsize = (12,12))
ax = fig.gca()
data.hist(ax = ax);

# Same thing but not as pretty
#data.hist();

## Transforming Categorical Variables

### Label encoding

In [None]:
print(data['cylinders'].dtype)

In [None]:
data['cylinders'] = data['cylinders'].astype('category')
print(data['cylinders'].dtype)

In [None]:
#
data.cylinders.unique()

In [None]:
data.cylinders.cat.codes

In [None]:
# Alternative Way
from sklearn.preprocessing import LabelEncoder
lb_make = LabelEncoder()

encoded_series = lb_make.fit_transform(data.origin)
encoded_series

### Dummy Variables

In [None]:
data.origin.unique()

In [None]:
dummy_origin1 = pd.get_dummies(data.origin)
dummy_origin1

In [None]:
# Alternative Way
from sklearn.preprocessing import LabelBinarizer

lb = LabelBinarizer()
temp = lb.fit_transform(data.origin)
# you need to convert this back to a dataframe
dummy_origin2 = pd.DataFrame(temp, columns=lb.classes_)
dummy_origin2