## Chapter 12: Predictive Analytics

Data mining is the process of analyzing data to discover patterns and relationships. If you are working with sales data, for example, it makes sense that you would analyze data to determine facts such as:

    •	Which customer purchased the most products?
    •	Which customer purchased the least?
    •	What was the average sales per customer order?
    •	What was the average number of days between orders?
    •	What customers have not yet ordered the new product?
    •	And so on.

Using data in this way to describe past events is descriptive analytics. As you might imagine, being able to analyze data to determine such historical facts is very important and very valuable to businesses. To perform descriptive analytics, you can use statistical tools to generate metrics, you can use visualization tools to chart data, you can use clustering to group data, and more.

This notebook, however, is about using data to predict future events—predictive analytics. Companies, today, use predictive analytics for a wide range of applications:

    •	Estimate the revenue opportunity associated with an upcoming product sale.
    •	Predict the length of time machines will run without failing.
    •	Determine the loan amount to offer a customer.
    •	Which customers are likely to become long-term customers?
    •	For what price will a sea-side home in Seattle sell?
    •	And more.


Using supervised learning (training with labelled datasets to learn links between inputs and outputs), you can predict in which class (category) an object should reside given predictor variables. Classification works with discrete categories, meaning data that has finite values, such as the type of a car, breed of a dog, color of hair, number of students (you can’t have a fractional student), and so on. In contrast, continuous values have an infinite set of values, such as a company’s projected revenue, the average basketball player’s height, and the range of temperatures in Phoenix.  

This notebook focuses on predicting continuous values using a technique called regression, the goal of which is to produce an equation which you can then use to predict results. 

Some of the scripts presented in this notebook use several Python libraries which have been pre-installed for you. If you had been required to install these libaries on your own, you would issue the following commands:

```python
! pip install --user pandas
! pip install --user numpy
! pip install --user sklearn
! pip install --user matplotlib
```

# Understanding Linear Regression

Linear regression is a statistical technique that produces a linear equation that best models a set of data. When analysts use linear regression to create an equation that predicts one variable, such as relating a customer’s age to sales, the process is called simple regression. In contrast, when more than one predictor variable is used, such as age and gender to predict salary, the process is called multiple regression.

The following Python script, SimpleLR.py, uses the linear_model library's LinearRegression function to perform a simple (one predictor value) linear regression:

In [None]:
######################################
# Chapter 12 (Python) / Deliverable 1
######################################

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

X = np.array([[0],[1],[2],[3]])  
y = np.array([2,3,4,5])

model = LinearRegression()
clf = model.fit(X, y)
print ('Coefficient: ', clf.coef_)
print('Y intercept: ', clf.intercept_)

The following Python script, PlotLR.py, uses the LinearRegression function to determine the line that best represents the data, and then plots the data and line:

In [None]:
######################################
# Chapter 12 (Python) / Deliverable 2
######################################

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

X = np.array([[0],[1],[2],[3],[4],[5],[6],[7]])  
y = np.array([1,3,9,10,27,84,105,169])

plt.scatter(X,y)

model = LinearRegression()
clf = model.fit(X, y)
predictions = np.dot(X, clf.coef_)

for index in range(len(predictions)):
 predictions[index] = predictions[index] + clf.intercept_

plt.plot(X, predictions)

plt.show()

# Looking at a Real-World Example of Simple Linear Regression

The University of California Irvine (UCI) dataset repository provides the Auto-MPG dataset which contains average miles-per-gallon (MPG), horse power, weight, number of cylinders, and so on, for many different car types. The following Python script, WeightMPG.py, uses the dataset to create an equation that models the relationship between automobile weight and miles per gallon:

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

data = pd.read_csv('auto-mpg.csv')

X = data[['weight']].values
y = data['mpg']

model = LinearRegression(fit_intercept=False)
clf = model.fit(X, y)
print ('Coefficient: ', clf.coef_)

predictions = model.predict(X)
for index in range(len(predictions)):
  print('Actual: ', y[index], 'Predicted: ', predictions[index], 'Weight: ', X[index,0])

# Multiple Linear Regression

When you must predict values based on two or more predictor variables, you can perform multiple linear regression. 

The following, Python script, MultipleLR.py, uses multiple regression to determine the coeffients:

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

X = np.array([[0, 6, 11],[2, 7, 12],[3, 8, 13],[4, 9, 14],[5, 10, 15]])  
y = np.array([46,52,58,64,70])

model = LinearRegression()
clf = model.fit(X, y)
print ('Coefficient: ', clf.coef_)
print('Y intercept: ', clf.intercept_)

If you multiply the coefficients specified and add the y-intercept, you will find that the results closely approximate the values of Y. 

# Looking at a Real-World Multiple Linear Regression

Earlier in this notebook you examined the use of Auto-MPG dataset to predict the miles-per-gallon based on the weight of a car. The following Python script, AutoMPGMR.py extends the previous script to use multiple predictor variables:

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

data = pd.read_csv('auto-mpg.csv')

X = data[['weight', 'horsepower', 'cylinders', 'acceleration', 'displacement', 'model year', 'origin']].values
y = data['mpg']

model = LinearRegression(fit_intercept=False)
clf = model.fit(X, y)
print ('Coefficient: ', clf.coef_)

y2 = model.predict(X)
for index in range(len(y2)):
  print('Actual: ', y[index], 'Predicted: ', y2[index], 'Weight: ', X[index,0])

Seattle (King County, Washington) is home to many successful high-tech companies (along with waterfront property) that reflect in the prices of its housing. A popular dataset used for evaluating simple regression models is the House Sales in King County, USA dataset, which contains 19 house attributes and prices for what each house sold.

The following Python script, SeattleHousing.py, uses several of the attributes from King County dataset to generate an equation with which you can predict a house price:

In [None]:
######################################
# Chapter 12 (Python) / Deliverable 3
######################################

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

data = pd.read_csv('Seattle.csv')

# specify attributes(X) to be used for prediction of price(y)
X = data[['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot','floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15']].values
y = data['price']

model = LinearRegression(fit_intercept=False)
clf = model.fit(X, y)
print ('Coefficient: ', clf.coef_)

predictions = model.predict(X)
for index in range(len(predictions)):
  print('Actual: ', y[index], 'Predicted: ', predictions[index])

# Decision Tree Regression

A decision tree is a graph-based data structure that a program can use to follow a series of decision paths to arrive at a destination.   
You may know that decision trees are used in data classification, by creating a similar structure with decision points that are based upon the different dataset attributes. However, you can also use decision trees to perform decision-tree regression in order to predict continuous data. 

The following Python script, SimpleDecisionTreeRegression.py, uses the dataset used at the beginning of this notebook:

In [None]:
import numpy as np
from sklearn import tree
from sklearn.tree import export_graphviz

X = np.array([[0],[1],[2],[3]])  
y = np.array([2,3,4,5])

model = tree.DecisionTreeRegressor()
model.fit(X, y)
predictions = model.predict(X)

print(model.feature_importances_)

for index in range(len(predictions)):
  print('Actual: ', y[index], 'Predicted: ', predictions[index])

The following Python script, SeattleDT.py, uses decision-tree regression to create an equation you can use to predict housing values in Seattle:

In [None]:
######################################
# Chapter 12 (Python) / Deliverable 4
######################################

import pandas as pd
import numpy as np
from sklearn import tree

data = pd.read_csv('Seattle.csv')

X = data[['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot','floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15']].values
y = data['price']

model = tree.DecisionTreeRegressor()
model.fit(X, y)
predictions = model.predict(X)

print(model.feature_importances_)

for index in range(len(predictions)):
  print('Actual: ', y[index], '   Predicted: ', predictions[index])

# Random Forest Regression

Sometimes a decision tree may not be optimal for the dataset. When the decision tree becomes very deep (many levels of nodes) it will often overfill and have a large variance. 

A random-forest algorithm will create many decision trees and then apply the one that best represents the average of the rest. The following Python script, AutoMPGRF.py, uses random-forest regression to produce coefficients for multiple predictor variables:

In [None]:
######################################
# Chapter 12 (Python) / Deliverable 5
######################################

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor

data = pd.read_csv('auto-mpg.csv')

X = data[['weight', 'horsepower', 'cylinders', 'acceleration', 'displacement','model year', 'origin']].values
y = data['mpg']

model = RandomForestRegressor(n_estimators=100)
model.fit(X, y)
predictions = model.predict(X)

print(model.feature_importances_)

for index in range(len(predictions)):
  print('Actual: ', y[index], 'Predicted: ', predictions[index], '\t\tWeight: ', X[index,0])

The following Python script, RandomForestSeattle.py, uses random-forest regression to produce an equation you can use to predict house values:

In [None]:
######################################
# Chapter 12 (Python) / Deliverable 6
######################################

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor

data = pd.read_csv('Seattle.csv')

X = data[['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot','floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15']].values
y = data['price']

model = RandomForestRegressor(n_estimators=100)
model.fit(X, y)
predictions = model.predict(X)

print(model.feature_importances_)

for index in range(len(predictions)):
  print('Actual: ', y[index], '  Predicted: ', predictions[index])

# K-Nearest Neighbors Regression

The KNN’s classification algorithm is based on the premise “If it walks like duck and quacks like a duck, it’s a duck.” To use KNN, you provide a value for the number K that specifies the number of neighboring dataset values to which a value must be similar in order to be considered part of a group. In short, the K-nearest-neighbor algorithm groups data that are similar to that around it. However, the K-nearest-neighbors can also be used for regression. To predict values, algorithm matches values to its nearest neighbors and then calculates an average.


The following Python script, KNNRegression.py, uses the Auto-MPG dataset to predict MPG based on a number of car attributes:

In [None]:
######################################
# Chapter 12 (Python) / Deliverable 7
######################################

import pandas as pd
import numpy as np
from sklearn import neighbors

data = pd.read_csv('auto-mpg.csv')

X = data[['weight', 'horsepower', 'cylinders', 'acceleration', 'displacement','model year', 'origin']].values
y = data['mpg']

model = neighbors.KNeighborsRegressor(n_neighbors = 5)

model.fit(X, y)
predictions = model.predict(X) 

for index in range(len(predictions)):
  print('Actual: ', y[index], 'Predicted: ', predictions[index], '\t\tWeight: ', X[index,0])

# Polynomial Regression

As you have learned, a line is not often a perfect fit for the underlying data. In such cases, you may find that a curved line better matches the data. The following Python script, SimplePoly.py, uses the data to perform polynomial regression:

In [None]:
######################################
# Chapter 12 (Python) / Deliverable 8
######################################

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

X = np.array([[0],[1],[2],[3],[4],[5],[6],[7]])
y = np.array([2,3,9,12,15,18,19,20])

plt.scatter(X,y)
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

model = LinearRegression()
model.fit(X_poly, y)
predictions = model.predict(X_poly)

plt.plot(X, predictions)

plt.show()

And finally, the following Python script, SeattlePoly.py, uses polynomial regression to predict Seattle-housing prices:

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

data = pd.read_csv('Seattle.csv')

X = data[['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot','floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15']].values
y = data['price']

poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

model = LinearRegression()
model.fit(X_poly, y)
predictions = model.predict(X_poly)

for index in range(len(predictions)):
  print('Actual: ', y[index], 'Predicted: ', predictions[index])