# 3.1 Linear Regression Exercise


Welcome to the Regression case study!
In this exercise we leverage the structure of the [Data Science Method (DSM)](http://bit.ly/2T6Hpp5) in order to ensure thorough and efficient processing in your data science project. The data science method is a framework  developed to guide aspiring data scientists in steps taken through the process of a data science project.

**The Data Science Method**  

1.   Problem Identification 

2.   Data Wrangling 
 
3.   Exploratory Data Analysis 
 
4.   Pre-processing and Training Data Development
  
5.   Modeling 

6.   Documentation
 

# 1. Problem Identification

## Avocado prices

BACKGROUND: You are a data scientist working for a small grocery store and they are interested in knowing if the demand for avocado's is changeing and additionally if the price for avocados can be predicted.

GOAL: You are tasked with building a predictive model using machine learning to predict the price of avocados given the historic dataset. 

DATA: The Hass Avocado Board provided these data.The data represents weekly 2018 retail scan data for National retail volume (units) and price. Retail scan data comes directly from retailers’ cash registers based on actual retail sales of Hass avocados. Starting in 2013, the data reflects an expanded, multi-outlet retail data set. Multi-outlet reporting includes an aggregation of the following channels: grocery, mass, club, drug, dollar and military. The Average Price (of avocados) in the table reflects a per unit (per avocado) cost, even when multiple units (avocados) are sold in bags. The Product Lookup codes (PLU’s) in the table are only for Hass avocados. Other varieties of avocados (e.g. greenskins) are not included in this table.


Download link: https://www.kaggle.com/neuromusic/avocado-prices/download

# 2. Data Wrangling

In [1]:
# Import relevant libraries and packages.
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns # For all our visualization needs.
import statsmodels.api as sm # For modelling with linear regression.
from statsmodels.graphics.api import abline_plot # For visualling evaluating predictions.
from sklearn.metrics import mean_squared_error, r2_score # For model evaluation.
from sklearn.model_selection import train_test_split # For splitting the data.
from sklearn import linear_model, preprocessing # For modelling with linear regression.

#### Load the data

In [3]:
url ='avocado.csv'
df=pd.read_csv(url, index_col=0)
df.head()

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


Review data types and null values

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18249 entries, 0 to 11
Data columns (total 13 columns):
Date            18249 non-null object
AveragePrice    18249 non-null float64
Total Volume    18249 non-null float64
4046            18249 non-null float64
4225            18249 non-null float64
4770            18249 non-null float64
Total Bags      18249 non-null float64
Small Bags      18249 non-null float64
Large Bags      18249 non-null float64
XLarge Bags     18249 non-null float64
type            18249 non-null object
year            18249 non-null int64
region          18249 non-null object
dtypes: float64(9), int64(1), object(3)
memory usage: 1.9+ MB


Review the count of unique values by column

In [5]:
print(df.nunique())

Date              169
AveragePrice      259
Total Volume    18237
4046            17702
4225            18103
4770            12071
Total Bags      18097
Small Bags      17321
Large Bags      15082
XLarge Bags      5588
type                2
year                4
region             54
dtype: int64


Review the percent of unique values by column

In [6]:
print(df.nunique()/df.shape[0])

Date            0.009261
AveragePrice    0.014193
Total Volume    0.999342
4046            0.970026
4225            0.992000
4770            0.661461
Total Bags      0.991671
Small Bags      0.949148
Large Bags      0.826456
XLarge Bags     0.306209
type            0.000110
year            0.000219
region          0.002959
dtype: float64


Review the range of values per column

In [7]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
AveragePrice,18249.0,1.405978,0.4026766,0.44,1.1,1.37,1.66,3.25
Total Volume,18249.0,850644.013009,3453545.0,84.56,10838.58,107376.76,432962.29,62505646.52
4046,18249.0,293008.424531,1264989.0,0.0,854.07,8645.3,111020.2,22743616.17
4225,18249.0,295154.568356,1204120.0,0.0,3008.78,29061.02,150206.86,20470572.61
4770,18249.0,22839.735993,107464.1,0.0,0.0,184.99,6243.42,2546439.11
Total Bags,18249.0,239639.20206,986242.4,0.0,5088.64,39743.83,110783.37,19373134.37
Small Bags,18249.0,182194.686696,746178.5,0.0,2849.42,26362.82,83337.67,13384586.8
Large Bags,18249.0,54338.088145,243966.0,0.0,127.47,2647.71,22029.25,5719096.61
XLarge Bags,18249.0,3106.426507,17692.89,0.0,0.0,0.0,132.5,551693.65
year,18249.0,2016.147899,0.9399385,2015.0,2015.0,2016.0,2017.0,2018.0


In [8]:
df.AveragePrice.value_counts()

1.15    202
1.18    199
1.08    194
1.26    193
1.13    192
       ... 
3.05      1
3.03      1
2.91      1
0.48      1
2.96      1
Name: AveragePrice, Length: 259, dtype: int64

## Data Cleaning
Check for duplicated rows

In [9]:
duplicateRowsDF = df[df.duplicated()]
duplicateRowsDF

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region


# 3. Exploratory Data Analysis


# Build data profile tables and plots 

**<font color='teal'> Print out the summary stats table transposed to fit on the screen using the `describe()` function.</font>**

**<font color='teal'> Histograms are an excellent way to review the range and density of values for each numeric features in your data set and build data profiles. Plot the histograms for all numeric features and adjust the bins size to 25.</font>**

Look for similarities in the features that may indicate that they are duplicates or highly correlated features. Make a note of your findings and any other interesting insights you find about these numeric features.

**<font color='teal'> Okay, now you should be getting a sense for what the data look like. Let's create a barplot for the categorical features `region` and `type` where the heights of the bars are the counts of each level in that variable. </font>**

**<font color='teal'>Type Levels Plot</font>**

**<font color='teal'>Region Levels Plot</font>**

## Anamolies & Outliers - Review boxplots

**<font color='teal'> Print boxplot for every column</font>**

You need to create boxplots and  histograms to evaluate the data for potential outliers or data anomalies. Generally, outliers are defined as observations that differ significantly from the other values in the dataset or feature. 

Reviewing the distribution of values by column will help you  interpret this. Outliers are extreme values that fall far outside the mean and standard deviation of a set of observations. They  can mislead the training process in building machine learning models. Outliers may be real anomalies in the observations or artificial errors. 

One method for outlier analysis is extreme value analysis using a boxplot, which assumes a normal distribution. The figure below describes the components of a boxplot. Notice the outlier is the point outside the upper whisker end. 

![](AnnotatedBoxplot.png)  
<font color='teal'>
    
**Follow these steps:**  

**1. Create boxplots - earlier step** 

**2. Apply outlier removal using the Interquartile range or replacement**  

**3. Review how many observations were removed**</font>

After reviewing these respons varible distributions, there doesn't appear to be any data issues to mitigate. Now, we move on to investigating feature relationship and interactions between the features the response.

## Explore data relationships

<font color='teal'>**Create pairplots or what is commonly referred to as biplots**</font>

## Identification and creation of features

<font color='teal'>**Create a Pearson correlation heatmap**</font>

When reviewing the Pearson correlation coefficient heat map you can see substantial differences in the correlations compared to the response variable(s) as well as in the features when compared to each other. The heatmap helps identify features that suffer from Multi-collinearity. 

<font color='teal'>**Use the correlation matrix displayed in the heatmap to select and remove collinear features. Remember to exclude the response variable(s) from the matrix to ensure they are retained in our final model development data set. Then select those features that are more than 95% correlated for removal.**</font>

# 4. Preprocessing and Feature Engineering

* Create dummy or indicator features for categorical variables
* Standardize the magnitude of numeric features
* Split into testing and training datasets

# 5. Modeling

#### Making a Linear Regression model: our first model
Sklearn has a [LinearRegression()](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) function built into the linear_model module. We'll be using that to make our regression model. 

In [0]:
# Create the model: make a variable called rModel, and assign it linear_model.LinearRegression(normalize=True).
# Note: the normalize=True parameter enables the handling of different scales of our variables. 
rModel = linear_model.LinearRegression(normalize=True)

In [0]:
# We now want to train the model on our test data.
# Call the .fit() method of rModel, and plug in X-train, y_train as parameters, in that order.
rModel.fit(X_train, y_train)

In [0]:
# Evaluate the model by printing the result of calling .score() on rModel, with parameters X_train, y_train. 
print(rModel.score(X_train, y_train))

The above score is called R-Squared coefficient, or the "coefficient of determination". It's basically a measure of how successfully our model predicts the variations in the data away from the mean: 1 would would mean a perfect model that explains 100% of the variation. At the moment, our our model explains only about 23% of the variation from the mean. There's more work to do!

In [0]:
# Use the model to make predictions about our test data
# Make a variable called y_pred, and assign it the result of calling the predict() method on rModel. Plug X_test into that method.
y_pred = rModel.predict(X_test)

In [0]:
# Let's plot the predictions against the actual result
plt.scatter(y_test,y_pred)

The above scatterplot represents how well the predictions match the actual results. 

Along the x-axis, we have the actual average avocado price, and along the y-axis we have the predicted value for the weekly price.

Let's build a similar model using a the stats model package, to see if we get a better result that way.

####  Making a Linear Regression model: our second model: Ordinary Least Squares (OLS)

In [0]:
# Create the test and train sets. Here, we do things slightly differently.  
# We make the explanatory variable X as before.
X = df.drop['AveragePrice']

# But here, reassign X the value of adding a constant to it. This is required for Ordinary Least Squares Regression.
# Further explanation of this can be found here: 
# https://www.statsmodels.org/devel/generated/statsmodels.regression.linear_model.OLS.html
X = sm.add_constant(X)

In [0]:
# The rest of the preparation is as before.
y = df[['AveragePrice']]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 123)

In [0]:
# Create the model
rModel2 = sm.OLS(y_train, X_train)
# Fit the model
rModel2_results = rModel2.fit()

In [0]:
# Evaluate the model
rModel2_results.summary()

One of the great things about Statsmodels (sm) is that you get so much information from the summary() method. 

There are lots of values here, whose meanings you can explore at your leisure, but here's one of the most important: the R-squared score is the same as what it was with the previous model. This makes perfect sense, right? It's the same value as the score from sklearn, because they've both used the same algorithm on the same data.

Here's a useful link you can check out if you have the time: https://www.theanalysisfactor.com/assessing-the-fit-of-regression-models/

In [0]:
# Let's use our new model to make predictions of the dependent variable y 
y_pred = rModel2_results.predict(X_test)

In [0]:
# Plot the predictions

# Build a scatterplot
plt.scatter(y_test, y_pred)

# Add a line for perfect correlation
plt.plot([x for x in range(9,15)],[x for x in range(9,15)], color='red')

# Label it nicely
plt.title("Model 2 predictions vs. actual")
plt.xlabel("Actual")
plt.ylabel("Predicted")

The red line shows a theoretically perfect correlation between our actual and predicted values - the line that would exist if every prediction was completely correct. 

We've now got a much closer match between our data and our predictions, and we can see that the shape of the data points is much more similar to the red line. 

We can check another metric as well - the RMSE (Root Mean Squared Error). This is a measure of the accuracy of a regression model. Very simply put, it's formed by finding the average difference between predictions and actual values.

In [0]:
# Define a function to check the RMSE
def rmse(predictions, targets):
    return np.sqrt(((predictions - targets) ** 2).mean())

In [0]:
# Get predictions from rModel3
y_pred = rModel3_results.predict(X_test)

# Put the predictions & actual values into a dataframe
matches = pd.DataFrame(y_test)
matches.rename(columns = {'AveragePrice':'actual'}, inplace=True)
matches["predicted"] = y_pred

rmse(matches["actual"], matches["predicted"])

The RMSE tells us how far, on average, our predictions were mistaken. An RMSE of 0 would mean we were making perfect predictions. 

# 6. Documentation 

* Review the Results
* Present and share your findings - storytelling
* Finalize Code
* Finalize Documentation