# Linear Regression

Predict pedal length for the `setosa` species from the iris dataset 




## Import Required Libraries

**Note:** You can tell the difference between a class and a function by the case sensivity. 

- A **class** will be captialized
- A **function** will be lowercase
- A **method**, or a function belonging to a class, will also be lowercase. You can call a method by invoking it through an instance of a class (instance method), or through a class definition (static method)

References: 
- [Understanding what a class is](https://www.hackerearth.com/practice/python/object-oriented-programming/classes-and-objects-i/tutorial/)
- [Differences between functions and methods](https://www.tutorialspoint.com/difference-between-method-and-function-in-python)
- [Different types of methods](https://www.bogotobogo.com/python/python_differences_between_static_method_and_class_method_instance_method.php)





In [0]:
# Collection libraries 
import numpy as np
import pandas as pd

# Visual libraries 
import matplotlib.pyplot as plt
import seaborn as sns

# Helper for splitting training and testing data
from sklearn.model_selection import train_test_split

# Models/Estimators
from sklearn.linear_model import LinearRegression

# Model evaluation functions
from sklearn.metrics import r2_score, mean_squared_error

#### Notes about imports with this notebook:
We will re-import some of the libraries when we use these modules, this is to get you used to importing and understanding their classes and functions. Reference the documentation to understand the libraries classes, methods, and functions. 

## Load Data

<hr>

##### Mount Drive - **Google Colab Only Step**

When using google colab in order to access files on our google drive we need to mount the drive by running the below python cell, then clicking the link it generates and pasting the code in the cell.



In [0]:
from google.colab import drive
drive.mount('/content/drive')

Change Directory To Access The Dependent Files - **Google Colab Only Step**

In [0]:
directory = "student"
if (directory == "student"):
  %cd drive/Colab\ Notebooks/machine-learning/
else:
  %cd drive/Shared\ drives/Rubrik/Data\ Science\ Track/machine-learning

<hr> 

### Import Iris Dataset
Read in the iris dataset using the following path and store it in a variable called `df`.

#### Import the cleaned real estate dataset
- Use pandas' `read_csv()` function

#### Pandas' `read_csv()` parameters:
- `filepath_or_buffer` (string): path of csv to import

```python 
filepath_or_buffer = './data/iris.csv'
```

In [0]:
iris_df = pd.read_csv(filepath_or_buffer = './data/iris.csv')

### Show Head Of Datset

In [0]:
iris_df.head()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


### Show Tail Of Dataset

In [0]:
iris_df.tail()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


<hr> 

## Prepare the Dataset For The Model 

The iris dataset has 50 observations (rows) of each species. 
- 50 Iris-Setosa samples
- 50 Iris-Versicolour samples 
- 50 Iris-Virginica samples

**Before we start we will need to create a new Data Frame to hold only the Iris-Setosa species. We are doing this because we want to create a linear regression line to only be predict the Iris-Setosa species**

<br>

**Tasks:**
- Create a new DataFrame to contain only `Iris-setosa` species.
- Sperated our target variable from our features.

### Print Shape Of Original DataFrame
We will do this to confirm our manipulations later


In [0]:
iris_df.shape




### Create a new DataFrame that contains only the Iris-Setosa species called `setosa_species_df`

In [0]:
setosa_species_df = iris_df[iris_df['Species'] == "Iris-setosa"]

#### Show shape of `setosa_species_df`

It should have 50 rows and 5 columns for each row

In [0]:
setosa_species_df.shape

### Create a Series called `target_setosa_species_series` using `setosa_species_df` which will contain only the `PetalLengthCm` column data



In [0]:
target_setosa_species_series = setosa_species_df['PetalLengthCm']

### Remove the `PetalLengthCm` and `Species` feature from the `setosa_species_df` 
We will do this so that we can seperate the features from the labels.

Hint: Use the DataFrame's drop method

#### DataFrame's `drop()` parameters:
- `labels` (string or list of strings): index or column labels to drop
- `axis`  ({0 or ‘index’, 1 or ‘columns’}): default 0; whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’)
- `inplace` (bool): default False; If True, do operation inplace and return None.


In [0]:
setosa_species_df.drop(labels=["PetalLengthCm", "Species"], axis=1, inplace=True)

#### Show Head of `setosa_species_df` to make sure we dropped the feature/column correctly:

There shouldn't be a `PetalLengthCm` feature anymore

In [0]:
setosa_species_df.head()

<br>

## We just created a new DataFrame to contain only `Iris-setosa` species, we then seperated our target variable from our features.

### Print the head of `setosa_species_df`

In [0]:
setosa_species_df.head()

### Print the head of `target_setosa_species_series`

In [0]:
target_setosa_species_series.head()

<hr>

### Split X (`setosa_species_df`) and y (`target_setosa_species_series`) into training and test sets

We will split the dataset into two sets:

- X_train and y_train: Will be passed into the model to learn the patterns in the data

- X_test and y_test: Will be used to test the validity of the model's predictions.

<br> 
Requirements: 
- Print shape of original DataFrame before manipulating the DataFrame
- pass in `0.20` as the argument for the `test_size` parameter

<br> 
[train_test_split documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) 



In [0]:
X_train, X_test, y_train, y_test = train_test_split(setosa_species_df, target_setosa_species_series, test_size = 0.20)

### Create Model
We can use Scikit-Learn's ``LinearRegression`` estimator to fit this data and construct the best-fit line:

Steps: 

1. Make sure the model class / blueprint is imported
2. Create an instance of this model class / blueprint by invoking the constructor with the desired hyper-parameters 

Requirements: 
- No hyperparameters will be set for this model

[Linear Regression Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

In [0]:
model = LinearRegression()

### Fit Model to Training Data
We will fit an estimator to be able to predict the values to which unseen samples belong.
This is the training/learning aspect of the model.

[Fit Linear Regression Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.fit)

In [0]:
model.fit(X_train, y_train)

### Make Predictions Against Training Set 
Now you can use the fitted model to predict values based off of unseen observations

[Predict Linear Regression Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.predict)

In [0]:
predictions = model.predict(X_test)

### Evaluate Model

We can now evaluate how the fitted model performed using different metric formulas. Note that there are many different ways of evaluating the models, some metrics analyze different aspects of how the model performed. 

#### Use The Following Metric Scores to find out model accuracy:

- Mean Squared Error [Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error)
  - Best possible score is 0, higher values are worse.


- R2 Score [Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score)
  - Best possible score is 1.0, lower values are worse.

Note: R2 score is a more meaningful estimate of how well the learned regression line is fitting the data points.

#### Resources: 
- [How to evaluate a linear regression model](https://www.ritchieng.com/machine-learning-evaluate-linear-regression-model/)



In [0]:
print("r2 score:", r2_score(y_test, predictions, multioutput='variance_weighted') ) # Best possible score is 1.0, lower values are worse.

print("mean squared error score:", mean_squared_error(y_test, predictions) )  # Values closer to zero are better

#### Compare the `y_test` (actual values) to the `predictions` ( values created by learned model)

Print both to see the difference in values

In [0]:
display(y_test)
display(predictions)