# Linear Regression - Iris Dataset: Predicting Petal width based on Petal length

The Iris Dataset is 'perhaps the best known database to be found in the pattern recognition literature. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant.'

<br>

Before we get started we need to import the following libraries:

1. Pandas  - Library providing high-performance, easy-to-use data structures and data analysis tools.
2. Numpy  -  Fundamental package for scientific computing with Python
3. SKLearn - Simple and efficient tools for data mining and data analysis

In [None]:
# Import all the necessary libraries

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import explained_variance_score, max_error, r2_score
import plotly.express as px

## Load Dataset

TODO:

- View the dataset 
- View the Summary Statistics
- Visualizing the data in a scatter graph

In [None]:

# Instatiate the dataset 
df = px.data.iris()

#Now view the number of rows and columns in the dataframe using .shape
df.shape

In [None]:
# View the first 10 rows in the dataset
df.head(10)

In [None]:
# View the Summary Statistics for the numeric columns in the dataset
df.describe()

## Visualizing Data
Scatter plots show how much one variable is affected by another. The relationship between two variables is called their correlation.

<br>

Create a correlation matrix between all the variables in the dataset using Pearson correlation coefficients

Create a scatter plot using Plotly with the `petal_width` and `petal_length` data

In [None]:
#Create Pearson Correlation matrix
df.corr()

In [None]:
#Create scatter graph
px.scatter(df, x="petal_width", y="petal_length")

## Manipulating the Data

Pandas Dataframe has the indexer *iloc* that is used to select rows and columns by number.

The syntax is `data.iloc[<row_selection>][<column_selection>]`

Example: `data.iloc([0, 2], [1, 3])` will retrieve the rows indexed 0 and 2, and the columns indexed with 1 and 3.

In [None]:
# Now we use iloc to separate the X and Y values of our dataset

X = df.iloc[:,2].values # Retrieve the third column from the dataset, i.e. petal_length
Y = df.iloc[:,3].values # Retrieve the fourth column from the dataset, i.e. petal_width

# Reshape the arrays so that they have the right structure the library expects
# When fitting the model, we need a 2D array so that each row of data
# from the original data source is in its own array

X = X.reshape(-1, 1)
Y = Y.reshape(-1, 1)

# To better understand what reshape is doing, try printing X and Y values 
# before and after applying the reshape function



## Split the Data
We need to split the dataset into training and validation data. The training set is much larger than the test set as the model will achieve a higher accuracy with more data to look at. Validation only needs to be a smaller percentage of the overall set, the model will not have seen the validation set during training, so we can use it to test if the model is predicting correctly.

In [None]:
# We can use the SKLearn train_test_split method to do the split
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=0)


## Define and train the model
We now need to create the classifier that we will use and train it with our data. 

We have imported the Linear Regression classifier in at the top, so can go ahead and use it - further information about this classifier and its parameters can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html).

To Do:
- Fit the model
- Predict the GPA values for the test data given
- Find out how accurate our model was 

In [None]:
# Create our model using linear regression
model = LinearRegression(fit_intercept=True, normalize=True)

# To train the model we use the .fit method on the model, this will take in both out x and y training data as its parameters
# Go ahead and train the model


## Test the model

In [None]:
# Now we want to predict the petal width values on the test petal length dataset
# We can use the .predict method on the model and this will take in out x_test data - give it a try
predictions = 

#Print out the predictions


In [None]:
# Comparing with the actual petal width values from the test dataset by printing the y_test data


## Metrics

When predicting a continuous value like the price of a house or a stock price we can’t use an accuracy metric, so we have to use what is known as the mean squared error (MSE). The MSE is the average squared difference between the actual and predicted values. We want to minimise this to improve our accuracy.  

<br>

### Score

Returns the coefficient of determination R^2 of the prediction.

The coefficient R^2 is defined as (1 - u/v), where 

*   u = residual sum of squares ((y_true - y_pred) ** 2).sum() 
*   v = the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). 

The best possible score is 1.0 (perfect prediction) and it can be negative (because the model can be arbitrarily worse). 
A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.

<br>

ToDo:
- Get the score of the model 

In [None]:
# To get the score we can use the .score method on the model and we pass in both x and y test as parameters - try it out



### max_error

The max_error function computes the maximum residual error , a metric that captures the worst case error between the predicted value and the true value. In a perfectly fitted single output regression model, max_error would be 0 on the training set and though this would be highly unlikely in the real world, this metric shows the extent of error that the model had when it was fitted.

In [None]:
max_error(y_test, predictions)

## Exercises 



Give some examples of:

* Non related variables.
* Variables that are increasingly related.
* Variables that are decreasingly related.

Create your own testing data with very high/low values and see how good your model's predictions are on edge cases.



In [None]:
### Add code here to solve the exercise


## Multivariate regression 

It is rare that a dependent variable is explained by only one variable. In this case, an analyst uses multiple regression, which attempts to explain a dependent variable using more than one independent variable. Multiple regressions can be linear and nonlinear.

 

It uses a complex linear equation : 


    f(x, y, z) = (w1 * x) + (w2 * y) + (w3 * z) 

    x, y, z = input data 

    w = weights 


Example:  

    Sales = (w1 * Radio) + (w2 * TV) + (w3 * News) 
    

Whereas simple regression uses the simple straight line equation

            y = mx + c


            y => prediction
            x => input data
            m => weight
            c => Bias

Example:

            Sales = (Weight x Radio) + Bias


For further information on this Navigate to the  link below for a practical example of using multiple linear regression to predict stock price using two independent variables: interest rate and unemployment rate.

https://datatofish.com/multiple-linear-regression-python/