# Linear Regression

Predict the transaction price of an observation using the real estate dataset


## Import Required Libraries

**Note:** You can tell the difference between a class and a function by the case sensivity. 

- A **class** will be captialized
- A **function** will be lowercase
- A **method**, or a function belonging to a class, will also be lowercase. You can call a method by invoking it through an instance of a class (instance method), or through a class definition (static method)

References: 
- [Understanding what a class is](https://www.hackerearth.com/practice/python/object-oriented-programming/classes-and-objects-i/tutorial/)
- [Differences between functions and methods](https://www.tutorialspoint.com/difference-between-method-and-function-in-python)
- [Different types of methods](https://www.bogotobogo.com/python/python_differences_between_static_method_and_class_method_instance_method.php)





In [0]:
# Collection libraries 
import numpy as np
import pandas as pd

# Visual libraries 
import matplotlib.pyplot as plt
import seaborn as sns

# Helper for splitting training and testing data
from sklearn.model_selection import train_test_split

# Models/Estimators
from sklearn.linear_model import LinearRegression

# Model evaluation functions
from sklearn.metrics import r2_score, mean_squared_error

#### Notes about imports with this notebook:
We will re-import some of the libraries when we use these modules, this is to get you used to importing and understanding their classes and functions. Reference the documentation to understand the libraries classes, methods, and functions. 

## Load Data

<hr>

##### Mount Drive - **Google Colab Only Step**

When using google colab in order to access files on our google drive we need to mount the drive by running the below python cell, then clicking the link it generates and pasting the code in the cell.



In [0]:
from google.colab import drive
drive.mount('/content/drive')

Change Directory To Access The Dependent Files - **Google Colab Only Step**

In [0]:
directory = "student"
if (directory == "student"):
  %cd drive/Colab\ Notebooks/machine-learning/
else:
  %cd drive/Shared\ drives/Rubrik/Data\ Science\ Track/machine-learning

<hr> 

### Import Real Estate Dataset
Read in the real estate dataset using the following path and store it in a variable called `df`.

#### Import the cleaned real estate dataset
- Use pandas' `read_csv()` function

#### Pandas' `read_csv()` parameters:
- `filepath_or_buffer` (string): path of csv to import

```python 
filepath_or_buffer = './data/cleaned_and_feature_engineered_real_estate.csv'
```

In [0]:
df = pd.read_csv(filepath_or_buffer = './data/cleaned_and_feature_engineered_real_estate.csv')

### Show Head Of Datset

In [0]:
df.head()

Unnamed: 0,tx_price,beds,baths,sqft,year_built,lot_size,basement,median_age,married,college_grad,property_tax,insurance,median_school,num_schools,tx_year,lifestyle_avg,two_and_two,exterior_walls_Brick,exterior_walls_Brick veneer,exterior_walls_Combination,exterior_walls_Metal,exterior_walls_Other,exterior_walls_Siding (Alum/Vinyl),exterior_walls_Wood,exterior_walls_missing,roof_Asphalt,roof_Composition Shingle,roof_Other,roof_Shake Shingle,roof_missing,property_type_Apartment / Condo / Townhouse,property_type_Single-Family
0,295850.0,1.0,1.0,584.0,2013.0,0.0,0.0,33.0,65.0,84.0,234.0,81.0,9.0,3.0,2013.0,1.493259,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0
1,216500.0,1.0,1.0,612.0,1965.0,0.0,1.0,39.0,73.0,69.0,169.0,51.0,3.0,3.0,2006.0,0.676598,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0
2,279900.0,1.0,1.0,615.0,1963.0,0.0,0.0,28.0,15.0,86.0,216.0,74.0,8.0,3.0,2012.0,2.298254,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0
3,379900.0,1.0,1.0,618.0,2000.0,33541.0,0.0,36.0,25.0,91.0,265.0,92.0,9.0,3.0,2005.0,2.47365,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0
4,340000.0,1.0,1.0,634.0,1992.0,0.0,0.0,37.0,20.0,75.0,88.0,30.0,9.0,3.0,2002.0,1.661371,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0


### Show Tail Of Dataset

In [0]:
df.tail()

Unnamed: 0,tx_price,beds,baths,sqft,year_built,lot_size,basement,median_age,married,college_grad,property_tax,insurance,median_school,num_schools,tx_year,lifestyle_avg,two_and_two,exterior_walls_Brick,exterior_walls_Brick veneer,exterior_walls_Combination,exterior_walls_Metal,exterior_walls_Other,exterior_walls_Siding (Alum/Vinyl),exterior_walls_Wood,exterior_walls_missing,roof_Asphalt,roof_Composition Shingle,roof_Other,roof_Shake Shingle,roof_missing,property_type_Apartment / Condo / Townhouse,property_type_Single-Family
1877,385000.0,5.0,6.0,6381.0,2004.0,224334.0,1.0,46.0,76.0,87.0,1250.0,381.0,10.0,3.0,2002.0,-0.792553,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1
1878,690000.0,5.0,6.0,6501.0,1956.0,23086.0,1.0,42.0,73.0,61.0,1553.0,473.0,9.0,3.0,2015.0,0.247411,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1
1879,600000.0,5.0,6.0,7064.0,1995.0,217800.0,1.0,43.0,87.0,66.0,942.0,287.0,8.0,1.0,1999.0,-0.643123,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1
1880,759900.0,5.0,6.0,7500.0,2006.0,8886.0,1.0,43.0,61.0,51.0,803.0,245.0,5.0,2.0,2009.0,-0.524305,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1
1881,735000.0,5.0,6.0,7515.0,1958.0,10497.0,1.0,37.0,80.0,86.0,1459.0,444.0,9.0,3.0,2015.0,-0.696683,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1


<hr> 

<br>

## Separate the dataset's features from the target variable

**Tasks:**
- Print shape of original DataFrame before manipulating the DataFrame
- Create a new DataFrame called `X` to contain only the features 
- Create a new DataFrame called `y` to contain only the labels

<br>

### Question: 
Why would you split the data this way?



Answer:

We will do this to separate the features of the dataset from the target value. For our problem we will set the `tx_price` as the target variable for this machine learning model, because we want to predict the house price based on a selected amount of other features of the dataset.     

<br>

### Print Shape Of Original DataFrame
We will do this to confirm our manipulations later


In [0]:
df.shape

(1882, 32)

### Create A DataFrame Called `X` To Hold All The Features
**Note:** `X` is uppercase because it's a 2D array / matrix. A matrix holds multiple rows and more than one column. 

**Tip:** Consider using the DataFrame's `drop` method to create this new DataFrame


#### DataFrame's `drop` method parameters:
- `labels` (string or list of strings): index or a  column labels to drop
- `axis`  ({0 or ‘index’, 1 or ‘columns’}): default 0; whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’)
- `inplace` (bool): default False; If True, do operation inplace and return None.



In [0]:
X = df.drop(labels="tx_price", axis=1)

#### Show Shape of `X` to make sure we created the features DataFrame correctly:

It should have 1882 rows and 31 columns

In [0]:
X.shape

(1882, 31)

### Print Head Of Features Matrix `X`

In [0]:
X.head()

Unnamed: 0,beds,baths,sqft,year_built,lot_size,basement,median_age,married,college_grad,property_tax,insurance,median_school,num_schools,tx_year,lifestyle_avg,two_and_two,exterior_walls_Brick,exterior_walls_Brick veneer,exterior_walls_Combination,exterior_walls_Metal,exterior_walls_Other,exterior_walls_Siding (Alum/Vinyl),exterior_walls_Wood,exterior_walls_missing,roof_Asphalt,roof_Composition Shingle,roof_Other,roof_Shake Shingle,roof_missing,property_type_Apartment / Condo / Townhouse,property_type_Single-Family
0,1.0,1.0,584.0,2013.0,0.0,0.0,33.0,65.0,84.0,234.0,81.0,9.0,3.0,2013.0,1.493259,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0
1,1.0,1.0,612.0,1965.0,0.0,1.0,39.0,73.0,69.0,169.0,51.0,3.0,3.0,2006.0,0.676598,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0
2,1.0,1.0,615.0,1963.0,0.0,0.0,28.0,15.0,86.0,216.0,74.0,8.0,3.0,2012.0,2.298254,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0
3,1.0,1.0,618.0,2000.0,33541.0,0.0,36.0,25.0,91.0,265.0,92.0,9.0,3.0,2005.0,2.47365,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0
4,1.0,1.0,634.0,1992.0,0.0,0.0,37.0,20.0,75.0,88.0,30.0,9.0,3.0,2002.0,1.661371,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0


### Create A Series Called `y` To Hold All The Labels
**Note:** `y` is lowercase because it's Series, meaning it holds multiple rows and only one column per row.



In [0]:
y = df.loc[:, 'tx_price']

#### Show Shape of `y` to make sure we created the target series correctly:

It should have 1882 rows and 1 column

**Note:** the shape will print out like this, `(1882,)` which means that is has 1882 rows and 1 column.

In [0]:
y.shape

(1882,)

### Print Head Of Label Series `y`

In [0]:
y.head()

0    295850.0
1    216500.0
2    279900.0
3    379900.0
4    340000.0
Name: tx_price, dtype: float64

<hr>
<br>

## Split Data Into Training And Testing 

Even though we will perform cross validation in the near future we will still want to split the dataset in to a training and testing set. We do this so that after we find the best estimator, or best fitted model, through utilizing the training data with a specific hyper-parameters values, we can evaluate the model using unseen testing data. This will allow us to understand if the model is overfitting or underfitting. 

Additional Resources:
- [Learn more about overfitting and underfitting](https://github.com/SoftStackFactory/PythonDataScienceHandbook/blob/master/notebooks/05.03-Hyperparameters-and-Model-Validation.ipynb)
- [Interested in how to better fit a model?](https://github.com/SoftStackFactory/PythonDataScienceHandbook/blob/master/notebooks/05.04-Feature-Engineering.ipynb)

Note: The second resource is very informative about feature engineering, but we specifically want to emphasize the **Derived Features** section. 

<br>

#### Split Data Into Training And Testing Sets Using The `train_test_split` Function
 
Requirements: 
- pass in `0.20` as the argument for the `test_size` parameter
- pass in `1` as the argument for the `random_state` parameter

**Note:** We set the `random_state` parameter to a unique argument value so that when we run this notebook multiple times or using different computers we will recieve the same split of data, which is important for re-running experiments and simulations.

[`train_test_split` function documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

### Import `train_test_split` Function From Sklearn's Library

In [0]:
# Helper for splitting training and testing sets
from sklearn.model_selection import train_test_split

#### Split Data Into Training And Testing Sets 

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)


<hr>

### Create Model
We can use Scikit-Learn's ``LinearRegression`` estimator to fit this data and construct the best-fit line:

Steps: 

1. Make sure the model class / blueprint is imported
2. Create an instance of this model class / blueprint by invoking the constructor with the desired hyper-parameters 

Requirements: 
- No hyperparameters will be set for this model

[Linear Regression Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

In [0]:
model = LinearRegression()

### Fit Model to Training Data
We will fit an estimator to be able to predict the values to which unseen samples belong.
This is the training/learning aspect of the model.

[Fit Linear Regression Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.fit)

In [0]:
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

### Make Predictions Against Training Set 
Now you can use the fitted model to predict values based off of unseen observations

[Predict Linear Regression Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.predict)

In [0]:
predictions = model.predict(X_test)

### Evaluate Model

We can now evaluate how the fitted model performed using different metric formulas. Note that there are many different ways of evaluating the models, some metrics analyze different aspects of how the model performed. 

#### Use The Following Metric Scores to find out model accuracy:

- Mean Squared Error [Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error)
  - Best possible score is 0, higher values are worse.


- R2 Score [Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score)
  - Best possible score is 1.0, lower values are worse.

Note: R2 score is a more meaningful estimate of how well the learned regression line is fitting the data points.

#### Resources: 
- [How to evaluate a linear regression model](https://www.ritchieng.com/machine-learning-evaluate-linear-regression-model/)



In [0]:
print("mean squared error score:", mean_squared_error(y_test, predictions) )  # Best possible score is 0, higher values are worse

print("r2 score:", r2_score(y_test, predictions, multioutput='variance_weighted') ) # Best possible score is 1.0, lower values are worse

mean squared error score: 11982745969.198545
r2 score: 0.4667882556254307


#### Compare the `y_test` (actual values) to the `predictions` ( values created by learned model)

Print the first five values of both to see the difference in values

In [0]:
display(y_test.head())
display(predictions[:5])

1347    360000.0
1868    765000.0
887     333250.0
650     590000.0
102     369900.0
Name: tx_price, dtype: float64

array([426719.81285466, 656464.19741555, 493557.15898164, 584228.1574835 ,
       387041.72778977])

<hr>

## Let Us Try To Increase Performance

### Let Us Build A Function To Do The Following:
  - This function will take a DataFrame parameter called `df`
  - Create a features DataFrame (`X`) and a target series (`y`)
  - Split the data into training and testing sets
    - pass in `0.20` as the argument for the `test_size` parameter
    - pass in `1` as the argument for the `random_state` parameter
  - Instanciate a linear regression model
  - Fit the model
  - Make predictions
  - Print the evaluation the model using the mean squared error and r2 scoring functions

**Note:** 

For the mean squared error score the best possible socre is 0, higher values are worse. 

For the r2 score the best possible score is 1, the lower values are worse

<br>

We will automate this because we will be modifiying the Data Frame and then will be assessing if these modifications increased our models performance.

In [0]:
def build_and_evaluate(df):
  # Create a feature matrix
  X = df.drop(labels="tx_price", axis=1)

  # Create a label series
  y = df.loc[:, 'tx_price']

  # Split data into train and test sets
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=1)

  # Create a Model
  model = LinearRegression()

  # Fit the model 
  model.fit(X_train, y_train)

  # Create Predictions Series
  predictions = model.predict(X_test)

  # Print Evaluations
  print("mean squared error score:", mean_squared_error(y_test, predictions) )  # Best possible score is 0, higher values are worse
  print("r2 score:", r2_score(y_test, predictions, multioutput='variance_weighted') ) # Best possible score is 1.0, lower values are worse

### Run the function to make sure it works 
Pass in the original unmanipulated DataFrame `df` to the function to see the model's performance.

In [0]:
build_and_evaluate(df)

mean squared error score: 11842349431.506424
r2 score: 0.47303566193441826


## Try Increase Our Performance By Altering The DataSet

We have two approaches to manipulating the dataset. 

### One:
Pass in a copy of the original DataFrame which has been manipulated (not inplace yet, because if this maniulation doesn't improve the model we want to disreguard the manipulation) as an argument to the `df` parameter of the `build_and_evaluate()` function, to see if the model performance improved.
- If it did increase the performance then make the manipulation of the original DataFrame permanent so that these changes persist. 
- If it decreased the performance then try a different operation on the DataFrame (not inplace) to boost the model's performance and if that improves the performance then make the manipulation of the original DataFrame permanent so that these changes persist. 

```python
build_and_evaluate(df[100: ]) # pass in only the first 100 rows of the dataset, note this is not an inplace operation, meaning the original dataset is not updated with a new assignment
```

### Two 
Before Manipulating the original DataFrame make a copy and save it to a variable named `df_before_modification` just incase the next operation decreases the performance.

We want to make a deep copy (copy of each value) rather than a shallow copy (copy of a memory address), check the shallow versus deep copy reference if you want to get a deeper understanding. You can also search the web for "Difference between passing by reference versus passing by value".

#### Shallow Copy Example

```python 
shallow_copy = df 
```

#### Deep Copy Example:

``` python
deep_copy = df.copy()
```

- [Pandas Deep Copy Reference](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html)

- [Shallow Versus Deep Copy Reference](https://www.geeksforgeeks.org/copy-python-deep-copy-shallow-copy/)


<hr>

### See The Difference In Performance Between Training A Model With A Lot Of Data versus A Small Amount Of Data

Pass in 50 rows of the dataset into the `build_and_evaluation` function to see the difference in performance.

**Hint:** 
Refer back to approach one, pass in non-inplace manipulated argument an argument 

In [0]:
build_and_evaluate(df[50:])

mean squared error score: 12776004943.123436
r2 score: 0.46262142541680096


### Questions 
- Did this increase of decrease the performance, if so why?

### Create a new feature on the DataFrame and pass this new DataFrame as an argument to the `build_and_evaluate()` function to see if the modification improved the performance

**Hint:**
Create a copy of the original DataFrame first and use it as a reference just incase this operation should be disreguarded and we should revert to the previous version of the DataFrame because this manipulation of the dataset didn't increase the performance of the model. Call this backup of the DataFrame `df_before_manipulation` 


#### Tips For Feature Engineering:
When feature engineering look at the correlations of the target variable and other features. 

Unfortunately we cannot create new features by mapping features with the target variable, because we do not have the target variable at the time of prediction
i.e. ```python df['new_feature'] = df['target'] / df['other_feature']```

Rather we can swap out target for features that are highly correlated.

For example when looking at correlations of the target variable and other features using real estate dataframe we notice that the feature `property_tax` is highly correlated to `tx_price` (target variable). So instead of us creating the following feature `price_per_sqft` we can create `property_tax_per_sqft` instead. 
 
**Can't Do / Wrong:** ```python df['price_per_sqft'] = df['tx_price'] / df['sqft']```
- We can not do this because we will not have the `tx_price` data point of an unseen observation at the time of prediction

**Can Do:** ```python df['property_tax_per_sqft'] = df['property_tax'] / df['sqft']```

Try your best to create a mapping using other correlated features. You'll only get better over time and with the help of domain knowledge.


In [0]:
df_before_maniplation = df.copy()

df['property_tax_per_sqft'] = df['property_tax'] / df['sqft']

build_and_evaluate(df)

mean squared error score: 11842349431.506424
r2 score: 0.47303566193441826


### Questions 
- Did this increase of decrease the performance?
- If it didn't what would we do with `df_before_modification`?

<hr> 

### Your Turn 
Try bring up the performance of the model

In [0]:
df_before_maniplation = df.copy()

df['baths_per_sqft'] = df['baths'] / df['sqft']

build_and_evaluate(df)

mean squared error score: 11842349431.506424
r2 score: 0.47303566193441826


### Questions 
- Did this increase of decrease the performance?
- If it didn't what would we do with `df_before_modification`?