# Final Project Template

For the final project for this module, you are asked to use data analysis techniques and linear regression to create a model to predict housing prices. 

In Video 7.9, Dr. Williams presented you with an example of data analysis in which housing prices were predicted by using just the columns `OverallQual` and `MassVnrArea` from the data provided. In Video 7.10, Dr. Williams showed more examples of data visualization and manipulation in addition to a more detailed analysis of the data.

Your challenge in this project is to improve Dr. Williams' results from Video 7.9 by choosing different variables in the *dataframe* to create your model. Although in Video 7.10 you are offered a sample data analysis which uses five columns from the data provided, your project submission must include an analysis of at least three additional variables and offer other solutions that improve the results obtained by Dr. Williams in these two videos.

Before you fill out the project outline template below, make sure you:

- Read through the template completely to understand the instructions for the structure of the project.
- Have a clear understanding of what to do to create a model that will return the results you want to find.
- Use Markdown to edit the template.

<div class="alert alert-block alert-success">
The purpose of this Jupyter Notebook is to give you a structure to follow when you are solving your problem and developing your model with Python. Make sure you follow it carefully. You can add more subsections if needed, but remember to fill out every section provided in the template.
</div>

<div class="alert alert-block alert-danger">
Delete all cells above, including this one, before submitting your final Notebook.
</div>

# Housing_Prediction

**Rajat Nathan**


# Index

- [Abstract](#Abstract)
- [1. Introduction](#1.-Introduction)
- [2. The Data](#2.-The-Data)
    - [2.1 Import the Data](#2.1-Import-the-Data)
    - [2.2 Data Exploration](#2.2-Data-Exploration)
    - [2.3 Data Preparation](#2.3-Data-Preparation)
    - [2.4 Correlation](#2.4-Correlation)
- [3. Project Description](#3.-Project-Description)
    - [3.1 Linear Regression](#3.1-Linear-Regression)
    - [3.2 Analysis](#3.2-Analysis)
    - [3.3 Results](#3.3-Results)
    - [3.4 Verify Your Model Against Test Data](#3.4-Verify-Your-Model-Against-Test-Data)
- [Conclusion](#Conclusion)
- [References](#References)

[Back to top](#Index)


##  Abstract

__This is a brief description (150 words or less) of your analysis and results of your prediction model. Complete this portion of the template after you are done working on your project.__

My housing prediction model uses linear regression to compute the relationship between dependent and independent variables. The model identifies this relationship through a linear equation that require computation of weights for each independent variable that must be applied to the values of independent variables such that:
Y = w1 * X1 + w2 * X2 + w3 * X3 ....

For this project - i identified 8 independent variables in the 'training data' that demonstrated the largest correlation with the dependent valiable: 'SalePrice'

'Fitting' this data into the linear regression model from the sci-kit library (sklearn) resulted in a R^2 value of : **0.8824472376719946**

Using this model R^2 value of the test data as : **0.7705677801288706**

[Back to top](#Index)


## 1. Introduction

__Introduce your project using 300 words or less. Describe all the processes you followed to solve the problem and create your prediction model. Start by summarizing the steps that you intend to perform and then elaborate on this section after you have completed your project.__


For this project - i identified 8 independent variables in the 'training data' that demonstrated the largest correlation with the dependent valiable: 'SalePrice'. The data shape of the frame at this time = 100 X 82.

I then removed the columns with large number of NaNs - 15 or more count and interpolated the NaNs in the remaining numerical columns, so that i could work with a cleaner training data set. The data shape of the frame at this time = 100 X 77.

From the remaining data set i identified the categorical data columns and used **one hot encoding** to transform categorical data into numerical columns(this process removed the small number of NaNs in categorical data). I then joined the categorical columns with the 'cleaned data set'; shape : 100 X 248 and then **dropped** the categorical columns from the 'cleaned data set'. The data shape of the frame at this time = 100 X 210. There are 38 categorical columns in the data.

In the resulting data set, I then proceeded to calculate the correlation of the dependednt variable with the remaining columns to identify eight distinct independednt variables/columns (X) with the highest correlation with the dependent variable - 'SalePrice'(Y). I noted that one of the independent variables identified was a result of **one hot encoding** process implemented in the data cleasing/transformation process. This variable was called: BsmtQual_Ex - indicating excellent basement quality. 


Finally, a dataframe with the top 8 correlated variables to the dependent variable ‘SalePrice’ was created and used as an input to the linear regression model



[Back to top](#Index)

## 2. The Data

For each of the steps below, make sure you include a description of your steps as well as your complete code. 

[Back to top](#Index)

### 2.1 Import the Data

__Import the necessary libraries and the data for the project. Include any auxiliary pandas *functions* that can be used to retrieve preliminary information about your data.

__Make sure to include a description of the data.__

import pandas as pd
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
import seaborn as sns

data = pd.read_csv("houseSmallData.csv")

#show statistics of dependent variable Y: SalePrice
* count:   100.000000
* mean:    173820.660000
+ std:     72236.552886
+ min:     40000.000000
- 25%:     129362.500000
- 50%:     153750.000000
* 75%:     207750.000000
* max:     438780.000000

[Back to top](#Index)

### 2.2 Data Exploration

Create graphs displaying the relationships between the variables that you consider most important to solve the problem of predicting housing prices.


Include a description of the results displayed by each *plot*.

Correlation relationship matrix indicating top 8 independednt variable relationships with dependent variable - SalePrice

* SalePrice:     1.000000
* OverallQual:   0.855061
* GrLivArea:     0.735129
* GarageArea:    0.688249
* BsmtQual_Ex:   0.680094
* GarageCars:    0.663441
* YearBuilt:     0.658636
* TotalBsmtSF:   0.616297
* GarageYrBlt:   0.589361

![scatter_overqual](/Users/rajatnathan/Scatter_SP_OverallQual.png "img_scatter_1")

[Back to top](#Index)

### 2.3 Data Preparation

**Determine if there are any missing values in the data. Did the data need to be reshaped? If yes, include a description of the steps you followed to clean the data.**

The data shape of the original dataframe is time = 100 X 82.

I then removed the columns with large number of NaNs - 15 or more count and interpolated the NaNs in the remaining numerical columns, so that i could work with a cleaner training data set. The data shape of the frame at this time = 100 X 77.

From the remaining data set i identified the categorical data columns and used **one hot encoding** to transform categorical data into numerical columns(this process removed the small number of NaNs in categorical data). I then joined the categorical columns with the 'cleaned data set'; shape : 100 X 248 and then **dropped** the catrgorical columns from the 'cleaned data set'. The data shape of the frame at this time = 100 X 210. There are 38 categorical columns in the data.


[Back to top](#Index)

### 2.4 Correlation

**Describe the correlation between the variables in your data. How can the correlation help you make an educated guess about how to proceed with your analysis? Will you explore different variables based on the correlation you found? If so, describe what you did and be sure to include what you found with the new set of variables.**

Correlation relationship matrix indicating top 8 independednt variable relationships with dependent variable - SalePrice

* SalePrice:     1.000000
* OverallQual:   0.855061
* GrLivArea:     0.735129
* GarageArea:    0.688249
* **BsmtQual_Ex:   0.680094**
* GarageCars:    0.663441
* YearBuilt:     0.658636
* TotalBsmtSF:   0.616297
* GarageYrBlt:   0.589361

The independent variables identified in the dataframe after the **one hot encoding** process as described in the 'introduction' were assessed for correlation with the dependent variable 'SalePrice'.

One of the new independent variables identified via this process was 'BsmtQual_Ex' with a correlation value of:  0.688249


[Back to top](#Index)

## 3. Project Description

**Describe, using 150 words or less, how your analysis improves upon the analysis performed by Dr. Williams. Explain the variables that you analyzed, why you selected them, and what relationships you determined in your analysis.
Make sure you explain specifically what findings you derived from your analysis of the data.**

The use of **one hot encoding** in my analysis allows for the use of categorical data columns in addition to the numeric data columuns only (as demonstrated by Dr. Williams). This approach allowed for identicication of the **BsmtQual** variable whose value: 'Ex' (meaning excellent) was found to have a high level of correlation .680094 with the dependent variable - 'SalePrice', thus providing additional improvement to the model as per below findings:

'Fitting' this data into the linear regression model from the sci-kit library (sklearn) resulted in a R^2 value of : **0.8824472376719946**

Using this model R^2 value of the test data as : **0.7705677801288706**

[Back to top](#Index)

### 3.1 Linear Regression

**Give a description (500 or less words) of the algorithm you use in this project. Include mathematical and computational details about linear regression.

**Include details about the theory (origin of the method, derivation, and formulas) and the necessary steps to implement the algorithm using Python.**

The housing pridiction model implemented here,  uses linear regression to compute the relationship between dependent and independent variablles. The model identifies this relationship through a linear equation that require computation of weights for each independednt variable that must be applied to the values of independent variables such that:
Y = w1 * X1 + w2 * X2 + w3 * X3 ....

[Back to top](#Index)

### 3.2 Analysis 

**Implement the algorithm on your data according to the examples in Video 7.9 and Video 7.10.**

**Try to improve the results of your model analysis by including a different number of variables in your code for linear regression. Use what you learned about the correlation between variables when you explored your data to help you select these variables.**

**Compare the results of at least three different groups of variables. In other words, run a linear regression algorithm on at least three different sets of independent variables. How many variables to include in each set is up to you.**

For each step, make sure you include your code. Ensure that your code is commented.


Group 1 (baseline) : 8 independent variables : R^2 = 0.8824472376719946

Group 2 (adversary_1) : 5 independent variables : R^2  = 0.836871490561969

Group 3 (adversary_2) : 11 independent variables : R^2  = 0.8848705848801298

As you can see, thhe correlation increases the number of independent variables are increased, however the increase is marginal as we go from 8 to 11 variables. As such, using 8 independent variables for pridication seems to be optimal for this data set

[Back to top](#Index)

### 3.3 Results

**What are your results? Which model performed better? Can you explain why? Include a detailed summary and a description of the metrics used to compute the accuracy of your predictions.**

**For each step, make sure you include your code. Ensure that your code is commented.**

Group 1 (baseline) : 8 independent variables : R^2  = 0.8824472376719946 
is the most optimal model and also produces an R^2 of value of the test data as : **0.7705677801288706**


[Back to top](#Index)

### 3.4 Verify Your Model Against Test Data

**Now that you have a prediction model, it's time to test your model against test data to confirm its accuracy on new data. The test data is located in the file `jtest.csv` **

**What do you observe? Are these results in accordance with what you found earlier? How can you justify this?**

R^2 of value of the test data as : **0.7705677801288706**

The reason for lower R^2 is perhaps due to certain biases in the training data that are unraveled when applying the model on the test data. 

It should be noted though that both training and sample data are small samples(in the global sense) and only consists of a 100 rows each. 

i believe that the model can be greatly improved with additional data (5-10 X)

[Back to top](#Index)

## Conclusion

**Describe your conclusions. Explain which approach worked better in terms of results. What did you learn about data analysis techniques by creating your prediction model?**

The approach that worked best was using:
1. Data preparation and cleasing - removing NaNs in the data
2. Using one hot encoding to transform the categorical data intro numeric data
3. Using highly correlated variables to the dependent variable and finding the optimal number of variables to be used 

I learned a lot during this exercise about data modeling. Key lessons for me:
* Experimentation to identify if the dependent variable should be used in the predication model or a function of the dependent variable such as log of 'SalePrice'(i confirmed that model R^2 using this approach was lesser than without conversion. .85 vs .88 respectively)
* Use a larger set of data where possible. This becomes evident given the large difference between model R^2 between training and test data sets



[Back to top](#Index
)
## References

Add all references you used to complete this project.

Use this format for articles:
- Author Last Name, Author First Name. “Article Title.” Journal Title Volume #, no. Issue # (year): page range.

- Ex: Doe, John. “Data Engineering.” Data Engineering Journal 18, no. 4 (2021): 12-18.

Use this format for websites:
- Author Last Name, Author First Name. “Title of Web Page.” Name of Website. Publishing organization, publication or revision date if available. Access date if no other date is available. URL .

- Doe, John. “Data Engineering.” Data Engineer Resource. Cengage, 2021. www.dataengineerresource.com .
