## Linear Regression With Python

In [1]:
import warnings
warnings.filterwarnings("ignore")

## Housing Dataset

Your neighbor is a real estate agent and wants some help predicting housing prices for regions in the USA. It would be great if you could somehow create a model for her that allows her to put in a few features of a house and returns back an estimate of what the house would sell for.

She has asked you if you could help her out with your new data science skills. You say yes, and decide that Linear Regression might be a good path to solve this problem!

Your neighbor then gives you some information about a bunch of houses in regions of the United States,it is all in the data set: USA_Housing.csv.

The data contains the following columns:

* 'Avg. Area Income': Avg. Income of residents of the city house is located in.
* 'Avg. Area House Age': Avg Age of Houses in same city
* 'Avg. Area Number of Rooms': Avg Number of Rooms for Houses in same city
* 'Avg. Area Number of Bedrooms': Avg Number of Bedrooms for Houses in same city
* 'Area Population': Population of city house is located in
* 'Price': Price that the house sold at
* 'Address': Address for the house

**Let's get started!**
## Check out the data
We've been able to get some data from your neighbor for housing prices as a csv set, let's get our environment ready with the libraries we'll need and then import the data!
### Import Libraries

In [2]:
import pandas as pd
import numpy as np

In [3]:
import matplotlib as plt 
import seaborn as sns

In [4]:
# To see the visualisations within the notebook
%matplotlib inline
sns.set_style('darkgrid')

### Check out the Data

In [5]:
USA_Housing = pd.read_csv("USA_Housing.csv")

FileNotFoundError: [Errno 2] File b'USA_Housing.csv' does not exist: b'USA_Housing.csv'

In [None]:
USA_Housing.head()
# Dataframe has some columns, every row represents different house with address in the end, there is price for each house
# and other columns represent different statistics for city or the area in which house is located such as Avg. Income of people
# in that area, Avg. Age of house in that area, Avg. # of rooms and bedrooms and population of that area.


In [None]:
USA_Housing.mean()

In [None]:
USA_Housing.info()
# Gives us info about total number of columns and entries, as well as info about type of objects in dataframe.

In [None]:
USA_Housing.describe()

In [None]:
# To reference column names 
USA_Housing.columns # A list of column names

# Creating Some plots to check data

In [None]:
# A plot that we can do if data isn't extremely large is sns's pairplot, passing in the entire dataframe.
sns.pairplot(USA_Housing)


* From pairplot above we can see that everything is almost normally distributed, except for average number of bedrooms as they can be only discrete 2,3,4,5 or 6 there is some noise but still we can differentiate that there are 4 to 5 discrete entries in bedroom features which lines up with how we know bedrooms are discrete.

In [None]:
# Let's check distribution of one of the column, Price column
sns.distplot(USA_Housing['Price'])
# We are doing it for price as it is going to be our target column or what we will be predicting.
# Here we predict price of the house. From plot we can infer that average price falls somewhere
# between 1M to 1.5 Million mark and is normally distributed.

In [None]:
USA_Housing.corr()

In [None]:
# Plotting a heatmap of correlation between each of the columns
sns.heatmap(USA_Housing.corr(),annot=True)


## Training a Linear Regression Model

Let's now begin to train out regression model! We will need to first split up our data into an X array that contains the features to train on, and a y array with the target variable, in this case the Price column. We will toss out the Address column because it only has text info that the linear regression model can't use.

### X and y arrays

In [None]:
USA_Housing.head(1)

In [None]:
X = USA_Housing[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
               'Avg. Area Number of Bedrooms', 'Area Population']]
y = USA_Housing['Price']

## Train Test Split

Now let's split the data into a training set and a testing set. We will train out model on the training set and then use the test set to evaluate the model.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# We pass in our x and y data and then specifying the test size
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)


## Creating and Training the Model

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lm = LinearRegression() 


In [None]:

lm.fit(X_train,y_train)


## Model Evaluation

Let's evaluate the model by checking out it's coefficients and how we can interpret them.

In [None]:
print(lm.intercept_)

In [None]:
lm.coef_ # Returns coefficient for each feature

In [None]:
# Let's create a dataframe based off of these coefficients, as each of these coefficients relates to columns
X_train.columns

In [None]:
coeff_df = pd.DataFrame(lm.coef_,X_train.columns,columns=['Coeff'])

In [None]:
coeff_df 

Interpreting the coefficients:

- Holding all other features fixed, a 1 unit increase in **Avg. Area Income** is associated with an **increase of \$21.52 **.
- Holding all other features fixed, a 1 unit increase in **Avg. Area House Age** is associated with an **increase of \$164883.28 **.
- Holding all other features fixed, a 1 unit increase in **Avg. Area Number of Rooms** is associated with an **increase of \$122368.67 **.
- Holding all other features fixed, a 1 unit increase in **Avg. Area Number of Bedrooms** is associated with an **increase of \$2233.80 **.
- Holding all other features fixed, a 1 unit increase in **Area Population** is associated with an **increase of \$15.15 **.

Does this make sense? Probably not because Jose Portilla made up this data. If you want real data to repeat this sort of analysis, check out the [boston dataset](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html):

- If the above coefficients were of real data then we would have seen negative coefficients in areas such as age of the house increases then its price goes down and much more.

In [None]:
from sklearn.datasets import load_boston

In [None]:
boston = load_boston()

In [None]:
boston.keys()

In [None]:
print(boston['DESCR'])

For Linear Regression on the above dataset check : https://towardsdatascience.com/linear-regression-on-boston-housing-dataset-f409b7e4a155

## Predictions from our Model

Let's grab predictions off our test set and see how well it did!

In [None]:
predictions = lm.predict(X_test)
# Asking our model to predict for features from test dataset which it hasn't seen before specifically during training period.

In [None]:
# Just printing the predictions won't make that much of difference so here we compare predictions against real y_test
plt.pyplot.scatter(y_test,predictions)
# This basically means that line of best fit is pretty close to y values and error would be less.

## Regression Evaluation Metrics


Here are three common evaluation metrics for regression problems:

**Mean Absolute Error** (MAE) is the mean of the absolute value of the errors:

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$

**Mean Squared Error** (MSE) is the mean of the squared errors:

$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$

**Root Mean Squared Error** (RMSE) is the square root of the mean of the squared errors:

$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$

Comparing these metrics:

- **MAE** is the easiest to understand, because it's the average error.
- **MSE** is more popular than MAE, because MSE "punishes" larger errors, which tends to be useful in the real world, as it squares the larger errors.
- **RMSE** is even more popular than MSE, because RMSE is interpretable in the "y" units.

All of these are **loss functions**, because we want to minimize them.

In [None]:
from sklearn import  metrics

In [None]:
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
print('R2 Score:', np.sqrt(metrics.r2_score(y_test, predictions)))