# Linear Regression with Python

* 'Avg. Area Income': Avg. Income of residents of the city house is located in.
* 'Avg. Area House Age': Avg Age of Houses in same city
* 'Avg. Area Number of Rooms': Avg Number of Rooms for Houses in same city
* 'Avg. Area Number of Bedrooms': Avg Number of Bedrooms for Houses in same city
* 'Area Population': Population of city house is located in
* 'Price': Price that the house sold at
* 'Address': Address for the house

In [None]:
# dataframe
import pandas as pd
# fast array calculation
import numpy as np
# visualization
import matplotlib.pyplot as plt
%matplotlib inline

### Understand the data

In [None]:
USAhousing = pd.read_csv('USA_Housing.csv')

In [None]:
USAhousing.head()

In [None]:
USAhousing.info()

In [None]:
USAhousing.describe()

In [None]:
USAhousing.columns

## Feature Engineering

In [None]:
features = USAhousing[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
               'Avg. Area Number of Bedrooms', 'Area Population']]

In [None]:
target = USAhousing['Price']

## Training a Linear Regression Model

Split the data into 
Features and Targets
Generally depicted as X and Y. We will just work with verbose names
### X and y arrays

## Train Test Split

Now let's split the data into a training set and a testing set. We will train out model on the training set and then use the test set to evaluate the model.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2, random_state=101)

In [None]:
len(features_test)

In [None]:
len(features_train)

In [None]:
len(target_train)

In [None]:
len(target_test)

## Creating and Training the Model

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lm = LinearRegression()

In [None]:
lm.fit(features_train,target_train)

## Model Evaluation

Let's evaluate the model by checking out it's coefficients and how we can interpret them.

In [None]:
# print the intercept
print(lm.intercept_)

In [None]:
coeff_df = pd.DataFrame(lm.coef_,features.columns,columns=['Coefficient'])
coeff_df

Interpreting the coefficients:

- Holding all other features fixed, a 1 unit increase in **Avg. Area Income** is associated with an **increase of \$21.52 **.
- Holding all other features fixed, a 1 unit increase in **Avg. Area House Age** is associated with an **increase of \$164883.28 **.
- Holding all other features fixed, a 1 unit increase in **Avg. Area Number of Rooms** is associated with an **increase of \$122368.67 **.
- Holding all other features fixed, a 1 unit increase in **Avg. Area Number of Bedrooms** is associated with an **increase of \$2233.80 **.
- Holding all other features fixed, a 1 unit increase in **Area Population** is associated with an **increase of \$15.15 **.

Does this make sense? Probably not because I made up this data. If you want real data to repeat this sort of analysis, check out the [boston dataset](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html):



    from sklearn.datasets import load_boston
    boston = load_boston()
    print(boston.DESCR)
    boston_df = boston.data

## Predictions from our Model

Let's grab predictions off our test set and see how well it did!

In [None]:
predictions = lm.predict(features_test)

In [None]:
features_test.head()

In [None]:
target_test[:5]


In [None]:
target_test.head()

In [None]:
predictions[:5]

## Regression Evaluation Metrics


Here are three common evaluation metrics for regression problems:

**Mean Absolute Error** (MAE) is the mean of the absolute value of the errors:

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$

**Mean Squared Error** (MSE) is the mean of the squared errors:

$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$

**Root Mean Squared Error** (RMSE) is the square root of the mean of the squared errors:

$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$

Comparing these metrics:

- **MAE** is the easiest to understand, because it's the average error.
- **MSE** is more popular than MAE, because MSE "punishes" larger errors, which tends to be useful in the real world.
- **RMSE** is even more popular than MSE, because RMSE is interpretable in the "y" units.

All of these are **loss functions**, because we want to minimize them.

In [None]:
from sklearn import metrics

In [None]:
print('MAE:', metrics.mean_absolute_error(target_test, predictions))
print('MSE:', metrics.mean_squared_error(target_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(target_test, predictions)))