# Linear Regression Project



### Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Data

In [2]:
df = pd.read_csv("../DATA/AMES_Final_DF.csv")

In [3]:
df.head()

Unnamed: 0,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,BsmtFin SF 1,BsmtFin SF 2,Bsmt Unf SF,...,Sale Type_ConLw,Sale Type_New,Sale Type_Oth,Sale Type_VWD,Sale Type_WD,Sale Condition_AdjLand,Sale Condition_Alloca,Sale Condition_Family,Sale Condition_Normal,Sale Condition_Partial
0,141.0,31770,6,5,1960,1960,112.0,639.0,0.0,441.0,...,0,0,0,0,1,0,0,0,1,0
1,80.0,11622,5,6,1961,1961,0.0,468.0,144.0,270.0,...,0,0,0,0,1,0,0,0,1,0
2,81.0,14267,6,6,1958,1958,108.0,923.0,0.0,406.0,...,0,0,0,0,1,0,0,0,1,0
3,93.0,11160,7,5,1968,1968,0.0,1065.0,0.0,1045.0,...,0,0,0,0,1,0,0,0,1,0
4,74.0,13830,5,5,1997,1998,0.0,791.0,0.0,137.0,...,0,0,0,0,1,0,0,0,1,0


In [4]:
df['SalePrice']

0       215000
1       105000
2       172000
3       244000
4       189900
         ...  
2920    142500
2921    131000
2922    132000
2923    170000
2924    188000
Name: SalePrice, Length: 2925, dtype: int64

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2925 entries, 0 to 2924
Columns: 274 entries, Lot Frontage to Sale Condition_Partial
dtypes: float64(11), int64(263)
memory usage: 6.1 MB


**The label we are trying to predict is the SalePrice column. Separate out the data into X features and y labels**

In [6]:
X = df.drop('SalePrice', axis = 1)
y = df['SalePrice']

**Use scikit-learn to split up X and y into a training set and test set. Since we will later be using a Grid Search strategy, set your test proportion to 10%. To get the same data split as the solutions notebook, you can specify random_state = 101**

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=101)

**The dataset features has a variety of scales and units. For optimal regression performance, scale the X features.**

In [9]:
from sklearn.preprocessing import StandardScaler

In [10]:
scaler = StandardScaler()
scaler.fit(X_train)

StandardScaler()

In [11]:
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

**We will use an Elastic Net model. Create an instance of default ElasticNet model with scikit-learn**

In [12]:
from sklearn.linear_model import ElasticNet

In [13]:
elastic_net_model = ElasticNet()

**The Elastic Net model has two main parameters, alpha and the L1 ratio. Create a dictionary parameter grid of values for the ElasticNet. Feel free to play around with these values, keep in mind, you may not match up exactly with the solution choices**

In [14]:
param_grid = {'alpha': [0.1, 1, 10, 50, 100], 'l1_ratio': [0.1,0.2,0.5,0.8,1]}

**Using scikit-learn create a GridSearchCV object and run a grid search for the best parameters for your model based on your scaled training data. [In case you are curious about the warnings you may recieve for certain parameter combinations](https://stackoverflow.com/questions/20681864/lasso-on-sklearn-does-not-converge)**

In [15]:
from sklearn.model_selection import GridSearchCV

In [16]:
grid_model = GridSearchCV(elastic_net_model, param_grid = param_grid, scoring = 'neg_root_mean_squared_error',cv = 20,verbose = 1)

In [17]:
grid_model.fit(X_train,y_train)

Fitting 20 folds for each of 25 candidates, totalling 500 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)


  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)


[Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed:  1.7min finished


GridSearchCV(cv=20, estimator=ElasticNet(),
             param_grid={'alpha': [0.1, 1, 10, 50, 100],
                         'l1_ratio': [0.1, 0.2, 0.5, 0.8, 1]},
             scoring='neg_root_mean_squared_error', verbose=1)

**The best combination of parameters for  model**

In [18]:
grid_model.best_params_

{'alpha': 100, 'l1_ratio': 1}

**Evaluate your model's performance on the unseen 10% scaled test set.**

In [19]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [20]:
y_pred = grid_model.predict(X_test)

In [21]:
mean_absolute_error(y_test,y_pred)

14195.354900562173

In [22]:
np.sqrt(mean_squared_error(y_test,y_pred))

20558.508566893168

In [23]:
df['SalePrice'].mean()

180815.53743589742

### Using Lasso Regression

In [24]:
from sklearn.linear_model import Lasso

In [25]:
lasso_model = Lasso(max_iter = 100000)

In [26]:
param_grid = {'alpha': [0.1, 1, 5, 10, 20, 50, 75, 100, 150, 170, 200]}

In [27]:
grid_model_lasso = GridSearchCV(lasso_model, param_grid=param_grid, scoring = 'neg_root_mean_squared_error',cv=5,
    verbose=1 )

In [28]:
grid_model_lasso.fit(X_train,y_train)

Fitting 5 folds for each of 11 candidates, totalling 55 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
  positive)
[Parallel(n_jobs=1)]: Done  55 out of  55 | elapsed:  4.8min finished


GridSearchCV(cv=5, estimator=Lasso(max_iter=100000),
             param_grid={'alpha': [0.1, 1, 5, 10, 20, 50, 75, 100, 150, 170,
                                   200]},
             scoring='neg_root_mean_squared_error', verbose=1)

In [29]:
grid_model_lasso.best_params_

{'alpha': 150}

In [30]:
y_pred = grid_model_lasso.predict(X_test)

In [31]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [32]:
mean_absolute_error(y_test,y_pred)

14195.033797515907

In [33]:
np.sqrt(mean_squared_error(y_test,y_pred))

20571.9973663202

In [34]:
from sklearn.linear_model import LassoCV

In [136]:
lasso_cv_model = LassoCV(eps = 0.0001, n_alphas = 1000, cv= 20, max_iter = 100000)

In [137]:
lasso_cv_model.fit(X_train, y_train)

LassoCV(cv=20, eps=0.0001, max_iter=100000, n_alphas=1000)

In [138]:
lasso_cv_model.alpha_

94.63821243944452

In [139]:
y_pred = lasso_cv_model.predict(X_test)

In [140]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [141]:
mean_absolute_error(y_test,y_pred)

14202.343839738256

In [142]:
np.sqrt(mean_squared_error(y_test,y_pred))

20565.603364930557