# Ames Housing Project Suggestions

Data science is not a linear process. In this project, in particular, you will likely find that EDA, data cleaning, and exploratory visualizations will constantly feed back into each other. Here's an example:

1. During basic EDA, you identify many missing values in a column/feature.
2. You consult the data dictionary and use domain knowledge to decide _what_ is meant by this missing feature.
3. You impute a reasonable value for the missing value.
4. You plot the distribution of your feature.
5. You realize what you imputed has negatively impacted your data quality.
6. You cycle back, re-load your clean data, re-think your approach, and find a better solution.

Then you move on to your next feature. _There are dozens of features in this dataset._

Figuring out programmatically concise and repeatable ways to clean and explore your data will save you a lot of time.

The outline below does not necessarily cover every single thing that you will want to do in your project. You may choose to do some things in a slightly different order. Many students choose to work in a single notebook for this project. Others choose to separate sections out into separate notebooks. Check with your local instructor for their preference and further suggestions.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.metrics import r2_score

%matplotlib inline

In [None]:
train = pd.read_csv('/Users/AakashSharma/Documents/DSI/Submissions/Project2/datasets/train.csv')
test = pd.read_csv('/Users/AakashSharma/Documents/DSI/Submissions/Project2/datasets/test.csv')


## EDA

- **Read the data dictionary.**
- Determine _what_ missing values mean.
- Figure out what each categorical value represents.
- Identify outliers.
- Consider whether discrete values are better represented as categorical or continuous. (Are relationships to the target linear?)

## Data Cleaning
- Decide how to impute null values.
- Decide how to handle outliers.
- Do you want to combine any features?
- Do you want to have interaction terms?
- Do you want to manually drop collinear features?

In [None]:
train.columns = train.columns.str.replace(" ", "_")
test.columns = test.columns.str.replace(" ", "_")
train.columns = train.columns.str.replace("/", "_")
test.columns = test.columns.str.replace("/", "_")

In [None]:
train.isnull().sum().sum()

In [None]:
test.isnull().sum().sum()

In [None]:
train.dropna()

In [None]:
test.dropna()

In [None]:
# def convert_nan(data, column):
#    return data[column].replace(np.nan, 0, inplace = True)

In [None]:
train['Alley'] = train.Alley.fillna('None')
test['Alley'] = test.Alley.fillna('None')
train['Lot_Frontage'] = train.Lot_Frontage.fillna(0)
test['Lot_Frontage'] = test.Lot_Frontage.fillna(0)
train['Mas_Vnr_Type'] = train.Mas_Vnr_Type.fillna('None')
test['Mas_Vnr_Type'] = test.Mas_Vnr_Type.fillna('None')
train['Mas_Vnr_Area'] = train.Mas_Vnr_Area.fillna(0)
test['Mas_Vnr_Area'] = test.Mas_Vnr_Area.fillna(0)
train['Fireplace_Qu'] = train.Fireplace_Qu.fillna('None')
test['Fireplace_Qu'] = test.Fireplace_Qu.fillna('None')
train['Garage_Type'] = train.Garage_Type.fillna('None')
test['Garage_Type'] = test.Garage_Type.fillna('None')
train['Garage_Yr_Blt'] = train.Garage_Yr_Blt.fillna(0)
test['Garage_Yr_Blt'] = test.Garage_Yr_Blt.fillna(0)
train['Garage_Finish'] = train.Garage_Finish.fillna('None')
test['Garage_Finish'] = test.Garage_Finish.fillna('None')
train['Garage_Qual'] = train.Garage_Qual.fillna('None')
test['Garage_Qual'] = test.Garage_Qual.fillna('None')
train['Garage_Cond'] = train.Garage_Cond.fillna('None')
test['Garage_Cond'] = test.Garage_Cond.fillna('None')
train['Pool_QC'] = train.Pool_QC.fillna('None')
test['Pool_QC'] = test.Pool_QC.fillna('None')
train['Fence'] = train.Fence.fillna('None')
test['Fence'] = test.Fence.fillna('None')
train['Misc_Feature'] = train.Misc_Feature.fillna('None')
test['Misc_Feature'] = test.Misc_Feature.fillna('None')
train['Garage_Cars'] = train.Garage_Cars.fillna(0)
test['Garage_Cars'] = test.Garage_Cars.fillna(0)
train['Garage_Area'] = train.Garage_Area.fillna(0)
test['Garage_Area'] = test.Garage_Area.fillna(0)
train['Electrical'] = train.Electrical.fillna('None')
test['Electrical'] = test.Electrical.fillna('None')
train['Bsmt_Qual'] = train.Bsmt_Qual.fillna('None')
test['Bsmt_Qual'] = test.Bsmt_Qual.fillna('None')
train['Bsmt_Cond'] = train.Bsmt_Cond.fillna('None')
test['Bsmt_Cond'] = test.Bsmt_Cond.fillna('None') 
train['Bsmt_Exposure'] = train.Bsmt_Exposure.fillna('None')
test['Bsmt_Exposure'] = test.Bsmt_Exposure.fillna('None') 
train['BsmtFin_Type_1'] = train.BsmtFin_Type_1.fillna('None')
test['BsmtFin_Type_1'] = test.BsmtFin_Type_1.fillna('None') 
train['BsmtFin_Type_2'] = train.BsmtFin_Type_2.fillna('None')
test['BsmtFin_Type_2'] = test.BsmtFin_Type_2.fillna('None')
train['Bsmt_Full_Bath'] = train.Bsmt_Full_Bath.fillna(0)
train['Bsmt_Half_Bath'] = train.Bsmt_Half_Bath.fillna(0)
train['BsmtFin_SF_1'] = train.BsmtFin_SF_1.fillna(0)
train['BsmtFin_SF_2'] = train.BsmtFin_SF_2.fillna(0)
train['Bsmt_Unf_SF'] = train.Bsmt_Unf_SF.fillna(0)
train['Total_Bsmt_SF'] = train.Total_Bsmt_SF.fillna(0)

In [None]:
train.dtypes

In [None]:
test.dtypes

In [None]:
train.shape

In [None]:
test.shape

In [None]:
test.isnull().sum().sum()

In [None]:
train.isnull().sum().sum()

## Exploratory Visualizations
- Look at distributions.
- Look at correlations.
- Look at relationships to target (scatter plots for continuous, box plots for categorical).

In [None]:
plt.figure(figsize = (20,20))
sns.heatmap(np.round(train.corr(), 2), annot = True)

In [None]:
train.corr()['SalePrice'].sort_values()

In [None]:
sns.pairplot(train, y_vars = ['Overall_Qual', 'Gr_Liv_Area', 'Garage_Area', 'PID', 'Enclosed_Porch'], x_vars = ['SalePrice'])

In [None]:
train.hist(figsize = (15, 15));

In [None]:
sns.boxplot(train['SalePrice'])

In [None]:
sns.boxplot(train['Overall_Qual'])

In [None]:
features = [col for col in train._get_numeric_data().columns if col != 'SalePrice']
X = train[features]
y = train['SalePrice']

### Model Prep: Train/Test Split


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [None]:
ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

### Model Prep: Instantiate Our Models


In [None]:
lr = LinearRegression()

In [None]:
lasso = LassoCV(n_alphas = 200)

In [None]:
ridge = RidgeCV(alphas = np.linspace(.1, 10, 100))

### Cross validation
Use `cross_val_score` to evaluate all three models.

In [None]:
lr_scores = cross_val_score(lr, X_train, y_train, cv = 3)
lr_scores.mean()

In [None]:
lasso_scores = cross_val_score(lasso, X_train, y_train, cv = 3)
lasso_scores.mean()

In [None]:
ridge_scores = cross_val_score(ridge, X_train, y_train, cv = 3)
ridge_scores.mean()

### Model Fitting and Evaluation

#### Ridge Regression Model: Train Data

In [None]:
ridge.fit(X_train, y_train)

In [None]:
ridge.score(X_train, y_train)

In [None]:
ridge.score(X_test, y_test)

In [None]:
ridge_scores.mean()

In [None]:
pred = ridge.predict(X_test)

In [None]:
r2_score(y_test, pred)

In [None]:
pd.Series(ridge.coef_, index = features).plot.bar(figsize = (15, 7))

In [None]:
residuals = y_test - pred

In [None]:
plt.scatter(pred, residuals)

#### Ridge Regression Model: Test Data

In [None]:
ybar = train.SalePrice.mean()
ybar

In [None]:
first_submission_example = pd.DataFrame(test.Id)

In [None]:
first_submission_example['SalePrice'] = ybar

In [None]:
first_submission_example.to_csv('/Users/AakashSharma/Documents/DSI/Submissions/Project2/datasets/predict_test1.csv')

In [None]:
first_submission_example.describe()

## Business Recommendations
- Which features appear to add the most value to a home?
- Which features hurt the value of a home the most?
- What are things that homeowners could improve in their homes to increase the value?
- What neighborhoods seem like they might be a good investment?
- Do you feel that this model will generalize to other cities? How could you revise your model to make it more universal OR what date would you need from another city to make a comparable model?