# Project 2 Model Workbook (Lasso / Kaggle / Friday)

---
I used this workbook to run models with various levels of variables and one of three transformations on the target y:  log(y), 1/log(y), and 1/y.  Overall this didn't improve my Kaggle score, which is what I was focused on during Friday's work.

---
Import cleaned data from prior steps.  
Select columns for model.  
Add dummies.  
Add polynomial features.  
Standard Scale  
And then run various models

---


In [549]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.metrics import r2_score
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline
import pickle

% matplotlib inline

In [516]:
cleantrain = './datasets/train_nonulls8.csv'
cleantest =  './datasets/test_nonulls8.csv'

df = pd.read_csv(cleantrain)
df_test = pd.read_csv(cleantest)

y = df['SalePrice']                             # For train, save SalePrice in new var
df.drop('SalePrice', axis =1, inplace = True)   # For train, drop SalePrice and use new var
Id_y_hat = df_test['Id']                        # Save Ids for Kaggle output
train_Id = df['Id']                             # Save Ids from train data (need?)

y = 1/y                                 # y is skewed, taking log transform helps

# print(df.shape)
# print(df_test.shape)
# print(len(y))

#### Dummies
---
The functions below are tools to more easily allow for the addition of dummy variables to data.

---

In [517]:
# This code puts all columns names of non-numeric values into one variable
def get_dummy_list(df_in):
    a = df_in.dtypes == 'object'
    col_list_all = df_in.columns
    b = len(col_list_all)
    features = []
    for each in range(b):
        if a[each]:
            features.append(col_list_all[each])
    return(features)

# This takes a list of variables in and a dataframe.  It outputs the dataframe with dummy columns added
def add_dummies(dlist, df_in):
    df_out = pd.get_dummies(df_in, columns = dlist, drop_first = True)
    return(df_out)

# After dummy variables are created, the text based variables need to be removed from the data.
def remove_text_var(df_in):
    a = df_in.dtypes != 'object'
    col_list_all = df_in.columns
    b = len(col_list_all)
    features = []
    for each in range(b):
        if a[each]:
            features.append(col_list_all[each])
    return(df_in[features])

#### Polynomial Additions
---
The following functions make it easier to add polynomial variables to the data.  
A list of lists gets passed in as long as a dataframe.  
In this way, polynomial additions can be added by groups of variables rather than in one huge mass.  
The unify function is needed as well in case the dummies or polynomials results in different columns for train versus test data.

---

In [518]:
def add_poly(var_list, df_in):
    df_slice = df_in[var_list]                           #only poly the variables in list
    poly = PolynomialFeatures(2, include_bias = True, interaction_only=True)
    poly.fit(df_slice)
    new_vals = poly.transform(df_slice)
    new_names = poly.get_feature_names(df_slice.columns)
    a = dict(zip(new_names,new_vals.T))
    for each in a:
        df_in[each] = a[each]
    return(df_in)

def unify_columns(df1, df2):
    clist1 = set(df1.columns)
    clist2 = set(df2.columns)
    master_list = clist1.union(clist2)
    add_to_1 = list(master_list - clist1)
    add_to_2 = list(master_list - clist2)
  
    if len(add_to_1) > 0:
        for each in add_to_1:
            df1[each] = 0
    
    if len(add_to_2) > 0:
        for each in add_to_2:
            df2[each] = 0

    return (df1.sort_index(axis = 1), df2.sort_index(axis = 1))

### Variable Selection
---
The next section brings together the data and calls the functions above to create the dataset that will go into the regression.

---

In [519]:
col_all = df.columns

# These are dropped from model before considering correlations
# Generally they are columns that have been zeroed out from LassoCV.
# I've iterated through columns using top correlated in, Lasso out, etc.

cols_to_exclude = [
#                   'Low Qual Fin SF', '2nd Flr SF', '1st Flr SF',
#                   'BsmtFin SF 1', 'BsmtFin SF 2',
#                   'Misc Val', 'Misc Feature',
                   'Id', 'PID'
#                   'Garage Area', 'Garage Cars','Full Bath',
#                   'TotRms AbvGrd', 'Garage Yr Blt'
                  ]


cols_in = col_all.drop(cols_to_exclude)

train_model = df[cols_in]                                #only take cols_in into model
test_model = df_test[cols_in]

train_model = train_model.astype({"MS SubClass":str})    #This numeric column is a category
test_model = test_model.astype({"MS SubClass":str})      #If done in dataprep, it would
train_model = train_model.astype({"Mo Sold":str})        #have reversed when re-read
test_model = test_model.astype({"Mo Sold":str})




feat = get_dummy_list(train_model)                        #Add dummies for all non-numeric
feat.remove('Neighborhood')                               #Except I want my feature based on Neigh, not Neigh
train_model = add_dummies(feat, train_model)               
test_model = add_dummies(feat, test_model)

train_model = remove_text_var(train_model)                #remove non-numeric variable
test_model = remove_text_var(test_model)



---

The section below are the polynomial variables that the user wants to include in the model.  User input required.

---



In [520]:
#poly_list = [['Full Bath', 'Half Bath'],
#             ['Garage Cars', 'Garage Area'],
#             ['Wood Deck SF', 'Open Porch SF', 'Screen Porch'],
#             ['Bsmt Unf SF', 'BsmtFin SF 1', 'BsmtFin SF 2'],
#             ['1st Flr SF', '2nd Flr SF', 'Gr Liv Area', 'TotRms AbvGrd'] ,
#             ['Year Built','Year Remod/Add','Yr Latest Change']
#            ]

#poly_list = [poly_all]

# Runs 20 on:
# poly_list = [['Garage Cars', 'Garage Area'],
#              ['Bsmt Unf SF', 'BsmtFin SF 1', 'BsmtFin SF 2'],
#              ['1st Flr SF', '2nd Flr SF', 'Gr Liv Area', 'TotRms AbvGrd']
#             ]

poly_list = [
            ['1st Flr SF', '2nd Flr SF', 'Gr Liv Area', 'TotRms AbvGrd'],
             ['Wood Deck SF', 'Open Porch SF', 'Screen Porch'],
             ['Garage Cars', 'Garage Area'],
             ['Full Bath', 'Half Bath'],
             ['Bsmt Unf SF', 'BsmtFin SF 1', 'BsmtFin SF 2']
            ]

for each in poly_list:
    train_model = add_poly(each, train_model)
    test_model = add_poly(each, test_model)


train_model, test_model = unify_columns(train_model, test_model)       # create columns where needed so cols match

---
The final step in pullling together the model is to make final variable selections.  I have chosen to pick those based on correlations (positive and negative) from all of the feature prep work above.

---

In [521]:
numb_vars = 145  # Actually twice this number of variables will be included.

top_corr = list(pd.concat([train_model, y], axis = 1).corr()['SalePrice'].sort_values(ascending = False).index[1:numb_vars+1])
top_neg_corr = list(pd.concat([train_model, y], axis = 1).corr()['SalePrice'].sort_values(ascending = True).index[0:numb_vars])

train_model_f = train_model[top_corr].copy()                 
test_model_f = test_model[top_corr].copy()                  

train_model_f[top_neg_corr] = train_model[top_neg_corr]          #final train data
test_model_f[top_neg_corr] = test_model[top_neg_corr]            #final test data

## Train Test Split
Now that the data is set up, we split the "train" dataset into train/test components so that we can test various modeling techniques.  Once a model is in shape to produce Kaggle output, the test_model above, the full train_model above will be used to calibrate the model and the test_model input will be used to make predictions.

In [522]:
X_train, X_test, y_train, y_test = train_test_split(train_model_f, y, random_state = 1929)

## To make sure the data is in good shape, run all cells above this point.  Select models to run and analyze below.

# Lasso without Pipeline - Using to Zero In on Kaggle Scores

In [523]:
ss = StandardScaler()
ss.fit(X_train)
X_train_ss = ss.transform(X_train)
X_test_ss = ss.transform(X_test)
train_f = ss.transform(train_model_f)
test_f = ss.transform(test_model_f)

In [524]:
ls = LassoCV(max_iter=10000, cv=2, n_alphas = 500)
ls.fit(X_train_ss, y_train)

n += 1
ns = '0'*(n<10) + str(n)
print(ns)

#ytrain = np.exp(1/ls.predict(X_train_ss))
#ytest = np.exp(1/ls.predict(X_test_ss))
#ykaggle = np.exp(1/ls.predict(test_f))


#ytrain = np.exp(ls.predict(X_train_ss))
#ytest = np.exp(ls.predict(X_test_ss))
#ykaggle = np.exp(ls.predict(test_f))

# ytrain = (ls.predict(X_train_ss))
# ytest = ls.predict(X_test_ss)
# ykaggle = ls.predict(test_f)

ytrain = 1/(ls.predict(X_train_ss))
ytest = 1/ls.predict(X_test_ss)
ykaggle = 1/ls.predict(test_f)

15


In [525]:
# Actual y's re-transformed
A_y_train =(1/y_train)
A_y_test = (1/y_test)

In [526]:
print(ls.alpha_)
print(r2_score(ytrain, A_y_train ))
print(r2_score(ytest,A_y_test))

1.648951282780468e-07
0.7281324307046508
0.7067477477582707


In [527]:
ykaggle[:5]

array([ 92194.26565224, 153392.41362789, 194913.51966602,  97553.04578572,
       159686.1498723 ])

In [528]:
# Store fit results systematically
var1 = ns +'_train'
var2 = ns +'_test'
results[var1] = r2_score(ytrain, A_y_train)
results[var2] = r2_score(ytest,A_y_test)

In [529]:
results

{'01_test': 0.8820455601866684,
 '01_train': 0.9343717741830373,
 '02_test': 0.870407407104832,
 '02_train': 0.9170755319994874,
 '03_test': 0.851336843690979,
 '03_train': 0.8908876023241936,
 '04_test': 0.8810922173597352,
 '04_train': 0.9278445119966676,
 '05_test': 0.8820455601866684,
 '05_train': 0.9343717741830373,
 '06_test': 0.8820133907177033,
 '06_train': 0.9349582618135205,
 '07_test': 0.8822502257738419,
 '07_train': 0.935353966696235,
 '08_test': 0.8629402961916399,
 '08_train': 0.9167370516277821,
 '09_test': 0.8854261749335861,
 '09_train': 0.9333126493935868,
 '10_test': 0.8822912486437898,
 '10_train': 0.9349383919301046,
 '11_test': 0.8815219338853812,
 '11_train': 0.9360313999586102,
 '12_test': 0.8556696632833527,
 '12_train': 0.895638865953223,
 '13_test': 0.8101145025114722,
 '13_train': 0.4420531868467207,
 '14_test': 0.7055343733827828,
 '14_train': 0.7258285054374725,
 '15_test': 0.7067477477582707,
 '15_train': 0.7281324307046508}

In [530]:
#set up systematically store predictions
ytrain_df[ns] = ytrain
ytest_df[ns] = ytest
yka_df[ns] = ykaggle


In [497]:
n

13

In [547]:
numb = 'fri07'
outfileb = './submissions/'+numb+'.csv'
out_to_kaggle = pd.DataFrame({'Id':Id_y_hat, 'SalePrice':(1/3)*(yka_df['10']+yka_df['07']+yka_df['15'])})
out_to_kaggle.to_csv(outfileb, index = False)

In [None]:
import pickle

# obj0, obj1, obj2 are created here...

# Saving the objects:
with open('objs.pkl', 'w') as f:  # Python 3: open(..., 'wb')
    pickle.dump([obj0, obj1, obj2], f)

# Getting back the objects:
with open('objs.pkl') as f:  # Python 3: open(..., 'rb')
    obj0, obj1, obj2 = pickle.load(f)

In [554]:
with open('friday_variables', 'wb') as fileout:
    pickle.dump([yka_df, ytrain_df, ytest_df, results], fileout)

In [259]:
#set up systematically store predictions

ytrain_df = pd.DataFrame({
    "05":y_05_train,
    "06":ytrain
})

ytest_df = pd.DataFrame({
    "05":y_05_test,
    "06":ytest
})
yka_df = pd.DataFrame({
    "05":y_05_kaggle,
    "06":ykaggle
})

In [260]:
ytrain_df

Unnamed: 0,05,06
0,194948.279274,194486.035567
1,105825.928871,105118.848421
2,194348.418859,194096.032580
3,147722.627878,150598.874421
4,160407.139049,160915.671212
5,413637.865897,415197.561074
6,207896.128094,209619.443596
7,379441.001666,378559.193455
8,171762.178196,170435.771367
9,147461.321910,146884.098325
