## How to submit to Kaggle

This notebook will walk you through a linear regression model using two variables to submit to Kaggle.

In [1]:
import pandas as pd
import seaborn as sns

%matplotlib inline

In [2]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

You'll need both `train.csv` and `test.csv`:

In [3]:
df = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [4]:
df.head(3)

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,SalePrice
0,109,533352170,60,RL,,13517,Pave,,IR1,Lvl,...,0,0,,,,0,3,2010,WD,130500
1,544,531379050,60,RL,43.0,11492,Pave,,IR1,Lvl,...,0,0,,,,0,4,2009,WD,220000
2,153,535304180,20,RL,68.0,7922,Pave,,Reg,Lvl,...,0,0,,,,0,1,2010,WD,109000


Let's use `Overall Qual` and `Gr Liv Area` to build our model:

In [5]:
df.corr()['SalePrice'].sort_values(ascending=False)[:5]

SalePrice       1.000000
Overall Qual    0.800207
Gr Liv Area     0.697038
Garage Area     0.650270
Garage Cars     0.648220
Name: SalePrice, dtype: float64

Defining `X` and `y`:

In [6]:
features = ['Overall Qual', 'Gr Liv Area']

X = df[features]
y = df['SalePrice']

Train-test split:

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

Intialize, fit, and evaluate the model:

In [8]:
lr = LinearRegression()

lr.fit(X_train, y_train)
lr.score(X_train, y_train), lr.score(X_test, y_test)



(0.7194163570829553, 0.7530182162122661)

Nice! The model is scoring well enough on the test set that I'm comfortable using it to make predictions on more unseen data.

This next line of code won't work. Let's see why:

In [9]:
lr.predict(test)

ValueError: could not convert string to float: 'WD '

    `ValueError: could not convert string to float: 'WD '`

This is happening because we're trying to predict on the whole dataframe, rather than on the two variables we subsetted on for `X`. Compare the below:

In [None]:
test.shape

In [None]:
X.shape

`test` will need to have the _exact same columns_ as `X` in order for our model to work.

I'll subset `test` with the list `features` that I saved above:

In [None]:
test_subset = test[features]
test_subset.head(3)

Now I can store predictions:

In [None]:
preds = lr.predict(test_subset)

Of course, we can't score the model on our own computers; we don't have access to the true value of `y` for `test`. We need to upload our predictions to Kaggle so that it can calculate the RMSE.

Kaggle will expect a very specific csv file: one row for every observation, and two columns with the names `Id` and `SalePrice`. The `Id` column should contain the house ID associated with our predicted sale price. One way to export the csv file would be this:

In [None]:
to_submit = pd.DataFrame({
    'Id': test['Id'],
    'SalePrice': preds
})

Here, we have a dataframe with appropriately labeled columns:

In [None]:
to_submit.head(3)

Export to csv:

In [None]:
to_submit.to_csv('submissions.csv', index=False)

Note: Setting `index = False` keeps the indices from being exported as well. **Kaggle will not accept a csv file that contains the indices.**

Good luck!