# Template for the House Price Machine Learning Competition on Kaggle.com
## --- SOLUTION ---

## 7 Steps of ML Projects
The 7 Steps below are universal to the majority of Machine Learning / Data Science projects, so this structure can be carried forward to other projects:
1. Frame the Problem and Ideate Potential Solutions
2. Acquire the Data
3. Exploratory Data Analysis (EDA)
4. Data Wrangling and Feature Engineering
5. Select and Train an ML Algorithm Model
6. Evaluate Results and Fine-Tune Your Model
7. Launch, Monitor, and Maintain Your System

# Step 1: Frame The Problem and Ideate Potential Solutions

## Project Overview

#### Description from Kaggle 
[https://www.kaggle.com/c/house-prices-advanced-regression-techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques)

**Start here if...**
You have some experience with R or Python and machine learning basics. This is a perfect competition for data science students who have completed an online course in machine learning and are looking to expand their skill set before trying a featured competition. 

**Competition Description**
Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

**Practice Skills**
- Creative feature engineering 
- Advanced regression techniques like random forest and gradient boosting

**Acknowledgments**
The [Ames Housing dataset](http://www.amstat.org/publications/jse/v19n3/decock.pdf) was compiled by Dean De Cock for use in data science education. It's an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset. 

## Frame the Problem and Ideate Potential Solutions

This looks to be a pretty straightforward challenge.  Similar to the Titanic challenge, this falls under the supervised learning branch of ML, since we will be using data with known house prices to predict unknown house prices.

**Examples of supervised learning ML algorithms that coule be used are:**
- Linear Regression
- Logistic Regression
- k-Nearest Neighbors
- Support Vector Machines (SVM)
- Decision Trees and Random Forests
- Neural Networks

At first glance, intuition tells me that Linear Regression or Decision Trees / Random Foresets will provide the best results but will withhold judgment until the model is chosen in a future step.

# Step 2: Acquire the Data

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline

## Download the Datasets

Download and unzip the datasets from the [data tab of the Kaggle house price competition page](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data).  
Then put the 4 files in a subfolder named `Datasets`

## First submission to Kaggle.com (benchmark)
Kaggle provides a pre-made sample submission file called `sample_submission` that shows the proper format of the file you will need to submit later to see your results.  These sample submission files are perfect for testing the Kaggle upload process, and to also set an early benchmark to guage your prediction skills against.

#### Upload and submit the sample_submission
- Go to the [House Price Kaggle web page]() and click the `Submit Predictions` button.  
- Either drag the file onto the page or click the upload icon and select the file
- Write a brief description (e.g. sample_submission benchmark upload), then click make submission

#### Confirm the upload worked and check your score
You should see a score on the sample submission of 0.40890.  
Kaggle competitions are generally ranked between 0 and 1.00 (percent accuracy).  Getting a perfect 100% is near impossible, and if you did you're either very lucky or maybe possibly cheated.  
So, with this score being about 41 out of 100, that gives us a lot of room for improvement, and should be less difficult to beat than the Titanic benchmark was.

## Import the Datasets

### 2A) import test.csv to `test` dataframe and run head() to make sure it loaded properly

In [2]:
test = pd.read_csv("Datasets/test.csv")
test.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal


### 2A) import train.csv to `train` dataframe and run head() to make sure it loaded properly

In [3]:
train = pd.read_csv("Datasets/train.csv")
train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


### 2C) Make the `Id` column the index for both dataframes

In [4]:
train.set_index('Id', inplace=True)
test.set_index('Id', inplace=True)

## Combine the train and test files into `combined` dataframe for improved EDA

Note that train has a column for `SalePrice` but test does not (since that's what we will be predicting in this project).  
Ensure that the combined dataframe maintains this column but will have null values for the data taken from test.     

### 2D) create a `combined` dataframe that combines/merges/concatenates/appends train and test. Then run head and tail to ensure it worked

In [22]:
combined = pd.concat([train, test], axis=0, sort=False)
combined.head()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500.0
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500.0
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500.0
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000.0
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000.0


In [23]:
combined.tail()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2915,160,RM,21.0,1936,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,6,2006,WD,Normal,
2916,160,RM,21.0,1894,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,4,2006,WD,Abnorml,
2917,20,RL,160.0,20000,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,9,2006,WD,Abnorml,
2918,85,RL,62.0,10441,Pave,,Reg,Lvl,AllPub,Inside,...,0,,MnPrv,Shed,700,7,2006,WD,Normal,
2919,60,RL,74.0,9627,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,11,2006,WD,Normal,


# Step 3: Exploratory Data Analysis (EDA)

### 3A) Read the `data_description.txt` file to get a better understanding of the data

### 3B) Run info and describe on `combined` to see 

# Step 4: Data Wrangling and Feature Engineering

# Step 5: Select and Train an ML Algorithm Model

# Step 6: Evaluate Results and Fine-Tune Your Model

# Step 7: Launch, Monitor, and Maintain Your System