Part 1: Data Cleaning (Houses in Iowa Dataset)

The Data Science Workflow

In general, there are a few key steps to begin working with a dataset.

First, we need to understand what the dataset actually is about, and what we are trying to 
do with it. Key to this stage is understanding what each row of the dataset represents, as 
well as what each column indicates. You can read more about this dataset by looking at the 
dataset itself (https://www.kaggle.com/c/house-prices-advanced-regression-techniques/), or 
reading the data dictionary in crash-course/Kaggle/DATA/house-prices/data_description.txt.

Second, before we can even do exploratory data analysis, we need to __clean our dataset__. 
It is very helpful to identify the size of the dataset, so we know how many samples we 
have. We need to determine a consistent method of dealing with missing values, such as 
setting them to a value, removing the feature entirely, interpolating values, etc. Other 
crucial steps are separating into training and validation, as well as creating elementary 
data plots.

Third, we have exploratory data analysis (EDA). The point of this phase is to inspect & 
visualize key relationships, trends, outliers, and issues with our data. By conducting 
EDA, we get a better sense of the underlying structure of our data, and which features are 
most important. Especially since we have 81 features, we would like to select the features 
that are most important to our analysis before we begin modeling SalePrice. Once we have 
an idea about which features to use and how they relate to one another, our modeling stage 
will be more informed and robust.

Fourth, we have the modeling phase, consisting of model selection and model training. 
Here, we select a predictive model to train on our features, and then actually train the 
model! In this first week, we will be using linear regression as a first-pass. In later 
weeks of the SUSA CX Kaggle Capstone project, we will be using more advanced models like 
random forest and neural networks. Depending on the model we are using, it is important to 
verify the model's assumptions before fully pledging to that model. Additionally, we may 
use validation in this stage to select certain hyperparameters for our model selection.

Finally, we have the model evaluation phase. Here, we compute a metric for our model's 
performance, usually by summing the squared errors of the model's predictions on the test 
set. This stage allows us to effectively compare various data cleaning and modeling 
selection decisions, by giving us a single comparable value for performance across our 
potential models.
I. Understanding our Dataset

Our dataset is about houses in Iowa! According to the Kaggle webpage, the competition is 
as follows:

Ask a home buyer to describe their dream house, and they probably won’t begin with the 
height of the basement ceiling or the proximity to an east-west railroad. But this 
playground competition’s dataset proves that much more influences price negotiations than 
the number of bedrooms or a white-picket fence. With 79 explanatory variables describing 
(almost) every aspect of residential homes in Ames, Iowa, this competition challenges you 
to predict the final price of each home.

More explicitly, our dataset has 81 columns, or features:

SalePrice, our response variable $Y$ that we are trying to predict accurately and 
precisely
Id, a simple identification variable
79 explanatory variables $X_k$ that we can use to predict SalePrice. Some of the variables 
are categorical, and others are continuous quantitative.
The goal of the next four weeks to to create a model that trains on (some of) the 79 
explanators from the training set to predict SalePrice well in the test set.

How will we know which explanators to use? We can start with some intuition and research 
into what each column represents by reading the data dictionary in crash-
course/Kaggle/DATA/house-prices/data_description.txt.

Please take a moment to read over this dictionary, as you will need to have a keen sense 
of these features for the weeks ahead. Can you come up with five features you suspect will 
be important in determining the SalePrice?

Now that we've talked a bit about this dataset, let's actually take a look at it. The 
first step is to load in the data! We will store this in a  pandas dataframe.

In [6]:
import pandas as pd
train = pd.read_csv('DATA/house-prices/train.csv')
test = pd.read_csv('DATA/house-prices/test.csv')

#Let's see what the training dataframe looks like!
train.head(10)

FileNotFoundError: File b'DATA/house-prices/train.csv' does not exist