# SUSA CX Kaggle Capstone
## Part 1: Introduction, Exploratory Data Analysis, and Feature Selection

### Table Of Contents
* [Introduction](#section1)
* [Data Science Workflow](#section2)
    1. [Understanding Our Dataset](#i)
    2. [Data Cleaning](#ii)
    3. [EDA](#iii)
    4. [Modeling](#iiii)
    5. [Model Evaluation](#v)
* [Conclusion](#conclusion)
* [Additional Reading](#reading)


### Hosted by and maintained by the [Statistics Undergraduate Students Association (SUSA)](https://susa.berkeley.edu). Originally authored by [Arun Ramamurthy](mailto:contact@arun.run), [Patrick Chao](mailto:prc@berkeley.edu), & [Noah Gundotra](mailto:noah.gundotra@berkeley.edu).


<a id='section1'></a>
# SUSA CX Kaggle Capstone Project

Welcome to the last four weeks of your semester in SUSA's Career Exploration committee! Now that you've participated in nearly a dozen workshops on Python, R, data science, and machine learning, we're going to guide you through a four-week collaborative Kaggle competition with your peers in Career Exploration. We want to give you the experience of working with real data, using real machine learning algorithms, in an educational setting. You will have to make use of your data computing skills in Python, dive into reading kernels on the Kaggle website, use visualization and feature engineering to improve your score, and maybe even pick up a few advanced deep learning models along the way. 

If this sounds a bit intimidating right now, do not fret! Your SUSA Mentors will be there to mentor you through the whole thing. Don't worry if our code, our approach, or any part of this process seems unnatural. It is. It takes time to learn, and it takes time to teach. We're learning as we make these tutorials, too! So without further ado, let's get cracking!

## What is [Kaggle](https://www.kaggle.com/)?

Kaggle is platform that hosts datasets & data science competitions. Kaggle started off as a small community of data science practitioners looking to hone their hobbying skills and meet like-minded folks, not unlike SUSA! :)


Today they host many different types of competitions for [money from 1k to 1000k](https://www.kaggle.com/c/data-science-bowl-2018), [jobs](https://www.kaggle.com/jobs), [internships](https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries) e.g. 2Sigma, and [prestige](https://www.kaggle.com/c/imagenet-object-detection-challenge). The best competitors get special badges that give their comments, contributions, and teams special status in the community. You can build your reputation by posting helpful notebooks (called `kernels` by Kaggle), like this one, on a Kaggle dataset.

There's lots more to Kaggle. If you're interested some examples in getting started with Kaggle competitions, or how it works, see our [Kaggle Workshop slides!](https://github.com/ngundotra/SUSA_Kaggle_Workshop)

### Kaggle Competition: Housing Prices

In this project, we will be working with the [**Housing Prices**](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/) competition on the Kaggle website. The competition deals with accurately predicting the sales prices of houses in Ames, Iowa. More information on the dataset and competition will follow shortly.

### Accessing the `House Prices` Dataset

While you can access these datasets online at the [Kaggle webpage for this competition](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data), we have already downloaded the data for you and placed it in the `crash-course/Kaggle/DATA/house-prices/` directory. 

## Overview of the CX Capstone Project

So where does Kaggle fit in into your CX education?

## Logistics

This project is going to be tough and long, but we will be there with you through the whole thing. Additionally, you won't be working alone - we have used your input to assign you into teams of 3 - 4 members. These will be your team members for the rest of the semester, so get to know each other a little!

The project will last the remaining 4 weeks of Career Ex, and one hour per each weekly CX meeting will be devoted to working on it with your team (The other hour will be devoted to guest speakers). However, that only leaves 4 hours of in-class time to work on the project. In order to complete it successfully, it is highly likely that you and your team will have to do some work outside of the weekly CX meetings. To help you stay accountable for yourself, we will be providing you with space and times to meet with SUSA Mentors to help.


### Teams 

You can find the breakdown of the teams with this link:
https://docs.google.com/document/d/1twOPXsil1ZyQ1_Rt770ubUbmdQto36OYNNhhBODukPI/edit?usp=sharing

I would start by getting the Slack handle or phone number (or other contact info) of all the other members in your group so that you can easily communicate outside of the weekly meetings. It's up to you guys to determine how you are going to meet. We highly encourage all work on the project to be done together, in-person.

### Mandatory Office Hours

Because this is such a large project, you and your team will surely have to work on it outside of meetings. In order to get you guys to seek help from this project, we are making it **mandatory** for you and your group to attend **two (2)** SUSA office hours over the next 4 weeks. This will allow questions to be answered outside of the regular meetings and will help promote collaboration with more experienced SUSA members.

The schedule of SUSA office hours are below:
https://susa.berkeley.edu/calendar#officehours-table

We understand that most of you will end up going to Arun or Patrick's office hours, but we highly encourage you to go to other people's office hours as well. There are many qualified SUSA mentors who can help and this could be an opportunity for you to meet them.

### SUSA Datathon

One other time that we are hoping you will work on your Kaggle projects is during the SUSA Datathon. The time and place of the event is still tentative, but it will likely 4-8PM on Sunday 4/15. At the Datathon, members from many SUSA committees, notably Data Consulting, Research & Publication, and Career Exploration, will all meet up and work on their respective projects together. There will be many other SUSA members there to help you and it should be a great environment for you and your group to work. This is also another great opportunity to meet other experienced SUSA members and get a taste for other committees in the club. More details about this event will be released later.

### Git

Given that this is a collaborative project, you'll need to work with your team members on the same codebase simutaneously! This is fortunately simple with Git, which you learned in your very first workshop. Visit `py0` if you need a refresher, but we will be going over the steps for collaborative work here too.

1. First, decide on which one of you will be hosting the forked Github repository for `crash-course`. Ideally, this would be someone with some GitHub experience and a GitHub account. If no one on your team has a GitHub account, one of you should sign up for one. For our examples, we will call this person's account name `rprincess`.

2. Next, have the above person navigate to the [SUSA crash-course repository](https://github.com/SUSA-org/crash-course) and click the `Fork` button. GitHub will make a copy of the crash-course repository in your team member's account. 

3. Each one of you can download the `rprincess` repository to your local computer with the following command: `git clone https://github.com/rprincess/crash-course.git`. 

4. Feel free to work within the repository and use `git pull origin` and `git add -A && git commit -am "My example message here" && git push` to pull/push your local repository to the `rprincess` online repository.

If you have any questions, just Slack Lucas, Noah, Patrick, or Arun and we can help you with your Git workflow.

<a id='section2'></a>
# The Data Science Workflow

In general, there are a few key steps to begin working with a dataset. 

First, we need to **understand what the dataset actually is about**, and what we are trying to do with it. Key to this stage is understanding what each row of the dataset represents, as well as what each column indicates. You can read more about this dataset by looking at the dataset itself, or reading the data dictionary in `crash-course/Kaggle/DATA/house-prices/data_description.txt`.

Second, before we can even do exploratory data analysis, we need to **clean our dataset**. It is very helpful to identify the size of the dataset, so we know how many samples we have. We need to determine a consistent method of dealing with missing values, such as setting them to a value, removing the feature entirely, interpolating values, etc. Other crucial steps are separating into training and validation, as well as creating elementary data plots.

Third, we have **exploratory data analysis (EDA)**. The point of this phase is to inspect & visualize key relationships, trends, outliers, and issues with our data. By conducting EDA, we get a better sense of the underlying structure of our data, and which features are most important. Especially since we have 81 features, we would like to select the features that are most important to our analysis before we begin modeling `SalePrice`. Once we have an idea about which features to use and how they relate to one another, our modeling stage will be more informed and robust. 

Fourth, we have the **modeling phase**, consisting of **model selection** and **model training**. Here, we select a predictive model to train on our features, and then actually train the model! In this first week, we will be using linear regression as a first-pass. In later weeks of the SUSA CX Kaggle Capstone project, we will be using more advanced models like random forest and neural networks. Depending on the model we are using, it is important to verify the model's assumptions before fully pledging to that model. Additionally, we may use **validation** in this stage to select certain hyperparameters for our model selection.

Finally, we have the **model evaluation phase**. Here, we compute a metric for our model's performance, usually by  summing the squared errors of the model's predictions on the test set. This stage allows us to effectively compare various data cleaning and modeling selection decisions, by giving us a single comparable value for performance across our potential models.

<a id='i'></a>
## I. Understanding our Dataset 
Our dataset is about houses in Iowa! According to the Kaggle webpage, the competition is as follows:

> Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

More explicitly, our dataset has 81 columns, or features: 
1. `SalePrice`, our response variable $Y$ that we are trying to predict accurately and precisely 
2. `Id`, a simple identification variable
3. 79 explanatory variables $X_k$ that we can use to predict `SalePrice`. Some of the variables are categorical, and others are continuous quantitative. 

The goal of the next four weeks to to create a model that trains on (some of) the 79 explanators from the training set to predict `SalePrice` well in the test set. 

How will we know which explanators to use? We can start with some intuition and research into what each column represents by reading the data dictionary in `crash-course/Kaggle/DATA/house-prices/data_description.txt`. 

> Please take a moment to read over this dictionary, as you will need to have a keen sense of these features for the weeks ahead.

Now that we've talked a bit about this dataset, let's actually take a look at it. The first step is to load in the data! We will store this in a `pandas` dataframe.

In [1]:
import pandas as pd
train = pd.read_csv('DATA/house-prices/train.csv')
test = pd.read_csv('DATA/house-prices/test.csv')

#Let's see what the training dataframe looks like!
train.head(10)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
5,6,50,RL,85.0,14115,Pave,,IR1,Lvl,AllPub,...,0,,MnPrv,Shed,700,10,2009,WD,Normal,143000
6,7,20,RL,75.0,10084,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,307000
7,8,60,RL,,10382,Pave,,IR1,Lvl,AllPub,...,0,,,Shed,350,11,2009,WD,Normal,200000
8,9,50,RM,51.0,6120,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2008,WD,Abnorml,129900
9,10,190,RL,50.0,7420,Pave,,Reg,Lvl,AllPub,...,0,,,,0,1,2008,WD,Normal,118000


## Questions for Understanding:
> 1. How many features (columns) do we have? How many entries (rows)?
> 2. What does a single row represent in our dataset?
> 3. List 3 issues/questions that you see from the dataset. Some ideas to get started: What does `LotShape` represent? Is it a good thing that we have so many features? How does a large number of features affect our modeling approach?
> 4. What are some important things we should do for data cleaning and exploration? Some ideas to get started: are all the values in `LotFrontage` numeric? What about `Alley`? How should we fix missing/`NA` values that appear sporadically in some columns? What about columns that are almost entirely full of `NA` values? Some columns are qualitative strings, whereas others are qualitative numerics - how might this affect our cleaning?

<a id='ii'></a>
# II. Data Cleaning
First, let's see how big our dataset looks like.

In [2]:
print("Training size",train.shape)
print("Test size",test.shape)

Training size (1460, 81)
Test size (1459, 80)


This tells us that we have $1460$ datapoints in the training set, and we would like to predict on $1459$ samples. There are $80$ total features, and one response variable we would like to predict. However, not all these 'features' are actually useful, such as `Id`. Thus we need to understand what the actual variables mean. A huge part in data science is actually understanding the variables. Of course it is possible to throw the data into some machine learning model and have it spit out predictions, but without actually understanding what data you are dealing with and how to feed the data into the model, your model is worthless.

Spend a good deal of time reading over the data dictionary. It is located in  
`crash-course/Kaggle/DATA/house-prices/data_description.txt`

## Questions for Understanding:
> 1. There are many categorical variables. What are some possibilities to deal with these variables? (Hint: [one hot encoding](https://www.kaggle.com/dansbecker/using-categorical-data-with-one-hot-encoding))
> 2. Are there any categorical variables that we can convert to numerical/quantitative variables as well? How might we do that?
> 3. Are there any variables that are just irrelevant and we can ignore?

One helpful method is looking at the unique values of a feature.

In [3]:
#We can view the unique values of a given feature
train['Foundation'].unique()

array(['PConc', 'CBlock', 'BrkTil', 'Wood', 'Slab', 'Stone'], dtype=object)

## Dealing with NA values
In almost all datasets, we will have NA values. These can be a pain to deal with, as there are many viable choices of what to do. First, it is good to see what columns have NA values.

In [4]:
#Sum the number of NA's in each column
train.isnull().sum()

Id                  0
MSSubClass          0
MSZoning            0
LotFrontage       259
LotArea             0
Street              0
Alley            1369
LotShape            0
LandContour         0
Utilities           0
LotConfig           0
LandSlope           0
Neighborhood        0
Condition1          0
Condition2          0
BldgType            0
HouseStyle          0
OverallQual         0
OverallCond         0
YearBuilt           0
YearRemodAdd        0
RoofStyle           0
RoofMatl            0
Exterior1st         0
Exterior2nd         0
MasVnrType          8
MasVnrArea          8
ExterQual           0
ExterCond           0
Foundation          0
                 ... 
BedroomAbvGr        0
KitchenAbvGr        0
KitchenQual         0
TotRmsAbvGrd        0
Functional          0
Fireplaces          0
FireplaceQu       690
GarageType         81
GarageYrBlt        81
GarageFinish       81
GarageCars          0
GarageArea          0
GarageQual         81
GarageCond         81
PavedDrive

It seems that many of the features have a great deal of NAs. However, this is not necessarily the case. The astute reader will notice that some variables like `Alley`, `PoolQC`, and `Fence`, have NA has an actual value. For example for `Pool Quality`, we have these possibilities.

PoolQC: Pool quality
		
       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       NA	No Pool
So having NA values is not the usual not available, it can actually be a legitimate value! We need to parse through the data dictionary to see when NA's are actually significant, and when they actually mean NA.

## Questions for Understanding:
> 1. Go through the data dictionary and find all features that have NA as legitimate values.
> 2. How can we refactor the variables in question 1 appropriately?


The features that we found with NA's are:
`Alley`, `BsmtQual`, `BsmtCond`, `BsmtExposure`, `BsmtFinType1`, `BsmtFinType2`, `FireplaceQu`, `GarageType`, `GarageFinish`, `GarageQual`, `GarageCond`, `PoolQC`, `Fence`, `Misc Feature`.

Alley: NA for no alley access  
`BsmtQual`, `BsmtCond`, `BsmtExposure`, `BsmtFinType1`, `BsmtFinType2`: NA for no basement  
`FireplaceQu`: NA for no fireplace  
`GarageType`, `GarageFinish`, `GarageQual`, `GarageCond`: NA for no garage  
`PoolQC`: NA for no pool  
`Fence`: NA for no fence  
`Misc Feature`: NA for no other miscellanous features (i.e. elevator, 2nd garage, shed, tennis Court, other) 

## Data Cleaning Functions
We provide some functions to help with data cleaning and preprocessing. One is creating one hot encodings of various features, and the latter is converting a categorical feature into values.

### One Hot Encoding for Unordered Categorical Variables
Consider we have a feature `committee` for a data frame of SUSA members. Let's say for sake of simplicity there are four committees, `CX`, `DC`, `RP`, and `WD`. There is no inherent ordering to the features, but each member is only in one committee. Thus we can replace this categorical variable with a single boolean for `committee`. If a member is a part of `CX`, then they will have a $1$ for `DC` and $0$ for all other variables. We have written this function for you in `oneHotFeature`.

### Ordered Categorical Variables
Consider we have a feature `attendance` for a data frame of SUSA members. Let's say there are $3$ possible values, `Good`, `Ok`, `Poor`. These inherently have an ordering, `Good` is better than `Okay` which is better than `Poor`. We can assign a numerical value to these, such as $0$ for `Poor`, $1$ for `Ok`, and $2$ for `Good`. We have written this function for you in `categoricalToQuantitative`.


Please read over the following functions and read the comments carefully! They describe in detail what the functions do, and are **very** crucial to the data cleaning process.

In [5]:
# Data Cleaning Functions

def oneHotFeature(df, features, withNA=True):
    """
    This function is for unordered categorical features 

    This function takes in input as: 
    Data frame 'df'
    List of features to one hot encode 'features'
    Boolean for how to deal with NA's 'withNA'

    This function creates a set of output features from a categorical feature
    The output features are one hot encodings with names featureName_{value}
    Returns a new data frame (does not modify original dataframe) with appended features
    Usually the NA values are also considered
    Will add a column featureName_none one hot encoding of NA values
    If the boolean withNA is false, it will not consider NA values

    If a feature is not found, it will ignore that feature and attempt to one hot the other features

    Code from Numpy and Pandas SUSA guide
    crash-course/Python/Numpy and Pandas.ipynb
    """
    # Copy over data
    newDf = df.copy()
    for feature in features:
        try:
            if withNA:
                newDf[feature] = newDf[feature].fillna('none')
            col_onehot = pd.get_dummies(newDf[feature], prefix=feature) 
            newDf = newDf.drop(feature, axis=1)
            newDf = newDf.join(col_onehot)
        except:
            print("No such feature", feature, "found in the dataframe when trying to one-hot encode")
    return newDf



def categoricalToQuantitative(df,feature, mapping,assumeInOrder = False):
    """
    This function is for ordered categorical features 

    This function takes in input as: 
    Data frame 'df'
    A single categorical feature to to be mapped 'feature'
    A mapping dictionary 'mapping'

    This function creates takes a dataframe and categorical feature, and maps the categorical values
    using the dictionary mapping
    Returns a new data frame (does not modify original dataframe) with modified values for the feature column
    For mappings with NA values, use 'NA' in the dictionary, this function properly deals with them

    If the mapping is just from 0 to n-1 for n values
    Then set default to True, and mapping is instead a list of ordering from worst to best

    If a feature is not found, the function will fail
    """
    newDf = df.copy()
    try:
        currFeature = df[feature]
    except:
        print("No such feature", feature, "found in the dataframe when trying to map")
        return newDf
    # Check if input mapping and all unique values for feature are equal
    newDf[feature] = newDf[feature].fillna('nan')
    if assumeInOrder:
        keys = mapping
    else:
        keys = mapping.keys()
    uniqueSet = set(newDf[feature].unique())
    keySet = set(keys)
    if 'NA' in keySet:
        keySet.add('nan')
        keySet.remove('NA')
    diff = uniqueSet.difference(keySet)
    diff2 = keySet.difference(uniqueSet)
    if len(diff) != 0:
        print("Missing value(s)",diff,"in mapping",feature,"unable to map all values")
        return newDf
    if len(diff2) != 0:
        print("Warning: no such values",diff2,"in feature",feature,"(they may not appear)")
    # Create mapping
    if assumeInOrder:
        newMapping ={}
        for i in range(len(mapping)):
            if mapping[i] == 'NA':
                newMapping['nan'] = i 
            else:
                newMapping[mapping[i]] = i
    else:    
        newMapping = mapping.copy()
        if 'NA' in newMapping.keys():
            newMapping['nan'] = newMapping['NA']
    newDf[feature] = newDf[feature].apply(lambda feat: newMapping[feat] if feat in newMapping.keys() else feat)
    return newDf

Let's use the first function `oneHotFeature` first. Consider the `Heating` feature. There are $5$ possible values, Brick & Tile, Cinder Block, Poured Concrete, Slab, Stone and Wood. There isn't an inherent ordering to these, where one is better than the other. The best way to go about preprocessing the data is through one hot encoding the data. We remove the `Foundation` feature and replace it with $5$ separate boolean features, one for each of the possible values.

The dataframe is shown below. Keep in mind that the original dataframe, `train`, is not modified in this function call.

In [6]:
oneHotFeature(train,['Foundation']).head(10)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,YrSold,SaleType,SaleCondition,SalePrice,Foundation_BrkTil,Foundation_CBlock,Foundation_PConc,Foundation_Slab,Foundation_Stone,Foundation_Wood
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,2008,WD,Normal,208500,0,0,1,0,0,0
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,2007,WD,Normal,181500,0,1,0,0,0,0
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,2008,WD,Normal,223500,0,0,1,0,0,0
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,2006,WD,Abnorml,140000,1,0,0,0,0,0
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,2008,WD,Normal,250000,0,0,1,0,0,0
5,6,50,RL,85.0,14115,Pave,,IR1,Lvl,AllPub,...,2009,WD,Normal,143000,0,0,0,0,0,1
6,7,20,RL,75.0,10084,Pave,,Reg,Lvl,AllPub,...,2007,WD,Normal,307000,0,0,1,0,0,0
7,8,60,RL,,10382,Pave,,IR1,Lvl,AllPub,...,2009,WD,Normal,200000,0,1,0,0,0,0
8,9,50,RM,51.0,6120,Pave,,Reg,Lvl,AllPub,...,2008,WD,Abnorml,129900,1,0,0,0,0,0
9,10,190,RL,50.0,7420,Pave,,Reg,Lvl,AllPub,...,2008,WD,Normal,118000,1,0,0,0,0,0


In [7]:
# Train is not modified
train['Foundation'].head(10)

0     PConc
1    CBlock
2     PConc
3    BrkTil
4     PConc
5      Wood
6     PConc
7    CBlock
8    BrkTil
9    BrkTil
Name: Foundation, dtype: object

Now we may try this with the other function `categoricalToQuantitative`, mapping a categorical data where there is an quantitative correspondence. Consider the feature `FireplaceQu`. There are $6$ possible values, `Ex`, `Gd`, `TA`, `Fa`, `Po`, `NA`, corresponding to Excellent, Good, Average, Fair, Poor, and No fireplace. These could be mapped to the values from $0$ to $5$, in order of quality. Thus we can make a dictionary of values, where `Ex` corresponds to $5$, `Gd` corresponds to $4$, ..., `NA` corresponds to zero. This is written out below.

In [8]:
mapping = {}
mapping['NA'] = 0
mapping['Po'] = 1
mapping['Fa'] = 2
mapping['TA'] = 3
mapping['Gd'] = 4
mapping['Ex'] = 5

print(train['FireplaceQu'].head(10))
print(categoricalToQuantitative(train,'FireplaceQu',mapping)['FireplaceQu'].head(10))

0    NaN
1     TA
2     TA
3     Gd
4     TA
5    NaN
6     Gd
7     TA
8     TA
9     TA
Name: FireplaceQu, dtype: object
0    0
1    3
2    3
3    4
4    3
5    0
6    4
7    3
8    3
9    3
Name: FireplaceQu, dtype: int64


However, it is a bit annoying to type out the entire dictionary like that if we would like to order from $0$ to $5$. Another variable in `categoricalToQuantitative` is `default`. If `assumeInOrder` is true, then mapping is instead a list of variables rather than a dictionary, and we can just pass in the order of values from worst to best. This maps from $0$ to $n-1$ when there are $n$ possible values. This is more convenient, but it may be better at times to be able to customize how you want to map the values.

In [9]:
mapping = ['NA','Po','Fa','TA','Gd','Ex']
#Notice these mappings are the same
print(categoricalToQuantitative(train,'FireplaceQu',mapping,assumeInOrder=True)['FireplaceQu'].head(10))

0    0
1    3
2    3
3    4
4    3
5    0
6    4
7    3
8    3
9    3
Name: FireplaceQu, dtype: int64


If you would like to remove a feature, use the code below.

In [10]:
dataframe.drop(feature, axis=1)

NameError: name 'dataframe' is not defined

Use the above two functions to clean the dataset! This may take a while, but doing a good job should take a while. Decide what variables are not worth keeping, decide what categorical features need to be changed, and how they should be changed. Consider how to deal with NA values, and keep all these commands together. We would recommend to save the final cleaned file as a csv so that you may easily reopen and send it, and also keep all the commands together neatly in the code block below.

Some recommendations:
> 1. Convert all features into quantitative values
> 2. While cleaning, keep in mind some features that you feel are very helpful.
> 3. Remove features that do not seem important.

In [11]:
# DATA CLEANING
# To start off data cleaning
clean = train.copy()

clean = categoricalToQuantitative(clean,'FireplaceQu',['NA','Po','Fa','TA','Gd','Ex'],assumeInOrder=True)
clean = oneHotFeature(clean,['Foundation'])

# DATA CLEANING SOLUTION

# ====================== DO NOT INCLUDE IN FINAL WORKBOOK ========================
# ================================================================================
unorderedFeatures = ['MSZoning','Street','Alley','LandContour','LotConfig','Neighborhood',
                     'BldgType','HouseStyle','RoofStyle','RoofMatl','MasVnrType','Heating','Electrical',
                     'GarageType','GarageFinish','MiscFeature','SaleType','SaleCondition']

unimportantFeatures = ['LotConfig','BsmtUnfSF','3SsnPorch','Id']
for feat in unimportantFeatures:
    clean = clean.drop(feat, axis=1)

clean = oneHotFeature(clean,unorderedFeatures)
clean = categoricalToQuantitative(clean,'LotShape',['IR3','IR2','IR1','Reg'],assumeInOrder=True)
clean = categoricalToQuantitative(clean,'Utilities',['ELO','NoSeWa','NoSewr','AllPub'],assumeInOrder=True)
clean = categoricalToQuantitative(clean,'LandSlope',['Sev','Mod','Gtl'],assumeInOrder=True)
clean = categoricalToQuantitative(clean,'ExterQual',['Po','Fa','TA','Gd','Ex'],assumeInOrder=True)
clean = categoricalToQuantitative(clean,'ExterCond',['Po','Fa','TA','Gd','Ex'],assumeInOrder=True)
clean = categoricalToQuantitative(clean,'BsmtQual',['NA','Po','Fa','TA','Gd','Ex'],assumeInOrder=True)
clean = categoricalToQuantitative(clean,'BsmtCond',['NA','Po','Fa','TA','Gd','Ex'],assumeInOrder=True)
clean = categoricalToQuantitative(clean,'BsmtExposure',['NA','No','Mn','Av','Gd'],assumeInOrder=True)
clean = categoricalToQuantitative(clean,'BsmtFinType1',['NA','Unf','LwQ','Rec','BLQ','ALQ','GLQ'],assumeInOrder=True)
clean = categoricalToQuantitative(clean,'BsmtFinType2',['NA','Unf','LwQ','Rec','BLQ','ALQ','GLQ'],assumeInOrder=True)
clean = categoricalToQuantitative(clean,'HeatingQC',['Po','Fa','TA','Gd','Ex'],assumeInOrder=True)
clean = categoricalToQuantitative(clean,'CentralAir',['N','Y'],assumeInOrder=True)
clean = categoricalToQuantitative(clean,'KitchenQual',['Po','Fa','TA','Gd','Ex'],assumeInOrder=True)
clean = categoricalToQuantitative(clean,'Functional',['Sal','Sev','Maj2','Maj1','Mod','Min2','Min1','Typ'],assumeInOrder=True)
clean = categoricalToQuantitative(clean,'GarageQual',['NA','Po','Fa','TA','Gd','Ex'],assumeInOrder=True)
clean = categoricalToQuantitative(clean,'GarageCond',['NA','Po','Fa','TA','Gd','Ex'],assumeInOrder=True)
clean = categoricalToQuantitative(clean,'PavedDrive',['N','P','Y'],assumeInOrder=True)
clean = categoricalToQuantitative(clean,'PoolQC',['NA','Fa','TA','Gd','Ex'],assumeInOrder=True)
clean = categoricalToQuantitative(clean,'Fence',['NA','MnWw','GdWo','MnPrv','GdPrv'],assumeInOrder=True)

# For dealing with
# `Condition1, `Condition2`
#'Exterior1st','Exterior2nd'

def kHotFeature(df, features,prefix, withNA=True):
    #Copy over data
    newDf = df.copy()
    allHot = []
    for feature in features:
        try:
            if withNA:
                newDf[feature] = newDf[feature].fillna('none')
            col_onehot = pd.get_dummies(newDf[feature]) 
            allHot.append(col_onehot)
            newDf = newDf.drop(feature, axis=1)
        except:
            print("No such feature", feature, "found in the dataframe when trying to k-hot encode")
            return newDf
          
    #Get all unique names
    allNames = set()
    for hotTable in allHot:
        allNames = allNames.union(set(hotTable.columns.values))
    for col in allNames:
        for hotTable in allHot:
            if col in hotTable.keys():
                hotTable[col] = hotTable[col].apply(lambda val: bool(val))

    for col in allNames:
        validCols = pd.DataFrame()
        for i in range(len(allHot)):
            hotTable = allHot[i]
            if col in hotTable.keys():
                validCols[str(i)] = hotTable[col]
        newDf[prefix+col] = validCols.any(axis=1)*1
    return newDf

clean = kHotFeature(clean,['Condition1','Condition2'],prefix="Condition")
clean = kHotFeature(clean,['Exterior1st','Exterior2nd'],prefix="Condition")



names = list(clean.columns.values)
lowOccurenceFeatures = []
for name,val in zip(names,clean.sum()):
    if val < 15:
        lowOccurenceFeatures.append(name)
for feat in lowOccurenceFeatures:
    clean = clean.drop(feat, axis=1)
    
# ================================================================================
# ================================================================================

clean.head(10)

No such feature LotConfig found in the dataframe when trying to one-hot encode


Unnamed: 0,MSSubClass,LotFrontage,LotArea,LotShape,Utilities,LandSlope,OverallQual,OverallCond,YearBuilt,YearRemodAdd,...,ConditionCemntBd,ConditionVinylSd,ConditionAsbShng,ConditionWd Sdng,ConditionBrkFace,ConditionMetalSd,ConditionWd Shng,ConditionWdShing,ConditionHdBoard,ConditionPlywood
0,60,65.0,8450,3,3,2,7,5,2003,2003,...,0,1,0,0,0,0,0,0,0,0
1,20,80.0,9600,3,3,2,6,8,1976,1976,...,0,0,0,0,0,1,0,0,0,0
2,60,68.0,11250,2,3,2,7,5,2001,2002,...,0,1,0,0,0,0,0,0,0,0
3,70,60.0,9550,2,3,2,7,5,1915,1970,...,0,0,0,1,0,0,1,0,0,0
4,60,84.0,14260,2,3,2,8,5,2000,2000,...,0,1,0,0,0,0,0,0,0,0
5,50,85.0,14115,2,3,2,5,5,1993,1995,...,0,1,0,0,0,0,0,0,0,0
6,20,75.0,10084,3,3,2,8,5,2004,2005,...,0,1,0,0,0,0,0,0,0,0
7,60,,10382,2,3,2,7,6,1973,1973,...,0,0,0,0,0,0,0,0,1,0
8,50,51.0,6120,3,3,2,7,5,1931,1950,...,0,0,0,0,1,0,1,0,0,0
9,190,50.0,7420,3,3,2,5,6,1939,1950,...,0,0,0,0,0,1,0,0,0,0


To save the dataframe as csv, run this line of code.

In [12]:
# Save to csv
clean.to_csv('DATA/house-prices/train_cleaned.csv')

<a id='iii'></a>
# III. Exploratory Data Analysis

Now that we've 

## Data Visualization

@ARUN

<a id='iiii'></a>
# IV. Modeling

An important part to any good statisticians toolkit.

> ***What I cannot create, I cannot understand.***
>
>  \- Richard Feynman (Honorary Statistician)

## Feature Selection

Modeling begins with a guess. To create a model for data, you must guess the relevant features necessary to recreate the data distribution. Even among expert statisticians, this is regarded as a hard skill, bordering on art. Oftentimes, industry experts will provide insight as to what the right features to select are, and stasticians (like ourselves) will have to create a model from those features.

There are lots of mathematical principles to guide your feature selection. So we'll begin with a theorem.
> **Theorem 1.1**
>
> -- just kidding we're not that evil

But seriously, there are lots of principles on how to do this correctly. But for now, do whatever feels intuitive. Explore. Create. Be inefficient. Only through walking can you learn to run. 

## A First Approach to Machine Learning: Linear Regression

Linear regression is the most important tool in a modeler's toolkit. It's the basis for which almost all other modeling techniques arise. Essentially, it's a way to model a variable out a weighted combination of other random variables in the data.

$$ \hat{Y} = aA + bB +cC + \ldots $$
where **a, b, c** are (scalar) weight values, and **A, B, C** are features in the dataset, and **$\hat{Y}$** is our modeled variable.

Mathematicians have derivatives, functions, and domains.

Statisticians have linear regression, data, and raw untamed IQ.

Okay so the gameplan is to give an example of how to **a) select features** and how to **b) create a model**. Specifically, we ask you to model the `SalePrice` column of the dataframe from any combination of features you choose.

For our linear model, we will be using the linear regression model from scikit learn. The docs are provided [here](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html).

#### [Meta Tips]
Good ways to approach things you are not familiar with: here's my steps for this
1. What is a linear model?
2. What inputs does it take? What ouput does it give?
3. How do I get inputs from the Pandas dataframe into a format that works with the Linear Regression model?
4. Did I run the model correctly?
5. How can I tell?
6. What are other inputs I can try?

... etc

In [23]:
# Here's what our 'SalePrice' column looks like
clean['SalePrice'].head(5)

0    208500
1    181500
2    223500
3    140000
4    250000
Name: SalePrice, dtype: int64

In [24]:
import numpy as np
# dataframe['column'].values returns the numpy array of that column
# to check the type of each array, you can use `.dtype` on any numpy array

print(clean['LotArea'].values[:5])
print(clean['Utilities'].values[:5])
print(clean['OverallCond'].values[:5])

[ 8450  9600 11250  9550 14260]
[3 3 3 3 3]
[5 8 5 5 5]


In [45]:
clean[clean.columns[10:20]].head(5)

Unnamed: 0,MasVnrArea,ExterQual,ExterCond,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2
0,196.0,3,2,4,3,1,6,706,1,0
1,0.0,2,2,4,3,4,5,978,1,0
2,162.0,3,2,4,3,2,6,486,1,0
3,0.0,2,2,3,4,1,5,216,1,0
4,350.0,3,2,4,3,3,6,655,1,0


In [63]:
clean['MasVnrArea'].unique()

array([1.960e+02, 0.000e+00, 1.620e+02, 3.500e+02, 1.860e+02, 2.400e+02,
       2.860e+02, 3.060e+02, 2.120e+02, 1.800e+02, 3.800e+02, 2.810e+02,
       6.400e+02, 2.000e+02, 2.460e+02, 1.320e+02, 6.500e+02, 1.010e+02,
       4.120e+02, 2.720e+02, 4.560e+02, 1.031e+03, 1.780e+02, 5.730e+02,
       3.440e+02, 2.870e+02, 1.670e+02, 1.115e+03, 4.000e+01, 1.040e+02,
       5.760e+02, 4.430e+02, 4.680e+02, 6.600e+01, 2.200e+01, 2.840e+02,
       7.600e+01, 2.030e+02, 6.800e+01, 1.830e+02, 4.800e+01, 2.800e+01,
       3.360e+02, 6.000e+02, 7.680e+02, 4.800e+02, 2.200e+02, 1.840e+02,
       1.129e+03, 1.160e+02, 1.350e+02, 2.660e+02, 8.500e+01, 3.090e+02,
       1.360e+02, 2.880e+02, 7.000e+01, 3.200e+02, 5.000e+01, 1.200e+02,
       4.360e+02, 2.520e+02, 8.400e+01, 6.640e+02, 2.260e+02, 3.000e+02,
       6.530e+02, 1.120e+02, 4.910e+02, 2.680e+02, 7.480e+02, 9.800e+01,
       2.750e+02, 1.380e+02, 2.050e+02, 2.620e+02, 1.280e+02, 2.600e+02,
       1.530e+02, 6.400e+01, 3.120e+02, 1.600e+01, 

In [74]:
from sklearn.linear_model import LinearRegression # There are lots of other models from this module you can try!

def get_features(data, col_list, y_name):
    """
    Function to return a numpy matrix of pandas dataframe features. 
    This is not a smart function. It might break. 
    
    data(DataFrame): e.g. train, clean
    col_list(list): list of columns to extract data from
    y_name(string): name of the column you to treat as the y column
    
    Ideally returns np.array of shape (len(data), len(col_list)), and one of shape (len(data), len(col_list))
    """
    
    # keep track of numpy values
    feature_matrix = data[col_list + [y_name]].dropna().values
    return feature_matrix[:, :-1], feature_matrix[:, -1]
    

# Initialize our linear regression model
first_model = LinearRegression()

# X is a matrix of inputs, Y is the variable we are trying to learn
feature_cols = ['LotArea', 'Utilities', 'OverallCond', 'BsmtFinSF1', 'MasVnrArea']
X, Y = get_features(clean, feature_cols, 'SalePrice')

# Fit the model to the data
first_model.fit(X, Y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

<a id='v'></a>
# V. Model Evaluation

Now it's time to actually see how your model performed!

Just mess around with the `.predict` method of your model object. See what it does.

In [75]:
# for example
prediction = first_model.predict(np.zeros((1,len(feature_cols))))
print("Our prediction for a house with 0 LotArea, 0 Utilities, and a rating of 0 OverallCond is:\n{:.2f}".format(prediction[0]))

Our prediction for a house with 0 LotArea, 0 Utilities, and a rating of 0 OverallCond is:
-29399.97


So our bias is pretty damn high. Obviously, we are extrapolating out of our dataset here, but it immediately gives us some intuition as to what the average price looks like. Actually that probably isn't true.

In [76]:
r2_coeff = first_model.score(X, Y)
bias = first_model.intercept_
print("r^2 coeff: {:.3f}".format(r2_coeff))
print("bias: {:.2f}".format(bias))

r^2 coeff: 0.329
bias: -29399.97


To get the actual loss of the model, we'll compute the mean squared error on the train dataset. **NOTE:** this is called overfitting. We are training and evaluating our data on the same dataset. However, we are using linear models which are too simple to overfit our data, so it makes a decent way to introduce modeling. In the future, ~20% of the dataset should be set aside for evaluating the model. This is to make sure models *generalize* their predictions.

In [77]:
def get_loss(model, data, col_list, true_col_name):
    """Returns L2 loss between Y_hat and true values
    
    model(Model object): model we use to predict values
    data(DataFrame): where we get our data from
    col_list(list): list of column names that our model uses to predict on
    true_col_name(String): name of the column in data we wish to predict
    """
    X, Y_true = get_features(data, col_list, true_col_name)
    Y_hat = model.predict(X)
    return np.mean((Y_true-Y_hat)**2)

loss = get_loss(model=first_model, data=clean, col_list=feature_cols, true_col_name='SalePrice')
print("Mean Squared Error loss of our model: {:.2f}".format(loss))

Mean Squared Error loss of our model: 4215215723.11


<a id='conclusion'></a>
# Conclusion

This ends our textbook-style primer into deep learning with Keras. While this was just an introduction to neural nets, we hope that you can now see some of the workflow patterns associated with machine learning. Feel free to play around with the code above to get a better feel for the hyperparameters of the neural net model. As always, please email [`contact@arun.run`](mailto:contact@arun.run) or [`prc@berkeley.edu`](mailto:prc@berkeley.edu) with any questions or concerns whatsoever. Happy machine learning!

## Sneakpeek at SUSA Kaggle Competition II

After Spring Break, we will be guiding you through a four-week collaborative Kaggle competition with your peers in Career Exploration! We want to give you the experience of working with real data, using real machine learning algorithms, in an educational setting. You will have to choose either Python or R, and dive into reading kernels on the Kaggle website, use visualization and feature engineering to improve your score, and maybe even pick up a few advanced deep learning models along the way. If this sounds a bit intimidating right now, do not fret! Your SUSA Mentors will be there to mentor you through the whole thing. So rest up during Spring Break, and come back ready to tackle your biggest data challenge yet!

<a id='reading'></a>
# Additional Reading
* For more information on the Kaggle API, a command-line program used to download and manage Kaggle datasets, visit the [Kaggle API Github page](https://github.com/Kaggle/kaggle-api)  
* For an interactive guide to learning R and Python, visit [DataCamp](https://www.datacamp.com/) a paid tutorial website for learning data computing.
