In [0]:
# basic libraries you need
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# The following is code for uploading a file to the colab.research.google 
# environment.

# library for uploading files
from google.colab import files 

def upload_files():
    # initiates the upload - follow the dialogues that appear
    uploaded = files.upload()

    # verify the upload
    for fn in uploaded.keys():
        print('User uploaded file "{name}" with length {length} bytes'.format(
            name=fn, length=len(uploaded[fn])))

    # uploaded files need to be written to file to interact with them
    # as part of a file system
    for filename in uploaded.keys():
        with open(filename, 'wb') as f:
            f.write(uploaded[filename])

# Data Preparation

This assignment is about preparing data for machine learning experiments. We have given you a dataset called "Melbourne_house_FULL.csv". You can find out about this dataset here:

https://www.kaggle.com/anthonypino/melbourne-housing-market

The dataset is for a housing price prediction task. The price column is the target value and the rest of the columns are features. We will request a series of cleaning tasks from you. The final result should be a dataset ready to do ML with.

Load the dataset using the code below. Notice that some columns are dropped. This is intentional as some of the columns in the original dataset are difficult to work with.


In [0]:
upload_files()

In [0]:
house_df = pd.read_csv("Melbourne_housing_FULL.csv")
house_df = house_df.drop(["Suburb", "Address", "Date", "Postcode", "SellerG", "CouncilArea", "Lattitude", "Longtitude"], axis=1)

In [0]:
#Run EDA
house_df.head()

In [0]:
house_df.info()

### 1) Remove Null prices

Remove any row where the Price column is null. Remember, if you are scared of messing with the data while testing ideas make a copy of it by calling `.copy` on the original dataframe. For the rest of the exercise, assume that operations should be done on the dataframe with no null Prices.

In [0]:
house_df.Feature.notnull()
house_df.Feature.isnull()
house_df.loc[]
# code goes here

### 2) Replace null "Regionname" values with the most frequent region name

Don't use Imputer objects for this, as sklearn's Imputer objects are not equipped to handle categorical data.

Hint: You can get the name of the most frequent value in a series by going `series_object.value_counts.index[0]`.

In [0]:
# code goes here

### 3) Impute the numerical columns with the mean value of that column

Use the Imputer class with default arguments to do this. We started you off by listing the numerical columns.

In [0]:
from sklearn.preprocessing import Imputer
num_columns = ["Landsize",
               "Distance",
               "BuildingArea",
               "Propertycount"]

# code goes here

### 4) Impute the integer columns with the most frequent value of that column

Use the Imputer class with `strategy=most_frequent`. We started you off by listing the integer columns.

In [0]:
from sklearn.preprocessing import Imputer
int_columns = ["Bedroom2",
               "Bathroom",
               "Car",
               "YearBuilt"]

# code goes here

### 5) Discretize the BuildingArea Column by making a new column named BuildingAreaDiscrete

More specifically, make a new column that has three new categories "small", "medium", "large". We listed the labels to use below.

Use the pd.qcut function to do this. Note: You may have to use the argument `duplicates="drop"` if you are getting an error.


In [0]:
building_labels = ["small", "medium", "large"]
new_column_name = "BuildingAreaDiscrete"
# code goes here

### 6) Make dummy variables of the categorical columns

We use pd.get_dummies to do this, but if you are comfortable with a different technique go ahead and try that. We identified that categorical columns we want "dummified" for you.

In [0]:
cat_columns = ["Type", "Method", "Regionname", "BuildingAreaDiscrete"]

# code goes here

### 7)  Remove the top 1% of Prices

Remove rows that have a Price in the top 1% of prices. This corresponds prices above the 99th percentile. Use the `quantile` method to accomplish this.

In [0]:
# code goes here

### 8) Engineer a new column called "BathroomRatio"

Make a new column named "BathroomRatio". This column is the number of bathrooms divided by the number of rooms. This gives an idea of the number of bathrooms in proportion to the size of the house.

In [0]:
# Code goes here

### 9) Separate the Price column from the other features

Make two new variables:
1) A Series that contains just the Price column
2) A DataFrame that contains every other feature
Use reasonable names for these two new variables

In [0]:
# Code goes here

### 10) Divide the data into a train and test set

Use `train_test_split` from `sklearn.model_selection` to accomplish this. Use a test size of 10%. Use the variables you made in number **9)** to accomplish this.

In [0]:
# Code goes here

### 11) Scale the data

Scale the features using standardization. You can use whatever technique for doing this.

Bonus: Scale just using the training features and then transform (scale) the test features using the mean and stddev learned from the training features. This can be done most efficiently with the `StandardScaler` class in `sklearn.preprocessing`. 

In [0]:
# Code goes here