## Using Pandas to Get Familiar with Data

In [1]:
import pandas as pd

In [3]:
melbourne_file_path = "/Users/deepshah/Desktop/Coding_Stuff/PYTHON/Intro to Machine Learning/melb_data.csv"
# read and store the data
melbourne_data = pd.read_csv(melbourne_file_path)
# print summary of the data
melbourne_data.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


# First Machine Learning Model

In [4]:
melbourne_data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

In [5]:
# dropna drops missing values (think of na as "not available")
melbourne_data = melbourne_data.dropna(axis=0)

## Selecting the Prediction Target

You can pull out a variable using dot-notation

The single column is stored in a Series which is broadly like a Dataframe

We will use the dot notation to select the column we want to predict, which is called the predicition target

By convention, the prediction target is called y

In [6]:
y = melbourne_data.Price

## Choosing "Features"

The columns that are inputted into our model are called "features".

Here, those columns would determine the home price

In [10]:
melbourne_features = ["Rooms", "Bathroom", "Landsize", "Lattitude", "Longtitude"]

In [11]:
# By convention, this data is called X
X = melbourne_data[melbourne_features]

In [12]:
X.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1.57634,471.00694,-37.807904,144.990201
std,0.971079,0.711362,897.449881,0.07585,0.099165
min,1.0,1.0,0.0,-38.16492,144.54237
25%,2.0,1.0,152.0,-37.855438,144.926198
50%,3.0,1.0,373.0,-37.80225,144.9958
75%,4.0,2.0,628.0,-37.7582,145.0527
max,8.0,8.0,37000.0,-37.45709,145.52635


In [13]:
X.head()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
1,2,1.0,156.0,-37.8079,144.9934
2,3,2.0,134.0,-37.8093,144.9944
4,4,1.0,120.0,-37.8072,144.9941
6,3,2.0,245.0,-37.8024,144.9993
7,2,1.0,256.0,-37.806,144.9954


## Building Your Model

To create the model, we will be using the scikit-learn (sklearn) library.

The steps to building and using a model are:
1. Define
2. Fit
3. Predict
4. Evaluate

In [14]:
from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number of random_state to ensure same resukts each run
melbourne_model = DecisionTreeRegressor(random_state=1)

# Fit model
melbourne_model.fit(X, y)

DecisionTreeRegressor(random_state=1)

In [15]:
print("Making predictions:")
print(X.head())
print("Prices:", end="")
print(melbourne_model.predict(X.head()))

Making predictions:
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
Prices:[1035000. 1465000. 1600000. 1876000. 1636000.]


# Model Validation

This to evaluate the model we built since we need to check the quality of the mdoel before deploying it.

Metrics for Summarizing Model Quality

Mean Absolute Error (MAE)

The prediction for each house is:

error = actual - predicted

In [18]:
from sklearn.metrics import mean_absolute_error
predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

1115.7467183128902

The measure is called an "In-sample" score, meaning we used a single "sample" of model for both building the model and evaluating it.

## The Problem with "In-Sample" Scores

Since models' practical value come from making predictions on new data, we measure performance on data that wasn't used to build the model. The most straightforward way to do this is to exclude some data from the model-building process, and then use those to test the model's accuracy on data it hasn't seen before. This data is called validation data.

The scikit-learn library has a "train_test_split" function to break up data into teo pieces. We will use one part as training data for building the model and other part as validation data to calculate "mean_absolute_error".

In [21]:
from sklearn.model_selection import train_test_split

# split data into training and validation data. Both features and target
# The split is based on a random number generator
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0)

# define model
melbourne_model = DecisionTreeRegressor()

# Fit model
melbourne_model.fit(train_X, train_y)

# get predictions
predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, predictions))

273940.23283839034


# Underfitting and Overfitting

When we divide the houses amongst many leaves, we also have fewer houses in each leaf. Leaves with very few houses will make predictions that are quite close to those homes' actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses).


This is a phenomenon called overfitting, where a model matches the training data almost perfectly, but does poorly in validation and other new data. On the flip side, if we make our tree very shallow, it doesn't divide up the houses into very distinct groups.


At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses. Resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason). When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called underfitting.

In [23]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, train_y, val_X, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return mae    

In [25]:
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, train_y, val_X, val_y)
    print(f"Max_leaf_nodes: {max_leaf_nodes}, MAE: {my_mae}")

Max_leaf_nodes: 5, MAE: 385696.54278937966
Max_leaf_nodes: 50, MAE: 279794.61143891385
Max_leaf_nodes: 500, MAE: 261718.1134423186
Max_leaf_nodes: 5000, MAE: 271996.1207230471


# Random Forests

The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree. It generally has much better predictive accuracy than a single decision tree and it works well with default parameters. 

In [26]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
pred = forest_model.predict(val_X)
print(mean_absolute_error(val_y, pred))

207190.6873773146
