Intro to Machine Learning

In [None]:
# Reading Data
import pandas as pd

data = pd.read_csv('melb_data.csv')
print("Data Coloums:\n")
print(data.columns)

Data Coloums:

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')


In [2]:
# dropna drops missing values (think of na as "not available")
data = data.dropna(axis=0)

In [3]:
# Dot notation, which we use to select the "prediction target"
price = data.Price
print("Price:\n")
print(price)

Price:

1        1035000.0
2        1465000.0
4        1600000.0
6        1876000.0
7        1636000.0
           ...    
12205     601000.0
12206    1050000.0
12207     385000.0
12209     560000.0
12212    2450000.0
Name: Price, Length: 6196, dtype: float64


The columns that are inputted into our model (and later used to make predictions) are called "features".

In [None]:
features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']

In [5]:
req_data = data[features]
print(req_data.describe())
print(req_data.head())

             Rooms     Bathroom      Landsize    Lattitude   Longtitude
count  6196.000000  6196.000000   6196.000000  6196.000000  6196.000000
mean      2.931407     1.576340    471.006940   -37.807904   144.990201
std       0.971079     0.711362    897.449881     0.075850     0.099165
min       1.000000     1.000000      0.000000   -38.164920   144.542370
25%       2.000000     1.000000    152.000000   -37.855438   144.926198
50%       3.000000     1.000000    373.000000   -37.802250   144.995800
75%       4.000000     2.000000    628.000000   -37.758200   145.052700
max       8.000000     8.000000  37000.000000   -37.457090   145.526350
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954


In [6]:
from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor(random_state=1)

model.fit(req_data, price)

results = model.predict(req_data)

count = 0
for i in results:
    print(i)
    count += 1
    if count == 10:
        break

1035000.0
1465000.0
1600000.0
1876000.0
1636000.0
1097000.0
1350000.0
750000.0
1310000.0
1200000.0


In MAE metric, we take the absolute value of each error. This converts each error to a positive number. We then take the average of those absolute errors.
error=actualâˆ’predicted

In [None]:
from sklearn.metrics import mean_absolute_error

print("Mean Absolute Error:", end=" ")
print(round(mean_absolute_error(price, results)))

Mean Absolute Error: 1116


Function train_test_split to break up the data into two pieces. We'll use some of that data as training data to fit the model, and we'll use the other data as validation data to calculate mean_absolute_error

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(req_data, price, random_state = 0)

model.fit(X_train, y_train)
results = model.predict(X_test)

count = 0
for i in results:
    print(i)
    count += 1
    if count == 10:
        break

print('\n')

print(mean_absolute_error(y_test, results))

900000.0
696750.0
1120000.0
1590000.0
630000.0
591000.0
885000.0
1634000.0
1155000.0
715000.0


273518.01872175594


Overfitting: where a model matches the training data almost perfectly, but does poorly in validation and other new data. On the flip side, if we make our tree very shallow, it doesn't divide up the houses into very distinct groups.

Underfitting: When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data

In [12]:
def get_mean(max_size, X_train, X_test, y_train, y_test):
    model = DecisionTreeRegressor(max_leaf_nodes = max_size, random_state= 0)
    model.fit(X_train, y_train)
    result = model.predict(X_test)
    mean_val = mean_absolute_error(y_test, result)
    return mean_val

In [37]:
for i in [5, 50, 500, 5000]:
    mean = get_mean(i, X_train, X_test, y_train, y_test)
    print(f"Leaf Node:{i} \t\t Mean: {round(mean)}")

Leaf Node:5 		 Mean: 385697
Leaf Node:50 		 Mean: 279795
Leaf Node:500 		 Mean: 261718
Leaf Node:5000 		 Mean: 271321
