# Regression: Using a Real World Dataset

Created for Green Level High School AI & Machine Learning Club

Dataset Source: https://www.kaggle.com/camnugent/california-housing-prices 

Dataset overview: The data is based on houses in California in the 1990 census data.


Features we will use
* longitude: A measure of how far west a house is; a higher value is farther west

* latitude: A measure of how far north a house is; a higher value is farther north

* housingMedianAge: Median age of a house within a block; a lower number is a newer building

* totalRooms: Total number of rooms within a block

* population: Total number of people residing within a block

* households: Total number of households, a group of people residing within a home unit, for a block

* medianIncome: Median income for households within a block of houses (measured in tens of thousands of US Dollars)


Label (what we will predict)
* medianHouseValue: Median house value for households within a block (measured in US Dollars)


In [4]:
# STEP 1: Import Dataset
import pandas as pd
data = pd.read_csv('housing.csv')
print(data.head())

In [None]:
# STEP 2: Splitting Up the Dataset

# Here's a function for when datasets aren't automatically in train and test sets
from sklearn.model_selection import train_test_split

# First we need to remove the 'label' column from the dataset
# Why do we need to do this?
X = data.drop('median_house_value', axis = 'columns').drop('ocean_proximity', axis = 'columns').drop('total_bedrooms', axis = 'columns') 
# Removed ocean_proximity b/c it was string not float, removed total_bedrooms b/c missing values
y = data.median_house_value

# Partition the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=0) 


In [None]:
# STEP 3: Training

from sklearn import tree  # We will use tree.DecisionTreeRegressor(), not tree.DecisionTreeClassifier()

regression = tree.DecisionTreeRegressor()
regression.fit(X_train, y_train)

DecisionTreeRegressor()

In [None]:
# STEP 4: Testing
y_predict = regression.predict(X_test)
print(y_predict) 

[131000. 205200. 160300. ... 281300. 314900. 246400.]


In [None]:
# STEP 5: Measuring Error

from sklearn.metrics import mean_squared_log_error
error = mean_squared_log_error(y_test, y_predict)
print(error)

0.10303647638916677
