This exercise will test your ability to read a data file and understand statistics about the data.

In later exercises, you will apply techniques to filter the data, build a machine learning model, and iteratively improve your model.

The course examples use data from Melbourne. To ensure you can apply these techniques on your own, you will have to apply them to a new dataset (with house prices from Melbourne).

In [4]:
import pandas as pd

## Step 1: Loading Data
Read the Iowa data file into a Pandas DataFrame called `filedata`.
### Review The Data


In [5]:
filepath = "melb_data.csv"
filedata = pd.read_csv(filepath)
filedata.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


In [6]:
filedata.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

## Drop the rows with non-existent data
### Not the recomended approach, but it will do for now

In [7]:
filedata = filedata.dropna(axis=0)

## Step 1: Specify Prediction Target
Select the target variable, which corresponds to the sales price. Save this to a new variable called `price`. You'll need to print a list of the columns to find the name of the column you need.

In [8]:
price = filedata.Price
print(price)

1        1035000.0
2        1465000.0
4        1600000.0
6        1876000.0
7        1636000.0
           ...    
12205     601000.0
12206    1050000.0
12207     385000.0
12209     560000.0
12212    2450000.0
Name: Price, Length: 6196, dtype: float64


## Step 2: Create X
Now you will create a DataFrame called `x` holding the predictive features.

Since you want only some columns from the original data, you'll first create a list with the names of the columns you want in `x`.

You'll use just the following columns in the list:
    * LotArea
    * YearBuilt
    * 1stFlrSF
    * 2ndFlrSF
    * FullBath
    * BedroomAbvGr
    * TotRmsAbvGrd

After you've created that list of features, use it to create the DataFrame that you'll use to fit the model.

In [22]:
features = ['Rooms','Bathroom','Landsize','Lattitude','Longtitude'] 
x = filedata[features]


## Review Data
Before building a model, take a quick look at **X** to verify it looks sensible

In [23]:
print(x.head(10))
print(x.describe())

    Rooms  Bathroom  Landsize  Lattitude  Longtitude
1       2       1.0     156.0   -37.8079    144.9934
2       3       2.0     134.0   -37.8093    144.9944
4       4       1.0     120.0   -37.8072    144.9941
6       3       2.0     245.0   -37.8024    144.9993
7       2       1.0     256.0   -37.8060    144.9954
9       2       1.0     220.0   -37.8010    144.9989
11      3       2.0     214.0   -37.8085    144.9964
12      2       2.0       0.0   -37.8078    144.9965
15      2       1.0     238.0   -37.8090    144.9976
16      3       2.0     113.0   -37.8056    144.9930
             Rooms     Bathroom      Landsize    Lattitude   Longtitude
count  6196.000000  6196.000000   6196.000000  6196.000000  6196.000000
mean      2.931407     1.576340    471.006940   -37.807904   144.990201
std       0.971079     0.711362    897.449881     0.075850     0.099165
min       1.000000     1.000000      0.000000   -38.164920   144.542370
25%       2.000000     1.000000    152.000000   -37.85543

### Install sklearn, the tool we will use to build our models

In [11]:
!pip3 install sklearn



## Step 3: Specify and Fit Model
Create a `DecisionTreeRegressor` and save it iowa_model. Ensure you've done the relevant import from sklearn to run this command.

Then fit the model you just created using the data in `x` and `price` that you saved above.

In [24]:
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor(random_state=1)
model.fit(x,price)

DecisionTreeRegressor(random_state=1)

## Step 4: Make Predictions
Make predictions with the model's `predict` command using `x` as the data. Save the results to a variable called `predictions`.

In [25]:
print(" Making predictions on the next 10 houses")
print(x.head(10))
print('the predicitons are')
predicted_home_prices=model.predict(x.head(10))
print(predicted_home_prices)

 Making predictions on the next 10 houses
    Rooms  Bathroom  Landsize  Lattitude  Longtitude
1       2       1.0     156.0   -37.8079    144.9934
2       3       2.0     134.0   -37.8093    144.9944
4       4       1.0     120.0   -37.8072    144.9941
6       3       2.0     245.0   -37.8024    144.9993
7       2       1.0     256.0   -37.8060    144.9954
9       2       1.0     220.0   -37.8010    144.9989
11      3       2.0     214.0   -37.8085    144.9964
12      2       2.0       0.0   -37.8078    144.9965
15      2       1.0     238.0   -37.8090    144.9976
16      3       2.0     113.0   -37.8056    144.9930
the predicitons are
[1035000. 1465000. 1600000. 1876000. 1636000. 1097000. 1350000.  750000.
 1310000. 1200000.]


## Think About Your Results

Use the `head` method to compare the top few predictions to the actual home values (in `price`) for those same homes. Anything surprising?

In [28]:
print (model.predict(x.head(10)))
print(price.head(10))

[1035000. 1465000. 1600000. 1876000. 1636000. 1097000. 1350000.  750000.
 1310000. 1200000.]
1     1035000.0
2     1465000.0
4     1600000.0
6     1876000.0
7     1636000.0
9     1097000.0
11    1350000.0
12     750000.0
15    1310000.0
16    1200000.0
Name: Price, dtype: float64


## Checking our results

In [29]:
from sklearn.metrics import mean_absolute_error
predicted_home_prices = model.predict(x)
mean_absolute_error(price,predicted_home_prices)

1115.7467183128902

Spliting the data to get  an aut-of-sample Score
And we can see we have a very big error, so our algorithm is not very good for prediction.

# Model Validation


## Step 1: Split Your Data
Use the `train_test_split` function to split up your data.

Recall, your features are loaded in the DataFrame `**x**` and your target is loaded in `**price**`.

## Step 2: Specify and Fit the Model

Create a `DecisionTreeRegressor` model and fit it to the relevant data.
Set `random_state` to 1 again when creating the model.

## Step 3: Make Predictions with Validation data


## Step 4: Calculate the Mean Absolute Error in Validation Data


In [37]:
from sklearn.model_selection import train_test_split
train_X,val_X,train_y,val_y = train_test_split(x,price,random_state=0)
model = DecisionTreeRegressor()
model.fit(train_X,train_y)
#get predicted prices on validation data
val_predictions = model.predict(val_X)
print(f'predictions: {val_predictions[:5]}')
error = mean_absolute_error(val_y,val_predictions)
print(f'error: {error}')

predictions: [ 900000.  696750. 1130000. 1590000.  630000.]
error: 276878.10673552833


# Underfiting and Overfitting

#### Defining the MAE(mean average error) using Decision trees to get a beter prediction.



In [30]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_node,train_X,val_X,train_y,val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_node,random_state=1)
    model.fit(train_X,train_y)
    pred_val=model.predict(val_X)
    mae = mean_absolute_error(val_y,pred_val)
    return(mae)

### Testing different sized trees

#### we can see that for our problem the best tree size is 500 nodes trees. with less we have an underfitting, and with more we have an overfitting.

In [20]:
for max_leaf_node in[5,50,500,5000]:
    my_mae = get_mae(max_leaf_node,train_X,val_X,train_y,val_y)
    print(f'max leaf nodes {max_leaf_node} \t\t Mean Absolute Error: {my_mae}')

max leaf nodes 5 		 Mean Absolute Error: 385696.54278937966
max leaf nodes 50 		 Mean Absolute Error: 279794.61143891385
max leaf nodes 500 		 Mean Absolute Error: 261718.1134423186
max leaf nodes 5000 		 Mean Absolute Error: 271996.1207230471


#### Defining the MAE(mean average error) using random forrests to get a beter prediction.

#### Note that before we are using Decision trees. Random forests are a bunch of decision trees combined together to fight against underfitting/overfitting, in order to get a better prediciton model.

#### we have better results than we had with our best decision tree.

In [21]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X,train_y)
prediction = forest_model.predict(val_X)
mae = mean_absolute_error(val_y,prediction)
print(mae)

207190.6873773146
