# Data Science Course Annotations

## Machine Learning

**Machine learning** is a model development technique used to train AI's to make right decisions

In [1]:
import pandas as pd

## Familiarizing with the data

Pandas is one of the primary tools used by data scientist to look at data.
The most important part of the Pandas library is the DataFrame. **A DataFrame holds the type of data you might think of as a table.** This is similar to a sheet in Excel, or a table in a SQL database.

In [2]:
melbourne_data = pd.read_csv("melb_data.csv")

In [3]:
melbourne_data.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


In [4]:
melbourne_data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

In [5]:
melbourne_data = melbourne_data.dropna(axis=0) 
# In order to clean NULL values from the dataset.

In [6]:
y = melbourne_data.Price

---

## Machile Learning Models

The columns selected to go into our model are called **features**. In general, the most common approach is to **use all columns except the target.**(The one that will be predicted)

In [7]:
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']

In [8]:
x = melbourne_data[melbourne_features]

In [9]:
x.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude
count,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1.57634,471.00694,141.568645,1964.081988,-37.807904,144.990201
std,0.971079,0.711362,897.449881,90.834824,38.105673,0.07585,0.099165
min,1.0,1.0,0.0,0.0,1196.0,-38.16492,144.54237
25%,2.0,1.0,152.0,91.0,1940.0,-37.855438,144.926198
50%,3.0,1.0,373.0,124.0,1970.0,-37.80225,144.9958
75%,4.0,2.0,628.0,170.0,2000.0,-37.7582,145.0527
max,8.0,8.0,37000.0,3112.0,2018.0,-37.45709,145.52635


In [10]:
x.head()

Unnamed: 0,Rooms,Bathroom,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude
1,2,1.0,156.0,79.0,1900.0,-37.8079,144.9934
2,3,2.0,134.0,150.0,1900.0,-37.8093,144.9944
4,4,1.0,120.0,142.0,2014.0,-37.8072,144.9941
6,3,2.0,245.0,210.0,1910.0,-37.8024,144.9993
7,2,1.0,256.0,107.0,1890.0,-37.806,144.9954


Usually Scikit-learn is the chosen library for data modeling.
There are some **steps in order to create a new model**.
- Define: What type of model it'll be? Some parameters of the model type are specified too.
- Fit: Obtain patterns from the observerd dataset. This is the heart of data modeling.
- Predict: It's the exact meaning of the word.
- Evaluate: Determine the model precision.

In [11]:
# Creating a model with sklearn and fitting into the features and target obtained.

from sklearn.tree import DecisionTreeRegressor

melbourne_model = DecisionTreeRegressor(random_state=1)

melbourne_model.fit(x,y)

In [12]:
print('Predictions for the dataset first 5 rows')
print(y.head(10))
[print(i) for i in melbourne_model.predict(x.head(10))]

Predictions for the dataset first 5 rows
1     1035000.0
2     1465000.0
4     1600000.0
6     1876000.0
7     1636000.0
9     1097000.0
11    1350000.0
12     750000.0
15    1310000.0
16    1200000.0
Name: Price, dtype: float64
1035000.0
1465000.0
1600000.0
1876000.0
1636000.0
1097000.0
1350000.0
750000.0
1310000.0
1200000.0


[None, None, None, None, None, None, None, None, None, None]

---

## Model Validation

Almost every model needs to be validated in order to make decent predictions, **the most common metric is predictive accuracy.**

### Mean Absolute Error

There are many metrics for summarizing model quality, but we'll start with one called **Mean Absolute Error (also called MAE).** Let's break down this metric starting with the last word, error.

The prediction error for each house is:

**error=actual−predicted**

So, if a house cost 150,000 and you **predicted it would cost 100,000 the error is 50,000.**

In [13]:
from sklearn.metrics import mean_absolute_error # Importing mean absolute error test

In [14]:
predicted_home_prices = melbourne_model.predict(x)

In [15]:
mean_absolute_error(y,predicted_home_prices)

434.71594577146544

In [16]:
# The previous analysis is wrong because the same test samples were used in order to test the model, this will almost always 
# Create false results

In [17]:
from sklearn.model_selection import train_test_split # Importing the method to split dataset into validation and trainning.

In [18]:
train_x, val_x, train_y, val_y = train_test_split(x, y, random_state=0)

In [19]:
melbourne_model = DecisionTreeRegressor()

In [20]:
melbourne_model.fit(train_x, train_y)

In [21]:
val_predictions = melbourne_model.predict(val_x)

In [22]:
mean_absolute_error(val_y, val_predictions)

254666.26016785024

---

## Underfitting and Overfitting

**Overfitting** is caused when the model fits the model trainning almost perfectly but perfoms poorly on the validation set.

**Underfitting** is caused when the model fails to predict important aspect of the validation set.

<img title="a title" alt="Alt text" src="AXSEOfI.png">

For the specific model utilized throughout the notebook a common solution is to **control the tree depth**, which is done by passing an argument of **max_leaf_nodes** to the DecisionTreeRegressor method.

## Random Forests
The **random forests** uses lots of trees in order to make predictions and then uses an average of the predicictions made by each component tree. Usually it has a much higher predictive accuracy (i.e. A better MAE).

In [24]:
from sklearn.ensemble import RandomForestRegressor


forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_x, train_y)
melb_preds = forest_model.predict(val_x)
mean_absolute_error(val_y, melb_preds)

191669.7536453626