# Your First Machine Learning Model

In [12]:
import pandas as pd

melbourne_file_path = "./melb_data.csv"
melbourne_data = pd.read_csv(melbourne_file_path)
melbourne_data.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


## Selecting Data For Modeling

You can pull out a variable using **dot-notation**. This single column is stores in a *Series*, which is a DataFrame with a single column of data

We will use *dot notation* to select a column that we want to predict. This is called the *prediction target*. By convention, we will call it y.

In [13]:
melbourne_data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

## Drop missing values

The data has missing values (some houses have no variables recorded). We will use "dropna", which drops missnig values.

In [14]:
melbourne_data = melbourne_data.dropna(axis=0)

## Selecting the Prediction Target

To select a single column of data from a DataFrame, we can use **dot notation** and save it as a Series object. In this case, we are selecting the column that we want to use as the prediction target, which is conventionally referred to as **"y"**. Therefore, to save the house prices from the Melbourne dataset, we can use the following code.

In [15]:
y = melbourne_data.Price

## Choosing Features

The inputs to our model, which are used to make predictions, are referred to as "features." In our case, the features are the columns used to determine the home price. Depending on the situation, we may use all columns except the target as features or we may choose to use only a subset of features.

For the current model, we will use only a few features.

One cam select multiple features like this:

In [16]:
melbourne_features = ["Rooms", "Bathroom", "Landsize", "Lattitude", "Longtitude"]

Again, conventionally this is called **X**.

In [22]:
X = melbourne_data[melbourne_features]

The code **melbourne_data[melbourne_features]** is used to create a new DataFrame **X** containing only the columns of the original DataFrame **melbourne_data** that are listed in the melbourne_features list.

This line of code uses square bracket notation to subset the melbourne_data DataFrame and select only the columns that are present in the melbourne_features list. This new DataFrame X will be used as the input (i.e., feature matrix) for the machine learning model that we will build to predict home prices.

In [24]:
X.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1.57634,471.00694,-37.807904,144.990201
std,0.971079,0.711362,897.449881,0.07585,0.099165
min,1.0,1.0,0.0,-38.16492,144.54237
25%,2.0,1.0,152.0,-37.855438,144.926198
50%,3.0,1.0,373.0,-37.80225,144.9958
75%,4.0,2.0,628.0,-37.7582,145.0527
max,8.0,8.0,37000.0,-37.45709,145.52635


In [25]:
X.head()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
1,2,1.0,156.0,-37.8079,144.9934
2,3,2.0,134.0,-37.8093,144.9944
4,4,1.0,120.0,-37.8072,144.9941
6,3,2.0,245.0,-37.8024,144.9993
7,2,1.0,256.0,-37.806,144.9954


## Building Your Model

To create machine learning models, we will be using the scikit-learn library, commonly abbreviated as sklearn. Scikit-learn is a widely-used library for building models on tabular (i.e., spreadsheet-like) data such as DataFrames.

There are four main steps involved in building and using a machine learning model:

1. **Define**: We need to determine what type of model we want to use and set any relevant parameters. For example, we might choose to use a decision tree model, and we would need to specify the maximum depth of the tree.
2. **Fit**: This is the process of training the model on a given set of input data, allowing it to learn the patterns in the data.
3. **Predict**: After the model has been trained, we can use it to make predictions on new data.
4. **Evaluate**: Finally, we need to determine how accurate the model's predictions are by comparing them to the true values (i.e., the targets) for a set of data that was not used in training. There are several metrics that can be used to evaluate model performance, such as mean squared error or R-squared.

In [28]:
from sklearn.tree import DecisionTreeRegressor

melbourne_model = DecisionTreeRegressor(random_state= 1)

melbourne_model.fit(X, y)

DecisionTreeRegressor(random_state=1)

The first line of code from **sklearn.tree import DecisionTreeRegressor** imports the DecisionTreeRegressor class from the scikit-learn tree module. This class will be used to define and train a decision tree regression model.

The second line of code **melbourne_model = DecisionTreeRegressor(random_state= 1)** creates an instance of the DecisionTreeRegressor class and assigns it to the variable melbourne_model. The **random_state** parameter is set to 1 to ensure that the results are reproducible.

The third line of code **melbourne_model.fit(X, y)** trains the decision tree regression model on the input features X and target variable y. During the fitting process, the model learns to capture the patterns in the training data and to make predictions of the target variable based on the input features. Once the model is trained, we can use it to make predictions on new data.

In practice, you'll want to make predictions for new houses coming on the market rather than the houses we already have prices for. But we'll make predictions for the first few rows of the training data to see how the predict function works.


In [29]:
print("Making predictions for 5 houses: ")
print(X.head())
print("Predictions are: ")
print(melbourne_model.predict(X.head()))

Making predictions for 5 houses: 
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
Predictions are: 
[1035000. 1465000. 1600000. 1876000. 1636000.]


## Model Validation

When evaluating a model, it is important to measure its predictive accuracy. However, a common mistake is to make predictions using the training data and compare them to the target values in the same data. This approach can be problematic, and we need to find a better way.

To summarize the model's quality in a meaningful way, we need to use a metric. One such metric is Mean Absolute Error (MAE), which can help us summarize the accuracy of the model's predictions in a single number. This is much more useful than looking at a list of predicted and actual values for each observation.

To evaluate the accuracy of a model's predictions, we can use a metric called Mean Absolute Error (MAE). This metric involves taking the absolute value of each prediction error (i.e., the difference between the predicted value and the actual value) and averaging those absolute errors.

For example, if the actual value of a house is 150,000 dollars and the model predicts it to be 100,000 dollars, the prediction error would be 50,000 dollars. By taking the absolute value of this error, we get a positive number of 50,000 dollars. We would do this for each prediction and actual value pair, then calculate the average of all the absolute errors to get the MAE.

The resulting MAE value tells us, on average, how much our model's predictions are off. A smaller MAE value indicates better predictive accuracy.

In [5]:
import pandas as pd

melbourne_file_path = './melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)

filtered_melbourne_data = melbourne_data.dropna(axis=0)

y = filtered_melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_features]

from sklearn.tree import DecisionTreeRegressor

melbourne_model = DecisionTreeRegressor()
melbourne_model.fit(X, y)

DecisionTreeRegressor()






We have the model, now lets calculate the mean absolute error:

In [6]:
from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

434.71594577146544


When evaluating a model's performance, it is important to consider its ability to make accurate predictions on **new data**, rather than just the data used to build the model. The evaluation metric calculated on the same data used to build the model is known as an **"in-sample"** score.

However, relying solely on in-sample scores can be misleading. For example, a model may identify a pattern in the training data that does not hold true in the larger population. This can result in **inaccurate predictions** when the model is used on new data.

To avoid this problem, it is common practice to set aside some data from the training set to use as a validation set. The model is built on the remaining data, and its performance is evaluated on the validation set. This provides a more realistic estimate of the model's performance on new, unseen data.



*Data sourced from Kaggle*

# Underfitting and Overfitting

In practical applications, it's common for a decision tree to have 10 splits between the top level (all houses) and a leaf. As the tree gets **deeper**, the dataset gets divided into smaller groups of houses, resulting in fewer houses in each leaf. If there are only a few houses in a leaf, the predictions may be accurate for those houses, but not for new data, which can cause **overfitting**. However, if the tree is **too shallow**, the houses won't be divided into distinct enough groups, causing **underfitting**. If a tree divides houses into only 2 or 4 groups, the predictions may be inaccurate even for the training data. Therefore, finding the right balance in dividing the houses into groups is crucial for **accurate predictions**.

### Example

In [16]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, trani_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes = max_leaf_nodes, random_state = 0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

The function takes five arguments:

1) **"max_leaf_nodes"**: an integer specifying the maximum number of leaf nodes for the decision tree model.
2) **"train_X"**: a Pandas DataFrame containing the features of the training set.
3) **"val_X"**: a Pandas DataFrame containing the features of the validation set.
4) **"train_y"**: a Pandas Series containing the target values of the training set.
5) **"val_y"**: a Pandas Series containing the target values of the validation set.

Within the function, a decision tree regression model is initialized with the specified maximum number of leaf nodes and a fixed random seed of 0 using the DecisionTreeRegressor class from scikit-learn. The model is then trained on the training set using the fit() method of the model object. The predict() method is used to generate predictions on the validation set, and the mean absolute error between the actual target values and the predicted target values is computed using the mean_absolute_error() function from scikit-learn.

Finally, the computed MAE is returned by the function as a float value. The purpose of this function is to allow for the comparison of the performance of different decision tree models with varying maximum number of leaf nodes on a validation set, which can be useful in tuning the hyperparameters of the model.

In [17]:
import pandas as pd
    
melbourne_file_path = 'melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 

filtered_melbourne_data = melbourne_data.dropna(axis=0)

y = filtered_melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_features]

from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)

First, the Pandas library is imported as "pd". The file path to the Melbourne dataset is stored in a variable called "melbourne_file_path". Then, the dataset is loaded into a Pandas DataFrame called "melbourne_data" using the read_csv() function.

The next line drops any rows with **missing values** from the dataset using the dropna() method with the argument "axis=0". This creates a new DataFrame called "filtered_melbourne_data" that contains only the rows with **complete data**.

The **target variable**, which is the **price of the houses** in the dataset, is assigned to a Pandas Series called **"y"**. The **list** of features that will be used as **predictor variables** in the machine learning model is assigned to a list called "melbourne_features". Then, the predictor variables are extracted from the filtered dataset using the indexing operator, and are **stored in** a new DataFrame called **"X"**.

The next step is to split the data into **training** and **validation** sets using the train_test_split() function from scikit-learn. The function takes the **predictor variables (X)** and the **target variable (y)** as arguments, along with the "random_state" parameter set to 0 for reproducibility of the split. The function returns four sets of data: the **training set of predictor variables** and **target variable (train_X and train_y)**, and the **validation set of predictor variables** and **target variable (val_X and val_y)**.

This technique of splitting the data into training and validation sets is a common practice in machine learning to evaluate the performance of the model on data it has not seen before. The training set is used to fit the model to the data, while the validation set is used to evaluate how well the model generalizes to new data.

In [18]:
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5  		 Mean Absolute Error:  347380
Max leaf nodes: 50  		 Mean Absolute Error:  258171
Max leaf nodes: 500  		 Mean Absolute Error:  243495
Max leaf nodes: 5000  		 Mean Absolute Error:  254983


Of the options listed, **500** is the optimal number of leaves.

## Conclusion

1. **Overfitting** occurs when a model captures noise or patterns that exist only in the training data, leading to poor performance in making predictions on new data.
2. **Underfitting** occurs when a model is too simple to capture the underlying patterns in the data, also leading to poor performance. 

To prevent this, we use validation data that the model hasn't seen during training to evaluate the model's performance. This allows us to try out various models and select the best one that performs well on both the training and validation data.

# Random Forests


When building a decision tree, there is a trade-off between **overfitting** and **underfitting**. A **deep tree** with many leaves may overfit the training data, as it relies on information from only a few houses at each leaf. On the other hand, a **shallow tree** with few leaves may underfit the data and miss important distinctions.

Even the most advanced modeling techniques still struggle with this trade-off. However, some models have found ways to achieve better performance. One such model is the **random forest**, which uses **multiple trees** and **averages their predictions** to make a final prediction. This approach often results in much better predictive accuracy than a single decision tree. Additionally, random forests tend to perform well with their **default parameters**. More advanced models can be trained for even better performance, but this often requires careful parameter tuning.

In [20]:
import pandas as pd
    
melbourne_file_path = './melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 

melbourne_data = melbourne_data.dropna(axis=0)

y = melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]

from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)

We build a **random forest model** similarly to how we built a **decision tree** in scikit-learn - this time using the **RandomForestRegressor** class instead of DecisionTreeRegressor.

In [24]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor(random_state = 1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))

191669.7536453626


This code imports the **RandomForestRegressor model** from the scikit-learn library, which is a type of **ensemble model** that trains **multiple decision trees** and **combines their predictions** to make a final prediction. It also imports the mean_absolute_error function from the same library, which is used to **measure the performance of the model**.

The code then creates an **instance** of the RandomForestRegressor model with a specified **random state** and **assigns** it to the variable **forest_model**. It fits the model to the **training data** using the **fit method**, which trains the model on the training set.

Next, the model makes **predictions** on the **validation set** using the **predict method**, and the mean_absolute_error function is used to **calculate the mean absolute error between the predicted values and the actual values in the validation set**. The mean absolute error is a common evaluation metric that measures the average absolute difference between the predicted values and the actual values. Finally, the mean absolute error is printed to the console.

*Data sourced from Kaggle*