# INTRO TO MACHINE LEARNING

## Using pandas to get familiar with your data

In [3]:
import pandas as pd


melbourne_file_path='melb_data.csv'
melbourne_data=pd.read_csv(melbourne_file_path)
melbourne_data.describe()

In [5]:
melbourne_data=pd.read_csv("melb_data.csv")

In [6]:
melbourne_data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

In [7]:
# just dropping missing values for now
melbourne_data=melbourne_data.dropna(axis=0)

### Selecting The Prediction Target

#### You can pull out a variable with dot-notation. This single column is stored in a Series, which is broadly like a DataFrame with only a single column of data. We'll use the dot notation to select the column we want to predict, which is called the prediction target. By convention, the prediction target is called y. So the code we need to save the house prices in the Melbourne data is

In [10]:
y=melbourne_data.Price

### Choosing 'Features'

#### The columns that are inputted into our model (and later used to make predictions) are called "features." By convenctio , this data is called X.

In [13]:
melbourne_features=['Rooms','Bathroom','Landsize','Lattitude','Longtitude']

In [14]:
X=melbourne_data[melbourne_features]

In [15]:
X.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1.57634,471.00694,-37.807904,144.990201
std,0.971079,0.711362,897.449881,0.07585,0.099165
min,1.0,1.0,0.0,-38.16492,144.54237
25%,2.0,1.0,152.0,-37.855438,144.926198
50%,3.0,1.0,373.0,-37.80225,144.9958
75%,4.0,2.0,628.0,-37.7582,145.0527
max,8.0,8.0,37000.0,-37.45709,145.52635


In [16]:
X.head()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
1,2,1.0,156.0,-37.8079,144.9934
2,3,2.0,134.0,-37.8093,144.9944
4,4,1.0,120.0,-37.8072,144.9941
6,3,2.0,245.0,-37.8024,144.9993
7,2,1.0,256.0,-37.806,144.9954


### Building Your Model

#### The steps to building and using a model are:

##### 1.Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
##### 2.Fit: Capture patterns from provided data. This is the heart of modeling.
##### 3.Predict: Just what it sounds like
##### 4.Evaluate: Determine how accurate the model's predictions are.

In [19]:
from sklearn.tree import DecisionTreeRegressor
# Define model.Specify  a number for random_state to ensure same results each run.
melbourne_model=DecisionTreeRegressor(random_state=1)
# Fit model
melbourne_model.fit(X,y)


In [20]:
print(" Making predictions for the following 5 houses: ")
print(X.head())
print("The predictions are")
print(melbourne_model.predict(X.head()))

 Making predictions for the following 5 houses: 
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
The predictions are
[1035000. 1465000. 1600000. 1876000. 1636000.]


## Where are they located?

In [22]:
import geopandas as gdp
import folium
import pandas as pd

# Sample of your first 5 houses
data = {
    'Latitude': [-37.8079, -37.8093, -37.8072, -37.8024, -37.8060],
    'Longitude': [144.9934, 144.9944, 144.9941, 144.9993, 144.9954],
    'Price Prediction': [1035000, 1120000, 1600000, 1876000, 1242000]
}

df = pd.DataFrame(data)

# Create a base map centered around Melbourne
melbourne_map = folium.Map(location=[-37.8136, 144.9631], zoom_start=13)

# Add markers for each house
for idx, row in df.iterrows():
    folium.Marker(
        location=[row['Latitude'], row['Longitude']],
        popup=f"Predicted Price: ${row['Price Prediction']:,.0f}",
        icon=folium.Icon(color="blue", icon="home")
    ).add_to(melbourne_map)

# Display map
melbourne_map


## Model Validation

### What is Model Validation

#### There are many metrics for summarizing model quality, but we'll start with one called Mean Absolute Error (also called MAE). Let's break down this metric starting with the last word, error
##### error=actual−predicted

### To calculate the MAE, we first need a model.

In [27]:
# Data loading code 
import pandas as pd
# load the data
melbourne_file_path="melb_data.csv"
melbourne_data=pd.read_csv(melbourne_file_path)
# Filter rows with missing price values
filtered_melbourne_data=melbourne_data.dropna(axis=0)
# Choose target and features
y=filtered_melbourne_data.Price
melbourne_features=['Rooms','Bathroom','Landsize','BuildingArea','YearBuilt','Lattitude','Longtitude']
X=filtered_melbourne_data[melbourne_features]
from sklearn.tree import DecisionTreeRegressor 
# Define model
melbourne_model=DecisionTreeRegressor()
#Fit model
melbourne_model.fit(X,y)

#### We now have the model onto calculating the mean absolute error

In [29]:
from sklearn.metrics import mean_absolute_error
predicted_home_prices=melbourne_model.predict(X)
mean_absolute_error(y,predicted_home_prices)

434.71594577146544

## The Problem with "In-Sample" scores

#### Since models' practical value come from making predictions on new data, we measure performance on data that wasn't used to build the model. The most straightforward way to do this is to exclude some data from the model-building process, and then use those to test the model's accuracy on data it hasn't seen before. This data is called validation data.

### Coding It

#### The scikit-learn library has a function train_test_split to break up the data into two pieces. We'll use some of that data as training data to fit the model, and we'll use the other data as validation data to calculate mean_absolute_error

In [34]:
from sklearn.model_selection import train_test_split
# Split data into training and validation data,for both features and target
# The split is based on a random number generator.Supplying a numeric value to 
# The random_state argument guarantees we get the same split every time we
#run this script
train_X,val_X,train_y,val_y=train_test_split(X,y,random_state=0)
# define the model
melbourne_model=DecisionTreeRegressor()
#Fit the model
melbourne_model.fit(train_X,train_y)
# Get predicted prices on validation data
val_predictions=melbourne_model.predict(val_X)
print(mean_absolute_error(val_y,val_predictions))

260282.7785668173


## Underfitting ans Overfitting

### Experimenting with different models

#### This is a phenomenon called overfitting, where a model matches the training data almost perfectly, but does poorly in validation and other new data
#### When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called underfitting

### Example:
#### We can use a utility function to help compare MAE scores from different values for max_leaf_nodes:

In [39]:
from sklearn.metrics import mean_absolute_error # to evaluate model performance.
from sklearn.tree import DecisionTreeRegressor# the model used for predictions.
def get_mae(max_leaf_nodes,train_X,val_X,train_y,val_y):#'max_leaf_nodes' (how complex the tree can be) 'train_X,val_X'(training and validation features) 'train_y,val_y'(trainig and validation labels
    model=DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes,random_state=0)# Creates a Decision Tree Regressor witht the specified number f leaf nodes(controls overfitting).'random_state=0' ensures reproducibility
    model.fit(train_X,train_y)# 'fit()' used to train the modle on the training data.
    preds_val=model.predict(val_X)# 'predict' makes the predictions.
    
    mae=mean_absolute_error(val_y,preds_val)# calculates how off the predictions are compared to the true value
    return(mae)

In [40]:
# Data loading code 
import pandas as pd
# load the data
melbourne_file_path="melb_data.csv"
melbourne_data=pd.read_csv(melbourne_file_path)
# Filter rows with missing price values
filtered_melbourne_data=melbourne_data.dropna(axis=0)
# Choose target and features
y=filtered_melbourne_data.Price
melbourne_features=['Rooms','Bathroom','Landsize','BuildingArea','YearBuilt','Lattitude','Longtitude']
X=filtered_melbourne_data[melbourne_features]
from sklearn.tree import DecisionTreeRegressor 
# Split data into training and validation data, for both features and target.
train_X,val_X,train_y,val_y=train_test_split(X,y,random_state=0)



In [41]:
# Compare MAE with differing values of max_leaf_mae.
for max_leaf_nodes in [5,50,500,5000]:
    my_mae=get_mae(max_leaf_nodes,train_X,val_X,train_y,val_y)
    print("Mx leaf nodes: %d \t\t Mean Absolute Error:%d" %(max_leaf_nodes,my_mae))

Mx leaf nodes: 5 		 Mean Absolute Error:347380
Mx leaf nodes: 50 		 Mean Absolute Error:258171
Mx leaf nodes: 500 		 Mean Absolute Error:243495
Mx leaf nodes: 5000 		 Mean Absolute Error:255575


## Random Forest

### The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree.
#### We build a random forest model similarly to how we built a decision tree in scikit-learn - this time using the RandomForestRegressor class instead of DecisionTreeRegressor.
mb

In [44]:
from sklearn.ensemble import RandomForestRegressor # import Random Forest Regressor model from scikit-learn
from sklearn.metrics import mean_absolute_error# import function to calculate Mean Absolute Error(MAE)
forest_model=RandomForestRegressor(random_state=1)# create a random forest model with a fixed random state for reproducibility
forest_model.fit(train_X,train_y)# train(fit) the model using the training data( features and target)
melb_preds=forest_model.predict(val_X)# Use the trained model to make predictions on the validation set
print(mean_absolute_error(val_y,melb_preds))# calculate and print the mean absolute 

191669.7536453626
