<a href="https://colab.research.google.com/github/HallsofLearningCoding/IAOI_Recap/blob/main/Week5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Welcome to Week 5! First and formost, let's finish up last week's lesson on classification models; I asked you to look into two additional classification models, I will briefly describe each.

##Decision Trees

A decision tree is like a flow chart that breaks down decisions. At each step or node, an if statement is used to determine which branch of the tree comes next. Let's take a look at the melbourne data again, and use it to create a decision tree.


In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

#We'll be using this dataset from kagglehub
#To check to ensure that it was downloaded successfully, please click Files. It's to the left.
import kagglehub
path = kagglehub.dataset_download("dansbecker/melbourne-housing-snapshot")
#print("Path to dataset files:", path)


df = pd.read_csv(os.path.join(path, "melb_data.csv"))
df.head()
df.columns
melbourne_data = df.dropna(axis=0)

In [None]:
features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = df[features]
y = df.Price

X.describe()


Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,13580.0,13580.0,13580.0,13580.0,13580.0
mean,2.937997,1.534242,558.416127,-37.809203,144.995216
std,0.955748,0.691712,3990.669241,0.07926,0.103916
min,1.0,0.0,0.0,-38.18255,144.43181
25%,2.0,1.0,177.0,-37.856822,144.9296
50%,3.0,1.0,440.0,-37.802355,145.0001
75%,3.0,2.0,651.0,-37.7564,145.058305
max,10.0,8.0,433014.0,-37.40853,145.52635


In [None]:
model = DecisionTreeRegressor(random_state=1)
model.fit(X, y)

In [None]:
print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(model.predict(X.head()))

Making predictions for the following 5 houses:
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
0      2       1.0     202.0   -37.7996    144.9984
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
3      3       2.0      94.0   -37.7969    144.9969
4      4       1.0     120.0   -37.8072    144.9941
The predictions are
[1480000. 1035000. 1465000.  850000. 1600000.]


#Random Forests
As one should be able to guess, a random forest is simply a collection of decision trees.

Now on to the real lesson!

Thus far we have created a number of different models, typically following the structure below:


*   Data Retrieval: Getting data
*   Data Preparation: Ensuring that data is combined and in a usable state
*   Data Exploration: Taking a look at the data to better understand trends, and determine the feautres that a model could use to make assumptions/predictions
*   Data Modelling: Creating the model!

However there is a step that we didn't cover that is extremely important when creating a model. This step is _Evaluation_, ie measuring how effective a model is, and thus adjusting the model where needed.

Today we will be looking into various methods of evaluating a model:


*   Model Evaluation Metrics
*   Confusion Matrix and ROC Curve
*   Cost Functions
*   Underfitting, Overfitting Theory
*   Hyperparameter Tuning
*   Cross-Validation Practice







Firstly, let's talk a bit about...

## Reproducibility
Generally, once we evaluate a model we may want to retrain it and re-evaluate it. However, this is not possible without reproducibility, ie being able to run the same code on the same data and get the same results. There are a few things that we can do to to reduce the randomness that would give variable results.

#### Setting a seed
A seed is a number that fixes randomness so that results don't change each time. It is almost exactly like a minecraft seed; where if you want to save a minecraft world that has been randomly generated, you would save the seed and use it again when creating your world.

You may have seen me do this a few times, for example in the following lines of code:

```
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = DecisionTreeRegressor(random_state=1)

```
Here, "random_state" would be the seed. This will usually be our means of increasing reproducability, but I will give general information about other important factors.

#### Keeping devices consistent
If you use different processing unit specs (CPU's or GPU's) when running a model, you're model's performance will most likely change. Therefore, it is generally good to specify a device, and keep it consistent when using a model.


#### Keeping inference output consistence
Inference is when a trained model is used to make predictions on brand new data, or data never seen by the model. There are some aspects of the model that may perform differently during inference as opposed to training, therefore before attempting inference it is good to manually diminish these differences.

Now that we have this context, let's discuss model evaluation

## Preparing to understand metrics
Please take a look [here](https://www.geeksforgeeks.org/underfitting-and-overfitting-in-machine-learning/) to better understand common problems when creating a model.


## Key Model Evaluation Metrics
I will give a very brief description of a few metrics below, however there is quite a bit of important information on them that I will not get the chance to share here.

As a result, I strongly recommend taking a deeper dive into evaluation metrics, for instance looking [here](https://www.geeksforgeeks.org/metrics-for-machine-learning-model/) and/or into other resources.

If you'd like a very (very) deep dive, you can look at documentation, for instance the scikit-learn docs [here](https://scikit-learn.org/stable/modules/model_evaluation.html).


When looking at these and other resources, of course, you should understand what these metrics mean and how to use them. However, it is **almost more important** to know when to to use the metrics, and what the metrics look like when they are *good* or *bad* for the model.

### Accuracy
The accuracy of a model is the ratio of total predictions that were "correct" out of the total number of inputs.



```
from sklearn.metrics import accuracy_score

acc = accuracy_score(y_true, y_pred)
```

### Precision
The precision of a model shows how many of the predicted positives were actually positive.


```
from sklearn.metrics import precision_score

precision_score(y_true, y_pred)
```

### Recall/Sensitivity
The recall of a model determines how many of the actual positives were correctly predicted.
```
from sklearn.metrics import recall_score

recall_score(y_true, y_pred)
```

Did it sound like I just said the same thing three times? Fair! These three terms are easy to confuse.  Use the metrics when you want to answer the following questions.

**Accuracy:** How often is the model right overall?

**Precision:** How often is the model right *when it predicts positive*

**Recall:** How many of the total correct predictions did the model predict?


### F1-Score
The F1 score is the mean of precision and recall.


```
from sklearn.metrics import f1_score

f1_score(y_true, y_pred)
```


###  Confusion Matrix
Imagine a matrix as a tabular arrangement of numbers, with rows and columns.

The confusion matrix, when simplified, is an 2x2 matrix(Meaning 2 rows, and 2 columns), comprised of the number of True Negatives (TN), False Negatives (FN), False Positives (FP) and True Positives (FP).

Note that the confusion matrix will increase in size based on the number of features to predict.



```
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(actual, predictions, normalize=None)

#You would definitely want to visualize this with a graph instead of just print(cm)
```



###  ROC
A ROC, or Receiver Operator Characteristic curve, is a graph with true positive rate plotted against false positive rate.

### AUC
The Area Under the roc Curve (ROC) is, well....the area under the ROC curve. It shows the probability that the model will rank a positive prediction over a negative prediciton.



[Here it's explained pretty clearly](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc)


## Model Optimization
Having evaluated a model, we often choose to optimize the model in one way or another. While we won't have time to go through these, it is important to note them as you, more than likely, would have to use them in the future. Some important examples include:

- Cost functions: calculates the cost/error/loss, ie or the number of mistakes that the model makes. We can use the error information to improve the model.

- Hyperparameter Tuning: finds the optimal hyperparameter values



## Task

Now that the theory is out of the way, I would like you to pick any class project that we have worked on, and add at least three evaluaiton metrics.

## HW

Although we aren't quite done with machine learning (next week we will be looking further into unsupervised learning), I'd like us to get a head start in the race to understanding *neural networks*.

As a result, for homework I would like you to build a brain!

**Due Next Week (Week 6):** https://learn.nvidia.com/courses/course-detail?course_id=course-v1:DLI+T-FX-01+V1

Please read each cell carefully before running! Don't just mindlessly press run :)

**Due the Week Afterwards (Week 7):** Create a neural network similar to the one from the course, on anything that you'd like!