# TRAIN TEST SPLIT, EVALUATIONS AND 
# UNDERSTANDING BIAS VARIANCE

1. How can we use our current data set to teach our model and also test our model, ie not manually inputing new observations?

2. How do we know if our models are good?

3. What does overfitting and underfitting mean?

![image.png](attachment:image.png)

**Training Dataset:** The sample of data used to fit the model.

**Validation Dataset:** The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset is incorporated into the model configuration.


**Test Dataset:** The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.
![ttsplit.png](attachment:ttsplit.png)

Train Test is best used when you have enough observations to cover all common possibilities. A normal distribution of potential outcomes. If you do not have enough there are other methods such as k-fold Cross Validation

The efficiency of the model is important too if the observations are too large than you could end up with long run time. 
Considerations
 - CPU cost to train the model
-  CPU cost to evaluate the model
-  Is training set representative of data
-  Is test set representative

Commonly used splits are
Train  .8    Test .2   or 80% & 20%
Train  .67  Test .33
Train .5     Test . 5


In [None]:
import pandas as pd
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
col_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
iris = pd.read_csv(url, header=None, names=col_names)

In [None]:
# all features and response need to be numbers for scikit-learn
# map each iris species to a number
iris['species_num'] = iris.species.map({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2})

In [None]:
# store feature matrix in "X"
feature_cols = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
X = iris[feature_cols]
print (X)

In [None]:
# store response vector in "y"
y = iris.species_num
print(y)

In [None]:
# example of the code working split into train test sets
from sklearn.model_selection import train_test_split
train, test = train_test_split(iris)

In [None]:
train

In [None]:
test

In [None]:
# split into train test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

### Repeatable Train-Test Splits
To ensure the train data and test data are not assigned randomly you can turn on or off the random sampling using random_state. The value placed in here will lock in a seed of randomness and is not tunable it is just to toggle various seeds on or off, any number will turn off randomization but still seed you with a locked randomization. 
you will find many examples with a 42 placed in here which is a reference to Douglas Adams Hitchhikers Guide to the Galaxy as the answer to life the universe and everything. 


Lets run this a few times to prove it works

In [None]:
# split into train test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33)
# summarize first 5 rows
X_train.head()

In [None]:
# split into train test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33)
# summarize first 5 rows
X_train.head()

In [None]:
# split into train test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33, random_state = 42)
# summarize first 5 rows
X_train.head()

In [None]:
# split into train test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33, random_state = 42)
# summarize first 5 rows
X_train.head()

**Stratified Train-Test Splits**

Classification models often do not have a balanced set of observations for the varying classes to be predictid
Stratification indexs in order to create proportions so each class is represented fairly.This will be done in the Y array 

Lets first look at the dataset as it is. 

In [None]:
# split into train test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33, random_state = 42)

In [None]:
from collections import Counter
print(Counter(y))
print(Counter(y_train))
print(Counter(y_test))

In [None]:
# split into train test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33, random_state = 42, stratify = y)
print(Counter(y))
print(Counter(y_train))
print(Counter(y_test))

In [None]:
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify=y)

## Revist KNN with Test Train Split

In [None]:
# Review
import sklearn.metrics as metrics
iris['species_num'] = iris.species.map({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2})
# 1 create a feature matrix and response vector
# store feature matrix in "X"
feature_cols = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
X = iris[feature_cols]
# store response vector in "y"
y = iris.species_num

#2 Choose estimator
from sklearn.neighbors import KNeighborsClassifier
#3 Instantiate the estimator
knn = KNeighborsClassifier(n_neighbors=1)

#4 Fit the model(model Training)
knn.fit(X_train,y_train)

#5 predict
y_pred=knn.predict(X_test)

#Evaluate
print("Accuracy of test data: ", metrics.accuracy_score(y_test,y_pred))

In [None]:
# Same but now with Test train split
iris['species_num'] = iris.species.map({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2})
# 1 create a feature matrix and response vector

# store feature matrix in "X"
feature_cols = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
X = iris[feature_cols]
# store response vector in "y"
y = iris.species_num

# Test Train Split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.33)

#2 Choose estimator
from sklearn.neighbors import KNeighborsClassifier
#3 Instantiate the estimator
knn = KNeighborsClassifier(n_neighbors=1)

#4 Fit the model(model Training)
knn.fit(X_train,y_train)

#5 predict
y_pred=knn.predict(X_test)

#Evaluate
print("Accuracy of test data: ", metrics.accuracy_score(y_test,y_pred))

In [None]:
# Same but now with Test train split
iris['species_num'] = iris.species.map({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2})
# 1 create a feature matrix and response vector

# store feature matrix in "X"
feature_cols = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
X = iris[feature_cols]
# store response vector in "y"
y = iris.species_num

# Test Train Split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.33, random_state = 42)

#2 Choose estimator
from sklearn.neighbors import KNeighborsClassifier
#3 Instantiate the estimator
knn = KNeighborsClassifier(n_neighbors=1)

#4 Fit the model(model Training)
knn.fit(X_train,y_train)

#5 predict
y_pred=knn.predict(X_test)

#Evaluate
print("Accuracy of test data: ", metrics.accuracy_score(y_test,y_pred))

In [None]:
# Same but now with Test train split
iris['species_num'] = iris.species.map({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2})
# 1 create a feature matrix and response vector

# store feature matrix in "X"
feature_cols = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
X = iris[feature_cols]
# store response vector in "y"
y = iris.species_num

# Test Train Split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.33, random_state = 42, stratify = y)

#2 Choose estimator
from sklearn.neighbors import KNeighborsClassifier
#3 Instantiate the estimator
knn = KNeighborsClassifier(n_neighbors=1)

#4 Fit the model(model Training)
knn.fit(X_train,y_train)

#5 predict
y_pred=knn.predict(X_test)

#Evaluate
print("Accuracy of test data: ", metrics.accuracy_score(y_test,y_pred))

In [None]:

# Same but now with Test train split
iris['species_num'] = iris.species.map({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2})
# 1 create a feature matrix and response vector

# store feature matrix in "X"
feature_cols = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
X = iris[feature_cols]
# store response vector in "y"
y = iris.species_num

# Test Train Split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.33, random_state = 42, stratify = y)

#2 Choose estimator
from sklearn.neighbors import KNeighborsClassifier
#3 Instantiate the estimator
knn = KNeighborsClassifier(n_neighbors=11)

#4 Fit the model(model Training)
knn.fit(X_train,y_train)

#5 predict
y_pred=knn.predict(X_test)

#Evaluate
print("Accuracy of test data: ", metrics.accuracy_score(y_test,y_pred))

In [None]:
# locating better values of K for higher accuracy
k_range= range (1,26)
scores=[]
for k in k_range :
    knn=KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train,y_train)
    y_pred=knn.predict(X_test)
    scores.append(metrics.accuracy_score(y_test,y_pred))


#### Scores of various K ranges

In [None]:
import matplotlib.pyplot as plt
plt.plot(k_range,scores)

#### Cross Validation
Cross-validation includes resampling and sample splitting methods that use different portions of the data to test and train a model on different iterations.
![1920px-K-fold_cross_validation_EN.svg.png](attachment:1920px-K-fold_cross_validation_EN.svg.png)

In [None]:
# Cross Validation
from sklearn.model_selection import cross_val_score
accuracy_rate = []

for i in range(1,40):
    
    knn = KNeighborsClassifier(n_neighbors=i)
    score = cross_val_score(knn, X, y, cv=10) 
    accuracy_rate.append(score.mean())

In [None]:
error_rate = []

for i in range(1,40):
    
    knn = KNeighborsClassifier(n_neighbors=i)
    score = cross_val_score(knn, X, y, cv=10)
    error_rate.append(1-score.mean())

In [None]:

plt.figure(figsize=(10,6))
plt.plot(range(1,40), error_rate, color='blue', linestyle='dashed', marker='o', markerfacecolor='red', markersize=10)

plt.title('Error rate VS K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')

In [None]:
#3 Instantiate the estimator
knn = KNeighborsClassifier(n_neighbors=17)

#4 Fit the model(model Training)
knn.fit(X_train,y_train)

#5 predict
y_pred=knn.predict(X_test)

#Evaluate
print("Accuracy of test data: ", metrics.accuracy_score(y_test,y_pred))

### Lets look at a few other evaluations other than accuracy

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_recall_fscore_support
conmat = confusion_matrix(y_test, y_pred)

#### Classification Report and another way to think of accuracy
* Precision measures the accuracy of positive predictions
* Recall measures the completeness of postive predictions
* f1-score combines Precision and recall or how many times a model made a correct prediction across the entire dataset
* Support is the number of occurances of the class in a dataset


#### What is the logic?
- Accuracy - How many times was it correct divided by number of chances to be correct
- Confustion Matrix counting up false positive true negatives etc. 
- Precision(positive predictive value) Sum of true positives by the sum of true positives and fp
- Recall( sensitivty or true positive rate)  tp / tp+fn 
- F1 Score a mean of recall and precision 2*precision*recall/ precision+recall
- Support - The number of observations of that classification in the specific dataset
- AUC-ROC Curve performance metric

In [None]:
# Importing all necessary libraries
from sklearn.metrics import accuracy_score

# Calculating the accuracy of classifier
print(f"Accuracy of the classifier is: {accuracy_score(y_test, y_pred)}")

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

### Confusion Matrix

![Accuracy-and-Precision.png](attachment:Accuracy-and-Precision.png)

![1%20n5Jfp2pPuJd82Akix-YMAw.webp](attachment:1%20n5Jfp2pPuJd82Akix-YMAw.webp)

* TRUE POSITIVE - AN ACTUAL POSITIVE THAT IS PREDICTED AS POSITIVE
* TRUE NEGATIVE - AN ACTUAL NEGATIVE THAT IS PREDICTED TO BE NEGATIVE
* FALSE POSITIVE(TYPE 1 ERROR) -  AN ACTUAL NEGATIVE IS PREDICTED TO BE POSITIVE
* FALSE NEGATIVE(TYPE 2 ERROR) - AN ACTUAL POSITIVE IS PREDICTED TO BE NEGATIVE


In [None]:
# importing all necessary libraries
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
cf_matrix =confusion_matrix(y_test, y_pred)
cf_matrix


In [None]:
import seaborn as sns
sns.heatmap(cf_matrix, annot=True)

In [None]:
import numpy as np
# Add a % 
sns.heatmap(cf_matrix/np.sum(cf_matrix), annot=True, fmt='.2%', cmap='Blues')
plt.xlabel('predict')
plt.ylabel('actual')

## Bias Variance(another form of evaluation) 
![Screenshot%202024-01-08%20170036.png](attachment:Screenshot%202024-01-08%20170036.png) 

![image.png](attachment:image.png)

|Algorithm|Bias|Variance|
|:---|:----|:---|
|Linear Regression|	High Bias|Less Variance|
|Decision Tree|	Low Bias|High Variance|
|Bagging|Low Bias|High Variance (Less than Decision Tree)|
|Random Forest|Low Bias|High Variance (Less than Decision Tree and Bagging)|

Simple example

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

# allow plots to appear in the notebook
%matplotlib inline

Brain and body weight

In [None]:
url = 'http://people.sc.fsu.edu/~jburkardt/datasets/regression/x01.txt'
col_names = ['id', 'brain', 'body']
mammals = pd.read_table(url, sep='\s+', skiprows=33, names=col_names, index_col='id')
mammals.head()

In [None]:
mammals.describe()

In [None]:
#We're going to focus on a smaller subset in which the body weight is less than 200:
# only keep rows in which the body weight is less than 200
mammals = mammals[mammals.body < 200]
mammals.shape



We're now going to pretend that there are only 51 mammal species in existence. In other words, we are pretending that this is the entire dataset of brain and body weights for every known mammal species.

Let's create a scatterplot (using Seaborn) to visualize the relationship between brain and body weight:


In [None]:
sns.lmplot(x='body', y='brain', data=mammals, ci=None, fit_reg=False)

There appears to be a relationship between brain and body weight for mammals.


Making a prediction

Now let's pretend that a new mammal species is discovered. We measure the body weight of every member of this species that we can find, and calculate an average body weight of 100. We want to predict the average brain weight of this species (rather than measuring it directly). How might we do this?


In [None]:
sns.lmplot(x='body', y='brain', data=mammals, ci=None)

We drew a straight line that appears to best capture the relationship between brain and body weight. So, we might predict that our new species has a brain weight of about 45, since that's the approximate y value when x=100.

This is known as a "linear model" or a "linear regression model", which we will study in a future class.



Making a prediction from a sample

Earlier, I said that this dataset contained every known mammal species. That's very convenient, but in the real world, all you ever have is a sample of data. A more realistic situation would be to only have brain and body weights for (let's say) half of the 51 known mammals.

When that new mammal species (with a body weight of 100) is discovered, we still want to make an accurate prediction for the brain weight, but this task might be more difficult since we don't have all of the data that we would ideally like to have.

Let's simulate this situation by assigning each of the 51 observations to either universe 1 or universe 2:


In [None]:
# set a random seed for reproducibility
np.random.seed(12345)

# randomly assign every observation to either universe 1 or universe 2
mammals['universe'] = np.random.randint(1, 3, len(mammals))
mammals.head()

In [None]:
# col='universe' subsets the data by universe and creates two separate plots
sns.lmplot(x='body', y='brain', data=mammals, ci=None, col='universe')

The line looks pretty similar between the two plots, despite the fact that they used separate samples of data. In both cases, we would predict a brain weight of about 45.
It's easier to see the degree of similarity by placing them on the same plot:

In [None]:
# hue='universe' subsets the data by universe and creates a single plot
sns.lmplot(x='body', y='brain', data=mammals, ci=None, hue='universe', order = 0)

What was the point of this exercise? This was a visual demonstration of a high bias, low variance model:

- It's high bias because it doesn't fit the data particularly well.
- It's low variance because it doesn't change much depending on which observations happen to be available in that universe.



Let's try something completely different.
First a vocab word:
**Polynomial**  
Here is a basic refresher https://www.youtube.com/watch?v=Rxx4HepHI_E


What would a low bias, high variance model look like? Let's try polynomial regression, with an tenth order polynomial:


In [None]:
sns.lmplot(x='body', y='brain', data=mammals, ci=None, col='universe', order=10)



    It's low bias because the models match the data quite well!
    It's high variance because the models are widely different depending on which observations happen to be available in that universe. (For a body weight of 100, the brain weight prediction would be 40 in one universe and 0 in the other universe!)




Can we find a middle ground?

Perhaps we can create a model that has less bias than the linear model, and less variance than the eighth order polynomial?

Let's try a second order polynomial instead:


In [None]:
sns.lmplot(x='body', y='brain', data=mammals, ci=None, col='universe', order=2)

This seems better. In both the left and right plots, it fits the data pretty well, but not too well.
This is the essence of the bias-variance tradeoff: 

You are seeking a model that appropriately balances bias and variance, and thus will generalize to new data (known as "out-of-sample" data).


For any model, we have to find the perfect balance between Bias and Variance. This just ensures that we capture the essential patterns in our model while ignoring the noise present it in. This is called Bias-Variance Tradeoff. It helps optimize the error in our model and keeps it as low as possible.

Detail review with further examples
https://www.geeksforgeeks.org/bias-vs-variance-in-machine-learning/


In [None]:
## Using data from the whale brain and body data lets see quick review of Evaluations

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.33)

In [None]:
from sklearn import metrics
knn=KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)
y_pred=knn.predict(X_test)
print("Accuracy of test data: ", metrics.accuracy_score(y_test,y_pred))

In [None]:
from sklearn import metrics
knn=KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train,y_train)
y_pred=knn.predict(X_test)
print("Accuracy of test data: ", metrics.accuracy_score(y_test,y_pred))

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn import metrics
knn=KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)
y_pred=knn.predict(X_test)
print('Confusion Matrix')
print(confusion_matrix(y_test, y_pred))
print("Accuracy of test data: ", metrics.accuracy_score(y_test,y_pred))
print("Classification Report")
print(classification_report(y_test, y_pred))
