__Overfitting:__  
is a common problem in machine learning and statistical modeling where a model learns the training data too well, including its noise and outliers, leading to poor generalization to new, unseen data.  In essence, an overfitted model performs well on the training data but fails to perform accurately on test or validation data.  

__Overfittening__ means being too confident in predictions that worked in the training data.  
We mentioned the distinction between training data and test data, where a linear model with five predictor variables (like cabin size, sauna, the distance to lake) adjusted to predict the prices in training data including three cabins will dutifully replicate the three prices exactely.   

Still it is clear that the perfect 'predictions' in the training data aren't a guarantee of perfect predictions for any other data. This, is essence, is an extreme case of overfitting.  

__Training Data__  
• Purpose: Is used to train the machine learning model. It is the data the model learns from, allowing them to understand pattterns, relationships, and features in data.   
• Usage: During training the model adjusts its parameters(like weights in nural networks) based on this data to minimize errors and improve accurecy.  
• Size: Typically, the majority of the available data is used for training to ensure the model has enough information to learn effectively. • Example: In a supervised learning scenario, if you are training a model to recognize a handwritten digits, the training data will consist of images of handwritten digits along with their correct labels. 


__Testing Data__  
• purpose: Is used to evaluate the performance of the trained model, It serves as a bench mark to assess how well the model generalizes to new, unseen data.  
• Usage: After the model has been trained, it is tested on the testing data to measures its accurecy, precision, recall, or other relevant metrics. The model should not have seen this data during trining, ensuring an unbiased evaluation.  
• Size: A small portion of the available data is typically set aside as testing data. This ensures the model's performance is evaluated on data it has not been exposed during the training.  

#### Key Differences:
* Purpose:
  * Training data: used to teach the model.
  * Testing data: used to evaluate the model's performance in predicting.
* Exposure to the model:
  * Training data: the model sees these data during training phase.
  * Testing data : the model dose not see this data during training , it      is used only for evaluation.
* Impact on model:
  * Training data: Directly influences the model's learning process and       parameter adjustment.
  * Testing data: Used to assess how well the model has learned and how       it performs in new unseen data.
 
#### Importance of the Distenction:  
The separation of training and testing data is crucial to avoid overfitting and to ensure that the model can generalize well to new data. If the same data is udsed for both training and testing, it would give an overly optimistic estimate of the model's performance, as the model would essentially be "tested" on data it has already seen 

In [20]:
import pandas as pd

In [24]:
cabin1 = {"Size": [66],
          'Sauna': [4],
          'Lake Distance': [15],
          'Bath Room': [2],
          'Neighbor Distance': [500], 
          'Predicted Price': [258250]
    
}

In [25]:
cabin1 = pd.DataFrame(cabin1, index = ['Cabin1'])

In [26]:
cabin1

Unnamed: 0,Size,Sauna,Lake Distance,Bath Room,Neighbor Distance,Predicted Price
Cabin1,66,4,15,2,500,258250


Back to the tourism cabins in Finland ....  


### Note
That this cabin is our "test data" point it is also included in the training data...  

If we feed the cabin details back into the nearest neighbor method, it will simply find the exact same cabin in the training data and determine that it is the nearest neighbor of the test data point.  
Therefore, it will predict the price as ¢ 258,250.  
Since this was the price in the test data, we find that the price is predicted exactly. The same goes for any other cabin in the training data.

As you noticed, when we added more cabins in the training data than there were predictors (in our case five), __the linear model could no longer fit the training data perfectly.__    
In other words, we would say that the training error is not zero: this happens vertually always when the __number of predictor variables is less than the sample size of any linear model (unless the data happen to be very special)__    

Non-linear models, which we'll study  in the next chapter, allow more flexability but we may still get a very small training error if the samle sizeis larger than the number of predictor variables.  




In [91]:
data = {'size':[25, 39, 13, 82, 130],
               'sauna size' :[2, 3, 2, 5, 6],
               'distance to water' :[50, 10, 13, 20, 10],
               'idoor bathrooms' :[1, 1, 1, 2, 2],
               'neighbor proximity' :[500, 1000, 1000, 120, 600],
               'price in ¢(output)' :[127900, 222100, 143750, 268000, 460700]}

table = pd.DataFrame(data, index=pd.MultiIndex.from_product([['Training Data'], ['Cabin1', 'Cabin2', 'Cabin3', 'Cabin4', 'Cabin5']]))

In [92]:
table

Unnamed: 0,Unnamed: 1,size,sauna size,distance to water,idoor bathrooms,neighbor proximity,price in ¢(output)
Training Data,Cabin1,25,2,50,1,500,127900
Training Data,Cabin2,39,3,10,1,1000,222100
Training Data,Cabin3,13,2,13,1,1000,143750
Training Data,Cabin4,82,5,20,2,120,268000
Training Data,Cabin5,130,6,10,2,600,460700


(Predictor Varibles) < than (Sample Size of Training data) ...  
Then, the (training error) is not zero ...  
OR  
The linear model could no longer fit the training data perfectly.  

Also, most important observation is that (a small training error)(does not guarantee that the model actally predicts new data well)  

Espaecially, if it is obtained by fitting a complex, possibly non-linear model to a small training data set.  

In fact, a zero training error may still followed by very poor prediction accuracy on test data that does not overlap with the training data.  

### How do avoid Overfitting:
1- The first line of defense is splitting your data into training and testing data.  
When you tain the model with one part of your original data and test its performance with another, you will have at least of some idea of how well your model generalising when using unseen data.  

2- Second line, if you have still doubts about if the data split in a good way, is to split the data into n different sets, and train the model n times - each time with a different combination of n-1 sets, with the remaining set being used as a test set.  
This way you will get n estimates on how your selected model performs when using unseen data.  
This is called (leave-one-out-cross-validation) and it is one of the simplest ways of doing cross validation.    

Since overfitting is such a pain in the neck in the realm of machine learning, many people have spent time and energy to come up with ways to combat it.  
Most of these will be out of this course scope, but you can read about methods like (regularisation and dropout) both of which are examples of widely understood and used methods in both linear and logistic regression and nural networks....


# Exercise:
Let's for a moment imagine you have a data set on 1000 email messages labeled as either spam or not. Out of 1000 messages, 990 are legitimate emails, and 10 are spam.  
Then ypu split your data into training and test data sets in such a way that both labels are present in both sets in equal ratios, and then train a classifier in a data.  
What would you set as baseline accuracy that your model has to outperform in order to be considered worthwhile?



In [93]:
total_data_set = 1000
legitimate_mails = 990
spam_mails = 10

In [104]:
base_line_accuracy = (f'{legitimate_mails/spam_mails} %'  ) 

In [105]:
base_line_accuracy

'99.0 %'

$$Great   Good      Bye$$  
$$👊🏻👊🏻👊🏻👊🏻$$