# 01. Introduction to Machine Learning 

Definitions:
 - MLOps: People who intergrate models, automate pipelines, things like cloud, kubernetes, data warehouses ( more of an engineering position )
 - Scientific Paper: A document that has passes a peer review
 - Machine Learning: Making computers learn from data
 - Reproducibility: The guarantee that another person with our data and following our work will get the same results
 - [High Cardinality](https://arxiv.org/abs/2307.02071#:~:text=High%2Dcardinality%20categorical%20variables%20are,difficulties%20with%20high%2Dcardinality%20variables.) (in ML concept): [Each categorical variable consists of unique values](https://towardsdatascience.com/dealing-with-features-that-have-high-cardinality-1c9212d7ff1b). A categorical feature is said to possess high cardinality when there are too many of these unique values.
 ```
 High-cardinality categorical variables are variables for which the number of different levels is large relative to the sample size of a data set, or in other words, there are few data points per level. Machine learning methods can have difficulties with high-cardinality variables.
 ```
 - High cardinality leads to [The Curse of Dimensionality](https://towardsdatascience.com/dealing-with-features-that-have-high-cardinality-1c9212d7ff1b#:~:text=curse%20of%20dimensionality.-,The%20Curse%20of%20Dimensionality,-Here%20is%20a) 
 ```
 As the number of features grows, the amount of data we need to accurately be able to distinguish between these features (in order to give us a prediction) and generalize our model (learned function) grows EXPONENTIALLY.
 ```
 
 Reducing Cardinality can be done by using a simple Aggregating function. This way from lets say 50 categories we can aggregate them into 3/4/5.
 - High and Low Entropy Features are almost always useless. We need then to have some variance not all be the same or all be different.

## The Scientific Method
<div>
<img src="attachment:c0792a54-d417-4224-95bf-6c9d6707ed08.png" width="400"/>
</div>

Steps:
 - Ask a question: We need a testable valid question
 - Research: We need to see what has been done by other people on the topic before us so we don't reinvent the hot water or waste our time
 - Hypothesis: We form a specific question that can be answered with Yes or No. We assume the answer is Yes, but we have to test it.
 - Test with an Experiment: Data Science work in this case
 - Analyze your results: MLFlow - compare results from different experiments
    - Hypothesis is True: Go to next step
    - Hypothesis is Fasle: Go back to Hypothesis step
 - Report your results
 
The whole process is strongly interative

## Appllied Machine Learning Process (KDD - Knowledge Discovery in Data)

1. Problem definition - Make sure the problem is well-defined and that you're solving the right problem
2. Data analysis - Get familiar with the available data
3. Data preparation - Get the data ready for modelling
4. Algorithm evaluation - Test and compare algorithms
5. Result improvement - Use results to create better models (e.g. fine-tuning, ensembles)
6. Result presentation - Describe the problem and solution to non-specialists

## Machine Learning Definition

Making computers learn from data:

![image.png](attachment:5fc9f80d-7235-4fcd-a7e0-4b912ed40a13.png)

## Types of Machine Learning models 

1. Supervised learning:
 - We train the program on previously known (labelled) data
 - Examples: regression, classification
2. Unsupervised learning
 - We leave the program to find patterns in data
 - Examples: clustering analysis, dimensionality reduction
3. Reinforcement learning - Agent not a model
 - A form of unsupervised learning 
 - As it is learning it is changing it's environment
 - The program learns continuously
 - Usually a combination of Unsupervised and supervised learning
 - Examples: learning to play a game by observing other players, learning to drive a car


### ML in a nutshell (How out model tried to approximate reality)

<div>
<img src="attachment:40d37a5f-a2bf-4ac0-9001-1a5edf877668.png" width="600"/>
<img src="attachment:4c563331-3a7b-4bbc-9986-59dbe52d4c95.png" width="600"/>
</div>

Regression - Target variable is a Real number\
Classification - Target variable is a set of limited amount of elements

### [Z-score normalization](https://www.statology.org/z-score-normalization/#:~:text=Z%2Dscore%20normalization%20refers%20to,the%20standard%20deviation%20is%201.&text=where%3A,%CE%BC%3A%20Mean%20of%20data)

Z-score normalization refers to the process of normalizing every value in a dataset such that the mean of all of the values is 0 and the standard deviation is 1

$$ z =  \tfrac{x - μ}{σ} $$

where: 
- x: Original value
- μ: Mean of data
- σ: Standard deviation of data
- z: New value

### [Min-max normalization](https://www.codecademy.com/article/normalization)
Min-max normalization is one of the most common ways to normalize data. For every feature, the minimum value of that feature gets transformed into a 0, the maximum value gets transformed into a 1, and every other value gets transformed into a decimal between 0 and 1.

* Min-max normalization has one fairly significant downside: it does not handle outliers very well

![image.png](attachment:5b1c687a-1c6e-4038-b0a2-db92e4abcc0a.png)

$$ z = \tfrac{value - min}{max - min}  $$
### [Z-score vs min-max](https://stats.stackexchange.com/questions/547446/z-score-vs-min-max-normalization)

![image.png](attachment:b1d3512e-d9d1-45df-9550-e3ad0b2e35f8.png)

```
What I do not understand and what is not intuitive for me at all is to use z-score for feature scaling.

Why is z-score used? What is the motivation to not use min-max and to use z-score? Why is it a good idea to scale your data in standard deviations from the mean? What was the motivation to use z-score for scaling? Why is min-max not used all the time? What problem does z-score solve what min-max does not solve?
```
---
```
The answer to your specific question about why z-score normalisation handles outliers better is largely to do with how standard deviations are calculated in the first place. If there are outliers, then the effect that the deviation from the mean related to those outliers will have on the final statistic (i.e, the standard deviation; the same value that will be used to normalise the feature) will be mitigated by the rest of the deviations within that same feature. In short, standard deviation is an aggregated calculation so individual values will carry less weight with the more observations there are. 

Conversely with min-max scaling where the values used to normalise the data will literally be the outliers themselves (assuming there are outliers of course). No aggregating, no averaging, just take the minimum value, take the maximum value and normalise all the observations in the feature relative to those values. If those minimum and maximum values happen to be outliers then you can see how they will impact the resulting normalisation. 

As far as I can see, how important this difference is will probably depend on the model that the data is being preprocessed for, and the question of "why those outliers would be kept in the data in the first place" is also valid, but maybe that's another discussion entirely. Anyway, Hope this helps.
```

* Min-max normalization: Guarantees all features will have the exact same scale but does not handle outliers well.
* Z-score normalization: Handles outliers, but does not produce normalized data with the exact same scale.

## Encoding
### [Label encoding](https://www.geeksforgeeks.org/ml-label-encoding-of-datasets-in-python/)
In machine learning projects, we usually deal with datasets having different categorical columns where some columns have their elements in the ordinal variable category for e.g a column income level having elements as low, medium, or high in this case we can replace these elements with 1,2,3. where 1 represents ‘low’  2  ‘medium’  and 3′ high’. Through this type of encoding, we try to preserve the meaning of the element where higher weights are assigned to the elements having higher priority.

This may lead to the generation of priority issues during model training of data sets. A label with a high value may be considered to have high priority than a label having a lower value.

From independent values they become dependent, the model can add/subtract/multiple their values.
### [One-hot encoding](https://www.geeksforgeeks.org/ml-one-hot-encoding/)
Called One-hot because the 1 is where the data is true. It is **get_dummies()** in pandas
<div>
<img src="attachment:67510e4d-e63f-431f-8aa6-d2e77aec6f96.png" width="600"/>
</div>

The values remain independent but we face dimentionality explosions
### [Multi-hot encoding](https://stats.stackexchange.com/questions/467633/what-exactly-is-multi-hot-encoding-and-how-is-it-different-from-one-hot)
Imagine your have five different classes e.g. ['cat', 'dog', 'fish', 'bird', 'ant']. If you would use one-hot-encoding you would represent the presence of 'dog' in a five-dimensional binary vector like [0,1,0,0,0]. If you would use multi-hot-encoding you would first label-encode your classes, thus having only a single number which represents the presence of a class (e.g. 1 for 'dog') and then convert the numerical labels to binary vectors of size ⌈log25⌉=3
```
Examples:

'cat'  = [0,0,0]  
'dog'  = [0,0,1]  
'fish' = [0,1,0]  
'bird' = [0,1,1]  
'ant'  = [1,0,0]   
```
This representation is basically the middle way between label-encoding, where you introduce false class relationships (0 < 1 < 2 < ... < 4, thus 'cat' < 'dog' < ... < 'ant') but only need a single value to represent class presence and one-hot-encoding, where you need a vector of size n
 (which can be huge!) to represent all classes but have no false relationships.

Note: multi-hot-encoding introduces false additive relationships, e.g. [0,0,1] + [0,1,0] = [0,1,1] that is 'dog' + 'fish' = 'bird'. That is the price you pay for the reduced representation.

## Interesting stuff

No mathematical system can evaluate itself

Greedy algorithms (Gradient Descent) - They try to optimize. Doesn't have complete information but for N steps (what it sees) it chooses the best solution. And so on and on. Doesn't guarantee it gets to the global min (doesn't guarantee that it is correct). So we can start it from different places to hopefully find it.

Cleaning the data and Exploring the data is a two way street. Cleaning it makes it more interpretable and intrepreting it makes it easier to clean.