# <span style='color:black'> Machine Learning: </span>
A method to teach computers to learn from data.
- Supervised: the computer is provided with labeled data. The goal is to learn a general rule that maps inputs to outputs.
    - Image classification, linear regression, spam detection.
- Unsupervised: No labeled data. The goal is to find patterns or relations in data.
    - Clustering, Anomaly detection, dimensionality detection.
- Reinforcement learning: Computers interact with an environment and learn to make a decision based on the response of the environment and rewards and penalties.

ML algorithms can also be divided into two categories
- **Probabilistic**
    - These are based on the idea of estimating the underlying probability distribution of the data, and using this distribution to make predictions. These models typically use Bayesian methods, which involve updating the probability distribution as new data becomes available. Examples of probabilistic ML models include Gaussian processes, hidden Markov models, and Bayesian networks.
- **Statistical**
    - These models are based on the idea of finding the best fixed model that explains the data, given a set of assumptions and a criterion for model selection. These models typically use frequentist methods, which involve optimizing the parameters of the model using maximum likelihood estimation or maximum a posteriori estimation. Examples of statistical ML models include linear regression, logistic regression, and support vector machines.

In summary, probabilistic models estimate probability distributions, while statistical models find the best fixed model that explains the data, given a set of assumptions.



# <span style='color:black'> Deep Learning: </span>:
It is a subfield of machine learning based on the use of neural networks (NN) and their development. Interacting layers in NN are able to learn and make decisions after extracting features from data.
- Natural language processing
- computer vision, image recognition
- speech recognition.


Machine learning algorithms are divided into two main categories, **parametric** and **non-parametric**. In the parametric category, algorithms make assumptions about the form of the relation between inputs and outputs. For example in linear regression, it is assumed that there is a linear relation between features and corresponding target. However, there is not any assumption for non-parametric algorithms such as KNN.

# <span style='color:blue'> K-nearest neighbor (KNN) </span>

* A **non-parametric** model
* A **non-linear** model

It makes few assumptions about structure of data and usually gives accurate result, but it is unstable to small changes in the dataset.

* Classifier
* Regressor

Instance based or memory based supervised learning. It is computationally expensive when the dataset is large.

- **KNN classifier**: memorize the entire training set.


Four things should be specified:

    1) A distance metric.
        * it controls the distance function between points and thus which points are considered as nearest in finding neighbors.
        * Typically Euclidean (Minkowski with p = 2)
    2) How many nearest neighbors to look at? (Model complexity)
        * k=5
    3) Optional weighting function on the neighbor points.
        * Ignored
    4) Methods for aggregating the classes of neighbor points.
        * Majority class

### Relation between $k$ and model complexity

* **Reducing $k$** in knn classifier **increases** the variance of the decision boundries and the risk of **Overfitting** because very local changes is captured.

* **$k=$the total number** of points in the training set, the result would be a **single decision** which it is the **most frequent** calss in the training set.

### Drawback:

When the training data has many samples, or each sample has lots of features, this can slow down the performance of KNN model.

For data set with hundred of thousands of features, especially if it is sparse, we should apply another model in stead of KNN.

# <span style='color:blue'>  Linear regression: </span>

Linear regression is a **parametric** statistical algorithm assuming a linear relationship between inputs and outputs.  It gives the target based on weighted sum of the features where the goal is to find the best line of fit to minimize the difference between predicted and actual values.

There are different methods to find the optimal values for the model, normal equations or optimization algorithms like gradient descent.

Linear regression can be extended to handle non-linear relationships by applying a non-linear transformation to the inputs which is called polynomial regression which I will explain below.




**Least square linear regression** (AKA: ordinary least square)

* It minimize the mean square error between target and prediction to find ws(weights) and b (bias/intercept parameter)

 $RSS(w,b) = \sum_{i=1}^N (y_i - (w.x_i + b))^2$

### <span style='color:green'>scikit-learn:</span>

* $w$ : linreg.coef_

* $b$: linreg.intercept_

    -The ' _ ' in linreg.coef_ means it is a parameter that has been derived by training the data and it is not set by the user.

## Comparing between KNN and Linear regression:

* **KNN**:
    - does not make a lot of assumption about the structure of the data.
    - gives potentially accurate but sometimes unstable predictions that are sensitive to small changes in the training data.
    - better on training set.

* **Linear regression**:
    - makes strong assuptions about the structure of the data: linear relationship.
    - gives stable but potentially inaccurate predictions.
    - better on unseen data.
    - very extendable to new data beyond the training set.
    - no parameter to control the complexity.

## <span style='color:orange'>  Regularization: </span>
Regularization prevents **overfitting** by restricting the model typically to reduce its complexity. 

* Ridge regression (L2)
* Lasso regression (L1)

### Ridge regression:

It is a type of linear regression that adds a regularization term to the cost function to prevent overfitting. It's useful for the case that some features are correlated. It forces the model to consider only the most important variables for the prediction. This model has a hyperparameter that controls the regularization. It uses the normal equation to find optimal parameters.

Using same least-square criterion but adds a penalty for **large variations** in weight parameters.

$RSS_{ridge}(w,b) = \sum_{i=1}^N (y_i - (w.x_i+b))^2 + \alpha \sum_{i=1}^P w_j^2$

**Higher** $\alpha$ means **more** regularization and **simpler** models.

### Lasso regression:

Same as Ridge regression, here, a regularization term is added to the cost function that cause w coefficients to shrink toward zero to prevent overfitting. It uses optimization algorithms such as gradient descent to find optimal parameters. Unlike Ridge regression, Lasso regression tends to give sparse solutions, meaning that it tends to set some parameters to zero, this helps to select the most important features for the prediction.


$RSS_{lasso}(w,b) = \sum_{i=1}^N (y_i - (w.x_i+b))^2 + \alpha \sum_{i=1}^P |w_j|$

With lasso, a subset of the coefficients are forced to be precisely zero. (it is called sparse solution which is a kind of **Feature selection**)

By default $\alpha=0$.


### Use
* **Ridge**: Many small/Medium sized effects.
* **Lasso**: Only a few variables with medium effects.

# Polynomial Features:

Generate polynomial and interaction features.

* It is still a **linear** model.
* Polynomial feature expansion is often combined with a regularization learning method like ridge regression.
* Using higher degrees leads to more complex models and regularization might be needed to avoid overfitting.

# Linear model for Classification

# <span style='color:blue'>  Logistic regression </span>

It is a **statistical linear method** for binary classification i.e., it gives binary outcomes (0/1). It performs best when the classes are well-balanced and data is not too complex.

* Linear model
* default: Binary classification but can be applied on multi-class 
* Applying logistic function (activation function) on estimated probabilities determines the class
* Parameter $C$ controls **regularization**
    - default: $C = 1$ Ridge (L2) regularization
    - **Higher** $C$ corresponds to **less** regularization
* Parameter $\gamma$ is a hyperparameter which we have to set before training model.
    -  $\gamma$ is a parameter for **non-linear** hyperplanes
    - $\gamma$ decides that how much curvature we want in a decision boundary
    - **Higher** $\gamma$ tries to exactly fit the training data set
* Normalization would be important here

# <span style='color:blue'>  Support vector machine (SVM): </span>


Support vector machine is a supervised algorithm that can be used for
- regression,
 - classification.

It finds a hyperplane to separate the data into two different classes. SVM is good when the data has many features and the boundary is not linear. The algorithm uses a technique called the kernel trick to transform the data into a higher dimensional space where a linear boundary can separate the classes.

A kernel function is a mathematical function that is used to transform the input data into a higher-dimensional space where it becomes linearly separable. SVM uses a kernel function to handle non-linearly separable data, while LSVM uses a linear kernel function.

SVM can use different types of kernel functions, such as polynomial, radial basis function (RBF), and sigmoid, which allow them to handle **non-linearly** separable data more effectively. However, these kernel functions can be computationally expensive, which can make SVMs less efficient for large datasets.


The main advantage of using a linear kernel function in LSVM is its simplicity and efficiency, as it doesn't require any additional computation to transform the data into a higher-dimensional space, which makes it computationally efficient for large datasets. Additionally, it's simpler to understand and implement, and it can be regularized to prevent overfitting.

* Apply **sign function** as activation function to produce binary output
    -feature vector -> linear function: $Sign(w.x+b)$ -> class value
* **Classifier margin** is defined as the width the decision boundary area can be increased before hitting a data point.
* The **best** classifier has the **maximum** margin.
* The **maximum** margin classifier is called the **linear support vector machine (LSVM)**
* Parameter $C$ controls **regularization**
    - default: $C = 1$ Ridge (L2) regularization
    - **Higher** $C$ corresponds to **less** regularization.
        * Fit the training data as well as possible
        * Each individual data point is important to classify correctly.
* Parameter $\gamma$ is a hyperparameter which we have to set before training model.
    -  $\gamma$ is a parameter for **non-linear** hyperplanes.
    - $\gamma$ decides that how much curvature we want in a decision boundary.
    - **Higher** $\gamma$ tries to exactly fit the training data set
* Parameter kernel
    - show the type of hyperplane used to separate the data.
        * linear hyperplane : ‘linear’ (a line in the case of 2D data). 
        * non-linear hyperplane: ‘rbf’ and ‘poly’
        
* **Drawback**:
    

* **Benefit**:
    - Simple and easy to train
    - Fast prediction
    - Scales well to very large datasets
    
### <span style='color:green'>scikit-learn</span>:

* sklearn.svm.SVC
* sklearn.svm.LinearSVC

## Multiclass classifier:

* scikit-learn make multi-class problem into binary problems (one class against all other classes).
    - choose the class with **highest score** as the **predicted class**.
    - model.coef_ gives us $n$ (number of classes) sets of parameters related to each class.



# <span style='color:blue'>  Linear Support vector machine (LSVM): </span>

As same as Support Vector Machine,  Linear Support Vector Machine (LSVM) area  supervised learning algorithm that can be used for both classification and regression tasks. However, the main difference between them is the type of kernel function they use.

The main advantage of using a linear kernel function in LSVM is its simplicity and efficiency, as it doesn't require any additional computation to transform the data into a higher-dimensional space, which makes it computationally efficient for large datasets. Additionally, it's simpler to understand and implement, and it can be regularized to prevent overfitting.
 However, it's important to note that the choice of the algorithm depends on the specific problem, and both SVM and LSVM have their own advantages and limitations.

# <span style='color:blue'>  Kernelized Support vector machine (KSVM): </span>

????

# <span style='color:orange'>  Train-test split </span>

If the data is randomly splitted into training and test sets, the model is trained on the training set and the test set is used for validation purpose, ideally split the data into 70:30 or 80:20. Here, there is a possibility of **high bias** if we have **limited data**, because we would miss some information about the data which we have not used for training.

To solve this problem we go for **Cross validation** approach.

# <span style='color:orange'>  Cross validation: </span>

How well a classifier generalizes. It results in a **less biased** model.

* **why?** The test set represented data that had not been seen during training but had the **same** general attributes as the original data set. The normal evaluation on a single test set has a problem: maybe the result becomes good for that specific testset by chance. So we go for **cross-validation**.

* **k-fold cross validation**:
cross-validation split the dataset to e.g. k=5 parts then consider 1 part as the testset and the rest as training set and evaluate. Then consider the next part as the test set, etc. At the end we can look into the average result.

* The **Stratified Cross-validation** means that when splitting the data, the proportions of classes in each fold are made as close as possible to the actual proportions of the classes in the overall data set as shown here.


It has two steps:

1) Splitting the data set into subsets (k folds):

    - Each fold has approximately the same size
    - Data can be randomly selected in each fold or stratified

2) Rotating, training and validating among them
    
    - All folds are used to train the model except one, which is used for validation.
    - That validation fold should be rotated until all folds have become a validation fold once and only once.
    - Each example is recommended to be contained in one and only one fold.
    
    
## <span style='color:green'>scikit-learn</span>:

### Key parameters: 

* cross_val_score: returns score of each test folds
* corss_val_predict: returns the predicted score for each observation in the input dataset when it was part of the test set

# <span style='color:blue'> Decision Tree </span>

* Classifier
* regression
    - the decision tree works like a classifier and then gives the mean value of the result class.

Based on a series of **if-then** / **yes-no** questions. Each question is a node. 




Different types of nodes:

    1) Root node
        - Top
    2) Decision node
        - Middle
    3) Terminated node (Leaf)
        - bottom

### Deawback:

**Overfitting** are likely to happen.
To prevent overfitting, two additional strategies are used:
* pre-pruning
    - early stop of growing tree
* post-pruning
    - ???
    
### Benefit:
* Works well with different types of features (Categorical non-categorical)
* No need for normalization
    
## <span style='color:green'>scikit-learn</span>:

### Key parameters: 
Controling one of the below parameters is enough to control overfitting.

* **max_depth**: controls maximum number of split points.
    - Most common way to reduce the complexity of the model.
* **min_samples_leaf**: threshold for minimum number of data instances a leaf can have to avoid further splitting.
* **max_leaf_nodes**: limit total number of leaves in the tree.



# <span style='color:blue'> Random forest </span>

To avoid **Overfitting** in Decision tree, it is needed to use an ensemble of trees.
* Classifier
* Regressor

Here in Random forest algorithm,
- Data used to build each tree is **randomly** selected
- Feature chosen in each split test are also **randomly** selected.

## <span style='color:green'>scikit-learn</span>:

### Key parameters: 

* **n_estimated**: the number of estimators shows how many trees considered in the random forest.
* **max_features**: the number of features in the subset that are randomly considered at each stage is controlled by it.

### Methods:
* decision_function(x)
    - predicts confidence scores for samples.
* predict(x)
    - Predict class for X.
* predict_proba(X)
    - Predict class probabilities for X.


# <span style='color:blue'> Naive Bayes Classifiers </span>
* Linear classifier
* Based on Bayes theorem
* Based on simple probabilistic model
    - Each feature is independent of all the others.
    - All the predictors have an equal effect on the outcome.
        * learning is very fast.
* Mostly used in sentiment analysis
    - spam filtering
    - recommendation systems
    - 
        
Different types of Bayes Classifier:
* Bernoulli:
    - Binary features (Boolean variables)
        * True/False
        * word presence/absence in a text (if a word occurs in the text or not)
        * Spam/not spam
        * 0/1
    - Work well if the dataset is small compared to other classification algorithms in the case of a small dataset.
* Multinomial:
    - mostly used for document classification problem
        * whether a document belongs to the category of sports, politics, technology etc.
    - dataset that is distributed multinomially.
    - discrete features
        * **frequency** of the words present in the document
* Gaussian:
    - Continuous/real-valued features
    - assume that the data for each class was generated by a simple class specific Gaussian distribution.
    - Decision boundary is in general a parabolic curve between the clases.
    - used for high dimentional data
    - partial.fit ????


### Deawback:
- features are considered as independent values

### Benefit:
- Easy impelement
- Fast

# <span style='color:blue'> Dummy estimators </span>

Dummy estimator is used to obtain a simple baseline to compare with complex algorithm.

- Classifier
- Regressor


**Dummy Classifier** is a classifier model that makes predictions without trying to find patterns in the data.
- provides null accuracy baseline
- strategy:
    * **most_frequent**: predicts the most frequent label in the training data
        - default of the Dummy classifier
    * **stratified**: predicts random predictions based on training set class distribution
    * **uniform**: generate predictions uniformly at random ???
    * **constant**: predicts a constant label provided by user

**Dummy regressor**

- strategy:
    * **mean**: predicts the mean of the training set.
        - default of the Dummy regressor
    * **median**: predicts the median of the training set
    * **quantile**: ???
    * **constant**: predicts a constant label provided by user
    
    
### Drawback:
- Ineffective

### Benefit:
- gives a baseline of metric
- large class imbalance

# <span style='color:black'> Optimization algorithms </span>
 The goal of optimization algorithms is to find the set of parameters that minimize the cost function or objective function.

- **Normal equation** is a non-iterative method to find an optimal solution for certain types of linear regression. It solves the problem exactly by minimizing the difference between predicted values and actual values. An alternative to solving linear regression is using gradient descent which is an iterative optimization approach.

- **Gradient descent** (GD) is an iterative first-order optimization algorithm used to find a local minimum/maximum of a given function. The algorithm works by iteratively moving in the direction of the negative gradient of the function, which is the direction of the steepest descent. The basic idea behind gradient descent is that at each iteration, the algorithm updates the parameters of the model by a small step in the direction that reduces the value of the loss function. The size of this step is determined by a learning rate, which is a hyperparameter that controls the step size. The gradient descent algorithm does not work for all functions. There are two specific requirements. A function has to be: 1) differentiable and 2) convex.
It's easy and efficient to implement. However, it has some limitations such as the problem of getting stuck in local minima and the need to tune the learning rate.
    - **Batch Gradient Descent**: In this variant, the gradient is calculated using the whole dataset before updating the parameters.
    - **Stochastic Gradient Descent (SGD)**: In this variant, the gradient is calculated using a single sample from the dataset at each iteration.
    - **Mini-batch Gradient Descent**: In this variant, the gradient is calculated using a small subset (batch) of the dataset at each iteration.

$\color{red}{\text{ciao}}$
<span style='color:green'> message/text </span>