# ML Basics 
##### 15-April-2022

### Model Regularisation: How to reduce it's complexity so that they don't overfit
    
#### Weight Penalty
To prevent overfitting, we modify our loss function
L_reg = L + Lambda * R

Here,

L = Loss function (MSE , Log-loss etc)

R = Regulariser

Lambda = Regularisation Strength

#### Example: L2 Penalty
Here R = sum of squares of weights (Not including the bias)

This prevents the weights from becoming too large. Thus reduces overfitting. It can be optimized using gradient descent since it is easily differentiable.


#### Another example: L1 Penalty
Here we take sum of absolute values of weights (Not including bias)
- It cannot be optimized using simple Gradient descent methods. (Needs some advanced optimization techniques)
Advantage: It leads to sparse solution. i.e the model can lead some of the weights to 0. Thus it depends only on some subset of features.

#### Other regularisation methods:
 - Dimentionality reduction (Remove some features or apply principal component analasys to get good features)
 - Data augmentation (Distort flip rotate images etc)
 - Dropout (Will be discussed in coming sessions)
 - Early stopping
 - Collect more data (More data i.e Harder to overfit)
 



## Data Preprocessing 

#### What if one feature was in Km and the other in m?

##### Feature scaling :
Ex: Car distance travelled & Car length, Annual Income & Age etc
    This is used to standardise the features. This makes sure that none of the features (Or input Data) dominates any other.
    
    

##### Data Encoding:
Say you have a data with 5 categories. Which subject the student scored the highest in. 

If we label them as 
` 1,2,3,4,5 ` The ML model will assume that the order is significant, it will assume the difference between each category is the same, wich is wrong. 

So in these cases we use one-hot encoding.
``` 
    1 -> 1 0 0 0 0
    2 -> 0 1 0 0 0
    2 -> 0 0 1 0 0 
```

Only one of them is high depending on the category. This helps the model understand the problem better.
    
**This can be done by Standardisation:**
X' = (X - mean(X)) / Standard_Deviation

**and by Normalisation:**
X' = (X - min(X))  / (max(X)-min(X))
    

## Feature Engineering
Feature engineering is the pre-processing step of machine learning, which extracts features from raw data
- **Feature Creation:**
    Creating features involves creating new variables which will be most helpful for our model. 
    This can be adding or removing some features. 
    For example, creating a feature called the cost per sq. ft using total cost and area of a property can help find mistakes using domain knowledge as well as be a usefull feature.

- **Transformations:** 
    Transforms features from one representation to another. Say rotating, zooming and cropping, distorting etc

- **Feature Extraction:**
    Extracting usefull features from the data.Without distorting the original relationships or significant information,
    this compresses the amount of data into manageable quantities for algorithms to process.
    
### Exploratory Data Analysis : 
Exploratory data analysis (EDA) is a powerful and simple tool that can be used to improve your understanding of your data, by exploring its properties. 
The technique is often applied when the goal is to create new hypotheses or find patterns in the data. It’s often used on large amounts of qualitative or quantitative data that haven’t been analyzed before.

This involves exploring and transforming your data to understand it well. Things like patterns and relations found will be useful while deciding and implementing your model as well as while deciding the features.
Plotting the data in various ways also helps visualize and understand the data.

# Multi Layered Perceptron (MLP)

Note: In the previous topics, I've given a surface level knowledge of each topic. You must read up on them to learn more details. Ex:momentum in gradient descent etc.

*Reference: Slides borrowed from Introduction to Deep Learning Course by HSE University on Coursera.*

### Linear Binary Classification:
   <img src="Images/LinearBinaryClassification.jpg" alt="LinearBinaryClassification" width="400"/>
   
   Here, we have two features x1 and x2, two weights w1 and w2 and a bias w0.
   Based on the linear equation, we have a predicion if it is positive and the other prediction if it is negative. Thus this is used for binary classification.
   
   Rather that using a sigmoid function, if we use a sigmoid function. We get a logistic regression.
  ### Logistic Regression:
  
  <img src="Images/LogisticRegression.jpg" alt="LogisticRegression" width="600"/>
  
  We know that the value of the output of the linear function is proportional to the distance from the line. 
  Passing it through the sigmoid function converts this distance into a measure of confidence.i.e The closer to 1 or 0 it is, the more likely the classification is correct.
  ~0.5 would mean it is near the border of the classes and we are not that confident that it is classified correctly.
  
  Now say we have this triangle problem
  <img src="Images/TriangleProblem.jpg" alt="TriangleProblem" width="400"/>
  
  It is obvious we cannot classify it perfectly into 2 parts using logistic regression.
  But, say we found 1 line like this:
  <img src="Images/Line1.jpg" alt="TriangleProblem" width="600"/>
  
  Similarly, with three lines
  <img src="Images/NewFeatures.jpg" alt="TriangleProblem" width="600"/>
  
  So we can now train a linear model on the three calculated features:
  <img src="Images/ComputationGraph.jpg" alt="TriangleProblem" width="600"/>

  This is what we call a Multi Layered Perceptron (MLP):
  <img src="Images/MLP.jpg" alt="TriangleProblem" width="600"/>

  It has 3 types of layers namely:
  - The Input layer
  - The Output Layer
  - The hidden layers
  
  Each node is called a neuron, and the sigmoid function used after each linear combination is called an activation function.
  
  Thus MLP comes under a group of Machine learning models called **Neural Networks**

We will discuss NN further in the next session