#### **Introduction to Machine Learning**

Imagine teaching a child to recognize animals. You show them pictures of cats and dogs, telling them which is which. After seeing many examples, the child learns to identify new animals they've never seen before. Machine Learning works the same way!


Machine Learning (ML) is a **subfield of artificial intelligence** that focuses on building systems that **learn patterns from data** and improve their performance over time **without being explicitly programmed for every scenario**. 
Instead of writing fixed rules, we provide data and allow algorithms to infer the underlying relationships.


**Why Was Machine Learning Needed?**

Machine learning emerged because traditional rule-based programming reached its limits in real-world problems. 
- Rule-Based Systems Do Not Scale
    - Developers manually define rules: if condition → then action
    - This works only when rules are simple, stable, and well-defined
- Data became abundant
    - With the rise of internet Sensors and IoT devices, Mobile applications, Enterprise systems we started generating massive volumes of data. If data is available, it is often easier and more effective to learn patterns from data than to hard-code expert knowledge. Machine learning leverages this data to automatically extract insights.
- Many Problems Lack Explicit Algorithms
    - Fraud detection, recommendation systems etc do not really have a clear logical solutionm and hence we cannot write determinitic algos , ML provides a way to approximate complex functions from input to output. 
- Improved Performance Over Time
    - Unlike traditional programs ML models can improve as more data becomes available and Performance increases without rewriting code
    - Example: Search engines, recommender systems, and voice assistants become more accurate as usage grows.

**Bottom Line :**  You should be able to take decisions backed by data. You cannot hard code all the complexities in terms of the coded logics . 

#### **Evolution of Machine learning**
- Rule-Based Systems (1950s–1970s)
    - Early AI relied on hand-crafted rules written by humans. These systems worked only for simple, well-defined problems and failed to scale to real-world complexity.

- Statistical & Probabilistic Methods (1980s–1990s)
    - Focus shifted to statistics and probability. Algorithms such as linear regression, decision trees, k-nearest neighbors, and Naive Bayes learned patterns from data rather than fixed rules.

- Classical Machine Learning (1990s–2010)
    - More robust algorithms emerged, including SVMs, ensemble methods (bagging, boosting, random forests), and improved optimization techniques. ML became practical due to better computing power and data availability.

- Big Data & Feature Engineering Era (2010–2015)
    - Explosion of data and distributed computing (Hadoop, Spark). Performance depended heavily on manual feature engineering combined with classical ML models.

- Deep Learning Era (2012–Present)
    - Neural networks with many layers (CNNs, RNNs, Transformers) enabled automatic feature learning. Major breakthroughs occurred in vision, speech, and natural language processing.

- Modern ML Systems (Present)
    - Focus expanded beyond models to end-to-end systems: MLOps, scalable training, real-time inference, foundation models, and responsible AI.

In one line:
Machine learning evolved from hand-coded rules → statistical learning → classical ML → deep learning → large-scale intelligent systems.


**Ques:** If machine learning was there from 1950s why do we see its increased usage now?

**Ans:** 
- Amount of data increased significantly 
- Computation power increased drastcially recently 
- Computation costs decresed tremendously decresed very recently

Data --> EDA --> FE--> ML--> find features --> Task complete

If you have new set of features coming up, just do the FE again give that to ML model and it will come up with solutions...You donot need to write all the 100 lines of code again. 

#### **Machine Learning is a Subset of Artificial Intelligence**

**Artificial Intelligence** is concerned with building systems that can perform tasks that typically require human intelligence like reasoning, planning and ultimately decision making. 

**Machine Learning** is a subset of AI that focuses specifically on *enabling systems to learn from data and improve performance over time* without explicit rule-based programming. Generally speaking Machine learning is what is used to achieve Artificial intelligence. 

**Show venn diagram for AI, ML, DL, etc.** 

#### **Types of Machine Learning**

**What is Learning?**

Learning simply means finding patterns and when these are done automatically (by the machine) we call this process machine learning. 

There are 3 methods in which the machine is learning 

- **Supervised learning** : As the name suggests -- supervised!! means you gonna have some teacher. (labelled data i.e. input - output pairs)
  - When members of a team need feedback of the supervisor to confirm if they are doing the work correctly. 
  - Similarly understand the Features as questions and the target as the answers , so we are telling the machine that so is the answer when this question comes up.
  - Features are like the input variables or known as independant variables. 
  - Targets are the labels or the known output or the ground truth also known as the dependant variable. 
  - Example : For deciding the income in a new job, you will need to know the applicant's age, years of experience and tech stack, hence all these are indepdant and these together defines the income which is why income is dependant
  - Real life examples: House price prediction, Eligible for credit card
  - Question we ask from ML model: Given data, **predict some answer**
  

- **Unsupervised learning** : Learning without labels. 
  - The model finds hidden patterns on its own, and then makes groups like grouping similar customers together. 
  - Features --> ML --> groups.
  - Question we ask from ML model: Given the data, **segment into the groups.**   


- **Reinforcement learning** : Closest to how the humans learn. 
  - An agent learns by interacting with an environment and receiving rewards or penalties.
  - Example : Autonomous cars, robotics, playing chess, etc. 
  - Won't be used ever in our jobs. 
  

### **Machine learning prep**
Analogy : 
- A teacher wants to evaluate whether a student is smart or not. 
- She teaches 10 questions to the students. 
- Now there are 2 methods to identify : Give same 10 questions in the test or give new questions in the test 
    - If the students is asked those same 10 questions, they moght have memorised them and give a false image of being smart i.e. they can perform 100%. 
    - For hollistic evaluation of a student it is important to have completely new questions because it checks thorough understanding.
- Similarly when we provide data to the model and asks it to learn the patterns, it can memorise all the existent patterns and perform 100% perfect, but when that will be given new data , it **might** underperform. 
- In order to avoid this, we split the data into two parts, one of which the model will see and learn from called Train dataset and the other it doesn't see which is called test dataset. 
- Now when the model is tested on the unseen data, it will show the actual intellect or smartness of the model in real world. 

This process of dividiing the data into 2 (or 3) parts is called **Data Splitting**

Now you can also do it slightly better by splitting in 3 parts say for 100 records
- Train set (70) : Study material
- Validation set (20) : Weekly test
- Test set (10) : Final exam

This is like you teach a student, validate their learnings in unit tests and improve and ultimately test in the annual exams. 

General rule of thumb for train test split . 70-80 % Train, 20-30% test when you have sufficiently high data rows, else you may even do 95-5 for cases where you have less data.  
General rule of thumb for train validate test split . 70% Train, 20% validate , 10 % test

#### 1. Data splitting 

General Steps:
- Make a model using train dataset. 
- Check the model accuracy on the validation data and observe what you get say 70 %. 
- If this is not acceptable, change some parameters, again check model accuracy on validation dataset and observe the model accuracy say it to be 76%. ACCEPTABLE!
- Now we will test on the test data and check the accuracy --> say 72% i.e. very near to the validation set 
- PERFECT!!

If it is away from the validation set accuracy, we will need to do some tweaks. 
- Start by checking the constituents of each of the three dataset i.e. test, validate and train 
- There is a possibility of data imbalance or skewness. 
- Do a mix of the different types of data points observed by EDA done above and then ultimately train the model with the mixed data. 

#### **Errors, Cost Function and Loss Function**
for x1=2, x2=3 you have y = 10 and ML predicted y(hat)=3

There needs to be a mechanism that tells how wrong the predicted value is versus the actual value. 
This functionality of comparing is called **evaluation** and you make **evaluation metrics** which help you **decide how well is the model learning** or is it a good model or not. 

- **Error** is the difference between the actual value and the predicted value. 
In the context of Machine learning when we talk about Error calculation, we might be talking in two segments
1) Error for each single data point
2) Error representing the overall model.  

Errors
- For each data point ==> We Use Loss function. A **Loss function** (also called as error function) measures how wrong a model's prediction is for a single data point. 
- for the set of data (train, validation, etc) ==> We use Cost function i.e. We use cost function to calculate total error of your data set. **Cost function** aggregates the loss over all training examples and represents the overall performance of the model. 


Say for example you have a train data and you want to check the model performance, you will use cost function which ofcourse internally will use a loss function to calculate loss on each data point. 
Train set error calculation --> you need a cost function --> you calculate loss on each data point using a loss function (a formula)

#### **Underfitting versus Overfitting**

Underfitting and Overfitting are the modelling errors related to how well a model learns patterns from data and generalises the unseen data.

- If training loss << test/validation loss ====> Model **Over fitting**. 
    - The model has learnt all the patterns in the training set but couldn't capture the intricacies of the test data. We perform superb on the training data but poorly on the test data. 

- If training loss == Test/validation loss, that's an ideal case. 
    - Though it's not a technical term, it is used for a general clarity.

- If training loss >>> test /validation loss ===> Model **Underfitting**. 
    - Usually here both training loss and test/validation loss is bad. 
The model is not understanding the basic patterns in the training data.


**Underfitting** : This occurs when a model is too simple to capture the underlying patterns or structures in data. 
- Key characteristics 
    - Poor performance on training data
    - Poor performance on test/validation data
    - High bias, low variance
    - Model fails to learn important relationships

- Common Causes

    - Model complexity is too low (e.g., linear model for highly non-linear data)
    - Insufficient or irrelevant features
    - Excessive regularization
    - Inadequate training time

- Example
    - Fitting a straight line (linear regression) to data that clearly follows a curved relationship.

- Symptoms
    - Training error: High
    - Test error: High  

**Overfitting** : Overfitting occurs when a model learns the training data too well, including noise and random fluctuations, and fails to generalize to new data.

- Key Characteristics
    - Excellent performance on training data
    - Poor performance on test/validation data
    - Low bias, high variance
    - Model memorizes instead of generalizing

- Common Causes

    - Model complexity is too high
    - Too many parameters relative to the dataset size
    - Training for too many epochs
    - Lack of regularization

- Example
    - A very deep decision tree that perfectly classifies training data but performs poorly on unseen data.

- Symptoms
    - Training error: Very low
    - Test error: High

In Train, the model has 30 % accuracy, in test the accuracy is 50 % . Why ?
2 factors possible 
- sample is small
- Randomness. Those cases of train where the model is performing good are part of test also 'majorly', the test accuracy will be better. 

**Types of evaluation metrics (FOR REGRESSION):** 
- summation of all the errors/loss. Simple summation! 
- averaging of all the errors/loss. Average!
- Mean squared error (MSE) -> Mean of squared errors. The squaring is done to penalise the one having more error much analogous to the subject in which the student scored the least need to be focussed more because that subject has the most error i.e. deviation from the ideal i.e 100.
- Mean absolute error (MAE) -> Mean of absolute errors.
- Root mean square error (RMSE) -> The root of Mean of squared errors. Used when MSE is giving very large values.
- Modified Mean squared error (mMSE) -> kind of half of RMSE. 

Types of evaluation metrics (FOR CLASSIFICATION): 
- True Positive (TP)
- True Negative (TN)
- False Positive (FP)
- False Negative (FN)

These all work on each line item of the dataset and using these we will define the evalutaion matrix. 
- Accuracy : (TP+TN)/all i.e. all true cases / all cases. Overall correctness
- Precision : TP/(TP+FP) i.e. How often we are correct.
- Recall : TP/(TP+FN) i.e. How many actual spams got caught
- F1 score : Harmonic mean : a balance of precision and recall.

  F1 = 2*precision *recall/(precision + recall)

### Linear Regression

Linear Regression is called linear not because of straight line but how you treat the parameters as in in a linear regression you can represnt the parameters like a simple summation or subtraction. 
A polynomial looking graph can also be a linear regression because it can have linear features inside that. 

### Flow of the data

Data --> EDA --> FE --> final data

Final data --> Train and test 

Train --> formula say y = mx + c . We start with a random initial value of m & c --> prediction --> error --> gradient 

**Cost function** is the curve obtained when you plot diffeerent values of losses for different values of m1, m2, etc which machine learning finds.

https://aero-learn.imperial.ac.uk/vis/Machine%20Learning/gradient_descent_3d.html

Ultimately we want to reduce the loss as much as possible possibly zero, We do this by the method of gradient descent. 

For each value of parameters we get different values of Loss. 

If we plot all the loss values against the parameter values you get a 3D function as shown in above URL. 

Ideally we want to have the lowest possible value of Lowest loss, gradient descent is the method that gives you the values of m0, m1, m2 WHICH GIVES THE LOWEST VALUE OF LOSS

**Gradient Descent** : It is an optimisation algorithm used in Machine learning. It is an iterative method used to adjust the parameters if a model to find the minimum loss of a function 

Loss + loss function ===> Gradient descent = gradient ====> update values of m0, m1, m2. Find the new loss, give to gradient descent and so on.... keep optimising such that the loss no longer decreases. 

https://www.playwithml.com/concepts/gradient-descent

Goal of Gradient Descent is to find minimum loss. 
- We use differentiation to calculate slope (gradient) of the cost function at our current position
- Slope/gradient will tell us the direction of the steepest ascent (uphill)
- Take a step in the opposite direction i.e. towards steepest descent (downhill)

Differentiation is simply what will be the rate of change of output when the input changes i.e. value of slope at that point