### **Feature Extraction**

**Q :** What are the data types to be dealt with?
* Text
* Numerical
* Image
* Categorical

**Q :** Name the 2 steps to be performed on raw data before feeding it into the ML algorithm.

1. Data cleaning - to remove the noises & inconsistencies in the data
2. Feature extraction (or **re-scaling** in case of **numerical data**)

**Q :** What is feature extraction?

* process of transforming raw data into a set of relevant, informative features that can be used as inputs to a ML model
* reduces the dimensionality of the data by selecting/ transforming input variables while preserving the most important information

**Q :** What is the need for feature extraction?

Raw data is ususally :
* high-dimensional
* noisy or redundant
* inappropriate for ML algos

Hence, feature extraction that does :
* reduce the number of features to simplify the model and improve computational efficiency,
* compact representation of the data to save storage and processing time,
* enhance accuracy by removing irrelevant or redundant features,
* simplify the dataset to make the results easier to interpret is a much needed process


**Q :** Feature extraction is crucial for building a successful ML model. Justify.

||Result|What FE does?|
|--|----|----|
|1|Improved Accuracy|extract features that capture the most relevant information, leading to better predictions|
|2|Reduced Overfitting|focuses on essential features, so, models are less likely to overfit to noise in the data|
|3|Enhanced Speed|Reduces data complexity to ensure faster training and inference times|
|4|Model Generalization is high|Well-extracted features generalize better to unseen data|

**Q :** Feature extraction is different for different data types. Illustrate.

| **Data Type**       | **Feature Extraction Method**                                         |
|---------------------|-----------------------------------------------------------------------|
| **Categorical Data** | Encoding (One-Hot Encoding, Label Encoding, Frequency Encoding)      |
| **Image Data**       | CNN-Based Feature Extraction, HOG, SIFT, Flattening (Preprocessing)  |
| **Text Data**        | Text Representation (TF-IDF, Word Embeddings, Bag of Words)          |
| **Numerical Data**   | Statistical Methods (PCA, LDA, Feature Scaling, Binning)             |


**Q :** Elaborate about the need of FE in categorical data.

* transforming non-numeric, discrete categories into meaningful numeric representations that ML algorithms can process effectively.
* most ML models work with numerical input, encoding these categorical variables is critical for leveraging the predictive power of the model.
* Non-numeric nature is undesirable as algorithms like linear regression or SVMs cannot process strings or categories directly.
* Some categorical variables (e.g., ZIP codes or product IDs) may have many distinct values.
* Improper encoding can introduce noise or cause the model to memorize data.
* Proper encoding captures relationships between categories (ordinal, nominal) and retains meaningful structure.
* Efficient encoding reduces the impact of high-cardinality categorical features.


**Q :** Give the steps involved in ML model building.

1. Identify & seperate inputs and output from the whole dataset ie. D ---> X, y
2. Split both input & output into train and test sets ie. X, y ---> X_train, X_test, y_train, y_test
3. Apply data pre-processing on train set of input data ie. X_train ---> X_train_transformed
4. Apply data pre-processing on test set of input data ie. X_test ---> X_test_transformed
5. train the model on X_train_transformed & y_train
6. Evaluate the model on X_train to get y_train_pred. Compare y_train & y_train pred to find train score (Mock Evaluation step)
8. Do necessary corrections( change data preprocessing strategy or the algorithm) if train score is not satisfactory.
8. evaluate the model using y_pred & y_test after testing on X_test

**Q :** Data preprocessing comprises of _____ & _____.

* data cleaning, feature extraction

**Q :** Name the two things that are dependent on the output variable y.
1. Choice of Algorithm
2. Choice of Evaluation metric

**Q :** What are the major procedures for cleaning text data?

* Stop words removal
* Removal of special characters
* Stemming or Lemmatization
* conversion to lowercase

### **Data Leakage**

**Q :** Why is data preprocessing done after splitting data?

* to prevent data leakage

**Q :** What is the need to prevent data leakage?

**What is the need of testing of the ML model on test data ?**
* it provides an unbiased evaluation of the model's generalization ability on unseen data
* we need to ensure that the model performs well beyond the training set
* it helps to identify issues like overfitting/ underfitting, which can compromise real-world performance. 
* Test data simulates real-world scenarios, offering insight into how the model will behave in practical applications. 
* By evaluating the model using various performance metrics, testing provides a quantitative measure of performance. 
* It also helps in comparing different models or hyperparameters objectively.
* it builds confidence in deploying the model by ensuring it meets the desired performance standards.

**Why data leakage is a villain ?**

* test data is a simulation of real data.
* if the model has already got a glimpse of it due to preprocessing before train-test split, the test results are useless.
* testing won't serve its purpose.
* DL leads to overly optimistic model performance by allowing information from the test set to influence the training process. 
* results in a model that appears to be accurate during evaluation but performs poorly in real-world scenarios. 
* undermines the integrity of the validation process, making it impossible to gauge the model’s true generalization ability. 
* DL can occur subtly, such as through shared preprocessing steps, making it harder to detect. Ultimately, 
* it compromises the model's reliability, rendering its predictions untrustworthy.

**Q :** Demonstrate the problem of data leakage using an example.

**Pre-processing before splitting :**

1. **Raw dataset**

||sentence|
|--|----|
|1|This is a big city.|
|2|This is the biggest city,|
|3|This is a $ warm city.|
|4|This is the # warmest city.|
|5|This @ is a cold city.|

2. **After Preprocessing (Data cleaning)**

||sentence|
|--|----|
|1|big city|
|2|big city|
|3|warm city|
|4|warm city|
|5|cold city|

3. **After Preprocessing (FE using BoW)**

||big|city|cold|warm|
|--|----|----|----|---|
|1|1|1|0|0|
|2|1|1|0|0|
|3|0|1|0|1|
|4|0|1|0|1|
|5|0|1|1|0|

4. **Train-test-split**

Train set :

||big|city|cold|warm|
|--|----|----|----|---|
|1|1|1|0|0|
|2|1|1|0|0|
|3|0|1|0|1|

Test set :

||big|city|cold|warm|
|--|----|----|----|---|
|1|0|1|0|1|
|2|0|1|1|0|

The BoW encoding for the test set includes terms that the model already saw during training, leading to artificially inflated performance metrics and misleading results.


**Pre-processing after splitting :**

1. **Raw dataset**

||sentence|
|--|----|
|1|This is a big city.|
|2|This is the biggest city,|
|3|This is a $ warm city.|
|4|This is the # warmest city.|
|5|This @ is a cold city.|

2. **After Train test split**

Train set :

||sentence|
|--|----|
|1|This is a big city.|
|2|This is the biggest city,|
|3|This is a $ warm city.|

Test set :

||sentence|
|--|----|
|1|This is the # warmest city.|
|2|This @ is a cold city.|

3. **After pre-processing (data cleaning)**

Train set :

||sentence|
|--|----|
|1|big city|
|2|big city|
|3|warm city|

Test set :

||sentence|
|---|----|
|1|warm city|
|2|cold city|

3. **After Preprocessing (FE using BoW)**

Train set :

||big|city|warm|
|--|----|----|--|
|1|1|1|0|0|
|2|1|1|0|0|
|3|0|1|0|1|

Test set :

||big|city|warm
|---|----|----|---|
|1|0|1|1|
|2|0|1|0|

This approach accurately reflects the model’s real-world generalization ability, providing a more honest evaluation.

||Data cleaning|Feature Extraction|
|--|----|-----|
|Numerical data|numpy, pandas|sklearn|
|Categorical data|numpy, pandas|sklearn|
|Text data|nltk, spacy, fuzzywuzzy|sklearn(BOW & TF-IDF), gensim(Word2Vec & Glove & FastText), Keras/ Pytorch(RNN, LSTM, Transformers), |
|Image data|PIL, Open CV|Keras, Pytorch|

### **NLP**

**Q :** What is Hugging Face?

* not just a library
* a platform and ecosystem 
* encompasses various libraries, models, tools, and resources for NLP and ML
* Hugging Face provides:
   1. **Model Hub** : A repository for thousands of pre-trained models.
   2. **Datasets Hub** : A collection of datasets for NLP, computer vision, and other ML domains.
   3. **Spaces** : A platform to host, share, and deploy ML apps (using tools like Streamlit or Gradio).
   4. **Inference API** : A service to run models in the cloud without worrying about infrastructure.
   5. **Community** : A collaborative space for developers and researchers.

**Q :** What is Transformers library?

* Python-based open-source library developed by Hugging Face. 
* focuses on pre-trained transformer models, making it easier to integrate them into projects.

|Feature|Description|
|---|----|
|Wide Model Support|Provides access to transformer-based architectures like BERT, GPT, RoBERTa, T5, and more.|
|Pre-Trained Models|Ready-to-use models for tasks like text classification, question answering, translation, and text generation.|
|Multi-Framework Support|Compatible with PyTorch, TensorFlow, and JAX.|
|Ease of Fine-Tuning|Tools to fine-tune models on custom datasets with minimal code.|
|Tokenizers Library|Highly efficient tokenization utilities optimized for speed and accuracy.|
|Integration with Datasets and Accelerate|Seamless support for Hugging Face’s Datasets and Accelerate libraries.|

**Q :** Compare HF and transformers library w.r.t their use cases.
||When to use?|
|----|-----|
|Hugging Face (Ecosystem)|If you need an end-to-end solution, from pre-trained models to deployment and hosting/For collaborative projects or sharing your models and apps with others.|
|Transformers Library|If you're focusing on development and fine-tuning models programmatically./For implementing custom pipelines and deep integration into machine learning workflows.|

**Q :** Although tokenization is the 1st step in text data cleaning, it is not an isolated step. Justify.

* breaks the text into words, enabling identification of stop-words
* splits the text, making it easier to isolate special characters for removal
* Lowercasing is often applied to tokens after splitting the text
* tokenized words are required as input to apply stemming or lemmatization

**Q :** Distinguish between stemming & lemmatization.

**Stemming**: 
 * process of reducing a word to its root or base form 
 * by removing suffixes or prefixes
 * often without considering the actual meaning 
 * e.g., "running" → "run" 
 * faster processing speed

**Lemmatization**: 
 * process of reducing a word to its dictionary or base form (lemma) 
 * considering its context and meaning 
 * e.g., "better" → "good" 
 * processing speed is slow

**Q :** What are the 2 factors under trade-off while considering choice between stemming and lemmentaization?

1. Processing speed
2. Preservation of word meaning/ context

**Q :** Construct a Python function named preprocess, that will do text data cleaning.

```python
# import required dependencies
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import re

nltk.download("wordnet")

# download stopwords dataset if not already downloaded
nltk.download("stopwords")

def preprocess(raw_text, flag):

    # remove special characters
    text = re.sub("[^a-zA-Z0-9]", "", raw_text)

    # change text to lowercase
    text = text.lower()

    # tokenize into words
    words = word_tokenize(text)

    # remove stopwords
    words = [word for word in words if word not in stopwords.words("english")]

    # Stemming/ Lemmatization
    if flag == "stem":
        stemmer = PorterStemmer()
        words = [stemmer.stem(word) for word in words]
    else :
        lemmatizer = WordNetLemmatizer()
        words = [lemmatizer.lemmatize(word) for word in words]

    preprocessed_text = " ".join(words)
    words_in_processed_text = len(words)

    return pd.Series([preprocessed_text, words_in_processed_text])
```


### **CV & Hyperparameter Tuning**

**Q : What does train score value indicate?**

* high value implies learning has happened
* low value implies that no learning has happened

**Q : For machine classification problem model, is there a cut-off that indicates desirable model?**

* No, such cut-off
* It totallyy depends on the problem statement and scenario

**Q : (Algo, hyperparameter) ordered pairs for a single problem can be a huge number. How to decide the best combination?**

* GridSearch CV
* Randomised CV 

These features are present in sklearn. CV stands for cross validation.

#### **WHAT ?**

**Q : What is Cross validation?**

* technique used to assess the performance and generalization ability of a machine learning algorithm
* done by dividing the data into multiple parts and training/testing on different subsets
* helps to avoid overfitting 
* ensures that the model performs well on unseen data



**Q : How is Cross validation done? Explain step-by-step.**

1. Divide the data into K folds. Fix an algorithm
2. Keep aside a fold for testing and compile rest of the K-1 folds into a training set and run the algo over it to get a model.
3. test that model over the test fold.
4. Repeat steps 2,3 for all K-folds.
5. The average value of K performance scores gives an idea about the algorithm's generalization ability. It is a more reliable indicator.

**Q : What are other types of CV?**

1. **Stratified K-Fold CV :**
 - Whole data is divided into k-folds
 - the class distribution is preserved across every folds
 - useful when there is class imbalance in the dataset

2. **Leave-One-Out-CV :**
 - Computationally very expensive
 - a datapoint is used as a test point and training is done on rest of the dataset
 - repeated for all datapoints in the dataset


**Q : Can cross validation be considered a thorough learning of the data?**

* Although it appears to be a thorough learning process, CV is not actually that
* During cross-validation, the algo is trained from scratch for each fold. It doesn’t retain knowledge from previous folds.
* It is not thorough learning, but instead it is a thorough evaluation process
* Its like thinking twice or thrice before fixing an algo for solving a problem

CV does not contribute to the model's actual learning but rather evaluates how well an ML algorithm generalizes to unseen data. Each fold in CV serves only as a temporary training-validation split, and once CV is complete, those models are discarded. The final model is then trained separately on the full dataset. Hence, CV is a thorough evaluation technique rather than a learning process.

**Q : Extend the 'student learning for exam' analogy of ML model building pipeline to include CV step.**

CV is more like screening tests or entrance exams, where:

 - Multiple test setups are designed to evaluate the student (or algorithm).
 - Each test is independent; there’s no learning transfer between tests.
 - Once the student (algorithm) clears the screening (CV), the real training begins (training on the entire dataset).

#### **WHY ?**

**Q : What is the prime purpose behind CV?**

* Validation/ Evaluation

**Q : What is that the CV is trying to evaluate?**

- CV evaluates an algorithm, not a specific model instance
- CV isn’t about improving or retaining knowledge in the model being trained
- CV is about understanding :
  * How well the algorithm (e.g., Random Forest, SVM) performs on the given data.
  * How **consistent** the performance is across multiple data splits.

**Q : Average of CV scores is a highly reliable indicator of what?**

- of the algorithm’s generalization ability on unseen data.

**Q : Why is it highly reliable?**

1. **Multiple Validation Splits** 
  - CV evaluates the algorithm on multiple train-validation splits rather than a single train-test split, reducing bias from any specific data partition.
2. **Reduced Variance** 
  - Averaging the scores across folds smooths out fluctuations caused by random data variations, leading to a more stable estimate of performance.
3. **Better Approximation of Real-World Performance**
 - Since CV tests the algorithm on diverse subsets of data, the average score reflects how well the algorithm would perform on truly unseen data.
4. **Prevents Overfitting to a Single Split**
  - Without CV, a single train-test split might give an overoptimistic or overly pessimistic estimate, while CV provides a balanced evaluation.

**Q : "CV mitigates problem of overfitting and underfitting of a ML model (not ML algorithm)". True or False. Justify.**

**False.**  

- CV helps evaluate and detect overfitting or underfitting
- but it does **not** directly mitigate these issues in the final **ML model** 
- Instead, CV provides insight into how well an **ML algorithm** generalizes by testing it on multiple train-validation splits. 
- If overfitting or underfitting is observed, actions like changing the **algorithm, data preprocessing, or regularization techniques** must be taken to address it. 
- CV itself does not alter the model’s parameters or learning process—it only assesses an algorithm's performance.

**Q : Doing CV is like thinking twice or thrice before fixing a ML algo for solving a problem. Justify.**

- CV allows us to **evaluate** an ML algorithm on multiple data splits before finalizing it for the problem. 
- Just as thinking twice or thrice helps in making a well-informed decision, CV helps in assessing whether an algorithm generalizes well to unseen data, avoiding **hasty or biased conclusions** based on a single train-test split
- provides a clearer picture of an algorithm’s **stability and consistency**, helping us decide whether it is the right choice for the given problem.

#### **WHERE ?**

CV is used in machine learning model evaluation, feature selection, and hyperparameter tuning to ensure robust performance.

It is applied when data is limited, when avoiding overfitting is crucial, or when comparing models or hyperparameter settings.

**Q : How is CV used for hyperparameter tuning? Explain step-by-step.**

1. **Initial Data Split**
  * Typically, the dataset is initially split into a training set (e.g., 80%) & a testing set
  * training set is used for CV ie. further divided into folds
  * test set is also called as **holdout set** is held out and never seen by the model during training or CV

2. **Folding the training Set**
  * training dataset is divided into k equal-sized subsets (folds).
  * One subset is used for validation, and the remaining k-1 subsets are used for training.

3. **Training**
  * (k-1) subsets form a single training set
  * we get a single model upon running the algorithm on this training set
  * this model is validated using the test subset

4. **Repeat**
  * This process is repeated k times, each time using a different fold for validation purpose.
  * hence, cross-validation

5. **Average Performance**
  * In the end, I have k different models and their performance scores
  * The final performance is computed as the average of the scores across all different training subsets
  * this value gives idea about how a single hyperparameter setting performs on an average
  * the same process is done for various hyperparameter settings

6. **Comparison**
  * after comparing average performances of various hyperparameter settings, the best hyperparameter setting is chosen

7. **Actual training**
  * the best hyperparameter setting is locked
  * the algo is run under this setting over the entire train set
  * a single final model is obtained which will be tested on the holdout set


**Q : For each hyperparameter setting, what is the need of training over k different subsets & then computing average performance? Why not on a single training set? What is the need of this double work? Or in other words why cross validate for a single hyperparameter setting?**

OR

**Q : What is the need of cross validation over single simple validation?**

**A :**

(i) *What is the problem with a Single Training-Validation Split?*
 
**Bias from Random Splitting :** 
  - single split may accidentally contain "easy" or "hard" examples in the training or test set
  - will lead to overly optimistic or pessimistic performance evaluation
  - model may perform well on this specific split but poorly on unseen data

**Overfitting to a Particular Split :** 
  - The model might learn patterns specific to the given training set
  - since you're testing on just one test set, you don't know if the model generalizes well across different subsets
  - A good test score might give a false sense of confidence.

**Data Imbalance Issues :**
  - Important patterns might be underrepresented in the test set due to class imbalance or sampling biases
  - will lead to misleading performance evaluation
  - CV ensures all patterns are tested across different subsets

(ii) *Why k-fold CV is done for each hyperparameter setting?*

**Reducing variance in Performance Estimation :**
  - training on k different subsets allows the model to be evaluated across diverse portions of the data
  - will lead to a more reliable estimate of performance
  - averaged score smooths out any fluctuations caused by randomness in a single split

**Ensuring Generalization Ability :**
  - By using different validation sets in each fold, we ensure the model performs well on all parts of the data, not just one specific subset
  - helps in selecting hyperparameters that generalize well to unseen data

**Efficient use of Data :**
  - With CV, each data point gets to be in the validation set exactly once and in the training set k-1 times, maximizing the use of limited data.
  - Training on the full dataset in different parts ensures no information is wasted.

**Q : Summarize the differences and similarities between 'CV with hyperparameter tuning' and 'CV without hyperparameter tuning' in a table.** 


| Aspect                              | CV Without Hyperparameter Tuning                          | CV With Hyperparameter Tuning                                |
|-------------------------------------|----------------------------------------------------------|------------------------------------------------------------|
| **Purpose**                         | Evaluate model performance and generalization ability.    | Find the best hyperparameter settings for the model.        |
| **Hyperparameters**                 | Fixed throughout the process.                            | Multiple hyperparameter combinations are tested.            |
| **Number of Models**                | k models (from k folds).                                 | k models per hyperparameter combination, leading to more models overall. |
| **Performance Score**               | Average performance score across k folds.                | Average performance score across k folds for each hyperparameter combination. |
| **Final Model**                     | Retrained on the full dataset with fixed hyperparameters. | Retrained on the full dataset using the best hyperparameters found during tuning. |
| **Computational Cost**              | Relatively low.                                           | Higher due to multiple hyperparameter combinations being evaluated. |
| **Focus**                           | Assess the model’s generalization ability.               | Optimize the model's performance through parameter adjustment. |
| **Use Cases**                       | Simple evaluation tasks or when hyperparameters are predetermined. | When tuning hyperparameters to maximize model performance.  |
| **Analogy (Student Example)**       | Multiple mock tests to revise concepts, final revision based on fixed study plan. | Mock tests + optimizing study plan for best performance before final revision. |
| **Practical Application**           | Less commonly used, mainly for baseline evaluation.       | Widely used in practice for building high-performing models. |
| **Risk of Data Leakage**            | Minimal, provided test data remains separate.             | Minimal, provided test data remains separate.               |


#### **HOW ?**

**Q : How to implement simple cross validation to check performance of RandomForest classifier algorithm on a dataset?**

```python
# import dependencies
from sklearn.model_selection import KFold, cross_val_score, train_test_split

# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# set up the algorithm apparatus
model = RandomForestClassifier(random_state = 42)

# perform CV
kf = KFold(n_splits = 5, shuffle = True, random_state = 42)
cv_scores = cross_val_score(model, X_train, y_train, cv = kf, scoring = 'accuracy')

# average performance of the algo
print("Cross-Validation Scores:", cv_scores)
print("Average CV Score:", np.mean(cv_scores))

**Q : Which module of sklearn has the classes GridSearchCV & RandomizedSearchCV?**

* model_selection module

```python
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

**Q : What is the purpose of the verbose attribute of GridSearchCV object?**

* It indicates whether or not to display the metadata( what is happening internally while running a grid search CV)
* verbose = 0 means, no need to show metadata
* verbose = 1 means, show very little essential metadata
* verbose = 2 means, show the entire detailed metadata

**Q : How is GridSearchCV executed in Python using sklearn?**

```python
'''STEP 01 - Import dependencies'''
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

'''STEP 02 - Define model'''
model = SVC()

'''STEP 03 - Setup a parameter grid'''
# parameter grid is a dictionary basically
param_grid = {
    'C': [0.1, 1, 10],  
    'kernel': ['linear', 'rbf'],  
    'gamma': [0.01, 0.1, 1]
}

'''STEP 04 - Initialize a GridSearchCV object'''
# Perform Grid Search with 5-fold cross-validation
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')

'''STEP 05 - Run the object over whole dataset'''
grid_search.fit(X_train, y_train)

# Best parameters and best score
print(grid_search.best_params_)
print(grid_search.best_score_)
best_model = grid_search.best_estimator_

# Evaluate final model on test set
y_pred = best_model.predict(X_test)
final_accuracy = accuracy_score(y_test, y_pred)
```


**Q : What about Randomised search CV?**

```python
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

# Define model 
model = SVC()

# Define parameter distribution
param_dist = {
    'C': uniform(0.1, 10),  
    'kernel': ['linear', 'rbf'],  
    'gamma': uniform(0.01, 1)
}

# Perform Randomized Search with 10 iterations
random_search = RandomizedSearchCV(model, param_dist, n_iter=10, cv=5, scoring='accuracy', random_state=42)
random_search.fit(X_train, y_train)

# Best parameters and best score
print(random_search.best_params_)
print(random_search.best_score_)


**Q : Is hyperparameter tuning the only method to rectify overfitting issues with a model? If not, what are the other remedies?**

No,

1. Get more data
2. Data Augmentation
3. Early stopping
4. Dropouts for DL
5. Feature engineering (Transformation, selection)
6. Regularization
7. Ensembles


### **Best Model**

To solve a problem, I have a list of algorithms and several hyperparameter settings for it. By doing several CVs, I get the optimal hyperparameter setting for each algorithm. Out of these best of each algos, how to choose one? How to finalize a model?

**Q : What is training phase in ML model building pipeline?**

$\text{training dataset} \to \boxed{\text{algorithm}} \to \text{model}$

**Q : What is training time?**

- refers to the total duration required for an ML algorithm to learn from the training dataset. 
- includes the time taken for multiple iterations (epochs), parameter updates, and optimization steps.

**Q : What are the implications provided by training time of a model?**

1. **Scalability**

Longer training times may not be practical for large datasets or real-time applications.

2. **Computational Cost**

More training time often means higher resource consumption (CPU, GPU, memory).

3. **Model Performance**

Longer training might improve performance, but excessive training can lead to overfitting.

4. **Experimentation Efficiency**

Faster training allows quick iterations to experiment with different hyperparameters and architectures.

**Q : What is the size of a ML model?**

- Model size refers to the amount of storage required to save the trained model
- typically measured in kilobytes (KB), megabytes (MB), or gigabytes (GB). 
- depends on the number of parameters, layers, and data precision.

**Q : What are the implications provided by the size of a ML model?**

1. **Deployment Feasibility** - Large models may not be suitable for edge devices (mobile, IoT) due to storage and processing limitations.

2. **Inference Speed** - Smaller models usually have faster inference times, making them ideal for real-time applications.

3. **Memory Efficiency** - Large models consume more RAM and may require specialized hardware (e.g., GPUs, TPUs).

4. **Transferability** - A compact model is easier to deploy over networks, especially in cloud-based applications.

**Q : What is Prediction Phase in ML model building pipeline?**


$\text{input test dataset} \to \boxed{\text{trained model}} \to \text{output predictions}$

**Q : Define prediction time.**

* also called inference time
* time taken by a trained ML model to generate predictions on new, unseen data during the testing or deployment phase
* Some models perform better when processing multiple inputs at once (batch inference), while others optimize for single-instance inference.

**Q : While building real time applications, what kind of prediction time is ideal?**

- In critical applications like fraud detection, self-driving cars, or healthcare diagnostics, low inference time is essential for real-time decision-making.
- A delay in prediction could lead to severe consequences (e.g., an autonomous vehicle reacting too late).


**Q : Does prediction time affect scalability and UX?**

Yes. Faster inference improves user experience in web applications (e.g., chatbots, recommendation systems).
A slow model may struggle with high traffic and reduce scalability.

**Q : What does prediction time taken by a model imply about the hardware requirements?**

- High inference times indicate that the model may be too complex for edge devices or mobile applications.
- Lighter models (e.g., quantized models or pruned networks) can help reduce latency.

**Q : What are the factors that affect the prediction time taken by a model?**

1. **Model Complexity** - More layers and parameters increase computation time.
2. **Hardware** - CPUs are generally slower than GPUs/TPUs for deep learning models.
3. **Optimizations** - Techniques like model quantization, pruning, and distillation can reduce prediction time.
4. **Batch Size** - Larger batches may reduce per-sample latency but increase overall computational demand.

**Q :
(i) KNN classifier
(ii) Logistic Regressor
(iii) Decision tree classifier**

**Compare the above algorithms based on training time, prediction time and model size.**

**(i) KNN**
 - Least training time $\because$ it doesn't actually "train" a model; it just stores the dataset.
 - takes more inference time, since it must calculate the distance from the test point to all training points.
 - model size is large because it keeps the entire dataset in memory.

**(ii) Logistic Regressor**
 - takes high training time as it spents time to learn patterns in the data and relationships between features
 - inference time is low $\because$ it computes a weighted sum of input features and applies a sigmoid function.
 - model size is small since it only stores a few learned coefficients.

**(iii) DT Classifier**
 - higher training time. It depends on how deep the tree grows and the splitting criteria (e.g., Gini, entropy).
 - lower inference time, since making a prediction just involves traversing the tree.
 - Can be small or large depending on depth and pruning; deep trees have more nodes, increasing size.

**Q : In Python, how to identify whether class or function from the import statement?**

- Class always begins with capital letter
- eg.
```python
from sklearn.pipeline import Pipeline
# here, 'Pipeline' is a class
```

- Function always begins with small letter
- eg. 
```python
from sklearn.model_selection import train_test_split
# here, 'train_test_split' is a function
```

- although this is not mandatory, it is the convention usually followed.