Assignment Questions


# 1. What is a parameter?

- A parameter in theoretical terms refers to:

>- A variable or constant that characterises or controls a system, mathematical expression, or model but is not the main variable of interest.

In more depth:

    Mathematical context:
A parameter might be a number or symbol that specifies a family of functions or equations.

    Statistical context:
A parameter refers to a numerical characteristic of a population (like its mean or standard deviation) that we typically want to estimate based on a sample.

    General or theoretical context:
A parameter sets conditions or controls under which a process operates, or it guides the form a theory might take.

# 2. What is correlation? What does negative correlation mean?

Correlation refers to a statistical measure that shows the relationship between two variables — whether and how strongly they move together.

>- Positive correlation means both variables move in the same direction. If one increases, the other increases; if one decreases, the other decreases.
Example: Height and weight — taller people tend to weigh more.

>- Negative correlation means the two variables move in opposite directions. If one increases, the other decreases, or vice versa.
Example: Speed and travel time — the faster you drive, the less time it takes to reach your destination.

The strength of this relationship is typically measured by Pearson’s correlation coefficient (r), which ranges from -1 (perfect negative) to +1 (perfect positive), with 0 indicating no correlation.

# 3. Define Machine Learning. What are the main components in Machine Learning?

>- Machine Learning (ML) refers to the process by which computers learn from data and improve their performance without being explicitly programmed with hardcoded instructions.
Instead of following step-by-step instructions, ML algorithms identify patterns and relationships within the data, allowing them to make predictions, decisions, or perform tasks with growing accuracy over time.

Machine Learning typically involves the following main components:

    Data  

The raw information used to train and evaluate the algorithm (examples, samples, training set, test set).

    Model

The mathematical representation or algorithm that parses the data and makes predictions (like decision trees, neural networks, or regression models).

    Training Algorithm

The procedure or method by which the model is trained and updated (such as gradient descent).

    Objective/ Loss Func­tion

A metric used to measure how well the model performs (error, accuracy, cross-entropy) — guiding its improvement during training.

    Evaluation/ Validation

Techniques (like cross-validation or a separate test set) used to gauge the model’s ability to generalize to new, previously unseen data.

    Predictions/ Inference

The process by which the trained model applies its knowledge to make predictions or decisions on new inputs.

# 4. How does loss value help in determining whether the model is good or not?

The loss value is a key metric for evaluating how well your model is performing during training (and afterwards, on validation or test data). Here’s how it helps:

    What loss measures:
The loss quantifies the error between your model’s predictions and the true values.

>- A lower loss means your predictions are close to the ground truth.

>- A higher loss signals greater error.

    Determining if a model is good or not:

>- If the loss drops and stabilises at a low value during training, it typically means your model is learning well.

>- If the loss plateaus or increases, it might be a sign of problems, such as underfitting, overfitting, or a poor architecture.

    Training vs Validation Loss:
Comparing training and validation loss is crucial:

>- If training and validation lose are both high, your model might be underfitting.

>- If training loss is low but validation loss is high, your model might be overfitting (memorizing instead of generalizing).

>- If both are low and close to each other, your model is likely performing well.

    Other consideration:
While the loss is a helpful indicator, it's not the whole story. Often, you will want to combine it with other metrics (like accuracy, precision, recall, or F1 score) and perform additional diagnostics (confusion matrix, ROC curve, etc.).

# 5. What are continuous and categorical variables?


    Continuous variables:
These can take any numerical value within a range — even fractions or decimal points.
Examples include:

Height (e.g. 160.5 cm)

Weight (e.g. 72.3 kg)

Time (e.g. 2.7 hours)

Distance (e.g. 5.5 km)

    Categorical variables:
These can take a limited number of separate, non-numeric values or categories.
Examples include:

Gender (e.g. male, female)

Eye color (e.g. blue, green, brown)

Marital status (e.g. single, married, divorced)

Type of vehicle (e.g. car, bike, truck)

# 6. How do we handle categorical variables in Machine Learning? What are the common techniques?

>- Categorical variables need special treatment because ML models typically work with numerical data. Here are some common techniques to handle categorical variables:

    Label Encoding
>- What: Encodes each category with a unique integer.

>- When to use: If there’s ordinal significance (like low, medium, high).

>- Pros: Simple; maintains order if applicable.

>- Cons: May confuse the algorithm if there’s no real ordering (integer might be interpreted as implying a numerical relation).
   
    One-Hot Encoding (Dummy variables)
>- What: Transforms each category into a new column with 0 or 1.

>- When to use: If there’s no ordinal relation.

>- Pros: Prevents algorithm from interpreting numerical relationships.

>- Cons: Increases dimensionality (especially with many unique categories).

    Target / Mean Encoding (with caution)
>- What: Encodes each category by the mean of the target variable for that category.

>- When to use: Large number of categories with high cardinality.

>- Pros: Reduce dimensionality while retaining information.

>- Cons: Higher risk of data leaks or overfitting if not done carefully (should be done within cross-validation).

    Binary/Hash Encoding (for high cardinality)
>- Encodes each category into a fixed number of bits or hashed components.

>- Helps reduce dimensionality while retaining distinctions.'



# 7. What do you mean by training and testing a dataset.

When you train and test a dataset, you’re following a standard procedure to make sure your algorithm performs well and can generalize to new, unseen data.

Here’s a breakdown:

    Training a dataset:

>- This is the portion of your data used to teach or “train” your algorithm.

>- The algorithm looks for patterns, relationships, and structures within this data.

>- For example: If you’re training a classifier to identify cats and dogs, your training set includes many labeled photos of cats and dogs.

    Testing a dataset:

>- This is a separate set of data that’s withheld from training.

>- After training, you feed this new, previously unseen data into your algorithm.

>- The algorithm’s performance here lets you gauge its ability to generalize — whether it performs well with new inputs instead of just memorizing the training data.

# 8. What is sklearn.preprocessing?

>- sklearn.preprocessing refers to a module in scikit-learn (sklearn) that provides utilities for transforming raw data into a format that's more suitable for machine learning models.

This typically involves scaling, normalizing, or encoding the data.

# 9. What is a Test set?

>- A test set refers to a collection of data that is kept aside and not used during the training process of a machine learning model. Instead, it’s used afterwards to evaluate the model’s performance in a realistic scenario — that is, on data it’s never seen before.

# 10. How do we split data for model fitting (training and testing) in Python? How do you approach a Machine Learning problem?

>- Usually, we use train test split from sklearn.model selection. For example:

    from sklearn.model_selection import train_test_split

    # data in X (features) and y (labels)
    X = [...]
    y = [...]

    # Split into training and testing sets

    X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.20,   # 20% for testing
    random_state=42   # for reproducibility
    )

Here’s a standard approach to tackling a Machine Learning problem:

1. Understand the problem.

>- What kind of problem is it? (Regression, Classification, Clustering, etc.)

>- What’s the business objective?

2. Gather and prepare the data.

>- Acquire the data (CSV, database, API).

>- Handle missing values, duplicates, and inconsistencies.

>- Perform exploratory data analysis (EDA) — visualize, compute statistics, and identify patterns.

3. Feature engineering.

>- Select or create relevant features.

>- Transform or scale if needed (normalize, standardize).

>- Encode categorical variables.

4. Split the data.

>- Split into training, validation, and/or testing sets.

5. Choose and train a model.

>- Select algorithm(s).

>- Train the algorithm(s) on training data.

6. Evaluate the model.

>- Use appropriate metrics (accuracy, RMSE, F1-score).

>- Validate against a separate test set or through cross-validation.

7. Hyperparameter Tuning (if needed).

>- Use grid search, random search, or Bayesian optimization.

8. Finalize and deploy.

>- Prepare for production.

>- Monitor performance over time.

# 11. Why do we have to perform EDA before fitting a model to the data?

>- We perform Exploratory Data Analysis (EDA) before fitting a model for several key reasons:

    1. Understand the data’s structure and relationships

>- EDA helps us discover:

>- The distribution of each variable

>- Relationships or correlations between variables

>- The presence of outliers or anomalies

>- Patterns or clusters in the data

    2. Identify data issues and prepare for modeling

>- Before we feed the data into a model, we need to make sure it’s “clean” and appropriate for the algorithm. EDA lets us:

>- Handle missing values

>- Detect and deal with duplicates

>- Transform variables if needed (such as scaling or normalizing)

>- Encode categorical variables

>- Remove or address outliers

    3. Improve Model Performance and Interpretability

>- Using the knowledge gained from EDA, we can:

>- Select the most relevant features

>- Reduce dimensionality if there’s redundancy

>- Apply appropriate transformation to aid algorithm performance

>- Provide context for interpreting the eventual results

    4. Avoid “Garbage In, Garbage Out”

>- If we feed poor or misunderstood data into a model, the algorithm’s output will reflect those problems — which can undermine its credibility and utility.

# 12. What is correlation?

>- Correlation refers to the statistical relationship or association between two or more variables.

    Key points about correlation:

>- Direction: It can be positive, negative, or zero (no correlation).

>- Strength: Measured by the correlation coefficient (r), which ranges from -1 to +1

# 13. What does negative correlation mean?

>- A negative correlation means that two variables move in ** opposite directions**.

    When one increases, the other decreases, or vice versa.

- Picture these examples:

>- The faster you drive, the less time it takes to reach your destination (negative correlation).
>- As stress increases, quality of sleep often drops (negative correlation).
>- An increase in exercise might be related to a decrease in weight (negative correlation).

    The strength of a negative correlation is measured by a correlation coefficient (r) that falls between -1 and 0:

>- -1 = perfect negative correlation (they move in exact opposition).

>- -0.5 = medium or strong but not perfect.

>- 0 = no correlation at a

# 14. How can you find correlation between variables in Python?

>- You can find the correlation between variables in Python using Pandas’ corr() method or NumPy’s corrcoef()

# 15. What is causation? Explain difference between correlation and causation with an example.

>- Causation refers to a cause-and-effect relationship, where one event directly produces or brings about another event.

>- Correlation, meanwhile, means two events are related or move together, but this doesn’t necessarily indicate that one directly causes the other.

    Example:

>- Let's say we observe these two facts:

>- The number of ice cream sales increases.

>- The number of drowning accidents also increases at the same time.

# 16. What is an Optimizer? What are different types of optimizers? Explain each with an example.

>- An optimizer is an algorithm or method used to adjust the parameters of a model (typically weights and biases in a neural network) to minimize a objective or loss function.
Essentially, the optimizer guides the training process by following the slope (gradient) of the loss to find a minimum — hopefully the minimum that results in the best performance.

>- Types of Optimizers (with Examples)

1. Gradient Descent (GD) or Batch Gradient Descent
Description: Computes the gradient of the entire training set to perform a single update.

Example:

    # Pseudo-code
    for epoch in range(epochs):
        gradient = compute_full_batch_gradient(model, training_data)
        model.weights -= learning_rate * gradient

2. Stochastic Gradient Descent (SGD)
Description: Computes the gradient for a single training example or a small mini-batch instead of the entire dataset.

Example:

    # Pseudo-code
    for epoch in range(epochs):
        for x, y in training_data:
           gradient = compute_gradient(model, x, y)
           model.weights -= learning_rate * gradient

3. Mini-batch Gradient Descent
Description: Combine the benefits of both methods — compute the gradient on a mini-batch of samples.

Example:

    # Pseudo-code
    batch_size = 32
    for epoch in range(epochs):
        for batch in batches(training_data, batch_size):
            gradient = compute_batch_gradient(model, batch)
            model.weights -= learning_rate * gradient


4. Momentum SGD
Description: Accumulates momentum from previous updates, adding a fraction of the previous update to accelerate convergence.

Example:

    # Pseudo-code
    v = 0
    beta = 0.9

    for epoch in range(epochs):
        for x, y in training_data:
            gradient = compute_gradient(model, x, y)
            v = beta * v + gradient
            model.weights -= learning_rate * v


5. Adagrad (Adaptive Gradient Algorithm)
Description: Adapts the learning rate for each parameter based on its historical squared gradient.       

Example:

    # Pseudo-code
    G = 0
    epsilon = 1e-8

    for epoch in range(epochs):
        for x, y in training_data:
            gradient = compute_gradient(model, x, y)
            G += gradient**2
            model.weights -= learning_rate * gradient / (np.sqrt(G) + epsilon)


6. RMSProp (Root Mean Square Propagation)
Description: Similar to Adagrad but with exponentially decayed average of squared gradients instead of sum.

Example:

    # Pseudo-code
    G = 0
    beta = 0.9
    epsilon = 1e-8

    for epoch in range(epochs):
        for x, y in training_data:
            gradient = compute_gradient(model, x, y)
            G = beta * G + (1 - beta) * gradient**2
            model.weights -= learning_rate * gradient / (np.sqrt(G) + epsilon)



. Adam (Adaptive Moment Estimation)
Description: Combines momentum and RMSProp.

Example:

    # Pseudo-code
    v = 0
    G = 0
    beta1 = 0.9
    beta2 = 0.999
    epsilon = 1e-8
    t = 0

    for epoch in range(epochs):
        for x, y in training_data:
            t += 1
            gradient = compute_gradient(model, x, y)
            v = beta1 * v + (1 - beta1) * gradient
            G = beta2 * G + (1 - beta2) * (gradient**2)
            v_corrected = v / (1 - beta1**t)
            G_corrected = G / (1 - beta2**t)
            model.weights -= learning_rate * v_corrected / (np.sqrt(G_corrected) + epsilon)






# 17. What is sklearn.linear_model ?

>- sklearn.linear_model is a module within scikit-learn (sklearn) — a popular machine learning library in Python — that contains a collection of classes and functions for linear models.

- What are linear models?
>-  Linear models are methods for understanding or predicting a variable (the output or target) as a linear combination of other variables (the inputs or features).

# 18. What does model.fit() do? What arguments must be given?

>- Forward pass: Computes predictions for each training example.

>- Loss calculation: Measures the error by comparing predictions to true values.

>- Backward pass (backpropagation): Computes the gradient of the loss with respect to each parameter.

>- Weight update: Adjusts the parameters in the direction that reduces the loss.

# Arguments:

>- epochs (integer): Number of times to iterate over the training data.

>- batch_size (integer): Number of samples per update.

>- validation_data (tuple): Validation inputs and targets (for evaluating performance after each epoch).

>- shuffle (boolean): Whether to shuffle training samples each epoch.

>- callbacks (list): List of callbacks (like ModelCheckpoint, EarlyStopping) to aid training.

>- steps_per_epoch: Number of batches to process in each epoch (typically used with generators).



# 19. What does model.predict() do? What arguments must be given?

>- The model.predict() method is a way to generate predictions or outputs from a trained machine learning or deep learning model.

# What it does:
It takes in new, unseen data (typically called input samples) and produces a prediction — for instance:

>- Probabilities for each class in a classifier

>- Continuous values for regression

>- Encoded output for a generative model, etc.

# What arguments are typically required?

The most essential and typically required argument is:

>- The input data — normally called X.

This should be in a format that your model expects (for instance, a NumPy array, a Tensor, or a DataFrame), with the appropriate dimensions.



# 20. What are continuous and categorical variables?

>- Continuous and categorical variables are two main types of variables you can work with in data and statistics.

    Continuous variables:

>- What they represent: Quantitative (numerical) data — they can take any value within a range.

>- Examples: Height (169.5 cm), weight (72.3 kg), temperature (23°C), or time (5.7 seconds).

>- Key feature: Between any two values, there’s an infinite number of possible values.

    Categorical variables:

>- What they represent: Qualitative (descriptive) data — they represent groups or categories.

>- Examples: Gender (male/female), color (red/blue/green), grade (A, B, C), or country (Italy, Spain, USA).

>- Key feature: Categories are discrete and separate; there’s no numerical range in which you can have fractional or intermediary values.

# 21. What is feature scaling? How does it help in Machine Learning?

>- Feature scaling refers to the process of standardizing or normalizing the range of independent variables or features in your dataset. This typically involves transforming the values to a common scale (like 0–1 or with a mean of 0 and standard deviation of 1).

# How does it help in Machine Learning?

>- Feature scaling is crucial for many algorithms due to their mathematical mechanisms:

>- Distance-Based Methods (like KNN, K-Means): Without scaling, a feature with large values dominates the distance metric.

>- Gradient Descent-Based Methods (like Logistic Regression, Neural Networks): Proper scaling helps the algorithm convergence faster and more stably.

>- Regularization (like Ridge, Lasso): If features are not on a similar scale, penalties may be unfairly distributed across coefficients.

# 22. How do we perform scaling in Python?

>- StandardScaler — Standardize (mean = 0, standard deviation = 1).

>- MinMaxScaler — Normalize to range [0, 1].

>- RobustScaler — Reduce influence of outliers.



# 23. What is sklearn.preprocessing?

>- sklearn.preprocessing refers to a module in scikit-learn (sklearn) that provides utilities for transforming raw data into a format that's more suitable for machine learning models.

This typically involves scaling, normalizing, or encoding the data.

24. How do we split data for model fitting (training and testing) in Python?

>- To split your data into training and testing sets in Python, you typically use train_test_split from sklearn.model_selection.

Here’s an example:

    # First, import the function
    from sklearn.model_selection import train_test_split

    # Say you have your data in X (features) and y (labels)
    # Split into 80% training and 20% testing
    X_train, X_test, y_train, y_test = train_test_split(
        X, y,
        test_size=0.20,
        random_state=42
    )

    # Now you can proceed with training:
    model.fit(X_train, y_train)

    # And later evaluate:
    model.score(X_test, y_test)


# 25. Explain data encoding?

>- Data encoding refers to the process of converting information from one format or representation into another — typically for storage, transmission, or processing.