Assignment Questions


# 1. What is a parameter?
  - In the context of machine learning, a parameter refers to a configuration variable that is learned from the training data by the model.

  -  Key Characteristics of Parameters:
    - Learned during training.
    - Define the model itself.
    - Examples: weights in neural networks, coefficients in linear regression, etc.


# 2. What is correlation?
  - Correlation is a statistical measure that shows the strength and direction of a relationship between two variables.

   - Key Points:
     - It tells us how closely two variables move together.
     - The value of correlation ranges between -1 and +1.
  
   - Correlation helps us understand which features are strongly related to the target variable or to each other.
   - It’s useful for feature selection — to keep the most relevant features and avoid multicollinearity.

  What does negative correlation mean?
     - A negative correlation means that as one variable increases, the other variable decreases, and vice versa.
     - In feature engineering, if two features have a strong negative correlation, one might be used to predict the other — or you might decide to remove one if it’s not helpful for the model.



# 3. Define Machine Learning. What are the main components in Machine Learning?
  - Machine Learning (ML) is a branch of Artificial Intelligence (AI) that enables computers to learn from data and make predictions or decisions without being explicitly programmed.

  - Main Components in Machine Learning
    - Data:
      - The foundation of ML — includes both features (inputs) and target (output).
      - Example: For predicting house prices:
        - Features: size, number of bedrooms
        - Target: house price
    
    - Model
      - An algorithm that learns patterns from the data.
      - Examples: Linear Regression, Decision Tree, Random Forest, etc.
   
    - Training
      - The process of feeding data to the model so it can learn the relationship between inputs and outputs.

    - Prediction/Inference
      - Once trained, the model can make predictions on new, unseen data.

    - Evaluation
      - Checking how well the model performs using metrics like:
        - Accuracy, Precision, Recall (for classification)
        - MSE, RMSE (for regression)

    - Features
      - The input variables used for training the model.
      - Feature Engineering is a key part of this step (which you're learning now!).

    - Algorithm
      - The mathematical logic or method that the model uses to learn from data.




# 4. How does loss value help in determining whether the model is good or not?
  - The loss value tells us how far off the model’s predictions are from the actual values.
     - Low loss = Model is predicting well (close to actual values).
     - High loss = Model is making poor predictions (far from actual values).


# 5. What are continuous and categorical variables?
  - 1. Continuous Variables:
     - These are numerical values that can take any value within a range — including decimals.
     - Examples:
       - Age (22.5 years)
       - Height (5.9 feet)
       - Salary (₹55,000.75)
       - Temperature (37.2°C)
     - They are measurable and can be infinite in precision.

  - 2. Categorical Variables:
     - These represent distinct groups or categories. They usually take on a limited number of values (often labels or names).
     - Examples:
       - Gender (Male, Female, Other)
       - City (Mumbai, Delhi, Bangalore)
       - Product Type (Electronics, Clothing, Groceries)
     - They are not measured — they are labeled or grouped.



# 6. How do we handle categorical variables in Machine Learning? What are the common techniques?
  - Machine learning algorithms cannot process strings or labels directly, so we need to encode categorical variables into numbers.

  - Common Techniques to Handle Categorical Variables:
    - 1. Label Encoding
      - Converts categories into numeric labels (0, 1, 2, …).
      - Suitable for ordinal data (where order matters)
    
    - 2. One-Hot Encoding
      - Creates a new binary column for each category.
      - Values are 0 or 1.

    - 3. Ordinal Encoding
      - Similar to label encoding, but you define the order

    - 4. Frequency Encoding
      - Replace each category with how often it appears.

    - 5. Target Encoding (Mean Encoding)
      - Replace category with the mean of the target variable for that category


    

# 7. What do you mean by training and testing a dataset?
  - In Machine Learning, we split the dataset into two main parts:

  - 1. Training Dataset
    - This is the part of the data used to teach the model.
    - The model learns patterns and relationships from this data.
    - Usually takes up 70% to 80% of the entire dataset.

  - 2. Testing Dataset
    - This part is used to evaluate how well the trained model performs on new, unseen data.
    - It checks the generalization ability of the model.
    - Usually takes up 20% to 30% of the dataset.




# 8. What is sklearn.preprocessing?
  - sklearn.preprocessing is a module in the Scikit-learn library that provides tools to prepare and transform data before feeding it into a machine learning model.



# 9. What is a Test set?
  - A test set is a portion of the dataset that is not used during training. Instead, it's used to evaluate the performance of a machine learning model after it has been trained.

  - Purpose:
    - To check how well the model performs on unseen data.
    - It helps determine whether the model can generalize beyond the training data.




# 10. How do we split data for model fitting (training and testing) in Python?
  - We use the train_test_split() function from Scikit-learn to split our dataset into:
    - Training set: used to train the model
    - Testing set: used to evaluate the model

   How do you approach a Machine Learning problem?

     - Understand the Problem (Domain Knowledge)
       - What are you trying to predict or classify?
       - What’s the goal — Accuracy? Revenue? Time-saving?
       - Is it a classification, regression, or clustering task?
    
     - Collect & Explore the Data (EDA 🔍)
       - Load the dataset (CSV, SQL, API, etc.)
       - Check shape, types, and summary stats
       - Visualize key features: histograms, boxplots, scatterplots
       - Check for missing values, outliers, duplicates



# 11. Why do we have to perform EDA before fitting a model to the data?
  - EDA (Exploratory Data Analysis) is like getting to know your data before trusting it. It helps you:
     - Understand the structure
     - Detect issues
     - Discover patterns
     - And ultimately make better decisions for model building

  - Here's Why EDA Is Important:
    - 1. Understand the Data
       - What are the columns?
       - What types of variables are they? (categorical, continuous)
       - Are they relevant to the problem?

    -  2. Detect Missing or Corrupt Data
       - Many ML models can’t handle missing values.
       - EDA helps you find and decide whether to fill, drop, or flag them.

    - 3. Spot Outliers and Anomalies
      - Outliers can distort model performance.
      - You might want to remove or treat them.

    - 4. Check Distributions
      - Helps decide which preprocessing steps are needed:
         - Normalization or Standardization?
         - Log transformation?



# 12. What is correlation?
  - same as 2nd question's answer



# 13. What does negative correlation mean?
  - Negative correlation means that as one variable increases, the other decreases.

  -  Example:
    - Outdoor Temperature ↑ → Sales of Winter Jackets ↓
    - Hours Spent Watching TV ↑ → Grades in School ↓ (possibly!)
  - These pairs have negative relationships — as one goes up, the other tends to go down.



# 14. How can you find correlation between variables in Python?
  - steps:
     
     - Using pandas.corr()
#import pandas as pd

# Load your dataset
#df = pd.read_csv('your_dataset.csv')

# Correlation matrix
#correlation_matrix = df.corr()

# Print it
#print(correlation_matrix)

     - Visualize with a Heatmap (Optional but Powerful)

#  import seaborn as sns
#import matplotlib.pyplot as plt

#plt.figure(figsize=(10, 8))
#sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
#plt.title("Correlation Matrix")
#plt.show()



# 15. What is causation? Explain difference between correlation and causation with an example.
  - Causation means one variable directly affects or causes a change in another.

  
  - Correlation vs. Causation

  - Correlation
    - Relationship
      Variables move together
    - Direction
      No specific direction
    - Proof
      Statistical (but not always meaningful)
    - Example
      Ice cream sales ↑ ↔ Drowning cases ↑

  - Causation
    - Relationship
      One variable causes the other
    - Direction
      Directional (A → B)
    - Proof
      Requires deeper evidence or experiments
    - Example
      Smoking → Lung cancer



# 16. What is an Optimizer? What are different types of optimizers? Explain each with an example.
   - An optimizer is an algorithm that adjusts the model's weights to minimize the loss function during training.

  - different types of Optimizers
    
    - 1. Gradient Descent (GD)\
      - Moves in the direction of steepest descent (negative gradient)
      - Updates weights after using entire dataset
      - Not used much in deep learning due to slowness

    - 2. Stochastic Gradient Descent (SGD)
      - Updates weights after each training sample
      - Faster, but has more fluctuations

    - 3. Mini-batch Gradient Descent
      - A middle ground: updates weights after small batches of data
      - Faster & more stable than SGD

    - 4. Adam (Adaptive Moment Estimation)
      - Combines Momentum and RMSProp
      - Very popular for deep learning tasks
      - Adjusts learning rate for each parameter individually
      
    - 5. RMSProp
      - Keeps track of past gradients and adjusts learning rate accordingly
      - Works well for RNNs and deep networks
    
    - 6. Adagrad
      - Adjusts learning rate based on frequency of parameters
      - Works well for sparse data (like NLP or recommender systems)




# 17. What is sklearn.linear_model ?
  - sklearn.linear_model is a module in Scikit-learn that provides classes to implement linear models for:
     - Regression
     - Classification
    
  - It contains models like Linear Regression, Logistic Regression, Ridge, Lasso, etc.



# 18. What does model.fit() do? What arguments must be given?
  - model.fit() is the function that trains your model.
  - It tells the model:
    - "Here is the input data (X) and the correct answers (y) — now learn the patterns!"


  - Argument
    - X_train : Input features (independent vars)
    - y_train	: Target values (dependent var/labels)
    - epochs : Number of times to run through the data
    - batch_size : Size of data chunk per weight update


# 19. What does model.predict() do? What arguments must be given?
  - model.predict() is used after training, to make predictions on new/unseen data.

  - Argument:
    - X_test :- The input data (features) for which you want predictions

  - X_test should have the same number of features as X_train used during training.

  - What Does It Return
    - For regression models: predicted numeric values
    - For classification models: predicted class labels or probabilities



# 20. What are continuous and categorical variables?
  - same as 5th question's answer



# 21. What is feature scaling? How does it help in Machine Learning?
  - Feature Scaling is the process of normalizing or standardizing the range of independent variables (features) so that they are on a similar scale.

  - It helps make sure no feature dominates the others just because of its larger values.

  - How It Helps:
    - Speeds up training
    - Improves model accuracy
    - Makes gradient descent converge faster
    - Prevents features with large scales from dominating



# 22. How do we perform scaling in Python?
  -  steps to perform scaling in python:-
    - 1. Import the Scalers
    - 2. Sample Dataset
    - 3. Standard Scaling (Z-score normalization)
    - 4. Min-Max Scaling (Normalization)
    - 5. Robust Scaling
    


# 23. What is sklearn.preprocessing?
  - sklearn.preprocessing is a module in Scikit-learn that provides tools to prepare your data before feeding it into a machine learning model.

  - It includes techniques for scaling, normalizing, encoding, and transforming data.



# 24. How do we split data for model fitting (training and testing) in Python?
  - We use train_test_split() from sklearn.model_selection.

  - What Each Part Means:
    - X : Input features (independent variables)
    - y :	Target values (dependent variable/labels)
    - test_size=0.2	: 20% data for testing, 80% for training
    - random_state : Sets a seed so your split is reproducible



# 25. Explain data encoding?
  - Data encoding means converting categorical (non-numeric) data into numeric format so that machine learning models can understand and use it.

  - ML models work with numbers, not text — so encoding turns values like "Red", "Green", or "Yes" into usable numbers.


