**Questions**

1. What is a parameter?

    In machine learning, a parameter is one of the internal variables that a model learns from data during training. They define how input features are transformed into outputs and are adjusted to minimize error.

2. What is correlation?
What does negative correlation mean?

    Correlation:

    A statistical measure showing how much two variables move together.

    It does not indicate causation—only that there's a relationship


   Negative Correlation:

   - A negative correlation (or inverse correlation) occurs when one variable increases while the other decreases, and vice versa

   - Graphically, this appears as a downward-sloping line when plotted .

3. Define Machine Learning. What are the main components in Machine Learning?

    Machine learning is learning the pattern from the data and replicate in the future.

    Main components:

    1. Data
    2. Algorithms
    3. Models
    4. Features and so on...

4. How does loss value help in determining whether the model is good or not?

    A loss value in machine learning is a numerical indicator of how far off your model’s predictions are from the true labels—essentially, it measures error using a chosen loss function (like MSE for regression or cross-entropy for classification)

5. What are continuous and categorical variables?

    Continuous variables: Numeric, infinite precision, allow full mathematical operations.

    Categorical variables: Grouped into categories (nominal or ordinal), no numeric meaning between values.

6. How do we handle categorical variables in Machine Learning? What are the common techniques?

    Categories are:
    1. Label / Ordinal Encoding
    2. One‑Hot Encoding
    3. Binary / Base‑N Encoding
    4. Binary Encoding
    5. Target Encoding

    Choose encoding method based on: variable type (nominal/ordinal), cardinality, and model type.

    Always prevent data leakage by fitting encoders on only training data.

    For high-dimensional categories, prefer compact methods (binary, hashing, embeddings).

    For interpretability, stick with one‑hot or label encoding where feasible.


7. What do you mean by training and testing a dataset?

    Training a dataset:
    The training dataset is used to fit the model’s parameters (like weights in a neural network or coefficients in regression)
    
    During training:
    - The learning algorithm processes input–label pairs (supervised learning).

    - It adjusts parameters to reduce error via optimization (e.g., gradient descent)
    - This is akin to a student studying from textbooks—the model learns patterns in the data.

    Testing a dataset: The test dataset contains unseen examples and is never used during training or hyperparameter tuning
    
    - After training is complete, the model makes predictions on this data. We then compute metrics (accuracy, precision, F1, etc.)
    
    - The test performance indicates how well the model generalizes to new real-world data—much like an exam after studying


8. What is sklearn.preprocessing?

    sklearn.preprocessing is a submodule in scikit-learn providing tools to transform raw data into formats suited for machine learning models (e.g., scaling, encoding, normalization, and more)

9. What is a Test set?

    In machine learning, a test set is a subset of the dataset used to evaluate the performance of a trained model on unseen data. It is kept separate from the training and validation sets to ensure an unbiased assessment of the model's ability to generalize to new, real-world data.

10. How do we split data for model fitting (training and testing) in Python?
How do you approach a Machine Learning problem?

    Splitting data for model fitting and approaching a machine learning problem are foundational steps in building effective models.

          from sklearn.model_selection import train_test_split

          X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


11. Why do we have to perform EDA before fitting a model to the data?

    Reasons:

    1. Understanding the Dataset
    2. Identifying Data Quality Issues
    3. Feature Engineering and Transformation
    4. Selecting the Right Model
    5. Visualizing Data Distributions

12. What is correlation?

      A statistical measure showing how much two variables move together.

      It does not indicate causation—only that there's a relationship

13. What does negative correlation mean?

    Negative Correlation:

    A negative correlation (or inverse correlation) occurs when one variable increases while the other decreases, and vice versa

    Graphically, this appears as a downward-sloping line when plotted .

14. How can you find correlation between variables in Python?

    1. Using Pandas:
Pandas provides a straightforward way to compute the correlation matrix for a DataFrame using the .corr() method. This method calculates the Pearson correlation coefficient by default, which measures the linear relationship between variables.

    2. Using NumPy:
NumPy's np.corrcoef() function computes the Pearson correlation coefficient for two or more variables.

    3. Using Seaborn for Visualization:
Seaborn, built on top of Matplotlib, provides a high-level interface for drawing attractive and informative statistical graphics. You can visualize the correlation matrix using a heatmap.

15. What is causation? Explain difference between correlation and causation with an example

    Causation refers to a direct cause-and-effect relationship between two variables, where a change in one variable directly leads to a change in another. In contrast, correlation indicates a statistical association between two variables, but it does not imply that one causes the other.

    Example: Ice Cream Sales and Sunburns:
A classic example illustrating the difference is the relationship between ice cream sales and sunburn incidents. Data may show that both ice cream sales and sunburn cases increase during summer months. However, this is a correlation, not causation. The underlying cause is sun exposure, which leads to both higher ice cream consumption and increased risk of sunburn. Therefore, while the two variables are correlated, one does not cause the other

16. What is an Optimizer? What are different types of optimizers? Explain each with an example.

    In machine learning, an optimizer is an algorithm or method used to adjust the weights and biases of a model to minimize the loss function during training. The goal is to find the optimal parameters that lead to the best performance of the model.

    1. Stochastic Gradient Descent (SGD)

            from tensorflow.keras.optimizers import SGD
            optimizer = SGD(learning_rate=0.01)

    2. Momentum

            from tensorflow.keras.optimizers import SGD
            optimizer = SGD(learning_rate=0.01, momentum=0.9)

    3. Nesterov Accelerated Gradient (NAG)


            from tensorflow.keras.optimizers import SGD
            optimizer = SGD(learning_rate=0.01, momentum=0.9, nesterov=True)



  4. Adagrad (Adaptive Gradient Algorithm)

                from tensorflow.keras.optimizers import Adagrad
                optimizer = Adagrad(learning_rate=0.01)




17. What is sklearn.linear_model ?

    The sklearn.linear_model module in scikit-learn provides a suite of linear models for both regression and classification tasks. These models are foundational in machine learning due to their simplicity, interpretability, and efficiency

18. What does model.fit() do? What arguments must be given?

      In scikit-learn, the model.fit() method is used to train a machine learning model on a given dataset. This method adjusts the model's internal parameters to learn patterns from the data, enabling it to make predictions on new, unseen data.

      from sklearn.linear_model import LinearRegression

        # Sample training data
        X_train = [[1], [2], [3], [4], [5]]
        y_train = [1, 2, 3, 4, 5]

        # Initialize the model
        model = LinearRegression()

        # Fit the model to the training data
        model.fit(X_train, y_train)


19. What does model.predict() do? What arguments must be given?

    In scikit-learn, the model.predict() method is used to make predictions on new, unseen data after a model has been trained using the fit() method. It applies the learned patterns to input data and returns the predicted outcomes.


        from sklearn.datasets import load_iris
        from sklearn.neighbors import KNeighborsClassifier

        # Load dataset
        iris = load_iris()
        X, y = iris.data, iris.target

        # Train model
        model = KNeighborsClassifier()
        model.fit(X, y)

        # Predict class labels for new data
        new_data = [[5.1, 3.5, 1.4, 0.2], [6.0, 3.0, 4.7, 1.5]]
        predictions = model.predict(new_data)
        print(predictions)


20. What are continuous and categorical variables?

      Continuous variables: Numeric, infinite precision, allow full mathematical operations.

      Categorical variables: Grouped into categories (nominal or ordinal), no numeric meaning between values

21. What is feature scaling? How does it help in Machine Learning?

    Feature scaling is a crucial data preprocessing technique in machine learning that normalizes the range of feature values so they contribute fairly during model training.

    Feature scaling transforms numeric features to a similar scale using techniques such as

    1. Standardization
    2. Normalization
    3. Unit vector


22. How do we perform scaling in Python?

          from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
          from sklearn.model_selection import train_test_split
          import numpy as np, pandas as pd

                 from sklearn.model_selection import train_test_split
          import numpy as np, pandas as pd


           data = {'height': [170,180,165,175], 'weight': [70,80,65,75]}
          df = pd.DataFrame(data)
          X = df.values

          X_train, X_test = train_test_split(X, test_size=0.5, random_state=1)

          scaler = MinMaxScaler()
          X_train_scaled = scaler.fit_transform(X_train)
          X_test_scaled = scaler.transform(X_test)

          print("Train scaled:\n", X_train_scaled)
          print("Test scaled:\n", X_test_scaled)


23. What is sklearn.preprocessing?

      sklearn.preprocessing is a powerful module in scikit-learn that provides a wide variety of transformers to preprocess your data before training a machine learning model.

24. How do we split data for model fitting (training and testing) in Python?
    1. Import the function
    2. Decide your data arrays
    3. Use train_test_split


    import pandas as pd
      from sklearn.model_selection import train_test_split

      df = pd.read_csv('data.csv')
      X = df.drop('target', axis=1)
      y = df['target']

      X_train, X_test, y_train, y_test = train_test_split(
          X, y,
          test_size=0.25,
          random_state=104,
          shuffle=True
      )

      print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)



25. Explain data encoding?

      Data encoding refers to the process of converting categorical data (like strings or text labels) into numeric values that machine learning models can process.

      Types:
      1. Nominal
      2. Label & ordinal
      3. Target Guided Ordinal Encoding