# Feature Engineering

# Assignment Questions

1. What is a parameter?
  - A parameter is a model configuration internal to the algorithm, learned from the training data (e.g., coefficients in linear regression or weights in neural networks).

2. What is correlation?  What does negative correlation mean?
  - Correlation measures the linear relationship between two variables, ranging from -1 to 1.

  - Negative correlation means that as one variable increases, the other variable tends to decrease. The relationship moves in opposite directions. For example, as the price of a product increases, the demand for it typically decreases, showing a negative correlation.

3. Define Machine Learning. What are the main components in Machine Learning?
  - Machine Learning is a subset of artificial intelligence that enables computers to learn and make decisions from data without being explicitly programmed for every task.

  - Main components:
  - Dataset
  - Model/Algorithm
  - Features
  - Loss function
  - Optimizer
  - Evaluation metrics


4. How does loss value help in determining whether the model is good or not?
  - Loss value measures how far the model's predictions are from the actual values. A lower loss indicates better model performance:

    - High loss = poor model performance (large prediction errors)
    - Low loss = good model performance (small prediction errors)
    - Loss helps in comparing different models and tracking improvement during training.

5. What are continuous and categorical variables?
  - Continuous variables: Numerical values that can take any value within a range (e.g., height, temperature, salary)
  - Categorical variables: Variables that represent discrete categories or groups (e.g., gender, color, brand names)

6. How do we handle categorical variables in Machine Learning? What are the common techniques?
  - Techniques:

  - Label Encoding

  - One-Hot Encoding

  - Ordinal Encoding

7. What do you mean by training and testing a dataset?
  - Training: Using a portion of data to teach the model patterns and relationships
  - Testing: Using a separate portion of data to evaluate how well the model performs on unseen data
   - This split helps assess if the model can generalize to new data rather than just memorizing the training data.       

8. What is sklearn.preprocessing?
  - Sklearn.preprocessing is a module in scikit-learn that provides tools for data preprocessing, including:

    - Feature scaling (StandardScaler, MinMaxScaler)
    - Encoding categorical variables (LabelEncoder, OneHotEncoder)
    - Data transformation and normalization
    - Handling missing data

9. What is a Test set?
  - A test set is a portion of data (typically 20-30%) that is kept separate from training and used only for final model evaluation. It provides an unbiased assessment of model performance on completely unseen data

10. How do we split data for model fitting (training and testing) in Python?
How do you approach a Machine Learning problem?
  - Using `train_test_split()` from `sklearn.model_selection`.
        ```
        from sklearn.model_selection import train_test_split
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

        ```  
  -  Define problem

  - Collect & clean data

  - EDA (Explore Data)

  - Feature Engineering

  - Train/Test split

  - Model selection

  - Train the model

  - Evaluate and improve

11.  Why do we have to perform EDA before fitting a model to the data?
  - EDA helps to:

  - Understand data distribution and quality
  - Identify missing values and outliers
  - Discover patterns and relationships
  - Select relevant features
  - Choose appropriate preprocessing techniques
  - Identify potential data issues early
  - Make informed decisions about model selection

12. What is correlation?
  - Correlation measures the linear relationship between two variables, ranging from -1 to 1.

13. What does negative correlation mean?
  -  Negative correlation means that as one variable increases, the other variable tends to decrease. The relationship moves in opposite directions. For example, as the price of a product increases, the demand for it typically decreases, showing a negative correlation.  

14. How can you find correlation between variables in Python?
  -

```
import pandas as pd
import numpy as np

# Using pandas
correlation_matrix = df.corr()

# Using numpy
correlation = np.corrcoef(x, y)[0, 1]

# Using scipy
from scipy.stats import pearsonr
correlation, p_value = pearsonr(x, y)

#Using heatmap
df.corr()   
sns.heatmap(df.corr(), annot=True)  # Seaborn heatmap


```

15. What is causation? Explain difference between correlation and causation with an example.
  - Causation means one variable directly causes changes in another variable.
Difference:

  - Correlation: Statistical relationship between variables
  - Causation: One variable actually causes changes in another

16. What is an Optimizer? What are different types of optimizers? Explain each with an example.
  - An optimizer is an algorithm that adjusts model parameters to minimize the loss function.
  - Types:

    - SGD (Stochastic Gradient Descent): Updates parameters using gradients from single samples
    - Adam: Adaptive learning rate optimizer combining momentum and RMSprop
    - RMSprop: Adapts learning rate based on recent gradients
    - Adagrad: Adapts learning rate based on historical gradients

17. What is sklearn.linear_model ?
  - sklearn.linear_model is a module containing linear models for regression and classification:

    - LinearRegression
    - LogisticRegression
    - Ridge, Lasso regression
    - ElasticNet     

18. What does model.fit() do? What arguments must be given?
  - model.fit() trains the model on training data. Required arguments:

  - X: Training features (input data)
  - y: Training targets (output data)

        ```
        model.fit(X_train, y_train)

        ```
19. What does model.predict() do? What arguments must be given?
  - model.predict() makes predictions on new data. Required argument:

  - X: Features for prediction

  ```
  model.predict(X_test)

  ```
20. What are continuous and categorical variables?
  -   Continuous variables: Numerical values that can take any value within a range (e.g., height, temperature, salary)
  - Categorical variables: Variables that represent discrete categories or groups (e.g., gender, color, brand names)

21. What is feature scaling? How does it help in Machine Learning?
  - Feature scaling is a data preprocessing technique that transforms the numerical features of a dataset to a similar scale.
  - This helps prevent features with larger values from disproportionately influencing the model, leading to improved performance and faster convergence, especially in algorithms sensitive to feature magnitudes.

22. How do we perform scaling in Python?
  -

```
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Standard Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)

# Min-Max Scaling
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X_train)

```  
23. What is sklearn.preprocessing?
  -  Sklearn.preprocessing is a module in scikit-learn that provides tools for data preprocessing, including:

    - Feature scaling (StandardScaler, MinMaxScaler)
    - Encoding categorical variables (LabelEncoder, OneHotEncoder)
    - Data transformation and normalization
    - Handling missing data

24. How do we split data for model fitting (training and testing) in Python?
  -  Using `train_test_split()` from `sklearn.model_selection`.
        ```
        from sklearn.model_selection import train_test_split
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

        ```  


25. Explain data encoding?
  - Converting categorical variables to numeric form.

  - Methods:

    - Label Encoding: Integer labels

    - One-Hot Encoding: Binary columns

    - Ordinal Encoding: Ranked categories              