# Feature Engineering

1. What is a parameter?
  - A parameter in machine learning is an internal variable of a model that is learned and updated automatically during training to optimize the model's performance on the given data. Parameters include weights and biases in neural networks, or coefficients in regression, which the algorithm adjusts while learning from data. These values define how input features are mapped to the output, directly impacting the predictions made by the model.

2. What is correlation? What does negative correlation mean?
  - Correlation is a statistical measure that describes the relationship between two variables, indicating how changes in one variable are associated with changes in another. When two variables move together in a similar way, either increasing or decreasing, they are said to be correlated. The strength and direction of this relationship are measured using the coefficient of correlation, which ranges from -1 to +1.

  Negative correlation means that as one variable increases, the other decreases, and vice versa. For example, if the amount of rainfall increases, the number of sunny days decreases, showing a negative correlation. Negative correlation is represented by a correlation coefficient less than zero, with -1 indicating a perfect negative relationship.

3. Define Machine Learning. What are the main components in Machine Learning?
  - Machine Learning is a branch of artificial intelligence that focuses on developing algorithms that allow computers to learn patterns and make predictions or decisions from data without explicit programming for each task.

   Main Components in Machine Learning
    
    a) Data: The foundational information used to train and evaluate models; it includes features (input variables) and labels (target outcomes).
    b) Algorithms: Mathematical procedures or models that extract patterns from data and produce outputs or predictions.
    c) Model Training: The process where an algorithm learns from the training data by adjusting its internal parameters to optimize performance.
    d) Testing & Evaluation: Assessing the model’s accuracy and generalization using new or unseen data following training.
    e) Inference/Prediction: Applying the trained model to make decisions or predictions on real-world data.

  Each of these components works together to enable a machine learning system to learn from data and generate useful results.

4. How does loss value help in determining whether the model is good or not?
  - The loss value is a quantitative measure of how much a model's predictions deviate from the actual target values, directly reflecting how well—or poorly—the model is performing.

  Role of Loss Value in Model Evaluation

    a) A low loss value means the model’s predictions are close to the true values, indicating good performance.
    b) A high loss value signals that predictions are inaccurate, which means the model needs improvement or further tuning.
    c) During training, minimizing the loss helps the model learn and improve by adjusting its internal parameters to achieve better accuracy.
    d) The loss function can be used to compare the effectiveness of different models or algorithms on the same task.
  Thus, by monitoring the loss value, one can determine whether a model is performing well and when it is sufficiently trained.

5. What are continuous and categorical variables?
  - Continuous variables are numerical variables that can take an infinite number of values within a given range, often representing measurements. Examples include height, weight, temperature, and time.

  Categorical variables are variables that represent distinct groups or categories without any numerical meaning, often used to classify data into groups. Examples include gender, hair color, city of residence, or types of fruit.

6. How do we handle categorical variables in Machine Learning? What are the common techniques?
  - Categorical variables in machine learning need to be converted into numerical representations because most algorithms require numerical input to learn patterns effectively.

  Common Techniques to Handle Categorical Variables

    a) Label Encoding: Assigns each unique category an integer value. It is simple and works well for ordinal categories where a natural order exists (e.g., low=0, medium=1, high=2). However, it can mislead algorithms if applied to nominal data without true order.
    b) One-Hot Encoding: Converts each category into binary columns representing presence (1) or absence (0). This is suitable for nominal variables without any intrinsic order but can increase feature dimensionality if many categories exist.
    c) Frequency Encoding: Replaces categories with their frequency counts in the dataset, useful for high-cardinality features.
    d) Effect Encoding (Deviation Encoding): Uses values 1, 0, and -1 instead of binary values to represent categories and helps reduce multicollinearity in linear models.
    e) Dropping Categorical Variables: Sometimes categorical features without predictive value may be dropped to simplify the model.
Proper encoding improves model performance and ensures the algorithms interpret categorical data meaningfully.

7. What do you mean by training and testing a dataset?
  - Training a dataset in machine learning means using a portion of the data to teach the model to learn patterns and relationships. During training, the model adjusts its parameters based on this data to minimize errors and improve its predictions.

  Testing a dataset refers to using a separate portion of data that the model has never seen before to evaluate how well it has learned and how accurately it can predict new or unseen data. Testing helps check the model's generalization ability and detect overfitting.

8. What is sklearn.preprocessing?
  - The sklearn.preprocessing module in scikit-learn provides various utility functions and classes to transform raw feature data into a form that is more suitable for machine learning models.

  Key Features of sklearn.preprocessing

    a) It includes methods for scaling (standardization, normalization), centering, and normalization of data.
    b) Supports encoding of categorical features into numerical formats, such as one-hot encoding and label encoding.
    c) Provides transformation techniques like binarization, power transforms to make data more Gaussian-like, and quantile transformation.
    d) These preprocessing steps help improve model performance by making input data consistent, scaled, and better distributed for learning algorithms.

  In summary, sklearn.preprocessing is essential for preparing and transforming raw data before training models.

9. What is a Test set?
  - A Test set in machine learning is a portion of the dataset that is kept separate and not used during the training of the model. It serves as an independent data sample to evaluate the trained model's performance on unseen data.

10.  How do we split data for model fitting (training and testing) in Python? How do you approach a Machine Learning problem?
   - In Python, data is commonly split into training and testing sets using the train_test_split() function from the scikit-learn library. This function randomly divides the dataset into subsets typically with about 70-80% for training and 20-30% for testing, helping evaluate model performance on unseen data.

   Approach to a Machine Learning Problem

    a) Understand the Problem: Clearly define the objective in non-technical terms and determine the type of problem (classification, regression, clustering).
    b) Explore and Prepare Data: Collect, clean, and preprocess the data, including handling missing values, outliers, and feature engineering.
    c) Split Data: Divide into training and testing (and optionally validation) datasets.
    d) Select Algorithms: Choose appropriate models based on problem type, data size, and complexity.
    e) Train Models: Fit models on training data, tuning hyperparameters as needed.
    f) Evaluate Models: Assess performance using relevant metrics on testing data to check generalization.
    g) Fine-tune and Validate: Iterate with cross-validation and parameter tuning to improve accuracy.
    h) Deploy: Use the model in real-world scenarios once performance is satisfactory.

11. Why do we have to perform EDA before fitting a model to the data?
   - Performing Exploratory Data Analysis (EDA) before fitting a model is essential because it helps to deeply understand the dataset and its characteristics. EDA reveals patterns, trends, and anomalies such as outliers or missing values that can significantly affect model performance if left unexamined.

   Importance of EDA before Model Fitting

    a) Understanding Data Structure: EDA helps to comprehend how many features exist, their types, distributions, and interactions, aiding in better model design.
    b) Detecting Errors and Outliers: It identifies unusual data points that can skew the learning process or bias results, allowing corrective actions before training.
    c) Feature Selection and Engineering: Insights gained guide the selection of relevant features and motivate transformations or creation of new features, improving model accuracy.
    d) Choosing the Right Model: Understanding relationships between variables helps in selecting suitable algorithms and tuning parameters effectively.
    e) Ensuring Data Quality: EDA uncovers missing values, duplicates, and inconsistencies that must be addressed to prevent issues during training.

   In summary, EDA acts as a foundational step to prepare and refine data, ensuring informed decisions in machine learning modeling and ultimately yielding better, reliable predictive performance.

12. What is correlation?
   - Correlation is a statistical measure that describes the relationship between two variables, indicating how changes in one variable are associated with changes in another. When two variables move together in a similar way, either increasing or decreasing, they are said to be correlated. The strength and direction of this relationship are measured using the coefficient of correlation, which ranges from -1 to +1.

13. What does negative correlation mean?
   -  Negative correlation means that as one variable increases, the other decreases, and vice versa. For example, if the amount of rainfall increases, the number of sunny days decreases, showing a negative correlation. Negative correlation is represented by a correlation coefficient less than zero, with -1 indicating a perfect negative relationship.

14. How can you find correlation between variables in Python?
   - Correlation between variables in Python can be found using several libraries, mainly pandas, NumPy, and SciPy.

   Using pandas:
   The .corr() method on a pandas DataFrame computes the pairwise correlation of columns, excluding NA/null values. Default method is Pearson correlation but others like Kendall and Spearman can also be used by specifying the method parameter.
   Example:

    import pandas as pd
    df = pd.DataFrame({'x': [1,2,3,4], 'y': [5,6,7,8]})
    correlation_matrix = df.corr(method='pearson')
    print(correlation_matrix)

   Using NumPy:
   NumPy's np.corrcoef() function calculates the Pearson correlation coefficient matrix between two arrays.
   Example:

    import numpy as np
    x = np.array([1, 2, 3, 4])
    y = np.array([5, 6, 7, 8])
    corr = np.corrcoef(x, y)
    print(corr)

   Using SciPy:
   For more statistical detail, SciPy's pearsonr(), spearmanr(), and kendalltau() functions provide correlation coefficients along with p-values.
   Example:
    
    from scipy.stats import pearsonr
    corr, p_value = pearsonr(x, y)
    print(corr, p_value)

15. What is causation? Explain difference between correlation and causation with an example.
   - Causation means that one event or variable directly causes a change in another. It describes a cause-and-effect relationship where the occurrence of one event is responsible for producing an effect in another.

   Difference Between Correlation and Causation:

    a) Correlation is when two variables move together in some pattern, but one does not necessarily cause the other. It indicates a statistical association, but not a direct cause.
    b) Causation implies that changes in one variable directly bring about changes in another, establishing a cause-effect link.

   Example:

    a) Correlation example: Ice cream sales and number of sunburn cases increase together during summer. These two variables are correlated because both rise at the same time, but buying ice cream does not cause sunburn; instead, both are influenced by a third factor—sunny weather.
    b) Causation example: Heavy rainfall causes river water levels to rise. Here, rainfall directly causes an increase in river levels demonstrating causation.

  In summary, correlation means two variables are related, while causation means one variable actually causes changes in the other. Understanding this distinction is crucial to avoid erroneous conclusions in data analysis.

16. What is an Optimizer? What are different types of optimizers? Explain each with an example.
   - An Optimizer in machine learning is an algorithm used to adjust the model's parameters (like weights) iteratively to minimize the loss function, thereby improving the model’s accuracy and performance.

    Different Types of Optimizers with Examples:   
    a) Gradient Descent (GD):
       Updates model parameters using the gradient of the loss function computed across the entire dataset.
       Example: Fits well with small datasets but is computationally expensive for large datasets.

    b) Stochastic Gradient Descent (SGD):
       Updates parameters using gradients computed from one sample at a time.
       Faster and less memory-intensive, works well for large datasets.
       Example: Common in deep learning for faster convergence on large data.
  
    c) Adagrad (Adaptive Gradient):
       Adjusts learning rate individually for each parameter based on historical gradient information.
       Useful for sparse data.
       Example: Effective where features have different frequencies.

    d) RMSprop (Root Mean Square Propagation):
       Similar to Adagrad but uses a moving average of squared gradients to normalize learning rates, preventing rapid decay.
       Example: Popular in training recurrent neural networks.

    e) Adam (Adaptive Moment Estimation):
       Combines momentum and RMSprop; keeps moving averages of both the gradients and their squared values.
       Adjusts learning rate adaptively and includes bias correction.
       Example: Widely used due to fast convergence and robustness in deep learning.

   Each optimizer has strengths and is suited to different model types and data situations. Choosing the right optimizer helps improve training efficiency and model accuracy.

17. What is sklearn.linear_model ?
   - The sklearn.linear_model module in scikit-learn provides a variety of linear models for regression and classification tasks where the target is assumed to be a linear combination of the input features.

    Key Points about sklearn.linear_model:
     a) It includes popular algorithms such as LinearRegression, LogisticRegression, Ridge, Lasso, ElasticNet, and more for fitting linear models with different regularization and regression strategies.
     b) For example, LinearRegression fits a linear model by minimizing the residual sum of squares between observed targets and predicted outputs.
     c) The module supports both dense and sparse input data and offers flexibility in handling intercept terms and parameters.


18. What does model.fit() do? What arguments must be given?
   - The model.fit() method in scikit-learn is used to train a machine learning model by learning from the provided data. It takes the input features and the corresponding target values and adjusts the internal parameters of the model based on this data to minimize errors and optimize performance.

    What model.fit() does:
     a) Takes the feature matrix X and target vector y.
     b) Performs data validation and checks.
     c) Runs the training process by optimizing model parameters (e.g., coefficients in linear regression).
     d) Stores the learned parameters within the model object for future use in prediction.

    Required Arguments:
     a) X: Feature matrix (2D array-like), where rows represent samples and columns represent features.
     b) y: Target vector (1D array-like) containing the true labels or values corresponding to each sample in X. This is required for supervised learning tasks.


19. What does model.predict() do? What arguments must be given?
   - The model.predict() method in scikit-learn is used to make predictions on new or unseen data using a trained machine learning model. After the model has been trained with fit(), predict() takes input feature data and outputs the predicted target values or class labels.

    What model.predict() does:
     a) Receives new data points (features) as input.
     b) Uses the learned parameters from training to estimate the corresponding output or class.
     c) Returns an array of predictions for each input sample.

    Required Arguments:
     X: A 2D array-like structure containing the input features for the data points to predict. The shape should match the features used during training (number of columns/features should be the same).


20. What are continuous and categorical variables?
   - Continuous variables are numerical variables that can take any value within a specific range or interval and represent measurable quantities. Examples include height, weight, temperature, and time, where measurements can have decimals or fractions and values can vary continuously.

   Categorical variables are variables that represent distinct groups or categories without numeric meaning. These variables classify data into categories such as gender, color, or city, which can be nominal (no order) or ordinal (with order). Examples are hair color, pizza topping types, or education levels.


21. What is feature scaling? How does it help in Machine Learning?
   - Feature scaling is a data preprocessing technique in machine learning where numerical features are transformed to a common scale or range. This ensures that all features contribute equally to the model's learning process rather than being dominated by features with larger magnitudes.

    How Feature Scaling Helps in Machine Learning:
     a) Improves Algorithm Performance: Algorithms like gradient descent converge faster when features are scaled because it prevents one feature from disproportionately influencing the model updates.
     b) Optimizes Distance-Based Methods: Methods such as k-Nearest Neighbors (k-NN), K-Means clustering, and Support Vector Machines depend on distance calculations, which can be biased if features have different scales. Scaling balances all feature contributions.
     c) Prevents Numerical Instability: Large scale disparities can cause computational issues like overflow or underflow; scaling keeps calculations stable.
     d) Ensures Equal Feature Importance: It prevents features with bigger ranges from dominating smaller-range features, helping models treat all features fairly.
     e) Makes Model Training More Efficient: Especially for algorithms using gradient-based optimization, scaling reduces training time and yields better results.

  Common methods of feature scaling include Min-Max Scaling (normalizes data to range ), Standardization (z-score normalization), and Robust Scaling (based on percentiles to handle outliers).


22. How do we perform scaling in Python?
   - Feature scaling in Python is commonly performed using the sklearn.preprocessing module from scikit-learn, which provides various scalers such as StandardScaler and MinMaxScaler for standardization and normalization respectively.

    How to Perform Feature Scaling in Python:
     a) Using StandardScaler (Z-score Standardization): Transforms data to have zero mean and unit variance.
     b) Using MinMaxScaler (Normalization): Scales features to a given range, typically.
     c) Workflow:
        Instantiate the scaler object.
        Use fit() on training data to compute necessary statistics (mean, std, min, max).
        Use transform() to scale the data.
        Often combined as fit_transform() for convenience.


23. What is sklearn.preprocessing?
   - The sklearn.preprocessing module in scikit-learn provides a collection of utility functions and transformer classes to preprocess and transform raw feature data into a format more suitable for machine learning models.

    Key Features:
     a) Methods for scaling (e.g., standardization, normalization) to bring features to comparable ranges.
     b) Methods to binarize or threshold data.
     c) Encoding categorical features into numeric form (e.g., one-hot encoding, label encoding).
     d) Generating polynomial features and interaction terms for more complex relationships.
     e) Power transforms to make data more Gaussian-like.
     g) Utilities to handle missing values, outlier robustness, and feature generation.
     h) Seamless integration with scikit-learn pipelines via the Transformer API for consistent training and test-time transformations.

   This module simplifies data preprocessing tasks which are essential for improving model performance and speeding up convergence during training.


24. How do we split data for model fitting (training and testing) in Python?
   - In Python, the common way to split data for model fitting into training and testing sets is by using the train_test_split() function from the sklearn.model_selection module.

   How to Use train_test_split:

   a) Import the function:

    from sklearn.model_selection import train_test_split

   b) Separate your dataset into features (X) and target/labels (y).

   c) Split the data: test_size=0.25 means 25% of data is reserved for testing, 75% for training. random_state ensures the split is reproducible.

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

   d) Use X_train, y_train to train your model and X_test, y_test to evaluate it.


25. Explain data encoding?
   - Data encoding in machine learning refers to the process of converting categorical variables or other data types into numerical representations that can be understood and processed by machine learning algorithms.

    Why Encoding is Needed:
     a) Most machine learning algorithms operate on numerical data and cannot directly handle non-numeric data like text labels.
     b) Encoding transforms these categorical variables into numeric format, enabling algorithms to learn patterns and make predictions effectively.

    Common Encoding Techniques:
     a) Label Encoding: Assigns each category a unique integer value. Suitable for ordinal data (with an inherent order), e.g., education levels "bachelor=0", "master=1", "PhD=2".
     b) One-Hot Encoding: Creates binary/dummy variables for each category, marking presence as 1 and absence as 0. Suitable for nominal data with no order, e.g., colors "red", "green", "blue" each becoming separate columns.
     c) Ordinal Encoding: Converts categories into integers respecting their order, used for ordered categories.
     d) Mean/Target Encoding: Encodes categories based on the mean of the target variable for each category, useful for high-cardinality variables.

   
      

In [2]:
#25) Example:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

data = {'Region': ['North', 'West', 'East', 'South'],
        'Product': ['Apples', 'Oranges', 'Bananas', 'Apples']}
df = pd.DataFrame(data)

# Label Encoding for 'Region'
le = LabelEncoder()
df['Region_encoded'] = le.fit_transform(df['Region'])

# One-Hot Encoding for 'Product'
ohe = OneHotEncoder(sparse_output=False)
product_encoded = ohe.fit_transform(df[['Product']])
df_ohe = pd.DataFrame(product_encoded, columns=ohe.get_feature_names_out(['Product']))

df = pd.concat([df, df_ohe], axis=1)
print(df)

  Region  Product  Region_encoded  Product_Apples  Product_Bananas  \
0  North   Apples               1             1.0              0.0   
1   West  Oranges               3             0.0              0.0   
2   East  Bananas               0             0.0              1.0   
3  South   Apples               2             1.0              0.0   

   Product_Oranges  
0              0.0  
1              1.0  
2              0.0  
3              0.0  
