#Feature Engineering

1. What is a parameter?
 - A parameter is a numerical value that defines certain characteristics of a model or function. For instance, in a linear regression model, parameters include the coefficients of the variables and the intercept. Parameters are adjusted during training to minimize error and improve model predictions.

2. What is correlation?
  
 - What does negative correlation mean? Correlation measures the statistical relationship between two variables. A negative correlation indicates that as one variable increases, the other decreases. For example, in many cases, hours of exercise and weight might be negatively correlated.

3. Define Machine Learning.What are the main components in Machine Learning?  
 - Machine Learning is the field of study that allows computers to learn from data and make predictions or decisions without being explicitly programmed. Its main components include:

 * Data: Input for training.

 * Model: A representation (e.g., linear regression).

 * Algorithms: Methods for training the model.

 * Loss Function: Measures prediction error.

 * Optimization: Updates parameters to reduce loss.

4. How does loss value help in determining whether the model is good or not?
 - The loss value quantifies the error between the model's predictions and the actual values. Lower loss indicates better predictions. For example, Mean Squared Error (MSE) evaluates the average squared difference between predicted and actual values.

5. What are continuous and categorical variables?


 - Continuous Variables: These take numeric values within a range (e.g., height, weight).

  Categorical Variables: These represent discrete categories or labels (e.g., gender: Male or Female).

6. How do we handle categorical variables in Machine Learning?
 - What are the common techniques? Categorical variables are encoded into numerical forms using techniques such as:

    One-Hot Encoding: Converts categories into binary columns.

    Label Encoding: Assigns numerical labels to categories.

7. What do you mean by training and testing a dataset?
 - Training a dataset involves using data to fit a model by finding patterns and relationships. Testing involves evaluating the model on unseen data to check its accuracy and generalizability.

8. What is sklearn.preprocessing?
 - sklearn.preprocessing provides functions for data preprocessing tasks in Python, such as normalization, scaling, and encoding. For example, StandardScaler standardizes numerical data to have a mean of 0 and a standard deviation of 1.

9. What is a Test set?
 - A Test set is a subset of data used to evaluate a trained model. It helps determine how well the model performs on unseen data, ensuring accuracy and avoiding overfitting.

10. How do we split data for model fitting (training and testing) in Python? How do you approach a Machine Learning problem?
 - Data is split using sklearn's train_test_split function. For example:
 as shown in the python code

     The Machine Learning approach includes:

     * Understanding the problem.

      * Preprocessing data.

      * Choosing a model.

     * Training and testing.

     * Evaluating results.

In [None]:
'''
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
'''

11. Why do we have to perform EDA before fitting a model to the data?
 - Exploratory Data Analysis (EDA) is a critical step before fitting a model to the data because it helps uncover patterns, anomalies, and relationships in the dataset. Here's why EDA is essential:

     Data Cleaning: EDA helps identify missing values, outliers, or inconsistencies in the dataset. Cleaning these issues ensures the model isn’t negatively impacted by noise or errors.

     Understanding the Data: It provides insights into the dataset's structure, distributions, and variable types, helping you select appropriate algorithms and preprocessing techniques.

     Feature Selection: EDA highlights which features are relevant for prediction and which ones may be redundant or irrelevant, improving model performance.

     Detecting Relationships: It allows you to identify correlations or dependencies between variables, which can inform feature engineering or model selection.

     Avoiding Bias: By exploring the data, you can spot biases or imbalances (e.g., class imbalances) that could skew the model's predictions, enabling you to address them beforehand.

     Performing EDA ensures that the data is ready for modeling and that the model will yield meaningful and reliable results.

12. 2. What is correlation?
 - Correlation is a statistical measure that indicates the strength and direction of a relationship between two variables. It ranges from -1 to 1:

     * +1: Perfect positive correlation (variables increase together).

     * 0: No correlation (variables are independent).

     * -1: Perfect negative correlation (one variable increases while the other decreases). For example, there is a positive correlation between the amount of time spent studying and exam scores.

13. What does negative correlation mean?
 - Negative correlation refers to an inverse relationship between two variables, where an increase in one variable results in a decrease in the other. It is represented by correlation values between -1 and 0. For example, an increase in daily exercise time often leads to a decrease in body weight, indicating a negative correlation.

14. How can you find correlation between variables in Python?
 - In Python, correlation between variables can be computed using the corr() method in pandas or the pearsonr() function from the scipy library. Here's an example using pandas:

 This will output the correlation matrix for the variables in the DataFrame.

In [None]:
'''
import pandas as pd

# Example DataFrame
data = {'X': [1, 2, 3, 4, 5], 'Y': [5, 4, 3, 2, 1]}
df = pd.DataFrame(data)

# Compute correlation
correlation = df.corr()
print(correlation)
'''

15. What is causation? Explain the difference between correlation and causation with an example.
 -  Causation means that one variable directly causes a change in another.  
     It implies a cause-and-effect relationship, unlike correlation, which only indicates a statistical relationship.

      Example of Correlation: Ice cream sales and drowning incidents are correlated because both increase in summer.

      Example of Causation: Turning on a light switch causes the light to turn on. Correlation does not imply causation, as correlations can occur due to coincidence or third variables.

16. What is an Optimizer? What are different types of optimizers? Explain each with an example.
 - An optimizer is an algorithm used in Machine Learning to minimize the loss function by adjusting model parameters like weights and biases. Common optimizers include:

     Gradient Descent: Updates parameters in the direction of the steepest descent of the loss function.

     Adam (Adaptive Moment Estimation): Combines the advantages of RMSProp and Momentum, making it efficient and widely used.

     Stochastic Gradient Descent (SGD): Uses random subsets of data for updates, improving speed for large datasets. Example of using Adam optimizer in Python:

In [None]:
'''
from tensorflow.keras.optimizers import Adam
optimizer = Adam(learning_rate=0.001)
'''

17. What is sklearn.linear_model?
 - sklearn.linear_model is a module in the scikit-learn library that provides tools for implementing linear models in Python, such as linear regression, logistic regression, and ridge regression. For example, LinearRegression() can be used to fit a linear model to a dataset:

In [None]:
'''
from sklearn.linear_model import LinearRegression

# Create model
model = LinearRegression()
'''

18. What does model.fit() do? What arguments must be given?
 - The model.fit() method is used to train a Machine Learning model on the provided data by finding patterns and learning the optimal parameters. Arguments:

     * X: Features (input data).

     * y: Target (output labels).
     
     Example:

In [None]:
'''
model.fit(X_train, y_train)
'''

19. What does model.predict() do? What arguments must be given?
 - The model.predict() method is used to make predictions on new, unseen data after the model has been trained.

 Arguments:

      * X: Features of the new data.
      
    Example:

In [None]:
'''
predictions = model.predict(X_test)
'''

20. What are continuous and categorical variables?

 - Continuous Variables: Take numeric values within a range (e.g., height, temperature).

     Categorical Variables: Represent discrete categories or labels (e.g., gender: male/female, colors: red/blue).

21. What is feature scaling? How does it help in Machine Learning?

 - Feature scaling is a technique to standardize the range of independent variables so that they contribute equally to the model. Without scaling, models like SVM or KNN may give undue importance to features with larger ranges. Common techniques include normalization (scaling to [0,1]) and standardization (mean=0, standard deviation=1).

22. 22. How do we perform scaling in Python?
 - In Python, scaling can be performed using scikit-learn's StandardScaler or MinMaxScaler.

 Example: as shown in the code

 This scales the features to have a mean of 0 and standard deviation of 1.

In [None]:
'''
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
'''

23. What is sklearn.preprocessing??
 - sklearn.preprocessing is a module in scikit-learn that provides tools for preprocessing data. It includes functions for scaling (StandardScaler), normalizing (Normalizer), encoding categorical variables (OneHotEncoder), and more. Preprocessing ensures raw data is transformed into a format suitable for Machine Learning models.

24. How do we split data for model fitting (training and testing) in Python?
 - Data splitting is done using the train_test_split function from scikit-learn.

 Example: as shown in the code


 Here, 80% of the data is used for training, and 20% is used for testing.

In [None]:
'''
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
'''

25. Explain data encoding.

 - Data encoding is the process of converting categorical variables into numerical formats for Machine Learning. Common techniques include:

     One-Hot Encoding: Converts categories into binary columns.

     Label Encoding: Assigns integers to categories.
     
  For example:

In [None]:
'''
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
encoded_data = encoder.fit_transform(categorical_data)
'''