#Feature Engineering

1. What is a parameter?
- In machine learning, a parameter is a variable that the model learns from the data during the training process. These are the internal settings of the model that are adjusted to make better predictions.
- This is different from a hyperparameter (like learning rate or the number of trees in a random forest), which is set by the developer before training begins.

2. What is correlation?
 What does negative correlation mean?
 - Correlation is a statistical measure that describes the strength and direction of a linear relationship between two variables. It is typically measured by a correlation coefficient, which ranges from -1 to +1
    - +1: Perfect positive correlation.

    - 0: No correlation.

    - -1: Perfect negative correlation.

 - A negative correlation (a value between 0 and -1) means there is an inverse relationship between two variables. As one variable increases, the other variable tends to decrease.
 - Example: The amount of time spent studying and the number of mistakes made on a test. As study time (variable A) increases, the number of mistakes (variable B) tends to decrease.

3. Define Machine Learning. What are the main components in Machine Learning?
 - Machine Learning (ML) is a subfield of artificial intelligence (AI) where computer systems are given the ability to "learn" from data—identifying patterns and making decisions—without being explicitly programmed for every task.

- The main components of a typical ML system are:
  - Data: The input (features and labels) used to train the model and evaluate its performance.
  - Model: The algorithm or architecture (e.g., Linear Regression, Neural Network) that processes the data and makes a prediction.
  - Loss Function (or Cost Function): A method for measuring the model's error. It quantifies how "bad" the model's prediction was compared to the actual answer.
  - Optimizer: The mechanism that "tunes" the model's parameters (see Q1) to minimize the loss function. It's how the model actually learns.

4. How does loss value help in determining whether the model is good or not?
- The loss value is the primary measure of a model's error. It directly tells you how poorly the model is performing on the data.

- A high loss value means the model's predictions are far from the actual truth. This is a bad model.

- A low loss value means the model's predictions are very close to the actual truth. This is a good model.

- During training, the goal is to use the optimizer to adjust the model's parameters until the loss value is as low as possible.

5. What are continuous and categorical variables?
- Continuous Variables: These are numerical variables that can take on any value within a given range. They are measurable.

  - Examples: Temperature (30.5°C), height (172.3 cm), price ($54.99).

- Categorical Variables: These are variables that represent distinct groups or labels. They are qualitative and have a finite (or fixed) number of possible values.

  - Examples: 'Color' (Red, Green, Blue), 'City' (New York, London, Tokyo), 'Yes/No'.

6. How do we handle categorical variables in Machine Learning? What are the common techniques?
 - Machine learning models are mathematical, so they require numerical input. We cannot feed them text labels like 'Red' or 'Cat'. We must convert these categorical variables into numbers using encoding.


- The two most common techniques are:

  - Label Encoding: This assigns a unique integer to each category.

    - Example: 'Red' = 0, 'Green' = 1, 'Blue' = 2.

    - When to use: Best for ordinal data, where the categories have a natural order (e.g., 'Small' < 'Medium' < 'Large'). It's risky for nominal data (like colors) because the model might incorrectly learn that 'Blue' (2) is "greater than" 'Green' (1).

  - One-Hot Encoding: This creates new binary (0 or 1) columns for each category.

    - Example: For a 'Color' feature:

    - 'Red' becomes [1, 0, 0]

    - 'Green' becomes [0, 1, 0]

    - 'Blue' becomes [0, 0, 1]

    - When to use: This is the safest and most common method for nominal data (where there is no order), as it avoids creating a false ranking.

7. What do you mean by training and testing a dataset?
- This refers to splitting our main dataset into two separate parts:

- Training Set (e.g., 80% of the data): This is the "textbook" for the model. The model looks at this data (both the features and the correct answers) to learn the underlying patterns.

- Testing Set (e.g., 20% of the data): This is the "final exam." This data is kept separate and is never shown to the model during training. After the model is trained, we use the testing set to evaluate how well it performs on new, unseen data. This gives an honest measure of the model's generalization.

8. What is sklearn.preprocessing?
- sklearn.preprocessing is a module (a collection of tools) within the Scikit-learn library in Python. It contains essential functions for cleaning, transforming, and preparing data before it is fed into a machine learning model.


- Common tools in this module include:

  - StandardScaler (for feature scaling)

  - MinMaxScaler (for feature scaling)

  - OneHotEncoder (for categorical data)

  - LabelEncoder (for categorical data)

9. What is a Test set?
- The Test set is the portion of the dataset that is held back and not used during the model training process. Its sole purpose is to provide an unbiased evaluation of the final, trained model. By making predictions on the test set and comparing them to the known true answers, you can measure the model's accuracy, precision, or other metrics on data it has never seen before.

10. How do we split data for model fitting (training and testing) in Python?
 How do you approach a Machine Learning problem?
 - We use the train_test_split function from Scikit-learn's model_selection module.

 - Here is the standard Python code:

In [None]:
from sklearn.model_selection import train_test_split

# Assume X is your dataframe of features and y is your target variable (answers)

# Define X (your features) and y (your target variable) here using your dataset
# Example:
# X = your_dataframe[['feature1', 'feature2', ...]]
# y = your_dataframe['target_variable']


# Split the data: 80% for training, 20% for testing
# random_state=42 ensures the split is reproducible (we get the same split every time)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


- A typical machine learning workflow involves several key steps:

- Define the Problem: What are we trying to predict? Is it a classification (e.g., 'Spam' or 'Not Spam') or a regression (e.g., 'House Price') problem?

- Gather Data: Collect the raw data needed for the project.

- Exploratory Data Analysis (EDA): Analyze the data to understand it. This involves finding missing values, identifying outliers, and visualizing relationships between variables.

- Data Preprocessing & Feature Engineering: Clean the data. This includes:

  - Handling missing values (e.g., filling or dropping them).

  - Encoding categorical variables (e.g., One-Hot Encoding).

  - Scaling features (e.g., StandardScaler).

  - Creating new features from existing ones.
- Model Selection: Choose one or more ML algorithms to try (e.g., Linear Regression, Random Forest, SVM).

- Data Splitting: Split the data into training and testing sets.

- Model Training: Fit the chosen model(s) to the training data using model.fit().

- Model Evaluation: Test the model's performance on the testing data using metrics like accuracy, F1-score, or Mean Squared Error.

- Hyperparameter Tuning: Adjust the model's hyperparameters (e.g., learning rate) to find the best possible version of the model.

- Deployment: If the model is good enough, deploy it so it can make predictions on new, real-world data.

11. Why do we have to perform EDA before fitting a model to the data?
- EDA (Exploratory Data Analysis) is like "reading the textbook before taking the test." You must perform EDA to:

- Understand Your Data: See what features you have, what data types they are (categorical/continuous), and how they are distributed.

- Identify Errors: Find missing values, outliers, or incorrect data (e.g., an 'Age' of 500) that need to be cleaned.

- Find Relationships: Check for correlations between variables. This helps in feature selection (deciding which variables are important).

- Guide Modeling: The insights from EDA (e.g., "this data is not linear") help you choose the right type of ML model to use.

- Fitting a model to "dirty" or un-analyzed data will almost always result in a poor, unreliable model.

12. What is correlation?
- Correlation is a statistical measure that describes the strength and direction of a linear relationship between two variables. It is typically measured by a correlation coefficient, which ranges from -1 to +1

13. What does negative correlation mean?
- A negative correlation (a value between 0 and -1) means there is an inverse relationship between two variables. As one variable increases, the other variable tends to decrease.

14. How can you find correlation between variables in Python?
- The easiest way is by using the Pandas library. If we have our data in a DataFrame called df, we just call the .corr() method.

15. What is causation? Explain difference between correlation and causation with an example.
- Causation: This means that a change in one variable directly causes a change in another.
- Difference: Correlation does not imply causation. This is the most important rule in data analysis. Just because two variables move together (correlation) does not mean one is making the other happen (causation).

16.
 What is an Optimizer? What are different types of optimizers? Explain each with an example.
 - An Optimizer is the "engine" of a machine learning model. Its job is to systematically change the model's parameters (like its weights) with the single goal of minimizing the loss function (i.e., making the model's error as small as possible). It's the core of the "learning" process.


- Common Types of Optimizers:

- Gradient Descent (GD) / Batch GD:

  - How it works: It calculates the error (gradient) for the entire training dataset and then takes one step (updates the parameters) in the direction that reduces the error.

  - Pro/Con: It's accurate but extremely slow on large datasets because it has to look at every single data point just to make one update.

- Stochastic Gradient Descent (SGD):

  - How it works: It does the opposite. It calculates the error for only one data point at a time and updates the parameters immediately.

  - Pro/Con: It's very fast, but the updates are "noisy" and "jumpy" because they are based on just one sample. It can be hard to find the perfect minimum.

- Mini-Batch Gradient Descent:

  - How it works: This is the practical compromise and the most common. It calculates the error for a small batch (e.g., 32 or 64 data points) at a time and updates the parameters.

  - Pro/Con: It provides a balance—it's fast like SGD but more stable like Batch GD.

- Adam (Adaptive Moment Estimation):

  - How it works: An advanced optimizer that is very popular, especially in deep learning. It's a "smart" optimizer that adapts the learning rate for each parameter individually.

  - Pro/Con: It often converges (finds the minimum loss) much faster and more reliably than other optimizers. It's frequently the default choice.

17. What is sklearn.linear_model ?
- sklearn.linear_model is a module in the Scikit-learn library that contains all the machine learning models based on a linear formula (like y = mx + b).

- This module includes:
  - LinearRegression: The standard model for regression (predicting a continuous value like price).
  - LogisticRegression: Used for classification (predicting a category like 'Yes' or 'No'), despite its name.
  - Ridge and Lasso: Advanced types of linear regression that include regularization to prevent overfitting.57

18. What does model.fit() do? What arguments must be given?
- What it does: model.fit() is the command that starts the training process. It tells the model to look at the training data, learn the patterns, and adjust its internal parameters to minimize the loss.

- Arguments: It must be given the training data.

- X: The features (the "inputs" or "questions") of the training set.

- y: The target (the "labels" or "answers") of the training set.

  - Example: model.fit(X_train, y_train)

19. What does model.predict() do? What arguments must be given?
- What it does: model.predict() is used after the model has been trained (using .fit()). It takes new, unseen data (for which you don't know the answer) and uses the model's learned patterns to generate a prediction.


- Arguments: It only needs the features of the new data.

- X: The features of the data you want to predict (e.g., X_test).

  - Example: predictions = model.predict(X_test)

20. What are continuous and categorical variables?
- These are two primary types of data:

- Continuous Variables: Numerical data that can be measured. They can take on any value within a range (including fractions/decimals).

- Examples: Height, weight, temperature, time.

- Categorical Variables: Data that represents groups, labels, or categories. They have a limited number of possible values.

- Examples: 'Gender' (Male, Female, Other), 'Payment Method' (Credit Card, PayPal, Cash).

21. What is feature scaling? How does it help in Machine Learning?
- What it is: Feature scaling is a preprocessing technique used to standardize the range of features. For example, it can transform all features so they fall between 0 and 1, or so they have a mean of 0 and a standard deviation of 1.

- How it helps: Many ML algorithms are sensitive to the scale of the data.

- Example: Imagine you have two features: 'Age' (range: 20-70) and 'Salary' (range: 50,000-200,000).

- The 'Salary' feature has much larger numerical values. An algorithm like k-Nearest Neighbors (k-NN) or an SVM will be dominated by the 'Salary' feature. It will mistakenly believe 'Salary' is more important than 'Age' simply because its numbers are bigger.

- Scaling puts all features on a "level playing field" (e.g., both 'Age' and 'Salary' are transformed to a range of 0 to 1), so the algorithm treats them fairly and can find the true patterns.

22. How do we perform scaling in Python?
- We use the sklearn.preprocessing module. The two most common methods are StandardScaler and MinMaxScaler.


- Example using StandardScaler (Z-score scaling): This scales the data to have a mean of 0 and a standard deviation of 1.

In [None]:
from sklearn.preprocessing import StandardScaler

# 1. Create the scaler object
scaler = StandardScaler()

# 2. Fit and transform the TRAINING data
#    (It learns the mean/std from X_train and then scales it)
X_train_scaled = scaler.fit_transform(X_train)

# 3. Only TRANSFORM the TESTING data
#    (It uses the mean/std learned from the *training* data to scale the test data)
X_test_scaled = scaler.transform(X_test)

- Note: It is crucial to fit_transform on the training set but only transform on the test set to prevent "data leakage" (peeking at the test set's answers).

23. What is sklearn.preprocessing?
- sklearn.preprocessing is a module in the Scikit-learn (sklearn) library in Python. It provides a suite of tools used for data transformation and cleaning before applying a machine learning model. Its main functions are for feature scaling (e.g., StandardScaler), encoding categorical variables (e.g., OneHotEncoder), and normalizing data.

24. How do we split data for model fitting (training and testing) in Python?
- The standard method is to use the train_test_split function from Scikit-learn's model_selection module.

25. Explain data encoding?
- Data encoding is the process of converting data from one format into another.

- In machine learning, this term specifically refers to converting categorical variables (which are text-based or non-numeric) into a numerical representation so that machine learning algorithms can process them.

- The most common types are:

- Label Encoding: Converts labels into single numbers (e.g., 'Small'=0, 'Medium'=1, 'Large'=2).

- One-Hot Encoding: Converts labels into new binary columns (e.g., 'Red' = [1, 0, 0], 'Green' = [0, 1, 0]).