## Feature Engineering Assignment

1. What is a parameter?
--> A parameter is a variable used to pass information into a function, procedure, or query — kind of like a placeholder that gets filled in when the function or command is run.

2. What is correlation? What does negative correlation mean?
--> Correlation is a statistical measure that shows the strength and direction of a relationship between two variables.

A negative correlation means:

As one variable increases, the other decreases.

They move in opposite directions.

3. Define Machine Learning. What are the main components in Machine Learning?
--> Machine Learning (ML) is a branch of artificial intelligence (AI) that focuses on building systems that can learn from data, identify patterns, and make decisions with minimal human intervention.

 Main Components of Machine Learning:

a. Data

The foundation of ML.

Can be structured (like tables) or unstructured (like images, text).

Needs to be cleaned and prepared before training.

b. Model

A mathematical representation that learns from data.

For example: linear regression, decision trees, neural networks.

c. Algorithm

The method used to train the model on data.

Examples: Gradient Descent, k-Means, Random Forest.

d. Features

The input variables used to make predictions.

Example: For predicting house prices, features might include size, location, number of rooms, etc.

e. Label (Target)

The outcome we want the model to predict.

Example: House price, email being spam or not.

f. Training

The process of feeding data to the model so it can learn.

The model adjusts its internal settings (parameters) to improve predictions.

g. Testing (Evaluation)

After training, the model is tested on unseen data to check its accuracy and performance.

h. Prediction

Using the trained model to make decisions or forecasts on new data.

4. How does loss value help in determining whether the model is good or not?
--> The loss value is a number that shows how far off the model's predictions are from the actual results.

It measures the error between the predicted values and the true values (labels).

The lower the loss, the better the model is at making accurate predictions.

 How It Helps:

Training Process: The model uses the loss value to learn. Algorithms (like gradient descent) adjust the model to minimize the loss.

Model Selection: You can compare different models based on their loss values.

Overfitting Detection:

Low training loss but high validation loss = overfitting.

Both low = good generalization.

5. What are continuous and categorical variables?
--> a. Continuous Variables

These are numerical variables that can take on any value within a range — including decimals.

 Examples:
Height (e.g., 172.5 cm)

Weight (e.g., 65.2 kg)

Temperature (e.g., 37.8°C)

Income (e.g., $45,300.75)

 Key Features:

Infinite possible values

Often measured, not counted

Can perform mathematical operations (like mean, standard deviation)

b. Categorical Variables

These represent groups or categories — they describe qualities or characteristics.

 Examples:

Gender (Male, Female, Other)

Country (USA, India, Brazil)

Car brand (Toyota, Ford, BMW)

Grade (A, B, C, D)

 Key Features:

Limited set of values

Values are labels, not numbers (even if coded as numbers)

Can be ordinal (ordered) or nominal (unordered)

6. How do we handle categorical variables in Machine Learning? What are the common techniques?
--> Since most machine learning algorithms work with numerical data, we need to convert categorical variables into numbers before feeding them into the model.

 Common Techniques to Handle Categorical Variables:

a.  Label Encoding

Converts each category into a unique number (e.g., Male = 0, Female = 1).

Simple but can introduce unintended order.

b. One-Hot Encoding

Creates binary (0/1) columns for each category.

Prevents the model from thinking one value is “greater” than another.

c. Ordinal Encoding

Assigns ordered numbers to categories.

d. Target Encoding (Mean Encoding)

Replaces a category with the mean of the target variable for that category.

e.  Binary Encoding / Hashing

More advanced techniques to handle high-cardinality data (like zip codes or user IDs).

Combine space-efficiency with one-hot’s safety.

7. What do you mean by training and testing a dataset?
-->  a.Training a Dataset

Training is the process where the model learns patterns from the data.

You feed a part of your dataset (called the training set) to the model.

The model tries to learn the relationship between inputs (features) and outputs (labels/targets).

It adjusts its internal parameters to reduce errors (based on loss function).

 Example:

You give a model house size and location → it learns to predict house prices.

b. Testing a Dataset

Testing is the process of checking how well the model performs on unseen data.

After training, you test the model using a different portion of the data (called the test set).

This helps you evaluate how well the model generalizes to new, real-world data.

You calculate performance metrics like accuracy, precision, recall, or RMSE depending on the problem.

8. What is sklearn.preprocessing?
--> sklearn.preprocessing is a module in Scikit-learn (sklearn) — a popular Python library for machine learning — that provides tools for data preprocessing and transformation.

Before feeding data into a machine learning model, it often needs to be cleaned, scaled, or encoded. That’s where sklearn.preprocessing comes in — it helps make your data model-ready.

9. What is a Test set?
--> a test set is a portion of your dataset that you set aside to evaluate the performance of your trained model.

It simulates real-world data — the model hasn't seen this data before.

Helps determine how well your model will generalize to new, unseen data.

Prevents you from fooling yourself — if your model only performs well on training data, it's probably overfitting.

10. How do we split data for model fitting (training and testing) in Python? How do you approach a Machine Learning problem?
--> To split data for model fitting (training and testing) in Python, we commonly use the train_test_split function from Scikit-learn (sklearn). This allows us to easily divide our dataset into training and testing sets.

 Steps for Splitting Data:

Import necessary libraries

Load your data

Use train_test_split to divide the data

Fit the model on the training data

Evaluate the model on the testing data

Here’s a typical approach to solving a machine learning problem:

a.  Define the Problem

Understand what you're trying to solve (e.g., classification, regression, clustering).

Clearly define the goal (predict house prices, classify emails as spam, etc.).

b. Collect and Understand Data

Gather data from reliable sources (datasets, APIs, etc.).

Understand the data — explore the features (input variables) and labels (output variable).

Visualize data to identify patterns or outliers (e.g., using matplotlib or seaborn).

c. Preprocess the Data

Clean the data: Handle missing values, remove duplicates.

Feature engineering: Create meaningful new features or select the most relevant ones.

Handle categorical data: Convert categorical variables into numerical values using techniques like one-hot encoding or label encoding.

Scale the data: Standardize or normalize the data to bring features to a common scale.

d. Split the Data into Training and Testing Sets

Use train_test_split to divide the data into training and testing sets, typically a 70–80% split for training and 20–30% for testing.

e. Choose the Right Model

Select a machine learning algorithm based on your problem:

Classification: Logistic Regression, Decision Trees, Random Forests, SVM, etc.

Regression: Linear Regression, Decision Trees, etc.

Clustering: K-Means, DBSCAN, etc.

Choose the model that best fits the nature of the data and the problem you're trying to solve.

f. Train the Model

Fit the model to your training data and allow it to learn the patterns.

g. Evaluate the Model

Use the test set to evaluate the model's performance.

Check metrics like:

Accuracy, Precision, Recall, F1-Score for classification.

MSE (Mean Squared Error), RMSE (Root Mean Squared Error) for regression.

h. Tune Hyperparameters (Optional)

Use techniques like Grid Search or Random Search to fine-tune the model's hyperparameters for better performance.

i. Deploy the Model (Optional)

Once satisfied with the performance, deploy the model for real-world predictions or integrate it into a production environment.

11. Why do we have to perform EDA before fitting a model to the data?
--> Exploratory Data Analysis (EDA) is the process of analyzing and visualizing the data to understand its main characteristics, identify patterns, and detect anomalies or outliers before applying machine learning algorithms.

Why Perform EDA Before Model Fitting:

a. Understanding the Data Structure

EDA helps you understand what the features (input variables) and labels (output variables) are.

You can check for missing values, data types (numerical, categorical), and whether any feature requires transformation (scaling, encoding, etc.).

b. Detecting Data Quality Issues

Missing values: You can identify if some features have missing values and decide how to handle them (e.g., imputation or removal).

Duplicates: You may find duplicate rows that can affect model performance.

Outliers: Outliers can skew the model’s performance, and detecting them during EDA allows you to either remove or transform them.

Inconsistent data: Values may be inconsistent, like categories being spelled differently or unusual formatting.

c. Feature Relationships

EDA helps you understand the relationships between features and the target variable.

You can use correlation analysis to check which features are strongly related to the target variable, which helps you decide which features to keep.

It can help you identify potential non-linear relationships or multicollinearity (high correlation between features) that could affect certain algorithms like linear regression.

d. Choosing the Right Model

Understanding whether the problem is classification or regression influences model choice.

EDA can reveal the distribution of the target variable — for example, a skewed distribution might require log transformation before applying a model.

You can also decide if you need to balance the dataset (e.g., if you're working with imbalanced classes in classification).

e. Feature Engineering

EDA may uncover the need for new features or the transformation of existing ones.

For example, if a feature has a skewed distribution, you might want to apply log transformation to normalize it.

It also helps identify which features are redundant and could be removed, improving model performance.

f. Improving Model Interpretability

It can give you insights into the data’s structure, helping you better understand and explain the model’s behavior later on.

This is particularly important if you need to make data-driven decisions or explain the model to stakeholders.


12. What is correlation?
--> Correlation is a statistical measure that describes the relationship between two variables. It tells you how strongly one variable is related to another and whether they move in the same direction (positive correlation) or opposite directions (negative correlation).

13. What does negative correlation mean?
--> Negative correlation refers to a relationship between two variables where as one variable increases, the other decreases, and vice versa. In other words, the two variables move in opposite directions.

Example of Negative Correlation:

Temperature and Heating Costs:

As temperature increases, the need for heating (costs) decreases.

Interpretation: The colder it is, the higher the heating costs. The warmer it gets, the lower the heating costs.

Correlation: As temperature (X) goes up, heating costs (Y) go down.

14. How can you find correlation between variables in Python?
--> In Python, the Pandas library provides an easy way to compute the correlation between variables in a dataset. The most commonly used method is the .corr() function, which computes the Pearson correlation coefficient for each pair of variables.

Step-by-Step Approach to Find Correlation:

a. Import Libraries

You need Pandas (and optionally Seaborn or Matplotlib for visualization) to work with the datase

b. Create or Load a DataFrame

You can either create a DataFrame from scratch or load a dataset.

c. Use the .corr() Method

The .corr() method will compute the Pearson correlation by default for all pairs of numerical variables in the DataFrame.

d. Interpret the Correlation Matrix

1.000: A perfect positive correlation between two variables.

0.000: No correlation (no relationship between the variables).

-1.000: A perfect negative correlation (one variable increases as the other decreases).

15. What is causation? Explain difference between correlation and causation with an example.
--> Causation refers to a cause-and-effect relationship between two variables. In this type of relationship, a change in one variable directly causes a change in another variable. The key idea behind causation is that the change in one variable brings about a change in another variable.

a. Correlation

Measures the relationship between two variables, but doesn’t imply that one causes the other.

Correlation can be positive, negative, or zero.

Does not imply one variable depends on the other.

Example- Ice cream sales and temperature (They are positively correlated, but one does not cause the other).

b. Causation

Indicates that one variable directly causes the change in another.

Causation shows a clear cause-effect relationship.

One variable depends on the other.

example - Smoking and lung cancer (Smoking causes lung cancer).

16. What is an Optimizer? What are different types of optimizers? Explain each with an example.
--> An optimizer in machine learning and deep learning is an algorithm used to adjust the model's parameters (e.g., weights in neural networks) to minimize the loss function. The goal of the optimizer is to find the best set of parameters that result in the lowest possible error (or loss), thereby improving the model's accuracy.

Optimizer's Role: It uses the gradient of the loss function to determine how to adjust the parameters.

How it Works: The optimizer makes incremental adjustments to the parameters based on the gradients of the loss function, and continues this process until the optimal parameters are found or a stopping criterion is met.

Types of Optimizers

a. Gradient Descent (GD)

Gradient Descent is the most basic optimizer. It computes the gradient of the loss function with respect to each parameter, and then updates the parameters in the direction that reduces the loss.

How it works:

The algorithm starts by initializing the parameters randomly.

It then computes the gradient of the loss function.

The parameters are updated by subtracting the gradient multiplied by a learning rate.

Example:

Learning Rate: If the learning rate (η) is set to 0.01, the optimizer updates the parameters in small steps, gradually reducing the error.

b. Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) is a variant of gradient descent where the model parameters are updated after evaluating each individual data point (or a small batch of data points).

How it works:

Instead of computing the gradient over the entire dataset, SGD computes the gradient on a single training sample (or a batch).

This leads to more frequent updates, which can result in faster convergence.

c.  Mini-Batch Gradient Descent

Mini-Batch Gradient Descent is a hybrid of Batch Gradient Descent and Stochastic Gradient Descent. In mini-batch gradient descent, the dataset is divided into smaller batches, and the gradient is computed for each batch.

How it works:

Instead of computing the gradient for the entire dataset (batch) or a single data point (stochastic), mini-batch uses a subset (mini-batch) of the training data.

The parameters are updated after every mini-batch.

Formula: Similar to SGD, but the gradient is averaged over a small batch of data points.

d. Momentum-based Optimizer

Momentum is an enhancement to the gradient descent algorithm that helps accelerate convergence by adding a fraction of the previous update to the current update.

How it works:

Momentum takes into account the previous gradients and updates the parameters with a combination of the current gradient and the previous gradient, smoothing the optimization path.

It uses a momentum term to maintain the direction of the previous update, preventing oscillations and helping the model converge faster.

e.  Adam (Adaptive Moment Estimation)

Adam is an adaptive optimization algorithm that combines the benefits of both Momentum and RMSprop (an optimization algorithm that adjusts the learning rate for each parameter). Adam computes adaptive learning rates for each parameter using estimates of the first and second moments of the gradients.

How it works:

Adam computes both the mean (first moment) and the variance (second moment) of the gradients, which helps to adjust the learning rate for each parameter based on its own gradient statistics.

Adam also includes a momentum-like term to accumulate previous gradients.

17. What is sklearn.linear_model ?
--> sklearn.linear_model is a module in the scikit-learn library that provides implementations of various linear models for regression and classification tasks. These models are based on the principle of linear relationships between input features (independent variables) and output (dependent variable).

Linear models are widely used in machine learning for their simplicity, efficiency, and interpretability. In sklearn.linear_model, you will find algorithms such as Linear Regression, Logistic Regression, Ridge Regression, Lasso Regression, ElasticNet, and others.

18. What does model.fit() do? What arguments must be given?
--> In machine learning, the fit() method is used to train a model on a given dataset. The purpose of the fit() method is to learn the underlying patterns in the data and adjust the model's parameters (weights) accordingly.

For supervised learning, the fit() method adjusts the model's parameters by learning from both the input features and the target labels (or values) of the data.

For unsupervised learning, it uses the features of the data to find patterns, clusters, or representations without the need for target labels.

Once the model has been "fitted", it can make predictions on new, unseen data using methods like predict().

The fit() method typically requires two main arguments:

a. X: The feature matrix (also known as the input data or independent variables).

Shape: It is a 2D array (or DataFrame) where each row represents a sample and each column represents a feature.

Data type: Typically, a NumPy array or pandas DataFrame.

Example: If you have a dataset of houses, X could be the features like square footage, number of rooms, location, etc.

b. y: The target vector (also known as labels or dependent variables).

Shape: A 1D array (or series) of target values for each sample in X. This is usually the outcome you are trying to predict.

Data type: Typically, a NumPy array or pandas Series.

Example: For a house price prediction task, y could be the actual house prices.

19. What does model.predict() do? What arguments must be given?
--> The predict() method is used to make predictions on new, unseen data after the model has been trained (i.e., after fitting the model using model.fit()).

Purpose: The primary goal of model.predict() is to generate predictions based on the learned patterns from the training data.

The method takes in input features and outputs predicted values or classes based on the model's learned parameters.

For regression: It returns predicted continuous values (e.g., predicted house prices).

For classification: It returns predicted class labels or probabilities (e.g., whether an email is spam or not, or predicting the class of an image).

model.predict() typically requires one argument:

a. X: The feature matrix (also known as the input data or independent variables).

Shape: It is a 2D array (or DataFrame), just like the input data used for training. Each row represents a sample (or data point), and each column represents a feature (or attribute).

Data type: Typically, a NumPy array or pandas DataFrame.

Example: For house price prediction, X could contain the size of houses, the number of rooms, and other relevant features.

Note: The input data X provided to predict() must have the same number of features as the data used to fit the model (X_train).

20. What are continuous and categorical variables?
--> a. Continuous Variables

Definition: Continuous variables are those that can take on an infinite number of values within a given range or interval. These variables can be measured and divided into smaller increments. They represent quantities or measurements that can be expressed with decimal points or fractions.

Examples:

Height (e.g., 5.6 feet, 5.75 feet, 5.9 feet)

Weight (e.g., 65.5 kg, 70.2 kg)

Temperature (e.g., 20.1°C, 25.5°C)

Age (e.g., 25.5 years, 30 years)

Characteristics:

They have infinite possible values within a given range.

Measured on an interval or ratio scale.

Can be discrete values (like counting) or continuous values (like measurements).

Use in Machine Learning:

Continuous variables are often used for regression models, where we predict a continuous output.

Techniques for continuous variables might include normalization or standardization to bring them into a similar scale.

b. Categorical Variables

Definition: Categorical variables are those that represent categories or groups. These variables can take on a limited number of distinct values or categories, which are often not numerical but represent qualitative aspects of the data.

Examples:

Gender (e.g., Male, Female)

Country (e.g., USA, India, Germany)

Product Type (e.g., Electronics, Clothing, Furniture)

Marital Status (e.g., Single, Married, Divorced)

Types of Categorical Variables:

Nominal Variables:

These have no inherent order or ranking among the categories.

Examples: Color (Red, Blue, Green), Country (USA, India, Japan).

Ordinal Variables:

These have a natural order or ranking, but the intervals between categories are not necessarily equal.

Examples: Education Level (High School, Bachelor's, Master's, PhD), Rating Scale (Poor, Average, Good, Excellent).

Characteristics:

They take a limited, fixed set of values (discrete).

Can be encoded as numerical values using techniques like one-hot encoding or label encoding for use in machine learning.

Use in Machine Learning:

Categorical variables are often used for classification tasks, where the goal is to predict a category.

Techniques like one-hot encoding, label encoding, and binary encoding are used to convert categorical variables into a format that models can process.

21. What is feature scaling? How does it help in Machine Learning?
-->Feature scaling is a technique used in machine learning to normalize or standardize the range of independent variables (features) in a dataset. In many machine learning algorithms, the magnitude of the features can significantly affect the performance and accuracy of the model. Feature scaling ensures that all features are on the same scale or range.

Feature scaling is particularly important for algorithms that rely on calculating distances between data points or gradients during training. These include algorithms like K-Nearest Neighbors (KNN), Support Vector Machines (SVM), Logistic Regression, and Gradient Descent-based models (e.g., Linear Regression).

How Does Feature Scaling Help in Machine Learning:

Equal Contribution of Features: Scaling ensures that all features contribute equally to the model’s learning process. Without scaling, features with larger values will dominate the model, which may lead to biased predictions.

Improves Algorithm Performance: Many machine learning algorithms work better and faster when the features are on the same scale. For example:

Distance-based algorithms like KNN, SVM, and K-means clustering depend on measuring distances between data points. If one feature has a much larger scale than others, it can skew the distance calculation.

Gradient-based algorithms like Linear Regression or Logistic Regression use optimization techniques (e.g., gradient descent) to minimize the error. Feature scaling ensures that the gradient descent converges faster and more efficiently.

Prevents Numerical Instability: Some algorithms, like Neural Networks, are sensitive to features that are on vastly different scales. Feature scaling helps avoid numerical instability during training.

Faster Convergence: For optimization algorithms (like Gradient Descent), feature scaling can speed up the convergence of the algorithm, making the training process faster and more efficient.

22. How do we perform scaling in Python?
--> a.  Min-Max Scaling (Normalization)

Min-Max scaling scales the data to a specified range, usually 0 to 1. It's useful when you want to normalize your features and make sure they are on the same scale.

Steps:
Import the MinMaxScaler from sklearn.preprocessing.

Fit the scaler to the feature data.

Transform the data using the fit_transform() method.

b. Standardization (Z-Score Normalization)

Standardization scales the data so that it has a mean of 0 and a standard deviation of 1. This is useful when your data follows a Gaussian distribution or when using algorithms that assume normally distributed data.

Steps:

Import the StandardScaler from sklearn.preprocessing.

Fit the scaler to the feature data.

Transform the data using the fit_transform() method.

c. Robust Scaling

Robust scaling uses the median and interquartile range (IQR) to scale features, which makes it more robust to outliers compared to Min-Max scaling or standardization.

Steps:

Import the RobustScaler from sklearn.preprocessing.

Fit the scaler to the feature data.

Transform the data using the fit_transform() method.

d. Scaling Categorical Data

Although feature scaling is mostly used for numerical features, categorical variables can also be encoded to numerical form using techniques like Label Encoding and One-Hot Encoding.

23. What is sklearn.preprocessing?
--> sklearn.preprocessing is a module in scikit-learn (a popular Python machine learning library) that provides a set of tools to preprocess and scale the data before feeding it into a machine learning model. Preprocessing involves transforming raw data into a format that is suitable for analysis and model building.

24. How do we split data for model fitting (training and testing) in Python?
--> Steps to Split Data in Python:

a. Import the Necessary Libraries:

You'll need train_test_split() from sklearn.model_selection.

Import other libraries like numpy or pandas for handling your data.

b. Prepare Your Data:

You'll need your features (X) and target labels (y) ready for splitting.

c. Use train_test_split():

train_test_split() will split your data into training and testing sets. You can specify the test size (the proportion of data to be used for testing), the random state (for reproducibility), and other parameters.

25. Explain data encoding?
--> Data encoding in machine learning refers to the process of converting categorical (non-numeric) data into a numerical format that can be used by machine learning algorithms. Most machine learning algorithms, especially in scikit-learn, require numerical inputs, so data encoding helps transform categorical variables (like strings) into numeric values.

There are several methods for encoding data, and the appropriate method depends on the nature of the categorical variable, such as whether the categories are nominal (unordered) or ordinal (ordered).









































