## FEATURE ENGINEERING ASSIGNMENT

Q1. What is Min-Max scaling, and how is it used in data preprocessing? Provide an example to illustrate its
application.

Min-Max scaling, also known as normalization, is a data preprocessing technique used to rescale numerical features within a specific range. It transforms the values of the features to a common scale, typically between 0 and 1. This scaling is achieved by subtracting the minimum value of the feature and then dividing it by the difference between the maximum and minimum values.

Here's an example of how Min-Max scaling can be applied in Python using the scikit-learn library:

In [2]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Example dataset
data = np.array([[5, 10], [2, 8], [3, 6], [1, 12]])

# Create a MinMaxScaler object
scaler = MinMaxScaler()

# Fit and transform the data
scaled_data = scaler.fit_transform(data)

print(scaled_data)


[[1.         0.66666667]
 [0.25       0.33333333]
 [0.5        0.        ]
 [0.         1.        ]]


Q2. What is the Unit Vector technique in feature scaling, and how does it differ from Min-Max scaling?
Provide an example to illustrate its application.

The Unit Vector technique, also known as normalization or vector normalization, is a feature scaling method that rescales the values of a feature to have a unit norm. It transforms the feature vectors to have a length or magnitude of 1, while preserving their direction in the original feature space.

Unlike Min-Max scaling, which scales the values within a specific range (e.g., 0 to 1), the Unit Vector technique focuses on the relative magnitudes of the feature vectors. It is particularly useful when the magnitude of the features is essential, but the actual values are less significant.

Here's an example of how to apply Unit Vector scaling in Python using the scikit-learn library:

In [3]:
from sklearn.preprocessing import Normalizer
import numpy as np

# Example dataset
data = np.array([[1, 2], [2, 4], [3, 6]])

# Create a Normalizer object
scaler = Normalizer(norm='l2')

# Transform the data
scaled_data = scaler.transform(data)

print(scaled_data)


[[0.4472136  0.89442719]
 [0.4472136  0.89442719]
 [0.4472136  0.89442719]]


In this example, we have a dataset with two features represented by the columns. We want to apply Unit Vector scaling to the data.

The Normalizer class from scikit-learn is used for Unit Vector scaling. We create an instance of the scaler, scaler, and specify the norm parameter as 'l2' to perform L2 normalization, which scales the feature vectors to have a Euclidean length of 1.

The transform method is then used to apply the scaling to the data. The resulting scaled_data will have the same number of rows as the original dataset, but each row will have a unit norm.

Q3. What is PCA (Principle Component Analysis), and how is it used in dimensionality reduction? Provide an
example to illustrate its application.

PCA (Principal Component Analysis) is a statistical technique used for dimensionality reduction in data analysis and machine learning. It identifies the most significant patterns and relationships in a dataset by transforming the original features into a new set of uncorrelated variables called principal components.

The main goal of PCA is to reduce the dimensionality of a dataset while retaining as much information as possible. It achieves this by projecting the data onto a lower-dimensional subspace that captures the maximum variance in the original data.

In [4]:
from sklearn.decomposition import PCA
import numpy as np

# Example dataset
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])

# Create a PCA object with desired number of components
pca = PCA(n_components=2)

# Fit and transform the data
transformed_data = pca.fit_transform(data)

print(transformed_data)


[[-7.79422863  0.        ]
 [-2.59807621  0.        ]
 [ 2.59807621  0.        ]
 [ 7.79422863 -0.        ]]


In this example, we have a dataset with three features represented by the columns. We want to apply PCA to reduce the dimensionality of the data to two dimensions.

The PCA class from scikit-learn is used to perform PCA. We create an instance of the PCA object, pca, and specify the number of components we want to keep (in this case, 2).

The fit_transform method is then used to fit the PCA model on the data and transform it into the new lower-dimensional representation. The resulting transformed_data will have the same number of rows as the original dataset, but each row will have only two values corresponding to the two principal components.

Q4. What is the relationship between PCA and Feature Extraction, and how can PCA be used for Feature
Extraction? Provide an example to illustrate this concept.

PCA (Principal Component Analysis) can be used for feature extraction, which involves selecting a subset of relevant features from a high-dimensional dataset. Feature extraction aims to reduce the dimensionality of the data by creating new features that capture the most important information or patterns.

PCA achieves feature extraction by transforming the original features into a new set of uncorrelated variables called principal components. These principal components are linear combinations of the original features and are sorted in descending order of variance. The first few principal components capture the most significant patterns in the data.

By selecting a subset of the top-ranked principal components, PCA effectively performs feature extraction. The selected principal components can serve as the new features, representing the most relevant information in the original dataset.

In [5]:
from sklearn.decomposition import PCA
import numpy as np

# Example dataset
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Create a PCA object with desired number of components
pca = PCA(n_components=2)

# Fit and transform the data
extracted_features = pca.fit_transform(data)

print(extracted_features)


[[-5.19615242e+00  2.56395025e-16]
 [ 0.00000000e+00  0.00000000e+00]
 [ 5.19615242e+00  2.56395025e-16]]


In this example, we have a dataset with three original features represented by the columns. We want to perform feature extraction using PCA to reduce the dimensionality to two components.

The PCA class from scikit-learn is used, and we create an instance of the PCA object, pca, specifying the desired number of components (in this case, 2).

The fit_transform method is then used to fit the PCA model on the data and transform it into the new feature space. The resulting extracted_features will have the same number of rows as the original dataset, but each row will have only two values corresponding to the selected principal components.

Q5. You are working on a project to build a recommendation system for a food delivery service. The dataset
contains features such as price, rating, and delivery time. Explain how you would use Min-Max scaling to
preprocess the data.

To preprocess the data for building a recommendation system for a food delivery service, you can use Min-Max scaling to normalize the features such as price, rating, and delivery time. Min-Max scaling will ensure that all these features are on a common scale and have values within a specific range, typically between 0 and 1. Here's how you can use Min-Max scaling for each feature:

Price: Suppose the price feature ranges from $5 to $20. Using Min-Max scaling, you can transform the price values to a range between 0 and 1. For example, if you have a price value of $10, the scaled value will be:

Scaled Price = (Price - Min(Price)) / (Max(Price) - Min(Price))
= (10 - 5) / (20 - 5)
= 0.5

Rating: If the rating feature ranges from 1 to 5, you can apply Min-Max scaling to transform the rating values to the range between 0 and 1. For instance, if you have a rating value of 4, the scaled value will be:

Scaled Rating = (Rating - Min(Rating)) / (Max(Rating) - Min(Rating))
= (4 - 1) / (5 - 1)
= 0.75

Delivery Time: Suppose the delivery time feature ranges from 30 minutes to 60 minutes. You can use Min-Max scaling to transform the delivery time values to the range between 0 and 1. For example, if you have a delivery time of 45 minutes, the scaled value will be:

Scaled Delivery Time = (Delivery Time - Min(Delivery Time)) / (Max(Delivery Time) - Min(Delivery Time))
= (45 - 30) / (60 - 30)
= 0.5

By applying Min-Max scaling to the price, rating, and delivery time features, all these features will be scaled to a common range between 0 and 1. This scaling ensures that no single feature dominates the others due to differences in their original scales. It allows for fair comparison and effective utilization of the features in the recommendation system.

Q6. You are working on a project to build a model to predict stock prices. The dataset contains many
features, such as company financial data and market trends. Explain how you would use PCA to reduce the
dimensionality of the dataset.

To reduce the dimensionality of a dataset containing many features, such as company financial data and market trends for predicting stock prices, PCA (Principal Component Analysis) can be used. PCA helps identify the most significant patterns and relationships in the dataset by transforming the original features into a new set of uncorrelated variables called principal components.

Here's how you can use PCA to reduce the dimensionality of the dataset:

Data Preprocessing: Begin by preprocessing the dataset, which may involve steps like handling missing values, normalization, and feature scaling. Ensure that the dataset is in a suitable format for PCA.

Feature Standardization: Since PCA is sensitive to the scale of the features, it is essential to standardize the features. This involves subtracting the mean of each feature and dividing by its standard deviation, ensuring that all features have zero mean and unit variance.

Apply PCA: Use a PCA implementation (e.g., scikit-learn's PCA) to perform dimensionality reduction. Specify the desired number of principal components you want to retain based on the trade-off between dimensionality reduction and retaining sufficient information.

Fit PCA: Fit the PCA model on the standardized dataset. The model analyzes the covariance structure among the features to determine the principal components.

Explained Variance Ratio: Examine the explained variance ratio, which indicates the amount of variance explained by each principal component. This helps assess the contribution of each component and decide how many components to retain.

Dimensionality Reduction: Select the desired number of principal components based on the explained variance ratio. Retain the components that capture the majority of the variance in the data while discarding less significant components.

Transform Data: Transform the original dataset into the lower-dimensional subspace by applying the dimensionality reduction technique determined in the previous step. This creates a new dataset consisting of the selected principal components.

Model Training: Use the reduced-dimensional dataset to train your stock price prediction model. The reduced features will help reduce noise, focus on the most important patterns, and potentially improve the model's performance.

By applying PCA, you can reduce the dimensionality of the dataset while retaining the most significant information captured by the principal components. This reduces computational complexity, helps avoid overfitting, and can improve the interpretability and generalization of your stock price prediction model.

Q7. For a dataset containing the following values: [1, 5, 10, 15, 20], perform Min-Max scaling to transform the
values to a range of -1 to 1.

In [6]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Original dataset
data = np.array([1, 5, 10, 15, 20]).reshape(-1, 1)

# Create a MinMaxScaler object with desired feature range
scaler = MinMaxScaler(feature_range=(-1, 1))

# Fit and transform the data
scaled_data = scaler.fit_transform(data)

print(scaled_data)


[[-1.        ]
 [-0.57894737]
 [-0.05263158]
 [ 0.47368421]
 [ 1.        ]]


In this example, we have the original dataset stored in the data array. We reshape the array to be a column vector to match the expected input format of the MinMaxScaler.

We then create an instance of the MinMaxScaler class, scaler, and specify the desired feature range as (-1, 1).

Next, we fit the scaler on the data and simultaneously transform it using the fit_transform method. The resulting scaled_data will contain the Min-Max scaled values.

The values in the scaled_data array have been transformed to the range of -1 to 1 using Min-Max scaling. The value 1 in the original dataset corresponds to 1 in the scaled dataset, and the value 20 in the original dataset corresponds to -1 in the scaled dataset, while the other values are scaled proportionally between these bounds.

Q8. For a dataset containing the following features: [height, weight, age, gender, blood pressure], perform
Feature Extraction using PCA. How many principal components would you choose to retain, and why?

To perform feature extraction using PCA on a dataset with features like height, weight, age, gender, and blood pressure, the number of principal components to retain would depend on the desired balance between dimensionality reduction and the amount of information preserved. Here are the steps to determine the number of principal components to retain:

Data Preprocessing: Begin by preprocessing the dataset, which may involve steps like handling missing values, normalization, and feature scaling. Ensure that the dataset is in a suitable format for PCA.

Standardize Features: Since PCA is sensitive to the scale of the features, it is essential to standardize the features. This involves subtracting the mean of each feature and dividing by its standard deviation, ensuring that all features have zero mean and unit variance.

Apply PCA: Use a PCA implementation (e.g., scikit-learn's PCA) to perform dimensionality reduction. Apply PCA on the standardized dataset.

Explained Variance Ratio: After applying PCA, analyze the explained variance ratio. The explained variance ratio indicates the proportion of variance explained by each principal component. It helps understand how much information is preserved by each component.

Scree Plot: Plot the cumulative explained variance ratio as a function of the number of principal components. This plot provides insights into how many principal components are needed to retain a certain percentage of the total variance.

Elbow Method: Look for an "elbow" in the scree plot, which signifies a significant drop in the explained variance ratio. The number of principal components at this elbow point is a common heuristic for selecting the number of components to retain.

Retain Principal Components: Based on the scree plot, cumulative explained variance ratio, and elbow method, choose the number of principal components that explain a sufficient amount of variance while still reducing the dimensionality of the dataset. This choice is often a trade-off between dimensionality reduction and the amount of information retained.

Transform Data: Transform the original dataset into the lower-dimensional subspace using the selected number of principal components.

The optimal number of principal components to retain can vary depending on the dataset and the specific requirements of the application. It is recommended to analyze the explained variance ratio, scree plot, and apply domain knowledge to make an informed decision.