Q1. What is Min-Max scaling, and how is it used in data preprocessing? Provide an example to illustrate its application.
Min-Max Scaling:
Min-Max scaling, also known as normalization, transforms the features to a fixed range, usually 0 to 1, but it can be any range, such as -1 to 1. This scaling technique adjusts the values of numeric columns in the dataset so that they fall within a specified range.
Example:
Consider a dataset with a single feature, price, with values: [10, 20, 30, 40, 50].

Find the minimum and maximum values of price: X min =10 and X max=50.
Apply the Min-Max scaling formula to each value in the dataset.
Q2. What is the Unit Vector technique in feature scaling, and how does it differ from Min-Max scaling? Provide an example to illustrate its application.
Unit Vector Scaling:
Unit Vector scaling, also known as normalization to unit length, scales the data so that the entire feature vector has a unit norm (i.e., length of 1). This technique is often used in text classification and clustering.

Difference from Min-Max Scaling:

Min-Max scaling transforms each feature individually based on its minimum and maximum values.
Unit Vector scaling transforms the entire feature vector to have a unit norm, preserving the direction but not the magnitude.

Q3. What is PCA (Principal Component Analysis), and how is it used in dimensionality reduction? Provide an example to illustrate its application.
Principal Component Analysis (PCA):
PCA is a statistical technique used to emphasize variation and bring out strong patterns in a dataset. It does so by transforming the original variables into a new set of variables called principal components, which are orthogonal (uncorrelated) and ordered such that the first few retain most of the variation present in the original variables.

Steps:

Standardize the data.
Calculate the covariance matrix.
Calculate the eigenvalues and eigenvectors of the covariance matrix.
Sort the eigenvalues and their corresponding eigenvectors.
Select the top k eigenvectors to form a new feature space (principal components).
Example:
Consider a dataset with two correlated features: x1 and x2.

Standardize the data.
Calculate the covariance matrix.
Compute eigenvalues and eigenvectors.
Project the data onto the top principal components.
The result is a lower-dimensional representation of the data, which captures most of the variability with fewer dimensions.

Q4. What is the relationship between PCA and Feature Extraction, and how can PCA be used for Feature Extraction? Provide an example to illustrate this concept.
Relationship:
PCA can be considered a feature extraction method because it transforms the original set of features into a new set of features (principal components) that capture the most important information in the data.

Using PCA for Feature Extraction:

Compute the principal components.
Select the top k principal components that capture the most variance.
Project the original data onto these k components to obtain a new feature set.
Example:
Consider a dataset with features [height, weight, age, blood pressure].

Standardize the data.
Perform PCA and obtain principal components.
Select the top 2 components (if they capture, say, 95% of the variance).
Project the data onto these 2 components, reducing the dimensionality while retaining most of the information.

Q6. You are working on a project to build a model to predict stock prices. The dataset contains many features, such as company financial data and market trends. Explain how you would use PCA to reduce the dimensionality of the dataset.
Steps to Use PCA:

Standardize the data: Ensure all features have a mean of 0 and a standard deviation of 1.
Compute the covariance matrix of the standardized data.
Calculate the eigenvalues and eigenvectors of the covariance matrix.
Select the top k eigenvectors corresponding to the largest eigenvalues to form the principal components.
Project the original data onto the new feature space defined by these k principal components.
Example:

Standardize the data:
Suppose we have features: revenue, profit, market_trend1, market_trend2, etc.
Compute the covariance matrix.
Calculate eigenvalues and eigenvectors.
Select top k components:
Choose k such that the principal components explain, say, 95% of the variance.
Transform the data:
Project the original data onto these k components.
This reduces the number of features while retaining most of the important information, making the model more efficient and potentially more accurate.

Q8. For a dataset containing the following features: [height, weight, age, gender, blood pressure], perform Feature Extraction using PCA. How many principal components would you choose to retain, and why?
Steps:

Standardize the data (excluding gender as it's categorical and needs to be encoded first).
Compute the covariance matrix.
Calculate eigenvalues and eigenvectors.
Determine the number of principal components to retain by looking at the explained variance ratio.



In [2]:
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import pandas as pd

data = pd.DataFrame({
    'Height': [170, 160, 175, 180],
    'Weight': [70, 60, 80, 90],
    'Age': [25, 30, 45, 50],
    'Gender': ['Male', 'Female', 'Male', 'Female'],
    'Blood Pressure': [120, 110, 130, 140]
})
data_encoded = pd.get_dummies(data, columns=['Gender'])
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data_encoded)
pca = PCA()
pca.fit(data_scaled)
explained_variance = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance)
components_to_retain = np.argmax(cumulative_variance >= 0.95) + 1
components_to_retain

2

Choosing Principal Components:

Calculate the cumulative explained variance.
Retain enough components to explain at least 95% of the variance.
For example, if the first 3 principal components explain 95% of the variance, retain 3 components.
Why:

Retaining components that explain 95% of the variance ensures that most of the information in the original features is preserved while reducing dimensionality, which helps in reducing computational complexity and potentially improving model performance.