Q1. What is Min-Max scaling, and how is it used in data preprocessing? Provide an example to illustrate its application.

Answer--> Min-Max scaling, also known as normalization, is a data preprocessing technique used to rescale numerical features to a fixed range, typically between 0 and 1. 

Here's an example to illustrate the application of Min-Max scaling:

In [1]:
import numpy as np

# Raw data
data = np.array([20, 30, 40, 50, 60])

# Calculate the minimum and maximum values
min_val = np.min(data)
max_val = np.max(data)

# Apply Min-Max scaling
scaled_data = (data - min_val) / (max_val - min_val)

# Print the scaled data
print(scaled_data)


[0.   0.25 0.5  0.75 1.  ]


Q2. What is the Unit Vector technique in feature scaling, and how does it differ from Min-Max scaling? Provide an example to illustrate its application.

Answer--> The Unit Vector technique, also known as unit normalization is a data preprocessing technique used to rescale numerical features to have a unit norm. In this technique, each feature vector is divided by its magnitude, resulting in a vector with a magnitude of 1.

Here's an example to illustrate the application of the Unit Vector technique:

Consider a dataset with two features representing the height (in centimeters) and weight (in kilograms) of individuals:

```
Height: [160, 170, 180]
Weight: [50, 60, 70]
```

To apply the Unit Vector technique, we follow these steps:

In [2]:
import pandas as pd 
df = pd.DataFrame({
    "Height": [150, 170, 180],
    "Weight": [50, 60, 70]
})

from sklearn.preprocessing import normalize 

normalize_height = normalize(df[["Height"]])
normalize_weight = normalize(df[["Weight"]])

print ("Unit vector of Height", normalize_height)
print ("Unit vector of Weight", normalize_weight)

Unit vector of Height [[1.]
 [1.]
 [1.]]
Unit vector of Weight [[1.]
 [1.]
 [1.]]


The resulting vectors have unit norms, meaning their magnitudes are 1. This normalization technique is useful when the magnitude of the features is significant, and we want to emphasize the direction or orientation of the vectors rather than their absolute values.

Differences between Unit Vector technique and Min-Max scaling:
- Min-Max scaling rescales features to a fixed range, while the Unit Vector technique scales vectors to have unit norms.
- Min-Max scaling preserves relative relationships and proportions within the range, while the Unit Vector technique focuses on vector direction or orientation.
- Min-Max scaling is feature-wise, while the Unit Vector technique is vector-wise.
- Min-Max scaling is used when features have different scales or units, while the Unit Vector technique emphasizes vector directionality.

Q3. What is PCA (Principle Component Analysis), and how is it used in dimensionality reduction? Provide an example to illustrate its application.

Answer--> PCA (Principal Component Analysis) is a statistical technique used for dimensionality reduction. It aims to transform a high-dimensional dataset into a lower-dimensional space while preserving the maximum amount of information or variance in the data. It achieves this by finding the principal components, which are new orthogonal axes that capture the most significant variations in the data.  

Let's consider an example to illustrate the application of PCA for dimensionality reduction:

Suppose we have a dataset with four features: "height," "weight," "age," and "income" of individuals. The dataset contains information for 1000 individuals.

To apply PCA for dimensionality reduction, we follow these steps:

1. Standardize the data: Since PCA is sensitive to the scale of the features, it is often necessary to standardize the data by subtracting the mean and dividing by the standard deviation. This step ensures that features with larger scales do not dominate the analysis.

2. Calculate the covariance matrix: The covariance matrix measures the relationships between pairs of features. It helps in understanding the variability and correlation present in the data.

3. Compute the eigenvectors and eigenvalues: The eigenvectors represent the principal components, and the corresponding eigenvalues indicate the amount of variance explained by each principal component. The eigenvectors are computed from the covariance matrix.

4. Select the desired number of principal components: Based on the eigenvalues, we can determine the number of principal components we want to retain. These components should capture a significant portion of the total variance in the data.

5. Project the data onto the selected principal components: The data is transformed by projecting it onto the selected principal components. This projection reduces the dimensionality of the data while retaining as much information or variance as possible.

For example, let's say after performing PCA, we decide to keep only the first two principal components, which explain most of the variance in the data.

The transformed dataset will now have two features (the first two principal components) instead of the original four features.

By reducing the dimensionality using PCA, we can visualize the data in a lower-dimensional space, simplify subsequent analysis tasks, and potentially improve computational efficiency. The retained principal components capture the most significant variations in the data, allowing us to focus on the most important aspects while discarding less informative features.

Q5. You are working on a project to build a recommendation system for a food delivery service. The dataset contains features such as price, rating, and delivery time. Explain how you would use Min-Max scaling to preprocess the data.

Answer--> To preprocess the data for building a recommendation system for a food delivery service, Min-Max scaling can be used to normalize the features such as price, rating, and delivery time. Here's how Min-Max scaling can be applied to each feature:

1. Determine the minimum and maximum values of each feature in the dataset.

For example:
   - Price: Minimum price of food items could be 5 doller , and the maximum price could be 50 doller.
   - Rating: The minimum rating could be 1.0, and the maximum rating could be 5.0.
   - Delivery time: The minimum delivery time could be 15 minutes, and the maximum delivery time could be 60 minutes.

2. Apply the Min-Max scaling formula to each feature:
   - Scaled value = (value - minimum) / (maximum - minimum)

   For instance:
   - Scaled Price = (Price - 5) / (50 - 5)
   - Scaled Rating = (Rating - 1.0) / (5.0 - 1.0)
   - Scaled Delivery time = (Delivery time - 15) / (60 - 15)

   The scaled values will now range between 0 and 1.

3. After scaling, the data will be ready to be used in the recommendation system. The scaled values ensure that each feature is proportionally transformed within the desired range. This step is particularly useful in scenarios where the features have different scales or units.

By applying Min-Max scaling, the features such as price, rating, and delivery time will be normalized to a common scale, allowing them to be compared and analyzed together. This preprocessing step helps avoid dominance by features with larger values and ensures that the features maintain their relative relationships within the transformed range. It facilitates the recommendation system in considering and weighing each feature appropriately when suggesting food items to users based on their preferences. Create a new DataFrame with the scaled features
scaled_data = pd.DataFrame(scaled_features, columns=features.columns)

Q6. You are working on a project to build a model to predict stock prices. The dataset contains many features, such as company financial data and market trends. Explain how you would use PCA to reduce the dimensionality of the dataset.

Answer--> Here's how PCA can be applied to achieve dimensionality reduction:

1. Data preprocessing: Before applying PCA, it's crucial to preprocess the dataset by standardizing the features. Standardization involves subtracting the mean and dividing by the standard deviation of each feature. This step ensures that all features are on a comparable scale, which is essential for PCA to work effectively.

2. Covariance matrix calculation: Compute the covariance matrix using the standardized dataset. The covariance matrix measures the relationships and variances between pairs of features. It provides crucial information for determining the principal components.

3. Eigenvalue decomposition: Perform eigenvalue decomposition on the covariance matrix to obtain the eigenvectors and eigenvalues. The eigenvectors represent the principal components, while the eigenvalues signify the amount of variance explained by each principal component.

4. Selection of principal components: Determine the number of principal components to retain based on the eigenvalues. You can choose a threshold, such as retaining components that explain a certain percentage of the total variance (e.g., 95% variance explained). Alternatively, you can decide to keep a fixed number of components that are most relevant to the problem.

5. Projection of data: Project the original dataset onto the selected principal components to obtain a reduced-dimensional representation. Each instance in the dataset is transformed into a new set of values corresponding to the retained principal components.

By applying PCA for dimensionality reduction, you effectively reduce the number of features in the dataset while capturing the most critical information and retaining the maximum amount of variance. This reduction in dimensionality can simplify the dataset, enhance interpretability, and potentially improve the performance of the stock price prediction model. Additionally, it helps mitigate the curse of dimensionality, which can lead to overfitting and computational inefficiency when dealing with high-dimensional datasets.

Q7. For a dataset containing the following values: [1, 5, 10, 15, 20], perform Min-Max scaling to transform the values to a range of -1 to 1.

In [3]:
from sklearn.preprocessing import MinMaxScaler

# Original dataset
data = [1, 5, 10, 15, 20]

# Create an instance of MinMaxScaler
scaler = MinMaxScaler(feature_range=(-1, 1))

# Reshape the data to 2D array (required by MinMaxScaler)
data = [[value] for value in data]

# Perform Min-Max scaling
scaled_data = scaler.fit_transform(data)

# Flatten the scaled data back to 1D array
scaled_data = [value[0] for value in scaled_data]

# Print the scaled data
print(scaled_data)

[-0.9999999999999999, -0.5789473684210525, -0.05263157894736836, 0.47368421052631593, 1.0]


Q8. For a dataset containing the following features: [height, weight, age, gender, blood pressure], perform Feature Extraction using PCA. How many principal components would you choose to retain, and why?

Answer--> Here's an approach to determine the number of principal components to retain:

- Standardize the data: Before applying PCA, it is generally recommended to standardize the data by subtracting the mean and scaling the features to have unit variance. This step ensures that features with larger scales do not dominate the analysis.

- Perform PCA: Apply PCA to the standardized dataset. The result will provide the principal components and their associated eigenvalues.

- Compute explained variance ratio: Calculate the explained variance ratio for each principal component. This can be done by dividing each eigenvalue by the sum of all eigenvalues.

- Calculate cumulative explained variance ratio: Compute the cumulative sum of the explained variance ratios.

 Determine the number of PCs to retain: Examine the cumulative explained variance ratio plot. Look for the point where adding additional PCs provides diminishing returns in terms of explained variance. This can be subjective and depends on the specific dataset and the desired level of information retention.