### Q1. What is Min-Max scaling, and how is it used in data preprocessing? Provide an example to illustrate its application.

## Min-Max Scaling

Min-Max scaling, also known as Min-Max normalization, is a data preprocessing technique used to rescale the values of features to a specific range, typically \([0, 1]\). This is particularly useful in preparing data for machine learning algorithms that are sensitive to the scale of input data, such as those based on distance metrics (e.g., k-nearest neighbors, support vector machines).

### How It Works

The Min-Max scaling formula is as follows:

$  x' = \frac{x - \min(x)}{\max(x) - \min(x)}  $

where:
- \( x \) is the original value.
- $ \min(x) $ is the minimum value in the feature.
- $ \max(x) $ is the maximum value in the feature.
- $ x' $ is the scaled value.

This formula scales each feature to the range \([0, 1]\). However, the range can be adjusted to any desired range \([a, b]\) using the following formula:

$  x' = a + \frac{(x - \min(x)) \times (b - a)}{\max(x) - \min(x)}  $

### Advantages

- **Uniform Scale**: Rescales features to a uniform range, ensuring that all features contribute equally to the model.
- **Improved Convergence**: Helps algorithms converge faster by providing a consistent scale.
- **Compatibility**: Suitable for algorithms that require input data within a certain range.

### Disadvantages

- **Sensitive to Outliers**: Since Min-Max scaling depends on the minimum and maximum values, it can be affected by outliers.

### Example

Let's illustrate Min-Max scaling with an example in Python using `sklearn.preprocessing.MinMaxScaler`.




In [44]:
import seaborn as sns
import pandas as pd 
from sklearn.preprocessing import MinMaxScaler

df=sns.load_dataset("tips")

scaler= MinMaxScaler()
print(df[['total_bill','tip']])
df1=pd.DataFrame(scaler.fit_transform(df[['total_bill','tip']]),columns=['Scaled_total_bill','Scaled_tip'])
print(df1)


     total_bill   tip
0         16.99  1.01
1         10.34  1.66
2         21.01  3.50
3         23.68  3.31
4         24.59  3.61
..          ...   ...
239       29.03  5.92
240       27.18  2.00
241       22.67  2.00
242       17.82  1.75
243       18.78  3.00

[244 rows x 2 columns]
     Scaled_total_bill  Scaled_tip
0             0.291579    0.001111
1             0.152283    0.073333
2             0.375786    0.277778
3             0.431713    0.256667
4             0.450775    0.290000
..                 ...         ...
239           0.543779    0.546667
240           0.505027    0.111111
241           0.410557    0.111111
242           0.308965    0.083333
243           0.329074    0.222222

[244 rows x 2 columns]


### Q2. What is the Unit Vector technique in feature scaling, and how does it differ from Min-Max scaling? Provide an example to illustrate its application.

## Unit Vector Technique in Feature Scaling

The Unit Vector technique, also known as normalization or vector normalization, scales each feature vector (row of the data) to have a unit norm (length). This technique ensures that each data point lies on the surface of a hypersphere. The most common norm used for normalization is the \($L^2$\) norm, but other norms like \($L^1$\) can also be used.

### How It Works

For a given feature vector \($\mathbf{x} = [x_1, x_2, \ldots, x_n]$\), the normalized vector \($\mathbf{x'}$\) is calculated as follows:

$ \mathbf{x'} = \frac{\mathbf{x}}{\|\mathbf{x}\|} $

where \|\mathbf{x}\| is the norm of the vector  $ \mathbf{x} $ . For the  $ L^2 $  norm, this is:

$ \|\mathbf{x}\|_2 = \sqrt{\sum_{i=1}^n x_i^2} $ 

Thus, the \( $ L^2 $ \) normalized vector is:

  $ x_i' = \frac{x_i}{\sqrt{\sum_{j=1}^n x_j^2}} $  

### Differences from Min-Max Scaling

- **Min-Max Scaling**: Transforms features to a fixed range, typically \([0, 1]\) or \([-1, 1]\), based on the minimum and maximum values of each feature.
- **Unit Vector Scaling**: Normalizes each data point independently to have a unit norm, ensuring that the length of each feature vector is 1. It focuses on the direction of the data points rather than their range.

### Example

Let's illustrate Unit Vector scaling with an example in Python using `sklearn.preprocessing.Normalizer`.



In [45]:

import numpy as np
import pandas as pd
import seaborn as sns 
from sklearn.preprocessing import normalize

df=sns.load_dataset("iris")
print(df[['sepal_length','sepal_width','petal_length','petal_width']])
df1=pd.DataFrame(normalize(df[['sepal_length','sepal_width','petal_length','petal_width']]),columns=['sepal_length','sepal_width','petal_length','petal_width'])
print(df1)

     sepal_length  sepal_width  petal_length  petal_width
0             5.1          3.5           1.4          0.2
1             4.9          3.0           1.4          0.2
2             4.7          3.2           1.3          0.2
3             4.6          3.1           1.5          0.2
4             5.0          3.6           1.4          0.2
..            ...          ...           ...          ...
145           6.7          3.0           5.2          2.3
146           6.3          2.5           5.0          1.9
147           6.5          3.0           5.2          2.0
148           6.2          3.4           5.4          2.3
149           5.9          3.0           5.1          1.8

[150 rows x 4 columns]
     sepal_length  sepal_width  petal_length  petal_width
0        0.803773     0.551609      0.220644     0.031521
1        0.828133     0.507020      0.236609     0.033801
2        0.805333     0.548312      0.222752     0.034269
3        0.800030     0.539151      0.260879    

### Q3. What is PCA (Principle Component Analysis), and how is it used in dimensionality reduction? Provide an example to illustrate its application.

### Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction. It transforms a high-dimensional dataset into a lower-dimensional one while retaining most of the original variance. PCA achieves this by identifying the directions (principal components) along which the variance in the data is maximized.

### How It Works

1. **Standardization**: Standardize the data to have a mean of 0 and a standard deviation of 1 for each feature.

2. **Covariance Matrix Computation**: Compute the covariance matrix to understand the relationships between different features.

3. **Eigen Decomposition**: Perform eigen decomposition on the covariance matrix to find the eigenvalues and eigenvectors. The eigenvectors represent the directions of the principal components, and the eigenvalues represent the magnitude of variance in these directions.

4. **Sorting Eigenvalues**: Sort the eigenvalues in descending order and select the top \(k\) eigenvectors corresponding to the largest eigenvalues. These eigenvectors form the principal components.

5. **Projection**: Project the original data onto the new \(k\)-dimensional subspace formed by the principal components.

### Advantages

- **Dimensionality Reduction**: Reduces the number of features while retaining most of the variance, simplifying the dataset.
- **Noise Reduction**: Helps in removing noise and redundant features.
- **Improved Performance**: Can improve the performance of machine learning algorithms by reducing overfitting and computational complexity.

### Example

Let's illustrate PCA with an example in Python using `sklearn.decomposition.PCA`.




In [46]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Sample data
data = {
    'feature1': [2.5, 0.5, 2.2, 1.9, 3.1, 2.3, 2.0, 1.0, 1.5, 1.1],
    'feature2': [2.4, 0.7, 2.9, 2.2, 3.0, 2.7, 1.6, 1.1, 1.6, 0.9]
}

df = pd.DataFrame(data)

# Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)

# Initialize PCA
pca = PCA(n_components=2)

# Fit and transform the data
pca_data = pca.fit_transform(scaled_data)

# Create a DataFrame for the PCA data
pca_df = pd.DataFrame(pca_data, columns=['PC1', 'PC2'])

print("Original Data:\n", df)
print("\nPCA Data:\n", pca_df)
print("\nExplained Variance Ratio:", pca.explained_variance_ratio_)

Original Data:
    feature1  feature2
0       2.5       2.4
1       0.5       0.7
2       2.2       2.9
3       1.9       2.2
4       3.1       3.0
5       2.3       2.7
6       2.0       1.6
7       1.0       1.1
8       1.5       1.6
9       1.1       0.9

PCA Data:
         PC1       PC2
0 -1.086432 -0.223524
1  2.308937  0.178081
2 -1.241919  0.501509
3 -0.340782  0.169919
4 -2.184290 -0.264758
5 -1.160739  0.230481
6  0.092605 -0.453317
7  1.482108  0.055667
8  0.567226  0.021305
9  1.563287 -0.215361

Explained Variance Ratio: [0.96296464 0.03703536]


### Q4. What is the relationship between PCA and Feature Extraction, and how can PCA be used for Feature Extraction? Provide an example to illustrate this concept.

### Relationship Between PCA and Feature Extraction

Principal Component Analysis (PCA) is closely related to feature extraction. Feature extraction involves transforming the original features into a new set of features that capture the most important information in the data. PCA achieves this by identifying and projecting the data onto the principal components, which are the directions of maximum variance in the data.

### How PCA is Used for Feature Extraction

1. **Variance Maximization**: PCA identifies the directions (principal components) along which the variance in the data is maximized.
2. **Dimensionality Reduction**: By selecting a subset of the principal components, PCA reduces the number of features while retaining most of the original variance.
3. **New Feature Space**: The principal components form a new feature space where each component is a linear combination of the original features. These new features are uncorrelated and capture the most significant patterns in the data.

### Example

Let's illustrate how PCA can be used for feature extraction with an example in Python using `sklearn.decomposition.PCA`.

In [47]:

import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Sample data
data = {
    'feature1': [2.5, 0.5, 2.2, 1.9, 3.1, 2.3, 2.0, 1.0, 1.5, 1.1],
    'feature2': [2.4, 0.7, 2.9, 2.2, 3.0, 2.7, 1.6, 1.1, 1.6, 0.9],
    'feature3': [3.5, 0.2, 3.2, 2.9, 4.1, 3.3, 3.0, 1.5, 2.5, 1.4]
}

df = pd.DataFrame(data)

# Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)

# Initialize PCA
pca = PCA(n_components=2)

# Fit and transform the data
pca_data = pca.fit_transform(scaled_data)

# Create a DataFrame for the PCA data
pca_df = pd.DataFrame(pca_data, columns=['PC1', 'PC2'])

print("Original Data:\n", df)
print("\nPCA Data:\n", pca_df)
print("\nExplained Variance Ratio:", pca.explained_variance_ratio_)
print("\nPrincipal Components:\n", pca.components_)


Original Data:
    feature1  feature2  feature3
0       2.5       2.4       3.5
1       0.5       0.7       0.2
2       2.2       2.9       3.2
3       1.9       2.2       2.9
4       3.1       3.0       4.1
5       2.3       2.7       3.3
6       2.0       1.6       3.0
7       1.0       1.1       1.5
8       1.5       1.6       2.5
9       1.1       0.9       1.4

PCA Data:
         PC1       PC2
0 -1.373427  0.202617
1  3.103600 -0.339408
2 -1.338163 -0.566786
3 -0.451958 -0.113369
4 -2.578542  0.110996
5 -1.326280 -0.276351
6 -0.156429  0.584672
7  1.756273  0.019510
8  0.493276  0.151515
9  1.871650  0.226603

Explained Variance Ratio: [0.95922786 0.03318559]

Principal Components:
 [[-0.5824797  -0.56946131 -0.58002691]
 [ 0.33491824 -0.81832751  0.46708657]]


### Q5. You are working on a project to build a recommendation system for a food delivery service. The dataset contains features such as price, rating, and delivery time. Explain how you would use Min-Max scaling to preprocess the data.

### Using Min-Max Scaling to Preprocess Data for a Food Delivery Recommendation System

In a food delivery recommendation system, the dataset may contain features such as price, rating, and delivery time. These features can have different ranges and units, which may affect the performance of machine learning algorithms. To ensure that all features contribute equally to the model, we can use Min-Max scaling to preprocess the data.

### Steps for Min-Max Scaling

1. **Identify Features**: Identify the features to be scaled, such as price, rating, and delivery time.
2. **Compute Minimum and Maximum Values**: Calculate the minimum and maximum values for each feature.
3. **Apply Min-Max Scaling**: Use the Min-Max scaling formula to transform the values of each feature to the range [0, 1].

### Min-Max Scaling Formula

\[ $ x' = \frac{x - \min(x)}{\max(x) - \min(x)} $ \]

where:
- \($ x $\) is the original value.
- \( $ \min(x)  $\) is the minimum value in the feature.
- \( $ \max(x) $ \) is the maximum value in the feature.
- \( $ x' $ \) is the scaled value.

### Example

Let's illustrate Min-Max scaling with an example dataset containing price, rating, and delivery time.

In [48]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Sample data
data = {
    'price': [10, 15, 20, 25, 30],
    'rating': [4.2, 4.5, 4.0, 3.8, 4.7],
    'delivery_time': [30, 25, 35, 40, 20]
}

df = pd.DataFrame(data)

# Initialize MinMaxScaler
scaler = MinMaxScaler()

# Fit and transform the data
scaled_data = scaler.fit_transform(df)

# Create a DataFrame for the scaled data
scaled_df = pd.DataFrame(scaled_data, columns=df.columns)

print("Original Data:\n", df)
print("\nScaled Data:\n", scaled_df)


Original Data:
    price  rating  delivery_time
0     10     4.2             30
1     15     4.5             25
2     20     4.0             35
3     25     3.8             40
4     30     4.7             20

Scaled Data:
    price    rating  delivery_time
0   0.00  0.444444           0.50
1   0.25  0.777778           0.25
2   0.50  0.222222           0.75
3   0.75  0.000000           1.00
4   1.00  1.000000           0.00


### Q6. You are working on a project to build a model to predict stock prices. The dataset contains many features, such as company financial data and market trends. Explain how you would use PCA to reduce the dimensionality of the dataset.

## Using PCA to Reduce Dimensionality in Stock Price Prediction

In a stock price prediction project, the dataset may contain numerous features, such as company financial data and market trends. High-dimensional data can lead to issues like overfitting, increased computational complexity, and difficulty in visualizing the data. Principal Component Analysis (PCA) can be used to reduce the dimensionality of the dataset while retaining most of the variance.

### Steps for Using PCA

1. **Standardize the Data**: Standardize the features to have a mean of 0 and a standard deviation of 1.
2. **Compute the Covariance Matrix**: Calculate the covariance matrix to understand the relationships between features.
3. **Perform Eigen Decomposition**: Compute the eigenvalues and eigenvectors of the covariance matrix.
4. **Select Principal Components**: Choose the top \(k\) principal components that explain the most variance.
5. **Transform the Data**: Project the original data onto the new feature space defined by the selected principal components.

### Example

Let's illustrate how PCA can be used to reduce the dimensionality of a stock price prediction dataset with an example in Python using `sklearn.decomposition.PCA`.



In [49]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Sample data representing financial and market features
data = {
    'revenue': [100, 150, 200, 250, 300],
    'profit': [10, 15, 20, 25, 30],
    'market_cap': [500, 700, 800, 900, 1000],
    'debt': [50, 60, 70, 80, 90],
    'dividend': [2, 2.5, 3, 3.5, 4],
    'interest_rate': [1.5, 1.7, 1.8, 1.9, 2.0],
    'inflation_rate': [2.1, 2.3, 2.2, 2.4, 2.5],
    'gdp_growth': [3.5, 3.6, 3.7, 3.8, 3.9]
}

df = pd.DataFrame(data)

scaler = StandardScaler()

scaled_data=scaler.fit_transform(df)

pca=PCA(n_components=2)

pca_data= pca.fit_transform(scaled_data)

pca_df=pd.DataFrame(pca_data,columns=['PC1','PC2'])

print("Original Data:\n", df)
print("\nPCA Data:\n", pca_df)
print("\nExplained Variance Ratio:", pca.explained_variance_ratio_)
print("\nPrincipal Components:\n", pca.components_)




Original Data:
    revenue  profit  market_cap  debt  dividend  interest_rate  inflation_rate  \
0      100      10         500    50       2.0            1.5             2.1   
1      150      15         700    60       2.5            1.7             2.3   
2      200      20         800    70       3.0            1.8             2.2   
3      250      25         900    80       3.5            1.9             2.4   
4      300      30        1000    90       4.0            2.0             2.5   

   gdp_growth  
0         3.5  
1         3.6  
2         3.7  
3         3.8  
4         3.9  

PCA Data:
         PC1       PC2
0  4.150439  0.065120
1  1.593253 -0.581157
2  0.151408  0.681863
3 -1.992580 -0.049597
4 -3.902520 -0.116229

Explained Variance Ratio: [0.97468911 0.02057227]

Principal Components:
 [[-0.35714391 -0.35714391 -0.35544647 -0.35714391 -0.35714391 -0.35544647
  -0.33099357 -0.35714391]
 [ 0.14510176  0.14510176  0.07327591  0.14510176  0.14510176  0.07327591
  -0.94

### Q7. For a dataset containing the following values: [1, 5, 10, 15, 20], perform Min-Max scaling to transform the values to a range of -1 to 1.

## Min-Max Scaling to Transform Values to a Range of -1 to 1

Given a dataset containing the values: $[1, 5, 10, 15, 20]$, we will perform Min-Max scaling to transform the values to a range of -1 to 1.

### Min-Max Scaling Formula

To scale the values to a range of $[-1, 1]$, we use the following formula:

$ x' = \frac{(x - \min(x)) \cdot (new_{max} - new_{min})}{\max(x) - \min(x)} + new_{min} $

where:
- $ x $ is the original value.
- $ \min(x) $ is the minimum value in the dataset.
- $ \max(x) $ is the maximum value in the dataset.
- $ new_{min} $ is the new minimum value (-1).
- $ new_{max} $ is the new maximum value (1).
- $ x' $ is the scaled value.

### Calculation

Let's apply this formula to each value in the dataset.

```python

In [50]:

import numpy as np

# Original data
data = np.array([1, 5, 10, 15, 20])

# Min and max values of the original data
min_val = np.min(data)
max_val = np.max(data)

# New min and max values for the scaled data
new_min = -1
new_max = 1

# Min-Max scaling function
def min_max_scale(x, min_val, max_val, new_min, new_max):
    return ((x - min_val) * (new_max - new_min) / (max_val - min_val)) + new_min

# Apply Min-Max scaling to each value
scaled_data = min_max_scale(data, min_val, max_val, new_min, new_max)

scaled_data


array([-1.        , -0.57894737, -0.05263158,  0.47368421,  1.        ])

### Q8. For a dataset containing the following features: [height, weight, age, gender, blood pressure], perform Feature Extraction using PCA. How many principal components would you choose to retain, and why?

## Choosing Number of Principal Components for Feature Extraction using PCA

When performing Feature Extraction using PCA on a dataset containing features like height, weight, age, gender, and blood pressure, the number of principal components to retain depends on several factors, including the desired level of dimensionality reduction and the variance explained by each principal component.

### Steps to Decide Number of Principal Components

1. **Standardize the Data**: Ensure all features are standardized to have a mean of 0 and a standard deviation of 1.
  
2. **Compute PCA**: Apply PCA to the standardized data and obtain the explained variance ratio for each principal component.

3. **Cumulative Explained Variance**: Calculate the cumulative explained variance and choose the number of principal components that explain a significant portion (e.g., 95%) of the total variance.

### Example Consideration

Let's hypothesize that after performing PCA on the dataset, we obtain the following explained variance ratios for each principal component:

- PC1 explains 70% of the variance.
- PC2 explains 20% of the variance.
- PC3 explains 5% of the variance.
- PC4 explains 3% of the variance.
- PC5 explains 2% of the variance.

In this scenario, the first two principal components (PC1 and PC2) collectively explain \( 70\% + 20\% = 90\% \) of the variance. Choosing to retain these two components would provide a good balance between dimensionality reduction and retaining significant information from the original dataset.

### Decision Criteria

- **Threshold**: A common practice is to set a threshold, such as retaining principal components that cumulatively explain at least 95% of the variance.
  
- **Visualization**: Consider visualizing the data in reduced dimensions (e.g., using scatter plots of principal components) to assess if the retained components capture meaningful patterns.

- **Application**: Evaluate how the reduced dataset performs in downstream tasks like prediction or clustering.

### Conclusion

The decision on how many principal components to retain in PCA for feature extraction depends on the specific goals of the analysis, the trade-off between dimensionality reduction and information retention, and the variance explained by each principal component. In general, retaining principal components that collectively explain a high percentage of the variance ensures that important patterns and relationships in the data are preserved while reducing the complexity of the dataset.


In [51]:


import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Sample dataset
data = {
    'height': [160, 165, 170, 175, 180],
    'weight': [60, 65, 70, 75, 80],
    'age': [25, 30, 35, 40, 45],
    'gender': [0, 1, 0, 1, 1],
    'blood_pressure': [120, 130, 125, 140, 135]
}

df = pd.DataFrame(data)

# Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)

# Initialize PCA
pca = PCA()

# Fit PCA and transform the data
pca_data = pca.fit_transform(scaled_data)

# Explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_

# Choosing number of principal components to retain (e.g., cumulative explained variance of 95%)
cumulative_variance_ratio = np.cumsum(explained_variance_ratio)
num_components = np.argmax(cumulative_variance_ratio >= 0.95) + 1

# Selecting top principal components
pca_selected = PCA(n_components=num_components)
pca_data_selected = pca_selected.fit_transform(scaled_data)

# Creating a DataFrame for the PCA-transformed data
pca_df = pd.DataFrame(data=pca_data_selected, columns=[f'PC{i+1}' for i in range(num_components)])

# Display results
print("Original Data:\n", df)
print("\nPCA-transformed Data:\n", pca_df)
print("\nExplained Variance Ratio:\n", explained_variance_ratio)
print("\nCumulative Explained Variance Ratio:\n", cumulative_variance_ratio)
print("\nNumber of Principal Components Selected:", num_components)


Original Data:
    height  weight  age  gender  blood_pressure
0     160      60   25       0             120
1     165      65   30       1             130
2     170      70   35       0             125
3     175      75   40       1             140
4     180      80   45       1             135

PCA-transformed Data:
         PC1       PC2
0  3.080633 -0.098238
1  0.690610  1.283370
2  0.773679 -1.179750
3 -1.933855  0.465176
4 -2.611067 -0.470558

Explained Variance Ratio:
 [8.44931430e-01 1.39452568e-01 1.56160014e-02 2.57855783e-33
 5.48726886e-36]

Cumulative Explained Variance Ratio:
 [0.84493143 0.984384   1.         1.         1.        ]

Number of Principal Components Selected: 2
