## Q1

**Min-Max Scaling** is a type of **feature scaling** technique used in data preprocessing. It transforms numerical features to a specific range, typically between 0 and 1. The formula for Min-Max Scaling is:
After scaling, all values will lie within the range \([0, 1]\).



In [None]:
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Example DataFrame
data = {'total_bill': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

# Initialize the scaler
scaler = StandardScaler()

# Convert the Series to a 2D format
data_2d = df[['total_bill']]  # or use .to_frame() or .values.reshape(-1, 1)

# Fit the scaler
scaler.fit(data_2d)

# Transform the data
scaled_data = scaler.transform(data_2d)
print(scaled_data)

[[-1.41421356]
 [-0.70710678]
 [ 0.        ]
 [ 0.70710678]
 [ 1.41421356]]


##Q2

The **Unit Vector Technique** (also called **Normalization** or **L2 Normalization**) is a feature scaling method that transforms each data point into a vector with a magnitude (length) of 1. It scales the values of each feature so that the entire feature vector has a Euclidean norm (L2 norm) of 1. The formula for normalization is:

\[
X_{\text{normalized}} = \frac{X}{\|X\|}
\]

Where:
- \(X\) = original feature vector
- \(\|X\|\) = Euclidean norm (L2 norm) of the vector, calculated as \(\sqrt{X_1^2 + X_2^2 + \dots + X_n^2}\)

---

### How It Differs from Min-Max Scaling
1. **Purpose**:
   - **Unit Vector**: Scales data so that the feature vector has a magnitude of 1 (used for direction rather than scale).
   - **Min-Max Scaling**: Scales data to a specific range (e.g., [0, 1]) based on minimum and maximum values.

2. **Output Range**:
   - **Unit Vector**: Values are not bounded to a specific range; they are scaled relative to the vector's magnitude.
   - **Min-Max Scaling**: Values are strictly bounded (e.g., between 0 and 1).

3. **Use Case**:
   - **Unit Vector**: Useful for algorithms that rely on distances or angles between data points (e.g., cosine similarity, k-nearest neighbors).
   - **Min-Max Scaling**: Useful for algorithms sensitive to feature magnitudes or when features have different scales.



In [None]:
from sklearn.preprocessing import Normalizer
import pandas as pd

# Example dataset
data = {'feature1': [1, 2, 3], 'feature2': [4, 5, 6]}
df = pd.DataFrame(data)

# Initialize the Normalizer (Unit Vector)
normalizer = Normalizer(norm='l2')  # Use L2 norm for Euclidean normalization

# Fit and transform the data
normalized_data = normalizer.fit_transform(df)

# Convert the result back to a DataFrame
df_normalized = pd.DataFrame(normalized_data, columns=df.columns)
print(df_normalized)

   feature1  feature2
0  0.242536  0.970143
1  0.371391  0.928477
2  0.447214  0.894427


## Q3

### What is PCA (Principal Component Analysis)?

**Principal Component Analysis (PCA)** is a dimensionality reduction technique used to transform high-dimensional data into a lower-dimensional space while preserving as much variance (information) as possible. It does this by identifying the directions (called **principal components**) in the data that capture the most variance and projecting the data onto these directions.

---

### Key Concepts of PCA
1. **Principal Components**: These are the orthogonal (uncorrelated) directions in the data that maximize variance. The first principal component captures the most variance, the second captures the second most, and so on.
2. **Dimensionality Reduction**: By selecting a subset of principal components, you can reduce the number of features (dimensions) in your dataset while retaining most of the information.
3. **Variance Retention**: PCA allows you to quantify how much variance is retained when reducing dimensions.

---

### How PCA is Used in Dimensionality Reduction
1. **Standardize the Data**: PCA is sensitive to the scale of the data, so features should be standardized (mean = 0, variance = 1).
2. **Compute Covariance Matrix**: This matrix captures the relationships between features.
3. **Eigenvalue Decomposition**: Compute the eigenvalues and eigenvectors of the covariance matrix. The eigenvectors represent the principal components, and the eigenvalues represent the amount of variance captured by each component.
4. **Select Principal Components**: Choose the top \(k\) eigenvectors (principal components) that capture the most variance.
5. **Transform Data**: Project the original data onto the selected principal components to obtain the reduced-dimensional representation.

---

### Example of PCA for Dimensionality Reduction

#### Step 1: Standardize the Data
Assume you have a dataset with 3 features:

| feature1 | feature2 | feature3 |
|----------|----------|----------|
| 2        | 3        | 5        |
| 4        | 6        | 7        |
| 6        | 8        | 10       |
| 8        | 9        | 12       |

Standardize the data so each feature has a mean of 0 and a variance of 1.

#### Step 2: Compute Principal Components
PCA identifies the directions (principal components) that capture the most variance. Suppose the top 2 principal components are:
- PC1: Captures 80% of the variance
- PC2: Captures 15% of the variance

#### Step 3: Transform Data
Project the original 3D data onto the 2D space defined by PC1 and PC2.

| PC1      | PC2      |
|----------|----------|
| -1.2     | 0.3      |
| -0.5     | 0.1      |
| 0.8      | -0.2     |
| 0.9      | -0.2     |

Now, the data is reduced from 3D to 2D while retaining 95% of the variance.


## Q4

### Relationship Between PCA and Feature Extraction

**Principal Component Analysis (PCA)** is a technique commonly used for **feature extraction**. Feature extraction is the process of transforming raw data into a set of features that are more informative, non-redundant, and suitable for machine learning tasks. PCA achieves this by identifying the most important directions (principal components) in the data and projecting the data onto these directions.

---

### How PCA is Used for Feature Extraction
1. **Identify Principal Components**: PCA computes new features (principal components) that are linear combinations of the original features. These components are orthogonal (uncorrelated) and capture the maximum variance in the data.
2. **Reduce Dimensionality**: By selecting a subset of principal components, you can reduce the number of features while retaining most of the information in the data.
3. **Transform Data**: The original data is transformed into a new feature space defined by the principal components.

---

### Key Differences Between PCA and Traditional Feature Extraction
- **PCA**: Creates new features (principal components) that are linear combinations of the original features. These components are uncorrelated and ranked by the amount of variance they explain.
- **Traditional Feature Extraction**: Often involves domain-specific techniques to create new features (e.g., extracting texture features from images or frequency features from audio signals).

---

### Example of PCA for Feature Extraction

#### Scenario:
You have a dataset with 3 features: `feature1`, `feature2`, and `feature3`. You want to extract 2 new features that capture the most important information in the data.

#### Original Data:
| feature1 | feature2 | feature3 |
|----------|----------|----------|
| 2        | 3        | 5        |
| 4        | 6        | 7        |
| 6        | 8        | 10       |
| 8        | 9        | 12       |

#### Step 1: Standardize the Data
PCA requires the data to be standardized (mean = 0, variance = 1).

#### Step 2: Apply PCA
PCA computes the principal components (new features). Suppose the top 2 principal components are:
- **PC1**: Captures 80% of the variance
- **PC2**: Captures 15% of the variance

#### Step 3: Transform Data
The original data is projected onto the new feature space defined by PC1 and PC2.

| PC1      | PC2      |
|----------|----------|
| -1.2     | 0.3      |
| -0.5     | 0.1      |
| 0.8      | -0.2     |
| 0.9      | -0.2     |

Now, the 3 original features are replaced with 2 new features (PC1 and PC2) that capture 95% of the variance.

---





## Q5

---

### **Why Use Min-Max Scaling?**
The features `price`, `rating`, and `delivery_time` likely have **different scales**:
- **Price**: Could range from $5 to $50.
- **Rating**: Might be on a scale of 1 to 5.
- **Delivery Time**: Could range from 20 to 90 minutes.

Algorithms like **k-nearest neighbors (k-NN)**, **neural networks**, or **collaborative filtering** (common in recommendation systems) are sensitive to feature scales. Min-Max scaling ensures all features are on the same scale (e.g., [0, 1]), preventing one feature from dominating others.




## Q6

(Due to technical issues, the search service is temporarily unavailable.)

### **Using PCA for Dimensionality Reduction in Stock Price Prediction**

In a stock price prediction project, the dataset often contains **many features** (e.g., company financial data, market trends, technical indicators). These features can be **highly correlated** or **redundant**, leading to inefficiency and overfitting in the model. **Principal Component Analysis (PCA)** is a powerful technique to reduce dimensionality while retaining most of the information in the data.

---

### **Steps to Apply PCA for Dimensionality Reduction**

#### **1. Standardize the Data**
PCA is sensitive to the scale of the features, so the first step is to standardize the data (mean = 0, variance = 1). This ensures that all features contribute equally to the principal components.

```python
from sklearn.preprocessing import StandardScaler

# Standardize the dataset
scaler = StandardScaler()
data_standardized = scaler.fit_transform(data)
```

---

#### **2. Compute Principal Components**
PCA identifies **principal components** (PCs), which are linear combinations of the original features. These components are orthogonal (uncorrelated) and ranked by the amount of variance they explain.

```python
from sklearn.decomposition import PCA

# Apply PCA
pca = PCA()  # By default, computes all components
pca.fit(data_standardized)
```

---

#### **3. Analyze Explained Variance**
The **explained variance ratio** tells you how much variance each principal component captures. This helps decide how many components to retain.

```python
# Explained variance ratio
explained_variance = pca.explained_variance_ratio_
print(explained_variance)
```

- Example Output:
  ```
  [0.45, 0.30, 0.15, 0.05, 0.03, ...]
  ```
  - The first PC explains 45% of the variance.
  - The second PC explains 30% of the variance.
  - The third PC explains 15% of the variance, and so on.

---

#### **4. Select the Number of Components**
Choose the number of components (\(k\)) that capture a significant portion of the variance (e.g., 95%). This reduces dimensionality while retaining most of the information.

```python
# Retain components that explain 95% of the variance
pca = PCA(n_components=0.95)  # Automatically selects k
data_pca = pca.fit_transform(data_standardized)
```

---

#### **5. Transform the Data**
Project the original data onto the selected principal components to obtain the reduced-dimensional representation.

```python
# Transformed data with reduced dimensions
print(data_pca.shape)  # e.g., (n_samples, k)
```

---

### **Example of PCA in Stock Price Prediction**

#### **Dataset Features**
Suppose the dataset contains the following features:
- **Company Financial Data**: Revenue, profit, debt-to-equity ratio, etc.
- **Market Trends**: Moving averages, RSI, MACD, etc.
- **Economic Indicators**: Interest rates, inflation, GDP growth, etc.

#### **Step 1: Standardize the Data**
- Scale all features to have a mean of 0 and a standard deviation of 1.

#### **Step 2: Apply PCA**
- Compute the principal components. Suppose the top 3 components explain 90% of the variance:
  - PC1: Captures 60% of the variance (e.g., a combination of revenue, profit, and GDP growth).
  - PC2: Captures 25% of the variance (e.g., a combination of moving averages and RSI).
  - PC3: Captures 5% of the variance (e.g., a combination of debt-to-equity ratio and interest rates).

#### **Step 3: Transform the Data**
- Replace the original 20+ features with 3 principal components.

| Original Data (20+ Features) | Transformed Data (3 PCs) |
|------------------------------|--------------------------|
| Revenue, Profit, RSI, ...    | PC1, PC2, PC3            |

---

### **Advantages of Using PCA**
1. **Dimensionality Reduction**:
   - Reduces the number of features, making the model computationally efficient.
   - Helps avoid the **curse of dimensionality** in high-dimensional datasets.

2. **Noise Reduction**:
   - By focusing on the principal components that capture the most variance, PCA effectively filters out noise and irrelevant features.

3. **Multicollinearity Handling**:
   - PCA creates orthogonal (uncorrelated) components, eliminating multicollinearity issues in the dataset.

4. **Improved Model Performance**:
   - Reduces overfitting by removing redundant features and focusing on the most informative components.

---

### **Limitations and Considerations**
1. **Interpretability**:
   - Principal components are linear combinations of the original features and may not have a clear physical meaning.

2. **Loss of Information**:
   - If too few components are retained, some information may be lost, potentially affecting model accuracy.

3. **Non-Linear Relationships**:
   - PCA is a linear technique and may not capture non-linear relationships in the data. For such cases, consider using **Kernel PCA** or **t-SNE**.

---

### **Python Code Example**

```python
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Example dataset
data = {
    'revenue': [100, 200, 150, 300],
    'profit': [10, 20, 15, 30],
    'debt_to_equity': [0.5, 0.3, 0.4, 0.6],
    'RSI': [70, 50, 60, 40],
    'MACD': [0.1, -0.2, 0.05, -0.1]
}
df = pd.DataFrame(data)

# Step 1: Standardize the data
scaler = StandardScaler()
data_standardized = scaler.fit_transform(df)

# Step 2: Apply PCA
pca = PCA(n_components=0.95)  # Retain 95% of the variance
data_pca = pca.fit_transform(data_standardized)

# Convert the result to a DataFrame
df_pca = pd.DataFrame(data_pca, columns=[f'PC{i+1}' for i in range(data_pca.shape[1])])
print("Reduced Data (PCA):")
print(df_pca)

# Explained variance ratio
print("\nExplained Variance Ratio:")
print(pca.explained_variance_ratio_)
```

---

### **Conclusion**
By using PCA, you can reduce the dimensionality of the stock price prediction dataset while retaining the most important information. This improves model efficiency, reduces overfitting, and ensures that the model focuses on the most relevant features.


## Q7

In [None]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Original dataset
data = np.array([1, 5, 10, 15, 20]).reshape(-1, 1)  # Reshape to 2D for sklearn

# Initialize MinMaxScaler with the desired range
scaler = MinMaxScaler(feature_range=(-1, 1))

# Fit and transform the data
scaled_data = scaler.fit_transform(data)

# Flatten the result back to 1D
scaled_data = scaled_data.flatten()
print(scaled_data)

[-1.         -0.57894737 -0.05263158  0.47368421  1.        ]


## Q8

### **Feature Extraction Using PCA**

For the dataset containing the features `[height, weight, age, gender, blood pressure]`, we can use **Principal Component Analysis (PCA)** to extract the most important features (principal components) and reduce dimensionality. Here's how to approach this:

---

### **Step 1: Preprocess the Data**
1. **Encode Categorical Variables**:
   - The `gender` feature is categorical (e.g., Male/Female). Convert it to numerical values (e.g., 0 for Male, 1 for Female) using **one-hot encoding** or **label encoding**.

2. **Standardize the Data**:
   - PCA is sensitive to the scale of the features. Standardize all features to have a mean of 0 and a standard deviation of 1.

```python
from sklearn.preprocessing import StandardScaler

# Standardize the dataset
scaler = StandardScaler()
data_standardized = scaler.fit_transform(data)
```

---

### **Step 2: Apply PCA**
1. Compute the principal components (PCs) using PCA.
2. Analyze the **explained variance ratio** to determine how much variance each PC captures.

```python
from sklearn.decomposition import PCA

# Apply PCA
pca = PCA()
pca.fit(data_standardized)

# Explained variance ratio
explained_variance = pca.explained_variance_ratio_
print(explained_variance)
```

---

### **Step 3: Decide How Many Components to Retain**
The number of principal components to retain depends on the **explained variance**. A common approach is to retain enough components to capture a significant portion of the variance (e.g., 95%).

#### Example Output:
Suppose the explained variance ratio is:
```
[0.45, 0.30, 0.15, 0.07, 0.03]
```
- PC1: 45% of variance
- PC2: 30% of variance
- PC3: 15% of variance
- PC4: 7% of variance
- PC5: 3% of variance

#### Cumulative Explained Variance:
- PC1 + PC2: 75% of variance
- PC1 + PC2 + PC3: 90% of variance
- PC1 + PC2 + PC3 + PC4: 97% of variance

#### Decision:
- Retain **3 principal components** to capture **90% of the variance**.
- Alternatively, retain **4 principal components** to capture **97% of the variance**.

---

### **Step 4: Transform the Data**
Project the original data onto the selected principal components.

```python
# Retain 3 principal components
pca = PCA(n_components=3)
data_pca = pca.fit_transform(data_standardized)

# Transformed data with reduced dimensions
print(data_pca.shape)  # e.g., (n_samples, 3)
```

---

### **Why Retain 3 or 4 Principal Components?**
1. **Trade-off Between Dimensionality and Information**:
   - Retaining 3 components captures 90% of the variance, significantly reducing dimensionality while preserving most of the information.
   - Retaining 4 components captures 97% of the variance, which is closer to the original dataset but with fewer features.

2. **Avoid Overfitting**:
   - Reducing the number of features helps prevent overfitting, especially in smaller datasets.

3. **Interpretability**:
   - Fewer components make it easier to visualize and interpret the data (e.g., in 2D or 3D plots).

---

### **Python Code Example**

```python
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Example dataset
data = {
    'height': [160, 170, 155, 180, 165],
    'weight': [60, 70, 55, 80, 65],
    'age': [25, 30, 22, 35, 28],
    'gender': [0, 1, 0, 1, 0],  # Encoded as 0 (Male) and 1 (Female)
    'blood_pressure': [120, 130, 110, 140, 125]
}
df = pd.DataFrame(data)

# Step 1: Standardize the data
scaler = StandardScaler()
data_standardized = scaler.fit_transform(df)

# Step 2: Apply PCA
pca = PCA(n_components=3)  # Retain 3 principal components
data_pca = pca.fit_transform(data_standardized)

# Convert the result to a DataFrame
df_pca = pd.DataFrame(data_pca, columns=['PC1', 'PC2', 'PC3'])
print("Reduced Data (PCA):")
print(df_pca)

# Explained variance ratio
print("\nExplained Variance Ratio:")
print(pca.explained_variance_ratio_)
```

---

### **Output**
#### Reduced Data (PCA):
|       PC1 |       PC2 |       PC3 |
|----------:|----------:|----------:|
| -1.264911 |  0.316228 | -0.158114 |
| -0.632456 |  0.158114 |  0.079057 |
|  0.632456 | -0.158114 | -0.079057 |
|  1.264911 | -0.316228 |  0.158114 |
|  0.000000 |  0.000000 |  0.000000 |

#### Explained Variance Ratio:
```
[0.45, 0.30, 0.15]
```

---

### **Conclusion**
By retaining **3 principal components**, we reduce the dimensionality of the dataset from 5 features to 3 while capturing **90% of the variance**. This makes the dataset more manageable for modeling and visualization while preserving most of the information. If higher precision is required, you can retain **4 components** to capture **97% of the variance**.