Q1. What is Min-Max scaling, and how is it used in data preprocessing? Provide an example to illustrate its
application

 Min-Max Scaling is a data preprocessing technique used to normalize the range of features or variables in a dataset. It transforms the values of each feature to a specific range, typically between 0 and 1. This is achieved by subtracting the minimum value of the feature and dividing by the range (maximum value minus minimum value).
 
 |
How is it Used in Data Preprocessing?
Min-Max Scaling is commonly used in the preprocessing stage of machine learning projects to ensure that features contribute equally to the model, especially in distance-based algorithms (e.g., k-nearest neighbors, support vector machines) and gradient-based optimization algorithms (e.g., neural networks).

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Create a DataFrame
data = {'Size': [600, 2500, 1200, 3000, 900],
        'Price': [150000, 400000, 250000, 500000, 180000]}
df = pd.DataFrame(data)

# Apply Min-Max Scaling
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df)

# Convert to DataFrame for better readability
scaled_df = pd.DataFrame(scaled_data, columns=['Size', 'Price'])

print(scaled_df)


       Size     Price
0  0.000000  0.000000
1  0.791667  0.714286
2  0.250000  0.285714
3  1.000000  1.000000
4  0.125000  0.085714


# What is the Unit Vector technique in feature scaling, and how does it differ from Min-Max scaling?
# Provide an example to illustrate its application.

The Unit Vector technique, also known as Normalization or Vector Normalization, is a feature scaling method that transforms each data point in the dataset to a unit vector. A unit vector has a magnitude (or Euclidean norm) of 1. This technique scales each feature vector (i.e., row in the dataset) so that the entire vector has a length of 1.

How Does it Differ from Min-Max Scaling?

Range vs. Magnitude:
Min-Max Scaling adjusts each feature to a specific range (commonly [0, 1]) based on the minimum and maximum values of that feature across all data points.
Unit Vector Scaling adjusts each data point (i.e., row) so that its overall magnitude is 1, maintaining the direction of the data point in feature space but normalizing its length.


In [2]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import Normalizer

# Create a DataFrame
data = {'Feature1': [3, 1, 5],
        'Feature2': [4, 2, 6]}
df = pd.DataFrame(data)

# Apply Unit Vector Scaling (Normalization)
normalizer = Normalizer()
normalized_data = normalizer.fit_transform(df)

# Convert to DataFrame for better readability
normalized_df = pd.DataFrame(normalized_data, columns=['Feature1', 'Feature2'])

print(normalized_df)


   Feature1  Feature2
0  0.600000  0.800000
1  0.447214  0.894427
2  0.640184  0.768221


# What is PCA (Principle Component Analysis), and how is it used in dimensionality reduction? Provide an
# example to illustrate its application.

Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a dimensionality reduction technique used to reduce the number of features in a dataset while preserving as much variance as possible. It does this by transforming the original features into a new set of uncorrelated features called principal components, which are ordered by the amount of variance they capture from the data.

How PCA is Used in Dimensionality Reduction:
Standardize the Data: PCA requires the data to be centered around zero, so features are typically standardized before applying PCA.
Compute the Covariance Matrix: This matrix captures the relationships between different features.
Calculate Eigenvectors and Eigenvalues: Eigenvectors determine the direction of the principal components, and eigenvalues determine their magnitude (importance).
Select Principal Components: The top components (with the highest eigenvalues) are selected to reduce dimensionality, keeping the most important information.
Transform the Data: The original data is projected onto the selected principal components to obtain a lower-dimensional representation.

In [3]:
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Sample DataFrame
data = {'Feature1': [2.5, 0.5, 2.2, 1.9, 3.1, 2.3, 2.0, 1.0, 1.5, 1.1],
        'Feature2': [2.4, 0.7, 2.9, 2.2, 3.0, 2.7, 1.6, 1.1, 1.6, 0.9],
        'Feature3': [1.5, 0.2, 1.8, 1.4, 2.0, 1.7, 1.2, 0.9, 1.0, 0.5]}
df = pd.DataFrame(data)

# Standardize the data
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)

# Apply PCA to reduce to 2 components
pca = PCA(n_components=2)
principal_components = pca.fit_transform(df_scaled)

# Convert to DataFrame for readability
pca_df = pd.DataFrame(principal_components, columns=['PC1', 'PC2'])

print(pca_df)


        PC1       PC2
0 -1.180178 -0.288121
1  2.961279  0.123194
2 -1.628910  0.512043
3 -0.469285  0.181215
4 -2.604368 -0.298370
5 -1.455327  0.223483
6  0.098673 -0.418672
7  1.545265  0.188687
8  0.695080  0.024176
9  2.037771 -0.247635


# Q4. What is the relationship between PCA and Feature Extraction, and how can PCA be used for Feature
# Extraction? Provide an example to illustrate this concept.

Relationship Between PCA and Feature Extraction
Principal Component Analysis (PCA) and Feature Extraction are closely related concepts.

Feature Extraction involves creating new features from the existing ones to capture the most important information. It aims to reduce the dimensionality of the data while retaining essential characteristics.

PCA is a specific method used for feature extraction. It transforms the original features into a new set of uncorrelated features (principal components) that capture the maximum variance in the data. These principal components are linear combinations of the original features and are ordered by the amount of variance they explain.

How PCA Can Be Used for Feature Extraction
PCA can be used for feature extraction by selecting the top principal components that capture the most variance. These components can then be used as new features in machine learning models, potentially improving performance by reducing noise and multicollinearity in the original features.

In [4]:
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Sample DataFrame
data = {'Feature1': [1, 2, 3, 4, 5],
        'Feature2': [2, 4, 6, 8, 10],
        'Feature3': [5, 4, 3, 2, 1],
        'Feature4': [10, 9, 8, 7, 6]}
df = pd.DataFrame(data)

# Standardize the data
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)

# Apply PCA to extract 2 principal components
pca = PCA(n_components=2)
principal_components = pca.fit_transform(df_scaled)

# Convert to DataFrame for readability
pca_df = pd.DataFrame(principal_components, columns=['PC1', 'PC2'])

print(pca_df)


        PC1           PC2
0  2.828427  3.648565e-16
1  1.414214 -1.216188e-16
2 -0.000000  0.000000e+00
3 -1.414214  1.216188e-16
4 -2.828427  2.432377e-16


# Q5. You are working on a project to build a recommendation system for a food delivery service. The dataset
# contains features such as price, rating, and delivery time. Explain how you would use Min-Max scaling to
# preprocess the data.

Using Min-Max Scaling in this context ensures that features with different ranges are normalized, which prevents any one feature (like price with potentially large numbers) from disproportionately influencing the recommendation system's performance. This preprocessing step is crucial for building effective models that make balanced recommendations.

In [5]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Sample DataFrame
data = {'price': [10, 20, 15, 30, 25],
        'rating': [4.5, 3.8, 4.0, 5.0, 4.2],
        'delivery_time': [30, 45, 20, 50, 40]}
df = pd.DataFrame(data)

# Initialize Min-Max Scaler
scaler = MinMaxScaler()

# Apply Min-Max Scaling to the dataset
scaled_data = scaler.fit_transform(df)

# Convert scaled data to a DataFrame for readability
scaled_df = pd.DataFrame(scaled_data, columns=['price', 'rating', 'delivery_time'])

print(scaled_df)


   price    rating  delivery_time
0   0.00  0.583333       0.333333
1   0.50  0.000000       0.833333
2   0.25  0.166667       0.000000
3   1.00  1.000000       1.000000
4   0.75  0.333333       0.666667


# Q6. You are working on a project to build a model to predict stock prices. The dataset contains many
# features, such as company financial data and market trends. Explain how you would use PCA to reduce the
# dimensionality of the dataset.


In [6]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Example dataset with financial and market features
data = {
    'Revenue': [100, 200, 150, 300, 250],
    'Profit': [10, 20, 15, 30, 25],
    'Assets': [500, 600, 550, 700, 650],
    'Liabilities': [200, 220, 210, 240, 230],
    'Market_Cap': [1000, 1100, 1050, 1150, 1200],
    'Volume': [10000, 15000, 12000, 16000, 14000],
    'PE_Ratio': [15, 16, 15.5, 17, 16.5],
    'Dividend_Yield': [3, 2.5, 3, 2, 2.8]
}

df = pd.DataFrame(data)

# Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)

# Apply PCA
pca = PCA(n_components=0.95)  # Retain 95% of variance
principal_components = pca.fit_transform(scaled_data)

# Convert to DataFrame for readability
pca_df = pd.DataFrame(principal_components, columns=[f'PC{i+1}' for i in range(principal_components.shape[1])])

print(pca_df)


        PC1       PC2
0  3.890336 -0.261988
1 -0.390455 -0.479979
2  2.037196  0.170841
3 -3.793704 -0.721570
4 -1.743372  1.292696


# Q7. For a dataset containing the following values: [1, 5, 10, 15, 20], perform Min-Max scaling to transform the
# values to a range of -1 to 1.

In [7]:
import numpy as np

# Original dataset
data = np.array([1, 5, 10, 15, 20])

# Min-Max Scaling parameters
min_range = -1
max_range = 1

# Min and Max of the original data
data_min = data.min()
data_max = data.max()

# Apply Min-Max Scaling
scaled_data = (data - data_min) * (max_range - min_range) / (data_max - data_min) + min_range

print(scaled_data)


[-1.         -0.57894737 -0.05263158  0.47368421  1.        ]


# Q8. For a dataset containing the following features: [height, weight, age, gender, blood pressure], perform
# Feature Extraction using PCA. How many principal components would you choose to retain, and why?

In [8]:
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Sample DataFrame with hypothetical data
data = {'height': [170, 160, 180, 175, 165],
        'weight': [70, 60, 80, 75, 65],
        'age': [25, 30, 35, 40, 45],
        'gender': ['male', 'female', 'male', 'female', 'male'],
        'blood_pressure': [120, 130, 110, 140, 125]}
df = pd.DataFrame(data)

# Preprocessing: One-Hot Encoding for 'gender'
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['height', 'weight', 'age', 'blood_pressure']),
        ('cat', OneHotEncoder(drop='first'), ['gender'])
    ])

scaled_data = preprocessor.fit_transform(df)

# Apply PCA
pca = PCA()
pca.fit(scaled_data)

# Explained variance
explained_variance = pca.explained_variance_ratio_
cumulative_variance = explained_variance.cumsum()

# Determine how many components to keep for 95% variance
n_components = next(i for i, total in enumerate(cumulative_variance) if total >= 0.95) + 1

print(f"Number of components to retain: {n_components}")
print(f"Explained variance by each component: {explained_variance}")
print(f"Cumulative explained variance: {cumulative_variance}")


Number of components to retain: 3
Explained variance by each component: [5.34257167e-01 2.96725017e-01 1.56550356e-01 1.24674603e-02
 9.00734575e-36]
Cumulative explained variance: [0.53425717 0.83098218 0.98753254 1.         1.        ]
