### Q1. What is Min-Max scaling, and how is it used in data preprocessing? Provide an example to illustrate its  application.

ANs - 
Min-Max scaling is a data preprocessing technique that adjusts the values of features in a dataset to a common range, typically between 0 and 1. It's useful for preventing features with larger values from dominating the learning process in machine learning algorithms.



In [1]:
import seaborn as sns
import pandas as pd 
from sklearn.preprocessing import MinMaxScaler
df = sns.load_dataset("tips")
scaler = MinMaxScaler()

In [2]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [3]:
scaler.fit(df[["total_bill"]])

In [4]:
df['total_bill'] = pd.DataFrame(scaler.transform(df[["total_bill"]]))## values are scaled between 0 to 1

In [5]:
df.head() ## df with scaled value 

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,0.291579,1.01,Female,No,Sun,Dinner,2
1,0.152283,1.66,Male,No,Sun,Dinner,3
2,0.375786,3.5,Male,No,Sun,Dinner,3
3,0.431713,3.31,Male,No,Sun,Dinner,2
4,0.450775,3.61,Female,No,Sun,Dinner,4


### Q2. What is the Unit Vector technique in feature scaling, and how does it differ from Min-Max scaling?  Provide an example to illustrate its application.

Ans - The Unit Vector technique, also known as Normalization, is another method for feature scaling in data preprocessing. Unlike Min-Max scaling that scales features to a specific range (e.g., 0 to 1), Normalization scales each feature in a way that the entire feature vector (row of data) has a length of 1 (i.e., it transforms the data into a unit vector). This technique is particularly useful when you want to ensure that the magnitude of the feature vector doesn't dominate the learning process, regardless of the original range of values.

In [6]:
from sklearn.preprocessing import normalize
data = {
    'fare': [50.0, 20.0, 100.0, 30.0],
    'age': [25, 30, 22, 35]
}
titanic = pd.DataFrame(data)

# Extract the "fare" and "age" columns
columns = titanic[['fare', 'age']].values

# Normalize the columns using sklearn's normalize function
normalized_columns = normalize(columns, axis=0) 

In [7]:
normalized_columns

array([[0.42562827, 0.43961247],
       [0.17025131, 0.52753496],
       [0.85125653, 0.38685897],
       [0.25537696, 0.61545745]])

### Q3. What is PCA (Principle Component Analysis), and how is it used in dimensionality reduction? Provide an  example to illustrate its application.

Ans - Principal Component Analysis (PCA) is a technique for reducing the dimensionality of a dataset while retaining important information. It involves finding new directions (principal components) in the data that capture the most variance.

here are the steps required:

1. Center the data.
2. Compute covariance matrix.
3. Find eigenvectors and eigenvalues.
4. Select top eigenvector (principal component).
5. Project data onto the principal component.

In [8]:
import numpy as np
from sklearn.decomposition import PCA
data = np.array([[5, 10],
                 [10, 15],
                 [2, 4],
                 [15, 30]])

# Create a PCA
pca = PCA(n_components=1)

# Fit the PCA model to the data and transform the data to the new space
reduced_data = pca.fit_transform(data)

In [9]:
reduced_data ## reduced to one dimension

array([[ -5.59311761],
       [  1.12710687],
       [-12.30122302],
       [ 16.76723375]])

### Q4. What is the relationship between PCA and Feature Extraction, and how can PCA be used for Feature  Extraction? Provide an example to illustrate this concept.

###### Relationship between PCA and Feature Extraction 

PCA (Principal Component Analysis) is closely related to Feature Extraction, as both techniques involve transforming the original features of a dataset into a new set of features. The key difference lies in the goal: PCA aims to capture maximum variance in the data, while feature extraction focuses on creating new features that represent the data in a way that is more informative or discriminative for a specific task.

###### Use of PCA to extract features

PCA can be used as a feature extraction technique when you want to reduce the dimensionality of your dataset while preserving as much information as possible. It involves selecting a subset of the most important principal components (new features) to represent the data.

In [10]:
import numpy as np
from sklearn.decomposition import PCA

# Create synthetic data with features representing pixel values
np.random.seed(42)
data = np.random.rand(100, 784)  # 100 images with 28x28 pixels each

# Apply PCA for feature extraction
pca = PCA(n_components=50)
reduced_data = pca.fit_transform(data)

In [11]:
reduced_data

array([[-3.72585183e-01, -7.80426847e-01,  2.57979403e-01, ...,
         1.51068069e-01, -8.36077155e-01,  3.91466911e-01],
       [-1.54016590e+00, -9.18066395e-01,  1.85965416e+00, ...,
        -3.19249530e-01,  6.39850080e-01, -1.43101765e+00],
       [-1.99253999e-01,  2.23871692e-01, -8.49146957e-01, ...,
        -6.63910112e-01,  1.02218288e+00,  8.64677988e-01],
       ...,
       [-1.02561497e+00, -4.35847746e-01,  1.11649705e-04, ...,
        -1.74148388e-01,  1.13122654e-01, -1.82846284e-02],
       [-3.41443171e-01, -7.25966878e-01, -1.37117766e+00, ...,
         5.23414197e-01,  1.12774227e+00, -5.09522490e-02],
       [-1.31821721e+00,  1.01920088e+00,  2.32253878e-01, ...,
         2.52473491e-02, -4.08119204e-01,  4.77516033e-01]])

### Q5. You are working on a project to build a recommendation system for a food delivery service. The dataset  contains features such as price, rating, and delivery time. Explain how you would use Min-Max scaling to  preprocess the data.

In [12]:
## first generate a data 
df = pd.DataFrame({
    'price': [10.0, 20.0, 15.0, 25.0, 30.0],
    'rating': [4.2, 3.8, 4.5, 4.0, 3.7],
    'delivery_time': [30, 45, 25, 50, 40]})
df

Unnamed: 0,price,rating,delivery_time
0,10.0,4.2,30
1,20.0,3.8,45
2,15.0,4.5,25
3,25.0,4.0,50
4,30.0,3.7,40


In [13]:
from sklearn.preprocessing import MinMaxScaler 

In [14]:
scaler = MinMaxScaler() 
scaled_Data = scaler.fit_transform(df) ## transforming the data

In [15]:
scaled_df = pd.DataFrame(scaled_Data, columns=['price','rating','delivery_time'])
scaled_df

Unnamed: 0,price,rating,delivery_time
0,0.0,0.625,0.2
1,0.5,0.125,0.8
2,0.25,1.0,0.0
3,0.75,0.375,1.0
4,1.0,0.0,0.6


### Q6. You are working on a project to build a model to predict stock prices. The dataset contains many  features, such as company financial data and market trends. Explain how you would use PCA to reduce the  dimensionality of the dataset.

Ans - 
###### Steps to use PCA to reduce the  dimensionality of the dataset.

1. Data Collection: Gather a dataset containing various features related to company financial data and market trends.

2. Data Preprocessing: Clean and preprocess the data, including handling missing values and scaling.

3. Apply PCA: Instantiate a PCA model, specifying the desired number of components or explained variance. Fit the PCA model on the preprocessed data.

4. Explained Variance Ratio: Check the explained variance ratio of each principal component to understand their importance in capturing the data's variance. Decide how many components to keep based on cumulative explained variance.

5.  Data: Transform the original data using the fitted PCA model to obtain lower-dimensional representations.

6. Modeling: Split the transformed data into training and testing sets. Train a stock price prediction model (e.g., Linear Regression, Random Forest) on the reduced data.

7. Evaluation and Interpretation: Make predictions using the trained model.Evaluate the model's performance using appropriate metrics (e.g., Mean Squared Error). Interpret the model's predictions and insights to inform trading decisions.

### Q7. For a dataset containing the following values: [1, 5, 10, 15, 20], perform Min-Max scaling to transform the  values to a range of -1 to 1.

In [16]:
A =np.array([[1,5,10,15,20]]).T
scaler = MinMaxScaler(feature_range =(-1,1)) 
scaler.fit(A)
new_A = scaler.transform(A)

In [17]:
new_A

array([[-1.        ],
       [-0.57894737],
       [-0.05263158],
       [ 0.47368421],
       [ 1.        ]])

### Q8. For a dataset containing the following features: [height, weight, age, gender, blood pressure], perform  Feature Extraction using PCA. How many principal components would you choose to retain, and why?

In [18]:
data = pd.DataFrame(np.random.rand(100,5),columns =['height', 'weight', 'age', 'gender', 'blood pressure'])

In [19]:
data.head()

Unnamed: 0,height,weight,age,gender,blood pressure
0,0.644689,0.753715,0.026725,0.177861,0.621223
1,0.525584,0.051354,0.93541,0.75851,0.358796
2,0.536398,0.374256,0.455289,0.596184,0.398467
3,0.620718,0.465147,0.657261,0.262222,0.171192
4,0.013582,0.60198,0.980781,0.420387,0.212628


In [20]:
from sklearn.decomposition import PCA
pca = PCA()

In [21]:
pca.fit(data)

In [22]:
cumulative_variance_ratio = np.cumsum(pca.explained_variance_ratio_)
# It calculates the cumulative variance ratio of the principal components. 
num_components_to_retain = np.argmax(cumulative_variance_ratio >= 0.95) + 1  
## since python uses zero-based indexing, we need to add 1 to get the actual number of components

# Transform the data using the chosen number of components
reduced_data = pca.transform(data)[:, :num_components_to_retain]



In [23]:
print(num_components_to_retain) ## retained all the 5 fetaures
print(reduced_data[:5])

5
[[ 0.59484083  0.00999533  0.29923482  0.00410237  0.07114741]
 [-0.65532063 -0.01385689 -0.10425449  0.11720733  0.08370854]
 [-0.1239956  -0.05674449  0.10550438 -0.06860186  0.07135177]
 [-0.04550106 -0.27252296  0.01289805  0.23405312 -0.1864289 ]
 [-0.13015997 -0.33683331 -0.6486938   0.0131586  -0.0417413 ]]


##### 