# Feature Engineering Assignment 

### Q1. What is Min-Max scaling, and how is it used in data preprocessing? Provide an example to illustrate its application.

Min Max Scaling is used in Deep Learning to Process data to range between [0 to 1].
Formula of min-max scaling --> **x scaled = (xi-x_min) / (x_max-x_min)**

We can use the sklearn library too.

In [1]:
import pandas as pd

In [2]:
import seaborn as sns

In [3]:
df = sns.load_dataset('taxis')

In [4]:
df.head()

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
0,2019-03-23 20:21:09,2019-03-23 20:27:24,1,1.6,7.0,2.15,0.0,12.95,yellow,credit card,Lenox Hill West,UN/Turtle Bay South,Manhattan,Manhattan
1,2019-03-04 16:11:55,2019-03-04 16:19:00,1,0.79,5.0,0.0,0.0,9.3,yellow,cash,Upper West Side South,Upper West Side South,Manhattan,Manhattan
2,2019-03-27 17:53:01,2019-03-27 18:00:25,1,1.37,7.5,2.36,0.0,14.16,yellow,credit card,Alphabet City,West Village,Manhattan,Manhattan
3,2019-03-10 01:23:59,2019-03-10 01:49:51,1,7.7,27.0,6.15,0.0,36.95,yellow,credit card,Hudson Sq,Yorkville West,Manhattan,Manhattan
4,2019-03-30 13:27:42,2019-03-30 13:37:14,3,2.16,9.0,1.1,0.0,13.4,yellow,credit card,Midtown East,Yorkville West,Manhattan,Manhattan


In [5]:
df.columns

Index(['pickup', 'dropoff', 'passengers', 'distance', 'fare', 'tip', 'tolls',
       'total', 'color', 'payment', 'pickup_zone', 'dropoff_zone',
       'pickup_borough', 'dropoff_borough'],
      dtype='object')

In [6]:
from sklearn.preprocessing import MinMaxScaler

In [7]:
scaler = MinMaxScaler()

In [8]:
min_max=scaler.fit_transform(df[[ 'fare', 'tip','total']])

In [9]:
df2=pd.DataFrame(min_max,columns="norm_" + scaler.get_feature_names_out())

In [10]:
df2.head()

Unnamed: 0,norm_fare,norm_tip,norm_total
0,0.040268,0.064759,0.067139
1,0.026846,0.0,0.046104
2,0.043624,0.071084,0.074112
3,0.174497,0.185241,0.205452
4,0.053691,0.033133,0.069733


In [11]:
df3=pd.concat([df,df2],axis=1)

In [12]:
print(df3.head())

               pickup             dropoff  passengers  distance  fare   tip  \
0 2019-03-23 20:21:09 2019-03-23 20:27:24           1      1.60   7.0  2.15   
1 2019-03-04 16:11:55 2019-03-04 16:19:00           1      0.79   5.0  0.00   
2 2019-03-27 17:53:01 2019-03-27 18:00:25           1      1.37   7.5  2.36   
3 2019-03-10 01:23:59 2019-03-10 01:49:51           1      7.70  27.0  6.15   
4 2019-03-30 13:27:42 2019-03-30 13:37:14           3      2.16   9.0  1.10   

   tolls  total   color      payment            pickup_zone  \
0    0.0  12.95  yellow  credit card        Lenox Hill West   
1    0.0   9.30  yellow         cash  Upper West Side South   
2    0.0  14.16  yellow  credit card          Alphabet City   
3    0.0  36.95  yellow  credit card              Hudson Sq   
4    0.0  13.40  yellow  credit card           Midtown East   

            dropoff_zone pickup_borough dropoff_borough  norm_fare  norm_tip  \
0    UN/Turtle Bay South      Manhattan       Manhattan   0.040268

### Q2. What is the Unit Vector technique in feature scaling, and how does it differ from Min-Max scaling? Provide an example to illustrate its application.

Normalization is the process of scaling individual samples to have unit norm.Unit vector scaling is a technique for normalizing features in a dataset so that they all have a unit length. This is done by dividing each feature by its Euclidean length. The Euclidean length of a feature vector is the square root of the sum of the squares of its components.

Min-max scaling is another technique for normalizing features in a dataset. This is done by subtracting the minimum value of each feature from all of its values and then dividing by the range of the feature. The range of a feature is the difference between its maximum and minimum values.

Sure. Here is a table that shows the difference between unit vector scaling and min-max scaling:

| Feature | Unit Vector Scaling | Min-Max Scaling |
|---|---|---|
| Purpose | Normalizes features so that they all have a unit length | Normalizes features so that they all have a range of 1 |
| Formula | $x' = \frac{x}{\|x\|}$ | $x' = \frac{x - min(x)}{max(x) - min(x)}$ |
| Advantages | More appropriate for machine learning algorithms that rely on distance calculations | More appropriate for machine learning algorithms that rely on linear models |
| Disadvantages | Does not change the relative scale of the features | Can make features with large ranges more influential than features with small ranges |


In [13]:
from sklearn.preprocessing import normalize

In [14]:
df=sns.load_dataset('iris')

In [15]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [16]:
df.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

In [17]:
scaler = normalize(df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']])

In [18]:
df2 = pd.DataFrame(scaler,columns=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'])

In [19]:
df2.head() ## Normalized dataset

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,0.803773,0.551609,0.220644,0.031521
1,0.828133,0.50702,0.236609,0.033801
2,0.805333,0.548312,0.222752,0.034269
3,0.80003,0.539151,0.260879,0.034784
4,0.790965,0.569495,0.22147,0.031639


### Q3. What is PCA (Principle Component Analysis), and how is it used in dimensionality reduction? Provide an example to illustrate its application.

Principal component analysis (PCA) is a statistical technique that is used to reduce the dimensionality of a dataset. This is accomplished by finding a set of new variables that are linear combinations of the original variables. The new variables are called principal components, and they are chosen so that they capture as much of the variation in the data as possible.

PCA is often used in dimensionality reduction because it can help to improve the interpretability of data. When a dataset has a large number of dimensions, it can be difficult to see the relationships between the variables. PCA can help to reduce the number of dimensions while still preserving the most important information.

PCA can also be used to improve the performance of machine learning algorithms. Many machine learning algorithms are not designed to handle large datasets. PCA can help to reduce the size of the dataset, which can make it easier for the algorithm to learn.

PCA is a powerful tool that can be used to reduce the dimensionality of data. It can help to improve the interpretability of data and the performance of machine learning algorithms.

_Example : Consider a dataset that contains information about a group of people, including their height, weight, and age. This dataset has three dimensions. We can use PCA to reduce the dimensionality of this dataset to two dimensions by finding the two principal components that capture the most variation in the data. The first principal component will likely be a combination of height and weight, and the second principal component will likely be a combination of age. By reducing the dimensionality of the dataset to two dimensions, we can make it easier to visualize the data and to see the relationships between the variables._



### Q5. You are working on a project to build a recommendation system for a food delivery service. The dataset contains features such as price, rating, and delivery time. Explain how you would use Min-Max scaling to preprocess the data.

In [20]:
import numpy as np
import pandas as pd
import seaborn as sns

# Set random seed for reproducibility
np.random.seed(42)

# Generate random data
price = np.random.normal(50, 10, 100)  # Mean: 50, Standard deviation: 10
rating = np.random.randint(1, 6, 100)  # Range: 1 to 5
delivery_time = np.random.normal(3, 1, 100)  # Mean: 3, Standard deviation: 1

# Create a DataFrame
data = pd.DataFrame({
    'price': price,
    'rating': rating,
    'delivery_time': delivery_time
})

In [21]:
data.head()

Unnamed: 0,price,rating,delivery_time
0,54.967142,1,4.049347
1,48.617357,5,4.325106
2,56.476885,1,3.734501
3,65.230299,3,2.045503
4,47.658466,2,2.248821


In [22]:
from sklearn.preprocessing import MinMaxScaler

In [23]:
scaler=MinMaxScaler()

In [24]:
min_max=scaler.fit_transform(data[['price','rating','delivery_time']])

In [25]:
df=pd.DataFrame(min_max,columns="processed_"+scaler.get_feature_names_out())

In [26]:
df2=pd.concat([data,df],axis=1)

In [27]:
print(df2.head())

       price  rating  delivery_time  processed_price  processed_rating  \
0  54.967142       1       4.049347         0.696879              0.00   
1  48.617357       5       4.325106         0.554890              1.00   
2  56.476885       1       3.734501         0.730639              0.00   
3  65.230299       3       2.045503         0.926376              0.50   
4  47.658466       2       2.248821         0.533448              0.25   

   processed_delivery_time  
0                 0.627436  
1                 0.688406  
2                 0.557825  
3                 0.184394  
4                 0.229347  


### Q6. You are working on a project to build a model to predict stock prices. The dataset contains many features, such as company financial data and market trends. Explain how you would use PCA to reduce the dimensionality of the dataset.

In [28]:
import pandas as pd
import numpy as np

# Set random seed for reproducibility
np.random.seed(42)

# Define the number of samples
num_samples = 1000

# Generate random data for features
company_name = ['Company A', 'Company B', 'Company C', 'Company D', 'Company E']
financial_data = np.random.uniform(0, 100, (num_samples, 3))  # 3 financial data features
market_trend = np.random.uniform(-1, 1, num_samples)  # 1 market trend feature

# Generate random target variable (stock prices)
stock_prices = np.random.uniform(0, 200, num_samples)

# Create a DataFrame
data = pd.DataFrame({
    'Company': np.random.choice(company_name, num_samples),
    'FinancialData1': financial_data[:, 0],
    'FinancialData2': financial_data[:, 1],
    'FinancialData3': financial_data[:, 2],
    'MarketTrend': market_trend,
    'StockPrice': stock_prices
})

In [29]:
data.head()

Unnamed: 0,Company,FinancialData1,FinancialData2,FinancialData3,MarketTrend,StockPrice
0,Company C,37.454012,95.071431,73.199394,0.345406,114.399176
1,Company C,59.865848,15.601864,15.599452,0.593363,161.086466
2,Company D,5.808361,86.617615,60.111501,-0.499064,152.032186
3,Company C,70.807258,2.058449,96.990985,0.249748,30.779981
4,Company E,83.244264,21.233911,18.182497,0.143492,29.849894


In [30]:
data.columns

Index(['Company', 'FinancialData1', 'FinancialData2', 'FinancialData3',
       'MarketTrend', 'StockPrice'],
      dtype='object')

In [31]:
from sklearn.decomposition import PCA

In [32]:
from sklearn.preprocessing import StandardScaler

In [33]:
scaler=StandardScaler()

In [34]:
X=scaler.fit_transform(data[['FinancialData1', 'FinancialData2', 'FinancialData3', 'MarketTrend', 'StockPrice']])

In [35]:
pca = PCA()
principal_components= pca.fit_transform(X)

principal_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2','PC3','PC4','PC5'])

In [36]:
principal_df.head()

Unnamed: 0,PC1,PC2,PC3,PC4,PC5
0,-0.088251,-0.645954,-1.366606,-0.549838,1.076891
1,0.37214,1.122241,1.17424,1.114218,1.02038
2,-0.679308,0.680248,-1.081302,-1.836184,0.564674
3,0.811412,-2.008004,1.528948,0.058229,-0.542202
4,0.747677,0.056992,0.584695,1.615508,-1.145488


### Q7. For a dataset containing the following values: [1, 5, 10, 15, 20], perform Min-Max scaling to transform the values to a range of -1 to 1.

In [37]:
x=[1, 5, 10, 15, 20]

In [38]:
from sklearn.preprocessing import MinMaxScaler

In [39]:
# Apply MinMaxScaler
scaler=MinMaxScaler()

In [40]:
#Convert to 2D
x_2d = np.array(x).reshape(-1, 1)

In [41]:
min_max=scaler.fit_transform(x_2d)

In [42]:
# Create a DataFrame
df = pd.DataFrame({'x': x, 'min-max': min_max.flatten()})

In [43]:
df

Unnamed: 0,x,min-max
0,1,0.0
1,5,0.210526
2,10,0.473684
3,15,0.736842
4,20,1.0


### Q8. For a dataset containing the following features: [height, weight, age, gender, blood pressure], perform Feature Extraction using PCA. How many principal components would you choose to retain, and why?

In [44]:
import numpy as np
np.random.seed(42)
height = np.random.randint(165,175,50)
weight = np.random.randint(45,95,50)
age = np.random.randint(15,75,50)
gender = np.array(["Male", "Female"] * 25)[:50]
blood_pressure = np.random.randint(80, 140,50)


In [45]:
import pandas as pd
df=pd.DataFrame({'height':height, 'weight':weight, 'age':age,'blood pressure': blood_pressure,"gender":gender})

In [46]:
df.head()

Unnamed: 0,height,weight,age,blood pressure,gender
0,171,91,65,130,Male
1,168,79,31,111,Female
2,172,58,22,118,Male
3,169,61,49,128,Female
4,171,80,49,131,Male


In [47]:
df.columns

Index(['height', 'weight', 'age', 'blood pressure', 'gender'], dtype='object')

In [48]:
from sklearn.preprocessing import StandardScaler

In [49]:
scaler = StandardScaler()

In [50]:
data = scaler.fit_transform(df[['height', 'weight', 'age', 'blood pressure']])

In [51]:
from sklearn.decomposition import PCA

In [52]:
pca=PCA()

In [53]:
df[['PC1', 'PC2', 'PC3', 'PC4']]=pca.fit_transform(data)

In [54]:
df.head()

Unnamed: 0,height,weight,age,blood pressure,gender,PC1,PC2,PC3,PC4
0,171,91,65,130,Male,1.350729,-1.116672,-0.28203,1.504568
1,168,79,31,111,Female,0.453537,-0.035229,1.030281,-0.42221
2,172,58,22,118,Male,-1.553033,-0.48611,0.32992,-0.271014
3,169,61,49,128,Female,0.171601,-0.755986,-0.772,-0.465955
4,171,80,49,131,Male,0.428808,-1.198966,-0.048954,0.746466


In [55]:
# Print the explained variance ratio
print("Explained Variance Ratio:")
print(pca.explained_variance_ratio_)

# Print the loadings of the principal components
print("Loadings of Principal Components:")
print(pca.components_)

Explained Variance Ratio:
[0.36336598 0.25181597 0.21370164 0.17111642]
Loadings of Principal Components:
[[-0.63519635  0.55929447  0.5286136   0.06544431]
 [-0.11311056 -0.14149466  0.13440342 -0.9742284 ]
 [-0.08156822  0.63423866 -0.74608978 -0.18557473]
 [ 0.75965665  0.51469435  0.38190761 -0.11026367]]


##### Explained Variane Ratio: 
The explained variance ratio indicates the proportion of variance explained by each principal component. In this case, PC1 captures 36.34% of the variance, PC2 captures 25.18%, PC3 captures 21.37%, and PC4 captures 17.11%. Generally, you want to retain the principal components that explain a significant amount of variance. Based on the explained variance ratios, you may consider retaining PC1, PC2, PC3, and PC4.

##### Loadings of Principal Components:
The loadings represent the contribution of each original feature to the principal components. Higher absolute loadings suggest a stronger relationship between the feature and the component. By examining the loadings, you can identify the features that have a higher impact on each principal component. In this case, PC1 is influenced by all the features, PC2 is primarily influenced by the fourth feature (blood pressure), PC3 is mainly influenced by the second feature (weight), and PC4 is influenced by the first feature (height).

##### Conclusion
Based on this information,I will consider retaining all the features (height, weight, age, blood pressure, gender) since all of them contribute to at least one principal component. 

## The End