## Feature Transformation


### A. To use PCA Algorithm for dimensionality reduction.
You have a dataset that includes measurements for different variables on wine
(alcohol, ash, magnesium, and so on). Apply PCA algorithm & transform this data
so that most variations in the measurements of the variables are captured by a small
number of principal components so that it is easier to distinguish between red and
white wine by inspecting these principal components.

Dataset Link: https://media.geeksforgeeks.org/wp-content/uploads/Wine.csv


In [1]:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

In [2]:
data = pd.read_csv('wine.csv')

In [4]:
data.head()

Unnamed: 0,Alcohol,Malic_Acid,Ash,Ash_Alcanity,Magnesium,Total_Phenols,Flavanoids,Nonflavanoid_Phenols,Proanthocyanins,Color_Intensity,Hue,OD280,Proline,Customer_Segment
0,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065,1
1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050,1
2,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185,1
3,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480,1
4,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735,1


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Alcohol               178 non-null    float64
 1   Malic_Acid            178 non-null    float64
 2   Ash                   178 non-null    float64
 3   Ash_Alcanity          178 non-null    float64
 4   Magnesium             178 non-null    int64  
 5   Total_Phenols         178 non-null    float64
 6   Flavanoids            178 non-null    float64
 7   Nonflavanoid_Phenols  178 non-null    float64
 8   Proanthocyanins       178 non-null    float64
 9   Color_Intensity       178 non-null    float64
 10  Hue                   178 non-null    float64
 11  OD280                 178 non-null    float64
 12  Proline               178 non-null    int64  
 13  Customer_Segment      178 non-null    int64  
dtypes: float64(11), int64(3)
memory usage: 19.6 KB


- The goal will be to reduce the number of variables while still capturing most of the variance in the dataset, making it potentially easier to distinguish between different wine types.

In [7]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Separating out the features and the target variable
# Assuming 'Customer_Segment' is the target variable which indicates the type of wine
features = data.columns[:-1]  # all columns except the last one
x = data.loc[:, features].values  # getting all feature values
y = data.loc[:, ['Customer_Segment']].values  # getting target values

In [8]:
# Standardizing the features (mean=0 and variance=1)
x = StandardScaler().fit_transform(x)

In [9]:
# Applying PCA
# We will start with no reduction to see the variance explained by each component
pca = PCA()
principalComponents = pca.fit_transform(x)

In [10]:
# Let's check the explained variance ratio for each component
explained_variance = pca.explained_variance_ratio_

explained_variance

array([0.36198848, 0.1920749 , 0.11123631, 0.0706903 , 0.06563294,
       0.04935823, 0.04238679, 0.02680749, 0.02222153, 0.01930019,
       0.01736836, 0.01298233, 0.00795215])

In [11]:
# Calculate the cumulative variance explained by the principal components
cumulative_variance = np.cumsum(explained_variance)
cumulative_variance

array([0.36198848, 0.55406338, 0.66529969, 0.73598999, 0.80162293,
       0.85098116, 0.89336795, 0.92017544, 0.94239698, 0.96169717,
       0.97906553, 0.99204785, 1.        ])

In [12]:
# Determine the number of components to reach the desired explained variance threshold
# In this case, we'll look for the point where we have explained roughly 90% of the variance
n_components = np.argmax(cumulative_variance >= 0.9) + 1  # +1 because index starts at 0
n_components

8

In [13]:
# Apply PCA again, this time only to the number of components we've determined
pca = PCA(n_components=n_components)
principalComponents_reduced = pca.fit_transform(x)

In [15]:
# The shape of the data after applying PCA reduction
reduced_data_shape = principalComponents_reduced.shape
reduced_data_shape

(178, 8)