                                                LINEAR DISCRIMINANT ANALYSIS (LDA)                                               

Linear Discriminant Analysis (LDA) is a supervised learning algorithm widely used for dimensionality reduction and classification. It aims to project data onto a lower-dimensional space while maximizing the separability between different classes. Unlike PCA (unsupervised), LDA considers class labels during the dimensionality reduction process.

\. Mathematical Intuition: The goal of LDA is to find a transformation that maximizes the distance between class means (separability) while minimizing the spread of data points within each class (compactness). 
To achieve this, LDA uses the following ratio:
- J(𝑤) = 𝑤^𝑇.𝑆𝑏𝑤/𝑤^𝑇.𝑆𝑤𝑤
    - where,
        - 𝑆𝑏: Between-class scatter matrix (quantifies class separability).
        - 𝑆𝑤: Within-class scatter matrix (quantifies intra-class variability).
        - w: Projection vector (or weight vector).

Maximizing J(𝑤) ensures that data points from different classes are far apart, while points from the same class remain close together.


\. Steps Involved in LDA:

1. Compute Class Means
- For each class 𝐶𝑖, compute its mean vector μi :
    - 𝜇𝑖 = (1/𝑁i) ∑𝑥∈𝐶𝑖 (𝑥)
    - where,
        - Ni: Number of data points in class Ci.
        - x: Data points in class 𝐶𝑖.

- For the entire dataset, calculate the overall mean μ:
    - 𝜇 = 1/𝑁 ∑ 𝑁𝑖.𝜇𝑖
​
 
2. Compute the Scatter Matrix (S):
- To calculate within-class scatter matrix (SW): The within-class scatter matrix quantifies the spread of data points within each class ->
    - 𝑆𝑤 = ∑ 𝑆𝑖
    - Where, 
        - Si (scatter for class Ci) is:
            - 𝑆𝑖 = ∑𝑥∈𝐶𝑖 (𝑥−𝜇𝑖)(𝑥−𝜇𝑖)^𝑇

\. Intuition: Measures how tightly clustered data points are around their class mean.

- To calculate between-class scatter matrix (Sb): The between-class scatter matrix quantifies the separation of class means relative to the overall mean ->
    - (𝑆𝑏) = ∑𝑁𝑖(𝜇𝑖−𝜇)(𝜇𝑖−𝜇)^𝑇

\. Intuition: Measures how far apart class means are from the overall mean.

3. Solve the Generalized Eigenvalue Problem: Find the optimal projection vector w by solving the following eigenvalue equation ->
- 𝑆𝑤^−1 𝑆𝑏.𝑤=𝜆.𝑤
- Here:
    - w: Eigenvector (direction of the new axis).
    - λ: Eigenvalue (importance of the corresponding eigenvector).

4. Choose the Top m Eigenvectors: If we want to reduce the data to m-dimensional space, select the m eigenvectors corresponding to the largest eigenvalues. Stack these eigenvectors column-wise to form the transformation matrix W ->
- 𝑊 = [𝑤1,𝑤2,…,𝑤𝑚]

5. Project Data onto the New Feature Space: Transform the original data points x into the new feature space using the transformation matrix 𝑊 ->
- 𝑦=𝑊^𝑇𝑥
- Where:
    - x: Original data point.
    - y: Transformed data point in the new space.


\. Key Considerations:
1. Assumptions:
- Classes are linearly separable.
- Data within each class follows a Gaussian distribution.
- All classes share the same covariance matrix.

2. Limitations:
- If these assumptions are violated, LDA may perform poorly.
- Sensitive to noise and outliers.

3. Strengths:
- Captures class information during dimensionality reduction.
- Effective for high-dimensional datasets (e.g., face recognition).


\. Geometric Intuition
- LDA finds the line (or hyperplane) that maximizes the distance between the projected means of different classes while minimizing the variance within each class.
- This line is oriented such that it captures the directions of maximum class separability.


\. Comparison with PCA: While both LDA and PCA perform dimensionality reduction, their goals are different:



In [None]:
comparisonWithPCA = pd.DataFrame(columns=['Feature', 'LDA', 'PCA'], data=[['Supervised/Unsupervised', 'Supervised (uses class labels).', 'Unsupervised (ignores labels).'], ['Objective', 'Maximize class separability.', 'Maximize total variance.'], ['Focus', 'Retains discriminative information.', 'Retains global structure.'], ['Dimensionality', 'Reduces to at most 𝐶−1 dimensions (C = classes).', 'Reduces to any number of dimensions.']])
comparisonWithPCA

Unnamed: 0,Feature,LDA,PCA
0,Supervised/Unsupervised,Supervised (uses class labels).,Unsupervised (ignores labels).
1,Objective,Maximize class separability.,Maximize total variance.
2,Focus,Retains discriminative information.,Retains global structure.
3,Dimensionality,Reduces to at most 𝐶−1 dimensions (C = classes).,Reduces to any number of dimensions.


In [42]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn import metrics

In [8]:
raw_data = pd.read_csv(r"S:\VS code\python\Data _Analytics\Dataset\customer.csv")
raw_data

Unnamed: 0.1,Unnamed: 0,Customer,State,CustomerLifetimeValue,Response,Coverage,Education,EffectiveToDate,EmploymentStatus,Gender,...,MonthsSincePolicyInception,NumberofOpenComplaints,NumberofPolicies,PolicyType,Policy,RenewOfferType,SalesChannel,ClaimAmount,VehicleClass,VehicleSize
0,0,BU79786,Washington,2763.519279,No,Basic,Bachelor,2/24/11,Employed,F,...,5,0,1,Corporate Auto,Corporate L3,Offer1,Agent,384.811147,Two-Door Car,Medsize
1,1,QZ44356,Arizona,6979.535903,No,Extended,Bachelor,1/31/11,Unemployed,F,...,42,0,8,Personal Auto,Personal L3,Offer3,Agent,1131.464935,Four-Door Car,Medsize
2,2,AI49188,Nevada,12887.431650,No,Premium,Bachelor,2/19/11,Employed,F,...,38,0,2,Personal Auto,Personal L3,Offer1,Agent,566.472247,Two-Door Car,Medsize
3,3,WW63253,California,7645.861827,No,Basic,Bachelor,1/20/11,Unemployed,M,...,65,0,7,Corporate Auto,Corporate L2,Offer1,Call Center,529.881344,SUV,Medsize
4,4,HB64268,Washington,2813.692575,No,Basic,Bachelor,2/3/11,Employed,M,...,44,0,1,Personal Auto,Personal L1,Offer1,Agent,138.130879,Four-Door Car,Medsize
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9129,9129,LA72316,California,23405.987980,No,Basic,Bachelor,2/10/11,Employed,M,...,89,0,2,Personal Auto,Personal L1,Offer2,Web,198.234764,Four-Door Car,Medsize
9130,9130,PK87824,California,3096.511217,Yes,Extended,College,2/12/11,Employed,F,...,28,0,1,Corporate Auto,Corporate L3,Offer1,Branch,379.200000,Four-Door Car,Medsize
9131,9131,TD14365,California,8163.890428,No,Extended,Bachelor,2/6/11,Unemployed,M,...,37,3,2,Corporate Auto,Corporate L2,Offer1,Branch,790.784983,Four-Door Car,Medsize
9132,9132,UP19263,California,7524.442436,No,Extended,College,2/3/11,Employed,M,...,3,0,3,Personal Auto,Personal L2,Offer3,Branch,691.200000,Four-Door Car,Large


In [6]:
raw_data.columns

Index(['Unnamed: 0', 'Customer', 'State', 'CustomerLifetimeValue', 'Response',
       'Coverage', 'Education', 'EffectiveToDate', 'EmploymentStatus',
       'Gender', 'Income', 'LocationCode', 'MaritalStatus',
       'MonthlyPremiumAuto', 'MonthsSinceLastClaim',
       'MonthsSincePolicyInception', 'NumberofOpenComplaints',
       'NumberofPolicies', 'PolicyType', 'Policy', 'RenewOfferType',
       'SalesChannel', 'ClaimAmount', 'VehicleClass', 'VehicleSize'],
      dtype='object')

In [9]:
cleaned_data = raw_data.drop(columns=['Unnamed: 0', 'Customer', 'State','PolicyType'])
cleaned_data

Unnamed: 0,CustomerLifetimeValue,Response,Coverage,Education,EffectiveToDate,EmploymentStatus,Gender,Income,LocationCode,MaritalStatus,...,MonthsSinceLastClaim,MonthsSincePolicyInception,NumberofOpenComplaints,NumberofPolicies,Policy,RenewOfferType,SalesChannel,ClaimAmount,VehicleClass,VehicleSize
0,2763.519279,No,Basic,Bachelor,2/24/11,Employed,F,56274,Suburban,Married,...,32,5,0,1,Corporate L3,Offer1,Agent,384.811147,Two-Door Car,Medsize
1,6979.535903,No,Extended,Bachelor,1/31/11,Unemployed,F,0,Suburban,Single,...,13,42,0,8,Personal L3,Offer3,Agent,1131.464935,Four-Door Car,Medsize
2,12887.431650,No,Premium,Bachelor,2/19/11,Employed,F,48767,Suburban,Married,...,18,38,0,2,Personal L3,Offer1,Agent,566.472247,Two-Door Car,Medsize
3,7645.861827,No,Basic,Bachelor,1/20/11,Unemployed,M,0,Suburban,Married,...,18,65,0,7,Corporate L2,Offer1,Call Center,529.881344,SUV,Medsize
4,2813.692575,No,Basic,Bachelor,2/3/11,Employed,M,43836,Rural,Single,...,12,44,0,1,Personal L1,Offer1,Agent,138.130879,Four-Door Car,Medsize
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9129,23405.987980,No,Basic,Bachelor,2/10/11,Employed,M,71941,Urban,Married,...,18,89,0,2,Personal L1,Offer2,Web,198.234764,Four-Door Car,Medsize
9130,3096.511217,Yes,Extended,College,2/12/11,Employed,F,21604,Suburban,Divorced,...,14,28,0,1,Corporate L3,Offer1,Branch,379.200000,Four-Door Car,Medsize
9131,8163.890428,No,Extended,Bachelor,2/6/11,Unemployed,M,0,Suburban,Single,...,9,37,3,2,Corporate L2,Offer1,Branch,790.784983,Four-Door Car,Medsize
9132,7524.442436,No,Extended,College,2/3/11,Employed,M,21941,Suburban,Married,...,34,3,0,3,Personal L2,Offer3,Branch,691.200000,Four-Door Car,Large


In [12]:
cleaned_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9134 entries, 0 to 9133
Data columns (total 21 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   CustomerLifetimeValue       9134 non-null   float64
 1   Response                    9134 non-null   object 
 2   Coverage                    9134 non-null   object 
 3   Education                   9134 non-null   object 
 4   EffectiveToDate             9134 non-null   object 
 5   EmploymentStatus            9134 non-null   object 
 6   Gender                      9134 non-null   object 
 7   Income                      9134 non-null   int64  
 8   LocationCode                9134 non-null   object 
 9   MaritalStatus               9134 non-null   object 
 10  MonthlyPremiumAuto          9134 non-null   int64  
 11  MonthsSinceLastClaim        9134 non-null   int64  
 12  MonthsSincePolicyInception  9134 non-null   int64  
 13  NumberofOpenComplaints      9134 

Scaling and labeling the data

In [17]:
labels = cleaned_data.select_dtypes(include=['object']).columns
scaled_data = cleaned_data.copy()

Encoder = LabelEncoder()
Scaler = StandardScaler()

for label in labels: scaled_data[label] = Encoder.fit_transform(scaled_data[label])
for label in scaled_data.select_dtypes(include=['float64', 'int64']).columns:
    scaled_data[label] = Scaler.fit_transform(scaled_data[[label]])

scaled_data




Unnamed: 0,CustomerLifetimeValue,Response,Coverage,Education,EffectiveToDate,EmploymentStatus,Gender,Income,LocationCode,MaritalStatus,...,MonthsSinceLastClaim,MonthsSincePolicyInception,NumberofOpenComplaints,NumberofPolicies,Policy,RenewOfferType,SalesChannel,ClaimAmount,VehicleClass,VehicleSize
0,-0.762878,0,0,0,47,1,0,0.612827,1,1,...,1.678099,-1.543287,-0.422250,-0.822648,2,0,0,-0.169640,5,1
1,-0.149245,0,1,0,24,4,0,-1.239617,1,2,...,-0.208186,-0.217334,-0.422250,2.106160,5,2,0,2.400737,0,1
2,0.710636,0,2,0,41,1,0,0.365710,1,1,...,0.288205,-0.360680,-0.422250,-0.404247,5,0,0,0.455734,5,1
3,-0.052263,0,0,0,12,4,1,-1.239617,1,1,...,0.288205,0.606907,-0.422250,1.687759,1,0,2,0.329769,3,1
4,-0.755575,0,0,0,52,1,1,0.203390,0,2,...,-0.307465,-0.145661,-0.422250,-0.822648,3,0,0,-1.018843,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9129,2.241590,0,0,0,32,1,1,1.128558,2,1,...,0.288205,1.466984,-0.422250,-0.404247,3,1,3,-0.811934,0,1
9130,-0.714411,1,1,1,34,1,0,-0.528450,1,0,...,-0.108908,-0.719046,-0.422250,-0.822648,2,0,1,-0.188956,0,1
9131,0.023135,0,1,0,55,4,1,-1.239617,1,2,...,-0.605299,-0.396517,2.873245,-0.404247,1,0,1,1.227937,0,1
9132,-0.069935,0,1,1,52,1,1,-0.517356,1,1,...,1.876656,-1.614960,-0.422250,0.014154,4,2,1,0.885113,0,0


Modeling the dataset

In [34]:
X = scaled_data.iloc[:, :-2]
y = scaled_data['VehicleClass']


In [39]:
x_train, x_test, y_train, y_test = train_test_split(X, y,test_size=0.2, random_state=42)

In [56]:
lda_model = LinearDiscriminantAnalysis(n_components=1)
lda_model.fit_transform(x_train, y_train)

array([[-3.13072853],
       [-3.84247362],
       [-1.34862905],
       ...,
       [ 3.4727    ],
       [ 4.83462829],
       [-3.1058057 ]])

In [57]:
predication = lda_model.predict(x_test)
predication

array([3, 0, 0, ..., 3, 0, 3])

In [60]:
print(f"Accuarcy: {metrics.accuracy_score(y_test, predication)*100:.2f}%")

Accuarcy: 71.65%
