## DEMONSTRATING THE EFFECT OF DIMENSIONALITY REDUCTION IN TWO DATASETS 
 - DATASET 1: Contains only numerical features.
 - DATASET 2: Contains a mixture od both numerical and categorical features.

### DATASET 1: WISCONSIN BREAST CANCER DATASET

In [34]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [35]:
# Step 1: Load and preprocess the breast cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

In [36]:
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [37]:
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [38]:
# Step 2: Apply PCA for dimensionality reduction
pca = PCA(n_components=2)  # Reduce to 2 principal components for visualization
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

## Implement Logistic Regression and LDA on normal dataset

In [39]:
# Logistic Regression
lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train_scaled, y_train)
# Predict using the model
y_pred_lr = lr.predict(X_test_scaled)

## Implement Linear Discriminant Analysis (LDA) as a classifier
lda = LinearDiscriminantAnalysis()
lda.fit(X_train_scaled, y_train)
# Step 3: Predict using the LDA model
y_pred = lda.predict(X_test_scaled)

# Evaluate the model
accuracy_lr = accuracy_score(y_test, y_pred_lr)
print("Accuracy for Logistic Regression:", accuracy_lr)

# Evaluate the LDA classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy for LDA:", accuracy)

Accuracy for Logistic Regression: 0.9736842105263158
Accuracy for LDA: 0.956140350877193


## Implement Logistic Regression and LDA on reduced dataset

In [40]:
# Step 3: Implement Logistic Regression and LDA using reduced dataset
# Logistic Regression

lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train_pca, y_train)
# Predict using the model
y_pred_lr = lr.predict(X_test_pca)


lda = LinearDiscriminantAnalysis()
lda.fit(X_train_pca, y_train)
# Step 3: Predict using the LDA model
y_pred = lda.predict(X_test_pca)

# Evaluate the model
accuracy_lr = accuracy_score(y_test, y_pred_lr)
print("Accuracy for Logistic Regression:", accuracy_lr)
# Step 4: Evaluate the LDA classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy for LDA:", accuracy)

Accuracy for Logistic Regression: 0.9912280701754386
Accuracy for LDA: 0.956140350877193


### DATASET 2: BMI DATASET

In [41]:
data = pd.read_csv("bmi_train.csv")
data.head()

Unnamed: 0,Gender,Height,Weight,Index
0,Male,161,89,4
1,Male,179,127,4
2,Male,172,139,5
3,Male,153,104,5
4,Male,165,68,2


In [42]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Gender  400 non-null    object
 1   Height  400 non-null    int64 
 2   Weight  400 non-null    int64 
 3   Index   400 non-null    int64 
dtypes: int64(3), object(1)
memory usage: 12.6+ KB


In [43]:
# Applying label encoding
from sklearn.preprocessing import LabelEncoder
encod = LabelEncoder()
data['Gender'] = encod.fit_transform(data['Gender'])
data.head(10)

Unnamed: 0,Gender,Height,Weight,Index
0,1,161,89,4
1,1,179,127,4
2,1,172,139,5
3,1,153,104,5
4,1,165,68,2
5,1,172,92,4
6,1,182,108,4
7,1,179,130,5
8,1,142,71,4
9,0,158,153,5


In [44]:
bins = (-1,0,1,2,3,4,5)
Status = ['Malnourished', 'Underweight', 'Fit', 'Slightly Overweight', 'Overweight', 'Extremely overweighted']
data['Index'] = pd.cut(data['Index'], bins = bins, labels = Status)
data.head()

Unnamed: 0,Gender,Height,Weight,Index
0,1,161,89,Overweight
1,1,179,127,Overweight
2,1,172,139,Extremely overweighted
3,1,153,104,Extremely overweighted
4,1,165,68,Fit


In [45]:
X = data[['Gender', "Height", "Weight"]]
y = data["Index"]

In [46]:
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [47]:
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Implement Logistic Regression and LDA on normal dataset

In [48]:
# Logistic Regression
lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train_scaled, y_train)
# Predict using the model
y_pred_lr = lr.predict(X_test_scaled)

## Implement Linear Discriminant Analysis (LDA) as a classifier
lda = LinearDiscriminantAnalysis()
lda.fit(X_train_scaled, y_train)
# Step 3: Predict using the LDA model
y_pred = lda.predict(X_test_scaled)

# Evaluate the model
accuracy_lr = accuracy_score(y_test, y_pred_lr)
print("Accuracy for Logistic Regression:", accuracy_lr)

# Evaluate the LDA classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy for LDA:", accuracy)

Accuracy for Logistic Regression: 0.7625
Accuracy for LDA: 0.85


## Implement Logistic Regression and LDA on reduced dataset

In [49]:
# Step 2: Apply PCA for dimensionality reduction
pca = PCA(n_components=2)  # Reduce to 2 principal components for visualization
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

In [50]:
# Step 3: Implement Logistic Regression and LDA using reduced dataset
# Logistic Regression

lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train_pca, y_train)
# Predict using the model
y_pred_lr = lr.predict(X_test_pca)


lda = LinearDiscriminantAnalysis()
lda.fit(X_train_pca, y_train)
# Step 3: Predict using the LDA model
y_pred = lda.predict(X_test_pca)

# Evaluate the model
accuracy_lr = accuracy_score(y_test, y_pred_lr)
print("Accuracy for Logistic Regression:", accuracy_lr)
# Step 4: Evaluate the LDA classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy for LDA:", accuracy)

Accuracy for Logistic Regression: 0.4375
Accuracy for LDA: 0.4375


## Conclusion

**Effect of Dimensionality Reduction on the Dataset 1**
- We see an significant increase in the accracy of Logistic Regression on applying dimensionality reduction on Breast Cancer dataset whereas the same accuracy is obtained for LDA.


**Effect of Dimensionality Reduction on the Dataset 2**

- In the second dataset there is a significant drop in the accuracy by applying PCA on the data. The accuracy dropeed to around 50%. 