# Task 31-> Dimensionality Reduction Techniques

### 
Dimensionality reduction techniques simplify complex datasets, reduce computational costs, and help 
prevent overfitting in machine learning models. Common methods include Principal Component Analysis 
(PCA), which transforms data into principal components with the greatest variance; Linear Discriminant 
Analysis (LDA), which finds linear combinations of features that best separate classes; and 
t-Distributed Stochastic Neighbor Embedding (t-SNE), a non-linear technique for visualizing 
high-dimensional data. Other techniques include Independent Component Analysis (ICA) for separating a 
multivariate signal into independent components, and feature selection methods like SelectKBest and 
Recursive Feature Elimination (RFE). To see the impact of these techniques, pick a high-dimensional 
dataset and observe how the results of a model change when dimensionality reduction is applied. 
This will help you understand the effectiveness of each technique in improving model performance.

### importing necessary libraries and dataset

In [49]:
import pandas as pd
from sklearn.decomposition import PCA, FastICA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.manifold import TSNE
from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

file_path = r'C:\Users\Huawei\Desktop\Iris.csv'
data = pd.read_csv(file_path)

### Define features and target

In [50]:
X = data.drop(['Id', 'Species'], axis=1)
y = data['Species']

### Encode target variable

In [51]:
y = pd.factorize(y)[0]#0 represents the list of numbers representing each unique value.

### Standardize the features

In [52]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

### Split the data into training and testing sets

In [54]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

### Principal Component Analysis (PCA)

In [55]:
pca = PCA(n_components=2)#n_components=2 specifies that pca algorithm should reduce the data to 2 dimensions.
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)
model.fit(X_train_pca, y_train)
y_pred_pca = model.predict(X_test_pca)
pca_accuracy = accuracy_score(y_test, y_pred_pca)
print("PCA Accuracy: ", pca_accuracy*100)

PCA Accuracy:  90.0


### Linear Discriminant Analysis (LDA)

In [56]:
lda = LDA(n_components=2)
X_train_lda = lda.fit_transform(X_train, y_train)
X_test_lda = lda.transform(X_test)
model.fit(X_train_lda, y_train)
y_pred_lda = model.predict(X_test_lda)
lda_accuracy = accuracy_score(y_test, y_pred_lda)
print("LDA Accuracy: ", lda_accuracy*100)

LDA Accuracy:  100.0


### t-Distributed Stochastic Neighbor Embedding (t-SNE)

In [None]:
tsne = TSNE(n_components=2, random_state=42, perplexity=10)
#perplexity should be set to value less than the number of samples in the dataset to achieve better accuracy.
X_combined = np.vstack((X_train, X_test))
X_combined_tsne = tsne.fit_transform(X_combined)
X_train_tsne = X_combined_tsne[:X_train.shape[0]]
X_test_tsne = X_combined_tsne[X_train.shape[0]:]
model.fit(X_train_tsne, y_train)
y_pred_tsne = model.predict(X_test_tsne)
tsne_accuracy = accuracy_score(y_test, y_pred_tsne)
print("t-SNE Accuracy: ", tsne_accuracy*100)

t-SNE Accuracy:  96.66666666666667


### Independent Component Analysis (ICA)

In [58]:
ica = FastICA(n_components=2, random_state=42)
X_train_ica = ica.fit_transform(X_train)
X_test_ica = ica.transform(X_test)
model.fit(X_train_ica, y_train)
y_pred_ica = model.predict(X_test_ica)
ica_accuracy = accuracy_score(y_test, y_pred_ica)
print("ICA Accuracy: ", ica_accuracy*100)

ICA Accuracy:  93.33333333333333




### Feature Selection (SelectKBest)

In [59]:
selector = SelectKBest(score_func=f_classif, k=2)
#score_func=f_classif uses ANOVA F-value(i.e it tests how much a feature helps distinguish between different groups) to assess feature importance.
#k=2 selects the top 2 features based on the score_func.
X_train_kbest = selector.fit_transform(X_train, y_train)
X_test_kbest = selector.transform(X_test)
model.fit(X_train_kbest, y_train)
y_pred_kbest = model.predict(X_test_kbest)
kbest_accuracy = accuracy_score(y_test, y_pred_kbest)
print("SelectKBest Accuracy: ", kbest_accuracy*100)

SelectKBest Accuracy:  100.0


### Feature Selection (Recursive Feature Elimination)

In [60]:
rfe = RFE(estimator=LogisticRegression(), n_features_to_select=2)#LogisticRegression is used here as the estimator for RFE.
X_train_rfe = rfe.fit_transform(X_train, y_train)
X_test_rfe = rfe.transform(X_test)
model_rfe = LogisticRegression() 
model_rfe.fit(X_train_rfe, y_train)
y_pred_rfe = model_rfe.predict(X_test_rfe)
rfe_accuracy = accuracy_score(y_test, y_pred_rfe)
print("RFE Accuracy: ", rfe_accuracy*100)

RFE Accuracy:  100.0


### My understanding of the change in results of the model when dimensionality reduction is applied
### . dimensionality reduction techniques like LDA, SelectKBest, and RFE improved the model's accuracy to 100%.
### . PCA, t_SNE and ICA also enhanced accuracy although slightly less than other methods.
### . This demonstrates that reducing feature space(i.e all possible input features for dataset) can lead to better model performance.
### . The choice of technique significantly impacts results which highlight the importance of selecting appropriate dimensionality reduction methods. 