In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [11]:
# Importing the dataset
dataset = pd.read_csv('Wine.csv')
X = dataset.iloc[:, 0:13].values
y = dataset.iloc[:, 13].values

In [12]:
# --- Split and scale ---
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [13]:
# --- Define classifiers ---
models = {
    'Logistic': LogisticRegression(random_state=0),
    'SVMl': SVC(kernel='linear', random_state=0),
    'SVMnl': SVC(kernel='rbf', random_state=0),
    'KNN': KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2),
    'Navie': GaussianNB(),
    'Decision': DecisionTreeClassifier(criterion='entropy', random_state=0),
    'Random': RandomForestClassifier(n_estimators=10, criterion='entropy', random_state=0)
}

# --- Different PCA component counts ---
pca_components = [2, 3, 4, 5]

# --- Result table ---
pca_results = pd.DataFrame(columns=list(models.keys()) + ['Explained_Variance'])


In [14]:
for n in pca_components:
    pca = PCA(n_components=n)
    X_train_pca = pca.fit_transform(X_train)
    X_test_pca = pca.transform(X_test)
    
    explained_var = pca.explained_variance_ratio_
    
    row_name = f"PCA_{n}"
    pca_results.loc[row_name, 'Explained_Variance'] = f"{explained_var} (Total={sum(explained_var):.2f})"
    
    for name, model in models.items():
        model.fit(X_train_pca, y_train)
        y_pred = model.predict(X_test_pca)
        acc = accuracy_score(y_test, y_pred)
        pca_results.loc[row_name, name] = round(acc, 3)

print("\nFinal PCA-based Accuracy Table:\n")
print(pca_results)


Final PCA-based Accuracy Table:

      Logistic   SVMl  SVMnl    KNN  Navie Decision Random  \
PCA_2    0.978  0.978  0.978  0.978  0.978    0.978  0.978   
PCA_3    0.978    1.0    1.0    1.0    1.0    0.956  0.956   
PCA_4    0.978  0.956  0.978  0.956    1.0    0.978  0.911   
PCA_5    0.978  0.956    1.0    1.0    1.0    0.978  0.978   

                                      Explained_Variance  
PCA_2               [0.37281068 0.18739996] (Total=0.56)  
PCA_3    [0.37281068 0.18739996 0.10801208] (Total=0.67)  
PCA_4  [0.37281068 0.18739996 0.10801208 0.07619859] ...  
PCA_5  [0.37281068 0.18739996 0.10801208 0.07619859 0...  


The provided image shows the results of a Principal Component Analysis (PCA) used to preprocess data for several machine learning classification models. The tables display the accuracy and explained variance for each model when using a different number of principal components.

The table compares the accuracy of seven different classification models (Logistic, SVMl, SVMnl, KNN, Naive, Decision, and Random) using 2, 3, 4, or 5 principal components (PCA_2 to PCA_5).

PCA_2: All models achieve an accuracy of 0.978, which indicates that using only the first two principal components is highly effective for this dataset.

PCA_3: The accuracy for most models (SVMl, SVMnl, KNN, and Naive) increases to 1.0 (100%), while Logistic remains at 0.978. Decision and Random model accuracy drops to 0.956. This suggests that adding a third principal component significantly improves the performance for some models, while causing a slight decrease for others, possibly due to the introduction of noise.

PCA_4 and PCA_5: The accuracy values become more varied across the models as more components are added. This highlights a trade-off: while more components retain more information, they can also introduce noise that may cause the models to overfit or perform inconsistently. For example, the Decision model's accuracy increases to 1.0 with PCA_4 but then drops slightly with PCA_5, while the Random model's accuracy consistently decreases.

Analysis of Explained Variance

The bottom table shows the explained variance, which measures how much of the original data's total variance is captured by the selected principal components.

PCA_2: The first two principal components capture a total of 56% of the variance in the data.

PCA_3: Adding the third principal component brings the cumulative explained variance up to 67%.

PCA_4: The cumulative explained variance increases to 74.6% with the addition of the fourth component.