# Fadhla Mohamed
# Mutua
# SM3201434

# **Analysis of Supervised and Unsupervised Learning on UCI Dataset (ID: 267)**  

## **1. Data Pretreatment**  



### **1.1 Data Loading and Inspection**  
The dataset was obtained from the UCI Machine Learning Repository using `fetch_ucirepo(id=267)`. After loading the data, we examined its structure, including the number of samples , features, and class distribution. From the repository we have:
1. The data has no missing value
2. The class distribution (the target) is already encoded (either 1 or 0)
3. There are 1372 samples and 4 features
4. The four features are continuos

data: [1372 rows x 5 columns]
| variance | skewness | curtosis | entropy  | targets |
|----------|----------|----------|----------|---------|
| 3.62160  | 8.66610  | -2.8073  | -0.44699 | 0       |
| 4.54590  | 8.16740  | -2.4586  | -1.46210 | 0       |
| 3.86600  | -2.63830 | 1.9242   | 0.10645  | 0       |
| 3.45660  | 9.52280  | -4.0112  | -3.59440 | 0       |
| 0.32924  | -4.45520 | 4.5718   | -0.98880 | 0       |
| ...      | ...      | ...      | ...      | ...     |
| 0.40614  | 1.34920  | -1.4501  | -0.55949 | 1       |
| -1.38870 | -4.87730 | 6.4774   | 0.34179  | 1       |
| -3.75030 | -13.45860| 17.5932  | -2.77710 | 1       |
| -3.56370 | -8.38270 | 12.3930  | -1.28230 | 1       |
| -2.54190 | -0.65804 | 2.6842   | 1.19520  | 1       |


### **1.2 Scaling, Normalization and Sorting Issues in the Dataset** 
The dataset consists of numerical features, but their values are on different scales. To ensure proper model training and clustering, **feature scaling** was applied using standardization (z-score normalization). I.e.
- Max of all features = 17.9274 while the Min is -13.7731

The data was immediately split into train and test sets, rescaled, and then reassembled for models that require the full dataset.

full dataset randomized and scaled: [1372 rows x 5 columns]
| variance | skewness | curtosis | entropy  | targets |
|----------|----------|----------|----------|---------|
| 0.904618 | 1.601126 | -1.265374 | -1.495569 | 0.0     |
| 1.532814 | -0.691013 | -0.000450 | 0.973356 | 0.0     |
| -0.367168 | -1.662094 | 1.257462 | 0.697353 | 1.0     |
| -2.299623 | 1.344148 | -0.419396 | -2.767430 | 1.0     |
| -0.539056 | -0.520896 | 0.148416 | 0.520688 | 1.0     |
| ...      | ...      | ...      | ...      | ...     |
| 0.706408 | 0.908746 | -0.465262 | 0.769656 | 0.0     |
| 1.130878 | 0.958700 | -0.751494 | 0.639514 | 0.0     |
| -1.804741 | 0.344855 | -0.217882 | -0.196042 | 1.0     |
| -0.369069 | -0.631649 | -0.471420 | 0.563440 | 1.0     |
| 1.394944 | -1.047881 | 0.753370 | 1.083987 | 0.0     




---



## **2. Unsupervised Learning**  

### **2.1 PCA for Visualization**  
**Principal Component Analysis (PCA)** was applied to reduce the dataset to two dimensions for visualization. The first two principal components were plotted, with points colored by their actual class labels.

**Observations (from plot):**  
- The classes are **not linearly separable** in this reduced space.  
- Some overlap between clusters suggests that linear models might struggle with classification.  



![PCA](PCA.png)

### **2.2 K-Means Clustering**  
**K-Means clustering** was applied with `k=2` (assuming two clusters).  

**Results:**  
- When using **only the first two PCA components**, k-means **misclassified several points**, showing that a 2D projection may not contain enough information.  
- When using **all features**, clustering improved but misclassifications persisted.   

We also obtain the following table:
| Metric        | 0.0  | 1.0  | Macro Avg | Weighted Avg | Accuracy |
|--------------|------|------|-----------|--------------|----------|
| Precision    | 0.50 | 0.38 | 0.44      | 0.45         |   -   |
| Recall       | 0.46 | 0.42 | 0.44      | 0.44         |    -      |
| F1-Score     | 0.48 | 0.40 | 0.44      | 0.44         |    -      |
| Support      | 762  | 610  | 1372      | 1372         |   -       |
| Accuracy      | -  | -  | -      | -         |   0.44       |

From which:
1. Precision
    - For class 0.0, the model's precision is 0.50, meaning that when it predicts class 0, it is correct 50% of the time.
    - For class 1.0, the precision is 0.38, so when it predicts class 1, it is correct 38% of the time.

2. Recall:
    - For class 0.0, recall is 0.46, meaning the model correctly identifies 46% of actual class 0 instances.
    - For class 1.0, recall is 0.42, meaning it correctly identifies 42% of actual class 1 instances.

3. F1-Score:
    - For class 0.0, the F1-score is 0.48, indicating a balance between precision and recall.
    - For class 1.0, the F1-score is 0.40, showing lower performance in predicting this class.

The model correctly classifies 44% of the total samples.

comparing it to the table that takes the full data set:
| Metric        | 0.0  | 1.0  | Accuracy | Macro Avg | Weighted Avg |
|--------------|------|------|----------|-----------|--------------|
| Precision    | 0.61 | 0.50 | -     | 0.56      | 0.56         |
| Recall       | 0.55 | 0.57 |   -   | 0.56      | 0.56         |
| F1-Score     | 0.58 | 0.53 |   -   | 0.56      | 0.56         |
| Support      | 762  | 610  |   -   | 1372      | 1372         |
| Accuracy      | -  | -  | 0.56      | -         |   -       |

From which it is clear that PCA results in information loss as all metrics increase

![2_nn_PCA](2_nn_PCA.png)

![2_nn_full](2_nn_full.png)

### **2.3 t-SNE for Nonlinear Projection**  
We used **t-SNE** for dimensionality reduction and visualized the data in 2D.  

**Observations:**  
- t-SNE provided a **better separation** than PCA, suggesting some non-linear class structure.  
- The class distributions are still somewhat mixed, indicating potential challenges for clustering algorithms.  


![t_SNE](t_SNE.png)

### **2.4 DBSCAN Clustering**  
We applied **DBSCAN**, a density-based clustering algorithm.  

**Results:**   
- It identified core clusters but also **classified some points as noise**.  
- The results depended significantly on hyperparameters `eps` and `min_samples`, of which eps is based on `n_neighbors` = 50.

### DBSCAN Metrics with noise removed

| Cluster | Precision | Recall | F1-Score | Support | **Accuracy** |
|---------|-----------|--------|----------|---------|---------|
| **0.0** | 1.00     | 0.83   | 0.91     | 521     | -  |
| **1.0** | 0.96     | 0.80   | 0.88     | 369     |-     |
| **2.0** | 0.00     | 0.00   | 0.00     | 0       |-     |
| **3.0** | 0.00     | 0.00   | 0.00     | 0       |-     |
| **Accuracy**  | -  | -  | - | - |**0.82**     |
| **Macro Avg** | 0.49 | 0.41 | 0.45 | 890 |-     |
| **Weighted Avg** | 0.99 | 0.82 | 0.90 | 890 |-     |

From which:
The table shows three clusters (0.0, 1.0, and 2.0), but clusters 2.0 and 3.0 have zero support, meaning no data points were assigned to them.
The majority of the data points are assigned to clusters 0.0 and 1.0.

1. Precision:
    - Cluster 0.0 has a precision of 1.00, meaning all points assigned to this cluster were correctly grouped (no false positives).
    - Cluster 1.0 has a precision of 0.96, indicating that most points were correctly assigned, but a few may have been misclassified.

2. Recall:
    - Cluster 0.0 has a recall of 0.83, meaning 83% of the actual members of this cluster were successfully identified.
    - Cluster 1.0 has a recall of 0.80, meaning 80% of the actual points belonging to this cluster were captured.
    - Since DBSCAN removes noise points, recall is slightly lower, as some valid points may have been left unclustered.

3. F1-Score:
    - Cluster 0.0: 0.91 (high, meaning both precision and recall are strong).
    - Cluster 1.0: 0.88 (also high, but slightly lower than cluster 0.0).

And given accuracy of 0.82, we have that 82% of points were correctly assigned to their respective clusters.

- Macro Average: The unweighted mean of precision, recall, and F1-score across clusters. Since clusters 2.0 and 3.0 have zero support, their presence lowers the macro average.
- Noise points were removed, improving the accuracy but slightly lowering recall (since some actual points were left out).
- Weighted Average: Averages the scores while considering the number of points in each cluster. The weighted values are high because the meaningful clusters (0.0 and 1.0) have strong performance.

![Best_epsilon](Best_epsilon.png)

![DBSCAN](DBSCAN.png)


---



## **3. Supervised Learning**  



#### **3.1 Logistic Regression** 
- From the model we find a high accuracy of 0.9770 and by analyzing the confusion matrix, it is observed that the model makes incorrect predictions for only 23 out of 1372 instances.
- Evaluating the effect of **regularization** using cross-validation finds the best parameter as.

| Metric      | Best Score | Parameters                          |
|------------|-----------|------------------------------------|
| Accuracy   | 0.9900    | {'penalty': 'l1', 'C': 2.1544}   |
| Precision  | 0.9901    | {'penalty': 'l1', 'C': 10.0}     |
| Recall     | 0.9906    | {'penalty': 'l1', 'C': 2.1544}   |
| F1-Score   | 0.9898    | {'penalty': 'l1', 'C': 2.1544}   |

- So we have that the logistic model performed well using training data

![confusion_matrix](confusion_matrix.png)


#### **3.2 Decision Tree (ID3 Algorithm)**  
- Greedy algorithm for tree construction.  
- Hyperparameters (depth, minimum samples per leaf) were optimized via cross-validation. 

From the algorithim we get an accuracy of 0.9470 for the training data.

Performing cross Validation we get:
- Mean Accuracy: 0.9410
- Best Accuracy: 0.9599
- Best Tree:

    - variance <= 0.0827  
    - skewness <= -0.2305 → **1.0**  
    - skewness > -0.2305  
        - skewness <= 0.3693 → **1.0**  
        - skewness > 0.3693  
        - variance <= -0.7789  
            - skewness <= 1.0834 → **1.0**  
            - skewness > 1.0834 → **0.0**  
        - variance > -0.7789 → **0.0**  
    - variance > 0.0827  
    - variance <= 0.9059  
        - curtosis <= -0.3654  
        - skewness <= 0.8130 → **1.0**  
        - skewness > 0.8130 → **0.0**  
        - curtosis > -0.3654 → **0.0**  
    - variance > 0.9059 → **0.0** 



#### **3.3 Naive Bayes Classifier**  
- Assumes feature independence.  
- Performed well in some cases but had lower accuracy due to its strong assumptions.

From the Gaussian Naive Bayes Classifier we get 0.8421 accuracy and performing cross-validation we get:
- Mean Accuracy: 0.8382
- Best Accuracy: 0.8978

- Best parameters:
 
| Parameter       | Value |
|----------------|----------------------------------------------------------------------------------|
| n_labels       | 2 |
| unique_labels  | [0., 1.] |
| n_classes      | 2 |
| mean           | [[ 0.7461,  0.4655, -0.2144, -0.0412,  0. ], [-0.6915, -0.4813,  0.1274, -0.0640,  1. ]] |
| variance       | [[0.5237, 0.8054, 0.6522, 1.0811, 1e-9], [0.4388, 0.9101, 1.7591, 0.9921, 1e-9]] |
| prior          | [-0.6161, -0.7767] |
| Score          | 0.8978 |



#### **3.4 k-Nearest Neighbors (k-NN)**  
- Hyperparameter `k` was tuned via cross-validation.  
- Performed well but computationally expensive.

Assuming p = 2 and 5 clusters then we have an accuracy of 0.9900

Performing cross-validation we get:
- Best Hyperparameters: k=2, distance=euclidean, p=1
- Best Cross-Validation Accuracy: 0.9985



### **3.3 Performance Comparison**  
Analyzing **accuracy, precision, recall, and F1-score** on the test set for supervised.  

| Model                | Accuracy | Precision | Recall | F1-Score | Support |
|----------------------|----------|-----------|--------|----------|----------|
| Naive Bayes | 0.8422      | **0**: 0.8486 <br> **1**: 0.8329       | **0**: 0.8785 <br> **1**: 0.7945    | **0**: 0.8633 <br> **1**: 0.8132      |**0**: 568 <br> **1**: 433      |
| Decision Tree       | 0.9471      | **0**: 0.9813 <br> **1**: 0.9077       | **0**: 0.9243 <br> **1**: 0.9769    | **0**: 0.9519 <br> **1**: 0.9410      |**0**: 568 <br> **1**: 433      |
|    k-NN      | 0.9970      | **0**: 1.0000 <br> **1**: 0.9931       | **0**: 0.9947 <br> **1**: 1.0000    | **0**: 0.9974 <br> **1**: 0.9965      |**0**: 568 <br> **1**: 433      |
|  Logistic Regression  | 0.9770      | **0**: 0.9964 <br> **1**: 0.9535       | **0**: 0.9630 <br> **1**: 0.9954    | **0**: 0.9794 <br> **1**: 0.9740      |**0**: 568 <br> **1**: 433      |

1. k-NN achieves the highest accuracy (0.9970) with near-perfect precision, recall, and F1-score for both classes, making it the best-performing model. However, k-NN can be sensitive to noisy data and computationally expensive for large datasets.

2. Decision Tree also performs well (0.9471 accuracy) but is slightly weaker than k-NN. It has high precision and recall but may be prone to overfitting, depending on the depth of the tree.

3. Logistic Regression performs slightly better than Decision Tree, with 0.9770 accuracy. It has high precision and recall but is slightly less effective for class 1, which may indicate some bias toward class 0.

4. Naive Bayes has the lowest accuracy (0.8422) among the models, with slightly lower recall for class 1. This suggests it makes more false negatives for class 1, potentially due to its assumption of feature independence.


Analyzing **accuracy, precision, recall, and F1-score** on the test set for unsupervised.
| Model                | Accuracy | Precision | Recall | F1-Score | Support |
|----------------------|----------|-----------|--------|----------|----------|
|  k-means  | 0.44      | **0**: 0.50 <br> **1**: 0.38       | **0**: 0.46 <br> **1**: 0.42    | **0**: 0.48 <br> **1**: 0.40     |**0**: 762 <br> **1**: 610     |
|  k-means (full data)  | 0.56      | **0**: 0.61 <br> **1**: 0.50       | **0**: 0.55 <br> **1**: 0.57    | **0**: 0.58 <br> **1**: 0.53      |**0**: 762 <br> **1**: 610      |
|  DBSCAN (no Noise)  | 0.82      | **0**: 1.00 <br> **1**: 0.96       | **0**: 0.83 <br> **1**: 0.80    | **0**: 0.91 <br> **1**: 0.88      |**0**: 521 <br> **1**: 369      |


The first set of models (Naïve Bayes, Decision Tree, k-NN, Logistic Regression) achieves strong accuracy scores, ranging from 84.22% to 99.70%, while the second set (k-means, k-means Full Data, DBSCAN without Noise) performs significantly worse, with accuracy ranging from 44% to 82%.

This suggests that the first set of models is well-suited for the classification task, while the second set struggles with distinguishing between classes effectively



---



## **4.Recommendations**  

1. **Feature Engineering:**  
   - Use polynomial features to capture non-linear relationships or kernel methods for better class separation.  
2. **Ensemble Methods:**  
   - Use **Random Forest** or **Gradient Boosting** for better generalization.



---
