<a href="https://colab.research.google.com/github/MohammedAbdurRehman/CS-351L---AI-Lab-GitHub-Repository_2022299/blob/main/Mohammad_Abdur_Rehman_CS351L_Lab04.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



### Model: k-Nearest Neighbors (k-NN)
- **Description**: k-NN is a simple, non-parametric classification algorithm. It classifies a data point based on the majority class among its k nearest neighbors in the feature space.
- **Working**: The algorithm calculates the distance (typically Euclidean) between the input instance and all training samples, identifies the closest k instances, and assigns the most common class label among them.

### Dataset: RCV1
- **Description**: The Reuters Corpus Volume 1 (RCV1) dataset contains over 800,000 news documents with 47,236 features (word counts).
- **Target Classes**: The dataset supports multi-label classification. For simplification, only one target label is used for binary classification (e.g., relevant or not).

### Code Workflow
1. **Import Libraries**: Essential libraries for data handling, model building, and visualization are imported.
2. **Load Dataset**: The RCV1 dataset is fetched, extracting features and a single target label.
3. **Display Information**: Key statistics about the dataset, such as the number of samples, features, and class distribution, are printed.
4. **Split Data**: The dataset is divided into training and testing sets to evaluate model performance.
5. **Train Model**: The k-NN model is trained on the training data.
6. **Make Predictions**: The model predicts classes for the test data.
7. **Evaluate Performance**: The accuracy and a classification report are generated to assess the model's effectiveness.
8. **Visualize Results**: Dimensionality reduction is applied for visualization, displaying the classified results in a 2D scatter plot.



In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_rcv1
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt
from scipy.sparse import csr_matrix
from sklearn.decomposition import TruncatedSVD

# Load a subset of the RCV1 dataset
rcv1 = fetch_rcv1()
X = rcv1.data  # Features (sparse matrix)
y = rcv1.target[:, 0].toarray().ravel()  # Using only one label

# Display dataset information
print("RCV1 Dataset Information:")
print(f"Number of samples: {X.shape[0]}")
print(f"Number of features: {X.shape[1]}")
print(f"Number of target classes: {np.unique(y).shape[0]}")
print(f"Class distribution: {np.bincount(y)}\n")

# Split the data into training and testing sets (sparse matrix needs to be preserved)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Implementing k-Nearest Neighbors with k=3
knn = KNeighborsClassifier(n_neighbors=3)

# k-NN doesn't handle sparse matrices directly, so we convert them to dense (only for a manageable size)
if X_train.shape[0] <= 10000:  # Process only if the dataset is small
    X_train_dense = X_train.toarray()
    X_test_dense = X_test.toarray()

    knn.fit(X_train_dense, y_train)  # Train the k-NN model

    # Making predictions on the test set
    y_pred = knn.predict(X_test_dense)

    # Evaluate the model's performance
    accuracy = accuracy_score(y_test, y_pred)
    classification_rep = classification_report(y_test, y_pred)

    # Print accuracy and classification report
    print(f"Accuracy of k-NN: {accuracy * 100:.2f}%")
    print("\nClassification Report:\n", classification_rep)

    # Reducing to 2 dimensions for visualization
    svd = TruncatedSVD(n_components=2)
    X_test_2d = svd.fit_transform(X_test_dense)

    # Visualizing the data after classification (2D projection)
    plt.figure(figsize=(8, 6))
    plt.scatter(X_test_2d[:, 0], X_test_2d[:, 1], c=y_pred, cmap='coolwarm', edgecolor='k', s=10)
    plt.title('RCV1 Dataset Classification Visualization (After Classification)')
    plt.xlabel('SVD Component 1')
    plt.ylabel('SVD Component 2')
    colorbar = plt.colorbar(label='Predicted Classes')
    colorbar.set_ticks([0, 1])
    colorbar.set_ticklabels(['Class 0', 'Class 1'])  # Adjust based on the chosen target
    plt.show()
else:
    print("Dataset too large for dense conversion.")


RCV1 Dataset Information:
Number of samples: 804414
Number of features: 47236
Number of target classes: 2
Class distribution: [780089  24325]

Dataset too large for dense conversion.
