# Project Title: Cancer Diagnosis using K-Nearest Neighbors Classification

### Importing Libraries and Loading Data
<p style="font-size:17px">We start by importing the required libraries and loading the dataset. The dataset contains cancer-related data and will be used to train and test our KNN classifier.</p>

In [13]:
import pandas as pd
import numpy as np

data = pd.read_csv('Cancer_Data.csv')  # Load data from the 'Cancer_Data.csv' file

### Data Preprocessing
<p style="font-size:17px">After loading the data, we need to preprocess it to ensure it's suitable for our machine learning model. We remove an unused column (Unnamed: 32), and then standardize the dataset using StandardScaler.</p>

In [14]:
# Delete the null columns
data = data.drop('Unnamed: 32', axis=1)

# Extract numeric columns for standardization
numeric_columns = data.select_dtypes(include='number').columns

# Apply StandardScaler to each numeric column of the DataFrame
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_standardized = pd.DataFrame(
    scaler.fit_transform(data[numeric_columns]), columns=numeric_columns
)

### Splitting Data into Train and Test Sets
<p style="font-size:17px">Now we split the preprocessed dataset into training and testing sets using an 80:20 split.</p>

In [15]:
from sklearn.model_selection import train_test_split

# Split data into X (features) and y (target)
y = data['diagnosis']
X = data.drop('diagnosis', axis=1)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Model Selection with GridSearchCV
<p style="font-size:17px">Next, we perform a grid search to find the optimal n_neighbors and weights parameters for our KNeighborsClassifier model. We use 5-fold cross-validation and the 'accuracy' scoring metric.</p>

In [16]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

knn = KNeighborsClassifier()
param_grid = {'n_neighbors': list(range(1, 31)), 'weights': ['uniform', 'distance']}

grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

### Model Evaluation
<p style="font-size:17px">We evaluate the model's performance using the best parameters found during the grid search, then print the optimal parameters, accuracy, and the actual test set accuracy.</p>

In [17]:
# Get the best result for K value
optimal_params = grid_search.best_params_
optimal_accuracy = grid_search.best_score_

print("Optimal parameters:", optimal_params)
print("Optimal accuracy:", optimal_accuracy)

# Train the Model
optimal_knn = KNeighborsClassifier(**optimal_params)
optimal_knn.fit(X_train, y_train)

# Test the model
y_pred = optimal_knn.predict(X_test)
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Optimal parameters: {'n_neighbors': 1, 'weights': 'uniform'}
Optimal accuracy: 0.7912087912087913
Accuracy: 0.7807017543859649
