## KNN ASSIGNMENT 3

Q1. Write a Python code to implement the KNN classifier algorithm on load_iris dataset in
sklearn.datasets.

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize the KNN classifier with k=3 (you can change k as needed)
knn = KNeighborsClassifier(n_neighbors=3)

# Train the classifier on the training data
knn.fit(X_train, y_train)

# Predict the labels for the test set
y_pred = knn.predict(X_test)

# Calculate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Accuracy: 1.0


This code loads the Iris dataset, splits it into training and testing sets, initializes the KNN classifier with n_neighbors=3, trains the classifier on the training data, and then predicts the labels for the test set. Finally, it calculates and prints the accuracy of the classifier. You can adjust the value of n_neighbors to change the number of neighbors used in the KNN algorithm.

Q2. Write a Python code to implement the KNN regressor algorithm on load_boston dataset in
sklearn.datasets.

`load_boston` has been removed from scikit-learn since version 1.2.

The Boston housing prices dataset has an ethical problem: as
investigated in [1], the authors of this dataset engineered a
non-invertible variable "B" assuming that racial self-segregation had a
positive impact on house prices [2]. Furthermore the goal of the
research that led to the creation of this dataset was to study the
impact of air quality but it did not give adequate demonstration of the
validity of this assumption.

The scikit-learn maintainers therefore strongly discourage the use of
this dataset unless the purpose of the code is to study and educate
about ethical issues in data science and machine learning.

Q3. Write a Python code snippet to find the optimal value of K for the KNN classifier algorithm using
cross-validation on load_iris dataset in sklearn.datasets.

Q3. Write a Python code snippet to find the optimal value of K for the KNN classifier algorithm using
cross-validation on load_iris dataset in sklearn.datasets.

To find the optimal value of K for the KNN classifier algorithm using cross-validation on the load_iris dataset from sklearn.datasets, you can use a loop to iterate over different values of K and perform cross-validation to evaluate the classifier's performance. 

In [6]:
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Define the range of K values to try
k_values = list(range(1, 31))

# Initialize lists to store mean accuracies and standard deviations for each K
mean_accuracies = []
std_accuracies = []

# Perform cross-validation for each value of K
for k in k_values:
    knn_classifier = KNeighborsClassifier(n_neighbors=k)
    accuracies = cross_val_score(knn_classifier, X, y, cv=5)
    mean_accuracies.append(accuracies.mean())
    std_accuracies.append(accuracies.std())

# Find the optimal K with the highest mean accuracy
optimal_k = k_values[mean_accuracies.index(max(mean_accuracies))]
optimal_mean_accuracy = max(mean_accuracies)
optimal_std_accuracy = std_accuracies[mean_accuracies.index(max(mean_accuracies))]

print(f"Optimal K: {optimal_k}")
print(f"Mean Accuracy: {optimal_mean_accuracy:.4f} (±{optimal_std_accuracy:.4f})")


Optimal K: 6
Mean Accuracy: 0.9800 (±0.0163)


In this code, we use 5-fold cross-validation to evaluate the classifier's performance for each value of K in the range of 1 to 30 (you can adjust the range as needed). We then find the K value that yields the highest mean accuracy and print the optimal K along with the mean accuracy and its standard deviation.

Note that cross-validation helps to get a more robust estimate of the model's performance and prevent overfitting or underfitting. It is crucial when tuning hyperparameters like K in KNN.

Q4. Implement the KNN regressor algorithm with feature scaling on load_boston dataset in
sklearn.datasets.

`load_boston` has been removed from scikit-learn since version 1.2.

The Boston housing prices dataset has an ethical problem: as
investigated in [1], the authors of this dataset engineered a
non-invertible variable "B" assuming that racial self-segregation had a
positive impact on house prices [2]. Furthermore the goal of the
research that led to the creation of this dataset was to study the
impact of air quality but it did not give adequate demonstration of the
validity of this assumption.

The scikit-learn maintainers therefore strongly discourage the use of
this dataset unless the purpose of the code is to study and educate
about ethical issues in data science and machine learning.

Q5. Write a Python code snippet to implement the KNN classifier algorithm with weighted voting on
load_iris dataset in sklearn.datasets.

To implement the K-Nearest Neighbors (KNN) classifier algorithm with weighted voting on the load_iris dataset from sklearn.datasets, we can modify the standard KNN classifier by using the weights parameter. Setting weights='distance' in the KNeighborsClassifier class will give more weight to closer neighbors when making predictions. Closer neighbors will have a higher influence on the decision, and their contributions will be inversely proportional to their distance.

In [8]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize the KNN classifier with weighted voting using 'distance'
knn_classifier = KNeighborsClassifier(n_neighbors=3, weights='distance')

# Train the classifier on the training data
knn_classifier.fit(X_train, y_train)

# Predict the labels for the test set
y_pred = knn_classifier.predict(X_test)

# Calculate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Accuracy: 1.0


In this code, we initialize the KNN classifier with weights='distance', which means that the voting weight of each neighbor will be determined by the inverse of its distance to the query point. This way, closer neighbors will have a more substantial influence on the decision-making process during the classification.

You can adjust the value of n_neighbors as needed for the number of neighbors to consider in the KNN algorithm.

Q6. Implement a function to standardise the features before applying KNN classifier.

To implement a function for standardizing the features before applying the K-Nearest Neighbors (KNN) classifier, we can use the StandardScaler from sklearn.preprocessing. The StandardScaler standardizes features by removing the mean and scaling to unit variance, making them have a mean of 0 and a standard deviation of 1. This ensures that all features contribute equally to the distance calculations in the KNN algorithm. 

In [12]:
from sklearn.preprocessing import StandardScaler

def standardize_features(X_train, X_test):
    """
    Standardize the features of the training and testing sets.

    Parameters:
        X_train (numpy array or pandas DataFrame): Training features.
        X_test (numpy array or pandas DataFrame): Testing features.

    Returns:
        X_train_std (numpy array): Standardized training features.
        X_test_std (numpy array): Standardized testing features.
    """
    # Initialize the StandardScaler
    scaler = StandardScaler()

    # Fit the scaler on the training data and transform it
    X_train_std = scaler.fit_transform(X_train)

    # Transform the testing data using the same scaler
    X_test_std = scaler.transform(X_test)

    return X_train_std, X_test_std


In [13]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize the features using the custom function
X_train_std, X_test_std = standardize_features(X_train, X_test)

# Initialize the KNN classifier
knn_classifier = KNeighborsClassifier(n_neighbors=3)

# Train the classifier on the standardized training data
knn_classifier.fit(X_train_std, y_train)

# Predict the labels for the standardized test set
y_pred = knn_classifier.predict(X_test_std)

# Calculate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Accuracy: 1.0


By using the standardize_features function, you ensure that the features are standardized for both the training and testing sets, which is crucial for the KNN algorithm to work effectively.

Q7. Write a Python function to calculate the euclidean distance between two points.

In [14]:
import numpy as np

def euclidean_distance(point1, point2):
    """
    Calculate the Euclidean distance between two points in a multi-dimensional space.

    Parameters:
        point1 (list or numpy array): The coordinates of the first point.
        point2 (list or numpy array): The coordinates of the second point.

    Returns:
        float: The Euclidean distance between the two points.
    """
    # Convert the points to numpy arrays for vectorized operations
    point1 = np.array(point1)
    point2 = np.array(point2)

    # Calculate the squared differences between the coordinates
    squared_diff = np.sum((point1 - point2) ** 2)

    # Take the square root to get the Euclidean distance
    euclidean_dist = np.sqrt(squared_diff)

    return euclidean_dist


In [15]:
point1 = [1, 2, 3]
point2 = [4, 5, 6]

distance = euclidean_distance(point1, point2)
print("Euclidean distance:", distance)


Euclidean distance: 5.196152422706632


The function takes two points as input (either lists or numpy arrays) and calculates the Euclidean distance between them by computing the squared differences between their coordinates and then taking the square root of the sum of squared differences.

Q8. Write a Python function to calculate the manhattan distance between two points.

In [16]:
import numpy as np

def manhattan_distance(point1, point2):
    """
    Calculate the Manhattan distance between two points in a multi-dimensional space.

    Parameters:
        point1 (list or numpy array): The coordinates of the first point.
        point2 (list or numpy array): The coordinates of the second point.

    Returns:
        float: The Manhattan distance between the two points.
    """
    # Convert the points to numpy arrays for vectorized operations
    point1 = np.array(point1)
    point2 = np.array(point2)

    # Calculate the absolute differences between the coordinates
    absolute_diff = np.abs(point1 - point2)

    # Sum the absolute differences to get the Manhattan distance
    manhattan_dist = np.sum(absolute_diff)

    return manhattan_dist


In [17]:
point1 = [1, 2, 3]
point2 = [4, 5, 6]

distance = manhattan_distance(point1, point2)
print("Manhattan distance:", distance)


Manhattan distance: 9


The function takes two points as input (either lists or numpy arrays) and calculates the Manhattan distance between them by computing the absolute differences between their coordinates and then summing up these absolute differences.