1. What is K-Nearest Neighbors (KNN) and how does it work ?


K-Nearest Neighbors (KNN) is a simple yet powerful machine learning algorithm used for both classification and regression tasks. It works by identifying the "k" nearest data points to a new data point, and then making predictions based on the majority class or the average value of those neighbors.

2.What is the difference between KNN Classification and KNN Regression?


KNN Classification and KNN Regression are both methods used in the K-Nearest Neighbors algorithm, but they differ in the type of output they predict. KNN Classification predicts a category or class label for a given input, while KNN Regression predicts a continuous numerical value.

3.  What is the role of the distance metric in KNN ?


In KNN (k-Nearest Neighbors), the distance metric plays a crucial role in determining the similarity between data points. It quantifies the distance between a new data point and its neighbors in the training dataset, allowing the algorithm to identify the k-nearest neighbors. These neighbors are then used to classify the new point or make predictions.

4.  What is the Curse of Dimensionality in KNN ?


The "curse of dimensionality" in k-Nearest Neighbors (k-NN) refers to the phenomenon where the performance of the algorithm deteriorates as the number of features (dimensions) in the data increases, particularly in high-dimensional spaces. This happens because, in high dimensions, distances between data points become less meaningful, and the data becomes increasingly sparse, making it difficult to find true nearest neighbors.

5.  How can we choose the best value of K in KNN ?


Choosing the optimal value of 'k' in k-Nearest Neighbors (kNN) involves a balance between accuracy and computational cost. While a general rule of thumb is to set 'k' to the square root of the number of data points, more refined approaches are often used for optimal results. These include using cross-validation, the elbow method, and considering domain knowledge.

6. What are KD Tree and Ball Tree in KNN ?


The choice between them often depends on the characteristics of the data and the specific requirements of the application. KD-Tree focuses on dividing data along dimensions, while Ball Tree organizes data based on proximity within hyperspheres.

7. When should you use KD Tree vs. Ball Tree?

Ball-Tree Use
Slower than KD-Trees in low dimensions ( d ≤ 3 ) but a lot faster in high dimensions. Both are affected by the curse of dimensionality, but Ball-trees tend to still work if data exhibits local structure (e.g. lies on a low-dimensional manifold).

8.  What are the disadvantages of KNN ?


KNN (K-Nearest Neighbors) has several disadvantages, including high computational cost and memory requirements, sensitivity to irrelevant features and data scaling, and potential issues with high-dimensional datasets and imbalanced classes.

9.  How does feature scaling affect KNN ?


Feature scaling significantly impacts KNN by ensuring all features contribute equally to distance calculations, preventing features with larger ranges from dominating the results. Scaling leads to improved accuracy, faster convergence, and a more robust model, especially when dealing with datasets where features have vastly different ranges.

10. What is PCA (Principal Component Analysis) ?

Principal component analysis (PCA) is a widely covered machine learning method on the web. And while there are some great articles about it, many go into too much detail. Below we cover how principal component analysis works in a simple step-by-step way, so everyone can understand it and make use of it — even those without a strong mathematical background.

11. How does PCA work ?



Principal Component Analysis (PCA) works by transforming high-dimensional data into a lower-dimensional space, identifying the principal components (new axes) that capture the most variance in the data, while preserving the most important information.

12. What is the geometric intuition behind PCA ?


PCA will find the “line” (in 2D) or “plane” (in 3D) that best captures the direction in which the data has the most spread or variance. The first principal component (PC1) captures the greatest variance in the data. It's the line (or axis) that best fits the data points.

13.  What is the difference between Feature Selection and Feature Extraction ?


Feature selection chooses a subset of the most relevant original features, while feature extraction creates new features from existing ones, aiming to capture hidden patterns or reduce dimensionality.

14.  What are Eigenvalues and Eigenvectors in PCA ?


In Principal Component Analysis (PCA), eigenvectors represent the principal components (new axes of the data), while eigenvalues indicate the amount of variance explained by each corresponding eigenvector (principal component).

15. How do you decide the number of components to keep in PCA ?


A widely applied approach is to decide on the number of principal components by examining a scree plot. By eyeballing the scree plot, and looking for a point at which the proportion of variance explained by each subsequent principal component drops off. This is often referred to as an elbow in the scree plot

16. Can PCA be used for classification ?


While Principal Component Analysis (PCA) is primarily a dimensionality reduction technique, it can be indirectly used for classification purposes. PCA can transform data into a lower-dimensional space, making it easier to visualize and potentially separate different classes. However, it's important to understand that PCA doesn't directly classify data; it prepares the data for use with a classification algorithm.

17. What are the limitations of PCA ?

PCA (Principal Component Analysis) has limitations that users should be aware of, including loss of interpretability, sensitivity to outliers, computational complexity for large datasets, potential for overfitting, data scaling requirements, and inability to capture non-linear relationships.

18. How do KNN and PCA complement each other ?

KNN and PCA complement each other by improving KNN's performance and efficiency, particularly when dealing with high-dimensional data. PCA, a dimensionality reduction technique, helps KNN by removing redundant and less significant features, thus reducing overfitting and computational cost. This improved data representation allows KNN to learn more effectively and make more accurate predictions.

19. How does KNN handle missing values in a dataset ?


Imputation with K-Nearest Neighbors (KNN) estimates missing values in a dataset by considering the values of the closest data points, determined by a distance metric like Euclidean distance. The missing value is then assigned the average of these nearest neighbors' values, weighted by their proximity.

20.  What are the key differences between PCA and Linear Discriminant Analysis (LDA)?



PCA focuses on maximizing variance to capture the most important features whereas LDA focuses on maximizing the separation between different classes. PCA produces principal components that are orthogonal and maximize variance. LDA produces linear discriminants that maximize class separation.

In [None]:
#21  Train a KNN Classifier on the Iris dataset and print model accuracy


from sklearn import neighbors, datasets, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

iris = datasets.load_iris()
X, y = iris.data[:, :], iris.target

Xtrain, Xtest, y_train, y_test = train_test_split(X, y)
scaler = preprocessing.StandardScaler().fit(Xtrain)
Xtrain = scaler.transform(Xtrain)
Xtest = scaler.transform(Xtest)

knn = neighbors.KNeighborsClassifier(n_neighbors=4)
knn.fit(Xtrain, y_train)
y_pred = knn.predict(Xtest)

print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

In [None]:
#22 Train a KNN Regressor on a synthetic dataset and evaluate using Mean Squared Error (MSE)


import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import mean_squared_error
from sklearn import preprocessing
from sklearn.metrics import confusion_matrix
k = 3

Dataset = pd.DataFrame({'a':[1,4,7,10,4,1,8,4],'b':[2,5,8,5,5,2,1,5],'Class':['y','y','n','n','n','n','y','y']})
knn = KNeighborsClassifier(n_neighbors = k)
knn = knn.fit(Dataset.drop("Class", axis=1), Dataset["Class"])

test_ds = pd.DataFrame({'a':[1,4,1,1,4,1,8,4],'b':[2,1,1,5,1,2,1,5],'Class':['y','y','n','n','n','n','y','y']})
y_pred = knn.predict(test_ds.drop("Class", axis=1))
y_true = test_ds['Class']
y_true = y_true.values
le = preprocessing.LabelEncoder() # We are using label encoder to convert categorical labels to number
le.fit(y_true) # Since this array contains both classes 'y' and 'n'.
print(list(le.classes_)) # To check the classes which are encoded

y_true = le.transform(y_true)
y_pred = le.transform(y_pred)
MSE = mean_squared_error(y_true, y_pred) # Calculating MSE
print(MSE)
cm = confusion_matrix(y_true,y_pred) # Creation of Confusion Matrix
print(cm)

In [None]:
#24  Train a KNN Classifier with different values of K and visualize decision boundaries


from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(data.iloc[:,0:4], data['Name'])

# Predicted class
print(neigh.predict(test))

-> ['Iris-virginica']

# 3 nearest neighbors
print(neigh.kneighbors(test)[1])
-> [[141 139 120]]

We can see that both the models predicted the same class (‘Iris-virginica’) and the same nearest neighbors ( [141 139 120] ). Hence we can conclude that our model runs as expected.


data<-read.table(file.choose(), header = T, sep = ",", dec = ".")#Importing the data
head(data)  #Top observations present in the data
dim(data)   #Check the dimensions of the data
summary(data) #Summarise the data

In [None]:
#25  Apply Feature Scaling before training a KNN model and compare results with unscaled data



import sys
import subprocess
subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'sklearn'])
import pandas as pd
# spliting training and testing data
from sklearn.model_selection import train_test_split

X = df
y = target


X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=27)


import pandas as pd
from sklearn.model_selection import train_test_split
# data normalization with sklearn
from sklearn.preprocessing import MinMaxScaler


data = pd.read_csv("train.csv")
print("Big Mart Data")


print(data.columns)
X = data[['Item_Weight', 'Item_MRP']]
y = data['Item_Outlet_Sales']

print(X.head())
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=27)


# fit scaler on training data
norm = MinMaxScaler().fit(X_train)

# transform training data
X_train_norm = norm.transform(X_train)
print("Scaled Train Data: \n\n")
print(X_train_norm)

# transform testing dataabs
X_test_norm = norm.transform(X_test)
print("\n\nScaled Test Data: \n\n")
print(X_test_norm)

In [None]:
#26 Train a PCA model on synthetic data and print the explained variance ratio for each component



from sklearn.decomposition import PCA
from sklearn.datasets import make_classification

# Generate a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=5, random_state=42)

# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X)

# Explained variance
explained_variance = pca.explained_variance_
total_explained_variance = explained_variance.sum()

# Print results
print(f"Explained Variance:\n{explained_variance}")
print(f"Total Explained Variance: {total_explained_variance:.4f}")

In [None]:
#27 Apply PCA before training a KNN Classifier and compare accuracy with and without PCA



import numpy as np
# Pandas - to deal with tables
import pandas as pd
# matplotlib's pyplot - for general plotting
import matplotlib.pyplot as plt
# Searborn - to plot statistical data
# (the alias "sns" stands for "Samuel Norman Seaborn")
import seaborn as sns
# scikit-learn's MinMaxScaler - for min-max normalisation of features
from sklearn.preprocessing import MinMaxScaler
# collections - for alternative containers. Useful for counting
import collections

# ---> Principal Component Analysis (PCA)
# scikit-learn's PCA
from sklearn.decomposition import PCA
# scikit-learn's Pipeline - to successively transform features and make estimations
from sklearn.pipeline import Pipeline
# plotly Express - for plotting
import plotly.express as px
# Make plotly's plots in the notebook to work without any issues
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)

# ---> K-Nearest Neighbour Classifier (KNN)
# scikit-learn's KNN
from sklearn.neighbors import KNeighborsClassifier
# scikit-learn's GridSearchCV for hyperparameter tuning (using Grid Search and
# cross-validation)
from sklearn.model_selection import GridSearchCV
# scikit-learn's several classification metrics
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    ConfusionMatrixDisplay,
)




In [None]:
#28 Perform Hyperparameter Tuning on a KNN Classifier using GridSearchCV


import pandas as pd

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('heart.csv')
print(df.head())

X = df.drop('target', axis = 1)

y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
rfc = RandomForestClassifier()
forest_params = [{'max_depth': list(range(10, 15)), 'max_features': list(range(0,14))}]

clf = GridSearchCV(rfc, forest_params, cv = 10, scoring='accuracy')

clf.fit(X_train, y_train)
print(clf.best_params_)
import pandas as pd

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split

from sklearn.model_selection import GridSearchCV

from sklearn import metrics

import warnings

warnings.filterwarnings('ignore')

df = pd.read_csv('heart.csv')

X = df.drop('target', axis = 1)

y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

rfc = RandomForestClassifier()

forest_params = [{'max_depth': list(range(10, 15)), 'max_features': list(range(0,14))}]

clf = GridSearchCV(rfc, forest_params, cv = 10, scoring='accuracy')

clf.fit(X_train, y_train)

print(clf.best_params_)

print(clf.best_score_)

In [None]:
#29 Train a KNN Classifier and check the number of misclassified samples


from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier

# Perform cross-validation for different values of k
k_values = [1, 3, 5, 7, 9, 11, 13, 15, 17, 19]
cv_scores = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_scaled, y, cv=10, scoring='accuracy')
    cv_scores.append(scores.mean())

# Find the optimal value of k
optimal_k = k_values[cv_scores.index(max(cv_scores))]
# Get the best cv score from the cv_scores array
best_cv_score = cv_scores[cv_scores.index(max(cv_scores))]


In [None]:
#30 Train a PCA model and visualize the cumulative explained variance.



import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Example Data
np.random.seed(42)
X = np.random.rand(100, 3)  # 100 samples with 3 features

# Step 1: Standardize the Data
scaler = StandardScaler()
X_std = scaler.fit_transform(X)

# Step 2-5: PCA
pca = PCA()
X_pca = pca.fit_transform(X_std)

# Plot Explained Variance Ratio
explained_var_ratio = pca.explained_variance_ratio_
cumulative_var_ratio = np.cumsum(explained_var_ratio)

plt.plot(range(1, len(cumulative_var_ratio) + 1), cumulative_var_ratio, marker='o')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.title('Explained Variance Ratio vs. Number of Principal Components')
plt.show()

In [3]:
#31.Train a KNN Classifier using different values of the weights parameter (uniform vs. distance) and compare
#accuracy



import numpy as np
from sklearn import neighbors
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
import pandas as pd
iris = load_iris()
print(iris.data.shape)
print(iris.target.shape)

import seaborn as sns
iris = sns.load_dataset("iris")
iris["ID"] = iris.index
iris["ratio"] = iris["sepal_length"]/iris["sepal_width"]
sns.lmplot(x="petal_length", y="petal_width", data=iris, hue="species", fit_reg=False, legend=False)
plt.legend()
plt.show()

In [None]:
#32  Train a KNN Regressor and analyze the effect of different K values on performance


import pandas as pd
df = pd.read_csv('train.csv')
df.head()
df.isnull().sum()
#missing values in Item_weight and Outlet_size needs to be imputed
mean = df['Item_Weight'].mean() #imputing item_weight with mean
df['Item_Weight'].fillna(mean, inplace =True)

mode = df['Outlet_Size'].mode() #imputing outlet size with mode
df['Outlet_Size'].fillna(mode[0], inplace =True)
df.drop(['Item_Identifier', 'Outlet_Identifier'], axis=1, inplace=True)
df = pd.get_dummies(df)
from sklearn.model_selection import train_test_split
train , test = train_test_split(df, test_size = 0.3)

x_train = train.drop('Item_Outlet_Sales', axis=1)
y_train = train['Item_Outlet_Sales']

x_test = test.drop('Item_Outlet_Sales', axis = 1)
y_test = test['Item_Outlet_Sales']

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))

x_train_scaled = scaler.fit_transform(x_train)
x_train = pd.DataFrame(x_train_scaled)

x_test_scaled = scaler.fit_transform(x_test)
x_test = pd.DataFrame(x_test_scaled)

#import required packages
from sklearn import neighbors
from sklearn.metrics import mean_squared_error
from math import sqrt
import matplotlib.pyplot as plt
%matplotlib inline

rmse_val = [] #to store rmse values for different k
for K in range(20):
    K = K+1
    model = neighbors.KNeighborsRegressor(n_neighbors = K)

    model.fit(x_train, y_train)  #fit the model
    pred=model.predict(x_test) #make prediction on test set
    error = sqrt(mean_squared_error(y_test,pred)) #calculate rmse
    rmse_val.append(error) #store rmse values
    print('RMSE value for k= ' , K , 'is:', error)

In [None]:
#33 Implement KNN Imputation for handling missing values in a dataset


H# KNN Imputer

from sklearn.impute import KNNImputer
import numpy as np

X = [ [3, np.NaN, 5], [1, 0, 0], [3, 3, 3] ]
print("X: ", X)
print("===========")


imputer = KNNImputer(n_neighbors= 1)
impute_with_1 = imputer.fit_transform(X)

print("\nImpute with 1 Neighbour: \n", impute_with_1)



imputer = KNNImputer(n_neighbors= 2)
impute_with_2 = imputer.fit_transform(X)

print("\n Impute with 2 Neighbours: \n", impute_with_1)


In [None]:
#34 Train a PCA model and visualize the data projection onto the first two principal components



from sklearn.datasets import load_wine
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt

# Load dataset
winedata = load_wine()
X, y = winedata['data'], winedata['target']
print("X shape:", X.shape)
print("y shape:", y.shape)

# Show any two features
plt.figure(figsize=(8,6))
plt.scatter(X[:,1], X[:,2], c=y)
plt.xlabel(winedata["feature_names"][1])
plt.ylabel(winedata["feature_names"][2])
plt.title("Two particular features of the wine dataset")
plt.show()

# Show any three features
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(projection='3d')
ax.scatter(X[:,1], X[:,2], X[:,3], c=y)
ax.set_xlabel(winedata["feature_names"][1])
ax.set_ylabel(winedata["feature_names"][2])
ax.set_zlabel(winedata["feature_names"][3])
ax.set_title("Three particular features of the wine dataset")
plt.show()

# Show first two principal components without scaler
pca = PCA()
plt.figure(figsize=(8,6))
Xt = pca.fit_transform(X)
plot = plt.scatter(Xt[:,0], Xt[:,1], c=y)
plt.legend(handles=plot.legend_elements()[0], labels=list(winedata['target_names']))
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("First two principal components")
plt.show()

# Show first two principal components with scaler
pca = PCA()
pipe = Pipeline([('scaler', StandardScaler()), ('pca', pca)])
plt.figure(figsize=(8,6))
Xt = pipe.fit_transform(X)
plot = plt.scatter(Xt[:,0], Xt[:,1], c=y)
plt.legend(handles=plot.legend_elements()[0], labels=list(winedata['target_names']))
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("First two principal components after scaling")
plt.show()

In [6]:
#35  Train a PCA model on a high-dimensional dataset and visualize the Scree plot


import plotly.express as px

df = px.data.iris()
features = ["sepal_width", "sepal_length", "petal_width", "petal_length"]

fig = px.scatter_matrix(
    df,
    dimensions=features,
    color="species"
)
fig.update_traces(diagonal_visible=False)
fig.show()


import plotly.express as px
from sklearn.decomposition import PCA

df = px.data.iris()
features = ["sepal_width", "sepal_length", "petal_width", "petal_length"]

pca = PCA()
components = pca.fit_transform(df[features])
labels = {
    str(i): f"PC {i+1} ({var:.1f}%)"
    for i, var in enumerate(pca.explained_variance_ratio_ * 100)
}

fig = px.scatter_matrix(
    components,
    labels=labels,
    dimensions=range(4),
    color=df["species"]
)
fig.update_traces(diagonal_visible=False)
fig.show()



import pandas as pd
import plotly.express as px
from sklearn.decomposition import PCA
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing(as_frame=True)
df = housing.data
n_components = 2

pca = PCA(n_components=n_components)
components = pca.fit_transform(df)

total_var = pca.explained_variance_ratio_.sum() * 100

labels = {str(i): f"PC {i+1}" for i in range(n_components)}
labels['color'] = 'Median Price'

fig = px.scatter_matrix(
    components,
    color=housing.target,
    dimensions=range(n_components),
    labels=labels,
    title=f'Total Explained Variance: {total_var:.2f}%',
)
fig.update_traces(diagonal_visible=False)
fig.show()




In [None]:
# 37 Train a KNN Classifier and evaluate performance using Precision, Recall, and F1-Score

from sklearn.model_selection import GridSearchCV

from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier

iris = load_iris()

X = iris.data
y = iris.target

k_range = list(range(1,31))
weight_options = ["uniform", "distance"]
leaf_size=[10,20,30,40,50,60]
p=[1,2,3,4,5]
algorithm=['auto', 'ball_tree', 'kd_tree', 'brute']

param_grid = dict(n_neighbors = k_range, weights = weight_options,algorithm= algorithm ,leaf_size = leaf_size , p =p)
#print (param_grid)
knn = KNeighborsClassifier()

grid = GridSearchCV(knn, param_grid, cv = 10, scoring = 'accuracy')
grid.fit(X,y)

print (grid.best_score_)
print (grid.best_params_)
print (grid.best_estimator_)

In [None]:
#38 Train a PCA model and analyze the effect of different numbers of components on accuracy


import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Example Data
np.random.seed(42)
X = np.random.rand(100, 3)  # 100 samples with 3 features

# Step 1: Standardize the Data
scaler = StandardScaler()
X_std = scaler.fit_transform(X)

# Step 2-5: PCA
pca = PCA()
X_pca = pca.fit_transform(X_std)

# Plot Explained Variance Ratio
explained_var_ratio = pca.explained_variance_ratio_
cumulative_var_ratio = np.cumsum(explained_var_ratio)

plt.plot(range(1, len(cumulative_var_ratio) + 1), cumulative_var_ratio, marker='o')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.title('Explained Variance Ratio vs. Number of Principal Components')
plt.show()


#directory path
 > path <- ".../Data/Big_Mart_Sales"
#set working directory
 > setwd(path)
#load train and test file
 > train <- read.csv("train_Big.csv")
 > test <- read.csv("test_Big.csv")
#add a column
 > test$Item_Outlet_Sales <- 1
#combine the data set
 > combi <- rbind(train, test)
#impute missing values with median
 > combi$Item_Weight[is.na(combi$Item_Weight)] <- median(combi$Item_Weight, na.rm = TRUE)
#impute 0 with median
 > combi$Item_Visibility <- ifelse(combi$Item_Visibility == 0, median(combi$Item_Visibility),                                   combi$Item_Visibility)
#find mode and impute
 > table(combi$Outlet_Size, combi$Outlet_Type)
 > levels(combi$Outlet_Size)[1] <- "Other"

In [None]:
#40  Train a PCA model and visualize how data points are transformed before and after PCA



from sklearn.datasets import load_wine
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt

# Load dataset
winedata = load_wine()
X, y = winedata['data'], winedata['target']
print("X shape:", X.shape)
print("y shape:", y.shape)

# Show any two features
plt.figure(figsize=(8,6))
plt.scatter(X[:,1], X[:,2], c=y)
plt.xlabel(winedata["feature_names"][1])
plt.ylabel(winedata["feature_names"][2])
plt.title("Two particular features of the wine dataset")
plt.show()

# Show any three features
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(projection='3d')
ax.scatter(X[:,1], X[:,2], X[:,3], c=y)
ax.set_xlabel(winedata["feature_names"][1])
ax.set_ylabel(winedata["feature_names"][2])
ax.set_zlabel(winedata["feature_names"][3])
ax.set_title("Three particular features of the wine dataset")
plt.show()

# Show first two principal components without scaler
pca = PCA()
plt.figure(figsize=(8,6))
Xt = pca.fit_transform(X)
plot = plt.scatter(Xt[:,0], Xt[:,1], c=y)
plt.legend(handles=plot.legend_elements()[0], labels=list(winedata['target_names']))
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("First two principal components")
plt.show()

# Show first two principal components with scaler
pca = PCA()
pipe = Pipeline([('scaler', StandardScaler()), ('pca', pca)])
plt.figure(figsize=(8,6))
Xt = pipe.fit_transform(X)
plot = plt.scatter(Xt[:,0], Xt[:,1], c=y)
plt.legend(handles=plot.legend_elements()[0], labels=list(winedata['target_names']))
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("First two principal components after scaling")
plt.show()

In [None]:
# 43. Train a KNN Classifier and evaluate using ROC-AUC score


import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# generate two class dataset
X, y = make_classification(n_samples=1000, n_classes=2, n_features=20, random_state=27)

# split into train-test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=27)

print(pd.DataFrame(X))
print(pd.Series(y))


# train models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

# logistic regression
model1 = LogisticRegression()
# knn
model2 = KNeighborsClassifier(n_neighbors=4)

# fit model
model1.fit(X_train, y_train)
model2.fit(X_train, y_train)

# predict probabilities
pred_prob1 = model1.predict_proba(X_test)
pred_prob2 = model2.predict_proba(X_test)

# auc roc curve

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# generate two class dataset
X, y = make_classification(n_samples=1000, n_classes=2, n_features=20, random_state=27)

# split into train-test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=27)



# train models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

# logistic regression
model1 = LogisticRegression()
# knn
model2 = KNeighborsClassifier(n_neighbors=4)

# fit model
model1.fit(X_train, y_train)
model2.fit(X_train, y_train)

# predict probabilities
pred_prob1 = model1.predict_proba(X_test)
pred_prob2 = model2.predict_proba(X_test)


from sklearn.metrics import roc_curve

# roc curve for models
fpr1, tpr1, thresh1 = roc_curve(y_test, pred_prob1[:,1], pos_label=1)
fpr2, tpr2, thresh2 = roc_curve(y_test, pred_prob2[:,1], pos_label=1)

# roc curve for tpr = fpr
random_probs = [0 for i in range(len(y_test))]
p_fpr, p_tpr, _ = roc_curve(y_test, random_probs, pos_label=1)


from sklearn.metrics import roc_auc_score

# auc scores
auc_score1 = roc_auc_score(y_test, pred_prob1[:,1])
auc_score2 = roc_auc_score(y_test, pred_prob2[:,1])

print(auc_score1, auc_score2)

In [None]:
#44  Train a PCA model and visualize the variance captured by each principal component


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
import requests
import ftplib
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
!pip install yfinance --quiet
import yfinance as yf
from scipy import stats
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
sns.color_palette("flare", as_cmap=True)
sns.set(style="whitegrid")



unique_counts = df.nunique()

# Print the number of unique values for each column
print(unique_counts)

In [None]:

# Train a KNN Classifier and perform feature selection before training


# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, -1].values

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Training the K-NN model on the Training set
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
ac = accuracy_score(y_test, y_pred)

In [None]:
#46


from sklearn.datasets import fetch_openml

# GET mnist data
mnist = fetch_openml(name='mnist_784', as_frame=False, parser='auto')
X = mnist.data

# Visualize
plot_MNIST_sample(X)

#################################################
## TODO for students
# Fill out function and remove
raise NotImplementedError("Student exercise: perform PCA and visualize scree plot")
#################################################

# Perform PCA
score, evectors, evals = ...

# Plot the eigenvalues
plot_eigenvalues(evals, xlimit=True)  # limit x-axis up to 100 for zooming

In [None]:
#47  Train a KNN Classifier and visualize the decision boundary

import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets, neighbors
from mlxtend.plotting import plot_decision_regions


def knn_comparison(data, k):
 x = data[[‘X’,’Y’]].values
 y = data[‘class’].astype(int).values
 clf = neighbors.KNeighborsClassifier(n_neighbors=k)
 clf.fit(x, y)
# Plotting decision region
 plot_decision_regions(x, y, clf=clf, legend=2)
# Adding axes annotations
 plt.xlabel(‘X’)
 plt.ylabel(‘Y’)
 plt.title(‘Knn with K=’+ str(k))
 plt.show()

In [None]:
#48  Train a PCA model and analyze the effect of different numbers of components on data variance



import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Example Data
np.random.seed(42)
X = np.random.rand(100, 3)  # 100 samples with 3 features

# Step 1: Standardize the Data
scaler = StandardScaler()
X_std = scaler.fit_transform(X)

# Step 2-5: PCA
pca = PCA()
X_pca = pca.fit_transform(X_std)

# Plot Explained Variance Ratio
explained_var_ratio = pca.explained_variance_ratio_
cumulative_var_ratio = np.cumsum(explained_var_ratio)

plt.plot(range(1, len(cumulative_var_ratio) + 1), cumulative_var_ratio, marker='o')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.title('Explained Variance Ratio vs. Number of Principal Components')
plt.show()