# kNN Project

I am following a tutorial in order to educate myself on how to use the k-Nearest-Neighbors algorithm in `skikit-learn` I will also be working with the `pandas` library as well as `plotly.express` for vizualizations. I will be looking at data that comes from the file `500hits.csv` which is about baseball players that are play with teams that are involved with the MLB. We will be dropping the following players or categories from the data before we begin:

 - Players that retired within the last 5 years because they do not qualify for the Hall of Fame.
 - Players that have used steroids.
 - Pete Rose because he is banned from baseball for betting on games including the Cincinatti Reds in 1989 which he was a manager of.

 **Let the fun Begin!**

In [1]:
import pandas as pd
import plotly.express as px
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler 
from sklearn.metrics import confusion_matrix, classification_report
# the MinMaxScaler scales features to a specified range (default: 0 to 1)

In [2]:
# Coding the model

df = pd.read_csv('500hits.csv', encoding='latin-1')

In [3]:
df.head()

Unnamed: 0,PLAYER,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,CS,BA,HOF
0,Ty Cobb,24,3035,11434,2246,4189,724,295,117,726,1249,357,892,178,0.366,1
1,Stan Musial,22,3026,10972,1949,3630,725,177,475,1951,1599,696,78,31,0.331,1
2,Tris Speaker,22,2789,10195,1882,3514,792,222,117,724,1381,220,432,129,0.345,1
3,Derek Jeter,20,2747,11195,1923,3465,544,66,260,1311,1082,1840,358,97,0.31,1
4,Honus Wagner,21,2792,10430,1736,3430,640,252,101,0,963,327,722,15,0.329,1


In [4]:
# Cleaning the data
# Will be dropping the columns 'PLAYER' and 'CS'

df = df.drop(columns=['PLAYER', 'CS'])

In [5]:
df.head()

Unnamed: 0,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,BA,HOF
0,24,3035,11434,2246,4189,724,295,117,726,1249,357,892,0.366,1
1,22,3026,10972,1949,3630,725,177,475,1951,1599,696,78,0.331,1
2,22,2789,10195,1882,3514,792,222,117,724,1381,220,432,0.345,1
3,20,2747,11195,1923,3465,544,66,260,1311,1082,1840,358,0.31,1
4,21,2792,10430,1736,3430,640,252,101,0,963,327,722,0.329,1


In [6]:
# Making a visualization

corr_matrix = df.corr()

fig = px.imshow(corr_matrix,
                labels = dict(color = 'correlation'),
                x = list(corr_matrix.columns),
                y = list(corr_matrix.index),
                color_continuous_scale=px.colors.sequential.ice,
                zmin = -1, zmax = 1,
                text_auto = True,
                aspect='auto',
                origin = "lower")
fig.update_layout(title = 'RBI and MLB Hall of Fame Correlation')
fig.show()

In [7]:
# Feature/ Label Selection
# HOF (Hall of Fame is going to be the Label/Target Variable)

X = df.iloc[:,0:13]
y = df.iloc[:,13]

In [8]:
# random_state ensures the train_test_split is reproducable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=11)

In [9]:
# Putting the features of the data to scale from 0 to 1
# You use the MinMaxScaler when you have the following: 
'''
1. Features have different units or scales,
2. Gradient Descent based algorithms (Lin reg, logistic regression, and neaural networks)
3. Distance Based Algorithms (kNN, Support Vector Machines, and Clustering Algorithms (k-means))
4. When Outliers are not a major concern
5. When you need to maintain interpretability
'''

scaler = MinMaxScaler(feature_range=(0,1))

X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

### k-Nearest Neighbors (kNN) Algorithm

The k-nearest neighbors (kNN) algorithm is a simple, supervised machine learning algorithm that can be used for both classification and regression tasks. The key idea behind kNN is that similar data points are close to each other in the feature space.

For classification, the algorithm works as follows:
1. **Choose the number of neighbors (k)**: This is a user-defined constant, typically a small positive integer.
2. **Calculate the distance**: Compute the distance between the point to be classified and all the points in the training data.
3. **Identify the k-nearest neighbors**: Select the k training examples that are closest to the point in question.
4. **Vote for the labels**: For classification, each of the k neighbors votes for their class label, and the label with the most votes is assigned to the point.

For regression, the algorithm works similarly, but instead of voting for the label, the algorithm takes the average of the values of the k-nearest neighbors.

#### Euclidean Distance Formula

The Euclidean distance between two points in an n-dimensional space is given by the formula:

$$
d = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}
$$

This formula computes the straight-line (or "as-the-crow-flies") distance between the two points.


In [10]:
# Creating the model

kNN = KNeighborsClassifier(n_neighbors = 8)

In [11]:
# Fitting the training data with the model

kNN.fit(X_train, y_train)

In [12]:
# Making predictions with the model that we have made

y_pred = kNN.predict(X_test)

print(y_pred)

[0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 1 1 1 0 1 0
 0 1 0 0 1 0 0 0 1 0 0 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 0 1 1 0 1 0 0 0 0
 1 0 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1]


In [13]:
# To see how accurate we are we can do kNN.score(X_test, y_test)

kNN.score(X_test, y_test)

0.8279569892473119

In [14]:
'''
Confusin Matrix allows you to see the following:

true positives (Upper Left)
false positives (Upper Right)
false negatives (Lower Left)
true negatives (Lower Right)
''' 

cm = confusion_matrix(y_test, y_pred)
cm

array([[55, 12],
       [ 4, 22]], dtype=int64)

In [15]:
# Making the Classification Report

cr = classification_report(y_test, y_pred)
print(cr)

              precision    recall  f1-score   support

           0       0.93      0.82      0.87        67
           1       0.65      0.85      0.73        26

    accuracy                           0.83        93
   macro avg       0.79      0.83      0.80        93
weighted avg       0.85      0.83      0.83        93

