In [1]:
import pandas as pd

The MinMaxScaler from sklearn.preprocessing is a tool used to scale numerical features to a specified range, typically between 0 and 1. It transforms each feature individually by rescaling its values linearly so that they fall within the defined range.

In [3]:
from sklearn.preprocessing import MinMaxScaler


The train_test_split function from sklearn.model_selection is a fundamental tool in machine learning that simplifies the process of dividing a dataset into two separate subsets: one for training a model and the other for evaluating its performance. This function ensures that the model is trained on a specific portion of the data, while the remaining portion, unseen durin

In [5]:
from sklearn.model_selection import train_test_split

The KNeighborsClassifier from sklearn.neighbors is a machine learning algorithm based on the k-nearest neighbors (KNN) approach, which is primarily used for classification tasks. This algorithm works by identifying the k nearest data points (neighbors) to a given sample in the feature space and assigning the class label most commonly found among these neighbors.

In [7]:
from sklearn.neighbors import KNeighborsClassifier

The confusion_matrix function from sklearn.metrics is a valuable tool for evaluating the performance of a classification model. It provides a detailed summary of the model's predictions by comparing the true labels of the dataset with the predicted labels. The result is presented in the form of a matrix, where each row represents the instances in an actual class, and each column represents the instances in a predicted class.

The confusion matrix offers four key components for binary classification:

True Positives (TP): Instances correctly predicted as belonging to the positive class.

True Negatives (TN): Instances correctly predicted as belonging to the negative class.

False Positives (FP): Instances incorrectly predicted as belonging to the positive class (also known as Type I errors).

False Negatives (FN): Instances incorrectly predicted as belonging to the negative class (also known as Type II errors).

In [9]:
from sklearn.metrics import confusion_matrix

The classification_report function from sklearn.metrics is a comprehensive tool for evaluating the performance of a classification model. It generates a detailed summary of key metrics that provide insights into how well the model is performing across different classes. These metrics include precision, recall, F1-score, and support.

Precision measures the accuracy of positive predictions. It is the ratio of true positive predictions to the total predicted positives. High precision indicates that the model makes fewer false positive predictions.

Recall (also called sensitivity or true positive rate) measures the ability of the model to correctly identify all relevant instances. It is the ratio of true positives to the total actual positives. High recall means the model identifies most of the true positives.

F1-Score is the harmonic mean of precision and recall. It provides a single metric that balances the trade-off between precision and recall, especially useful when you want to find a balance between the two.

Support refers to the number of true instances for each class in the dataset. It gives an idea of the distribution of the dataset across the classes.

In [11]:
from sklearn.metrics import classification_report

In [12]:
## Build out the model

In [13]:
df = pd.read_csv(r'C:\Users\user\Desktop\Data/500hits.csv', encoding = 'latin-1')

In [14]:
df.head()

Unnamed: 0,PLAYER,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,CS,BA,HOF
0,Ty Cobb,24,3035,11434,2246,4189,724,295,117,726,1249,357,892,178,0.366,1
1,Stan Musial,22,3026,10972,1949,3630,725,177,475,1951,1599,696,78,31,0.331,1
2,Tris Speaker,22,2789,10195,1882,3514,792,222,117,724,1381,220,432,129,0.345,1
3,Derek Jeter,20,2747,11195,1923,3465,544,66,260,1311,1082,1840,358,97,0.31,1
4,Honus Wagner,21,2792,10430,1736,3430,640,252,101,0,963,327,722,15,0.329,1


In [15]:
df = df.drop(columns = ['PLAYER', 'CS'])

In [16]:
df.head()

Unnamed: 0,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,BA,HOF
0,24,3035,11434,2246,4189,724,295,117,726,1249,357,892,0.366,1
1,22,3026,10972,1949,3630,725,177,475,1951,1599,696,78,0.331,1
2,22,2789,10195,1882,3514,792,222,117,724,1381,220,432,0.345,1
3,20,2747,11195,1923,3465,544,66,260,1311,1082,1840,358,0.31,1
4,21,2792,10430,1736,3430,640,252,101,0,963,327,722,0.329,1


In [17]:
#Setting aside the independent variable that will be used for prediction
X = df.iloc[:,0:13]

In [18]:
#Setting aside the Target variable that will be predicted.
y = df.iloc[:,13]

This splits the dataset into training and testing sets using the train_test_split function from the sklearn.model_selection module. The purpose of this split is to train the model on one portion of the data (training set) and evaluate its performance on unseen data (testing set). Here's a breakdown of the parameters and their roles:

X and y:

X represents the features or independent variables of the dataset.
y represents the target or dependent variable (what we want to predict).

test_size=0.2: This specifies the proportion of the dataset to include in the test split. In this case, 20% of the data is allocated for testing, and the remaining 80% is used for training.

random_state=11:This ensures reproducibility. By setting a fixed random state, the same split will be generated each time the code is run. This is particularly useful for debugging or comparing model performance.
Outputs (X_train, X_test, y_train, y_test):

X_train and y_train are the feature and target variables for the training set, used to fit the model.
X_test and y_test are the feature and target variables for the testing set, used to evaluate the model's performance.

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 11, test_size=0.2)

This code initializes a Min-Max Scaler using the MinMaxScaler class from the sklearn.preprocessing module. The Min-Max Scaling technique transforms the features of the dataset so that their values fall within a specified range, in this case, between 0 and 1

In [22]:
scaler = MinMaxScaler(feature_range=(0,1))

In [23]:
#applies the Min-Max Scaling transformation to the training dataset (X_train).
X_train = scaler.fit_transform(X_train)

In [24]:
#standardizes or normalizes the features in X_test using the MinMaxScaler
X_test = scaler.fit_transform(X_test)

The code initializes a K-Nearest Neighbors (KNN) classifier with the following specification:

KNeighborsClassifier: This is a machine learning algorithm used for classification tasks, where it assigns labels to new data points based on the majority class of their nearest neighbors in the training dataset.
n_neighbors=8: This parameter sets the number of neighbors (data points) to consider when making a prediction. In this case, it uses the 8 nearest neighbors to determine the class of a test sample.

In [26]:
knn = KNeighborsClassifier(n_neighbors=8)

In [27]:
# Trains the K-Nearest Neighbors (KNN) classifier using the training data:
knn.fit(X_train, y_train)


This code uses the trained K-Nearest Neighbors (KNN) classifier to make predictions on the test data:

knn: This is the KNN classifier that was previously trained using the fit() method.
predict(): This method is used to predict the target labels (i.e., class labels) for new data based on the trained model.
X_test: This is the feature matrix of the test data for which the predictions are being made.

In [29]:
y_pred = knn.predict(X_test)

In [30]:
print(y_pred)

[0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 1 1 1 0 1 0
 0 1 0 0 1 0 0 0 1 0 0 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 0 1 1 0 1 0 0 0 0
 1 0 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1]


In [31]:
knn.score(X_test, y_test)

0.8279569892473119

In [32]:
cm = confusion_matrix(y_test, y_pred)

In [33]:
print(cm)

[[55 12]
 [ 4 22]]


In [34]:
cr = classification_report(y_test, y_pred)
print(cr)

              precision    recall  f1-score   support

           0       0.93      0.82      0.87        67
           1       0.65      0.85      0.73        26

    accuracy                           0.83        93
   macro avg       0.79      0.83      0.80        93
weighted avg       0.85      0.83      0.83        93

