<a href="https://colab.research.google.com/github/Shnku/pythoning_stuff/blob/main/mathml/ClassificationofWine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Wine ClassiFication using KNN-Model

The sklearn wine dataset is a classic dataset for practicing classification. It contains information about *178 different wines*, with **13 features** describing their chemical properties, and a target variable indicating the wine's class (0, 1, or 2).

## Importing libraries


In [142]:
from sklearn.datasets import load_wine
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import confusion_matrix
from IPython.display import display
import pandas as pd
import numpy as np

## Loading the Dataset


- `return_X_y=True` means the function will return the feature data (X) and target data (y) separately.

- `as_frame=True` means the data will be returned as Pandas DataFrames.

The **feature data** is assigned to the variable `wine_dataX`, and the **target data** is assigned to the variable `wine_datay`.


In [143]:
wine_dataX,wine_datay = load_wine(return_X_y=True,as_frame=True)

In [144]:
display(wine_dataX)
display(wine_datay)

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.80,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.20,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.40,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.80,3.24,0.30,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.50,16.8,113.0,3.85,3.49,0.24,2.18,7.80,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.80,2.69,0.39,1.82,4.32,1.04,2.93,735.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,13.71,5.65,2.45,20.5,95.0,1.68,0.61,0.52,1.06,7.70,0.64,1.74,740.0
174,13.40,3.91,2.48,23.0,102.0,1.80,0.75,0.43,1.41,7.30,0.70,1.56,750.0
175,13.27,4.28,2.26,20.0,120.0,1.59,0.69,0.43,1.35,10.20,0.59,1.56,835.0
176,13.17,2.59,2.37,20.0,120.0,1.65,0.68,0.53,1.46,9.30,0.60,1.62,840.0


Unnamed: 0,target
0,0
1,0
2,0
3,0
4,0
...,...
173,2
174,2
175,2
176,2


## Splitting wine data into training and testing sets
***`train_test_split:`*** This is a function from the sklearn.model_selection module. Its purpose is to split data into random train and test subsets.

1. **`test_size=0.3:`** This parameter specifies that 30% of the data should be allocated to the testing set. The remaining 70% will be used for training.

2. **`random_state=62:`** This parameter ensures that the data is split in the same way every time the code is run. This is important for reproducibility. If a different random_state is used, then the data will be split into different training and testing sets, potentially causing variance in model performance comparisons.


3. **`train_X,test_X,train_y,test_y:`** These are the output variables that will store the split data:
    -  train_X: The features used to train the model.
    - test_X: The features used to evaluate the model's performance on unseen data.
    - train_y: The target variable corresponding to the training data.
    - test_y: The target variable corresponding to the testing data, used to evaluate the model's accuracy in predicting against unseen data.


In [145]:
train_X,test_X,train_y,test_y = train_test_split(wine_dataX,wine_datay,test_size=0.3,random_state=62)

In [146]:
print(train_X.shape)
print(test_X.shape)

(124, 13)
(54, 13)


## Creating the KNN Classifier
**`n_neighbors=3`**: This is a parameter of the KNeighborsClassifier. It tells the model to consider the 3 nearest neighbors when classifying a new data point. In essence, *it looks at the 3 most similar data points to help categorize the new one.*

In [147]:
KNN = KNeighborsClassifier(n_neighbors=3)

## Training the Model
**`fit:`** This is a method that trains the model. We are essentially teaching the model how to classify wine types based on the training data.

In [148]:
KNN.fit(train_X,train_y)

## Testing the Model

the KNN model as a wine expert who has tasted many wines (training data) and learned to identify them based on their characteristics (features). \
Now, you're giving the expert a new set of wines (`test_X`) to taste, and `y_pred` will hold the expert's predictions about the types of those new wines based on their experience.

In [149]:
y_pred = KNN.predict(test_X)
print(y_pred)

[0 2 1 1 2 0 0 0 1 0 1 0 1 1 2 2 2 1 0 1 1 1 0 0 1 0 0 2 0 2 2 2 2 1 1 1 0
 2 1 1 2 1 2 1 0 1 0 2 1 2 1 2 1 2]


In [150]:
y_actul = np.asarray(test_y)
print(y_actul)

[0 2 2 1 0 0 1 0 2 0 2 0 1 2 1 2 1 1 1 2 2 1 0 0 1 0 0 2 0 2 1 1 2 1 1 1 0
 1 2 1 0 2 2 1 2 1 0 1 1 1 1 0 2 2]


In [151]:
res=[7 if i==j else 0 for i, j in zip(y_pred, y_actul)]
print(np.array(res)) # 1=True / 0=False

true_count = np.count_nonzero(res)  # Count True values
false_count = len(res) - true_count  # Count False values

print(f"\nNo of True values: {true_count}, ({(true_count / len(res)) * 100} %)")
print(f"No of False values: {false_count}, ({(false_count / len(res)) * 100} %)")

[7 7 0 7 0 7 0 7 0 7 0 7 7 0 0 7 0 7 0 0 0 7 7 7 7 7 7 7 7 7 0 0 7 7 7 7 7
 0 0 7 0 0 7 7 0 7 7 0 7 0 7 0 0 7]

No of True values: 32, (59.25925925925925 %)
No of False values: 22, (40.74074074074074 %)


### Explanation of TP, FP, TN, and FN:

Think of it like this:

- **TP (True Positive):** You predicted YES, and the answer is YES. (Correct prediction)
- **FP (False Positive):** You predicted YES, but the answer is NO. (Wrong prediction - Type I error)
- **TN (True Negative):** You predicted NO, and the answer is NO. (Correct prediction)
- **FN (False Negative):** You predicted NO, but the answer is YES. (Wrong prediction - Type II error)

#### How they relate to your Wine Classification:

Imagine your KNN model is trying to classify wines into three classes (0, 1, or 2). Here's how these terms apply:

- **TP:** The model predicts a wine to be class '1', and it actually is class '1'.
- **TN:** The model predicts a wine to not be class '0', and it actually is not class '0' (it could be class '1' or '2').
- **FP:** The model predicts a wine to be class '2', but it's actually class '0'.
- **FN:** The model predicts a wine to not be class '1', but it actually is class '1'.

---

#### Matrix representation for 3 wine classes

|                     | Predict C0 | Predict C1 | Predict C2 |
|---------------------|-------------------|-------------------|-------------------|
| **Actual Class_0** | TP                | FP                | FP                |
| **Actual Class_1** | FN                | TP                | FN                |
| **Actual Class_2** | FN                | FP                | TP                |

#### Now, trying to count them manually

In [152]:
#predict class 0 and actual class 0
# TP = np.count_nonzero(y_pred == 0 and y_actul == 0) #not work
TP0 = np.count_nonzero(np.logical_and(y_pred == 0, y_actul == 0))
print(TP0)
#predict class 0 and actual class 1
FP0 = np.count_nonzero(np.logical_and(y_pred == 0, y_actul == 1))
print(FP0)
#predict class 0 and actual class 2
FN0 = np.count_nonzero(np.logical_and(y_pred == 0, y_actul == 2))
print(FN0)

12
2
1


In [153]:
#predict class 1 and actual class 0
FP1 = np.count_nonzero(np.logical_and(y_pred == 1, y_actul == 0))
print(FP1)
#predict class 1 and actual class 1
TP1 = np.count_nonzero(np.logical_and(y_pred == 1, y_actul == 1))
print(TP1)
#predict class 1 and actual class 2
FN1 = np.count_nonzero(np.logical_and(y_pred == 1, y_actul == 2))
print(FN1)

0
13
9


In [154]:
#predict class 2 and actual class 0
FN2 = np.count_nonzero(np.logical_and(y_pred == 2, y_actul == 0))
print(FN2)
#predict class 2 and actual class 1
FP2 = np.count_nonzero(np.logical_and(y_pred == 2, y_actul == 1))
print(FP2)
#predict class 2 and actual class 2
TP2 = np.count_nonzero(np.logical_and(y_pred == 2, y_actul == 2))
print(TP2)

3
7
7


### Accuracy Score, recall, Confusion matrix and precision_Score
$$ accuracy\ score=\frac{TP+TN}{TP+FP+TN+FN} \\ $$

$$ precision\ score=\frac{TP}{TP+FP} \\ $$

$$ recall=\frac{TP}{TP+FN}$$


In [155]:
accuracy_score(test_y,y_pred)

0.5925925925925926

In [156]:
precision_score(test_y,y_pred,average='micro')

0.5925925925925926

In [157]:
confusion_matrix(test_y,y_pred)

array([[12,  0,  3],
       [ 2, 13,  7],
       [ 1,  9,  7]])

### f1_Score

Used to Evaluate the performance of classification models

$$ F-Score=\frac{2*(preecision*recall)}{precision+rewcall} \\ $$
<br><br>
$$
F(1)-Score= \\
F(p)-Score=\frac{(β^2+1)PR}{β^2P+R}
$$

In [158]:
from sklearn.metrics import f1_score
f1_score(test_y,y_pred,average='micro')

0.5925925925925926