# Implementación de un modelo de clasificación KNN para predecir la calidad de vinos según sus caracteristicas quimicas. El objetivo es clasificar los vinos en 2 categorias:

- Calidad baja (0): Vinos con puntuación menor a 7
- Calidad alta (1): Vinos con puntuación mayor a 7

El dataset presenta un balance donde los vinos con buena calidad son muy pocos:

0 :   0.86429

1 :   0.13571

In [9]:
import pandas as pd
import numpy as np
# from scikitlearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [10]:
df= pd.read_csv('https://raw.githubusercontent.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/refs/heads/main/winequality-red.csv',delimiter=";")

In [11]:
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [12]:
# Ver info general
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


In [13]:
df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


## KNN para clasificación

In [14]:
# Convertir calidad en variable binaria
df['quality_binary'] = df['quality'].apply(lambda x: 1 if x >= 7 else 0)
# Separar variables predictoras y target
X = df.drop(['quality', 'quality_binary'], axis=1)
y = df['quality_binary']


In [15]:
print(y.value_counts(normalize=True))

quality_binary
0    0.86429
1    0.13571
Name: proportion, dtype: float64


In [16]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [17]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [18]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

0,1,2
,n_neighbors,5
,weights,'uniform'
,algorithm,'auto'
,leaf_size,30
,p,2
,metric,'minkowski'
,metric_params,
,n_jobs,


In [13]:
from sklearn.metrics import classification_report, confusion_matrix
y_pred = knn.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


[[268   9]
 [ 26  17]]
              precision    recall  f1-score   support

           0       0.91      0.97      0.94       277
           1       0.65      0.40      0.49        43

    accuracy                           0.89       320
   macro avg       0.78      0.68      0.72       320
weighted avg       0.88      0.89      0.88       320



## Optimización de hiperparametro con gridsearch

In [19]:
from sklearn.model_selection import GridSearchCV

param_grid = {'n_neighbors': range(1, 31)}
grid = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5, scoring='f1', n_jobs=-1)
grid.fit(X_train_scaled, y_train)

print("Mejor k:", grid.best_params_)
print("Mejor score:", grid.best_score_)

Mejor k: {'n_neighbors': 1}
Mejor score: 0.5408835139913306


In [20]:
print("Mejor k:", grid.best_params_)
print("Mejor F1-score en validación cruzada:", grid.best_score_)

Mejor k: {'n_neighbors': 1}
Mejor F1-score en validación cruzada: 0.5408835139913306


In [23]:
from sklearn.metrics import classification_report, confusion_matrix
best_knn = grid.best_estimator_
y_pred_best = best_knn.predict(X_test_scaled)
print(classification_report(y_test, y_pred_best))

              precision    recall  f1-score   support

           0       0.95      0.95      0.95       277
           1       0.67      0.65      0.66        43

    accuracy                           0.91       320
   macro avg       0.81      0.80      0.80       320
weighted avg       0.91      0.91      0.91       320



In [24]:
# Matriz de confusión
cm = confusion_matrix(y_test, y_pred_best)
print("Matriz de confusión:")
print(cm)

Matriz de confusión:
[[263  14]
 [ 15  28]]


En la matriz de confusión solo se presentan 14 falsos positivos de un total de 277 casos. Es un modelo confiable para identificar vinos de calidad baja (95% de exactitud) por lo que si se requiere identificar con mayor exactitud vinos de alta calidad, se tienen que implementar técnicas más avanzadas. De un total de 43 vinos de alta calidad, solo detecta 28 de forma correcta. 