## Clasificador de Distancia y Bayesianos <a class="anchor" id="1"></a>

En este notebook, se tratará de aplicar tres modelos de clasificación a un dataset.
a. 1NN
b. KNN con K={3, 5, 7 y 9}
c. Naive Bayes
Con los métodos de validación.
a. Hold-Out 70/30
b. 10-Fold Cross-Validation
c. Leave-One-Out
El dataset es uno de clasificación de vidrio glass.xls; empezaremos preprocesando el dataset de manera rápida para poder trabajar con el después

## 1. Importar librerias <a class="anchor" id="2"></a>

In [1]:
# Env: Python 3.12
import os  # for file and directory manipulation
import matplotlib
matplotlib.use('qt5agg')  # o 'tkagg' si llegara a fallar
import matplotlib.pyplot as plt  # for data visualization
import numpy as np  # for linear algebra
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly.express as px  # for interactive plots
import seaborn as sns  # for statistical data visualization

In [2]:
import warnings

warnings.filterwarnings('ignore')

## 2. Importar dataset <a class="anchor" id="3"></a>

El dataset es de clasificación de vidrio

In [3]:
data = './glass.xls'

df = pd.read_csv(data)

### Ver las dimensiones del dataset <a class="anchor" id="4.1"></a>

In [4]:
df_shape = df.shape
print(f"DataFrame registries: {df_shape[0]}")
print(f"DataFrame variables: {df_shape[1]}")

DataFrame registries: 214
DataFrame variables: 10


### Previsión del dataset <a class="anchor" id="4.2"></a>

In [5]:
df.head()

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
0,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,1
1,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,1
2,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,1
3,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,1
4,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,1


### Ver resumen del dataset <a class="anchor" id="4.5"></a>

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   RI      214 non-null    float64
 1   Na      214 non-null    float64
 2   Mg      214 non-null    float64
 3   Al      214 non-null    float64
 4   Si      214 non-null    float64
 5   K       214 non-null    float64
 6   Ca      214 non-null    float64
 7   Ba      214 non-null    float64
 8   Fe      214 non-null    float64
 9   Type    214 non-null    int64  
dtypes: float64(9), int64(1)
memory usage: 16.8 KB


### Ver propiedades estadisticas del dataset <a class="anchor" id="4.6"></a>

In [7]:
df.describe()

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
count,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0
mean,1.518365,13.40785,2.684533,1.444907,72.650935,0.497056,8.956963,0.175047,0.057009,2.780374
std,0.003037,0.816604,1.442408,0.49927,0.774546,0.652192,1.423153,0.497219,0.097439,2.103739
min,1.51115,10.73,0.0,0.29,69.81,0.0,5.43,0.0,0.0,1.0
25%,1.516522,12.9075,2.115,1.19,72.28,0.1225,8.24,0.0,0.0,1.0
50%,1.51768,13.3,3.48,1.36,72.79,0.555,8.6,0.0,0.0,2.0
75%,1.519157,13.825,3.6,1.63,73.0875,0.61,9.1725,0.0,0.1,3.0
max,1.53393,17.38,4.49,3.5,75.41,6.21,16.19,3.15,0.51,7.0


#### Checar valores pérdidos.

In [8]:
df['Type'].isna().sum()


np.int64(0)

#### Checar valores únicos.

In [9]:
print(df['Type'].nunique())
df['Type'].unique()

6


array([1, 2, 3, 5, 6, 7])

#### Visualizar frecuencia distribucion de `Type` variable

In [10]:
df['Type'].value_counts(normalize=False, dropna=False)

Type
2    76
1    70
7    29
3    17
5    13
6     9
Name: count, dtype: int64

In [15]:
plt.figure(figsize=(6, 8))
sns.countplot(x="Type", data=df, palette="Set1")
plt.show()

### Explorar variables categoricas <a class="anchor" id="6.2"></a>

In [14]:
categorical = [var for var in df.columns if df[var].dtype=='O']

print('There are {} categorical variables\n'.format(len(categorical)))
print('The categorical variables are :', categorical)


There are 0 categorical variables

The categorical variables are : []


### Explorar Variables Numericas <a class="anchor" id="6.5"></a>

In [16]:
numerical = [var for var in df.columns if df[var].dtype!='O']
print('There are {} numerical variables\n'.format(len(numerical)))
print('The numerical variables are :', numerical)

There are 10 numerical variables

The numerical variables are : ['RI', 'Na', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ba', 'Fe', 'Type']


### Explorar problemas con variables numericas <a class="anchor" id="6.7"></a>


### Valores pérdidos en variables numericas

In [17]:
df[numerical].isnull().sum()

RI      0
Na      0
Mg      0
Al      0
Si      0
K       0
Ca      0
Ba      0
Fe      0
Type    0
dtype: int64

### Outliers en variables numericas

In [18]:
print(round(df[numerical].describe()), 2)

          RI     Na     Mg     Al     Si      K     Ca     Ba     Fe   Type
count  214.0  214.0  214.0  214.0  214.0  214.0  214.0  214.0  214.0  214.0
mean     2.0   13.0    3.0    1.0   73.0    0.0    9.0    0.0    0.0    3.0
std      0.0    1.0    1.0    0.0    1.0    1.0    1.0    0.0    0.0    2.0
min      2.0   11.0    0.0    0.0   70.0    0.0    5.0    0.0    0.0    1.0
25%      2.0   13.0    2.0    1.0   72.0    0.0    8.0    0.0    0.0    1.0
50%      2.0   13.0    3.0    1.0   73.0    1.0    9.0    0.0    0.0    2.0
75%      2.0   14.0    4.0    2.0   73.0    1.0    9.0    0.0    0.0    3.0
max      2.0   17.0    4.0    4.0   75.0    6.0   16.0    3.0    1.0    7.0 2


### find outliers for Rainfall variable


In [22]:
numer_cols = numerical # Exclude the last three columns (Year, Month, Day)

# draw boxplots to visualize outliers
plt.figure(figsize=(20, 40))

# Create boxplots for each numerical column
for i, col in enumerate(numer_cols, 1):
    plt.subplot(len(numer_cols), 2, i)
    sns.boxplot(data=df, x=col)
    plt.title(f"Boxplot for {col}", fontsize=14)
    plt.xlabel(col, fontsize=12)

# Overall title for the entire figure
plt.suptitle("Boxplots to Identify Outliers in Numerical Data", fontsize=16, weight='bold')

# Adjust layout for better spacing
plt.tight_layout(rect=[0, 0, 1, 0.96])  # Leave space for the suptitle
plt.show()

In [23]:
# plot histogram to check distribution

plt.figure(figsize=(15, 40))

for i, col in enumerate(numer_cols, 1):
    plt.subplot(len(numer_cols), 2, i)
    sns.histplot(df[col], bins=10, kde=True)
    plt.title(f"Histogram for {col}", fontsize=14)
    plt.xlabel(col, fontsize=12)

# Overall title for the entire figure
plt.suptitle("Distribution in Numerical Data", fontsize=16, weight='bold')

# Adjust layout for better spacing
plt.tight_layout(rect=[0, 0, 1, 0.96])  # Leave space for the suptitle
plt.show()

In [32]:
def detect_outliers_only(df):
    numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    outlier_info = []
    for col in numeric_cols:
        non_null = df[col].dropna()
        if len(non_null)==0: continue
        Q1 = non_null.quantile(0.25)
        Q3 = non_null.quantile(0.75)
        IQR = Q3 - Q1
        if IQR>0:
            lb = Q1 - 1.5*IQR
            ub = Q3 + 1.5*IQR
            outliers = non_null[(non_null < lb) | (non_null > ub)]
            pct = 100 * len(outliers) / len(non_null)
            outlier_info.append((col, len(outliers), pct, lb, ub))
    out_df = pd.DataFrame(outlier_info, columns=['Variable','Outliers','Porcentaje','Límite_Inferior','Límite_Superior'])
    return out_df
print(detect_outliers_only(df))

  Variable  Outliers  Porcentaje  Límite_Inferior  Límite_Superior
0       RI        17    7.943925          1.51257          1.52311
1       Na         7    3.271028         11.53125         15.20125
2       Mg         0    0.000000         -0.11250          5.82750
3       Al        18    8.411215          0.53000          2.29000
4       Si        12    5.607477         71.06875         74.29875
5        K         7    3.271028         -0.60875          1.34125
6       Ca        26   12.149533          6.84125         10.57125
7       Fe        12    5.607477         -0.15000          0.25000
8     Type        29   13.551402         -2.00000          6.00000


## 7. Analisis Multivariable <a class="anchor" id="7"></a>


In [33]:
# corr() method computes the pairwise correlation of columns
correlation = df.corr(numeric_only=True)

### Heat Map <a class="anchor" id="7.1"></a>

In [None]:
plt.figure(figsize=(16,12))
plt.title('Correlation Heatmap of Rain in Australia Dataset')
ax = sns.heatmap(correlation, square=True, annot=True, fmt='.2f', cmap='coolwarm_r')
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
ax.set_yticklabels(ax.get_yticklabels(), rotation=30)           
plt.show()