# Laboratorio 2 - Agrupación
Grupo 14

## Entendimiento de datos
Dentro de este laboratorio se hará uso de los datos pertenecientes al archivo `Datos_SenecaféAlpes.csv`. A continuación, se detalla el proceso de carga, entendimiento y calidad de datos. 

In [40]:
import os
import numpy as np
import pandas as pd
from sklearn import tree
import sklearn as sklearn

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D # for 3D plots
import seaborn as sns; sns.set()
 

from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, MinMaxScaler, RobustScaler

# Modelos
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.tree import DecisionTreeClassifier 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cluster import KMeans, DBSCAN

# Métricas
from sklearn.metrics import (
    mean_absolute_error, mean_squared_error, r2_score,
    accuracy_score, precision_score, recall_score, f1_score, confusion_matrix,
    silhouette_score, davies_bouldin_score, silhouette_samples
)
from sklearn.metrics import ConfusionMatrixDisplay

from sklearn.decomposition import PCA
import joblib

np.random.seed(42)

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 50)


In [41]:
#Carga de datos
df = pd.read_csv('Datos_SenecaféAlpes.csv', encoding="UTF-8", sep=";")
df.shape

(14291, 19)

In [42]:
df.head(5)

Unnamed: 0,ID,Area,Perimetro,LongitudEjeMayor,LongitudEjeMenor,RelacionAspecto,Excentricidad,AreaConvexa,DiametroEquivalente,Medida,Solidez,Redondez,Compacidad,FactorForma1,FactorForma2,FactorForma3,FactorForma4,DefectoVisible,MétodoSecado
0,G006149,50836,923618.0,358.515147,181.388899,alargado,,,254.413847,0.804762,0.98384,0.748853,0.709632,0.007052,0.001103,0.503578,0.995321,Normal,Lavado
1,G007234,62764,1003767.0,409.207082,198.330199,Alargado,,64158.0,282.689948,0.703995,0.978272,0.782807,0.690824,0.00652,0.000916,0.477237,0.984666,,Natural
2,G007054,59965,994266.0,389.088529,197.967275,Alargado,0.860886,60910.0,276.314692,0.661581,0.984485,0.762259,0.710159,0.006489,0.001018,0.504326,0.991211,Normal,Natural
3,G006619,55035,917.6,379.346822,185.390577,Alargado,0.872446,55591.0,,0.799695,0.989998,0.821376,0.697811,0.006893,0.001008,0.486941,0.99638,Normal,Lavado
4,G013353,39324,737773.0,262.520242,191.176858,Alargado,0.685326,39758.0,223.760747,0.775392,0.989084,0.907867,0.852356,0.006676,0.002174,0.726511,0.99763,Normal,Lavado


Para entender el significado de cada una de las columnas se incluye el diccionario. 

In [43]:
diccionario = pd.read_excel('Diccionario_SenecaféAlpes.xlsx')
pd.set_option("display.max_colwidth", None) 
display(diccionario)

Unnamed: 0,ATRIBUTO,DESCRIPCIÓN
0,ID,Código único generado para cada grano de café inspeccionado.
1,Área,"Superficie ocupada por el grano, medida como el número total de píxeles dentro de sus límites. Indica el tamaño del grano."
2,Perímetro,"Longitud de la frontera del grano, equivalente a su circunferencia. Refleja la complejidad y continuidad del borde."
3,LongitudEjeMayor,Distancia entre los extremos de la línea más larga que puede trazarse a lo largo del grano. Representa la longitud máxima del grano.
4,LongitudEjeMenor,Longitud de la línea más larga que puede trazarse perpendicular al eje mayor. Refleja el ancho máximo transversal del grano.
5,RelaciónAspecto,Relación entre la longitud del eje mayor y la del eje menor. Indica si el grano es Alargado (> 1.3) o Redondeado (≤ 1.3).
6,Excentricidad,"Medida de la desviación de la forma respecto a un círculo, basada en la elipse equivalente. Valores cercanos a 0 indican formas circulares; cercanos a 1, formas alargadas."
7,ÁreaConvexa,Número de píxeles contenidos en el polígono convexo más pequeño que abarca el grano. Permite identificar irregularidades en el borde.
8,DiámetroEquivalente,Diámetro de un círculo con el mismo área que el grano. Facilita la comparación entre granos de distintas formas mediante una medida circular equivalente.
9,Medida,Proporción entre el área del grano y el área de su caja delimitadora (bounding box). Evalúa qué tan bien el grano ocupa su espacio mínimo rectangular.


In [44]:
df.dtypes

ID                      object
Area                     int64
Perimetro              float64
LongitudEjeMayor       float64
LongitudEjeMenor       float64
RelacionAspecto         object
Excentricidad          float64
AreaConvexa            float64
DiametroEquivalente    float64
Medida                 float64
Solidez                float64
Redondez               float64
Compacidad             float64
FactorForma1           float64
FactorForma2           float64
FactorForma3           float64
FactorForma4           float64
DefectoVisible          object
MétodoSecado            object
dtype: object

In [45]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14291 entries, 0 to 14290
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   14291 non-null  object 
 1   Area                 14291 non-null  int64  
 2   Perimetro            13054 non-null  float64
 3   LongitudEjeMayor     13890 non-null  float64
 4   LongitudEjeMenor     14291 non-null  float64
 5   RelacionAspecto      13825 non-null  object 
 6   Excentricidad        13687 non-null  float64
 7   AreaConvexa          12868 non-null  float64
 8   DiametroEquivalente  12368 non-null  float64
 9   Medida               14291 non-null  float64
 10  Solidez              11985 non-null  float64
 11  Redondez             12228 non-null  float64
 12  Compacidad           13641 non-null  float64
 13  FactorForma1         13172 non-null  float64
 14  FactorForma2         13185 non-null  float64
 15  FactorForma3         13813 non-null 

In [46]:
print("Datos Nulos")
print(df.isnull().sum())

Datos Nulos
ID                        0
Area                      0
Perimetro              1237
LongitudEjeMayor        401
LongitudEjeMenor          0
RelacionAspecto         466
Excentricidad           604
AreaConvexa            1423
DiametroEquivalente    1923
Medida                    0
Solidez                2306
Redondez               2063
Compacidad              650
FactorForma1           1119
FactorForma2           1106
FactorForma3            478
FactorForma4           1159
DefectoVisible         2935
MétodoSecado            587
dtype: int64


In [47]:
df.describe()

Unnamed: 0,Area,Perimetro,LongitudEjeMayor,LongitudEjeMenor,Excentricidad,AreaConvexa,DiametroEquivalente,Medida,Solidez,Redondez,Compacidad,FactorForma1,FactorForma2,FactorForma3,FactorForma4
count,14291.0,13054.0,13890.0,14291.0,13687.0,12868.0,12368.0,14291.0,11985.0,12228.0,13641.0,13172.0,13185.0,13813.0,13132.0
mean,53055.408999,772987.0,319.985592,202.178613,0.749977,53575.397809,253.001741,0.749844,0.986774,0.87308,0.799242,0.00656,0.001712,0.643183,0.994292
std,29396.080372,326649.3,86.378452,45.494541,0.099438,29566.387814,60.54233,0.050774,0.025947,0.063237,0.067643,0.001164,0.000601,0.100857,0.039081
min,-62716.0,-1012143.0,-421.444657,-200.838672,-0.835004,-78423.0,-448.402605,-0.798706,-0.989042,-0.896861,-0.843901,-0.007982,-0.002673,-0.683269,-0.998527
25%,36338.0,676860.8,253.319858,175.881052,0.715144,36720.0,215.302463,0.718767,0.985597,0.832824,0.762127,0.005903,0.001151,0.581047,0.993663
50%,44660.0,772034.5,296.682345,192.43787,0.764392,45107.5,238.579492,0.760232,0.988279,0.883353,0.800994,0.006645,0.001691,0.641648,0.996377
75%,61311.0,955409.8,376.548109,216.847844,0.810441,62109.25,279.672481,0.786942,0.989991,0.916803,0.834405,0.007273,0.002169,0.696366,0.997889
max,254616.0,1921685.0,738.860154,460.198497,0.911423,251082.0,569.374358,0.866195,0.994378,0.990685,0.987303,0.010451,0.003665,0.974767,0.999733


In [48]:
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = [c for c in df.columns if c not in numeric_cols]
print(numeric_cols)
print(categorical_cols)


['Area', 'Perimetro', 'LongitudEjeMayor', 'LongitudEjeMenor', 'Excentricidad', 'AreaConvexa', 'DiametroEquivalente', 'Medida', 'Solidez', 'Redondez', 'Compacidad', 'FactorForma1', 'FactorForma2', 'FactorForma3', 'FactorForma4']
['ID', 'RelacionAspecto', 'DefectoVisible', 'MétodoSecado']


### Análisis de columnas categóricas

In [49]:
df.value_counts("RelacionAspecto", dropna=False)

RelacionAspecto
Alargado      12047
Redondeado     1739
NaN             466
alargado         29
redondeado       10
Name: count, dtype: int64

In [50]:
df.value_counts("DefectoVisible", dropna=False)

DefectoVisible
Normal    9096
NaN       2935
normal    2260
Name: count, dtype: int64

In [51]:
df.value_counts("MétodoSecado", dropna=False)

MétodoSecado
Lavado     6260
Natural    4639
lavado     1552
natural    1176
NaN         587
Honey        64
honey        13
Name: count, dtype: int64