# UNIDAD 2 - PROCESAMIENTO PREVIO DE DATOS
## PD3
Procesamiento de datos del dataset wine disponible en: https://archive.ics.uci.edu/dataset/109/wine

El objetivo de esta actividad es realizar diferentes tratamientos a un dataset en concreto

In [9]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

def convert_string_to_float(df : pd.DataFrame):
    for col in df.columns:
        df[col] = pd.to_numeric(df[col], errors='coerce')
    return df

scaler = StandardScaler()

names = [
	"class","Alcohol","Malic acid","Ash","Alcalinity of ash","Magnesium",
    "Total phenols","Flavanoids","Nonflavanoid phenols","Proanthocyanins",
    "Color intensity","Hue","OD280/OD315 of diluted wines","Proline"    
]
wine = pd.read_csv("wine.csv", names=names)

wine = convert_string_to_float(wine)


### Primeras 10 filas del Dataset:

In [10]:
wine.head(10)

Unnamed: 0,class,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735
5,1,14.2,1.76,2.45,15.2,112,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450
6,1,14.39,1.87,2.45,14.6,96,2.5,2.52,0.3,1.98,5.25,1.02,3.58,1290
7,1,14.06,2.15,2.61,17.6,121,2.6,2.51,0.31,1.25,5.05,1.06,3.58,1295
8,1,14.83,1.64,2.17,14.0,97,2.8,2.98,0.29,1.98,5.2,1.08,2.85,1045
9,1,13.86,1.35,2.27,16.0,98,2.98,3.15,0.22,1.85,7.22,1.01,3.55,1045


### Estadistica de cada columna
Por cada columna se indica:
- Minimo
- Máximo
- Media
- Desviacion estandar

In [11]:
for name in names[1:]:
    print(
        f"Atributo {name} | Maximo: {wine[name].max()} | Minimo: {wine[name].min()} | Media: {wine[name].mean()} | DesvEst {wine[name].std()}"
    )

Atributo Alcohol | Maximo: 14.83 | Minimo: 11.03 | Media: 13.00061797752809 | DesvEst 0.8118265380058575
Atributo Malic acid | Maximo: 5.8 | Minimo: 0.74 | Media: 2.3363483146067416 | DesvEst 1.1171460976144627
Atributo Ash | Maximo: 3.23 | Minimo: 1.36 | Media: 2.3665168539325845 | DesvEst 0.27434400906081485
Atributo Alcalinity of ash | Maximo: 30.0 | Minimo: 10.6 | Media: 19.49494382022472 | DesvEst 3.339563767173505
Atributo Magnesium | Maximo: 162 | Minimo: 70 | Media: 99.74157303370787 | DesvEst 14.282483515295665
Atributo Total phenols | Maximo: 3.88 | Minimo: 0.98 | Media: 2.295112359550562 | DesvEst 0.6258510488339893
Atributo Flavanoids | Maximo: 5.08 | Minimo: 0.34 | Media: 2.0292696629213487 | DesvEst 0.9988586850169467
Atributo Nonflavanoid phenols | Maximo: 0.66 | Minimo: 0.13 | Media: 0.3618539325842696 | DesvEst 0.12445334029667937
Atributo Proanthocyanins | Maximo: 3.58 | Minimo: 0.41 | Media: 1.5908988764044945 | DesvEst 0.5723588626747613
Atributo Color intensity | M

### Estandarizar Data set

Para la estandarizacion se utilizaran tres opciones diferents, la primera es realizando el calculo con pandas y la segunda con sklearn. Luego se comparara.

In [12]:
wine_copy = wine

for name in names:
    wine[name] = (wine[name] - wine[name].mean()) / wine[name].std()

wine_copy = scaler.fit_transform(wine_copy)

wine.head(5)

Unnamed: 0,class,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,-1.210529,1.514341,-0.560668,0.2314,-1.166303,1.908522,0.806722,1.031908,-0.657708,1.221438,0.251009,0.361158,1.842721,1.010159
1,-1.210529,0.245597,-0.498009,-0.825667,-2.483841,0.018094,0.567048,0.731565,-0.818411,-0.543189,-0.292496,0.404908,1.110317,0.962526
2,-1.210529,0.196325,0.021172,1.106214,-0.267982,0.08811,0.806722,1.212114,-0.497005,2.129959,0.268263,0.317409,0.786369,1.391224
3,-1.210529,1.686791,-0.345835,0.486554,-0.806975,0.9283,2.484437,1.462399,-0.979113,1.029251,1.182732,-0.426341,1.180741,2.328007
4,-1.210529,0.294868,0.227053,1.835226,0.450674,1.278379,0.806722,0.661485,0.226158,0.400275,-0.318377,0.361158,0.448336,-0.037767


In [13]:
pd.DataFrame(wine_copy, columns = names).head(5)

Unnamed: 0,class,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,-1.213944,1.518613,-0.56225,0.232053,-1.169593,1.913905,0.808997,1.034819,-0.659563,1.224884,0.251717,0.362177,1.84792,1.013009
1,-1.213944,0.24629,-0.499413,-0.827996,-2.490847,0.018145,0.568648,0.733629,-0.820719,-0.544721,-0.293321,0.406051,1.113449,0.965242
2,-1.213944,0.196879,0.021231,1.109334,-0.268738,0.088358,0.808997,1.215533,-0.498407,2.135968,0.26902,0.318304,0.788587,1.395148
3,-1.213944,1.69155,-0.346811,0.487926,-0.809251,0.930918,2.491446,1.466525,-0.981875,1.032155,1.186068,-0.427544,1.184071,2.334574
4,-1.213944,0.2957,0.227694,1.840403,0.451946,1.281985,0.808997,0.663351,0.226796,0.401404,-0.319276,0.362177,0.449601,-0.037874


Se puede observar al comparar ambos metodos de estandarizacion que los cambios son minimos. 

### Dividir el dataset en conjuntos de entrenamiento y testing

Para esto se opta por 70% del dataset para entrenamiento y un 30% para training. Para realizar esta tarea se utilizara sklearn con train_test_split

In [14]:
train, test = train_test_split(wine, test_size = 0.30)

train.head(5)

Unnamed: 0,class,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
34,-1.210529,0.627452,-0.480106,1.033313,-0.148206,0.718252,0.087701,0.501302,-0.577356,-0.088928,-0.37014,0.623658,0.363828,1.105425
162,1.37,-0.18553,0.835747,0.778159,0.750115,0.438189,-1.030776,-1.430903,1.913537,-1.102279,0.225128,-0.382591,-0.706609,-0.561731
105,0.079735,-0.7152,0.191248,-0.35181,0.750115,-0.682064,-0.982841,-0.189486,2.395645,-0.298587,-1.01717,-0.426341,0.96947,-1.371493
41,-1.210529,0.504273,1.345976,-0.898568,-0.208094,-0.682064,0.247483,0.651474,-0.738059,-0.193758,-0.335632,-0.207591,0.546929,0.914893
99,0.079735,-0.875332,0.746233,-0.570513,-0.447646,-0.822096,0.886613,0.961828,0.708266,2.129959,-1.189711,2.023657,0.30749,-1.082519


In [15]:
print(
    "Total de muestras: " + str(len(wine)),
    "Total de muestras de entrenamiento: "+ str(len(train)),
    "Total de muestras de test: " + str(len(test))
)


Total de muestras: 178 Total de muestras de entrenamiento: 124 Total de muestras de test: 54


Podemos observar que la division se hizo correctamentemte puesto que dado 178 valores en el dataset original un 60% del mismo fue al dataset de entrenamiento (106 atributos), mientras que el restante 40% (72 atributos) fue al dataset de test