# Trabajo Practico N1 - Problema de clasificacion y regresion.

Autor: Francisco Ledesma

## Problema de Clasificacion Multiclase
Para este problema se utilizo un dataset "Anclaje de proteínas de levaduras basado en información de las secuencias de aminoácidos que forman a la proteína" <br>
URL:  (https://archive.ics.uci.edu/ml/datasets/Yeast)

### Dataset Yeast
Cargamos el dataset en df

In [225]:
import pandas as pd

In [226]:
columns = ["mcg","gvh","alm","mit","erl","pox","vac","nuc"]
df = pd.read_csv("yeast.data",names = columns + ['Class'], sep='\s+')
df = df.drop(["erl", "pox"], axis=1) #Elimino las conflictivas...
columns = ["mcg","gvh","alm","mit","vac","nuc"]

In [227]:
class_balance = df.Class.value_counts()
for name in class_balance[4:].index:                    # Me quedo solo con las 4 clases mas significativas
    df.drop(df[df['Class']==name].index,inplace=True)
df.reset_index()
df.Class.value_counts()


CYT    463
NUC    429
MIT    244
ME3    163
Name: Class, dtype: int64

In [228]:
from sklearn.model_selection import train_test_split

In [229]:
seed = (-1)*80538738812075974**3 + 80435758145817515**3 + 12602123297335631**3
train, test = train_test_split(df, test_size=0.3, stratify=df['Class'], random_state=seed)

In [230]:
#Calculo el balance de clases
class_balance = train.Class.value_counts()

In [231]:
import matplotlib.pyplot as plt

def meanVector_covMatrix(class_vector,df,columns,onlyclass=None):
    if onlyclass is not None:
        class_vector=onlyclass
    result = {}
    for class_name in class_vector:
        result_temp = []
        class_data = df[df['Class']==class_name]
        mean_vector = []
        for col in columns:
                mean = class_data[col].mean()
                mean_vector.append(mean)

        covMatrix = pd.DataFrame.cov(class_data)

        result_temp.append(mean_vector)
        result_temp.append(covMatrix)

        result[class_name] = result_temp
    return result

In [232]:
class_vector = train['Class'].unique()
#Calculo vector de media y covarianza para cada clase
multivariate_stats_values = meanVector_covMatrix(class_vector,train,columns)

In [233]:
from scipy.stats import multivariate_normal
import numpy as np

def evaluate_data(x, classes_stats, class_vector,balance_class):
    "calculo por bayes y devuelvo vector de probabilidades a posteriori"
    prob_vector = []
    for class_name in class_vector:
        p = multivariate_normal.pdf(x, mean=classes_stats[class_name][0], cov=classes_stats[class_name][1],allow_singular=True);
        #print('P(x|y='+class_name+').p(y='+class_name+')=',round(p*balance_class[class_name],3))
        #p(y=k|x) = p(x|y=k).p(y=k)
        prob_vector.append(p*balance_class[class_name])
    return prob_vector


def estimate_class(data, multivariate_stats_values, class_vector, balance_class):
    "Evaluo input y devuelvo el nombre de la clase donde es maxima la probabilidad a posteriori"
    prob_vector = evaluate_data(data,multivariate_stats_values,class_vector,balance_class)
    return class_vector[np.argmax(prob_vector)]


def isResultOk(data, multivariate_stats_values, class_vector, balance_class):
    "Estimo cual es la clase mas probable a la que corresponde el dato y devuelvo true si coincide con la que es"
    name = data[-1]
    data = [np.delete(data.values,-1)]
    result = estimate_class(data,multivariate_stats_values,class_vector,balance_class)
    if name == result:
        return True
    else:
        return False


In [234]:
def scoreModel(test, stats, balance_class):
    total = test.shape[0]
    corrects = 0
    #Calculo la cantidad de estimaciones correctas, es decir, si la clase mas probable coincide con la clase que realmente es.
    for i in range(0,total):
        if isResultOk(test.iloc[i], stats, class_vector, balance_class):
            corrects += 1
    print('Accuracy:',round(corrects/total*100,3),'%')

In [235]:
scoreModel(train,multivariate_stats_values,class_balance)

Accuracy: 61.606 %


In [236]:
scoreModel(test,multivariate_stats_values,class_balance)

Accuracy: 63.333 %


## Comparo utilizando sklearn

In [237]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.preprocessing import LabelEncoder

In [238]:
enc = LabelEncoder()

In [239]:
y = enc.fit_transform(train.Class.values)
y_test = enc.transform(test.Class.values)

In [240]:
clf = QuadraticDiscriminantAnalysis()

In [241]:
clf.fit(train[columns].values,y)

In [242]:
acc = clf.score(train[columns].values,y)
print('Accuracy:',round(acc*100,3),'%')

Accuracy: 61.606 %


In [243]:
acc = clf.score(test[columns].values,y_test)
print('Accuracy:',round(acc*100,3),'%')

Accuracy: 63.333 %


## Lista de cosas que faltan...

- Evaluar metricas de micro-macro F1-score para el de clasificacion
- Evaluar al menos 3 clases mas significativas (listo)
- A partir del dataset original, resolver probllema de clasificacion binario (partiendo el dataset en 2) se puede unificar o tomar las 2 mas significativas. Luego resolver nuevo problema binario y evaluar metricas AUC-ROC