In [1]:
import pandas as pd

# Testing

Al momento de diseñar modelos, es importante conocer la efectividad de nuestro modelo. Para esto, es necesario realizar pruebas del modelo con datos diferentes con el que fue generado, a esta generacion del modelo a partir de unos datos se le llama training, y a las pruebas testing.
Para esto es necesario saber si un modelo es bueno o malo, y tambien, si el modelo sirve para una aplicacion real, o en general para cualquier dato entrante.

## Sampling

<div>
    <br>
    <img src="sampling.png" width="500"/>
    <br>
</div>

Para esto, al momento de elegir una muestra de la poblacion (sample), es necesario que esta sea representativa de la poblacion entera, pero no nos centraremos en tecnicas de muestreo, en cambio con unos datos ya dados, nos aseguraremos de que el modelo pueda funcionar con datos fuera de la muestra, para esto se realiza el proceso de testing.

# Train - Test Split

<div>
    <br>
    <img src="test.png" width="500"/>
    <br>
</div>

Los test mas comunes solo realizan una division entre train y split, para entrenamiento y pruebas, esto en realidad es una mala practica, ya que directamente estaremos validando la efectividad de nuestro modelo en datos que deberian representar nuestra poblacion, por lo que al final nuestro modelo sera suceptible a esos datos.

Para esto un acercamiento mas adecuado, es realizar 3 divisiones, entrenamiento, validacion, y pruebas, los nuevos datos intermedios serviran para validar la efectividad de nuestro modelo, y finalmente la de pruebas o test sera el indicador real de funcionamiento.

Un problema que se nos puede llegar a generar, es como realizamos la esta division de datos de entrenamiento, validacion, y pruebas. La division recomendada se realiza de manera aleatoria, pero se puede dar el caso de que los datos de training y testing esten muy relacionados, y de un falso resultado por encima del comportamiento real, o al contrario.

De aca surge las divisiones k-fold y monte carlo cross validation

## k-fold

<div>
    <br>
    <img src="kfold.png" width="500"/>
    <br>
</div>

## Monte Carlo Cross Validation

<div>
    <br>
    <img src="monte_carlo.png" width="500"/>
    <br>
</div>

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets

In [3]:
iris_x, iris_y = datasets.load_iris(return_X_y = True, as_frame = True)
iris = iris_x.assign(target=iris_y)
iris.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [22]:
wine_x, wine_y = datasets.load_wine(return_X_y = True, as_frame = True)
wine = wine_x.assign(target=wine_y)

In [5]:
from sklearn.preprocessing import LabelEncoder

students = pd.read_csv("students_scores.csv")
encoder = LabelEncoder()
students["Gender"]=encoder.fit_transform(students["Gender"])
students["EthnicGroup"]=encoder.fit_transform(students["EthnicGroup"])
students["ParentEduc"]=encoder.fit_transform(students["ParentEduc"])
students["LunchType"]=encoder.fit_transform(students["LunchType"])
students["TestPrep"]=encoder.fit_transform(students["TestPrep"])
students.head()

Unnamed: 0.1,Unnamed: 0,Gender,EthnicGroup,ParentEduc,LunchType,TestPrep,MathScore,ReadingScore,WritingScore
0,0,0,1,1,1,1,72,72,74
1,1,0,2,4,1,0,69,90,88
2,2,0,1,3,1,1,90,95,93
3,3,1,0,0,0,1,47,57,44
4,4,1,2,4,1,1,76,78,75


In [6]:
boston = pd.read_csv("boston.csv")
boston.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


In [7]:
diabetes_x, diabetes_y = datasets.load_diabetes(return_X_y = True, as_frame = True)
diabetes = diabetes_x.assign(target=diabetes_y)
diabetes.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0


# Support Vector Machine

<div>
    <br>
    <img src="svm.png" width="500"/>
    <br>
</div>

In [24]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

In [25]:
X_train, X_test, y_train, y_test = train_test_split(iris.drop("target", axis=1),
                                                    iris["target"],
                                                    test_size=0.2,
                                                    shuffle=True)

In [26]:
svm = SVC()
svm.fit(X_train, y_train)
f1_score(y_test, svm.predict(X_test), average="weighted")

0.966750208855472

In [37]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
svm = SVC()
kf = KFold(n_splits=5)
scores = cross_val_score(svm, iris.drop("target",axis=1), iris["target"], cv=kf, scoring='f1_weighted')
scores.mean()

0.9334459142508678

# Neural Networks

<div>
    <br>
    <img src="nn.png" width="500"/>
    <br>
</div>

In [28]:
from sklearn.neural_network import MLPClassifier

In [29]:
X_train, X_test, y_train, y_test = train_test_split(iris.drop("target", axis=1),
                                                    iris["target"],
                                                    test_size=0.2,
                                                    shuffle=True)

In [40]:
mlp = MLPClassifier()
mlp.fit(X_train, y_train)
f1_score(y_test, mlp.predict(X_test), average="weighted")



0.9668534080298786

In [41]:
from sklearn.model_selection import ShuffleSplit
mlp = MLPClassifier()
sp = ShuffleSplit(n_splits=5)
scores = cross_val_score(mlp, iris.drop("target",axis=1), iris["target"], cv=kf, scoring='f1_weighted')
scores.mean()



0.9182657567242071