<a href="https://colab.research.google.com/github/A00829752/TC3006C/blob/main/clasificadorSinFramework.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Implementación de una técnica de aprendizaje máquina sin el uso de un framework

En este archivo se implementa un modelo de aprendizaje de maquina sin el uso de un framework tal como scikit learn.

In [21]:
import pandas as pd
import seaborn as sb
import numpy as np
import matplotlib.pyplot as plt
from tabulate import tabulate
from sklearn import preprocessing
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Se utilizará el dataset del Titanic también utilizado en el reto grupal, esta base de datos incluye la información de pasajeros del Titanic y nuestro objetivo será predecir si los pasajeros sobrevivieron o no. El dataset puede ser encontrado en esta liga: https://www.kaggle.com/competitions/titanic

In [22]:
from google.colab import drive

drive.mount("/content/gdrive")
!pwd  # show current path
%cd "/content/gdrive/MyDrive/AI"
data_titanic = pd.read_csv("data_titanic.csv")

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
/content/gdrive/.shortcut-targets-by-id/1FNDCkfZgBaoLUenHQpb6h3yPrqX4y7SU/AI
/content/gdrive/.shortcut-targets-by-id/1FNDCkfZgBaoLUenHQpb6h3yPrqX4y7SU/AI


In [23]:
print(data_titanic.shape)

(891, 12)


In [24]:
data_titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [25]:
data_titanic.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [26]:
data_titanic.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [27]:
data_titanic = data_titanic.drop(columns=['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 'Cabin'])

In [28]:
data_titanic['Sex'].replace(['male', 'female'],
                        [0, 1], inplace=True)
data_titanic['Embarked'].replace(['S', 'C', 'Q'],
                        [0, 1, 2], inplace=True)
data_titanic.apply (pd.to_numeric, errors='coerce')
data_titanic = data_titanic.dropna()

Se eliminaron las columnas consideradas menos relevantes, las variables se convirtieron en numéricas y se eliminaron los valores nulos

In [29]:
print(data_titanic.shape)

(712, 6)


In [30]:
data_titanic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 712 entries, 0 to 890
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  712 non-null    int64  
 1   Pclass    712 non-null    int64  
 2   Sex       712 non-null    int64  
 3   Age       712 non-null    float64
 4   Fare      712 non-null    float64
 5   Embarked  712 non-null    float64
dtypes: float64(3), int64(3)
memory usage: 38.9 KB


In [31]:
data_titanic.describe()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked
count,712.0,712.0,712.0,712.0,712.0,712.0
mean,0.404494,2.240169,0.363764,29.642093,34.567251,0.261236
std,0.491139,0.836854,0.48142,14.492933,52.938648,0.521561
min,0.0,1.0,0.0,0.42,0.0,0.0
25%,0.0,1.0,0.0,20.0,8.05,0.0
50%,0.0,2.0,0.0,28.0,15.64585,0.0
75%,1.0,3.0,1.0,38.0,33.0,0.0
max,1.0,3.0,1.0,80.0,512.3292,2.0


Nos quedamos con 712 muestras y 5 características, nuestra variable de salida es binaria y este será un problema de clasificación.

Se utilizará un modelo de regresión logística, estos son modelos que tienen un cierto número fijo de parámetros que dependen del número de características de entrada, y dan como resultado una predicción categórica

In [32]:
class regLogistica:
    def __init__(self, aprendizaje, iteraciones):
        self.aprendizaje = aprendizaje
        self.iteraciones = iteraciones
        self.pesos = None
        self.sesgo = None

    def sigmoide(self, z):
        return 1 / (1 + np.exp(-z))

    def inParametros(self, numCar):
        self.pesos = np.zeros(numCar)
        self.sesgo = 0

    def fit(self, X, y):
        numMuestras, numCar = X.shape
        self.inParametros(numCar)

        for _ in range(self.iteraciones):
            y_pred = self.sigmoide(np.dot(X, self.pesos) + self.sesgo)

            # Calcular gradientes
            dw = (1/numMuestras) * np.dot(X.T, (y_pred - y))
            db = (1/numMuestras) * np.sum(y_pred - y)

            # Actualizar parametros
            self.pesos -= self.aprendizaje * dw
            self.sesgo -= self.aprendizaje * db

    def predict(self, X):
        y_pred = self.sigmoide(np.dot(X, self.pesos) + self.sesgo)
        y_pred_class = [1 if i > 0.5 else 0 for i in y_pred]
        return y_pred_class

Para evaluar nuestro modelo se utilizará la métrica de accuracy, la cual nos dice el porcentaje de predicciones que el modelo tuvo correctas, con este compararemos el conjunto de entrenamiento y el de prueba

In [35]:
feature_names = ['Age', 'Sex', 'Pclass', 'Embarked', 'Fare']
X = data_titanic[feature_names] # variables predictoras
y = data_titanic['Survived']    # variable de respuesta
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42, stratify=y)

RL = regLogistica(aprendizaje=0.001, iteraciones=1000)
RL.fit(X_train, y_train)

y_pred = RL.predict(X_train)
print("Accuracy train:", accuracy_score(y_train, y_pred))

Accuracy train: 0.687170474516696


Al hacer predicciones con el set de entrenamiento vemos que nuestra accuracy es del 68.7%

In [36]:
y_pred = RL.predict(X_test)
print("Accuracy test:", accuracy_score(y_test, y_pred))

Accuracy: 0.6713286713286714


Cuando probamos con el set de prueba nuestra accuracy abaja al 67.1% es positivo que esta metrica haya bajado tan poco. A continuación se pueden comparar los valores reales del set de prueba versus las predicciones del modelo.

In [41]:
i = 0
print("Sobrevivientes vs Predicciones")
for x in y_test:
  print("            ",x, "|", round(y_pred[i]))
  i += 1

Sobrevivientes vs Predicciones
             1 | 0
             0 | 0
             1 | 0
             1 | 1
             0 | 0
             0 | 0
             0 | 0
             1 | 1
             0 | 0
             1 | 1
             0 | 0
             1 | 0
             1 | 1
             1 | 0
             0 | 0
             0 | 0
             1 | 0
             1 | 0
             0 | 0
             0 | 0
             0 | 0
             0 | 0
             0 | 0
             0 | 0
             0 | 1
             0 | 1
             0 | 1
             0 | 1
             0 | 1
             0 | 0
             1 | 1
             1 | 0
             0 | 0
             1 | 0
             1 | 0
             1 | 1
             0 | 0
             1 | 0
             0 | 0
             0 | 0
             0 | 0
             1 | 0
             1 | 0
             0 | 0
             0 | 0
             1 | 0
             0 | 0
             1 | 1
             1 | 1
             0 | 0
             1 | 1
