En este proyecto, hemos desarrollado una red neuronal utilizando TensorFlow/Keras para predecir la presencia de diabetes en pacientes en función de diversos factores de salud. A lo largo del proceso, implementamos técnicas clave para mejorar el rendimiento del modelo.

Preprocesamiento de Datos:

Se utilizó el dataset de Pima Indians Diabetes, el cual fue limpiado y normalizado mediante StandardScaler para mejorar la convergencia del modelo.
Se dividieron los datos en un 80% para entrenamiento y 20% para prueba.

Arquitectura de la Red Neuronal:

La red neuronal mejorada cuenta con tres capas ocultas (Dense con activación ReLU).
Se agregaron capas de Dropout (0.3) para prevenir el sobreajuste.
La capa de salida usa una activación Sigmoid para clasificación binaria (0: No tiene diabetes, 1: Tiene diabetes).

Entrenamiento y Evaluación:

Se entrenó el modelo con 50 épocas y batch size de 10.
Se utilizó binary_crossentropy como función de pérdida y adam como optimizador.
La precisión alcanzada en el conjunto de prueba se imprimió como resultado final.



In [22]:
#Librerias
import nltk
import string
import re
import pandas as pd
from nltk.corpus import reuters
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

In [33]:
# Descargar recursos de NLTK
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')
nltk.download('reuters')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data]   Package reuters is already up-to-date!


True

# Parte 1

# NLP

In [30]:
# Cargar dataset de Reuters de NLTK
documents = reuters.fileids()
text_data = [reuters.raw(doc_id) for doc_id in documents[:500]]  # Seleccionar 500 documentos

df = pd.DataFrame({'text': text_data})

df.head()



Unnamed: 0,text
0,ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RI...
1,CHINA DAILY SAYS VERMIN EAT 7-12 PCT GRAIN STO...
2,JAPAN TO REVISE LONG-TERM ENERGY DEMAND DOWNWA...
3,THAI TRADE DEFICIT WIDENS IN FIRST QUARTER\n ...
4,INDONESIA SEES CPO PRICE RISING SHARPLY\n Ind...


In [31]:
# Definir funciones de preprocesamiento

def preprocess_text(text):
    """Limpia el texto eliminando signos de puntuación y convirtiendo a minúsculas."""
    text = text.lower()  # Convertir a minúsculas
    text = re.sub(f"[{string.punctuation}]", "", text)  # Eliminar puntuación
    return text

def lemmatize_text(text):
    """Aplica lematización al texto con manejo de errores."""
    lemmatizer = WordNetLemmatizer()
    if not isinstance(text, str) or text.strip() == "":
        return ""  # Devolver vacío si no es texto válido
    tokens = word_tokenize(text)
    lemmatized_words = [lemmatizer.lemmatize(word) for word in tokens]
    return " ".join(lemmatized_words)

def compute_tfidf(texts):
    """Calcula la matriz TF-IDF para un conjunto de textos."""
    vectorizer = TfidfVectorizer(max_features=5000)  # Limitamos a 5000 características por rendimiento
    tfidf_matrix = vectorizer.fit_transform(texts)
    return pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

In [35]:
# Aplicar preprocesamiento
df['clean_text'] = df['text'].apply(preprocess_text)
df['lemmatized_text'] = df['clean_text'].apply(lemmatize_text)

# Calcular la matriz TF-IDF
tfidf_df = compute_tfidf(df['lemmatized_text'])


In [37]:
# Textos lematizados
print("Primeros textos lematizados:")
print(df[['text', 'lemmatized_text']].head())


Primeros textos lematizados:
                                                text  \
0  ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RI...   
1  CHINA DAILY SAYS VERMIN EAT 7-12 PCT GRAIN STO...   
2  JAPAN TO REVISE LONG-TERM ENERGY DEMAND DOWNWA...   
3  THAI TRADE DEFICIT WIDENS IN FIRST QUARTER\n  ...   
4  INDONESIA SEES CPO PRICE RISING SHARPLY\n  Ind...   

                                     lemmatized_text  
0  asian exporter fear damage from usjapan rift m...  
1  china daily say vermin eat 712 pct grain stock...  
2  japan to revise longterm energy demand downwar...  
3  thai trade deficit widens in first quarter tha...  
4  indonesia see cpo price rising sharply indones...  


In [38]:
#TF-IDF
print("\nMatriz TF-IDF (primeras 5 filas):")
print(tfidf_df.head())


Matriz TF-IDF (primeras 5 filas):
    03   05  050  056  0560361  063  071  07381881  087   09  ...  zayre  \
0  0.0  0.0  0.0  0.0      0.0  0.0  0.0       0.0  0.0  0.0  ...    0.0   
1  0.0  0.0  0.0  0.0      0.0  0.0  0.0       0.0  0.0  0.0  ...    0.0   
2  0.0  0.0  0.0  0.0      0.0  0.0  0.0       0.0  0.0  0.0  ...    0.0   
3  0.0  0.0  0.0  0.0      0.0  0.0  0.0       0.0  0.0  0.0  ...    0.0   
4  0.0  0.0  0.0  0.0      0.0  0.0  0.0       0.0  0.0  0.0  ...    0.0   

   zealand  zeebregts  zeeuw  zenex  zennoh  zinc  zondervan  zone  zurich  
0      0.0        0.0    0.0    0.0     0.0   0.0        0.0   0.0     0.0  
1      0.0        0.0    0.0    0.0     0.0   0.0        0.0   0.0     0.0  
2      0.0        0.0    0.0    0.0     0.0   0.0        0.0   0.0     0.0  
3      0.0        0.0    0.0    0.0     0.0   0.0        0.0   0.0     0.0  
4      0.0        0.0    0.0    0.0     0.0   0.0        0.0   0.0     0.0  

[5 rows x 5000 columns]


# Red Neuronal Sencilla

In [57]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np


In [48]:
# Cargar un dataset
dataset_url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
columns = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
df = pd.read_csv(dataset_url, names=columns)


In [49]:
# Separar características y etiquetas
X = df.drop(columns=["Outcome"]).values
y = df["Outcome"].values

# Dividir en conjunto de entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Escalar los datos
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [50]:
# Construcción de la red neuronal
model = Sequential([
    Dense(16, activation='relu', input_shape=(X_train.shape[1],)),
    Dense(8, activation='relu'),
    Dense(1, activation='sigmoid')  # Clasificación binaria
])




  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [51]:
# Compilar el modelo
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])



In [52]:
# Entrenar el modelo
model.fit(X_train, y_train, epochs=50, batch_size=10, validation_data=(X_test, y_test))


Epoch 1/50
[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 7ms/step - accuracy: 0.3619 - loss: 0.8205 - val_accuracy: 0.4091 - val_loss: 0.7141
Epoch 2/50
[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.4672 - loss: 0.6965 - val_accuracy: 0.6753 - val_loss: 0.6374
Epoch 3/50
[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.6523 - loss: 0.6314 - val_accuracy: 0.7532 - val_loss: 0.5771
Epoch 4/50
[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.6791 - loss: 0.5790 - val_accuracy: 0.7857 - val_loss: 0.5366
Epoch 5/50
[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step - accuracy: 0.7127 - loss: 0.5476 - val_accuracy: 0.7597 - val_loss: 0.5132
Epoch 6/50
[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step - accuracy: 0.7612 - loss: 0.4978 - val_accuracy: 0.7727 - val_loss: 0.4980
Epoch 7/50
[1m62/62[0m [32m━━━━━━━━━━

<keras.src.callbacks.history.History at 0x7d7b4eeb2050>

In [53]:
# Evaluar el modelo
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Pérdida: {loss:.4f}, Precisión: {accuracy:.4f}')

[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.7003 - loss: 0.5590 
Pérdida: 0.5603, Precisión: 0.7403


In [58]:
# Realizar predicciones
y_pred_probs = model.predict(X_test)  # Probabilidades
y_pred = (y_pred_probs > 0.5).astype(int)  # Convertir a clases 0 o 1

# Crear un DataFrame con los valores reales y predichos
predictions_df = pd.DataFrame({
    'Real': y_test.flatten(),
    'Predicho': y_pred.flatten()
})

# Mostrar las primeras filas de la tabla
print(predictions_df.head(20))

[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step
    Real  Predicho
0      0         0
1      0         0
2      0         0
3      0         0
4      0         0
5      0         1
6      0         0
7      0         1
8      0         1
9      0         1
10     1         0
11     0         1
12     1         0
13     0         1
14     0         0
15     1         0
16     0         0
17     0         0
18     1         1
19     1         1


# Parte 2

# Añadir capas

In [60]:
# Construcción de la red neuronal mejorada
model = Sequential([
    Dense(32, activation='relu', input_shape=(X_train.shape[1],)),
    Dropout(0.3),  # Ayuda a prevenir sobreajuste
    Dense(16, activation='relu'),
    Dropout(0.3),
    Dense(8, activation='relu'),
    Dense(1, activation='sigmoid')  # Clasificación binaria
])

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [61]:
# Compilar el modelo
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])


In [62]:
# Entrenar el modelo
history = model.fit(X_train, y_train, epochs=50, batch_size=10, validation_data=(X_test, y_test))



Epoch 1/50
[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 8ms/step - accuracy: 0.6491 - loss: 0.6920 - val_accuracy: 0.6429 - val_loss: 0.6338
Epoch 2/50
[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.6584 - loss: 0.6223 - val_accuracy: 0.6558 - val_loss: 0.5940
Epoch 3/50
[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.6531 - loss: 0.5978 - val_accuracy: 0.6688 - val_loss: 0.5627
Epoch 4/50
[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.6802 - loss: 0.5578 - val_accuracy: 0.6818 - val_loss: 0.5442
Epoch 5/50
[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.7110 - loss: 0.5171 - val_accuracy: 0.7338 - val_loss: 0.5297
Epoch 6/50
[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.7502 - loss: 0.5101 - val_accuracy: 0.7662 - val_loss: 0.5167
Epoch 7/50
[1m62/62[0m [32m━━━━━━━━━━

In [63]:
# Evaluar el modelo
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Pérdida: {loss:.4f}, Precisión: {accuracy:.4f}')


[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.7272 - loss: 0.5296 
Pérdida: 0.5334, Precisión: 0.7468


In [64]:
# Realizar predicciones
y_pred_probs = model.predict(X_test)  # Probabilidades
y_pred = (y_pred_probs > 0.5).astype(int)  # Convertir a clases 0 o 1

# Crear un DataFrame con los valores reales y predichos
predictions_df = pd.DataFrame({
    'Real': y_test.flatten(),
    'Predicho': y_pred.flatten()
})

# Mostrar las primeras filas de la tabla
print(predictions_df.head(20))

[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
    Real  Predicho
0      0         0
1      0         0
2      0         0
3      0         0
4      0         0
5      0         1
6      0         0
7      0         1
8      0         1
9      0         1
10     1         0
11     0         1
12     1         0
13     0         1
14     0         0
15     1         1
16     0         0
17     0         0
18     1         1
19     1         1


#Conclusión

El modelo ha logrado una buena precisión en la predicción de diabetes. La inclusión de capas adicionales y Dropout ayudó a mejorar un poco la presición y a disminuir la pérdida, evitando el sobreajuste. Sin embargo, su desempeño podría mejorarse aún más. Dado a las limitaciones de mi computadora es algo complicado
