<a href="https://colab.research.google.com/github/Diego-CB/DS-Proyecto/blob/main/modelo/proyecto.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Modelo
### Predicción de argumentos efectivos
    
Grupo:
- Cristian Aguirre: 20231
- Diego Córdova: 20212
- Marco Jurado: 20308
- Paola Contreras: 20213
- Paola de León: 20361

In [116]:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.util import ngrams

## Carga de Datos de Entreno

In [117]:
id = '1kzPayZj888s0RkHlxYHGXzHwdb63fEYH'
url = 'https://drive.google.com/uc?id=' + id
data = pd.read_csv(url)
data.head()

Unnamed: 0,discourse_id,essay_id,discourse_text,discourse_type,discourse_effectiveness
0,0013cc385424,007ACE74B050,"Hi, i'm Isaac, i'm going to be writing about h...",Lead,Adequate
1,9704a709b505,007ACE74B050,"On my perspective, I think that the face is a ...",Position,Adequate
2,c22adee811b6,007ACE74B050,I think that the face is a natural landform be...,Claim,Adequate
3,a10d361e54e4,007ACE74B050,"If life was on Mars, we would know by now. The...",Evidence,Adequate
4,db3e453ec4e2,007ACE74B050,People thought that the face was formed by ali...,Counterclaim,Adequate


## Liempieza de dataset de entreno

Se eliminaran las columnas de ***discourse_id*** y ***essay_id*** debido a que estas son solo identificadores no relevantes para el modelo. AL contrario, podrian causar ruido que genere malas predicciones.

Ademas se agrega una columna ***index*** y se guarda la columna ***discourse_text*** en una variable aparte ya que esta servira como entrada de la capa de word embedding.

In [118]:
data.drop('discourse_id', axis=1, inplace=True)
data.drop('essay_id', axis=1, inplace=True)
data['index'] = data.index
data.head()

Unnamed: 0,discourse_text,discourse_type,discourse_effectiveness,index
0,"Hi, i'm Isaac, i'm going to be writing about h...",Lead,Adequate,0
1,"On my perspective, I think that the face is a ...",Position,Adequate,1
2,I think that the face is a natural landform be...,Claim,Adequate,2
3,"If life was on Mars, we would know by now. The...",Evidence,Adequate,3
4,People thought that the face was formed by ali...,Counterclaim,Adequate,4


In [119]:
texto_original = data['discourse_text']
data.drop('discourse_text', axis=1, inplace=True)
data.head()

Unnamed: 0,discourse_type,discourse_effectiveness,index
0,Lead,Adequate,0
1,Position,Adequate,1
2,Claim,Adequate,2
3,Evidence,Adequate,3
4,Counterclaim,Adequate,4


### Encoding de variables categoricas
En este caso las variables ***discourse_effectiveness*** y ***discourse_type*** son categoricas.

In [120]:
type_map = {cat:index for index, cat in enumerate(data['discourse_type'].unique())}
print('> mapa para encoding de discourse_type', type_map)
data['discourse_type'] = [type_map[cat] for cat in data['discourse_type']]
data[['discourse_type']].head()

> mapa para encoding de discourse_type {'Lead': 0, 'Position': 1, 'Claim': 2, 'Evidence': 3, 'Counterclaim': 4, 'Rebuttal': 5, 'Concluding Statement': 6}


Unnamed: 0,discourse_type
0,0
1,1
2,2
3,3
4,4


In [121]:
type_map = {cat:index for index, cat in enumerate(data['discourse_effectiveness'].unique())}
print('> mapa para encoding de discourse_effectiveness', type_map)
data['discourse_effectiveness'] = [type_map[cat] for cat in data['discourse_effectiveness']]
data[['discourse_effectiveness']].head()

> mapa para encoding de discourse_effectiveness {'Adequate': 0, 'Ineffective': 1, 'Effective': 2}


Unnamed: 0,discourse_effectiveness
0,0
1,0
2,0
3,0
4,0


In [122]:
data.head()

Unnamed: 0,discourse_type,discourse_effectiveness,index
0,0,0,0
1,1,0,1
2,2,0,2
3,3,0,3
4,4,0,4


## Generacion de Secuencias de Texto



Para la capa de incrustacion se usara para realizar ***word embedding*** de los argumentos dados como input.

In [123]:
texto_original

0        Hi, i'm Isaac, i'm going to be writing about h...
1        On my perspective, I think that the face is a ...
2        I think that the face is a natural landform be...
3        If life was on Mars, we would know by now. The...
4        People thought that the face was formed by ali...
                               ...                        
36760    For many people they don't like only asking on...
36761    also people have different views and opinions ...
36762    Advice is something that can impact a persons ...
36763    someone can use everything that many people sa...
36764    In conclusion asking for an opinion can be ben...
Name: discourse_text, Length: 36765, dtype: object

In [124]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# Tokenizar las frases
tokenizador = Tokenizer()
tokenizador.fit_on_texts(texto_original)
secuencias = tokenizador.texts_to_sequences(texto_original)

# Rellenar (Pad) las secuencias para que tengan la misma longitud
secuencias = pad_sequences(np.array(secuencias))
# Dimensiones de entrada y salida de la capa de inscrustamiento

long_vocab = len(tokenizador.word_index) + 1
dim_incrustamiento = 2  # Representar cada palabra por un vector 2D

pd.DataFrame(secuencias).head()

  secuencias = pad_sequences(np.array(secuencias))


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,836,837,838,839,840,841,842,843,844,845
0,0,0,0,0,0,0,0,0,0,0,...,17,155,25,24,9,8,57,3,343,300
1,0,0,0,0,0,0,0,0,0,0,...,59,69,18,58,6,8,8,3,343,300
2,0,0,0,0,0,0,0,0,0,0,...,8,86,120,17,155,6,30,14,8571,536
3,0,0,0,0,0,0,0,0,0,0,...,7,39,6,508,436,105,381,3,343,300
4,0,0,0,0,0,0,0,0,0,0,...,7603,28,10,398,6,38,50,120,17,155


## Split de Datos

In [125]:
y = data['discourse_effectiveness']
X = data.copy()
X.drop('discourse_effectiveness', axis=1, inplace=True)

In [126]:
from sklearn.model_selection import train_test_split

# Dividir los datos en entrenamiento (80%) y prueba (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train

Unnamed: 0,discourse_type,index
3994,3,3994
3069,6,3069
4749,2,4749
34914,3,34914
2878,2,2878
...,...,...
16850,3,16850
6265,2,6265
11284,0,11284
860,3,860


Se usara la columna ***index*** que creamos anteriormente para dividir los datos de texto tokenizados

Ahora se borra la columna ***index*** en los datasdets "X" luego de usarla para hacer el split en las secuencias

In [127]:
X_train.drop('index', axis=1, inplace=True)
X_test.drop('index', axis=1, inplace=True)
X_train = X_train.values.T[0]
X_test = X_test.values.T[0]
X_train

array([3, 6, 2, ..., 0, 3, 6])

## Modelo
Capas:
1. Embedding
2. LSTM
3. Dense

In [128]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Embedding, Flatten, Dense, LSTM, Dropout, GRU

modelo = Sequential()
batch_size = 32

# Capa de Word Embedding
modelo.add(Embedding(
    input_dim=long_vocab,
    output_dim=dim_incrustamiento,
    input_length=len(secuencias[0]),
    batch_input_shape=(batch_size, len(secuencias[0]))  # Especifica el tamaño del lote y la longitud de la secuencia
))

# Capa GRU
modelo.add(GRU(
    2,  # Número de unidades en la capa GRU
    return_sequences=True,
    stateful=True,
    recurrent_initializer='glorot_uniform',
    batch_input_shape=(batch_size, len(secuencias[0]), dim_incrustamiento)
))

modelo.add(Flatten())  # Aplanar la matriz de 3x2 matrix en un vector 6D para la capa Dense
# Capa Final Densa para Predecir
modelo.add(Dense(long_vocab, activation='softmax'))

modelo.compile(optimizer = 'adam', loss ='crossentropy', metrics=['accuracy'])
modelo.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_18 (Embedding)    (32, 846, 2)              61592     
                                                                 
 gru_5 (GRU)                 (32, 846, 2)              36        
                                                                 
 flatten_13 (Flatten)        (32, 1692)                0         
                                                                 
 dense_22 (Dense)            (32, 30796)               52137628  
                                                                 
Total params: 52199256 (199.12 MB)
Trainable params: 52199256 (199.12 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [129]:
# proportionInput = Input(shape=(1,))
# proportion = Dense(1, activation='sigmoid')(proportionInput)

# reviewInput = Input(shape=(128,))
# embedding = Embedding(50000, 128, input_length=128)(reviewInput)
# lstm = LSTM(64, dropout=0.2, return_sequences=True, kernel_regularizer=tf.keras.regularizers.l1(0.01))(embedding)
# dropout = Dropout(0.2)(lstm)
# lstm = LSTM(64, dropout=0.2, kernel_regularizer=tf.keras.regularizers.l1(0.01))(dropout)
# concat = tf.keras.layers.concatenate([lstm, proportion])
# dense = Dense(1, activation='sigmoid')(concat)

# model = tf.keras.Model(inputs=[reviewInput, proportionInput], outputs=dense)
# model.summary()

In [130]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Embedding, Flatten, Dense, LSTM, Dropout, GRU
import tensorflow as tf

# Primer Pipeline: Procesamiento de texto
batch_size = 32

# Capa de Word Embedding
text_inputs = Input(shape=(len(secuencias[0]),))
text_pipeline = Embedding(
    input_dim=len(secuencias[0]),
    output_dim=2,
    input_length=len(secuencias),
    # batch_size=batch_size,  # Especifica el tamaño del lote y la longitud de la secuencia
)(text_inputs)

# Capa GRU
# text_pipeline = GRU(
#     2,  # Número de unidades en la capa GRU
#     return_sequences=True,
#     stateful=True,
#     recurrent_initializer='glorot_uniform',
#     # batch_size=batch_size
# )(text_pipeline)
text_pipeline = LSTM(
  64, dropout=0.2, return_sequences=True, kernel_regularizer=tf.keras.regularizers.l1(0.01)
)(text_pipeline)

text_pipeline = Flatten()(text_pipeline)

# Segundo Pipeline: Prediccion
predict_inputs = Input(shape=(1))
predict_pipeline = Dense(long_vocab, activation='softmax')(predict_inputs)

concat = tf.keras.layers.concatenate([predict_pipeline, text_pipeline])
dense = Dense(1, activation='softmax')(concat)

model = tf.keras.Model(inputs=[text_inputs, predict_inputs], outputs=dense)
model.summary()

Model: "model_9"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_24 (InputLayer)       [(None, 846)]                0         []                            
                                                                                                  
 embedding_19 (Embedding)    (None, 846, 2)               1692      ['input_24[0][0]']            
                                                                                                  
 input_25 (InputLayer)       [(None, 1)]                  0         []                            
                                                                                                  
 lstm_13 (LSTM)              (None, 846, 64)              17152     ['embedding_19[0][0]']        
                                                                                            

In [131]:

print(len(secuencias_train), len(secuencias_train[0]))
print(X_train.shape)
print(y_train.shape)


29412 846
(29412,)
(29412,)


In [132]:
modelo.fit(
    [secuencias_train, X_train],
    y_train,
    batch_size = 32,
    epochs = 15,
    verbose = 2,
    validation_data = ([secuencias_test, X_test], y_test)
)

ValueError: ignored

In [None]:
# modelo.fit(
#     X_train,
#     y_train,
#     batch_size = 32,
#     epochs = 15,
#     verbose = 2,
#     validation_data = (X_test, y_test)
# )

In [None]:
X_train