<a href="https://colab.research.google.com/github/Diego-CB/DS-Proyecto/blob/main/modelo/proyecto.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Modelo
### Predicción de argumentos efectivos
    
Grupo:
- Cristian Aguirre: 20231
- Diego Córdova: 20212
- Marco Jurado: 20308
- Paola Contreras: 20213
- Paola de León: 20361

In [67]:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.util import ngrams

## Carga de Datos de Entreno

In [68]:
id = '1kzPayZj888s0RkHlxYHGXzHwdb63fEYH'
url = 'https://drive.google.com/uc?id=' + id
data = pd.read_csv(url)
data.head()

Unnamed: 0,discourse_id,essay_id,discourse_text,discourse_type,discourse_effectiveness
0,0013cc385424,007ACE74B050,"Hi, i'm Isaac, i'm going to be writing about h...",Lead,Adequate
1,9704a709b505,007ACE74B050,"On my perspective, I think that the face is a ...",Position,Adequate
2,c22adee811b6,007ACE74B050,I think that the face is a natural landform be...,Claim,Adequate
3,a10d361e54e4,007ACE74B050,"If life was on Mars, we would know by now. The...",Evidence,Adequate
4,db3e453ec4e2,007ACE74B050,People thought that the face was formed by ali...,Counterclaim,Adequate


## Liempieza de dataset de entreno

Se eliminaran las columnas de ***discourse_id*** y ***essay_id*** debido a que estas son solo identificadores no relevantes para el modelo. AL contrario, podrian causar ruido que genere malas predicciones.

Ademas se agrega una columna ***index*** y se guarda la columna ***discourse_text*** en una variable aparte ya que esta servira como entrada de la capa de word embedding.

In [69]:
data.drop('discourse_id', axis=1, inplace=True)
data.drop('essay_id', axis=1, inplace=True)
data['index'] = data.index
data.head()

Unnamed: 0,discourse_text,discourse_type,discourse_effectiveness,index
0,"Hi, i'm Isaac, i'm going to be writing about h...",Lead,Adequate,0
1,"On my perspective, I think that the face is a ...",Position,Adequate,1
2,I think that the face is a natural landform be...,Claim,Adequate,2
3,"If life was on Mars, we would know by now. The...",Evidence,Adequate,3
4,People thought that the face was formed by ali...,Counterclaim,Adequate,4


Se agregara una variable para obtener el tamaño en palabras del texto como input

In [70]:
claim_sizes = [len(text) for text in data['discourse_text']]
data['claim_size'] = claim_sizes
data.head()

Unnamed: 0,discourse_text,discourse_type,discourse_effectiveness,index,claim_size
0,"Hi, i'm Isaac, i'm going to be writing about h...",Lead,Adequate,0,317
1,"On my perspective, I think that the face is a ...",Position,Adequate,1,210
2,I think that the face is a natural landform be...,Claim,Adequate,2,105
3,"If life was on Mars, we would know by now. The...",Evidence,Adequate,3,362
4,People thought that the face was formed by ali...,Counterclaim,Adequate,4,101


In [71]:
texto_original = data['discourse_text']
data.drop('discourse_text', axis=1, inplace=True)
data.head()

Unnamed: 0,discourse_type,discourse_effectiveness,index,claim_size
0,Lead,Adequate,0,317
1,Position,Adequate,1,210
2,Claim,Adequate,2,105
3,Evidence,Adequate,3,362
4,Counterclaim,Adequate,4,101


### Encoding de variables categoricas
En este caso las variables ***discourse_effectiveness*** y ***discourse_type*** son categoricas.

In [72]:
type_map = {cat:index for index, cat in enumerate(data['discourse_type'].unique())}
print('> mapa para encoding de discourse_type', type_map)
data['discourse_type'] = [type_map[cat] for cat in data['discourse_type']]
data[['discourse_type']].head()

> mapa para encoding de discourse_type {'Lead': 0, 'Position': 1, 'Claim': 2, 'Evidence': 3, 'Counterclaim': 4, 'Rebuttal': 5, 'Concluding Statement': 6}


Unnamed: 0,discourse_type
0,0
1,1
2,2
3,3
4,4


In [73]:
type_map = {cat:index for index, cat in enumerate(data['discourse_effectiveness'].unique())}
print('> mapa para encoding de discourse_effectiveness', type_map)
data['discourse_effectiveness'] = [type_map[cat] for cat in data['discourse_effectiveness']]
data[['discourse_effectiveness']].head()

> mapa para encoding de discourse_effectiveness {'Adequate': 0, 'Ineffective': 1, 'Effective': 2}


Unnamed: 0,discourse_effectiveness
0,0
1,0
2,0
3,0
4,0


In [74]:
data.head()

Unnamed: 0,discourse_type,discourse_effectiveness,index,claim_size
0,0,0,0,317
1,1,0,1,210
2,2,0,2,105
3,3,0,3,362
4,4,0,4,101


## Generacion de Secuencias de Texto



Para la capa de incrustacion se usara para realizar ***word embedding*** de los argumentos dados como input.

In [75]:
texto_original

0        Hi, i'm Isaac, i'm going to be writing about h...
1        On my perspective, I think that the face is a ...
2        I think that the face is a natural landform be...
3        If life was on Mars, we would know by now. The...
4        People thought that the face was formed by ali...
                               ...                        
36760    For many people they don't like only asking on...
36761    also people have different views and opinions ...
36762    Advice is something that can impact a persons ...
36763    someone can use everything that many people sa...
36764    In conclusion asking for an opinion can be ben...
Name: discourse_text, Length: 36765, dtype: object

In [76]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# Tokenizar las frases
tokenizador = Tokenizer()
tokenizador.fit_on_texts(texto_original)
secuencias = tokenizador.texts_to_sequences(texto_original)

# Rellenar (Pad) las secuencias para que tengan la misma longitud
secuencias = pad_sequences(np.array(secuencias))
# Dimensiones de entrada y salida de la capa de inscrustamiento

long_vocab = len(tokenizador.word_index) + 1
dim_incrustamiento = 2  # Representar cada palabra por un vector 2D

pd.DataFrame(secuencias).head()

  secuencias = pad_sequences(np.array(secuencias))


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,836,837,838,839,840,841,842,843,844,845
0,0,0,0,0,0,0,0,0,0,0,...,17,155,25,24,9,8,57,3,343,300
1,0,0,0,0,0,0,0,0,0,0,...,59,69,18,58,6,8,8,3,343,300
2,0,0,0,0,0,0,0,0,0,0,...,8,86,120,17,155,6,30,14,8571,536
3,0,0,0,0,0,0,0,0,0,0,...,7,39,6,508,436,105,381,3,343,300
4,0,0,0,0,0,0,0,0,0,0,...,7603,28,10,398,6,38,50,120,17,155


## Split de Datos

In [77]:
y = data['discourse_effectiveness']
X = data.copy()
X.drop('discourse_effectiveness', axis=1, inplace=True)

In [78]:
from sklearn.model_selection import train_test_split

# Dividir los datos en entrenamiento (80%) y prueba (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train

Unnamed: 0,discourse_type,index,claim_size
3994,3,3994,472
3069,6,3069,151
4749,2,4749,67
34914,3,34914,238
2878,2,2878,19
...,...,...,...
16850,3,16850,100
6265,2,6265,47
11284,0,11284,223
860,3,860,209


Se usara la columna ***index*** que creamos anteriormente para dividir los datos de texto tokenizados

In [79]:
secuencias_train = [sec for index, sec in enumerate(secuencias) if index in X_train['index']]
secuencias_test = [sec for index, sec in enumerate(secuencias) if index in X_test['index']]

Ahora se borra la columna ***index*** en los datasdets "X" luego de usarla para hacer el split en las secuencias

In [80]:
X_train.drop('index', axis=1, inplace=True)
X_test.drop('index', axis=1, inplace=True)
X_train

Unnamed: 0,discourse_type,claim_size
3994,3,472
3069,6,151
4749,2,67
34914,3,238
2878,2,19
...,...,...
16850,3,100
6265,2,47
11284,0,223
860,3,209


Por ultimo se pasan a arrays de numpy para darle como input al modelo

In [81]:
X_train = X_train.values
X_test = X_test.values
X_train

array([[  3, 472],
       [  6, 151],
       [  2,  67],
       ...,
       [  0, 223],
       [  3, 209],
       [  6, 339]])

## Modelo
Capas:
1. Embedding
2. LSTM
3. Dense

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Embedding, Flatten, Dense, LSTM, Dropout, GRU
import tensorflow as tf

# Primer Pipeline: Procesamiento de texto
batch_size = 32

text_secuence_size = len(secuencias[0])
embedding_dim = 4

# Capa de Word Embedding
text_inputs = Input(
    shape=(text_secuence_size,)
)
text_pipeline = Embedding(
    input_dim=text_secuence_size,
    output_dim=embedding_dim,
    input_length=len(secuencias),
    # batch_size=batch_size,  # Especifica el tamaño del lote y la longitud de la secuencia
)(text_inputs)

# Capa GRU
# text_pipeline = GRU(
#     64,  # Número de unidades en la capa GRU
#     return_sequences=True,
#     stateful=True,
#     recurrent_initializer='glorot_uniform',
# )(text_pipeline)

text_pipeline = LSTM(
  text_secuence_size // embedding_dim, dropout=0.5, return_sequences=True, kernel_regularizer=tf.keras.regularizers.l1(0.01),
)(text_pipeline)
text_pipeline = Dropout(rate=0.5)(text_pipeline)

text_pipeline = LSTM(
  text_secuence_size // embedding_dim, dropout=0.5, return_sequences=True, kernel_regularizer=tf.keras.regularizers.l1(0.01),
)(text_pipeline)
text_pipeline =Dropout(rate=0.5)(text_pipeline)

text_pipeline = Flatten()(text_pipeline)

# Segundo Pipeline: Prediccion
predict_inputs = Input(shape=(2))
predict_pipeline = Dense(long_vocab, activation='softmax')(predict_inputs)

concat = tf.keras.layers.concatenate([predict_pipeline, text_pipeline])
dense = Dense(1, activation='softmax')(concat)

modelo = tf.keras.Model(inputs=[text_inputs, predict_inputs], outputs=dense)
modelo.compile(optimizer = 'adam', loss ='categorical_hinge', metrics=['accuracy'], run_eagerly=True)
modelo.summary()

In [None]:
print('Largo secuencias de entreno', len(secuencias_train))
print('Largo matriz de entreno:', len(X_train))
assert len(secuencias_train) == len(X_train)

Por ultimo, se convierten las secuencias a tensores para agregarlas como input al modelo y las variables objetivo ***y_test*** y ***y_train*** se convierten a arrays de numpy

In [None]:
secuencias_train_list = tf.stack(secuencias_train)
secuencias_test_list = tf.stack(secuencias_test)
y_train = y_train.values
y_test = y_test.values

In [None]:
y_test

## Entreno del Modelo

In [None]:
modelo.fit(
    [secuencias_train_list, X_train],
    y_train,
    batch_size = 64,
    epochs = 5,
    verbose = 'auto',
    validation_data = ([secuencias_test_list, X_test], y_test)
)

# Funciones para Deploy

Procesar claim de entrada (texto)

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

def procesar_texto(texto: str):
  ''' Necesita el Tokenizador creado arriba '''

  # Tokenizar las frase
  secuencia = tokenizador.texts_to_sequences(texto)

  # Rellenar (Pad) las secuencias para que tengan la misma longitud
  secuencia = pad_sequences(np.array(secuencia))
  tensor_secuencias = tf.stack(secuencias_train)
  claim_size = len(texto)
  return secuencia, claim_size

Procesar tipo de claim

In [None]:
claim_map = {'Lead': 0, 'Position': 1, 'Claim': 2, 'Evidence': 3, 'Counterclaim': 4, 'Rebuttal': 5, 'Concluding Statement': 6}

def procesar_claim(claim: str):
  '''
  Se usa el diccionario de arriba como "encoder"
  '''
  if claim not in claim_map.keys():
    raise Exception(f'tipo de argumento \'{claim}\' no aceptado')

  encoded_claim = claim_map[claim]
  return encoded_claim

Para pasarle como input al modelo debe estar como una matriz de la siguiente manera

[
  secuencias generadas,
  [
    tipo de claim,
    tamaño del claim
  ]
]

In [None]:
# Ejemplo

texto_claim = 'asdasdasadasd'
tipo_claim = 'Counterclaim'

secuencias, size_claim = procesar_texto(texto_claim)
tipo_encoded = procesar_claim(tipo_claim)

input = [
    secuencias,
    np.array([tipo_encoded, size_claim])
]
input

Ese array de input se le pasa al modelo para la prediccion