# Modelo Básico de Preguntas y Respuestas

## Carga de módulos

In [3]:
import warnings

# Ignorar todas las advertencias
warnings.filterwarnings('ignore')

import pandas as pd
pd.set_option('display.max_colwidth', None)

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

## Carga de datos

Este conjunto de datos contiene preguntas para el juego de jeopardy, fue descargado de kaggle, disponible en el siguiente enlace: https://www.kaggle.com/datasets/tunguz/200000-jeopardy-questions?resource=download

In [9]:
%cd C:\\Users\\gabri\\Documents\\GitHub\\Prueba_DACD\\Pregunta 2

C:\Users\gabri\Documents\GitHub\Prueba_DACD\Pregunta 2


In [10]:
df = pd.read_csv("JEOPARDY_CSV.csv")

## Análisis exploratorio de los datos 

Para esta parte del código, daremos un vistazo a los datos y haremos cualquier tratamiento necesario previo a el preprocesamiento

In [11]:
df.head(3)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,"No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves",Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,"The city of Yuma in this state has a record average of 4,055 hours of sunshine each year",Arizona


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   Show Number  216930 non-null  int64 
 1    Air Date    216930 non-null  object
 2    Round       216930 non-null  object
 3    Category    216930 non-null  object
 4    Value       216930 non-null  object
 5    Question    216930 non-null  object
 6    Answer      216928 non-null  object
dtypes: int64(1), object(6)
memory usage: 11.6+ MB


Para evitar problemas de desbordamiento de memoria en fases futuras del modelo, nos quedaremos unicamente con 1000 preguntas

In [13]:
df = df.sample(n=1000)

In [14]:
df = df[[" Question"," Answer"]]

In [15]:
df.isna().sum()

 Question    0
 Answer      0
dtype: int64

In [16]:
df.head()

Unnamed: 0,Question,Answer
97082,"This hero went ""Forth upon the Gitche Gumee...with his fishing-line of cedar"" to catch a sturgeon",Hiawatha
185687,"He was sent to bring Isolde to Cornwall to marry his uncle, King Mark",Tristan
93958,The European part of Turkey lies entirely on this peninsula,Balkan Peninsula
151067,"<a href=""http://www.j-archive.com/media/2011-06-10_DJ_26.jpg"" target=""_blank"">This</a> precocious <a href=""http://www.j-archive.com/media/2011-06-10_DJ_26a.jpg"" target=""_blank"">little girl</a> first appeared in the comics in 1933 as the niece of an aunt named Fritzi",Nancy
16880,"Near nudity, not high heels & swimsuits, is on display in Cranach the Elder's <a href=""http://www.j-archive.com/media/2008-07-23_DJ_12.jpg"" target=""_blank"">painting</a> of <a href=""http://www.j-archive.com/media/2008-07-23_DJ_12a.jpg"" target=""_blank"">this</a> beauty contest",The Judgment of Paris


## Preprocesamiento

En esta parte nos enfocaremos en tratar el texto para poder darselo a nuestro modelo y entrenarlo

In [17]:
df[" Question"] = df[" Question"].str.replace('[^\w\s]', '')  # Quitamos caracteres no alfanuméricos
df[" Question"] = df[" Question"].str.lower()  # Transformamos el texto a minusculas 
df[" Answer"].fillna("", inplace=True)  # Hacemos el tratamiento de cadenas vacias

Usamos TfidVectorizer para transformar el texto a números, además indicamos que "Answer" será nuestro target para el modelo

In [18]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df[" Question"])
y = df[" Answer"]

## Entrenamiento

dividimos el conjunto de datos en Train y Test para poder entrenar con el 80% de los datos y probarlos con el 20%

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Entrenamos una Regresión Logistica para la clasificación

In [20]:
model = RandomForestClassifier()
model.fit(X_train, y_train)

## Validación

Evaluamos los resultados del modelo, Creamos una función que replique el preprocesamiento previamente realizado

In [21]:
def predice_respuesta(preg):
    preg = preg.lower()
    preg_vector = vectorizer.transform([preg])
    res = model.predict(preg_vector)[0]
    return res

In [24]:
input_ = "The European part of Turkey lies entirely on this peninsula"
pred_res = predice_respuesta(input_)
print(f"Respuesta: {pred_res}")

Respuesta: Balkan Peninsula
