# Predicción de palabras con redes neuronales 

Este proyecto se enfoca en la adquisición de un conjunton de datos para posteriormente introducir los embeddings en una red neuronal, permitiendo predecir la siguiente palabra basándose en dichas entradas. Para este caso se hace uso de un corpus de canciones (se pretende que sea únicamente en inglés) de un total de 79 generos musicales. Ese corpus contiene 379,893 letras de canciones de 4,239 artistas.


Adquisición de los datos. 

In [1]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("neisse/scrapped-lyrics-from-6-genres")

print("Path to dataset files:", path)

  from .autonotebook import tqdm as notebook_tqdm


Path to dataset files: C:\Users\bugy1\.cache\kagglehub\datasets\neisse\scrapped-lyrics-from-6-genres\versions\3


In [2]:
import pandas as pd 

file_name = 'lyrics-data.csv'
complete_route = path + '/' + file_name
df = pd.read_csv(complete_route)
df.head()

Unnamed: 0,ALink,SName,SLink,Lyric,language
0,/ivete-sangalo/,Arerê,/ivete-sangalo/arere.html,"Tudo o que eu quero nessa vida,\nToda vida, é\...",pt
1,/ivete-sangalo/,Se Eu Não Te Amasse Tanto Assim,/ivete-sangalo/se-eu-nao-te-amasse-tanto-assim...,Meu coração\nSem direção\nVoando só por voar\n...,pt
2,/ivete-sangalo/,Céu da Boca,/ivete-sangalo/chupa-toda.html,É de babaixá!\nÉ de balacubaca!\nÉ de babaixá!...,pt
3,/ivete-sangalo/,Quando A Chuva Passar,/ivete-sangalo/quando-a-chuva-passar.html,Quando a chuva passar\n\nPra quê falar\nSe voc...,pt
4,/ivete-sangalo/,Sorte Grande,/ivete-sangalo/sorte-grande.html,A minha sorte grande foi você cair do céu\nMin...,pt


Verificar los lenguajes disponibles en el dataset

In [3]:
languages = df['language'].unique()
print("Lenguajes disponibles: ",languages)

Lenguajes disponibles:  ['pt' 'es' 'en' nan 'it' 'gl' 'fr' 'de' 'tl' 'et' 'fi' 'pl' 'da' 'st' 'sv'
 'ro' 'af' 'no' 'eu' 'rw' 'sw' 'ga' 'cy' 'ca' 'ny' 'ko' 'ar' 'gd' 'tr'
 'id' 'su' 'lg' 'ru' 'nl' 'sq' 'is' 'cs' 'jw' 'lv' 'hu' 'ms' 'ku' 'zh'
 'hr' 'ht' 'fa' 'mg' 'vi' 'ja' 'hmn' 'sr' 'iw' 'sl']


En este proyecto únicamente se realizarán predicciones en inglés. 

In [4]:
df_en = df[df['language'] == 'en']
df_en.head()

Unnamed: 0,ALink,SName,SLink,Lyric,language
69,/ivete-sangalo/,Careless Whisper,/ivete-sangalo/careless-whisper.html,I feel so unsure\nAs I take your hand and lead...,en
86,/ivete-sangalo/,Could You Be Loved / Citação Musical do Rap: S...,/ivete-sangalo/could-you-be-loved-citacao-musi...,"Don't let them fool, ya\nOr even try to school...",en
88,/ivete-sangalo/,Cruisin' (Part. Saulo),/ivete-sangalo/cruisin-part-saulo.html,"Baby, let's cruise, away from here\nDon't be c...",en
111,/ivete-sangalo/,Easy,/ivete-sangalo/easy.html,"Know it sounds funny\nBut, I just can't stand ...",en
140,/ivete-sangalo/,For Your Babies (The Voice cover),/ivete-sangalo/for-your-babies-the-voice-cover...,You've got that look again\nThe one I hoped I ...,en


Se debe realizar una limpieza de los datos del apartado de Lyric 

In [5]:
import re 
from nltk.tokenize import word_tokenize
import numpy as np


def preprocessing(text):
    if pd.isna(text):
        return []
    
    text = str(text).lower()
    text = re.sub(r'[^a-zA-Z\s\']', ' ', text) 
    text = re.sub(r'\s+', ' ', text).strip()

    tokens = word_tokenize(text)
    return tokens

df_en['tokens'] = df_en['Lyric'].apply(preprocessing)
df_en['tokens'].head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_en['tokens'] = df_en['Lyric'].apply(preprocessing)


69     [i, feel, so, unsure, as, i, take, your, hand,...
86     [do, n't, let, them, fool, ya, or, even, try, ...
88     [baby, let, 's, cruise, away, from, here, do, ...
111    [know, it, sounds, funny, but, i, just, ca, n'...
140    [you, 've, got, that, look, again, the, one, i...
Name: tokens, dtype: object

Configuración de los tokenizadores

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

all_tokens = [token for sublist in df_en['tokens'].tolist() for token in sublist]

tokenizer = Tokenizer()
tokenizer.fit_on_texts(df_en['tokens'].apply(' '.join))

word_index = tokenizer.word_index
total_words = len(word_index) 

input_sequences = []
for tokens in df_en['tokens']:
    if len(tokens) > 1:
        for i in range(1, len(tokens)):
            n_gram_sequence = tokens[:i+1]
            input_sequences.append(n_gram_sequence)

max_sequence_len = max([len(seq) for seq in input_sequences])
input_sequences = tokenizer.texts_to_sequences([' '.join(seq) for seq in input_sequences])
input_sequences = pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre')

X = input_sequences[:, :-1]
y = input_sequences[:, -1]

y_one_hot = np.zeros((len(y), total_words), dtype=np.int8)
for i, word_idx in enumerate(y):
    y_one_hot[i, word_idx] = 1

Realizar el modelo de red neuronal 

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

# Configuración
embedding_dim = 100  # Tamaño del vector de cada palabra
lstm_units = 150     # Neuronas en la capa LSTM

# Crear el modelo
model = Sequential()
model.add(Embedding(input_dim=total_words, output_dim=embedding_dim, input_length=X.shape[1]))
model.add(LSTM(lstm_units, return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(total_words, activation='softmax'))

# Compilar el modelo
model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])

# Resumen del modelo
model.summary()

hello
he
hel
hell
hello
world 
wo
wor
worl
world
world 
machine learning
ma
mac
mach
machi
machin
machine
machine 
machine l
machine le
machine lea
machine lear
machine learn
machine learni
machine learnin
machine learning
 learning python is great
 l
 le
 lea
 lear
 learn
 learni
 learnin
 learning
 learning 
 learning p
 learning py
 learning pyt
 learning pyth
 learning pytho
 learning python
 learning python 
 learning python i
 learning python is
 learning python is 
 learning python is g
 learning python is gr
 learning python is gre
 learning python is grea
 learning python is great
is
is
fun
fu
fun


['he',
 'hel',
 'hell',
 'hello',
 'wo',
 'wor',
 'worl',
 'world',
 'world ',
 'ma',
 'mac',
 'mach',
 'machi',
 'machin',
 'machine',
 'machine ',
 'machine l',
 'machine le',
 'machine lea',
 'machine lear',
 'machine learn',
 'machine learni',
 'machine learnin',
 'machine learning',
 ' l',
 ' le',
 ' lea',
 ' lear',
 ' learn',
 ' learni',
 ' learnin',
 ' learning',
 ' learning ',
 ' learning p',
 ' learning py',
 ' learning pyt',
 ' learning pyth',
 ' learning pytho',
 ' learning python',
 ' learning python ',
 ' learning python i',
 ' learning python is',
 ' learning python is ',
 ' learning python is g',
 ' learning python is gr',
 ' learning python is gre',
 ' learning python is grea',
 ' learning python is great',
 'is',
 'fu',
 'fun']

Entrenar modelo 

In [None]:
history = model.fit(X, y_one_hot, 
                    epochs=50, 
                    batch_size=32,
                    validation_split=0.2,
                    verbose=1)

In [None]:
# Paso 9: Función para predecir la siguiente palabra (SIMPLIFICADA)
def predecir_siguiente_palabra(modelo, tokenizer, texto, num_predicciones=3):
    """
    Predice la siguiente palabra dado un texto de entrada
    """
    # 1. Limpiar y tokenizar el texto de entrada
    tokens = texto.lower().split()  # Convertir a minúsculas y dividir en palabras
    tokens = [palabra for palabra in tokens if palabra in tokenizer.word_index]
    
    if len(tokens) == 0:
        return "No hay palabras conocidas en el texto"
    
    # 2. Convertir palabras a números (usando el tokenizer)
    secuencia_numeros = tokenizer.texts_to_sequences([' '.join(tokens)])
    
    # 3. Rellenar la secuencia para que tenga el tamaño esperado por el modelo
    secuencia_rellena = pad_sequences(secuencia_numerios, maxlen=X.shape[1], padding='pre')
    
    # 4. Hacer la predicción
    prediccion = modelo.predict(secuencia_rellena, verbose=0)
    
    # 5. Obtener las palabras más probables
    indices_top = np.argsort(prediccion[0])[-num_predicciones:][::-1]
    
    resultados = []
    for indice in indices_top:
        palabra = tokenizer.index_word.get(indice, '???')
        probabilidad = prediccion[0][indice]
        resultados.append((palabra, probabilidad))
    
    return resultados

# Ejemplo de uso
texto_ejemplo = "machine learning"
predicciones = predecir_siguiente_palabra(model, tokenizer, texto_ejemplo)

print(f"Para: '{texto_ejemplo}'")
print("Posibles siguientes palabras:")
for palabra, prob in predicciones:
    print(f"  {palabra} ({prob:.3f})")