## Dataset
https://www.kaggle.com/datasets/ramjasmaurya/poem-classification-nlp?select=Poem_classification+-+train_data.csv
Contains poems and their genre. Affection,Environment,Music and Death. Separated by training and test data sets

## Motivation
Super interested in NLP and making models understand human language. This time I wanted to use neural nets for my models to see if I can accurately predict poem genre. I will be using LSTM, Bidirectional LSTM and CNN. Regular RNN is not used often in the real world due to the vanishing gradient problem where as LSTM is preferred. CNN is usually used for image classification but I wanted to set it up for nlp.

In [1]:
from tensorflow import keras
from keras.preprocessing.text import Tokenizer
from keras.layers import Dense, Flatten
from tensorflow.keras.layers import Embedding
from keras.models import Sequential
from tensorflow.keras import layers, models, losses, optimizers,callbacks
from keras.utils.np_utils import to_categorical
from keras.layers import Dense,LSTM, Bidirectional,Conv1D, MaxPooling1D, GlobalMaxPooling1D
from keras.utils import pad_sequences


import pandas as pd
import numpy as np
import re

import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from gensim.models import Word2Vec
stopwords = stopwords.words('english')

## Loading dataset and EDA

In [2]:
train_df = pd.read_csv("Poem_classification - train_data.csv")
test_df = pd.read_csv("Poem_classification - test_data.csv")
train_df.head()

Unnamed: 0,Genre,Poem
0,Music,
1,Music,In the thick brushthey spend the...
2,Music,Storms are generous. ...
3,Music,—After Ana Mendieta Did you carry around the ...
4,Music,for Aja Sherrard at 20The portent may itself ...


In [3]:
train_df.isnull().sum()

Genre    0
Poem     4
dtype: int64

In [4]:
test_df.isnull().sum()

Genre    0
Poem     0
dtype: int64

In [5]:
train_df = train_df.dropna().reset_index(drop = True)

In [6]:
train_df.isnull().sum()

Genre    0
Poem     0
dtype: int64

In [7]:
train_df.head()

Unnamed: 0,Genre,Poem
0,Music,In the thick brushthey spend the...
1,Music,Storms are generous. ...
2,Music,—After Ana Mendieta Did you carry around the ...
3,Music,for Aja Sherrard at 20The portent may itself ...
4,Music,"for Bob Marley, Bavaria, November 1980 Here i..."


## Converting the catagorical data to numerical labels 
had issues before where the neural net would not accept the labels as words.

In [24]:
labelencoder = LabelEncoder()
train_df['label'] = labelencoder.fit_transform(train_df['Genre'])
test_df['label'] = labelencoder.fit_transform(test_df['Genre'])

In [25]:
train_df.head()

Unnamed: 0,Genre,Poem,label,clean_poem
0,Music,In the thick brushthey spend the...,3,in the thick brushthey spend the hottest part ...
1,Music,Storms are generous. ...,3,storm are generous something so easy to surren...
2,Music,—After Ana Mendieta Did you carry around the ...,3,after ana mendieta did you carry around the ma...
3,Music,for Aja Sherrard at 20The portent may itself ...,3,for aja sherrard at 20the portent may itself b...
4,Music,"for Bob Marley, Bavaria, November 1980 Here i...",3,for bob marley bavaria november 1980 here is t...


In [26]:
test_df.head()

Unnamed: 0,Genre,Poem,label,clean_poem
0,Music,A woman walks by the bench I’m sitting onwith ...,3,a woman walk by the bench im sitting onwith he...
1,Music,"Because I am a boy, the untouchability of beau...",3,because i am a boy the untouchability of beaut...
2,Music,"Because today we did not leave this world,We n...",3,because today we did not leave this worldwe no...
3,Music,"Big Bend has been here, been here. Shouldn’t i...",3,big bend ha been here been here shouldnt it ha...
4,Music,"I put shells there, along the lip of the road....",3,i put shell there along the lip of the roadbiv...


In [27]:
train_df.Genre.value_counts()

Music          238
Death          231
Environment    227
Affection      141
Name: Genre, dtype: int64

In [28]:
train_df.label.value_counts()

3    238
1    231
2    227
0    141
Name: label, dtype: int64

## Cleaning 

In [13]:
# 1. function that makes all text lowercase.
def make_lowercase(test_string):
    return test_string.lower()

# 2. function that removes all punctuation. 
def remove_punc(test_string):
    test_string = re.sub(r'[^\w\s]', '', test_string)
    return test_string

# 3. function that removes all stopwords.
def remove_stopwords(test_string):
    # Break the sentence down into a list of words
    words = word_tokenize(test_string)
    
    # Make a list to append valid words into
    valid_words = []
    
    # Loop through all the words
    for word in words:
        
        # Check if word is not in stopwords. Stopwords was imported from nltk.corpus
        if word not in stopwords:
            
            # If word not in stopwords, append to our valid_words
            valid_words.append(word)

    # Join the list of words together into a string
    a_string = ' '.join(valid_words)

    return a_string

# 4. function to break words into their lemm words
def lem_words(a_string):
    # Initalize our Stemmer
    lemmatizer = WordNetLemmatizer()
    
    # Break the sentence down into a list of words
    words = word_tokenize(a_string)
    
    # Make a list to append valid words into
    valid_words = []

    # Loop through all the words
    for word in words:
        # Stem the word
        lemmed_word = lemmatizer.lemmatize(word) #from nltk.stem import PorterStemmer
        
        # Append stemmed word to our valid_words
        valid_words.append(lemmed_word)
        
    # Join the list of words together into a string
    a_string = ' '.join(valid_words)

    return a_string

In [14]:
def text_processing_pipeline(a_string):
    a_string = make_lowercase(a_string)
    a_string = remove_punc(a_string)
    a_string = lem_words(a_string)
    return a_string

In [15]:
train_df["clean_poem"] = train_df["Poem"].apply(text_processing_pipeline)
test_df["clean_poem"] = test_df["Poem"].apply(text_processing_pipeline)

In [16]:
train_df.head()

Unnamed: 0,Genre,Poem,label,clean_poem
0,Music,In the thick brushthey spend the...,3,in the thick brushthey spend the hottest part ...
1,Music,Storms are generous. ...,3,storm are generous something so easy to surren...
2,Music,—After Ana Mendieta Did you carry around the ...,3,after ana mendieta did you carry around the ma...
3,Music,for Aja Sherrard at 20The portent may itself ...,3,for aja sherrard at 20the portent may itself b...
4,Music,"for Bob Marley, Bavaria, November 1980 Here i...",3,for bob marley bavaria november 1980 here is t...


## Vectorize X_train
basically converting each poem to a vector of size 500.
doing this so that the nueral nets can accept them as inputs

In [17]:
# Limiting our tokenizers vocab size to 10000
max_words = 10000
 
    
# create the tokenizer
tokenizer = Tokenizer(num_words=max_words)


# Fit the tokenizer
tokenizer.fit_on_texts(train_df["clean_poem"])


# Create the sequences for each sentence, basically turning each word into its index position
sequences = tokenizer.texts_to_sequences(train_df["clean_poem"])


index_word = tokenizer.index_word


# # Limiting our sequencer to only include 500 words
max_length = 500


# # Convert the sequences to all be the same length of 500
X_train = pad_sequences(sequences, maxlen=max_length, padding='post')
print(X_train.shape)

(837, 500)


## Vectorize X_test

In [18]:
# Fit the tokenizer
tokenizer.fit_on_texts(train_df["clean_poem"])

# Create the sequences for each sentence, basically turning each word into its index position
sequences = tokenizer.texts_to_sequences(test_df["clean_poem"])


index_word = tokenizer.index_word


# # Limiting our sequencer to only include 500 words
max_length = 500


# # Convert the sequences to all be the same length of 500
X_test = pad_sequences(sequences, maxlen=max_length, padding='post')
print(X_test.shape)

(150, 500)


In [19]:
y_train = to_categorical(train_df['label'])
y_test = to_categorical(test_df['label'])

## Stacked LSTM

In [20]:
# This creates the Neural Network
model = Sequential() 

# This embedding layer basically will automatically create the word2vec vectors based on your text data.
model.add( Embedding(max_words, 100, input_length=max_length) ) 
model.add(LSTM(50, return_sequences=True,dropout =0.2))
model.add(LSTM(50,return_sequences=True,dropout =0.2))
model.add(LSTM(50,dropout =0.2))
model.add(Dense(4, activation='softmax'))
optimizer = optimizers.Adam(lr=0.003)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy']) 

model.summary()



     




Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 500, 100)          1000000   
                                                                 
 lstm (LSTM)                 (None, 500, 50)           30200     
                                                                 
 lstm_1 (LSTM)               (None, 500, 50)           20200     
                                                                 
 lstm_2 (LSTM)               (None, 50)                20200     
                                                                 
 dense (Dense)               (None, 4)                 204       
                                                                 
Total params: 1,070,804
Trainable params: 1,070,804
Non-trainable params: 0
_________________________________________________________________


In [21]:
hist = model.fit(X_train, y_train, 
                 validation_split=0.2, 
                 epochs=30, batch_size=20)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [22]:
acc = model.evaluate(X_test, y_test, verbose=0)[1]
print('Test accuracy with stacked LSTM:', acc)

Test accuracy with stacked LSTM: 0.07999999821186066


## Bidirectional LSTM

In [29]:
model = models.Sequential()
model.add(layers.Embedding(max_words, 32, input_length=max_length))
model.add(Bidirectional(layers.LSTM(64, dropout=0.2)))
model.add(layers.Dense(4, activation='softmax'))
optimizer = optimizers.Adam(lr=0.003)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy']) 
model.summary()



Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 500, 32)           320000    
                                                                 
 bidirectional (Bidirectiona  (None, 128)              49664     
 l)                                                              
                                                                 
 dense_1 (Dense)             (None, 4)                 516       
                                                                 
Total params: 370,180
Trainable params: 370,180
Non-trainable params: 0
_________________________________________________________________


In [31]:
hist = model.fit(X_train, y_train, 
                 validation_split=0.2, 
                 epochs=30, batch_size=20)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [32]:
acc = model.evaluate(X_test, y_test, verbose=0)[1]
print('Test accuracy with Bidirectional LSTM:', acc)

Test accuracy with Bidirectional LSTM: 0.2266666740179062


## Cnn

In [44]:
model = Sequential()

model.add(Embedding(max_words, 32, input_length=max_length))

model.add(Conv1D(32, 7, activation='relu'))

model.add(MaxPooling1D(5))

model.add(Conv1D(32, 7, activation='relu'))

model.add(GlobalMaxPooling1D())

model.add(layers.Dense(4, activation='softmax'))

optimizer = optimizers.Adam(lr=0.003)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy']) 
model.summary()



Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 500, 32)           320000    
                                                                 
 conv1d_2 (Conv1D)           (None, 494, 32)           7200      
                                                                 
 max_pooling1d_1 (MaxPooling  (None, 98, 32)           0         
 1D)                                                             
                                                                 
 conv1d_3 (Conv1D)           (None, 92, 32)            7200      
                                                                 
 global_max_pooling1d_1 (Glo  (None, 32)               0         
 balMaxPooling1D)                                                
                                                                 
 dense_1 (Dense)             (None, 4)                

In [45]:
hist = model.fit(X_train, y_train, 
                 validation_split=0.2, 
                 epochs=30, batch_size=20)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [47]:
acc = model.evaluate(X_test, y_test, verbose=0)[1]

print('Test accuracy with CNN:', acc)

Test accuracy with CNN: 0.36000001430511475


In [23]:
cleaned_text = 'Roses are red Voilets are blue but I know I do not love you.'
sequence = tokenizer.texts_to_sequences([cleaned_text])
padded_sequence = pad_sequences(sequence, maxlen=max_length, padding='post')
model.predict(padded_sequence)



array([[0.2120746 , 0.33572647, 0.08293388, 0.36926502]], dtype=float32)

# Conclusion
I got poor results for these nueral net. Stacked LSTM was underfitting with accuracy and validation accuracy being low. Low test accuracy as well. Bidirectional LSTM did better than LSTM on test accuracy but overfitted during training. Maybe I should add more layer and dropout. Cnn was overfitting as well but perform better than bidirectional LSTM. 

What was weird was that it took only a few seconds for each epoch for all neural nets. I am not sure what can be done about that since I would prefer each epoch to take its time. And I would have prefered more data, since neural nets are data hungry.