# Analizing the sentences, part 1

For this proyect i will develop an AI able to analyze the sentences and describe it.

Twitter's "Sentiment140" dataset. It contains 1.6 million tagged tweets for sentiment analysis.

The columns are as follows:
0. The sentiment of the tweet (0 = negative, 2 = neutral, 4 = positive).
1. The ID of the tweet
2. The date the tweet was sent
3. The query (if there is one)
4. The user who sent the tweet
5. The text of the tweet

Let's load and preprocess this dataset. As we only need the sentiment and text columns, we will ignore the other columns. We will also convert the sentiment labels from 0, 2, 4 to 0, 1, 2 respectively for ease of handling.

For this dataset, the preprocessing code would look like this:

In [1]:
import pandas as pd
import re
import os
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

# Define column names
col_names = ['sentiment','id','date','query','user','text']

# Check if preprocessed data exists
if os.path.exists("./dataset/preprocessed_data.csv"):
    # Load preprocessed data
    data = pd.read_csv("./dataset/preprocessed_data.csv")
else:
    # Load data
    data = pd.read_csv("./dataset/training.1600000.processed.noemoticon.csv", 
                       header=None, names=col_names, encoding='latin-1')

    # Replace sentiment labels
    data['sentiment'] = data['sentiment'].replace({0: 0, 2: 1, 4: 2})

    # We only need 'sentiment' and 'text' columns
    data = data[['sentiment', 'text']]

    # Data cleaning and tokenization
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()

    def clean_text(text):
        # Convert text to lowercase and remove punctuation
        text = text.lower()
        text = re.sub(r'[^\w\s]', '', text)

        # Tokenization
        word_tokens = word_tokenize(text)

        # Remove stop words and lemmatize remaining words
        text = [lemmatizer.lemmatize(w) for w in word_tokens if not w in stop_words]

        return text

    # Apply cleaning function to each text in our dataset
    data['text'] = data['text'].apply(clean_text)

    # Save preprocessed data to a new CSV file
    data.to_csv("./dataset/preprocessed_data.csv", index=False)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\carlo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\carlo\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\carlo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


the next step is to convert the words in our dataset into numerical vectors that our neural network can understand. There are several ways to do this, but one of the most common and effective is to use a technique called word embeddings.

Word embeddings are a way of representing words as vectors in a high-dimensional space, so that words with similar meanings are close to each other in that space. There are several ways to generate word embeddings, but one of the most common is to use a pre-trained model such as Word2Vec or GloVe.

In our case, we will use the Keras Tokenizer API to encode our words. First, we will adjust the tokenizer to our text data so that it learns the vocabulary. Then, we will transform our sequences of words into sequences of numbers with the texts_to_sequences method. Finally, we will use the pad_sequences method to make sure that all our sequences have the same length, which is a requirement for feeding the data to our neural network.

In [2]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# Instantiate a Tokenizer
tokenizer = Tokenizer()

# Fit the Tokenizer to our text data
tokenizer.fit_on_texts(data['text'])

# Transform words into sequences of numbers
sequences = tokenizer.texts_to_sequences(data['text'])

# Pad all sequences to have the same length
padded_sequences = pad_sequences(sequences)

For sentiment analysis, a commonly used network architecture is the recurrent neural network (RNN), and more specifically, the Long Short-Term Memory (LSTM) variant. LSTMs are useful because they are able to learn long-term dependencies, which is important in text analysis where the meaning of a word may depend on the words preceding it.

In [3]:
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense

# Tamaño del vocabulario (añadimos 1 porque los índices del Tokenizer empiezan en 1)
vocab_size = len(tokenizer.word_index) + 1

# Longitud de las secuencias de entrada
input_length = len(padded_sequences[0])

# Dimensión del espacio de embedding
embedding_dim = 50

# Create the model
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=input_length))
model.add(LSTM(units=32, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(3, activation='softmax'))  # Three units for three classes


# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


In [5]:
from keras.models import load_model
from keras.callbacks import ModelCheckpoint
from keras.utils import to_categorical
# Define where to save or load the model
filepath = "./best_model.hdf5"

# If the checkpoint file exists, load the model from the checkpoint
if os.path.exists(filepath):
    model = load_model(filepath)

# Create a callback that saves the model's weights
checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, mode='min', save_freq=1000)



# Convert labels to one-hot encoding
labels = to_categorical(data['sentiment'])

# Train the neural network
model.fit(padded_sequences, labels, epochs=10, validation_split=0.2, callbacks=[checkpoint])



Epoch 1/10
  999/40000 [..............................] - ETA: 2:59:40 - loss: 0.4396 - accuracy: 0.7933
Epoch 1: saving model to .\best_model.hdf5


  saving_api.save_model(


 1999/40000 [>.............................] - ETA: 2:55:26 - loss: 0.4389 - accuracy: 0.7935
Epoch 1: saving model to .\best_model.hdf5
 2999/40000 [=>............................] - ETA: 2:50:47 - loss: 0.4378 - accuracy: 0.7941
Epoch 1: saving model to .\best_model.hdf5
 3999/40000 [=>............................] - ETA: 2:46:10 - loss: 0.4380 - accuracy: 0.7939
Epoch 1: saving model to .\best_model.hdf5
 4999/40000 [==>...........................] - ETA: 2:41:32 - loss: 0.4372 - accuracy: 0.7946
Epoch 1: saving model to .\best_model.hdf5
 5999/40000 [===>..........................] - ETA: 2:36:56 - loss: 0.4352 - accuracy: 0.7961
Epoch 1: saving model to .\best_model.hdf5
 6999/40000 [====>.........................] - ETA: 2:32:54 - loss: 0.4337 - accuracy: 0.7970
Epoch 1: saving model to .\best_model.hdf5
 7615/40000 [====>.........................] - ETA: 2:32:10 - loss: 0.4331 - accuracy: 0.7978

 7633/40000 [====>.........................] - ETA: 2:32:09 - loss: 0.4331 - accuracy: 0.7978

In [8]:
from keras.models import load_model
from keras.utils import to_categorical
filepath = "./best_model.hdf5"

if os.path.exists(filepath):
    model = load_model(filepath)
# Convert labels to one-hot encoding
labels = to_categorical(data['sentiment'])
# Evaluate the model on the data subset
loss, accuracy = model.evaluate(padded_sequences, labels)

# Print the loss and accuracy
print('Loss:', loss)
print('Accuracy:', accuracy)

NameError: name 'to_categorical' is not defined