# Twitter Sentiment Classification
**here we are taking the 16 lakhs tweets data to train a model for sentiment classification**

In [1]:
!pip install contractions

Collecting contractions
  Obtaining dependency information for contractions from https://files.pythonhosted.org/packages/bb/e4/725241b788963b460ce0118bfd5c505dd3d1bdd020ee740f9f39044ed4a7/contractions-0.1.73-py2.py3-none-any.whl.metadata
  Downloading contractions-0.1.73-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Obtaining dependency information for textsearch>=0.0.21 from https://files.pythonhosted.org/packages/e2/0f/6f08dd89e9d71380a369b1f5b6c97a32d62fc9cfacc1c5b8329505b9e495/textsearch-0.0.24-py2.py3-none-any.whl.metadata
  Downloading textsearch-0.0.24-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Obtaining dependency information for anyascii from https://files.pythonhosted.org/packages/4f/7b/a9a747e0632271d855da379532b05a62c58e979813814a57fa3b3afeb3a4/anyascii-0.3.2-py3-none-any.whl.metadata
  Downloading anyascii-0.3.2-py3-none-any.whl.metadata (1.5 kB)
Collecting pyahocorasick 

**contractions** library in Python is used to expand contractions in text. Contractions are shortened versions of words or phrases that involves one or more letters, such as "**don't**" for "**do not**" or "**can't**" for "cannot

In [2]:
pip install --upgrade nltk

Collecting nltk
  Obtaining dependency information for nltk from https://files.pythonhosted.org/packages/a6/0a/0d20d2c0f16be91b9fa32a77b76c60f9baf6eba419e5ef5deca17af9c582/nltk-3.8.1-py3-none-any.whl.metadata
  Downloading nltk-3.8.1-py3-none-any.whl.metadata (2.8 kB)
Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m22.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: nltk
  Attempting uninstall: nltk
    Found existing installation: nltk 3.2.4
    Uninstalling nltk-3.2.4:
      Successfully uninstalled nltk-3.2.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
preprocessing 0.1.13 requires nltk==3.2.4, but you have nltk 3.8.1 which is incompatible.[0m[31m
[0mSuccessfully installed nltk-3.8.1
Note: you may need to restart the kernel to 

it upgrade the **NLTK library** to the latest version available on the **Python Package Index (PyPI)**

In [3]:
import numpy as np
import pandas as pd
import contractions
import tensorflow as tf
import os #it provides a portable way to interact with the operating system

**numpy (as np):** NumPy is a powerful python library for numerical computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays efficiently.

**pandas (as pd):** Pandas is a widely-used library for data manipulation and analysis in Python. It provides high-performance, easy-to-use data structures (e.g., DataFrame) and data analysis tools for working with structured/tabular data.

**contractions:** As mentioned earlier, this library is used to expand contractions in text data, making it easier to preprocess text for natural language processing tasks.

**tensorflow (as tf):** TensorFlow is an open-source machine learning framework developed by Google. It provides tools for building and training various machine learning models, including deep learning models, for tasks such as classification, regression, clustering, and more.

**os:** The os module provides a portable way to interact with the operating system. It offers functions for accessing the filesystem, working with file paths, manipulating environment variables, and executing system commands.

These libraries are commonly used in data science, machine learning, and natural language processing projects to perform various tasks efficiently.

In [4]:
from concurrent.futures import ThreadPoolExecutor
from multiprocessing import cpu_count
import re #Regular expressions are used for pattern matching and text manipulation tasks.
from spacy.lang.en.stop_words import STOP_WORDS #Stop words are common words (e.g., "the", "is", "and") that are often removed from text during natural language processing tasks because they typically don't carry significant meaning
from nltk import word_tokenize
import nltk
from nltk.corpus import stopwords
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn

**concurrent.futures.ThreadPoolExecutor:** This module provides a high-level interface for asynchronously executing callables (functions or other Python callables) in separate threads. ThreadPoolExecutor is used to manage a pool of worker threads for parallel execution of tasks.

**multiprocessing.cpu_count:** This function returns the number of CPU cores available on the current system. It's commonly used for determining the number of processes or threads to utilize for parallel processing tasks.

**re:** The re module provides support for working with regular expressions in Python. Regular expressions are used for pattern matching and text manipulation tasks.

**spacy.lang.en.stop_words.STOP_WORDS:** This is a set of stop words provided by the spaCy library for the English language. Stop words are common words (e.g., "the", "is", "and") that are often removed from text during natural language processing tasks because they typically don't carry significant meaning.

**nltk.word_tokenize:** This function from the NLTK library is used to tokenize text into individual words or tokens.

**nltk:** The NLTK (Natural Language Toolkit) library is a comprehensive toolkit for natural language processing tasks in Python. It provides tools and resources for tasks such as tokenization, stemming, lemmatization, part-of-speech tagging, and more.

**nltk.corpus.stopwords:** This module in NLTK provides a list of stop words for various languages. Stop words are commonly removed from text during preprocessing to reduce noise and improve the efficiency of text processing tasks.

**nltk.pos_tag:** This function is used for part-of-speech tagging in NLTK. It assigns a part-of-speech tag (e.g., noun, verb, adjective) to each word in a text.

**nltk.stem.WordNetLemmatizer:** This class is used for lemmatization in NLTK. Lemmatization is the process of reducing words to their base or canonical form (i.e., lemma).

**nltk.corpus.wordnet:** This module in NLTK provides access to WordNet, a lexical database of the English language. WordNet is used for tasks such as synonym detection, semantic similarity, and word sense disambiguation.

These imports indicate that you're setting up a text processing pipeline using various libraries and tools for tasks such as tokenization, stop word removal, part-of-speech tagging, and lemmatization.

In [5]:
from keras.layers import TextVectorization
from keras.layers import Input,Embedding,LSTM,GRU,Dense
from keras.models import Sequential
import pickle # used for saving and loading trained machine learning models, including deep learning models.
import locale #handle various locale-specific information, such as formatting numbers, dates, and currency values according to the user's locale.

**keras.layers.TextVectorization:** This layer is used for vectorizing a batch of strings into either integer indices or a dense representation (e.g., one-hot encoding) using a vocabulary of tokens.

**Input:** This layer is used to instantiate a Keras tensor.
Embedding: This layer is used for word embeddings, which map words or tokens to dense vectors of real numbers.

**LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit):** These layers are used for implementing recurrent neural networks (RNNs), which are commonly used for sequence modeling tasks.

**Dense:** This layer implements a densely connected neural network layer, where each neuron is connected to every neuron in the previous layer.

**keras.models.Sequential:** This is a linear stack of layers in Keras. You can create a Sequential model by passing a list of layer instances to the constructor.

**pickle:** This module in Python is used for serializing and deserializing Python objects. It's often used for saving and loading trained machine learning models, including deep learning models.

**locale:** This module provides a way to handle various locale-specific information, such as formatting numbers, dates, and currency values according to the user's locale.

In [6]:
locale.getpreferredencoding = lambda:"UTF-8"

#**UTF-8** is a widely-used character encoding that supports a wide range of characters from various languages and is compatible with most modern systems and applications

In [7]:
nltk.download("punkt") #tokenization

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [8]:
nltk.download("stopwords") #has updated slang words such as "the", "is", "and", etc

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [9]:
nltk.download("averaged_perceptron_tagger") #for pos_tag

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /usr/share/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [10]:
nltk.download("wordnet")


[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [11]:
cpu_count()

4

In [12]:
data = pd.read_csv("/kaggle/input/sentiment140/training.1600000.processed.noemoticon.csv",encoding="latin-1",header = None)

#header=None it wont set header for data

In [13]:
data.shape

(1600000, 6)

In [14]:
data

Unnamed: 0,0,1,2,3,4,5
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
...,...,...,...,...,...,...
1599995,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...
1599996,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1599997,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599998,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...


In [15]:
data.head

<bound method NDFrame.head of          0           1                             2         3  \
0        0  1467810369  Mon Apr 06 22:19:45 PDT 2009  NO_QUERY   
1        0  1467810672  Mon Apr 06 22:19:49 PDT 2009  NO_QUERY   
2        0  1467810917  Mon Apr 06 22:19:53 PDT 2009  NO_QUERY   
3        0  1467811184  Mon Apr 06 22:19:57 PDT 2009  NO_QUERY   
4        0  1467811193  Mon Apr 06 22:19:57 PDT 2009  NO_QUERY   
...     ..         ...                           ...       ...   
1599995  4  2193601966  Tue Jun 16 08:40:49 PDT 2009  NO_QUERY   
1599996  4  2193601969  Tue Jun 16 08:40:49 PDT 2009  NO_QUERY   
1599997  4  2193601991  Tue Jun 16 08:40:49 PDT 2009  NO_QUERY   
1599998  4  2193602064  Tue Jun 16 08:40:49 PDT 2009  NO_QUERY   
1599999  4  2193602129  Tue Jun 16 08:40:50 PDT 2009  NO_QUERY   

                       4                                                  5  
0        _TheSpecialOne_  @switchfoot http://twitpic.com/2y1zl - Awww, t...  
1          scotthamil

In [16]:
data.rename(columns={0:"target",1:"ids",2:"date",3:"flag",4:"user",5:"text"},
            inplace=True)

In [17]:
data.head()

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [18]:
data["text"].iloc[0]

"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D"

In [19]:
reduced_data = data.drop(labels=data.columns[1:5],axis=1)

In [20]:
reduced_data.head()

Unnamed: 0,target,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


In [21]:
#data.rename(columns = {0:"Target",1:"tweet_text"},inplace = True)

In [22]:
def normalize_tweet(tweet_text):
    return tweet_text.lower() #converts all tweets into smallercase

In [23]:
with ThreadPoolExecutor(max_workers=4) as pool: #here we make 4 active threads to do multi processing on our data 
    reduced_data["text"] = list(pool.map(normalize_tweet,list(reduced_data["text"])))

In [24]:
reduced_data["text"].head()

0    @switchfoot http://twitpic.com/2y1zl - awww, t...
1    is upset that he can't update his facebook by ...
2    @kenichan i dived many times for the ball. man...
3      my whole body feels itchy and like its on fire 
4    @nationwideclass no, it's not behaving at all....
Name: text, dtype: object

In [25]:
# def normalize_text(tweet_text):
#     return contractions.fix(tweet_text)

In [26]:
def expand_contractions(tweet_text):
    return contractions.fix(tweet_text)

In [27]:
with ThreadPoolExecutor(max_workers=4) as pool:
    reduced_data["text"] = list(pool.map(expand_contractions, list(reduced_data["text"])))

In [28]:
# with ThreadPoolExecutor(max_workers = 4) as pool:
#     reduced_data["text"] = list(pool.map(expand_contractions,list(reduced_data["text"])))

In [29]:
reduced_data["text"].head()

0    @switchfoot http://twitpic.com/2y1zl - awww, t...
1    is upset that he cannot update his facebook by...
2    @kenichan i dived many times for the ball. man...
3      my whole body feels itchy and like its on fire 
4    @nationwideclass no, it is not behaving at all...
Name: text, dtype: object

In [30]:
regex_pattern =  r'@[a-zA-z0-9 ]+|#[a-zA-Z0-9 ]+|\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*|\W+|\d+|<("[^"]*"|\'[^\']*\'|[^\'">])*>|_+|[^\u0000-\u007f]+'

In [31]:
def remove_noisy_tokens(tweet_text):
    return re.sub(pattern=regex_pattern,repl=" ",string=tweet_text)

In [32]:
with ThreadPoolExecutor(max_workers=4) as pool:
      reduced_data["text"] = list(pool.map(remove_noisy_tokens,list(reduced_data["text"])))

In [33]:
reduced_data["text"].head()

0      twitpic com  y zl awww that is a bummer you ...
1    is upset that he cannot update his facebook by...
2          managed to save   the rest go out of bounds
3      my whole body feels itchy and like its on fire 
4      it is not behaving at all i am mad why am i ...
Name: text, dtype: object

In [34]:
def remove_remaining_noisy_tokens(tweet_text):

      return re.sub(r'\b\w\b|[^\u0000-\u007f]+|_+|\W+',repl=" ",string=tweet_text)

In [35]:
with ThreadPoolExecutor(max_workers=4) as pool:

      reduced_data["text"] = list(pool.map(remove_remaining_noisy_tokens,list(reduced_data["text"])))

In [36]:
reduced_data["text"].head()

0     twitpic com   zl awww that is   bummer you sh...
1    is upset that he cannot update his facebook by...
2            managed to save the rest go out of bounds
3      my whole body feels itchy and like its on fire 
4     it is not behaving at all   am mad why am   h...
Name: text, dtype: object

In [37]:
def tokenize_tweet_text(tweet_text):

      return word_tokenize(tweet_text)

In [38]:
with ThreadPoolExecutor(max_workers=4) as pool:
    reduced_data["text"] = list(pool.map(tokenize_tweet_text,list(reduced_data["text"])))

In [39]:
reduced_data["text"].head()

0    [twitpic, com, zl, awww, that, is, bummer, you...
1    [is, upset, that, he, can, not, update, his, f...
2    [managed, to, save, the, rest, go, out, of, bo...
3    [my, whole, body, feels, itchy, and, like, its...
4    [it, is, not, behaving, at, all, am, mad, why,...
Name: text, dtype: object

In [40]:
en_stop_words = list(set(stopwords.words('english')).union(set(STOP_WORDS)))

In [41]:
def is_stopword(token):
    return not(token in en_stop_words)

In [42]:
def remove_stopwords(tweet_text):
    return list(filter(is_stopword,tweet_text))

In [43]:
with ThreadPoolExecutor(max_workers=4) as pool:
    reduced_data["text"] = list(pool.map(remove_stopwords,list(reduced_data["text"])))

In [44]:
reduced_data["text"].head()

0    [twitpic, com, zl, awww, bummer, shoulda, got,...
1    [upset, update, facebook, texting, cry, result...
2                        [managed, save, rest, bounds]
3                     [body, feels, itchy, like, fire]
4                                      [behaving, mad]
Name: text, dtype: object

In [45]:
def get_wnet_pos_tag(treebank_tag):
    if treebank_tag[1].startswith('J'):
        return (treebank_tag[0],wn.ADJ)
    elif treebank_tag[1].startswith('V'):
        return (treebank_tag[0],wn.VERB)
    elif treebank_tag[1].startswith('N'):
        return (treebank_tag[0],wn.NOUN)
    elif treebank_tag[1].startswith('R'):
        return (treebank_tag[0],wn.ADV)
    else:
        return (treebank_tag[0],wn.NOUN)

In [46]:
def get_pos_tag(tweet_text):

    return list(map(get_wnet_pos_tag,pos_tag(tweet_text)))

In [47]:
import os
import nltk

# Replace 'path_to_nltk_data' with the actual path to your NLTK data directory
os.environ['NLTK_DATA'] = 'path_to_nltk_data'

# Download WordNet
nltk.download('wordnet')


[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [48]:
pos_tag(reduced_data["text"][0])

[('twitpic', 'NN'),
 ('com', 'NN'),
 ('zl', 'NN'),
 ('awww', 'IN'),
 ('bummer', 'NN'),
 ('shoulda', 'NN'),
 ('got', 'VBD'),
 ('david', 'JJ'),
 ('carr', 'NN'),
 ('day', 'NN')]

In [None]:
reduced_data["text"] = list(map(get_pos_tag,list(reduced_data["text"])))

In [None]:
reduced_data["text"].head()

In [None]:
lemmatizer = WordNetLemmatizer()

In [None]:
def lemmatize_token(token_pos_tuple):

    if token_pos_tuple == None:
        return ""
    else:
        return lemmatizer.lemmatize(word=token_pos_tuple[0],pos=token_pos_tuple[1])

In [None]:
def lemmatize_text(tweet_text):

    if len(tweet_text) > 0:
        return list(map(lemmatize_token,tweet_text))
    else:
        return [""]

In [None]:
reduced_data["text"] = list(map(lemmatize_text,list(reduced_data["text"])))

In [None]:
reduced_data["text"].head()

In [None]:
# with open("/kaggle/working/tokenized_data.pkl","wb") as file_handle:
#     pickle.dump(reduced_data,file_handle)

In [None]:
#!mv /kaggle/working/tokenized_data.pkl /drive/MyDrive

In [None]:
#!cp /content/drive/MyDrive/tokenized_data.pkl /content

In [None]:
with open("/kaggle/working/tokenized_data.pkl","rb") as file_handle:
    reduced_data = pickle.load(file_handle)

In [None]:
reduced_data.head()

In [None]:
max_tokens = 30000
max_sequence_len = max(list(reduced_data["text"].apply(lambda x: len(x))))

In [None]:
reduced_rows_idx = np.argwhere(np.array(reduced_data["text"].apply(lambda x: len(x))) >= 10)

In [None]:
reduced_data = reduced_data.iloc[reduced_rows_idx.reshape(reduced_rows_idx.shape[0],)]

In [None]:
reduced_data["text"] = list(map(lambda x: " ".join(x), list(reduced_data["text"])))

In [None]:
reduced_data.head()

In [None]:
filtered_reduced_data = reduced_data[reduced_data["text"] != ""]

In [None]:
filtered_reduced_data.shape

In [None]:
vectorize_layer = TextVectorization(max_tokens=max_tokens,output_sequence_length=max_sequence_len)

In [None]:
vectorize_layer.adapt(data=np.array(filtered_reduced_data["text"]))

In [None]:
# import os

# # Get the path to the working directory
# working_directory = "/kaggle/working/"

# # Construct the full path to the pickle file in the working directory
# pickle_file_path = os.path.join(working_directory, "token_integer_mapping.pkl")

# # Save the configuration and weights to the pickle file
# with open(pickle_file_path, "wb") as file_handle:
#     pickle.dump({"config": vectorize_layer.get_config(),
#                  "weights": vectorize_layer.get_weights()}, file_handle)


In [None]:
with open("/kaggle/working/token_integer_mapping.pkl","wb") as file_handle:

  pickle.dump({"config":vectorize_layer.get_config(),
               "weights":vectorize_layer.get_weights()},file_handle)

In [None]:
#!mv token_integer_mapping.pkl /content/drive/MyDrive/

In [None]:
#!mv /content/token_integer_mapping.pkl /content/drive/MyDrive

In [None]:
#!cp /content/drive/MyDrive/token_integer_mapping.pkl /content

In [None]:
with open("/kaggle/working/token_integer_mapping.pkl","rb") as file_handle:

  vectorize_layer_attributes = pickle.load(file_handle)

In [None]:
vectorize_layer_attributes["config"]

In [None]:
loaded_vectorization_layer = TextVectorization.from_config(vectorize_layer_attributes["config"])

In [None]:
# loaded_vectorization_layer

In [None]:
vectorize_layer_attributes["weights"]

In [None]:
loaded_vectorization_layer.set_weights(vectorize_layer_attributes["weights"])

In [None]:
len(vectorize_tweet(filtered_reduced_data["text"][0]))

In [None]:
def vectorize_tweet(raw_tweet):

  return loaded_vectorization_layer(raw_tweet).numpy()

In [None]:
filtered_reduced_data["text"][0]

In [None]:
vectorized_tweets = list(filtered_reduced_data["text"].apply(vectorize_tweet))

In [None]:
#-ls

In [None]:
vectorized_train_tweets = vectorized_tweets[0:200000]
vectorized_cv_tweets = vectorized_tweets[200000:]

In [None]:
np.savez_compressed("./vectorized_train_tweets.npz",*vectorized_train_tweets)
np.savez_compressed("./vectorized_cv_tweets.npz",*vectorized_cv_tweets)

In [None]:
!mv /content/vectorized_train_tweets.npz /content/drive/MyDrive
!mv /content/vectorized_cv_tweets.npz /content/drive/MyDrive

In [None]:
filtered_reduced_data["target"] = filtered_reduced_data["target"].apply(lambda x: str(x))

In [None]:
tweet_labels = list(filtered_reduced_data["target"].replace(to_replace=filtered_reduced_data["target"].unique(),
                                        value=list(range(len(filtered_reduced_data["target"].unique())))))

In [None]:
train_labels = tweet_labels[0:200000]
cv_labels = tweet_labels[200000:]

In [None]:
train_mb_size = 1000
num_epochs = 10
train_size = 200000

In [None]:
loaded_vectorized_train_tweets = np.load("./vectorized_train_tweets.npz")

In [None]:
def train_datagen():

  for _ in range(num_epochs):

    for i in range(train_size//train_mb_size):

      tweets_mb_list = [loaded_vectorized_train_tweets["arr_"+str(arr_idx)] for arr_idx in range(i*train_mb_size,(i+1)*train_mb_size)]
      tweets_labels_mb_list = [np.array(tweet_label) for tweet_label in train_labels[i*train_mb_size:(i+1)*train_mb_size]]

      yield np.array(tweets_mb_list), np.expand_dims(np.array(tweets_labels_mb_list),-1)

In [None]:
for tweets_mb,tweets_labels in train_datagen():

  print(tweets_mb.shape)
  print(tweets_labels.shape)

  break

In [None]:
cv_mb_size = 1577
cv_size = 14193

In [None]:
loaded_vectorized_cv_tweets = np.load("./vectorized_cv_tweets.npz")

In [None]:
def cv_datagen():

  for _ in range(num_epochs):

    for i in range(cv_size//cv_mb_size):

      tweets_mb_list = [loaded_vectorized_cv_tweets["arr_"+str(arr_idx)] for arr_idx in range(i*cv_mb_size,(i+1)*cv_mb_size)]
      tweets_labels_mb_list = [np.array(tweet_label) for tweet_label in cv_labels[i*cv_mb_size:(i+1)*cv_mb_size]]

      yield np.array(tweets_mb_list), np.expand_dims(np.array(tweets_labels_mb_list),-1)

In [None]:
for tweets_mb,tweets_labels in cv_datagen():

  print(tweets_mb.shape)
  print(tweets_labels.shape)

  break

In [None]:
!wget https://nlp.stanford.edu/data/glove.6B.zip

In [None]:
!unzip /content/glove.6B.zip

In [None]:
max_vocabulary_size = len(loaded_vectorization_layer.get_vocabulary())
embedding_output_dim = 50

In [None]:
def create_bin_class_rnn():

  rnn_model = Sequential()

  rnn_model.add(Input(shape=(None,),dtype="int64"))
  rnn_model.add(Embedding(input_dim=max_vocabulary_size,output_dim=embedding_output_dim,input_length=max_sequence_len))
  rnn_model.add(LSTM(units=50))
  rnn_model.add(Dense(units=1,activation="sigmoid"))

  return rnn_model

In [None]:
rnn_model = create_bin_class_rnn()

In [None]:
rnn_model.summary()

In [None]:
rnn_model.compile(loss="binary_crossentropy",metrics=["accuracy",
                                                      tf.keras.metrics.Precision(),
                                                      tf.keras.metrics.Recall()])

In [None]:
training_data_gen = train_datagen()
cv_data_gen = cv_datagen()

In [None]:
rnn_model.fit(training_data_gen,
              epochs=num_epochs,
              validation_data=cv_data_gen,
              steps_per_epoch=200,validation_steps=9)