Skip to content

๐Ÿ“ Determining the tonality of the text using LSTM

Notifications You must be signed in to change notification settings

AliSultanov2000/Wise-advice

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

46 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

The goal of this project is to create ML model (Based on RNN) which will determine the tone of the text: positive or negative (thus the ML problem is a binary classification). The main scope of the project: chatbots
Project stages:

  • Problem statement, its translation into machine learning language
  • Choosing metrics
  • Text data parsing
  • Text data preprocessing
  • Creating an RNN (exactly LSTM) model
  • Creating an ML Pipeline
  • Saving the entire model in docker image

The stack of technologies used: Python, Docker, Jupyter Notebook(Google Colab), Keras, NumPy, Pandas, Matplotlib

Model limitations: The model can only process words in English
Metrics: Precision

There are various types of text representation in the form of a vector or matrix, for example:

  • Bag of words (N - grams). The text is represented as a vector
  • Numerical encoding of words. The text is represented as a vector
  • One-hot encoding. The text is represented as a vector
  • Dense vector representation of words. The text is represented as a matrix

Within the framework of this project, a dense vector representation of words is used. To do this, we set the maximum number of words that will be represented as a vector, and also set the maximum number of words in one text. The numerical values in the vector to represent the word are set up during the training of a recurrent neural network and are trainable parameters

ML Pipeline:

  • Using our own tokenizer, we translate the text into tokens
  • Using the tokenizer dictionary, we match a number to each token. We get the sequence
  • Changing the size of the sequence (in our case it is 200 elements). If the numbers in the sequence are less than 200, then 0 are added to the beginning of the sequence. If the numbers in the sequence are more than 200, then the sequence is trimmed to the desired size
  • Passing the sequence model LSTM. As a result, we get an answer for this text

As a result of this project, a recurrent neural network - LSTM was obtained. To obtain the best metrics, the parameters of the size of the batch and the number of epochs were varied. Also, a change in the complexity of the model itself was considered: the number of layers and cells (for the same purpose, the Batch Normalization layer was added). Using a tokenizer from Spaะกy, Nltk - did not give a significant result, so a standard tokenizer from Keras was used. Currently, the model is ready for implementation in production (wrapped in a Docker image)