GitHub - AliSultanov2000/Wise-advice: 📝 Determining the tonality of the text using LSTM

The goal of this project is to create ML model (Based on RNN) which will determine the tone of the text: positive or negative (thus the ML problem is a binary classification). The main scope of the project: chatbots
Project stages:

Problem statement, its translation into machine learning language
Choosing metrics
Text data parsing
Text data preprocessing
Creating an RNN (exactly LSTM) model
Creating an ML Pipeline
Saving the entire model in docker image

The stack of technologies used: Python, Docker, Jupyter Notebook(Google Colab), Keras, NumPy, Pandas, Matplotlib

Model limitations: The model can only process words in English
Metrics: Precision

There are various types of text representation in the form of a vector or matrix, for example:

Bag of words (N - grams). The text is represented as a vector
Numerical encoding of words. The text is represented as a vector
One-hot encoding. The text is represented as a vector
Dense vector representation of words. The text is represented as a matrix

Within the framework of this project, a dense vector representation of words is used. To do this, we set the maximum number of words that will be represented as a vector, and also set the maximum number of words in one text. The numerical values in the vector to represent the word are set up during the training of a recurrent neural network and are trainable parameters

ML Pipeline:

Using our own tokenizer, we translate the text into tokens
Using the tokenizer dictionary, we match a number to each token. We get the sequence
Changing the size of the sequence (in our case it is 200 elements). If the numbers in the sequence are less than 200, then 0 are added to the beginning of the sequence. If the numbers in the sequence are more than 200, then the sequence is trimmed to the desired size
Passing the sequence model LSTM. As a result, we get an answer for this text

As a result of this project, a recurrent neural network - LSTM was obtained. To obtain the best metrics, the parameters of the size of the batch and the number of epochs were varied. Also, a change in the complexity of the model itself was considered: the number of layers and cells (for the same purpose, the Batch Normalization layer was added). Using a tokenizer from SpaСy, Nltk - did not give a significant result, so a standard tokenizer from Keras was used. Currently, the model is ready for implementation in production (wrapped in a Docker image)

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
lstm_model		lstm_model
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
create_lstm_model.ipynb		create_lstm_model.ipynb
model_pipeline.py		model_pipeline.py
paths.json		paths.json
requirements.txt		requirements.txt
text_parsing.py		text_parsing.py
tokenizer.pickle		tokenizer.pickle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lstm_model

lstm_model

.dockerignore

.dockerignore

.gitignore

.gitignore

Dockerfile

Dockerfile

README.md

README.md

create_lstm_model.ipynb

create_lstm_model.ipynb

model_pipeline.py

model_pipeline.py

paths.json

paths.json

requirements.txt

requirements.txt

text_parsing.py

text_parsing.py

tokenizer.pickle

tokenizer.pickle

Repository files navigation

About

Releases

Packages

Languages

AliSultanov2000/Wise-advice

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Stars

Watchers

Forks

Languages