Rotem Shperling - 305699514
Eitan Shteinberg - 305809535
Notebook | Description |
---|---|
ds_project.ipynb |
Main notebook of the project |
colab_gpu.ipynb |
Contains the phases that were conducted on the Google Colab GPU |
Directory | Description |
---|---|
models |
Contains the trained models of the politicians |
history |
Contains the history & parameters of the trained models |
text_speeches |
Contains the original raw politicians speeches |
generated_speeches |
Contains the generated speeches from the models |
dataframe |
Contains the dataframes that are used for the classification model |
In this assignment we were asked to perform 3 major tasks:
- Build a classification model based on the speeches of 3 chosen politicians.
- Build a model to predict the speeches of each politician (separate model for each of them).
- Test the trained classification model on the generated speeches.
Technical aspects:
- We collected 80 speeches from each politician.
- We generated 24 speeches (30%) for each politician using his trained models.
- We trained each model with 200 epochs in order to get better results.
What is Google Colab:
Colaboratory is a Google research project created to help disseminate machine learning education and research.
It's a Jupyter notebook environment that requires no setup to use and runs entirely in the cloud.
Colaboratory notebooks are stored in Google Drive and can be shared just as you would with Google Docs or Sheets.
What was our main benefit from using it:
The us of the Tesla K80 GPU, using the Keras library.
That way we performed very heavy calculations such as training the model & predicting speeches in less time.
It is FREE.
Cons of using it:
- The runtime environment collapsed occasionally and we had to start everything
from the start (we trained each model ~10 times before the last version)- The importing and exporting of files was incredibly painful.
- The work with the Google Drive was not convinient.
Where did we use it:
- Training the RNN LSTM models - 200 epochs
- Generating the speeches of each politician
- Support in generating the generated speeches dataframe
Textacy is a Python library for performing higher-level natural language processing (NLP) tasks, built on the high-performance spaCy library.
With the basics — tokenization, part-of-speech tagging, dependency parsing, etc. — offloaded to another library, textacy focuses on tasks
facilitated by the ready availability of tokenized, POS-tagged, and parsed text.
In this code we use an LSTM (Long Short Term Memory) Neural Network which is a special kind of RNN,
capable of learning long-term dependencies.
LSTM was introduced by Hochreiter & Schmidhuber (1997), and was refined and popularized by many people afterwards.
LSTM works tremendously well on a large variety of problems, and are now widely used. LSTMs are explicitly designed to avoid the long-term dependency problem.
Remembering information for long periods of time is practically their default behavior.
Multinomial Logistic Regression is a classification method that generalizes logistic regression to multiclass problems,
i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the probabilities of the
different possible outcomes of a categorically distributed dependent variable, given a set of independent variables,
which may be real-valued, binary-valued, categorical-valued and more.