This is the repository of "Style transfer in NLP: a framework and multilingual analysis with Friends TV series" paper.
Style transfer is an important and a rapidly developing of Natural Language Processing. This days more and more methods and models are proposed which allow us to generate text in predefined style. In this paper we propose a framework for style transfer of "Friends" TV series. The trained models are able to mimic one of 6 main characters of this famous TV-series in English and Russian. We also present a dialogue dataset of "Friends" subtitles in English and its Russian automatic translation. In addition to that we perform a multilingual comparison of "Friends" style transfer in the two considered languages.
This folder contains data for Telegram-bot:
data
- DB for storing state of each chat, rating given to each message; paths to models and logmodels
- Folder template holding the pre-trained modelsui
- Utilities for enhancing UIutils
- Database control, Model uploader and Ratingmain.py
- The main file to start bot itself
Folder folder contains all output datasets we have:
bigram_pics
- pictures of frineds without backgrounddata_for_tone_analysis
- statistics of tone analysis from positive and negative wordsgenerated
- phrases generated by GPT3-Largequestions
- quections in English and Russian for mannual assessment of generated phrasesscripts
- all scripts with speakers' annotation and phrases of all friends in English and Russiantrain_data
- train data for two step finetuning of GPT3-Large models split in 9 to 1 ratio (monologues and cleaned replics) in English and Russian
The folder folder contains all Jupiter notebooks:
bigrams_trigrams
- a notebook to create bigrams and trigrams for each friendbinary_classifier
- notebooks for Bianry Classifiers (Training + Evaluation)multilabel_classifier
- a notebook for Multilabel Classifiers- Other files:
Parser.ipynb
- parses website with series' scriptsData_preparation.ipynb
- cleans parsed scripts from irrelevant symbols and wordsStatistics.ipynb
- gets statistics of most frequently used words and visualizes itPhrases_Preprocessing.ipynb
- gets phrases that are common for friends and hard to detect by a classifier in EnglishRu_Phrases_Preprocessing.ipynb
- gets phrases that are common for friends and hard to detect by a classifier in RussianText_Analysis.ipynb
- brief analysis of most frequently used wordsMetrics.ipynb
- preprocessing and furhter tone anaylis
The checkpoints of the trained models stored here.