In [6]:
# from preprocess import PreprocessFromMap

Note: This notebook contains most steps and decisions made for this project. 
This is in case someone else wants to use this code and pipeline. 
Also, this code assumes that the raw folder obtained from the TikTok metadata stays the same in format :)

# Preprocess raw folders and translate

## Step 1: folder moving and name changing
Not many impactful decisions are made here. This step just contained a few functions to get the pipeline started. These are the different functions:


In [7]:
    # preprocess = PreprocessFromMap(r'00 2\00', 'txt_00 2',
    #                                to_map=True,
    #                                edit_txt=True,
    #                                divide_into_folders=False,
    #                                translate=False,
    #                                construct_df=False)
    
    # preprocess.to_map()
    # preprocess.edit_txt()


The only things worth mentioning are that this function erases the timestamps for the text and takes away the 'WEBVTT' line from the .txt files. Furthermore, the decision is made to incorporate the language of the text in the new file, so that the different languages can be detected for translating
To execute this step, you have to input values for the original directory (in this case: '00 2\00') and a name for the new directory (in this case: 'txt_00 2')

## Step 2: Translating
Before translation, the decision is made to only translate the Dutch files. This is because those were by far the majority of text files. And this makes it a lot easier to check if the translation made sense.
First off, I made a function which takes all the Dutch files and moves them in groups of 30 to different folder within a translation directory. I did this to translate all the folders in parts. The translation takes a lot of computational power and time. Translating all files in chunks makes this more manageable to oversee and correct when an error pops up, or the wifi disconnects. 

The function has an internal variable called: 'desired_amount'. This variable comes back in the translation function, but manages how many text files there should be per folder. By default, this value is set to 30. If you want to translate more files per folder, this can be changed (make sure to also edit it in the translation function!). 

In [4]:
    # preprocess = PreprocessFromMap(r'00 2\00', 'txt_00 2',
    #                                to_map=False,
    #                                edit_txt=False,
    #                                divide_into_folders=True,
    #                                translate=False,
    #                                construct_df=False)
    
    # preprocess.divide_into_folders()

For the translation, I picked a transformer type model from hugging face. This model is: 'facebook/mbart-large-50-many-to-many-mmt'. There were a couple of reasons for choosing this model. First of all, this model is able to translate a lot of different languages (not just Dutch-English). This leaves open the possibility for also translating more languages in the future. Furthermore, it is a fairly recent transformer type model that is also very large (many parameters). This could make translation more accurate by more successfully incorporating context (possibly).

Before using this model, I roughly tested the accuracy of the translation. This was not a very well-done testing process, but just a rough look at the performance of the model, since I did not expect to find many better performing models (non-fine tuned). I took (non-random) the first fifty Dutch files from the folder and moved them to a testing directory. I then had them translated with this model. For the test, I put the results in a .csv with the original Dutch next to it. I then roughly looked at if the main subject or content is retained in the translation. So, it does not matter if it is perfect, but the main subject should be clear from the translation. 

By subjectively rating each translation, the accuracy comes to about 0.71. However, most of the wrongly translated files did not come only as a fault of the model. Since this is TikTok data, it became apparent that there is a lot of nonsense in the original Dutch already present. Many times I could not make sense of the Dutch either (or pick out the subject). So, as a performance metric for the model only, the score of 0.71 can be considered a lower bound. 

This is the function for model initialization:

In [5]:
    # preprocess = PreprocessFromMap(r'00 2\00', 'txt_00 2',
    #                                to_map=False,
    #                                edit_txt=False,
    #                                divide_into_folders=False,
    #                                translate=True,
    #                                construct_df=False)
    
    # preprocess.translation_model_init(text)
    # preprocess.split_text (text, max_length=750)
    # preprocess.translate()

Naturally, the above functions assume that the previous functions have been run.
Furthermore, this model is run on Nvidia Cuda. This makes the model do its computation on the GPU instead of the CPU, which is a lot faster for generation. 
Important: If you do not have an Nvidia GPU or have not installed PyTorch (12.1) with Cuda enabled, this model will not run.
You can fix this by erasing the .to('cuda') from the function. Or if you are not sure, create a different variable which is set to 'cuda' or if that does not work to CPU.

One further comment, the split_text function splits text files that are too big for the model to run all at once. Therefore, it splits it into parts and feeds it to the model in that way. 

## Step 3: creating a complete dataframe

Not much has to be said here. All english files and Dutch (translated) files are gathered and put underneath each-other in one pandas dataframe, which is then exported to .csv.
This is the function:

In [ ]:
    # preprocess = PreprocessFromMap(r'00 2\00', 'txt_00 2',
    #                                to_map=False,
    #                                edit_txt=False,
    #                                divide_into_folders=False,
    #                                translate=False,
    #                                construct_df=True)
    
    # preprocess.construct_df()