In [6]:
# from preprocess import PreprocessFromMap

Note: This notebook contains most steps and decisions made for this project. 
This is in case someone else wants to use this code and pipeline. 
Also, this code assumes that the raw folder obtained from the TikTok metadata stays the same in format :)

# Preprocess raw folders and translate

## Step 1: folder moving and name changing
Not many impactful decisions are made here. This step just contained a few functions to get the pipeline started. These are the different functions:


In [7]:
    # preprocess = PreprocessFromMap(r'00 2\00', 'txt_00 2',
    #                                to_map=True,
    #                                edit_txt=True,
    #                                divide_into_folders=False,
    #                                translate=False,
    #                                construct_df=False)
    
    # preprocess.to_map()
    # preprocess.edit_txt()


The only things worth mentioning are that this function erases the timestamps for the text and takes away the 'WEBVTT' line from the .txt files. Furthermore, the decision is made to incorporate the language of the text in the new file, so that the different languages can be detected for translating
To execute this step, you have to input values for the original directory (in this case: '00 2\00') and a name for the new directory (in this case: 'txt_00 2')

## Step 2: Translating
Before translation, the decision is made to only translate the Dutch files. This is because those were by far the majority of text files. And this makes it a lot easier to check if the translation made sense.
First off, I made a function which takes all the Dutch files and moves them in groups of 30 to different folder within a translation directory. I did this to translate all the folders in parts. The translation takes a lot of computational power and time. Translating all files in chunks makes this more manageable to oversee and correct when an error pops up, or the wifi disconnects. 

The function has an internal variable called: 'desired_amount'. This variable comes back in the translation function, but manages how many text files there should be per folder. By default, this value is set to 30. If you want to translate more files per folder, this can be changed (make sure to also edit it in the translation function!). 

In [4]:
    # preprocess = PreprocessFromMap(r'00 2\00', 'txt_00 2',
    #                                to_map=False,
    #                                edit_txt=False,
    #                                divide_into_folders=True,
    #                                translate=False,
    #                                construct_df=False)
    
    # preprocess.divide_into_folders()

For the translation, I picked a transformer type model from hugging face. This model is: 'facebook/mbart-large-50-many-to-many-mmt'. There were a couple of reasons for choosing this model. First of all, this model is able to translate a lot of different languages (not just Dutch-English). This leaves open the possibility for also translating more languages in the future. Furthermore, it is a fairly recent transformer type model that is also very large (many parameters). This could make translation more accurate by more successfully incorporating context (possibly).

Before using this model, I roughly tested the accuracy of the translation. This was not a very well-done testing process, but just a rough look at the performance of the model, since I did not expect to find many better performing models (non-fine tuned). I took (non-random) the first fifty Dutch files from the folder and moved them to a testing directory. I then had them translated with this model. For the test, I put the results in a .csv with the original Dutch next to it. I then roughly looked at if the main subject or content is retained in the translation. So, it does not matter if it is perfect, but the main subject should be clear from the translation. 

By subjectively rating each translation, the accuracy comes to about 0.71. However, most of the wrongly translated files did not come only as a fault of the model. Since this is TikTok data, it became apparent that there is a lot of nonsense in the original Dutch already present. Many times I could not make sense of the Dutch either (or pick out the subject). So, as a performance metric for the model only, the score of 0.71 can be considered a lower bound. 

This is the function for model initialization:

In [5]:
    # preprocess = PreprocessFromMap(r'00 2\00', 'txt_00 2',
    #                                to_map=False,
    #                                edit_txt=False,
    #                                divide_into_folders=False,
    #                                translate=True,
    #                                construct_df=False)
    
    # preprocess.translation_model_init(text)
    # preprocess.split_text (text, max_length=750)
    # preprocess.translate()

Naturally, the above functions assume that the previous functions have been run.
Furthermore, this model is run on Nvidia Cuda. This makes the model do its computation on the GPU instead of the CPU, which is a lot faster for generation. 
Important: If you do not have an Nvidia GPU or have not installed PyTorch (12.1) with Cuda enabled, this model will not run.
You can fix this by erasing the .to('cuda') from the function. Or if you are not sure, create a different variable which is set to 'cuda' or if that does not work to CPU.

One further comment, the split_text function splits text files that are too big for the model to run all at once. Therefore, it splits it into parts and feeds it to the model in that way. 

## Step 3: creating a complete dataframe

Not much has to be said here. All english files and Dutch (translated) files are gathered and put underneath each-other in one pandas dataframe, which is then exported to .csv.
This is the function:

In [ ]:
    # preprocess = PreprocessFromMap(r'00 2\00', 'txt_00 2',
    #                                to_map=False,
    #                                edit_txt=False,
    #                                divide_into_folders=False,
    #                                translate=False,
    #                                construct_df=True)
    
    # preprocess.construct_df()

# Filter out noise

When looking at the complete dataframe, a lot of transcripts consisted of things like song lyrics, meaningless words and instances where content extraction would be nearly impossible (even for human eyes). Off course, some transcripts could only make sense with the corresponding video and this could mostly be the cause of these 'meaningless transcripts'. For clarity’s sake, here I am only talking about utterly meaningless transcript and not transcripts that were a little vague. It is good to keep in mind that this is still social media data, thus riddled with slang and abbreviations. Examples of the things that in my opinion need filtering are:

eng_7282795689363393800.txt,"  Will, will, will, will, will, will,  will, will, will, will, will, will, will,  will, will, will, will,

eng_7283110983239847200.txt,"  Run, run, run,  run!  I'm all set.  Oh, oh, oh, oh! One "

It is clear to see that these will not be usefully for analysis. If the reader disagrees, this step can be skipped and analysis can just be done on the raw dataframe. 

This step required a lot of experimentation since you want to filter out the mess and as little of the other (even slightly) meaningfully content. Thus, the first idea was to use kmeans and hope I could find clusters of mess (or noise) that I could then just filter out. I tried this by making embeddings of each transcript within the google/t5-base model on the huggingface platform. These embeddings could then be clustered. This did not work out. Probably because the data is too high dimensional. I also tried pca before clustering, but this only helped a little (not much). Judging from the silhouette score, there was not really a clear cluster division present. The highest silhouette achieved (after pca) was about 0.095, which in my opinion is terrible. I also tried other cluster algorithms (DBSCAN, and Gaussian Mixed Modelling). This also did not work out. If someone wants to repeat this (with for example, better embeddings or otherwise) the code for the clustering is kept in supplementary code in this notebook. 


## Step 1

Since the first idea did not work out, I had to get a little creative. Thus, I decided a heuristic approach might work well. A lot of these noisy lines followed a certain pattern. They either had too little words to extract meaning, or they repeated a lot of words (as in the first examples).

In [ ]:
from filtering import Filtering

# new_csv = Filtering('translated_df_copy.csv', do_heuristic_filter=True, max_words_heuristic=10, max_diversity_heuristic=0.3, do_perplexity_filter=False, filter_out_perplexity=False)

# new_csv.heuristics(translated_df)

As you can see, this function takes two parameters (max_words_heuristic, and max_diversity_heuristic). These parameters have to be manually tuned to see if it does not filter out to much meaningfull content. The parameters are set by default to 10 and 0.3. This is because, from my testing with n=100 randomly filtered out transcripts, this configuration works the best. Off course, this testing consisted of subjectively looking at whether I thought the filter filtered out something I thought had meaning, and vice versa. The rough results from this subjective test are below:

Positive class = meaning unclear, 
Negative class = meaning clear


    | | Actual -> | pos | neg |
|---:|:-------------|:-----------|:------|:------|
| pos (prediction) | 95 | 5      |
| neg (prediction) | 9 | 91   |

This means the true positive rate with these parameters is 0.95 and the recall is 0.91
A further note is that most of the mistakes came from the word_diversity parameter, so a case could be made for a slightly higher tuning (maybe to 0.35, or 0.40). However, I was pretty pleased with these results

## Step 2

After the heuristics, I wanted some way to filter out all the things that did not make sense, but fell outside of having to do with too little word diversity or too little text. These are things where there are a lot of random words, but absolutely no (or very little) meaning.

For this problem, I eventually landed on a rather unorthodox solution. I thought it could be worth a try to use the perplexity score for this. Perplexity scores are normally used to estimate the performance of language classifiers. However, I thought that if you can take a language model that has been proven to have pretty high performance, you can calculate the perplexity score for each transcript and remove those that are above a certain threshold. The perplexity score is a measure of how 'surprised' the model is by seeing a word, given all the other words in coherence (for llm's this could include context). 
Again, since usually this is used as a measure of total model performance on a text corpus, this method is unorthodox, but with a carefully enough chosen threshold of exclusion, this seemed to work very well. 

In [ ]:

# new_csv = Filtering('translated_df_copy.csv', do_heuristic_filter=False, do_perplexity_filter=True, model='gpt2-xl', filter_out_perplexity=True, perplexity_threshold=100)

# new_csv.perplexity_filter(translated_df)
# new_csv.filter_away_perplexity(translated_df)

These two functions take two parameters: model, and perplexity_threshold. The model is by default initialized as gpt2-xl. I wanted to choose a model that is known to perform pretty well. I also chose the largest variant of gpt2 so better detail separation can be ensured. This is a rather large model and takes a while to run and download, so off-course this can be changed to e.g. ('gpt2'). Perplexity is calculated from the loss function of the model and put as a separate row in the dataframe (without filtering anything yet)

One thing to mention is that I had the gpt2 deployed with truncation enabled. This requires some explanation. The model can take only a maximum of 1024 tokens. Luckily, this is quite a lot. However, some of the transcripts have more than this and thus, could not be run. I decided that the first 1024 tokens might be enough to determine whether or not this transcript was vague and messy. This way I could cut all the tokens after 1024 and the perplexity of the longer transcripts is calculated on the first 1024 tokens. You could debate this choice. However, I think this is not so terrible an assumption to make that you could determine whether the transcript is good or not by the first 1024 tokens (since this is quite a lot of words already). If you disagree, dividing the text into chunks and inserting each part separately (maybe with a sliding window to keep as much context as possible?), and taking the weighted perplexity might also work.

This second function (when enabled) will make a new dataframe with the rows that have a perplexity score below the threshold filtered out. By subjective looking, a good threshold is 100. The lower you set it the stricter it is going to look at coherence of sentences and context. However, this is still social media data, which is riddled with 'weird' sentence structures, slang and skips even in the text I'd rather keep. Thus, it is a good idea to set this threshold pretty high to allow for more room. However, you can adjust it depending on how much you want filtered out. 

Both of these methods deleted a large number of rows that were mostly filled with messy data.
With the current tuning of parameters: 
The heuristics approach filtered out 429 rows
The perplexity approach filtered out 622 rows
So, out of 6210 rows, 1051 rows were filtered out (about 17%)

You could argue this is too strict. However, I noticed that it mostly filters out the mess. Still, a lot of data is left that could be hard to classify. But such is to be expected with social media data. 

# Initial Analyses

With the filtered dataset, I could try some initial analyses on the data. This required a lot of experimenting and resulted in a lot of data and interesting relationships to explore. In order to not present anything double or unnecessary, I will just move on to some of the visualizations I made of the initial results. 

**IMPORTANT**: Reading this notebook from github might not display the graphs, since the image files are accessed locally


NOTE: I only did two kinds of NLP analysis: a sentiment analysis and a topic analysis. For both analyses I used a similar model, namely a pretrained ROBERTA model trained on tweets. 

sentiment: f"cardiffnlp/twitter-roberta-base-sentiment-latest"
topic: "cardiffnlp/twitter-roberta-large-topic-latest"

## Sentiment

For sentiment, I have at first some raw percentages. These were little enough that I did not think it really that necessary to make a graph out of them.

Out of 5158 transcripts
2120 transcripts were labelled as positive (41.1%).
1334 transcripts were labelled as negative (25.9%).
1704 transcripts were labelled as neutral (33.0%).

1175 of the 2120 (55.4%) positive transcripts were labelled with confidence over 0.8 (confidently positive)
669 of the 2120 (31.6%)  positive transcripts were labelled with confidence over 0.9 (very confidently positive)

402 of the 1334 (30.1%) negative transcripts were labelled with confidence over 0.8 (confidently negative)
155 of the 1334 (11.6%) negative transcripts were labelled with confidence over 0.9 (very confidently negative)

However, I did make a density plot of the confidences for each of the different sentiment labels:

![Alt text](results_graphs/roberta_sent_density.png)

## Topic

The graph below shows the distribution of all the classes present in the transcripts as labelled by the ROBERTA model. 
Note that the first graph shows a high threshold variant, whereas the second graph shows a low threshold variant. This pertains to the confidence of the labelling. For the high threshold, I only wanted the model to label something as such if it was at least 0.5 (50%) sure it was right. This is very strict. The low threshold variant allows for labelling when the model is at least 0.1 (10%) sure.
Because of the lower threshold, the model could also pick up on a secondary label in some cases. Therefore, the grey bar part of the second graph indicates the presence of secondary labels. Secondary labels allow for a bit more nuance to be detected in some cases, since not all texts necessarily have only one subject or topic. Between the high threshold and low threshold, and even between only primary and with secondary labels, the distribution of classes stays roughly the same interestingly enough. There are some small changes especially with e.g. relationships. Likely this is because a lot of primary labels are dairies and daily life and often those can also qualify for relationships as a second label. You can see this in the heatmap later. 

![Alt text](results_graphs/roberta_topic_distribution.png)

## Sentiment over topics

This graph below shows the distribution of sentiment scores per individual topic. You can see a lot of difference in most occurring sentiment depending on the topic. Most interesting is the dominance of negative labels within social concern and news. Furthermore, you can see the positive label being quite dominant in almost all categories, especially within fashion and style. This is likely because fashion and style could very much go together with advertisement and this is usually positive (just a thought). 

![Alt text](results_graphs/roberta_sent_over_topic.png)

## Topic Co_occurrence

Below you will see a heatmap projecting co-occurrence of topics. With the introduction of primary topics and secondary topics, it is possible to see which topics most often go together. I don't think there is a lot of super interesting things here. Just in dairies and daily life, you see a lot of different topics co occurring with that, which makes sense since it is the largest category. You can especially see things like relationships, and family often co-occurring with dairies. This suggests to me that it might not be a bad idea to group these three together in one class for the future?

![Alt text](results_graphs/roberta_topic_co_ocurrence.png)

## Some closing notes for the initial analysis

The analyses with the ROBERTA model are included in the pipeline as a part of an analysis class, such that in the future it could be run on all the data. This script will be expanded with different models apart from ROBERTA. This would make it possible to compare model results. 

In [ ]:
# from init_analysis import NLP_Analysis
# 
# analysis = NLP_Analysis('translated_df_copy_heuristic_perplexity_out100.csv')
# analyzed_df = analysis.roberta_tweet()

# Supplementary code

## KMeans and other clustering algorithms on T5-Embeddings

In [1]:
# def pca(embeddings):
#     scaler = StandardScaler()
#     z_embeddings = scaler.fit_transform(embeddings)
#     pca_model = PCA(n_components=160)
#     reduced_embeddings = pca_model.fit_transform(z_embeddings)
#     explained_variance_ratio = pca_model.explained_variance_ratio_
# 
#     def n_components():
#         cum_var = explained_variance_ratio.cumsum()
#         plt.figure(figsize=(8, 6))
#         plt.plot(range(1, len(cum_var) + 1), cum_var, marker='o', linestyle='--', color='red')
#         plt.xlabel('components')
#         plt.ylabel('cum variance')
#         plt.title('PCA scree plot')
#         plt.grid(True)
#         plt.show()
# 
#         n_components_80 = next(i for i, cumulative in enumerate(cum_var) if cumulative >= 0.80) + 1
#         print(n_components_80)
# 
#     # n_components()
#     # The 90% variance argues for 326 components
#     # The 80% variance argues for 186 components
#     # This still seems large for kmeans (curse of dimensionality), but let's try it
#     # Otherwise, different clustering algorithm or kmeans_mini batch
# 
#     return reduced_embeddings, pca_model

# def kmeans(reduced_embeddings):
# 
#     def scree(embeddings):
#         inertia = []
#         for k in range(1,21):
#             kmeans_model = KMeans(n_clusters=k)
#             kmeans_model.fit(reduced_embeddings)
#             inertia.append(kmeans_model.inertia_)
# 
#         plt.plot(range(1, 21), inertia, marker='o', color='red')
#         plt.title('scree')
#         plt.xlabel('n_clusters')
#         plt.ylabel('inertia')
#         plt.show()
#     # scree(reduced_embeddings)
#     # Scree plot indicated 10 clusters
#     kmeans_model = KMeans(n_clusters=10)
#     kmeans_model.fit(reduced_embeddings)
#     silhouette = silhouette_score(reduced_embeddings, kmeans_model.labels_)
#     return kmeans_model.labels_, silhouette
# 
#     # silhouette score of this kmeans was 0.08 which is very poor

# This only identified one cluster
# def db_cluster(reduced_embeddings):
#     dbscan = DBSCAN(eps=2, min_samples=770)
#     dbscan.fit(reduced_embeddings)
#     labels = dbscan.labels_
#     score = silhouette_score(reduced_embeddings, labels, metric='cosine')
# 
#     return labels, score
# 
# 
# # This also gave silhouette score of at most 0.975
# def GMM(reduced_embeddings):
#     gmm = GaussianMixture(n_components=12)
#     gmm.fit(reduced_embeddings)
#     labels = gmm.predict(reduced_embeddings)
#     return labels, gmm
