# BERTopic


In this notebook all 12 dataframes (TikTok and Instagram) are combined into 1 to then get uniquely Topics for all of the datapoints. The plan is to add the topic to each row of the big dataframe and all the Words BERTopic assocciates with this topic of this datarow. Then the 12 dataframes should be split up again in the individual ones.


The `bertopic` package for transformer based topic modeling is used.

 Documentation: https://maartengr.github.io/BERTopic/index.html and a tutorial notebook: https://github.com/MaartenGr/BERTopic/blob/master/notebooks/BERTopic.ipynb


Paper: Grootendorst, Maarten: "BERTopic: Neural topic modeling with a class-based TF-IDF procedure", arXiv preprint arXiv:2203.05794, 2022.


Part of Michael Achmann's Notebook are used:
```
Michael Achmann. (2023). michaelachmann/social-media-lab: 2023-12-04 (v0.0.6). Zenodo. https://doi.org/10.5281/zenodo.8199901
```

##Steps:

1. Merge the 12 data frames into one big data frame using pandas' concat function.
2. Perform BERTopic analysis on the combined data frame to obtain topics and associated words for each row.
3. Add the BERTopic and BERT words columns to the combined data frame.
4. Merge the updated information back into the original 12 data frames using the unique IDs.

### 1. Create one big dataframe

1. Access all 12 dataframes
2. Concat their ids, descriptions (Instagram), OCRs (Instagram), body (TikTok) columns into one new dataframe

In [64]:
import pandas as pd

In [87]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [88]:
# INSTAGRAM
df_insta_climatecrisis = pd.read_csv('/content/drive/MyDrive/Klimawandel Projekt/Daten/INSTA_DATA_COMPLETE/climatecrisis.csv')
df_insta_climatechange = pd.read_csv('/content/drive/MyDrive/Klimawandel Projekt/Daten/INSTA_DATA_COMPLETE/climatechange.csv')
df_insta_savetheplanet = pd.read_csv('/content/drive/MyDrive/Klimawandel Projekt/Daten/INSTA_DATA_COMPLETE/savetheplanet.csv')
df_insta_klimakrise = pd.read_csv('/content/drive/MyDrive/Klimawandel Projekt/Daten/INSTA_DATA_COMPLETE/klimakrise.csv')
df_insta_klimawandel = pd.read_csv('/content/drive/MyDrive/Klimawandel Projekt/Daten/INSTA_DATA_COMPLETE/klimawandel.csv')
df_insta_klimaschutz = pd.read_csv('/content/drive/MyDrive/Klimawandel Projekt/Daten/INSTA_DATA_COMPLETE/klimaschutz.csv')

#TIKTOK
df_tiktok_climatecrisis = pd.read_csv('/content/drive/MyDrive/Klimawandel Projekt/Daten/TikTok_DATA_COMPLETE/tiktok_climatecrisis_ende.csv')
df_tiktok_climatechange = pd.read_csv('/content/drive/MyDrive/Klimawandel Projekt/Daten/TikTok_DATA_COMPLETE/tiktok_climatechange_ende.csv')
df_tiktok_savetheplanet = pd.read_csv('/content/drive/MyDrive/Klimawandel Projekt/Daten/TikTok_DATA_COMPLETE/tiktok_savetheplanet_ende.csv')
df_tiktok_klimakrise = pd.read_csv('/content/drive/MyDrive/Klimawandel Projekt/Daten/TikTok_DATA_COMPLETE/tiktok_klimakrise_ende.csv')
df_tiktok_klimawandel = pd.read_csv('/content/drive/MyDrive/Klimawandel Projekt/Daten/TikTok_DATA_COMPLETE/tiktok_klimawandel_ende.csv')
df_tiktok_klimaschutz = pd.read_csv('/content/drive/MyDrive/Klimawandel Projekt/Daten/TikTok_DATA_COMPLETE/tiktok_klimaschutz_ende.csv')

In [None]:
df_insta_climatecrisis.head(2)

In [None]:
df_tiktok_climatecrisis.head(2)

Only keep id and ocr and description of instagram dfs.

In [88]:
# this code needs to be run 6 times
# INSTAGRAM

# Creating a dataframe only for the Description
df_description = df_insta_klimawandel[['Id', 'Description']].copy() # do this for all 6 Insta_df!

# Renaming column
df_description.rename(columns={'Description': 'Text'}, inplace=True)

# Adding 'Text Type' column
df_description['Text Type'] = 'Caption'

new_df = df_description

# Dropping any rows where 'Text' is NaN or empty
new_df.dropna(subset=['Text'], inplace=True)
new_df = new_df[new_df['Text'].str.strip() != '']

# Resetting the index
new_df.reset_index(drop=True, inplace=True)

In [74]:
# only with the first of six csv: then only append to the all_insta
all_insta= new_df

In [89]:
# do this 5 times for the 5 other csvs
all_insta = pd.concat([all_insta, new_df], ignore_index=True)

In [90]:
len(all_insta)

3802

In [45]:
all_insta[300:3505]

Unnamed: 0,Id,Text,Text Type
3500,klimaschutz_163.jpg,Danke an @ingwar.pero⁠\n.⁠\n.⁠\n.⁠\n#klimaschu...,Caption
3501,klimaschutz_164.jpg,"Am einen Tag kleben sie auf der Straße, am and...",Caption
3502,klimaschutz_165.jpg,Deutschland ist einer Analyse des Forschungsin...,Caption
3503,klimaschutz_166.jpg,Streit zwischen ÖVP und Grünen: Dass sich Nied...,Caption
3504,klimaschutz_167.jpg,Ja zum Klimaschutz-Gesetz: 59.1 Prozent der St...,Caption


In [91]:
# save for later
all_insta.to_csv("BERTopic_Insta_DataPREP_onlyDescription.csv")

Only keep id and body of tiktok dfs.

In [66]:
# this code needs to be run 6 times
# TIKTOK

# Creating a separate dataframe
df_body = df_tiktok_klimaschutz[['id', 'body']].copy()  # do this for all 6 tiktok dfs

# Renaming columns
df_body.rename(columns={'body': 'Text', 'id': "Id"}, inplace=True)

# Adding 'Text Type' column
df_body['Text Type'] = 'Caption'

new_df = df_body


# Dropping any rows where 'Text' is NaN or empty
new_df.dropna(subset=['Text'], inplace=True)
new_df = new_df[new_df['Text'].str.strip() != '']

# Resetting the index
new_df.reset_index(drop=True, inplace=True)
print(new_df)

                      Id                                               Text  \
0    7229641670344690970  Dürfen Kinder keine Bedürfnisse äußern? #kinde...   
1    6971355508833864965  #Rassismus in den #Medien #Fridaysforfuture #K...   
2    7095428402261282053  Hey Leute, ich hab in dem Video paar Tipps zum...   
3    7313149913834032416  Fidnest du die strafe angemessen? #klimakleber...   
4    7280502024813251872  Klima-Krise erklärt in 23 Sekunden! #klima #kl...   
..                   ...                                                ...   
556  7099500572104985861  Doch beim Second-Hand Kauf fällt all das weg, ...   
557  7276326619596344608  Quelle:Hr Fernsehen maintower #notarzt #notdie...   
558  7187443510562393350  Was geht ab in Lützerath? 🏚 ➡️ Aktivisten habe...   
559  7267819656864435488  Die #Nutztierhaltung ist haupttreiber der #Umw...   
560  7050114244121054470  Was würdest du tun?!🤯 Drück das Plus weg!#Klim...   

    Text Type  
0     Caption  
1     Caption  
2  

In [50]:
# only run for the first tiktok csv
all_tiktok = new_df

In [67]:
# then run 5x to append the other 5 tiktok dfs
all_tiktok = pd.concat([all_tiktok, new_df], ignore_index=True)

In [68]:
len(all_tiktok)

3103

In [52]:
all_tiktok.head(3)


Unnamed: 0,Id,Text,Text Type
0,7197008454655937798,Tackling the climate crisis starts with realis...,Caption
1,6992513609523891462,ways to combat the climate crisis 💚 #foryou #c...,Caption
2,7104514452074007813,climate change says hello 👋🏻💀 #singer #climate...,Caption


In [69]:
 # save tiktok df
all_tiktok.to_csv("BERTopic_TikTok_DataPREP.csv")

In [92]:
# concat insta and tiktok dfs
allData_df = pd.concat([all_insta, all_tiktok], ignore_index=True)

In [93]:
# save the complete df
allData_df.to_csv("BERTopic_allDataPREP.csv")

### 2. BERTopic analysis

Perform BERTopic analysis on the combined data frame to obtain topics and associated words for each row.

In [1]:
!pip install -q bertopic

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.1/154.1 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.9/90.9 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m156.5/156.5 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.8/55.8 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for hdbscan (pyproject.toml) ... [?25l[?25hdone
  Building wheel for umap-learn (setup.py) ... [?25l[?25hdone


In [2]:
from bertopic import BERTopic

In the following cells we download a stopword dictionary for the German language and applied it according to [the documentation](https://maartengr.github.io/BERTopic/faq.html#how-do-i-remove-stop-words)

In [3]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

STOPWORDS = stopwords.words('english') + stopwords.words('spanish') + stopwords.words('italian') + stopwords.words('german')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [4]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words=STOPWORDS)

Now we're ready to create our corpus in `docs`, a list of text documents to pass to `BERTopic`.

In [None]:
docs = all_data_df["Text"]

In [10]:
print(docs)

0       Not your typical ice cube. 🧊\n\nAlthough it is...
1       Not your typical ice cube. 🧊\n\nAlthough it is...
2       2023 will be the hottest year recorded in hist...
3       I just know someone would try drinking them to...
4       According to one professor, we can stop global...
                              ...                        
6900    Doch beim Second-Hand Kauf fällt all das weg, ...
6901    Quelle:Hr Fernsehen maintower #notarzt #notdie...
6902    Was geht ab in Lützerath? 🏚 ➡️ Aktivisten habe...
6903    Die #Nutztierhaltung ist haupttreiber der #Umw...
6904    Was würdest du tun?!🤯 Drück das Plus weg!#Klim...
Name: Text, Length: 6905, dtype: object


In [11]:
# We're dealing with German and English texts, therefore we choose 'multilingual'
topic_model = BERTopic(language="multilingual", calculate_probabilities=True, verbose=True, vectorizer_model=vectorizer_model)
topics, probs = topic_model.fit_transform(docs)

2024-03-06 16:11:54,786 - BERTopic - Embedding - Transforming documents to embeddings.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.12k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/216 [00:00<?, ?it/s]

2024-03-06 16:22:03,351 - BERTopic - Embedding - Completed ✓
2024-03-06 16:22:03,354 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-06 16:22:48,872 - BERTopic - Dimensionality - Completed ✓
2024-03-06 16:22:48,875 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-06 16:22:56,219 - BERTopic - Cluster - Completed ✓
2024-03-06 16:22:56,241 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-06 16:23:00,201 - BERTopic - Representation - Completed ✓


#### Topic Visualization and Reduction

Now that the model is fitted, the topics are presented, and reduced to a number that can be dealt with. Then the Topics are named and then added to the all_data_df in a new Column BERTopic.

We look at the most frequent topics first as they best represent the collection of documents.

In [62]:
freq = topic_model.get_topic_info(); freq

Unnamed: 0,Topic,Count,Name,CustomName,Representation,Representative_Docs
0,-1,2714,-1_klimawandel_klimaschutz_climatechange_klima...,"klimawandel, klimaschutz, climatechange","[klimawandel, klimaschutz, climatechange, klim...",[#repost from @theslowfactory The atrocities a...
1,0,2447,0_klimaschutz_klimawandel_klimakrise_climatech...,"klimaschutz, klimawandel, klimakrise","[klimaschutz, klimawandel, klimakrise, climate...",[#klima #klimawandel #klimakrise #klimakatastr...
2,1,279,1_savetheplanet_pa_pa pa_viral,"savetheplanet, pa, pa pa","[savetheplanet, pa, pa pa, viral, foryou, fyp,...",[I have released all my dreams\n\n#surrealism ...
3,2,165,2_ice_gletscher_schnee_antarktis,"ice, gletscher, schnee","[ice, gletscher, schnee, antarktis, arktis, kl...",[Selbst in der ewigen Antarktis ist der Klimaw...
4,3,157,3_heat_record_temperatures_grad,"heat, record, temperatures","[heat, record, temperatures, grad, temperature...",[Global temperatures in July and August were a...
5,4,146,4_congo_kinder_people_climate,"congo, kinder, people","[congo, kinder, people, climate, help, 000, tu...",[Father holds the hand of his 15-year-old daug...
6,5,142,5_oil_shell_willow_stop,"oil, shell, willow","[oil, shell, willow, stop, project, climate, a...",[#StopWillow || This week is our last chance t...
7,6,116,6_penguins_daisygilardini_coral_penguin,"penguins, daisygilardini, coral","[penguins, daisygilardini, coral, penguin, spe...",[Photo by @daisygilardini / Chinstrap penguins...
8,7,109,7_city_cities_carfree_carsdestroyedourcities,"city, cities, carfree","[city, cities, carfree, carsdestroyedourcities...",[One of the most successful freeway removal pr...
9,8,96,8_wald_bäume_tree_klimawandel,"wald, bäume, tree","[wald, bäume, tree, klimawandel, forests, land...",[🇩🇪 Klimawandel und Naturschutz \n\nDer Klimaw...


In [63]:
len(freq)   # at first it was 141, it was then reduced to 20

20

In [None]:
freq.to_csv('freq.csv', index=False) # to save the topic info in another df

In [21]:
#this was used to reduce the topics to 20, all the Visualisations where run before and after.
topic_model.reduce_topics(docs, nr_topics=20)

2024-03-06 16:27:05,322 - BERTopic - Topic reduction - Reducing number of topics
2024-03-06 16:27:09,363 - BERTopic - Topic reduction - Reduced number of topics from 141 to 20


<bertopic._bertopic.BERTopic at 0x7b85fba61ba0>


-1 refers to all outliers and should typically be ignored. Next, let's take a look at a frequent topic that were generated:

In [64]:
topic_model.visualize_topics()

In [72]:
topic_model.get_topic(0)  # Select the most frequent topic

[('klimaschutz', 0.01499402919947828),
 ('klimawandel', 0.013647721124402159),
 ('klimakrise', 0.013107535636366222),
 ('climatechange', 0.010022381424719678),
 ('klima', 0.009721155766507781),
 ('climate', 0.008749924084627238),
 ('climatecrisis', 0.008112924358257629),
 ('mehr', 0.007935651288609073),
 ('deutschland', 0.007247855910014634),
 ('fyp', 0.005912089484850443)]

In [74]:
topic_model.visualize_barchart(top_n_topics=20)

In [85]:
# to add columns with the top 3 words of the topic
topic_labels = topic_model.generate_topic_labels(nr_words=3, topic_prefix=False, word_length=20, separator=', ')
topic_model.set_topic_labels(topic_labels)
# 2714 outliers
topic_model.get_topic_info()


Unnamed: 0,Topic,Count,Name,CustomName,Representation,Representative_Docs
0,-1,2714,-1_klimawandel_klimaschutz_climatechange_klima...,"klimawandel, klimaschutz, climatechange","[klimawandel, klimaschutz, climatechange, klim...",[#repost from @theslowfactory The atrocities a...
1,0,2447,0_klimaschutz_klimawandel_klimakrise_climatech...,"klimaschutz, klimawandel, klimakrise","[klimaschutz, klimawandel, klimakrise, climate...",[#klima #klimawandel #klimakrise #klimakatastr...
2,1,279,1_savetheplanet_pa_pa pa_viral,"savetheplanet, pa, pa pa","[savetheplanet, pa, pa pa, viral, foryou, fyp,...",[I have released all my dreams\n\n#surrealism ...
3,2,165,2_ice_gletscher_schnee_antarktis,"ice, gletscher, schnee","[ice, gletscher, schnee, antarktis, arktis, kl...",[Selbst in der ewigen Antarktis ist der Klimaw...
4,3,157,3_heat_record_temperatures_grad,"heat, record, temperatures","[heat, record, temperatures, grad, temperature...",[Global temperatures in July and August were a...
5,4,146,4_congo_kinder_people_climate,"congo, kinder, people","[congo, kinder, people, climate, help, 000, tu...",[Father holds the hand of his 15-year-old daug...
6,5,142,5_oil_shell_willow_stop,"oil, shell, willow","[oil, shell, willow, stop, project, climate, a...",[#StopWillow || This week is our last chance t...
7,6,116,6_penguins_daisygilardini_coral_penguin,"penguins, daisygilardini, coral","[penguins, daisygilardini, coral, penguin, spe...",[Photo by @daisygilardini / Chinstrap penguins...
8,7,109,7_city_cities_carfree_carsdestroyedourcities,"city, cities, carfree","[city, cities, carfree, carsdestroyedourcities...",[One of the most successful freeway removal pr...
9,8,96,8_wald_bäume_tree_klimawandel,"wald, bäume, tree","[wald, bäume, tree, klimawandel, forests, land...",[🇩🇪 Klimawandel und Naturschutz \n\nDer Klimaw...


In [109]:
#NOT USED
# Replace topic numbers with custom names
custom_topic_labels = []
for topic in topics:
    # Check if the topic number is within the bounds of the topic_labels list
    if topic < len(topic_labels):
        custom_topic_labels.append(topic_labels[topic])
    else:
        custom_topic_labels.append("Unknown Topic")  # Add a placeholder for out-of-bounds topics

# Add the custom topic labels to the DataFrame as a new column
all_data_df['CustomTopic'] = custom_topic_labels

In [None]:
# NOT USED
topics_to_merge = [-1, 0, 1]
print(topics_to_merge)
topic_model.merge_topics(docs, topics_to_merge=topics_to_merge)
topic_model.get_topic_info().head()

### 3. Create Dataframe with added BERTopic Column



Add the BERTopic and BERT words columns to the combined data frame.

In [31]:
# df of docs and topic
df = pd.DataFrame({"Document": docs, "Topic": topic_model.topics_})

In [32]:
df

Unnamed: 0,Document,Topic
0,Not your typical ice cube. 🧊\n\nAlthough it is...,2
1,Not your typical ice cube. 🧊\n\nAlthough it is...,2
2,2023 will be the hottest year recorded in hist...,3
3,I just know someone would try drinking them to...,-1
4,"According to one professor, we can stop global...",-1
...,...,...
6900,"Doch beim Second-Hand Kauf fällt all das weg, ...",-1
6901,Quelle:Hr Fernsehen maintower #notarzt #notdie...,-1
6902,Was geht ab in Lützerath? 🏚 ➡️ Aktivisten habe...,16
6903,Die #Nutztierhaltung ist haupttreiber der #Umw...,-1


In [82]:
# how many Post-Descriptions belong to each topic (-1 = Outlier Topic)
count_topic_1 = (df['Topic'] == -1).sum()
print(count_topic_1)

2714


In [77]:
all_data_df.head()

Unnamed: 0.1,Unnamed: 0,Id,Text,Text Type
0,0,climatechange_1.jpg,Not your typical ice cube. 🧊\n\nAlthough it is...,Caption
1,1,climatechange_2.jpg,Not your typical ice cube. 🧊\n\nAlthough it is...,Caption
2,2,climatechange_3.jpg,2023 will be the hottest year recorded in hist...,Caption
3,3,climatechange_4.jpg,I just know someone would try drinking them to...,Caption
4,4,climatechange_5.jpg,"According to one professor, we can stop global...",Caption


In [78]:
# Assuming your dataframes are named all_data_df and topic_df
# Merge the dataframes based on the common column 'Document' and 'Text'
merged_df = pd.merge(all_data_df, df, left_on='Text', right_on='Document', how='left')

# Drop the extra 'Document' column and we do not need the 'Text Type' Column, as all the Text Types are Captions
merged_df.drop('Document', axis=1, inplace=True)
merged_df.drop('Text Type', axis=1, inplace=True)
# Now merged_df contains all_data_df with the 'Topic' column appended

In [81]:
merged_df.head()

Unnamed: 0.1,Unnamed: 0,Id,Text,Topic
0,0,climatechange_1.jpg,Not your typical ice cube. 🧊\n\nAlthough it is...,2
1,0,climatechange_1.jpg,Not your typical ice cube. 🧊\n\nAlthough it is...,2
2,1,climatechange_2.jpg,Not your typical ice cube. 🧊\n\nAlthough it is...,2
3,1,climatechange_2.jpg,Not your typical ice cube. 🧊\n\nAlthough it is...,2
4,2,climatechange_3.jpg,2023 will be the hottest year recorded in hist...,3


In [47]:
final_df = pd.merge(all_data_df, THISDF, left_on='Text', right_on='Document', how='left')

In [49]:
final_df

Unnamed: 0.1,Unnamed: 0,Id,Text,Text Type,Document,Topic,Name,CustomName,Representation,Representative_Docs,Top_n_words,Probability,Representative_document
0,0,climatechange_1.jpg,Not your typical ice cube. 🧊\n\nAlthough it is...,Caption,Not your typical ice cube. 🧊\n\nAlthough it is...,2,2_ice_gletscher_schnee_antarktis,"ice, gletscher, schnee","[ice, gletscher, schnee, antarktis, arktis, kl...",[Selbst in der ewigen Antarktis ist der Klimaw...,ice - gletscher - schnee - antarktis - arktis ...,0.906538,False
1,0,climatechange_1.jpg,Not your typical ice cube. 🧊\n\nAlthough it is...,Caption,Not your typical ice cube. 🧊\n\nAlthough it is...,2,2_ice_gletscher_schnee_antarktis,"ice, gletscher, schnee","[ice, gletscher, schnee, antarktis, arktis, kl...",[Selbst in der ewigen Antarktis ist der Klimaw...,ice - gletscher - schnee - antarktis - arktis ...,1.000000,False
2,1,climatechange_2.jpg,Not your typical ice cube. 🧊\n\nAlthough it is...,Caption,Not your typical ice cube. 🧊\n\nAlthough it is...,2,2_ice_gletscher_schnee_antarktis,"ice, gletscher, schnee","[ice, gletscher, schnee, antarktis, arktis, kl...",[Selbst in der ewigen Antarktis ist der Klimaw...,ice - gletscher - schnee - antarktis - arktis ...,0.906538,False
3,1,climatechange_2.jpg,Not your typical ice cube. 🧊\n\nAlthough it is...,Caption,Not your typical ice cube. 🧊\n\nAlthough it is...,2,2_ice_gletscher_schnee_antarktis,"ice, gletscher, schnee","[ice, gletscher, schnee, antarktis, arktis, kl...",[Selbst in der ewigen Antarktis ist der Klimaw...,ice - gletscher - schnee - antarktis - arktis ...,1.000000,False
4,2,climatechange_3.jpg,2023 will be the hottest year recorded in hist...,Caption,2023 will be the hottest year recorded in hist...,3,3_heat_record_temperatures_grad,"heat, record, temperatures","[heat, record, temperatures, grad, temperature...",[Global temperatures in July and August were a...,heat - record - temperatures - grad - temperat...,1.000000,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9184,6902,7187443510562393350,Was geht ab in Lützerath? 🏚 ➡️ Aktivisten habe...,Caption,Was geht ab in Lützerath? 🏚 ➡️ Aktivisten habe...,16,16_lützerath_räumung_rwe_innen,"lützerath, räumung, rwe","[lützerath, räumung, rwe, innen, polizei, brau...",[Gerade beginnt die Räumung des Dorfs Lützerat...,lützerath - räumung - rwe - innen - polizei - ...,1.000000,False
9185,6903,7267819656864435488,Die #Nutztierhaltung ist haupttreiber der #Umw...,Caption,Die #Nutztierhaltung ist haupttreiber der #Umw...,-1,-1_klimawandel_klimaschutz_climatechange_klima...,"klimawandel, klimaschutz, climatechange","[klimawandel, klimaschutz, climatechange, klim...",[#repost from @theslowfactory The atrocities a...,klimawandel - klimaschutz - climatechange - kl...,0.441703,False
9186,6903,7267819656864435488,Die #Nutztierhaltung ist haupttreiber der #Umw...,Caption,Die #Nutztierhaltung ist haupttreiber der #Umw...,-1,-1_klimawandel_klimaschutz_climatechange_klima...,"klimawandel, klimaschutz, climatechange","[klimawandel, klimaschutz, climatechange, klim...",[#repost from @theslowfactory The atrocities a...,klimawandel - klimaschutz - climatechange - kl...,0.659057,False
9187,6904,7050114244121054470,Was würdest du tun?!🤯 Drück das Plus weg!#Klim...,Caption,Was würdest du tun?!🤯 Drück das Plus weg!#Klim...,0,0_klimaschutz_klimawandel_klimakrise_climatech...,"klimaschutz, klimawandel, klimakrise","[klimaschutz, klimawandel, klimakrise, climate...",[#klima #klimawandel #klimakrise #klimakatastr...,klimaschutz - klimawandel - klimakrise - clima...,0.350770,False


In [50]:
final_df.to_csv("all_BERTopic_Info.csv")

In [51]:
only_basic_bert_info = final_df

In [58]:
only_basic_bert_info =only_basic_bert_info.drop(["Text Type", "Topic", "Document", "CustomName", "Representation", "Representative_Docs", "Top_n_words", "Representative_document" ], axis= 1)

In [59]:
only_basic_bert_info = only_basic_bert_info.rename(columns={"Name": "BERTopic"})

In [60]:
only_basic_bert_info

Unnamed: 0.1,Unnamed: 0,Id,Text,BERTopic,Probability
0,0,climatechange_1.jpg,Not your typical ice cube. 🧊\n\nAlthough it is...,2_ice_gletscher_schnee_antarktis,0.906538
1,0,climatechange_1.jpg,Not your typical ice cube. 🧊\n\nAlthough it is...,2_ice_gletscher_schnee_antarktis,1.000000
2,1,climatechange_2.jpg,Not your typical ice cube. 🧊\n\nAlthough it is...,2_ice_gletscher_schnee_antarktis,0.906538
3,1,climatechange_2.jpg,Not your typical ice cube. 🧊\n\nAlthough it is...,2_ice_gletscher_schnee_antarktis,1.000000
4,2,climatechange_3.jpg,2023 will be the hottest year recorded in hist...,3_heat_record_temperatures_grad,1.000000
...,...,...,...,...,...
9184,6902,7187443510562393350,Was geht ab in Lützerath? 🏚 ➡️ Aktivisten habe...,16_lützerath_räumung_rwe_innen,1.000000
9185,6903,7267819656864435488,Die #Nutztierhaltung ist haupttreiber der #Umw...,-1_klimawandel_klimaschutz_climatechange_klima...,0.441703
9186,6903,7267819656864435488,Die #Nutztierhaltung ist haupttreiber der #Umw...,-1_klimawandel_klimaschutz_climatechange_klima...,0.659057
9187,6904,7050114244121054470,Was würdest du tun?!🤯 Drück das Plus weg!#Klim...,0_klimaschutz_klimawandel_klimakrise_climatech...,0.350770


In [157]:
only_basic_bert_info.to_csv("only_basic_BERTopic_info.csv", index = False)

### 4. Add BERTopic to original 12 dataframes

Merge the updated information back into the original 12 data frames using the unique IDs.

In [158]:
# use this one so each single dataframe gets a column BERTopic appended
# with the topic number and the first three most frequent words
# of its topic
only_basic_bert_info[6010:600]

Unnamed: 0.1,Unnamed: 0,id,Text,BERTopic,Probability
6800,5028,7120666070175796485,HAT DIE KÖNIGIN ÜBERLEBT ??? @beesteez #beeste...,1_savetheplanet_pa_pa pa_viral,0.352054
6801,5029,7247679923660361006,Replying to @simplyburiedinbooks How to Zero...,-1_klimawandel_klimaschutz_climatechange_klima...,0.250891
6802,5030,7267702109674638638,Can you tell im craving mexican food?🥴🌮✨🌱🌎 #wh...,-1_klimawandel_klimaschutz_climatechange_klima...,0.208912
6803,5031,7121966214003330310,#greenscreen #thereisnoplanetb #fyp #climatech...,-1_klimawandel_klimaschutz_climatechange_klima...,0.656021
6804,5032,7205676012816878891,Google “stop the willow project.” Sign every p...,-1_klimawandel_klimaschutz_climatechange_klima...,0.378042
...,...,...,...,...,...
6895,5113,7163977121520553262,"sources: darksky.org, cescos.fau.edu #savethep...",0_klimaschutz_klimawandel_klimakrise_climatech...,0.574413
6896,5114,6965056820553534725,Für Tipps folgt mir dort gerne —> mariacmdy 😂 ...,-1_klimawandel_klimaschutz_climatechange_klima...,0.399995
6897,5115,7092304430090947845,Parte 1#TZD_MOPA #semtags🙌 #stumbleguys #savet...,1_savetheplanet_pa_pa pa_viral,1.000000
6898,5116,7204901930928295211,"the willow project will effect ecosystems, ani...",5_oil_shell_willow_stop,0.592996


For the Instagram Data:
This needs to be done for each of the 6 dataframes.
So replace all the names and run it 6 times.

In [133]:
# Merge the DataFrames based on the common column 'Id', performing a left join
merged_insta_df = pd.merge(df_insta_klimaschutz, only_basic_bert_info[['Id', 'BERTopic']], on='Id', how='left')

In [130]:
df_insta_klimaschutz['BERTopic'] = merged_insta_df['BERTopic']

In [131]:
df_insta_klimaschutz

Unnamed: 0,Id,Account,User Name,Followers at Posting,Post Created,Post Created Date,Post Created Time,Type,Total Interactions,Likes,...,Photo,Title,Description,Image Text,Sponsor Id,Sponsor Name,Total Interactions (weighted — Likes 1x Comments 1x ),Overperforming Score,language,BERTopic
0,klimaschutz_1.jpg,tagesschau,tagesschau,4589981.0,2023-07-20 10:00:11 CEST,20.07.23,10:00:11,Album,215428,213214,...,https://scontent-sea1-1.cdninstagram.com/v/t39...,,Bei Reisen durch Europa ist die klimafreundlic...,,,,215428,03. May,Language.GERMAN,-1_klimawandel_klimaschutz_climatechange_klima...
1,klimaschutz_2.jpg,ZDF heute-show,heuteshow,2169088.0,2023-01-16 15:13:19 CET,16.01.23,15:13:19,Photo,149974,149074,...,https://scontent-sea1-1.cdninstagram.com/v/t51...,,NSFW\n#heuteshow,"‎outdoor, ‎'‎Porn hub 2KAufrufe ld agen RWE & ...",,,149974,Feb 91,Language.ENGLISH,1_savetheplanet_pa_pa pa_viral
2,klimaschutz_3.jpg,tagesschau,tagesschau,4798823.0,2023-12-13 18:00:32 CET,13.12.23,18:00:32,Photo,131109,125662,...,https://scontent-sea1-1.cdninstagram.com/v/t39...,,Die Klimaschutzgruppe Letzte Generation hat in...,„Letzte Generation Weihnachtsbäüme mitFarbebes...,,,131109,Feb 57,Language.GERMAN,0_klimaschutz_klimawandel_klimakrise_climatech...
3,klimaschutz_4.jpg,tagesschau,tagesschau,4480794.0,2023-04-21 17:30:06 CEST,21.04.23,17:30:06,Album,12901,124401,...,https://scontent-sea1-1.cdninstagram.com/v/t39...,,Ein Tempolimit von 130 Kilometern pro Stunde i...,,,,12901,Jan 83,Language.GERMAN,0_klimaschutz_klimawandel_klimakrise_climatech...
4,klimaschutz_5.jpg,tagesschau,tagesschau,4540786.0,2023-06-19 12:07:00 CEST,19.06.23,12:07:00,Album,118155,114588,...,https://scontent-sea1-1.cdninstagram.com/v/t39...,,"Um die Klimaziele zu erreichen, könnten in den...",,,,118155,Jan 67,Language.GERMAN,-1_klimawandel_klimaschutz_climatechange_klima...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
683,klimaschutz_696.jpg,RTL Aktuell,rtlaktuell,471745.0,2023-05-08 14:56:29 CEST,08.05.23,14:56:29,Photo,2518,1299,...,https://scontent-sea1-1.cdninstagram.com/v/t51...,,Die jüngste Extremhitze in Spanien und anderen...,RT Laut Studie Klimawandel macht Extremhitze 1...,,,2518,Feb 28,Language.GERMAN,0_klimaschutz_klimawandel_klimakrise_climatech...
684,klimaschutz_697.jpg,AfD-Fraktion im Deutschen Bundestag,afdimbundestag,51389.0,2023-05-04 11:24:43 CEST,04.05.23,11:24:43,Photo,2511,2390,...,https://scontent-sea1-1.cdninstagram.com/v/t39...,,❌Den grünen Sumpf austrocknen!\n \n➡Laut Medie...,Milliardengeschäft durch Enteignung Der grüne ...,,,2511,01. Apr,Language.GERMAN,-1_klimawandel_klimaschutz_climatechange_klima...
685,klimaschutz_698.jpg,Jörg Spengler,joerg_spengler,11617.0,2023-11-10 06:45:16 CET,10.11.23,06:45:16,Photo,2501,2490,...,https://scontent-sea1-1.cdninstagram.com/v/t51...,,"Die #Klimakrise ist keine Frage des Glaubens, ...",,,,2501,Jan 74,Language.GERMAN,-1_klimawandel_klimaschutz_climatechange_klima...
686,klimaschutz_699.jpg,NABU Bundesverband,nabu,200623.0,2023-07-15 16:00:00 CEST,15.07.23,16:00:00,Album,2499,2483,...,https://scontent-sea1-1.cdninstagram.com/v/t51...,,Es ist mal wieder Zeit für ein paar Gute Nachr...,,,,2499,-1.26,Language.GERMAN,1_savetheplanet_pa_pa pa_viral


In [134]:
# ACHTUNG: if execuded, csv in drive gets changed!!!
# Write the modified DataFrame back to a CSV file
#df_insta_klimaschutz.to_csv('/content/drive/MyDrive/Klimawandel Projekt/Daten/INSTA_DATA_COMPLETE/klimaschutz.csv', index=False)


For the TikTok Data: This needs to be done for each of the 6 dataframes. So replace all the names and run it 6 times.

In [147]:
only_basic_bert_info_TIKTOK = only_basic_bert_info

In [None]:
only_basic_bert_info_TIKTOK.rename(columns={'Id': 'id'}, inplace=True)

In [159]:
# Assuming your DataFrame is named df

# Use regular expressions to filter rows based on the pattern
pattern = r'^\d+$'  # This pattern matches strings containing only digits
filtered_df_for_TIKTOK = only_basic_bert_info_TIKTOK[only_basic_bert_info_TIKTOK['id'].astype(str).str.match(pattern)]

# Now filtered_df contains only the rows where the 'Id' column contains only digits

In [163]:
filtered_df_for_TIKTOK['id'] = filtered_df_for_TIKTOK['id'].astype(int)

In [171]:
len(filtered_df_for_TIKTOK)

3787

In [None]:
#df_tiktok_klimaschutz
#df_tiktok_klimaschutz = pd.read_csv('/content/drive/MyDrive/Klimawandel Projekt/Daten/TikTok_DATA_COMPLETE/tiktok_klimaschutz_ende.csv')

In [190]:
# Merge the DataFrames based on the common column 'Id', performing a left join
merged_tiktok_df = pd.merge(df_tiktok_savetheplanet, filtered_df_for_TIKTOK[['id', 'BERTopic']], on='id', how='left')

In [191]:
df_tiktok_savetheplanet['BERTopic'] = merged_tiktok_df['BERTopic']

In [None]:
df_tiktok_savetheplanet

In [193]:
# ACHTUNG: if execuded, csv in drive gets changed!!!
# Write the modified DataFrame back to a CSV file
# df_tiktok_savetheplanet.to_csv('/content/drive/MyDrive/Klimawandel Projekt/Daten/TikTok_DATA_COMPLETE/tiktok_savetheplanet_ende.csv', index=False)