# 1. Import Necessary Libraries

Make sure  Microsoft Visual C++ is installed on your pc

In [None]:
import pandas as pd
import numpy as np
import bertopic 


# 2.  Load Your Data

Load the articles from your CSV file (e.g. studies_lobke.csv) using pandas. 

In [None]:
import pandas as pd

# Load the data
df= pd.read_csv('studies.csv')
df.head()

# 3. Prepare Your Text Data
We clean up the text
- Remove special characters (only letters)
- Convert to lower case
- Remove stop words
- Remove words of only one or 2 letters ('a', 'I', at,...)

In [None]:
import re
import unicodedata
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import download

# Ensure WordNet and stopwords are downloaded
download('wordnet')
download('stopwords')
stop_words = set(stopwords.words('english'))
# Customize stopword list as needed
stop_words.update(['study'])

# Minimum word size threshold
minWordSize = 3

# Initialize the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Preprocessing function that outputs a cleaned sentence
def preprocess_text(text):
    # Normalize Unicode characters
    text = unicodedata.normalize('NFKD', text)
    
    # Remove text between parentheses and non-alphabetic characters
    text = re.sub(r'\(.*?\)', '', text)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Convert text to lowercase and split into words, lemmatizing each word
    lemmatized_words = [lemmatizer.lemmatize(word.lower()) for word in text.split()]
    
    # Filter out stopwords and short words
    words = [w for w in lemmatized_words if w not in stop_words and len(w) >= minWordSize]
    
    # Join the cleaned words back into a single string
    return ' '.join(words)

# Apply the preprocessing function to the 'text' column, producing clean sentences
df['text_clean'] = df['text'].apply(preprocess_text)

# Display the updated DataFrame
df.head()


# Extract the articles text
#documents = data['text']


# 4. Initialize and Fit BERTopic
The good thing with BERTopic is that is does most of the work automatically (Meaning, I do not need to bore you to death with details about how it works behind te scenes.)

We need to do 3 things
1. Initialize BERTopic model
2. 'Fit' the model -> this  means: run the model, as you would run a simple linear regression
3. Look at the topics via 

To get started, let's just use the default settings.

In [None]:
from bertopic import BERTopic
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize BERTopic model
topic_model = BERTopic(calculate_probabilities=True)

# Fit the model with preprocessed text sentences
topics, probabilities = topic_model.fit_transform(df['text_clean'])

# View and inspect topics
topic_model.get_topic_info()


We get two, quite intuitive topics
1. mixed-aged classes 
2. the quality of care. 

The model has done a good job of identifying these topics, but by default, BERTopic tries to group documents into fewer topics. If you want more refined topics, you can increase the number of topics by tweaking the min_topic_size or nr_topics parameters.

+ min_topic_size: = the minimum number of documents in a topic (default is 10).
    + too high =  fewer topics  
+ nr_topics: the desired number of topics you want.

There are other parameters, but I would not recommend that you mess around with these, unless you really do a deep dive in the statistical properties.


In [None]:
# Initialize BERTopic model
topic_model = BERTopic(calculate_probabilities=True, min_topic_size=5, nr_topics=10)

# Fit the model with preprocessed text sentences
topics, probabilities = topic_model.fit_transform(df['text_clean'])

# View and inspect topics
topic_model.get_topic_info()



# 5. Visualize Topics
We can call .visualize_topics to create a 2D representation of the topics. The  graph is a plotly interactive graph which can be converted to HTML:

Note: If you get the error 'ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed', go to terminal and type 'pip install --upgrade nbformat  ' 

In [None]:
# Visualize topics with an interactive plot
topic_model.visualize_topics()

You can use the slider to select the topic which then lights up red. If you hover over a topic, then general information is given about the topic, including the size of the topic and its corresponding words.

We can also ask for a representation of the corresponding words for each topic:

In [None]:
topic_model.visualize_barchart()

# 6. Visualize Topic Hierarchy¶
The topics that were created can be hierarchically reduced. In order to understand the potential hierarchical structure of the topics, we can use scipy.cluster.hierarchy to create clusters and visualize how they relate to one another. We can also see what happens to the topic representations when merging topics. 

In [None]:
hierarchical_topics = topic_model.hierarchical_topics(df['text_clean'])
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)


If you hover over the black circles, you will see the topic representation at that level of the hierarchy. These representations help you understand the effect of merging certain topics. Some might be logical to merge whilst others might not. Moreover, we can now see which sub-topics can be found within certain larger themes.

You can also print a text-version of the topic representation at the different levels (a bit less pretty, but maybe easier to read.)

In [None]:
tree = topic_model.get_topic_tree(hierarchical_topics)
print(tree)


# 7. Visualize documents

We can visualize the documents (=texts) inside the topics to see if they were assigned correctly or whether they make sense. To do so, we can use the topic_model.visualize_documents() function. This function recalculates the document embeddings and reduces them to 2-dimensional space for easier visualization purposes. 

In [None]:
topic_model.visualize_documents(df['text'])

When you hover over a point, you can see which text it is. The color tells you to which topic it belongs. While this is very pretty, it might be useful to be able to just open an excel-file or csv, which contains the original text, with the assigned topic, including the topic words:

In [None]:
# Add topics and probabilities to the original DataFrame
df["topic_number"] = np.argmax(probabilities, axis=1)

# Also extract the topic names and assign them to the DataFrame
info = topic_model.get_topic_info()
topic_names = info['Representation']

df['topic_name'] = df['topic_number'].map(topic_names)

# Save the updated DataFrame to a CSV

df['topic_name'] = df['topic_number'].map(topic_names)

# Save to a new CSV file
df.to_csv("studies_lobke_with_topics.csv", index=False)


In [None]:
df.head()

We can also see the topic distribution per document = the probability that the text belongs to each topic (if a topic is not included in the graph, the probability is 0). Eg, the topic distribution for the sixth document:(!python starts counting at 0, so 6th =5)

In [None]:
topic_model.visualize_distribution(probabilities[5])

# 8. Topics per full article

We extract the number of times a topic is assigned within the full articles.

In [None]:
import matplotlib.pyplot as plt

# Calculate the count of times each topic is chosen within each article
article_topic_counts = df.groupby('File')['topic_number'].value_counts().unstack(fill_value=0)

# Rename columns to 'Topic X'
article_topic_counts.columns = [f'Topic {i}' for i in article_topic_counts.columns]

# Display the table
print(article_topic_counts)

# Plot the distribution for each article
article_topic_counts.plot(kind='bar', stacked=True, figsize=(15, 7))
plt.title('Topic Distribution per Article (Count)')
plt.xlabel('Article')
plt.ylabel('Count')
plt.legend(title='Topics', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

We could also do the same, but with proportions in stead of counts.

In [None]:
import matplotlib.pyplot as plt

# Calculate the proportion of times each topic is chosen within each article
article_topic_proportions = df.groupby('File')['topic_number'].value_counts(normalize=True).unstack(fill_value=0)

# Rename columns to 'Topic X'
article_topic_proportions.columns = [f'Topic {i}' for i in article_topic_proportions.columns]

# Display the table
print(article_topic_proportions)

# Plot the distribution for each article
article_topic_proportions.plot(kind='bar', stacked=True, figsize=(15, 7))
plt.title('Topic Distribution per Article (Proportion)')
plt.xlabel('Article')
plt.ylabel('Proportion')
plt.legend(title='Topics', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()