<a href="https://colab.research.google.com/github/NadiaHolmlund/Thesis/blob/main/Thesis_fors%C3%B8g_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üì¶ **Packages**

In [1]:
%%capture
# DataMapPlot
!git clone https://github.com/TutteInstitute/datamapplot.git
!pip install datamapplot/.

# GPU-accelerated HDBSCAN + UMAP
!pip install cudf-cu12 dask-cudf-cu12 --extra-index-url=https://pypi.nvidia.com
!pip install cuml-cu12 --extra-index-url=https://pypi.nvidia.com
!pip install cugraph-cu12 --extra-index-url=https://pypi.nvidia.com
!pip install cupy-cuda12x -f https://pip.cupy.dev/aarch64

# OpenAI
!pip install openai

# Topic Modelling
!pip install bertopic datasets
!pip install sentence-transformers

# Progress bar
!pip install tqdm

# üìÑ **Data**

In [None]:
import textwrap
import pandas as pd

# Import raw data
berlingske = pd.read_csv('raw_article_Berlingske_2024-03-04.csv')
jyllands_posten = pd.read_csv('raw_article_Jyllands_Posten_2024-03-04.csv')
politiken = pd.read_csv('raw_article_Politiken_2024-03-04.csv')

# Merge the datasets
dataset = pd.concat([berlingske, jyllands_posten, politiken], ignore_index=True)

## Light Pre-Processing

In [None]:
# Removing rows with NaN values in the Article column
dataset = dataset.dropna(subset=['Article'])

# Reset the index
dataset.reset_index(drop=True, inplace=True)

In [None]:
# Merging 'Brief' and 'Article' into one column 'Content'
dataset['Content'] = dataset.apply(lambda row: str(row['Article']) if pd.isna(row['Brief']) else str(row['Brief']) + " " + str(row['Article']), axis=1)

In [None]:
dataset.head()

In [4]:
# Example view of title and content
print(textwrap.fill(dataset['Title'][7050], width=140))
print(textwrap.fill(dataset['Content'][7050], width=140))

Dansk intelligent hjertealarm skal redde liv
Danske forskere udvikler en alarm, der kan forudsige hjerteanfald hos personer med avancerede pacemakere. Der er ikke opl√¶sning af denne
artikel, s√• den opl√¶ses derfor med maskinstemme. Kontakt os gerne p√•automatiskoplaesning@pol.dk, hvis du h√∏rer ord, hvis udtale kan
forbedres. Indtil nu har danske hjertel√¶ger selv skullet analysere tusindvis af signaler og sendinger fra pacemakere og andre hjerteenheder
landet over, og dermed vurdere om patienten har brug for hj√¶lp. Men en ny intelligent hjertealarm skal i fremtiden give l√¶gerne en hj√¶lpende
h√•nd. Den danskudviklede hjertealarm, der potentielt kan redde menneskeliv, bliver i √∏jeblikket testet af l√¶gerne p√• Rigshospitalet. Tariq
Andersen, der er adjunkt p√• Datalogisk Institut p√• K√∏benhavns Universitet, er med til at teste alarmen. ¬ª Alarmen best√•r af en algoritme,
der kigger p√• data fra folks hjerteenheder. Den kigger s√• p√• det data og laver en forudsigelse af, hvad ris

In [None]:
len(dataset['Content'])

# üí¨ **Utilizing OpenAI and Together.ai API**

In [6]:
from google.colab import userdata

In [7]:
from openai import OpenAI

In [8]:
TOGETHER_API_KEY = userdata.get('TOGETHER_API_KEY')

In [9]:
client = OpenAI(base_url="https://api.together.xyz/v1", api_key=TOGETHER_API_KEY)

In [10]:
# Testing that the model is loaded correctly by running a prompt
system = "You are a helpful assistant"
user = "Explain artificial intelligence as if I am 5?"

In [11]:
completion = client.chat.completions.create(
  model="NousResearch/Nous-Hermes-2-Mistral-7B-DPO",
  messages=[
    {"role": "system", "content": system},
    {"role": "user", "content": user}
  ],
  temperature=0.2,
)

In [12]:
print(textwrap.fill(completion.choices[0].message.content, width=140))

Alright, imagine you have a toy robot that can learn and do things on its own. Artificial intelligence is like giving that toy robot the
ability to understand and solve problems, just like a real person. It can learn from experiences, recognize things, and even make decisions.
So, AI is like making smart robots or machines that can think and work like humans.


## **Prompt Engineering for Translation and Data Cleaning**

In [None]:
from tqdm import tqdm

# Initialize progress bar
pbar = tqdm(total=len(dataset))

  0%|          | 0/5 [00:00<?, ?it/s]

### **Translating Titles**

In [None]:
# Translating article titles to English

# Iterate over dataset rows
for index, row in dataset.iterrows():
    # Access title content from the 'Title' column
    title_content = row['Title']

    # Define the system and user messages
    system = "You are a helpful, respectful and honest assistant specialized in translating text data from Danish to English."
    user = "Translate the following text from Danish to English:\n\n" + title_content

    # Create completion using Together.ai API and Mistral
    completion = client.chat.completions.create(
        model="NousResearch/Nous-Hermes-2-Mistral-7B-DPO",
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": user}
        ],
        temperature=0.2
    )

    # Get cleaned text from completion
    translated_title = textwrap.fill(completion.choices[0].message.content, width=100)

    # Add cleaned text to new column 'Article_cleaned'
    dataset.at[index, 'Title_translated'] = translated_title

    # Update progress bar
    pbar.update(1)

# Close progress bar
pbar.close()

In [None]:
dataset[['Title', 'Title_translated']].head(20)

### **Translating and Cleaning Content**

In [None]:
# Translating the Content column and cleaning the translated content

# Iterate over dataset rows
for index, row in dataset.iterrows():
    # Access content from the 'Content' column
    article_content = row['Content']

    # Define the system and user messages for translation
    translation_system = "You are a helpful, respectful and honest assistant specialized in translating text data from Danish to English."
    translation_user = "Translate the following text from Danish to English:\n\n" + article_content

    # Create completion for translation using Together.ai API and Mistral
    translation_completion = client.chat.completions.create(
        model="NousResearch/Nous-Hermes-2-Mistral-7B-DPO",
        messages=[
            {"role": "system", "content": translation_system},
            {"role": "user", "content": translation_user}
        ],
        temperature=0.2
    )

    # Get translated text
    translated_content = translation_completion.choices[0].message.content

    # Add translated text to new column 'Content_translated'
    dataset.at[index, 'Content_translated'] = translated_content

    # Define the system and user messages for cleaning
    cleaning_system = "You are a helpful, respectful and honest assistant specialized in cleaning text data"
    cleaning_user = "Clean the following text. Keep only the core content and remove irrelevent information:\n\n" + translated_content

    # Create completion for cleaning using Together.ai API and Mistral
    cleaning_completion = client.chat.completions.create(
        model="NousResearch/Nous-Hermes-2-Mistral-7B-DPO",
        messages=[
            {"role": "system", "content": cleaning_system},
            {"role": "user", "content": cleaning_user}
        ],
        temperature=0.2
    )

    # Get cleaned text
    cleaned_content = cleaning_completion.choices[0].message.content

    # Add cleaned text to new column 'Content_cleaned'
    dataset.at[index, 'Content_cleaned'] = cleaned_content

    # Update progress bar
    pbar.update(1)

# Close progress bar
pbar.close()

In [None]:
dataset[['Content', 'Content_translated', 'Content_cleaned']].head()

In [None]:
# Saving the cleaned dataset as csv
dataset.to_csv('dataset_cleaned.csv', index=False)

### **Filtering the Dataset**

In [140]:
# Filter dataset into three distinct time periods
time_period ='2014_2016'
time_period_viz = '2014-2016'

dataset_filtered = dataset[(dataset['Date'] >= '2014-01-01') & (dataset['Date'] <= '2016-12-31')] # Time period 1
#dataset_filtered = dataset[(dataset['Date'] >= '2017-01-01') & (dataset['Date'] <= '2019-12-31')] # Time period 2
#dataset_filtered = dataset[dataset['Date'] >= '2020-01-01'] # Time period 3

In [None]:
len(dataset_filtered['Content_cleaned'])

In [142]:
from datasets import Dataset

# Create a dictionary instead of a pandas dataframe
data_dict = {
    "Content_cleaned": dataset_filtered["Content_cleaned"].tolist(),
    "Title_translated": dataset_filtered["Title_translated"].tolist()
}

# Create a dataset object
dataset_dict = Dataset.from_dict(data_dict)

# Extract cleaned content to train on and corresponding titles
content = dataset_dict["Content_cleaned"]
titles = dataset_dict["Title_translated"]

## **Promt Engineering for Topic Labelling**

### **Prompt Template**



In [143]:
# System prompt describes information given to all conversations
system_prompt = """
You are a helpful, respectful and honest assistant specialized in labeling topics of news articles.
"""

In [144]:
# Example prompt demonstrating the output we are looking for
example_prompt = """
I have a topic that contains the following documents:
- Traditional diets in most cultures were primarily plant-based with a little meat on top, but with the rise of industrial style meat production and factory farming, meat has become a staple food.
- Meat, but especially beef, is the word food in terms of emissions.
- Eating meat doesn't make you a bad person, not eating meat doesn't make you a good one.

The topic is described by the following keywords: 'meat, beef, eat, eating, emissions, steak, food, health, processed, chicken'.

Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.
"""

example_output = """Environmental impacts of eating meat"""

In [145]:
# Our main prompt with documents ([DOCUMENTS]) and keywords ([KEYWORDS]) tags
main_prompt = """
I have a topic that contains the following documents:
[DOCUMENTS]

The topic is described by the following keywords: '[KEYWORDS]'.

Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.
"""

In [146]:
prompt = system_prompt + example_prompt + example_output + main_prompt

In [147]:
print(prompt)


You are a helpful, respectful and honest assistant specialized in labeling topics of news articles.

I have a topic that contains the following documents:
- Traditional diets in most cultures were primarily plant-based with a little meat on top, but with the rise of industrial style meat production and factory farming, meat has become a staple food.
- Meat, but especially beef, is the word food in terms of emissions.
- Eating meat doesn't make you a bad person, not eating meat doesn't make you a good one.

The topic is described by the following keywords: 'meat, beef, eat, eating, emissions, steak, food, health, processed, chicken'.

Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.
Environmental impacts of eating meat
I have a topic that contains the following documents:
[DOCUMENTS]

The topic is described by the following keywords: '[KEYWORDS]'.

Based on the information about the topic

# üó®Ô∏è  **BERTopic**

## **Preparing Embeddings**

In [148]:
from sentence_transformers import SentenceTransformer

# Pre-calculate embeddings
embedding_model = SentenceTransformer("BAAI/bge-small-en")
embeddings = embedding_model.encode(content, show_progress_bar=True)

Batches:   0%|          | 0/88 [00:00<?, ?it/s]

## **Sub-models**

In [177]:
from cuml.manifold import UMAP
from cuml.cluster import HDBSCAN

# UMAP parameters
umap_model = UMAP(n_components=2, min_dist=0.0, metric='cosine', random_state=42)

# HDBSCAN parameters
hdbscan_model = HDBSCAN(min_cluster_size=10, metric='euclidean', cluster_selection_method='eom', prediction_data=True)


In [178]:
# Pre-reduce embeddings for visualization purposes
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine', random_state=42).fit_transform(embeddings)

### **Representation Models**

In [179]:
prompt = """
I have a topic that is described by the following keywords: [KEYWORDS]
In this topic, the following documents are a small but representative subset of all documents in the topic:
[DOCUMENTS]

Based on the information above, please give a topic label of maximum 4 words:
topic: <label>
"""

In [180]:
from bertopic.representation import OpenAI

from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance

# KeyBERT
keybert = KeyBERTInspired()

# MMR
mmr = MaximalMarginalRelevance(diversity=0.3)

openai_rep = OpenAI(client, model="NousResearch/Nous-Hermes-2-Mistral-7B-DPO",
                    chat=True,
                    prompt=prompt,
                    nr_docs=5,
                    delay_in_seconds=3)


# All representation models
representation_model = {
    "KeyBERT": keybert,
    "Mixtral": openai_rep,
    "MMR": mmr,
}

# üî• **Training**

In [None]:
from bertopic import BERTopic

topic_model = BERTopic(

  # Sub-models
  embedding_model=embedding_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  representation_model=representation_model,

  # Hyperparameters
  top_n_words=15,
  verbose=True
)

# Train model
topics, probs = topic_model.fit_transform(content, embeddings)

In [None]:
# Get topic info
topic_info = topic_model.get_topic_info()

# Save topic info as CSV for future reference
topic_info.to_csv(f'topic_info_' + time_period + '.csv', index=False)

# Show topics
topic_info

In [None]:
topic_model.get_topic(1, full=True)["KeyBERT"]

In [184]:
# Extract and add topic labels to the overall dataframe for future reference
dataset_filtered['Topic'] = topics

# Merge with KeyBERT and Mixtral labels
topic_info_subset = topic_info[['Topic', 'KeyBERT', 'Mixtral']]
dataset_cleaned_with_labels = pd.merge(dataset_filtered, topic_info_subset, on='Topic', how='left')

# Save the dataset as csv
dataset_cleaned_with_labels.to_csv(f'dataset_cleaned_with_labels_' + time_period + '.csv', index=False)

In [None]:
dataset_cleaned_with_labels

In [186]:
mixtral_labels = [label[0][0].split("\n")[0] for label in topic_model.get_topics(full=True)["Mixtral"].values()]
topic_model.set_topic_labels(mixtral_labels)

# üìä **Visualize**

In [None]:
topic_model.visualize_documents(titles, reduced_embeddings=reduced_embeddings, hide_annotations=True, hide_document_hover=False, custom_labels=True, title=f"News Articles on Artificial Intelligence - Topics from " + time_period_viz)

In [169]:
import PIL
import numpy as np
import requests

# Prepare logo
bertopic_logo_response = requests.get(
    "https://raw.githubusercontent.com/MaartenGr/BERTopic/master/images/logo.png",
    stream=True,
    headers={'User-Agent': 'My User Agent 1.0'}
)
bertopic_logo = np.asarray(PIL.Image.open(bertopic_logo_response.raw))

In [None]:
import datamapplot
import re

# Create a label for each document
llm_labels = [re.sub(r'\W+', ' ', label[0][0].split("\n")[0].replace('"', '')) for label in topic_model.get_topics(full=True)["Mixtral"].values()]
llm_labels = [label if label else "Unlabelled" for label in llm_labels]
all_labels = [llm_labels[topic+topic_model._outliers] if topic != -1 else "Unlabelled" for topic in topics]

# Run the visualization
datamapplot.create_plot(
    reduced_embeddings,
    all_labels,
    label_font_size=10,
    title=f"News Articles on Artificial Intelligence - Topics from " + time_period_viz,
    sub_title="Topics labeled with `Nous-Hermes-2-Mistral-7B-DPO`",
    label_wrap_width=20,
    use_medoids=True,
    #logo=bertopic_logo,
    #logo_width=0.16
)