# Creating topics for the cases
This notebooks explores the topics we can create from the cases. It's gonna involve a lot of trial and error, but it will hopefully be well-documented

In [7]:
import pickle
import requests
import random
import pandas as pd
import numpy as np
from pathlib import Path
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer

In [8]:
def read_pickle(file_path):
    with open(file_path, "rb") as f:
        return pickle.load(f)

def pickle_object(obj, file_path):
    with open(file_path, "wb") as f:
        pickle.dump(obj, f)

def flatten_embeddings(embedding_dict):
    """ Creates a big matrix with all the embeddings from the dict """
    return np.vstack(embedding_dict.values())

def flatten_list(lst):
    return [elem for sublist in lst for elem in sublist]

def get_paragraphs(paragraph_dict):
    return flatten_list(list(paragraph_dict.values()))


def load_text_url(text_url: str): 
    """Inputs a URL of a newline seperated text-file and returns a list"""
    response = requests.get(text_url)
    return response.text.split("\n")

## Loading the data

In [9]:
DATA_DIR = Path("../../BscThesisData/data")
MODEL_PATH = Path("../models")
embedding_dict = read_pickle(DATA_DIR / "embedding_dict.pkl")
clean_paragraphs = read_pickle(DATA_DIR / "paragraph_dict.pkl")

Now it's time to extract the embeddings for BERTopic to process

In [10]:
embeddings = flatten_embeddings(embedding_dict)
docs = get_paragraphs(clean_paragraphs)
print(random.sample(docs, 5))

['Sagen drejede sig om, hvorvidt udgifter til arbejde udført af en ekstern konsulent, der dels bestod i bistand ved salg af datterselskabsaktier, dels bestod i formidling af finansiering (formidlingshonorar), kunne fradrages efter henholdsvis statsskattelovens § 6, stk. 1, litra a og ligningslovens § 8, stk. 3, litra c.', 'Skatterådet bekræfter, at fortjeneste indvundet ved modtagelse af erstatning for indgåelse af aftale om dyrkningspraksis, samt anvendelse af sprøjtemidler bliver sidestillet med ekspropriationserstatning, så § 11 i ejendomsavancebeskatningsloven finder anvendelse og erstatningen bliver friholdt for ejendomsavancebeskatning.', 'Landsretten fandt det ikke dokumenteret, at udgifterne havde en sådan konkret og direkte tilknytning til selskabets indkomsterhvervelse, at de kunne anses for erhvervsmæssige udgifter. Efter anskaffelsernes karakter var det en nærliggende mulighed, at udgifterne var afholdt for at opfylde private formål, og appellanten havde ikke dokumenteret, 

### Now we initialize the models

In [11]:
# Writing to test-directory
TEST_PATH = Path("../../explainlp/tests")
pickle_object(docs[:50], TEST_PATH / "example_docs.pkl")
pickle_object(embeddings[:50, :], TEST_PATH / "example_embeddings.pkl")

# Load Cleaning Models

In [12]:
STOP_WORD_URL = "https://gist.githubusercontent.com/berteltorp/0cf8a0c7afea7f25ed754f24cfc2467b/raw/305d8e3930cc419e909d49d4b489c9773f75b2d6/stopord.txt"
STOP_WORDS = load_text_url(STOP_WORD_URL)
vectorizer_model = CountVectorizer(stop_words=STOP_WORDS)
pickle_object(vectorizer_model, MODEL_PATH / "vectorizer.pkl")

In [13]:
# Powering up the transformer!
topic_model = BERTopic("Maltehb/-l-ctra-danish-electra-small-cased", vectorizer_model=vectorizer_model, nr_topics=5)

In [14]:
topics, probs = topic_model.fit_transform(docs, embeddings)

In [23]:
topic_model.visualize_topics()

In [24]:
preds_df = pd.DataFrame(list(zip(topics, probs, docs)), columns = ["topic", "prob", "doc"])
preds_df.to_csv(DATA_DIR / "doc_topics.csv", index=False)

In [15]:
topic_dict = topic_model.get_topics()
topic_dict_clean = {k: [tup[0] for tup in word_list] for k, word_list in topic_dict.items() if k != -1}
pickle_object(topic_dict_clean, DATA_DIR / "topic_dict.pkl")

In [16]:
topic_dict

{-1: [('stk', 0.03187539136267379),
  ('fandt', 0.028636827703240927),
  ('retten', 0.026334731197899103),
  ('sagsøgeren', 0.024490987745089433),
  ('skatterådet', 0.024472563562004945),
  ('skat', 0.023786131734937745),
  ('kr', 0.02189166245869958),
  ('sagen', 0.018884532431091453),
  ('landsretten', 0.01829822263005172),
  ('selskabet', 0.017723318903009332),
  ('jf', 0.01677112374454011),
  ('danmark', 0.016296933473401293),
  ('anses', 0.01593309145386382),
  ('spørger', 0.015778214047986752),
  ('skatteyderen', 0.01563522152784892),
  ('idet', 0.014497951646911051),
  ('selskab', 0.014340868812810654),
  ('forbindelse', 0.013995196463239701),
  ('nr', 0.013559370035959325),
  ('landsskatteretten', 0.013448290546233446),
  ('bekræfte', 0.012468524168281998),
  ('grundlag', 0.012194453867617012),
  ('endvidere', 0.012153099198262563),
  ('virksomhed', 0.011695388374274307),
  ('000', 0.011621739671120051),
  ('gældende', 0.01129413226434),
  ('samt', 0.010930661854322456),
  ('gr

In [25]:
# Saving the model
topic_model.save(str(MODEL_PATH / "topic_model"), save_embedding_model=False)


Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.

