<h1> Analysis 2 (BERTTopic) </h1>



This analysis involved using BERTopic. See detailed documentation [here](https://maartengr.github.io/BERTopic/api/bertopic.html). Since it involved many manual steps that were specific to our data (e.g. removing certain topics by index), only the 2 key steps are shown

<h3> Visualize documents by topic colour and ntp-ftp-stp class colour </h3>

The graph generated by this cell is a modified version of the <code>visualize_documents</code> and <code>visualize__hierarhical_documents</code>, so that one can see how the topics modelled reflect normal, fast or slow time perception (ntp-ftp, stp). In colour mode 1 documents are coloured according to their topic, and in colour mode 2 documents are coloured white-red-blue according to their ntp-ftp-stp class (e.g. "slower" seed word -> stp).

To use this function, it has to be added to the BERTopic documentation. Also, you need to pass a dictionary <code>color_map2</code> with documents as keys and RGB values as values, and a list topics indexed by doc for all the topics you want to model. 

<br>

All the modifications of the BERTopic documentation have the comment "#CHANGED HERE". The most important line that was changed is this one:

 ```marker=dict(size=5, opacity=0.5, color = [color_map[doc] if doc is not None else [255, 255, 255] for doc in selection.doc]) ```

In [None]:
def modified_visualize_documents(self,
                                    docs: List[str],
                                    topics: List[int] = None,
                                    embeddings: np.ndarray = None,
                                    reduced_embeddings: np.ndarray = None,
                                    sample: float = None,
                                    hide_annotations: bool = False,
                                    hide_document_hover: bool = False,
                                    custom_labels: bool = False,
                                    title: str = "<b>Documents and Topics</b>",
                                    width: int = 1200,
                                    height: int = 750,
                                    color_map2 = None, #CHANGED HERE
                                    keep_list3 = None): #CHANGED HERE
        """ Visualize documents and their topics in 2D

        Arguments:
            topic_model: A fitted BERTopic instance.
            docs: The documents you used when calling either `fit` or `fit_transform`
            topics: A selection of topics to visualize.
                    Not to be confused with the topics that you get from `.fit_transform`.
                    For example, if you want to visualize only topics 1 through 5:
                    `topics = [1, 2, 3, 4, 5]`.
            embeddings: The embeddings of all documents in `docs`.
            reduced_embeddings: The 2D reduced embeddings of all documents in `docs`.
            sample: The percentage of documents in each topic that you would like to keep.
                    Value can be between 0 and 1. Setting this value to, for example,
                    0.1 (10% of documents in each topic) makes it easier to visualize
                    millions of documents as a subset is chosen.
            hide_annotations: Hide the names of the traces on top of each cluster.
            hide_document_hover: Hide the content of the documents when hovering over
                                specific points. Helps to speed up generation of visualization.
            custom_labels: Whether to use custom topic labels that were defined using 
                        `topic_model.set_topic_labels`.
            title: Title of the plot.
            width: The width of the figure.
            height: The height of the figure.

        Examples:

        To visualize the topics simply run:

        ```python
        topic_model.visualize_documents(docs)
        ```

        Do note that this re-calculates the embeddings and reduces them to 2D.
        The advised and prefered pipeline for using this function is as follows:

        ```python
        from sklearn.datasets import fetch_20newsgroups
        from sentence_transformers import SentenceTransformer
        from bertopic import BERTopic
        from umap import UMAP

        # Prepare embeddings
        docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
        sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
        embeddings = sentence_model.encode(docs, show_progress_bar=False)

        # Train BERTopic
        topic_model = BERTopic().fit(docs, embeddings)

        # Reduce dimensionality of embeddings, this step is optional
        # reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)

        # Run the visualization with the original embeddings
        topic_model.visualize_documents(docs, embeddings=embeddings)

        # Or, if you have reduced the original embeddings already:
        topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)
        ```

        Or if you want to save the resulting figure:

        ```python
        fig = topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)
        fig.write_html("path/to/file.html")
        ```

        <iframe src="../../getting_started/visualization/documents.html"
        style="width:1000px; height: 800px; border: 0px;""></iframe>
        """
        topic_per_doc = keep_list3 #CHANGED HERE

        # Sample the data to optimize for visualization and dimensionality reduction
        if sample is None or sample > 1:
            sample = 1

        indices = []
        for topic in set(topic_per_doc):
            s = np.where(np.array(topic_per_doc) == topic)[0]
            size = len(s) if len(s) < 100 else int(len(s) * sample)
            indices.extend(np.random.choice(s, size=size, replace=False))
        indices = np.array(indices)

        df = pd.DataFrame({"topic": np.array(topic_per_doc)[indices]})
        df["doc"] = [docs[index] for index in indices]
        df["topic"] = [topic_per_doc[index] for index in indices]

        # Extract embeddings if not already done
        if sample is None:
            if embeddings is None and reduced_embeddings is None:
                embeddings_to_reduce = self._extract_embeddings(df.doc.to_list(), method="document")
            else:
                embeddings_to_reduce = embeddings
        else:
            if embeddings is not None:
                embeddings_to_reduce = embeddings[indices]
            elif embeddings is None and reduced_embeddings is None:
                embeddings_to_reduce = self._extract_embeddings(df.doc.to_list(), method="document")

        # Reduce input embeddings
        if reduced_embeddings is None:
            umap_model = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit(embeddings_to_reduce)
            embeddings_2d = umap_model.embedding_
        elif sample is not None and reduced_embeddings is not None:
            embeddings_2d = reduced_embeddings[indices]
        elif sample is None and reduced_embeddings is not None:
            embeddings_2d = reduced_embeddings

        unique_topics = set(topic_per_doc)
        if topics is None:
            topics = unique_topics

        # Combine data
        df["x"] = embeddings_2d[:, 0]
        df["y"] = embeddings_2d[:, 1]

        # Prepare text and names
        if self.custom_labels_ is not None and custom_labels:
            names = [self.custom_labels_[topic + self._outliers] for topic in unique_topics]
        else:
            names = [f"{topic}_" + "_".join([word for word, value in self.get_topic(topic)][:3]) for topic in unique_topics]



        # Outliers and non-selected topics
        non_selected_topics = set(unique_topics).difference(topics)
        if len(non_selected_topics) == 0:
            non_selected_topics = [-1]

        selection = df.loc[df.topic.isin(non_selected_topics), :]
        selection["text"] = ""
        selection.loc[len(selection), :] = [None, None, selection.x.mean(), selection.y.mean(), "Other documents"]



        all_traces = []
        for level in range(2): #CHANGED FROM HERE
            traces = []


            if level == 0:

                # Selected topics
                for name, topic in zip(names, unique_topics):
                    if topic in topics and topic != -1:
                        selection = df.loc[df.topic == topic, :]
                        selection["text"] = ""

                        if not hide_annotations:
                            selection.loc[len(selection), :] = [None, None, selection.x.mean(), selection.y.mean(), name]

                        traces.append(
                            go.Scattergl(
                                x=selection.x,
                                y=selection.y,
                                hovertext=selection.doc if not hide_document_hover else None,
                                hoverinfo="text",
                                text=selection.text,
                                mode='markers+text',
                                name=name,
                                textfont=dict(
                                    size=12,
                                ),
                                marker=dict(size=5, opacity=0.5)
                            )
                        )

            elif level == 1: 
                # Selected topics
                for name, topic in zip(names, unique_topics):
                    if topic in topics and topic != -1:
                        selection = df.loc[df.topic == topic, :]
                        selection["text"] = ""

                        if not hide_annotations:
                            selection.loc[len(selection), :] = [None, None, selection.x.mean(), selection.y.mean(), name]


                        traces.append(
                            go.Scattergl(
                                x=selection.x,
                                y=selection.y,
                                hovertext=selection.doc if not hide_document_hover else None,
                                hoverinfo="text",
                                text=selection.text,
                                mode='markers+text',
                                name=name,
                                textfont=dict(size=12),
                                marker=dict(size=5, opacity=0.5, color = [color_map2[doc] if doc is not None else [255, 255, 255] for doc in selection.doc]) #TO HERE
                            )
         
                        )
            all_traces.append(traces)



        # Track and count traces
        nr_traces_per_set = [len(traces) for traces in all_traces]
        trace_indices = [(0, nr_traces_per_set[0])]
        for index, nr_traces in enumerate(nr_traces_per_set[1:]):
            start = trace_indices[index][1]
            end = nr_traces + start
            trace_indices.append((start, end))

        # Visualization
        fig = go.Figure()
        for traces in all_traces:
            for trace in traces:
                fig.add_trace(trace)

        for index in range(len(fig.data)):
            if index >= nr_traces_per_set[0]:
                fig.data[index].visible = False

        # Create and add slider
        steps = []
        for index, indices in enumerate(trace_indices):
            step = dict(
                method="update",
                label=str(index),
                args=[{"visible": [False] * len(fig.data)}]
            )
            for index in range(indices[1]-indices[0]):
                step["args"][0]["visible"][index+indices[0]] = True
            steps.append(step)

        sliders = [dict(
            currentvalue={"prefix": "Colour mode: "},
            pad={"t": 20},
            steps=steps
        )]
        

        # Add grid in a 'plus' shape
        x_range = (df.x.min() - abs((df.x.min()) * .15), df.x.max() + abs((df.x.max()) * .15))
        y_range = (df.y.min() - abs((df.y.min()) * .15), df.y.max() + abs((df.y.max()) * .15))
        fig.add_shape(type="line",
                    x0=sum(x_range) / 2, y0=y_range[0], x1=sum(x_range) / 2, y1=y_range[1],
                    line=dict(color="#CFD8DC", width=2))
        fig.add_shape(type="line",
                    x0=x_range[0], y0=sum(y_range) / 2, x1=x_range[1], y1=sum(y_range) / 2,
                    line=dict(color="#9E9E9E", width=2))
        fig.add_annotation(x=x_range[0], y=sum(y_range) / 2, text="D1", showarrow=False, yshift=10)
        fig.add_annotation(y=y_range[1], x=sum(x_range) / 2, text="D2", showarrow=False, xshift=10)

        # Stylize layout
        fig.update_layout(
            sliders=sliders,
            template="simple_white",
            title={
                'text': f"{title}",
                'x': 0.5,
                'xanchor': 'center',
                'yanchor': 'top',
                'font': dict(
                    size=22,
                    color="Black")
            },
            width=width,
            height=height
        )

        fig.update_xaxes(visible=False)
        fig.update_yaxes(visible=False)
        return fig 

<h4> Erowid Quotes </h4>

The data behind [Erowid Quotes](https://akseli-ilmanen.github.io/BSc-Dissertation/), was created int his cell.

In [None]:
#CSV file for Erowid Quotes finder tool

#for urls with no placeholders - find quote on the Erowid page
def prepare_url(s, url):
    s = s.replace(" .", ".").replace(" ,", ",").replace(" i ", " I ")
    words = s.split()  # split the string into a list of words
    first_three_words = ' '.join(words[:3])  # join the first three words with a space
    last_three_words = ' '.join(words[-3:])  # join the last three words with a space
    updated_url = url + "#:~:text=" + first_three_words.replace(" ", "%20") + "," + last_three_words.replace(" ", "%20")
    return (updated_url)



substances_classes_list_lowercase_subst = ["Serotonergic psychedelics", "Dissociative psychedelics", "Entactogens", "Deliriants", "Depressant sedatives", "Stimulants", "Antidepressants antipsychotics", "lsd", "psilocybin mushrooms", "dmt", "mdma", "cannabis spp", "salvia divinorum"]


#create 
df4 = pd.DataFrame(columns=["Topic Nr", "Topic", "Substance", "Class", "Document"])


for substance_or_class in substances_classes_list_lowercase_subst:
    print(substance_or_class)

    #get temp df
    temp_df2  = pd.read_pickle(f"BERTopic files/BERTopic docs/BERTopic_df2_{substance_or_class}.pkl")



    #for substances "LSD", "Psilocybin mushrooms", "MDMA", "Cannabis spp", "Salvia divinorum" use own topics (not class topics)
    large_substances = ["lsd", "dmt", "psilocybin mushrooms", "mdma", "cannabis spp", "salvia divinorum"]
    if substance_or_class.lower() not in large_substances:
        temp_df2 = temp_df2[~temp_df2.filter(items=['substance']).isin(large_substances).any(axis=1)]

    #remove outlier rows
    #temp_df2 = temp_df2[~temp_df2.Topic == -1]

    #sort df2 by highest probabilities
    temp_df2 = temp_df2.sort_values(by=["Probability"], ascending=False)
    temp_df2.reset_index(drop=True, inplace=True) 

    #set custom labels for temp_df
    labels = temp_topic_model.generate_topic_labels(nr_words=3, topic_prefix=True, word_length=15, separator=" - ")
    temp_topic_model.set_topic_labels(labels)



    
    substance_topics_combos = []
 
    for i, topic_nr in enumerate(temp_df2.Topic):
        if topic_nr != -1:
            substance = temp_df2.loc[i, "substance"]
            substance_topics_combo = str(topic_nr) + substance
            substance_topics_combos.append(substance_topics_combo)
            if substance_topics_combos.count(substance_topics_combo) <= 10: 
                class_ = temp_df2.loc[i, "classes"]
                url = temp_df2.loc[i, "url"]
                doc = temp_df2.loc[i, "Document"]
                prob = round(temp_df2.loc[i, "Probability"], 3)
                #get topic labels from topic_info df
                if  len(str(topic_nr)) == 1: 
                    topic = "00" + temp_df2.loc[i, "CustomName"]
                elif len(str(topic_nr)) == 2: 
                    topic = "0" + temp_df2.loc[i, "CustomName"]
                else:
                    topic = temp_df2.loc[i, "CustomName"]
                #change url so it leads to the quote on the
                if any(substring in doc for substring in ["miranda", "megan", "matt", "alexa"]): #exclude names not removed by Spacy in pre-processing
                    pass 
                elif any(substring in doc for substring in ["PERSON", "ORG", "GPE", "LOC"]):
                    doc = "..." + doc + f"...- TBS:{prob}  (NO URL)" 
                else:
                    url = prepare_url(doc, url)    
                    doc = "..." + doc + f"... - TBS:{prob}  (<a href={url}>URL</a>)"       

                df4.loc[len(df4.index)] = [topic_nr, topic, substance, class_, doc]

#save
df4.to_csv("Representative Quotes Per Topic-Substance.csv")