# Text Analysis with Large Language Models

As we have seen in the previous notebooks, Large Language Models are very powerful text processing engines. We can therefore prompt them to do tasks beyond text summarization for us.

In this notebook, we will use an LLM to tackle the text analysis tasks we have previously implemented in [Session 2](../../../02-text-by-the-numbers/).


## Loading our API key

At this point you should have set up a file named `secrets.env` with your OpenAI API key. We will now use a lightweight Python package called `dotenv` to read in this file and set its contents as environment variables:


In [None]:
from dotenv import load_dotenv
import os

load_dotenv("../secrets.env")

os.getenv(
    "OPENAI_API_KEY"
) is not None  # Do not print the key itself! We want to keep it secret

## Choosing a model

Since we want to process a rather long text all at once, let's pick a model with a large context window:


In [None]:
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0)

## Loading a document

Pick any document you like! To keep the code from getting too complex, you may want to pick a single text of less than 10,000 words.


In [None]:
from pathlib import Path

docs_dir = Path(
    "~/shared/RR-workshop-data/federalist-papers-dataset/split"
).expanduser()

with open(docs_dir / "federalist_1.txt") as f:
    doc = f.read()

## Tokenization

When prompting the LLM to tokenize the text, try to add an instruction that makes the output easy to process! For example, you could request the output to be a list of comma-separated values.


In [None]:
tokenization_prompt = (
    "Tokenize the following text into a comma-separated list of words: \n\n" + doc
)
tokenized = llm.predict(tokenization_prompt)
tokenized

## Stopword removal

For stop word removal, we can try several approaches: We could provide a list of stopwords to remove, or let the LLM figure it out entirely.


In [None]:
stopwords_removed = llm.predict(
    "Remove all stop words from the following tokenized text: \n\n" + tokenized
)
stopwords_removed

## Stemming


In [None]:
stemmed = llm.predict(
    "Apply stemming to the following tokenized text: \n\n" + stopwords_removed
)
stemmed

## Lemmatization


In [None]:
lemmatized = llm.predict(
    "Apply lemmatization to the following tokenized text: \n\n" + stopwords_removed
)
lemmatized

## Topic modeling


In [None]:
topics = llm.predict(
    "List the 5 main topics discussed in the following text: \n\n" + doc
)
print(topics)

## Discussion

While it was relatively little work to achieve these results above, we have no insights into the process _how_ the LLM arrived at this output. This makes reproducing and/or discussing and reasoning about the results much more difficult.

However, we can also create workflows that combine conventional analysis techniques with the Large Language Model's ability to generate natural language.

A great example for such a workflow is [BERTopic](https://maartengr.github.io/BERTopic/api/bertopic.html): This technique first extracts topics in a similar way to what we have seen in session 2, but then leverages a Large Language Model to find natural language representations for those topics. So instead of a collection of keywords that represent the topic, you end up with a more intelligble title close to what we saw above. In contrast to letting the LLM do the entire topic modeling, we can backtrack in the analysis and identify which words or passages support each topic.


<table >
<tbody>
  <tr>
    <td style="padding:0px;border-width:0px;vertical-align:center">    
    Created by Simon Stone for Dartmouth College Library under <a href="https://creativecommons.org/licenses/by/4.0/">Creative Commons CC BY-NC 4.0 License</a>.<br>For questions, comments, or improvements, email <a href="mailto:researchdatahelp@groups.dartmouth.edu">Research Data Services</a>.
    </td>
    <td style="padding:0 0 0 1em;border-width:0px;vertical-align:center"><img alt="Creative Commons License" src="https://i.creativecommons.org/l/by/4.0/88x31.png"/></td>
  </tr>
</tbody>
</table>
