## Exploring the Open AI API

In this notebook we continue exploring the Open AI API. Through week three of the semester, our content in Text Mining hasn't been too amenable to using Open AI as a tool to actually _do_ the text mining for us. This week we will play around with using the API to do normalization and tokenization and carry out the work in the "Patterns in Text" assignment. 

Once you start using the API, you can start incurring costs. The pricing structure is outlined on the [pricing page](https://openai.com/pricing). Make sure to give it a read. There is a tokenizer tool that you can use [here](https://platform.openai.com/tokenizer). 


In [None]:
import os
import openai

You need to tell the API module about your key. Either method below will work. The environment variable is better, since you don't need to have your key visible in plain text. But feel free to just hard code it in at this point, just be careful putting any code up on GitHub with the api key visible.

In [None]:
#openai.api_key = "sk-VPAJxeOHF7YLCLC0fFT3BljkFJTP5Yux15Rs8FTfrf6Mxj"
openai.api_key = os.getenv("OPENAI_API_KEY")

# If you want to set up the environment variable, try a prompt like this:
prompt = """
My professor has a line of code like this: 

openai.api_key = os.getenv("OPENAI_API_KEY")

Can you tell me how to set this up on my system? 
"""

In [None]:
# here's the example from last class. 
response = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What are some famous astronomical observatories?"}
  ],
  temperature=0,
  max_tokens=1024
)

---

Now we'll read in Beowulf and play around with tokenizing it.

In [None]:
with open("../data/beowulf.txt", "r") as file:
    beowulf = file.read()

In [None]:
beowulf[:100]

In [None]:
len(beowulf.split())

In [None]:
def count_tokens_in_text(text_chunk):
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are an expert in NLP."},
            {"role": "user", "content": 
             f"""How many tokens are in the following text?
                 Please reply with the number of tokens enclosed in brackets.
                 Here's an example: "there are [XXX] tokens in this text."
                 
                 Here's the text:"
                 \n\n {text_chunk}"""}
        ],
        temperature=0,
        max_tokens=1024
    )
    return response.choices[0].message["content"] 

def get_most_common_token(text_chunk, remove_stopwords=True) :
    
    if remove_stopwords : 
        user_prompt = f"""What is the most common token in the following text? 
                          Please remove stopwords.\n\n \"{text_chunk}\" """
    else : 
        user_prompt = f"What is the most common token in the following text?\n\n \"{text_chunk}\""
    
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are an expert in NLP."},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0,
        max_tokens=1024
    )
    return response.choices[0].message["content"] 
    
    
def clean_text(text_chunk) :
    
    user_prompt = f"""I am going to give you some text. Please perform the following
        transformations on the text and return only the transformed text. 
        
        1. Cast everything to lowercase
        2. Remove all marks of punctuation
        3. Remove stopwords
        4. Convert all whitespace characters to spaces
        5. Remove any tokens that contain non-alphabetic characters
        
        Here is the text:\n
         {text_chunk} """
    
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are an expert in NLP."},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0,
        max_tokens=2048
    )
    return response.choices[0].message["content"] 
    


In [None]:
# Splitting the text into chunks (e.g., if each chunk is 1000 characters long)
chunk_size = 1000
text_chunks = [beowulf[i:i+chunk_size] for i in range(0, len(beowulf), chunk_size)]



In [None]:
print(text_chunks[1])

In [None]:
count_tokens_in_text(text_chunks[5])

In [None]:
len(text_chunks[5].split())

In [None]:
get_most_common_token(text_chunks[2],remove_stopwords=True)

In [None]:
get_most_common_token(text_chunks[1],remove_stopwords=False)

In [None]:
clean_text(text_chunks[2])