<a href="https://colab.research.google.com/github/Madeira-International-Workshop-in-ML/2023-prompt-engineering-code/blob/student/open_ai_api.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A word about the transformer's base function Word2Vec

Let's start this notebook by looking at the base function of the Transformer model Word2Vec.

We will show how to train a Word2Vec model to be used as an input embedding layer for a Transformer model.

The outcome of the Word2Vec is a collection of word vectors, with vectors that are closely positioned in vector space signifying similar meanings derived from their context, while word vectors situated far apart indicate contrasting meanings.

The intuition behind this model is that words that appear in similar contexts are semantically similar.

Let's start by importing the required libraries and load the data.

In [None]:
import nltk
import pandas as pd
from gensim.models import Word2Vec
from nltk import word_tokenize
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/diogonunofreitas/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/diogonunofreitas/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/diogonunofreitas/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
corpus = pd.read_csv('imdb.csv')

## Tokenization and Stopword Removal

We will start by tokenize the text into words, which is required for training the Word2Vec model.

After that, it is recommended to remove stopwords and pontuation from the text before training the Word2Vec model. Stopwords are words that appear very frequently in the English language, such as "the", "a", "an", etc.
These words do not add much value to the model, so it is recommended to remove them.

After removing the stopwords, we will lemmatize the words. Lemmatization is the process of converting a word to its base form. For example, the word "running" will be converted to "run". This is done to reduce the number of unique words in the corpus, which will reduce the size of the Word2Vec model.

In [None]:
def process_text(text):
    """
    Tokenize the text and remove stopwords
    :param text: the text to process
    :return: the processed text tokenized and without stopwords
    """
    # Tokenize the text
    tokenized_text = word_tokenize(text.lower())
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_text = [word for word in tokenized_text if word not in stop_words]
    # Remove punctuation
    filtered_text = [word for word in filtered_text if word.isalpha()]
    # Lemmatize words
    lemmatizer = nltk.stem.WordNetLemmatizer()
    filtered_text = [lemmatizer.lemmatize(word) for word in filtered_text]
    return filtered_text

In [None]:
# Apply the function to the corpus to get a list of tokenized reviews
tokenized_reviews = [process_text(text) for text in corpus['Review']]

## Training the Word2Vec Model

In [None]:
# Train the Word2Vec model using the tokenized reviews and the training algorithm CBOW (Continuous Bag of Words).
# The CBOW model architecture aims to forecast the central word (the target word) by relying on the context words from the surrounding text. To illustrate this concept using a basic sentence, such as "the quick brown fox jumps over the lazy dog," we can form pairs of (context window, target word). When we adopt a context window size of 2, these pairs might appear as follows: ([quick, fox], brown), ([the, brown], quick), ([the, dog], lazy), and so forth. Consequently, the model goal is to make predictions for the target word using the context window words as input.
model = Word2Vec(tokenized_reviews, window=5, sg=0)

In [None]:
# Get the vector for a word
word = 'shoot'

# Get the vector for a word
word_vector = model.wv[word]

# Find similar words
similar_words = model.wv.most_similar(word, topn=5)

# Print the word vector and similar words
print(f"Vector for '{word}': {word_vector}")
print(f"Similar words to '{word}': {similar_words}")

Vector for 'shoot': [-1.6354848   0.02940555 -0.03065043  0.18312892 -0.29582706 -0.761482
 -1.370863    0.88662237  0.82866555 -0.2131901   0.20288606 -0.9310113
 -0.16546102 -0.9916288   0.7659614  -0.71086603  0.60045326 -0.9254737
  0.9910457   0.7306788  -0.43744752 -0.73582894 -1.2232022  -0.41490057
  0.23851106 -0.18415537  0.21422204  1.3858869   0.3235057   0.39649987
  0.37479663  0.3931285  -0.9485503   0.4688766  -0.7412996  -0.06449657
  0.5491741   0.12971085 -0.13941239 -0.30687982 -0.6296342  -0.7170775
  0.55186385  0.25287125 -0.4231826  -0.6350234   0.06494009  1.1232893
 -0.8005206  -0.592499    1.1786201  -0.89500403  0.4767624   0.38472432
  0.7104765   0.14611271 -0.04292127  1.2554514   0.6084654  -0.16617449
 -0.2234725   0.18799862 -0.18125822 -0.04208365 -0.3744834  -0.51432025
  0.5683422   1.230871    0.29570186  1.334292    0.67582184  0.8105357
  0.43456602 -1.1591966  -0.03199024  0.08510557 -1.3268903   0.43598405
 -0.43810228 -1.4577626  -0.34515244 -

# OpenAI API

The rest of the notebook will be dedicated to the OpenAI API. We will show how to use the API to sentiment analysis and summarization.

Let's start by installing and importing the OpenAI library, as well as the other libraries required for this part of the notebook

In [None]:
!pip install openai
!pip install python-dotenv
!pip install pypdf

In [None]:
import openai

## Completion function

The follwing function will be used to get the completion from the OpenAI API.

In [None]:
def get_completion(prompt, model="gpt-3.5-turbo", temperature=0):
    """
    Get the completion from the OpenAI API
    :param prompt: the prompt to send to the API
    :param model: the model to use
    :param temperature: the temperature to use
    :return: the completion
    """
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=temperature
    )
    completion = response.choices[0].message["content"]
    return completion

## Configuration of the OpenAI API

Before we can use the API, we need to configure it with our API key. The API key will be stored in a file called `.env`, which will be loaded by the `dotenv` library and used to configure the API via environment variables.

The API key can be found in the OpenAI dashboard.

In [None]:
from dotenv import load_dotenv
import os

In [None]:
_ = load_dotenv('.env')

In [None]:
openai.api_key = os.getenv('OPENAI_API_KEY')

_Optional:_ Define a hellper function to print contents in markdown format.

In [None]:
from IPython.display import Markdown, display

In [None]:
def printmd(string):
    """
    Print markdown content in the notebook.
    :param string: the markdown content
    """
    display(Markdown(string))

Now, we can test our API configuration by sending a prompt to the API and printing the completion.

In [None]:
printmd(get_completion("Craft a concise invitation for the AI Summit on 13-09-2023 at 2pm at the Museu da Imprensa da Madeira, keeping in mind that the audience will be tech professionals, researchers, and enthusiasts, and the format should be in a standard invitation style."))

**Now that we have configured the API, we can start using it to perform several tasks.**

## Sentiment Analysis

The following function will be used to perform sentiment analysis using the OpenAI API. Sentiment analysis is the process of determining whether a piece of writing is positive, negative, or neutral.

In [None]:
def get_sentiment(text):
    prompt = f"What is the sentiment of the following movie review, which is delimited with triple backticks? Give your answer as a single word, either 'positive', 'negative' or 'neutral'. Review text: ```{text}```"
    return get_completion(prompt)

In [None]:
# Let's test the function with a few reviews from the corpus
for i in range(10):
    print(f"Review {i}: {corpus.iloc[i]['Review']}--> Sentiment: {get_sentiment(corpus.iloc[i]['Review'])} \n")

## Inferring Topics

Is the task of inferring topics from a piece of text.

In [None]:
# Read the suggestions from the text file
with open('desired-topics.txt', 'r') as f:
    suggestions = f.read()

In [None]:
prompt = f"""
Determine five topics that the participants of the MML '23 want to see discussed in the event. The suggestions are delimited by triple backticks. Each line corresponds to a user suggestion.
Make each item one or two words long.
Format your response as a list of items separated by commas.

Suggestions: '''{suggestions}'''
"""

printmd(get_completion(prompt))

# Summarization and Q&A

The follwing section will read a PDF file and summarize it using the OpenAI API. Also, it will answer a few questions about the text.

In [None]:
from pypdf import PdfReader

In [None]:
# Read the PDF file and extract the text
text = ""

reader = PdfReader("regulation.pdf")

for i in range(len(reader.pages)):
  page = reader.pages[i]
  text += page.extract_text()

In [None]:
prompt = f"""Your task is to perform the following actions:
1 - Summarize the following text delimited by <> in at most 100 words. Focus on eligibility criteria, application process, and evaluation criteria.
2 - Translate the summary into French.
Text: <{text}>"""

printmd(get_completion(prompt))

In [None]:
prompt = f"""Your task is to answer the following questions, based on text delimited by <>:
1 - What is the eligibility criteria for the funding?
2 - What is the maximum amount of funding that can be requested?
Text: <{text}>"""

printmd(get_completion(prompt))