# A word about the transformer's base function Word2Vec

Let's start this notebook by looking at the base function of the Transformer model Word2Vec.

We will show how to train a Word2Vec model to be used as an input embedding layer for a Transformer model.

The outcome of the Word2Vec is a collection of word vectors, with vectors that are closely positioned in vector space signifying similar meanings derived from their context, while word vectors situated far apart indicate contrasting meanings.

The intuition behind this model is that words that appear in similar contexts are semantically similar.

Let's start by importing the required libraries and load the data.

In [1]:
import nltk
import pandas as pd
from gensim.models import Word2Vec
from nltk import word_tokenize
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/diogonunofreitas/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/diogonunofreitas/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
corpus = pd.read_csv('imdb.csv')

## Tokenization and Stopword Removal

We will start by tokenize the text into words, which is required for training the Word2Vec model.

After that, it is recommended to remove stopwords and pontuation from the text before training the Word2Vec model. Stopwords are words that appear very frequently in the English language, such as "the", "a", "an", etc. 
These words do not add much value to the model, so it is recommended to remove them.

After removing the stopwords, we will lemmatize the words. Lemmatization is the process of converting a word to its base form. For example, the word "running" will be converted to "run". This is done to reduce the number of unique words in the corpus, which will reduce the size of the Word2Vec model. 

In [3]:
def process_text(text):
    """
    Tokenize the text and remove stopwords
    :param text: the text to process
    :return: the processed text tokenized and without stopwords
    """
    # Tokenize the text
    tokenized_text = word_tokenize(text.lower())
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_text = [word for word in tokenized_text if word not in stop_words]
    # Remove punctuation
    filtered_text = [word for word in filtered_text if word.isalpha()]
    # Lemmatize words
    lemmatizer = nltk.stem.WordNetLemmatizer()
    filtered_text = [lemmatizer.lemmatize(word) for word in filtered_text]
    return filtered_text

In [4]:
# Apply the function to the corpus to get a list of tokenized reviews
tokenized_reviews = [process_text(text) for text in corpus['Review']]

## Training the Word2Vec Model

In [5]:
# Train the Word2Vec model using the tokenized reviews and the training algorithm CBOW (Continuous Bag of Words).
# The CBOW model architecture aims to forecast the central word (the target word) by relying on the context words from the surrounding text. To illustrate this concept using a basic sentence, such as "the quick brown fox jumps over the lazy dog," we can form pairs of (context window, target word). When we adopt a context window size of 2, these pairs might appear as follows: ([quick, fox], brown), ([the, brown], quick), ([the, dog], lazy), and so forth. Consequently, the model goal is to make predictions for the target word using the context window words as input.
model = Word2Vec(tokenized_reviews, window=5, sg=0)

In [6]:
# Get the vector for a word
word = 'shoot'

# Get the vector for a word
word_vector = model.wv[word]

# Find similar words
similar_words = model.wv.most_similar(word, topn=5)

# Print the word vector and similar words
print(f"Vector for '{word}': {word_vector}")
print(f"Similar words to '{word}': {similar_words}")

Vector for 'shoot': [-1.601348    0.09291331 -0.2091694  -0.20012675 -0.31293353 -0.98299575
 -1.4431851   0.86283994  0.5607004  -0.20436175 -0.1266774  -0.96774113
 -0.10618712 -0.7847869   0.719473   -1.0028324   0.64026845 -0.5788989
  0.97191536  0.6465201  -0.50941294 -0.44322893 -1.1775478  -0.9309422
  0.509612   -0.00369132 -0.825812    1.200712   -0.19466943  0.35298246
  1.0321238   0.6536332  -1.3436902   0.6521783  -0.9366522  -0.304327
 -0.00662516 -0.08758615  0.11713576 -0.41246325 -0.49395525 -0.3112347
  0.89092785  0.5860399  -0.6303957  -0.5766113  -0.13125284  1.4730711
 -0.8066987  -1.042017    0.6502339  -0.4309696   0.68188727  0.3263426
  0.43316984  0.21201862 -0.25643376  0.9354706   0.59966797 -0.3857056
  0.16552098  0.00366997 -0.39653325 -0.66316974 -0.02419852 -0.45007494
  0.41504377  1.2620479   0.43472221  1.7092502   0.4752763   0.829365
 -0.09821726 -1.4151767  -0.56747913  0.15081604 -1.1768506   0.65523297
  0.02793373 -1.4035088  -0.34144238 -0.3

# OpenAI API

The rest of the notebook will be dedicated to the OpenAI API. We will show how to use the API to sentiment analysis and summarization.

Let's start by installing and importing the OpenAI library.

In [7]:
!pip install openai



In [8]:
import openai

## Completion function

The follwing function will be used to get the completion from the OpenAI API.

In [9]:
def get_completion(prompt, model="gpt-3.5-turbo", temperature=0):
    """
    Get the completion from the OpenAI API
    :param prompt: the prompt to send to the API
    :param model: the model to use
    :param temperature: the temperature to use
    :return: the completion
    """
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=temperature
    )
    completion = response.choices[0].message["content"]
    return completion

## Configuration of the OpenAI API

Before we can use the API, we need to configure it with our API key. The API key will be stored in a file called `.env`, which will be loaded by the `dotenv` library and used to configure the API via environment variables.

The API key can be found in the OpenAI dashboard.

In [10]:
from dotenv import load_dotenv
import os

In [11]:
_ = load_dotenv('.env')

In [12]:
openai.api_key = os.getenv('OPENAI_API_KEY')

_Optional:_ Define a hellper function to print contents in markdown format.

In [13]:
from IPython.display import Markdown, display

In [14]:
def printmd(string):
    """
    Print markdown content in the notebook.
    :param string: the markdown content
    """
    display(Markdown(string))

Now, we can test our API configuration by sending a prompt to the API and printing the completion.

In [15]:
printmd(get_completion(
    "Craft a concise invitation for the AI Summit on 13-09-2023 at 2pm at the Museu da Imprensa da Madeira, keeping in mind that the audience will be tech professionals, researchers, and enthusiasts, and the format should be in a standard invitation style."))

Dear [Recipient],

We are delighted to extend our warm invitation to the highly anticipated AI Summit, taking place on 13-09-2023 at 2pm. This prestigious event will be held at the esteemed Museu da Imprensa da Madeira, gathering tech professionals, researchers, and enthusiasts under one roof.

Join us for an enlightening afternoon dedicated to the latest advancements and trends in artificial intelligence. Immerse yourself in captivating discussions, thought-provoking presentations, and invaluable networking opportunities with industry experts and like-minded individuals.

Date: 13-09-2023
Time: 2pm
Venue: Museu da Imprensa da Madeira

We eagerly anticipate your presence at this remarkable gathering, where knowledge and innovation converge. Kindly RSVP by [RSVP date] to secure your seat at this exclusive event.

We look forward to welcoming you to the AI Summit and sharing an unforgettable experience.

Sincerely,

[Your Name]
[Your Organization]

**Now that we have configured the API, we can start using it to perform several tasks.**

## Sentiment Analysis

The following function will be used to perform sentiment analysis using the OpenAI API. Sentiment analysis is the process of determining whether a piece of writing is positive, negative, or neutral.

In [16]:
def get_sentiment(text):
    prompt = f"What is the sentiment of the following movie review, which is delimited with triple backticks? Give your answer as a single word, either 'positive', 'negative' or 'neutral'. Review text: ```{text}```"
    return get_completion(prompt)

In [17]:
# Let's test the function with a few reviews from the corpus
for i in range(10):
    print(
        f"Review {i}: {corpus.iloc[i]['Review']}--> Sentiment: {get_sentiment(corpus.iloc[i]['Review'])} \n")

Review 0: this was an absolutely terrible movie  don t be lured in by christopher walken or michael ironside  both are great actors  but this must simply be their worst role in history  even their great acting could not redeem this movie s ridiculous storyline  this movie is an early nineties us propaganda piece  the most pathetic scenes were those when the columbian rebels were making their cases for revolutions  maria conchita alonso appeared phony  and her pseudo love affair with walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning  i am disappointed that there are movies like this  ruining actor s like christopher walken s good name  i could barely sit through it --> Sentiment: negative 
Review 1: i have been known to fall asleep during films  but this is usually due to a combination of things including  really tired  being warm and comfortable on the sette and having just eaten a lot  however on this occasion i fell asleep because the fil

## Inferring Topics

Is the task of inferring topics from a piece of text.

In [18]:
# Read the suggestions from the text file
with open('desired-topics.txt', 'r') as f:
    suggestions = f.read()

In [19]:
prompt = f"""
Determine five topics that the participants of the MML '23 want to see discussed in the event. The suggestions are delimited by triple backticks. Each line corresponds to a user suggestion.
Make each item one or two words long.
Format your response as a list of items separated by commas.

Suggestions: '''{suggestions}'''
"""

printmd(get_completion(prompt))

Machine Learning, Cloud AI, Generative AI, AI in Digital Marketing, Computer Vision

# Summarization and Q&A 

The follwing section will read a PDF file and summarize it using the OpenAI API. Also, it will answer a few questions about the text.

In [20]:
import PyPDF2

In [24]:
# Read the PDF file and extract the text
text = ""
with open('regulation.pdf', 'rb') as pdfFileObj:
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    for i in range(pdfReader.numPages):
        text += pdfReader.getPage(i).extractText()

PdfReader stream/file object is not in binary mode. It may not be read correctly.


UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 10: invalid continuation byte

In [22]:
prompt = f"""Your task is to perform the following actions:
1 - Summarize the following text delimited by <> in at most 100 words. Focus on eligibility criteria, application process, and evaluation criteria.
2 - Translate the summary into French.
Text: <{text}>"""

printmd(get_completion(prompt))

Public Financing Regulation establishes guidelines for the allocation and management of public funds. Eligibility criteria include demonstrating public interest, complying with laws, and submitting a detailed proposal. The application process involves submitting an application form, presenting the proposal, and evaluation based on established criteria. Funds are allocated based on merit and alignment with public interest. Recipients are monitored for compliance and accountability measures are in place. Transactions involving public funds are documented and reports are published. An appeals and grievance mechanism is established. Non-compliance may result in suspension or legal action. The regulation is subject to review and amendments. Effective upon approval and publication. 

Réglementation sur le financement public établit des lignes directrices pour l'allocation et la gestion des fonds publics. Les critères d'éligibilité comprennent la démonstration d'un intérêt public, la conformité aux lois et la soumission d'une proposition détaillée. Le processus de demande implique la soumission d'un formulaire de demande, la présentation de la proposition et l'évaluation selon des critères établis. Les fonds sont alloués en fonction du mérite et de l'alignement avec l'intérêt public. Les bénéficiaires sont surveillés pour assurer la conformité et des mesures de responsabilité sont en place. Les transactions impliquant des fonds publics sont documentées et des rapports sont publiés. Un mécanisme d'appel et de plainte est établi. Le non-respect peut entraîner la suspension ou des poursuites judiciaires. La réglementation est sujette à révision et à des amendements. Entrée en vigueur après approbation et publication.

In [23]:
prompt = f"""Your task is to answer the following questions, based on text delimited by <>:
1 - What is the eligibility criteria for the funding?
2 - What is the maximum amount of funding that can be requested?
Text: <{text}>"""

printmd(get_completion(prompt))

1 - The eligibility criteria for the funding include demonstrating a clear and compelling public interest or benefit, complying with all applicable laws and regulations, and submitting a detailed proposal outlining the project or program's goals, budget, and expected outcomes.
2 - The maximum amount of funding that can be requested is not specified in the given text.