<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Created by [Nathan Kelber](http://nkelber.com) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.<br />
___

# Language Models 1

**Description:** This lesson offers some examples of language models, giving a basic outline of concepts such as:

* Historical Approaches to NLP
* Word embeddings
* Transformers

Learners will use the Gensim and 🤗 Transformers library to explore aspects of language models including:

* Word Vectors
* Text Generation
* Sentiment Analysis
* Named Entity Recognition
* Question Answering
* Summarization

**Use Case:** For Learners (Detailed explanation, not ideal for researchers)

**Difficulty:** Intermediate

**Completion Time:** 75 minutes

**Knowledge Required:** 
* Python Basics
* Pandas Basics

**Knowledge Recommended:** 
* Python Intermediate
* Pandas Intermediate
* A Basic Grasp of Neural Networks

**Data Format:** None

**Libraries Used:** 
* [Gensim](https://radimrehurek.com/gensim/)- for examining word embeddings
* [🤗 Transformers](https://huggingface.co/docs/transformers/index)- provides APIs and tools to easily download and train pretrained models
* [Pytorch](https://pytorch.org/)- a popular machine learning framework
* [xFormers](https://github.com/facebookresearch/xformers)- for improving transformer computation speed

**Research Pipeline:** None
___

<h3 style="color:red; display:inline">Note:</h3>

Language models come in many sizes. The models for this notebook were tested on the given tasks, but for other models/tasks it is a good idea to check the model size and requirements. If you load or use a language model that is too big, you may fill all of the available space (10 GB) and/or memory (8 GB) in your lab. If the memory is full, try restarting the kernel (or restarting the lab). If the disk space is full, before deleting your own files, delete the .cache directory to clear out downloaded models from your space. You can do this by running the following code cell:


In [None]:
# Delete the .cache folder
!rm -r /home/jovyan/.cache/

In [None]:
# Check current disk space usage
!df -h /home/jovyan/

___

# Installations

In [None]:
# Install Click, a dependency for Gensim
!pip install click

In [None]:
# Install Pytorch, a popular machine learning framework
!pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

In [None]:
# Install en_core_web_md, the spaCy English pipeline with word embeddings
#!python3 -m spacy download en_core_web_md

In [None]:
# Install 🤗 Transformers
!pip install transformers

In [None]:
# Install Xformers
!pip install xformers

In [None]:
# Install Sentencepiece, a  a subword tokenizer and detokenizer for natural language processing
# that uses byte-pair-encoding (BPE)
!pip install sentencepiece

In [None]:
# Install sacremoses, a Python port of the Moses tokenizer
!pip install sacremoses

# Import libraries

In [None]:
from transformers import pipeline, set_seed
import gensim
import gensim.downloader
import pandas as pd

# Word Vectors with Gensim

In [None]:
# List the pre-trained embeddings available from Gensim
list(gensim.downloader.info()['models'].keys())

In [None]:
# Download an example set of pretrained embeddings: glove-wiki-gigaword-100
# "Pre-trained vectors based on Wikipedia 2014 + Gigaword, 5.6B tokens, 400K vocab, uncased (https://nlp.stanford.edu/projects/glove/)
# The final number for each model is the number of dimensions/parameters trained, more parameters means larger size files but hopefully also more accurate vectors
trained_model = gensim.downloader.load('glove-wiki-gigaword-100')

In [None]:
# Find the most similar words to a given word
trained_model.most_similar('fish')

In [None]:
# Try the most famous example used to describe Word Vectors
trained_model.most_similar(positive=['woman', 'king'], negative=['man'])

# Text Generation
By default, the 🤗 Transformers library text generation pipeline uses the Generative Pre-trained Transformer 2 (GPT-2) model by [OpenAI](https://openai.com/). This is a precursor of GPT-3.5, the model used for ChatGPT. This model was released in 2019 and you can find more information by reading its [model card](https://huggingface.co/gpt2/tree/main) on the 🤗 Transformers website. We include here several parameters:

* `set_seed` Remove the randomness of the text generation by supplying the same seed value each time.
* `prompt` The prompt that the text generator uses to build the sequence.
* `max_length` The length of the text returned. More text requires more time and the limit is defined by the model.
* `num_return_sequences` Allows more than one sequence to be returned for the prompt.

In [None]:
# Text Generation
input_text = "The Legend of Zelda: Tears of the Kingdom is a video game that"

generator = pipeline('text-generation', model='gpt2')

def create_text(prompt):
    #set_seed(42)
    return generator(prompt, max_length=100, num_return_sequences=3)

create_text(input_text)

# Sentiment Analysis

By default, the 🤗 Transformers library text-classification pipeline uses the [distilbert-base-uncased-finetuned-sst-2-english](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) model. This model is based on a distilled, uncased version of [BERT](https://huggingface.co/bert-base-uncased) that has been fine-tuned on the [Stanford Sentiment Treebank 2](https://huggingface.co/datasets/sst2) (SST-2) dataset. The SST-2 dataset is a binary classification dataset for training models to learn the sentiment of words, phrases, and sentences. It contains 215,154 unique manually labeled texts of varying lengths. The model card describes SST-2:

>
The corpus is based on the dataset introduced by Pang and Lee (2005) and consists of 11,855 single sentences extracted from movie reviews. It was parsed with the Stanford parser and includes a total of 215,154 unique phrases from those parse trees, each annotated by 3 human judges.
>

In [None]:
# Sentiment Analysis Prompts

prompt1 = """
I really wanted to like this game as I enjoyed the first one on the switch. Problem is, tears is FAR more difficult than the original. It’s not a casual game at all anymore. It takes far too much time, effort, research (online) to figure out where to find hidden areas and there are lots of “stuck points” (dead ends) that have no way of getting out except going to previous game save.

Pros:
1. Beautiful graphics and environments
2. Mending/Attaching, Ascending
3. Dynamic loading of zones (developers should be commended)

Cons:
1. The controls are lazy. The jump button is at the top, sprint is where jump should be. Attack is on opposite of where every other game is. It’s equivalent of having one App on an iPhone that doesn’t use basic swipe functionality. Quite frankly, it’s bad design
2. The first Zelda showed what/where you’re supposed to do/get to in a shrine via a quick cut scene. This one does none of that. What’s worse, is that on the sky island, you’re actually given a new power and then if you leave (because the puzzle is obscenely hidden) - you don’t actually have the power. Puzzles are 10x’s harder. Example of nearly impossible puzzles use the rewind power (talk about next level difficulty: there’s a two clock handle gate puzzle that took me two days to get through. One puzzle!)
3. Far more fighting/combat. Takes away the fun experience of exploring and talking to people doing side quests. And combat is just far more difficult in this game. The first robot shrine to train you, is just hard/horrible for a first tutorial.
4. Sky island (first zone) is just overall at a ridiculously high level of difficulty. I got so disappointed (too much running around and end areas/or falling) that I abandoned the game for a few weeks and eventually came back. I had to watch several YouTube videos to figure out where the 4th shrine entrance was in a dark cave.

"""

prompt2 = """
Where to begin with this game? Did you enjoy Breath of the Wild? Were you frustrated by certain aspects? If the answer was yes to both, you’ll love this game.

I have been a Zelda fan all the way back to playing Zelda 2 when I was 4. And when Breath of the Wild came out, I was blown away, but also frustrated by aspects of it. Yes the game did away with most of the standard Zelda formula, but enough still remained for me to enjoy it as a Zelda game. What blew me away though was the size of the game, the focus on exploration and many other things. However, many things frustrated me like how long it took to get around, warping to shrines was easy, but losing your horse was always a serious pain.

Now we arrive at the new game Tears of the Kingdom, clearly a sequel to BOTW that uses the same map, but manages to keep it fresh by introducing sky islands and the depths. And first off, if you felt BOTW had a massive map, this will blow you away. There is so much more to explore, but yet the game moves SO much faster by changing the towers from things you have to slowly climb to activate to just making them usually have a small puzzle to get them activated. Once you have these towers up, the game gets MUCH easier to navigate as each launches you miles into the sky where you can glide down toward the next closest one, or perhaps a shrine or other notable area. This eliminated the problem I had with horses and in fact, I barely used mine over the course of the entire game.

The new powers you gain at the start also open this up to a realm of creativity previously offered by other games like Minecraft, but in a Zelda game, it feels fresh. I can only say this game will feel like a dream to any engineers or someone with an interest in building. You can truly get lost in crafting weapons or vehicles, but even if only done when necessary it’s a lot of fun. I who think the shrine puzzles have gotten much harder (at least for someone like me), but they were still fun and brilliant to experience.

"""

In [None]:
# Sentiment Analysis Pipeline
classifier = pipeline("text-classification")

In [None]:
def classify_sentiment(prompt):
        output = classifier(prompt)
        return output

classify_sentiment(prompt2)


# Named Entity Recognition

The `aggregation_strategy` parameter defines the strategy used to group entity tokens together, like "New York". Remember, the tokenization may also be at the subword level, so you could see "Microsoft" broken up into "Micro" and "soft". Additional aggregration strategies such as "first", "average", and "max" are discussed in the 🤗 Transformers [documentation](https://huggingface.co/transformers/v4.7.0/_modules/transformers/pipelines/token_classification.html).

In [None]:
# Named Entity Recognition Pipeline
ner_tagger = pipeline("ner", aggregation_strategy="simple")

In [None]:
def extract_entities(prompt):
    output = ner_tagger(prompt)
    return pd.DataFrame(output)

extract_entities(prompt2)

# Question Answering

The Question Answering pipeline has two required parameters: 

* `question` The question being asked
* `context` The source material that should be used to answer this question

In [None]:
# Question Answering Pipeline
reader = pipeline("question-answering")

In [None]:
question = "What are the best parts of Tears of the Kingdom?"

def answer_question(question):
    output = reader(question=question, context=prompt2)
    return pd.DataFrame([output])

answer_question(question)

# Summarization

The `clean_up_tokenization_spaces` parameter removes extraneous spaces created through the detokenization process. If tokenization breaks up a string into separate tokens, then detokenization joins together a series of tokens into a string.

In [None]:
# Summarization pipeline
summarizer = pipeline("summarization")

In [None]:
def summarize(text):
    outputs = summarizer(text, max_length=75, clean_up_tokenization_spaces=True)
    return outputs[0]['summary_text']

print(summarize(prompt2))

# Translation

The translation pipeline may have length limitations based on the model selected. If your text is long, you may need to break it up into smaller chunks for analysis.

In [None]:
chunk1 = prompt2[:952]
chunk2 = prompt2[952:]

translator = pipeline('translation_en_to_de', model="Helsinki-NLP/opus-mt-en-de")
outputs = translator(chunk1, clean_up_tokenization_spaces=True, min_length=100)
print(chunk1 + '\n')
print(outputs[0]['translation_text'])

In [None]:
print(prompt2[952:])