<h1>AI in Fact and Fiction - Summer 2021</h1>
<h2>Natural Language Proceesing</h2>

In this lab, we will explore several natural lanugage processing techniques (including deep learning models) to perform some useful language tasks.

* Use [Google Colab](https://colab.research.google.com/github/AIFictionFact/Summer2021/blob/master/lab3.ipynb) to run the python code, and to complete any missing lines of code.
* You might find it helpful to save this notebook on your Google Drive.
* Please make sure to fill the required information in the **Declaration** cell.
* Once you complete the lab, please download the .ipynb file (File --> Download .ipynb).
* Then, please use the following file naming convention to rename the downloaded python file lab3_YourRCS.ipynb (make sure to replace 'YourRCS' with your RCS ID, for example 'lab3_senevo.ipynb').
* Submit the .ipynb file in LMS.

<p>Due Date/Time: <b>Friday, Jul 23 1.00 PM ET</b></p>

<p>Estimated Time Needed: <b>4 hours</b></p>

<p>Total Tasks: <b>15</b></p>
<p>Total Points: <b>50</b></p>

<hr>


**Declaration**

*Your Name* :

*Your RCS ID* :

*Collaborators (if any)* :

*Online Resources consulted (if any):*

# Part 1 - Data Cleaning and Exploratory Data Analysis

Data cleaning is a time consuming and unenjoyable task, yet it's a very important one. Keep in mind, "garbage in, garbage out". Feeding dirty data into a model will give us results that are meaningless.

Specifically, we'll be walking through:

1. **Getting the data** - in this case, we'll be scraping data from a website
2. **Cleaning the data** - we will walk through popular text pre-processing techniques
3. **Organizing the data** - we will organize the cleaned data into a way that is easy to input into other algorithms

The output of this part of the lab will be clean, organized data in two standard text formats:

* Corpus - a collection of text
* Document-Term Matrix - word counts in matrix format

We will try to scrape IMDB movie reviews from the IMDB website in this part.

## Getting the Data

This is the part where you have to do a bit of data sleuthing. I checked the IMDB Website and discovered that the movie reviews are available at 
https://www.imdb.com/title/[movie_id]/reviews, and that each individual review is in an HTML tag called "content."

In [None]:
# Web scraping, pickle imports
import requests
from bs4 import BeautifulSoup
import pickle

# Scrapes the reviews
def url_to_review(url):
    '''Returns review data specifically from imdb.com.'''
    page = requests.get(url).text
    soup = BeautifulSoup(page, "html.parser")
    reviews = []
    for row in soup.find_all(class_="content"):
      reviews.append(row.text)
    return reviews

# Names of the movies we have seen / will see in this class
movies = ['The Day the Earth Stood Still', 
          '2001: A Space Odyssey', 
          'Short Circuit']

# Movie Review URLs on IMDB 
urls = ['https://www.imdb.com/title/tt0043456/reviews',
        'https://www.imdb.com/title/tt0062622/reviews',
        'https://www.imdb.com/title/tt0091949/reviews']

# This may takes a few minutes to run
reviews = [url_to_review(u) for u in urls]

print(reviews)

A good practice is to save (pickle) the files for later use. Also, note how we are replacing the spaces with underscores in the movie name.

In [None]:
# # Make a new directory to hold the text files. You need to run this only once.
# !mkdir reviews

for i, movie in enumerate(movies):
    movie_file_name = movie.replace(" ", "_")
    with open("reviews/" + movie_file_name + ".txt", "wb") as file:
        pickle.dump(reviews[i], file)

Now let's load the pickled files.

In [None]:
# Load pickled files
data = {}
for i, movie in enumerate(movies):
    movie_file_name = movie.replace(" ", "_")
    with open("reviews/" + movie_file_name + ".txt", "rb") as file:
        data[movie] = pickle.load(file)

Let's double check if the data has been loaded properly.

In [None]:
# Double check to make sure data has been loaded properly
data.keys()

More checks.

In [None]:
data['The Day the Earth Stood Still'][:2]

### Task 1 (5 points)

Write code to append your two favorite movies to the `movies` list and the `urls` list. Retrieve the reviews into a variable called `my_reviews`, and pickle those new movie reviews. Load the pickled files, print the total number of reviews for each movie, and the last review in the dataset for each movie.

In [None]:
# Type code for Task 1 here.

## Cleaning the Data


When dealing with numerical data, data cleaning often involves removing null values and duplicate data, dealing with outliers, etc. With text data, there are some common data cleaning techniques, which are also known as text pre-processing techniques.

With text data, this cleaning process can go on forever. There's always an exception to every cleaning step. So, we're going to follow the MVP (minimum viable product) approach - start simple and iterate. Here are a bunch of things you can do to clean your data. We're going to execute just the common cleaning steps here and the rest can be done at a later point to improve our results.

**Common data cleaning steps on all text:**

* Make text all lower case
* Remove punctuation
* Remove numerical values
* Remove common non-sensical text (/n)
* Tokenize text
* Remove stop words

**More data cleaning steps after tokenization:** 

* Stemming / lemmatization
* Parts of speech tagging
* Create bi-grams or tri-grams
* Deal with typos
* And more...

In [None]:
# Let's take a look at our data again
next(iter(data.keys()))

In [None]:
# Notice that our dictionary is currently in key: movie, value: list of text format
next(iter(data.values()))

In [None]:
# We are going to change this to key: movie, value: string format
def combine_text(list_of_text):
    '''Takes a list of text and combines them into one large chunk of text.'''
    combined_text = ' '.join(list_of_text)
    return combined_text

# Combine it!
data_combined = {key: [combine_text(value)] for (key, value) in data.items()}

We can either keep it in dictionary format or put it into a pandas dataframe.

In [None]:
import pandas as pd
pd.set_option('max_colwidth',150)

data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ['review']
data_df = data_df.sort_index()
data_df

Let's take a look at the reviews for "The Day the Earth Stood Still"

In [None]:
data_df.review.loc['The Day the Earth Stood Still']

Apply a first round of text cleaning techniques.

In [None]:
# Apply a first round of text cleaning techniques
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, 
    remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text_round1(x)

Let's take a look at the updated text.


In [None]:
data_clean = pd.DataFrame(data_df.review.apply(round1))
data_clean

### Task 2 (5 points)

Let's apply a second round of cleaning to get rid of some additional punctuation and non-sensical text that was missed the first time around. Hint: we do not want the `\n`. Similary, please check the reviews to see if there are any such characters we need to clean out. Please complete the `clean_text_round2` function.

In [None]:
# Apply a second round of cleaning
def clean_text_round2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    # Type your code here
    return text

round2 = lambda x: clean_text_round2(x)

## Organizing The Data
We mentioned earlier that the output of this notebook will be clean, organized data in two standard text formats:

* Corpus - a collection of text
* Document-Term Matrix - word counts in matrix format

**Corpus**

We already created a corpus in an earlier step. The definition of a corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here.

In [None]:
# Let's take a look at our dataframe
data_df

In [None]:
# Let's pickle it for later use
data_df.to_pickle("corpus.pkl")

**Document-Term Matrix**

For many of the techniques we'll be doing later in this lab, the text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's CountVectorizer, where every row will represent a different document and every column will represent a different word.

In addition, with CountVectorizer, we can remove stop words. Stop words are common words that add no additional meaning to text such as 'a', 'the', etc.

In [None]:
# We are going to create a document-term matrix using CountVectorizer, 
# and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data_clean.review)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_clean.index
data_dtm

Let's pickle it for later use.

In [None]:
data_dtm.to_pickle("dtm.pkl")

Let's also pickle the cleaned data (before we put it in document-term matrix format) and the CountVectorizer object.

In [None]:
data_clean.to_pickle('data_clean.pkl')
pickle.dump(cv, open("cv.pkl", "wb"))

### Task 3 (5 points)

Play around with CountVectorizer's parameters. (Type your code in the cell below.)

What is ngram_range? What is min_df? and max_df? (This is a written question, and please type your answer in the cell below the next.)

In [None]:
# Type your code to experiment with the CountVectorizer's parameters (2 points)

__What is ngram_range?__ (1 point)

_Type your answer here_

__What is min_df?__ (1 point)

_Type your answer here_

__What is max_df?__ (1 point)

_Type your answer here_


## Exploratory Data Analysis

After the data cleaning step where we put our data into a few standard formats, the next step is to take a look at the data and see if what we're looking at makes sense. Before applying any fancy algorithms, it's always important to explore the data first.

When working with numerical data, some of the exploratory data analysis (EDA) techniques we can use include finding the average of the data set, the distribution of the data, the most common values, etc. The idea is the same when working with text data. We are going to find some more obvious patterns with EDA before identifying the hidden patterns with machines learning (ML) techniques. We are going to look at the following for each comedian:

1. **Most common words** - find these and create word clouds
2. **Size of vocabulary** - look number of unique words

### Most Common Words

Read in the document-term matrix.

In [None]:
data = pd.read_pickle('dtm.pkl')
data = data.transpose()
data.head()

Find the top 30 words in the reviews.

In [None]:
top_dict = {}
for c in data.columns:
    top = data[c].sort_values(ascending=False).head(30)
    top_dict[c]= list(zip(top.index, top.values))

top_dict

Print the top 15 words in each movie review.

In [None]:
for movie, top_words in top_dict.items():
    print(movie)
    print(', '.join([word for word, count in top_words[0:14]]))
    print('---')

At this point, we could go on and create word clouds. However, by looking at these top words, you can see that some of them have very little meaning and could be added to a stop words list, so let's do just that.

In [None]:
# Look at the most common top words --> add them to the stop word list
from collections import Counter

# Let's first pull out the top 30 words for each comedian
words = []
for movie in data.columns:
    top = [word for (word, count) in top_dict[movie]]
    for t in top:
        words.append(t)
        
words

Let's aggregate this list and identify the most common words along with how many times they occur.

In [None]:
Counter(words).most_common()

If all the movies have it as a top word, exclude it from the list.

In [None]:
add_stop_words = [word for word, count in Counter(words).most_common() if count == len(movies)]
add_stop_words

Let's update our document-term matrix with the new list of stop words.


In [None]:
from sklearn.feature_extraction import text 
from sklearn.feature_extraction.text import CountVectorizer

# Read in cleaned data
data_clean = pd.read_pickle('data_clean.pkl')

# Add new stop words
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# Recreate document-term matrix
cv = CountVectorizer(stop_words=stop_words)
data_cv = cv.fit_transform(data_clean.review)
data_stop = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_stop.index = data_clean.index

# Pickle it for later use
pickle.dump(cv, open("cv_stop.pkl", "wb"))
data_stop.to_pickle("dtm_stop.pkl")

Let's make some word clouds!

First, install the Wordclouds, if not installed already.

In [None]:
#!pip install wordcloud

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

wc = WordCloud(stopwords=stop_words, background_color="white", colormap="Dark2",
               max_font_size=150, random_state=42)

# Reset the output dimensions
plt.rcParams['figure.figsize'] = [16, 6]

# Create subplots for each movie
for index, movie in enumerate(data.columns):
    wc.generate(data_clean.review[movie])
    
    plt.subplot(3, 4, index+1)
    plt.imshow(wc, interpolation="bilinear")
    plt.axis("off")
    plt.title(movie)
    
plt.show()

Find the number of unique words that each set of movie reviews have.

In [None]:
# Identify the non-zero items in the document-term matrix, meaning that the word occurs at least once
unique_list = []
for movie in data.columns:
    uniques = data[movie].to_numpy().nonzero()[0].size
    unique_list.append(uniques)

# Create a new dataframe that contains this unique word count
data_words = pd.DataFrame(list(zip(movies, unique_list)), columns=['movie', 'unique_words'])
data_unique_sort = data_words.sort_values(by='unique_words')
data_unique_sort

Let's plot our findings.

In [None]:
import numpy as np

y_pos = np.arange(len(data_words))

plt.subplot(1, 2, 1)
plt.barh(y_pos, data_unique_sort.unique_words, align='center')
plt.yticks(y_pos, data_unique_sort.movie)
plt.title('Number of Unique Words', fontsize=20)

**Analysis of Specific Movies**

Since these three movies feature Robots, Fiction, and Aliens, let's see how many times those words are mentioned in these movie reviews.

In [None]:
Counter(words).most_common()

Let's isolate these words.

In [None]:
# Let's isolate these words
data_ff_words = data.transpose()[['robot', 'fiction', 'scifi', 'alien']]
data_ff = pd.concat([data_ff_words.robot, data_ff_words.fiction + data_ff_words.scifi, data_ff_words.alien], axis=1)
data_ff.columns = ['robot', 'fiction', 'alien']
data_ff

### Task 5 (2 points)

What are some other techniques you can use to analyze the dataset? (This is a written task).

_Please type your answer here._

# Part 2 - Sentiment Analysis

So far, all of the analysis we've done has been pretty generic - looking at counts, creating scatter plots, etc. These techniques could be applied to numeric data as well.

When it comes to text data, there are a few popular techniques that we'll be going through in the next few tasks, starting with sentiment analysis. A few key points to remember with sentiment analysis.

**TextBlob Module:** Linguistic researchers have labeled the sentiment of words based on their domain expertise. Sentiment of words can vary based on where it is in a sentence. The TextBlob module allows us to take advantage of these labels.

**Sentiment Labels:** Each word in a corpus is labeled in terms of polarity and subjectivity (there are more labels as well, but we're going to ignore them for now). A corpus' sentiment is the average of these.
  * **Polarity:** How positive or negative a word is. -1 is very negative. +1 is very positive.
  * **Subjectivity:** How subjective, or opinionated a word is. 0 is fact. +1 is very much an opinion.

Let's take a look at the sentiment of the various movie reviews.

We'll start by reading in the corpus, which preserves word order. Let's inspect the `corpus` real quickly.

In [None]:
data = pd.read_pickle('corpus.pkl')
data

We need to install TextBlob, if not already installed.

In [None]:
#!pip install textblob

Let's create quick lambda functions to find the polarity and subjectivity of each review.

In [None]:
from textblob import TextBlob

pol = lambda x: TextBlob(x).sentiment.polarity
sub = lambda x: TextBlob(x).sentiment.subjectivity

data['polarity'] = data['review'].apply(pol)
data['subjectivity'] = data['review'].apply(sub)
data

Let's plot the results.

In [None]:
plt.rcParams['figure.figsize'] = [10, 8]

for index, movie in enumerate(data.index):
    x = data.polarity.loc[movie]
    y = data.subjectivity.loc[movie]
    plt.scatter(x, y, color='blue')
    plt.text(x+.001, y+.001, movie, fontsize=10)
    plt.xlim(-.01, .2) 
    
plt.title('Sentiment Analysis', fontsize=20)
plt.xlabel('<-- Negative -------- Positive -->', fontsize=15)
plt.ylabel('<-- Facts -------- Opinions -->', fontsize=15)

plt.show()

### Task 6 (3 points)

The sentiment we obtained was for the entire corpus. Please obtain the sentiment values for the first 5 reviews for each of the three movies.
Please check out the TextBlob's sentiment analysis API for more information. https://textblob.readthedocs.io/en/dev/quickstart.html#sentiment-analysis

In [None]:
# Type your answer here.


# Part 3 - Topic Modeling

Another popular text analysis technique is called topic modeling. The ultimate goal of topic modeling is to find various topics that are present in your corpus. Each document in the corpus will be made up of at least one topic, if not multiple topics.

We will be covering the steps on how to do **Latent Dirichlet Allocation (LDA)**, which is one of many topic modeling techniques. It was specifically designed for text data.

To use a topic modeling technique, you need to provide (1) a document-term matrix and (2) the number of topics you would like the algorithm to pick up.

Once the topic modeling technique is applied, your job as a human is to interpret the results and see if the mix of words in each topic make sense. If they don't make sense, you can try changing up the number of topics, the terms in the document-term matrix, model parameters, or even try a different model.

First, let's read in our document-term matrix.

In [None]:
data = pd.read_pickle('dtm_stop.pkl')
data

Install the necessary modules for LDA with gensim (if not installed already).

In [None]:
#!pip install gensim

Import the necessary modules for LDA with gensim.

In [None]:
from gensim import matutils, models
import scipy.sparse

# import logging
# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

One of the required inputs is a term-document matrix.

In [None]:
tdm = data.transpose()
tdm.head()

We're going to put the term-document matrix into a new gensim format, from df --> sparse matrix --> gensim corpus.

In [None]:
sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)

Gensim also requires dictionary of the all terms and their respective location in the term-document matrix.

In [None]:
cv = pickle.load(open("cv_stop.pkl", "rb"))
id2word = dict((v, k) for k, v in cv.vocabulary_.items())

Now that we have the corpus (term-document matrix) and id2word (dictionary of location: term), we need to specify two other parameters - the number of topics and the number of passes. Let's start the number of topics at 2, see if the results make sense, and increase the number from there.

In [None]:
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, passes=10)
lda.print_topics()

LDA for num_topics = 3

In [None]:
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=10)
lda.print_topics()

LDA for num_topics = 4

In [None]:
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, passes=10)
lda.print_topics()

These topics aren't looking too great. We've tried modifying our parameters. Let's try modifying our terms list as well.

One popular trick is to look only at terms that are from one part of speech (only nouns, only adjectives, etc.).

We will need to dowload the following packages from nltk.

In [None]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

Let's create a function to pull out nouns from a string of text.

In [None]:
from nltk import word_tokenize, pos_tag

def nouns(text):
    '''Given a string of text, tokenize the text and pull out only the nouns.'''
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)] 
    return ' '.join(all_nouns)

Read in the cleaned data, before the CountVectorizer step.

In [None]:
data_clean = pd.read_pickle('data_clean.pkl')
data_clean

Apply the nouns function to the transcripts to filter only on nouns.

In [None]:
data_nouns = pd.DataFrame(data_clean.review.apply(nouns))
data_nouns

Create a new document-term matrix using only nouns.

In [None]:
# Re-add the additional stop words since we are recreating the document-term matrix
add_stop_words = ['like', 'im', 'know', 'just', 'dont', 'thats', 'right', 'people',
                  'youre', 'got', 'gonna', 'time', 'think', 'yeah', 'said']
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# Recreate a document-term matrix with only nouns
cvn = CountVectorizer(stop_words=stop_words)
data_cvn = cvn.fit_transform(data_nouns.review)
data_dtmn = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_names())
data_dtmn.index = data_nouns.index
data_dtmn

In [None]:
# Create the gensim corpus
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmn.transpose()))

# Create the vocabulary dictionary
id2wordn = dict((v, k) for k, v in cvn.vocabulary_.items())

In [None]:
# Let's start with 2 topics
ldan = models.LdaModel(corpus=corpusn, num_topics=2, id2word=id2wordn, passes=10)
ldan.print_topics()

### Task 7 (2 points)

Write the code to print 3 topics on the document-term matrix with nouns. (1 point)

What are the three topics you would determine out of the results you see. _(This is a written answer question.)_ (1 point)

In [None]:
# Type your code here.

_Type your 3 chosen topics (1 point)_

### Task 8 (5 points)

Complete the function to pull out both nouns and adjectives from a string of text. Adjectives are marked as 'JJ' in nltk. (2 points)

Apply the `nouns_adj` function to the reviews and determine the 3 best topics. (1 point)

What are the 3 best topics? How do they compare to the previous topics you selected? (2 points)


In [None]:
def nouns_adj(text):
  # Type your code to complete the function.
  return #fix me

corpusna = "" #fix me

# Apply the `nouns_adj` function to the reviews and determine the 3 best topics
ldana = "" # fix me

_Please type your answer to "What are the 3 best topics"? (1 point)_

 _Please type your answer to "How do they compare to the previous topics you selected?" (1 point)_

**Identify topics in the reviews for each movie**

Let's take a look at which topic each set of movie reviews contains. If you have defined `corpusna` and `ldana` correctly above, the following code should run and display the topics for each movie based on their IMDB reviews.

In [None]:
corpus_transformed = ldana[corpusna]
list(zip([a for [(a,b)] in corpus_transformed], data_dtmna.index))

# Part 4 - Word Vectors

Next, we turn our attention to some of the mode advanced (and recent) deep learning NLP libraries to learn aboout vector representations of language .

First, we will take a look at [Spacy](https://spacy.io).

Let's first install spacy (if not installed already).



In [None]:
!pip install -U spacy
!python -m spacy download en_core_web_md

spaCy can compare two objects and predict how similar they are – for example, documents, spans or single tokens.

The Doc, Token and Span objects have a .similarity method that takes another object and returns a floating point number between 0 and 1, indicating how similar they are.

One thing that's very important: In order to use similarity, you need a larger spaCy model that has word vectors included.

For example, the medium or large English model – but not the small one. So if you want to use vectors, always go with a model that ends in "md" or "lg". You can find more details on this in the [models documentation](https://spacy.io/models).

Here's an example. Let's say we want to find out whether two documents are similar.

First, we load the medium English model, "en_core_web_md".

We can then create two doc objects and use the first doc's similarity method to compare it to the second.

Here, a fairly high similarity score of 0.86 is predicted for "I like fast food" and "I like pizza".

The same works for tokens.

According to the word vectors, the tokens "pizza" and "pasta" are kind of similar, and receive a high score.

In [None]:
import spacy
# Load the model with vectors
nlp = spacy.load("en_core_web_md")

# Compare two documents
doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")
print(doc1.similarity(doc2))

### Task 9 (2 points)

Obtain 2 movies reviews from our mini-movie review corpus and see how similar they are.

In [None]:
# Type your code here

**Word vectors in Spacy**

To give you an idea of what those vectors look like, here's an example.

First, we load the medium model again, which ships with word vectors.

Next, we can process a text and look up a token's vector using the .vector attribute.

The result is a 300-dimensional vector of the word "movie".

In [None]:
# Load a larger model with vectors
nlp = spacy.load("en_core_web_md")

doc = nlp("movie")
# Access the vector via the token.vector attribute
print(doc.vector)

Predicting similarity can be useful for many types of applications. For example, to recommend a user similar texts based on the ones they have read. It can also be helpful to flag duplicate content, like posts on an online platform.


In [None]:
review1 = "I love the movie"
review2 = "I hate the movie"

doc1 = nlp(review1)
doc2 = nlp(review2)

print(doc1.similarity(doc2))

### Task 10 (2 points)

The similarity of the statements "I like the movie" and "I hate the movie" received a high score in the vector space, despite being opposites. What might this be? _(This is a written answer question)._

_Type your answer here._

# Part 5 - Text Classification, Generation, and Summarization



In this part of the lab, we will be exploring the [HuggingFace](https:/huggingface.co/) library that implements the state of the art transformer models we discussed in class.

Let's first install `transformers` if not installed already.

In [None]:
!pip install transformers[sentencepiece]

Before we move on to text generation, let's see how transformers can perform some of the tasks we already covered in this lab.

Note: these transformer models are really large. For example, BERT-large has 24-layers and  a total of 340M parameters! Altogether it is 1.34GB, so expect these transformer models to take a couple minutes to download to your Colab instance.  (Note that this download is not using your own network bandwidth or your Google Drive space--it's between the Google instance and wherever the model is stored on the web).

**Sentiment Analysis**

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
reviews = [
    review1,
    review2,
]
review_sentiment = classifier(reviews)
result = "\n".join("{} {}".format(x, y) for x, y in zip(reviews, review_sentiment))
print(result)

### Task 11 (2 points)

Using HuggingFace's transformer library, print the sentiment scores of at least two random movie reviews from each of the movies in our small IMDB corpus. If there are `n` movies, you must print at most `2n` sentiment scores.

In [None]:
#Type your code here

**Text classification**

The sentiment analysis we saw earlier can be thought of a type of a binary classification problem. However, a more challenging task is where we need to classify texts that haven’t been labelled. This is a common scenario in real-world application because annotating text is usually time-consuming and requires domain expertise. 

For this use case, the **zero-shot-classification** pipeline in HuggingFace is very powerful: it allows you to specify which labels to use for the classification, so you don’t have to rely on the labels of the pretrained model. You’ve already seen how the model can classify a sentence as positive or negative using those two labels — but it can also classify the text using any other set of labels you like.

We will use the description of the AI in Fact and Fiction class, and along with several candidate topics, let's see what the transformer model classifies this text to be. As you can see, the category "Education" has a higher score.

In [None]:
course_description = '''The class will explore current AI topics through reading, writing, programming,
and exploring some of the classic fiction that has former people's (mis)perceptions
of machine intelligence. This course will give students an appreciation on how to
separate fiction from fact, and how to critically evaluate the impact current and
upcoming AI topics will have on society.'''
text_classifier = pipeline("zero-shot-classification")
text_classifier(
    course_description,
    candidate_labels=["education", "politics", "business"],
)

### Task 12 (4 points)

Using the entire IMDB movie review corpus, and the individual sets of movie reviews for each of the three movies we have seen / will be seeing in class, classify them according to 3 distinct topics (these topics cannot be the sentiment, i.e., positive or negative, or it cannot be "review"). (3 points)

What are your observations on the topic distribution across the 3 movies based on their reviews? (1 point) _(This is a written answer question.)_

In [None]:
# Type your code here

_Type your answer here_

**Text Generation**

Now let’s see how to use a pipeline to generate some text. The main idea here is that you provide a prompt and the model will auto-complete it by generating the remaining text. This is similar to the predictive text feature that is found on many phones. Text generation involves randomness, so it’s normal if you don’t get the same results when you run the code again.

In [None]:
generator = pipeline("text-generation")
generator("In this course, we will teach you how to")

**Using a specific model**

The previous example used the default model for text generation, but you can also choose a particular model from the [Transformer Model Hub](https://huggingface.co/models). Go to the Model Hub and click on the corresponding tag on the left to display only the supported models a given task, i.e., text generation.

Let’s try the distilgpt2 model! Here’s how to load it in the same pipeline as before:

In [None]:
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

### Task 13 (3 points)

Generate a movie review with at most 100 words for your favorite movie. You must use another model that is not `distilgpt2` for this task. Please check the available text generation models from the [Transformer Model Hub](https://huggingface.co/models) for this purpose. (2 points)

How do you compare the quality of this review to the same review that would be generated from the `distilgpt2` model? (1 point) _(This is a written answer question.)_

**Text Summarization**

Summarization is the task of reducing a text into a shorter text while keeping all (or most) of the important aspects referenced in the text. Here’s an example where we attempt to shorten the course description to 50 words:

In [None]:
summarizer = pipeline("summarization")
summarizer(course_description, max_length=20)

### Task 14 (2 points)

Shorten the entire movie review corpus (i.e., all the reviews from all the movies) to 100 words.

In [None]:
# Type your code here.

# Part 6 - Question Answering

**BERT**, or **B**idirectional **E**mbedding **R**epresentations from **T**ransformers, is a new method of training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. The academic paper can be found here: https://arxiv.org/abs/1810.04805. (You do not have to read it as part of this lab, but it is a good reference if you want to understand the inner-workings of BERT.) 

The model we are using in this lab is a pre-trained model released by Google that ran for many, many hours on Wikipedia, and [Book Corpus](https://arxiv.org/pdf/1506.06724.pdf), a dataset containing +10,000 books of different genres. For question answering, we could get decent results using a BERT model that's already been fine-tuned on the Stanford Question Answering Dataset (SQuAD) benchmark: https://rajpurkar.github.io/SQuAD-explorer/explore/v2.0/dev/. 

Import all the necessary libraries.

In [None]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="What is this course about",
    context=course_description
)

### Task 15 (3 points)

Let's get some answers to some questions about the movie "The Day the Earth Stood Still" based on the reviews for that movie. You must compile at least 5 questions about the movie from the review. Your code must use a specific question answer model from the [Transformer Model Hub](https://huggingface.co/models). (2 points)

What is the model you selected, and why did you select it? (you may compare it with some other model and describe pros and cons of your chosen model) (1 point) _(This is a written answer question.)_

In [None]:
#Type your code here

_Type your answer here_