<a href="https://colab.research.google.com/github/SAC-CS112-Nguyen-Austin/CS112Java/blob/master/Copy_of_UCI_Lsci_109_Miniproject_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mini-Project 1: Movie Plots

We will be dealing with a large dataset of movie plot descriptions scraped from IMDB. For each movie, we have a plot description, the year the movie came out, a set of genres, and the movie's average rating out of 10.

Your mission is to characterize the distinctive features of each movie genre and how they have changed over time using the plot descriptions.

You will be writing code and text. The directives in **bold** indicate things you are required to complete.

We will start off reading the data into dataframes and formatting the dataframes. You should start by executing all this code, but you do not need to worry about the details of how it works. Your focus should be on the textual data analysis, starting with the header "Part 1" below.

**Group members**: 

**Author contribution statement**: 

## Part 0: Importing libraries and data

We'll import a bunch of useful stuff, including NLP libraries `nltk` and `spacy`. You can use these as you see fit.

In [0]:
import io

import requests
import pandas as pd
import altair as alt
import nltk
import spacy
import numpy as np
import gensim


# You could use NLTK or Spacy to do tokenization and other preprocessing
nltk.download("punkt") # 
nlp = spacy.load("en_core_web_sm")

alt.data_transformers.enable('default', max_rows=None)



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


DataTransformerRegistry.enable('default')

Now we download data from online CSV into a `pandas` dataframe.

In [0]:
# From https://stackoverflow.com/questions/32400867/pandas-read-csv-from-url

url = "http://socsci.uci.edu/~rfutrell/teaching/lsci109-w2020/data/movies_small.csv"
s = requests.get(url).content.decode('utf-8')
df = pd.read_csv(io.StringIO(s), index_col=0)

In [0]:
# Take a look at the dataframe.
df

Unnamed: 0,title,year,synopsis,rating,History,Thriller,Family,Horror,Mystery,Drama,Music,Adventure,Comedy,Action,Musical,Fantasy,Documentary,Animation,Western,War,Short,Film-Noir,Romance,Sport,Sci-Fi,Biography,News,Crime
0,The Bling Ring,2013,Five teenagers climb over an iron gate and bre...,5.6,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True
1,Forever Mine,1999,The highly stylized film begins in the early 8...,5.3,False,True,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True
2,Machine Gun Preacher,2011,The film is an adaptation of Childers' memoir ...,6.8,False,False,False,False,False,True,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,True
3,"Good Morning, Vietnam",1987,"With the war in Vietnam escalating in 1965, Ai...",7.3,False,False,False,False,False,True,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False
4,Two Bits,1995,"A hot summer day in South Philadelphia, 1933, ...",6.3,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2883,The Walk,2015,Philippe Petit (Joseph Gordon-Levitt) stands a...,7.3,False,True,False,False,False,True,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False
2884,Mean Girls,2004,Cady (Lindsay Lohan) is the 16-year-old home-s...,7.0,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2885,The Star Chamber,1983,"On a South Los Angeles, California, street, tw...",6.3,False,True,False,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
2886,You're Next,2011,"In the opening scene, Adam and Talia (Kate Lyn...",6.6,False,True,False,True,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [0]:
# This is the set of genres that the movies are classified into.
# A movie may belong to multiple genres.

genres = [
 'History',
 'Thriller',
 'Family',
 'Horror',
 'Mystery',
 'Drama',
 'Music',
 'Adventure',
 'Comedy',
 'Action',
 'Musical',
 'Fantasy',
 'Documentary',
 'Animation',
 'Western',
 'War',
 'Short',
 'Film-Noir',
 'Romance',
 'Sport',
 'Sci-Fi',
 'Biography',
 'News',
 'Crime',
]

In [0]:
# Visualization: What is the distribution of genres by year in our sample?

all_but_title = [x for x in df.columns if x != 'title']
genre_counts = df.drop('rating', axis=1).drop('synopsis', axis=1).drop('title', axis=1).groupby('year').sum().reset_index()
genre_counts = pd.melt(genre_counts, id_vars=['year'])
chart = alt.Chart(genre_counts)
chart.mark_line().encode(
    x='year',
    y='value',
    color='variable',
    tooltip=['variable'],
).interactive()

In [0]:
# Reshape the dataframe so that there is only one genre column
# Note that this means there will be duplicate entries: some movies will occupy more than one row.

melted = df.drop('rating', axis=1).melt(id_vars=['title', 'year', 'synopsis'])
df2 = melted[melted['value']].drop('value', axis=1)
df2.columns = ['title', 'year', 'synopsis', 'genre']
df2

Unnamed: 0,title,year,synopsis,genre
12,JFK,1991,The film opens with a narration (by an uncredi...,History
30,Gone with the Wind,1939,"The film opens in Tara, a cotton plantation ow...",History
112,An American Haunting,2005,This synopsis is from the UNRATED VERSION.A gi...,History
123,JFK,1991,The film opens with a narration (by an uncredi...,History
140,The Last Emperor,1987,Arrival.\nA train pulls into a station in Nort...,History
...,...,...,...,...
69269,Murder on the Orient Express,1974,The murderHaving sorted a matter out in the Mi...,Crime
69272,Heathers,1988,Veronica is part of the most popular clique at...,Crime
69274,Strangers on a Train,1951,Amateur Tennis star Guy Haines (Farley Granger...,Crime
69277,The Counterfeiters,2007,The film opens with Salomon (Sali) gambling he...,Crime


In [0]:
# Concatenate all synopses within genre and year.
# Now we have three columns: First is genre, second is year, third is the concatenation of all movie synopses for that genre and year.
df_by_genre_and_year = df2.groupby(['genre', 'year']).aggregate({'synopsis': " ".join}).reset_index()
df_by_genre_and_year

Unnamed: 0,genre,year,synopsis
0,Action,1903,"First, in the opening scene, two masked robber..."
1,Action,1931,Small-time Italian-American criminals Caesar E...
2,Action,1939,"Not quite. Parody, yes; remake, no. Read revie..."
3,Action,1940,The editor of the New York Globe (Harry Davenp...
4,Action,1947,"Akbar, India. Jean Preston ('Patricia Morison'..."
...,...,...,...
1157,Western,2014,Its the 1870s America. When settler John kills...
1158,Western,2015,Note: The movie is divided into six narrative ...
1159,Western,2016,In this remake of the 1960 film of the same na...
1160,Western,2017,"In 1892, a Comanche war party descends on the ..."


In [0]:
# Further simplifying, we can make a dataframe that concatenates all movie synopses per genre, regardless of year.
# Or we can do it by year.
df_by_genre = df2.groupby(['genre']).aggregate({'synopsis': " ".join}).reset_index()
df_by_year = df2.groupby(['year']).aggregate({'synopsis': " ".join}).reset_index()
df_by_genre


Unnamed: 0,genre,synopsis
0,Action,The film is an adaptation of Childers' memoir ...
1,Adventure,The movie opens with an incident on the island...
2,Animation,Ten-year-old Chihiro (voice: Daveigh Chase in ...
3,Biography,Five teenagers climb over an iron gate and bre...
4,Comedy,"With the war in Vietnam escalating in 1965, Ai..."
5,Crime,Five teenagers climb over an iron gate and bre...
6,Documentary,[from www.007.com]\nEverything Or Nothing focu...
7,Drama,Five teenagers climb over an iron gate and bre...
8,Family,The film begins at a Russian space station whe...
9,Fantasy,The figure of a limp woman with long curly hai...


In [0]:
df_by_year

Unnamed: 0,year,synopsis
0,1903,"First, in the opening scene, two masked robber..."
1,1927,"Set in Southern California, the movie opens wi..."
2,1931,The opening scene is on a quiet road surrounde...
3,1932,A family pulls together to help a member in fi...
4,1933,The 1933 version of the work of H.G. Wells ope...
...,...,...
85,2015,Herman Melville visits old Thomas Nickerson ab...
86,2016,The opening text states that there were over 2...
87,2017,"Moscow, USSRMarch 1, 1953Based on a true story..."
88,2018,The Holy Grail is an icon that has existed mor...


## Part 1: Featurization

The first step in the analysis is to convert the text to a vector representation. To do this, we will represent the text as a bag of words. To do this effectively, we need to build a vocabulary index that has an entry for all the words in all our texts.

**Build a vocabulary index** for all the words in all the synopses below. Remember to tokenize and do any other preprocessing steps you find fit. You might want to exclude stop words.

*Note*. This may run slowly.

*Note*.  You should not expect that you will complete the featurization part here and then go on to the next part and never come back. More likely, you will do some featurization here, then as you do your analysis later on, you'll realize you should have done something different back here. In that case, you should come back and change the way you do featurization, then run all the code again. I fully expect that you will come back and alter the featurization code here a few times in response to the results you get later. In your responses to the questions below, you should talk about this process and what you learned.

In [0]:
# Below we collect all the synopses into a single huge string variable.
all_text = " ".join(df['synopsis'])[:100]
UNK = "<!!!!UNK!!!!>"

def preprocess(text):
  # TODO modify this function as you please
  tokens = nltk.tokenize.word_tokenize(text) # Replace this with real tokenization
  utokens = [token.casefold() for token in tokens]
  return utokens

def make_embedding_index(tokens):
  index = {}
  max_index = 0
  for token in tokens:
    if token not in index:
      index[token] = max_index
      max_index += 1
  index[UNK] = max_index
  return index

In [0]:
vocabulary_index = make_embedding_index(preprocess(all_text))
vocabulary_index

{'.': 16,
 '<!!!!UNK!!!!>': 18,
 'an': 4,
 'and': 7,
 'at': 12,
 'break': 8,
 'climb': 2,
 'driveway': 15,
 'five': 0,
 'ga': 17,
 'gate': 6,
 'into': 9,
 'iron': 5,
 'mansion': 11,
 'of': 14,
 'over': 3,
 'teenagers': 1,
 'the': 10,
 'top': 13}

**Answer the questions below.** 

1. How did you decide to tokenize the text? Why did you choose that strategy? Do you notice any problems with the tokenization strategy? If so, do you think these going to cause problems with your analysis? Why or why not?

2. What other preprocessing steps did you do and why? Did you exclude stop words or other words? Why or why not?

3. How many entries are there in your vocabulary index? Does this seem reasonable?



Now that you have a vocabulary index, we want to create vector representations of all the texts. For now, let's focus on the movies by genre, and later we can focus on movies by year, title, etc. 


**Write a function to create a bag-of-words vector representation of a text**, using the vocabulary index that you built above. 

In [0]:
def embed(tokens, index):
  a = np.zeros(len(index))
  for token in tokens:
    if token in index:
      a[index[token]] += 1
    else:
      a[index[UNK]] += 1
  return a

a = embed(preprocess(all_text), vocabulary_index)
a


array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 4., 1., 1., 1., 1., 1., 1.,
       1., 0.])

Now that you have a function `embed`, we will apply it in the following way. The bit of code below will create a dataframe called `counts_by_genre` where each column is a genre, and each row is an entry from your vocabulary index. The number in each cell is the count of how many times that word was seen in that genre.

You should probably not change the code below. If you write the functions `embed` and `preprocess` properly, and you create `vocabulary_index`, then it should just run.

In [0]:
def reverse_index(index):
  """ reverse_index(index) is a list xs where xs[index[w]] == w """
  return [word for _, word in sorted((v, k) for k, v in index.items())]




# The code below creates the counts_by_genre dataframe using the embedding function and the vocabulary index
# vocabulary_index is assumed to be a dict mapping words to integers
# embed is assumed to be a function mapping sequences of tokens to vectors

# First apply the embed function to every text in the dataframe
vectors = df_by_genre['synopsis'].map(lambda text: embed(preprocess(text), vocabulary_index))

# Now create a dictionary mapping genre names to vectors
d = dict(zip(df_by_genre['genre'], vectors))

# Use that dictionary to make a dataframe
counts_by_genre = pd.DataFrame(d)

# Add a column to the dataframe that contains the word for each index
counts_by_genre['word'] = reverse_index(vocabulary_index)


In [0]:
# Display the resulting dataframe, to make sure it looks right 
counts_by_genre

Unnamed: 0,Action,Adventure,Animation,Biography,Comedy,Crime,Documentary,Drama,Family,Fantasy,Film-Noir,History,Horror,Music,Musical,Mystery,News,Romance,Sci-Fi,Short,Sport,Thriller,War,Western,word
0,162.0,138.0,8.0,18.0,161.0,120.0,9.0,242.0,36.0,78.0,1.0,14.0,77.0,14.0,28.0,96.0,2.0,105.0,96.0,0.0,3.0,197.0,14.0,19.0,five
1,7.0,7.0,0.0,5.0,20.0,10.0,1.0,42.0,0.0,2.0,0.0,4.0,24.0,2.0,0.0,10.0,0.0,13.0,17.0,0.0,0.0,23.0,1.0,0.0,teenagers
2,103.0,104.0,17.0,9.0,57.0,38.0,4.0,77.0,38.0,47.0,0.0,3.0,56.0,4.0,14.0,26.0,0.0,37.0,63.0,0.0,11.0,112.0,3.0,0.0,climb
3,1855.0,1849.0,321.0,243.0,2009.0,1164.0,10.0,2778.0,650.0,969.0,27.0,129.0,1057.0,141.0,264.0,925.0,1.0,1224.0,1283.0,10.0,91.0,2289.0,212.0,117.0,over
4,4068.0,4168.0,587.0,592.0,4158.0,2341.0,47.0,5746.0,1393.0,2062.0,48.0,248.0,1751.0,309.0,462.0,1668.0,10.0,2304.0,2998.0,22.0,275.0,4531.0,445.0,197.0,an
5,107.0,104.0,0.0,4.0,12.0,10.0,0.0,21.0,6.0,3.0,0.0,4.0,25.0,0.0,0.0,22.0,0.0,31.0,94.0,0.0,2.0,80.0,6.0,1.0,iron
6,101.0,154.0,4.0,7.0,57.0,24.0,0.0,84.0,45.0,94.0,0.0,1.0,57.0,0.0,5.0,24.0,0.0,26.0,77.0,0.0,0.0,99.0,5.0,0.0,gate
7,42366.0,44042.0,7965.0,5311.0,47464.0,26143.0,420.0,62238.0,16926.0,24860.0,507.0,2905.0,21559.0,3152.0,5361.0,19180.0,55.0,28607.0,28692.0,118.0,2824.0,50370.0,4171.0,2322.0,and
8,271.0,257.0,64.0,33.0,316.0,158.0,1.0,395.0,113.0,163.0,4.0,16.0,155.0,19.0,45.0,114.0,0.0,217.0,184.0,1.0,13.0,310.0,35.0,12.0,break
9,4499.0,4712.0,979.0,362.0,4180.0,2196.0,32.0,4734.0,1869.0,2623.0,25.0,195.0,2623.0,203.0,550.0,1834.0,4.0,1969.0,3354.0,12.0,189.0,4927.0,344.0,167.0,into


## Part 2: Words Associated with Movie Genres

Let's figure out what words are most associated with each genre. To do that, let's turn our giant dataframe containing word counts into another giant dataframe containing tf-idf values.

In [0]:
# We'll need this number to calculate tf-idf
NUM_GENRES = len(genres)

# Let's make a copy of our dataframe
tfidf_by_genre = counts_by_genre.copy()
all_counts = tfidf_by_genre[[
 'History',
 'Thriller',
 'Family',
 'Horror',
 'Mystery',
 'Drama',
 'Music',
 'Adventure',
 'Comedy',
 'Action',
 'Musical',
 'Fantasy',
 'Documentary',
 'Animation',
 'Western',
 'War',
 'Short',
 'Film-Noir',
 'Romance',
 'Sport',
 'Sci-Fi',
 'Biography',
 'News',
 'Crime',
]]

** Add columns to the dataframe `tfidf_by_genre` containing document frequencies and overall frequencies. **

In [0]:
tfidf_by_genre['document_frequency'] = (all_counts > 0).sum(axis=1)

tfidf_by_genre['overall_frequency'] = all_counts.loc[:, "overall_frequency"]

tfidf_by_genre

Unnamed: 0,Action,Adventure,Animation,Biography,Comedy,Crime,Documentary,Drama,Family,Fantasy,Film-Noir,History,Horror,Music,Musical,Mystery,News,Romance,Sci-Fi,Short,Sport,Thriller,War,Western,word,document_frequency,overall_frequency
0,162.0,138.0,8.0,18.0,161.0,120.0,9.0,242.0,36.0,78.0,1.0,14.0,77.0,14.0,28.0,96.0,2.0,105.0,96.0,0.0,3.0,197.0,14.0,19.0,five,23,
1,7.0,7.0,0.0,5.0,20.0,10.0,1.0,42.0,0.0,2.0,0.0,4.0,24.0,2.0,0.0,10.0,0.0,13.0,17.0,0.0,0.0,23.0,1.0,0.0,teenagers,16,
2,103.0,104.0,17.0,9.0,57.0,38.0,4.0,77.0,38.0,47.0,0.0,3.0,56.0,4.0,14.0,26.0,0.0,37.0,63.0,0.0,11.0,112.0,3.0,0.0,climb,20,
3,1855.0,1849.0,321.0,243.0,2009.0,1164.0,10.0,2778.0,650.0,969.0,27.0,129.0,1057.0,141.0,264.0,925.0,1.0,1224.0,1283.0,10.0,91.0,2289.0,212.0,117.0,over,24,
4,4068.0,4168.0,587.0,592.0,4158.0,2341.0,47.0,5746.0,1393.0,2062.0,48.0,248.0,1751.0,309.0,462.0,1668.0,10.0,2304.0,2998.0,22.0,275.0,4531.0,445.0,197.0,an,24,
5,107.0,104.0,0.0,4.0,12.0,10.0,0.0,21.0,6.0,3.0,0.0,4.0,25.0,0.0,0.0,22.0,0.0,31.0,94.0,0.0,2.0,80.0,6.0,1.0,iron,17,
6,101.0,154.0,4.0,7.0,57.0,24.0,0.0,84.0,45.0,94.0,0.0,1.0,57.0,0.0,5.0,24.0,0.0,26.0,77.0,0.0,0.0,99.0,5.0,0.0,gate,17,
7,42366.0,44042.0,7965.0,5311.0,47464.0,26143.0,420.0,62238.0,16926.0,24860.0,507.0,2905.0,21559.0,3152.0,5361.0,19180.0,55.0,28607.0,28692.0,118.0,2824.0,50370.0,4171.0,2322.0,and,24,
8,271.0,257.0,64.0,33.0,316.0,158.0,1.0,395.0,113.0,163.0,4.0,16.0,155.0,19.0,45.0,114.0,0.0,217.0,184.0,1.0,13.0,310.0,35.0,12.0,break,23,
9,4499.0,4712.0,979.0,362.0,4180.0,2196.0,32.0,4734.0,1869.0,2623.0,25.0,195.0,2623.0,203.0,550.0,1834.0,4.0,1969.0,3354.0,12.0,189.0,4927.0,344.0,167.0,into,24,


** Replace each column with tfidf instead of raw count values. **

In [0]:
idf = np.log(NUM_GENRES / tfidf_by_genre['document_frequency'])
for genre in genres:
  # tfidf_by_genre[genre] gives you the vector of counts for the given genre
  # we want to replace those counts with the tfidf
  tfidf_by_genre[genre] = ... # TODO

In [0]:
tfidf_by_genre

**Answer the questions below.**

1. What are the top 10 most distinctive words per genre? Do these seem reasonable and interesting? 

2. Did you do any filtration or change your algorithms in any way to get the current results? Why or why not? 

3. What observations do you have about these words? What do they indicate to you about the genres? 

Instead of looking at the most distinctive words, it is sometimes useful to look at certain words that are of interest. For example, we may be interested in how gender roles are different in movies across genres and across time. We can study this by looking at *ratios of frequencies* for certain words. For the gender example, we could look at the frequency for the pronoun *she* divided by the combined frequencies of the pronouns *he* and *she* --- giving you the probability that any given subject pronoun is *she* as opposed to *he*.

**Answer the questions below.**

1. Why would it be a good idea to look at the probability of *she* as opposed to *he*, instead of the raw frequencies? Why is tf-idf not appropriate here?
2. Does this seem like a reasonable way to study gender roles in movies? List at least one advantage and one disadvantage. What is one way you could overcome the disadvantage(s)?



The code below will reshape the dataframe `counts_by_genre` into a new dataframe called `counts_by_genre2` where each row is a genre, and each column is a word. This will be useful for you to calculate the probability of *she* vs. *he*. Run this code and verify that the dataframe which is output at the end seems sensible.

In [0]:
counts_by_genre2 = counts_by_genre.T # .T means transpose
counts_by_genre2.columns = counts_by_genre['word']
counts_by_genre2 = counts_by_genre2[counts_by_genre2['word'] != 'word']
counts_by_genre2


**Calculate the proportion of subject pronouns which are *she* as opposed to *he* per genre in this dataset.** Display the results below. That is, calculate for each genre:
(count of *she*) / (count of *she* + count of *he*)

In [0]:
she_proportions = ... # TODO

You can use the code below to visualize your results. Below I have written a function which takes as input two arrays: the first is called `categories`, and in this case it represents the genres. The second is called `numbers`, and this case it is the proportions of the pronoun *she* (or it could be any sequence of numbers). The result is a plot of the proportions by genre. Right now, as an example, the code will visualize a series of meaningless numbers.

In [0]:
def plot(categories, numbers):
  df = pd.DataFrame({'categories': categories, 'numbers': numbers})
  return alt.Chart(df).mark_bar().encode(
    x='categories',
    y='numbers',
    tooltip=['categories', 'numbers'],
  ).interactive()

Below, use the function `plot` to make a plot of the proportion of the pronoun *she* by genre.

In [0]:
categories = sorted(genres)
numbers = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24])

plot(categories, numbers)

**Answer the questions below**. 

1. Which genres have the highest rate of *she* as opposed to *he*? 

2. What patterns do you notice? Are you surprised by anything? 

## Part 3: Movies over Time

Now we will examine how the plots of movies have changed over time.

One way to do this is to look at all the plot synopses from within a single year as a document, and find the most distinctive words per year.



Based on your code above, **create dataframes `counts_by_year` and `tfidf_by_year`** similar to `counts_by_genre` and `tfidf_by_genre`, but grouped by year instead of genre. Feel free to copy and reuse code from above.

In [0]:
YEARS = list(set(df_by_year['year']))
NUM_YEARS = len(YEARS) # 102 years (non-consecutive) are covered in our data


**Answer the questions below**.

1. What are the 10 most distinctive words per year for the years 1990-2019?
2. Did you do any extra filtration or change any algorithms to arrive at these words?
3. What observations do you have about these words? What do they indicate to you about how movie content has changed over the years? Did any of the words surprise you?

**Calculate the she-to-he proportion per year and display the results below.** Same as we did by genre, but try it by year. 

**Use the visualization function `plot()` from above to visualize your results.**

**Answer the questions below.** 
1. What patterns do you notice? Are you surprised by anything?



## Part 4: Your turn

Think of another pair of words for which it might be interesting to track their relative probabilities, like we did for *he* and *she*.

**Calculate the probabilities for this pair of words by genre or by year. Use `plot()` to display the results. Try to find something interesting!**

**Answer the questions below.**

1. What difficulties did you encounter in this analysis and what was your strategy for overcoming them? 

2. Were the preprocessing steps from earlier still appropriate here? If not, why?

3. What insight did you gain about movies over time or over genres?