# Programming Assessment Prompt  #1

This prompt asks us to find the 5 most popular genres within the given *movies* dataset.

In this notebook I will be:
-  Munging the data into a form most conducive to analyses for the given question
-  Determining which statistic (mean/median/count) is best to establish a ranking
-  Computing the ranking

We start by importing necessary modules and the dataset:

In [11]:
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
import matplotlib.pyplot as plt
%matplotlib inline
from collections import Counter
import ast

movies = pd.read_csv("movie_data.csv")

movies.head(5)

Unnamed: 0,id,title,release_date,box_office_revenue,runtime,genres,summary
0,0,Ghosts of Mars,2001-08-24,14010832.0,98.0,"[""Space western"", ""Horror"", ""Supernatural"", ""T...","Set in the second half of the 22nd century, th..."
1,1,White Of The Eye,1987,,110.0,"[""Erotic thriller"", ""Psychological thriller"", ...",A series of murders of rich young women throug...
2,2,A Woman in Flames,1983,,106.0,"[""Drama""]","Eva, an upper class housewife, becomes frustra..."
3,3,The Sorcerer's Apprentice,2002,,86.0,"[""Adventure"", ""Fantasy"", ""World cinema"", ""Fami...","Every hundred years, the evil Morgana returns..."
4,4,Little city,1997-04-04,,93.0,"[""Romance Film"", ""Ensemble Film"", ""Comedy-dram...","Adam, a San Francisco-based artist who works a..."


## Now we want to check the type of each feature to make sure it matches our initial intuition

In [12]:
for feature in movies.columns:
    print(feature, "...", type(movies[feature][0]))

id ... <class 'numpy.int64'>
title ... <class 'str'>
release_date ... <class 'str'>
box_office_revenue ... <class 'numpy.float64'>
runtime ... <class 'numpy.float64'>
genres ... <class 'str'>
summary ... <class 'str'>


We can see that "genres" is actually a string and not a list. Let's change it into a list!

## Munging

We convert list-looking strings into lists using ast's "literal_eval" method.

In [13]:
movies['genres'] = pd.Series(ast.literal_eval(genres) for genres in movies['genres'])

#Let's check our work:
type(movies['genres'][0])

list

Next we split by genres so each movie has a row for each genre it falls under. This allows us to more easily work with the data and ask genre-centric questions of the data.

Note: All of these list comprehensions make us prone to memory errors. Though my small laptop can handle it given I am not running a bunch of processes elsewhere in Jupyter.

In [14]:
#all_genres stores all the genres split out of the lists and stacks horizontally
#This makes is easy to construct our dataframe later
all_genres = np.hstack(movies.genres)

#Each feature is stacked so that a value repeats for as many genres as were in the original genre list
all_titles = np.hstack([[title]*len(genre) for title, genre in movies[['title', 'genres']].values])
all_release_dates = np.hstack([[release]*len(genre) for release, genre in movies[['release_date', 'genres']].values])
all_revenues = np.hstack([[revenue]*len(genre) for revenue, genre in movies[['box_office_revenue', 'genres']].values])
all_runtimes = np.hstack([[runtime]*len(genre) for runtime, genre in movies[['runtime', 'genres']].values])
all_summaries = np.hstack([[summary]*len(genre) for summary, genre in movies[['summary', 'genres']].values])

movies_split = pd.DataFrame({'genres':all_genres, 'titles':all_titles, 'release_date':all_release_dates,
                     'box_office_revenue':all_revenues, 'runtime':all_runtimes, 'summary':all_summaries})

#Let's check to see if this yields what we expect...
movies_split.head(10)

Unnamed: 0,box_office_revenue,genres,release_date,runtime,summary,titles
0,14010832.0,Space western,2001-08-24,98.0,"Set in the second half of the 22nd century, th...",Ghosts of Mars
1,14010832.0,Horror,2001-08-24,98.0,"Set in the second half of the 22nd century, th...",Ghosts of Mars
2,14010832.0,Supernatural,2001-08-24,98.0,"Set in the second half of the 22nd century, th...",Ghosts of Mars
3,14010832.0,Thriller,2001-08-24,98.0,"Set in the second half of the 22nd century, th...",Ghosts of Mars
4,14010832.0,Science Fiction,2001-08-24,98.0,"Set in the second half of the 22nd century, th...",Ghosts of Mars
5,14010832.0,Action,2001-08-24,98.0,"Set in the second half of the 22nd century, th...",Ghosts of Mars
6,14010832.0,Adventure,2001-08-24,98.0,"Set in the second half of the 22nd century, th...",Ghosts of Mars
7,,Erotic thriller,1987,110.0,A series of murders of rich young women throug...,White Of The Eye
8,,Psychological thriller,1987,110.0,A series of murders of rich young women throug...,White Of The Eye
9,,Thriller,1987,110.0,A series of murders of rich young women throug...,White Of The Eye


Next we want to look at what the actual genres are. To save the reader from a ton of scrolling I have hidden the output showing all unique genres.

In [15]:
unique_genres = list(set(all_genres))

print(len(unique_genres))

unique_genres;

363


Note there are 363 unique genres. Looking at the set of genres we can see there are a number of genres that are the same for all intents and purposes just spelled slightly differently or are synonyms to other genres.

Using the tools provided by the package "fuzzymuzy" we can create a mapping dictionary to replace certain genre labels. Here is a good examply of the two different function we will use from fuzzywuzzy:

In [16]:
#Each of these is a member of unique_genres

#partial_ratio matches on substrings and is good for matching things like: "New York Mets" and "Mets"
print("partial_ratio: ", fuzz.partial_ratio("family", "family-oriented"))

#ratio just looks at the minimum edit, or Levenshtein, distance from one string to another.
#We can see that it does not perform well on  string pairs that partial_ratio does.
print("ratio: ", fuzz.ratio("family", "family-oriented"))

#Here we see a case where ratio is appropriate
print("ratio: ", fuzz.ratio("documentary", "documetary"))

partial_ratio:  100
ratio:  57
ratio:  95


Using these two string matching metrics we construct a mapping dictionary. We use this as a temp so that the next dictionary will only have non-empty genre-to-genre mapping elements.

In [48]:
temp_mapper = {genre : [] for genre in unique_genres}

for i in range(len(unique_genres)):
    for j in range(len(unique_genres)):
        
        if i != j:
            ratio_score = fuzz.ratio(unique_genres[i],unique_genres[j])
            partial_score = fuzz.partial_ratio(unique_genres[i],unique_genres[j])
            
            #We choose the threshold here to be restrictive in ratio and very restrictive in partial_ratio
            #These values were found though trial and error
            if ratio_score >= 85 or partial_score == 100:
                temp_mapper[unique_genres[i]].append(unique_genres[j])

#After looking through temp_mapper we can find an appropriate mapping visually
genre_mapper = {"Action": "Action/Adventure", "Adventure": "Action/Adventure", "Music": "Musical",
               "Computer Animation": "Animation", "Animated Cartoon": "Animation", "Backstage Musical": "Musical",
               "Beach Party film": "Beach Film", "Biographical Film": "Biography", "Biopic [feature]": "Biography",
               "Breakdance": "Dance", "Alien invasion": "Alien Film", "Children\'s/Family": "Children\'s",
               "Children\'s Entertainment": "Children\'s", "Children\'s Fantsy": "Children\'s", 
               "Children\'s Issues": "Children\'s", "Comdedy": "Comedy", "Comedy horror": "Horror Comedy",
               "Comedy film": "Comedy", "Coming-of-age film": "Coming of age", "Detective Fiction": "Detective",
               "Education": "Educational", "Extreme Sports": "Sports", "Gross-out film": "Gross out",
               "World History": "History", "Humour": "Comedy", "Monster movie": "Monster", "Prison film": "Prison",
               "Superhero movie": "Superhero", "Sword and sorcery films": "Sword and sorcery"}

When choosing the mapping we left appropriately specific deviations from a more general genre (e.g. Pyschological Thriller doesn't get mapped to Thriller) but we did bucket things like Humour -> Comedy. Manual creation was necessary since fuzzywuzzy was being a bit too overzealous in it's matching and we would have been left with improper combinations. We also replace each genre found within the set of keys to the genre found in the corresponding values.

In [None]:
movies_split['genres'].replace(genre_mapper, inplace = True)

#We also want to drop duplicate rows where a given movie has both the genre mapped to and from in 'genres'
movies_split = movies_split.drop_duplicates()
movies_split.head(10)

Awesome! Now that we have our data in a format that lends itself to analyses of the genres let's look at how we will define and quantify "popular"

## Establishing a measure of Popularity

There are a few different metrics we could use to quantify popularity from the data we are given.
- The mean box office revenue across all movies in a genre
- The median box office revenue acorss all movies in a genre
- The number of times a given genre appears in our dataset

The reasoning for this last measure would be that, given the sampling from Wikipedia was random, it roughly represents the actual population of movies. If so, it would make some sense that a more popular genre would have more movies made in that genre than a less popular genre. Hence, the higher the count of occurrences for a given genre, the more popular.

To begin let's take a look at the distributions of box office revenues for a few genres to see if mean might be a good statistcal descriptor. Let's look at a few genres with the highest sample sizes.

In [None]:
movies_split['genres'].groupby(movies_split['genres']).count().nlargest(50)

In [None]:
horror_revenue = movies_expanded[movies_expanded['genres'] == 'Horror']['box_office_revenue'].dropna()

comedy_revenue = movies_expanded[movies_expanded['genres'] == 'Comedy']['box_office_revenue'].dropna()

horror_revenue = movies_expanded[movies_expanded['genres'] == 'Drama']['box_office_revenue'].dropna()

fig, ax = plt.subplots(1,3, sharex=True)

ax[0].hist(horror_revenue, alpha = 0.5, color = 'r')
ax[0].set_title('Horror')
ax[0].set_ylabel('Counts')
ax[0].set_xlabel("Revenue")

ax[1].hist(comedy_revenue, alpha = 0.5, color = 'g')
ax[1].set_title('Comedy')
ax[1].set_ylabel("Counts")
ax[1].set_xlabel("Revenue")

ax[2].hist(drama_revenue, alpha = 0.5, color = 'b')
ax[2].set_title('Drama')
ax[2].set_ylabel("Counts")
ax[2].set_xlabel("Revenue")

plt.tight_layout()
plt.show()

Taking a look at the plots above we can see that the vast majority of box office revenues are found at the lower end of the range. Also we see that for each genre plotted there are a few significant outliers which indicates that the mean will be greatly affected by these outliers. Because of this we assume that either median or genre-count are better overall descriptions of popularity via box office revenue.

In [None]:
def top_n_median(n):
    return movies_expanded['box_office_revenue'].groupby(movies_expanded['genres']).median().nlargest(n)

#Let's take a look at the top 20 genres and see if they align with out intuition (i.e. Comedy/Drama/etc...)
top_n_median(20)

This does not match our intuition whatsoever! I have never heard of anyone going to a "Space Opera, which certainly wouldn't be true for the actual most popular genre.

Let's look at count and see what we find.

In [None]:
movies_expanded['box_office_revenue'].groupby(movies_expanded['genres']).count().nlargest(20)

This list makes the most sense certainly. 7,587 of our 42,204 movies have non-empty values for box office revenue. The movie industry is all about serving a market just like any other industry and because of this it is easy to see how demand dictates supply. By looking at what we are assuming is a random sample of the supply we can reason that frequency of genre occurrance is directly related to consumer demand which is again directly related to "popularity".