# Programming Assessment Prompt  #1

This prompt asks us to find the 5 most popular genres within the given *movies* dataset.

In this notebook I will be:
-  Munging the data into a form most conducive to analyses for the given question
-  Determining which statistic (mean/median/count) is best to establish a ranking
-  Computing the ranking

We start by importing necessary modules and the dataset:

In [25]:
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
import matplotlib.pyplot as plt
%matplotlib inline
from collections import Counter
import ast

movies = pd.read_csv("movie_data.csv")

movies.head(5)

Unnamed: 0,id,title,release_date,box_office_revenue,runtime,genres,summary
0,0,Ghosts of Mars,2001-08-24,14010832.0,98.0,"[""Space western"", ""Horror"", ""Supernatural"", ""T...","Set in the second half of the 22nd century, th..."
1,1,White Of The Eye,1987,,110.0,"[""Erotic thriller"", ""Psychological thriller"", ...",A series of murders of rich young women throug...
2,2,A Woman in Flames,1983,,106.0,"[""Drama""]","Eva, an upper class housewife, becomes frustra..."
3,3,The Sorcerer's Apprentice,2002,,86.0,"[""Adventure"", ""Fantasy"", ""World cinema"", ""Fami...","Every hundred years, the evil Morgana returns..."
4,4,Little city,1997-04-04,,93.0,"[""Romance Film"", ""Ensemble Film"", ""Comedy-dram...","Adam, a San Francisco-based artist who works a..."


Now we want to check the type of each feature to make sure it matches our initial intuition

In [22]:
for feature in movies.columns:
    print(feature, "...", type(movies[feature][0]))

id ... <class 'numpy.int64'>
title ... <class 'str'>
release_date ... <class 'str'>
box_office_revenue ... <class 'numpy.float64'>
runtime ... <class 'numpy.float64'>
genres ... <class 'str'>
summary ... <class 'str'>


We can see that "genres" is actually a string and not a list. Let's change it into a list!

## Munging

In [26]:
#Use ast.liter_eval
def genre_cleaner(row):
    genre_string = row['genres']
    
    genre_string = genre_string.replace(',','').replace('"','')
    
    #We lower just in case there is variability in capatalization convention
    genre_list = genre_string.strip("[]").lower().split() #change
    
    return genre_list

#Here we reassign genre column. We change the original dataset as there is no reason to keep old format
movies['genres'] = movies.apply(lambda row: genre_cleaner(row), axis = 1)

#Let's check our work:
type(movies['genres'][0])

list

Next we split by genres so each movie has a row for each genre it falls under. This allows us to more easily work with the data and ask genre-centric questions of the data.

Note: All of these list comprehensions make us prone to memory errors. Though my small laptop can handle it given I am not running a bunch of processes elsewhere in Jupyter.

In [29]:
#all_genres stores all the genres split out of the lists and stacks horizontally
#This makes is easy to construct our dataframe later
all_genres = np.hstack(movies.genres)

#Each feature is stacked so that a value repeats for as many genres as were in the original genre list
all_titles = np.hstack([[title]*len(genre) for title, genre in movies[['title', 'genres']].values])
all_release_dates = np.hstack([[release]*len(genre) for release, genre in movies[['release_date', 'genres']].values])
all_revenues = np.hstack([[revenue]*len(genre) for revenue, genre in movies[['box_office_revenue', 'genres']].values])
all_runtimes = np.hstack([[runtime]*len(genre) for runtime, genre in movies[['runtime', 'genres']].values])
all_summaries = np.hstack([[summary]*len(genre) for summary, genre in movies[['summary', 'genres']].values])

movies_split = pd.DataFrame({'genres':all_genres, 'titles':all_titles, 'release_date':all_release_dates,
                     'box_office_revenue':all_revenues, 'runtime':all_runtimes, 'summary':all_summaries})

#Let's check to see if this yields what we expect...
movies_split.head(10)

Unnamed: 0,box_office_revenue,genres,release_date,runtime,summary,titles
0,14010832.0,space,2001-08-24,98.0,"Set in the second half of the 22nd century, th...",Ghosts of Mars
1,14010832.0,western,2001-08-24,98.0,"Set in the second half of the 22nd century, th...",Ghosts of Mars
2,14010832.0,horror,2001-08-24,98.0,"Set in the second half of the 22nd century, th...",Ghosts of Mars
3,14010832.0,supernatural,2001-08-24,98.0,"Set in the second half of the 22nd century, th...",Ghosts of Mars
4,14010832.0,thriller,2001-08-24,98.0,"Set in the second half of the 22nd century, th...",Ghosts of Mars
5,14010832.0,science,2001-08-24,98.0,"Set in the second half of the 22nd century, th...",Ghosts of Mars
6,14010832.0,fiction,2001-08-24,98.0,"Set in the second half of the 22nd century, th...",Ghosts of Mars
7,14010832.0,action,2001-08-24,98.0,"Set in the second half of the 22nd century, th...",Ghosts of Mars
8,14010832.0,adventure,2001-08-24,98.0,"Set in the second half of the 22nd century, th...",Ghosts of Mars
9,,erotic,1987,110.0,A series of murders of rich young women throug...,White Of The Eye


Next we want to look at what the actual genres are. To save the reader from a ton of scrolling I have hidden the output showing all unique genres.

In [40]:
unique_genres = list(set(all_genres))

print(len(unique_genres))

unique_genres;

386


Note there are 386 unique genres. Also note a number of these genres are the same excepting a small spelling mistake or else are very close synonyms of one another.

Using the tools provided by the package "fuzzymuzy" we can create a mapping dictionary to replace certain genre labels. Here is a good examply of the two different function we will use from fuzzywuzzy:

In [43]:
#Each of these is a member of unique_genres

#partial_ratio matches on substrings and is good for matching things like: "New York Mets" and "Mets"
print("partial_ratio: ", fuzz.partial_ratio("family", "family-oriented"))

#ratio just looks at the minimum edit, or Levenshtein, distance from one string to another.
#We can see that it does not perform well on  string pairs that partial_ratio does.
print("ratio: ", fuzz.ratio("family", "family-oriented"))

#Here we see a case where ratio is appropriate
print("ratio: ", fuzz.ratio("documentary", "documetary"))

partial_ratio:  100
ratio:  57
ratio:  95


Using these two string matching metrics we construct a mapping dictionary. We use this as a temp so that the next dictionary will only have non-empty genre-to-genre mapping elements.

In [48]:
temp_mapper = {genre : [] for genre in unique_genres}

for i in range(len(unique_genres)):
    for j in range(len(unique_genres)):
        
        if i != j:
            ratio_score = fuzz.ratio(unique_genres[i],unique_genres[j])
            partial_score = fuzz.partial_ratio(unique_genres[i],unique_genres[j])
            
            #We choose the threshold here to be restrictive in ratio and very restrictive in partial_ratio
            #These values were found though trial and error
            if ratio_score >= 85 or partial_score == 100:
                temp_mapper[unique_genres[i]].append(unique_genres[j])

#The following takes out all of the empty genre-to-genre elements
#This makes it easier to visually go over
genre_mapper = {}

for key, value in temp_mapper.items():
    if len(value) > 0:
        genre_mapper[key] = value
        
genre_mapper;

In [52]:
unique_genres

['black',
 'juvenile',
 'prison',
 'demonic',
 'rock',
 'backstage',
 'hip',
 'rape',
 'about',
 'erotic',
 'sorcery',
 'business',
 'noir',
 'music',
 'black-and-white',
 'musical',
 'martial',
 'bias',
 'pictures',
 'investing',
 'gangster',
 'auto',
 'hollywood',
 'tokusatsu',
 'anti-war',
 'vampire',
 'adventure',
 'racing',
 'news',
 'neorealism',
 'japanese',
 'latino',
 'animated',
 'tragicomedy',
 'comedy-drama',
 'bollywood',
 'rockumentary',
 'finance',
 'company',
 'piece',
 'candid',
 'anthropology',
 'jungle',
 'legal',
 'reboot',
 'fan',
 'heavenly',
 'mondo',
 'gothic',
 'motion',
 'interest',
 'applied',
 'studies',
 'military',
 'giallo',
 'mockumentary',
 'remake',
 '&',
 'softcore',
 'theatrical',
 'kitchen',
 '[feature]',
 'indian',
 'chinese',
 'swashbuckler',
 'filipino',
 'manners',
 'libraries',
 'private',
 'period',
 'splatter',
 'computer',
 'thrillers',
 'humour',
 'space',
 'cavalry',
 'personal',
 'flick',
 'chase',
 'out',
 'blaxploitation',
 'new',
 'hyb

Again in the interest of saving screen space we will hide the output. After looking through the mapping what you see in the next manually entered version of genre_mapper. Manual creation was necessary since fuzzywuzzy was being a bit too overzealous in it's matching and we would have been left with improper combinations. We also replace each genre found within the set of keys to the genre found in the corresponding values.

In [51]:
genre_mapper = {}

genre_mapper = {'action': 'action/adventure', 'adventure': 'action/adventure', 'age':'coming-of-age',
               'animal': 'animals', 'art': 'arts', 'child': 'children\'s/family', 'children\'s': 'children\'s/family',
               'comdedy': 'comedy', 'humour': 'comedy', 'computers': 'computer', 'breakdance': 'dance', 
               'documetary': 'documentary', 'education': 'educational', 'homoeroticism': 'erotic', 'erotica': 'erotic',
               'family': 'children\'s/family', 'family-oriented': 'children\'s/family', 'fantasies': 'fantasy',
               'fictional': 'fiction', 'gross-out': 'gross', 'investing': 'finance', 'film-opera': 'opera',
               'pictures': 'picture', 'thrillers': 'thriller'}

#Here we use the mapping we just created to replace like genres
movies_split['genres'].replace(genre_mapper, inplace = True)

movies_split.head(10)

Unnamed: 0,box_office_revenue,genres,release_date,runtime,summary,titles
0,14010832.0,space,2001-08-24,98.0,"Set in the second half of the 22nd century, th...",Ghosts of Mars
1,14010832.0,western,2001-08-24,98.0,"Set in the second half of the 22nd century, th...",Ghosts of Mars
2,14010832.0,horror,2001-08-24,98.0,"Set in the second half of the 22nd century, th...",Ghosts of Mars
3,14010832.0,supernatural,2001-08-24,98.0,"Set in the second half of the 22nd century, th...",Ghosts of Mars
4,14010832.0,thriller,2001-08-24,98.0,"Set in the second half of the 22nd century, th...",Ghosts of Mars
5,14010832.0,science,2001-08-24,98.0,"Set in the second half of the 22nd century, th...",Ghosts of Mars
6,14010832.0,fiction,2001-08-24,98.0,"Set in the second half of the 22nd century, th...",Ghosts of Mars
7,14010832.0,action/adventure,2001-08-24,98.0,"Set in the second half of the 22nd century, th...",Ghosts of Mars
8,14010832.0,action/adventure,2001-08-24,98.0,"Set in the second half of the 22nd century, th...",Ghosts of Mars
9,,erotic,1987,110.0,A series of murders of rich young women throug...,White Of The Eye


Awesome! Now that we have our data in a format that lends itself to analyses of the genres let's look at how we will define and quantify "popular"

## Establishing a measure of Popularity

There are a few different metrics we could use to quantify popularity from the data we are given.
- The mean box office revenue across all movies in a genre
- The median box office revenue acorss all movies in a genre
- The number of times a given genre appears in our dataset

The reasoning for this last measure would be that, given the sampling from Wikipedia was random, it roughly represents the actual population of movies. If so, it would make some sense that a more popular genre would have more movies made in that genre than a less popular genre. Hence, the higher the count of occurrences for a given genre, the more popular.