# Introduction

As we need to scrap a lot of data, we will performe the initial analysis for milestone 2 on a subset of movies and the actors that play in those movies. This is done to show the viability of our approaches. Then as soon as we have the full set of the data, we will scale up our analysis.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
df_movies_subset=pd.read_pickle('movies_subset.pkl')
df_actors_awards=pd.read_pickle('actors_awards_subset.pkl')
df_movies=pd.read_csv('data/title.basics.tsv', sep='\t')
df_actors=pd.read_csv('data/name.basics.tsv', sep='\t')
df_ratings=pd.read_csv('data/title.ratings.tsv', sep='\t')

  interactivity=interactivity, compiler=compiler, result=result)


## Raw Data

We start by merging the different data sets with information about the movies into one data set.

### Movies data

In [3]:
#adding our scrapped data to the original movie data set
movies_data=df_movies_subset.merge(df_movies,how='inner', on='tconst')
movies_data=movies_data.merge(df_ratings, how='left', on='tconst').drop(['titleType','isAdult','endYear'],axis=1)

In [4]:
with pd.option_context('display.max_columns', 50):
    display(movies_data.head())

Unnamed: 0,tconst,stars,oscarWins,nominations,wins,releaseDate,releaseCountry,plotKeywords,budget,worldwideGross,metascore,musicProducer,primaryTitle,originalTitle,startYear,runtimeMinutes,genres,averageRating,numVotes
0,tt0014799,"[nm0265550, nm0370407, nm0550195]",0,0,0,1924-05-31,UK,[],,,,,The Conspirators,The Conspirators,1924,\N,"Crime,Drama",,
1,tt0014843,"[nm0427659, nm0107574, nm0421138]",0,0,0,1924-08-24,USA,[],,,,,The Desert Outlaw,The Desert Outlaw,1924,60,Western,,
2,tt0014809,"[nm0267916, nm0119572, nm0055809]",0,0,0,1924-04-08,USA,[],,,,,Crossed Trails,Crossed Trails,1924,50,Western,,
3,tt0014751,"[nm0403710, nm0744408]",0,0,0,,,[],,,,,La buenaventura de Pitusín,La buenaventura de Pitusín,1924,\N,\N,,
4,tt0014812,"[nm0556953, nm0531962, nm0645941]",0,0,0,1924-12-28,USA,[],,,,,Curlytop,Curlytop,1924,60,"Drama,Romance",,


#### Description of the data

- tconst: Unique imdb identifier of each movie, given as string
- stars: Three main actors playing in the movie , given as list of strings, scraped 
- oscarWins: Number of Oscars awarded to the movie, given as float, scraped 
- nominations: Number of general award nominations, given as float, scraped
- wins: Number of awards won, given as integer, scraped
- releaseDate: Date release of the movie, given as DateTime, scraped
- releaseCountry: Country of release, given as string, scraped
- plotKeywords: Keywords describing the plot of the movie, given as list of strings, scraped
- budget: production budget of the movie, given as string, scraped
- worldwideGross: Wordlwide revenue, given as string, scrapped
- metascore: Scores are assigned to movie's reviews of large group of the world's most respected critics, and weighted average are applied to summarize their opinions range, given as float, scraped
- musicProducer: Producer of the music in the movie, string, scraped
- primaryTitle: English title, string
- originalTitle: Original title, string
- startYear: Year of release, integer
- runtimeMinutes: Runtime of the movie in minutes, string, scraped
- genres: genres the movie can be attributed to, string
- averageRating: Average rating of the movie given by imbd users, float
- numVotes: Number of users that have scored the movie, float



### Actors data

In [5]:
df_actors_awards.head()

Unnamed: 0,nconst,year,category,w_n,description,movie,tconst
0,nm0309470,2005,David,Nominee,Best Supporting Actress (Migliore Attrice non ...,Cuore sacro,tt0429898
1,nm0309470,1968,Golden Plate,Winner,,Grazie zia,tt0063033
2,nm0309470,2005,Golden Ciak,Winner,Best Supporting Actress (Migliore Attrice Non ...,Cuore sacro,tt0429898
3,nm0309470,1967,Golden Globe,Winner,Best Actress (Migliore Attrice),Svegliati e uccidi,tt0061049
4,nm0309470,1968,Golden Goblet,Winner,Best Actress (Migliore Attrice),Grazie zia,tt0063033


In [6]:
df_actors.head()

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm0000001,Fred Astaire,1899,1987,"soundtrack,actor,miscellaneous","tt0043044,tt0050419,tt0072308,tt0053137"
1,nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack","tt0117057,tt0038355,tt0071877,tt0037382"
2,nm0000003,Brigitte Bardot,1934,\N,"actress,soundtrack,producer","tt0054452,tt0049189,tt0059956,tt0057345"
3,nm0000004,John Belushi,1949,1982,"actor,writer,soundtrack","tt0077975,tt0080455,tt0078723,tt0072562"
4,nm0000005,Ingmar Bergman,1918,2007,"writer,director,actor","tt0083922,tt0069467,tt0050986,tt0050976"


## Data Wrangling

### Movie Data

In movies_data we need to change a few columns to achive our goal:

- releaseDate, here we only want the month, to see if there is seasonality, the release year is given seperately
- for budget and wordwideGross: These are given in different currency depending on the movie. Furthermore, they are not always given in the same currency for a given movie. To proceed we first need to convert the strings into integers and then we would need to convert all the numbers into dollars to make them comparable.
- runtime: Transforme to integers
- genres: Transforme to list of strings

Furthermore, there are some columns we are not interested in, that we can drop.

In [7]:
#extracting month
movies_data['releaseMonth']=movies_data['releaseDate'].map(lambda x:  x.month if x != None else None)

In [8]:
#first we need to change worldwideGross and budget to integers
import re
    
def to_int(x):
    if x ==None:
        val=None
        dollar= False
    else:
        val= int(re.sub('[^0-9,]', "", x).replace(",", ""))
        if x.startswith('$'):
            dollar= True
        else:
            dollar= False
    return val, dollar
    
movies_data['budget'],movies_data['budget_in_dollar'] =movies_data['budget'].apply(to_int)
movies_data['worldwideGross'], movies_data['worldwideGross_in_dollar']=movies_data['worldwideGross'].apply(to_int)
#now we can calculate the revenue percentage
movies_data['revenue']=movies_data['worldwideGross']/movies_data['budget']-1

ValueError: too many values to unpack (expected 2)

In [None]:
#only extracting one genre
#Needs to be improved as genres are just orderd in alphabetical order not importance
movies_data.genres=movies_data.genres.str.split(',')

In [None]:
with pd.option_context('display.max_columns', 50):
    display(movies_data.head())

### Actors Data

Out of the awards data of the actors we want to build a score for the actors. We will score the wins and nominations in the top 10 movies awards according to https://www.therichest.com/most-popular/top-10-most-prestigious-movie-awards-in-the-world/ higher then the wins and nominations in the other awards.

In [None]:
best_award=['Golden Globe','Oscar', 'Golden Lion','Grand Jury Prize','Golden Leopard','European Film Award', 'Filmfare Award','Golden Berlin Bear','BAFTA Film Award',"Palme d'Or"]

In [None]:
df_actors_awards['Important']=[False]*df_actors_awards.shape[0]

In [None]:
df_actors_awards['Important'][df_actors_awards.category.isin(best_award)]=True

In [None]:
df_actors_awards.head()

In the actors data set we want to have the gender as an aditional feature

In [None]:
#splitting professions
df_actors.primaryProfession=df_actors.primaryProfession.str.split(',')

In [None]:
def m_f(x):
    if type(x) is list:
        if 'actor' in x:
            return 'M'
        elif 'actress' in x:
            return 'F'
        else:
            return None
    else:
        return None

In [None]:
df_actors['gender']=df_actors.primaryProfession.apply(m_f)

In [None]:
df_actors.head()

## Data exploration

Now that we have the data sets cleaned we can do the initial data analysis.

#### Actors data

#### Movies data

In [None]:
plt.figure(figsize=(15,7))
chart=sns.barplot(x='releaseCountry',y='oscarWins',data=pd.DataFrame(movies_data[movies_data['oscarWins']>0].groupby('releaseCountry')['oscarWins'].sum().reset_index()),palette='colorblind')
chart.set_xticklabels(chart.get_xticklabels(),rotation=45)
plt.title('Number of Oscars by release country')
plt.xlabel('Release Country')
plt.ylabel('Number of Oscar Wins')
plt.show()

In [None]:
plt.figure(figsize=(12,7))
sns.barplot('releaseMonth', 'oscarWins', data=movies_data.groupby('releaseMonth')['oscarWins'].sum().reset_index(),palette='colorblind')
plt.ylabel('Number of Oscar wins')
plt.xlabel('Release Month')
plt.title('Barplot of oscar wins per release month')
plt.show()