# COMP 10020 Introduction to Programming 2
## Assignment 2 - Movie Moneyball

The rise of popularity of **data science** has led to data science techniques being applied in some unexpected places, and to the release of some very interesting datasets. There are very few areas in which data science techniques are not making a difference.

One area where data is driving decision making is the movie business. Data is being used to understand the performance of movies and even make decisions about what movies to make. The goal of this assignment is to use data about movie releases from 1916 to 2017 to answer a series of questions. 

### Import Useful Packages
Import useful packages for data science

In [None]:
import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt
%matplotlib inline  
import seaborn as sns

### Question 1: Import Data

Details of a large collection of movies released between 1916 and 2017 are contained in the file **movies.csv** and **casts.csv**. Each row within **movies.csv** describes a movie, using the following fields:

* **id**: A unique ID of the movie
* **title**: The title of the movie
* **tagline**: The tagline for the movie
* **homepage**: A URL for the homepage of the movie (may no longer exist)
* **release_date**: The release date of the movie as dd/mm/yyyy
* **genre**: A category indicating the main genre of the movie
* **budget**: The budget of the movie in US dollars
* **keywords**: A list of keywords describing the movie
* **original_language** The original language of the movie (as a two letter abbreviation, e.g. en for English)
* **revenue**: The revenue earned by the movie in US dollars
* **runtime**: The runtime of the movie in minutes
* **status**: The status of the movie (one of *Released*, *Rumored*, or *Post Production*)
* **vote_average**: The average rating for the movie (from 0 to 10)
* **vote_count**: The number of ratings that have been provided for the movie
* **director**: The director of the movie
       
Each row within **casts.csv** contains the id of a movie and the name of an actor in that move. This file uses the following fields:

* **id**: The id of the movie
* **billing**: The billing of the actor (one of *cast_0*, *cast_1*, *cast_2*, or *cast_3*)
* **actor**: The name of the actor

Load the two datasets (**movies.csv** and **casts.csv**) into pandas data frames: **movies** and **casts**. Display the first five rows from each.

In [None]:
# Write code here


### Question 2: Tidy

Define a function to extract the month part from a date given as dd/mm/yyyy.

In [None]:
def date_to_month(d):
    if(type(d) != str):
        return np.NaN
    else:
        date_parts = d.split('/')
        month = date_parts[1]
        return int(month)

Extract months from the release date for each movie to add a new column **release_month**. The pandas Series **apply** function used together with the **date_to_month** function defined above can be used for this.

In [None]:
movies['release_month'] = movies["release_date"].apply(date_to_month)
movies.head()

**a)** Define a function to extract the year part from a date given as dd/mm/yyyy.

In [None]:
def date_to_year(d):
    
    # Write code here


**b)** Extract the year from the release date for each movie to add a new column **release_year**. The pandas Series **apply** function used together with the **date_to_year** function defined above can be used for this.

In [None]:
# Write code here


### Question 3: Simple Analysis

Use simple data analysis to answer the following questions. 

**a)** How many directors have released movies included in the dataset?

In [None]:
# Write code here


**b)** Generate and print a table showing many times movies have been released in each *genre*?

In [None]:
# Write code here


**c)** How many movies have been released under the *Horror* genre?

In [None]:
# Write code here


**d)** In which month are movies most frequently relased?

In [None]:
# Write code here


**e)** Who are the ten most prolific *directors* in the dataset? (We define *prolific* as the directors who have released the most movies).

In [None]:
# Write code here


### Question 4: Deeper Analysis

Use slightly more advanced data analysis to answer the following questions.

**a)** Draw an appropriate data visualisation that shows the number of movies released each year?

In [None]:
# Write code here


**b)** What is the average duration of a movie (in minutes)?

In [None]:
# Write code here


**c)** How many times has there been a movie described with the keyword '*spoof*'?

**Hint:** Experiment with the *str.contains* method from *pandas.Series*.

In [None]:
# Write code here


**d)** Which actor has starred in the most movies in the dataset? 

In [None]:
# Write code here


**e)** Have movies by *Steven Spielberg* had higher revenues, on average, than movies by *Ridley Scott*?

**Bonus:** Can you plot a data visualisation to support this conclusion?

In [None]:
# Write code here


**f)** Which director has earned the highest cumulative revenue for movies in the dataset released in the year 2000 or later?

In [None]:
# Write code here


### Question 5: Merging Data

Answer questions that require merging the **movies** and **casts** datsets.

**a)** Which actor starred in the most movies in the year 2002?

In [None]:
# Write code here


**b)** Have movies in the dataset starring '*Will Smith*' earned more cumulative revenue than movies starring '*Chris Rock*', or vice versa? 

In [None]:
# Write code here
