# Digital Futures - Project 3 

## 1.1 Importing the data 

In [None]:
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import ast 
import datetime

In [None]:
movies = pd.read_csv('TMDB_movies.csv')
movies.head(5)

In [None]:
movies.dtypes

Appending the dictionaries into lists...

In [None]:
# *1

movies['genres'] = movies['genres'].apply(lambda x: ast.literal_eval(x))
movies['genres'] = movies['genres'].apply(lambda x: [genre['name'] for genre in x])

movies['keywords'] = movies['keywords'].apply(lambda x: ast.literal_eval(x))
movies['keywords'] = movies['keywords'].apply(lambda x: [genre['name'] for genre in x])

movies['production_companies'] = movies['production_companies'].apply(lambda x: ast.literal_eval(x))
movies['production_companies'] = movies['production_companies'].apply(lambda x: [genre['name'] for genre in x])

movies['production_countries'] = movies['production_countries'].apply(lambda x: ast.literal_eval(x))
movies['production_countries'] = movies['production_countries'].apply(lambda x: [genre['name'] for genre in x])

movies['spoken_languages'] = movies['spoken_languages'].apply(lambda x: ast.literal_eval(x))
movies['spoken_languages'] = movies['spoken_languages'].apply(lambda x: [genre['name'] for genre in x])

In [None]:
movies.head()

Converting the release_date data type to a datetime format:

In [None]:
movies['release_date'] = pd.to_datetime(movies['release_date'], format='%Y-%m-%d')

In [None]:
movies.dtypes

Making a 'profit' column:

In [None]:
movies['profit'] = movies['revenue'] - movies['budget']

## 1.2 Null Handling

In [None]:
null_movies = movies.isnull() 
null_movies.sum() #total nulls = 3941

There are a large number of nulls in the 'homepage' and 'tagline' columns, with just 3, 2 and 1 in the 'overview', 'runtime' and 'release_date' columns respectively. 

In [None]:
movies.dropna(
    axis = 0, 
    how = 'any', 
    subset = [ 'overview','release_date'], 
    inplace = True   
)

Investigating the null values, those present in the 'overview' and 'release_date' columns appear to align in the same rows as the nulls in the 'runtime' column. It makes sense to remove these rows as they account for a small proportion of the total nulls.

In [None]:
def null_vals(dataframe):
    null_vals = dataframe.isnull().sum() 
    total_cnt = len(dataframe) 
    null_vals = pd.DataFrame(null_vals,columns=['null'])  
    null_vals['percent'] = round((null_vals['null']/total_cnt)*100,3) 
    
    return null_vals.sort_values('percent', ascending=False)

null_vals(movies)

Now looking at the columns, ' homepage' and 'tagline'. 

These columns have a combined null count of 3927 following the removal of the previously affected rows. 

As there are too many nulls to remove rows and no numerical distrubtion to infer from to predict values, removal of the columns is therefore the best option for this task. 

In [None]:
movies.drop(columns = ['homepage', 'tagline'], inplace = True)

In [None]:
movies.head()

## 1.3 Interesting Insights

### 1.3.1 The correlation between vote average and decade of release

In [None]:
pd.set_option('display.max_rows', 20)

Creating a 'year' column 

In [None]:
movies['year'] = pd.DatetimeIndex(movies['release_date']).year
movies.head()

Making a function that creates a new column 'decade'

In [None]:
# *2

def what_decade(year):
    if year < 1920:
        return '1910s'
    elif year < 1930:
        return '1920s'
    elif year < 1940:
        return '1930s'
    elif year < 1950:
        return '1940s'
    elif year < 1960:
        return '1950s'
    elif year < 1970:
        return '1960s'
    elif year < 1980:
        return '1970s'
    elif year < 1990:
        return '1980s'
    elif year < 2000:
        return '1990s'
    elif year < 2010:
        return '2000s'
    else:
        return '2010s'
    
    
movies['decade'] = movies['year'].apply(what_decade)


movies[['release_date', 'decade']] 

movies.head()

Grouping by the deacade and finding the average vote score...

In [None]:
year_stats = movies.groupby('decade')[['vote_average']].mean().reset_index()

year_stats

Producing a bar plot of decade against 'vote_average':

In [None]:
plt.figure(figsize=(14,6))

sns.barplot( data = year_stats,
             x = 'decade',
             y = 'vote_average',
             palette = 'winter' 
           )

plt.xlabel('Decade')
plt.ylabel('Average Rating')
plt.title('Average Rating by Decade')

plt.show()



There is a clear negative trend as time goes on with the average rating steadily decreasing over the decades. Ranging from 7.4 in the 1910s to 5.85 in the 2010s.

I decided to dive deeper into this by looking at the number of votes by decade to see if this potentially had an affect on the average vote.

In [None]:
vote_stats = movies.groupby('decade')[['vote_count']].mean().reset_index()
vote_stats

Plotting 'the number of votes' against decade:

In [None]:
plt.figure(figsize=(14,6))

sns.barplot( data = vote_stats,
             x = 'decade',
             y = 'vote_count',
             palette = 'winter' 
           )

plt.xlabel('Decade')
plt.ylabel('Number of Votes')
plt.title('Average Number of Votes by Decade')

plt.show()


There is a clear upward trend as time goes on in the number of votes. This plot almost mirrors the previous graph as the decades with the higher average rating tend to have fewer votes compared to those that appeared to have a lower average rating. 

This may indicate that the frequency of votes may have a negative affect on the movie ratings.

### 1.3.2 The correlation between budget and revenue

It makes sense that budget and revenue will be correlated, however, interesting inference can be made when grouping these variables by different columns.

Investigating the data, I noticed the films that hadn't been released had a budget and revenue of zero. 

Creating a new dataframe, 'movies_released', soley detailing the records of released movies:

In [None]:
movies_released = movies[movies['status']=='Released']
movies_released.status.unique()

I found there remained 4 rows with a revenue of 0 and 3 with a budget of 0.  

As there was some overlap with the nulls in either column I decided to drop these rows as they might affect future values.

In [None]:
movies_released.replace(0 , np.nan, inplace = True) ## converting the 0's to nulls

In [None]:
## dropping the new nulls

movies_released.dropna( 
    axis = 0, 
    how = 'any', 
    subset = [ 'budget', 'revenue'], 
    inplace = True 
)

movies_released[['status', 'budget','revenue']]


Firstly taking a look at the correlation when simply taken from the released movies dataframe.

In [None]:
budget_revenue = movies_released[['revenue','budget']]
budget_revenue

Plotting a correlation heatmap:

In [None]:
plt.figure(figsize = (5,5))

sns.heatmap(budget_revenue.corr(),
            annot = True)

plt.title('The Correlation Between Revenue and Budget')
plt.show()

There is a strong positive correlation between the 2 variables. This value differs, however, when it is grouped by other variables 

Firstly, we'll take a look at runtime...

In [None]:
length = movies.groupby('runtime')[['budget', 'revenue']].mean()
length.head()

In [None]:
plt.figure(figsize = (5,5))

sns.heatmap(length.corr(),
            annot = True,
           )

plt.title('Grouped by Runtime')
plt.show()

There is a slight increase in the correlation between budget and revenue when the data is grouped by the runtime.

In [None]:
language = movies_released.groupby('original_language')[['budget', 'revenue']].\
        mean().sort_values(by = 'revenue', ascending = False)
language

In [None]:
sns.heatmap(language.corr(),
            annot = True,
           )

plt.title('Grouped by language')
plt.show()

Again, there is an increase in the correlation between budget and revenue when grouped by the original language. 

I found this insight interesting as it showed that other variables had a direct impact on the correlation between revenue and budget, particularly the original language of the movie where, when grouped by this variable, the correlation coefficient tends further to one, indicating a very stronger positive correlation between the two variables.

## 1.4. A Deeper Dive Into Individual Movies

### 1.4.1 Higher Ground

I thought I would take a closer look at the unreleased movies 

In [None]:
movies_unreleased = movies[movies['status'] != 'Released']
movies_unreleased.status.unique()
movies_unreleased[['budget', 'release_date', 'original_title', 'revenue', 
                       'popularity', 'vote_average', 'profit', 'status']]

The movie 'Higher Ground' stood out to me as a noteworthy film as it was the only film unreleased ('post-production' or 'rumored') that had revenue data.

In [None]:
Higher_Ground = movies_unreleased[movies_unreleased['title'] == 'Higher Ground']
Higher_Ground

However, when comparing the revenue to the budget, it appeared to have a negative profit of -1,158,267

The average revenue over the entire dataset was 10 times larger than that of 'Higher Ground' with the mean 'vote average' being close to one rating larger than that of the movie.

In [None]:
movies_released[['budget', 'revenue', 'vote_average', 'popularity','runtime','profit']].describe()

Comparing the film to other unreleased movies... 

In [None]:
Not_Higher_Ground = movies_unreleased[movies_unreleased['title'] != 'Higher Ground']
Not_Higher_Ground[['budget', 'revenue', 'vote_average', 'popularity', 'runtime','profit']].describe()

It sat just below the 'vote_average' mean and had a runtime close to the maximum.

Another intersting point was it's popularity rating was marginally higher than that of next most popular 'unreleased' movie. 

Though in contrast to the released movies dataset it's popularity rating was far lower than the average.

In conclusion, 'Higher Ground', in contrast to the rest of the movies in the original data set is an unsuccessful movie. It had an overall negative profit while maintaining a reasonably low popularity and average vote rating. However, in contrast to the other unreleased movies it was marginally more popular while having a profit greater than the mean, though this may be due to just two other movies in the smaller dataframe having profit data.

### 1.4.1 Pirates of the Caribbean: On Stranger Tides

Another interesting film was that with the largest budget, 'Pirates of the Caribbean: On Stranger Tides' 

In [None]:
movies[['title', 'budget']].sort_values(by = 'budget', ascending = False)

Notably, it had a budget 80 million greater than the second most expensive film, another film in the Pirates of the Caribbean franchise, 'At World's End'.

In [None]:
Stranger_tides = movies[movies['title'] == 'Pirates of the Caribbean: On Stranger Tides']

Looking at some of the key columns:

In [None]:
Stranger_tides[['budget', 'revenue', 'vote_average', 'popularity','profit','vote_count']]

Notable points can be made for the large disparity between budget and revenue, with the revenue being around 660 million greater than the budget despite the seemingly low 'vote_average'.

How does the film stand against the general dataset measures?

In [None]:
Not_stranger_tides = movies_released[movies_released['title'] != 'Pirates of the Caribbean: On Stranger Tides']

In [None]:
Not_stranger_tides[['budget', 'revenue', 'vote_average', 'popularity','profit', 'vote_count']].describe()

The popularity of the film is far greater than the mean of the other released movies and despite the potentially low rating, it appears to be slighly higher rated than the general average. 

Furthermore, the films profit was far greater than the average, placing the movie in the top 25%

In conclusion, 'Pirates of the Caribbean: On Stranger Tides' was a highly successful movie if based primarily on popularity and profit. The films average rating was lower than potentially expected, however, this may potentially in part be due to the, previously mentioned, vote count being quite a bit greater for this movie than the mean for the other released movies.

## 2. References 

*1 - Georgia Dias - helped me convert the tables dictionaries into a list format

*2 - Rowan Jarvis - helped with the troubleshooting when it came to writing my function