The purpose of this code is to perform some EDA on the information stored in the IMDB movies dataset. It is to practice data visualisation. First the relevant modules will need to be imported:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.legend_handler import HandlerBase
from matplotlib.text import Text
from matplotlib.legend import Legend


The dataframe is first loaded into data. By calling .head() we can get an idea of the data. There are over 10,000 rows in the dataframe, which will make visualisation of each entry difficult. Going forward, it will be important to group the data apropriately and use averages where necessary. 

In [None]:
data = pd.read_csv('imdb_movies.csv')

print(data.head())
print(len(data))

There are 12 columns of data. First, the distribution of different languages will be considered. 

In [None]:
language = data.groupby(['orig_lang'])['orig_lang'].count()

print(len(language))

There are 54 entries, which would make visualisation a bit difficult. In order to make this easier, we can group the less common languages as 'Other':

In [None]:
readable_language = {'Other' : 0}
i = 0
for entry in language:
    if entry > 50:
        readable_language[language.index[i]] = entry
    else:
        readable_language['Other'] = readable_language['Other'] + entry
    i += 1

We can then visualize these as a piechart:

In [None]:
plt.pie(readable_language.values(), labels=readable_language.keys(), autopct="%0.2f%%")
plt.title('Percentage of language for each title in IDMB database')
plt.show()

As English is the language of the overwhelming majority of titles with 72.87%, it might be helpful to view the data in terms of non-English titles. 

In [None]:
readable_language.pop(' English')

plt.pie(readable_language.values(), labels=readable_language.keys(), autopct="%0.2f%%")
plt.title('Percentage of non-English titles in IDMB database')
plt.show()

We can view the average scores for the titles based on their language. We first need to calculate the mean for each language:

In [None]:
mean_scores = data.groupby(['orig_lang'])['score'].mean()

If the language is not frequent enough to be considered one of the majority language, i.e. falling into the 'other' class from above, we can check this by comparing to the keys of the readable_language dictionary. A new average can then be calculated for all of the titles whose language comes under the 'other' umbrella. 

In [None]:
lang_scores = {}

i = 0
others = []
for entry in mean_scores:
    if mean_scores.index[i] in readable_language.keys():
        lang_scores[mean_scores.index[i]] = entry
    else:
        others.append(entry)
    i += 1

lang_scores['Other'] = (sum(others)) / len(others)

This data can then be visualized as a bar chart:

In [None]:
fig = plt.figure()

plt.bar(lang_scores.keys(), lang_scores.values())
plt.title('Average rating by language group')
plt.xlabel('Title language')
plt.ylabel('Average user score')
plt.show()

We can use a scatterplot to see the relationship between the average budget and score for each country. The values in 'country' are saved with the Alpha-2 country codes. To help provide a legend that helps identify each country on the plot, we can produce a dictionary that matches the code with the country name. I load in a .csv file that contains a list of the countries and their codes. Two codes provided errors as they were not in the list, so I googled them and had them be added as a result of error handling in the loop. 

In [None]:
countries = data.groupby(['country'])['country'].count()
country_code_dict = {}


for entry in list(countries.index):
    try:
        country_code_dict[(country_codes['English short name lower case'].loc[country_codes['Alpha-2 code'] == entry]).iloc[0]] = entry

    except IndexError:
        if entry == 'SU':
            country_code_dict['USSR'] = entry
        else:
            country_code_dict['International'] = entry



country_code_df = pd.DataFrame(country_code_dict.items(), columns=['Countries', 'Codes'])

I wanted to add a legend to the scatterplot to improve readability by giving the country code next to the country name. As the legend only works with handles that are artists (i.e. not string), I had to look up code from stackoverflow to assist in transforming the string with the country code into a handle for the legend:

In [None]:
class TextHandlerB(HandlerBase):
    def create_artists(self, legend, text ,xdescent, ydescent,
                        width, height, fontsize, trans):
        tx = Text(width/2.,height/2, text, fontsize=fontsize,
                  ha="center", va="center", fontweight="bold")
        return [tx]

Legend.update_default_handler_map({str : TextHandlerB()})

The x and y values for the scatterplot are stored in the variables:

In [None]:
countries_b = data.groupby(['country'])['budget_x'].mean()
countries_s = data.groupby(['country'])['score'].mean()

The mean scores and budgets are plotted for each country and annotate is used to give each plot a country code for identification: 

In [None]:
plt.scatter(countries_b, countries_s)
plt.title('Country mean budget vs score')

i = 0 
while i < len(countries.index):
    plt.annotate(countries.index[i], (countries_b[i], countries_s[i]))
    i += 1

Handles is used to add the new handle to the legend. As there are 60 countries, the legend needs to be positioned outside of the graph area to ensure its readability without overlapping any of the plots. 

In [None]:
plt.legend(handles=list(country_code_df['Codes']), labels=list(country_code_df['Countries']), ncols=10, loc='upper center', bbox_to_anchor=(0.5, -0.05),
          fancybox=True, shadow=True)
plt.xlabel('Mean budgets')
plt.ylabel('Mean user score')
plt.subplots_adjust(bottom = 0.18, top = 0.936)
plt.show()

Boxplots are helpful to show the distribution of the lowest range of scores and the higher range of scores. To increase accuracy of reading the boxplot, I run a loop to increment by 2 and save it to x, which is then set as the .xticks() on the plot:

In [None]:
x = []
i = 0
while i < 100:
    x.append(i)
    i += 2


plt.boxplot(data['score'], vert=False)
plt.title('Boxplot showing the distribution of scores in the database')
plt.ylabel('Plot showing the range of scores')
plt.xlabel('Range of scores from 0 - 100')
plt.xticks(x)
plt.margins(x=0, y=0)
plt.show()

From this, we can see that the lowest 25% of scores are 59 and below, and the highest from 71 and above. We can filter for these and save them in variables:

In [None]:
lower_scores = data[data['score'] < 60]
higher_scores = data[data['score'] > 70]

We can use boxplots to see how these titles have done in terms of revenue:

In [None]:
plt.boxplot([lower_scores['revenue'], higher_scores['revenue']], labels=['Lower scores', 'Higher scores'], vert=False, notch=True, patch_artist=True, 
            boxprops = dict(linestyle='-', linewidth=1, color='k', facecolor='#d0f3f7'), 
            medianprops= dict(linestyle='-', linewidth=2, color='r'))


plt.ylabel('Range of scores')
plt.xlabel('Revenue of the titles')
plt.title('The revenue made for the titles in the lowest and highest score quartiles')

plt.show()

We can see that both plots are right-skewed, with the higher scores having a higher level of revenue in terms of Q3 and Q4. The outliers for the higher scores tend to be at higher levels of revenue as well. The box plot would suggest a relationship between higher scores and higher levels of revenue. However, the Q1 and Q2 figures for lower and higher scores are relatively similar. This is useful to note if we wanted to determine a model for predicting revenue, as it shows us that score, wouldn't be sole determining factor. 