# Wikipedia Movie Plots EDA

In this kernel we are performing an exploratory data analysis based on the Wikipedia Movie Plots dataset https://www.kaggle.com/jrobischon/wikipedia-movie-plots

First step is to import all the necessary libraries. We will first import the basic ones and add to this cell those we will use further to keep all the imports in one place. Then we will provide the filepath and read the data.

In [None]:
import pandas as pd
import numpy as np

pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
print("Setup Complete")

In [None]:
# Specify the path of the CSV file to read
my_filepath = "../input/wikipedia-movie-plots/wiki_movie_plots_deduped.csv"

# Fill in the line below: Read the file into a variable my_data
my_data = pd.read_csv(my_filepath)

## Context
Plot summary descriptions scraped from Wikipedia.

## Content
The dataset contains descriptions of 34,886 movies from around the world. Column descriptions are listed below:

* Release Year - Year in which the movie was released
* Title - Movie title
* Origin/Ethnicity - Origin of movie (i.e. American, Bollywood, Tamil, etc.)
* Director - Director(s)
* Plot - Main actor and actresses
* Genre - Movie Genre(s)
* Wiki Page - URL of the Wikipedia page from which the plot description was scraped
* Plot - Long form description of movie plot (WARNING: May contain spoilers!!!)

Let's have a look at the first few rows of the dataset.

In [None]:
my_data.head()

As we can see, most of the data is categorical, so we will need to take that into consideration when deciding on the plots to choose for this datset.

Now let's get the basic information about this dataset.

In [None]:
my_data.info()

Data is quite massive, containing about 35k observations. It is advantagous when considering to create a Machine Learning model based on this data. There are 8 columns/variables describing the data. Only the "Releae Year" column is numerical, the rest are categorical.

# Visualize the data
### Movie ndustry trend

First, let's see the overall trend in the movie production industry over a century.

In [None]:
# Check out the overall trend in movie releases over the years around the world 
plt.figure(figsize=(10,6))
sns.distplot(a=my_data["Release Year"], kde=False)
plt.title("Number of movies realsed around the world \n over the years", loc="center")

Looks like it is growing almost exponentially! Let's see who is driving this growth.

# World leaders

In [None]:
my_data.rename(columns={"Origin/Ethnicity":"Origin"}, inplace=True)

# How many Origins are there in the dataset? 
len(my_data["Origin"].unique())

In [None]:
plt.figure(figsize=(10,5))
sns.catplot(x="Origin", kind="count", data=my_data, height=5, aspect=2)
plt.xticks(rotation=45, 
    horizontalalignment='right')
plt.title("Total number of movies released per ethnicity over the years (1900-2020)", fontsize=15)
plt.xlabel("")
plt.ylabel("")

Out of the 24 origins represented in the dataframe, **the US** is an unconditional leader.

Interesting to note, that there are quite a few ethnicities belonging to India. To make the further analysis easier, let's add a new column to this dataframe containing countries to which these ethnicities refer to.

In [None]:
equiv_dict = {"American":"The US", "Australian":"Australia", "Bangladeshi":"Bangladesh", 
              "British":"The Great Britain", "Canadian":"Canada", "Chinese":"China", 
              "Egyptian":"Egypt", "Hong Kong":"Hong Kong", "Fillipino":"The Phillipins", 
              "Assamese":"India", "Bengali":"India", "Bollywood":"India", "Kannada":"India", 
              "Malayalam":"India", "Marathi":"India", "Punjabi":"India", "Tamil":"India", 
              "Telugu":"India", "Japanese":"Japan", "Malaysian":"Malaysia", "Maldivian":"Maldives", 
              "Russian":"Russia", "South_Korean":"South_Korea","Turkish":"Turkey"}
my_data["Country"] = my_data["Origin"].map(equiv_dict)

In [None]:
plt.figure(figsize=(10,5))
sns.catplot(x="Country", kind="count", data=my_data, height=5, aspect=2)
plt.xticks(rotation=45, 
    horizontalalignment='right')
plt.title("Total number of movies released in each country over the years (1900-2020)", fontsize=15)
plt.xlabel("")
plt.ylabel("")

This way it is much easier to analyse the data as there are fewer categories.

Now we can see that the second place belongs to **India**, producing twice as less movies in total over the century than the US.
It is followed by **the Great Britain** and **Japan**, having much lower movie production than the 2 leaders.

In [None]:
# Group the data by the "Country" and "Release Year" columns 
# to make visual the periods when the movie production was the most intensive for different countries.
by_country_by_year = my_data.groupby(["Country","Release Year"]).size().unstack()

plt.figure(figsize=(14,10))
g = sns.heatmap(
    by_country_by_year, 
    #square=True, # make cells square
    cbar_kws={'fraction' : 0.02}, # shrink colour bar
    cmap='OrRd', # use orange/red colour map
    linewidth=1 # space between cells
)

We can see that all the 4 leaders have started producing moves much earlier than most of other countries. This partly explains such a big difference in the numbers.

# The US

If we look at **the US numbers**, we see that the movie industry was developping progressivly until the 1960-s. During the next 20 years there was a decline and then it started to grow again.
This decline is explained in this article: https://en.wikipedia.org/wiki/New_Hollywood

In short, the advent of television and the much changed audience demographics (yonger generation) have put an end to the "Old Hollywood" and "The New Hollywood" was only starting to develop.

Another reason for a much faster growth of the American movie industry in the past 20 years could be the beginning of the digital age that made making movies cheaper and much faster.

Now let's take a closer look at the Indian movie industry.

In [None]:
india = my_data[["Country", "Release Year"]].query('Country == "India" ').groupby("Release Year").size()

plt.figure(figsize=(10,5))
plt.title("Movie production industry growth in India")
plt.ylabel("Number of movies")
sns.lineplot(data=india)

# India

We can clearly see that **the Indian movie production industry** was growing steadily until the 1980s. In the late 1980s, Hindi cinema experienced a period of stagnation due to increasing violence, decline in musical melodic quality, and rise in video piracy, leading to middle-class family audiences abandoning theatres.

The turning point came with the musical romance Chandni (1989). It was instrumental in ending the era of violent action films in Indian Cinema and rejuvenating the romantic musical genre. It also set a new template for Bollywood musical romance films that defined Hindi cinema in the coming years.

Reference: https://en.wikipedia.org/wiki/Cinema_of_India#New_Bollywood_(1990s%E2%80%93present)

Starting from the 2000 it is growing with a much higher rate. Partly, it could be connected to the growing population of India.
http://statisticstimes.com/demographics/population-of-india.php

Digital age has probably also had its effect on the movie production speed.

# Movie genres

It would be interesting to compare what kinds of genres are the most popular in different countries.

First, let's visualize the data of the "Plot" column using the word clouds for the 2 leaders - the US and India.

### The US

In [None]:
# American word cloud

# Generate a word cloud image
usa = " ".join(plot for plot in my_data[my_data["Country"]=="The US"].Plot)
d = '../input/flags-pics2/'
usa_mask = np.array(Image.open(d + 'american-flag-1399556531Ci4.jpg'))
stopwords=set(STOPWORDS)
stopwords.update(["tell",'tells',"take","one","two","see","will","now"])
wordcloud_usa = WordCloud(stopwords=stopwords, background_color="white", mode="RGBA", max_words=1000, mask=usa_mask).generate(usa)

# create coloring from image
image_colors = ImageColorGenerator(usa_mask)
plt.figure(figsize=[10,10])
plt.imshow(wordcloud_usa.recolor(color_func=image_colors), interpolation="bilinear")
plt.axis("off")

plt.show()

The most common words used in the plot column for American movies are:

* find
* leave
* return
* father
* friend
* house
* kill
* meet
* help

This indicates such genres like "mystery", "drama", "thriller", maybe some "romance" and "melodrama" tend to be the most frequent ones.

### India

In [None]:
# Indian word cloud

# Generate a word cloud image
india = " ".join(plot for plot in my_data[my_data["Country"]=="India"].Plot)
d = '../input/flags-pics2/'
india_mask = np.array(Image.open(d + 'india-flag.jpg'))
stopwords=set(STOPWORDS)
stopwords.update(["tell",'tells',"take","one","two","see","will","now","meanwhile","give","ask"])
wordcloud_india = WordCloud(stopwords=stopwords, background_color="white", mode="RGBA", max_words=1000, mask=india_mask).generate(india)

# create coloring from image
image_colors = ImageColorGenerator(india_mask)
plt.figure(figsize=[10,10])
plt.imshow(wordcloud_india.recolor(color_func=image_colors), interpolation="bilinear")
plt.axis("off")

plt.show()

The most common words used in the plot column for Indian movies are:

* father
* mother
* friend
* son
* family
* house
* love
* kill
* life
* find

From this quick analysis it seems like family is an important value in India (which is a common knowledge about Indian culture). So it is logical that movies are filmed around this topic. Important words include also "kill", "life" and "find". From that we can conclude that genres like "detective", "thriller", "crime", "drama" are going to be the most popular among Indian audience.

Now let's see in reality which genres are the most popular in this dataset. We will perform this analysis for the 2 leaders - The US and India - to be able to compare the results with our predictions drawn from the word clouds. As there are 2265 genres in the dataset, first we will identify 20 most common ones.

In [None]:
pop_genres = list(my_data.Genre.unique())[:20]

plt.figure(figsize=(12,6))

sns.countplot(my_data.Genre,order=pd.value_counts(my_data.Genre).iloc[:20].index,palette=sns.color_palette("Pastel1", 20))
plt.title('Most frequent Genre types',fontsize=16)
plt.ylabel('Number of movies', fontsize=12)
plt.xlabel('Genre', fontsize=12)
plt.xticks(size=12,rotation=60)
plt.yticks(size=12)
sns.despine(bottom=True, left=True)
plt.show()

The most frequent genre is "unknown", probably due to some data collection issues. Unfortunately, this subset of the dataframe is not very useful for our purpose, so we can disregard it for now. However, in case of creating a machine learning algorithm that predicts the genre of a movie, this data can be used to test the final working model on some real data.

Now, let's make a heatmap that shows which movie genre of the top 20 genres is the most popular in different countries.

p.s. As the genre data is quite messy, before creating a ML model it should be cleaned/preprocessed in a way that we identify all the existing genre categories and label each row of the dataset with these categories.

In [None]:
by_country_by_genre = my_data.groupby(["Country","Genre"]).size().unstack()
by_genre_top20 = by_country_by_genre.loc[:, by_country_by_genre.columns.isin(pop_genres)]

plt.figure(figsize=(14,8))
sns.heatmap(
    by_genre_top20, 
    #square=True, # make cells square
    cbar_kws={'fraction' : 0.02}, # shrink colour bar
    cmap='OrRd', # use orange/red colour map
    linewidth=1 # space between cells
)

As we can see, in **the US the most popular genres** are "drama", "comedy", "horror", "western" and "adventure". Except for the genre of "comedy" (which I guess is hard to predict from just a wordcloud), our prediction is in line with the reality.

For **the Indian data**, most information about the genres is missing, so it is much more difficult to see the real picture but we see that "drama" and "comedy" also prevail. Same situation we can observe in the British movie data.

# Top directors and cast
In the end, let's see who is the most efficient director that produced most movies, and the most popular cast choice.

We will remove the "Unknown" director and null cast values to make the analysis clearer.
Also, while analizing the Cast data, we noticed a few small issues. First, the same cast character was named in 2 different ways: "Three Stooges" and "The Three Stooges". It influenced the data reflexion, so we fixed it merging the two. Second, while removing the null cast values, we noticed there are not only null values but also non-breaking spaces in some rows of the data, so we removed them as well.

### Top cast

In [None]:
# Getting rid of null values and invisible characters (non-breaking spaces)
top_cast = my_data[(my_data.Cast.notnull()) & (my_data.Cast != " ")]
top_cast.set_index("Cast",inplace=True)
top_cast.rename(index={'Three Stooges':'The Three Stooges'},inplace=True)

In [None]:
plt.figure(figsize=(14,10))
plt.title('Top cast (based on the number of movies)',fontsize=16)

sns.countplot(y=top_cast.index,order=pd.value_counts(top_cast.index)[:20].index,palette=sns.color_palette("Pastel1", 20))

plt.xlabel('Number of movies',fontsize=12)
plt.ylabel('',fontsize=12)
plt.yticks(size=12)
plt.show()

By far, the most frequent cast of this dataframe is ... **The Three Stooges**, an American vaudeville and comedy team active from 1922 until 1970, best known for their 190 short subject films by Columbia Pictures that have been regularly airing on television since 1958. 

https://en.wikipedia.org/wiki/The_Three_Stooges

The second place belongs to **Tom and Jerry** :)
It is an American animated series of comedy short films created in 1940 by William Hanna and Joseph Barbera. Best known for its 161 theatrical short films by Metro-Goldwyn-Mayer, the series centers on a rivalry between the title characters Tom, a cat, and Jerry, a mouse. They won seven Academy Awards for Animated Short Film.

https://en.wikipedia.org/wiki/Tom_and_Jerry

### Top directors

In [None]:
top_director = my_data[my_data.Director != "Unknown"]

In [None]:
plt.figure(figsize=(14,5))
plt.title('Top Directors (based on the number of movies directed)',fontsize=14)

sns.countplot(top_director.Director,order=pd.value_counts(top_director.Director)[:20].index,palette=sns.color_palette("Pastel1", 20))

plt.xlabel('',fontsize=10)
plt.ylabel('Number of movies directed',fontsize=10)
plt.xticks(size=11,rotation=60)
plt.show()

The difference between directors in this case seems less visible but if we take into consideration that filming even 1 movie might mean years of work, it changes the perception of the data.

The top director based on the number of movies produced is **Michael Curtiz**, a Hungarian-born American film director. The Wikipedia article about Michael Curtiz confirms that he is recognized as one of the most prolific directors in history. He was nominated five times for an Oscar and won twice, once for Best Short Subject for "Sons of Liberty" and once as Best Director for "Casablanca". 
https://en.wikipedia.org/wiki/Michael_Curtiz

**Hanna-Barbera** has the second place and it is not surprizing since these 2 directors (William Hanna and Joseph Barbera) have produced the "Tom and Jerry" cartoon which has the second place in the most frequent cast of this dataframe! 

# Conclusion

* In this analysis we've discovered that the Wikipedia Movie Plots dataset contains massive data (~35k movies). It could be an advantage when considering to create a Genre prediction model or a Recommendation system.
* The overall tendency of the world movie production industry is near exponential growth which is lead by the US and India, followed by the Great Britain and Japan.
* We've looked more closely into the reasons of such growth in the US and India.
* We've explored the most popular movie genres of this dataset and tried to predict most frequent genres for the US and India based on the word clouds.
* Lastly, we've identified top directors and cast of this dataframe, both American. The Three Stooges is the most popular cast choice and Michael Curtiz is the top director based on the number of movies produced.

Further work on this dataset would include developping a Genre prediction algorithm or/and a Recommendation system.