# **TMDB - Detailed EDA and Time Series** <a id="0"></a> <br>

[TMDB](https://www.themoviedb.org/?language=en-US) (The Movie Database) is a community-built portal containing data about movies and TV shows thus becoming a direct competition of IMDb. 

As the film industry grows every year, more and more movies are being released - and as the taste of cinema-goers changes with time, film indutry has to adapt as well. Amount of money flowing to and from this market is immense so the interest in trends prediction is growing. On other side, for a common person it would be difficult to follow all the news from this world and websites like TMDB allow them to get a good overview of what is currently played. The database given to us in this competition is a movie database containing a vast amount of information about movies dating back to 70s. This will give us a great opportunity to learn how the film industry developed over the years, who are the biggest players and what trends can we observe. And of course finally, we can hone our data analysis and machine learning skills. Let's start our investigation!

![Imgur](https://i.imgur.com/7dSrGWx.jpg)
(*screenshot of the main page of TMDB*)

#### **Content:**
1. [EDA](#1)  
    - [Number of titles per year](#2)  
    - [Revenue per year](#3)  
    - [Budget per year](#4)  
    - [Runtime per year](#5)       
2. [Correlations](#6)  
    - [Overview of numerical variables](#7)  
3. [Categorical variables](#8)  


---
**SOME INSIGHTS**
* Average length of a movie is about **107 minutes**
* The **longest** movie: ["Carlos"](https://www.themoviedb.org/movie/43434-carlos?language=en-US) is **5hrs 38min** long
* The **most popular** movies: ["Minions"](https://www.themoviedb.org/movie/211672-minions) and "["Wonder Woman"](https://www.themoviedb.org/movie/297762-wonder-woman?language=en-US)"
* Movies with the **biggest budget**: "[Pirates of the Caribbean: On Stranger Tides](https://www.themoviedb.org/movie/1865-pirates-of-the-caribbean-on-stranger-tides?language=en-US)" and "[Pirates of the Caribbean: At World's End](https://www.themoviedb.org/movie/285-pirates-of-the-caribbean-at-world-s-end?language=en-US)"
* The **biggest** movie **producers**: *Warner Bros.* and *Universal Pictures*
* The most popular **genres**: *Drama* and *Comedy*
* In terms of movies played in **Bruce Willis** (62 movies) is the winner followed by **Morgan Freeman** (61 movies).

Much more you will find in the analysis itself so enjoy!

---
First loading of required libraries:

In [None]:
import numpy as np  # linear algebra
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)
import os  # operating system library
import matplotlib.pyplot as plt  # basic visualisation library
import seaborn as sns  # advanced visualisations
import ast  # Abstact Syntax Trees
import itertools  # Iterating tools
import re  # Regular Expressions

Reading data and visual inspection of the first 4 rows.

In [None]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
train.head(4)

The columns with theit description and type:


| No. | Column                | Description                                     | Type   |
|-----|-----------------------|-------------------------------------------------|--------|
| 1.  | id                    | Identification number                           | int    |
| 2.  | belongs_to_collection | Contains TMDB Id, Name, Poster and Backdrop url | object |
| 3.  | budget                | Budget of the movie in USD                      | int    |
| 4.  | genres                | Genres to which the movie classifies            | object |
| 5.  | homepage              | Link to a homepage of the movie on TMDB         | object |
| 6.  | imbd_id               | Internal TMDB identification number             | int    |
| 7.  | original_language     | Original language of the movie                  | object |
| 8.  | original_title        | Original title of the movie                     | object |
| 9.  | overview              | Short description of the movie's plot           | object |
| 10. | popularity            | Popularity score                                | int    |
| 11. | poster_path           | Link to the movie's poster on TMDB page         | object |
| 12. | production_companies  | Companies participating in the movie production | object |
| 13. | production_countries  | Country of production                           | object |
| 14. | release_date          | Movie's release date in mm/dd/yy format         | object |
| 15. | runtime               | Movie's runtime in minutes                      | int    |
| 16. | spoken_language       | Language spoken in the movie                    | object |
| 17. | status                | Status of the movie (released or rumored)       | object |
| 18. | tagline               | A short tag sentence                            | object |
| 19. | title                 | English title of the movie                      | object |
| 20. | Keywords              | Keywords for a search engines                   | object |
| 21. | cast                  | List of actors                                  | object |
| 22. | crew                  | List of crew members                            | object |
| 23. | revenue               | Revenue of the movie in USD                     | int    |


Checking shape of datasets. We expect a test dataset to have one column less.

In [None]:
train_shape = train.shape
test_shape = test.shape
print("Train dataset has {} rows and {} columns.".format(train_shape[0], train_shape[1]))
print("Test dataset has {} rows and {} columns.".format(test_shape[0], test_shape[1]))

Checking shapes of our datasets reveals that number of column is correct and that a test dataset has over 1/3 rows more. The next sep is to check if we have any duplicates.

In [None]:
if sum(train.duplicated()) == 0:
    print("There are no duplicates in a train dataset.")
else:
    print("There are {} duplicates in a train dataset.".format(sum(train.duplicated())))
    
if sum(train.duplicated()) == 0:
    print("There are no duplicates in a test dataset.")
else:
    print("There are {} duplicates in a test dataset.".format(sum(train.duplicated())))

Everything looks fine so far - no duplicates. In the next step number of missing values will be checked.

In [None]:
train_pct_nans = round(train.isnull().sum()/train_shape[0]*100,2).to_frame().sort_values(by=[0], ascending=False)
test_pct_nans = round(test.isnull().sum()/test_shape[0]*100,2).to_frame().sort_values(by=[0], ascending=False)

pct_nans = pd.concat([train_pct_nans,test_pct_nans],axis=1, sort=False).reset_index()
pct_nans.columns = ["column","train","test"]
melted = pct_nans.melt(id_vars="column").rename(columns=str.title)
melted.head()

In [None]:
# create a bar chart
sns.set_style("darkgrid")
plt.figure(figsize=(20,8))
sns.barplot(x="Column", y="Value", hue="Variable", data=melted)

plt.axhline(10, ls="--")
plt.xticks(rotation=90, fontsize=13)
plt.title("Percentage of missing values", fontsize=13)
plt.ylabel("Missing values [%]", fontsize=13)
plt.show()

Both datasets have similar percentage of missing values in each column. From them only 3 columns have this value above 10%. These are: 
* "belong_to_collection"
* "homepage"
* "tagline"

Now I will concatenate datasets into into one to control seaborn plots easier and not to write everytime a separate code for test and train datasets. 

In [None]:
train["src"] = "train"
test["src"] = "test"
all_data = pd.concat([train,test], sort=False)
all_data.head(3)

## 1 - Exploratory Data Analysis [^](#0) <a id="1"></a> <br>
Initial data preparation. I will look into release date, budget, runtime, revenue and some more variables.  

First we have to clean the dates - beacuse we have MM/DD/YY format we have to be able to distinguish 1918 from 2018. Therefore, we will choose a threshold year of 2018 and check the results manually.

In [None]:
melted[melted["Column"]=="release_date"]

We have 0.02% of missing release dates. This is a negligible amount and I will remove it for simplicity.

In [None]:
all_data.dropna(subset=["release_date"], inplace=True)

In [None]:
def clean_dates(row):
    '''
    This function cleans release_date row. 
    '''
    text = row["release_date"]
    yr = re.findall(r"\d+/\d+/(\d+)",text)
    
    if int(yr[0]) >= 18:
        return(text[:-2] + "19" + yr[0])
    else:
        return(text[:-2] + "20" + yr[0])

In [None]:
all_data["release_date"] = all_data.apply(clean_dates, axis=1)    # applying cleaning function
all_data["release_date"] = pd.to_datetime(all_data["release_date"])    # converting release_date column to datetime type

In [None]:
check = all_data[all_data["release_date"].dt.year==2017]
check[["id","title","release_date","runtime","budget","revenue"]].head()

In [None]:
first_date = all_data["release_date"].min()
last_date = all_data["release_date"].max()

first_movie = all_data[all_data["release_date"]==first_date]
first_movie[["id","title","release_date","runtime","budget","revenue","poster_path"]]

In [None]:
url = "https://image.tmdb.org/t/p/original"+first_movie["poster_path"].values[0]
url

**NOTE:** Since release of this database on Kaggle the TMDB has changed it's domina name and links in the database are all broken. Therefore, I'll use the links extracted manually.

The oldest movie in our database is black and white [Mickey](https://www.themoviedb.org/movie/54242-mickey) from 1921. 
<img src="https://www.themoviedb.org/t/p/w600_and_h900_bestv2/tB7XlpNbGMBD7XrnXkbLmTWvnpJ.jpg" width="200"> 

In [None]:
last_movie = all_data[all_data["release_date"]==last_date]
last_movie[["id","title","release_date","runtime","budget","revenue","poster_path"]]

In [None]:
url = "https://image.tmdb.org/t/p/original"+last_movie["poster_path"].values[0]
url

The oldest movie in our database is [Good Time](https://www.themoviedb.org/movie/429200-good-time) movie from 2017. 
<img src="https://www.themoviedb.org/t/p/w600_and_h900_bestv2/s6DJXJU3HzX24Ij3VWg5MfVGHrI.jpg" width="200"> 

In order to check if there are no mixed movies from 1917 and 2017 I will look at the first 10 movies from the database subset.

In [None]:
all_data.loc[:,"release_year"] = all_data.loc[:,"release_date"].dt.year
all_data.loc[:,"release_month"] = all_data.loc[:,"release_date"].dt.month
movies_2017 = all_data[all_data["release_year"]==2017]
movies_2017[["id","title","release_date","runtime","budget","revenue"]].head(10)

All the movies listed above are indeed from 2017. It is not a perfect method as I am not checking all the movies but it gives a good indication of correctness. The same I will do for years 1918 and 2018.

In [None]:
movies_2018 = all_data[all_data["release_year"]==2018]
movies_2018[["id","title","release_date","runtime","budget","revenue"]].head(10)

There are no movies in this subset - this is a good sign as the first movie was from 1921 and the last from 2017. Therefore we would expect it to be empty.

In [None]:
TS = all_data.loc[:,["original_title","release_date","budget","runtime","revenue","release_year","release_month"]]
#TS = train.copy()
TS.dropna()
TS.head()

### a) Number of titles per year [^](#0) <a id="2"></a> <br>

In [None]:
plt.figure(figsize=(25,10))
sns.countplot(x="release_year", data=TS)
plt.xticks(rotation=70, fontsize=12)
plt.xlabel("Release Year", size=15)
plt.ylabel("# of movies", size=15)
plt.show()

Number of titles per year steadily increases over the years. There was a rapid increase in number of titles after mid 70's. The last year (2017) is incomplete therefore there is a rapid drop.

### b) Revenue per year [^](#0) <a id="3"></a> <br>

In [None]:
max_rev = all_data["revenue"].max()
max_rev_title = all_data["original_title"][all_data["revenue"] == max_rev].values[0]

plt.figure(figsize=(16,8))
ax = sns.boxplot(x="release_year", y="revenue", data = all_data)
ax.set_title("Revenue of  movies between 1969-2017")
plt.xticks(rotation=70)
plt.text(2,1.35e9,"Max revenue: {:,.0f} $ ('{}')".format(max_rev, max_rev_title), fontweight="bold")
plt.text(2,1.3e9,"Mean revenue: {:,.0f} $".format(all_data["revenue"].mean()), fontweight="bold")
plt.xlabel("Release Year", size=15)
plt.ylabel("Revenue", size=15)
plt.show()

The boxplot above shows that mean revenue of movies over the years are relatively similar but lately there are more individual movies with very big revenue. The biggest one in our dataset is of ["The Avengers"](https://www.themoviedb.org/movie/24428-the-avengers?language=en-US) from 2012 with a revenue of 1.519.557.910$.

<img src="https://image.tmdb.org/t/p/original/RYMX2wcKCBAr24UyPD7xwmjaTn.jpg" width="200"> 

### c) Budget per year [^](#0) <a id="4"></a> <br>

In [None]:
zero_budget = all_data[all_data["budget"]==0]
print(len(zero_budget))
zero_budget[["title","release_date","budget"]].head(10)

After quick check in Google it is clear that these **0s are actually missing values**, e.g.:

<a href="https://en.wikipedia.org/wiki/Control_Room_(film)">"Control Room"</a> - actual budget is around 60 000 $.

<a href="https://en.wikipedia.org/wiki/The_Invisible_Woman_(2013_film)">"The Invisible Woman"</a> - actual budget is around 12 000 000 $.
<table><tr>
<td> <img src="https://image.tmdb.org/t/p/original/jvPKEho10KUY71Di9oyKKezue92.jpg" style="width: 250px;"/> </td>
<td> <img src="https://www.themoviedb.org/t/p/w600_and_h900_bestv2/1qurTmuj36prAz4BG71hC9r2n02.jpg" style="width: 250px;"/> </td>
</tr></table>

In case of missing/corrupted data we have couple of option where some are:
1. Drop observations (not recommended). I am not dropping anything here.
2. Fix data manually (good but not feasible with a lot of missing data and very time consuming). Below I will fix some of the missing budgets data manually to show how to do it.
3. Replace missing/corrupted data with some value, e.g. mean. Remaining missing budgets I will replace with a mean budget of all other movies' budget in a given year.

Replacing missing budgets manually:

In [None]:
all_data.loc[all_data['id'] == 7,'budget'] = 60000 # Control Room 
all_data.loc[all_data['id'] == 8,'budget'] = 50000000 # Muppet Treasure Island
all_data.loc[all_data['id'] == 17,'budget'] = 12000000 # The Invisible Woman
all_data.loc[all_data['id'] == 11,'budget'] = 10000000 # Revenge of the Nerds II:
all_data.loc[all_data['id'] == 31,'budget'] = 8000000 # Caché
all_data.loc[all_data['id'] == 38,'budget'] = 4000000 # Final: The Rapture
all_data.loc[all_data['id'] == 48,'budget'] = 5000000 # Wilson
all_data.loc[all_data['id'] == 52,'budget'] = 800000 # The Last Flight of Noah's Ark
all_data.loc[all_data['id'] == 53,'budget'] = 100000000 # For Keeps
all_data.loc[all_data['id'] == 55,'budget'] = 200000000 # Son in Law
all_data.loc[all_data['id'] == 56,'budget'] = 50000000 # Queen to Play
all_data.loc[all_data['id'] == 62,'budget'] = 46000000 # Fallen
all_data.loc[all_data['id'] == 67,'budget'] = 4000000 # A Touch of Sin
all_data.loc[all_data['id'] == 73,'budget'] = 19000000 # Czech Dream
all_data.loc[all_data['id'] == 89,'budget'] = 30000000 # Sommersby
all_data.loc[all_data['id'] == 91,'budget'] = 75000000 # His Secret Life
all_data.loc[all_data['id'] == 93,'budget'] = 8500 # Hunger
all_data.loc[all_data['id'] == 97,'budget'] = 300 # Target
all_data.loc[all_data['id'] == 102,'budget'] = 14000000 # Going by the Book
all_data.loc[all_data['id'] == 103,'budget'] = 19000000 # Won't Back Down
all_data.loc[all_data['id'] == 116,'budget'] = 2400000 # Back to 1942
all_data.loc[all_data['id'] == 117,'budget'] = 60000000 # Wild Hogs
all_data.loc[all_data['id'] == 118,'budget'] = 2000000 # Boxing Helena
all_data.loc[all_data['id'] == 126,'budget'] = 9000000 # Corvette Summer
all_data.loc[all_data['id'] == 141,'budget'] = 12000000 # Girl with a Pearl Earring
all_data.loc[all_data['id'] == 146,'budget'] = 14000000 # Police Academy 5
all_data.loc[all_data['id'] == 148,'budget'] = 18000000 # Beethoven

Two additional usefull columns will be created (year and month of movie release) in order to replace missing values with a yearly mean.

In [None]:
all_data.loc[:,"release_year"] = all_data.loc[:,"release_date"].dt.year
all_data.loc[:,"release_month"] = all_data.loc[:,"release_date"].dt.month

In [None]:
years = all_data.loc[:,"release_year"].unique()

for year in years:
    year_mean = all_data[(all_data["release_year"]==year) & (all_data["budget"]!=0)]["budget"].mean()
    if year_mean != np.nan:
        all_data[all_data["release_year"]==year]  = all_data[all_data["release_year"]==year].replace({"budget": 0}, year_mean)

Let's verify if everything was performed correctly - missing budgets were replaced and manually corrected ones are unaltered. 

In [None]:
pd.set_option('display.float_format', lambda x: '%.0f' % x)
all_data.iloc[zero_budget.index,:][["title","budget"]].head(10)

Everything looks fine.

In [None]:
max_bud = all_data["budget"].max()
max_bud_title = all_data["original_title"][all_data["budget"] == max_bud].values[0]

plt.figure(figsize=(16,8))
ax = sns.boxplot(x="release_year", y="budget", data = TS)
ax.set_title("Budgets of  movies between 1969-2017")
plt.xticks(rotation=70)
plt.text(2,3.4e8,"Mean budget: {:,.0f} $".format(all_data["budget"].mean()), fontweight="bold")
plt.text(2,3.3e8,"Max budget: {:,.0f} $ ('{}')".format(max_bud, max_bud_title), fontweight="bold")
plt.show()

There are movies with budget of 0. Let's check them.

In [None]:
b_zeros = TS[TS.budget==0]
b_zeros.head()

### d) Runtime per year [^](#0) <a id="5"></a> <br>

In [None]:
max_run = all_data["runtime"].max()
max_run_title = all_data["original_title"][all_data["runtime"] == max_run].values[0]

plt.figure(figsize=(16,8))
ax = sns.boxplot(x="release_year", y="runtime", data = all_data)
ax.set_title("Budgets of  movies between 1969-2017")
plt.xticks(rotation=70)
plt.text(2,300,"Mean runtime: {:.0f} minutes".format(train["runtime"].mean()), fontweight="bold")
plt.text(2,280,"Max runtime: {:.0f} minutes ('{}')".format(max_run, max_run_title), fontweight="bold")
plt.show()

Genarally mean runtime of the movies over the entire time span is around 108 minutes. The longest movie is ["Carlos"](https://www.themoviedb.org/movie/43434-carlos?language=en-US) which is 338 mnutes long (5h 38min).
It seems we have some movies with duration of 0 minutes. Let's investigate.
<img src="https://www.themoviedb.org/t/p/w600_and_h900_bestv2/883l2CsEuNPqv8aCAOuOavxdLAy.jpg" width="200"> 

In [None]:
r_zeros = all_data[all_data.runtime==0][["id","title"]]
r_zeros.head()

Once again after checking these titles in Google it appears that 0s are missing values, e.g.:

<a href="https://en.wikipedia.org/wiki/The_Worst_Christmas_of_My_Life_(film)">"Ia peggior Natale della mia vita (The_Worst_Christmas_of_My_Life)"</a> - actual runtime is 86 minutes.

<a href="https://en.wikipedia.org/wiki/The_Worst_Week_of_My_Life_(film)">"La peggior settimana della mia vita (The_Worst_Week_of_My_Life)"</a> - actual runtime is 93 minutes.

Let's remove corrupted lines and plot the same graph once again.


In [None]:
TS_clean_runtime = all_data.drop(train[train.runtime==0].index)

max_run = all_data["runtime"].max()
max_run_title = all_data["original_title"][all_data["runtime"] == max_run].values[0]

plt.figure(figsize=(16,8))
ax = sns.boxplot(x="release_year", y="runtime", data=all_data)
ax.set_title("Budgets of  movies between 1969-2017")
plt.xticks(rotation=70)
plt.text(2,300,"Mean runtime: {:.0f} minutes".format(all_data["runtime"].mean()), fontweight="bold")
plt.text(2,280,"Max runtime: {:.0f} minutes ('{}')".format(max_run, max_run_title), fontweight="bold")
plt.show()

## 2 - Correlations [^](#0) <a id="6"></a> <br>

### a) Overview of numerical columns [^](#0) <a id="7"></a> <br>
Before we go into detailed analysis of each variable let's take a look at the quick overview with seaborne PairGrid.

In [None]:
fig = sns.PairGrid(all_data[['runtime', 'budget', 'popularity','revenue']].dropna())
# define top, bottom and diagonal plots
fig.map_upper(plt.scatter, color='purple')
fig.map_lower(sns.kdeplot, cmap='cool_d')
fig.map_diag(sns.distplot, bins=30);

### RUNTIME - BUDGET

In [None]:
all_data.plot(x="runtime",y="budget", kind="scatter",figsize=(12,8))
runtime_mean = np.mean(all_data.runtime)
plt.axvline(x=runtime_mean, c="red", linestyle="--")
plt.text(runtime_mean-10, 3.5e8,"mean: {0:0.1f}".format(runtime_mean), color="red", rotation=90)
plt.ylim([0,4e8])
plt.show()

In [None]:
plt.figure(figsize=(12,7))
ax = sns.distplot(all_data["runtime"].dropna(), hist_kws={"rwidth":0.75,"alpha": 0.6, "color": "g"})
plt.axvline(x=runtime_mean, c="red", linestyle="--")
plt.text(runtime_mean+5, 0.025,"mean: {0:0.1f}".format(runtime_mean), color="red", rotation=90)
plt.show()

### POPULARITY - BUDGET

In [None]:
all_data.plot(x="popularity",y="budget", kind="scatter",figsize=(12,7))
pop_mean = np.mean(all_data.popularity)
plt.axvline(x=pop_mean, c="red", linestyle="--")
plt.text(pop_mean-10, 3.5e8,"mean: {0:0.1f}".format(pop_mean), color="red", rotation=90)
plt.ylim([0,4e8])
plt.show()

For clarity let's remove the outliers and see the most popular movie.

In [None]:
pop = all_data[all_data.popularity<50]
pop.plot(x="popularity",y="budget", kind="scatter",figsize=(12,7))
plt.axvline(x=pop_mean, c="red", linestyle="--")
plt.text(pop_mean-1, 3.5e8,"mean: {0:0.1f}".format(pop_mean), color="red", rotation=90)
plt.ylim([0,4e8])
plt.show()

### TOP3 - POPULARITY

In [None]:
top3 = all_data.sort_values(by="popularity", ascending=False)[["original_title","release_year"]][:3]
top3

The 3 most popular movies are: [Minions](https://www.themoviedb.org/movie/211672-minions), [Wonder Woman](https://www.themoviedb.org/movie/297762-wonder-woman?language=en-US), [Beauty and the Beast](https://www.themoviedb.org/movie/321612-beauty-and-the-beast?language=en-US).

<table><tr>
<td> <img src="https://www.themoviedb.org/t/p/w600_and_h900_bestv2/vlOgaxUiMOA8sPDG9n3VhQabnEi.jpg" style="width: 250px;"/> </td>
<td> <img src="https://image.tmdb.org/t/p/original/gfJGlDaHuWimErCr5Ql0I8x9QSy.jpg" style="width: 250px;"/> </td>
<td> <img src="https://www.themoviedb.org/t/p/w600_and_h900_bestv2/tWqifoYuwLETmmasnGHO7xBjEtt.jpg" style="width: 250px;"/> </td>
</tr></table>

### TOP3 - BUDGET

In [None]:
top3 = all_data.sort_values(by="budget", ascending=False)[["original_title","release_year"]][:3]
top3

<table><tr>
<td> <img src="https://www.themoviedb.org/t/p/w600_and_h900_bestv2/8fxendLObfOewRllxiM4X9Ey7S4.jpg" style="width: 250px;"/> </td>
<td> <img src="https://www.themoviedb.org/t/p/w600_and_h900_bestv2/cVFKkJ8Kcuf9zSUdUNB7iKOm4nh.jpg" style="width: 250px;"/> </td>
<td> <img src="https://www.themoviedb.org/t/p/w600_and_h900_bestv2/4ssDuvEDkSArWEdyBl2X5EHvYKU.jpg" style="width: 250px;"/> </td>
</tr></table>

### POPULARITY - REVENUE

In [None]:
pop.plot(x="popularity",y="revenue", kind="scatter", figsize=(12,8))
plt.show()

### TOP3 - REVENUE

In [None]:
top3 = all_data.sort_values(by="revenue", ascending=False)[["original_title","release_year"]][:3]
top3

<table><tr>
<td> <img src="https://www.themoviedb.org/t/p/w600_and_h900_bestv2/RYMX2wcKCBAr24UyPD7xwmjaTn.jpg" style="width: 250px;"/> </td>
<td> <img src="https://www.themoviedb.org/t/p/w600_and_h900_bestv2/ktofZ9Htrjiy0P6LEowsDaxd3Ri.jpg" style="width: 250px;"/> </td>
<td> <img src="https://www.themoviedb.org/t/p/w600_and_h900_bestv2/4ssDuvEDkSArWEdyBl2X5EHvYKU.jpg" style="width: 250px;"/> </td>
</tr></table>

### CLUSTERMAP - Summary of Correlations

In [None]:
sns.clustermap(TS.corr(), annot=True, linewidths=.5, fmt= '.2f')
plt.show()

## 3 - Categorical variables [^](#0) <a id="8"></a> <br>

In [None]:
all_data.original_language.nunique()

Movies in our database are in total in 36 different languages.

In [None]:
plt.figure(figsize=(15,5))
sns.countplot(all_data['original_language'].sort_values())
plt.show()

The most popular language is English.

In [None]:
non_english = all_data['original_language'][all_data['original_language'] != "en"].value_counts()
plt.figure(figsize=(15,5))
sns.barplot(x=non_english.index, y=non_english.values)
plt.title("Number of non-english movies per language")
plt.show()

In [None]:
prod_c = all_data["production_companies"].to_frame()

prod_c.dropna(inplace=True)

# convert the strings into lists
prod_c["production_companies"] = prod_c["production_companies"].apply(lambda x: ast.literal_eval(x))

prod_c_clean = pd.DataFrame(list(itertools.chain(*prod_c["production_companies"].values.tolist())))
prod_c_clean.head()

In [None]:
tmp1 = prod_c_clean["name"].value_counts()
TOP20 = tmp1[:20].to_frame()
TOP20

In [None]:
plt.figure(figsize=(16,6))
ax = sns.barplot(x=TOP20.index, y="name", data=TOP20)
plt.title("Number of movies produced by companies")
plt.xticks(rotation=90)
plt.show()

In [None]:
genres = all_data.loc[:,["id","original_title","genres", "release_date","release_year"]]
genres.loc[:,"genres"] = genres.loc[:,"genres"].fillna("None")
genres.head(5)

In [None]:
genres.isna().sum()

In [None]:
def extract_genres(row):
    if row == "None":
        return "None"
    else:
        results = re.findall(r"'name': '(\w+\s?\w+)'", row)
        return results

In [None]:
genres["genres"] = genres["genres"].apply(extract_genres)
genres["genres"].head(10)

In [None]:
unique_genres = genres["genres"].apply(pd.Series).stack().unique()
print("Number of genres: {}".format(len(unique_genres)))
print("Genres: {}".format(unique_genres))

In [None]:
genres_total = pd.get_dummies(genres["genres"].apply(pd.Series).stack()).sum()

In [None]:
genres_total.sort_values(inplace=True, ascending=False)
genres_total.head()

In [None]:
plt.figure(figsize=(15,5))
ax = sns.barplot(x=genres_total.index, y=genres_total.values)
plt.xticks(rotation=90)
plt.title("Popularity of genres overall")
plt.ylabel("count")
plt.show()

In [None]:
genres_pp = genres.reindex(columns=[*genres.columns.tolist(),*unique_genres], fill_value=0)
genres_pp.head()

In [None]:
# this is a quite slow approach but very easy to code and to understand (use .apply to faster approach)
for index, row in genres_pp.iterrows():
    for genre in unique_genres:
        if genre in row["genres"]:
            genres_pp.loc[index, genre] = 1

In [None]:
genres_by_years = genres_pp.groupby("release_year")[unique_genres].sum()
other_genres = ["TV Movie", "None", "Foreign", "Western", "Documentary", "War","Music","History", "Animation"]
genres_by_years.rename(columns={"Science Fiction":"Sci-Fi"}, inplace=True)
#genres_by_years["Others"] = genres_by_years.loc[:,other_genres].sum(axis=1)
genres_by_years.drop(other_genres, axis=1, inplace=True)
genres_by_years.head()

In [None]:
plt.figure(figsize=(17,10))
sns.lineplot(data=genres_by_years[:-1], dashes=False)
plt.ylim([0,250])
plt.title("Popularity of genres over years")
plt.show()

In [None]:
genres_by_years_5yrs = genres_by_years.rolling(5).mean()

#sns.set_style("darkgrid")
plt.figure(figsize=(17,10))
plt.xlim([1975,2020])
plt.ylim([0,250])

plt.xlabel("release year", size=14)
plt.ylabel("# of movies", size=14)


# plotting everythin but the last incomplete year
sns.lineplot(data=genres_by_years_5yrs[:-1], dashes=False, palette ="Paired")
plt.legend(prop={'size': 14})

plt.title("Popularity of genres from 1975 (5yr SMA)")
for g in genres_by_years_5yrs.columns:
    y_loc = genres_by_years_5yrs.iloc[-2][g]
    plt.text(2016.5, y_loc+1, g)
plt.show()

In [None]:
cast = all_data.loc[:,["id","original_title","cast","release_date"]]
cast.loc[:,"cast"] = all_data.loc[:,"cast"].fillna("None")
cast.head(10)

In [None]:
def extract_cast(row):
    if row == "None":
        return "None"
    else:
        results = re.findall(r"\'name\': \'(\w+\s\w+-?\w+)'", row)
        return results

In [None]:
cast["cast"] = cast["cast"].apply(extract_cast)
cast.head()

In [None]:
# number of unique actors
unique_actors = cast["cast"].apply(pd.Series).stack().unique()
print("Number of unique actors: {}".format(len(unique_actors)))

In [None]:
actors = cast["cast"].apply(pd.Series).stack().value_counts()
actors.head(20)

In [None]:
act_mov = []
mov = range(1,27)
for i in mov:
    actors_movies = actors[actors.values >i]
    act_mov.append(len(actors_movies))

In [None]:
plt.figure(figsize=(16,8))
sns.barplot(x=list(mov), y=act_mov)
plt.xlabel("Number of movies played in")
plt.show()

In [None]:
actors_top30 = actors[:30]

cast = cast["cast"].apply(pd.Series).stack()

def masking(actor):
    if actor in actors_top30:
        return actor
    else:
        return "Other"
    
cast = cast.apply(masking)
cast_dummies = pd.get_dummies(cast.apply(pd.Series).stack()).sum(level=0)
cast_dummies["Other"][cast_dummies["Other"]>1] = 1
cast_dummies.head()

**MORE TO COME SOON** but if you liked what you've seen so far please consider upvoting! Thanks!