##### <h1 style="text-align:center;"> Movie Content Rating Classification from their Poster Image using CNNs</h1>

<h3 style="text-align:center;">- Final Project -</h3>
<h5 style="text-align:center;">author: Curtis Zhuang</h5>

# Introduction 
### Problem Description
This project has the aim to achieve ***movie content rating classication based only on movie poster images.*** <br>

For movies, content rating is a important aspect as it specifies what the target audience will be and reflects what content may be contained within the movie. While by searching a movie, one can easily see what the rating is, our group thinks it may also be helpful if one can judge from looking at the poster alone. And that is what we do for our real-time system. Once the client receives a poster, it will be able to tell what the content rating is.

### Proposed Approach
We adopted the *Deep Neural Network* (**Convolutional Neural Network**) to classify a given movie poster image content rating.

Our data is collected from Kaggle **IMDB 5000 Movie Dataset** (source: https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset) as it contains content rating as a variable. <br>

We use a **Web Scraping approach** with the IMDb id of each movie retrieve its poster image from the IMDB movie page and save it locally. 

We then  construct our Convolutional Neural Network in order to classify movie content ratings basing on poster characteristics and saved the model for our client.

Once the modeling part is done, we went on design our **client-server part**. The basic structure follows that our client sends a movie name to the server where the server will download the poster and sends it back to our client. And the picture will then be used in model and predict its content rating.

### Acknowledgement

**This project is based on the work by Davide Iacobelli on his repo: https://github.com/davideiacobs/-Movie-Genres-Classification-from-their-Poster-Image-using-CNNs.**


# Step 1: _Webscraping_
 
We got the dataset and cleaned it until we are left with only the essential variables, we then cleand our data for ease of scraping. 
In this task we use ***BeautifulSoup***, a Python Framework for Webscraping. <br>

*Through Webscraping we can easily get the poster link of each film simply going on its IMDB page and taking the content of the _src_ HTML tag corresponding to the poster.* (Original words from the Davide Project.)

Once we have all poster links, we add them to our dataset.
<br><br>




In [4]:
import numpy as np
import pandas as pd
import glob
import os
import scipy.misc
import imageio
import skimage.transform
import skimage
from tqdm import tqdm
import requests  
import re
from bs4 import BeautifulSoup  
from urllib.request import urlretrieve
import ast 
import matplotlib.pyplot as plt 


savelocation = 'imdb_posters/'

### Dataset Cleaning

In [15]:
movie = pd.read_csv('movie_metadata.csv', header = 1)

In [16]:
len(movie)

5043

In [17]:
movie

Unnamed: 0,genres,movie_title,movie_imdb_link,content_rating
0,Action|Adventure|Fantasy|Sci-Fi,Avatar聽,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,PG-13
1,Action|Adventure|Fantasy,Pirates of the Caribbean: At World's End聽,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,PG-13
2,Action|Adventure|Thriller,Spectre聽,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,PG-13
3,Action|Thriller,The Dark Knight Rises聽,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,PG-13
4,Documentary,Star Wars: Episode VII - The Force Awakens聽 ...,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,
...,...,...,...,...
5038,Comedy|Drama,Signed Sealed Delivered聽,http://www.imdb.com/title/tt3000844/?ref_=fn_t...,
5039,Crime|Drama|Mystery|Thriller,The Following聽,http://www.imdb.com/title/tt2071645/?ref_=fn_t...,TV-14
5040,Drama|Horror|Thriller,A Plague So Pleasant聽,http://www.imdb.com/title/tt2107644/?ref_=fn_t...,
5041,Comedy|Drama|Romance,Shanghai Calling聽,http://www.imdb.com/title/tt2070597/?ref_=fn_t...,PG-13


In [18]:
### We need to drop NA ratings 

In [23]:
movie.dropna(axis = 0, subset =  ['content_rating'], inplace = True)

In [25]:
movie.movie_title

0                                         Avatar聽
1       Pirates of the Caribbean: At World's End聽
2                                        Spectre聽
3                          The Dark Knight Rises聽
5                                    John Carter聽
                          ...                    
5036                             The Mongol King聽
5037                                   Newlyweds聽
5039                   The Following聽            
5041                            Shanghai Calling聽
5042                           My Date with Drew聽
Name: movie_title, Length: 4740, dtype: object

In [None]:
### Adjust Movie title

In [30]:
movie['movie_title'] = movie.movie_title.astype('str').str.strip().str[:-1]

In [34]:
### Drop Genre

In [35]:
movie.drop(['genres'], axis = 1, inplace = True)

In [36]:
### Append 

In [41]:
movie['imdb_id'] = movie.movie_imdb_link.str[26:35]

In [42]:
movie.imdb_id

0       tt0499549
1       tt0449088
2       tt2379713
3       tt1345836
5       tt0401729
          ...    
5036    tt0430371
5037    tt1880418
5039    tt2071645
5041    tt2070597
5042    tt0378407
Name: imdb_id, Length: 4740, dtype: object

In [47]:
movie.drop(['movie_imdb_link'], axis = 1, inplace = True)

### Scraping

In [48]:
movie['imdb_link'] = ["https://www.imdb.com/title/"+str(x) for x in movie['imdb_id']]

In [50]:
# with the links, we can easily access the poster links on imdb website
imdbURLS = movie['imdb_link'].tolist()
imdbIDS = movie['imdb_id'].tolist()
records = [] 
counter = 0

for x in tqdm(imdbURLS): 
    # web scraping
    imdbID = imdbIDS[counter]
    r = requests.get(x)
    soup = BeautifulSoup(r.text, 'html.parser')  
    results = soup.find_all('div', attrs={'class':'poster'})  
    if results:
        first_result = results[0]  
        postername = first_result.find('img')['alt'] 
        imgurl = first_result.find('img')['src'] 
        records.append((x, postername, imgurl))
    else:
        movie = movie[movie.imdb_id != imdbID]    
counter += 1


100%|██████████| 4740/4740 [1:06:10<00:00,  1.19it/s]


In [51]:
poster_df = pd.DataFrame(records, columns=['imdb_link', 'postername', 'poster_link'])

In [52]:
poster_df['poster_link']

0       https://m.media-amazon.com/images/M/MV5BMTYwOT...
1       https://m.media-amazon.com/images/M/MV5BMjIyNj...
2       https://m.media-amazon.com/images/M/MV5BOWQ1MD...
3       https://m.media-amazon.com/images/M/MV5BMTk4OD...
4       https://m.media-amazon.com/images/M/MV5BMDEwZm...
                              ...                        
4735    https://m.media-amazon.com/images/M/MV5BMjA2Nz...
4736    https://m.media-amazon.com/images/M/MV5BMjAzNT...
4737    https://m.media-amazon.com/images/M/MV5BZjgzMD...
4738    https://m.media-amazon.com/images/M/MV5BNjA1OD...
4739    https://m.media-amazon.com/images/M/MV5BMjM1ZG...
Name: poster_link, Length: 4740, dtype: object

In [53]:
df_movietotal = pd.merge(movie, poster_df, on='imdb_link')

In [54]:
df_movietotal

Unnamed: 0,movie_title,content_rating,imdb_id,imdb_link,postername,poster_link
0,Avatar,PG-13,tt0499549,https://www.imdb.com/title/tt0499549,Avatar Poster,https://m.media-amazon.com/images/M/MV5BMTYwOT...
1,Pirates of the Caribbean: At World's End,PG-13,tt0449088,https://www.imdb.com/title/tt0449088,Pirates of the Caribbean: At World's End Poster,https://m.media-amazon.com/images/M/MV5BMjIyNj...
2,Spectre,PG-13,tt2379713,https://www.imdb.com/title/tt2379713,Spectre Poster,https://m.media-amazon.com/images/M/MV5BOWQ1MD...
3,The Dark Knight Rises,PG-13,tt1345836,https://www.imdb.com/title/tt1345836,The Dark Knight Rises Poster,https://m.media-amazon.com/images/M/MV5BMTk4OD...
4,John Carter,PG-13,tt0401729,https://www.imdb.com/title/tt0401729,John Carter Poster,https://m.media-amazon.com/images/M/MV5BMDEwZm...
...,...,...,...,...,...,...
4993,The Mongol King,PG-13,tt0430371,https://www.imdb.com/title/tt0430371,The Mongol King Poster,https://m.media-amazon.com/images/M/MV5BMjA2Nz...
4994,Newlyweds,Not Rated,tt1880418,https://www.imdb.com/title/tt1880418,Newlyweds Poster,https://m.media-amazon.com/images/M/MV5BMjAzNT...
4995,The Following,TV-14,tt2071645,https://www.imdb.com/title/tt2071645,The Following Poster,https://m.media-amazon.com/images/M/MV5BZjgzMD...
4996,Shanghai Calling,PG-13,tt2070597,https://www.imdb.com/title/tt2070597,Shanghai Calling Poster,https://m.media-amazon.com/images/M/MV5BNjA1OD...


In [55]:
df_movietotal.to_csv('movie_metadataWithPoster_rating.csv', sep='\t')

# Step 2: _Posters Download_

We obtained the links in our dataset so we can access those posters by using url retrieval.

In [56]:
df_movietotal = pd.read_csv("movie_metadataWithPoster_rating.csv", sep='\t')

In [58]:
df_movietotal['content_rating'].replace('', np.nan, inplace=True)
df_movietotal.dropna(inplace=True)

In [59]:
df_movietotal['content_rating']

0           PG-13
1           PG-13
2           PG-13
3           PG-13
4           PG-13
          ...    
4993        PG-13
4994    Not Rated
4995        TV-14
4996        PG-13
4997           PG
Name: content_rating, Length: 4996, dtype: object

In [60]:
df_poster = df_movietotal[['imdb_id','poster_link']]

In [61]:
df_poster

Unnamed: 0,imdb_id,poster_link
0,tt0499549,https://m.media-amazon.com/images/M/MV5BMTYwOT...
1,tt0449088,https://m.media-amazon.com/images/M/MV5BMjIyNj...
2,tt2379713,https://m.media-amazon.com/images/M/MV5BOWQ1MD...
3,tt1345836,https://m.media-amazon.com/images/M/MV5BMTk4OD...
4,tt0401729,https://m.media-amazon.com/images/M/MV5BMDEwZm...
...,...,...
4993,tt0430371,https://m.media-amazon.com/images/M/MV5BMjA2Nz...
4994,tt1880418,https://m.media-amazon.com/images/M/MV5BMjAzNT...
4995,tt2071645,https://m.media-amazon.com/images/M/MV5BZjgzMD...
4996,tt2070597,https://m.media-amazon.com/images/M/MV5BNjA1OD...


In [62]:
df_poster['poster_link']

0       https://m.media-amazon.com/images/M/MV5BMTYwOT...
1       https://m.media-amazon.com/images/M/MV5BMjIyNj...
2       https://m.media-amazon.com/images/M/MV5BOWQ1MD...
3       https://m.media-amazon.com/images/M/MV5BMTk4OD...
4       https://m.media-amazon.com/images/M/MV5BMDEwZm...
                              ...                        
4993    https://m.media-amazon.com/images/M/MV5BMjA2Nz...
4994    https://m.media-amazon.com/images/M/MV5BMjAzNT...
4995    https://m.media-amazon.com/images/M/MV5BZjgzMD...
4996    https://m.media-amazon.com/images/M/MV5BNjA1OD...
4997    https://m.media-amazon.com/images/M/MV5BMjM1ZG...
Name: poster_link, Length: 4996, dtype: object

In [63]:
not_found = []
for index, row in tqdm(df_poster.iterrows()):
    url = row['poster_link']
    if "https://m.media-amazon.com/images/M" in str(url):
        id = row['imdb_id']
        jpgname = savelocation+id+'.jpg'
        urlretrieve(url, jpgname)
    else:
        not_found.append(index)

4996it [07:47, 10.69it/s]


In [64]:
# Verify local image
from os import listdir
from PIL import Image
   
for filename in listdir(savelocation):
    if filename.endswith('.jpg'):
        try:
            img = Image.open(savelocation+filename) # open the image file
            img.verify() # verify that it is, in fact an image
        except (IOError, SyntaxError) as e:
            print('Bad file:', filename) # print out the names of corrupt files

In [65]:
df_movietotal.drop(df_movietotal.index[not_found], inplace=True)

In [66]:
columns_to_drop = []
for i in df_movietotal.columns:
    if "Unnamed" in i:
        columns_to_drop.append(i)

In [67]:
df_movietotal.drop(columns_to_drop, axis=1, inplace=True)

In [68]:
df_movietotal.to_csv('movie_metadataWithPoster_rating.csv', sep='\t')