# Top 100 Movies of All Time Analysis
## Data Processing

#### Loading the Dataset

In [22]:
import pandas as pd

df = pd.read_csv('../data/raw/Top_100_Movies.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,rank,title,description,genre,rating,id,year,imdbid,imdb_link,image
0,0,1,The Shawshank Redemption,Two imprisoned men bond over a number of years...,['Drama'],9.3,top1,1994,tt0111161,https://www.imdb.com/title/tt0111161,https://m.media-amazon.com/images/M/MV5BMDFkYT...
1,1,2,The Godfather,The aging patriarch of an organized crime dyna...,"['Crime', 'Drama']",9.2,top2,1972,tt0068646,https://www.imdb.com/title/tt0068646,https://m.media-amazon.com/images/M/MV5BM2MyNj...
2,2,3,The Dark Knight,When the menace known as the Joker wreaks havo...,"['Action', 'Crime', 'Drama']",9.0,top3,2008,tt0468569,https://www.imdb.com/title/tt0468569,https://m.media-amazon.com/images/M/MV5BMTMxNT...
3,3,4,The Godfather Part II,The early life and career of Vito Corleone in ...,"['Crime', 'Drama']",9.0,top4,1974,tt0071562,https://www.imdb.com/title/tt0071562,https://m.media-amazon.com/images/M/MV5BMWMwMG...
4,4,5,12 Angry Men,The jury in a New York City murder trial is fr...,"['Crime', 'Drama']",9.0,top5,1957,tt0050083,https://www.imdb.com/title/tt0050083,https://m.media-amazon.com/images/M/MV5BMWU4N2...


#### Scraping MPAA Rating and Runtime

In [23]:
# store imdb links in list for extra data extraction
imdb_links = df['imdb_link'].tolist()

In [24]:
# drop unnecessary columns and rename rating to imdb_rating
df.drop(columns=['Unnamed: 0', 'imdbid', 'image', 'imdb_link', 'description', 'id'], inplace=True)

df.rename(columns={'rating': 'imdb_rating'}, inplace=True)

df.head()

Unnamed: 0,rank,title,genre,imdb_rating,year
0,1,The Shawshank Redemption,['Drama'],9.3,1994
1,2,The Godfather,"['Crime', 'Drama']",9.2,1972
2,3,The Dark Knight,"['Action', 'Crime', 'Drama']",9.0,2008
3,4,The Godfather Part II,"['Crime', 'Drama']",9.0,1974
4,5,12 Angry Men,"['Crime', 'Drama']",9.0,1957


In [25]:
from bs4 import BeautifulSoup
import requests

In [26]:
# headers to allow scraping
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

# initialize lists for ratings and times
rating_list = []
time_list = []

for link in imdb_links:
    response = requests.get(link, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # check to see if rating exists on IMDB and add to list
    try:
        rating = soup.select_one('div.iPPbjm li:nth-of-type(2) a').text
    except AttributeError:
        rating = 'Not Rated'
    rating_list.append(rating)
    
    # check to see if the time is the second or third list item 
    try:
        total_time = soup.select_one('div.iPPbjm li:nth-of-type(3)').text.replace('h', '').replace('m', '').split(' ')
    except AttributeError:
        total_time = soup.select_one('div.iPPbjm li:nth-of-type(2)').text.replace('h', '').replace('m', '').split(' ')
        
    # check to see if the time contains hours and minutes or just hours and add to list
    if len(total_time) == 2:
        time = int(total_time[0]) * 60 + int(total_time[1])
    else:
        time = int(total_time[0]) * 60
    time_list.append(time)

In [27]:
# add ratings and runtime to df
df['rating'] = rating_list
df['run_time'] = time_list

df.head()

Unnamed: 0,rank,title,genre,imdb_rating,year,rating,run_time
0,1,The Shawshank Redemption,['Drama'],9.3,1994,R,142
1,2,The Godfather,"['Crime', 'Drama']",9.2,1972,R,175
2,3,The Dark Knight,"['Action', 'Crime', 'Drama']",9.0,2008,PG-13,152
3,4,The Godfather Part II,"['Crime', 'Drama']",9.0,1974,R,202
4,5,12 Angry Men,"['Crime', 'Drama']",9.0,1957,Approved,96


#### Cleaning the Data Further

In [28]:
# split the end of the year to make a decades column
df_split = df['year'].astype(str).str.split('', expand=True)

df_split[4] = '0'

df['decade'] = df_split[1] + df_split[2] + df_split[3] + df_split[4]

df.head(5)

Unnamed: 0,rank,title,genre,imdb_rating,year,rating,run_time,decade
0,1,The Shawshank Redemption,['Drama'],9.3,1994,R,142,1990
1,2,The Godfather,"['Crime', 'Drama']",9.2,1972,R,175,1970
2,3,The Dark Knight,"['Action', 'Crime', 'Drama']",9.0,2008,PG-13,152,2000
3,4,The Godfather Part II,"['Crime', 'Drama']",9.0,1974,R,202,1970
4,5,12 Angry Men,"['Crime', 'Drama']",9.0,1957,Approved,96,1950


In [29]:
# split genres into their own columns and strip the list elements from the string
genre_split = df['genre'].str.replace(r"[',\[\]]", "", regex=True).str.split(' ', expand=True)

genre_split = genre_split.rename(columns={0: 'genre_1', 1: 'genre_2', 2: 'genre_3'})

In [30]:
# add genre columns to original df
df = pd.concat([df, genre_split], axis=1)

df.head()

Unnamed: 0,rank,title,genre,imdb_rating,year,rating,run_time,decade,genre_1,genre_2,genre_3
0,1,The Shawshank Redemption,['Drama'],9.3,1994,R,142,1990,Drama,,
1,2,The Godfather,"['Crime', 'Drama']",9.2,1972,R,175,1970,Crime,Drama,
2,3,The Dark Knight,"['Action', 'Crime', 'Drama']",9.0,2008,PG-13,152,2000,Action,Crime,Drama
3,4,The Godfather Part II,"['Crime', 'Drama']",9.0,1974,R,202,1970,Crime,Drama,
4,5,12 Angry Men,"['Crime', 'Drama']",9.0,1957,Approved,96,1950,Crime,Drama,


In [31]:
# organize columns to get decade after year and genres 
df = df[['rank', 'title', 'genre', 'genre_1', 'genre_2', 'genre_3', 'imdb_rating', 'year', 'decade', 'rating', 'run_time']]

df

Unnamed: 0,rank,title,genre,genre_1,genre_2,genre_3,imdb_rating,year,decade,rating,run_time
0,1,The Shawshank Redemption,['Drama'],Drama,,,9.3,1994,1990,R,142
1,2,The Godfather,"['Crime', 'Drama']",Crime,Drama,,9.2,1972,1970,R,175
2,3,The Dark Knight,"['Action', 'Crime', 'Drama']",Action,Crime,Drama,9.0,2008,2000,PG-13,152
3,4,The Godfather Part II,"['Crime', 'Drama']",Crime,Drama,,9.0,1974,1970,R,202
4,5,12 Angry Men,"['Crime', 'Drama']",Crime,Drama,,9.0,1957,1950,Approved,96
...,...,...,...,...,...,...,...,...,...,...,...
95,96,Reservoir Dogs,"['Crime', 'Thriller']",Crime,Thriller,,8.3,1992,1990,R,99
96,97,Ikiru,['Drama'],Drama,,,8.3,1952,1950,Not Rated,143
97,98,Lawrence of Arabia,"['Adventure', 'Biography', 'Drama']",Adventure,Biography,Drama,8.3,1962,1960,PG,218
98,99,Citizen Kane,"['Drama', 'Mystery']",Drama,Mystery,,8.3,1941,1940,PG,119


#### Export the Dataframe

In [32]:
df.to_csv('../data/processed/top_100_movies_processed.csv', index=False, header=True)

## Data Analysis

#### Loading the Processed Dataset

In [3]:
import pandas as pd

df = pd.read_csv('../data/processed/top_100_movies_processed.csv')

df.head()

Unnamed: 0,rank,title,genre,genre_1,genre_2,genre_3,imdb_rating,year,decade,rating,run_time
0,1,The Shawshank Redemption,['Drama'],Drama,,,9.3,1994,1990,R,142
1,2,The Godfather,"['Crime', 'Drama']",Crime,Drama,,9.2,1972,1970,R,175
2,3,The Dark Knight,"['Action', 'Crime', 'Drama']",Action,Crime,Drama,9.0,2008,2000,PG-13,152
3,4,The Godfather Part II,"['Crime', 'Drama']",Crime,Drama,,9.0,1974,1970,R,202
4,5,12 Angry Men,"['Crime', 'Drama']",Crime,Drama,,9.0,1957,1950,Approved,96


#### Genre Analysis

In [4]:
# get unique genres from the 3 genre columns
unique_genres = set()

genre_1 = df['genre_1'].value_counts().index

genre_2 = df['genre_2'].value_counts().index

genre_3 = df['genre_3'].value_counts().index

unique_genres.update(genre_1, genre_2, genre_3)

unique_genres = sorted(list(unique_genres))

unique_genres

['Action',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Drama',
 'Family',
 'Fantasy',
 'Film-Noir',
 'History',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'Romance',
 'Sci-Fi',
 'Thriller',
 'War',
 'Western']

There are 20 unique genres in the top 100 movies.

In [5]:
import ast

# Convert string representation of lists to actual lists
df['genre'] = df['genre'].apply(ast.literal_eval)

In [6]:
exploded_df = df.explode('genre')

genres_df = exploded_df.groupby('genre')

genre_counts = genres_df['title'].count().sort_values(ascending=False)

In [7]:
import plotly.express as px

In [8]:
def plotly_bar_graph(df: pd.DataFrame, x, y, color = None, labels = None, title: str = None):
    fig = px.bar(df, x=x, y=y, color=color, 
             labels=labels,
             title=title)
    fig.update_xaxes(tickangle=-45)
    return fig

In [9]:
x = genre_counts.index
y = genre_counts.values
color = x
labels = {'y': 'Movie Count', 'genre': 'Genre'}
title = 'Top Movies by Genre'

fig = plotly_bar_graph(genre_counts, x, y, color, labels, title)

fig.show()

Drama is the most popular genre by far, while musicals and film-noir are the least popular.

In [10]:
genre_ratings = genres_df['imdb_rating'].mean().round(2).sort_values(ascending=False)

x = genre_ratings.index
y = genre_ratings.values
color = x
labels = {'y': 'Average IMDB Rating', 'genre': 'Genre'}
title = 'Average IMDB Rating by Genre'

fig = plotly_bar_graph(genre_ratings, x, y, color, labels, title)

fig.update_yaxes(range=[8.25, 8.65])

fig.show()

The highest rated genres on IMDB are History, Crime, and Western, while the lowest is Musical. The average IMDB rating for each genre doesn't vary much. The highest rated genre has an average rating of 8.63 and the lowest has an average rating of 8.3.

In [11]:
genre_decades_df = exploded_df.groupby(['genre', 'decade'])['title'].count().reset_index()

genre_decades_df = genre_decades_df.pivot(index='genre', columns='decade')['title'].fillna(0)

fig = px.imshow(genre_decades_df, x=genre_decades_df.columns, y=genre_decades_df.index, text_auto=True, color_continuous_scale='Cividis', 
                title='Top 100 Movies Genre Popularity Across Decades')
fig.update_layout(width=1000,height=1000)
fig.show()

The majority of the top 100 movies were released in the 1980s and later. The earliest decade in the dataset is the 1930s with 9 movies and the 2020s are the most recent decade with 6 movies. Drama is the only genre with a top 100 movie in every decade. musical and film-noir both have their only top 100 movie in the 1950s. There is a high concentration of action, adventure, drama, and crime movies in the top 100 in the 80s through the 2010s.

In [12]:
import plotly.graph_objects as go

fig = go.Figure()
for i in range(len(genre_decades_df.index)):
    fig.add_trace(go.Scatter(x=genre_decades_df.columns, y=genre_decades_df.values[i],
                    name=genre_decades_df.index[i]
                    ))
    
fig.update_layout(height=750, title_text="Top 100 Movies Genre Popularity Trends Across Decades", xaxis_title='Decade of Release', yaxis_title='Movie Count')

fig.show()

###### Top 100 Movies Genre Popularity Across Decades
- Sci-Fi: Sci-Fi movies didn’t see any releases from the 1930s to the 1950s. The genre began to gain popularity in the 1960s with the release of one movie. This trend continued to grow steadily from the 1970s to the 2010s, with one, two, two, two, and three movies released in the respective decades. However, this steady increase came to a halt in the 2020s, as no movies were released in the genre.
- Thriller: 
- War:
- Western:



- Action: Action movies started to gain popularity in the 1950s-1970s, with one movie released each decade, while the previous two decades don't have a movie in the top 100. The genre saw significant growth in the 1980s, with four movies released. It peaked in popularity during the 1990s, 2000s, and 2010s, with five movies released in each of those decades. However, the genre saw a decline in the 2020s, dropping back down to just one movie.
- Adventure: Adventure movies began to gain popularity in the 1960s, with three movies released. However, the genre experienced a decline in the 1970s, with only one movie released. The genre then saw significant growth from the 1980s through the 2010s, with five, four, six, and seven movies released in each respective decade. In the 2020s, the genre regressed, with just one movie released.
- Animation: Animation movies didn't start to gain popularity until the 1980s, with one movie released. The genre saw steady growth from the 1990s to the 2010s, with three, two, and four movies released in each respective decade. In the 2020s, the genre regressed, with only one movie released.
- Biography: Biographical movies didn't experience any growth until the 1960s, with one movie released. The 1970s saw no releases in this genre. From the 1980s to the 2020s, there was minimal growth, with one movie released in the 1980s, three in the 1990s, and one movie each in the 2000s and 2010s.
- Comedy: Comedy movies started off popular, with two movies released in the 1930s and one movie released per decade from the 1940s to the 1960s. However, the 1970s saw no releases in this genre. The genre experienced steady growth between the 1980s and the 2010s, with one, two, one, and three movies released in each respective decade. In the 2020s, the genre regressed, with no movies released.
- Crime: Crime movies were somewhat popular from the 1930s to the 1980s, with one movie released per decade in the 1930s, 1960s, and 1970s; two movies in the 1950s and 1970s; and no movies released during the 1940s. The genre saw a rapid surge in the 1990s, with nine movies released. However, the genre gradually declined over the next three decades, with three movies released in the 2000s, one in the 2010s, and none in the 2020s.
- Drama: Drama movies were very popular from the 1930s to the 1950s, experiencing steady growth with two, four, and six movies released in each respective decade. The genre saw a slight decline in the 1960s, with three movies released. However, it experienced rapid growth from the 1970s to the 1990s, with four, six, and nineteen movies released in each respective decade. Although the genre saw a slight decline in the 2000s, with fourteen movies released, and the 2010s, with eleven movies released, it remained highly popular. In the 2020s, no movies were released in the genre.
- Family: Family movies gained some popularity in the 1940s, with one movie released. The genre didn’t see further growth until the 2000s, when two movies were released. However, these were the only two decades that experienced any growth, as the genre failed to see any other releases in the remaining decades.
- Fantasy: Fantasy movies gained some popularity in the 1940s, with one movie released. The genre then declined, with no movies released in the 1950s and 1960s. However, there was a steady increase in popularity from the 1970s to the 1980s, with one and two movies released in each respective decade. After this peak, the genre steadily declined, with one movie released in the 1990s and none in the 2000s. There was a small resurgence in the 2010s with one movie released, but the genre regressed again in the 2020s with no movies released.
- Film-Noir: Film-Noir movies experienced a single spike in popularity in the 1950s, with one movie released. This was the only decade that saw any growth, as the genre failed to see any other releases in the earlier or later decades.
- History: History movies saw its first spike in popularity in the 1990s with two movies released and a second spike in the 2020s. This was the only decade to see any growth, as the genre failed to see any other releases in the previous decades.
- Horror: Horror movies gained consistent popularity from the 1960s to the 1980s, with one movie released in each of these decades. However, these were the only decades that saw any growth, as the genre failed to see any other releases in the following decades.
- Music: Music movies saw a brief increase in popularity in the 1980s, with one movie released before the genre regressed in the 1990s. This decline was short-lived, as the genre experienced consistent popularity in the 2000s and 2010s, with one movie released in each decade. However, this growth did not continue, as the genre failed to see a release in the 2020s.
- Musical: Musical movies experienced a single spike in popularity in the 1950s, with one movie released. This was the only decade to see any growth, as the genre failed to have any other releases in the remaining decades.
- Mystery: Mystery movies remained popular from the 1930s to the 1970s, with one, one, two, three, and one movie released in the respective decades. However, this popularity did not extend into the 1980s, as no movies were released. This decline was short-lived, as the genre's popularity rapidly increased in the 1990s and 2000s, with two and four movies released, respectively. Despite this resurgence, the genre's popularity declined again, with no releases in the 2010s and 2020s.
- Romance: Romance movies remained popular from the 1930s to the 1950s, with two, one, and one movie released in the respective decades. The genre experienced a lull in the 1960s and 1970s, with no movies released. This decline ended in the 1980s with the release of one movie. The genre then peaked in the 1990s, with three movies released, before regressing in the 2000s with just one movie. This marked the end of the genre's popularity, as no movies were released in the 2010s or 2020s. 