# IMDbMovies

## Contents <a id='back'></a>

1. [Introduction](#introduction)
2. [Loading the Dataset](#loading-the-dataset)
3. [Data Cleaning](#data-cleaning)
   - 3.1 [Header style](#header-style)
   - 3.2 [Data Transformation](#data-transformation)
   - 3.3 [Handling Missing Values](#handling-missing-values)
4. [Exploratory Data Analysis (EDA)](#exploratory-data-analysis-eda)
   - 4.1 [Overview of the Dataset](#overview-of-the-dataset)
   - 4.2 [Visualizing Data](#visualizing-data)

## 1. Introduction <a name="introduction"></a>

The IMDbMovies dataset is a comprehensive collection with information on more than 9000 movies on IMDb. The dataset includes key features such as title, director, writer, genres, runtime, release year, budget, and gross revenue. 

The primary goal of this analysis is to gain a deeper insight of the evolution of the cinematographic industry over the years.

[Back to contents](#back)

## 2. Loading the Dataset <a name="loading-the-dataset"></a>

The IMDB Movies dataset is available on [kaggle.com](https://www.kaggle.com/datasets/elvinrustam/imdb-movies-dataset). We will be using the unclean version, so we can carry out our own cleanup. 

The first step is to correct a few rows that have the correct data, but it's shifted one column to the right. The movies that require this are:
 
- Sleepy Hollow
- Texas Chainsaw
- G.I. Joe: Retaliation
- The Woman in the Window
- The Social Dilemma
- Bank of Dave
- Eden
- CHIPS
- The Diving Bell and the Butterfly
- Untamed Heart
- Roxanne
- The Postcard Killings
- Cover Girl

We'll be fixing this with google docs.

Note that "The Fall of Minneapolis" has a similar issue, but also a lot of missing values, and the value for budget (302) doesn't make sense. We'll drop this row. 

The rest of the preprocessing will be done here. We can load the Dataset now.

In [3]:
import pandas as pd

df = pd.read_csv("../IMDbMovies.csv")

Lets see what the data looks like.

In [5]:
df.head(5)

Unnamed: 0,Title,Summary,Director,Writer,Main Genres,Motion Picture Rating,Runtime,Release Year,Rating,Number of Ratings,Budget,Gross in US & Canada,Gross worldwide,Opening Weekend Gross in US & Canada
0,Napoleon,An epic that details the checkered rise and fa...,Ridley Scott,David Scarpa,"Action,Adventure,Biography",R,2h 38m,2023.0,6.7/10,38K,,"$37,514,498","$84,968,381","$20,638,887Nov 26, 2023"
1,The Hunger Games: The Ballad of Songbirds & Sn...,Coriolanus Snow mentors and develops feelings ...,Francis Lawrence,"Michael Lesslie,Michael Arndt,Suzanne Collins","Action,Adventure,Drama",PG-13,2h 37m,2023.0,7.2/10,37K,"$100,000,000 (estimated)","$105,043,414","$191,729,235","$44,607,143Nov 19, 2023"
2,The Killer,"After a fateful near-miss, an assassin battles...",David Fincher,"Andrew Kevin Walker,Luc Jacamon,Alexis Nolent","Action,Adventure,Crime",R,1h 58m,2023.0,6.8/10,117K,,,"$421,332",
3,Leo,A 74-year-old lizard named Leo and his turtle ...,"David Wachtenheim,Robert Smigel,Robert Marianetti","Paul Sado,Robert Smigel,Adam Sandler","Animation,Comedy,Family",PG,1h 42m,2023.0,7.0/10,10K,,,,
4,Thanksgiving,"After a Black Friday riot ends in tragedy, a m...",Eli Roth,"Eli Roth,Jeff Rendell","Horror,Mystery,Thriller",R,1h 46m,2023.0,7.0/10,9.1K,,"$25,408,677","$29,666,585","$10,306,272Nov 19, 2023"


There seems to be a lot of missing values on the Budget and Gross columns, and some columns are in a format that is uncomfortable to work with. Time for cleanup.

[Back to contents](#back)

## 3. Data Cleaning <a name="data-cleaning"></a>

### 3.1 Header style <a name="header-style"></a>

First we'll fix the names of the existing columns.

In [9]:
base_columns_names = df.columns
base_columns_names

Index(['Title', 'Summary', 'Director', 'Writer', 'Main Genres',
       'Motion Picture Rating', 'Runtime', 'Release Year', 'Rating',
       'Number of Ratings', 'Budget', 'Gross in US & Canada',
       'Gross worldwide', 'Opening Weekend Gross in US & Canada'],
      dtype='object')

In [13]:
fixed_columns_names = {}

for name in base_columns_names:
    fixed_name = name.lower()
    fixed_name = fixed_name.replace(' & ', '&') #turns "us & canada" into "us&canada", to avoid so many '_' characters in the next step
    fixed_name = fixed_name.replace(' ', '_')
    fixed_columns_names [name] = fixed_name

fixed_columns_names

{'Title': 'title',
 'Summary': 'summary',
 'Director': 'director',
 'Writer': 'writer',
 'Main Genres': 'main_genres',
 'Motion Picture Rating': 'motion_picture_rating',
 'Runtime': 'runtime',
 'Release Year': 'release_year',
 'Rating': 'rating',
 'Number of Ratings': 'number_of_ratings',
 'Budget': 'budget',
 'Gross in US & Canada': 'gross_in_us&canada',
 'Gross worldwide': 'gross_worldwide',
 'Opening Weekend Gross in US & Canada': 'opening_weekend_gross_in_us&canada'}

In [15]:
df = df.rename(columns = fixed_columns_names)
df.columns

Index(['title', 'summary', 'director', 'writer', 'main_genres',
       'motion_picture_rating', 'runtime', 'release_year', 'rating',
       'number_of_ratings', 'budget', 'gross_in_us&canada', 'gross_worldwide',
       'opening_weekend_gross_in_us&canada'],
      dtype='object')

With the columns renamed, lets move on to improvind the format of certain columns. We might end up renaming some of them again.

[Back to contents](#back)