# Movie Data Analysis

Dataset: Information aboiut movies from 1980-2020 scrapped IMDB

Link: https://www.kaggle.com/datasets/danielgrijalvas/movies

Business Understanding/Goal: Learn about what makes movies popular!

## 1. Import Libraries

In [14]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## 2. Load Dataset

In [15]:
df = pd.read_csv("data/movies.csv")

## 3. Understand the Data

In [16]:
df.shape
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7668 entries, 0 to 7667
Data columns (total 15 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   name      7668 non-null   object 
 1   rating    7591 non-null   object 
 2   genre     7668 non-null   object 
 3   year      7668 non-null   int64  
 4   released  7666 non-null   object 
 5   score     7665 non-null   float64
 6   votes     7665 non-null   float64
 7   director  7668 non-null   object 
 8   writer    7665 non-null   object 
 9   star      7667 non-null   object 
 10  country   7665 non-null   object 
 11  budget    5497 non-null   float64
 12  gross     7479 non-null   float64
 13  company   7651 non-null   object 
 14  runtime   7664 non-null   float64
dtypes: float64(5), int64(1), object(9)
memory usage: 898.7+ KB


Columns that look useful:
* name, rating, genre, year, score, votes, director, star, country, budget, gross, company, runtime

From this, I could try to figure out:
* Which genre's have the best ratings
* Which directors have the best success in the box office
* How does the rating (pg, pg-13, etc) effect the success in the box office
* Does runtime effect viewers opinion on the movie

In [17]:
df.describe()

Unnamed: 0,year,score,votes,budget,gross,runtime
count,7668.0,7665.0,7665.0,5497.0,7479.0,7664.0
mean,2000.405451,6.390411,88108.5,35589880.0,78500540.0,107.261613
std,11.153508,0.968842,163323.8,41457300.0,165725100.0,18.581247
min,1980.0,1.9,7.0,3000.0,309.0,55.0
25%,1991.0,5.8,9100.0,10000000.0,4532056.0,95.0
50%,2000.0,6.5,33000.0,20500000.0,20205760.0,104.0
75%,2010.0,7.1,93000.0,45000000.0,76016690.0,116.0
max,2020.0,9.3,2400000.0,356000000.0,2847246000.0,366.0


## 4. Data Processing/Preparation

In [18]:
df.columns

Index(['name', 'rating', 'genre', 'year', 'released', 'score', 'votes',
       'director', 'writer', 'star', 'country', 'budget', 'gross', 'company',
       'runtime'],
      dtype='object')

### 4.1. Dropping Unnecessary Columns

In [19]:
df = df[['name', 'rating', 'genre', 'year', 'score', 'votes',
       'director', 'star', 'country', 'budget', 'gross', 'company',
       'runtime']]

df.shape

(7668, 13)

In [20]:
df = df.rename(columns={"name": "Name",
                        "rating": "Rating",
                        "genre": "Genre",
                        "year": "Year",
                        "score": "Score",
                        "votes": "Votes",
                        "director": "Director",
                        "star": "Star",
                        "country": "Country",
                        "budget": "Budget",
                        "gross": "Gross",
                        "company": "Company",
                        "runtime": "Runtime"})

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7668 entries, 0 to 7667
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      7668 non-null   object 
 1   Rating    7591 non-null   object 
 2   Genre     7668 non-null   object 
 3   Year      7668 non-null   int64  
 4   Score     7665 non-null   float64
 5   Votes     7665 non-null   float64
 6   Director  7668 non-null   object 
 7   Star      7667 non-null   object 
 8   Country   7665 non-null   object 
 9   Budget    5497 non-null   float64
 10  Gross     7479 non-null   float64
 11  Company   7651 non-null   object 
 12  Runtime   7664 non-null   float64
dtypes: float64(5), int64(1), object(7)
memory usage: 778.9+ KB


### 4.2. Checking for Duplicates

In [21]:
df.duplicated().sum()

df.loc[df.duplicated(["Name"], keep=False)].sort_values(["Name"])

Unnamed: 0,Name,Rating,Genre,Year,Score,Votes,Director,Star,Country,Budget,Gross,Company,Runtime
483,A Nightmare on Elm Street,R,Horror,1984,7.5,212000.0,Wes Craven,Heather Langenkamp,United States,1800000.0,2.550714e+07,New Line Cinema,91.0
5712,A Nightmare on Elm Street,R,Crime,2010,5.2,95000.0,Samuel Bayer,Jackie Earle Haley,United States,35000000.0,1.156952e+08,New Line Cinema,95.0
7556,After the Wedding,PG-13,Drama,2019,6.3,6700.0,Bart Freundlich,Julianne Moore,United States,,2.790019e+06,Sony Pictures Classics,112.0
4995,After the Wedding,R,Drama,2006,7.7,33000.0,Susanne Bier,Mads Mikkelsen,Denmark,,1.163272e+07,Zentropa Entertainments,120.0
7481,Aladdin,PG,Adventure,2019,6.9,239000.0,Guy Ritchie,Will Smith,United Kingdom,183000000.0,1.050694e+09,Walt Disney Pictures,128.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
163,Venom,R,Horror,1981,5.8,2600.0,Piers Haggard,Klaus Kinski,United Kingdom,,5.229643e+06,Morison Film Group,92.0
3708,Where the Heart Is,PG-13,Comedy,2000,6.7,32000.0,Matt Williams,Natalie Portman,United States,15000000.0,4.086372e+07,Twentieth Century Fox,120.0
1812,Where the Heart Is,R,Comedy,1990,6.0,1500.0,John Boorman,Dabney Coleman,United States,15000000.0,1.106475e+06,Touchstone Pictures,107.0
836,Wuthering Heights,Not Rated,Drama,1985,6.5,339.0,Jacques Rivette,Fabienne Babe,France,,,La Cecilia,130.0
