# Business Understanding

## Project Overview
For this project, you will use exploratory data analysis to generate insights for a business stakeholder.

## Business problem: 
Your company now sees all the big companies creating original video content and they want to get in on the fun. They have decided to create a new movie studio, but they don’t know anything about creating movies. You are charged with exploring what types of films are currently doing the best at the box office. You must then translate those findings into actionable insights that the head of your company's new movie studio can use to help decide what type of films to create.
## Project objectives: 
### Main Objective
To analyze movie data and uncover patterns in sales, popularity, ratings, and director influence across genres, providing actionable insights for business growth and strategy.

### Specific Objectives
1. **Genre by Sales**  
   Identify which genres generate the most revenue and analyze trends contributing to their sales performance.
   Tables:`bom.movie_gross.csv` , `rt.movie_info.tsv`

2. **Genre by Popularity**  
   Understand which genres are most popular among audiences and explore factors driving their popularity.
   Tables:`bom.movie_gross.csv`, `tmdb.movies.csv`

3. **Genre by Rating**  
   Examine the ratings of movies across different genres to evaluate their critical reception.
    DB:`im.db` Tables: `movie_basics`, `movie_rating`

4. **Directors by Genre**  
   Determine which directors are most associated with specific genres and assess their impact on genre success.
    DB:`im.db` Tables: `movie_basics`, `directors`

### The Data
In the folder `zippedData` are movie datasets from:

* [Box Office Mojo](https://www.boxofficemojo.com/)
* [IMDB](https://www.imdb.com/)
* [Rotten Tomatoes](https://www.rottentomatoes.com/)
* [TheMovieDB](https://www.themoviedb.org/)
* [The Numbers](https://www.the-numbers.com/)

Because it was collected from various locations, the different files have different formats. Some are compressed CSV (comma-separated values) or TSV (tab-separated values) files that can be opened using spreadsheet software or `pd.read_csv`, while the data from IMDB is located in a SQLite database.

![movie data erd](https://raw.githubusercontent.com/learn-co-curriculum/dsc-phase-2-project-v3/main/movie_data_erd.jpeg)

Note that the above diagram shows ONLY the IMDB data. You will need to look carefully at the features to figure out how the IMDB data relates to the other provided data files.

It is up to you to decide what data from this to use and how to use it. If you want to make this more challenging, you can scrape websites or make API calls to get additional data. If you are feeling overwhelmed or behind, we recommend you use only the following data files:

* `im.db.zip`
  * Zipped SQLite database (you will need to unzip then query using SQLite)
  * `movie_basics` and `movie_ratings` tables are most relevant
* `bom.movie_gross.csv.gz`
  * Compressed CSV file (you can open without expanding the file using `pd.read_csv`)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

# Data Understanding 

In [2]:
#importing libraries for data manipulation (pandas, numpy) and visualization (seaborn, matplotlib)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import sqlite3
import warnings

# Suppress all warnings
warnings.filterwarnings('ignore')

In [3]:
# set the maximum number of columns to 40 to display all columns
pd.set_option('display.max_columns', 40)

<b>rt.movie_info.tsv</b>

In [4]:
movie_df = pd.read_csv('Datasets/rt.movie_info.tsv', sep='\t')
movie_df.head()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


In [5]:
movie_df.tail()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
1555,1996,Forget terrorists or hijackers -- there's a ha...,R,Action and Adventure|Horror|Mystery and Suspense,,,"Aug 18, 2006","Jan 2, 2007",$,33886034.0,106 minutes,New Line Cinema
1556,1997,The popular Saturday Night Live sketch was exp...,PG,Comedy|Science Fiction and Fantasy,Steve Barron,Terry Turner|Tom Davis|Dan Aykroyd|Bonnie Turner,"Jul 23, 1993","Apr 17, 2001",,,88 minutes,Paramount Vantage
1557,1998,"Based on a novel by Richard Powell, when the l...",G,Classics|Comedy|Drama|Musical and Performing Arts,Gordon Douglas,,"Jan 1, 1962","May 11, 2004",,,111 minutes,
1558,1999,The Sandlot is a coming-of-age story about a g...,PG,Comedy|Drama|Kids and Family|Sports and Fitness,David Mickey Evans,David Mickey Evans|Robert Gunter,"Apr 1, 1993","Jan 29, 2002",,,101 minutes,
1559,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures


In [6]:
movie_df.shape

(1560, 12)

In [7]:
movie_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1560 entries, 0 to 1559
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            1560 non-null   int64 
 1   synopsis      1498 non-null   object
 2   rating        1557 non-null   object
 3   genre         1552 non-null   object
 4   director      1361 non-null   object
 5   writer        1111 non-null   object
 6   theater_date  1201 non-null   object
 7   dvd_date      1201 non-null   object
 8   currency      340 non-null    object
 9   box_office    340 non-null    object
 10  runtime       1530 non-null   object
 11  studio        494 non-null    object
dtypes: int64(1), object(11)
memory usage: 146.4+ KB


<b> bom.movie_gross.csv</b>

In [8]:
gross_df = pd.read_csv("Datasets/bom.movie_gross.csv")

In [9]:
gross_df.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [10]:
gross_df.tail()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
3382,The Quake,Magn.,6200.0,,2018
3383,Edward II (2018 re-release),FM,4800.0,,2018
3384,El Pacto,Sony,2500.0,,2018
3385,The Swan,Synergetic,2400.0,,2018
3386,An Actor Prepares,Grav.,1700.0,,2018


In [11]:
gross_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [12]:
gross_df.describe()

Unnamed: 0,domestic_gross,year
count,3359.0,3387.0
mean,28745850.0,2013.958075
std,66982500.0,2.478141
min,100.0,2010.0
25%,120000.0,2012.0
50%,1400000.0,2014.0
75%,27900000.0,2016.0
max,936700000.0,2018.0


<b>tmdb.movies.csv</b>

In [13]:
tmdb_df = pd.read_csv("Datasets/tmdb.movies.csv")
tmdb_df.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


In [14]:
tmdb_df.tail()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
26512,26512,"[27, 18]",488143,en,Laboratory Conditions,0.6,2018-10-13,Laboratory Conditions,0.0,1
26513,26513,"[18, 53]",485975,en,_EXHIBIT_84xxx_,0.6,2018-05-01,_EXHIBIT_84xxx_,0.0,1
26514,26514,"[14, 28, 12]",381231,en,The Last One,0.6,2018-10-01,The Last One,0.0,1
26515,26515,"[10751, 12, 28]",366854,en,Trailer Made,0.6,2018-06-22,Trailer Made,0.0,1
26516,26516,"[53, 27]",309885,en,The Church,0.6,2018-10-05,The Church,0.0,1


In [15]:
tmdb_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         26517 non-null  int64  
 1   genre_ids          26517 non-null  object 
 2   id                 26517 non-null  int64  
 3   original_language  26517 non-null  object 
 4   original_title     26517 non-null  object 
 5   popularity         26517 non-null  float64
 6   release_date       26517 non-null  object 
 7   title              26517 non-null  object 
 8   vote_average       26517 non-null  float64
 9   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 2.0+ MB


In [16]:
tmdb_df.describe()

Unnamed: 0.1,Unnamed: 0,id,popularity,vote_average,vote_count
count,26517.0,26517.0,26517.0,26517.0,26517.0
mean,13258.0,295050.15326,3.130912,5.991281,194.224837
std,7654.94288,153661.615648,4.355229,1.852946,960.961095
min,0.0,27.0,0.6,0.0,1.0
25%,6629.0,157851.0,0.6,5.0,2.0
50%,13258.0,309581.0,1.374,6.0,5.0
75%,19887.0,419542.0,3.694,7.0,28.0
max,26516.0,608444.0,80.773,10.0,22186.0


In [17]:
#connecting to db
conn = sqlite3.Connection('Datasets/im.db')


In [18]:
#getting table names
cursor = conn.cursor()
cursor.execute("""SELECT name
    FROM sqlite_master
    WHERE type = 'table';""")
print(cursor.fetchall())

[('movie_basics',), ('directors',), ('known_for',), ('movie_akas',), ('movie_ratings',), ('persons',), ('principals',), ('writers',)]


In [19]:
mbasics_df = pd.read_sql("""SELECT * FROM movie_basics;""",conn)

In [20]:
mbasics_df.head()

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


In [21]:
mbasics_df.tail()

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres
146139,tt9916538,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,2019,123.0,Drama
146140,tt9916622,Rodolpho Teóphilo - O Legado de um Pioneiro,Rodolpho Teóphilo - O Legado de um Pioneiro,2015,,Documentary
146141,tt9916706,Dankyavar Danka,Dankyavar Danka,2013,,Comedy
146142,tt9916730,6 Gunn,6 Gunn,2017,116.0,
146143,tt9916754,Chico Albuquerque - Revelações,Chico Albuquerque - Revelações,2013,,Documentary


In [22]:
mbasics_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   movie_id         146144 non-null  object 
 1   primary_title    146144 non-null  object 
 2   original_title   146123 non-null  object 
 3   start_year       146144 non-null  int64  
 4   runtime_minutes  114405 non-null  float64
 5   genres           140736 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB


In [23]:
rating_df = pd.read_sql("""SELECT * FROM movie_ratings;""",conn)

In [24]:
rating_df.head()

Unnamed: 0,movie_id,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


In [25]:
rating_df.tail()

Unnamed: 0,movie_id,averagerating,numvotes
73851,tt9805820,8.1,25
73852,tt9844256,7.5,24
73853,tt9851050,4.7,14
73854,tt9886934,7.0,5
73855,tt9894098,6.3,128


In [26]:
rating_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   movie_id       73856 non-null  object 
 1   averagerating  73856 non-null  float64
 2   numvotes       73856 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 1.7+ MB


In [27]:
rating_df.describe()

Unnamed: 0,averagerating,numvotes
count,73856.0,73856.0
mean,6.332729,3523.662
std,1.474978,30294.02
min,1.0,5.0
25%,5.5,14.0
50%,6.5,49.0
75%,7.4,282.0
max,10.0,1841066.0


In [28]:
directors_df = pd.read_sql("""SELECT * FROM directors;""",conn)

In [29]:
directors_df.head()

Unnamed: 0,movie_id,person_id
0,tt0285252,nm0899854
1,tt0462036,nm1940585
2,tt0835418,nm0151540
3,tt0835418,nm0151540
4,tt0878654,nm0089502


In [30]:
directors_df.tail()

Unnamed: 0,movie_id,person_id
291169,tt8999974,nm10122357
291170,tt9001390,nm6711477
291171,tt9001494,nm10123242
291172,tt9001494,nm10123248
291173,tt9004986,nm4993825


In [31]:
directors_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 291174 entries, 0 to 291173
Data columns (total 2 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   movie_id   291174 non-null  object
 1   person_id  291174 non-null  object
dtypes: object(2)
memory usage: 4.4+ MB


In [32]:
#closing database
conn.close()

## Data Cleaning

### Missing Values

In [33]:
movie_df.isna().sum()

id                 0
synopsis          62
rating             3
genre              8
director         199
writer           449
theater_date     359
dvd_date         359
currency        1220
box_office      1220
runtime           30
studio          1066
dtype: int64

In [34]:
# drop those columns with more than 1000 non-null rows
movie_df = movie_df.drop(['currency', 'box_office', 'studio'],axis=1)

In [35]:
#replacing movie genre nulls with mode
genre_mode = movie_df.genre.mode()[0]
movie_df.genre.fillna(genre_mode, inplace=True)
movie_df.genre.isna().sum()

0

In [36]:
#replacing movie rating nulls with mode
rating_mode = movie_df.rating.mode()[0]
movie_df.rating.fillna(rating_mode, inplace=True)
movie_df.rating.isna().sum()

0

In [37]:
#drop the rest with nulls
movie_df.dropna(inplace=True)
movie_df.isna().sum()

id              0
synopsis        0
rating          0
genre           0
director        0
writer          0
theater_date    0
dvd_date        0
runtime         0
dtype: int64

In [38]:
gross_df.isna().sum()

title                0
studio               5
domestic_gross      28
foreign_gross     1350
year                 0
dtype: int64

In [39]:
#replacing gross for domestic and foreign with 0
gross_df.foreign_gross.fillna(0, inplace=True)
gross_df.domestic_gross.fillna(0, inplace=True)

In [40]:
#drop the rest with nulls
gross_df.dropna(inplace=True)
gross_df.isna().sum()

title             0
studio            0
domestic_gross    0
foreign_gross     0
year              0
dtype: int64

In [41]:
# converting 'foreign_gross' to float 
gross_df['foreign_gross'] = pd.to_numeric(gross_df['foreign_gross'],errors='coerce')

# calculating 'total_gross' as the sum of 'domestic_gross' and 'foreign_gross'
gross_df['total_gross'] = gross_df['domestic_gross'] + gross_df['foreign_gross']

gross_df[['domestic_gross', 'foreign_gross', 'total_gross']].head()


Unnamed: 0,domestic_gross,foreign_gross,total_gross
0,415000000.0,652000000.0,1067000000.0
1,334200000.0,691300000.0,1025500000.0
2,296000000.0,664300000.0,960300000.0
3,292600000.0,535700000.0,828300000.0
4,238700000.0,513900000.0,752600000.0


In [42]:
tmdb_df.isna().sum()

Unnamed: 0           0
genre_ids            0
id                   0
original_language    0
original_title       0
popularity           0
release_date         0
title                0
vote_average         0
vote_count           0
dtype: int64

In [43]:
mbasics_df.isna().sum()

movie_id               0
primary_title          0
original_title        21
start_year             0
runtime_minutes    31739
genres              5408
dtype: int64

In [44]:
rating_df.isna().sum()

movie_id         0
averagerating    0
numvotes         0
dtype: int64

In [45]:
directors_df.isna().sum()

movie_id     0
person_id    0
dtype: int64

### Changing Columns

In [46]:
# Renaming columns in tmdb_df
tmdb_df = tmdb_df.rename(columns={'Unnamed: 0': 'id', 'id': 'tmdb_id'})

tmdb_df.head()


Unnamed: 0,id,genre_ids,tmdb_id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


### Checking Duplicates

In [47]:
movie_df.duplicated().sum()

0

In [48]:
gross_df.duplicated().sum()

0

In [49]:
tmdb_df.duplicated().sum()

0

In [50]:
mbasics_df.duplicated().sum()

0

In [51]:
mbasics_df.drop_duplicates()
mbasics_df.duplicated().sum()

0

In [52]:
rating_df.duplicated().sum()

0

In [53]:
directors_df.duplicated().sum()

127639

In [54]:
directors_df.drop_duplicates(inplace=True)
directors_df.duplicated().sum()

0

### Feature engineering

In [55]:
#changing runtime to int after splitting with a space to get int and 'minutes' the string
movie_df['runtime'] = movie_df['runtime'].str.split(" ").str[0]
movie_df['runtime'] = pd.to_numeric(movie_df['runtime'], errors='coerce')

# changing column name from runtime' to'runtime_in_minutes'
movie_df = movie_df.rename(columns={'runtime': 'runtime_in_minutes'})

# preview the first few rows
movie_df['runtime_in_minutes'].head()

0    104
1    108
2    116
3    128
5     95
Name: runtime_in_minutes, dtype: int64

In [56]:
# Split 'genre' into 'main_genre' and 'supporting_genre'
movie_df['main_genre'] = movie_df['genre'].str.split('|').str[0]
movie_df['supporting_genre'] = movie_df['genre'].str.split('|').apply(lambda x: '|'.join(x[1:]) if len(x) > 1 else '')

# Preview the result
movie_df[['genre', 'main_genre', 'supporting_genre']].head()


Unnamed: 0,genre,main_genre,supporting_genre
0,Action and Adventure|Classics|Drama,Action and Adventure,Classics|Drama
1,Drama|Science Fiction and Fantasy,Drama,Science Fiction and Fantasy
2,Drama|Musical and Performing Arts,Drama,Musical and Performing Arts
3,Drama|Mystery and Suspense,Drama,Mystery and Suspense
5,Drama|Kids and Family,Drama,Kids and Family


In [57]:
# Convert 'theater_date' and 'dvd_date' columns to datetime format
movie_df['theater_date'] = pd.to_datetime(movie_df['theater_date'], format='%b %d, %Y')
movie_df['dvd_date'] = pd.to_datetime(movie_df['dvd_date'], format='%b %d, %Y')

# preview the result
movie_df[['theater_date', 'dvd_date']].head()

Unnamed: 0,theater_date,dvd_date
0,1971-10-09,2001-09-25
1,2012-08-17,2013-01-01
2,1996-09-13,2000-04-18
3,1994-12-09,1997-08-27
5,2000-03-03,2000-07-11


In [58]:
# convert 'release_date' to datetime format
tmdb_df['release_date'] = pd.to_datetime(tmdb_df['release_date'], format='%Y-%m-%d')

# extract the year and create a new column 'release_year'
tmdb_df['release_year'] = tmdb_df['release_date'].dt.year

tmdb_df[['release_date', 'release_year']].head()

Unnamed: 0,release_date,release_year
0,2010-11-19,2010
1,2010-03-26,2010
2,2010-05-07,2010
3,1995-11-22,1995
4,2010-07-16,2010


### Saving Dataset

In [59]:
movie_df.to_csv("Datasets/movie_info_clean.csv")

In [60]:
gross_df.to_csv("Datasets/movie_gross_clean.csv")

In [61]:
# merging the movie_df and gross_df on 'movie_id'
movie_basics_rating_df = pd.merge(mbasics_df, rating_df, on='movie_id', how='inner')

movie_basics_rating_df.to_csv("Datasets/movie_basics_rating_clean.csv")
movie_basics_rating_df.head()


Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama",7.0,77
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama",7.2,43
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,6.9,4517
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",6.1,13
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",6.5,119


In [62]:
directors_df.to_csv("Datasets/director_clean.csv")

# Data Preparation


# Modeling


# Evaluation