# Cleaning Movie Dataset

Dataset downloaded from the following [link](https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows). There are a ton of columns which we don't really need. I am going to clean the dataset, remove unwanted columns and save the file as `test.csv`. 

## 1. Importing Libraries

In [1]:
import pandas as pd
import numpy as np

In [2]:
movie_df = pd.read_csv("data/imdb_top_1000.csv")
print(movie_df.head(2))

                                         Poster_Link  \
0  https://m.media-amazon.com/images/M/MV5BMDFkYT...   
1  https://m.media-amazon.com/images/M/MV5BM2MyNj...   

               Series_Title Released_Year Certificate  Runtime         Genre  \
0  The Shawshank Redemption          1994           A  142 min         Drama   
1             The Godfather          1972           A  175 min  Crime, Drama   

   IMDB_Rating                                           Overview  Meta_score  \
0          9.3  Two imprisoned men bond over a number of years...        80.0   
1          9.2  An organized crime dynasty's aging patriarch t...       100.0   

               Director          Star1           Star2       Star3  \
0        Frank Darabont    Tim Robbins  Morgan Freeman  Bob Gunton   
1  Francis Ford Coppola  Marlon Brando       Al Pacino  James Caan   

            Star4  No_of_Votes        Gross  
0  William Sadler      2343110   28,341,469  
1    Diane Keaton      1620367  134,966,411

In [3]:
# Let us now, look at the columns and decide which to keep and which to drop
movie_df.dtypes

Poster_Link       object
Series_Title      object
Released_Year     object
Certificate       object
Runtime           object
Genre             object
IMDB_Rating      float64
Overview          object
Meta_score       float64
Director          object
Star1             object
Star2             object
Star3             object
Star4             object
No_of_Votes        int64
Gross             object
dtype: object

We need only the following columns - `Series_Title`, `Director`, `Star1`, `Star2` , `Star3` , `Star4`. We can drop everything else from the dataset. 

In [4]:
movie_df.drop(["Poster_Link", "Released_Year", "Certificate", "Runtime", "Genre", "IMDB_Rating", "Overview", "Meta_score", "No_of_Votes", "Gross"], axis = 1, inplace = True)
movie_df.head()

Unnamed: 0,Series_Title,Director,Star1,Star2,Star3,Star4
0,The Shawshank Redemption,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler
1,The Godfather,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton
2,The Dark Knight,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine
3,The Godfather: Part II,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton
4,12 Angry Men,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler


I want every Star to be in a single column called `Actors`. So, I will be transforming them. 

In [5]:
movie_df["Actors"] = movie_df.apply(lambda record : [record["Star1"],record["Star2"],record["Star3"],record["Star4"]], axis = 1)

In [6]:
movie_df.head()

Unnamed: 0,Series_Title,Director,Star1,Star2,Star3,Star4,Actors
0,The Shawshank Redemption,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,"[Tim Robbins, Morgan Freeman, Bob Gunton, Will..."
1,The Godfather,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,"[Marlon Brando, Al Pacino, James Caan, Diane K..."
2,The Dark Knight,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,"[Christian Bale, Heath Ledger, Aaron Eckhart, ..."
3,The Godfather: Part II,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,"[Al Pacino, Robert De Niro, Robert Duvall, Dia..."
4,12 Angry Men,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,"[Henry Fonda, Lee J. Cobb, Martin Balsam, John..."


Now, we can drop the individual Star columns. 

In [7]:
movie_df.drop(["Star1","Star2","Star3","Star4"], axis = 1, inplace = True)
movie_df.head()

Unnamed: 0,Series_Title,Director,Actors
0,The Shawshank Redemption,Frank Darabont,"[Tim Robbins, Morgan Freeman, Bob Gunton, Will..."
1,The Godfather,Francis Ford Coppola,"[Marlon Brando, Al Pacino, James Caan, Diane K..."
2,The Dark Knight,Christopher Nolan,"[Christian Bale, Heath Ledger, Aaron Eckhart, ..."
3,The Godfather: Part II,Francis Ford Coppola,"[Al Pacino, Robert De Niro, Robert Duvall, Dia..."
4,12 Angry Men,Sidney Lumet,"[Henry Fonda, Lee J. Cobb, Martin Balsam, John..."


Let's export this to CSV and play with this dataset using Graph Database.

In [8]:
# I am setting index to use index as ID.
movie_df.to_csv("data/test.csv", index = True)