
# Project: MOVIES DATASET CLEANUP

by 
**Godswill Enaohwo** 
godswilleo@gmail.com | https://linkedin.com/in/godswillenaohwo | +2347035600213



<a id='intro'></a>
# INTRODUCTION

This data set contains information  10,866 movies collected from The Movie Database (TMDb). The dataset contains columns includes;

- Certain columns, like __cast__ and __genres__, which contains multiple values separated by pipe (|) characters.
- __popularity__, __vote_count__ and __vote_average__ collectively helps to specify how popular a movie is amongst audiences
- __budget__ and __revenue__ columns respectively contains the film __budget__ and the __revenue__ made from the film
- The final two columns ending with ___adj__ show the budget and revenue of the associated movie in terms of 2010 dollars,
accounting for inflation overtime.

The dataset is gotten from kaggle.com

# QUESTIONS TO ANSWER
The final analysis of this dataset intends to provide answers the following questions
> 1. What genre of movie is most produced in the last ten years?
> 2. What are the first two most profitable movie genres in the last ten years?
> 3. What range of movie runtime is more prefered by movie goers?
> 4. Which production company made the largest profit within the time frame covered by the dataset?
> 5. Which production company made more films within the time frame covered by the dataset?
> 6. Which genre of movie is most produced by the company with the highest profit within the time frame covered by the dataset

    


In [364]:
# importing the required dependencies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


In [365]:
# The data is now loaded from the csv file into the notebook
mv_df = pd.read_csv('tmdb-movies.csv', sep=',')


## INVESTIGATING THE DATASET
The dataframe is now investigated to determine if there are impediments in it that could militate against the ability to effectively answer the questions above.

In [366]:
# checking out the datatype of the columns

mv_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 5   original_title        10866 non-null  object 
 6   cast                  10790 non-null  object 
 7   homepage              2936 non-null   object 
 8   director              10822 non-null  object 
 9   tagline               8042 non-null   object 
 10  keywords              9373 non-null   object 
 11  overview              10862 non-null  object 
 12  runtime               10866 non-null  int64  
 13  genres                10843 non-null  object 
 14  production_companies  9836 non-null   object 
 15  release_date       

> * some columns like keyword, tagline and homepage are not relevant to the analysis that will be carried out on the dataset


In [367]:
#sampling all the columns

mv_df.iloc[:, 0:11].sample(5)

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,keywords
10275,35588,tt0107004,0.625238,35000000,18635620,Geronimo: An American Legend,Jason Patric|Wes Studi|Gene Hackman|Robert Duv...,,Walter Hill,A Warrior. A Leader. A Legend.,apache|historical figure|historical|cavalry|am...
3457,60308,tt1210166,1.081676,50000000,110206216,Moneyball,Brad Pitt|Jonah Hill|Robin Wright|Philip Seymo...,http://www.moneyball-movie.com/,Bennett Miller,What are you really worth?,underdog|based on novel|baseball|teamwork|partner
412,226458,tt1945044,0.256286,0,0,Backmask,Stephen Lang|Kelly Blatz|Brittany Curran|Gage ...,,Marcus Nispel,nederlands,possession
6702,14171,tt0763840,0.546519,12000000,0,Home of the Brave,Samuel L. Jackson|Jessica Biel|50 Cent|Christi...,,Irwin Winkler,,post traumatic stress disorder|u.s. soldier|i...
3488,64685,tt0477302,0.907952,40000000,55247881,Extremely Loud & Incredibly Close,Thomas Horn|Tom Hanks|Sandra Bullock|Max von S...,http://extremelyloudandincrediblyclose.warnerb...,Stephen Daldry,"This is not a story about September 11th, it's...",based on novel|key|scavenger hunt|death of fat...


> * All the genres the respective movies falls under are stored in a single column seperated by a "|"

In [368]:
#sampling all the columns

mv_df.iloc[:, 10:21].sample(5)

Unnamed: 0,keywords,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
10664,world war ii|island|bomber|pianosa|american,A bombardier in World War II tries desperately...,121,War|Comedy|Drama,Paramount|Filmways Productions,6/24/70,30,6.7,1970,0.0,0.0
2315,independent film,Danny Foster doesn't have much: an apartment a...,0,Drama|Music|Romance,,8/14/10,29,7.3,2010,0.0,0.0
5318,canada|suffering|village|paralysis|independent...,A small mountain community in Canada is devast...,112,Drama,Fine Line Features,5/14/97,41,6.4,1997,6792302.0,4433451.0
3807,male nudity|horror|anthology|gay interest|driv...,It's the closing night at the last drive-in th...,120,Horror|Comedy,ArieScope Pictures,10/14/11,11,5.9,2011,0.0,0.0
6381,underdog|secret society|friendship|pistol|self...,"In a blue-collar American town, a group of tee...",105,Comedy|Crime|Drama|Romance,Zentropa Entertainments|Nimbus Film Production...,1/22/05,15,5.5,2005,0.0,0.0


In [369]:
# checking out the uppermost records of the dataset

mv_df.head()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999900.0,1392446000.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,...,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,6185,7.1,2015,137999900.0,348161300.0
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2480,6.3,2015,101200000.0,271619000.0
3,140607,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,...,Thirty years after defeating the Galactic Empi...,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/15,5292,7.5,2015,183999900.0,1902723000.0
4,168259,tt2820852,9.335014,190000000,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,...,Deckard Shaw seeks revenge against Dominic Tor...,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/15,2947,7.3,2015,174799900.0,1385749000.0


In [370]:
# checking out the bottom records of the dataset

mv_df.tail()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
10861,21,tt0060371,0.080598,0,0,The Endless Summer,Michael Hynson|Robert August|Lord 'Tally Ho' B...,,Bruce Brown,,...,"The Endless Summer, by Bruce Brown, is one of ...",95,Documentary,Bruce Brown Films,6/15/66,11,7.4,1966,0.0,0.0
10862,20379,tt0060472,0.065543,0,0,Grand Prix,James Garner|Eva Marie Saint|Yves Montand|Tosh...,,John Frankenheimer,Cinerama sweeps YOU into a drama of speed and ...,...,Grand Prix driver Pete Aron is fired by his te...,176,Action|Adventure|Drama,Cherokee Productions|Joel Productions|Douglas ...,12/21/66,20,5.7,1966,0.0,0.0
10863,39768,tt0060161,0.065141,0,0,Beregis Avtomobilya,Innokentiy Smoktunovskiy|Oleg Efremov|Georgi Z...,,Eldar Ryazanov,,...,An insurance agent who moonlights as a carthie...,94,Mystery|Comedy,Mosfilm,1/1/66,11,6.5,1966,0.0,0.0
10864,21449,tt0061177,0.064317,0,0,"What's Up, Tiger Lily?",Tatsuya Mihashi|Akiko Wakabayashi|Mie Hama|Joh...,,Woody Allen,WOODY ALLEN STRIKES BACK!,...,"In comic Woody Allen's film debut, he took the...",80,Action|Comedy,Benedict Pictures Corp.,11/2/66,22,5.4,1966,0.0,0.0
10865,22293,tt0060666,0.035919,19000,0,Manos: The Hands of Fate,Harold P. Warren|Tom Neyman|John Reynolds|Dian...,,Harold P. Warren,It's Shocking! It's Beyond Your Imagination!,...,A family gets lost on the road and stumbles up...,74,Horror,Norm-Iris,11/15/66,15,1.5,1966,127642.279154,0.0


> It is observed that some movies have 0 as their budget and/or revenue figure, this is an inaccurate information

In [371]:
# Checking for the total number of empty data on each column

mv_df.isna().sum()

id                         0
imdb_id                   10
popularity                 0
budget                     0
revenue                    0
original_title             0
cast                      76
homepage                7930
director                  44
tagline                 2824
keywords                1493
overview                   4
runtime                    0
genres                    23
production_companies    1030
release_date               0
vote_count                 0
vote_average               0
release_year               0
budget_adj                 0
revenue_adj                0
dtype: int64

> It is observed that 9 Columns of the 21 contains rows contain empty data. While 4 of the columns can be seen as a major issue
the number of empty data in the remaining 5 columns are quite negligible 

In [372]:
# Checking for duplicate records

mv_df.duplicated().sum()

1

> One duplicate record is observed from the over 10,000 records.

## Issues in the dataframe
After carrying out assessment on the dataset the following where noted as issues in the dataset
>    1. Missing data in some columns
>    2. Remove duplicate record
>    3. There is no column for profit
>    4. Genres are moduled up together in one column seperated by "|"
>    5. Production_companies are also moduled up together in one column seperated by "|"
 



<a id='wrangling'></a>
## CleanUp

 Having investigated the data and noted down the observations. Clean up operation on the dataframe will now commence to take care of the issues listed above
  `
    

### Issue 1. Fixing the Missing data in some columns

> This will be fixed in two operations;
   1. The Columns which contain more than 1,000 empty data will be dropped
   2. The empty records in the remaining columns will be filled

#### Operation 1: DROPPING THE COLUMNS WITH OVER 1000 EMPTY DATA

#### Code

In [373]:
mv_df.drop(['tagline','keywords','homepage'], axis=1, inplace=True)


#### Test

In [374]:
# checking to ensure the removed columns are no longer part of the dataframe

mv_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 5   original_title        10866 non-null  object 
 6   cast                  10790 non-null  object 
 7   director              10822 non-null  object 
 8   overview              10862 non-null  object 
 9   runtime               10866 non-null  int64  
 10  genres                10843 non-null  object 
 11  production_companies  9836 non-null   object 
 12  release_date          10866 non-null  object 
 13  vote_count            10866 non-null  int64  
 14  vote_average          10866 non-null  float64
 15  release_year       

#### Comment

The unnecessary columns which also contains more than 1000 empty data has now be removed

#### Operation 2: FILLING COLUMNS WITH MISSING DATA
> This is to be done by using a custom function such that for each column "No" attached with the name of the column is entered to replace the empty data. 

#### Code

In [375]:
def fill_str_column(dataframe):
    # Fills empty data in string columns with the value No attached with the column name
    # e.g a column with the column Name "cast" will have its empties filled with "No Cast"
    
    for column in (dataframe).columns:
        if isinstance((dataframe)[column], object):
            (dataframe)[column] = (dataframe)[column].fillna("No "+column)
        
fill_str_column(mv_df)

#### Test


In [376]:
# Testing to ensure all string empty data has being filled
mv_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10866 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 5   original_title        10866 non-null  object 
 6   cast                  10866 non-null  object 
 7   director              10866 non-null  object 
 8   overview              10866 non-null  object 
 9   runtime               10866 non-null  int64  
 10  genres                10866 non-null  object 
 11  production_companies  10866 non-null  object 
 12  release_date          10866 non-null  object 
 13  vote_count            10866 non-null  int64  
 14  vote_average          10866 non-null  float64
 15  release_year       

#### Comment
All of the columns now contain the required number of records

#### ISSUE 2: THERE IS A DUPLICATE RECORD IN THE DATASET


#### Code

In [377]:
mv_df.drop_duplicates(inplace=True)

### Test

In [378]:
mv_df.duplicated().sum()

0

#### ISSUE 3: THERE IS NO COLUMN FOR PROFIT
In other to be able to calculate profits recorded, a profit column will be created for the dataset

#### Code

In [486]:
# creating the Profit column

mv_df['profit'] = mv_df['revenue_adj'] - mv_df['budget_adj']

#### Test

In [487]:
mv_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10865 entries, 0 to 10865
Data columns (total 19 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10865 non-null  int64  
 1   imdb_id               10865 non-null  object 
 2   popularity            10865 non-null  float64
 3   budget                10865 non-null  int64  
 4   revenue               10865 non-null  int64  
 5   original_title        10865 non-null  object 
 6   cast                  10865 non-null  object 
 7   director              10865 non-null  object 
 8   overview              10865 non-null  object 
 9   runtime               10865 non-null  int64  
 10  genres                10865 non-null  object 
 11  production_companies  10865 non-null  object 
 12  release_date          10865 non-null  object 
 13  vote_count            10865 non-null  int64  
 14  vote_average          10865 non-null  float64
 15  release_year       

#### ISSUE 4: ALL THE GENRES THE RESPECTIVE MOVIES FALLS UNDER ARE MODULED UP IN ONE COLUMN SEPERATED BY "|"
The genres will now be seperated into different records in cases where a single film falls under multiple genres in other to facilitate proper analysis on genres

### Code

In [533]:
import re

newobj = {
        "id":[],"imdb_id":[],"popularity":[],"budget":[],"revenue":[],"original_title":[],"cast":[],"director":[],
         "overview":[],"runtime":[],"genres":[],"production_companies":[],"release_date":[],"vote_count":[],"vote_average":[],
         "release_year":[],"budget_adj":[],"revenue_adj":[], "profit":[]
}



for index,row in mv_df.iterrows():
    df = row['genres'].split("|")
    
    if (len(df) > 1):
        for i in df:
        
           
                newobj['id'].append(row['id'])
                newobj['imdb_id'].append(row['imdb_id'])
                newobj['popularity'].append(row['popularity'])
                newobj['budget'].append(row['budget'])
                newobj['revenue'].append(row['revenue'])
                newobj['original_title'].append(row['original_title'])
                newobj['cast'].append(row['cast'])
                newobj['director'].append(row['director'])
                newobj['overview'].append(row['overview'])
                newobj['runtime'].append(row['runtime'])
                newobj['genres'].append(i)
                newobj['production_companies'].append(row['production_companies'])
                newobj['release_date'].append(row['release_date'])
                newobj['vote_count'].append(row['vote_count'])
                newobj['vote_average'].append(row['vote_average'])
                newobj['release_year'].append(row['release_year'])
                newobj['budget_adj'].append(row['budget_adj'])
                newobj['revenue_adj'].append(row['revenue_adj'])
                newobj['profit'].append(row['profit'])


            
            
    else:
                newobj['id'].append(row['id'])
                newobj['imdb_id'].append(row['imdb_id'])
                newobj['popularity'].append(row['popularity'])
                newobj['budget'].append(row['budget'])
                newobj['revenue'].append(row['revenue'])
                newobj['original_title'].append(row['original_title'])
                newobj['cast'].append(row['cast'])
                newobj['director'].append(row['director'])
                newobj['overview'].append(row['overview'])
                newobj['runtime'].append(row['runtime'])
                newobj['genres'].append(row['genres'])
                newobj['production_companies'].append(row['production_companies'])
                newobj['release_date'].append(row['release_date'])
                newobj['vote_count'].append(row['vote_count'])
                newobj['vote_average'].append(row['vote_average'])
                newobj['release_year'].append(row['release_year'])
                newobj['budget_adj'].append(row['budget_adj'])
                newobj['revenue_adj'].append(row['revenue_adj'])
                newobj['profit'].append(row['profit'])

In [535]:
# # Creating a dataframe from the object above and then saving  the dataframe as a csv file
pd.DataFrame(newobj).to_csv('genre_cleaned.csv', index=False)

#### Test

In [536]:
# assessing the newly created csv file

dfgenre = pd.read_csv("genre_cleaned.csv", sep=",")

In [537]:
dfgenre.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26978 entries, 0 to 26977
Data columns (total 19 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    26978 non-null  int64  
 1   imdb_id               26978 non-null  object 
 2   popularity            26978 non-null  float64
 3   budget                26978 non-null  int64  
 4   revenue               26978 non-null  int64  
 5   original_title        26978 non-null  object 
 6   cast                  26978 non-null  object 
 7   director              26978 non-null  object 
 8   overview              26978 non-null  object 
 9   runtime               26978 non-null  int64  
 10  genres                26978 non-null  object 
 11  production_companies  26978 non-null  object 
 12  release_date          26978 non-null  object 
 13  vote_count            26978 non-null  int64  
 14  vote_average          26978 non-null  float64
 15  release_year       

In [538]:
# A look at a record in the original dataframe
mv_df[mv_df['id'] == 19724]

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,director,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj,profit
1685,19724,tt0780567,0.336167,55000000,0,Imagine That,Eddie Murphy|Thomas Haden Church|Yara Shahidi|...,Karey Kirkpatrick,A financial executive who can't stop his caree...,107,Comedy|Drama|Family|Fantasy,Paramount Pictures|Di Bonaventura Pictures|Nic...,6/19/09,77,5.7,2009,55902020.0,0.0,-55902020.0


In [539]:
# A look at same record in the new genre dataframe created
dfgenre[dfgenre['id'] == 19724]

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,director,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj,profit
3819,19724,tt0780567,0.336167,55000000,0,Imagine That,Eddie Murphy|Thomas Haden Church|Yara Shahidi|...,Karey Kirkpatrick,A financial executive who can't stop his caree...,107,Comedy,Paramount Pictures|Di Bonaventura Pictures|Nic...,6/19/09,77,5.7,2009,55902020.0,0.0,-55902020.0
3820,19724,tt0780567,0.336167,55000000,0,Imagine That,Eddie Murphy|Thomas Haden Church|Yara Shahidi|...,Karey Kirkpatrick,A financial executive who can't stop his caree...,107,Drama,Paramount Pictures|Di Bonaventura Pictures|Nic...,6/19/09,77,5.7,2009,55902020.0,0.0,-55902020.0
3821,19724,tt0780567,0.336167,55000000,0,Imagine That,Eddie Murphy|Thomas Haden Church|Yara Shahidi|...,Karey Kirkpatrick,A financial executive who can't stop his caree...,107,Family,Paramount Pictures|Di Bonaventura Pictures|Nic...,6/19/09,77,5.7,2009,55902020.0,0.0,-55902020.0
3822,19724,tt0780567,0.336167,55000000,0,Imagine That,Eddie Murphy|Thomas Haden Church|Yara Shahidi|...,Karey Kirkpatrick,A financial executive who can't stop his caree...,107,Fantasy,Paramount Pictures|Di Bonaventura Pictures|Nic...,6/19/09,77,5.7,2009,55902020.0,0.0,-55902020.0


> A new record has now being created for each genre

#### ISSUE 5: ALL THE PRODUCTION COMPANY NAMES FOR EACH MOVIE ARE COMBINED IN ONE COLUMN SEPERATED BY  "|"
The names of the production companies will be seperated into different records like it was done in the case of genre. This is to be able to analyse and answer questions 4 and 5 


In [549]:
import re

newobj_prod = {
        "id":[],"imdb_id":[],"popularity":[],"budget":[],"revenue":[],"original_title":[],"cast":[],"director":[],
         "overview":[],"runtime":[],"genres":[],"production_companies":[],"release_date":[],"vote_count":[],"vote_average":[],
         "release_year":[],"budget_adj":[],"revenue_adj":[], "profit":[]
}



for index,row in mv_df.iterrows():
    df = row['production_companies'].split("|")
    
    if (len(df) > 1):
        for i in df:
        
           
                newobj_prod['id'].append(row['id'])
                newobj_prod['imdb_id'].append(row['imdb_id'])
                newobj_prod['popularity'].append(row['popularity'])
                newobj_prod['budget'].append(row['budget'])
                newobj_prod['revenue'].append(row['revenue'])
                newobj_prod['original_title'].append(row['original_title'])
                newobj_prod['cast'].append(row['cast'])
                newobj_prod['director'].append(row['director'])
                newobj_prod['overview'].append(row['overview'])
                newobj_prod['runtime'].append(row['runtime'])
                newobj_prod['genres'].append(row['genres'])
                newobj_prod['production_companies'].append(i)
                newobj_prod['release_date'].append(row['release_date'])
                newobj_prod['vote_count'].append(row['vote_count'])
                newobj_prod['vote_average'].append(row['vote_average'])
                newobj_prod['release_year'].append(row['release_year'])
                newobj_prod['budget_adj'].append(row['budget_adj'])
                newobj_prod['revenue_adj'].append(row['revenue_adj'])
                newobj_prod['profit'].append(row['profit'])


            
            
    else:
                newobj_prod['id'].append(row['id'])
                newobj_prod['imdb_id'].append(row['imdb_id'])
                newobj_prod['popularity'].append(row['popularity'])
                newobj_prod['budget'].append(row['budget'])
                newobj_prod['revenue'].append(row['revenue'])
                newobj_prod['original_title'].append(row['original_title'])
                newobj_prod['cast'].append(row['cast'])
                newobj_prod['director'].append(row['director'])
                newobj_prod['overview'].append(row['overview'])
                newobj_prod['runtime'].append(row['runtime'])
                newobj_prod['genres'].append(row['genres'])
                newobj_prod['production_companies'].append(row['production_companies'])
                newobj_prod['release_date'].append(row['release_date'])
                newobj_prod['vote_count'].append(row['vote_count'])
                newobj_prod['vote_average'].append(row['vote_average'])
                newobj_prod['release_year'].append(row['release_year'])
                newobj_prod['budget_adj'].append(row['budget_adj'])
                newobj_prod['revenue_adj'].append(row['revenue_adj'])
                newobj_prod['profit'].append(row['profit'])

In [550]:
# converting the object in above to a dataframe and then saving it as a csv file
pd.DataFrame(newobj_prod).to_csv("prodname_cleaned.csv", index=False)

#### Test

In [551]:
# Accessing the csv file above
dfprod = pd.read_csv("prodname_cleaned.csv", sep=",")

In [552]:
dfprod.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24255 entries, 0 to 24254
Data columns (total 19 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    24255 non-null  int64  
 1   imdb_id               24255 non-null  object 
 2   popularity            24255 non-null  float64
 3   budget                24255 non-null  int64  
 4   revenue               24255 non-null  int64  
 5   original_title        24255 non-null  object 
 6   cast                  24255 non-null  object 
 7   director              24255 non-null  object 
 8   overview              24255 non-null  object 
 9   runtime               24255 non-null  int64  
 10  genres                24255 non-null  object 
 11  production_companies  24255 non-null  object 
 12  release_date          24255 non-null  object 
 13  vote_count            24255 non-null  int64  
 14  vote_average          24255 non-null  float64
 15  release_year       

In [553]:
# A look at a record in the original dataframe
mv_df[mv_df['id'] == 19724]

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,director,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj,profit
1685,19724,tt0780567,0.336167,55000000,0,Imagine That,Eddie Murphy|Thomas Haden Church|Yara Shahidi|...,Karey Kirkpatrick,A financial executive who can't stop his caree...,107,Comedy|Drama|Family|Fantasy,Paramount Pictures|Di Bonaventura Pictures|Nic...,6/19/09,77,5.7,2009,55902020.0,0.0,-55902020.0


In [555]:
# A look at same record in the new genre dataframe created
dfprod[dfprod['id'] == 19724]

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,director,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj,profit
4136,19724,tt0780567,0.336167,55000000,0,Imagine That,Eddie Murphy|Thomas Haden Church|Yara Shahidi|...,Karey Kirkpatrick,A financial executive who can't stop his caree...,107,Comedy|Drama|Family|Fantasy,Paramount Pictures,6/19/09,77,5.7,2009,55902020.0,0.0,-55902020.0
4137,19724,tt0780567,0.336167,55000000,0,Imagine That,Eddie Murphy|Thomas Haden Church|Yara Shahidi|...,Karey Kirkpatrick,A financial executive who can't stop his caree...,107,Comedy|Drama|Family|Fantasy,Di Bonaventura Pictures,6/19/09,77,5.7,2009,55902020.0,0.0,-55902020.0
4138,19724,tt0780567,0.336167,55000000,0,Imagine That,Eddie Murphy|Thomas Haden Church|Yara Shahidi|...,Karey Kirkpatrick,A financial executive who can't stop his caree...,107,Comedy|Drama|Family|Fantasy,Nickelodeon Movies,6/19/09,77,5.7,2009,55902020.0,0.0,-55902020.0
4139,19724,tt0780567,0.336167,55000000,0,Imagine That,Eddie Murphy|Thomas Haden Church|Yara Shahidi|...,Karey Kirkpatrick,A financial executive who can't stop his caree...,107,Comedy|Drama|Family|Fantasy,Goldcrest Pictures,6/19/09,77,5.7,2009,55902020.0,0.0,-55902020.0


In [None]:
h