# Instructions

For Part A, you need to scrape IMDB web page to find out top movies sorted by user votes. For each movie, you need to pull :
- movie_id
- rank
- title 
- year
- rating
- runtime
- votes 

The URL of an page that include movies released between 2018 and 2021 sorted by number of votes is: 

https://www.imdb.com/search/title/?at=0&sort=num_votes,desc&start=1&title_type=feature&year=2018,2020


Please click the URL and investigate how you can pull movie_id, rank, title,... from the webpage. This webpage, however, only includes 50 movies. Hence, if you want to extract top 250 movies between 2018 and 2020 according to  the number of votes, you need to click “next” 4 times and parse 4 more pages. Fortunately, we can do this by modifying the URL a little. For example, the URL

https://www.imdb.com/search/title/?at=0&sort=num_votes,desc&start=51&title_type=feature&year=2018,2020

(Note that start = 1 in the first URL, now start = 51) allows us to move on to the next page. Obviously, https://www.imdb.com/search/title/?at=0&sort=num_votes,desc&start=101&title_type=feature&year=2018,2020
will lead us to the third page. 


**You need to write code where I have <span style="color:red">'''  Your code here ...    '''.</span>**

***
Now let’s look at each function in detail. The parameter “top_number” in the function read_m_by_voting(first_year, last_year, top_number)  represents the top number of movies you want to retrieve. For example, read_m_by_voting(2018,2020,500) means that we want to extract top 500 movies released between 2018 and 2020, sorted by users' votes.

This function returns a list of dictionaries. Each dictionary represents one of the top movies, which could look like the following:

{   
   
      'movie_id': 'tt6324278',
      'title': 'Abominable',
      'year': '(2019)',
      'rank': '358.', 
      'runtime': '97 min', 
      'rating': '7.0', 
      'votes': '34,093'
}

In order to implement this function read_m_by_voting(first_year, last_year, top_number), you need to first implement the function read_m_from_url(url, num_of_m=50). This read_m_from_url function is used to extract num_of_m number of movies from a URL. It will also return a list of dictionaries, each of which represents a movie. As described above, to extract say top 120 movies, you need to parse 3 webpage because each page includes only 50 movies. This read_m_from_url function allows you to extract a specific number of movies from a URL. 

After you implement “read_m_by_voting”, which will return a list of top movies, you need to implement the function write_movies_csv(final_list, filename) to write the movies list to a csv file.

***
You probably want to first work on read_m_from_url, then read_m_by_voting, and then write_movies_csv. For each of the functions, I included a test function. For instance, for the function, read_m_from_url, I have included a test function called test_read_m_from_url(). Please un-comment the test function to test your code. The test function test_write_movies_csv () outputs "IMDb_TopVoted.csv". You need to test each function before moving onto the next. Even within a function, you may need to use print() to test your code very carefully.

***

After you done with scraping the needed data, you should clean and transform it as needed to make it ready for enriching the given "Movies.csv" dataset.

Finaly, export the enriched dataset to a CSV file:
Use the following naming convention: Project_3_PartA_Lastname.csv



### Helper functions
***

The helper functions section includes some functions you need to use when implementing the three functions describe above. This inculdes the implementation of two functions and test function for each helper function: read_html(url), test_read_html(), process_str_with_comma(string), and test_process_str_with_comma(). For instance, running  test_process_str_with_comma() will help you understand what the function process_str_with_comma(string) does.



In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import math

# Helper functions:

In [2]:
"""
The read_html(url) return the contents of an URL hmtl file as a string.
"""
def read_html(url):
    response = requests.get(url)
    content = response.content
    return content

#===========================================================================

"""
If a string contains comma, use "" to enclose the string to write it to a comma-separated values (CSV) 
file with no issue with the comma.
"""
def process_str_with_comma(string):
    if ',' in string:
        new_string = '"' + string.strip() + '"'
    else:
        new_string = string
    return new_string


### Test  read_html 

In [3]:
def test_read_html():
    print (read_html('https://www.imdb.com/search/title/?at=0&sort=num_votes,desc&start=1&title_type=feature&year=2018,2020'))


In [4]:
test_read_html()

b'\n\n<!DOCTYPE html>\n<html\n    xmlns:og="http://ogp.me/ns#"\n    xmlns:fb="http://www.facebook.com/2008/fbml">\n    <head>\n         \n\n        <meta charset="utf-8">\n\n\n\n\n        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:\'java\'};</script>\n\n<script>\n    if (typeof uet == \'function\') {\n      uet("bb", "LoadTitle", {wb: 1});\n    }\n</script>\n  <script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>\n        <title>Feature Film,\nReleased between 2018-01-01 and 2020-12-31\n(Sorted by Number of Votes Descending) - IMDb</title>\n  <script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>\n<script>\n    if (typeof uet == \'function\') {\n      uet("be", "LoadTitle", {wb: 1});\n    }\n</script>\n<script>\n    if (typeof uex == \'function\') {\n      uex("ld", "LoadTitle", {wb: 1});\n    }\n</script>\n\n        <

### Test  process_str_with_comma 

In [5]:
def test_process_str_with_comma():
    """output:
    string: it is a string
    string: "it is a string, right"
    """
    string = 'it is a string'
    print ("string: " + process_str_with_comma(string))
    string = 'it is a string, right'
    print ("string: " + process_str_with_comma(string))


In [6]:
test_process_str_with_comma()

string: it is a string
string: "it is a string, right"


***

## read_m_from_url

Inside this function, you need to write your code to pull the movies information.
For each movie, you need to pull :
- movie_id
- title 
- rank
- year
- rating
- runtime
- votes 

To give examples on how to pull data from web bage, I have included the code to pull the movie_id, title, and votes.
You need to inculde your code to pull the other needed movie information (rank, year, ......). You should have no missing values for each of the collected data.

In [7]:
def read_m_from_url(url, num_of_m=50):
    print(url)
    # this function, read a number of movies from a url. The default value is 50
    
    html_string = read_html(url) # given a url you need to read the hmtl file as a string. 
    # I have included  the read_html function in the helper functions. Please take a look.
    
    # create a soup object
    soup = BeautifulSoup(html_string, "html.parser")
    
    '''
    Click the URL and investigate how you can pull movie_id, rank, title,... from the webpage.
    
    To investigate the html of a web page , For example:
    URL: https://www.imdb.com/search/title/?at=0&sort=num_votes,desc&start=1&title_type=feature&year=2018,2020
    Right-click anywhere on the webpage, and at the very bottom of the menu that pops up, 
    you will see "Inspect", Click on it.
    '''
    
# Fetching a div that includes all the movies. This can be done by using find and find_all functions.
    # for example, find_all('div') will give you all divs on the page. Actually, 
    # this find or find_all function can have two parameters,
    # in the code below 'div' is the tag name and 'lister-list' is an attribute value of the tag. You can also do
    # movie_list = soup.find('div', 'lister-list'). Here you explicitly say: I want to find a div with 
    # attribute class = 'lister-list'.
    
    # Since on each imdb page, there's only one div with class = 'lister-list', we can use find rather than find_all. 
    # Find_all will return a list of div tags, while find() will return only one div.
    
    movie_list = soup.find('div', 'lister-list') # this div contains all the listed movies in 
                                                 # the requested html web page.
    
    list_movies = [] # initialize the function return value, which is a list of movies. 
                     # This list will contain the scraped data transformed to a structured format.
    
    # Using count track the number of movies processed. now it's 0 - No movie has been processed yet.
    count = 0
    
    # each movie listed in a div with attribute value 'lister-item mode-advanced'.
    divs=  movie_list.find_all('div','lister-item mode-advanced') # To find all the listed movies in the page.
    for d in divs:
        dict_each_movie = {}  # initialize the movie dictionary to store the movie information.

        # Pulling the movie_id
        try:
            h = d.find('h3','lister-item-header') 
            movie_id= h.find('a').attrs['href']
            movie_id= movie_id[7:-1]
            
        except:
            movie_id=""
        finally:
            dict_each_movie["movie_id"] = movie_id
            print(movie_id)
            
            

        # Pulling the title
        try:
            h = d.find('h3','lister-item-header') 
            title= h.find('a').text
        except:
            title=""
        finally:
            dict_each_movie["title"] = title
            print(title)
            
            
        
        # Pulling the rank
        try:
            h = d.find('h3','lister-item-header') 
            rank = h.find('span', class_='lister-item-index unbold text-primary').text
        except:
            rank=""
        finally:
            dict_each_movie["rank"] = rank
            print(rank)
            
        
        
        # Pulling the year
        try:
            h = d.find('h3','lister-item-header') 
            year = h.find('span', class_='lister-item-year text-muted unbold').text
        except:
            year=""
        finally:
            dict_each_movie["year"] = year
            print(year)

            
        # Pulling the runtime
        try:
            h = d.find('p','text-muted') 
            runtime = h.find('span', class_='runtime').text
        except:
            runtime=""
        finally:
            dict_each_movie["runtime"] = runtime
            print(runtime)

        
        # Pulling the rating
        try:
            h = d.find('div','ratings-bar') 
            rating = h.find('strong').text
        except:
            rating=""
        finally:
            dict_each_movie["rating"] = rating
            print(rating)

            
        # Pulling the votes
        try: 
            div1= d.find('div','lister-item-content')
            votes= div1.find('span', text='Votes:').find_next('span').text
        except:
            votes = ""
        finally:
            dict_each_movie["votes"] = votes
            print(votes)


        list_movies.append(dict_each_movie)  # To add the movie information to the movies list.

        count +=1
        print('===============================')
        print()
        if count == num_of_m:
            break # to exit from the loop.

    return list_movies


### Test  read_m_from_url

In [8]:
def test_read_m_from_url():
    """ output:
    [{'movie_id': 'tt7286456', 'title': 'Joker', 'year': '(2019)', 'rank': '1.', 'runtime': '122 min', 'rating': '8.4', 'votes': '1,074,230'}, {'movie_id': 'tt4154796', 'title': 'Avengers: Endgame', 'year': '(2019)', 'rank': '2.', 'runtime': '181 min', 'rating': '8.4', 'votes': '945,461'}, {'movie_id': 'tt4154756', 'title': 'Avengers: Infinity War', 'year': '(2018)', 'rank': '3.', 'runtime': '149 min', 'rating': '8.4', 'votes': '928,596'}, {'movie_id': 'tt1825683', 'title': 'Black Panther', 'year': '(2018)', 'rank': '4.', 'runtime': '134 min', 'rating': '7.3', 'votes': '678,964'}, {'movie_id': 'tt6751668', 'title': 'Parasite', 'year': '(2019)', 'rank': '5.', 'runtime': '132 min', 'rating': '8.6', 'votes': '666,646'}, {'movie_id': 'tt7131622', 'title': 'Once Upon a Time... In Hollywood', 'year': '(2019)', 'rank': '6.', 'runtime': '161 min', 'rating': '7.6', 'votes': '642,048'}, {'movie_id': 'tt8946378', 'title': 'Knives Out', 'year': '(2019)', 'rank': '7.', 'runtime': '130 min', 'rating': '7.9', 'votes': '534,299'}, {'movie_id': 'tt5463162', 'title': 'Deadpool 2', 'year': '(2018)', 'rank': '8.', 'runtime': '119 min', 'rating': '7.7', 'votes': '519,760'}, {'movie_id': 'tt8579674', 'title': '1917', 'year': '(2019)', 'rank': '9.', 'runtime': '119 min', 'rating': '8.3', 'votes': '495,380'}, {'movie_id': 'tt4154664', 'title': 'Captain Marvel', 'year': '(2019)', 'rank': '10.', 'runtime': '123 min', 'rating': '6.8', 'votes': '493,817'}, {'movie_id': 'tt1727824', 'title': 'Bohemian Rhapsody', 'year': '(2018)', 'rank': '11.', 'runtime': '134 min', 'rating': '7.9', 'votes': '489,064'}, {'movie_id': 'tt6644200', 'title': 'A Quiet Place', 'year': '(2018)', 'rank': '12.', 'runtime': '90 min', 'rating': '7.5', 'votes': '482,141'}, {'movie_id': 'tt4633694', 'title': 'Spider-Man: Into the Spider-Verse', 'year': '(2018)', 'rank': '13.', 'runtime': '117 min', 'rating': '8.4', 'votes': '430,153'}, {'movie_id': 'tt6966692', 'title': 'Green Book', 'year': '(2018)', 'rank': '14.', 'runtime': '130 min', 'rating': '8.2', 'votes': '428,762'}, {'movie_id': 'tt6723592', 'title': 'Tenet', 'year': '(2020)', 'rank': '15.', 'runtime': '150 min', 'rating': '7.4', 'votes': '426,125'}, {'movie_id': 'tt1477834', 'title': 'Aquaman', 'year': '(2018)', 'rank': '16.', 'runtime': '143 min', 'rating': '6.9', 'votes': '417,286'}, {'movie_id': 'tt1270797', 'title': 'Venom', 'year': '(2018)', 'rank': '17.', 'runtime': '112 min', 'rating': '6.7', 'votes': '410,565'}, {'movie_id': 'tt2527338', 'title': 'Star Wars: The Rise Of Skywalker', 'year': '(2019)', 'rank': '18.', 'runtime': '141 min', 'rating': '6.5', 'votes': '404,527'}, {'movie_id': 'tt1677720', 'title': 'Ready Player One', 'year': '(2018)', 'rank': '19.', 'runtime': '140 min', 'rating': '7.4', 'votes': '398,599'}, {'movie_id': 'tt6320628', 'title': 'Spider-Man: Far from Home', 'year': '(2019)', 'rank': '20.', 'runtime': '129 min', 'rating': '7.4', 'votes': '383,087'}, {'movie_id': 'tt1517451', 'title': 'A Star Is Born', 'year': '(2018)', 'rank': '21.', 'runtime': '136 min', 'rating': '7.6', 'votes': '356,745'}, {'movie_id': 'tt1302006', 'title': 'The Irishman', 'year': '(2019)', 'rank': '22.', 'runtime': '209 min', 'rating': '7.8', 'votes': '352,691'}, {'movie_id': 'tt5095030', 'title': 'Ant-Man and the Wasp', 'year': '(2018)', 'rank': '23.', 'runtime': '118 min', 'rating': '7.0', 'votes': '346,822'}, {'movie_id': 'tt2584384', 'title': 'Jojo Rabbit', 'year': '(2019)', 'rank': '24.', 'runtime': '108 min', 'rating': '7.9', 'votes': '340,733'}, {'movie_id': 'tt1950186', 'title': 'Ford v Ferrari', 'year': '(2019)', 'rank': '25.', 'runtime': '152 min', 'rating': '8.1', 'votes': '337,490'}, {'movie_id': 'tt3778644', 'title': 'Solo: A Star Wars Story', 'year': '(2018)', 'rank': '26.', 'runtime': '135 min', 'rating': '6.9', 'votes': '313,683'}, {'movie_id': 'tt4912910', 'title': 'Mission: Impossible - Fallout', 'year': '(2018)', 'rank': '27.', 'runtime': '147 min', 'rating': '7.7', 'votes': '308,761'}, {'movie_id': 'tt6146586', 'title': 'John Wick: Chapter 3 - Parabellum', 'year': '(2019)', 'rank': '28.', 'runtime': '130 min', 'rating': '7.4', 'votes': '306,871'}, {'movie_id': 'tt2737304', 'title': 'Bird Box', 'year': '(2018)', 'rank': '29.', 'runtime': '124 min', 'rating': '6.6', 'votes': '305,135'}, {'movie_id': 'tt2798920', 'title': 'Annihilation', 'year': '(I) (2018)', 'rank': '30.', 'runtime': '115 min', 'rating': '6.8', 'votes': '298,808'}, {'movie_id': 'tt0448115', 'title': 'Shazam!', 'year': '(2019)', 'rank': '31.', 'runtime': '132 min', 'rating': '7.0', 'votes': '295,848'}, {'movie_id': 'tt4881806', 'title': 'Jurassic World: Fallen Kingdom', 'year': '(2018)', 'rank': '32.', 'runtime': '128 min', 'rating': '6.2', 'votes': '283,862'}, {'movie_id': 'tt8367814', 'title': 'The Gentlemen', 'year': '(2019)', 'rank': '33.', 'runtime': '113 min', 'rating': '7.8', 'votes': '282,900'}, {'movie_id': 'tt2948372', 'title': 'Soul', 'year': '(2020)', 'rank': '34.', 'runtime': '100 min', 'rating': '8.1', 'votes': '280,267'}, {'movie_id': 'tt7653254', 'title': 'Marriage Story', 'year': '(2019)', 'rank': '35.', 'runtime': '137 min', 'rating': '7.9', 'votes': '274,033'}, {'movie_id': 'tt7784604', 'title': 'Hereditary', 'year': '(2018)', 'rank': '36.', 'runtime': '127 min', 'rating': '7.3', 'votes': '270,157'}, {'movie_id': 'tt3606756', 'title': 'Incredibles 2', 'year': '(2018)', 'rank': '37.', 'runtime': '118 min', 'rating': '7.6', 'votes': '269,810'}, {'movie_id': 'tt6857112', 'title': 'Us', 'year': '(II) (2019)', 'rank': '38.', 'runtime': '116 min', 'rating': '6.8', 'votes': '257,673'}, {'movie_id': 'tt8772262', 'title': 'Midsommar', 'year': '(2019)', 'rank': '39.', 'runtime': '148 min', 'rating': '7.1', 'votes': '255,303'}, {'movie_id': 'tt0437086', 'title': 'Alita: Battle Angel', 'year': '(2019)', 'rank': '40.', 'runtime': '122 min', 'rating': '7.3', 'votes': '247,522'}, {'movie_id': 'tt5727208', 'title': 'Uncut Gems', 'year': '(2019)', 'rank': '41.', 'runtime': '135 min', 'rating': '7.4', 'votes': '246,741'}, {'movie_id': 'tt6139732', 'title': 'Aladdin', 'year': '(2019)', 'rank': '42.', 'runtime': '128 min', 'rating': '6.9', 'votes': '245,331'}, {'movie_id': 'tt7349662', 'title': 'BlacKkKlansman', 'year': '(2018)', 'rank': '43.', 'runtime': '135 min', 'rating': '7.5', 'votes': '242,325'}, {'movie_id': 'tt4123430', 'title': 'Fantastic Beasts: The Crimes of Grindelwald', 'year': '(2018)', 'rank': '44.', 'runtime': '134 min', 'rating': '6.5', 'votes': '239,448'}, {'movie_id': 'tt7126948', 'title': 'Wonder Woman 1984', 'year': '(2020)', 'rank': '45.', 'runtime': '151 min', 'rating': '5.4', 'votes': '232,962'}, {'movie_id': 'tt7349950', 'title': 'It Chapter Two', 'year': '(2019)', 'rank': '46.', 'runtime': '169 min', 'rating': '6.5', 'votes': '230,874'}, {'movie_id': 'tt6105098', 'title': 'The Lion King', 'year': '(2019)', 'rank': '47.', 'runtime': '118 min', 'rating': '6.8', 'votes': '226,997'}, {'movie_id': 'tt6823368', 'title': 'Glass', 'year': '(2019)', 'rank': '48.', 'runtime': '129 min', 'rating': '6.6', 'votes': '224,677'}, {'movie_id': 'tt1979376', 'title': 'Toy Story 4', 'year': '(2019)', 'rank': '49.', 'runtime': '100 min', 'rating': '7.7', 'votes': '223,650'}, {'movie_id': 'tt2704998', 'title': 'Game Night', 'year': '(I) (2018)', 'rank': '50.', 'runtime': '100 min', 'rating': '6.9', 'votes': '220,159'}]    
    """
    url = "http://www.imdb.com/search/title?at=0&sort=num_votes,desc&start=1&title_type=feature&year=2018,2020"
    print ("Movies list: ", read_m_from_url(url))

In [9]:
test_read_m_from_url()

http://www.imdb.com/search/title?at=0&sort=num_votes,desc&start=1&title_type=feature&year=2018,2020
tt7286456
Joker
1.
(I) (2019)
122 min
8.4
1,306,451

tt4154796
Avengers: Endgame
2.
(2019)
181 min
8.4
1,145,181

tt4154756
Avengers: Infinity War
3.
(2018)
149 min
8.4
1,092,729

tt6751668
Parasite
4.
(2019)
132 min
8.5
818,953

tt1825683
Black Panther
5.
(2018)
134 min
7.3
783,211

tt7131622
Once Upon a Time in Hollywood
6.
(2019)
161 min
7.6
753,406

tt8946378
Knives Out
7.
(2019)
130 min
7.9
701,391

tt8579674
1917
8.
(2019)
119 min
8.2
601,322

tt5463162
Deadpool 2
9.
(2018)
119 min
7.7
590,658

tt4154664
Captain Marvel
10.
(2019)
123 min
6.8
566,950

tt1727824
Bohemian Rhapsody
11.
(2018)
134 min
7.9
545,495

tt4633694
Spider-Man: Into the Spider-Verse
12.
(2018)
117 min
8.4
539,646

tt6644200
A Quiet Place
13.
(2018)
90 min
7.5
537,924

tt6723592
Tenet
14.
(2020)
150 min
7.3
515,445

tt6966692
Green Book
15.
(2018)
130 min
8.2
501,061

tt6320628
Spider-Man: Far from Home
16.
(2019

***
##  read_m_by_voting

This method takes two years (first_year, last_year) and a number (top_number) as inputs. 
If we have first year = 2018, and last_year =2020, and top_number = 500, we want to retrieve top 500 movies 
that were released between 2018 and 2020, we need to do two things:

##### 1. Construct a url. An example url can be 
    https://www.imdb.com/search/title/?at=0&sort=num_votes,desc&start=1&title_type=feature&year=2018,2020
    
        This URL means that the web page will display movies based on user_voting in descending 
        order "sort=num_votes,desc" between year 2018 and 2020,
        It will start from the first movie (i.e., start=1 - we need to set the start index in the url).
    
        IMDB will display just 50 movies based on this URL. So in order to review more 
        movies, we need to click "next" on the web page. By clicking the next button,
        we will see a new url
           https://www.imdb.com/search/title/?at=0&sort=num_votes,desc&start=51&title_type=feature&year=2018,2020

        If we compare the two urls above, we can easily see that in the second url, start=51, i.e., IMDB provides 
        another 50 movies, starting from movie No 51.If we want to retrieve the top 61 movies, we need to open 
        two web pages with two urls. 
        And if we want to retrieve top 256 movies, we need to open 6 different URLs.
        Obviously, we want to use a loop to construct the different URLs.
    
    
##### 2. Read movies from the URL: using "read_m_from_url" method. 
      What this method does is that it opens a url and read numbers_of_movies_you_need_to_read_on_the_page. 
      For example, if we want to read top 61 movies, we will need to first open 
      a url https://www.imdb.com/search/title/?at=0&sort=num_votes,desc&start=1&title_type=feature&year=2018,2020
      
      We need to read all 50 movies on the page by calling read_m_from_url(url), and then we need to open 
      the second url http://www.imdb.com/search/title?at=0&sort=user_rating&start=1&title_type=feature&year=2005,2016, but 
      we just need to read 61-50=11 movies from the page. Now, let's set current_index = 51 and top_number = 61, we 
      actually need to retrieve (top_number - current_index + 1) movies. we can do this by calling
      read_m_from_url(url, top_number - the current_index+1).
      
      Hence, we need to use if statement here. Based on top_number, we may need to open multiple urls 
         (e.g., if top 256, we need to open 6 urls)
    

In [10]:
m_per_page = 50 # by default, imdb return 50 movies page url.
def read_m_by_voting(first_year, last_year, top_number):
    
    current_index = 1  # initialize current_index. In the first iteration, we need to have start = 1.
    
    final_list = []  # initialize the return value. This method returns a list. Each item in the list is a dictionary. 
                     # Each dictionary includes information regarding a movie.

    for i in range(int(math.ceil(top_number/50.0))):
        url= 'http://www.imdb.com/search/title/?at=0&sort=num_votes,desc&start='+str(current_index)+'&title_type=feature&year='+str(first_year)+','+ str(last_year)

        if (i+1) == len(range( int(math.ceil(top_number/50.0)))):
            lis = read_m_from_url(url, top_number - current_index + 1)
        else:
            lis = read_m_from_url(url, m_per_page)
        final_list += lis
        current_index +=50

    return final_list

### Test read_m_by_voting

In [11]:
def test_read_m_by_voting():
    """output:
    [{'movie_id': 'tt7286456', 'title': 'Joker', 'year': '(2019)', 'rank': '1.', 'genres': 'Crime, Drama, Thriller', 'runtime': '122 min', 'rating': '8.4', 'votes': '"1,074,179"'}, {'movie_id': 'tt4154796', 'title': 'Avengers: Endgame', 'year': '(2019)', 'rank': '2.', 'genres': 'Action, Adventure, Drama', 'runtime': '181 min', 'rating': '8.4', 'votes': '"945,422"'}]
    """
    print (read_m_by_voting(2018,2020,500)) # This will print a list of top three movies.

In [12]:
test_read_m_by_voting()

http://www.imdb.com/search/title/?at=0&sort=num_votes,desc&start=1&title_type=feature&year=2018,2020
tt7286456
Joker
1.
(I) (2019)
122 min
8.4
1,306,451

tt4154796
Avengers: Endgame
2.
(2019)
181 min
8.4
1,145,181

tt4154756
Avengers: Infinity War
3.
(2018)
149 min
8.4
1,092,729

tt6751668
Parasite
4.
(2019)
132 min
8.5
818,953

tt1825683
Black Panther
5.
(2018)
134 min
7.3
783,211

tt7131622
Once Upon a Time in Hollywood
6.
(2019)
161 min
7.6
753,406

tt8946378
Knives Out
7.
(2019)
130 min
7.9
701,391

tt8579674
1917
8.
(2019)
119 min
8.2
601,322

tt5463162
Deadpool 2
9.
(2018)
119 min
7.7
590,658

tt4154664
Captain Marvel
10.
(2019)
123 min
6.8
566,950

tt1727824
Bohemian Rhapsody
11.
(2018)
134 min
7.9
545,495

tt4633694
Spider-Man: Into the Spider-Verse
12.
(2018)
117 min
8.4
539,646

tt6644200
A Quiet Place
13.
(2018)
90 min
7.5
537,924

tt6723592
Tenet
14.
(2020)
150 min
7.3
515,445

tt6966692
Green Book
15.
(2018)
130 min
8.2
501,061

tt6320628
Spider-Man: Far from Home
16.
(201

tt7040874
A Simple Favor
101.
(2018)
117 min
6.8
152,034

tt5052474
Sicario: Day of the Soldado
102.
(2018)
122 min
7.0
151,887

tt6412452
The Ballad of Buster Scruggs
103.
(2018)
133 min
7.3
151,673

tt6266538
Vice
104.
(2018)
132 min
7.2
150,837

tt4566758
Mulan
105.
(2020)
115 min
5.7
150,280

tt4500922
Maze Runner: The Death Cure
106.
(2018)
143 min
6.2
144,859

tt5814060
The Nun
107.
(2018)
96 min
5.3
144,426

tt13143964
Borat Subsequent Moviefilm
108.
(2020)
95 min
6.6
143,290

tt3794354
Sonic the Hedgehog
109.
(2020)
99 min
6.5
142,835

tt7959026
The Mule
110.
(2018)
116 min
7.0
140,353

tt7395114
The Devil All the Time
111.
(2020)
138 min
7.1
138,532

tt2283336
Men in Black: International
112.
(2019)
114 min
5.6
137,155

tt2854926
Tag
113.
(I) (2018)
100 min
6.5
136,621

tt3829266
The Predator
114.
(2018)
107 min
5.3
135,821

tt9893250
I Care a Lot
115.
(2020)
118 min
6.3
135,056

tt7984766
The King
116.
(I) (2019)
140 min
7.3
133,917

tt1618434
Murder Mystery
117.
(2019)
97 mi

tt3387520
Scary Stories to Tell in the Dark
201.
(2019)
108 min
6.2
78,695

tt8350360
Annabelle Comes Home
202.
(2019)
106 min
5.9
77,988

tt3861390
Dumbo
203.
(2019)
112 min
6.3
77,764

tt7014006
Eighth Grade
204.
(2018)
93 min
7.4
77,465

tt2990140
The Christmas Chronicles
205.
(2018)
104 min
7.0
77,225

tt8155288
Happy Death Day 2U
206.
(2019)
100 min
6.2
77,140

tt6111574
The Woman in the Window
207.
(2020)
100 min
5.7
77,054

tt10618286
Mank
208.
(2020)
131 min
6.8
76,912

tt2709692
The Grinch
209.
(2018)
85 min
6.4
76,856

tt5220122
Hotel Transylvania 3: Summer Vacation
210.
(2018)
97 min
6.3
76,589

tt6921996
Johnny English Strikes Again
211.
(2018)
89 min
6.2
76,172

tt6977338
Good Boys
212.
(2019)
90 min
6.7
75,740

tt4532826
Robin Hood
213.
(2018)
116 min
5.3
75,455

tt5033998
Charlie's Angels
214.
(2019)
118 min
4.9
75,137

tt6679794
Outlaw King
215.
(2018)
121 min
6.9
74,360

tt10280276
Coolie No. 1
216.
(2020)
134 min
4.2
73,721

tt1137450
Death Wish
217.
(2018)
107 min
6.

tt0800325
The Dirt
301.
(2019)
107 min
7.0
51,138

tt7504726
The Call of the Wild
302.
(2020)
100 min
6.7
51,126

tt8239946
Tumbbad
303.
(2018)
104 min
8.2
51,114

tt5814534
Spies in Disguise
304.
(2019)
102 min
6.8
51,014

tt5116302
Togo
305.
(2019)
113 min
7.9
50,576

tt5783956
When We First Met
306.
(2018)
97 min
6.4
50,563

tt10431500
Miracle in Cell No. 7
307.
(2019)
132 min
8.2
50,190

tt5073642
Color Out of Space
308.
(2019)
111 min
6.2
50,023

tt7347846
The Lodge
309.
(2019)
108 min
6.0
49,908

tt7125860
If Beale Street Could Talk
310.
(2018)
119 min
7.1
49,871

tt5431890
Official Secrets
311.
(2019)
112 min
7.3
49,238

tt1086064
Bill & Ted Face the Music
312.
(2020)
91 min
5.9
49,034

tt5177088
The Girl in the Spider's Web
313.
(2018)
115 min
6.1
48,773

tt7734218
Stuber
314.
(2019)
93 min
6.2
48,588

tt1289403
The Guernsey Literary and Potato Peel Pie Society
315.
(2018)
124 min
7.3
48,170

tt8544498
The Way Back
316.
(I) (2020)
108 min
6.7
47,963

tt6491178
Dragged Across Co

tt8633462
Quo Vadis, Aida?
401.
(2020)
101 min
8.0
34,694

tt8983202
Kabir Singh
402.
(2019)
173 min
7.0
34,691

tt8108268
The Tashkent Files
403.
(2019)
134 min
8.1
34,576

tt6141246
The Aeronauts
404.
(2019)
100 min
6.6
34,569

tt10324144
Article 15
405.
(2019)
130 min
8.1
34,534

tt6802308
The 15:17 to Paris
406.
(2018)
94 min
5.3
34,355

tt4761916
Unfriended: Dark Web
407.
(2018)
92 min
6.0
34,335

tt6195094
Incident in a Ghostland
408.
(2018)
91 min
6.4
34,263

tt8850222
Peninsula
409.
(2020)
116 min
5.5
34,111

tt9900782
Kaithi
410.
(2019)
145 min
8.5
34,020

tt9806192
I Lost My Body
411.
(2019)
81 min
7.5
33,976

tt9398640
Between Two Ferns: The Movie
412.
(2019)
82 min
6.1
33,838

tt6865690
The Professor
413.
(I) (2018)
90 min
6.7
33,744

tt5935704
Padmaavat
414.
(2018)
164 min
7.0
33,721

tt5198068
Wolfwalkers
415.
(2020)
103 min
8.0
33,719

tt10003008
The Rental
416.
(2020)
88 min
5.7
33,622

tt2076298
Ip Man 4: The Finale
417.
(2019)
107 min
7.0
33,401

tt3907584
All the Bri

***
# write_movies_csv
This method to write the movies to a csv file.
Each row in the csv file represents a movie. 
 - The parameter final_list includes a number of movies, which is the output of the function read_m_by_voting. 
 - The filename represents the output file name.
 
 ***
 Important note:  make sure to complete the code of read_m_from_url method before running write_movies_csv method.
 ***

In [13]:

def write_movies_csv(final_list, filename):

    lis = [] # to write the file, we create a list of strings.
    
    header = "movie_id,"+"rank,"+"title,"+"year,"+"rating,"+"runtime," +"votes"
    lis.append(header) # add the header to the list
    for movie in final_list:
        string =  process_str_with_comma(movie["movie_id"]) +"," + process_str_with_comma(movie["rank"]) +","+ process_str_with_comma(movie["title"]) +","+ process_str_with_comma(movie["year"]) + ","+  process_str_with_comma(movie["rating"]) +","+ process_str_with_comma(movie["runtime"]) + "," + process_str_with_comma(movie["votes"])
        print (string)
        lis.append(string)# add the string to the list

    # Writing the strings list to csv file
    f = None
    f = open(filename,"w")
    for s in lis:
        f.write("%s\n" % s)
    f.close()
    
    return


In [14]:
# The output of the test_write_movies_csv method is the "IMDb_TopVoted.csv" file.
def test_write_movies_csv(): 
    li = read_m_by_voting(2018, 2020, 500) # To read the top voted 500 movies between 2018 and 2020 from imdb.
    print(li)
    print("================================================================")
    write_movies_csv(li,"IMDb_TopVoted_PartA_Errichetti_Martin_RothEagle.csv") 


In [15]:
test_write_movies_csv()

http://www.imdb.com/search/title/?at=0&sort=num_votes,desc&start=1&title_type=feature&year=2018,2020
tt7286456
Joker
1.
(I) (2019)
122 min
8.4
1,306,451

tt4154796
Avengers: Endgame
2.
(2019)
181 min
8.4
1,145,181

tt4154756
Avengers: Infinity War
3.
(2018)
149 min
8.4
1,092,729

tt6751668
Parasite
4.
(2019)
132 min
8.5
818,953

tt1825683
Black Panther
5.
(2018)
134 min
7.3
783,211

tt7131622
Once Upon a Time in Hollywood
6.
(2019)
161 min
7.6
753,406

tt8946378
Knives Out
7.
(2019)
130 min
7.9
701,391

tt8579674
1917
8.
(2019)
119 min
8.2
601,322

tt5463162
Deadpool 2
9.
(2018)
119 min
7.7
590,658

tt4154664
Captain Marvel
10.
(2019)
123 min
6.8
566,950

tt1727824
Bohemian Rhapsody
11.
(2018)
134 min
7.9
545,495

tt4633694
Spider-Man: Into the Spider-Verse
12.
(2018)
117 min
8.4
539,646

tt6644200
A Quiet Place
13.
(2018)
90 min
7.5
537,924

tt6723592
Tenet
14.
(2020)
150 min
7.3
515,445

tt6966692
Green Book
15.
(2018)
130 min
8.2
501,061

tt6320628
Spider-Man: Far from Home
16.
(201

tt7040874
A Simple Favor
101.
(2018)
117 min
6.8
152,034

tt5052474
Sicario: Day of the Soldado
102.
(2018)
122 min
7.0
151,887

tt6412452
The Ballad of Buster Scruggs
103.
(2018)
133 min
7.3
151,673

tt6266538
Vice
104.
(2018)
132 min
7.2
150,837

tt4566758
Mulan
105.
(2020)
115 min
5.7
150,280

tt4500922
Maze Runner: The Death Cure
106.
(2018)
143 min
6.2
144,859

tt5814060
The Nun
107.
(2018)
96 min
5.3
144,426

tt13143964
Borat Subsequent Moviefilm
108.
(2020)
95 min
6.6
143,290

tt3794354
Sonic the Hedgehog
109.
(2020)
99 min
6.5
142,835

tt7959026
The Mule
110.
(2018)
116 min
7.0
140,353

tt7395114
The Devil All the Time
111.
(2020)
138 min
7.1
138,532

tt2283336
Men in Black: International
112.
(2019)
114 min
5.6
137,155

tt2854926
Tag
113.
(I) (2018)
100 min
6.5
136,621

tt3829266
The Predator
114.
(2018)
107 min
5.3
135,821

tt9893250
I Care a Lot
115.
(2020)
118 min
6.3
135,056

tt7984766
The King
116.
(I) (2019)
140 min
7.3
133,917

tt1618434
Murder Mystery
117.
(2019)
97 mi

tt3387520
Scary Stories to Tell in the Dark
201.
(2019)
108 min
6.2
78,695

tt8350360
Annabelle Comes Home
202.
(2019)
106 min
5.9
77,988

tt3861390
Dumbo
203.
(2019)
112 min
6.3
77,764

tt7014006
Eighth Grade
204.
(2018)
93 min
7.4
77,465

tt2990140
The Christmas Chronicles
205.
(2018)
104 min
7.0
77,225

tt8155288
Happy Death Day 2U
206.
(2019)
100 min
6.2
77,140

tt6111574
The Woman in the Window
207.
(2020)
100 min
5.7
77,054

tt10618286
Mank
208.
(2020)
131 min
6.8
76,912

tt2709692
The Grinch
209.
(2018)
85 min
6.4
76,856

tt5220122
Hotel Transylvania 3: Summer Vacation
210.
(2018)
97 min
6.3
76,589

tt6921996
Johnny English Strikes Again
211.
(2018)
89 min
6.2
76,172

tt6977338
Good Boys
212.
(2019)
90 min
6.7
75,740

tt4532826
Robin Hood
213.
(2018)
116 min
5.3
75,455

tt5033998
Charlie's Angels
214.
(2019)
118 min
4.9
75,137

tt6679794
Outlaw King
215.
(2018)
121 min
6.9
74,360

tt10280276
Coolie No. 1
216.
(2020)
134 min
4.2
73,721

tt1137450
Death Wish
217.
(2018)
107 min
6.

tt0800325
The Dirt
301.
(2019)
107 min
7.0
51,138

tt7504726
The Call of the Wild
302.
(2020)
100 min
6.7
51,126

tt8239946
Tumbbad
303.
(2018)
104 min
8.2
51,114

tt5814534
Spies in Disguise
304.
(2019)
102 min
6.8
51,014

tt5116302
Togo
305.
(2019)
113 min
7.9
50,576

tt5783956
When We First Met
306.
(2018)
97 min
6.4
50,563

tt10431500
Miracle in Cell No. 7
307.
(2019)
132 min
8.2
50,190

tt5073642
Color Out of Space
308.
(2019)
111 min
6.2
50,023

tt7347846
The Lodge
309.
(2019)
108 min
6.0
49,908

tt7125860
If Beale Street Could Talk
310.
(2018)
119 min
7.1
49,871

tt5431890
Official Secrets
311.
(2019)
112 min
7.3
49,238

tt1086064
Bill & Ted Face the Music
312.
(2020)
91 min
5.9
49,034

tt5177088
The Girl in the Spider's Web
313.
(2018)
115 min
6.1
48,773

tt7734218
Stuber
314.
(2019)
93 min
6.2
48,588

tt1289403
The Guernsey Literary and Potato Peel Pie Society
315.
(2018)
124 min
7.3
48,170

tt8544498
The Way Back
316.
(I) (2020)
108 min
6.7
47,963

tt6491178
Dragged Across Co

tt8633462
Quo Vadis, Aida?
401.
(2020)
101 min
8.0
34,694

tt8983202
Kabir Singh
402.
(2019)
173 min
7.0
34,691

tt8108268
The Tashkent Files
403.
(2019)
134 min
8.1
34,576

tt6141246
The Aeronauts
404.
(2019)
100 min
6.6
34,569

tt10324144
Article 15
405.
(2019)
130 min
8.1
34,534

tt6802308
The 15:17 to Paris
406.
(2018)
94 min
5.3
34,355

tt4761916
Unfriended: Dark Web
407.
(2018)
92 min
6.0
34,335

tt6195094
Incident in a Ghostland
408.
(2018)
91 min
6.4
34,263

tt8850222
Peninsula
409.
(2020)
116 min
5.5
34,111

tt9900782
Kaithi
410.
(2019)
145 min
8.5
34,020

tt9806192
I Lost My Body
411.
(2019)
81 min
7.5
33,976

tt9398640
Between Two Ferns: The Movie
412.
(2019)
82 min
6.1
33,838

tt6865690
The Professor
413.
(I) (2018)
90 min
6.7
33,744

tt5935704
Padmaavat
414.
(2018)
164 min
7.0
33,721

tt5198068
Wolfwalkers
415.
(2020)
103 min
8.0
33,719

tt10003008
The Rental
416.
(2020)
88 min
5.7
33,622

tt2076298
Ip Man 4: The Finale
417.
(2019)
107 min
7.0
33,401

tt3907584
All the Bri

# Importing the given dataset "Movies.csv" to Pandas DataFrame called df1

In [16]:
df1 = pd.read_csv('Movies.csv', low_memory = False)

# Import the scraped data from the IMDb_TopVoted.csv file to Pandas DataFrame called df2

In [17]:
# You need to import the collected dataset "IMDb_TopVoted.csv".
# To handel Latin characters that may contained in the csv file
# with no issue, use  encoding= "ISO-8859-1" with the pd.read_csv()
# Example: df1 = pd.read_csv('thefilename.csv', encoding= "ISO-8859-1") 
# Using encoding= "ISO-8859-1" will avoid Unicode-Decode-Errors.

df2 = pd.read_csv('IMDb_TopVoted_PartA_Errichetti_Martin_RothEagle.csv', encoding= "ISO-8859-1", low_memory = False)

## Data cleansing and transformation to convert the columns datatype for df2.

In [18]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   movie_id  500 non-null    object 
 1   rank      500 non-null    float64
 2   title     500 non-null    object 
 3   year      500 non-null    object 
 4   rating    500 non-null    float64
 5   runtime   500 non-null    object 
 6   votes     500 non-null    object 
dtypes: float64(2), object(5)
memory usage: 27.5+ KB


In [19]:
# Cleaning and tranforming df2
# rank, year, runtime, and votes should have a numeric integer data type.


# Reformatting values in the 'year' column
df2['year'] = df2['year'].replace(['(II) (2019)'],['2019']) 
df2['year'] = df2['year'].replace(['(I) (2020)'],['2020'])
df2['year'] = df2['year'].replace(['(III) (2018)'],['2018'])
df2['year'] = df2['year'].replace(['(II) (2020)'],['2020'])
df2['year'] = df2['year'].replace(['(I) (2019)'],['2019'])
df2['year'] = df2['year'].replace(['(III) (2019)'],['2019'])
df2['year'] = df2['year'].replace(['(II) (2020)'],['2020'])
df2['year'] = df2['year'].replace(['(II) (2018)'],['2018'])
df2['year'] = df2['year'].replace(['(I) (2018)'],['2018'])
df2['year'] = df2['year'].replace(['(IV) (2020)'],['2020'])
df2['year'] = df2['year'].replace(['(2018)'],['2018'])
df2['year'] = df2['year'].replace(['(2019)'],['2019'])
df2['year'] = df2['year'].replace(['(2020)'],['2020'])

# Reformatting values in the  'runtime' column
df2['runtime'] = df2['runtime'].map(lambda x: x.rstrip(' min'))

# Reformating values in the 'votes' column
df2['votes'] = df2['votes'].str.replace(',','')


# Unique column values 
for col in df2.columns:
    print(col)

    unique = df2[col].unique()
    print(unique, '\n*************************************\n\n')



movie_id
['tt7286456' 'tt4154796' 'tt4154756' 'tt6751668' 'tt1825683' 'tt7131622'
 'tt8946378' 'tt8579674' 'tt5463162' 'tt4154664' 'tt1727824' 'tt4633694'
 'tt6644200' 'tt6723592' 'tt6966692' 'tt6320628' 'tt1270797' 'tt1477834'
 'tt2527338' 'tt1677720' 'tt5095030' 'tt1950186' 'tt2584384' 'tt1302006'
 'tt1517451' 'tt3778644' 'tt6146586' 'tt2737304' 'tt8367814' 'tt0448115'
 'tt4912910' 'tt2948372' 'tt8772262' 'tt2798920' 'tt7784604' 'tt4881806'
 'tt7653254' 'tt6857112' 'tt3606756' 'tt5727208' 'tt4123430' 'tt0437086'
 'tt7126948' 'tt7349950' 'tt6139732' 'tt7349662' 'tt9243946' 'tt7975244'
 'tt1979376' 'tt6105098' 'tt6823368' 'tt7713068' 'tt2935510' 'tt2704998'
 'tt8332922' 'tt8228288' 'tt1051906' 'tt5164214' 'tt7984734' 'tt1365519'
 'tt6806448' 'tt3281548' 'tt8936646' 'tt5083738' 'tt7846844' 'tt5606664'
 'tt1213641' 'tt6565702' 'tt2873282' 'tt6499752' 'tt3741700' 'tt1560220'
 'tt6450804' 'tt8106534' 'tt1070874' 'tt2066051' 'tt9620292' 'tt4520988'
 'tt3104988' 'tt4779682' 'tt5104604' 'tt22

In [20]:
# Converting various column types to an integer

df2 = df2.astype({'rank': 'int64','year': 'int64', 'runtime': 'int64', 'votes': 'int64'})

In [21]:
print("Converted column types:")
df2.info()

Converted column types:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   movie_id  500 non-null    object 
 1   rank      500 non-null    int64  
 2   title     500 non-null    object 
 3   year      500 non-null    int64  
 4   rating    500 non-null    float64
 5   runtime   500 non-null    int64  
 6   votes     500 non-null    int64  
dtypes: float64(1), int64(4), object(2)
memory usage: 27.5+ KB


# 	Enrich the given dataset (df1) by merging it to the scraped data (df2).

In [22]:
df2.head()

Unnamed: 0,movie_id,rank,title,year,rating,runtime,votes
0,tt7286456,1,Joker,2019,8.4,122,1306451
1,tt4154796,2,Avengers: Endgame,2019,8.4,181,1145181
2,tt4154756,3,Avengers: Infinity War,2018,8.4,149,1092729
3,tt6751668,4,Parasite,2019,8.5,132,818953
4,tt1825683,5,Black Panther,2018,7.3,134,783211


In [23]:
df1.head()

Unnamed: 0,movie_id,titleType,originalTitle,isAdult,genres
0,tt7286456,movie,Joker,0,"Crime,Drama,Thriller"
1,tt4154796,movie,Avengers: Endgame,0,"Action,Adventure,Drama"
2,tt4154756,movie,Avengers: Infinity War,0,"Action,Adventure,Sci-Fi"
3,tt6751668,movie,Gisaengchung,0,"Comedy,Drama,Thriller"
4,tt1825683,movie,Black Panther,0,"Action,Adventure,Sci-Fi"


In [24]:
# Merging both datasets

In [25]:
df = pd.merge(left=df1, right=df2, how='left', on='movie_id')
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 500 entries, 0 to 499
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   movie_id       500 non-null    object 
 1   titleType      500 non-null    object 
 2   originalTitle  500 non-null    object 
 3   isAdult        500 non-null    int64  
 4   genres         500 non-null    object 
 5   rank           493 non-null    float64
 6   title          493 non-null    object 
 7   year           493 non-null    float64
 8   rating         493 non-null    float64
 9   runtime        493 non-null    float64
 10  votes          493 non-null    float64
dtypes: float64(5), int64(1), object(5)
memory usage: 46.9+ KB


In [26]:
# Identifying missing values 

df.isnull().sum()

movie_id         0
titleType        0
originalTitle    0
isAdult          0
genres           0
rank             7
title            7
year             7
rating           7
runtime          7
votes            7
dtype: int64

In [27]:
# There were 7 films in the client's dataset that were not included on the IMDb dataset, representing 1.4% of the client's dataset. 
# Given the number of missing values is so small, we elected to remove them. 

df = df.dropna(axis=0, how='any')

In [28]:
df.isnull().sum()

movie_id         0
titleType        0
originalTitle    0
isAdult          0
genres           0
rank             0
title            0
year             0
rating           0
runtime          0
votes            0
dtype: int64

In [29]:
# Given the null values were removed, we need to change the datatypes for several columns

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 493 entries, 0 to 498
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   movie_id       493 non-null    object 
 1   titleType      493 non-null    object 
 2   originalTitle  493 non-null    object 
 3   isAdult        493 non-null    int64  
 4   genres         493 non-null    object 
 5   rank           493 non-null    float64
 6   title          493 non-null    object 
 7   year           493 non-null    float64
 8   rating         493 non-null    float64
 9   runtime        493 non-null    float64
 10  votes          493 non-null    float64
dtypes: float64(5), int64(1), object(5)
memory usage: 46.2+ KB


In [30]:
# Converting data types

df = df.astype({'rank': 'int64','year': 'int64', 'runtime': 'int64', 'votes': 'int64'})

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 493 entries, 0 to 498
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   movie_id       493 non-null    object 
 1   titleType      493 non-null    object 
 2   originalTitle  493 non-null    object 
 3   isAdult        493 non-null    int64  
 4   genres         493 non-null    object 
 5   rank           493 non-null    int64  
 6   title          493 non-null    object 
 7   year           493 non-null    int64  
 8   rating         493 non-null    float64
 9   runtime        493 non-null    int64  
 10  votes          493 non-null    int64  
dtypes: float64(1), int64(5), object(5)
memory usage: 46.2+ KB


### Rearrange the dataset fields to be listed in the following order: 
 movie_id, rank, votes, title, originalTitle, year, rating, titleType, isAdult, runtime,  genres

In [31]:
# Rearrange the dataset fields
df = df.loc[0:, ['movie_id', 'rank','votes','title','originalTitle','year','rating','titleType','isAdult','runtime','genres',]]

# Sorting the dataet based on the "ranking" column

df = df.sort_values(by=['rank'])
df.head()

Unnamed: 0,movie_id,rank,votes,title,originalTitle,year,rating,titleType,isAdult,runtime,genres
0,tt7286456,1,1306451,Joker,Joker,2019,8.4,movie,0,122,"Crime,Drama,Thriller"
1,tt4154796,2,1145181,Avengers: Endgame,Avengers: Endgame,2019,8.4,movie,0,181,"Action,Adventure,Drama"
2,tt4154756,3,1092729,Avengers: Infinity War,Avengers: Infinity War,2018,8.4,movie,0,149,"Action,Adventure,Sci-Fi"
3,tt6751668,4,818953,Parasite,Gisaengchung,2019,8.5,movie,0,132,"Comedy,Drama,Thriller"
4,tt1825683,5,783211,Black Panther,Black Panther,2018,7.3,movie,0,134,"Action,Adventure,Sci-Fi"


# Export the enriched dataset to a CSV file:

In [32]:
# Use the following naming convention: 
#  Project_3_PartA_Lastname.csv

df.to_csv('Project_3_PartA_Errichetti_Martin_RothEagle.csv')