# Instructions
         
For Part A, you need to scrape IMDB web page to find out top movies sorted by user votes. For each movie, you need to pull :
- movie_id
- rank
- title 
- runtime
- year
- rating
- votes

The URL of an page that include movies released between 2018 and 2020 sorted by number of votes is: 

https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2020-12-31&sort=num_votes,desc

Please click the URL and investigate how you can pull movie_id, rank, title,... from the webpage. 


**You need to write code after where I have <span style="color:red">'''  Your code here ...    '''.</span>**

***
Now let’s look at the read_m_from_html_string(url, num_of_m=50) function in detail. The parameter “num_of_m” in the function def read_m_from_html_string(url, num_of_m=50)
  represents the top number of movies you want to retrieve. For example, read_m_from_html_string(url,500) means that we want to extract top 500 movies released between, sorted by users' votes.

This function returns a list of dictionaries. Each dictionary represents one of the top movies, which could look like the following:

{
  
    'movie_id': 'tt7286456',
    'rank': '1.',
    'title': 'Joker',
    'runtime': 2h 2m,
    'year': '2019',
    'rating': '8.4',
    'votes': '1,421,777',
}


After you implement “read_m_from_html_string”, which will return a list of top movies, you need to export the movies list to a csv file.


***

After you done with scraping the needed data, you should clean and transform it as needed to make it ready for enriching the given "Movies.csv" dataset.
***

Finaly, export the enriched dataset to a CSV file:
Use the following naming convention: Project_3_PartA_Lastname.csv




In [2]:
import warnings
warnings.filterwarnings('ignore')
from bs4 import BeautifulSoup
import pandas as pd

***

## read_m_from_html_string

Inside this function, you need to write your code to pull the movies information from the provided Movies 500 HTML String text file.

For each movie, you need to pull :
- movie_id
- rank
- title 
- runtime
- year
- rating
- votes

To give examples on how to pull data from the web bage html string, I have included the code to pull the movie_id.
You need to inculde your code to pull the other needed movie information (title, rank, year, ......). You should have no missing values for each of the collected data.

The URL of an page that include movies released between 2018 and 2020 sorted by number of votes is: 

https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2020-12-31&sort=num_votes,desc

Please click the URL and investigate how you can pull movie_id, rank, title,... from the webpage using the Inspect option.



In [4]:
# This function, read a number of movies from a url html string. The default value is 50
def read_m_from_html_string(url, num_of_m=500):
    
    print(url)
    
    with open('TopVoted_500_Movies_HTML.txt', 'r', encoding="utf8") as file:
        html_string = file.read()   # to read the hmtl file as a string
        # I have included the Movies 500 HTML String.txt file in the project folder. Please take a look.
    
    # create a soup object
    soup = BeautifulSoup(html_string, "html.parser")
    '''
    Click the URL and investigate how you can pull movie_id, rank, title,... from the webpage.
    To investigate the html of a web page , For example:
    URL: https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2020-12-31&sort=num_votes,desc
    Right-click anywhere on the webpage, and at the very bottom of the menu that pops up, 
    you will see "Inspect", Click on it.
    '''
    '''
    Fetching a div that includes all the movies. This can be done by using find and find_all functions.
    for example, find_all('div') will give you all divs on the page. Actually, 
    this find or find_all function can have two parameters,
    in the code below 'div' is the tag name and 'ipc-page-grid__item ipc-page-grid__item--span-2' is an attribute 
    value of the tag. You can also do movie_list = soup.find('div', 'ipc-page-grid__item ipc-page-grid__item--span-2'). 
    Here you explicitly say: I want to find a div with 
    attribute class = 'ipc-page-grid__item ipc-page-grid__item--span-2'.
    
    Since on each imdb page, there's only one div with class = 'lister-list', we can use find rather than find_all. 
    Find_all will return a list of div tags, while find() will return only one div.
   '''     
    movie_list = soup.find('div', 'ipc-page-grid__item ipc-page-grid__item--span-2') 
    # this div contains all the listed movies in the requested html web page string.
    
    list_movies = [] # initialize the function return value, which is a list of movies. 
                     # This list will contains the scraped data transformed to a structured format.
    
    # Using count track the number of movies processed. now it's 0 - No movie has been processed yet.
    count = 0
    
    # each movie listed in a div with attribute value 'ipc-metadata-list-summary-item'.
    divs=  movie_list.find_all('li','ipc-metadata-list-summary-item') # To find all the listed movies in the page.
    for d in divs:
        dict_each_movie = {}  # initialize the movie dictionary to store the movie information.

        # Pulling the movie_id
        try:
            movie_id= d.find('a', 'ipc-title-link-wrapper').attrs['href']
            movie_id= movie_id[7:16]
            
        except:
            movie_id=""
        finally:
            dict_each_movie["movie_id"] = movie_id
            print(movie_id)
            
        # Pulling the rank
        try:
            rank_and_title = d.find('h3','ipc-title__text').text.strip()
            rank = rank_and_title.split('.')[0]
            
        except:
            rank=""
        finally:
            dict_each_movie["rank"] = rank
            print(rank)

        # Pulling the title
        try:
            title = d.find('h3', 'ipc-title__text').text.split('. ', 1)[1].strip()

        except:
            title=""
        finally:
            dict_each_movie["title"] = title
            print(title)     
        
        # Pulling the runtime
        try:
            runtime_d= d.find('div', 'sc-5bc66c50-5 hVarDB dli-title-metadata')
            runtime_s= runtime_d.find_all('span', 'sc-5bc66c50-6 OOdsw dli-title-metadata-item')
            m_runtime= runtime_s[1].text.strip()
            
        except:
            m_runtime=""
        finally:
            dict_each_movie["m_runtime"] = m_runtime
            print(m_runtime)   
        
        # Pulling the year
        try:
            year= d.find('span', 'sc-5bc66c50-6 OOdsw dli-title-metadata-item').text.strip()
            year= year
          
        except:
            year=""
        finally:
            dict_each_movie["year"] = year
            print(year)
        
        # Pulling the rating
          # the rating out of 10
        try:
            rating= d.find('span', 'ipc-rating-star--rating').text
            
        except:
            rating=""
        finally:
            dict_each_movie["rating"] = rating
            print(rating)
        
        # Pulling the votes
        try:   
            votes= d.find('span', 'ipc-rating-star--voteCount').text
            votes= votes
            votes = votes.replace("(","").replace(")","")
            vote_number = float(votes[0:len(votes) - 1])
            vote_k = votes[-1]
                
            if vote_k == 'M':
             votes = int(vote_number * 1000000)
            else: 
             votes = int(vote_number * 1000)
        
        except:
            votes=""
        finally:
            dict_each_movie["votes"] = votes
            print(votes)
        
        list_movies.append(dict_each_movie)  # To add the movie information to the movies list.

        count +=1
        print('===============================')
        print()
        if count == num_of_m:
            break # to exit from the loop.

    return list_movies


###  Call statement to scrap the TopVoted 500 movies
##### read_m_from_html_string(url,500)

In [6]:
url = "https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2020-12-31&sort=num_votes,desc"

Movies_list = read_m_from_html_string(url,500)  #to read the topVoted 500 movies
Movies_list

https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2020-12-31&sort=num_votes,desc
tt7286456
1
Joker
2h 2m
2019
8.4
1600000

tt4154796
2
Avengers: Endgame
3h 1m
2019
8.4
1300000

tt4154756
3
Avengers: Infinity War
2h 29m
2018
8.4
1200000

tt6751668
4
Parasite
2h 12m
2019
8.5
1000000

tt7131622
5
Once Upon a Time... in Hollywood
2h 41m
2019
7.6
878000

tt1825683
6
Black Panther
2h 14m
2018
7.3
856000

tt8946378
7
Knives Out
2h 10m
2019
7.9
796000

tt8579674
8
1917
1h 59m
2019
8.2
703000

tt4633694
9
Spider-Man: Into the Spider-Verse
1h 57m
2018
8.4
702000

tt5463162
10
Deadpool 2
1h 59m
2018
7.6
691000

tt4154664
11
Captain Marvel
2h 3m
2019
6.8
624000

tt6644200
12
A Quiet Place
1h 30m
2018
7.5
613000

tt6723592
13
Tenet
2h 30m
2020
7.3
612000

tt1727824
14
Bohemian Rhapsody
2h 14m
2018
7.9
609000

tt6966692
15
Green Book
2h 10m
2018
8.2
605000

tt6320628
16
Spider-Man: Far from Home
2h 9m
2019
7.4
577000

tt1270797
17
Venom
1h 52m
2018
6.6
560000

tt1477834
1

[{'movie_id': 'tt7286456',
  'rank': '1',
  'title': 'Joker',
  'm_runtime': '2h 2m',
  'year': '2019',
  'rating': '8.4',
  'votes': 1600000},
 {'movie_id': 'tt4154796',
  'rank': '2',
  'title': 'Avengers: Endgame',
  'm_runtime': '3h 1m',
  'year': '2019',
  'rating': '8.4',
  'votes': 1300000},
 {'movie_id': 'tt4154756',
  'rank': '3',
  'title': 'Avengers: Infinity War',
  'm_runtime': '2h 29m',
  'year': '2018',
  'rating': '8.4',
  'votes': 1200000},
 {'movie_id': 'tt6751668',
  'rank': '4',
  'title': 'Parasite',
  'm_runtime': '2h 12m',
  'year': '2019',
  'rating': '8.5',
  'votes': 1000000},
 {'movie_id': 'tt7131622',
  'rank': '5',
  'title': 'Once Upon a Time... in Hollywood',
  'm_runtime': '2h 41m',
  'year': '2019',
  'rating': '7.6',
  'votes': 878000},
 {'movie_id': 'tt1825683',
  'rank': '6',
  'title': 'Black Panther',
  'm_runtime': '2h 14m',
  'year': '2018',
  'rating': '7.3',
  'votes': 856000},
 {'movie_id': 'tt8946378',
  'rank': '7',
  'title': 'Knives Out',


In [8]:
# to convert the movies list of dics to dataframe
df_movies = pd.DataFrame(Movies_list)
df_movies

Unnamed: 0,movie_id,rank,title,m_runtime,year,rating,votes
0,tt7286456,1,Joker,2h 2m,2019,8.4,1600000
1,tt4154796,2,Avengers: Endgame,3h 1m,2019,8.4,1300000
2,tt4154756,3,Avengers: Infinity War,2h 29m,2018,8.4,1200000
3,tt6751668,4,Parasite,2h 12m,2019,8.5,1000000
4,tt7131622,5,Once Upon a Time... in Hollywood,2h 41m,2019,7.6,878000
...,...,...,...,...,...,...,...
495,tt3089630,496,Artemis Fowl,1h 35m,2020,4.3,31000
496,tt1308728,497,The Happytime Murders,1h 31m,2018,5.5,31000
497,tt1138238,498,The Dissident,1h 59m,2020,7.8,31000
498,tt1031014,499,Fatman,1h 40m,2020,5.9,31000


***
#  To export the colleted movies to IMDb_TopVoted.csv file.


In [10]:
df_movies.to_csv('Final_IMDb_TopVoted.csv', index = False)

# Importing the given dataset "Movies.csv" to Pandas DataFrame called df1

In [12]:
# Importing the csv file to df1 and print the df1.

df1 = pd.read_csv("Movies.csv")


# Import the scraped data from the IMDb_TopVoted.csv file to Pandas DataFrame called df2

In [14]:
# You need to import the collected dataset "IMDb_TopVoted.csv" and print the df2.
# To handel Latin characters that may contained in the csv file
# with no issue, use  encoding= "ISO-8859-1" with the pd.read_csv()
# Example: df1 = pd.read_csv('thefilename.csv', encoding= "ISO-8859-1") 
# Using encoding= "ISO-8859-1" will avoid Unicode-Decode-Errors.

df2 = pd.read_csv('Final_IMDb_TopVoted.csv', encoding="ISO-8859-1")

# Data cleansing and transformation for df2.

In [16]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   movie_id   500 non-null    object 
 1   rank       500 non-null    int64  
 2   title      500 non-null    object 
 3   m_runtime  500 non-null    object 
 4   year       500 non-null    int64  
 5   rating     500 non-null    float64
 6   votes      500 non-null    int64  
dtypes: float64(1), int64(3), object(3)
memory usage: 27.5+ KB


In [18]:
# Cleaning and tranforming df2
 # rank, year, and votes should have a numeric integer data type.
 # runtime column should be renamed to runtimeMinutes and the value should be in minutes, 
 # for example: 2h 2m should be 122
    
'''  Your code here ...    '''
# Convert rank, year, and votes to integer
df2['rank'] = df2['rank'].astype(int)
df2['year'] = df2['year'].astype(int)
df2['votes'] = df2['votes'].astype(int)

# Rename runtime column to runtimeMinutes
df2 = df2.rename(columns={'m_runtime': 'runtimeMinutes'})

# Convert runtime to minutes
def convert_to_minutes(m_runtime):
    if pd.isna(m_runtime):
        return None
    parts = m_runtime.split()
    minutes = 0
    for part in parts:
        if 'h' in part:
            minutes += int(part.replace('h', '')) * 60
        elif 'm' in part:
            minutes += int(part.replace('m', ''))
    return minutes

df2['runtimeMinutes'] = df2['runtimeMinutes'].apply(convert_to_minutes)

# Convert runtimeMinutes to integer
df2['runtimeMinutes'] = df2['runtimeMinutes'].astype('Int64')


In [20]:
df2.head()

Unnamed: 0,movie_id,rank,title,runtimeMinutes,year,rating,votes
0,tt7286456,1,Joker,122,2019,8.4,1600000
1,tt4154796,2,Avengers: Endgame,181,2019,8.4,1300000
2,tt4154756,3,Avengers: Infinity War,149,2018,8.4,1200000
3,tt6751668,4,Parasite,132,2019,8.5,1000000
4,tt7131622,5,Once Upon a Time... in Hollywood,161,2019,7.6,878000


# 	Enrich the given dataset (df1) by merging it to the scraped data (df2).

In [22]:
# Merege the two dataframes to one dataframe called df.
df = pd.merge(df1, df2)
df.head()


Unnamed: 0,movie_id,originalTitle,description,ratingCategory,genres,rank,title,runtimeMinutes,year,rating,votes
0,tt7286456,Joker,"During the 1980s, a failed stand-up comedian i...",R,"Crime,Drama,Thriller",1,Joker,122,2019,8.4,1600000
1,tt4154796,Avengers: Endgame,After the devastating events of Avengers: Infi...,PG-13,"Action,Adventure,Drama",2,Avengers: Endgame,181,2019,8.4,1300000
2,tt4154756,Avengers: Infinity War,The Avengers and their allies must be willing ...,PG-13,"Action,Adventure,Sci-Fi",3,Avengers: Infinity War,149,2018,8.4,1200000
3,tt6751668,Parasite,Greed and class discrimination threaten the ne...,R,"Drama,Thriller",4,Parasite,132,2019,8.5,1000000
4,tt1825683,Black Panther,"T'Challa, heir to the hidden but advanced king...",PG-13,"Action,Adventure,Sci-Fi",6,Black Panther,134,2018,7.3,856000


In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 498 entries, 0 to 497
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   movie_id        498 non-null    object 
 1   originalTitle   498 non-null    object 
 2   description     498 non-null    object 
 3   ratingCategory  495 non-null    object 
 4   genres          498 non-null    object 
 5   rank            498 non-null    int64  
 6   title           498 non-null    object 
 7   runtimeMinutes  498 non-null    Int64  
 8   year            498 non-null    int64  
 9   rating          498 non-null    float64
 10  votes           498 non-null    int64  
dtypes: Int64(1), float64(1), int64(3), object(6)
memory usage: 43.4+ KB


# Rearrange the dataset fields to be listed in the following order: 
movie_id , rank , title ,  originalTitle ,  description ,
          year ,  votes , rating ,  runtimeMinutes ,  ratingCategory ,  genres

In [26]:
# Rearrange the dataset fields.

df = df[['movie_id', 'rank', 'title', 'originalTitle', 'description', 'year', 'votes', 'rating', 'runtimeMinutes', 'ratingCategory', 'genres']]
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 498 entries, 0 to 497
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   movie_id        498 non-null    object 
 1   rank            498 non-null    int64  
 2   title           498 non-null    object 
 3   originalTitle   498 non-null    object 
 4   description     498 non-null    object 
 5   year            498 non-null    int64  
 6   votes           498 non-null    int64  
 7   rating          498 non-null    float64
 8   runtimeMinutes  498 non-null    Int64  
 9   ratingCategory  495 non-null    object 
 10  genres          498 non-null    object 
dtypes: Int64(1), float64(1), int64(3), object(6)
memory usage: 43.4+ KB


# Export the enriched dataset to a CSV file:

In [28]:
# Use the following naming convention: 
#  Project_3_PartA_Lastname.csv

df.to_csv('Final_Project_3_PartA_Group2.csv', index =False)
