# Instructions
         
For Part A, you need to scrape IMDB web page to find out top movies sorted by user votes. For each movie, you need to pull :
- movie_id
- rank
- title 
- runtime
- year
- rating
- votes

The URL of an page that include movies released between 2018 and 2020 sorted by number of votes is: 

https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2020-12-31&sort=num_votes,desc

Please click the URL and investigate how you can pull movie_id, rank, title,... from the webpage. 


**You need to write code after where I have <span style="color:red">'''  Your code here ...    '''.</span>**

***
Now let’s look at the read_m_from_html_string(url, num_of_m=50) function in detail. The parameter “num_of_m” in the function def read_m_from_html_string(url, num_of_m=50)
  represents the top number of movies you want to retrieve. For example, read_m_from_html_string(url,500) means that we want to extract top 500 movies released between, sorted by users' votes.

This function returns a list of dictionaries. Each dictionary represents one of the top movies, which could look like the following:

{
  
    'movie_id': 'tt7286456',
    'rank': '1.',
    'title': 'Joker',
    'runtime': 2h 2m,
    'year': '2019',
    'rating': '8.4',
    'votes': '1,421,777',
}


After you implement “read_m_from_html_string”, which will return a list of top movies, you need to export the movies list to a csv file.


***

After you done with scraping the needed data, you should clean and transform it as needed to make it ready for enriching the given "Movies.csv" dataset.
***

Finaly, export the enriched dataset to a CSV file:
Use the following naming convention: Project_3_PartA_Lastname.csv




In [4]:
import warnings
warnings.filterwarnings('ignore')
from bs4 import BeautifulSoup
import pandas as pd

***

## read_m_from_html_string

Inside this function, you need to write your code to pull the movies information from the provided Movies 500 HTML String text file.

For each movie, you need to pull :
- movie_id
- rank
- title 
- runtime
- year
- rating
- votes

To give examples on how to pull data from the web bage html string, I have included the code to pull the movie_id.
You need to inculde your code to pull the other needed movie information (title, rank, year, ......). You should have no missing values for each of the collected data.

The URL of an page that include movies released between 2018 and 2020 sorted by number of votes is: 

https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2020-12-31&sort=num_votes,desc

Please click the URL and investigate how you can pull movie_id, rank, title,... from the webpage using the Inspect option.



In [7]:
def read_m_from_html_string(url, num_of_m=50):
    
    print(url)
    
    with open('TopVoted_500_Movies_HTML.txt', 'r', encoding="utf8") as file:
        html_string = file.read()   # to read the html file as a string
        # I have included the Movies 500 HTML String.txt file in the project folder. Please take a look.
    
    # create a soup object
    soup = BeautifulSoup(html_string, "html.parser")
    
    '''
    Click the URL and investigate how you can pull movie_id, rank, title,... from the webpage.
    To investigate the html of a web page , For example:
    URL: https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2020-12-31&sort=num_votes,desc
    Right-click anywhere on the webpage, and at the very bottom of the menu that pops up, 
    you will see "Inspect", Click on it.
    '''
    '''
    Fetching a div that includes all the movies. This can be done by using find and find_all functions.
    for example, find_all('div') will give you all divs on the page. Actually, 
    this find or find_all function can have two parameters,
    in the code below 'div' is the tag name and 'ipc-page-grid__item ipc-page-grid__item--span-2' is an attribute 
    value of the tag. You can also do movie_list = soup.find('div', 'ipc-page-grid__item ipc-page-grid__item--span-2'). 
    Here you explicitly say: I want to find a div with 
    attribute class = 'ipc-page-grid__item ipc-page-grid__item--span-2'.
    
    Since on each imdb page, there's only one div with class = 'lister-list', we can use find rather than find_all. 
    Find_all will return a list of div tags, while find() will return only one div.
   '''     
    movie_list = soup.find_all('h3', class_='ipc-title__text') 
    # this h3 contains the rank and title of the listed movies in the requested html web page string.
    
    if not movie_list:
        print("Error: Movie list container not found in the HTML.")
        return []  # Return an empty list if the container is not found
    
    list_movies = []  # initialize the function return value, which is a list of movies. 
    
    # Use count to track the number of movies processed. Now it's 0 - No movie has been processed yet.
    count = 0
    
    # Loop for each movie listed in an h3 with attribute value 'ipc-metadata-list-summary-item'.
    for d in movie_list:
        dict_each_movie = {}  # initialize the movie dictionary to store the movie data.

        # Extract the movie_id, rank, and title from the HTML
        try:
            movie_id = d.find_previous('a', class_='ipc-title-link-wrapper')['href'].split('/')[2]
            title_rank = d.text.strip()
            rank, title = title_rank.split('. ', 1)
        # If extraction fails, default to empty strings
        except:
            movie_id = ""
            rank = ""
            title = ""
        # Add the extracted or default values to the movie dictionary
        finally:
            dict_each_movie["movie_id"] = movie_id
            dict_each_movie["rank"] = rank
            dict_each_movie["title"] = title

        # Locate the metadata div and extract runtime, year, and rating category
        try:
            parent_debug = d.find_parent('div') # Navigate to the parent div of the current h3 tag
            higher_parent = parent_debug.find_parent('div') if parent_debug else None # Go one level higher if possible
            metadata_div = higher_parent.find('div', class_='sc-5bc66c50-5 hVarDB dli-title-metadata') if higher_parent else None
            
            if metadata_div:
                metadata_items = metadata_div.find_all('span', class_='sc-5bc66c50-6 OOdsw dli-title-metadata-item')
                # Extract metadata fields if available
                year = metadata_items[0].text.strip() if len(metadata_items) > 0 else "N/A"
                runtime = metadata_items[1].text.strip() if len(metadata_items) > 1 else "N/A"
                rating_category = metadata_items[2].text.strip() if len(metadata_items) > 2 else "N/A"
            else:
                # Default to "N/A" if metadata is not found
                year = "N/A"
                runtime = "N/A"
                rating_category = "N/A"
        except:
            # Handle exceptions by setting default values
            year = "N/A"
            runtime = "N/A"
            rating_category = "N/A"
        finally:
            # Add metadata to the movie dictionary
            dict_each_movie["runtime"] = runtime
            dict_each_movie["year"] = year

        # Extract the rating from the HTML
        try:
            rating = d.find_next('span', class_='ipc-rating-star--rating').text.strip()
        except:
            # Default to "N/A" if rating is not found
            rating = "N/A"
        finally:
            dict_each_movie["rating"] = rating

        # Extract the number of votes from the HTML
        try:
            votes = d.find_next('span', class_='ipc-rating-star--voteCount').text.strip().strip('()')
        except:
            # Default to "N/A" if votes are not found
            votes = "N/A"
        finally:
            dict_each_movie["votes"] = votes

        # Append the movie dictionary to the list of movies
        list_movies.append(dict_each_movie)

        # Increment the counter and break the loop if the desired number of movies is processed
        count += 1
        if count == num_of_m:
            break 

    return list_movies


###  Call statement to scrap the TopVoted 500 movies
##### read_m_from_html_string(url,500)

In [14]:
url = "https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2020-12-31&sort=num_votes,desc"

Movies_list = read_m_from_html_string(url,500)  # To read the topVoted 500 movies
Movies_list # Display to verify structure and content

https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2020-12-31&sort=num_votes,desc


[{'movie_id': 'tt7286456',
  'rank': '1',
  'title': 'Joker',
  'runtime': '2h 2m',
  'year': '2019',
  'rating': '8.4',
  'votes': '1.6M'},
 {'movie_id': 'tt4154796',
  'rank': '2',
  'title': 'Avengers: Endgame',
  'runtime': '3h 1m',
  'year': '2019',
  'rating': '8.4',
  'votes': '1.3M'},
 {'movie_id': 'tt4154756',
  'rank': '3',
  'title': 'Avengers: Infinity War',
  'runtime': '2h 29m',
  'year': '2018',
  'rating': '8.4',
  'votes': '1.2M'},
 {'movie_id': 'tt6751668',
  'rank': '4',
  'title': 'Parasite',
  'runtime': '2h 12m',
  'year': '2019',
  'rating': '8.5',
  'votes': '1M'},
 {'movie_id': 'tt7131622',
  'rank': '5',
  'title': 'Once Upon a Time... in Hollywood',
  'runtime': '2h 41m',
  'year': '2019',
  'rating': '7.6',
  'votes': '878K'},
 {'movie_id': 'tt1825683',
  'rank': '6',
  'title': 'Black Panther',
  'runtime': '2h 14m',
  'year': '2018',
  'rating': '7.3',
  'votes': '856K'},
 {'movie_id': 'tt8946378',
  'rank': '7',
  'title': 'Knives Out',
  'runtime': '2h 1

In [16]:
# Convert the list of movie dictionaries to a Pandas DataFrame
df_movies = pd.DataFrame(Movies_list)  # Each dictionary in Movies_list becomes a row, with keys as column headers
df_movies  # Display the DataFrame to verify the data structure and contents

Unnamed: 0,movie_id,rank,title,runtime,year,rating,votes
0,tt7286456,1,Joker,2h 2m,2019,8.4,1.6M
1,tt4154796,2,Avengers: Endgame,3h 1m,2019,8.4,1.3M
2,tt4154756,3,Avengers: Infinity War,2h 29m,2018,8.4,1.2M
3,tt6751668,4,Parasite,2h 12m,2019,8.5,1M
4,tt7131622,5,Once Upon a Time... in Hollywood,2h 41m,2019,7.6,878K
...,...,...,...,...,...,...,...
495,tt3089630,496,Artemis Fowl,1h 35m,2020,4.3,31K
496,tt1308728,497,The Happytime Murders,1h 31m,2018,5.5,31K
497,tt11382384,498,The Dissident,1h 59m,2020,7.8,31K
498,tt10310140,499,Fatman,1h 40m,2020,5.9,31K


***
#  To export the colleted movies to IMDb_TopVoted.csv file.


In [19]:
df_movies.to_csv('IMDb_TopVoted.csv', index = False)

# Importing the given dataset "Movies.csv" to Pandas DataFrame called df1

In [22]:
# Import the csv file to df1 and print the df1.

df1 = pd.read_csv('Movies.csv')
print(df1)

      movie_id                originalTitle  \
0    tt7286456                        Joker   
1    tt4154796            Avengers: Endgame   
2    tt4154756       Avengers: Infinity War   
3    tt6751668                     Parasite   
4    tt1825683                Black Panther   
..         ...                          ...   
495  tt9072352                        Relic   
496  tt1006569  Episode dated 9 August 2005   
497  tt8652728                        Waves   
498  tt7748244                 Mortal World   
499  tt6768578                       Dogman   

                                           description ratingCategory  \
0    During the 1980s, a failed stand-up comedian i...              R   
1    After the devastating events of Avengers: Infi...          PG-13   
2    The Avengers and their allies must be willing ...          PG-13   
3    Greed and class discrimination threaten the ne...              R   
4    T'Challa, heir to the hidden but advanced king...          PG-13 

# Import the scraped data from the IMDb_TopVoted.csv file to Pandas DataFrame called df2

In [25]:
# You need to import the collected dataset "IMDb_TopVoted.csv" and print the df2.
# To handle Latin characters that may contained in the csv file
# with no issue, use  encoding= "ISO-8859-1" with the pd.read_csv()
# Example: df1 = pd.read_csv('thefilename.csv', encoding= "ISO-8859-1") 
# Using encoding= "ISO-8859-1" will avoid Unicode-Decode-Errors.

df2 = pd.read_csv('IMDb_TopVoted.csv', encoding="ISO-8859-1")

print(df2) # Display to verify data structure and contents

       movie_id  rank                             title runtime  year  rating  \
0     tt7286456     1                             Joker   2h 2m  2019     8.4   
1     tt4154796     2                 Avengers: Endgame   3h 1m  2019     8.4   
2     tt4154756     3            Avengers: Infinity War  2h 29m  2018     8.4   
3     tt6751668     4                          Parasite  2h 12m  2019     8.5   
4     tt7131622     5  Once Upon a Time... in Hollywood  2h 41m  2019     7.6   
..          ...   ...                               ...     ...   ...     ...   
495   tt3089630   496                      Artemis Fowl  1h 35m  2020     4.3   
496   tt1308728   497             The Happytime Murders  1h 31m  2018     5.5   
497  tt11382384   498                     The Dissident  1h 59m  2020     7.8   
498  tt10310140   499                            Fatman  1h 40m  2020     5.9   
499  tt10065694   500                        Antebellum  1h 45m  2020     5.8   

    votes  
0    1.6M  
1  

# Data cleansing and transformation for df2.

In [28]:
df2.info() # Display concise summary

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   movie_id  500 non-null    object 
 1   rank      500 non-null    int64  
 2   title     500 non-null    object 
 3   runtime   500 non-null    object 
 4   year      500 non-null    int64  
 5   rating    500 non-null    float64
 6   votes     500 non-null    object 
dtypes: float64(1), int64(2), object(4)
memory usage: 27.5+ KB


In [30]:
# Clean and tranform df2
 # rank, year, and votes should have a numeric integer data type.
 # runtime column should be renamed to runtimeMinutes and the value should be in minutes, 
 # for example: 2h 2m should be 122

df2 = pd.read_csv('IMDb_TopVoted.csv', encoding="ISO-8859-1")

df2['rank'] = pd.to_numeric(df2['rank'], errors='coerce').astype('Int64')
df2['year'] = pd.to_numeric(df2['year'], errors='coerce').astype('Int64')

# Convert votes to numeric values
def convert_votes(vote_str):
    if pd.isna(vote_str):
        return None
    vote_str = vote_str.replace(',', '')
    if 'M' in vote_str:
        return int(float(vote_str.replace('M', '')) * 1_000_000)
    elif 'K' in vote_str:
        return int(float(vote_str.replace('K', '')) * 1_000)
    return int(vote_str)

# Apply the conversion function to the 'votes' column
df2['votes'] = df2['votes'].apply(convert_votes)

# Function to convert runtime strings to total minutes
def convert_runtime_to_minutes(runtime):
    if pd.isna(runtime):
        return None
    time_parts = runtime.split('h')
    if len(time_parts) == 2:
        hours = int(time_parts[0].strip())
        minutes = int(time_parts[1].strip('m').strip()) if 'm' in time_parts[1] else 0
        return hours * 60 + minutes
    elif 'm' in runtime:
        return int(runtime.strip('m').strip())
    return None

# Apply the runtime conversion function and create a new column
df2['runtimeMinutes'] = df2['runtime'].apply(convert_runtime_to_minutes)
# Drop the original 'runtime' column
df2 = df2.drop(columns=['runtime'])

# Display the first few rows of the updated DataFrame
print(df2.head())

    movie_id  rank                             title  year  rating    votes  \
0  tt7286456     1                             Joker  2019     8.4  1600000   
1  tt4154796     2                 Avengers: Endgame  2019     8.4  1300000   
2  tt4154756     3            Avengers: Infinity War  2018     8.4  1200000   
3  tt6751668     4                          Parasite  2019     8.5  1000000   
4  tt7131622     5  Once Upon a Time... in Hollywood  2019     7.6   878000   

   runtimeMinutes  
0             122  
1             181  
2             149  
3             132  
4             161  


# 	Enrich the given dataset (df1) by merging it to the scraped data (df2).

In [33]:
# Merge the two dataframes to one dataframe called df.
df = pd.merge(df1, df2, on="movie_id", how="inner")
print(df)

      movie_id           originalTitle  \
0    tt7286456                   Joker   
1    tt4154796       Avengers: Endgame   
2    tt4154756  Avengers: Infinity War   
3    tt6751668                Parasite   
4    tt1825683           Black Panther   
..         ...                     ...   
456  tt9252468                   Mosul   
457  tt3152592                  Scoob!   
458  tt9072352                   Relic   
459  tt8652728                   Waves   
460  tt7748244            Mortal World   

                                           description ratingCategory  \
0    During the 1980s, a failed stand-up comedian i...              R   
1    After the devastating events of Avengers: Infi...          PG-13   
2    The Avengers and their allies must be willing ...          PG-13   
3    Greed and class discrimination threaten the ne...              R   
4    T'Challa, heir to the hidden but advanced king...          PG-13   
..                                                 ...   

# Rearrange the dataset fields to be listed in the following order: 
movie_id , rank , title ,  originalTitle ,  description ,
          year ,  votes , rating ,  runtimeMinutes ,  ratingCategory ,  genres

In [36]:
# Rearrange the dataset fields.
df = df[
    [
        "movie_id",
        "rank",
        "title",
        "originalTitle",
        "description",
        "year",
        "votes",
        "rating",
        "runtimeMinutes",
        "ratingCategory",
        "genres",
    ]
]
print(df)

      movie_id  rank                   title           originalTitle  \
0    tt7286456     1                   Joker                   Joker   
1    tt4154796     2       Avengers: Endgame       Avengers: Endgame   
2    tt4154756     3  Avengers: Infinity War  Avengers: Infinity War   
3    tt6751668     4                Parasite                Parasite   
4    tt1825683     6           Black Panther           Black Panther   
..         ...   ...                     ...                     ...   
456  tt9252468   492                   Mosul                   Mosul   
457  tt3152592   494                  Scoob!                  Scoob!   
458  tt9072352   491                   Relic                   Relic   
459  tt8652728   473                   Waves                   Waves   
460  tt7748244   477            Mortal World            Mortal World   

                                           description  year    votes  rating  \
0    During the 1980s, a failed stand-up comedian i...

# Export the enriched dataset to a CSV file:

In [39]:
# Use the following naming convention: 
# Project_3_PartA_Lastname.csv
df.to_csv('Project_3_PartA_Group7.csv', index=False)