<div align="center">

# **Data Collection and Web Scraping for Movie Dataset**

</div>

&nbsp;

## **Overview**
As part of Phase 1 of the Data Science project, this document outlines the process of gathering a dataset by scraping movie-related information from [The Movie Database (TMDB)](https://www.themoviedb.org). The goal was to collect a comprehensive dataset containing details such as movie titles, ratings, genres, directors, cast members, budgets, revenues, and user reviews. A total of **9994 movies** were successfully scraped, forming a dataset that will serve as the foundation for subsequent phases of the project, including visualization, analysis, and model development. 

&nbsp;<br>
&nbsp;

## **Methodology**

&nbsp;

### **Tools and Libraries**
The following Python libraries were used for web scraping and data processing:
- **`requests`**: For sending HTTP requests to fetch web pages.
- **`BeautifulSoup` (from `bs4`)**: For parsing HTML content and extracting relevant data.
- **`pandas`**: For structuring the extracted data into a tabular format and saving it as a CSV file.
- **`json`**: For parsing JSON data embedded in the HTML.
- **`ast`**: For safe evaluation of stringified Python objects

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import json
import ast 

### **Data Source**

The data was scraped from TMDB's "Top Rated Movies" section, which lists movies across multiple pages. Each movie's individual page was further scraped to gather detailed information.



### **Scraping Process**


**1.** **Initial Setup**:
   - The base URL for TMDB's top-rated movies was defined, and pagination was handled by iterating through pages 1 to 500.
   - A CSV file (`movies_data.csv`) was used to store the scraped data incrementally, allowing the process to resume from the last completed page in case of interruptions.

**2.** **Page-Level Scraping**:
   - For each page, the HTML content was fetched and parsed using `BeautifulSoup`.
   - Movie cards were identified, and links to individual movie pages were extracted for further scraping.

**3.** **Movie-Level Scraping**:
   - For each movie, detailed information was extracted, including:
     - **Basic Details**: Movie name, release date, runtime, certification, and tagline.
     - **Ratings**: User score from the movie's page.
     - **Genres**: A list of genres associated with the movie.
     - **Financials**: Budget and revenue figures.
     - **Cast and Crew**: Directors and cast members.
     - **Keywords**: Normal (rounded) and tone-related (bold) keywords.
     - **Reviews**: User reviews, including the reviewer's name, score, and review text. For each reviewer, their most-watched genres were also extracted from their profile page.
     - **Content Score**: Additional metadata about the movie's content.

**4.** **Error Handling**:
   - Default values (e.g., `'N/A'`) were assigned when specific data fields were missing or unavailable.
   - The script handled exceptions gracefully, such as missing elements or failed HTTP requests.

**5.** **Data Storage**:
   - After scraping each page, the data was appended to the CSV file to ensure progress was saved.
   - The final dataset was structured as a pandas DataFrame for easy manipulation and analysis.

In [None]:
base_url = 'https://www.themoviedb.org'
base_movie_url = 'https://www.themoviedb.org/movie'
temp_url = '/top-rated?page='
all_page_data = []

output_file = 'movies_data.csv'


try:
    existing_data = pd.read_csv(output_file)
    all_page_data = existing_data.to_dict('records')
except FileNotFoundError:
    all_page_data = []

# Start from the last completed page + 1
start_page = len(all_page_data) // 20 + 1 if all_page_data else 1

for num in range(start_page, 501): 
    resp1 = requests.get(base_movie_url + temp_url + str(num)).text
    soup_data = BeautifulSoup(resp1, 'lxml')
    all_div = soup_data.find_all('div', class_='card style_1')

    page_data = [] 

    for items in all_div:
        inner_div = items.find('div', class_='content')
        inner_link = inner_div.find('a')['href']
        full_link = base_url + inner_link

        inner_data_req = requests.get(full_link).text
        new_soup_data = BeautifulSoup(inner_data_req, 'lxml')

        movie_name = inner_div.find('h2').text.strip() if inner_div.find('h2') else 'N/A'
        movie_date = inner_div.find('p').text.strip() if inner_div.find('p') else 'N/A'

        rating_div = new_soup_data.find('div', 'user_score_chart')
        if rating_div:
            rating = rating_div["data-percent"]
        else:
            rating = 'N/A'

        genre_list = []
        genre_span = new_soup_data.find('span', class_='genres')
        if genre_span:
            genre_links = genre_span.find_all('a')
            genre_list = [link.text.strip() for link in genre_links if link.text.strip()]

        run_time_find = new_soup_data.find('span', class_='runtime')
        if run_time_find:
            run_time = run_time_find.text.strip()
        else:
            run_time = "N/A"

        # Extract certification
        certification_span = new_soup_data.find('span', class_='certification')
        certification = certification_span.get_text(strip=True) if certification_span else 'N/A'

        # Extract overview
        ovr_view = new_soup_data.find('div', class_='overview')
        if ovr_view and ovr_view.find('p'): 
            overview = ovr_view.find('p').text.strip()
        else:
            overview = 'N/A'

        # Extract tagline
        tagline_h3 = new_soup_data.find('h3', class_='tagline')
        tagline = tagline_h3.get_text(strip=True) if tagline_h3 else 'N/A'


        facts_section = new_soup_data.find('section', class_='facts left_column')

        language = 'N/A'
        budget = 'N/A'
        revenue = 'N/A'

        # Extract language, budget, and revenue
        if facts_section:
            # Extract Original Language
            language_bdi = facts_section.find('bdi', string='Original Language')
            if language_bdi:
                language_p = language_bdi.find_parent('p')
                if language_p:
                    language = language_p.get_text(strip=True).replace('Original Language', '').strip()

            # Extract Budget
            budget_bdi = facts_section.find('bdi', string='Budget')
            if budget_bdi:
                budget_p = budget_bdi.find_parent('p')
                if budget_p:
                    budget = budget_p.get_text(strip=True).replace('Budget', '').strip()

            # Extract Revenue
            revenue_bdi = facts_section.find('bdi', string='Revenue')
            if revenue_bdi:
                revenue_p = revenue_bdi.find_parent('p')
                if revenue_p:
                    revenue = revenue_p.get_text(strip=True).replace('Revenue', '').strip()


        # Extract Directors
        ol_profile = new_soup_data.find('ol', class_='people no_image')
        if ol_profile:
            li_profile = ol_profile.find_all('li', class_='profile')
            director_set = set()
            for li in li_profile:
                character_p = li.find('p', class_="character")
                if character_p and 'Director' in character_p.text:
                    director_a = li.find('a')
                    if director_a:
                        director_set.add(director_a.text.strip())
            
            directors = list(director_set)
       

        # Extract Cast
        cast_list = []
        cast_section = new_soup_data.find('ol', class_='people scroller')
        if cast_section:
            cast_cards = cast_section.find_all('li', class_='card')
            for card in cast_cards:
                actor_p = card.find('p')
                if actor_p:
                    actor_a = actor_p.find('a')
                    if actor_a:
                        cast_list.append(actor_a.get_text(strip=True))

        # Extract keywords
        keyword_rounded = []
        keyword_bold = []
        keywords_section = new_soup_data.find('section', class_='keywords right_column')
        if keywords_section:
            ul_tag = keywords_section.find('ul')
            if ul_tag:
                for li in ul_tag.find_all('li'):
                    a_tag = li.find('a')
                    if a_tag:
                        if 'rounded' in a_tag.get('class', []):
                            keyword_rounded.append(a_tag.get_text(strip=True))
                        elif '!border' in a_tag.get('class', []):
                            keyword_bold.append(a_tag.get_text(strip=True))



        # Extract reviews
        reviews_list = []
        inner_content_div = new_soup_data.find('div', class_='inner_content')
        if inner_content_div:
            reviews_link = inner_content_div.find('p', class_='new_button')
            if reviews_link:
                base_reviews_url = base_url + reviews_link.find('a')['href']
                page_num = 1

                while True:
                    reviews_url = f"{base_reviews_url}?page={page_num}"
                    reviews_page = requests.get(reviews_url).text
                    reviews_soup = BeautifulSoup(reviews_page, 'lxml')

                    review_containers = reviews_soup.find_all('div', class_='review_container')
                    if not review_containers:
                        break

                    for container in review_containers:
                        review_contents = container.find_all('div', class_='content')
                        for review_content in review_contents:

                            # Extract writer
                            writer_h5 = review_content.find('h5')
                            writer = writer_h5.find('a').text if writer_h5 and writer_h5.find('a') else 'N/A'

                            # Extract rating
                            rating_div = review_content.find('div', class_='rating_border rating')
                            score = rating_div.text.strip() if rating_div else 'N/A'

                            # Extract review text
                            teaser_div = review_content.find('div', class_='teaser')
                            if teaser_div:
                                read_more_link = teaser_div.find('a', class_='underline')
                                if read_more_link:
                                    full_review_url = base_url + read_more_link['href']
                                    full_review_page = requests.get(full_review_url).text
                                    full_review_soup = BeautifulSoup(full_review_page, 'lxml')
                                    full_review_div = full_review_soup.find('div', class_='content column pad')
                                    if full_review_div:
                                        all_paragraphs = full_review_div.find_all('p')
                                        review_text = ' '.join(p.get_text(strip=True) for p in all_paragraphs)
                                    else:
                                        review_text = teaser_div.get_text(strip=True)
                                else:
                                    review_text = teaser_div.get_text(strip=True)

                            # Extract writer's most watched genres
                            most_watched_genres = []
                            if writer != 'N/A' and writer_h5.find('a'):
                                writer_link = base_url + writer_h5.find('a')['href']
                                try:
                                    writer_page = requests.get(writer_link).text
                                    writer_soup = BeautifulSoup(writer_page, 'lxml')
                                    
                                    for script in writer_soup.find_all('script'):
                                        if 'var genreData' in script.text:
                                            for line in script.text.split(';'):
                                                if 'var genreData' in line:
                                                    json_str = line.split('=', 1)[1].strip()
                                                    genre_data = json.loads(json_str)
                                                    # Convert to list of tuples -> (genre, count of the reviewed movies/series)
                                                    most_watched_genres = [(item['name'], item['count']) for item in genre_data]
                                                    break
                                    
                                except Exception as e:
                                    print(f"Error getting genres for {writer}: {e}")
                                    most_watched_genres = []

                            reviews_list.append({
                                'writer': writer,
                                'score': score,
                                'review': review_text,
                                'most_watched_genres': most_watched_genres
                            })

                    page_num += 1


        # Extract content_score and content_score_description
        content_score_div = new_soup_data.find('div', class_='content_score')
        content_score_wrapper_div = new_soup_data.find('div', class_='content_score_wrapper')

        if content_score_div:
            content_score = content_score_div.find('p').text.strip() if content_score_div.find('p') else 'N/A'
        else:
            content_score = 'N/A'

        if content_score_wrapper_div:
            content_score_description = content_score_wrapper_div.find('p', attrs={'dir': 'auto'}).text.strip() if content_score_wrapper_div.find('p', attrs={'dir': 'auto'}) else 'N/A'
        else:
            content_score_description = 'N/A'


        movie_data = {
            'movie_name': movie_name,
            'release_date': movie_date,
            'rating': rating,
            'genre': genre_list, 
            'run_time': run_time,
            'certification': certification,
            'overview': overview,
            'tagline': tagline,
            'director': directors,  
            'language': language,
            'budget': budget,
            'revenue': revenue,
            'normal_keyword_(rounded)': keyword_rounded,
            'tone_keyword_(bold)': keyword_bold,  
            'cast': cast_list,
            'reviews':  reviews_list,
            'content_score': content_score,
            'content_score_description': content_score_description 
        }

        page_data.append(movie_data)

    # Add current page data to all data
    all_page_data.extend(page_data)

    # Save after each page
    df = pd.DataFrame(all_page_data)
    df.to_csv(output_file, index=False)
    print(f"Saved data up to page {num}")        

Saved data up to page 344
Saved data up to page 345
Saved data up to page 346
Saved data up to page 347
Saved data up to page 348
Saved data up to page 349
Saved data up to page 350


## **Dataset Specifications**
The collected dataset includes the following fields:
- `movie_name`: Name of the movie.
- `release_date`: Release date of the movie.
- `rating`: User rating percentage.
- `genre`: List of genres associated with the movie.
- `run_time`: Duration of the movie.
- `certification`: Age certification (e.g., PG, R).
- `overview`: Brief summary of the movie.
- `tagline`: Tagline or slogan of the movie.
- `director`: List of directors.
- `language`: Original language of the movie.
- `budget`: Production budget.
- `revenue`: Revenue generated by the movie.
- `normal_keyword_(rounded)`: List of general keywords.
- `tone_keyword_(bold)`: List of tone-related keywords.
- `cast`: List of main cast members.
- `reviews`: List of user reviews, each containing:
  - `writer`: Reviewer's name.
  - `score`: Reviewer's rating.
  - `review`: Full review text.
  - `most_watched_genres`: Genres most frequently reviewed by the reviewer.
- `content_score`: Additional content rating.
- `content_score_description`: Description of the content score.

In [5]:
data = pd.read_csv('movies_data.csv')

In [6]:
data.head()

Unnamed: 0,movie_name,release_date,rating,genre,run_time,certification,overview,tagline,director,language,budget,revenue,normal_keyword_(rounded),tone_keyword_(bold),cast,reviews,content_score,content_score_description
0,The Shawshank Redemption,17-Feb-95,87.0,"['Drama', 'Crime']",2h 22m,15,Imprisoned in the 1940s for the double murder ...,Fear can hold you prisoner. Hope can set you f...,['Frank Darabont'],English,"$25,000,000.00","$28,341,469.00","['prison', 'corruption', 'police brutality', '...",['admiring'],"['Morgan Freeman', 'Tim Robbins', 'Bob Gunton'...","[{'writer': 'John Chard', 'score': '100%', 're...",100.0,Yes! Looking good!
1,The Godfather,24-Aug-72,87.0,"['Drama', 'Crime']",2h 55m,15,"Spanning the years 1945 to 1955, a chronicle o...",An offer you can't refuse.,['Francis Ford Coppola'],English,"$6,000,000.00","$245,066,411.00","['italy', 'loss of loved one', 'love at first ...","['macabre', 'aggressive', 'vindictive', 'suspe...","['Marlon Brando', 'Al Pacino', 'James Caan', '...","[{'writer': 'futuretv', 'score': '100%', 'revi...",100.0,Yes! Looking good!
2,Selena Gomez: My Mind & Me,4-Nov-22,87.0,"['Documentary', 'Music']",1h 35m,12A,"After years in the limelight, Selena Gomez ach...","Every breath, a breakthrough.",['Alek Keshishian'],English,-,-,['music documentary'],[],"['Selena Gomez', 'Raquelle Stevens', 'Ashley C...","[{'writer': 'sufian-123', 'score': 'N/A', 'rev...",76.0,Pump it up! We're close now.
3,The Godfather Part II,20-Dec-74,86.0,"['Drama', 'Crime']",3h 22m,15,In the continuing saga of the Corleone crime f...,The rise and fall of the Corleone empire.,['Francis Ford Coppola'],English,"$13,000,000.00","$102,600,000.00","['italy', 'new york city', ""new year's eve"", '...","['melancholy', 'aggressive', 'vindictive', 'ca...","['Al Pacino', 'Robert Duvall', 'Diane Keaton',...","[{'writer': 'jkbbr549', 'score': 'N/A', 'revie...",100.0,Yes! Looking good!
4,Schindler's List,18-Feb-94,86.0,"['Drama', 'History', 'War']",3h 15m,15,The true story of how businessman Oskar Schind...,"Whoever saves one life, saves the world entire.",['Steven Spielberg'],English,"$22,000,000.00","$321,365,567.00","['factory', 'concentration camp', 'hero', 'hol...","['disturbed', 'philosophical', 'loving', 'hope...","['Liam Neeson', 'Ben Kingsley', 'Ralph Fiennes...","[{'writer': 'Mayurpanchamia', 'score': '80%', ...",100.0,Yes! Looking good!


In [8]:
data.columns

Index(['movie_name', 'release_date', 'rating', 'genre', 'run_time',
       'certification', 'overview', 'tagline', 'director', 'language',
       'budget', 'revenue', 'normal_keyword_(rounded)', 'tone_keyword_(bold)',
       'cast', 'reviews', 'content_score', 'content_score_description'],
      dtype='object')

In [None]:
for review_str in data['reviews'].dropna():
    try:
        reviews_list = ast.literal_eval(review_str)
        if reviews_list:
            review_format = reviews_list[0]
            break
    except (SyntaxError, ValueError):
        continue

review_format

{'writer': 'John Chard',
 'score': '100%',
 'review': 'Some birds aren\'t meant to be caged. The Shawshank Redemption is written and directed by Frank Darabont. It is an adaptation of the Stephen King novella Rita Hayworth and Shawshank Redemption. Starring Tim Robbins and Morgan Freeman, the film portrays the story of Andy Dufresne (Robbins), a banker who is sentenced to two life sentences at Shawshank State Prison for apparently murdering his wife and her lover. Andy finds it tough going but finds solace in the friendship he forms with fellow inmate Ellis "Red" Redding (Freeman). While things start to pick up when the warden finds Andy a prison job more befitting his talents as a banker. However, the arrival of another inmate is going to vastly change things for all of them. There was no fanfare or bunting put out for the release of the film back in 94, with a title that didn\'t give much inkling to anyone about what it was about, and with Columbia Pictures unsure how to market it, S

In [7]:
len(data)

9994

## **Challenges**

**1.** **Pagination Handling**:
   - The script was designed to resume scraping from the last completed page by checking the existing CSV file. This ensured no duplicate efforts in case of interruptions.

**2.** **Dynamic Content**:
   - Some data (e.g., reviewer genres) required parsing JavaScript objects embedded in the HTML. The `json` library was used to extract and parse this data.

**3.** **Missing Data**:
   - Default values were assigned to handle missing or incomplete data fields, ensuring the dataset remained consistent.
