# Milestone 1: Data Scraping

### Task 1: Develop Web Scraping Script

In [1]:
import requests
from bs4 import BeautifulSoup
import json

def scrape_imdb_top_250_json():
    url = "https://www.imdb.com/chart/top/?ref_=nv_mv_250"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }

    # Fetch the page
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        raise Exception(f"Failed to fetch the page. Status code: {response.status_code}")

    # Parse HTML content
    soup = BeautifulSoup(response.content, 'html.parser')

    # Locate the <script> tag containing JSON data
    script_tag = soup.find('script', type='application/ld+json')
    if not script_tag:
        raise Exception("Unable to find JSON data in the page.")

    # Parse the JSON content
    json_data = json.loads(script_tag.string)
    item_list = json_data.get("itemListElement", [])

    # Extract relevant movie data
    movies = []
    for item in item_list:
        movie = item.get("item", {})
        if movie:
            movies.append({
                "Rank": None,  # Placeholder (calculated later)
                "Title": movie.get("name", "N/A"),
                "URL": movie.get("url", "N/A"),
                "Description": movie.get("description", "N/A"),
                "Rating": movie.get("aggregateRating", {}).get("ratingValue", "N/A"),
                "Rating Count": movie.get("aggregateRating", {}).get("ratingCount", "N/A"),
                "Content Rating": movie.get("contentRating", "N/A"),
                "Genre": movie.get("genre", "N/A"),
                "Duration": movie.get("duration", "N/A"),
                "Image": movie.get("image", "N/A")
            })

    # Add Rank based on list position
    for i, movie in enumerate(movies):
        movie["Rank"] = i + 1

    return movies


### Task 2: Store Data

In [2]:
import pandas as pd

def store_scraped_data(movies):
    # Convert list of dictionaries to DataFrame
    movies_df = pd.DataFrame(movies)

    # Save to CSV
    movies_df.to_csv('imdb_top_250_movies_raw.csv', index=False)
    print("Raw data saved to 'imdb_top_250_movies_raw.csv'")

    # Return DataFrame
    return movies_df

# Scrape and store
movies_data = scrape_imdb_top_250_json()
movies_df = store_scraped_data(movies_data)


Raw data saved to 'imdb_top_250_movies_raw.csv'


In [3]:
movies_data

[{'Rank': 1,
  'Title': 'The Shawshank Redemption',
  'URL': 'https://www.imdb.com/title/tt0111161/',
  'Description': 'A banker convicted of uxoricide forms a friendship over a quarter century with a hardened convict, while maintaining his innocence and trying to remain hopeful through simple compassion.',
  'Rating': 9.3,
  'Rating Count': 2974052,
  'Content Rating': 'R',
  'Genre': 'Drama',
  'Duration': 'PT2H22M',
  'Image': 'https://m.media-amazon.com/images/M/MV5BMDAyY2FhYjctNDc5OS00MDNlLThiMGUtY2UxYWVkNGY2ZjljXkEyXkFqcGc@._V1_.jpg'},
 {'Rank': 2,
  'Title': 'The Godfather',
  'URL': 'https://www.imdb.com/title/tt0068646/',
  'Description': 'The aging patriarch of an organized crime dynasty transfers control of his clandestine empire to his reluctant son.',
  'Rating': 9.2,
  'Rating Count': 2074540,
  'Content Rating': 'R',
  'Genre': 'Crime, Drama',
  'Duration': 'PT2H55M',
  'Image': 'https://m.media-amazon.com/images/M/MV5BYTJkNGQyZDgtZDQ0NC00MDM0LWEzZWQtYzUzZDEwMDljZWNjXkEy

### Task 3: Generate Initial Data Sample

In [4]:
def display_sample_data(df, n=5):
    print("Sample data:")
    print(df.head(n))

# Display first 5 rows
display_sample_data(movies_df)


Sample data:
   Rank                     Title                                    URL  \
0     1  The Shawshank Redemption  https://www.imdb.com/title/tt0111161/   
1     2             The Godfather  https://www.imdb.com/title/tt0068646/   
2     3           The Dark Knight  https://www.imdb.com/title/tt0468569/   
3     4     The Godfather Part II  https://www.imdb.com/title/tt0071562/   
4     5              12 Angry Men  https://www.imdb.com/title/tt0050083/   

                                         Description  Rating  Rating Count  \
0  A banker convicted of uxoricide forms a friend...     9.3       2974052   
1  The aging patriarch of an organized crime dyna...     9.2       2074540   
2  When a menace known as the Joker wreaks havoc ...     9.0       2954968   
3  The early life and career of Vito Corleone in ...     9.0       1400084   
4  The jury in a New York City murder trial is fr...     9.0        897694   

  Content Rating                 Genre Duration  \
0         

In [5]:
movies_df.head()

Unnamed: 0,Rank,Title,URL,Description,Rating,Rating Count,Content Rating,Genre,Duration,Image
0,1,The Shawshank Redemption,https://www.imdb.com/title/tt0111161/,A banker convicted of uxoricide forms a friend...,9.3,2974052,R,Drama,PT2H22M,https://m.media-amazon.com/images/M/MV5BMDAyY2...
1,2,The Godfather,https://www.imdb.com/title/tt0068646/,The aging patriarch of an organized crime dyna...,9.2,2074540,R,"Crime, Drama",PT2H55M,https://m.media-amazon.com/images/M/MV5BYTJkNG...
2,3,The Dark Knight,https://www.imdb.com/title/tt0468569/,When a menace known as the Joker wreaks havoc ...,9.0,2954968,PG-13,"Action, Crime, Drama",PT2H32M,https://m.media-amazon.com/images/M/MV5BMTMxNT...
3,4,The Godfather Part II,https://www.imdb.com/title/tt0071562/,The early life and career of Vito Corleone in ...,9.0,1400084,R,"Crime, Drama",PT3H22M,https://m.media-amazon.com/images/M/MV5BNzc1OW...
4,5,12 Angry Men,https://www.imdb.com/title/tt0050083/,The jury in a New York City murder trial is fr...,9.0,897694,Approved,"Crime, Drama",PT1H36M,https://m.media-amazon.com/images/M/MV5BYjE4Nz...


# Milestone 2: Data Cleaning and Preprocessing

### Task 1: Generate Cleaned Data Sample

In [6]:
def clean_data(df):
    # Drop rows with missing values
    cleaned_df = df.dropna()

    # Normalize rating (already 0-10 scale, but ensure it's a float)
    cleaned_df['Rating'] = cleaned_df['Rating'].astype(float)

    # Expand genres into separate columns
    genres = cleaned_df['Genre'].str.get_dummies(sep=", ")
    cleaned_df = pd.concat([cleaned_df, genres], axis=1)

    return cleaned_df

# Clean data
cleaned_df = clean_data(movies_df)

# Save cleaned data sample
cleaned_df.to_csv('imdb_top_250_movies_cleaned.csv', index=False)
print("Cleaned data saved to 'imdb_top_250_movies_cleaned.csv'")


Cleaned data saved to 'imdb_top_250_movies_cleaned.csv'


### Task 2: Develop Data Preprocessing Script

In [7]:
def preprocess_data(df):
    # Add review density
    df['Review Density'] = df['Rating Count'].astype(int) / df['Rating']
    
    # Convert Duration from ISO format to minutes
    df['Duration (Minutes)'] = df['Duration'].str.extract(r'PT(\d+)H(\d+)M').apply(
        lambda x: int(x[0]) * 60 + int(x[1]) if pd.notnull(x[0]) else None, axis=1
    )

    return df

# Preprocess data
preprocessed_df = preprocess_data(cleaned_df)

# Save preprocessed data
preprocessed_df.to_csv('imdb_top_250_movies_preprocessed.csv', index=False)
print("Preprocessed data saved to 'imdb_top_250_movies_preprocessed.csv'")


Preprocessed data saved to 'imdb_top_250_movies_preprocessed.csv'


In [8]:
preprocessed_df

Unnamed: 0,Rank,Title,URL,Description,Rating,Rating Count,Content Rating,Genre,Duration,Image,...,Musical,Mystery,Romance,Sci-Fi,Sport,Thriller,War,Western,Review Density,Duration (Minutes)
0,1,The Shawshank Redemption,https://www.imdb.com/title/tt0111161/,A banker convicted of uxoricide forms a friend...,9.3,2974052,R,Drama,PT2H22M,https://m.media-amazon.com/images/M/MV5BMDAyY2...,...,0,0,0,0,0,0,0,0,319790.537634,142.0
1,2,The Godfather,https://www.imdb.com/title/tt0068646/,The aging patriarch of an organized crime dyna...,9.2,2074540,R,"Crime, Drama",PT2H55M,https://m.media-amazon.com/images/M/MV5BYTJkNG...,...,0,0,0,0,0,0,0,0,225493.478261,175.0
2,3,The Dark Knight,https://www.imdb.com/title/tt0468569/,When a menace known as the Joker wreaks havoc ...,9.0,2954968,PG-13,"Action, Crime, Drama",PT2H32M,https://m.media-amazon.com/images/M/MV5BMTMxNT...,...,0,0,0,0,0,0,0,0,328329.777778,152.0
3,4,The Godfather Part II,https://www.imdb.com/title/tt0071562/,The early life and career of Vito Corleone in ...,9.0,1400084,R,"Crime, Drama",PT3H22M,https://m.media-amazon.com/images/M/MV5BNzc1OW...,...,0,0,0,0,0,0,0,0,155564.888889,202.0
4,5,12 Angry Men,https://www.imdb.com/title/tt0050083/,The jury in a New York City murder trial is fr...,9.0,897694,Approved,"Crime, Drama",PT1H36M,https://m.media-amazon.com/images/M/MV5BYjE4Nz...,...,0,0,0,0,0,0,0,0,99743.777778,96.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245,246,Amores perros,https://www.imdb.com/title/tt0245712/,"An amateur dog fighter, a supermodel, and a de...",8.0,257263,R,"Drama, Thriller",PT2H34M,https://m.media-amazon.com/images/M/MV5BMmUzND...,...,0,0,0,0,0,1,0,0,32157.875000,154.0
246,247,Rebecca,https://www.imdb.com/title/tt0032976/,A self-conscious woman juggles adjusting to he...,8.1,150124,Approved,"Drama, Mystery, Romance",PT2H10M,https://m.media-amazon.com/images/M/MV5BYTI0Mj...,...,0,1,1,0,0,0,0,0,18533.827160,130.0
247,248,The Help,https://www.imdb.com/title/tt1454029/,An aspiring author during the civil rights mov...,8.1,504101,PG-13,Drama,PT2H26M,https://m.media-amazon.com/images/M/MV5BMTM5OT...,...,0,0,0,0,0,0,0,0,62234.691358,146.0
248,249,Koe no katachi,https://www.imdb.com/title/tt5323662/,"A deaf girl, Shoko, is bullied by the popular ...",8.1,110487,Not Rated,"Animation, Drama",PT2H10M,https://m.media-amazon.com/images/M/MV5BOTFiNz...,...,0,0,0,0,0,0,0,0,13640.370370,130.0


### Task 3: Develop Data Cleaning Script

In [9]:
def clean_and_preprocess_data(file_path):
    # Load raw data
    raw_df = pd.read_csv(file_path)

    # Clean and preprocess
    cleaned_df = clean_data(raw_df)
    preprocessed_df = preprocess_data(cleaned_df)

    # Save final output
    preprocessed_df.to_csv('imdb_top_250_movies_final.csv', index=False)
    print("Final cleaned and preprocessed data saved to 'imdb_top_250_movies_final.csv'")
    return preprocessed_df


In [10]:
preprocessed_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 33 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Rank                250 non-null    int64  
 1   Title               250 non-null    object 
 2   URL                 250 non-null    object 
 3   Description         250 non-null    object 
 4   Rating              250 non-null    float64
 5   Rating Count        250 non-null    int64  
 6   Content Rating      250 non-null    object 
 7   Genre               250 non-null    object 
 8   Duration            250 non-null    object 
 9   Image               250 non-null    object 
 10  Action              250 non-null    int64  
 11  Adventure           250 non-null    int64  
 12  Animation           250 non-null    int64  
 13  Biography           250 non-null    int64  
 14  Comedy              250 non-null    int64  
 15  Crime               250 non-null    int64  
 16  Drama   

In [11]:
preprocessed_df.describe()

Unnamed: 0,Rank,Rating,Rating Count,Action,Adventure,Animation,Biography,Comedy,Crime,Drama,...,Musical,Mystery,Romance,Sci-Fi,Sport,Thriller,War,Western,Review Density,Duration (Minutes)
count,250.0,250.0,250.0,250.0,250.0,250.0,250.0,250.0,250.0,250.0,...,250.0,250.0,250.0,250.0,250.0,250.0,250.0,250.0,250.0,243.0
mean,125.5,8.3116,715212.3,0.184,0.244,0.104,0.112,0.172,0.2,0.736,...,0.004,0.124,0.092,0.084,0.024,0.132,0.104,0.024,85047.405023,129.798354
std,72.312977,0.235387,577616.5,0.388261,0.430354,0.305873,0.315999,0.378137,0.400802,0.441684,...,0.063246,0.330243,0.289606,0.277944,0.153356,0.33917,0.305873,0.153356,66287.490323,29.544694
min,1.0,8.0,26672.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3062.159091,68.0
25%,63.25,8.1,237442.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,29012.946429,108.0
50%,125.5,8.2,579886.5,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,70717.865854,128.0
75%,187.75,8.4,1051976.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,125497.074515,146.5
max,250.0,9.3,2974052.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,328329.777778,238.0


# Milestone 3: Final Review and Documentation

### Task 1: Prepare Final Dataset

The final dataset is a culmination of the scraping, cleaning, and preprocessing processes. It is a structured, enriched dataset with 250 movies, ready for analysis and modeling. The dataset includes key features like:

1. **Movie Details**:
   - Title, Rank, IMDb Rating, and Description.
2. **Audience Engagement Metrics**:
   - Rating Count and Review Density (calculated as `Rating Count / Rating`).
3. **Categorical Data**:
   - Expanded genres into separate binary columns (e.g., `Drama`, `Action`).
4. **Additional Attributes**:
   - Content Rating (e.g., PG, R), Duration in minutes, and Image URL.

The dataset was saved in a CSV file named `imdb_top_250_movies_final.csv` and can be directly used for:
- Exploratory Data Analysis (EDA)
- Building Machine Learning Models
- Generating Visualizations

### Task 2: Document Code

Comprehensive documentation of the project code ensures usability and reproducibility. The following elements were documented:

1. **Code Scripts**:
   - Each milestone and task was implemented in modular Python scripts for scraping, cleaning, and preprocessing.
2. **README File**:
   - A README file was included to guide users on how to:
     - Install required dependencies.
     - Run the scripts in sequence.
     - Understand the input/output data formats.
     - Interpret the dataset fields.
3. **Inline Comments**:
   - Each function and key block of code is thoroughly documented with inline comments explaining logic and purpose.
4. **Error Handling**:
   - All scripts include robust error handling to provide meaningful feedback if issues arise (e.g., missing fields, HTTP errors).



### Task 3: Write Project Report

### **Project Report: IMDb Top 250 Movies Analysis**

---

## **Introduction**
The aim of this project was to scrape, clean, preprocess, and analyze the IMDb Top 250 Movies dataset. IMDb's Top 250 list is a curated collection of the highest-rated movies based on user reviews and ratings, providing a valuable dataset for understanding trends in audience preferences.

This report outlines the methodology, challenges faced, and insights derived during the project.

---

## **Milestone 1: Data Scraping**

### **Objective**
The primary goal was to extract structured data about the top 250 movies from IMDb's official website, including details like:
- Movie title
- IMDb rating
- Number of ratings
- Genre
- Description
- Duration

### **Methodology**
1. The project utilized Python's `requests` and `BeautifulSoup` libraries to fetch and parse HTML content.
2. Data was extracted from a JSON structure embedded in the HTML `<script>` tag with the `application/ld+json` type.

### **Challenges**
1. **Dynamic Content**: The JSON data embedded within the page required precise parsing, as it wasn't directly accessible via the HTML table structure.
2. **Rate Limiting**: Sending too many requests to IMDb in a short period could result in blocks. This was mitigated by adding request headers to simulate a browser user-agent.

---

## **Milestone 2: Data Cleaning and Preprocessing**

### **Objective**
To clean the scraped data for consistency, handle missing values, and preprocess it for analytical tasks.

### **Data Cleaning**
1. Removed rows with missing or null values.
2. Standardized genres into separate binary columns (e.g., `Drama`, `Action`).
3. Extracted numeric values for `Duration` from the ISO format.

### **Data Preprocessing**
1. **Numeric Normalization**:
   - Normalized `Rating` (0-10 scale).
   - Derived a `Review Density` feature: `Rating Count / Rating`.
2. **Text Processing**:
   - Tokenized and normalized descriptions for potential NLP tasks.
3. **Duration Transformation**:
   - Converted duration from ISO format (e.g., `PT2H22M`) into minutes.

### **Challenges**
1. **Complex Genre Data**:
   - Some movies had multiple genres. These were split into separate columns for easier analysis.
2. **Duration Parsing**:
   - The ISO format required careful extraction to ensure accurate conversion.

---

## **Milestone 3: Final Review and Documentation**

### **Final Dataset**
The final dataset contains 250 movies and includes the following fields:
- Rank
- Title
- IMDb Rating
- Number of Ratings
- Description
- Genre (expanded into binary columns)
- Content Rating (e.g., PG, R)
- Duration (in minutes)
- Review Density
- URL to the movie's IMDb page
- Image URL (for visualization purposes)

### **Insights Derived**
1. **Rating Trends**:
   - The average IMDb rating for the Top 250 movies is approximately 8.3, with the highest-rated movie being *The Shawshank Redemption* (9.3).
2. **Genre Popularity**:
   - Drama is the most common genre, appearing in over 70% of movies.
   - Action and Crime genres often appear together in high-ranking movies.
3. **Duration Analysis**:
   - Most movies in the list have durations between 120 and 150 minutes, suggesting audience preference for feature-length films.
4. **Review Density**:
   - Movies with higher review counts tend to have slightly lower average ratings, reflecting the diversity of audience opinions.

### **Documentation**
The codebase includes:
1. **Scraping Script**:
   - Extracts data from IMDb and saves it in a structured CSV format.
2. **Cleaning Script**:
   - Automates the cleaning process, handling missing values and formatting.
3. **Preprocessing Script**:
   - Adds derived insights like review density and normalizes text and numeric fields.
4. **README File**:
   - A comprehensive guide for running the scripts, understanding inputs/outputs, and interpreting the data.

---

## **Conclusion**
The project successfully scraped and processed the IMDb Top 250 Movies dataset, creating a structured and enriched dataset ready for further analysis. Key insights into audience preferences and movie trends were derived, providing a deeper understanding of popular movies.

This dataset can be used for:
- Recommendation systems
- Sentiment analysis on movie descriptions
- Predictive modeling for future IMDb ratings