# Group 10: Summary

### Data Source and Description

- The primary data for this project was sourced from a publicly available curated list titled "The 500 Greatest Movies of All Time According to Letterboxd". 

- This list is maintained on the Letterboxd platform and represents a ranked compilation of films based on community ratings and critical acclaim. 

- The list spans a wide range of genres, languages, and time periods, making it suitable for a comprehensive exploratory analysis of global cinematic trends.

- Each entry in the list includes the title of a film but does not provide structured metadata such as release year, director, genre, rating, runtime, or box office performance. 

- To enrich the dataset and support further analysis, supplementary data was gathered from the Open Movie Database (OMDb), which is a widely used and accessible API for retrieving film metadata based on title searches. 

- The OMDb API returns a range of structured fields, including IMDb rating, Metascore, content rating, runtime, genres, directors, actors, plot summaries, vote counts, and gross earnings.

- The dataset resulting from this combined approach includes 500 film titles with enriched metadata sourced from OMDb, enabling a multidimensional analysis of some of the most critically acclaimed films of all time.

Dataset Description: <br>
Information about variables in the dataset we have collected:

- **Title**: The name of the movie
- **Year**: The year the movie was released
- **Rated**: The content rating of the movie
- **Runtime**: The duration of the movie in minutes
- **Genre**: The genre(s) of the movie
- **Director**: The director of the movie
- **Stars**: The main stars of the movie
- **IMDB_Rating**: The IMDb rating of the movie
- **Metascore**: The Metascore rating of the movie
- **Votes**: The number of votes the movie has received
- **Gross**: The gross earnings of the movie (in million dollars)
- **Plot**: A brief description of the movie (Already Dropped)

### Data Retrieval technique

- The data retrieval process consisted of two major components: 
    1. web scraping;  
    2. API integration

- The first phase involved web scraping to extract all 500 movie titles from the Letterboxd list. Since the full list is distributed across five paginated views on the website, a loop-based approach was implemented using the `_requests_` and `_BeautifulSoup_` libraries in Python. For each page, movie titles were extracted from structured HTML elements and compiled into a single list. Each title was then cleaned and standardized to support accurate lookup via the OMDb API.

- In the second phase, the compiled list of titles was used to programmatically query the OMDb API. Each movie title was submitted as a search query, and structured metadata was returned in JSON format. A delay was implemented between API calls to ensure rate limit compliance. The returned records were stored in a tabular format and reviewed for completeness.

- To address potential mismatches and incomplete responses, a title correction dictionary was created. This allowed for re-submission of failed queries with revised or corrected movie titles (e.g., adding apostrophes, correcting translations, or adjusting special characters). This additional step significantly improved the match rate with OMDb and ensured a more complete final dataset.

- The final retrieval process yielded a well-structured DataFrame containing detailed information for nearly all of the 500 films. All scripts for scraping and API retrieval are available in the corresponding project notebooks, and the entire process was designed for reproducibility and transparency.

### Data Enrichment and Cleaning

- Data Enrichment: Some movies failed to return results from the OMDb API due to minor discrepancies (e.g., missing apostrophes, alternate translations, or formatting issues). A title correction dictionary was created to remap these entries to their accurate titles. This allowed for a second round of API queries, significantly improving coverage and reducing data loss.

- The retrieved raw data was transformed into a tidy tabular format. 

- Renamed and standardized column names (e.g., changed 'Title' to 'Movie Name') for better clarity and usability.

- Removed unnecessary columns such as 'Plot' that were not used in analysis or modeling.

- Fixed year inconsistencies by extracting 4-digit years using regular expressions from formats like '1984–1990'.

- Cleaned numeric columns such as 'Votes', 'Gross', and 'Runtime' by removing non-numeric characters (e.g., commas, '$', 'M') and converting to appropriate numeric types.

- Runtime values were initially stored as unstructured strings in formats like `"142 min"` or `"2h 22m"`. Converted all entries into a consistent numeric format representing duration in minutes.

- Handled missing data:

    - Imputed missing values in 'Runtime' using the median.

    - Imputed missing 'Votes' with the median after cleaning.

    - Filled missing 'Rated' entries with "Not Rated" to retain these records for analysis.

- Duplicate rows were checked and confirmed to be absent. 


### Data Quality Tests

- Comprehensive data validation was performed using automated Python assert statements to ensure dataset reliability and readiness for analysis.

- All tests passed, confirming data integrity across structure, types, ranges, and completeness.

- Volume checks verified that the dataset has sufficient records (496 movies).

- Range validation ensured:

    - IMDb ratings are within [0.0, 10.0]

    - Years fall between 1880 and 2025

    - Runtimes are positive

- Type checks confirmed that:

    - 'Votes' and 'Gross' are stored as numeric (float)

    - 'Rated' is correctly stored as a categorical (object) type

- Completeness checks verified that no missing values remain in critical fields like 'Rated'

- A reusable verification pipeline was created, acting as both a guardrail for quality and implicit documentation of expected data structure.

### Summary Statistics

The data analysis section explores key quantitative metrics across the top 500 movies and highlights important trends through summary statistics and visualizations.

- **Data Types and Structure**  
  The dataset includes both numeric and categorical variables. A preliminary breakdown was done to separate numeric features (e.g., IMDb rating, runtime, votes, gross) from categorical ones (e.g., rating category, genre, director).

- **Distributional Insights**  
  Histograms and boxplots were generated for all numeric columns. These revealed that:
  - **IMDb ratings** are tightly clustered between 7.7 and 8.7, with a strong central tendency around 8.3.
  - **Runtime** values show a peak between 90 and 130 minutes, consistent with standard feature-length films.
  - **Vote counts** are heavily right-skewed, with a small number of movies receiving millions of votes and many others under 100k.

- **Descriptive Statistics**  
  Summary statistics using `.describe()` were applied to `Votes` and `Gross`, showing:
  - A **median vote count** significantly lower than the mean, confirming skew.
  - A wide range of gross revenues, from low-budget films to blockbusters crossing hundreds of millions.

- **Extrema Identification**  
  Specific queries were run to identify standout films:
  - The movie with the **highest IMDb rating**, **highest Metascore**, **most votes**, and **highest gross** were printed.
  - Likewise, the movies with the **lowest values** in those fields were also extracted.
  - Additional insights like the **longest and shortest runtimes** were computed and attributed to specific films and years.

- **Categorical Breakdown**  
  A frequency count of movie certificates (content ratings like PG, R, etc.) was produced using `.value_counts()`, followed by a `countplot` to visualize their distribution. 
- The majority of films, over 150 in number, are rated 'Not Rated', signifying content suitable for **All Ages**.
- The 'R' category follows with around 130 movies, indicating content for **Adults**.
- 'PG-13' and 'PG','Approved' occupy the 3rd, 4th and 5th positions, each with approximately 50-70 movies, representing content for **Teens/PG**.
- The 'Passed', 'G', and 'Approved' categories are all part of the **All Ages** group, each with just under 20 movies.
- Other categories have significantly fewer movies, typically less than 10.

These statistical insights provide a strong foundation for identifying patterns in critical and commercial success, as well as content trends across the most celebrated films. More detailed information along with inference for each step is mentioned in the notebook.

## Visualizations

### Distribution of IMDb Ratings
![IMDb Ratings](/Users/tanvimehrotra/Downloads/New/imgs/imdb-ratings.png)

### Distribution of Movie Durations
![Movie Durations](/Users/tanvimehrotra/Downloads/New/imgs/movie-durations.png)

### Histogram and Boxplots of Key Numeric Fields
![Histograms and Boxplots](/Users/tanvimehrotra/Downloads/New/imgs/histograms.png)

### Gross Revenue vs IMDb Rating (coloured by Certificate)
![Interactive Plot](/Users/tanvimehrotra/Downloads/New/imgs/interactive-plot.png)

### Top 10 Most Common Genres
![Top Genres](/Users/tanvimehrotra/Downloads/New/imgs/top-genres.png)

### Top 10 Most Frequent Actors
![Top Actors](/Users/tanvimehrotra/Downloads/New/imgs/top-actors.png)

### Top 10 Directors with Most Movies
![Top Directors](/Users/tanvimehrotra/Downloads/New/imgs/top-directors.png)

### Distribution of Movie Certificates
![Movie Certificates](/Users/tanvimehrotra/Downloads/New/imgs/movie-certs.png)

### Movies with Lowest Ratings, Gross, and Runtime
![Lowest Metrics](/Users/tanvimehrotra/Downloads/New/imgs/lowest-metrics.png)

### Movies with Highest Ratings, Gross, and Runtime
![Highest Metrics](/Users/tanvimehrotra/Downloads/New/imgs/highest-metrics.png)


Below is the table having description about different types of content rating.

| Content Rating | Description                                                 |
|----------------|-------------------------------------------------------------|
| Not Rated      | Movies that have not been assigned a specific rating. |
| R              | Restricted, children under 17 require accompanying parents or adult guardians. |
| PG             | Parental guidance suggested, some material may not be suitable for children. |
| Approved       | Movies that have been approved for all audiences. |
| PG-13          | Parents strongly cautioned, some material may be inappropriate for children under 13. |
| G              | Suitable for all ages and audiences. |
| Passed         | Movies that have been passed for all audiences. |
| TV-MA          | Intended for mature audiences only, may contain graphic violence, explicit sex, or strong language. |
| Unrated        | Movies with an unknown or missing rating. |
| TV-G          | Parental guidance suggested for some material. |
| M/PG           | Intended for mature audiences, parental guidance suggested. |
| GP             | Suitable for general audiences. |
| TV-14          | Parents strongly cautioned, some material may be inappropriate for children under 14. |
| NC-17          | Intended for adults only, children under 17 not admitted. |

# Key Findings

- **There is a clear positive correlation between revenue and votes with the year of release**, reflecting an increasingly engaged global audience and evolving movie markets. As the film industry expands and population grows, both viewership and revenue potential rise accordingly.

- **A noticeable decline in Metascore over time suggests a shift in critical standards or content focus**, potentially indicating changing critic expectations or a tradeoff between commercial appeal and critical acclaim in modern films.

- **IMDb Rating remains a strong indicator of both quality and commercial potential**, as shown by the regression model — where features like vote count and gross earnings significantly influence rating prediction.

- **Regression analysis also reveals that films rated ‘R’ and those with higher vote counts tend to have stronger predictive power for IMDb Rating**, highlighting the importance of audience engagement and content targeting in influencing success.

- **Top 1000 IMDb movies are heavily concentrated from the 1980s onward**, showing a shift toward modern filmmaking practices and cultural preferences. However, this trend is nuanced rather than strictly linear.

- **Film durations have gradually increased over the years**, often ranging from 90 to 150 minutes or more. This may reflect growing narrative complexity, increased budgets, and shifting audience expectations.

- **There is a positive correlation between audience ratings and critic ratings overall**, suggesting general alignment in perception — though discrepancies arise where critics focus more on artistic merit and audiences on entertainment value.

- **The dataset reflects a balanced mix of content ratings (R, PG-13, etc.)**, with R-rated films being the most prevalent. These often explore universal themes like love, family, and identity, resonating across demographics.

- **International representation is strong among the top directors**, including notable Japanese auteurs like Akira Kurosawa and Hayao Miyazaki, showing that cinematic excellence transcends geography — though Hollywood continues to dominate in global reach and revenue.
