# CIND 820 – Big Data Analytics Project  
## Milestone 3 – Initial Results and Code  

**Author:** Kabilan Puvaneswaran  
**Student Number:** 501038628
**Supervisor:** Tamer Abdou
**Date of Submission:** November 10, 2025  

**Course:** CIND 820 – Big Data Analytics Project  
**Institution:** Toronto Metropolitan University  

---

### **Project Title:**  
**Streaming Platforms as Cultural Intermediaries: An Exploratory Analysis of Netflix, Amazon Prime Video, and Disney+**


## Data Analysis and Initial Results

This project examines how streaming services Netflix, Amazon Prime Video, and Disney+ differ in catalogue size, content composition, and cultural diversity.  After cleaning and merging the Kaggle datasets, the final dataset includes roughly 13,000 titles of movies and TV shows. This was reduced from the original 19,000+ after removing duplicates, invalid entries, and records missing essential fields. Most missing fields were identified after enrichment with IMDb data.  Even after three matching passes using different criteria, over 6,000 titles could not be matched to retrieve their missing country and rating information. 
All preprocessing and transformation steps are documented in `Cleaning.ipynb`, and the resulting dataset is stored as `clean_streaming_metadata.csv`.

---

### Platform Representation
Across all platforms, about three-quarters of entries are movies and the remaining quarter are TV shows.  The *titles_by_platform* figure shows a clear hierarchy in catalogue size: Netflix holds the largest share, followed by Amazon Prime, with Disney+ trailing behind.  It confirms Netflix’s clear numerical dominance. However, Amazon’s close count suggests greater catalogue overlap than often assumed, while Disney+ remains proportionally small but newer. 
This context why catalogue-level patterns in later sections are influenced most by Netflix’s share.

---

### Geographic and Cultural Diversity
While the *top_countries*  chart provides some perspective on global representation, the country field remains highly inconsistent.  
Even after multiple cleaning and matching passes, many entries could not be aligned to IMDb records, leaving roughly one-third of titles without reliable country data.  
After reviewing the data, several matched values were also incorrect.  
>For example, Naruto appeared as being from *the Netherlands* and Death Note as *India*.
  
These mismatches likely stem from partial IMDb lookups, variable formatting, and earlier cleaning steps that introduced non-standard codes.

Because of this, country-level insights should be treated as approximate rather than definitive.  
They still provide a general sense of where production is concentrated, mainly the United States, India, Japan, and the U.K., but the data is not accurate enough for modelling.  
This limitation was acknowledged during preprocessing, however, it remains useful for contextual and descriptive summaries about global media diversity.

---

### Genre Distribution
The *top_genres* chart shows the most common genres across the entire dataset rather than by individual platform.  
Drama, Comedy, and Documentary appear as the dominant categories overall, making up a large portion of available titles.  
This distribution reflects general viewing and production trends across major streaming platforms rather than specific platform strategies.  
The combined plot provides a broader view of global content composition and cultural reach.
The *eda_top_15_genres_overall* figure confirms that Drama, Comedy, and Documentary dominate the full dataset, but the *eda_heatmap_platform_by_genre_top_15* visual adds additonal detail by showing how the how genre frequency differs across services.  
> Netflix covers the widest variety, including international genres such as Crime, Thriller, and Anime, whereas Disney+ concentrates almost entirely on Family, Animation, and Adventure content.

The *eda_platform_share_within_top_15_genres* figure also highlights that Amazon Prime’s contribution is uneven.  Its strong in Action and Thriller, but limited elsewhere.  

Together, these plots demonstrate how genre diversity varies by platform even when overall genre rankings look similar.

---

### Release-Year Trends
The *content_release_trends_by_platform* histogram shows a clearer decade-by-decade pattern.  
Netflix output expands rapidly after 2013, aligning with its global rollout.  
Amazon Prime’s additions rise slowly through 2015–2020, and Disney+ entries appear primarily from 2019 onward.  
These trajectories reflect differing platform launch timelines rather than shared industry growth.

---

### IMDb Ratings
The *imdb_rating_distribution_by_platform* figure shows moderate variation across platforms.  
Disney+ titles trend slightly higher due to curated, franchise-based content, while Amazon exhibits a wider range.  
Overall, the median IMDb rating sits around 6.5, indicating relatively consistent quality across services.


---

### Platform Share
The *platform_distribution* chart confirms that Netflix dominates overall share, followed by Amazon Prime and Disney+.  
Together, they account for nearly the entire mainstream streaming catalogue represented in this analysis.

---

### Summary
- Netflix leads in scale, range, and global presence.  
- Disney+ maintains a smaller, more focused catalogue with higher average ratings.  
- Amazon Prime sits between the two in both size and diversity.  
- Streaming production accelerated sharply after 2015.  
- Data limitations in *country* fields prompted a shift toward genre- and language-based cultural analysis.  

These initial findings establish the descriptive baseline for the popularity prediction and content clustering components developed in later milestones.


## Data Preparation and Cleaning

Three raw Kaggle datasets were used: *Netflix Shows*, *Amazon Prime Movies and TV Shows*, and *Disney Movies and TV Shows*.  
Each dataset had minor schema differences and variable field completeness.  
The goal of cleaning was to produce one standardized metadata table suitable for descriptive and exploratory analysis.

---

### Consolidation
- Standardized column names and adopted a common schema:  
  `title`, `type`, `country`, `release_year`, `genres`, `imdb_rating`, `imdb_votes`,  
  `director`, `cast`, `date_added`, `duration`, and a new field `platform` identifying data source.  
- Combined all three datasets into a single DataFrame.  
- Removed duplicate titles using a composite key of `title + platform + release_year`.

---

### IMDb Integration
- The OMDb API was tested for enrichment but was discovered to be impractical due to strict daily request-rate limits.  
- IMDb ratings and votes were instead merged from an external IMDb metadata dump using `title` and `release_year` as join keys.  
- Approximately 60% of entries have valid IMDb data; unmatched titles retain missing values (`NaN`), and were removed.

---

### Cleaning and Transformation
- Converted list-like fields (`genres`, `country`) to standardized Python lists for consistent parsing.  
- Normalized TV shows separately labelled by season count.  
- Coerced `release_year` to integer and removed impossible or null values.  
- Dropped rows with missing `title` or `type`.  
- After all filters and merges, the dataset was reduced from roughly 19,000 to about 13,000 rows.

---

### Country Field Limitations
- The `country` column is only about 60% complete and often contains multi-country or ambiguous entries.  
- Several mismatches occurred during enrichment (e.g., *Naruto* listed as the Netherlands; *Death Note* listed as India).  
- It remains useful for descriptive summaries of cultural reach but not as a reliable analytical variable.

---

### Output and Reproducibility
- The final cleaned dataset is saved as `clean_streaming_metadata.csv`.  
- All steps are reproducible through a single execution of `Cleaning.ipynb`, provided the three Kaggle datasets are pre-downloaded and placed in the same directory.  
- The code’s automatic download functions can fail due to connection or API issues, so using local copies ensures consistent execution.  
- The IMDb metadata used for enrichment is not stored in the GitHub repository due to file size constraints but can be easily accessed from public IMDb datasets or standard archives if replication is needed.

---

This cleaning pipeline standardizes all three streaming sources into a unified structure with minimal duplication and improved metadata quality, forming the foundation for the exploratory analysis completed in Milestone 3.


## Preliminary Modeling – Popularity Prediction

To test whether basic metadata fields can predict title popularity, a simple logistic regression model was trained using `platform`, `type`, `release_year`, and the top 20 genres as features.  
A binary label (`popular = 1`) was assigned to titles with IMDb ratings ≥ 7.0.

### Model Setup
- **Algorithm:** Logistic Regression (baseline)
- **Features:** Platform, Type, Release Year, Top 20 Genre Indicators  
- **Target:** Binary popularity label (IMDb rating ≥ 7.0)  
- **Train/Test Split:** 70 / 30 (stratified)
- **Dataset:** `clean_streaming_metadata.csv`

### Model Performance
| Metric | Score |
|:--------|:------:|
| Accuracy | **0.729** |
| Precision | 0.578 |
| Recall | 0.439 |
| F1-Score | 0.499 |

| ![Confusion Matrix – Logistic Regression](../figures/cm_regression.png) |
|:--:|

The model achieved ~73 % accuracy, demonstrating measurable predictive signal beyond random baseline (50 %).  
Genres such as *Documentaries*, *Dramas*, and *Docuseries* showed the strongest positive influence on popularity, while *Horror* and *Thrillers* trended negatively.

### Interpretation
While modest, these results confirm that platform-level and genre metadata contain predictive information about perceived popularity.  
This proof-of-concept will serve as the foundation for more advanced models in future milestones.

---

## Summary and Interpretation

The work completed for this milestone focused on creating a unified, high-quality dataset and conducting exploratory data analysis across Netflix, Amazon Prime Video, and Disney+.  
Three separate Kaggle datasets were cleaned, standardized, and merged into a single, consistent table of approximately 13,000 titles.  
All preprocessing steps, including schema alignment, duplicate removal, IMDb integration, and field normalization, are all documented in `Cleaning.ipynb`.
This version forms the reproducible data foundation for the project.

The analysis used six figures to summarize patterns across the combined catalogue. 
Most releases occur after 2010, coinciding with the global expansion of streaming platforms.  
Genre distribution is dominated by Drama, Comedy, and Documentary, reflecting mainstream production and viewing trends.  
IMDb ratings are broadly consistent across services, averaging around 6.5, while geographic data remain incomplete and are treated cautiously in interpretation.

Overall, the data cleaning and exploratory analysis provide a clear descriptive overview of the streaming landscape and establish a reliable dataset for future analytical stages.  
All outputs, figures, and notebooks are included in the project repository for verification and reproducibility.


## References

Asmar, A., Raats, T., & Van Audenhove, L. (2023). *Streaming difference(s): Netflix and the branding of diversity.* Critical Studies in Television, 18(1), 24–40.  

Bansal, S. (2019). *Netflix Movies and TV Shows [Data set].* Kaggle. https://www.kaggle.com/datasets/shivamb/netflix-shows  

Bansal, S. (2020a). *Amazon Prime Movies and TV Shows [Data set].* Kaggle. https://www.kaggle.com/datasets/shivamb/amazon-prime-movies-and-tv-shows  

Bansal, S. (2020b). *Disney+ Movies and TV Shows [Data set].* Kaggle. https://www.kaggle.com/datasets/shivamb/disney-movies-and-tv-shows  

Internet Movie Database (IMDb). (2025). *IMDb Datasets [Data files].* IMDb. https://www.imdb.com/interfaces/  

Open Movie Database (OMDb). (n.d.). *OMDb API: The Open Movie Database [Web service].* Retrieved November 2025, from https://www.omdbapi.com/  

Jenkins, H. (2006). *Convergence culture: Where old and new media collide.* New York University Press.  

Lobato, R. (2019). *Netflix nations: The geography of digital distribution.* New York University Press.  

Lotz, A. D. (2017). *Portals: A treatise on internet-distributed television.* Michigan Publishing.  

UNESCO. (2022). *Re|Shaping policies for creativity: Addressing culture as a global public good.* UNESCO.  
