Skip to content

The goal of this project is to use Netflix data (7787,12) to classify and group movies and shows into specific clusters. We will utilize techniques such as K-means clustering, Agglomerative clustering and content-based recommendation systems to analyze the data and provide personalized suggestions to consumers based on their preferences.

Notifications You must be signed in to change notification settings

Ashif-khan033/Netflix_Movies_and_TV_Shows_Clustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 

Repository files navigation

Netflix Movies and TV Shows Clustering & Netflix Recommender System

68747470733a2f2f6d656469612e74656e6f722e636f6d2f52667978394f6b5249333841414141432f6e6574666c69782d6e6574666c69782d737461727475702e676966

To ensure an optimal user experience and prevent subscriber churn, it is essential for Netflix, the world's leading online streaming service provider with over 220 million subscribers as of 2022, to effectively cluster the shows on their platform.

Table of Content

  1. Problem Statement
  2. Objective
  3. Dataset
  4. Data Pipeline
  5. Conclusion

Problem Statement

The goal of this project is to analyze the Netflix catalog of movies and TV shows, which was sourced from the third-party search engine Flixable, and group them into relevant clusters. This will aid in enhancing the user experience and prevent subscriber churn for the world's largest online streaming service provider, Netflix, which currently boasts over 220 million subscribers as of 2022-Q2. The dataset, which includes movies and TV shows as of 2019, will be analyzed to uncover new insights and trends in the rapidly growing world of streaming entertainment.

Objective

The objective of the project was to analyze the Netflix dataset and identify trends and patterns in the content that is available on the platform. The goal was to gain insights that could be used to improve the user experience and make recommendations for future content.

Data Pipeline

  1. Know Your Data: The first step in this project was to examine the various features of the dataset, understand the structure of the data and identify any patterns or trends. We looked at the shape of the data, the data types of each feature, and a statistical summary.
  2. Exploratory Data Analysis: We conducted an exploratory analysis of the data to identify patterns and dependencies, and to draw conclusions that would be useful for further processing.
  3. Data Cleaning: We checked for duplicated values in the dataset and then addressed any null values and outliers by imputing empty strings and dropping some of the null rows.
  4. Textual Data Preprocessing: We used techniques such as stop word removal, punctuation removal, conversion to lowercase, stemming, tokenization, and word vectorization to prepare the textual data for clustering. We also used Principal Component Analysis (PCA) to handle the curse of dimensionality.
  5. Cluster Implementation: We used K-Means and Agglomerative Hierarchical clustering algorithms to cluster the movies and determine the optimal number of clusters.
  6. Content-Based Recommendation System: We built a content-based recommendation system using the similarity matrix obtained from cosine similarity, which will provide the user with 10 recommendations based on the type of movie/show they have watched.

Conclusion

The conclusion of the project was that it was able to identify a number of trends and patterns in the Netflix dataset. These trends and patterns could be used to improve the user experience and make recommendations for future content.

  • There were approximately 7787 records and 11 attributes in the dataset.
  • We started by working on the missing values in the dataset and conducting exploratory data analysis (EDA).
  • It was discovered that Netflix hosts more movies than television shows on its platform, and the total number of shows added to Netflix is expanding at an exponential rate. Additionally, most of the shows were made in the United States.
  • The attributes were chosen as the basis for the clustering of the data: cast, country, genre, director, rating, and description The TFIDF vectorizer was used to tokenize, preprocess, and vectorize the values in these attributes.
  • 10000 attributes in total were created by TFIDF vectorization.
  • The problem of dimensionality was dealt with through the use of Principal Component Analysis (PCA). Because 3000 components were able to account for more than 80% of the variance, the total number of components was limited to 3000.
  • Utilizing the K-Means Clustering algorithm, we first constructed clusters, and the optimal number of clusters was determined to be 6. The elbow method and Silhouette score analysis were used to get this.
  • The Agglomerative clustering algorithm was then used to create clusters, and the optimal number of clusters was determined to be 7. This was obtained after visualizing the dendrogram.
  • The similarity matrix generated by applying cosine similarity was used to construct a content-based recommender system. The user will receive ten recommendations from this recommender system based on the type of show they watched.

About

The goal of this project is to use Netflix data (7787,12) to classify and group movies and shows into specific clusters. We will utilize techniques such as K-means clustering, Agglomerative clustering and content-based recommendation systems to analyze the data and provide personalized suggestions to consumers based on their preferences.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published