This project utilizes PySpark, a Python library for Apache Spark, to analyze Spotify song data. The script covers a range of tasks, including data exploration, clustering, dimensionality reduction, and collaborative filtering for song recommendations.
The analysis is conducted on two main datasets:
- Final Spotify Database (Final_database.csv): This dataset contains detailed information about various Spotify songs, including features like danceability, energy, instrumentalness, valence, and more.
- Database for Calculating Popularity (Database_to_calculate_popularity.csv): This dataset includes information about the popularity and listening statistics of songs, such as position, track URI, and country.
Both datasets are utilized to extract insights and patterns related to song popularity, artist trends, and user preferences.
Python
PySpark
Pandas
Seaborn
Scikit-learn
Sql
Hdfs
Clustering
Mapreduce
pip install pyspark pandas seaborn scikit-learn
EDA is the procedure which is used to gather deep and hidden information about the dataset by categorizing the data in various different ways such as finding duplictes, finding and handling null values,and visulaization of data through plots, charts and graphs. Here we are performing SQL queries, Plots and Figures, and using pyspark to filter results from the dataset.
Clustering is the task of dividing the unlabeled data or data points into different clusters such that similar data points fall in the same cluster than those which differ from the others. In simple words, the aim of the clustering process is to segregate groups with similar traits and assign them into clusters.
-
Data Loading:The script loads Spotify song data from CSV files (Final_database.csv and Database_to_calculate_popularity.csv) using PySpark.
-
Exploratory Data Analysis (EDA): Conducts EDA to understand various aspects of the dataset, such as popular artists, genres, and trends over time.
-
Clustering: Applies clustering techniques, including KMeans, to identify patterns and similarities among songs.
-
Dimensionality Reduction:Uses PCA (Principal Component Analysis) for reducing the dimensionality of the data and visualizing it in two dimensions.
-
Collaborative Filtering:Implements collaborative filtering with the ALS (Alternating Least Squares) algorithm to provide song recommendations based on user listening history.
-
Song Similarity: It allows users to find the most similar songs to a given input song using cosine similarity.