Spotify Song Data Analysis with PySpark

Overview

This project utilizes PySpark, a Python library for Apache Spark, to analyze Spotify song data. The script covers a range of tasks, including data exploration, clustering, dimensionality reduction, and collaborative filtering for song recommendations.

Dataset

The analysis is conducted on two main datasets:

Final Spotify Database (Final_database.csv): This dataset contains detailed information about various Spotify songs, including features like danceability, energy, instrumentalness, valence, and more.
Database for Calculating Popularity (Database_to_calculate_popularity.csv): This dataset includes information about the popularity and listening statistics of songs, such as position, track URI, and country.

Both datasets are utilized to extract insights and patterns related to song popularity, artist trends, and user preferences.

Requirements

Python

PySpark

Pandas

Seaborn

Scikit-learn

Sql

Hdfs

Clustering

Mapreduce

Install the required dependencies:

pip install pyspark pandas seaborn scikit-learn

EDA(Exploratory Data Analysis)

EDA is the procedure which is used to gather deep and hidden information about the dataset by categorizing the data in various different ways such as finding duplictes, finding and handling null values,and visulaization of data through plots, charts and graphs. Here we are performing SQL queries, Plots and Figures, and using pyspark to filter results from the dataset.

Clustering

Clustering is the task of dividing the unlabeled data or data points into different clusters such that similar data points fall in the same cluster than those which differ from the others. In simple words, the aim of the clustering process is to segregate groups with similar traits and assign them into clusters.

Key Features

Data Loading:The script loads Spotify song data from CSV files (Final_database.csv and Database_to_calculate_popularity.csv) using PySpark.
Exploratory Data Analysis (EDA): Conducts EDA to understand various aspects of the dataset, such as popular artists, genres, and trends over time.
Clustering: Applies clustering techniques, including KMeans, to identify patterns and similarities among songs.
Dimensionality Reduction:Uses PCA (Principal Component Analysis) for reducing the dimensionality of the data and visualizing it in two dimensions.
Collaborative Filtering:Implements collaborative filtering with the ALS (Alternating Least Squares) algorithm to provide song recommendations based on user listening history.
Song Similarity: It allows users to find the most similar songs to a given input song using cosine similarity.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
CLUSTERING		CLUSTERING
EDA		EDA
FINAL_SRC		FINAL_SRC
MODELSAMPLE		MODELSAMPLE
SETUP&DATASET		SETUP&DATASET
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spotify Song Data Analysis with PySpark

Overview

Dataset

Requirements

Install the required dependencies:

EDA(Exploratory Data Analysis)

Clustering

Key Features

About

Releases

Packages

Contributors 4

Languages

Shams261/BIG_DATA_REPOSITORY

Folders and files

Latest commit

History

Repository files navigation

Spotify Song Data Analysis with PySpark

Overview

Dataset

Requirements

Install the required dependencies:

EDA(Exploratory Data Analysis)

Clustering

Key Features

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages