Skip to content

Shams261/BIG_DATA_REPOSITORY

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spotify Song Data Analysis with PySpark

Overview

This project utilizes PySpark, a Python library for Apache Spark, to analyze Spotify song data. The script covers a range of tasks, including data exploration, clustering, dimensionality reduction, and collaborative filtering for song recommendations.

Dataset

The analysis is conducted on two main datasets:

  1. Final Spotify Database (Final_database.csv): This dataset contains detailed information about various Spotify songs, including features like danceability, energy, instrumentalness, valence, and more.
  2. Database for Calculating Popularity (Database_to_calculate_popularity.csv): This dataset includes information about the popularity and listening statistics of songs, such as position, track URI, and country.

Both datasets are utilized to extract insights and patterns related to song popularity, artist trends, and user preferences.

Requirements

Python

PySpark

Pandas

Seaborn

Scikit-learn

Sql

Hdfs

Clustering

Mapreduce

Install the required dependencies:

pip install pyspark pandas seaborn scikit-learn

EDA(Exploratory Data Analysis)

EDA is the procedure which is used to gather deep and hidden information about the dataset by categorizing the data in various different ways such as finding duplictes, finding and handling null values,and visulaization of data through plots, charts and graphs. Here we are performing SQL queries, Plots and Figures, and using pyspark to filter results from the dataset.

image

Clustering

Clustering is the task of dividing the unlabeled data or data points into different clusters such that similar data points fall in the same cluster than those which differ from the others. In simple words, the aim of the clustering process is to segregate groups with similar traits and assign them into clusters.

image

Key Features

  1. Data Loading:The script loads Spotify song data from CSV files (Final_database.csv and Database_to_calculate_popularity.csv) using PySpark.

  2. Exploratory Data Analysis (EDA): Conducts EDA to understand various aspects of the dataset, such as popular artists, genres, and trends over time.

  3. Clustering: Applies clustering techniques, including KMeans, to identify patterns and similarities among songs.

  4. Dimensionality Reduction:Uses PCA (Principal Component Analysis) for reducing the dimensionality of the data and visualizing it in two dimensions.

  5. Collaborative Filtering:Implements collaborative filtering with the ALS (Alternating Least Squares) algorithm to provide song recommendations based on user listening history.

  6. Song Similarity: It allows users to find the most similar songs to a given input song using cosine similarity.

About

THIS IS THE BIG DATA GROUP PROJECT

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •