### Project Title: Beyond the Playlist: Insights from Spotify Streaming Data
#### Done By: Nozipho Sithembiso Ndebele & Thabisisle Xaba
---

<div style="text-align: center;">
<img src="alexander-shatov-JlO3-oY5ZlQ-unsplash-scaled.jpg" alt="Anime Image" width="1000"/>
</div>

---

## Table of Contents

<a href=#BC> Background Context</a>

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Data Collection and Description</a>

<a href=#three>3. Loading Data </a>

<a href=#four>4. Data Cleaning and Filtering</a>

<a href=#five>5. Exploratory Data Analysis (EDA)</a>

<a href=#six>6. Modeling </a>

<a href=#seven>7. Evaluation and Validation</a>

<a href=#eight>8. Final Model</a>

<a href=#nine>9. Conclusion and Future Work</a>

<a href=#ten>10. References</a>

---
 <a id="BC"></a>
## **Background Context**

### Purpose
This project aims to analyze Spotify user data to gain insights into music streaming behaviors, trends, and preferences. By leveraging data visualization techniques, the project will explore listening patterns, track popularity, and user engagement with different artists and albums over time.

### Significance
Understanding music streaming behavior is valuable for multiple applications, such as:

* Identifying the most played tracks, artists, and albums over a period of time

* Analyzing the impact of time, mood, and external factors on music choices

* Helping artists, record labels, and streaming platforms improve recommendations and user experiences

* Understanding the reasons behind track play and end events to optimize playlist curation

By applying data visualization techniques to this dataset, we aim to uncover patterns in user listening habits and provide actionable insights for both music industry professionals and everyday listeners.

### Problem Domain
Spotify generates vast amounts of user data, including track plays, timestamps, and listening durations. Effectively interpreting this data requires the use of visualization techniques to extract meaningful insights and trends.
### Challenges
* Data Volume: Handling and processing large amounts of streaming data efficiently.

* User Behavior Complexity: Understanding listening patterns influenced by factors like time of day, mood, and device type.

* Feature Engineering: Identifying key attributes that contribute to user engagement and preferences.

### Key Questions
* What are the most popular songs, artists, and albums in the dataset?

* How do listening habits change over time (daily, weekly, monthly trends)?

* What factors influence track play and end reasons?

* Can we identify user-specific music preferences based on historical data?

* How do external factors (e.g., time of day, weekday vs. weekend) affect listening behavior?

---
<a href=#one></a>
## **Importing Packages**

### Purpose
To set up the Python environment with the necessary libraries for data manipulation, visualization, and machine learning. These libraries will facilitate data preprocessing, feature extraction, model training, and evaluation.

### Details
* Pandas: For handling and analyzing data.

* NumPy: For numerical operations.

* Matplotlib/Seaborn: For data visualization to understand trends and patterns.

* scikit-learn: For building and evaluating machine learning models.

* NLTK/Spacy: For text preprocessing and natural language processing tasks.

---

In [None]:
# Import necessary packages  

# Data manipulation and analysis  
import pandas as pd  # Pandas for data handling  
import numpy as np  # NumPy for numerical operations  

# Data visualization  
import matplotlib.pyplot as plt  # Matplotlib for static plots  
import seaborn as sns  # Seaborn for statistical visualization  
import plotly.express as px  # Plotly for interactive plots  

# Natural Language Processing  
import nltk  # Natural Language Toolkit  
from nltk.corpus import stopwords  # Stopword removal  
from nltk.tokenize import word_tokenize  # Tokenization  
import re  # Regular expressions for text cleaning  

# Configure visualization settings
sns.set(style='whitegrid')  # Set the default style for Seaborn plots
plt.rcParams['figure.figsize'] = (10, 6)  # Set default figure size for Matplotlib

# Suppress warnings
import warnings  # Import the warnings module
warnings.filterwarnings('ignore')  # Ignore all warning messages


  from pandas.core import (


---
<a href=#two></a>
## **Data Collection and Description**
### Purpose
This section describes how the Spotify dataset was collected and provides insights into its structure. Understanding the dataset's composition is crucial for effective data visualization and interpretation.

### Details
The dataset consists of complete music streaming history data from Spotify users, including track details, timestamps, and listening behaviors.

* Source: The dataset is sourced from Spotify's user data exports, APIs, or publicly available repositories like Kaggle.

* Method of Collection: Data was gathered by tracking user interactions with Spotify, recording timestamps, track metadata, and reasons for playing or stopping each song.

* Size: The dataset includes thousands of records capturing user listening behavior.

* Scope: Covers various aspects of user engagement, including song play duration, artist preferences, and playlist interactions.

* Types of Data:

  * spotify_track_uri – Unique identifier for each track

  * track_name – Name of the song played

  * artist_name – Name of the artist performing the track

  * album_name – Name of the album the track belongs to

  * played_at – Timestamp when the track was played

  * reason_start – Reason for playing the track (e.g., autoplay, user selection)

  * reason_end – Reason for stopping the track (e.g., track end, user skip)

By leveraging this dataset, we aim to create meaningful visualizations that provide deep insights into user streaming habits and music consumption trends.

---
<a href=#three></a>
## **Loading Data**
### Purpose
The purpose of this section is to load the dataset into the notebook for further manipulation and analysis. This is the first step in working with the data, and it allows us to inspect the raw data and get a sense of its structure.

### Details
In this section, we will load the dataset into a Pandas DataFrame and display the first few rows to understand what the raw data looks like. This will help in planning the next steps of data cleaning and analysis.


---

In [2]:
# Load the dataset into a Pandas DataFrame

# The dataset is stored in a CSV file named 'Domestic violence.csv'
df = pd.read_csv('spotify_history.csv')

In [3]:
# df is the original dataset (DataFrame), this creates a copy of it
df_copy = df.copy()

# Now 'df_copy' is an independent copy of 'df'. Changes to 'df_copy' won't affect 'df'.


In [4]:
# Display the first few rows of the dataset to get a sense of what the raw data looks like
df_copy.head()

Unnamed: 0,spotify_track_uri,ts,platform,ms_played,track_name,artist_name,album_name,reason_start,reason_end,shuffle,skipped
0,2J3n32GeLmMjwuAzyhcSNe,2013-07-08 02:44:34,web player,3185,"Say It, Just Say It",The Mowgli's,Waiting For The Dawn,autoplay,clickrow,False,False
1,1oHxIPqJyvAYHy0PVrDU98,2013-07-08 02:45:37,web player,61865,Drinking from the Bottle (feat. Tinie Tempah),Calvin Harris,18 Months,clickrow,clickrow,False,False
2,487OPlneJNni3NWC8SYqhW,2013-07-08 02:50:24,web player,285386,Born To Die,Lana Del Rey,Born To Die - The Paradise Edition,clickrow,unknown,False,False
3,5IyblF777jLZj1vGHG2UD3,2013-07-08 02:52:40,web player,134022,Off To The Races,Lana Del Rey,Born To Die - The Paradise Edition,trackdone,clickrow,False,False
4,0GgAAB0ZMllFhbNc3mAodO,2013-07-08 03:17:52,web player,0,Half Mast,Empire Of The Sun,Walking On A Dream,clickrow,nextbtn,False,False


In [5]:
# Display the number of rows and columns in the dataset to understand its size
df_copy.shape

(149860, 11)

In [6]:
# Check the structure of the dataset
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149860 entries, 0 to 149859
Data columns (total 11 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   spotify_track_uri  149860 non-null  object
 1   ts                 149860 non-null  object
 2   platform           149860 non-null  object
 3   ms_played          149860 non-null  int64 
 4   track_name         149860 non-null  object
 5   artist_name        149860 non-null  object
 6   album_name         149860 non-null  object
 7   reason_start       149717 non-null  object
 8   reason_end         149743 non-null  object
 9   shuffle            149860 non-null  bool  
 10  skipped            149860 non-null  bool  
dtypes: bool(2), int64(1), object(8)
memory usage: 10.6+ MB


---
<a href=#four></a>
## **Data Cleaning and Filtering**
Before analyzing the data, it is crucial to clean and filter it. This process involves handling missing values, removing outliers, correcting errors, and possibly reducing the data by filtering out irrelevant features. These steps ensure that the analysis is based on accurate and reliable data.

Details
In this section, we will:

* Check for Missing Values: Identify if there are any missing values in the dataset and handle them accordingly.
* Remove Duplicates: Ensure there are no duplicate rows that could skew the analysis.
* Correct Errors: Look for and correct any obvious data entry errors.
* Filter Data: Depending on the analysis requirements, filter the data to include only relevant records.

In [7]:
# 1. Check for missing values in the dataset

def check_missing_values(df):
    """
    Check for missing values in the dataset and display the number of missing values per column.

    Parameters:
    df (pandas.DataFrame): The dataset to check for missing values.

    Returns:
    pandas.Series: A series showing the number of missing values for each column.
    """
     # Check for missing values in the dataset and display them
    print("Missing values per column:")
    missing_values = df.isnull().sum()
    print(missing_values)
    return missing_values


In [8]:
# Assuming df is your DataFrame
missing_values = check_missing_values(df_copy)


Missing values per column:
spotify_track_uri      0
ts                     0
platform               0
ms_played              0
track_name             0
artist_name            0
album_name             0
reason_start         143
reason_end           117
shuffle                0
skipped                0
dtype: int64


After examining the dataset, missing values were identified in the reason_start (143) and reason_end (117) columns. All other columns are complete, ensuring data integrity for most features. Depending on the analysis goals, handling these missing values may be necessary through imputation or other data-cleaning techniques.

In [9]:
def impute_missing_values(df):
    """
    Imputes missing values in the 'reason_start' and 'reason_end' columns with 'Unknown'.
    
    Args:
    df (pandas.DataFrame): The dataframe to process.

    Returns:
    pandas.DataFrame: The dataframe with missing values imputed.
    """
    # Check initial missing values
    missing_before = df[['reason_start', 'reason_end']].isnull().sum().sum()
    print(f"\nMissing values before imputation: {missing_before}")

    # Impute missing values
    df[['reason_start', 'reason_end']] = df[['reason_start', 'reason_end']].fillna("Unknown")

    # Check missing values after imputation
    missing_after = df[['reason_start', 'reason_end']].isnull().sum().sum()
    print(f"Missing values after imputation: {missing_after}")

    if missing_after == 0:
        print("All missing values have been successfully replaced with 'Unknown'.")
    else:
        print("Some missing values remain. Please review the data.")

    return df


In [10]:
# Call the function
df_copy = impute_missing_values(df_copy)


Missing values before imputation: 260
Missing values after imputation: 0
All missing values have been successfully replaced with 'Unknown'.


In [11]:
def remove_duplicates(df):
    """
    Checks for duplicate rows in the dataset and removes them if any are found.

    Args:
    df (pandas.DataFrame): The dataframe to check for duplicate rows.

    Returns:
    pandas.DataFrame: The dataframe with duplicate rows removed, if any existed.
    """
    # Check for duplicate rows
    duplicate_rows = df.duplicated().sum()
    print(f"\nNumber of duplicate rows: {duplicate_rows}")
    
    # Remove duplicates if any exist
    if duplicate_rows > 0:
        df.drop_duplicates(inplace=True)
        print(f"Duplicate rows removed. Updated dataframe has {len(df)} rows.")
    else:
        print("No duplicate rows found.")
    
    return df

In [12]:
df_copy = remove_duplicates(df_copy)


Number of duplicate rows: 1185
Duplicate rows removed. Updated dataframe has 148675 rows.


Upon reviewing the dataset, 1185 duplicate rows were found and removed. This ensures that all records are unique, and no further action is required for data deduplication.


## **Saving the Cleaned Dataset**
### Purpose

This section outlines how to save the cleaned dataset for future use. Saving the dataset ensures that the data cleaning process does not need to be repeated and allows for consistent use in subsequent analyses.

### Details

We will save the cleaned dataset as a CSV file.

In [13]:
#6. Save the cleaned dataset to a new CSV file

def save_cleaned_dataset(df, filename='cleaned_spotify_history.csv'):
    """
    Saves the cleaned dataframe to a CSV file.

    Args:
    df (pandas.DataFrame): The cleaned dataframe to save.
    filename (str): The name of the file to save the dataframe to (default is 'cleaned_domestic_violence.csv').

    Returns:
    None
    """
    # Save the cleaned dataset to a CSV file
    df.to_csv(filename, index=False)
    print(f"Cleaned dataset saved successfully as '{filename}'.")


In [14]:
save_cleaned_dataset(df_copy)


Cleaned dataset saved successfully as 'cleaned_spotify_history.csv'.


---
<a href=#five></a>
## **Exploratory Data Analysis (EDA)**

It is the process of analyzing datasets to summarize key features, often through visualization methods. It aims to discover patterns, spot anomalies, and formulate hypotheses for deeper insights, which informs subsequent analysis.
#### Advantages

- Helps in understanding the data before modeling.
- Provides insights that guide feature selection and engineering.
- Assists in choosing appropriate modeling techniques.
- Uncovers potential data quality issues early.

`The following methods were employed to communicate our objective:`



---


---
<a href=#nine></a>
## **Conclusion and Future Work**


##### Conclusion



##### Future Work

To build upon this study, future work could focus on the following areas:



---
<a href=#ten></a>
## **References**

## Additional Sections to Consider

**Contributors**: Nozipho Sithembiso Ndebele & Thabisisle Xaba
