<a href="https://colab.research.google.com/github/C-Lion/HU-DS-BC/blob/main/Explanatory_Data_Analysis_and_Data_Visualization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Scope and Limitations

This analysis originally sought to determine whether the emotional tone of songs (as measured by valence and energy) changed significantly after the murder of George Floyd on May 25, 2020.

However, after inspecting the dataset, it became clear that:

- The majority of `release_date` values were missing or invalid.
- The valid dates in the dataset ranged from 1921 to 2017.
- No data was available for the year 2020 or later, making it impossible to compare song emotional tone before and after the target date.

As a result, the dataset could not support the intended research question. Nevertheless, this process demonstrated the importance of verifying data coverage early in the analysis workflow. The notebook includes all verification steps to transparently document the limitations encountered.


# Research Question
How did the emotional tone of popular songs—measured by valence and energy—change during key periods of the Civil Rights Movement (1940–1980), and how might these changes reflect broader societal shifts or resistance?

# Problem Statement
Popular music has historically served as both a reflection and an instrument of social sentiment. The Civil Rights Movement (1954–1968) marked a critical era in American history, encompassing both widespread activism and systemic pushback. This analysis seeks to understand how the emotional tone of popular songs—as quantified by Spotify's valence (positivity) and energy (intensity) features—varied during this era.

By segmenting the timeline into pre-Movement (1940–1953), during the Movement (1954–1968), and post-Movement (1969–1980), we aim to explore whether musical expression mirrored evolving societal tensions, cultural resilience, or hope. This work contributes to the broader conversation on how data science can uncover cultural insights and how music may reflect or resist dominant narratives during times of social change.


# Hypothesis

The emotional tone of popular music—reflected in valence and energy—shifted during the Civil Rights Movement. We hypothesize that music released during the movement (1954–1968) may exhibit lower valence and/or higher energy, corresponding to heightened societal tensions and calls for resistance.



#Setting up the environent for the data project in Python.


In [45]:
pip install pip



In [46]:
pip install --upgrade pip




# Set up Numerical Python and Data Structures:

In [47]:
pip install numpy




In [48]:
pip install pandas



# Set up Visualization Libraries

In [49]:
pip install matplotlib




In [50]:
pip install seaborn



##Geospatail Visualization tools

In [51]:
pip install folium



In [52]:
pip install geopandas



#Set up Scientific Python

In [53]:
pip install scipy



In [54]:
pip install statsmodels



In [55]:
pip install -U scikit-learn



In [56]:
## I assume this version of scikit-learn will be suffient
##since this project will focus on data science rather than
#machine learning but it is something to keep in mind.

Load dataset

In [57]:
import pandas as pd


In [58]:
spotify_df = pd.read_csv("/content/data.csv")

ParserError: Error tokenizing data. C error: EOF inside string starting at row 137432

In [None]:
spotify_df.head(7)

In [None]:
spotify_df.tail(7)

In [None]:
spotify_df.describe()

#Verify the data needed for my research question is present
Inspect the columns to confirm if release_date, valence, and energy are present.



## Verify Dataset Columns
To confirm this dataset supports the research question, we must ensure that it includes a release date and emotion-related metrics like valence and energy.


In [None]:
# Check the column names
spotify_df.columns

## Define Periods for Comparison
To evaluate emotional tone before and after the murder of George Floyd (May 25, 2020), we will split the dataset into two time periods.


In [None]:
# Ensure release_date is in datetime format
spotify_df['release_date'] = pd.to_datetime(spotify_df['release_date'], errors='coerce')

# Filter for year 2020 only
spotify_2020 = spotify_df[spotify_df['release_date'].dt.year == 2020].copy()

# Create period labels based on the cutoff date
cutoff = pd.to_datetime("2020-05-25")
spotify_2020['period'] = spotify_2020['release_date'].apply(
    lambda x: 'Before' if x < cutoff else 'After'
)

# Show the distribution by period
spotify_2020['period'].value_counts()


## Data Types & Type Conversion

We examined the dataframe’s types and determined that `release_date` should be a datetime format for time‑based analysis.

```python
spotify_df['release_date'] = pd.to_datetime(spotify_df['release_date'], errors='coerce')


## Emotional Tone by Period
This boxplot visualizes whether valence—the positivity of songs—changed significantly after May 25, 2020.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set visual style
sns.set(style="whitegrid")

# Create boxplot for valence
plt.figure(figsize=(8, 6))
sns.boxplot(x='period', y='valence', data=spotify_2020, palette="coolwarm")
plt.title('Valence of Songs Before vs After May 25, 2020')
plt.xlabel('Period')
plt.ylabel('Valence')
plt.tight_layout()
plt.show()


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Drop rows with missing valence values
filtered_df = spotify_2020.dropna(subset=['valence'])

# Create the boxplot
plt.figure(figsize=(8, 6))
sns.boxplot(x='period', y='valence', data=filtered_df)
plt.title('Valence of Songs Before and After May 25, 2020')
plt.xlabel('Period')
plt.ylabel('Valence')
plt.tight_layout()
plt.show()


In [None]:
print(spotify_2020.shape)

In [None]:
print("Dataset shape:", spotify_2020.shape)

#Checking to see what the data contains

In [None]:
print("Total rows:", spotify_df.shape[0])
print("Missing release_date values:", spotify_df['release_date'].isna().sum())
print("Earliest release_date:", spotify_df['release_date'].min())
print("Latest release_date:", spotify_df['release_date'].max())


When we filtered with:

python
<spotify_df['release_date'].dt.year == 2020>
there were zero matches, because the  dataset only includes songs up to 2017.

Since:

spotify_2020 was empty,

The boxplot had nothing to draw.



Data cleaning & Preparation

In [None]:
# Ensure release_date is datetime
spotify_df['release_date'] = pd.to_datetime(spotify_df['release_date'], errors='coerce')

# Drop rows with missing valence, energy, or release_date
spotify_clean = spotify_df.dropna(subset=['valence', 'energy', 'release_date'])

# Extract year from release_date
spotify_clean['year'] = spotify_clean['release_date'].dt.year

# Filter for relevant range
spotify_clean = spotify_clean[(spotify_clean['year'] >= 1921) & (spotify_clean['year'] <= 2017)]

# Verify result
spotify_clean[['year', 'valence', 'energy']].describe()


#Box Plot: Valence over Time

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Aggregate valence by year
yearly_valence = spotify_clean.groupby('year')['valence'].mean().reset_index()

# Plot
plt.figure(figsize=(12, 6))
sns.lineplot(x='year', y='valence', data=yearly_valence)
plt.title('Average Valence of Songs (1921–2017)')
plt.xlabel('Year')
plt.ylabel('Average Valence')
plt.grid(True)
plt.tight_layout()
plt.show()


## Interpretation of Visualization

The line plot of average valence (musical positivity) from 1921 to 2017 reveals distinct historical patterns. There is a noticeable dip in emotional tone during the 1930s and 1940s, which aligns with the Great Depression and World War II era. From the mid-1950s through the 1980s, valence trends upward, possibly reflecting post-war recovery, economic growth, and cultural optimism.

After 1990, valence appears to fluctuate with a slight overall decline, suggesting a gradual shift in popular musical mood. This may correspond to changing listener preferences, industry trends, or broader sociocultural developments.

Further statistical analysis would be required to validate these observations, but this exploratory visualization suggests that historical context may influence the emotional tone of popular music over time.


Plot Average Energy Over Time

In [None]:
# Aggregate energy by year
yearly_energy = spotify_clean.groupby('year')['energy'].mean().reset_index()

# Plot
plt.figure(figsize=(12, 6))
sns.lineplot(x='year', y='energy', data=yearly_energy)
plt.title('Average Energy of Songs (1921–2017)')
plt.xlabel('Year')
plt.ylabel('Average Energy')
plt.grid(True)
plt.tight_layout()
plt.show()


## Preliminary Reflection on Trends

The rising trend in song energy from the 1950s onward aligns with the emergence and dominance of electrified musical styles like rock, disco, electronic, and hip-hop. These genres tend to feature higher tempos, amplified instrumentation, and more rhythmic intensity—elements strongly correlated with Spotify's "energy" metric.

The dip in energy during the 1930s–1940s may reflect both the recording technology of the era and the somber cultural mood during the Great Depression and World War II.

While these interpretations remain speculative at this stage, they suggest a strong case for deeper domain research in the next phase of this project.


A more detailed exploration of historical musical trends will be conducted in Stage 3, using peer-reviewed musicology or sociocultural studies to validate the patterns observed here.