# COGS 108 - Final Project 

# Overview

From venting about a broken heart, about a racist president, or about being sent to fight a war that you don’t agree with, music is easily one of the most common forms of expression and catharsis. Given music’s power to express one’s perspective, it’s no surprise that the music surrounding the Vietnam War era is iconic for the music (amongst other media) that arose from that time. 
	When President Donald Trump was elected into office, many people and music forums were discussing how artists would be inspired by the president’s election, suggesting that countless protest songs and political anthems would arise from this time period. Despite this suggestion, the vast majority of popular music seems to be apolitical.
	Has music become less political or have people stopped listening to political music? Does our nostalgia overemphasize the music of the Vietnam War era? Our goal is to expose the indirect influences of major events like the Vietnam War on music and to make connects (if any) between politics and music. 


# Names

- Albert Putra Purnama
- Austin Moss-Ennis
- Chandler Ennis
- Xirui He

# Research Question

1. Compared to the Vietnam War’s influence on popular music, how did Donald Trump’s election win and campaign platform influence popular music in America?

## Background and Prior Work

&emsp;&emsp;From venting about a broken heart, about a racist president, or about being sent to fight a war that you don’t agree with, music is easily one of the most common forms of expression and catharsis. Given music’s power to express one’s perspective, it’s no surprise that the music surrounding the Vietnam War era is iconic for the music (amongst other media) that arose from that time. <br>
&emsp;&emsp;When President Donald Trump was elected into office, many people and music forums were discussing how artists would be inspired by the president’s election, suggesting that countless protest songs and political anthems would arise from this time period. Despite this suggestion, the vast majority of popular music seems to be apolitical. <br>
&emsp;&emsp;Has music become less political or have people stopped listening to political music? Does our nostalgia overemphasize the music of the Vietnam War era? Our goal is to expose the indirect influences of major events like the Vietnam War on music and to make connects (if any) between politics and music. 

References (include links):
- [What has America been singing about?](https://journals.sagepub.com/doi/full/10.1177/0305735617748205)

# Hypothesis


The number of popular political/protest songs during the Vietnam War will be drastically higher than the number of political songs nowadays because youth participation in politics has been much lower than it was during the Vietnam War.

# Dataset(s)

**Dataset 1**
- Dataset Name: Kaggle 380000 Metro Lyrics
- Link to the dataset: https://www.kaggle.com/gyani95/380000-lyrics-from-metrolyrics
- Number of observations: 380000+
- Description: This dataset is the initial simple modeling on a clean data provided by kaggle. The data is in csv format, ready to be analysed by simple statistical model.

**Dataset 2**
- Dataset Name: Top Songs from 1958 to 2019
- Link to the dataset: https://www.billboard.com
- Number of observations: 
- Description: This dataset is taken by scraping the information from billboard.com. Further explanation in the data cleaning section

# Setup
Please make sure you are using `python3` notebook.

In [1]:
# imports
import pandas as pd
import matplotlib.pyplot as plt
import re
import string

In [2]:
# Read data to a variable, this cell need to be separated
# because it takes ~5s to load the data on a Macbook Pro.
# imagine if you load it in inferior computers.
country_songs_data = pd.read_csv('Clean_Country_Data.csv')
hot_songs_data = pd.read_csv('Clean_Hot_100_Chart.csv')
country_songs_data['year'] = country_songs_data.apply (lambda row: int(row['Last Week Charted'].split('-')[0]), axis=1)
hot_songs_data['year'] = country_songs_data.apply (lambda row: int(row['Last Week Charted'].split('-')[0]), axis=1)
#hip_songs_data['year'] = country_songs_data.apply (lambda row: int(row['Last Week Charted'].split('-')[0]), axis=1)

FileNotFoundError: File b'Clean_Country_Data.csv' does not exist

In [None]:
# make sure that the data is large enough
assert len(country_songs_data) > 1000, "Country data not large enough"
assert len(hot_songs_data) > 1000, "Hot 100 data not large enough"
#assert len(hip_songs_data) > 1000, "Hip Hop data not large enough"

# visually check the if the data is parsed correctly
display(country_songs_data.head())
display(hot_songs_data.head())
#display(hip_songs_data.head())

**Explnations on the lyrics data** <br>

| Column name | Description |
|-------------|-------------|
| Artist | The artist, singer, or creator of the song |
| Song | The title of the song |
| Date | The date the song got released |
| Current Rank | The rank of this song on the billboard during the release |
| Last Weeks Position | The rank of this song on the billboard last week |
| Weeks on Chart | The number of weeks the song is in the billboard top 100 |
| Peak Position | The highest position of the song ever achieved in the chart |

**Explanations on the song data** <br>

| Column name | Description |
|-------------|-------------|
| Artist | The artist, singer, or creator of the song |
| Song | The title of the song |
| Date | The date the song got released |
| Years on Chart | The range of years the song is in the billboard top 100 |
| Weeks on Chart | The number of weeks the song is in the billboard top 100 |
| Peak Position | The highest position of the song ever achieved in the chart |
| Political Value | A numerical representation of how political the song is <br>as determined by our sentiment analysis program |

This marks the end of input data. By this point, all inputs should have been loaded<br>
 into notebook's memory. Next thing we have to do is define some constants. <br>


In [None]:
# Declare constants
CONST_RAW_DATA_USED = 0.5 # Percentage of input data we want to use
                    # 0.5 means 50% of the entire set.

---
# Data Cleaning

1. Get billboard data from 1960s to date.
2. Get lyrics from billboard songs (1960s to date) from genius.com
3. Get popular protests, war, or politics related songs

In [None]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION
country_songs_data = country_songs_data.dropna()
hot_songs_data = hot_songs_data.dropna()
#hip_songs_data = hip_songs_data.dropna()

In [None]:
def clean_lyrics(lyrics):
    #Normalize case sensitive names before lowercasing them
    lyrics = lyrics.replace("Johnson", "president")

    lyrics = lyrics.lower()
    lyrics = lyrics.strip()
    lyrics = lyrics.replace("can't", "can not")
    lyrics = lyrics.replace("won't", "will not")
    lyrics = lyrics.replace("ain't", "aint")
    lyrics = lyrics.replace("n't", " not")
    lyrics = lyrics.replace("'ll", " will")
    lyrics = lyrics.replace("'re", " are")
    lyrics = lyrics.replace("'ve", " have")
    lyrics = lyrics.replace("'m", " am")
    lyrics = lyrics.replace("how'd", "how did")
    lyrics = lyrics.replace("'d", " would")
    lyrics = lyrics.replace("it's", "it is")
    lyrics = lyrics.replace("'til", "until")
    lyrics = lyrics.replace("'s", "s")
    lyrics = lyrics.replace("in'", "ing")
    lyrics = lyrics.replace("'cause", "because")
    lyrics = lyrics.replace("gon'", "going to")
    lyrics = lyrics.replace("gonna", "going to")
    lyrics = lyrics.replace("'bout", "about")
    lyrics = lyrics.replace("y'all", "you all")
    lyrics = lyrics.replace("tryna'", "trying to")
    lyrics = lyrics.replace("lil'", "little")
    lyrics = lyrics.replace("'em", "them")
    lyrics = lyrics.replace("'im", "him")
    lyrics = lyrics.replace("wanna'", "want to")
    lyrics = re.sub("[\[].*?[\]]", "", lyrics)
    lyrics = lyrics.translate(str.maketrans('', '', string.punctuation))
    lyrics = lyrics.replace(" ya ", " you ")

    # Normalizing terms for `gun`
    lyrics = lyrics.replace("bullet", "gun")
    lyrics = lyrics.replace("guns", "gun")
    lyrics = lyrics.replace("aa-12", "gun")
    lyrics = lyrics.replace("ak-47", "gun")
    lyrics = lyrics.replace("ak-5", "gun")
    lyrics = lyrics.replace("ak-74", "gun")
    lyrics = lyrics.replace("ar-15", "gun")
    lyrics = lyrics.replace("calico", "gun")
    lyrics = lyrics.replace("caliber", "gun")
    lyrics = lyrics.replace("desert eagle", "gun")
    lyrics = lyrics.replace("draco", "gun")
    lyrics = lyrics.replace("famas", "gun")
    lyrics = lyrics.replace("five-seven", "gun")
    lyrics = lyrics.replace("five seven", "gun")
    lyrics = lyrics.replace("p90", "gun")
    lyrics = lyrics.replace("glock", "gun")
    lyrics = lyrics.replace("luger", "gun")
    lyrics = lyrics.replace("m16", "gun")
    lyrics = lyrics.replace("m1", "gun")
    lyrics = lyrics.replace("m21", "gun")
    lyrics = lyrics.replace("m4", "gun")
    lyrics = lyrics.replace("m9", "gun")
    lyrics = lyrics.replace("mac-10", "gun")
    lyrics = lyrics.replace("mac-11", "gun")
    lyrics = lyrics.replace("mac-12", "gun")
    lyrics = lyrics.replace("master key", "gun")
    lyrics = lyrics.replace("ruger", "gun")
    lyrics = lyrics.replace("blackhawk", "gun")
    lyrics = lyrics.replace("sks", "gun")
    lyrics = lyrics.replace("tec-9", "gun")
    lyrics = lyrics.replace("uzi", "gun")
    lyrics = lyrics.replace("colt", "gun")
    lyrics = lyrics.replace("armalite", "gun")
    lyrics = lyrics.replace("beretta", "gun")
    lyrics = lyrics.replace("glock", "gun")
    lyrics = lyrics.replace("heckler & koch", "gun")
    lyrics = lyrics.replace("kel-tec", "gun")
    lyrics = lyrics.replace("intratec", "gun")
    lyrics = lyrics.replace("mossberg", "gun")
    lyrics = lyrics.replace("ruger", "gun")
    lyrics = lyrics.replace("sig", "gun")
    lyrics = lyrics.replace("taurus", "gun")
    lyrics = lyrics.replace("sinchester", "gun")
    lyrics = lyrics.replace("smith & wesson", "gun")
    lyrics = lyrics.replace("smith and wesson", "gun")
    lyrics = lyrics.replace("revolver", "gun")

    # Normalizing terms for `war`
    lyrics = lyrics.replace("combat", "war")
    lyrics = lyrics.replace("fighting", "war")
    lyrics = lyrics.replace("fightin'", "war")
    lyrics = lyrics.replace("battlefield", "war")
    lyrics = lyrics.replace("battle", "war")
    lyrics = lyrics.replace("warfare", "war")
    lyrics = lyrics.replace("war-worn", "war")
    lyrics = lyrics.replace("war-torn", "war")
    lyrics = lyrics.replace("bloodshed", "war")

    # Normalizing terms for `death`
    lyrics = lyrics.replace("die", "death")
    lyrics = lyrics.replace("dying", "death")
    lyrics = lyrics.replace("dead", "death")
    lyrics = lyrics.replace("kill", "death")
    lyrics = lyrics.replace("killed", "death")

    # Normalizing terms for `soldiers`
    lyrics = lyrics.replace("soldier", "soldiers")
    lyrics = lyrics.replace("private", "soldiers")
    lyrics = lyrics.replace("trooper", "soldiers")
    lyrics = lyrics.replace("troops", "soldiers")
    lyrics = lyrics.replace("army", "soldiers")
    lyrics = lyrics.replace("military", "soldiers")
    lyrics = lyrics.replace("navy", "soldiers")
    lyrics = lyrics.replace("veteran", "soldiers")
    lyrics = lyrics.replace("veterans", "soldiers")
    lyrics = lyrics.replace("vet", "soldiers")
    lyrics = lyrics.replace("vets", "soldiers")

    # Normalizing terms for `protest`
    lyrics = lyrics.replace("protesting", "protest")
    lyrics = lyrics.replace("protests", "protest")
    lyrics = lyrics.replace("picket line", "protest")
    lyrics = lyrics.replace("picket sign", "protest")

    # Normalizing terms for `Vietnam`
    lyrics = lyrics.replace("viet cong", "vietnam")
    lyrics = lyrics.replace(" nam ", "vietnam")
    lyrics = lyrics.replace("'nam", "vietnam")
    lyrics = lyrics.replace("vietnamese", "vietnam")

    # Normalizing terms for `president`
    lyrics = lyrics.replace("chief", "president")
    lyrics = lyrics.replace("presidential", "president")
    lyrics = lyrics.replace("lyndon johnson", "president")
    lyrics = lyrics.replace("lbj", "president")
    lyrics = lyrics.replace("nixon", "president")
    lyrics = lyrics.replace("richard nixon", "president")
    lyrics = lyrics.replace("donald trump", "president")

    # Normalizing terms for `communism`
    lyrics = lyrics.replace("communist", "communism")
    lyrics = lyrics.replace("commie", "communism")

    # Normalizing terms for `America`
    lyrics = lyrics.replace("usa", "america")
    lyrics = lyrics.replace("united states", "america")
    lyrics = lyrics.replace("uncle sam", "america")
    lyrics = lyrics.replace("american", "america")
    lyrics = lyrics.replace("red white and blue", "america")
    lyrics = lyrics.replace("red white blue", "america")

    # Normalziing terms for `MAGA`
    lyrics = lyrics.replace("make america great", "maga")
    lyrics = lyrics.replace("making america great", "maga")

    # Normalizing terms for `red hat`
    lyrics = lyrics.replace("maga hat", "red hat")
    lyrics = lyrics.replace("red cap", "red hat")
    lyrics = lyrics.replace("racist hat", "red hat")

    # Normalizing terms for `the wall`
    lyrics = lyrics.replace("walls", "the wall")
    lyrics = lyrics.replace("wall", "the wall")

    # Normalizing terms for `Mexico`
    lyrics = lyrics.replace("mexican", "mexico")
    lyrics = lyrics.replace("mexicans", "mexico")

    # Normalizing terms for `racism`
    lyrics = lyrics.replace("racist", "racism")

    # Normalizing terms for `fascism`
    lyrics = lyrics.replace("fascist", "fascism")
    
    #Removing special characters and numerics
    lyrics = lyrics.replace('‘', '')
    lyrics = lyrics.replace('’', '')
    lyrics = lyrics.replace('…', ' ')
    lyrics = lyrics.replace('\r', ' ')
    lyrics = lyrics.replace('\n', ' ')
    lyrics = lyrics.replace('\xa0', ' ')
    lyrics = lyrics.replace('0', '')
    lyrics = lyrics.replace('1', '')
    lyrics = lyrics.replace('2', '')
    lyrics = lyrics.replace('3', '')
    lyrics = lyrics.replace('4', '')
    lyrics = lyrics.replace('5', '')
    lyrics = lyrics.replace('6', '')
    lyrics = lyrics.replace('7', '')
    lyrics = lyrics.replace('8', '')
    lyrics = lyrics.replace('9', '')
    
    #Removing instances of multiple spaces between words
    for i in range(0, 10):
        lyrics = lyrics.replace('  ', ' ')
    
    lyrics = lyrics.strip()
    
    return lyrics


# Data Analysis & Results

1. Get the frequently used words in popular protest, politics, or war songs
2. Get the statistics of the data based on groups.

## Get protests songs

We will use this later for the baseline of war related words

In [None]:
#import nltk
#from string import punctuation
import operator

In [None]:
#nltk.download()

In [None]:

country_songs_data['Lyrics'] = country_songs_data.apply (lambda row: clean_lyrics(row['Lyrics']), axis=1)
hot_songs_data['Lyrics'] = hot_songs_data.apply (lambda row: clean_lyrics(row['Lyrics']), axis=1)
#hip_songs_data['Lyrics'] = hip_songs_data.apply (lambda row: clean_lyrics(row['Lyrics']), axis=1)
display(country_songs_data.head())
display(hot_songs_data.head())
#hip_songs_data.head()

In [None]:
# freq = dict()
# #print (pol_df['Lyrics'])
# for lyrics in pol_df['lyrics']:
#     lyrics = [w for w in lyrics.split()]
#     for word in lyrics:
#         if word in freq.keys():
#             freq[word] += 1
#         else:
#             freq[word] = 1

# sorted_freq = sorted(freq.items(), key=operator.itemgetter(1))
# print(sorted_freq[-40:])

## Data Analytics

- Group by year
- Group by genre

In [None]:
war_related_words = [
    'fight', 'stand', 'war', 'hate', 'president', 'propaganda', 'gun', 'death', 
    'protest', 'soldiers', 'vietnam', 'communism', 'america'
]
race_related_words = [
    'city', 'stand', 'man', 'black', 'hate', 'politics', 'gun', 'death', 'racism', 'mexico'
]
election_related_words = [
    'politics', 'president', 'propaganda', 'protest', 'america'
]
trump_related_words = [
    'trump', ' maga ', 'red hat', 'the wall', 'fascicm'
]

In [None]:
country_group_by_year = country_songs_data.groupby('year')
hot_group_by_year = hot_songs_data.groupby('year')
#hip_group_by_year = hip_songs_data.groupby('year')

In [None]:
def getYearlyValues(songs_group_by_year):
    thres = 4
    y_r_w_s = dict()
    y_rat_w_s = dict()
    y_s_w_t = dict()
    y_r_r_s = dict()
    y_rat_r_s = dict()
    y_s_r_t = dict()
    y_r_e_s = dict()
    y_rat_e_s = dict()
    y_s_e_t = dict()
    y_r_t_s = dict()
    y_rat_t_s = dict()
    y_s_t_t = dict()

    for year, group in songs_group_by_year:
        war_matrix = [[w in l for w in war_related_words] for l in group['Lyrics']]
        race_matrix = [[w in l for w in race_related_words] for l in group['Lyrics']]
        elec_matrix = [[w in l for w in election_related_words] for l in group['Lyrics']]
        war_related_sum = sum([any(m) for m in war_matrix])
        war_related_sum_t = sum([1 if sum(m) > thres else 0 for m in war_matrix])
        race_related_sum = sum([any(m) for m in race_matrix])
        race_related_sum_t = sum([1 if sum(m) > thres else 0 for m in race_matrix])
        elec_related_sum = sum([any(m) for m in elec_matrix])
        elec_related_sum_t = sum([1 if sum(m) > thres else 0 for m in elec_matrix])
        if year > 2014:
            trump_matrix = [[w in l for w in trump_related_words] for l in group['Lyrics']]
            trump_related_sum = sum([any(m) for m in trump_matrix])
            trump_related_sum_t = sum([1 if sum(m) > 1 else 0 for m in trump_matrix])
        else:
            trump_related_sum = 0
            trump_related_sum_t = 0
        if year < 1000:
            continue
        y_r_w_s[year] = war_related_sum
        y_rat_w_s[year] = war_related_sum / len(group)
        y_s_w_t[year] = war_related_sum_t
        y_r_r_s[year] = race_related_sum
        y_rat_r_s[year] = race_related_sum / len(group)
        y_s_r_t[year] = race_related_sum_t
        y_r_e_s[year] = elec_related_sum
        y_rat_e_s[year] = elec_related_sum / len(group)
        y_s_e_t[year] = elec_related_sum_t
        y_r_t_s[year] = trump_related_sum
        y_rat_t_s[year] = trump_related_sum / len(group)
        y_s_t_t[year] = trump_related_sum_t
    return [y_r_w_s, y_rat_w_s, y_s_w_t, y_r_r_s, y_rat_r_s, y_s_r_t, y_r_e_s, y_rat_e_s, y_s_e_t, y_r_t_s, y_rat_t_s, y_s_t_t]

In [None]:
country_vals_by_year = getYearlyValues(country_group_by_year)
hot_vals_by_year = getYearlyValues(hot_group_by_year)
#hip_vals_by_year = getYearlyValues(hip_songs_data)

In [None]:
fig = plt.figure()

plt.xlim(1950,2020)
plt.bar(list(country_vals_by_year[0].keys()), list(country_vals_by_year[0].values()))

fig.suptitle('Sum of war related country songs over the years', fontsize=20)
plt.xlabel('Year', fontsize=18)
plt.ylabel('War related', fontsize=16)
plt.show()

In [None]:
fig = plt.figure()
plt.xlim(1950,2020)
plt.bar(list(country_vals_by_year[1].keys()), list(country_vals_by_year[1].values()))
fig.suptitle('Ratio of war related country songs to country songs produced over the years', fontsize=20)
plt.xlabel('Year', fontsize=18)
plt.ylabel('War related / total songs', fontsize=16)
plt.show()

In [None]:
fig = plt.figure()
plt.xlim(1950,2020)
plt.bar(list(country_vals_by_year[2].keys()), list(country_vals_by_year[2].values()))
fig.suptitle('With threshold', fontsize=20)
plt.xlabel('Year', fontsize=18)
plt.ylabel('War related', fontsize=16)
plt.show()

In [None]:
fig = plt.figure()
plt.xlim(1950,2020)
plt.bar(list(country_vals_by_year[3].keys()), list(country_vals_by_year[3].values()))
fig.suptitle('Sum of race related country songs over the years', fontsize=20)
plt.xlabel('Year', fontsize=18)
plt.ylabel('Race related', fontsize=16)
plt.show()

In [None]:
fig = plt.figure()
plt.xlim(1950,2020)
plt.bar(list(country_vals_by_year[4].keys()), list(country_vals_by_year[4].values()))
fig.suptitle('Ratio of race related country songs to country songs produced over the years', fontsize=20)
plt.xlabel('Year', fontsize=18)
plt.ylabel('Race related / total songs', fontsize=16)
plt.show()

In [None]:
fig = plt.figure()
plt.xlim(1950,2020)
plt.bar(list(country_vals_by_year[5].keys()), list(country_vals_by_year[5].values()))
fig.suptitle('With threshold', fontsize=20)
plt.xlabel('Year', fontsize=18)
plt.ylabel('Race related', fontsize=16)
plt.show()

In [None]:
fig = plt.figure()
plt.xlim(1950,2020)
plt.bar(list(country_vals_by_year[6].keys()), list(country_vals_by_year[6].values()))
fig.suptitle('Sum of election related country songs over the years', fontsize=20)
plt.xlabel('Year', fontsize=18)
plt.ylabel('Election related', fontsize=16)
plt.show()

In [None]:
fig = plt.figure()
plt.xlim(1950,2020)
plt.bar(list(country_vals_by_year[7].keys()), list(country_vals_by_year[7].values()))
fig.suptitle('Ratio of election related country songs to country songs produced over the years', fontsize=20)
plt.xlabel('Year', fontsize=18)
plt.ylabel('Election related / total songs', fontsize=16)
plt.show()

In [None]:
fig = plt.figure()
plt.xlim(1950,2020)
plt.bar(list(country_vals_by_year[8].keys()), list(country_vals_by_year[8].values()))
fig.suptitle('With threshold', fontsize=20)
plt.xlabel('Year', fontsize=18)
plt.ylabel('Election related', fontsize=16)
plt.show()

In [None]:
fig = plt.figure()
plt.xlim(1950,2020)
plt.bar(list(country_vals_by_year[9].keys()), list(country_vals_by_year[9].values()))
fig.suptitle('Sum of Trump related country songs over the years', fontsize=20)
plt.xlabel('Year', fontsize=18)
plt.ylabel('Trump related', fontsize=16)
plt.show()

In [None]:
fig = plt.figure()
plt.xlim(1950,2020)
plt.bar(list(country_vals_by_year[10].keys()), list(country_vals_by_year[10].values()))
fig.suptitle('Ratio of Trump related country songs to country songs produced over the years', fontsize=20)
plt.xlabel('Year', fontsize=18)
plt.ylabel('Trump related / total songs', fontsize=16)
plt.show()

In [None]:
fig = plt.figure()
plt.xlim(1950,2020)
plt.bar(list(country_vals_by_year[11]), list(country_vals_by_year[11].values()))
fig.suptitle('With threshold', fontsize=20)
plt.xlabel('Year', fontsize=18)
plt.ylabel('Trump related', fontsize=16)
plt.show()

In [None]:
fig = plt.figure()

plt.xlim(1950,2020)
plt.bar(list(hot_vals_by_year[0].keys()), list(hot_vals_by_year[0].values()))

fig.suptitle('Sum of war related Hot 100 songs over the years', fontsize=20)
plt.xlabel('Year', fontsize=18)
plt.ylabel('War related', fontsize=16)
plt.show()

In [None]:
fig = plt.figure()
plt.xlim(1950,2020)
plt.bar(list(hot_vals_by_year[1].keys()), list(hot_vals_by_year[1].values()))
fig.suptitle('Ratio of war related Hot 100 songs to Hot 100 songs produced over the years', fontsize=20)
plt.xlabel('Year', fontsize=18)
plt.ylabel('War related / total songs', fontsize=16)
plt.show()

In [None]:
fig = plt.figure()
plt.xlim(1950,2020)
plt.bar(list(hot_vals_by_year[2].keys()), list(hot_vals_by_year[2].values()))
fig.suptitle('With threshold', fontsize=20)
plt.xlabel('Year', fontsize=18)
plt.ylabel('War related', fontsize=16)
plt.show()

In [None]:
fig = plt.figure()
plt.xlim(1950,2020)
plt.bar(list(hot_vals_by_year[3].keys()), list(hot_vals_by_year[3].values()))
fig.suptitle('Sum of race related Hot 100 songs over the years', fontsize=20)
plt.xlabel('Year', fontsize=18)
plt.ylabel('Race related', fontsize=16)
plt.show()

In [None]:
fig = plt.figure()
plt.xlim(1950,2020)
plt.bar(list(hot_vals_by_year[4].keys()), list(hot_vals_by_year[4].values()))
fig.suptitle('Ratio of race related Hot 100 songs to Hot 100 songs produced over the years', fontsize=20)
plt.xlabel('Year', fontsize=18)
plt.ylabel('Race related / total songs', fontsize=16)
plt.show()

In [None]:
fig = plt.figure()
plt.xlim(1950,2020)
plt.bar(list(hot_vals_by_year[5].keys()), list(hot_vals_by_year[5].values()))
fig.suptitle('With threshold', fontsize=20)
plt.xlabel('Year', fontsize=18)
plt.ylabel('Race related', fontsize=16)
plt.show()

In [None]:
fig = plt.figure()
plt.xlim(1950,2020)
plt.bar(list(hot_vals_by_year[6].keys()), list(hot_vals_by_year[6].values()))
fig.suptitle('Sum of election related Hot 100 songs over the years', fontsize=20)
plt.xlabel('Year', fontsize=18)
plt.ylabel('Election related', fontsize=16)
plt.show()

In [None]:
fig = plt.figure()
plt.xlim(1950,2020)
plt.bar(list(hot_vals_by_year[7].keys()), list(hot_vals_by_year[7].values()))
fig.suptitle('Ratio of election related Hot 100 songs to Hot 100songs produced over the years', fontsize=20)
plt.xlabel('Year', fontsize=18)
plt.ylabel('Election related / total songs', fontsize=16)
plt.show()

In [None]:
fig = plt.figure()
plt.xlim(1950,2020)
plt.bar(list(hot_vals_by_year[8].keys()), list(hot_vals_by_year[8].values()))
fig.suptitle('With threshold', fontsize=20)
plt.xlabel('Year', fontsize=18)
plt.ylabel('Election related', fontsize=16)
plt.show()

In [None]:
fig = plt.figure()
plt.xlim(1950,2020)
plt.bar(list(hot_vals_by_year[9].keys()), list(hot_vals_by_year[9].values()))
fig.suptitle('Sum of Trump related Hot 100 songs over the years', fontsize=20)
plt.xlabel('Year', fontsize=18)
plt.ylabel('Trump related', fontsize=16)
plt.show()

In [None]:
fig = plt.figure()
plt.xlim(1950,2020)
plt.bar(list(hot_vals_by_year[10].keys()), list(hot_vals_by_year[10].values()))
fig.suptitle('Ratio of Trump related Hot 100 songs and songs produced over the years', fontsize=20)
plt.xlabel('Year', fontsize=18)
plt.ylabel('Trump related / total songs', fontsize=16)
plt.show()

In [None]:
fig = plt.figure()
plt.xlim(1950,2020)
plt.bar(list(hot_vals_by_year[11]), list(hot_vals_by_year[11].values()))
fig.suptitle('With threshold', fontsize=20)
plt.xlabel('Year', fontsize=18)
plt.ylabel('Trump related', fontsize=16)
plt.show()

## Year based analysis

The initial idea of grouping based on year is to see how does politics impact the songs released that year.  
The first figure shows that there are significant amount of war related songs in the year 2006. In 2006,  
American people realize that war in Iraq was about getting Iraq's earth resources.

# Part 2

In this data we will use different data set to analyze. The main purpose of the following part is to analyze   curated data instead of some provided data from kaggle. The reasons why we want to curate the data ourselves are two-fold:
1. The data provided by kaggle user that is being used above also does not go beyond 2016, in this new dataset, we include songs up to March 2019.
2. The songs in the kaggle dataset may not be popular songs. However, the data we scraped are from top 100 charts on billboard (popular).

---
## Data retrieval

The following is the code that we used to get the data from [billboard](https://www.kaggle.com/gyani95/380000-lyrics-from-metrolyrics) and [genius](https://genius.com/)

> Please note that the following code is not the complete version of the code. Please refer to the github repo [here](https://github.com/albertputrapurnama/FinalProject108) for the complete python script used for actual dataset retrieval.

We will guide you briefly on how it works

```python
    # filename: DataWrangler_toCSV.py
    from datetime import date, datetime, timedelta
    import csv
    import inquirer
    from bs4 import BeautifulSoup
    import requests
    import time
    import urllib2
    import json
    import re
    import unicodedata

    lyric_dict = dict()
    nan = 0

    charts = {'Hot 100': ["https://www.billboard.com/charts/hot-100/", 1958, 8, 4],
              'Pop Songs': ["https://www.billboard.com/charts/pop-songs/", 1992, 10, 3],
              'Rock Songs': ["https://www.billboard.com/charts/rock-songs/", 2009, 6, 20],
              'Hip-Hop/R&B Songs': ["https://www.billboard.com/charts/r-b-hip-hop-songs/", 1958, 10, 20],
              'Country Songs': ["https://www.billboard.com/charts/country-songs/", 1958, 10, 20]
              }
```

*The above code is all about defining our constants and importing libraries.*

---
```python
    
```

```python
    # filename: DataWrangler_toCSV.py
    def info(s, dt):
        artist = s.get("data-artist").encode('ascii', 'ignore').decode('ascii')
        song = s.get("data-title").encode('ascii', 'ignore').decode('ascii')
        rank = s.get("data-rank")
        lyrics = get_lyrics(song, artist).encode('ascii', 'ignore').decode('ascii')
        if s.find(class_="chart-list-item__last-week") is None:
            last = '0'
        else:
            last = s.find(class_="chart-list-item__last-week").contents[0]
        if s.find(class_="chart-list-item__weeks-at-one") is None:
            peak = '0'
        else:
            peak = s.find(class_="chart-list-item__weeks-at-one").contents[0]
        if s.find(class_="chart-list-item__weeks-on-chart") is None:
            weeks = '0'
        else:
            weeks = s.find(class_="chart-list-item__weeks-on-chart").contents[0]
        return [artist, song, dt.isoformat(), rank, last, peak, weeks, lyrics]
```

`info` function is a helper function used to scrape the data we want from the page passed into it

`info` function takes in 2 params:
1. **s** — which is a soup page. soup page is a page given by BeautifulSoup, more documentation on BeautifulSoup [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
2. **dt** — the date of the chart page passed in.

`info` returns an array of scraped data in the desired format for further analysis

```python
    def scrape_site(writer, url, chart_date):
        global nan
        start_time = time.time()
        new_url = url + chart_date.isoformat()
        req = requests.get(new_url)
        soup = BeautifulSoup(req.text, "html.parser")
        songs = soup.find_all(class_="chart-list-item")
        hundred_list = [info(x, chart_date) for x in songs]
        print("Time taken: " + str(time.time() - start_time))
        print("NaNs = " + str(nan))
        writer.writerows(hundred_list)
```
`scrape` site will take use `info` function to scrape data out of a given url.

`scrape_site` function takes in 3 params:
1. **writer** — The object or entity which writes the scraped data to a dataframe.
2. **url** — The URL of the page soon to be scraped
3. **chart_date** — The publishing date of the chart.

```python
    def get_lyrics(song, artist):
        global nan
        global lyric_dict
        # Makes a URL that the genius API is capable of reading/requesting
        genius_url = "https://api.genius.com/search?q="

        ...
        More codes here
        ...
        
            lyrics = get_lyrics_from_url(lyric_url)
            lyric_dict[url] = lyrics
            return lyrics
```

`get_lyrics` is a huge function that takes in song title and artists and spit out a nice well-formatted lyrics.

There are many other helper function that come into play here, but for the sake of simplifying the report, we will not be covering those functions here. Feel free to download the complete python script on the github repository [here](https://github.com/albertputrapurnama/FinalProject108)

---
The following code block is the main code

```python
    # filename: DataWrangler_toCSV.py
    billboard_url, base_year, base_month, base_day = inquire()

    curr_date = date(base_year, base_month, base_day)
    end_date = last_day()
    # Ask user for input to name the spreadsheet.
    sheetTitle = raw_input("Enter spreadsheet title: ").replace(" ", "_")
    sheetTitle = sheetTitle + ".csv"
    with open(sheetTitle, mode='w') as billboard_chart:
        charter = csv.writer(billboard_chart, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
        charter.writerow(
            ["Artist", "Song", "Date", "Current Rank", "Last Weeks Position", "Weeks on Chart", "Peak Position"])
        #************Section 2**********************
        while curr_date != end_date:
            scrape_site(charter, billboard_url, curr_date)
            print("Finished charting the week of: " + curr_date.isoformat())
            curr_date = date_increment(curr_date)
        #*******************************************
```

This is where we put all the function definitions to work. We will upen up a file to write. Then, we will run the scraper on the dates we specified. This is shown specifically on code marked as **Section 2**

# Ethics & Privacy

&emsp;&emsp;Since we are going to focus our attention to music and politics, most of our data will be available publicly and less likely to have any privacy issues.<br>
&emsp;&emsp;Most of our ethical considerations revolves around the community and political figures itself. We live in an era where social construct is a huge thing we cannot ignore. This includes race, gender, and sexual orientations. We have to be careful about the possible race discrimination that may be revealed in the data that we collected. One way to mitigate this is by evaluating the data source and look for race discrimination before using the data. This method can also be used to mitigate any gender discrimination. For sexual orientation, the data is not given from the dataset, although we can find the sexual orientation of the musician, and that is a very complicated and exhaustive work to find sexual orientation of an artist. Furthermore, there are thousands of songs with thousands of different writers and musicians.<br>
&emsp;&emsp;Since we analyze artist’s response toward political activity around them, we have to be careful not proposing conclusions that may endanger the artist’s career or life. Likewise, political figures should also be protected. This can be solved by anonymizing any entity (artists or public figures) that we will analyze. This does two-fold: protecting artist’s and public figure’s identity, and push the research to a more generalized political and musical scope.<br>

# Conclusion & Discussion

*Fill in your discussion information here*