# Research Question
#### Is there a relationship in how happy a country is and the music the people of this country listen to?

### Introduction
Attempting to understand what makes a country "happy" is often attributable to freedom, social support, life expectancy, health, among others. However, the music the people of said country listen to is often not one of these factors given its weak relationship and versatility across the world. Although our research question does not attempt to attribute the type of music a country listens to to its happiness index score (it would be difficult to establish given reverse causality), we are more interested in observing what trends are apparent in some of the happiest and less happiest countries with respect to the music they listen to. 

#### Some additional questions that we are seeking to answer...
- Are there songs that are popular in all countries regardless of happiness rank?
- Are there genres/styles of music that are consistent with "happy" countries? With "sad" countries? For example, can we expect happy countries to listen to more pop music than sad ones?
- What are the characteristics of songs that are popular in happy countries? Fast or slow tempo? Live vocals or autotune?

#### The datasets we will be using:

All the datasets were found on Kaggle.
<br>
1. ["World Happiness Report 2017"](https://www.kaggle.com/unsdsn/world-happiness) by Sustainable Development Solutions Network
<br>
The Sustainable Development Solutions Network ranks 155 countries by their happiness levels, and calculates a "happiness score" for each country using six factors: economic production, social support, life expectancy, freedom, absence of corruption, and generosity. This dataset is for the year 2017.
2. ["Spotify's Worldwide Daily Song Ranking"](https://www.kaggle.com/edumucelli/spotifys-worldwide-daily-song-ranking) by Kaggle user
<br>
For each country in 54 countries, this dataset provides the top 200 songs per day in the year of 2017 (January 1, 2017 to January 9, 2018). 
- *Note: the Kaggle description says 53 countries, but we found 54 countries. Perhaps the description was not updated when the dataset was.*
3. ["Spotify Dataset 1921-2020, 160k+ Tracks"](https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks?select=data.csv) by Kaggle user
<br>
This dataset contains contains characteristics of over 175,000+ songs directly taken from the Spotify Web API. Some of these values include scores for danceability, beats per minute (bpm), and liveness (the likeliness that the song is a live recording).

<hr>

## Data prep

In [114]:
#importing relevant packages
import requests #package for http requests
import bs4 # package for html parsing
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### Data cleaning for "World Happiness Report"

In [72]:
#"World Happiness Report 2017"
happy2017=pd.read_csv("2017.csv")
#cleaning up col names
happy2017columns=happy2017.columns
happy2017columns= [x.lower() for x in happy2017columns]
happy2017columns= [x.replace("..",".") for x in happy2017columns]
happy2017columns= [x.replace(".","_") for x in happy2017columns]
happy2017.columns= happy2017columns
happy2017.head()

Unnamed: 0,country,happiness_rank,happiness_score,whisker_high,whisker_low,economy_gdp_per_capita_,family,health_life_expectancy_,freedom,generosity,trust_government_corruption_,dystopia_residual
0,Norway,1,7.537,7.594445,7.479556,1.616463,1.533524,0.796667,0.635423,0.362012,0.315964,2.277027
1,Denmark,2,7.522,7.581728,7.462272,1.482383,1.551122,0.792566,0.626007,0.35528,0.40077,2.313707
2,Iceland,3,7.504,7.62203,7.38597,1.480633,1.610574,0.833552,0.627163,0.47554,0.153527,2.322715
3,Switzerland,4,7.494,7.561772,7.426227,1.56498,1.516912,0.858131,0.620071,0.290549,0.367007,2.276716
4,Finland,5,7.469,7.527542,7.410458,1.443572,1.540247,0.809158,0.617951,0.245483,0.382612,2.430182


#### About the "World Happiness Report 2017"
##### How were the scores calculated?
*(Edited from Kaggle description)*
>The happiness scores and rankings use data from the Gallup World Poll. The scores are based on answers to the main life evaluation question asked in the poll. This question, known as the Cantril ladder, asks respondents to think of a ladder (10 being the best possible life and 0 the worst possible life) and to rate their own current lives on that scale. The scores are from nationally representative samples for the years 2013-2016 and use the Gallup weights to make the estimates representative.
<br><br>The columns following the happiness score estimate the extent to which each of six factors – economic production, social support, life expectancy, freedom, absence of corruption, and generosity – contribute to making life evaluations higher in each country than they are in Dystopia, an imaginary country that has values equal to the world’s lowest national averages for each of the six factors. They have no impact on the total score reported for each country. 
The purpose in establishing Dystopia is to have a benchmark against which all countries can be favorably compared (no country performs more poorly than Dystopia) in terms of each of the six key variables, thus allowing each sub-bar to be of positive width.

##### Column Descriptions
- `happiness_score` the sum of each `happiness_score`, `economy_gdp_per_capita_`, `health_life_expectancy_`, `freedom`, `generosity`, `	trust_government_corruption_`, and `dystopia_residual` scores. These individual scores reflect the "six factors" used to calculate happiness in the description above.<br>
- `Happiness.Rank` the ranking of each country's happiness scores, from highest happiness score to the lowest<br>
- `Country` the country being ranked/scored<br>

* **For our research purposes, we will only be keeping the following columns: `country`,`happiness_rank`, and `happiness_score`.**

In [77]:
#removing all columns except for "country", "happiness_rank", and "happiness_score"
happy2017=happy2017.iloc[:,:3]
happy2017.head()

Unnamed: 0,country,happiness_rank,happiness_score
0,Norway,1,7.537
1,Denmark,2,7.522
2,Iceland,3,7.504
3,Switzerland,4,7.494
4,Finland,5,7.469


In [115]:
#"Spotify's Worldwide Song Ranking"
#This .csv file was so big that not only could we not push it to GitHub, but it was also difficult to load the file on Sheets.
#Locally on her own computer, Eva split up data.csv into 53 individual .csv files by country so that we work with the data.

#allspotifydata=pd.read_csv("data.csv")
#countries=pd.unique(allspotifydata['Region'])

#allcountries=[]
#for country in countries:
    #allcountries.append(allspotifydata[allspotifydata['Region']==country])

#count=1
#for df in allcountries:
    #name='country'+str(count)+'.csv'
    #df.to_csv(r'C:\Users\Eva\Downloads\country'+str(count)+'.csv')
    #count=count+1

In [102]:
argentina=pd.read_csv("argentina.csv", parse_dates=['Date'])
australia=pd.read_csv("australia.csv", parse_dates=['Date'])
austria=pd.read_csv("austria.csv", parse_dates=['Date'])
belgium=pd.read_csv("belgium.csv", parse_dates=['Date'])
bolivia=pd.read_csv("bolivia.csv", parse_dates=['Date'])
brazil=pd.read_csv("brazil.csv", parse_dates=['Date'])
canada=pd.read_csv("canada.csv", parse_dates=['Date'])
chile=pd.read_csv("chile.csv", parse_dates=['Date'])
colombia=pd.read_csv("colombia.csv", parse_dates=['Date'])
costarica=pd.read_csv("costarica.csv", parse_dates=['Date'])
czechrepublic=pd.read_csv("czechrepublic.csv", parse_dates=['Date'])
denmark=pd.read_csv("denmark.csv", parse_dates=['Date'])
dominicanrepublic=pd.read_csv("dominicanrepublic.csv", parse_dates=['Date'])
ecuador=pd.read_csv("ecuador.csv", parse_dates=['Date'])
elsalvador=pd.read_csv('elsalvador.csv', parse_dates=['Date'])
estonia=pd.read_csv('estonia.csv', parse_dates=['Date'])
finland=pd.read_csv("finland.csv", parse_dates=['Date'])
france=pd.read_csv("france.csv", parse_dates=['Date'])
germany=pd.read_csv("germany.csv", parse_dates=['Date'])
Global=pd.read_csv('global.csv', parse_dates=['Date'])
greece=pd.read_csv('greece.csv', parse_dates=['Date'])
guatemala=pd.read_csv("guatemala.csv", parse_dates=['Date'])
honduras=pd.read_csv("honduras.csv", parse_dates=['Date'])
hongkong=pd.read_csv("hongkong.csv", parse_dates=['Date'])
hungary=pd.read_csv("hungary.csv", parse_dates=['Date'])
iceland=pd.read_csv("iceland.csv", parse_dates=['Date'])
indonesia=pd.read_csv("indonesia.csv", parse_dates=['Date'])
ireland=pd.read_csv("ireland.csv", parse_dates=['Date'])
italy=pd.read_csv("italy.csv", parse_dates=['Date'])
japan=pd.read_csv("japan.csv", parse_dates=['Date'])
latvia=pd.read_csv("latvia.csv", parse_dates=['Date'])
lithuania=pd.read_csv("lithuania.csv", parse_dates=['Date'])
luxembourg=pd.read_csv("luxembourg.csv", parse_dates=['Date'])
malaysia=pd.read_csv("malaysia.csv", parse_dates=['Date'])
mexico=pd.read_csv("mexico.csv", parse_dates=['Date'])
netherlands=pd.read_csv("netherlands.csv", parse_dates=['Date'])
newzealand=pd.read_csv("newzealand.csv", parse_dates=['Date'])
norway=pd.read_csv("norway.csv", parse_dates=['Date'])
panama=pd.read_csv("panama.csv", parse_dates=['Date'])
paraguay=pd.read_csv("paraguay.csv", parse_dates=['Date'])
peru=pd.read_csv("peru.csv", parse_dates=['Date'])
philippines=pd.read_csv("philippines.csv", parse_dates=['Date'])
poland=pd.read_csv("poland.csv", parse_dates=['Date'])
portugal=pd.read_csv("portugal.csv", parse_dates=['Date'])
singapore=pd.read_csv("singapore.csv", parse_dates=['Date'])
slovakia=pd.read_csv("slovakia.csv", parse_dates=['Date'])
spain=pd.read_csv("spain.csv", parse_dates=['Date'])
sweden=pd.read_csv("sweden.csv", parse_dates=['Date'])
switzerland=pd.read_csv("switzerland.csv", parse_dates=['Date'])
taiwanprovinceofchina=pd.read_csv("taiwan.csv", parse_dates=['Date'])
turkey=pd.read_csv("turkey.csv", parse_dates=['Date'])
unitedkingdom=pd.read_csv("unitedkingdom.csv", parse_dates=['Date'])
unitedstates=pd.read_csv("unitedstates.csv", parse_dates=['Date'])
uruguay=pd.read_csv("uruguay.csv", parse_dates=['Date'])

Unnamed: 0             int64
Position               int64
Track Name            object
Artist                object
Streams                int64
URL                   object
Date          datetime64[ns]
Region                object
dtype: object

#### Creating a new happiness dataframe that contains only the relevant countries
As noted in our dataset descriptions, the "Worldwide Happiness Ranking" (happy17 dataframe) contains happiness data for 155 countries, while "Spotify's Worldwide Song Ranking" contains only 54 countries. We needed to find the overlapping countries to <br>
1. Create a new happiness ranking excluding the countries not found in the song ranking dataset, subsetted in the dataframe **happy**
<br>
2. Figure out which "Spotify's Worldwide Song Rankings) country .csv files we do not need.

In [110]:
#list of all 54 countries from "Spotify's Worldwide Songs"
allspotifycountries=["argentina", "australia", "austria", "belgium", "brazil","bolivia", "canada", "chile", "colombia", "costarica", "czechrepublic","denmark", "dominicanrepublic", "estonia", "elsalvador", "Global","greece", "ecuador", "finland", "france", "germany", "guatemala", "honduras", "hongkong", "hungary", "iceland", "indonesia", "ireland", "italy", "japan", "latvia", "lithuania", "luxembourg", "malaysia", "mexico", "netherlands", "newzealand", "norway", "panama", "paraguay", "peru", "philippines", "poland", "portugal", "singapore", "slovakia", "spain", "sweden", "switzerland", "taiwanprovinceofchina", "turkey", "unitedkingdom", "unitedstates", "uruguay"]

#list for countries that are found in both datsets.
allcountries=[]
for row in range(len(happy2017)):
    country=happy2017.loc[row,'country']
    country=country.lower()
    country=country.replace(" ","")
    if country in allspotifycountries:
        allcountries.append(country)      

#Happy Index from 2017 with only 'country', 'Happiness.Rank', and 'Happiness.Score' columns and only with countries that have their own dataset for spotify daily.        
happy=pd.DataFrame({'country':[],'happiness_rank':[],'happiness_score':[]})
for row in range(len(happy2017)):
    country=happy2017.loc[row,'country']
    country=country.lower()
    country=country.replace(" ","")
    if country in allcountries:       
        newrow={'country':happy2017.loc[row,'country'],'happiness_rank':happy2017.loc[row,'happiness_rank'],'happiness_score':happy2017.loc[row,'happiness_score']}
        happy=happy.append(newrow, ignore_index=True)

        
#Noticed that Taiwan is referred to as Taiwan Province of China on Happy Index.
# newrow={'Country':happy2017.loc[32,'Country'],'Happiness.Rank':happy2017.loc[32,'Happiness.Rank'],'Happiness.Score':happy2017.loc[32,'Happiness.Score']}
# happy=happy.append(newrow, ignore_index=True)

#allspotifydata[allspotifydata['Region']==country])
# allcountries.append("Taiwan")
print("There are "+ str(len(allcountries)) +" countries total that we can use to address our research question.")
print("\n")
print('These are the spotify datsets we should use: ' + str(allcountries))
print("\n")
print("This is the updated happiness ranking:")
print(happy)

There are 52 countries total that we can use to address our research question.


These are the spotify datsets we should use: ['norway', 'denmark', 'iceland', 'switzerland', 'finland', 'netherlands', 'canada', 'newzealand', 'sweden', 'australia', 'costarica', 'austria', 'unitedstates', 'ireland', 'germany', 'belgium', 'luxembourg', 'unitedkingdom', 'chile', 'brazil', 'czechrepublic', 'argentina', 'mexico', 'singapore', 'uruguay', 'guatemala', 'panama', 'france', 'taiwanprovinceofchina', 'spain', 'colombia', 'slovakia', 'malaysia', 'ecuador', 'elsalvador', 'poland', 'italy', 'japan', 'lithuania', 'latvia', 'bolivia', 'peru', 'estonia', 'turkey', 'paraguay', 'philippines', 'hungary', 'indonesia', 'dominicanrepublic', 'greece', 'portugal', 'honduras']


This is the updated happiness ranking:
                     country  happiness_rank  happiness_score
0                     Norway             1.0            7.537
1                    Denmark             2.0            7.522
2             

### Data cleaning for Spotify song ranking for country .csv files
#### We don't need the top 200 songs per day in a year
The "Spotify Worldwide Song Rankings" from Kaggle is far too excessive for our analysis and research purposes. Providing the top 200 songs per day in a year means that each country .csv file should have around 365*200 (73,000) entries. We decided that we would remove the bottom 150 songs per day in each country's dataset. <br>
It is possible that later we may decide to remove even more.
The "Spotify Worldwide Song Rankings" from Kaggle is far too excessive for our analysis and research purposes. Providing the top 200 songs per day in a year means that each country .csv file should have around 365*200 (73,000) entries. We decided that we would remove the bottom 150 songs per day in each country's dataset. <br>
It is possible that later we may decide to remove even more.

In [119]:
#adding the countries we will work with based on the subset below in allcountries. We will use this to subset the first 50 songs below
countries_to_subset = [norway, denmark, iceland, switzerland, finland, 
                        netherlands, canada, newzealand, sweden, australia, costarica, austria, 
                        unitedstates, ireland, germany, belgium, luxembourg, unitedkingdom, 
                        chile, brazil, czechrepublic, argentina, mexico, singapore, uruguay, guatemala,
                        panama, france, spain, colombia, slovakia, malaysia, ecuador, elsalvador, poland, 
                        italy, japan,lithuania, latvia, bolivia, peru, estonia, turkey, paraguay, 
                        philippines, hungary, indonesia, dominicanrepublic, greece, portugal, honduras, taiwanprovinceofchina]

In [33]:
#fake_norway = norway.groupby(norway.index//200).head(50)

def first_fifty(dataframe):
    """
    This will get the first 50 observations in every 200 observations.
    There should only be 200 observations in 1 day, and there are 365 days per country in this data,
    which is the purpose of this function.
    
    Parameter dataframe: this is the country's dataframe which we will work with.
    Precondition: a pandas dataframe object
    """
    dataframe = dataframe.groupby(dataframe.index//200).head(50) 
#     dataframe = subset_data.copy()
#     return dataframe

In [117]:
#running each country in countries_to_subset through the first_fifty procedure
for file in countries_to_subset: 
    first_fifty(file)

# ?????
#### Brief Explanation of Code Above
This part is attaching these new subsetted dataframes to each country's name and adding 1 to it, as to not alter the original dataframe.

Ex: The original Norway dataframe contains all the subsetted data of 50 observations for every 200 songs (day) in the now-new dataframe Norway1

In [35]:
# taiwan1.head()

#### We cannot use the data from 2018
We only have happiness rankings for the year of 2017, but the Spotify rankings start in January 1, 2017 and stop at January 9, 2018. Though this is only 9 days in 2018, we cannot use this part of the data set.
<br.
To maintain consistency in out datasets, the function below is excludes all observations, or songs, from 2018 accidentally subsetted in our dataframe. 

In [125]:
def not18(dataframe):
    dataframe['Date'] = pd.to_datetime(dataframe['Date'])
    dataframe = dataframe[dataframe['Date'].dt.year != 2018]
    return dataframe

In [126]:
# FAILED CODE
# We wanted not18 to run as a procedure so that we could loop it through countries_to_subset. Unfortunately this did not work, 
# so we had to call not18() on every country .csv

#for file in countries_to_subset: 
#    not18(file)
not18(norway)
## estelle's commments
## shouldn't there be much less than 72,400 rows? 365x200 is 73,000. did the first_fifty not work
## also, should we just call not18(norway) why do we use an assignment statement norway2=not18(norway)
## also what is the unamed column

Unnamed: 0.1,Unnamed: 0,Position,Track Name,Artist,Streams,URL,Date,Region
74195,370995,196,Sucker for You,Matt Terry,9267,https://open.spotify.com/track/4vkVvmjIiQibQ6z...,2018-01-09,no
74196,370996,197,HUMBLE. - SKRILLEX REMIX,Skrillex,9167,https://open.spotify.com/track/4d0vDZBKQFL4sVB...,2018-01-09,no
74197,370997,198,SUBEME LA RADIO,Enrique Iglesias,9160,https://open.spotify.com/track/7nKBxz47S9SD79N...,2018-01-09,no
74198,370998,199,Anyway,Tyron Hapi,9150,https://open.spotify.com/track/77ngtwBosDuAwiQ...,2018-01-09,no
74199,370999,200,Redbone,Childish Gambino,9046,https://open.spotify.com/track/3kxfsdsCpFgN412...,2018-01-09,no


# ???
### Example of the function not18 above
This is taking one of our country datasets (Iceland), which – as shown from the result– only takes into account songs from January 2017 until December 2017.

In [None]:
#running all 
norway2 = not18(norway)
denmark2= not18(denmark)
iceland2 = not18(iceland)
switzerland2= not18(switzerland)
finland2 = not18(finland)
netherlands2= not18(netherlands)
canada2 = not18(canada)
newzealand2= not18(newzealand)
sweden2= not18(sweden)
australia2= not18(australia)
costarica2= not18(costarica)
austria2 = not18(austria)
unitedstates2= not18(unitedstates)
ireland2= not18(ireland)
germany2= not18(germany)
belgium2= not18(belgium)
luxembourg2= not18(luxembourg)
unitedkingdom2= not18(unitedkingdom)
chile2= not18(chile)
brazil2= not18(brazil)
czechrepublic2= not18(czechrepublic)
argentina2= not18(argentina)
mexico2= not18(mexico)
singapore2= not18(singapore)
uruguay2= not18(uruguay)
guatemala2= not18(guatemala)
panama2= not18(panama)
france2= not18(france)
spain2= not18(spain)
colombia2= not18(colombia)
slovakia2= not18(slovakia)
malaysia2= not18(malaysia)
ecuador2= not18(ecuador)
elsalvador2= not18(elsalvador)
poland2= not18(poland)
italy2= not18(italy)
japan2= not18(japan)
lithuania2= not18(lithuania)
latvia2= not18(latvia)
bolivia2= not18(bolivia)
peru2= not18(peru)
estonia2= not18(estonia)
turkey2= not18(turkey)
paraguay2= not18(paraguay)
philippines2= not18(philippines)
hungary2= not18(hungary)
indonesia2= not18(indonesia)
dominicanrepublic2= not18(dominicanrepublic)
greece2= not18(greece)
portugal2= not18(portugal)
honduras2= not18(honduras)
taiwan2= not18(taiwanprovinceofchina)

In [38]:
norway2.tail()

Unnamed: 0.1,Unnamed: 0,Position,Track Name,Artist,Streams,URL,Date,Region
72395,369195,196,Bade naken,Plumbo,12700,https://open.spotify.com/track/4vGVOcH8F0O0TkT...,2017-12-31,no
72396,369196,197,"Happy - From ""Despicable Me 2""",Pharrell Williams,12627,https://open.spotify.com/track/5b88tNINg4Q4nrR...,2017-12-31,no
72397,369197,198,Would You Ever,Skrillex,12616,https://open.spotify.com/track/57p8CBvPOxrvyCb...,2017-12-31,no
72398,369198,199,False Alarm,Matoma,12535,https://open.spotify.com/track/7gZQfdEQpmwAoPH...,2017-12-31,no
72399,369199,200,Driving Home For Christmas,Chris Rea,12505,https://open.spotify.com/track/0ZoHHROTzwIYeNA...,2017-12-31,no


### About.....
column descriptors

### Limitations
1) Given we are dealing with a relatively sample of only 2017, albeit with 50 songs per day, drawing conclusions for what type of music a happy or unhappy country listens may not prove out to be as accurate as we would like it to be. We can say, however, that the culture of a country likely does not vary signficantly from one year to another, so the top genres and happy scores may reflect of the country's values to an extent.

2) Some countries were omitted in order to be able to use the countries in the happiness index and those in our spotify data. In order to draw some observations, we had to find overlap in songs. This may naturally produce bias since some potentially happy countries with potentially signficant relationships to music will completely be overlooked due to availability of data. This concerns mainly the overarching/big picture of happy countries and certain genres being more common.

3) We are relying on the genre and danceability score for specific songs provided by a dataset. It is possible this dataset contains subjective information to the user, hence the genres and danceability scores may not reflect the names the people of the country would utilize. This metric of genre, further, may be interpreted as...

In [39]:
# unique = np.array(norway2['Track Name'].unique())
# p = [norway2[norway2['Track Name']==i].loc[:,'Streams'].mean() for i in unique]
# plt.plot(unique, p)
# plt.xlabel('Track Name')
# plt.ylabel('Streams')

In [40]:
for x in norway2:
    print(x)

Unnamed: 0
Position
Track Name
Artist
Streams
URL
Date
Region


In [41]:
def change(d):
    '''
    changes column names to all lower case 
    '''
    new_colnames = [x.lower().replace(' ', '_') for x in d.columns]
#     d = d.copy()
    d.columns= new_colnames 
    d.dropna(inplace=True)
#     return r

In [113]:
filez= [norway2, denmark2, iceland2, switzerland2, finland2, 
                       netherlands2, canada2, newzealand2, sweden2, australia2, costarica2, austria2, 
                       unitedstates2, ireland2, germany2, belgium2, luxembourg2, unitedkingdom2, 
                       chile2, brazil2, czechrepublic2, argentina2, mexico2, singapore2, uruguay2, guatemala2,
                       panama2, france2, spain2, colombia2, slovakia2, malaysia2, ecuador2, elsalvador2, poland2, 
                       italy2, japan2,lithuania2, latvia2, bolivia2, peru2, estonia2, turkey2, paraguay2, 
                       philippines2, hungary2, indonesia2, dominicanrepublic2, greece2, portugal2, honduras2, taiwan2]

NameError: name 'taiwan2' is not defined

In [None]:
for file in filez:
    change(file)

In [None]:
# change(norway2)
belgium2.head()

In [None]:
plt.figure(figsize=(80,20))

plt.scatter(happy['Country'], happy['Happiness.Score'])
plt.xlabel('Country', fontsize=50)
plt.ylabel('Happiness Score', fontsize=50)
plt.title('Country Happiness Score 2017', fontsize=50)

In [None]:
happy.describe()