### Adding attributes and displaying descriptive statistics

In this notebook, the countries are split into its respective continents. Furthermore, the interaction variable Popularity x Gender is added here by changing the gender category to 0, 1 and 2. Finally, the descriptive statistics are shown. 

In [1]:
import pandas as pd

In [2]:
df_artist = pd.read_csv("Nodelist_Final3.csv")
df_artist.head()

Unnamed: 0,Artist,Gender,Country,Popularity
0,Martin Garrix,male,Netherlands,74
1,David Guetta,male,France,85
2,Dimitri Vegas & Like Mike,male,Belgium,67
3,24hrs,male,United States,49
4,2WEI,male,Germany,60


### Changing gender categories

Changing the categorical gender categories to 0 for mixed, 1 for male and 2 for female to be able to get more meaningful ERGM results

In [3]:
df_artist["Gender_numeric"] = df_artist["Gender"]
df_artist["Gender_numeric"].replace({"mixed": 0, "male": 1, "female": 2}, inplace = True)
df_artist.head()

Unnamed: 0,Artist,Gender,Country,Popularity,Gender_numeric
0,Martin Garrix,male,Netherlands,74,1
1,David Guetta,male,France,85,1
2,Dimitri Vegas & Like Mike,male,Belgium,67,1
3,24hrs,male,United States,49,1
4,2WEI,male,Germany,60,1


### Adding continents

Dividing all countries into their respective contintents. The contintents used are the geographical continents. Russia is divided into the Europe group as 75% of Russia's population lives in the Europe part of Russia. The duo (AREA21) that was from the Netherlands and US is divided into the Europe continent, as the montly listeners on Spotify of the Dutch artists (21 million) were significantly higher than his American partner (2 million).

In [4]:
#Adding contintent column by using the countries, which will change into contintents
df_artist["Continent"] = df_artist["Country"]

In [5]:
europe_lst = ["Netherlands", "France", "Belgium", "Germany", "United Kingdom & Finland", "Norway", "United Kingdom", 
              "Italy", "Romania", "Lithuania", "Czechia", "Slovakia", "Denmark", "Poland", "Ireland", "Spain", "Estonia",
             "Sweden", "Iceland", "Hungary", "Russia", "Netherlands and United States"]
north_amer_lst = ["United States", "Puerto Rico", "Mexico", "Canada", "Barbados", "Cuba"]
south_amer_lst = ["Brazil", "Colombia", "Peru", "Argentina", "Ecuador"]
africa_lst = ["Egypt", "South Africa"]
oceanie_lst = ["Australia"]
asia_lst = ["Thailand", "Japan"]
df_artist["Continent"].replace(dict.fromkeys(europe_lst, "Europe"), inplace = True)
df_artist["Continent"].replace(dict.fromkeys(north_amer_lst, "North America"), inplace = True)
df_artist["Continent"].replace(dict.fromkeys(south_amer_lst, "South America"), inplace = True)
df_artist["Continent"].replace(dict.fromkeys(africa_lst, "Africa"), inplace = True)
df_artist["Continent"].replace(dict.fromkeys(oceanie_lst, "Oceania"), inplace = True)
df_artist["Continent"].replace(dict.fromkeys(asia_lst, "Asia"), inplace = True)

In [6]:
df_artist.head()

Unnamed: 0,Artist,Gender,Country,Popularity,Gender_numeric,Continent
0,Martin Garrix,male,Netherlands,74,1,Europe
1,David Guetta,male,France,85,1,Europe
2,Dimitri Vegas & Like Mike,male,Belgium,67,1,Europe
3,24hrs,male,United States,49,1,North America
4,2WEI,male,Germany,60,1,Europe


### Popularity x Gender column

Adding Popularity multiplied by Gender column, as female artists are underrepresented in this dataset to give them more weight to be able to compare output of the ERGM.

In [7]:
#Popularity X gender column
df_artist["Popularity_gender"] = df_artist["Gender_numeric"] * df_artist["Popularity"]
df_artist.head()

Unnamed: 0,Artist,Gender,Country,Popularity,Gender_numeric,Continent,Popularity_gender
0,Martin Garrix,male,Netherlands,74,1,Europe,74
1,David Guetta,male,France,85,1,Europe,85
2,Dimitri Vegas & Like Mike,male,Belgium,67,1,Europe,67
3,24hrs,male,United States,49,1,North America,49
4,2WEI,male,Germany,60,1,Europe,60


In [8]:
df_artist.to_csv("Nodelist_Final3_adjusted.csv", index = False)

### Descriptive statistics

In [9]:
df_artist[["Artist", "Continent"]].groupby(by = "Continent").count()

Unnamed: 0_level_0,Artist
Continent,Unnamed: 1_level_1
Africa,6
Asia,4
Europe,187
North America,170
Oceania,4
South America,13


In [10]:
df_artist["Popularity"].describe()

count    384.000000
mean      42.375000
std       22.918404
min        0.000000
25%       26.000000
50%       43.000000
75%       59.000000
max       96.000000
Name: Popularity, dtype: float64

In [11]:
df_artist[["Artist", "Gender"]].groupby(by = "Gender").count()

Unnamed: 0_level_0,Artist
Gender,Unnamed: 1_level_1
female,49
male,316
mixed,19


In [12]:
df_artist["Popularity_gender"].describe()

count    384.000000
mean      46.257812
std       32.203720
min        0.000000
25%       23.000000
50%       45.000000
75%       63.250000
max      176.000000
Name: Popularity_gender, dtype: float64