# Exam Question 2 Nadine Kanbier 4283724

## Question 2

###### Choose two genres from the song dataset from exercise 3.1 to examine and compare the gender bias in songs of both genres. Explain why the two genres you choose are relevant to compare in this context, and formulate a hypothesis. Train two word embeddings models (one for each genre), and use the lists of female and male words uploaded to BB (Assignments / exam 1 / question 2) for your analysis. Compare the biases between the two genres you choose using the method by Wevers. Interpret the results and relate them to your hypothesis.

##### Your answer must consist of the following:
##### • A statement on the relevance of your comparison and the hypothesis (ca.150 words)
##### • The complete code to answer the question with a short comment for every step (max 2 sentences per step)
##### • Interpretation and conclusion (ca. 200 words)

I will be comparing the gender bias in Pop music versus the gender bias in Country music. Country music is more popular in the rural and conservative states of the United States (see https://trends.google.com/trends/explore?date=all&geo=US&q=%2Fm%2F064t9,%2Fm%2F01lyv). Because rural and conservative states tend towards more 'traditional' gender roles (i.e. women more associated with family and men with work), I will specifically compare differences in family and work related words.
    
Pop music is more popular in the more liberal and progressive states. I expect a more modern approach towards women in pop music. That being said, I expect there will still be a bias towards women, only less strong than that in Country music. 

Therefore, it is hypothesized that there is a stronger bias in family-related words towards women in Country music than in Pop music. It is also hypothesized that there is a stronger bias in work-related words towards men in Country music than in Pop music.

In [7]:
# First, load the data with the correction path. 
import pandas as pd
import pickle

df = pd.read_csv('english_cleaned_lyrics.csv')

PATH_DF = '/Users/nadinekanbier/Desktop/Applied Data Science/Periode 2/Data Mining/Assignment/english_cleaned_lyrics.csv'
PATH_CORRECTION = '/Users/nadinekanbier/Desktop/Applied Data Science/Periode 2/Data Mining/Assignment/indx2newdate.p'

def load_dataset(data_path, path_correction):
    df = pd.read_csv(data_path)
    indx2newdate = pickle.load(open(PATH_CORRECTION, 'rb'))
    df['year'] = df['index'].apply(lambda x: int(indx2newdate[x][0][:4]) if indx2newdate[x][0] != '' else 0)
    return df[df.year > 1960][['song', 'year', 'artist', 'genre', 'lyrics']]
    
dataset = load_dataset(PATH_DF, PATH_CORRECTION)

In [9]:
# Split the datasets for the genres I want to compare: country and pop.
country = dataset.lyrics[dataset.genre == 'Country']
pop = dataset.lyrics[dataset.genre == 'Pop']

In [10]:
# tokenize the text
import spacy

nlp = spacy.load("en_core_web_sm")
processed_country = [text for text in nlp.pipe(country, 
                                              disable=["ner",
                                                       "parser"])]

tokenized_country = [[word.text for word in text if not word.is_punct] 
                    for text in processed_country]

processed_pop = [text for text in nlp.pipe(pop, 
                                              disable=["ner",
                                                       "parser"])]

tokenized_pop = [[word.text for word in text if not word.is_punct] 
                    for text in processed_pop]

In [11]:
# Training a word embedding for the country lyrics.
import gensim
from gensim.models import Word2Vec

SIZE = 300 # dimensions of the embeddings
SG = 1 # whether to use skip-gram or CBOW (we use skip-gram)
WINDOW = 10 # the window size
N_WORKERS = 1 # number of workers to use
MIN_COUNT = 1

model = Word2Vec(size=SIZE,
                sg=SG,
                window=WINDOW, 
                min_count=MIN_COUNT,
                workers=N_WORKERS)

model.build_vocab(tokenized_country)

model.train(tokenized_country,
           total_examples=model.corpus_count,
           epochs=model.epochs) 

(8126886, 10779570)

In [12]:
# Training a word embedding model for the pop lyrics.
import gensim
from gensim.models import Word2Vec

SIZE = 300 # dimensions of the embeddings
SG = 1 # whether to use skip-gram or CBOW (we use skip-gram)
WINDOW = 10 # the window size
N_WORKERS = 1 # number of workers to use
MIN_COUNT = 1

model2 = Word2Vec(size=SIZE,
                sg=SG,
                window=WINDOW, 
                min_count=MIN_COUNT,
                workers=N_WORKERS)

model2.build_vocab(tokenized_pop)

model2.train(tokenized_pop,
           total_examples=model.corpus_count,
           epochs=model.epochs) 

(22203431, 30293585)

In [17]:
# import list of male and female words.
import numpy as np

male = pd.read_pickle('male_words.p')
female = pd.read_pickle('female_words.p')

In [31]:
# checks if word is in vocabulary (i.e. has been seen by the model before)
male_words_country = [word for word in male if word in model.wv.vocab]
male_words_pop = [word for word in male if word in model2.wv.vocab]
female_words_country = [word for word in female if word in model.wv.vocab]
female_words_pop = [word for word in female if word in model2.wv.vocab]

In [37]:
# mean embedding for female words in pop
mean_embed_fem = np.mean([model2.wv[word] for word in female_words_pop], axis = 0)
print(mean_embed_fem.shape)

(300,)


In [38]:
# mean embedding for male words in pop
mean_embed_male = np.mean([model2.wv[word] for word in male_words_pop], axis = 0)
print(mean_embed_male.shape)

(300,)


In [39]:
# mean embedding for male words in country
mean_embed_male_c = np.mean([model.wv[word] for word in male_words_country], axis = 0)
print(mean_embed_male.shape)

(300,)


In [40]:
# mean embedding for female words in country
mean_embed_female_c = np.mean([model.wv[word] for word in female_words_country], axis = 0)
print(mean_embed_male.shape)

(300,)


In [41]:
# loading categorical words
cat_data = pd.read_csv("word_cats.csv")

In [143]:
categories = cat_data.columns

In [144]:
# calculating bias for the categories, trained with pop word embedding model
import pandas as pd
# Initialize objects for the loop
biases = {}
lst_bias = []
lst_meanbias = []
for word in categories:
    if word in model2.wv.vocab:
        dist_male = np.linalg.norm(np.subtract(model2.wv[word], mean_embed_male))
        dist_female = np.linalg.norm(np.subtract(model2.wv[word], mean_embed_fem))
        biases = {'word': word,
                 'bias': (dist_male - dist_female)}
        lst_bias.append(biases)
df_bias_pop_work = pd.DataFrame(lst_bias)
df_bias_pop_work.sort_values('bias', ascending = False)

Unnamed: 0,word,bias
2,family,0.050389
5,leisure,0.049098
0,affect,0.01981
3,body,0.011299
6,money,6.5e-05
4,work,-0.005069
1,social,-0.006579
7,occupation,-0.018106


In [145]:
# calculating bias for categories, trained with country word embedding model
import pandas as pd
# Initialize objects for the loop
biases = {}
lst_bias = []
lst_meanbias = []
for word in categories:
    if word in model.wv.vocab:
        dist_male = np.linalg.norm(np.subtract(model.wv[word], mean_embed_male_c))
        dist_female = np.linalg.norm(np.subtract(model.wv[word], mean_embed_female_c))
        biases = {'word': word,
                 'bias': (dist_male - dist_female)}
        lst_bias.append(biases)
df_bias_country_work = pd.DataFrame(lst_bias)
df_bias_country_work.sort_values('bias', ascending = False)

Unnamed: 0,word,bias
3,body,0.082413
2,family,0.025205
4,work,0.009881
6,money,0.000149
1,social,-0.009652
0,affect,-0.049788
7,occupation,-0.076094
5,leisure,-0.2023


It was hypothesized that there is a stronger bias in family-related words towards women in Country music than in Pop music (H1). It was also hypothesized that there is a stronger bias in work-related words towards men in Country music than in Pop music (H2).

H1 can be rejected: there is a stronger bias for family-related words in the model that was trained with Pop lyrics (0.05 vs. 0.03). This means that pop lyrics are more biased towards women when it comes to family-related words. 
H2 can also be rejected: there is a bias towards men in work-related words in the model that was trained with Pop music (-0.005). However, in Country music work-related words are biased towards women (0.01). 

In conclusion, Country music might be more progressive than we expected. According to our results, Country music is less biased towards women than Pop music when it comes to family and work related words. 

Overall, the number of categories trained with the Pop lyrics are more biased towards women than men compared to the Country music.