In [2]:
%load_ext rpy2.ipython
%load_ext autoreload
%autoreload 2

%matplotlib inline  
from matplotlib import rcParams
rcParams['figure.figsize'] = (8, 8)

import warnings
from rpy2.rinterface import RRuntimeWarning
warnings.filterwarnings("ignore") # Ignore all warnings
# warnings.filterwarnings("ignore", category=RRuntimeWarning) # Show some warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML

In [3]:
%%javascript
// Disable auto-scrolling
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

In [4]:
%%R

require('tidyverse')
source('functions.R')

R[write to console]: Loading required package: tidyverse



── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.1      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()


R[write to console]: Loading required package: emojifont

R[write to console]: Loading required package: ggrepel

R[write to console]: Loading required package: ggtext

R[write to console]: Loading required package: ggforce



### My Hypothesis Question: 
**Do white Grammys nominees stand a higher chance of winning than nominees of color?**


In [5]:
# Import data
df = pd.read_csv("musicians-matched.csv")
df = df[['id','year','work','category','status','cleaned_musician','ethnicity','status/ethnicity','total_percentage']]
df.head()

Unnamed: 0,id,year,work,category,status,cleaned_musician,ethnicity,status/ethnicity,total_percentage
0,1,2022,We Are,album-of-the-year,winner,Jon Batiste,UR,UR winner,0.666667
1,2,2022,Love for Sale,album-of-the-year,nominee,Tony Bennett Lady Gaga,white,white nominee,0.666667
2,3,2022,Justice,album-of-the-year,nominee,Justin Bieber,white,white nominee,0.666667
3,4,2022,Planet Her,album-of-the-year,nominee,Doja Cat,UR,UR nominee,0.666667
4,5,2022,Happier Than Ever,album-of-the-year,nominee,Billie Eilish,white,white nominee,0.666667


### 1. Exploratory Analysis

In [40]:
# a pivot table to lookup chance of winning by category

pivot = pd.pivot_table(df, values='id', 
                            index=['category','ethnicity'], 
                            columns='status', 
                            aggfunc='count',
                            fill_value=0)

pivot['total'] = pivot['nominee']+pivot['winner']
pivot['chance'] = pivot['winner']/pivot['total']

pivot.reset_index(inplace=True)
pivot

status,category,ethnicity,nominee,winner,total,chance
0,album-of-the-year,UR,63,11,74,0.148649
1,album-of-the-year,white,92,23,115,0.2
2,best-new-artist,UR,70,13,83,0.156627
3,best-new-artist,white,85,21,106,0.198113
4,record-of-the-year,UR,79,11,90,0.122222
5,record-of-the-year,white,77,23,100,0.23


In [38]:
data = [['album-of-the-year',0.15,0.2],['best-new-artist',0.16,0.2],['record-of-the-year',0.12,0.23]]

pivot_test = pd.DataFrame(data, columns=['category','UR','white'])
pivot_test

Unnamed: 0,category,UR,white
0,album-of-the-year,0.15,0.2
1,best-new-artist,0.16,0.2
2,record-of-the-year,0.12,0.23


### 2. Journalistic —> Statistical Inquiry

- null hypothesis: There is no difference between white nominees and non-white nominees' chance of winning.
- alternative hypothesis: White nominees on average stand higher chance of winning than non-white nominees.

The statistical test I'm applying for my hypothesis is Chi-Square test.

In [45]:
%%R -i df
# Chi Square test chance of winning across category
to_test <- table(df$status, df$ethnicity)
print(to_test)
chisq.test(to_test, correct=FALSE)

         
           UR white
  nominee 212   254
  winner   35    67

	Pearson's Chi-squared test

data:  to_test
X-squared = 4.256, df = 1, p-value = 0.03911



In [32]:
%%R -i df
# Chi Square test chance of winning of each category
album_df <- df%>%filter(category=="album-of-the-year")
album_test <- table(album_df$status, album_df$ethnicity)
print(album_test)
chisq.test(album_test, correct=FALSE)

         
          UR white
  nominee 63    92
  winner  11    23

	Pearson's Chi-squared test

data:  album_test
X-squared = 0.80479, df = 1, p-value = 0.3697



In [34]:
%%R -i df
# Chi Square test chance of winning of each category
record_df <- df%>%filter(category=="record-of-the-year")
record_test <- table(record_df$status, record_df$ethnicity)
print(record_test)
chisq.test(record_test, correct=FALSE)

         
          UR white
  nominee 79    77
  winner  11    23

	Pearson's Chi-squared test

data:  record_test
X-squared = 3.745, df = 1, p-value = 0.05297



In [35]:
%%R -i df
# Chi Square test chance of winning of each category
new_df <- df%>%filter(category=="best-new-artist")
new_test <- table(new_df$status, new_df$ethnicity)
print(new_test)
chisq.test(new_test, correct=FALSE)

         
          UR white
  nominee 70    85
  winner  13    21

	Pearson's Chi-squared test

data:  new_test
X-squared = 0.54307, df = 1, p-value = 0.4612



#### Conclusion:
- The p-values from both test results suggest are not statistically significant enough to reject my null hypothesis. 
- Simply by looking at the results, it is possible that the observed difference in chance of winning is a result of sole chance.
- By looking at the summary stats, however, white nominees on average do stand a slightly higher chance of winning individually and across category.
- Some caveats of my data: I do have a small smaple size (compared with all grammys nominations across all categories). 
- In addition, I did not separate artists for tracks that featured both white and non-white artists (e.g. "Old Town Road", a track by Lil Nas X-a black artist-featured Billy Ray Cyrus, who is white, is marked as "UR" in the dataset). The results might be different if I break down the artists on these tracks. 


I am also having trouble interperting p-values that are > 0.05. Does a larger-than-0.05 p-value mean that my hypothesis is false? Or does it suggest a possibility that the hypothesis might be true?

### 3. Statistical —> Journalistic Inquiry
Reporting Qs that my exploratory plots raised:
- Question relating to statistical test (Qs for statisticians): 
    - Am I running the appropriate statistical tests for my hypothesis? 
    - Are there any issues in the way the ethnicity variable is classified? 
    - Should I restructure my data or classify my data differently in order to represent the reality of grammys nomination's diversity with the dataset?

- Why is the percentage of winning lower for Record of the Year category? What about other award categories?
- What is the nomination & voting process of the grammys like?
- What sorts of criterias were used to determine which nominee wins the grammys? For instance, quantitative evalutions such as number of record sales and ranking on charts (like Billboard 100); or qualitative evaluations such as critic's opinion and creative value? 
- Were there past accusations relating to racial diversity and representation against the Grammys? What kind of claims did the accusers raised? 
- Why did the academy added number of nominees in each category in recent years?
- Had the recording academy ever addressed problems regarding racial diversity in their nomination process/voting process? If so, did their acknowledgement and refomative actions have an effect on the racial demographic of nominees and winners?

Next step:
1. Contact statisticians that have worked with award nomination data to go over my dataset for possible improvement of data strcuture, as well as advice on the statistical tests used in my rough analysis. Potential source: the [Inclusion Institute at USC](https://assets.uscannenberg.org/docs/aii-inclusion-recording-studio-20200117.pdf).
2. Reach out to journalists who have covered this topic to learn about their findings about racial diversity of the grammys.
3. Reach out to advocacy groups and activists for diversity in the Recording Academy.
4. Reach out to the recording academy for their detailed number of racial demographic of all their general members and voting members.

*Appendix: My data studio project using this dataset: https://xinyitu.github.io/grammys/*