**What Makes an All Star?
Development of a machine learning model to predict which RuPaul's Drag Race contestants are most likely to be selected for an All Stars Season based on their performance in previous seasons. **

**Note:** For the purposes of this exploration, I use the terminology "selected" to describe a contestant who has competed in previous seasons who is competing in the target All Stars season. I am aware that not all contestants who are selected for All Stars accept and that this information is not widely distributed, so it is impossible to obtain accurate data on which contestants are selected for All Stars. "Selected for All Stars" in this context should therefore be read as a shorthand "selected and accepted an invitation to compete in All Stars". This is further discussed in the "Reviewing sample output and refining the input parameters" section under "Ineligibility and declined invitations".

**Retrieving input data**

Run prep module to get data of contestant performance and All Stars selection/performance for All Stars Seasons 1-4 scraped from Wikipedia and cleaned for analysis. (The All_Stars_Data_Prep Python script can be found in the GitHub repository for this project.)

Additionally, set a seed value to be used throughout the procedure for splitting and bootstrapping data.

In [1]:
import All_Stars_Data_Prep as all_stars_prep
model_data = all_stars_prep.get_all_stars_selection_model_data(range(1,5))

seed_value = 0

Load function to get Pearson correlations and p-values to explore which features might be best used in the model.

In [2]:
def get_pearson_correlations (model_data, dv, feature_columns):

    # Create output dataframe
    df = pd.DataFrame(columns=['Independent Variable', 'Correlation', 'P-Value'])
    
    # For each independent variable in the feature columns, get Pearson correlation/p-value and add to dataframe
    for iv in feature_columns:
        [corr, pval] = pearsonr(model_data[iv], model_data[dv])
        new_row = [iv, corr, pval]
        df.loc[df.shape[0] + 1] = new_row
        
    # Sort by p-value
    df = df.sort_values(by=['P-Value'])

    return df

**Exploratory analysis**

Load packages for data analysis.

In [3]:
import pandas as pd
from scipy.stats.stats import pearsonr
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics

***Correlations***

Explore correlations to see what independent variables are correlated with a contestant being selected for All Stars.

In [4]:
feature_cols = [ 'Win', 'High', 'Safe', 'Low', 'Bottom', 'Eliminated', 'Guest', 
                 'Season Winner', 'Season Runner-Up', 'Season Miss Congeniality',
                 'Total Appearances', 'Years Since Last Competed' ]

pcorr = get_pearson_correlations(model_data, 'Competed', feature_cols)
pcorr

Unnamed: 0,Independent Variable,Correlation,P-Value
10,Season Miss Congeniality,0.18912,0.000235
2,High,0.148623,0.003969
12,Years Since Last Competed,-0.144958,0.004972
7,Guest,-0.120837,0.019407
8,Season Winner,-0.111353,0.031322
3,Safe,0.110064,0.033347
9,Season Runner-Up,0.102309,0.048027
6,Eliminated,-0.098907,0.055998
11,Total Appearances,0.091991,0.075594
5,Bottom,0.051252,0.322911


As a viewer of the show, this is quite an interesting list of correlations.

The independent variable with the strongest correlation and the lowest p-value to whether or not a contestant is selected to compete in an All Stars season is whether or not the contestant was crowned Miss Congeniality in a main season. This may have something to do with the "fan favourite" nature of the award (as noted by Trinity the Tuck in the season 9 reunion) and, upon examination, only 3 of the 11 eligible Miss Congeniality winners - Ivy Winters, Cynthia Lee Fontaine, and Nina West (excluding Heidi N Closet who was awarded Miss Congeniality after the most recent All Stars cast was already announced) - did not return for an All Stars Season. Of these, Cynthia Lee Fontaine was asked to return for a main season (season 9) and Nina West has only been eligible for one selection round and may be selected for subsequent seasons. This correlation makes a lot of sense in the context of the shows.

***Confusion matrix for subset of independent variables***

Create a list of feature names to be used in an initial model, selecting those with a p-value of less than 0.05.

In [5]:
feature_cols_selected = list(pcorr.loc[pcorr['P-Value'] < 0.05]['Independent Variable'])
feature_cols_selected

['Season Miss Congeniality',
 'High',
 'Years Since Last Competed',
 'Guest',
 'Season Winner',
 'Safe',
 'Season Runner-Up']

Load function to get confusion matrix for a selected set of feature columns.

In [6]:
def get_confusion_matrix (model_data, feature_cols, dependent_variable, seed_val=0, test_size=0.25, solver='liblinear'):  
    
    # Split into independent and dependent variables
    X = model_data[feature_cols]
    y = model_data[dependent_variable]
    
    # Split into train and test set
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed_val)
    
    # Instantiate the model
    lr = LogisticRegression(solver=solver)
    lr.fit(X_train, y_train)
    y_pred = lr.predict(X_test)
    cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
    
    return cnf_matrix

Examine confusion matrix for the selected columns.

In [7]:
confusion_matrix = get_confusion_matrix(model_data, feature_cols_selected, 'Competed')
confusion_matrix

array([[85,  0],
       [ 9,  0]])

Although, at 89.36%, the overall accuracy of the model is high, the recall of the two classification is 0% and 100% (for outcomes of "selected" and "not selected" respectively). The model is consistently predicting an outcome of 0 for whether or not a contestant is selected to compete.

This behaviour indicates that my dataset is imbalanced. An imbalanced dataset is one where records with one output classification (the minority) are vastly outnumbered by records with another output classification (the majority), resulting in the model predicting the majority output every time.

Check how many records belong to each output class to get the ratio of class instances.

In [8]:
model_data[model_data['Competed']==1].count()['Contestant']

39

In [9]:
model_data[model_data['Competed']==0].count()['Contestant']

335

The overall class ratio is 39:337 - just over 10% of the records belong to the minority class (contestants selected for the All Stars season).

This is consistent with the nature of this dataset, since out of at least 46 contestants that could be selected for an All Stars season (up to almost 130 for All Stars Season 4), only 10-12 contestants are selected, making a complete dataset imbalanced.

This dataset will need to be rebalanced, which I will try to achieve by over-sampling. I have decided to use over-sampling in this case as the overall number of records is less than 500 and under-sampling is best used where there is a large number of records (>10,000). For this, I will use the random oversampling technique.

**Rebalancing by oversampling**

Import imblearn library and create function to get an oversampled dataframe.

In [10]:
from imblearn import over_sampling as imbsample

def get_oversampled_df (model_data, feature_columns, dependent_variable, sampling_strategy='minority', seed_val=0):

    # Get dependent and independent variables
    X = model_data[feature_columns]
    y = model_data[dependent_variable]
    
    # Create an oversample object and fit it to the data
    oversample_obj = imbsample.RandomOverSampler(sampling_strategy=sampling_strategy, random_state=seed_val)
    X_over, y_over = oversample_obj.fit_resample(X, y)
    
    # Combine variables into a single dataframe for output
    oversampled_df = X_over
    oversampled_df['Competed'] = y_over
    
    return oversampled_df

oversampled_df = get_oversampled_df(model_data, feature_cols, 'Competed')

Modify the confusion matrix function to allow for oversampling of training data only.

In [11]:
def get_confusion_matrix (model_data, feature_cols, dependent_variable, seed_val=0, test_size=0.25, 
                          solver='liblinear', oversample_training_data=False, sampling_strategy='minority'):  
    
    # Split into independent and dependent variables
    X = model_data[feature_cols]
    y = model_data[dependent_variable]
    
    # Split into train and test set
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed_val)
    
    # Oversample training data if specified
    oversample_obj = imbsample.RandomOverSampler(sampling_strategy=sampling_strategy, random_state=seed_val)
    X_train, y_train = oversample_obj.fit_resample(X_train, y_train)
    
    # Instantiate the model
    lr = LogisticRegression(solver=solver)
    lr.fit(X_train, y_train)
    y_pred = lr.predict(X_test)
    cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
    
    return cnf_matrix

***Correlations with oversampled data***

Re-run correlations with oversampled dataframe.

In [12]:
pcorr = get_pearson_correlations(oversampled_df, 'Competed', feature_cols)
pcorr

Unnamed: 0,Independent Variable,Correlation,P-Value
12,Years Since Last Competed,-0.285527,4.910271e-14
10,Season Miss Congeniality,0.26483,3.239433e-12
2,High,0.262642,4.941161e-12
8,Season Winner,-0.23829,4.192707e-10
6,Eliminated,-0.222549,5.781456e-09
7,Guest,-0.190145,7.126181e-07
9,Season Runner-Up,0.17968,2.863899e-06
3,Safe,0.159656,3.30255e-05
11,Total Appearances,0.152788,7.163963e-05
5,Bottom,0.077942,0.0437189


Here we can see that many more of the correlations are statistically significant. Once again, I'm going to select the feature columns where the p-value is less than 0.05 (5.0e-2) as a starting point and work on forward and backward subset selection from there.

In [13]:
feature_cols_selected = list(pcorr.loc[pcorr['P-Value'] < 0.05]['Independent Variable'])
feature_cols_selected

['Years Since Last Competed',
 'Season Miss Congeniality',
 'High',
 'Season Winner',
 'Eliminated',
 'Guest',
 'Season Runner-Up',
 'Safe',
 'Total Appearances',
 'Bottom']

***Confusion matrix for subset of independent variables with oversampled data***

Examine confusion matrix with the new dataset.

In [14]:
confusion_matrix = get_confusion_matrix(model_data, feature_cols_selected, 'Competed', oversample_training_data=True)
confusion_matrix

array([[72, 13],
       [ 3,  6]])

The overall accuracy with the oversampled data is 78.72%, with a recall of 60.0% for the group that were selected to compete and 80.95% for the group that were not selected. While the overall accuracy and the recall for the non-competing class have dropped, the recall for the class selected to compete in All Stars has increased dramatically.

**Retrieve selected model as an object**

Adjust the get_confusion_matrix() function to create a function returning the model itself rather than the confusion matrix.

In [15]:
def get_model (model_data, feature_cols, dependent_variable, seed_val=0, test_size=0.25, 
               solver='liblinear', oversample_training_data=False, sampling_strategy='minority'):  
    
    # Split into independent and dependent variables
    X = model_data[feature_cols]
    y = model_data[dependent_variable]
    
    # Split into train and test set
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed_val)
    
    # Oversample training data if specified
    oversample_obj = imbsample.RandomOverSampler(sampling_strategy=sampling_strategy, random_state=seed_val)
    X_train, y_train = oversample_obj.fit_resample(X_train, y_train)
    
    # Instantiate the model
    lr = LogisticRegression(solver=solver)
    lr.fit(X_train, y_train)
    
    return lr

Retrieve model object.

In [16]:
model = get_model(model_data, feature_cols_selected, 'Competed', oversample_training_data=True)

**Make predictions about the contestants selected for All Stars Season 5**

Get model data for All Stars Season 5 only, then manually add in which contestants have been selected (the data is not yet on Wikipedia in a form that can be retrieved by Wikipedia_Web_Scrape).

In [17]:
model_data_season_5 = all_stars_prep.get_all_stars_selection_model_data(5)

Get predicted values for All Stars Season 5 model data and print the names of contestants predicted to be contestants in All Stars Season 5.

In [18]:
predictions = model.predict(model_data_season_5[feature_cols_selected])

predicted_contestants = list(model_data_season_5.loc[predictions==1]['Contestant'])
all_stars_5_selected = ['Alexis Mateo', 'Blair St. Clair', 'Derrick Barry', 'India Ferrah', 'Jujubee',
                        'Mariah Balenciaga', 'Mayhem Miller', 'Miz Cracker', 'Ongina', 'Shea Coulee']

predicted_contestants_correct = [_ for _ in list(predicted_contestants) if _ in all_stars_5_selected]
predicted_contestants_incorrect = [_ for _ in list(predicted_contestants) if _ not in all_stars_5_selected]

List of correct predictions:

In [19]:
predicted_contestants_correct

['Shea Coulee', 'Miz Cracker', 'Blair St. Clair', 'Jujubee']

List of incorrect predictions:

In [20]:
predicted_contestants_incorrect

['Willam',
 'Ivy Winters',
 'Darienne Lake',
 'Joslyn Fox',
 'Trinity K. Bonet',
 'Pearl',
 'Jaidynn Diore Fierce',
 'Acid Betty',
 'Peppermint',
 'Trinity the Tuck',
 'Alexis Michelle',
 "Nina Bo'nina Brown",
 'Eureka',
 'Kameron Michaels',
 "Asia O'Hara",
 'Brooke Lynn Hytes',
 "A'Keria C. Davenport",
 'Silky Nutmeg Ganache',
 'Nina West',
 'Shuga Cain',
 'Plastique Tiara',
 "Ra'Jah O'Hara",
 'Yara Sofia',
 'Katya',
 'Roxxxy Andrews',
 "Phi Phi O'Hara",
 'Kennedy Davenport',
 'BeBe Zahara Benet',
 'BenDeLaCreme',
 'Monique Heart',
 'Naomi Smalls',
 'Valentina']

While this looks like a long list of incorrect predictions and a few correct ones, it actually reveals a some gaps in my independent variables that might be useful for refining the model.

**Reviewing sample output and refining the input parameters**

Reviewing the list of incorrect predictions, there are a few features noticeable to a regular viewer of the show that could inform how the model is further developed.

***Prior All Stars competitors***

The first noticeable feature of the incorrect predictions is that a high proportion of them had already competed in All Stars (notably Trinity the Tuck, who won the prior All Stars season). The reason that I had left previous All Stars competitors as possible competitors in future All Stars seasons is that it has happened in the past - Manila Luzon, Latrice Royale, Jujubee, and Alexis Mateo all competed in All Stars Season 1 and then again in either All Stars Season 4 or All Stars Season 5.

The problem with this from a data perspective is that it biases the model heavily in favour of previous All Stars competitors, since it is their metrics than inform the original model. Additionally, they will have more appearances and placements by default, as a result of appearing on an additional season, and these tend to be correlated with selection. As a result, previous All Stars competitors are massively over-represented in the predicted competitors.

The adjustment I will be making here is to make a contestant ineligible for future seasons if they have previously competed in any All Stars season, excluding All Stars Season 1. This is because all of the re-selected All Stars competitors so far are from All Stars Season 1, which was largely considered to be a particularly poor season (largely because of the convoluted teams format), so the re-selection of these contestants could be considered to be compensation for their participation in All Stars Season 1. No participants from All Stars Seasons 2-4 have been re-selected for another All Stars Season, so this makes sense for now, although this may have to be revisited if this changes in the future.

***Ineligibility and declined invitations***

There are a number of reasons that a contestant may find themselves effectively ineligible for All Stars selection.

Possible reasons that a contestant may be effectively ineligible for an All Stars invitation:
- The contestant was disqualified for conduct outside the show (Sherry Pie, Season 12)
- The contestant has had disagreements with the show/RuPaul (Pearl, Season 7; Willam, Season 4)
- The contestant has since drag as a full-time/professional career (Ivy Winters, Season 5)
- The contestant has since drag entirely (Tyra Sanchez, Season 2)
- The contestant is now deceased (Sahara Davenport, Season 2)
- The contestant is now judging and/or hosting a Drag Race spin-off (Brooke Lynn Hytes, Season 11, judge/host of Drag Race Canada)

Additionally, a participant might receive an invitation for an All Stars season and then decline it. All Stars contestants Trinity the Tuck, Shea Coulee, Valentina, and Ben Delacreme have reportedly stated that they turned down an earlier All Stars invitation in order to compete on a later season, while Phi Phi O'Hara allegedly turned down an All Stars Season 1 invitation due to scheduling conflicts. Kim Chi and Laganja Estranja have also both said that they turned down an All Stars invitation previously. Additionally, Ben Delacreme and Adore Delano both voluntarily left the first All Stars seasons they competed in, likely lowering the chances of appearing in a subsequent season.

None of this data can be easily obtained via a web scrape and any data relating to this would be effectively incomplete and unreliable compared to the season data (challenge wins, etc.) as it is effectively based on rumours and individual statements from the contestants, which can't be verified as production does not comment on who they have invited to partake in All Stars. Additionally, it is possible that there are other contestants who have been asked back for an All Stars season but have refrained from commenting on it publicly.

As a metric, this is a difficult one to compile as the reasons are so varied and would be impossible to web-scrape without an overly complex algorithm that is beyond the scope of this project and for those reasons I had initially planned not to consider them in my analysis. However, this is clearly having an impact on the output and so I now think that it may be better for the output to filter out contestants known to be effectively ineligible, even though the data underlying this will be messy and incomplete.

***Popularity***

Popularity with fans and success outside the show is usually a major factor in determining whether a contestant is selected for All Stars, even if the contestant does not perform outstandingly well by the episode metrics during their season (such as challenge wins). Examples include Trixie Mattel (who had a country music career and successful web series outside the show and was selected for, and subsequently won, All Stars Season 3) and Jasmine Masters (who had a short run on the show but found success on YouTube and was selected for All Stars 4).

If you refer to the Pearson correlations tables, you will notice that, in both cases (untreated and over-sampled datasets), the independent variable with the strongest correlation and the lowest p-value is whether or not a contestant won Miss Congeniality in their original season. Prior to Season 10, the Miss Congeniality prize was voted by fans and this was notably controversial with Trinity the Tuck renaming the award as "Fan Favourite" in the Season 9 reunion. This incident brought about discussions that the award was more of a "fan favourite" in the cases of Pandora Boxx, Ben Delacreme, and Katya, where some felt that the Miss Congeniality winners were the most beloved by fans, even if there were more "congenial" contestants in their respective seasons. However, even after the Miss Congeniality award was voted by the other contestants, the season Miss Congeniality was consistently one of the season fan favourites regardless (Monet X Change, Nina West, and Heidi N Closet).

However it is difficult to measure popularity by an objective metric. Follower counts on Instagram/Twitter might make a good proxy in this case, however I am not aware of a way to get follower counts prior to selection and additional season appearances will boost a contestant's following after airing, particularly in the case of contestants such as Manila Luzon (All Stars Season 4) and Shangela (All Stars Season 3), who had large gaps between their prior season and All Stars season appearances and experienced a surge of engagement with their social media after the airing of their respective All Stars seasons.

The best metric for which historic data can be obtained that might serve as a proxy for popularity is probably Google Trends, which will show how many times a contestant's drag name has been searched prior to All Stars selection. This will involve creating some sort of web scraper to download the data for each contestant's normalized drag name.

I will then take the highest score prior to a cut-off date specified for each All Stars Season. Other approaches would be either taking an "area under the curve" approach by summing the scores prior to the cut-off, or averaging the scores between the contestant's first appearance and the cut-off date, however, these might result in results that are either biased for or against contestants based on how long it had been since their first appearance. Taking the single highest Google Trends score for each contestant will not only be easier to implement, but should give a more reliable measure. Where contestants go by multiple names (e.g. Trinity the Tuck/Trinity Taylor), the highest score between the two names will be used.

The other benefits of using Google Trends as opposed to social media followings is that it will prevent bias against controversial contestants, such as Roxxxy Andrews and Phi Phi O'Hara. These contestants were notably unpopular in their initial seasons, but were invited back to All Stars for a "redemption" story arc. Google Trends should measure contestants in terms of both popularity and controversy, both of which appear to be factors in being invited back for an All Stars season, whereas simply looking at social media metrics might measure only popularity.

***Conclusions and actions to take from sample output***

- There appears to be a bias towards competitors of All Stars Seasons 2-4, despite none of these competitors appearing in subsequent seasons, so I will be removing these contestants from consideration for All Stars seasons that occur after their first All Stars appearance.
- I will manually create a file of ineligible contestants for All Stars selection and create a function to remove these contestants' records from the seasons for which they are ineligible.
- I will build another web scraper to download Google Trends data for contestants and another function to retrieve the highest Google Trends score for each contestant and add it as a parameter.