**What Makes an All Star?
Development of a machine learning model to predict which RuPaul's Drag Race contestants are most likely to be selected for an All Stars Season based on their performance in previous seasons. **

**Limitation:** For the purposes of this exploration, I use the terminology "selected" to describe a contestant who has competed in previous seasons who is competing in the target All Stars season. I am aware that not all contestants who are selected for All Stars accept and that this information is not widely distributed, so it is impossible to obtain accurate data on which contestants are selected for All Stars. "Selected for All Stars" in this context should therefore be read as a shorthand "selected and accepted an invitation to compete in All Stars".

Further limitations will be discussed at the end of the analysis.

**Retrieving input data**

Run prep module to get data of contestant performance and All Stars selection/performance for All Stars Seasons 1-4 scraped from Wikipedia and cleaned for analysis.

In [1]:
import All_Stars_Data_Prep as all_stars_prep
model_data = all_stars_prep.get_all_stars_selection_model_data(range(1,5))

Load function to get Pearson correlations and p-values to explore which features might be best used in the model.

In [37]:
def get_pearson_correlations (model_data, dv, feature_columns):

    # Create output dataframe
    df = pd.DataFrame(columns=['Independent Variable', 'Correlation', 'P-Value'])
    
    # For each independent variable in the feature columns, get Pearson correlation/p-value and add to dataframe
    for iv in feature_columns:
        [corr, pval] = pearsonr(model_data[iv], model_data[dv])
        new_row = [iv, corr, pval]
        df.loc[df.shape[0] + 1] = new_row

    return df

**Exploratory analysis**

Load packages for data analysis.

In [6]:
import pandas as pd
from scipy.stats.stats import pearsonr
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics

***Correlations***

Explore correlations to see what independent variables are correlated with a contestant being selected for All Stars.

In [7]:
feature_cols = [ 'Win', 'High', 'Safe', 'Low', 'Bottom', 'Eliminated', 'Guest', 
                 'Season Winner', 'Season Runner-Up', 'Season Miss Congeniality',
                 'Total Appearances', 'Years Since Last Competed' ]
get_pearson_correlations(model_data, 'Competed', feature_cols)

Unnamed: 0,Independent Variable,Correlation,P-Value
1,Win,0.037493,0.468539
2,High,0.147827,0.004069
3,Safe,0.114581,0.026302
4,Low,-0.006113,0.905957
5,Bottom,0.052854,0.306697
6,Eliminated,-0.095364,0.064713
7,Guest,-0.116583,0.023771
8,Season Winner,-0.110695,0.03188
9,Season Runner-Up,0.102937,0.046077
10,Season Miss Congeniality,0.189553,0.000218


As a viewer of the show, this is quite an interesting list of correlations.

The independent variable with the strongest and most statistically significant correlation to whether or not a contestant is selected to compete in an All Stars season is whether or not the contestant was crowned Miss Congeniality in a main season. This may have something to do with the "fan favourite" nature of the award (as noted by Trinity the Tuck in the season 9 reunion) and, upon examination, only 3 of the 11 eligible Miss Congeniality winners - Ivy Winters, Cynthia Lee Fontaine, and Nina West (excluding Heidi N Closet who was awarded Miss Congeniality after the most recent All Stars cast was already announced) - did not return for an All Stars Season. Of these, Cynthia Lee Fontaine was asked to return for a main season (season 9) and Nina West has only been eligible for one selection round and may be selected for subsequent seasons. This correlation makes a lot of sense in the context of the shows.

***Confusion matrix for subset of independent variables***

Create a list of feature names to be used in an initial model, selecting those with a p-value of less than 0.05.

In [9]:
feature_cols_selected = [ 'High', 'Safe', 'Guest', 'Season Winner', 'Season Runner-Up', 
                          'Season Miss Congeniality', 'Years Since Last Competed' ]

Load function to get confusion matrix for a selected set of feature columns.

In [25]:
def get_confusion_matrix (model_data, feature_cols, dependent_variable, seed_val=0, test_size=0.25, solver='liblinear'):                
    X = model_data[feature_cols]
    y = model_data[dependent_variable]
    
    # Split into train and test set
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed_val)
    
    # Instantiate the model
    lr = LogisticRegression(solver=solver)
    lr.fit(X_train, y_train)
    y_pred = lr.predict(X_test)
    cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
    
    return cnf_matrix

Examine confusion matrix for the selected columns.

In [31]:
get_confusion_matrix(model_data, feature_cols_selected, 'Competed')

array([[84,  0],
       [10,  0]])

The model is consistently predicting an outcome of 0 for whether or not a contestant is selected to compete (a recall of 0% for the class that were selected and 100% for the class that were not selected), which indicates that my dataset is imbalanced.

Check how many records belong to each output class to get the ratio of class instances.

In [36]:
model_data[model_data['Competed']==1].count()['Contestant']

39

In [35]:
model_data[model_data['Competed']==0].count()['Contestant']

337

The overall class ratio is 39:337. This is due the nature of this data, since out of at least 46 contestants that could be selected for an All Stars season (up to almost 130 for All Stars Season 4), only 10-12 contestants are selected, making a complete dataset imbalanced.

This dataset will need to be rebalanced, which I will try to achieve by over-sampling. I have decided to use over-sampling in this case as the overall number of records is less than 500 and under-sampling is best used where there is a large number of records (>10,000). For this, I will use the Synthetic Minority Over-sampling Technique (SMOTE).

**Rebalancing by oversampling**

In [1]:
import imblearn

**Limitations**

**Ineligibility:** Some contestants are effectively ineligible for selection due to factors occurring outside of the show. This includes Sherry Pie, who was disqualified from Season 12 for conduct outside the show, and Pearl, who was allegedly informed by production that she would never be selected for an All Stars season as a result of comments that she made about RuPaul and the show after her run.

Other contestants are not ineligible for selection, but have stated that they would not be interested in competing in an All Stars season and have turned down invitations to do so. Some contestants, such as Trinity the Tuck, Shea Coulee, Valentina, and Ben Delacreme have stated that they turned down an earlier All Stars invitation in order to compete in a later season, while Phi Phi O'Hara allegedly turned down an All Stars Season 1 invitation due to scheduling conflicts. Kim Chi and Laganja Estranga have said that they have also turned down an All Stars invitation previously and did not appear in any subsequent All Stars seasons. Others have quit drag as a full-time career (Ivy Winters) or disavowed their drag personas entirely (Tyra Sanchez).

None of this data can be easily obtained via a web scrape and any data relating to this would be effectively incomplete and unreliable compared to the season data (challenge wins, etc.) as it is effectively based on rumours and individual statements from the contestants, which can't be verified as production does not comment on who they have invited to partake in All Stars. Additionally, it is possible that there are other contestants who have been asked back for an All Stars season but have refrained from commenting on it publicly.

**Fan favourites:** Popularity with fans and success outside the show is usually a major factor in determining whether a contestant is selected for All Stars, even if the contestant does not perform outstandingly well by the episode metrics during their season (such as challenge wins). Examples include Trixie Mattel (who had a country music career and successful web series outside the show and was selected for, and subsequently won, All Stars Season 3) and Jasmine Masters (who had a short run on the show but found success on YouTube and was selected for All Stars 4).

If you refer to the correlations table at [7], you will notice that the variable with the strongest and most statistically significant correlation is whether or not a contestant won Miss Congeniality in their original season. Prior to Season 10, the Miss Congeniality prize was voted by fans (notably, this was controversial in the cases Pandora Boxx, Ben Delacreme, Katya, and in particular Valentina, with Trinity the Tuck renaming the award as "Fan Favourite" in the Season 9 reunion). Even after the Miss Congeniality award was voted by the other contestants, the winner was consistently a fan favourite regardless.

However it is difficult to measure popularity by an objective metric. Follower counts on Instagram/Twitter might make a good proxy in this case, however I am not aware of a way to get follower counts prior to selection and additional season appearances will boost a contestant's following after airing, particularly in the case of contestants such as Manila Luzon (All Stars Season 4) and Shangela (All Stars Season 3), who had large gaps between their prior season and All Stars season appearances and experienced a resurgence in social media following after the airing of their All Stars season.