<a href="https://colab.research.google.com/github/AGR-Yes/RuPauls-Drag-Race-Winner-Prediction/blob/main/RPDR_Winner_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RuPaul's Drag Race Winner Prediciton

**Author:** Anton Reyes

## **Introduction**

### **Requirements and Imports**

#### **Imports**

**Basic Libraries**

* `numpy` contains a large collection of mathematical functions
* `pandas` contains functions that are designed for data manipulation and data analysis



In [5]:
import numpy as np
import pandas as pd

**Visualization Libraries**

* `matplotlib.pyplot` contains functions to create interactive plots
* `seaborn` is a library based on matplotlib that allows for data visualization
* `plotly` is,,,,

In [6]:
import matplotlib.pyplot as plt
#import seaborn as sns
import plotly.express as px
#from jupyter_dash import JupyterDash

import dash
import dash_bootstrap_components as dbc
import dash_core_components as dcc
import dash_html_components as html


**Natural Language Processing Libraries**
* `re` is a module that allows the use of regular expressions
* `nltk` provides functions for processing text data
* `Counter` is from Python's collections module, which is helpful for tokenization
* `string` contains functions for string operations

In [7]:
import re
#import nltk
#from nltk.stem import WordNetLemmatizer
#from nltk.corpus import stopwords
#from nltk.tokenize import RegexpTokenizer
#from nltk.probability import FreqDist
#from nltk.stem import WordNetLemmatizer
#from collections import Counter
import string


**Google Drive**
* `google.colab` a library that allows the colab notebook to mount the google drive

In [8]:
#from google.colab import drive
#drive.mount('/content/drive')

#### **Datasets and Files**

The following ``csv` file was used for this project:

- `RPDR Database_winner.csv` contains all the winners of the Drag Race franchise as well as the placements of each winner in certain maxi (major) challenges. This dataset also contains the final four of some Drag Race franchises as of March 31, 2023.

## **Data Collection**

Importing the dataset

In [9]:
url = "https://raw.githubusercontent.com/AGR-Yes/RuPauls-Drag-Race-Winner-Prediction/main/RPDR_Winners_2023.csv?token=GHSAT0AAAAAACAXZ3YRNZFMQSK64PVJ7FQIZBKZJAQ"

df = pd.read_csv(url)
df.head()

Unnamed: 0,Placement,Country,Season,Queen,Design,Snatch Game,Ball,Makeover,Acting,Girl Groups,Rusical,Unnamed: 11,Unnamed: 12
0,winner,AUS,1,Kita Mean,SAFE,HIGH,,WIN,,HIGH,,,4
1,winner,AUS,2,Spankie Jackzon,BTM,HIGH,,HIGH,WIN,WIN,,,5
2,winner,CAN,1,Priyanka,SAFE,BTM,SAFE,WIN,HIGH,WIN,,,6
3,winner,CAN,2,Icesis Couture,WIN,SAFE,WIN,BTM,SAFE,HIGH,BTM,ALL 7,7
4,winner,CAN,3,Gisele Lullaby,WIN,WIN,HIGH,,,,BTM,,4


## **Description of the Dataset**

Here, we find the shape of the dataset.

In [10]:
df.shape

(51, 13)

By looking at the `info` of the dataframe, we can see that there are `non-null` values. 

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Placement    51 non-null     object
 1   Country      51 non-null     object
 2   Season       51 non-null     int64 
 3   Queen        51 non-null     object
 4   Design       36 non-null     object
 5   Snatch Game  50 non-null     object
 6   Ball         36 non-null     object
 7   Makeover     41 non-null     object
 8   Acting       40 non-null     object
 9   Girl Groups  25 non-null     object
 10  Rusical      28 non-null     object
 11  Unnamed: 11  7 non-null      object
 12  Unnamed: 12  51 non-null     int64 
dtypes: int64(2), object(11)
memory usage: 5.3+ KB


By displaying the number of queens and the placements, we can confirm that there are 51 contestants and rows to be worked on. 

But by looking at the information above, there are columns (maxi-challenges) that show null values. This is because not all of the stated-maxi challenges are in every season and franchise of Drag Race.

In [12]:
display(df['Placement'].value_counts(), print('Number of Queens:', df['Placement'].count()))

Number of Queens: 51


winner    43
final      8
Name: Placement, dtype: int64

None

## **Exploratory Data Analysis**

The following questions are asked to guide the EDA.

1. How many placements are there per challenge column?
2. What is the most occurring placement per challenge?
3. Which challenge has the most complete appearances? Which had the least?
4. What is the most number of challenges per season? How many is the least?
5. What is the average number of *notable* challenges per season?

### **1. How many placements are there per challenge column?**

We first get the `challenge` column names and then create a new dataframe with the counts of the 5 different placements per column

In [13]:
placements = df[['Design', 'Snatch Game', 'Ball', 'Makeover', 'Acting', 'Girl Groups', 'Rusical']].apply(pd.Series.value_counts)

We then make a list of rows to make a custom order in the new dataframe.

In [14]:
row_order = ['WIN','HIGH','SAFE','LOW','BTM']

By resetting the index, we can easily access the dataframe when needed - especially when it come to visualization.

In [15]:
placements = placements.reindex(row_order).reset_index()

placements

Unnamed: 0,index,Design,Snatch Game,Ball,Makeover,Acting,Girl Groups,Rusical
0,WIN,10,16,12,13,12,7,4
1,HIGH,5,11,12,16,12,7,7
2,SAFE,17,15,9,2,13,7,11
3,LOW,2,3,1,4,1,1,2
4,BTM,2,5,2,6,2,3,4


### **2. What is the most occurring placement per challenge?**

By getting the mode of each `challenge` column, we can see which challenge has had the most occuring placement regardless of season.

In [16]:
challenge_mode = pd.DataFrame(df[['Design', 'Snatch Game', 'Ball', 'Makeover', 'Acting', 'Girl Groups', 'Rusical']].mode().iloc[0])

challenge_mode.transpose()

Unnamed: 0,Design,Snatch Game,Ball,Makeover,Acting,Girl Groups,Rusical
0,SAFE,WIN,HIGH,HIGH,SAFE,HIGH,SAFE


Even though we can see the most occuring placement, it should be noted that not all contestants had complete notable challenges in the seasons they competed in.

### **3. Which challenge has the most complete appearances? Which had the least?**

According to the dataset description, there are 51 contestants in the dataset. 43 of which are winners and 8 of them being the top four finalists.

Now, we get the number of appearances each challenge did in all the winners' seasons by checking the rows itself.

In [17]:
appear = pd.DataFrame(df[['Design', 'Snatch Game', 'Ball', 'Makeover', 'Acting', 'Girl Groups', 'Rusical']].count()).reset_index()
appear.sort_values([0], ascending=False)

Unnamed: 0,index,0
1,Snatch Game,50
3,Makeover,41
4,Acting,40
0,Design,36
2,Ball,36
6,Rusical,28
5,Girl Groups,25


By getting the columns and creating a new dataframe with counts, we can see the challenge that appears almost every season: `Snatch Game.`

### **4. What is the most number of challenges per season? How many is the least?**

We create a new dataframe for counting the challenge placements.

In [18]:
desc = pd.DataFrame(df[['Design', 'Snatch Game', 'Ball', 'Makeover', 'Acting', 'Girl Groups', 'Rusical']])

After that, we count all the `non-null` values present in the `desc` dataframe.

In [19]:
desc['count'] = desc.apply(lambda row: row.count(), axis = 1)

We then use the `.describe` function to quantitatively describe the `count` column.

In [20]:
desc['count'].describe()

count    51.000000
mean      5.019608
std       1.122323
min       3.000000
25%       4.000000
50%       5.000000
75%       6.000000
max       7.000000
Name: count, dtype: float64

We can see here that the most number of challenges in a singular season can be `7`.

While the least number would be `3` challenges.

### **5. What is the average number of challenges per season?**

Based on the `desc` dataframe, the average number of challenges per season is `5`. Similarly, the median of the column is also `5`. 

## **Data Preprocessing**

#### **Data Preprocessing**

In [21]:
df.head()

Unnamed: 0,Placement,Country,Season,Queen,Design,Snatch Game,Ball,Makeover,Acting,Girl Groups,Rusical,Unnamed: 11,Unnamed: 12
0,winner,AUS,1,Kita Mean,SAFE,HIGH,,WIN,,HIGH,,,4
1,winner,AUS,2,Spankie Jackzon,BTM,HIGH,,HIGH,WIN,WIN,,,5
2,winner,CAN,1,Priyanka,SAFE,BTM,SAFE,WIN,HIGH,WIN,,,6
3,winner,CAN,2,Icesis Couture,WIN,SAFE,WIN,BTM,SAFE,HIGH,BTM,ALL 7,7
4,winner,CAN,3,Gisele Lullaby,WIN,WIN,HIGH,,,,BTM,,4


##### **Concatenating Columns**

Instead of having to display the `country` and `season` separately, we join them into one column instead.

In [22]:
df['code'] = df['Country'].astype(str) + df['Season'].astype(str)

df.head()

Unnamed: 0,Placement,Country,Season,Queen,Design,Snatch Game,Ball,Makeover,Acting,Girl Groups,Rusical,Unnamed: 11,Unnamed: 12,code
0,winner,AUS,1,Kita Mean,SAFE,HIGH,,WIN,,HIGH,,,4,AUS1
1,winner,AUS,2,Spankie Jackzon,BTM,HIGH,,HIGH,WIN,WIN,,,5,AUS2
2,winner,CAN,1,Priyanka,SAFE,BTM,SAFE,WIN,HIGH,WIN,,,6,CAN1
3,winner,CAN,2,Icesis Couture,WIN,SAFE,WIN,BTM,SAFE,HIGH,BTM,ALL 7,7,CAN2
4,winner,CAN,3,Gisele Lullaby,WIN,WIN,HIGH,,,,BTM,,4,CAN3


##### **Dropping and Reorganizing of Columns**

Before dropping any columns, we first get the column names.

In [23]:
df.columns

Index(['Placement', 'Country', 'Season', 'Queen', 'Design', 'Snatch Game',
       'Ball', 'Makeover', 'Acting', 'Girl Groups', 'Rusical', 'Unnamed: 11',
       'Unnamed: 12', 'code'],
      dtype='object')

We make a list of columns to drop that would not be needed further in the analysis.

In [24]:
drop_col = [ "Country", "Season", "Queen", 'Unnamed: 11', 'Unnamed: 12']

In [25]:
df = df.drop(drop_col, axis = 1)

df.head()

Unnamed: 0,Placement,Design,Snatch Game,Ball,Makeover,Acting,Girl Groups,Rusical,code
0,winner,SAFE,HIGH,,WIN,,HIGH,,AUS1
1,winner,BTM,HIGH,,HIGH,WIN,WIN,,AUS2
2,winner,SAFE,BTM,SAFE,WIN,HIGH,WIN,,CAN1
3,winner,WIN,SAFE,WIN,BTM,SAFE,HIGH,BTM,CAN2
4,winner,WIN,WIN,HIGH,,,,BTM,CAN3


We now reposition the `code` column by 

In [26]:
df.columns.tolist()

['Placement',
 'Design',
 'Snatch Game',
 'Ball',
 'Makeover',
 'Acting',
 'Girl Groups',
 'Rusical',
 'code']

In [27]:
df = df[['Placement', 'code', 'Design', 'Snatch Game', 'Ball', 'Makeover', 'Acting', 'Girl Groups', 'Rusical']]

df.head(2)

Unnamed: 0,Placement,code,Design,Snatch Game,Ball,Makeover,Acting,Girl Groups,Rusical
0,winner,AUS1,SAFE,HIGH,,WIN,,HIGH,
1,winner,AUS2,BTM,HIGH,,HIGH,WIN,WIN,


##### **Count of Row**

This is to find out the number of episodes that are available in a season or that a contestant has participated in.

Using the count function, we can count the number of non-null values.

We add the counted values as a new column in `df_score`

In [28]:
placements

Unnamed: 0,index,Design,Snatch Game,Ball,Makeover,Acting,Girl Groups,Rusical
0,WIN,10,16,12,13,12,7,4
1,HIGH,5,11,12,16,12,7,7
2,SAFE,17,15,9,2,13,7,11
3,LOW,2,3,1,4,1,1,2
4,BTM,2,5,2,6,2,3,4


##### **Splitting the Dataset**

We split the dataset separately into the `winners` and `finalists` so that we can drop more columns and work with the columns we need.

In [29]:
winners = df.loc[df['Placement'] == 'winner']

winners.head()

Unnamed: 0,Placement,code,Design,Snatch Game,Ball,Makeover,Acting,Girl Groups,Rusical
0,winner,AUS1,SAFE,HIGH,,WIN,,HIGH,
1,winner,AUS2,BTM,HIGH,,HIGH,WIN,WIN,
2,winner,CAN1,SAFE,BTM,SAFE,WIN,HIGH,WIN,
3,winner,CAN2,WIN,SAFE,WIN,BTM,SAFE,HIGH,BTM
4,winner,CAN3,WIN,WIN,HIGH,,,,BTM


In [30]:
finalists = df.loc[df['Placement'] == 'final'].reset_index(drop = True)

finalists.head()

Unnamed: 0,Placement,code,Design,Snatch Game,Ball,Makeover,Acting,Girl Groups,Rusical
0,final,US15,SAFE,SAFE,HIGH,WIN,HIGH,LOW,WIN
1,final,US15,SAFE,HIGH,HIGH,HIGH,WIN,HIGH,HIGH
2,final,US15,HIGH,SAFE,WIN,HIGH,SAFE,HIGH,HIGH
3,final,US15,WIN,SAFE,SAFE,BTM,SAFE,SAFE,HIGH
4,final,BEL1,,LOW,SAFE,WIN,SAFE,,


### **Data Cleaning**

### **Feature Extraction**

##### **Column Conversion**

This column conversion is mainly to convert the categorical data. This isn't affecting the main dataframe, but rather a copy of it. 

Before converting any objects to integer values, first we check the kinds of placements available. Since the `Snatch Game` column has the most values in terms of placement, that will be used.

In [31]:
print(df['Snatch Game'].unique())

['HIGH' 'BTM' 'SAFE' 'WIN' 'LOW' nan]


Now that we have the placements, we map the values accordingly and assign a certain score through a copy. Another copy will be made for one hot encoded dataset.

In [32]:
df_score = df.copy(deep = True) #score df
df_ohe = df.copy(deep = True) #one hot encoded df

In [33]:
col_list = df[['Design', 'Snatch Game', 'Ball', 'Makeover','Acting', 'Girl Groups', 'Rusical']].columns.tolist()

 **Mapping the Scores per Placement**

Here we map the scores based on placements

In [34]:
df_score[col_list] = df_score[col_list].replace({'WIN':4,
                                 'HIGH':3,
                                 'SAFE':2,
                                 'LOW':1,
                                 'BTM':0})

In [35]:
df_score.head()

Unnamed: 0,Placement,code,Design,Snatch Game,Ball,Makeover,Acting,Girl Groups,Rusical
0,winner,AUS1,2.0,3.0,,4.0,,3.0,
1,winner,AUS2,0.0,3.0,,3.0,4.0,4.0,
2,winner,CAN1,2.0,0.0,2.0,4.0,3.0,4.0,
3,winner,CAN2,4.0,2.0,4.0,0.0,2.0,3.0,0.0
4,winner,CAN3,4.0,4.0,3.0,,,,0.0


# **References**