<a href="https://colab.research.google.com/github/AGR-Yes/RuPauls-Drag-Race-Winner-Prediction/blob/main/RPDR_Winner_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RuPaul's Drag Race Winner Prediciton

**Author:** Anton Reyes

## **Introduction**

### **Requirements and Imports**

#### **Imports**

**Basic Libraries**

* `numpy` contains a large collection of mathematical functions
* `pandas` contains functions that are designed for data manipulation and data analysis



In [1]:
import numpy as np
import pandas as pd

**Visualization Libraries**

* `matplotlib.pyplot` contains functions to create interactive plots
* `seaborn` is a library based on matplotlib that allows for data visualization
* `wordcloud` contains functions for generating wordclouds from text data

In [3]:
import matplotlib.pyplot as plt
import seaborn as sns
#from wordcloud import WordCloud
#from wordcloud import ImageColorGenerator

**Natural Language Processing Libraries**
* `re` is a module that allows the use of regular expressions
* `nltk` provides functions for processing text data
* `Counter` is from Python's collections module, which is helpful for tokenization
* `string` contains functions for string operations

In [4]:
import re
#import nltk
#from nltk.stem import WordNetLemmatizer
#from nltk.corpus import stopwords
#from nltk.tokenize import RegexpTokenizer
#from nltk.probability import FreqDist
#from nltk.stem import WordNetLemmatizer
#from collections import Counter
import string


**Google Drive**
* `google.colab` a library that allows the colab notebook to mount the google drive

In [None]:
#from google.colab import drive
#drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#### **Datasets and Files**

The following ``csv` file was used for this project:

- `RPDR Database_winner.csv` contains all the winners of the Drag Race franchise as well as the placements of each winner in certain maxi (major) challenges. This dataset also contains the final four of some Drag Race franchises as of March 31, 2023.

## **Data Collection**

Importing the dataset

In [5]:
url = "https://raw.githubusercontent.com/AGR-Yes/RuPauls-Drag-Race-Winner-Prediction/main/RPDR_Winners_2023.csv?token=GHSAT0AAAAAACAXZ3YRNZFMQSK64PVJ7FQIZBKZJAQ"

df = pd.read_csv(url)
df.head()

Unnamed: 0,Placement,Country,Season,Queen,Design,Snatch Game,Ball,Makeover,Acting,Girl Groups,Rusical,Unnamed: 11,Unnamed: 12
0,winner,AUS,1,Kita Mean,SAFE,HIGH,,WIN,,HIGH,,,4
1,winner,AUS,2,Spankie Jackzon,BTM,HIGH,,HIGH,WIN,WIN,,,5
2,winner,CAN,1,Priyanka,SAFE,BTM,SAFE,WIN,HIGH,WIN,,,6
3,winner,CAN,2,Icesis Couture,WIN,SAFE,WIN,BTM,SAFE,HIGH,BTM,ALL 7,7
4,winner,CAN,3,Gisele Lullaby,WIN,WIN,HIGH,,,,BTM,,4


## **Description of the Dataset**

Here, we find the shape of the dataset.

In [6]:
df.shape

(51, 13)

By looking at the `info` of the dataframe, we can see that there are `non-null` values. 

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Placement    51 non-null     object
 1   Country      51 non-null     object
 2   Season       51 non-null     int64 
 3   Queen        51 non-null     object
 4   Design       36 non-null     object
 5   Snatch Game  50 non-null     object
 6   Ball         36 non-null     object
 7   Makeover     41 non-null     object
 8   Acting       40 non-null     object
 9   Girl Groups  25 non-null     object
 10  Rusical      28 non-null     object
 11  Unnamed: 11  7 non-null      object
 12  Unnamed: 12  51 non-null     int64 
dtypes: int64(2), object(11)
memory usage: 5.3+ KB


By displaying the number of queens and the placements, we can confirm that there are 51 contestants and rows to be worked on. 

But by looking at the information above, there are columns (maxi-challenges) that show null values. This is because not all of the stated-maxi challenges are in every season and franchise of Drag Race.

In [8]:
display(df['Placement'].value_counts(), print('Number of Queens:', df['Placement'].count()))

Number of Queens: 51


winner    43
final      8
Name: Placement, dtype: int64

None

## **Exploratory Data Analysis**

The following questions are asked to guide the EDA.

1. How many placements are there per challenge column?
2. What is the most occurring placement per challenge?
3. Which challenge has the most complete appearances? Which had the least?
4. What is the most number of challenges per season? How many is the least?
5. What is the average number of *notable* challenges per season?

### **1. How many placements are there per challenge column?**

### **2. What is the most occurring placement per challenge?**

### **3. Which challenge has the most complete appearances? Which had the least?**

### **4. What is the most number of challenges per season? How many is the least?**

### **5. What is the average number of *notable* challenges per season?**

## **Data Preprocessing**

#### **Data Preprocessing**

##### **Dropping of Columns**

Before dropping any columns, we first get the column names.

In [None]:
df.columns

Index(['Placement', 'Country', 'Season', 'Queen', 'Design', 'Snatch Game',
       'Ball', 'Makeover', 'Acting', 'Girl Groups', 'Rusical', 'Unnamed: 11',
       'Unnamed: 12'],
      dtype='object')

We drop the columns that would not be needed further in the analysis.

In [None]:
drop_col = [ "Country", "Season", 'Unnamed: 11', 'Unnamed: 12']

df = df.drop(drop_col, axis = 1)

df.head()

Unnamed: 0,Placement,Queen,Design,Snatch Game,Ball,Makeover,Acting,Girl Groups,Rusical
0,winner,Kita Mean,SAFE,HIGH,,WIN,,HIGH,
1,winner,Spankie Jackzon,BTM,HIGH,,HIGH,WIN,WIN,
2,winner,Priyanka,SAFE,BTM,SAFE,WIN,HIGH,WIN,
3,winner,Icesis Couture,WIN,SAFE,WIN,BTM,SAFE,HIGH,BTM
4,winner,Gisele Lullaby,WIN,WIN,HIGH,,,,BTM


##### **Column Conversion**

This column conversion is mainly to convert the categorical data. This isn't affecting the main dataframe, but rather a copy of it. 

Before converting any objects to integer values, first we check the kinds of placements available. Since the `Snatch Game` column has the most values in terms of placement, that will be used.

In [None]:
print(df['Snatch Game'].unique())

['HIGH' 'BTM' 'SAFE' 'WIN' 'LOW' nan]


Now that we have the placements, we map the values accordingly and assign a certain score through a copy. Another copy will be made for one hot encoded dataset.

In [None]:
df_score = df.copy(deep = True) #score df
df_ohe = df.copy(deep = True) #one hot encoded df

In [None]:
col_list = df[['Design', 'Snatch Game', 'Ball', 'Makeover','Acting', 'Girl Groups', 'Rusical']].columns.tolist()

 **Mapping the Scores per Placement**

Here we map the scores based on placements

In [None]:
df_score[col_list] = df_score[col_list].replace({'WIN':4,
                                 'HIGH':3,
                                 'SAFE':2,
                                 'LOW':1,
                                 'BTM':0})

In [None]:
df_score.head()

Unnamed: 0,Placement,Queen,Design,Snatch Game,Ball,Makeover,Acting,Girl Groups,Rusical
0,winner,Kita Mean,2.0,3.0,,4.0,,3.0,
1,winner,Spankie Jackzon,0.0,3.0,,3.0,4.0,4.0,
2,winner,Priyanka,2.0,0.0,2.0,4.0,3.0,4.0,
3,winner,Icesis Couture,4.0,2.0,4.0,0.0,2.0,3.0,0.0
4,winner,Gisele Lullaby,4.0,4.0,3.0,,,,0.0


##### **Count of Row**

This is to find out the number of episodes that are available in a season or that a contestant has participated in.

Using the count function, we can count the number of non-null values.

In [None]:
df[col_list].count(axis = 'columns')

0     4
1     5
2     6
3     7
4     4
5     4
6     5
7     4
8     5
9     4
10    5
11    4
12    5
13    6
14    3
15    5
16    5
17    5
18    4
19    6
20    3
21    4
22    5
23    5
24    5
25    5
26    7
27    6
28    7
29    6
30    6
31    6
32    4
33    5
34    5
35    3
36    4
37    5
38    5
39    5
40    4
41    6
42    5
43    7
44    7
45    7
46    7
47    4
48    4
49    4
50    4
dtype: int64

We add the counted values as a new column in `df_score`

In [None]:
df_score['episode count'] = df[col_list].count(axis = 'columns')

Describing the `episode count` column to see which number of episodes should be used when testing the data

In [None]:
df_score['episode count'].describe()

count    51.000000
mean      5.019608
std       1.122323
min       3.000000
25%       4.000000
50%       5.000000
75%       6.000000
max       7.000000
Name: episode count, dtype: float64

### **Data Cleaning**

### **Feature Extraction**

## **Modeling and Evaluation**

### **Modeling**

#### **Model Training**

#### **Hyperparameter Training**

### **Evaluation**

#### **Feature Importance**

## **Conclusion**

# **References**