# **You're A Winner, Baby!**

**Author:** Anton Reyes

## **Introduction**

### **Requirements and Imports**

#### **Imports**

**Basic Libraries**

* `numpy` contains a large collection of mathematical functions
* `pandas` contains functions that are designed for data manipulation and data analysis



In [1]:
import numpy as np
import pandas as pd

**Visualization Libraries**

* `matplotlib.pyplot` contains functions to create interactive plots
* `seaborn` is a library based on matplotlib that allows for data visualization
* `plotly` is an open-source graphing library for Python.

In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

**Natural Language Processing Libraries**
* `re` is a module that allows the use of regular expressions
* `string` contains functions for string operations

In [3]:
import re
import string

#### **Datasets and Files**

The following `csv` file was used for this project:

- `finalists.csv` contains all the finalists of the Drag Race franchise as well as the placements of each contestant in their seasons. Finalists are as of April 22, 2023

## **Data Collection**

Importing the dataset

In [4]:
dataset = "data/finalists.csv"

df = pd.read_csv(dataset)
df.head()

Unnamed: 0,Rank WRE,Rank,Country,Season,Code,W/R/E,Queen,Z-Score,Percent,Score,...,Low,Safe,High,SWin,Win,LSA,T,Unnamed: 36,Unnamed: 37,Unnamed: 38
0,,,,,,,,,,,...,-1,0,1,1.5,2,,,Never/--,Rank,Rank
1,18.0,8.0,US,AS1,US_AS1,W,Chad Michaels,0.47,100.00%,5.0,...,0,0,1,0.0,3,,,--,1220,1
2,129.0,159.0,US,AS1,US_AS1,R,Raven,-2.26,-120.00%,-6.0,...,1,0,1,0.0,0,LSA,2.0,--,237,140
3,75.0,159.0,US,AS1,US_AS1,E,Jujubee,-2.26,-120.00%,-6.0,...,1,0,1,0.0,0,LSA,2.0,--,237,140
4,6.0,8.0,US,AS1,US_AS1,E,Shannel,0.47,100.00%,5.0,...,0,0,1,0.0,3,,,--,1220,1


## **Description of the Dataset**

Here, we find the shape of the dataset.

In [5]:
df.shape

(161, 39)

By looking at the `info` of the dataframe, we can see that there are `non-null` values. 

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 161 entries, 0 to 160
Data columns (total 39 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Rank WRE     160 non-null    float64
 1   Rank         160 non-null    float64
 2   Country      160 non-null    object 
 3   Season       160 non-null    object 
 4   Code         160 non-null    object 
 5   W/R/E        160 non-null    object 
 6   Queen        160 non-null    object 
 7   Z-Score      160 non-null    float64
 8   Percent      160 non-null    object 
 9   Score        160 non-null    float64
 10  Episode      160 non-null    float64
 11  1            154 non-null    float64
 12  2            154 non-null    float64
 13  3            156 non-null    float64
 14  4            160 non-null    float64
 15  5            159 non-null    float64
 16  6            142 non-null    float64
 17  7            139 non-null    float64
 18  8            93 non-null     float64
 19  9       

## **Exploratory Data Analysis Part 1**

The following questions are asked to guide the EDA.

1. How many finalists are there per class?
2. How many unique values are there in the `Season` column?
3. How many unique values are there in the `Country` column?
4. What are measures of central tendency in the `Episode` column?

### **1. How many finalists are there per class?**

Before getting the number of contestants per class, we first define the classes:

| Class | Meaning | Definition |
|:---:|---| --- |
| W | Winner | Contestant has won the season |
| R | Runner-Up | Contestant became the runner-up or placed 2nd |
| E | Elimenated | Contestant was elimenated at the finale and placed 3rd/4th |

With that, we now get the total of contestants by counting them in the `W/R/E` column.

In [7]:
df[["W/R/E"]].count()

W/R/E    160
dtype: int64

Now, we get the division of classes between all 160 contestants by printing the `.value_counts()` and summing it all up to measure.

In [18]:
print("Division of classes:", df[["W/R/E"]].value_counts())
print("Sum of the classes:", df[["W/R/E"]].value_counts().sum())

Division of classes: W/R/E
R        68
W        45
E        44
TBA       3
dtype: int64
Sum of the classes: 160


### **2. How many unique values are there in the `Season` column?**

In [9]:
df[['Season']].count()

Season    160
dtype: int64

From getting the value counts in the `Season` column, we can see that there are inconsistensies in the dataset.

In [10]:
df[['Season']].value_counts()

Season
1         26
S1        24
2         16
S2        10
3          7
14         5
15         4
AS6        4
S13        4
S11        4
AS7        4
S10        4
AS4        4
AS3        4
AS2        4
AS1        4
4          4
S9         4
AS5        3
S12        3
S3         3
S4         3
S5         3
S6         3
S7         3
S8         3
dtype: int64

### **3. How many unique values are there in the `Country` column?**

In [11]:
df[["Country"]].count()

Country    160
dtype: int64

By seeing the value counts in the `Country` column, we can observer that there are multiple values. However, countries with `_WORLD` just classify them as a different country rather then a *type* of season. When in fact, `_WORLD` seasons are similar to an All Star season.

In [12]:
df[["Country"]].value_counts()

Country  
US           79
UK           14
CAN          10
ESP           8
AUS           7
HOL           7
ITA           7
THI           6
CAN_WORLD     4
FRA           4
PH            4
UK_WORLD      4
BEL           3
SWE           3
dtype: int64

### **4. What are the measure of central tendency in the `Episode` column?**

In [13]:
df['Episode'].describe()

count    160.00000
mean       8.81875
std        2.40511
min        5.00000
25%        7.00000
50%        9.00000
75%       11.00000
max       13.00000
Name: Episode, dtype: float64

In [14]:
df['Episode'].max() - df['Episode'].min()

8.0

## **Data Preprocessing**

#### **Data Preprocessing**

In [28]:
df.head()

Unnamed: 0,Placement,Country,Season,Queen,Design,Snatch Game,Ball,Makeover,Acting,Girl Groups,Rusical,Unnamed: 11,Unnamed: 12
0,winner,AUS,1,Kita Mean,SAFE,HIGH,,WIN,,HIGH,,,4
1,winner,AUS,2,Spankie Jackzon,BTM,HIGH,,HIGH,WIN,WIN,,,5
2,winner,CAN,1,Priyanka,SAFE,BTM,SAFE,WIN,HIGH,WIN,,,6
3,winner,CAN,2,Icesis Couture,WIN,SAFE,WIN,BTM,SAFE,HIGH,BTM,ALL 7,7
4,winner,CAN,3,Gisele Lullaby,WIN,WIN,HIGH,,,,BTM,,4


Instead of having to display the `country` and `season` separately, we join them into one column instead.

## **Exploratory Data Analysis Part 2**

# **Saving Dataframes as CSVs**

In [51]:
#main_df.to_csv('data\main.csv')
