# Data Preprocessing for the Bachelorette Predictor
### Kwame V. Taylor

I need to preprocess the data from ```joined_df.csv``` to ready it for modeling. Then I will move on to continue my code in my main notebook, ```bachelorette-predictor.ipynb```, re-explore the data, and begin modeling.

### Set up Env

In [1]:
import numpy as np
import pandas as pd

### Import the tidied data

In [2]:
df = pd.read_csv('joined_df.csv', index_col=0)

In [3]:
df.head()

Unnamed: 0_level_0,Name,Age,Occupation,Hometown,ElimWeek,Season,ELIMINATION-1,ELIMINATION-2,ELIMINATION-3,ELIMINATION-4,...,DATES-1,DATES-2,DATES-3,DATES-4,DATES-5,DATES-6,DATES-7,DATES-8,DATES-9,DATES-10
CONTESTANT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1_RYAN_S,Ryan Sutter,29,Firefighter,"Vail, Colorado",,1,,,,,...,,D14,D1,D1,D1,D1,,,,
1_CHARLIE_M,Charlie Maher,28,Financial Analyst,"Hermosa Beach, California",6.0,1,,,,,...,,D14,D1,D1,D1,D1,,,,
1_RUSS_X,Russ,30,Writer,"San Rafael, California",5.0,1,,,,,...,,D14,D1,D1,D1,,,,,
1_GREG_T,Greg T.,28,Importer,"Manhattan, New York",4.0,1,,,,E,...,,D14,D5,D1,,,,,,
1_BOB_G,Bob Guiney,31,Mortgage Broker,"Ferndale, Michigan",3.0,1,,,E,,...,,D14,D5,,,,,,,


### Summarize the data

In [4]:
df.Age.value_counts()

27    43
28    40
29    31
26    29
31    28
30    24
25    18
33    14
32    13
35     7
34     6
24     3
36     3
42     1
41     1
37     1
38     1
23     1
Name: Age, dtype: int64

In [5]:
df.Hometown.value_counts()

Dallas, Texas                 7
San Francisco, California     6
Chicago, Illinois             4
Boston, Massachusetts         4
San Diego, California         3
                             ..
Casa Grande, Arizona          1
Blackfoot, Idaho              1
Phillips Ranch, California    1
Lockport, Illinois            1
Warren, Michigan              1
Name: Hometown, Length: 225, dtype: int64

In [6]:
df.ElimWeek.value_counts()

1.0     79
2.0     37
3.0     33
4.0     24
5.0     23
6.0     18
7.0     16
8.0     10
9.0      8
10.0     7
Name: ElimWeek, dtype: int64

In [7]:
df.Season.value_counts()

5     29
12    25
1     25
10    24
9     24
2     24
11    23
8     23
7     23
6     23
4     21
Name: Season, dtype: int64

### Drop unnecessary features / choose features for my MVP

In [8]:
df['DATES-1'].value_counts()

Series([], Name: DATES-1, dtype: int64)

There were no dates in the first episode of each season, so I'll drop that feature.

In [9]:
df.columns

Index(['Name', 'Age', 'Occupation', 'Hometown', 'ElimWeek', 'Season',
       'ELIMINATION-1', 'ELIMINATION-2', 'ELIMINATION-3', 'ELIMINATION-4',
       'ELIMINATION-5', 'ELIMINATION-6', 'ELIMINATION-7', 'ELIMINATION-8',
       'ELIMINATION-9', 'ELIMINATION-10', 'DATES-1', 'DATES-2', 'DATES-3',
       'DATES-4', 'DATES-5', 'DATES-6', 'DATES-7', 'DATES-8', 'DATES-9',
       'DATES-10'],
      dtype='object')

```ELIMINATION-10``` is the column that has win information. I could also use ```ElimWeek``` to predict on how many weeks a contestant will last.

I'll choose to predict on ```ElimWeek``` for now, but I may change this in a future iteration after I have more thought on the matter.

In [10]:
df = df.drop(columns=['ELIMINATION-1', 'ELIMINATION-2', 'ELIMINATION-3', 'ELIMINATION-4',
       'ELIMINATION-5', 'ELIMINATION-6', 'ELIMINATION-7', 'ELIMINATION-8',
       'ELIMINATION-9', 'ELIMINATION-10', 'Hometown', 'Occupation', 'Name', 'Season', 'DATES-1'])

In [11]:
df

Unnamed: 0_level_0,Age,ElimWeek,DATES-2,DATES-3,DATES-4,DATES-5,DATES-6,DATES-7,DATES-8,DATES-9,DATES-10
CONTESTANT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1_RYAN_S,29,,D14,D1,D1,D1,D1,,,,
1_CHARLIE_M,28,6.0,D14,D1,D1,D1,D1,,,,
1_RUSS_X,30,5.0,D14,D1,D1,D1,,,,,
1_GREG_T,28,4.0,D14,D5,D1,,,,,,
1_BOB_G,31,3.0,D14,D5,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
12_JAKE_P,26,1.0,,,,,,,,,
12_JONATHAN_H,29,1.0,,,,,,,,,
12_NICK_S,26,1.0,,,,,,,,,
12_PETER_M,26,1.0,,,,,,,,,


### Handle NaNs in ElimWeek

I think the NaNs in ElimWeek represent that the contestant won their season/made it to the end of the show. I'll do a quick Google search to confirm this...

In [12]:
df[df.ElimWeek.isna()].head(3)

Unnamed: 0_level_0,Age,ElimWeek,DATES-2,DATES-3,DATES-4,DATES-5,DATES-6,DATES-7,DATES-8,DATES-9,DATES-10
CONTESTANT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1_RYAN_S,29,,D14,D1,D1,D1,D1,,,,
2_IAN_M,29,,D7,D1,D4,D1,D1,D1,,,
4_JESSE_C,26,,D7,D10,D1,D4,D1,D1,D1,,


In [13]:
df.ElimWeek.value_counts()

1.0     79
2.0     37
3.0     33
4.0     24
5.0     23
6.0     18
7.0     16
8.0     10
9.0      8
10.0     7
Name: ElimWeek, dtype: int64

Yes, NaN represents that that contestant won that season. I will encode this as 11.0, representing that the contestants lasted to the end of their season of the show.

In [14]:
df.ElimWeek = df[['ElimWeek']].fillna(11.0)
df.head()

Unnamed: 0_level_0,Age,ElimWeek,DATES-2,DATES-3,DATES-4,DATES-5,DATES-6,DATES-7,DATES-8,DATES-9,DATES-10
CONTESTANT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1_RYAN_S,29,11.0,D14,D1,D1,D1,D1,,,,
1_CHARLIE_M,28,6.0,D14,D1,D1,D1,D1,,,,
1_RUSS_X,30,5.0,D14,D1,D1,D1,,,,,
1_GREG_T,28,4.0,D14,D5,D1,,,,,,
1_BOB_G,31,3.0,D14,D5,,,,,,,


### Feature engineering with ```DATES``` and ```ELIMINATION``` features

Header          | Description
----------------|------------
`ELIMINATION-1` | Who was eliminated in week 1
`ELIMINATION-2` | Who was eliminated in week 2
`ELIMINATION-3` | Who was eliminated in week 3
`ELIMINATION-4` | Who was eliminated in week 4
`ELIMINATION-5` | Who was eliminated in week 5
`ELIMINATION-6` | Who was eliminated in week 6
`ELIMINATION-7` | Who was eliminated in week 7
`ELIMINATION-8` | Who was eliminated in week 8
`ELIMINATION-9` | Who was eliminated in week 9
`ELIMINATION-10`| Who was eliminated in week 10
`DATES-1`       | Who was on which date in week 1
`DATES-2`       | Who was on which date in week 2
`DATES-3`       | Who was on which date in week 3
`DATES-4`       | Who was on which date in week 4
`DATES-5`       | Who was on which date in week 5
`DATES-6`       | Who was on which date in week 6
`DATES-7`       | Who was on which date in week 7
`DATES-8`       | Who was on which date in week 8
`DATES-9`       | Who was on which date in week 9
`DATES-10`      | Who was on which date in week 10

- Eliminates connote either an elimination (starts with "E") or a rose (starts with "R").
- Eliminations supercede roses.
- "E" connotes a standard elimination, typically at a rose ceremony. "EQ" means the contestant quits. "EF" means the contestant was fired by production. "ED" connotes a date elimination. "EU" connotes an unscheduled elimination, one that takes place at a time outside of a date or rose ceremony.
- "R" means the contestant received a rose. "R1" means the contestant got a first impression rose.
- "D1" means a one-on-one date, "D2" means a 2-on-1, "D3" means a 3-on-1 group date, and so on.
- Weeks of the show are deliminated by rose ceremonies, and may not line up exactly with episodes.

Because Eliminations are directly correlated with Win, I'm just going to get dummies for dates, and drop eliminations.

In [15]:
df['DATES-2'].value_counts()

D14    42
D7     38
D8     15
D1     15
D13    13
D6     12
D12    11
D10     9
Name: DATES-2, dtype: int64

To quantify the dates, I am going to make them each a float value that is equal to ```1``` divided by the ```number of people on the date```.

My reasoning behind this is that, anecdotally, I would guess that the fewer other people present on a date, the more that that date can affect the standing of a contestant on the date.

Therefore, the highest date value a contestant can have is if the value is 1, meaning they went on a date in that episode, and they went on that date as a 1-on-1.

First I will remove the ```D``` from the beginning of each value.

In [16]:
def remove_Ds(df, col):
    df[col] = df[col].replace(to_replace = r'D', value = '', regex = True)
    return df

In [17]:
df = remove_Ds(df, 'DATES-2')
df = remove_Ds(df, 'DATES-3')
df = remove_Ds(df, 'DATES-4')
df = remove_Ds(df, 'DATES-5')
df = remove_Ds(df, 'DATES-6')
df = remove_Ds(df, 'DATES-7')
df = remove_Ds(df, 'DATES-8')
df = remove_Ds(df, 'DATES-9')
df = remove_Ds(df, 'DATES-10')

Now I will perform my calculation. I have to convert all of the date features to floats first. Then I will turn NaNs into 0s.

In [18]:
df.dtypes

Age           int64
ElimWeek    float64
DATES-2      object
DATES-3      object
DATES-4      object
DATES-5      object
DATES-6      object
DATES-7      object
DATES-8      object
DATES-9      object
DATES-10     object
dtype: object

In [19]:
df["DATES-2"] = pd.to_numeric(df["DATES-2"], downcast="float")
df["DATES-3"] = pd.to_numeric(df["DATES-3"], downcast="float")
df["DATES-4"] = pd.to_numeric(df["DATES-4"], downcast="float")
df["DATES-5"] = pd.to_numeric(df["DATES-5"], downcast="float")
df["DATES-6"] = pd.to_numeric(df["DATES-6"], downcast="float")
df["DATES-7"] = pd.to_numeric(df["DATES-7"], downcast="float")
df["DATES-8"] = pd.to_numeric(df["DATES-8"], downcast="float")
df["DATES-9"] = pd.to_numeric(df["DATES-9"], downcast="float")
df["DATES-10"] = pd.to_numeric(df["DATES-10"], downcast="float")

In [20]:
df.dtypes

Age           int64
ElimWeek    float64
DATES-2     float32
DATES-3     float32
DATES-4     float32
DATES-5     float32
DATES-6     float32
DATES-7     float32
DATES-8     float32
DATES-9     float32
DATES-10    float32
dtype: object

Now I'll fill the rest of the NaNs (just present in the dates features) with 0.

In [21]:
df = df.fillna(0)
df.head()

Unnamed: 0_level_0,Age,ElimWeek,DATES-2,DATES-3,DATES-4,DATES-5,DATES-6,DATES-7,DATES-8,DATES-9,DATES-10
CONTESTANT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1_RYAN_S,29,11.0,14.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0
1_CHARLIE_M,28,6.0,14.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0
1_RUSS_X,30,5.0,14.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
1_GREG_T,28,4.0,14.0,5.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1_BOB_G,31,3.0,14.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [22]:
df.head(30)

Unnamed: 0_level_0,Age,ElimWeek,DATES-2,DATES-3,DATES-4,DATES-5,DATES-6,DATES-7,DATES-8,DATES-9,DATES-10
CONTESTANT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1_RYAN_S,29,11.0,14.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0
1_CHARLIE_M,28,6.0,14.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0
1_RUSS_X,30,5.0,14.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
1_GREG_T,28,4.0,14.0,5.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1_BOB_G,31,3.0,14.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1_JAMIE_B,27,3.0,14.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1_MIKE_X,24,3.0,14.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1_ROB_X,29,3.0,14.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1_BRIAN_C,28,2.0,14.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1_BRIAN_S,28,2.0,14.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [33]:
def calculate_dates(df, col, new_col_name):
    try:
        df[new_col_name] = 1 / df[col]
    except ZeroDivisionError:
        df[new_col_name] = 0
    return df

In [34]:
df = calculate_dates(df, 'DATES-2', 'Dates2-Calculated')
df = calculate_dates(df, 'DATES-3', 'Dates3-Calculated')
df = calculate_dates(df, 'DATES-4', 'Dates4-Calculated')
df = calculate_dates(df, 'DATES-5', 'Dates5-Calculated')
df = calculate_dates(df, 'DATES-6', 'Dates6-Calculated')
df = calculate_dates(df, 'DATES-7', 'Dates7-Calculated')
df = calculate_dates(df, 'DATES-8', 'Dates8-Calculated')
df = calculate_dates(df, 'DATES-9', 'Dates9-Calculated')
df = calculate_dates(df, 'DATES-10', 'Dates10-Calculated')

In [35]:
df = df.replace([np.inf, -np.inf], np.nan)

In [36]:
df = df.fillna(0)

In [37]:
df.head(20)

Unnamed: 0_level_0,Age,ElimWeek,DATES-2,DATES-3,DATES-4,DATES-5,DATES-6,DATES-7,DATES-8,DATES-9,DATES-10,Dates2-Calculated,Dates3-Calculated,Dates4-Calculated,Dates5-Calculated,Dates6-Calculated,Dates7-Calculated,Dates8-Calculated,Dates9-Calculated,Dates10-Calculated
CONTESTANT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1_RYAN_S,29,11.0,14.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.071429,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0
1_CHARLIE_M,28,6.0,14.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.071429,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0
1_RUSS_X,30,5.0,14.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.071429,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
1_GREG_T,28,4.0,14.0,5.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071429,0.2,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1_BOB_G,31,3.0,14.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071429,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1_JAMIE_B,27,3.0,14.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071429,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1_MIKE_X,24,3.0,14.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071429,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1_ROB_X,29,3.0,14.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071429,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1_BRIAN_C,28,2.0,14.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1_BRIAN_S,28,2.0,14.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Calculate One-on-One Score

In [38]:
def calculate_score(df):
    df['One-on-One_Score'] = (df['Dates2-Calculated'] + df['Dates3-Calculated'] + df['Dates4-Calculated'] + df['Dates5-Calculated'] + df['Dates6-Calculated'] + df['Dates7-Calculated'] + df['Dates8-Calculated'] + df['Dates9-Calculated'] + df['Dates10-Calculated']) / (df['ElimWeek'] + 1.0)
    return df

The choice to make winning contests have an ```ElimWeek``` of 11 may cause slightly lower One-on-One scores for winning contests because the dividing value (that I'm using to average the amount of One-on-One time per week) is ```ElimWeek```, yet there are no dates in the 11th week. I will compensate for this in a future iteration.

### Create ```First_Date``` feature

This feature will be a boolean that represents if a contestant was chosen to go on a date in the second episode of their season, which is the episode on which the first dates occur.

In [41]:
# df.loc[df['column name'] condition, 'new column name'] = 'value if condition is met'

def get_first_date(df):
    if df[['DATES-2'] > 0.0]:
        df['FirstDate'] = 1
    else:
        df['FirstDate'] = 0
    return df

In [44]:
# df.loc[df['column name'] condition, 'new column name'] = 'value if condition is met'

def get_first_date(df):
    df.loc[df['DATES-2'] > 0.0, 'FirstDate'] = 1.0
    return df

In [48]:
df = get_first_date(df)

df.head(17)

Unnamed: 0_level_0,Age,ElimWeek,DATES-2,DATES-3,DATES-4,DATES-5,DATES-6,DATES-7,DATES-8,DATES-9,...,Dates3-Calculated,Dates4-Calculated,Dates5-Calculated,Dates6-Calculated,Dates7-Calculated,Dates8-Calculated,Dates9-Calculated,Dates10-Calculated,One-on-One_Score,FirstDate
CONTESTANT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1_RYAN_S,29,11.0,14.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,...,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.339286,1.0
1_CHARLIE_M,28,6.0,14.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,...,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.581633,1.0
1_RUSS_X,30,5.0,14.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.511905,1.0
1_GREG_T,28,4.0,14.0,5.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.254286,1.0
1_BOB_G,31,3.0,14.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.067857,1.0
1_JAMIE_B,27,3.0,14.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.067857,1.0
1_MIKE_X,24,3.0,14.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.067857,1.0
1_ROB_X,29,3.0,14.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.067857,1.0
1_BRIAN_C,28,2.0,14.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,1.0
1_BRIAN_S,28,2.0,14.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,1.0


In [50]:
df.isna().sum()

Age                     0
ElimWeek                0
DATES-2                 0
DATES-3                 0
DATES-4                 0
DATES-5                 0
DATES-6                 0
DATES-7                 0
DATES-8                 0
DATES-9                 0
DATES-10                0
Dates2-Calculated       0
Dates3-Calculated       0
Dates4-Calculated       0
Dates5-Calculated       0
Dates6-Calculated       0
Dates7-Calculated       0
Dates8-Calculated       0
Dates9-Calculated       0
Dates10-Calculated      0
One-on-One_Score        0
FirstDate             109
dtype: int64

In [51]:
df = df.fillna(0)
df.head(17)

Unnamed: 0_level_0,Age,ElimWeek,DATES-2,DATES-3,DATES-4,DATES-5,DATES-6,DATES-7,DATES-8,DATES-9,...,Dates3-Calculated,Dates4-Calculated,Dates5-Calculated,Dates6-Calculated,Dates7-Calculated,Dates8-Calculated,Dates9-Calculated,Dates10-Calculated,One-on-One_Score,FirstDate
CONTESTANT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1_RYAN_S,29,11.0,14.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,...,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.339286,1.0
1_CHARLIE_M,28,6.0,14.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,...,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.581633,1.0
1_RUSS_X,30,5.0,14.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.511905,1.0
1_GREG_T,28,4.0,14.0,5.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.254286,1.0
1_BOB_G,31,3.0,14.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.067857,1.0
1_JAMIE_B,27,3.0,14.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.067857,1.0
1_MIKE_X,24,3.0,14.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.067857,1.0
1_ROB_X,29,3.0,14.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.067857,1.0
1_BRIAN_C,28,2.0,14.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,1.0
1_BRIAN_S,28,2.0,14.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,1.0


In [52]:
df.isna().sum()

Age                   0
ElimWeek              0
DATES-2               0
DATES-3               0
DATES-4               0
DATES-5               0
DATES-6               0
DATES-7               0
DATES-8               0
DATES-9               0
DATES-10              0
Dates2-Calculated     0
Dates3-Calculated     0
Dates4-Calculated     0
Dates5-Calculated     0
Dates6-Calculated     0
Dates7-Calculated     0
Dates8-Calculated     0
Dates9-Calculated     0
Dates10-Calculated    0
One-on-One_Score      0
FirstDate             0
dtype: int64

### Create more features

I want to create a ```Homestate``` feature (with regex and with get dummies). I will do this in my second iteration of the pipeline.

### Drop unneeded columns that were used to feature-engineer

### Train, validate, test split

In [None]:
# y is what I'm predicting: ElimWeek
# I'll stratify by ElimWeek

### Scale the data

I'll scale the data on my next iteration through the data science pipeline.

### Export data and make .py file

# Notes to Self:

**Dec 4, 2020**

ElimWeek should be the target variable, not win. To compensate for the overbearing correlation between number of dates (over a span of a season) to Win. So I'll use regression to predict how many weeks a contestant will last.

Keep in mind that the predictor should be able to do so within the first few episodes of a season, for future use to have the most value.

Am I looking at a times series? Although it does not have dates, only weeks.

Tomorrow: Look up information about time series analysis that samples by week.

Alternatively, I could create new features out of the ```DATES``` features, such as:
   * Had a one-on-one? (boolean)
   * Went on a date in the first 2 episodes? (boolean)
   * Went on a date in the first 3 episodes? (boolean)
   * Went on a date in the first 4 episodes? (boolean)
   * Had the most one-on-one time during their time on the show (new calculation, which divides the calculated total date values -- per contestant, across all 10 episodes -- by the amount of weeks the contestant was on the show, and rescales it)

**Dec 5, 2020**

I have decided that I will save the time-series analysis for when I have more time for a future iteration. For now, I will focus on creating features that don't have a direct relationship with how many weeks a contestant will last. I could use linear regression and treat ElimWeek as a continuous variable, or I could use classification and treat it as a multi-class problem.