# Data Preprocessing for the Bachelorette Predictor
### Kwame V. Taylor

I need to preprocess the data from ```joined_df.csv``` to ready it for modeling. Then I will move on to continue my code in my main notebook, ```bachelorette-predictor.ipynb```, re-explore the data, and begin modeling.

### Set up Env

In [1]:
import numpy as np
import pandas as pd

import sklearn.preprocessing
from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LassoLars, TweedieRegressor

import warnings
warnings.filterwarnings("ignore")

### Import the tidied data

In [2]:
df = pd.read_csv('joined_df.csv', index_col=0)

In [3]:
df.head()

Unnamed: 0_level_0,Name,Age,Occupation,Hometown,ElimWeek,Season,ELIMINATION-1,ELIMINATION-2,ELIMINATION-3,ELIMINATION-4,...,DATES-1,DATES-2,DATES-3,DATES-4,DATES-5,DATES-6,DATES-7,DATES-8,DATES-9,DATES-10
CONTESTANT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1_RYAN_S,Ryan Sutter,29,Firefighter,"Vail, Colorado",,1,,,,,...,,D14,D1,D1,D1,D1,,,,
1_CHARLIE_M,Charlie Maher,28,Financial Analyst,"Hermosa Beach, California",6.0,1,,,,,...,,D14,D1,D1,D1,D1,,,,
1_RUSS_X,Russ,30,Writer,"San Rafael, California",5.0,1,,,,,...,,D14,D1,D1,D1,,,,,
1_GREG_T,Greg T.,28,Importer,"Manhattan, New York",4.0,1,,,,E,...,,D14,D5,D1,,,,,,
1_BOB_G,Bob Guiney,31,Mortgage Broker,"Ferndale, Michigan",3.0,1,,,E,,...,,D14,D5,,,,,,,


### Summarize the data

In [4]:
df.Age.value_counts()

27    43
28    40
29    31
26    29
31    28
30    24
25    18
33    14
32    13
35     7
34     6
24     3
36     3
42     1
41     1
37     1
38     1
23     1
Name: Age, dtype: int64

In [5]:
df.Hometown.value_counts()

Dallas, Texas                7
San Francisco, California    6
Boston, Massachusetts        4
Chicago, Illinois            4
Orlando, Florida             3
                            ..
Mukilteo, Washington         1
Derby, Kansas                1
Glen Ellyn, Illinois         1
Peculiar, Missouri           1
Fort Lauderdale, Florida     1
Name: Hometown, Length: 225, dtype: int64

In [6]:
df.ElimWeek.value_counts()

1.0     79
2.0     37
3.0     33
4.0     24
5.0     23
6.0     18
7.0     16
8.0     10
9.0      8
10.0     7
Name: ElimWeek, dtype: int64

In [7]:
df.Season.value_counts()

5     29
12    25
1     25
10    24
9     24
2     24
11    23
8     23
7     23
6     23
4     21
Name: Season, dtype: int64

### Drop unnecessary features / choose features for my MVP

In [8]:
df['DATES-1'].value_counts()

Series([], Name: DATES-1, dtype: int64)

There were no dates in the first episode of each season, so I'll drop that feature.

In [9]:
df.columns

Index(['Name', 'Age', 'Occupation', 'Hometown', 'ElimWeek', 'Season',
       'ELIMINATION-1', 'ELIMINATION-2', 'ELIMINATION-3', 'ELIMINATION-4',
       'ELIMINATION-5', 'ELIMINATION-6', 'ELIMINATION-7', 'ELIMINATION-8',
       'ELIMINATION-9', 'ELIMINATION-10', 'DATES-1', 'DATES-2', 'DATES-3',
       'DATES-4', 'DATES-5', 'DATES-6', 'DATES-7', 'DATES-8', 'DATES-9',
       'DATES-10'],
      dtype='object')

```ELIMINATION-10``` is the column that has win information. I could also use ```ElimWeek``` to predict on how many weeks a contestant will last.

I'll choose to predict on ```ElimWeek``` for now, but I may change this in a future iteration after I have more thought on the matter.

In [10]:
df = df.drop(columns=['ELIMINATION-1', 'ELIMINATION-2', 'ELIMINATION-3', 'ELIMINATION-4',
       'ELIMINATION-5', 'ELIMINATION-6', 'ELIMINATION-7', 'ELIMINATION-8',
       'ELIMINATION-9', 'ELIMINATION-10', 'Occupation', 'Name', 'Hometown'])

In [11]:
df

Unnamed: 0_level_0,Age,ElimWeek,Season,DATES-1,DATES-2,DATES-3,DATES-4,DATES-5,DATES-6,DATES-7,DATES-8,DATES-9,DATES-10
CONTESTANT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1_RYAN_S,29,,1,,D14,D1,D1,D1,D1,,,,
1_CHARLIE_M,28,6.0,1,,D14,D1,D1,D1,D1,,,,
1_RUSS_X,30,5.0,1,,D14,D1,D1,D1,,,,,
1_GREG_T,28,4.0,1,,D14,D5,D1,,,,,,
1_BOB_G,31,3.0,1,,D14,D5,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
12_JAKE_P,26,1.0,12,,,,,,,,,,
12_JONATHAN_H,29,1.0,12,,,,,,,,,,
12_NICK_S,26,1.0,12,,,,,,,,,,
12_PETER_M,26,1.0,12,,,,,,,,,,


### Handle NaNs in ElimWeek

I think the NaNs in ElimWeek represent that the contestant won their season/made it to the end of the show. I'll do a quick Google search to confirm this...

In [12]:
df[df.ElimWeek.isna()].head(3)

Unnamed: 0_level_0,Age,ElimWeek,Season,DATES-1,DATES-2,DATES-3,DATES-4,DATES-5,DATES-6,DATES-7,DATES-8,DATES-9,DATES-10
CONTESTANT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1_RYAN_S,29,,1,,D14,D1,D1,D1,D1,,,,
2_IAN_M,29,,2,,D7,D1,D4,D1,D1,D1,,,
4_JESSE_C,26,,4,,D7,D10,D1,D4,D1,D1,D1,,


In [13]:
df.ElimWeek.value_counts()

1.0     79
2.0     37
3.0     33
4.0     24
5.0     23
6.0     18
7.0     16
8.0     10
9.0      8
10.0     7
Name: ElimWeek, dtype: int64

Yes, NaN represents that that contestant won that season. I will encode this as 11.0, representing that the contestants lasted to the end of their season of the show.

In [14]:
df.ElimWeek = df[['ElimWeek']].fillna(11.0)
df.head()

Unnamed: 0_level_0,Age,ElimWeek,Season,DATES-1,DATES-2,DATES-3,DATES-4,DATES-5,DATES-6,DATES-7,DATES-8,DATES-9,DATES-10
CONTESTANT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1_RYAN_S,29,11.0,1,,D14,D1,D1,D1,D1,,,,
1_CHARLIE_M,28,6.0,1,,D14,D1,D1,D1,D1,,,,
1_RUSS_X,30,5.0,1,,D14,D1,D1,D1,,,,,
1_GREG_T,28,4.0,1,,D14,D5,D1,,,,,,
1_BOB_G,31,3.0,1,,D14,D5,,,,,,,


### Feature engineering with ```DATES``` and ```ELIMINATION``` features

Header          | Description
----------------|------------
`ELIMINATION-1` | Who was eliminated in week 1
`ELIMINATION-2` | Who was eliminated in week 2
`ELIMINATION-3` | Who was eliminated in week 3
`ELIMINATION-4` | Who was eliminated in week 4
`ELIMINATION-5` | Who was eliminated in week 5
`ELIMINATION-6` | Who was eliminated in week 6
`ELIMINATION-7` | Who was eliminated in week 7
`ELIMINATION-8` | Who was eliminated in week 8
`ELIMINATION-9` | Who was eliminated in week 9
`ELIMINATION-10`| Who was eliminated in week 10
`DATES-1`       | Who was on which date in week 1
`DATES-2`       | Who was on which date in week 2
`DATES-3`       | Who was on which date in week 3
`DATES-4`       | Who was on which date in week 4
`DATES-5`       | Who was on which date in week 5
`DATES-6`       | Who was on which date in week 6
`DATES-7`       | Who was on which date in week 7
`DATES-8`       | Who was on which date in week 8
`DATES-9`       | Who was on which date in week 9
`DATES-10`      | Who was on which date in week 10

- Eliminates connote either an elimination (starts with "E") or a rose (starts with "R").
- Eliminations supercede roses.
- "E" connotes a standard elimination, typically at a rose ceremony. "EQ" means the contestant quits. "EF" means the contestant was fired by production. "ED" connotes a date elimination. "EU" connotes an unscheduled elimination, one that takes place at a time outside of a date or rose ceremony.
- "R" means the contestant received a rose. "R1" means the contestant got a first impression rose.
- "D1" means a one-on-one date, "D2" means a 2-on-1, "D3" means a 3-on-1 group date, and so on.
- Weeks of the show are deliminated by rose ceremonies, and may not line up exactly with episodes.

Because Eliminations are directly correlated with Win, I'm just going to get dummies for dates, and drop eliminations.

In [15]:
df['DATES-2'].value_counts()

D14    42
D7     38
D8     15
D1     15
D13    13
D6     12
D12    11
D10     9
Name: DATES-2, dtype: int64

## Average One-on-One Score

To quantify the dates, I am going to make them each a float value that is equal to ```1``` divided by the ```number of people on the date```.

My reasoning behind this is that, anecdotally, I would guess that the fewer other people present on a date, the more that that date can affect the standing of a contestant on the date.

Therefore, the highest date value a contestant can have is if the value is 1, meaning they went on a date in that episode, and they went on that date as a 1-on-1.

First I will remove the ```D``` from the beginning of each value.

In [16]:
def remove_Ds(df, col):
    df[col] = df[col].replace(to_replace = r'D', value = '', regex = True)
    return df

In [17]:
df = remove_Ds(df, 'DATES-2')
df = remove_Ds(df, 'DATES-3')
df = remove_Ds(df, 'DATES-4')
df = remove_Ds(df, 'DATES-5')
df = remove_Ds(df, 'DATES-6')
df = remove_Ds(df, 'DATES-7')
df = remove_Ds(df, 'DATES-8')
df = remove_Ds(df, 'DATES-9')
df = remove_Ds(df, 'DATES-10')

Now I will perform my calculation. I have to convert all of the date features to floats first. Then I will turn NaNs into 0s.

In [18]:
df.dtypes

Age           int64
ElimWeek    float64
Season        int64
DATES-1     float64
DATES-2      object
DATES-3      object
DATES-4      object
DATES-5      object
DATES-6      object
DATES-7      object
DATES-8      object
DATES-9      object
DATES-10     object
dtype: object

In [19]:
df["DATES-2"] = pd.to_numeric(df["DATES-2"], downcast="float")
df["DATES-3"] = pd.to_numeric(df["DATES-3"], downcast="float")
df["DATES-4"] = pd.to_numeric(df["DATES-4"], downcast="float")
df["DATES-5"] = pd.to_numeric(df["DATES-5"], downcast="float")
df["DATES-6"] = pd.to_numeric(df["DATES-6"], downcast="float")
df["DATES-7"] = pd.to_numeric(df["DATES-7"], downcast="float")
df["DATES-8"] = pd.to_numeric(df["DATES-8"], downcast="float")
df["DATES-9"] = pd.to_numeric(df["DATES-9"], downcast="float")
df["DATES-10"] = pd.to_numeric(df["DATES-10"], downcast="float")

In [20]:
df.dtypes

Age           int64
ElimWeek    float64
Season        int64
DATES-1     float64
DATES-2     float32
DATES-3     float32
DATES-4     float32
DATES-5     float32
DATES-6     float32
DATES-7     float32
DATES-8     float32
DATES-9     float32
DATES-10    float32
dtype: object

Now I'll fill the rest of the NaNs (just present in the dates features) with 0.

In [21]:
df = df.fillna(0)
df.head()

Unnamed: 0_level_0,Age,ElimWeek,Season,DATES-1,DATES-2,DATES-3,DATES-4,DATES-5,DATES-6,DATES-7,DATES-8,DATES-9,DATES-10
CONTESTANT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1_RYAN_S,29,11.0,1,0.0,14.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0
1_CHARLIE_M,28,6.0,1,0.0,14.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0
1_RUSS_X,30,5.0,1,0.0,14.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
1_GREG_T,28,4.0,1,0.0,14.0,5.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1_BOB_G,31,3.0,1,0.0,14.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [22]:
df.head(30)

Unnamed: 0_level_0,Age,ElimWeek,Season,DATES-1,DATES-2,DATES-3,DATES-4,DATES-5,DATES-6,DATES-7,DATES-8,DATES-9,DATES-10
CONTESTANT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1_RYAN_S,29,11.0,1,0.0,14.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0
1_CHARLIE_M,28,6.0,1,0.0,14.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0
1_RUSS_X,30,5.0,1,0.0,14.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
1_GREG_T,28,4.0,1,0.0,14.0,5.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1_BOB_G,31,3.0,1,0.0,14.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1_JAMIE_B,27,3.0,1,0.0,14.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1_MIKE_X,24,3.0,1,0.0,14.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1_ROB_X,29,3.0,1,0.0,14.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1_BRIAN_C,28,2.0,1,0.0,14.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1_BRIAN_S,28,2.0,1,0.0,14.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [23]:
def calculate_dates(df, col, new_col_name):
    try:
        df[new_col_name] = 1 / df[col]
    except ZeroDivisionError:
        df[new_col_name] = 0
    return df

In [24]:
df = calculate_dates(df, 'DATES-1', 'Dates1-OneonOneScore')
df = calculate_dates(df, 'DATES-2', 'Dates2-OneonOneScore')
df = calculate_dates(df, 'DATES-3', 'Dates3-OneonOneScore')
df = calculate_dates(df, 'DATES-4', 'Dates4-OneonOneScore')
df = calculate_dates(df, 'DATES-5', 'Dates5-OneonOneScore')
df = calculate_dates(df, 'DATES-6', 'Dates6-OneonOneScore')
df = calculate_dates(df, 'DATES-7', 'Dates7-OneonOneScore')
df = calculate_dates(df, 'DATES-8', 'Dates8-OneonOneScore')
df = calculate_dates(df, 'DATES-9', 'Dates9-OneonOneScore')
df = calculate_dates(df, 'DATES-10', 'Dates10-OneonOneScore')

In [25]:
df = df.replace([np.inf, -np.inf], np.nan)

In [26]:
df = df.fillna(0)

In [27]:
df.head(20)

Unnamed: 0_level_0,Age,ElimWeek,Season,DATES-1,DATES-2,DATES-3,DATES-4,DATES-5,DATES-6,DATES-7,...,Dates1-OneonOneScore,Dates2-OneonOneScore,Dates3-OneonOneScore,Dates4-OneonOneScore,Dates5-OneonOneScore,Dates6-OneonOneScore,Dates7-OneonOneScore,Dates8-OneonOneScore,Dates9-OneonOneScore,Dates10-OneonOneScore
CONTESTANT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1_RYAN_S,29,11.0,1,0.0,14.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.071429,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0
1_CHARLIE_M,28,6.0,1,0.0,14.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.071429,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0
1_RUSS_X,30,5.0,1,0.0,14.0,1.0,1.0,1.0,0.0,0.0,...,0.0,0.071429,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
1_GREG_T,28,4.0,1,0.0,14.0,5.0,1.0,0.0,0.0,0.0,...,0.0,0.071429,0.2,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1_BOB_G,31,3.0,1,0.0,14.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.071429,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1_JAMIE_B,27,3.0,1,0.0,14.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.071429,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1_MIKE_X,24,3.0,1,0.0,14.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.071429,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1_ROB_X,29,3.0,1,0.0,14.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.071429,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1_BRIAN_C,28,2.0,1,0.0,14.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1_BRIAN_S,28,2.0,1,0.0,14.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Calculate Average One-on-One Score

In [28]:
#def calculate_score(df):
#    df['One-on-One_Score'] = (df['Dates2-AvgOneonOneScore'] + df['Dates3-AvgOneonOneScore'] + df['Dates4-AvgOneonOneScore'] + df['Dates5-AvgOneonOneScore'] + df['Dates6-AvgOneonOneScore'] + df['Dates7-AvgOneonOneScore'] + df['Dates8-AvgOneonOneScore'] + df['Dates9-AvgOneonOneScore'] + df['Dates10-AvgOneonOneScore']) / (df['ElimWeek'] + 1.0)
#    return df

In [29]:
#df = calculate_score(df)
#df

The choice to make winning contests have an ```ElimWeek``` of 11 may cause slightly lower One-on-One scores for winning contests because the dividing value (that I'm using to average the amount of One-on-One time per week) is ```ElimWeek```, yet there are no dates in the 11th week. I will compensate for this in a future iteration.

**Dec 6, 2020**

I have decided to drop the One-on-One Score feature because it has too much relationship with the target variable. As the season progresses and there are fewer contestants left, it only follows that those contestants will get more relative one-on-one time than earlier in the season. Also, I want my predictor to be able to be used after the 3rd episode or before the 4th episode of a season, so that users (and myself) can use my model to predict the frontrunners of a season before the 4th episode of the show.

Instead, I will keep the One-on-One scores for ```DATES-2``` and ```DATES-3```.

### Create ```First_Date``` feature

This feature will be a boolean that represents if a contestant was chosen to go on a date in the second episode of their season, which is the episode on which the first dates occur.

In [30]:
# df.loc[df['column name'] condition, 'new column name'] = 'value if condition is met'

def get_first_date(df):
    df.loc[df['DATES-2'] > 0.0, 'FirstDate'] = 1.0
    return df

In [31]:
df = get_first_date(df)

df.head(17)

Unnamed: 0_level_0,Age,ElimWeek,Season,DATES-1,DATES-2,DATES-3,DATES-4,DATES-5,DATES-6,DATES-7,...,Dates2-OneonOneScore,Dates3-OneonOneScore,Dates4-OneonOneScore,Dates5-OneonOneScore,Dates6-OneonOneScore,Dates7-OneonOneScore,Dates8-OneonOneScore,Dates9-OneonOneScore,Dates10-OneonOneScore,FirstDate
CONTESTANT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1_RYAN_S,29,11.0,1,0.0,14.0,1.0,1.0,1.0,1.0,0.0,...,0.071429,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
1_CHARLIE_M,28,6.0,1,0.0,14.0,1.0,1.0,1.0,1.0,0.0,...,0.071429,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
1_RUSS_X,30,5.0,1,0.0,14.0,1.0,1.0,1.0,0.0,0.0,...,0.071429,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
1_GREG_T,28,4.0,1,0.0,14.0,5.0,1.0,0.0,0.0,0.0,...,0.071429,0.2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1_BOB_G,31,3.0,1,0.0,14.0,5.0,0.0,0.0,0.0,0.0,...,0.071429,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1_JAMIE_B,27,3.0,1,0.0,14.0,5.0,0.0,0.0,0.0,0.0,...,0.071429,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1_MIKE_X,24,3.0,1,0.0,14.0,5.0,0.0,0.0,0.0,0.0,...,0.071429,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1_ROB_X,29,3.0,1,0.0,14.0,5.0,0.0,0.0,0.0,0.0,...,0.071429,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1_BRIAN_C,28,2.0,1,0.0,14.0,0.0,0.0,0.0,0.0,0.0,...,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1_BRIAN_S,28,2.0,1,0.0,14.0,0.0,0.0,0.0,0.0,0.0,...,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [32]:
df.isna().sum()

Age                        0
ElimWeek                   0
Season                     0
DATES-1                    0
DATES-2                    0
DATES-3                    0
DATES-4                    0
DATES-5                    0
DATES-6                    0
DATES-7                    0
DATES-8                    0
DATES-9                    0
DATES-10                   0
Dates1-OneonOneScore       0
Dates2-OneonOneScore       0
Dates3-OneonOneScore       0
Dates4-OneonOneScore       0
Dates5-OneonOneScore       0
Dates6-OneonOneScore       0
Dates7-OneonOneScore       0
Dates8-OneonOneScore       0
Dates9-OneonOneScore       0
Dates10-OneonOneScore      0
FirstDate                109
dtype: int64

In [33]:
df = df.fillna(0)
df.head(17)

Unnamed: 0_level_0,Age,ElimWeek,Season,DATES-1,DATES-2,DATES-3,DATES-4,DATES-5,DATES-6,DATES-7,...,Dates2-OneonOneScore,Dates3-OneonOneScore,Dates4-OneonOneScore,Dates5-OneonOneScore,Dates6-OneonOneScore,Dates7-OneonOneScore,Dates8-OneonOneScore,Dates9-OneonOneScore,Dates10-OneonOneScore,FirstDate
CONTESTANT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1_RYAN_S,29,11.0,1,0.0,14.0,1.0,1.0,1.0,1.0,0.0,...,0.071429,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
1_CHARLIE_M,28,6.0,1,0.0,14.0,1.0,1.0,1.0,1.0,0.0,...,0.071429,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
1_RUSS_X,30,5.0,1,0.0,14.0,1.0,1.0,1.0,0.0,0.0,...,0.071429,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
1_GREG_T,28,4.0,1,0.0,14.0,5.0,1.0,0.0,0.0,0.0,...,0.071429,0.2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1_BOB_G,31,3.0,1,0.0,14.0,5.0,0.0,0.0,0.0,0.0,...,0.071429,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1_JAMIE_B,27,3.0,1,0.0,14.0,5.0,0.0,0.0,0.0,0.0,...,0.071429,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1_MIKE_X,24,3.0,1,0.0,14.0,5.0,0.0,0.0,0.0,0.0,...,0.071429,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1_ROB_X,29,3.0,1,0.0,14.0,5.0,0.0,0.0,0.0,0.0,...,0.071429,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1_BRIAN_C,28,2.0,1,0.0,14.0,0.0,0.0,0.0,0.0,0.0,...,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1_BRIAN_S,28,2.0,1,0.0,14.0,0.0,0.0,0.0,0.0,0.0,...,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [34]:
df.isna().sum()

Age                      0
ElimWeek                 0
Season                   0
DATES-1                  0
DATES-2                  0
DATES-3                  0
DATES-4                  0
DATES-5                  0
DATES-6                  0
DATES-7                  0
DATES-8                  0
DATES-9                  0
DATES-10                 0
Dates1-OneonOneScore     0
Dates2-OneonOneScore     0
Dates3-OneonOneScore     0
Dates4-OneonOneScore     0
Dates5-OneonOneScore     0
Dates6-OneonOneScore     0
Dates7-OneonOneScore     0
Dates8-OneonOneScore     0
Dates9-OneonOneScore     0
Dates10-OneonOneScore    0
FirstDate                0
dtype: int64

### Create more features

I want to create a ```Homestate``` feature (with regex and with get dummies). I will do this in my second iteration of the pipeline.

### Drop unneeded columns that were used to feature-engineer

In [35]:
df.columns

Index(['Age', 'ElimWeek', 'Season', 'DATES-1', 'DATES-2', 'DATES-3', 'DATES-4',
       'DATES-5', 'DATES-6', 'DATES-7', 'DATES-8', 'DATES-9', 'DATES-10',
       'Dates1-OneonOneScore', 'Dates2-OneonOneScore', 'Dates3-OneonOneScore',
       'Dates4-OneonOneScore', 'Dates5-OneonOneScore', 'Dates6-OneonOneScore',
       'Dates7-OneonOneScore', 'Dates8-OneonOneScore', 'Dates9-OneonOneScore',
       'Dates10-OneonOneScore', 'FirstDate'],
      dtype='object')

In [36]:
df['DATES-1'].value_counts()

0.0    264
Name: DATES-1, dtype: int64

In [37]:
df = df.drop(columns=['DATES-1', 'DATES-2', 'DATES-3', 'DATES-4', 'DATES-5',
       'DATES-6', 'DATES-7', 'DATES-8', 'DATES-9', 'DATES-10', 'Dates1-OneonOneScore', 'Dates4-OneonOneScore',
       'Dates5-OneonOneScore', 'Dates6-OneonOneScore', 'Dates7-OneonOneScore',
       'Dates8-OneonOneScore', 'Dates9-OneonOneScore', 'Dates10-OneonOneScore'])
df

Unnamed: 0_level_0,Age,ElimWeek,Season,Dates2-OneonOneScore,Dates3-OneonOneScore,FirstDate
CONTESTANT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1_RYAN_S,29,11.0,1,0.071429,1.0,1.0
1_CHARLIE_M,28,6.0,1,0.071429,1.0,1.0
1_RUSS_X,30,5.0,1,0.071429,1.0,1.0
1_GREG_T,28,4.0,1,0.071429,0.2,1.0
1_BOB_G,31,3.0,1,0.071429,0.2,1.0
...,...,...,...,...,...,...
12_JAKE_P,26,1.0,12,0.000000,0.0,0.0
12_JONATHAN_H,29,1.0,12,0.000000,0.0,0.0
12_NICK_S,26,1.0,12,0.000000,0.0,0.0
12_PETER_M,26,1.0,12,0.000000,0.0,0.0


In [38]:
df['Dates2-OneonOneScore'].value_counts()

0.000000    109
0.071429     42
0.142857     38
0.125000     15
1.000000     15
0.076923     13
0.166667     12
0.083333     11
0.100000      9
Name: Dates2-OneonOneScore, dtype: int64

### Train, validate, test split

Using https://stats.stackexchange.com/questions/282803/response-is-an-integer-should-i-use-classification-or-regression as a reference, I have decided that I will use regression (not multi-class classification) to solve this problem.

> Regression is **supervised** machine learning technique for predicting a **continuous** target variable.
>
> -Codeup DS curriculum

In [39]:
# y is what I'm predicting: ElimWeek
# I'll stratify by ElimWeek

def train_validate_test(df, target):
    # split df into test (20%) and train_validate (80%)
    train_validate, test = train_test_split(df, stratify=df[target], test_size=.2, random_state=666)

    # split train_validate off into train (70% of 80% = 56%) and validate (30% of 80% = 24%)
    train, validate = train_test_split(train_validate, test_size=.3, random_state=666)
        
    # split train into X (dataframe, drop target) & y (series, keep target only)
    X_train = train.drop(columns=[target])
    y_train = train[target]
    
    # split validate into X (dataframe, drop target) & y (series, keep target only)
    X_validate = validate.drop(columns=[target])
    y_validate = validate[target]
    
    # split test into X (dataframe, drop target) & y (series, keep target only)
    X_test = test.drop(columns=[target])
    y_test = test[target]
    
    return X_train, y_train, X_validate, y_validate, X_test, y_test

In [40]:
# split data 
X_train, y_train, X_validate, y_validate, X_test, y_test = train_validate_test(df, 'ElimWeek')

In [41]:
X_train.head()

Unnamed: 0_level_0,Age,Season,Dates2-OneonOneScore,Dates3-OneonOneScore,FirstDate
CONTESTANT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2_RYAN_M,30,2,0.142857,0.5,1.0
5_ADAM_D,27,5,0.0,0.0,0.0
4_GRAHAM_B,29,4,1.0,0.1,1.0
11_BEN_H,26,11,0.125,0.166667,1.0
10_BRADLEY_W,32,10,0.071429,0.083333,1.0


In [42]:
y_train.head()

CONTESTANT
2_RYAN_M        4.0
5_ADAM_D        1.0
4_GRAHAM_B      6.0
11_BEN_H        8.0
10_BRADLEY_W    3.0
Name: ElimWeek, dtype: float64

In [43]:
print('Shape of train:', X_train.shape, '| Shape of validate:', X_validate.shape, '| Shape of test:', X_test.shape)

Shape of train: (147, 5) | Shape of validate: (64, 5) | Shape of test: (53, 5)


I might split manually by season in a future iteration of this project.

Note to self from previous version of README:
> The goal of this project is to create a machine learning model that accurately predicts which contestant will win a given season of the ABC reality show The Bachelorette, based on data from Seasons 1-2 and 4-9 (train), Season 10 & 11 (validate), and Seasons 11 & 12 (test).

### Scale the data

I'll scale the data on my next iteration through the data science pipeline.

### Export data and make .py file

I will now place the code in this notebook into functions in a ```preprocessing.py``` file and move on to exploration and modeling.

## Notes to Self:

**Dec 4, 2020**

ElimWeek should be the target variable, not win. To compensate for the overbearing correlation between number of dates (over a span of a season) to Win. So I'll use regression to predict how many weeks a contestant will last.

Keep in mind that the predictor should be able to do so within the first few episodes of a season, for future use to have the most value.

Am I looking at a times series? Although it does not have dates, only weeks.

Tomorrow: Look up information about time series analysis that samples by week.

Alternatively, I could create new features out of the ```DATES``` features, such as:
   * Had a one-on-one? (boolean)
   * Went on a date in the first 2 episodes? (boolean)
   * Went on a date in the first 3 episodes? (boolean)
   * Went on a date in the first 4 episodes? (boolean)
   * Had the most one-on-one time during their time on the show (new calculation, which divides the calculated total date values -- per contestant, across all 10 episodes -- by the amount of weeks the contestant was on the show, and rescales it)

**Dec 5, 2020**

I have decided that I will save the time-series analysis for when I have more time for a future iteration. For now, I will focus on creating features that don't have a direct relationship with how many weeks a contestant will last. I could use linear regression and treat ElimWeek as a continuous variable, or I could use classification and treat it as a multi-class problem.

**Dec 5, 2020**

I decided to redo my train, validate, test splits, treating my data sort of like a time-series.

I'll split down Season for train, validate, test.

i.e.,
```
Seasons 1 to 8 (66.67%) | Seasons 9 and 10 (16.667%) | Seasons 11 and 12 (16.667%)
```

In [44]:
df.Season.value_counts()

5     29
12    25
1     25
10    24
9     24
2     24
11    23
8     23
7     23
6     23
4     21
Name: Season, dtype: int64

In [45]:
df.dtypes

Age                       int64
ElimWeek                float64
Season                    int64
Dates2-OneonOneScore    float32
Dates3-OneonOneScore    float32
FirstDate               float64
dtype: object

In [46]:
df[df['Season'] <= 8].shape

(168, 6)

In [47]:
df.loc[(df.Season == 9) | (df.Season == 10)].shape

(48, 6)

In [48]:
df.loc[(df.Season == 11) | (df.Season == 12)].shape

(48, 6)

In [49]:
def train_validate_test(df, target):
    '''
    This function manually splits the df into
    train (66.67%), validate (16.667%), and test (16.667%),
    splitting down Seasons.
    '''
    # split df into train, validate, and test
    train = df[df['Season'] <= 8]
    validate = df.loc[(df.Season == 9) | (df.Season == 10)]
    test = df.loc[(df.Season == 11) | (df.Season == 12)]
        
    # split train into X & y
    X_train = train.drop(columns=[target])
    y_train = train[target]
    
    # split validate into X & y
    X_validate = validate.drop(columns=[target])
    y_validate = validate[target]
    
    # split test into X & y
    X_test = test.drop(columns=[target])
    y_test = test[target]
    
    print('Shape of train:', X_train.shape, '| Shape of validate:', X_validate.shape, '| Shape of test:', X_test.shape)

    return X_train, y_train, X_validate, y_validate, X_test, y_test, train, validate, test

Now I'll put this new split method into ```preprocessing.py``` and the main project notebook.

# Preprocessing, continued

**Dec 6, 2020**

I'm going to parse some data from df.Hometown values.

In [50]:
states_regions = {
        'AK': 'O',
        'AL': 'S',
        'AR': 'S',
        'AS': 'O',
        'AZ': 'W',
        'CA': 'W',
        'CO': 'W',
        'CT': 'N',
        'DC': 'N',
        'DE': 'N',
        'FL': 'S',
        'GA': 'S',
        'GU': 'O',
        'HI': 'O',
        'IA': 'M',
        'ID': 'W',
        'IL': 'M',
        'IN': 'M',
        'KS': 'M',
        'KY': 'S',
        'LA': 'S',
        'MA': 'N',
        'MD': 'N',
        'ME': 'N',
        'MI': 'W',
        'MN': 'M',
        'MO': 'M',
        'MP': 'O',
        'MS': 'S',
        'MT': 'W',
        'NA': 'O',
        'NC': 'S',
        'ND': 'M',
        'NE': 'W',
        'NH': 'N',
        'NJ': 'N',
        'NM': 'W',
        'NV': 'W',
        'NY': 'N',
        'OH': 'M',
        'OK': 'S',
        'OR': 'W',
        'PA': 'N',
        'PR': 'O',
        'RI': 'N',
        'SC': 'S',
        'SD': 'M',
        'TN': 'S',
        'TX': 'S',
        'UT': 'W',
        'VA': 'S',
        'VI': 'O',
        'VT': 'N',
        'WA': 'W',
        'WI': 'M',
        'WV': 'S',
        'WY': 'W'
}

us_state_abbrev = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'American Samoa': 'AS',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Guam': 'GU',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Northern Mariana Islands':'MP',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Puerto Rico': 'PR',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virgin Islands': 'VI',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY'
}

In [51]:
#for key in us_state_abbrev:
#    df.loc[df['Homestate'].str.contains(key), 'HomestateAbbr'] = us_state_abbrev[key]

Putting this directly into preprocessing.py to use in main notebook.