# Data Preprocessing for the Bachelorette Predictor
### Kwame V. Taylor

I need to preprocess the data from ```joined_df.csv``` to ready it for modeling. Then I will move on to continue my code in my main notebook, ```bachelorette-predictor.ipynb```, re-explore the data, and begin modeling.

### Set up Env

In [4]:
import numpy as np
import pandas as pd

### Import the tidied data

In [9]:
df = pd.read_csv('joined_df.csv', index_col=0)

In [17]:
df.head()

Unnamed: 0_level_0,Name,Age,Occupation,Hometown,ElimWeek,Season,ELIMINATION-1,ELIMINATION-2,ELIMINATION-3,ELIMINATION-4,...,DATES-1,DATES-2,DATES-3,DATES-4,DATES-5,DATES-6,DATES-7,DATES-8,DATES-9,DATES-10
CONTESTANT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1_RYAN_S,Ryan Sutter,29,Firefighter,"Vail, Colorado",,1,,,,,...,,D14,D1,D1,D1,D1,,,,
1_CHARLIE_M,Charlie Maher,28,Financial Analyst,"Hermosa Beach, California",6.0,1,,,,,...,,D14,D1,D1,D1,D1,,,,
1_RUSS_X,Russ,30,Writer,"San Rafael, California",5.0,1,,,,,...,,D14,D1,D1,D1,,,,,
1_GREG_T,Greg T.,28,Importer,"Manhattan, New York",4.0,1,,,,E,...,,D14,D5,D1,,,,,,
1_BOB_G,Bob Guiney,31,Mortgage Broker,"Ferndale, Michigan",3.0,1,,,E,,...,,D14,D5,,,,,,,


### Summarize the data

In [15]:
df.Age.value_counts()

27    43
28    40
29    31
26    29
31    28
30    24
25    18
33    14
32    13
35     7
34     6
24     3
36     3
42     1
41     1
37     1
38     1
23     1
Name: Age, dtype: int64

In [18]:
df.Hometown.value_counts()

Dallas, Texas                7
San Francisco, California    6
Chicago, Illinois            4
Boston, Massachusetts        4
San Diego, California        3
                            ..
Huntsville, Texas            1
Santa Monica, California     1
Raleigh, North Carolina      1
Carthage, Illinois           1
Lockport, Illinois           1
Name: Hometown, Length: 225, dtype: int64

In [19]:
df.ElimWeek.value_counts()

1.0     79
2.0     37
3.0     33
4.0     24
5.0     23
6.0     18
7.0     16
8.0     10
9.0      8
10.0     7
Name: ElimWeek, dtype: int64

In [25]:
df.Season.value_counts()

5     29
12    25
1     25
10    24
9     24
2     24
11    23
8     23
7     23
6     23
4     21
Name: Season, dtype: int64

### Get dummies for ```DATES``` and ```ELIMINATION``` features

Header          | Description
----------------|------------
`ELIMINATION-1` | Who was eliminated in week 1
`ELIMINATION-2` | Who was eliminated in week 2
`ELIMINATION-3` | Who was eliminated in week 3
`ELIMINATION-4` | Who was eliminated in week 4
`ELIMINATION-5` | Who was eliminated in week 5
`ELIMINATION-6` | Who was eliminated in week 6
`ELIMINATION-7` | Who was eliminated in week 7
`ELIMINATION-8` | Who was eliminated in week 8
`ELIMINATION-9` | Who was eliminated in week 9
`ELIMINATION-10`| Who was eliminated in week 10
`DATES-1`       | Who was on which date in week 1
`DATES-2`       | Who was on which date in week 2
`DATES-3`       | Who was on which date in week 3
`DATES-4`       | Who was on which date in week 4
`DATES-5`       | Who was on which date in week 5
`DATES-6`       | Who was on which date in week 6
`DATES-7`       | Who was on which date in week 7
`DATES-8`       | Who was on which date in week 8
`DATES-9`       | Who was on which date in week 9
`DATES-10`      | Who was on which date in week 10

- Eliminates connote either an elimination (starts with "E") or a rose (starts with "R").
- Eliminations supercede roses.
- "E" connotes a standard elimination, typically at a rose ceremony. "EQ" means the contestant quits. "EF" means the contestant was fired by production. "ED" connotes a date elimination. "EU" connotes an unscheduled elimination, one that takes place at a time outside of a date or rose ceremony.
- "R" means the contestant received a rose. "R1" means the contestant got a first impression rose.
- "D1" means a one-on-one date, "D2" means a 2-on-1, "D3" means a 3-on-1 group date, and so on.
- Weeks of the show are deliminated by rose ceremonies, and may not line up exactly with episodes.

For the sake of my MVP, I'm just going to get dummies for dates, not eliminations.

### Create features

### Create target feature (to predict on)

### Drop unnecessary features / choose features for my MVP

In [27]:
df.head()

Unnamed: 0_level_0,Name,Age,Occupation,Hometown,ElimWeek,Season,ELIMINATION-1,ELIMINATION-2,ELIMINATION-3,ELIMINATION-4,...,DATES-1,DATES-2,DATES-3,DATES-4,DATES-5,DATES-6,DATES-7,DATES-8,DATES-9,DATES-10
CONTESTANT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1_RYAN_S,Ryan Sutter,29,Firefighter,"Vail, Colorado",,1,,,,,...,,D14,D1,D1,D1,D1,,,,
1_CHARLIE_M,Charlie Maher,28,Financial Analyst,"Hermosa Beach, California",6.0,1,,,,,...,,D14,D1,D1,D1,D1,,,,
1_RUSS_X,Russ,30,Writer,"San Rafael, California",5.0,1,,,,,...,,D14,D1,D1,D1,,,,,
1_GREG_T,Greg T.,28,Importer,"Manhattan, New York",4.0,1,,,,E,...,,D14,D5,D1,,,,,,
1_BOB_G,Bob Guiney,31,Mortgage Broker,"Ferndale, Michigan",3.0,1,,,E,,...,,D14,D5,,,,,,,


In [26]:
df.columns

Index(['Name', 'Age', 'Occupation', 'Hometown', 'ElimWeek', 'Season',
       'ELIMINATION-1', 'ELIMINATION-2', 'ELIMINATION-3', 'ELIMINATION-4',
       'ELIMINATION-5', 'ELIMINATION-6', 'ELIMINATION-7', 'ELIMINATION-8',
       'ELIMINATION-9', 'ELIMINATION-10', 'DATES-1', 'DATES-2', 'DATES-3',
       'DATES-4', 'DATES-5', 'DATES-6', 'DATES-7', 'DATES-8', 'DATES-9',
       'DATES-10'],
      dtype='object')

In [30]:
df = df.drop(columns=['ELIMINATION-1', 'ELIMINATION-2', 'ELIMINATION-3', 'ELIMINATION-4',
       'ELIMINATION-5', 'ELIMINATION-6', 'ELIMINATION-7', 'ELIMINATION-8',
       'ELIMINATION-9', 'ELIMINATION-10', 'Hometown', 'Occupation', 'Name'])

In [33]:
df

Unnamed: 0_level_0,Age,ElimWeek,Season,DATES-1,DATES-2,DATES-3,DATES-4,DATES-5,DATES-6,DATES-7,DATES-8,DATES-9,DATES-10
CONTESTANT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1_RYAN_S,29,,1,,D14,D1,D1,D1,D1,,,,
1_CHARLIE_M,28,6.0,1,,D14,D1,D1,D1,D1,,,,
1_RUSS_X,30,5.0,1,,D14,D1,D1,D1,,,,,
1_GREG_T,28,4.0,1,,D14,D5,D1,,,,,,
1_BOB_G,31,3.0,1,,D14,D5,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
12_JAKE_P,26,1.0,12,,,,,,,,,,
12_JONATHAN_H,29,1.0,12,,,,,,,,,,
12_NICK_S,26,1.0,12,,,,,,,,,,
12_PETER_M,26,1.0,12,,,,,,,,,,


### Scale the data

### Export data and make .py file