## Data Set

The data set that we will be using is the 2017 Halloween Candy Hierarchy data set as discussed in this [boingboing](https://boingboing.net/2017/10/30/the-2017-halloween-candy-hiera.html) article.  You can also read more about the data in the [Science Creative Quarterly](https://www.scq.ubc.ca/so-much-candy-data-seriously/).

The following are the rating instructions from the survey:  

> Basically, consider that feeling you get when you receive this item in your Halloween haul. Does it make you really happy (JOY)? Or is it something that you automatically place in the junk pile (DESPAIR)? MEH for indifference, and you can leave blank if you have no idea what the item is.

Note that the original data set has been slightly altered from its original state, and if you wanted to perform any analysis for future projects, you would need to download the data directly from the links above.

This data is a great example of a messy data set, especially since they allowed respondents to enter text for a number of the fields. Also, note that some of the comments in the file might be considered inappropriate to some readers but cleaning this type of data is normal in a lot of data science projects.

In [10]:
candy_full = pd.read_csv('candy.csv', encoding='iso-8859-1')
candy = candy_full.copy()

In [11]:

import pandas as pd
import numpy as np

pd.set_option('display.max_columns', 20)

In [12]:

candy.head()

Unnamed: 0,Internal ID,Q1: GOING OUT?,Q2: GENDER,Q3: AGE,Q4: COUNTRY,"Q5: STATE, PROVINCE, COUNTY, ETC",Q6 | 100 Grand Bar,Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes),Q6 | Any full-sized candy bar,Q6 | Black Jacks,...,Q8: DESPAIR OTHER,Q9: OTHER COMMENTS,Q10: DRESS,Unnamed: 113,Q11: DAY,Q12: MEDIA [Daily Dish],Q12: MEDIA [Science],Q12: MEDIA [ESPN],Q12: MEDIA [Yahoo],"Click Coordinates (x, y)"
0,90258773,,,,,,,,,,...,,,,,,,,,,
1,90272821,No,Male,44.0,USA,NM,MEH,DESPAIR,JOY,MEH,...,,Bottom line is Twix is really the only candy w...,White and gold,,Sunday,,1.0,,,"(84, 25)"
2,90272829,,Male,49.0,USA,Virginia,,,,,...,,,,,,,,,,
3,90272840,No,Male,40.0,us,or,MEH,DESPAIR,JOY,MEH,...,,Raisins can go to hell,White and gold,,Sunday,,1.0,,,"(75, 23)"
4,90272841,No,Male,23.0,usa,exton pa,JOY,DESPAIR,JOY,DESPAIR,...,,,White and gold,,Friday,,1.0,,,"(70, 10)"


In [13]:
candy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2479 entries, 0 to 2478
Columns: 120 entries, Internal ID to Click Coordinates (x, y)
dtypes: float64(4), int64(1), object(115)
memory usage: 2.3+ MB


In [14]:
candy.info(max_cols=120)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2479 entries, 0 to 2478
Data columns (total 120 columns):
 #    Column                                                                                 Non-Null Count  Dtype  
---   ------                                                                                 --------------  -----  
 0    Internal ID                                                                            2479 non-null   int64  
 1    Q1: GOING OUT?                                                                         2368 non-null   object 
 2    Q2: GENDER                                                                             2437 non-null   object 
 3    Q3: AGE                                                                                2394 non-null   object 
 4    Q4: COUNTRY                                                                            2414 non-null   object 
 5    Q5: STATE, PROVINCE, COUNTY, ETC                                   

In [15]:
for col in candy.columns:
    print(col)

Internal ID
Q1: GOING OUT?
Q2: GENDER
Q3: AGE
Q4: COUNTRY
Q5: STATE, PROVINCE, COUNTY, ETC
Q6 | 100 Grand Bar
Q6 | Anonymous brown globs that come in black and orange wrappers	(a.k.a. Mary Janes)
Q6 | Any full-sized candy bar
Q6 | Black Jacks
Q6 | Bonkers (the candy)
Q6 | Bonkers (the board game)
Q6 | Bottle Caps
Q6 | Box'o'Raisins
Q6 | Broken glow stick
Q6 | Butterfinger
Q6 | Cadbury Creme Eggs
Q6 | Candy Corn
Q6 | Candy that is clearly just the stuff given out for free at restaurants
Q6 | Caramellos
Q6 | Cash, or other forms of legal tender
Q6 | Chardonnay
Q6 | Chick-o-Sticks (we donÕt know what that is)
Q6 | Chiclets
Q6 | Coffee Crisp
Q6 | Creepy Religious comics/Chick Tracts
Q6 | Dental paraphenalia
Q6 | Dots
Q6 | Dove Bars
Q6 | Fuzzy Peaches
Q6 | Generic Brand Acetaminophen
Q6 | Glow sticks
Q6 | Goo Goo Clusters
Q6 | Good N' Plenty
Q6 | Gum from baseball cards
Q6 | Gummy Bears straight up
Q6 | Hard Candy
Q6 | Healthy Fruit
Q6 | Heath Bar
Q6 | Hershey's Dark Chocolate
Q6 | HersheyÕ

In [16]:
candy.rename(columns=lambda x: x.replace('Õ', "'"), inplace=True)

In [17]:
Q1 = candy.duplicated().sum()
Q1

17

In [18]:
Q2 = candy.duplicated(subset=['Internal ID']).sum()
Q2

19

In [19]:
candy.drop_duplicates(subset=['Internal ID'], inplace=True)

In [20]:
candy = candy.drop(columns = ['Internal ID','Q5: STATE, PROVINCE, COUNTY, ETC','Q7: JOY OTHER','Q8: DESPAIR OTHER',
                               'Q9: OTHER COMMENTS','Unnamed: 113','Click Coordinates (x, y)'])

In [21]:
candy.shape

(2460, 113)

In [22]:
candy['Q2: GENDER'].value_counts()

Q2: GENDER
Male                  1466
Female                 839
I'd rather not say      83
Other                   30
Name: count, dtype: int64

In [23]:
Q3 = candy['Q2: GENDER'].isnull().sum()
Q3

42

In [24]:
candy.dropna(subset=['Q2: GENDER'], inplace=True)

**Exercise6:** For this project, we want to use binary classification, which predicts one of two classes. We want to predict between `Male` or `Female`. Because of this, select only the rows that contain either `Male` or `Female` in the `Q2: GENDER` column.

In [25]:
candy = candy[candy['Q2: GENDER'].isin(['Male', 'Female'])]

In [26]:
candy.shape

(2305, 113)

In [27]:
Q4 = candy['Q1: GOING OUT?'].isnull().sum()
Q4

77

In [28]:
candy['Q1: GOING OUT?'].fillna('No', inplace=True)

In [29]:
candy['Q1: GOING OUT?'].value_counts().sum()
candy.head()

Unnamed: 0,Q1: GOING OUT?,Q2: GENDER,Q3: AGE,Q4: COUNTRY,Q6 | 100 Grand Bar,Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes),Q6 | Any full-sized candy bar,Q6 | Black Jacks,Q6 | Bonkers (the candy),Q6 | Bonkers (the board game),...,Q6 | Whatchamacallit Bars,Q6 | White Bread,Q6 | Whole Wheat anything,Q6 | York Peppermint Patties,Q10: DRESS,Q11: DAY,Q12: MEDIA [Daily Dish],Q12: MEDIA [Science],Q12: MEDIA [ESPN],Q12: MEDIA [Yahoo]
1,No,Male,44.0,USA,MEH,DESPAIR,JOY,MEH,DESPAIR,DESPAIR,...,DESPAIR,DESPAIR,DESPAIR,DESPAIR,White and gold,Sunday,,1.0,,
2,No,Male,49.0,USA,,,,,,,...,,,,,,,,,,
3,No,Male,40.0,us,MEH,DESPAIR,JOY,MEH,MEH,DESPAIR,...,JOY,DESPAIR,DESPAIR,DESPAIR,White and gold,Sunday,,1.0,,
4,No,Male,23.0,usa,JOY,DESPAIR,JOY,DESPAIR,MEH,DESPAIR,...,JOY,DESPAIR,DESPAIR,JOY,White and gold,Friday,,1.0,,
5,No,Male,,,JOY,DESPAIR,JOY,,,,...,JOY,DESPAIR,DESPAIR,JOY,,,,1.0,,


In [30]:
candy_slice = candy.loc[:, 'Q6 | 100 Grand Bar':'Q11: DAY']
candy_slice

Unnamed: 0,Q6 | 100 Grand Bar,Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes),Q6 | Any full-sized candy bar,Q6 | Black Jacks,Q6 | Bonkers (the candy),Q6 | Bonkers (the board game),Q6 | Bottle Caps,Q6 | Box'o'Raisins,Q6 | Broken glow stick,Q6 | Butterfinger,...,Q6 | Trail Mix,Q6 | Twix,"Q6 | Vials of pure high fructose corn syrup, for main-lining into your vein",Q6 | Vicodin,Q6 | Whatchamacallit Bars,Q6 | White Bread,Q6 | Whole Wheat anything,Q6 | York Peppermint Patties,Q10: DRESS,Q11: DAY
1,MEH,DESPAIR,JOY,MEH,DESPAIR,DESPAIR,DESPAIR,DESPAIR,DESPAIR,DESPAIR,...,DESPAIR,JOY,DESPAIR,DESPAIR,DESPAIR,DESPAIR,DESPAIR,DESPAIR,White and gold,Sunday
2,,,,,,,,,,,...,,,,,,,,,,
3,MEH,DESPAIR,JOY,MEH,MEH,DESPAIR,MEH,DESPAIR,DESPAIR,MEH,...,MEH,JOY,DESPAIR,JOY,JOY,DESPAIR,DESPAIR,DESPAIR,White and gold,Sunday
4,JOY,DESPAIR,JOY,DESPAIR,MEH,DESPAIR,MEH,DESPAIR,DESPAIR,MEH,...,DESPAIR,JOY,MEH,JOY,JOY,DESPAIR,DESPAIR,JOY,White and gold,Friday
5,JOY,DESPAIR,JOY,,,,MEH,MEH,DESPAIR,JOY,...,MEH,JOY,DESPAIR,DESPAIR,JOY,DESPAIR,DESPAIR,JOY,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2474,JOY,DESPAIR,MEH,DESPAIR,DESPAIR,MEH,MEH,DESPAIR,DESPAIR,MEH,...,JOY,JOY,MEH,JOY,DESPAIR,MEH,DESPAIR,MEH,White and gold,Friday
2475,MEH,DESPAIR,JOY,,,,,DESPAIR,DESPAIR,JOY,...,DESPAIR,JOY,,,JOY,DESPAIR,MEH,JOY,Blue and black,Friday
2476,MEH,DESPAIR,JOY,DESPAIR,MEH,JOY,DESPAIR,MEH,MEH,DESPAIR,...,MEH,MEH,MEH,JOY,MEH,DESPAIR,DESPAIR,MEH,Blue and black,Friday
2477,,,,,,,,,,,...,,,,,,,,,,


In [31]:
candy.loc[:, 'Q6 | 100 Grand Bar':'Q11: DAY'] = candy.loc[:, 'Q6 | 100 Grand Bar':'Q11: DAY'].fillna('NO_ANSWER')

In [32]:
candy.loc[:,'Q12: MEDIA [Daily Dish]':'Q12: MEDIA [Yahoo]'] = candy.loc[:,'Q12: MEDIA [Daily Dish]':'Q12: MEDIA [Yahoo]'].fillna(0.0)
candy.head()

Unnamed: 0,Q1: GOING OUT?,Q2: GENDER,Q3: AGE,Q4: COUNTRY,Q6 | 100 Grand Bar,Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes),Q6 | Any full-sized candy bar,Q6 | Black Jacks,Q6 | Bonkers (the candy),Q6 | Bonkers (the board game),...,Q6 | Whatchamacallit Bars,Q6 | White Bread,Q6 | Whole Wheat anything,Q6 | York Peppermint Patties,Q10: DRESS,Q11: DAY,Q12: MEDIA [Daily Dish],Q12: MEDIA [Science],Q12: MEDIA [ESPN],Q12: MEDIA [Yahoo]
1,No,Male,44.0,USA,MEH,DESPAIR,JOY,MEH,DESPAIR,DESPAIR,...,DESPAIR,DESPAIR,DESPAIR,DESPAIR,White and gold,Sunday,0.0,1.0,0.0,0.0
2,No,Male,49.0,USA,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,...,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,0.0,0.0,0.0,0.0
3,No,Male,40.0,us,MEH,DESPAIR,JOY,MEH,MEH,DESPAIR,...,JOY,DESPAIR,DESPAIR,DESPAIR,White and gold,Sunday,0.0,1.0,0.0,0.0
4,No,Male,23.0,usa,JOY,DESPAIR,JOY,DESPAIR,MEH,DESPAIR,...,JOY,DESPAIR,DESPAIR,JOY,White and gold,Friday,0.0,1.0,0.0,0.0
5,No,Male,,,JOY,DESPAIR,JOY,NO_ANSWER,NO_ANSWER,NO_ANSWER,...,JOY,DESPAIR,DESPAIR,JOY,NO_ANSWER,NO_ANSWER,0.0,1.0,0.0,0.0


In [33]:
candy.loc[:,'Q6 | 100 Grand Bar':'Q12: MEDIA [Yahoo]'].isnull().sum()

Q6 | 100 Grand Bar                                                                        0
Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes)    0
Q6 | Any full-sized candy bar                                                             0
Q6 | Black Jacks                                                                          0
Q6 | Bonkers (the candy)                                                                  0
                                                                                         ..
Q11: DAY                                                                                  0
Q12: MEDIA [Daily Dish]                                                                   0
Q12: MEDIA [Science]                                                                      0
Q12: MEDIA [ESPN]                                                                         0
Q12: MEDIA [Yahoo]                                                              

In [34]:
candy['Q4: COUNTRY'].unique()

array(['USA ', 'USA', 'us', 'usa', nan, 'canada', 'Canada', 'Us', 'US',
       'Murica', 'United States', 'uk', 'United Kingdom', 'united states',
       'Usa', 'United States ', 'United staes',
       'United States of America', 'UAE', 'England', 'UK', 'canada ',
       'United states', 'u.s.a.', '35', 'france',
       'United States of America ', 'america', 'U.S.A.', 'finland',
       'unhinged states', 'Mexico', 'Canada ', 'united states of america',
       'US of A', 'The United States', 'North Carolina ', 'Unied States',
       'Netherlands', 'germany', 'Europe', 'U S', 'u.s.', 'U.K. ',
       'Costa Rica', 'The United States of America', 'unite states',
       'U.S.', '46', 'Australia', 'Greece', 'USA? Hard to tell anymore..',
       "'merica", '45', 'United State', '32', 'France', 'australia',
       'Can', 'Canae', 'Trumpistan', 'Ireland', 'United Sates', 'Korea',
       'California', 'Unites States', 'Japan', 'USa', 'South africa',
       'I pretend to be from Canada, but I am

In [35]:
candy['Q4: COUNTRY'].nunique()

115

In [36]:
candy['Q4: COUNTRY'].fillna('Other', inplace=True)

In [37]:
candy['Q4: COUNTRY'].isnull().sum()

0

In [38]:
candy['Q4: COUNTRY'].nunique()

116

In [39]:
candy['Q4: COUNTRY'].replace('australia', 'Other', inplace=True)
candy['Q4: COUNTRY'].replace('Australia', 'Other', inplace=True)

In [40]:
candy['Q4: COUNTRY'].nunique()

114

In [41]:
import re
us_aliases = [
    'USA ', 'USA', 'us', 'usa', 'Us', 'US', 'Murica', 'United States', 'united states',
    'Usa', 'United States ', 'United staes', 'United States of America', 'United states',
    'u.s.a.', 'United States of America ', 'america', 'U.S.A.', 'unhinged states',
    'united states of america', 'US of A', 'The United States', 'North Carolina ',
    'Unied States', 'U S', 'u.s.', 'The United States of America', 'unite states',
    'U.S.', 'USA? Hard to tell anymore..', "'merica", 'United State', 'United Sates',
    'California', 'Unites States', 'USa',
    'I pretend to be from Canada, but I am really from the United States.', 'Usa ', 'United Stated',
    'New Jersey', 'United ststes', 'America', 'United Statss', 'murrika', 'USA! USA! USA!',
    'USAA', 'united States ', 'N. America', 'USSA', 'U.S. ', 'u s a', 'United Statea', 'united ststes',
    'USAUSAUSA!!!!'
]

pattern = '|'.join(['(?i)' + re.escape(alias) for alias in us_aliases])

candy['Q4: COUNTRY'] = candy['Q4: COUNTRY'].str.replace(pattern, 'USA', regex=True)

candy['Q4: COUNTRY'].replace('USAof USA','USA', inplace=True)
candy['Q4: COUNTRY'].replace('USAof A','USA', inplace=True)
candy['Q4: COUNTRY'].replace('USA? Hard to tell anymore..','USA', inplace=True)
candy['Q4: COUNTRY'].replace('USAd','USA', inplace=True)
candy['Q4: COUNTRY'].replace('USA! USA! USA!','USA', inplace=True)
candy['Q4: COUNTRY'].replace('USAA','USA', inplace=True)
candy['Q4: COUNTRY'].replace('USASA','USA', inplace=True)
candy['Q4: COUNTRY'].replace('USAa','USA', inplace=True)
candy['Q4: COUNTRY'].replace('USAUSAUSA!!!!','USA', inplace=True)

In [42]:
candy['Q4: COUNTRY'].unique()

array(['USA', 'Other', 'canada', 'Canada', 'uk', 'United Kingdom', 'USA ',
       'USA of USA', 'UAE', 'England', 'UK', 'canada ', '35', 'france',
       'USA of USA ', 'finland', 'Mexico', 'Canada ', 'USA of A',
       'Netherlands', 'germany', 'Europe', 'U.K. ', 'Costa Rica', '46',
       'Greece', '45', '32', 'France', 'Can', 'Canae', 'Trumpistan',
       'Ireland', 'Korea', 'Japan', 'South africa', 'Uk', 'Germany',
       'Canada`', 'Scotland', 'UK ', 'Denmark', 'France ', 'Switzerland',
       'UD', 'Scotland ', 'South Korea', 'CANADA', 'Indonesia',
       'The Netherlands', 'endland', 'soviet canuckistan', 'Singapore',
       'China', 'Taiwan', 'Ireland ', 'hong kong', 'spain', 'Sweden',
       'Hong Kong', 'Narnia', 'USA a', 'subscribe to dm4uz3 on youtube',
       'United kingdom', "I don't know anymore", 'Fear and Loathing'],
      dtype=object)

In [43]:
candy['Q4: COUNTRY'].replace({'canada':'CA', 'Canada':'CA', 'canada ':'CA', 'Canada ':'CA', 'Can':'CA', 'Canae':'CA', 'Canada`':'CA', 'CANADA':'CA'},inplace=True)


In [44]:
candy['Q4: COUNTRY'].nunique()

59

In [45]:
candy['Q4: COUNTRY'].replace({'uk':'EU', 'United Kingdom':'EU', 'England':'EU', 'UK':'EU', 'france':'EU', 'finland':'EU', 'Netherlands':'EU', 'germany':'EU', 'Europe':'EU', 'U.K. ':'EU', 'Greece':'EU', 'France':'EU', 'Ireland':'EU', 'Uk':'EU', 'Germany':'EU', 'Scotland':'EU', 'UK ':'EU', 'Denmark':'EU', 'France ':'EU', 'Switzerland':'EU', 'Scotland ':'EU', 'The Netherlands':'EU', 'Ireland ':'EU', 'spain':'EU', 'Sweden':'EU', 'United kingdom':'EU'}, inplace=True)

In [46]:
candy['Q4: COUNTRY'].nunique()

34

In [47]:
candy['Q4: COUNTRY'].replace({'UAE':'Other', '35':'Other', 'Mexico':'Other', 'Costa Rica':'Other',
       '46':'Other', '45':'Other', '32':'Other', 'Trumpistan':'Other', 'Korea':'Other', 'Japan':'Other', 'South africa':'Other',
       'UD':'Other', 'South Korea':'Other', 'Indonesia':'Other', 'endland':'Other', 'soviet canuckistan':'Other',
       'Singapore':'Other', 'China':'Other', 'Taiwan':'Other', 'hong kong':'Other', 'Hong Kong':'Other', 'Narnia':'Other',
       'subscribe to dm4uz3 on youtube':'Other', "I don't know anymore":'Other',
       'Fear and Loathing':'Other'}, inplace=True)

In [48]:
candy['Q4: COUNTRY'].unique()

array(['USA', 'Other', 'CA', 'EU', 'USA ', 'USA of USA', 'USA of USA ',
       'USA of A', 'USA a'], dtype=object)

In [49]:
candy['Q3: AGE'].unique()

array(['44', '49', '40', '23', nan, '53', '33', '43', '56', '64', '37',
       '48', '54', '36', '45', '25', '34', '35', '38', '58', '50', '47',
       '16', '52', '63', '65', '41', '27', '31', '59', '61', '46', '42',
       '62', '29', '39', '32', '28', '69', '67', '30', '22', '51', '70',
       '24', '19', 'Old enough', '57', '60', '66', '12', 'Many', '55',
       '72', '?', '21', '11', 'no', '9', '68', '20', '6', '10', '71',
       '13', '26', '45-55', '7', '39.4', '74', '18', 'older than dirt',
       '17', '15', '8', '75', '5u', 'Enough', 'Over 50', '90', '76',
       'sixty-nine', 'ancient', '77', 'OLD', 'old', '73', '70 1/2', '14',
       'MY NAME JEFF', '4', '59 on the day after Halloween', 'old enough',
       'your mom', 'I can remember when Java was a cool new language',
       '60+'], dtype=object)

In [50]:
age_index = candy['Q3: AGE'].str.isnumeric()
age_index = age_index.fillna(False)
candy.loc[~age_index, 'Q3: AGE'] = np.nan

In [51]:
def categorize_age(age):
    if pd.isnull(age) or age == 'unknown':
        return 'unknown'
    age = float(age)
    if age <= 17:
        return '17 and under'
    elif age <= 25:
        return '18-25'
    elif age <= 35:
        return '26-35'
    elif age <= 45:
        return '36-45'
    elif age <= 55:
        return '46-55'
    else:
        return '56+'
candy['Q3: AGE'] = candy['Q3: AGE'].apply(categorize_age)
categories = ['unknown', '17 and under', '18-25', '26-35', '36-45', '46-55', '56+']
candy['Q3: AGE'] = pd.Categorical(candy['Q3: AGE'], categories=categories, ordered=False)

In [52]:
candy['Q3: AGE']

1         36-45
2         46-55
3         36-45
4         18-25
5       unknown
         ...   
2474      18-25
2475      26-35
2476      26-35
2477        56+
2478        56+
Name: Q3: AGE, Length: 2305, dtype: category
Categories (7, object): ['unknown', '17 and under', '18-25', '26-35', '36-45', '46-55', '56+']

In [53]:
candy['Q3: AGE'].value_counts()

Q3: AGE
36-45           768
46-55           525
26-35           520
56+             298
18-25            85
unknown          60
17 and under     49
Name: count, dtype: int64

In [54]:
candy['Q3: AGE'].isnull().sum()

0

In [55]:
candy.reset_index(drop=True, inplace=True)

In [56]:
candy_reduced = candy.loc[:, 'Q6 | 100 Grand Bar':'Q6 | York Peppermint Patties']
candy_reduced

Unnamed: 0,Q6 | 100 Grand Bar,Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes),Q6 | Any full-sized candy bar,Q6 | Black Jacks,Q6 | Bonkers (the candy),Q6 | Bonkers (the board game),Q6 | Bottle Caps,Q6 | Box'o'Raisins,Q6 | Broken glow stick,Q6 | Butterfinger,...,Q6 | Three Musketeers,Q6 | Tolberone something or other,Q6 | Trail Mix,Q6 | Twix,"Q6 | Vials of pure high fructose corn syrup, for main-lining into your vein",Q6 | Vicodin,Q6 | Whatchamacallit Bars,Q6 | White Bread,Q6 | Whole Wheat anything,Q6 | York Peppermint Patties
0,MEH,DESPAIR,JOY,MEH,DESPAIR,DESPAIR,DESPAIR,DESPAIR,DESPAIR,DESPAIR,...,JOY,JOY,DESPAIR,JOY,DESPAIR,DESPAIR,DESPAIR,DESPAIR,DESPAIR,DESPAIR
1,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,...,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER
2,MEH,DESPAIR,JOY,MEH,MEH,DESPAIR,MEH,DESPAIR,DESPAIR,MEH,...,DESPAIR,JOY,MEH,JOY,DESPAIR,JOY,JOY,DESPAIR,DESPAIR,DESPAIR
3,JOY,DESPAIR,JOY,DESPAIR,MEH,DESPAIR,MEH,DESPAIR,DESPAIR,MEH,...,JOY,JOY,DESPAIR,JOY,MEH,JOY,JOY,DESPAIR,DESPAIR,JOY
4,JOY,DESPAIR,JOY,NO_ANSWER,NO_ANSWER,NO_ANSWER,MEH,MEH,DESPAIR,JOY,...,JOY,JOY,MEH,JOY,DESPAIR,DESPAIR,JOY,DESPAIR,DESPAIR,JOY
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2300,JOY,DESPAIR,MEH,DESPAIR,DESPAIR,MEH,MEH,DESPAIR,DESPAIR,MEH,...,MEH,MEH,JOY,JOY,MEH,JOY,DESPAIR,MEH,DESPAIR,MEH
2301,MEH,DESPAIR,JOY,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,DESPAIR,DESPAIR,JOY,...,MEH,MEH,DESPAIR,JOY,NO_ANSWER,NO_ANSWER,JOY,DESPAIR,MEH,JOY
2302,MEH,DESPAIR,JOY,DESPAIR,MEH,JOY,DESPAIR,MEH,MEH,DESPAIR,...,JOY,JOY,MEH,MEH,MEH,JOY,MEH,DESPAIR,DESPAIR,MEH
2303,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,...,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER


In [57]:
from collections import Counter
joy_count = candy_reduced.apply(lambda col: col.str.contains('JOY').sum())

In [58]:
from collections import Counter
despair_count = candy_reduced.apply(lambda col: col.str.contains('DESPAIR').sum())

In [59]:
candy_reduced_transpose = candy_reduced.T
candy_reduced_transpose

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2295,2296,2297,2298,2299,2300,2301,2302,2303,2304
Q6 | 100 Grand Bar,MEH,NO_ANSWER,MEH,JOY,JOY,NO_ANSWER,JOY,JOY,MEH,NO_ANSWER,...,NO_ANSWER,NO_ANSWER,JOY,MEH,JOY,JOY,MEH,MEH,NO_ANSWER,DESPAIR
Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes),DESPAIR,NO_ANSWER,DESPAIR,DESPAIR,DESPAIR,NO_ANSWER,DESPAIR,DESPAIR,DESPAIR,NO_ANSWER,...,NO_ANSWER,NO_ANSWER,DESPAIR,DESPAIR,MEH,DESPAIR,DESPAIR,DESPAIR,NO_ANSWER,DESPAIR
Q6 | Any full-sized candy bar,JOY,NO_ANSWER,JOY,JOY,JOY,NO_ANSWER,JOY,JOY,JOY,NO_ANSWER,...,NO_ANSWER,NO_ANSWER,JOY,JOY,JOY,MEH,JOY,JOY,NO_ANSWER,JOY
Q6 | Black Jacks,MEH,NO_ANSWER,MEH,DESPAIR,NO_ANSWER,NO_ANSWER,DESPAIR,MEH,MEH,NO_ANSWER,...,NO_ANSWER,NO_ANSWER,DESPAIR,DESPAIR,DESPAIR,DESPAIR,NO_ANSWER,DESPAIR,NO_ANSWER,DESPAIR
Q6 | Bonkers (the candy),DESPAIR,NO_ANSWER,MEH,MEH,NO_ANSWER,NO_ANSWER,DESPAIR,MEH,MEH,NO_ANSWER,...,NO_ANSWER,NO_ANSWER,MEH,MEH,MEH,DESPAIR,NO_ANSWER,MEH,NO_ANSWER,DESPAIR
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Q6 | Vicodin,DESPAIR,NO_ANSWER,JOY,JOY,DESPAIR,NO_ANSWER,MEH,DESPAIR,DESPAIR,NO_ANSWER,...,NO_ANSWER,NO_ANSWER,JOY,DESPAIR,MEH,JOY,NO_ANSWER,JOY,NO_ANSWER,JOY
Q6 | Whatchamacallit Bars,DESPAIR,NO_ANSWER,JOY,JOY,JOY,NO_ANSWER,DESPAIR,MEH,MEH,NO_ANSWER,...,NO_ANSWER,NO_ANSWER,MEH,NO_ANSWER,JOY,DESPAIR,JOY,MEH,NO_ANSWER,DESPAIR
Q6 | White Bread,DESPAIR,NO_ANSWER,DESPAIR,DESPAIR,DESPAIR,NO_ANSWER,DESPAIR,DESPAIR,DESPAIR,NO_ANSWER,...,NO_ANSWER,NO_ANSWER,DESPAIR,DESPAIR,MEH,MEH,DESPAIR,DESPAIR,NO_ANSWER,MEH
Q6 | Whole Wheat anything,DESPAIR,NO_ANSWER,DESPAIR,DESPAIR,DESPAIR,NO_ANSWER,DESPAIR,DESPAIR,DESPAIR,NO_ANSWER,...,NO_ANSWER,NO_ANSWER,MEH,DESPAIR,MEH,DESPAIR,MEH,DESPAIR,NO_ANSWER,DESPAIR


In [60]:
candy_reduced_transpose['joy_count'] = joy_count
candy_reduced_transpose['despair_count'] = despair_count

In [61]:
candy_reduced_transpose['net_feelies'] = candy_reduced_transpose['joy_count'] - candy_reduced_transpose['despair_count']


In [62]:
candy_net_sorted = candy_reduced_transpose[['joy_count', 'despair_count', 'net_feelies']].sort_values(by='net_feelies', ascending=False)
candy_net_sorted

Unnamed: 0,joy_count,despair_count,net_feelies
Q6 | Any full-sized candy bar,1477,15,1462
Q6 | Reese's Peanut Butter Cups,1416,88,1328
Q6 | Kit Kat,1367,47,1320
"Q6 | Cash, or other forms of legal tender",1363,63,1300
Q6 | Twix,1339,67,1272
...,...,...,...
Q6 | Dental paraphenalia,84,1356,-1272
Q6 | Real Housewives of Orange County Season 9 Blue-Ray,86,1398,-1312
Q6 | White Bread,43,1376,-1333
Q6 | Gum from baseball cards,43,1386,-1343


In [63]:
candy_encode =candy.copy()

In [64]:
candy_encode = candy_encode.replace({'Female': 0, 'Male': 1})

In [65]:
candy_response = candy_encode['Q2: GENDER']
drop = ['Q2: GENDER', 'Q1: GOING OUT?', 'Q3: AGE', 'Q4: COUNTRY', 'Q10: DRESS', 'Q11: DAY', 'Q12: MEDIA [Daily Dish]', 'Q12: MEDIA [Science]', 'Q12: MEDIA [ESPN]', 'Q12: MEDIA [Yahoo]']
candy_features = candy_encode.drop(columns=drop)

In [66]:
candy_features_encoded = pd.get_dummies(candy_features, drop_first=True)
candy.reset_index(drop=True, inplace=True)
candy_features.reset_index(drop=True, inplace=True)
candy_response.reset_index(drop=True, inplace=True)
candy_features_encoded.reset_index(drop=True, inplace=True)
candy_features_encoded

Unnamed: 0,Q6 | 100 Grand Bar_JOY,Q6 | 100 Grand Bar_MEH,Q6 | 100 Grand Bar_NO_ANSWER,Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes)_JOY,Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes)_MEH,Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes)_NO_ANSWER,Q6 | Any full-sized candy bar_JOY,Q6 | Any full-sized candy bar_MEH,Q6 | Any full-sized candy bar_NO_ANSWER,Q6 | Black Jacks_JOY,...,Q6 | Whatchamacallit Bars_NO_ANSWER,Q6 | White Bread_JOY,Q6 | White Bread_MEH,Q6 | White Bread_NO_ANSWER,Q6 | Whole Wheat anything_JOY,Q6 | Whole Wheat anything_MEH,Q6 | Whole Wheat anything_NO_ANSWER,Q6 | York Peppermint Patties_JOY,Q6 | York Peppermint Patties_MEH,Q6 | York Peppermint Patties_NO_ANSWER
0,False,True,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,True,False,False,True,False,False,True,False,...,True,False,False,True,False,False,True,False,False,True
2,False,True,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,True,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,True,False,False
4,True,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2300,True,False,False,False,False,False,False,True,False,False,...,False,False,True,False,False,False,False,False,True,False
2301,False,True,False,False,False,False,True,False,False,False,...,False,False,False,False,False,True,False,True,False,False
2302,False,True,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,True,False
2303,False,False,True,False,False,True,False,False,True,False,...,True,False,False,True,False,False,True,False,False,True
