# Feature selection

In order to predict a given rating, we will want to create the features that represent distinct information about the product being rated. Our dataset is limited, but contains a text field. This notebook will be exploring possible avenues and creating new features within our dataset.

## Covered in this notebook

1. Importing data
2. Variable overview/thoughts
3. Style as dummy variable
4. Country
5. Variety text features

## Importing data

In [1]:
 import pandas as pd

rr_df = pd.read_csv('../clean_data/clean_ramen_ratings.csv', thousands=',')
rr_df.head()

Unnamed: 0,Brand,Variety,Style,Country,Stars,2014,2015,2016,2017,2018
0,New Touch,T's Restaurant Tantanmen,Cup,Japan,3.75,5500,5540,5660,5660,5780
1,Just Way,Noodles Spicy Hot Sesame Spicy Hot Sesame Guan...,Pack,Taiwan,1.0,710,680,770,820,830
2,Nissin,Cup Noodles Chicken Vegetable,Cup,United States,2.25,4280,4080,4100,4130,4400
3,Wei Lih,GGE Ramen Snack Tomato Flavor,Pack,Taiwan,2.75,710,680,770,820,830
4,Ching's Secret,Singapore Curry,Pack,India,3.75,5340,3260,4270,5420,6060


## Variable overview / thoughts

Summary of my thought process of potential new features to create.

#### Brand


The Brand variable in this dataset seems difficult to use.

In [2]:
bvc_df = rr_df['Brand'].value_counts()
print('It has a cardinality of {0}, yet {1} brands are only associated with 5 or more rows...'
      .format(bvc_df.shape[0], bvc_df[bvc_df < 5].shape[0])
)

It has a cardinality of 350, yet 228 brands are only associated with 5 or more rows...


I could try to reduce the number of brands down and convert it to a dummy variable for the top X or an index variable (top 10 or not top 10 biggest brand), but I'm not sure this would be useful.

#### Variety

The Variety variable is also challenging. This column is likely to contain words from varied set of languages and I'm not familiar with text mining libraries. I will do my best to extract simple features, and time willing will be digging into a text handling library a little bit.

#### Style
This column is categorical and has a low cardinality. It is a good candidate to become a dummy variable. We will start by modifying this column.

#### Stars

Stars is the value we want to predict and will not be generating features taking this value into account.

#### Country & Years

Similarily, the Country variable in an of itself poses the same challenges as the Brand variable. It has a high cardinality and is not evenly spread accross the data. However it is 100% correlated with the consumption of ramen between 2014 and 2018. We can keep those values numeric (although we will probably want to normalize them) and we can also encode the fact that there was an increase in consumption in the last 5 years as an indicator of the overall ramen quality.

## Style as a dummy variable

 We use the pandas function get_dummies() to generate new rows for style and add it to our original dataset using the concat function.

In [3]:
style_dummies = pd.get_dummies(rr_df['Style'])
rr_df = pd.concat([rr_df, style_dummies], axis=1)
rr_df.head()

Unnamed: 0,Brand,Variety,Style,Country,Stars,2014,2015,2016,2017,2018,Bar,Bowl,Box,Can,Cup,Pack,Tray
0,New Touch,T's Restaurant Tantanmen,Cup,Japan,3.75,5500,5540,5660,5660,5780,0,0,0,0,1,0,0
1,Just Way,Noodles Spicy Hot Sesame Spicy Hot Sesame Guan...,Pack,Taiwan,1.0,710,680,770,820,830,0,0,0,0,0,1,0
2,Nissin,Cup Noodles Chicken Vegetable,Cup,United States,2.25,4280,4080,4100,4130,4400,0,0,0,0,1,0,0
3,Wei Lih,GGE Ramen Snack Tomato Flavor,Pack,Taiwan,2.75,710,680,770,820,830,0,0,0,0,0,1,0
4,Ching's Secret,Singapore Curry,Pack,India,3.75,5340,3260,4270,5420,6060,0,0,0,0,0,1,0


We can now get rid of our original "Style" column.

In [4]:
rr_df = rr_df.drop(columns=['Style'])

## Country

As discussed above, country by itself is not the most interesting feature, however a country's consumption of ramen is! Lets normalize the ramen ingestion for each year and add an index variable indicating whether there was an increase in consumption over the past 5 years.

In [5]:
years = ['2014', '2015', '2016', '2017','2018']
rr_consumption_to_scale = rr_df[years]

In [6]:
from sklearn import preprocessing

consumption_np_array = rr_consumption_to_scale.values
scaler = preprocessing.MinMaxScaler()
consumption_scaled_np_array = scaler.fit_transform(consumption_np_array)
scaled_consumption_df = pd.DataFrame(consumption_scaled_np_array)
scaled_consumption_df.columns = rr_consumption_to_scale.columns

We now drop the value of the year columns and replace them with our normalized values.

In [7]:
rr_df = rr_df.drop(columns=years)
rr_df = pd.concat([rr_df, scaled_consumption_df], axis=1)
rr_df.head()

Unnamed: 0,Brand,Variety,Country,Stars,Bar,Bowl,Box,Can,Cup,Pack,Tray,2014,2015,2016,2017,2018
0,New Touch,T's Restaurant Tantanmen,Japan,3.75,0,0,0,0,1,0,0,0.123677,0.136813,0.146715,0.145021,0.14339
1,Just Way,Noodles Spicy Hot Sesame Spicy Hot Sesame Guan...,Taiwan,1.0,0,0,0,0,0,1,0,0.015769,0.016576,0.019735,0.020791,0.020378
2,Nissin,Cup Noodles Chicken Vegetable,United States,2.25,0,0,0,0,1,0,0,0.096193,0.100693,0.106206,0.105749,0.109095
3,Wei Lih,GGE Ramen Snack Tomato Flavor,Taiwan,2.75,0,0,0,0,0,1,0,0.015769,0.016576,0.019735,0.020791,0.020378
4,Ching's Secret,Singapore Curry,India,3.75,0,0,0,0,0,1,0,0.120072,0.080406,0.110621,0.13886,0.150348


We will now add the indicator of increase. Note that we calculate the indicator on the normalized data instead of the raw numbers to see if the country has an increase in ramen consumption proportional with the global consumption increase.

In [8]:
# We use astype to make the boolean into a 0, 1 integer
rr_df['Five Year Consumption Increase'] = (rr_df['2014'] < rr_df['2018']).astype(int)
rr_df.head()

Unnamed: 0,Brand,Variety,Country,Stars,Bar,Bowl,Box,Can,Cup,Pack,Tray,2014,2015,2016,2017,2018,Five Year Consumption Increase
0,New Touch,T's Restaurant Tantanmen,Japan,3.75,0,0,0,0,1,0,0,0.123677,0.136813,0.146715,0.145021,0.14339,1
1,Just Way,Noodles Spicy Hot Sesame Spicy Hot Sesame Guan...,Taiwan,1.0,0,0,0,0,0,1,0,0.015769,0.016576,0.019735,0.020791,0.020378,1
2,Nissin,Cup Noodles Chicken Vegetable,United States,2.25,0,0,0,0,1,0,0,0.096193,0.100693,0.106206,0.105749,0.109095,1
3,Wei Lih,GGE Ramen Snack Tomato Flavor,Taiwan,2.75,0,0,0,0,0,1,0,0.015769,0.016576,0.019735,0.020791,0.020378,1
4,Ching's Secret,Singapore Curry,India,3.75,0,0,0,0,0,1,0,0.120072,0.080406,0.110621,0.13886,0.150348,1


Finally we drop the country column which will not be used.

In [9]:
rr_df = rr_df.drop(columns=['Country'])

## Variety text features

A few features which should not be too hard to extract are:

1. Word frequency
2. Length of Variety

If we have more time, we will be adding to this list. Lets start by creating a word count for all the Variety.

In [10]:
all_variety_text_arr = []
for i, r in rr_df.iterrows():
    all_variety_text_arr.append(r["Variety"])
    
variety_text = ' '.join(all_variety_text_arr)

In [11]:
import spacy
from collections import Counter

nlp = spacy.load("en")

In [12]:
doc = nlp(variety_text)
words = [token.text.lower() for token in doc if token.is_stop != True and token.is_punct != True]

freq = Counter(words)
common_words = freq.most_common(30)
print (common_words)

[('noodles', 669), ('noodle', 519), ('instant', 443), ('flavour', 401), ('ramen', 344), ('chicken', 325), ('flavor', 323), ('spicy', 276), ('beef', 230), ('cup', 198), ('soup', 196), ('sauce', 145), ('rice', 143), ('artificial', 133), ('tom', 128), ('shrimp', 127), ('curry', 125), ('mi', 123), ('hot', 120), ('seafood', 109), ('bowl', 104), ('pork', 102), ('style', 90), ('yum', 87), ('udon', 79), ('goreng', 79), ('vermicelli', 59), ('demae', 58), ('oriental', 58), ('fried', 57)]


After looking at this list, it seems like many of these words refer to type or style rather than informing about the content. Let's simplify our selection by taking in descriptions that contain an indication of spicyness.

In [None]:
# We pick the terms meaning spicy out of the top 300 words by frequency...
freq.most_common(300)

After a lot of google search, here is a list or terms associated with spicyness. Let's add spicyness as a column by validating that some of these words are in our Variety text.

In [13]:
spicy_terms = ['spicy', 'hot', 'tom', 'yum', 'laksa', 'chili', 'masala', 'habanero', 'buldak', 'picante', 'camaron', 'sambal']

def is_spicy(text):
    for spicy_term in spicy_terms:
        if spicy_term in text.lower():
            return True
    return False

rr_df['Spicy'] = (rr_df['Variety'].apply(lambda variety: is_spicy(variety))).astype(int)

Lets also add a normalized value representing the length of the Variety name.

In [14]:
variety_len_np_array = rr_df['Variety'].apply(lambda variety: len(variety)).values
scaler = preprocessing.MinMaxScaler()
variety_len_scaled_np_array = scaler.fit_transform(variety_len_np_array.reshape(-1,1))
rr_df['Variety Length'] = variety_len_scaled_np_array

In [15]:
rr_df.head()

Unnamed: 0,Brand,Variety,Stars,Bar,Bowl,Box,Can,Cup,Pack,Tray,2014,2015,2016,2017,2018,Five Year Consumption Increase,Spicy,Variety Length
0,New Touch,T's Restaurant Tantanmen,3.75,0,0,0,0,1,0,0,0.123677,0.136813,0.146715,0.145021,0.14339,1,0,0.236559
1,Just Way,Noodles Spicy Hot Sesame Spicy Hot Sesame Guan...,1.0,0,0,0,0,0,1,0,0.015769,0.016576,0.019735,0.020791,0.020378,1,1,0.602151
2,Nissin,Cup Noodles Chicken Vegetable,2.25,0,0,0,0,1,0,0,0.096193,0.100693,0.106206,0.105749,0.109095,1,0,0.27957
3,Wei Lih,GGE Ramen Snack Tomato Flavor,2.75,0,0,0,0,0,1,0,0.015769,0.016576,0.019735,0.020791,0.020378,1,1,0.27957
4,Ching's Secret,Singapore Curry,3.75,0,0,0,0,0,1,0,0.120072,0.080406,0.110621,0.13886,0.150348,1,0,0.129032


We then drop the Variety column.

In [16]:
rr_df = rr_df.drop(columns=['Variety'])

## Brand

We will not be using Brand in this analysis as it is highly skewed toward a small number of brands. There representation in the list could have been encoded as a feature, but I'm no sure whether this would further enhance biais during modeling as it might be directly correlated with our output variable (rating).

In [17]:
rr_df = rr_df.drop(columns=['Brand'])

## Removing unrated entries

Our dataset contains a few entries without ratings, we will not be handling these in this notebook. Instead we will try different imputation methods with our modeling to measure the impact of our methods on the overall quality of the predictions.

## Output

We now output our dataset with features so we rapidly start with this dataset in the next notebook.

In [18]:
rr_df.to_csv('../clean_data/ramen_features.csv', encoding='utf-8', index=False)
rr_df.head()

Unnamed: 0,Stars,Bar,Bowl,Box,Can,Cup,Pack,Tray,2014,2015,2016,2017,2018,Five Year Consumption Increase,Spicy,Variety Length
0,3.75,0,0,0,0,1,0,0,0.123677,0.136813,0.146715,0.145021,0.14339,1,0,0.236559
1,1.0,0,0,0,0,0,1,0,0.015769,0.016576,0.019735,0.020791,0.020378,1,1,0.602151
2,2.25,0,0,0,0,1,0,0,0.096193,0.100693,0.106206,0.105749,0.109095,1,0,0.27957
3,2.75,0,0,0,0,0,1,0,0.015769,0.016576,0.019735,0.020791,0.020378,1,1,0.27957
4,3.75,0,0,0,0,0,1,0,0.120072,0.080406,0.110621,0.13886,0.150348,1,0,0.129032
