# Overview
This notebook explores the cleaned ramen-ratings CSV. Any engineered features or scripts will be added to explore.py.

# Findings
1. Ramen packaging and five-star ratings are independent.
2. Ramen country of origin and five-star ratings **have a dependent relationship.**

In [1]:
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

import wrangle

In [2]:
# create train split for exploration
train, _, _ = wrangle.prep_explore()
print('')

# check work
train.info()

Train size: (1515, 5) Validate size: (506, 5) Test size: (506, 5)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1515 entries, 2566 to 2127
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   brand       1515 non-null   object
 1   name        1515 non-null   object
 2   package     1515 non-null   object
 3   country     1515 non-null   object
 4   five_stars  1515 non-null   bool  
dtypes: bool(1), object(4)
memory usage: 60.7+ KB


# Initial Exploration of Ramen Packaging
Let's check if there's a dependent relationship between packaging and our target.

Hypotheses:
- $H_0$: Packaging and five-star ratings are independent.
- $H_a$: Packaging and five-star ratings have a dependent relationship.

Confidence interval: 95%

In [3]:
# set confidence interval
alpha = .05

In [4]:
# check dependence of packaging and target
package_5star_crosstab = pd.crosstab(train.package, train.five_stars)
_, p, _, _ = stats.chi2_contingency(package_5star_crosstab)

In [5]:
# check if p is significant
if p < alpha:
    print("Packaging and five-star ratings have a dependent relationship with 95% confidence.")
    print("p-value:", p)
else:
    print("Packaging and five-star ratings are independent, did not pass 95% confidence interval.")
    print("p-value:", p)

Packaging and five-star ratings are independent, did not pass 95% confidence interval.
p-value: 0.43924606238117814


**Packaging and five-star ratings are independent.** We will not use 'package' in our predictive model.

# Initial Exploration of Ramen Country of Origin
Let's check if there's a dependent relationship between country and our target.

Hypotheses:
- $H_0$: Country of origin and five-star ratings are independent.
- $H_a$: Country of origin and five-star ratings have a dependent relationship.

Confidence interval: 95%

In [6]:
# set confidence interval
alpha = .05

In [7]:
# create crosstab for chi-square statistical test
country_5star_crosstab = pd.crosstab(train.country, train.five_stars)
# limit only to countries with sufficient value counts in crosstab (an assumption of chi-square)
enough_values_mask = (country_5star_crosstab[False] > 5) & (country_5star_crosstab[True] > 5)
# run chi-square test
_, p, _, _ = stats.chi2_contingency(country_5star_crosstab[enough_values_mask])

In [8]:
# check if p is significant
if p < alpha:
    print("Country of origin and five-star ratings have a dependent relationship with 95% confidence.")
    print("p-value:", p)
else:
    print("Country of origin and five-star ratings are independent, did not pass 95% confidence interval.")
    print("p-value:", p)

Country of origin and five-star ratings have a dependent relationship with 95% confidence.
p-value: 1.680501270556502e-06


**Country of origin and five-star ratings have a dependent relationship.** We will further explore country of origin and consider using it in our model.

# Initial Exploration of Ramen Brand
Let's check if there's a dependent relationship between ramen brand and our target.

Hypotheses:
- $H_0$: Ramen brand and five-star ratings are independent.
- $H_a$: Ramen brand and five-star ratings have a dependent relationship.

Confidence interval: 95%

In [9]:
# set confidence interval
alpha = .05

In [10]:
# create crosstab for chi-square statistical test
brand_5star_crosstab = pd.crosstab(train.brand, train.five_stars)
# limit only to brands with sufficient value counts in crosstab (an assumption of chi-square)
enough_values_mask = (brand_5star_crosstab[False] > 5) & (brand_5star_crosstab[True] > 5)
# run chi-square test
_, p, _, _ = stats.chi2_contingency(brand_5star_crosstab[enough_values_mask])

In [11]:
# check if p is significant
if p < alpha:
    print("Ramen brand and five-star ratings have a dependent relationship with 95% confidence.")
    print("p-value:", p)
else:
    print("Ramen brand and five-star ratings are independent, did not pass 95% confidence interval.")
    print("p-value:", p)

Ramen brand and five-star ratings are independent, did not pass 95% confidence interval.
p-value: 0.22292206264805428


**Ramen brand and five-star ratings are independent.** We will not use 'brand' in our predictive model.

# Breaking Down the Ramen Product Name
Nearly all ramen reviews in our dataset have a unique combination of brand and product name. Only 21 combinations of brand and product out of the nearly 1500 in our exploration split have two reviews, and there are no combinations with more than two reviews. Because of this low commonality, we can't run initial chi-square tests to see if product names have a dependent relationship with five star reviews.

In [12]:
# show the review repeats and non-repeats
print("Brand+name with only one review:", (train[['brand','name']].value_counts() == 1).sum())
print("Brand+name with two reviews:", (train[['brand','name']].value_counts() == 2).sum())
print("Brand+name with more than two reviews:", (train[['brand','name']].value_counts() > 2).sum())

Brand+name with only one review: 1473
Brand+name with two reviews: 21
Brand+name with more than two reviews: 0


Before we can understand the relationship between product name and the target, we will need to split out certain keywords in product names to use as features.

## Identifying Keywords to Use
**Ramen names will require multiple features** due to the fact that a ramen product can have many attributes.

Another issue is that some product names use English and some do not use English. In order to accomodate this, **we will need to translate some words** to English and include them in our features (EX: put "soy" and "shoyu"/"shouyu" into one feature).

### Word Checks

In [13]:
# print a list of countries with a ramen product
print(train.country.unique().tolist())

['South Korea', 'USA', 'Hong Kong', 'UK', 'Thailand', 'Japan', 'Taiwan', 'Malaysia', 'India', 'Canada', 'Singapore', 'Philippines', 'China', 'Mexico', 'Indonesia', 'Cambodia', 'Netherlands', 'Australia', 'Nepal', 'Vietnam', 'Myanmar', 'Germany', 'Pakistan', 'Hungary', 'Colombia', 'Bangladesh', 'Brazil']


In [14]:
# checking product names by country (cell ran multiple times with different inputs)
# train[train.country == 'Taiwan'].name.tolist()

In [15]:
# checking count of values matching the string (cell ran multiple times with different inputs)
# (train.name.str.contains('Cake') == True).sum()

In [16]:
# checking row's values for rows containing matched string (cell ran multiple times with different inputs)
# train[train.name.str.contains('Teriyaki')]

### Identifying Remaining Keywords
The following keyword mask contains all the words I've designated as keywords. This keyword conglomeration will be unpacked in a readable format in the sections below this section.

In [17]:
# identify all keywords, prepare list for df.col.str.contains()
keyword_mask = '|'.join(['Vermicelli', 'Vernicalli', 'Bihun', 'Sano', 'Chicken', 'Chikin', 'Duck',
                         'Vegetable', 'Veggie', 'Vegetarian','Beef', 'Gomtang', 'Seolleongtang', 'Sukiyaki', 
                         'Nam Tok', 'Pork', 'Jjajangmen', 'Jiajang', 'Tonkotsu', 'Tomkotsu', 'Bacon', 'Budae',
                         'Seafood', 'Crab', 'Anchovy', 'Bajirak', 'Clam', 'Abalone', 'Scallop', 'Vongole', 
                         'Salmon', 'Lobster', 'Shrimp', 'Prawn', 'Tuna', 'Tteok', 'Rabokki', 'Raobokki',
                         'Spicy', 'Spice', 'Shin', 'Jjamppong', 'Jjambbong', 'Buldalk', 'Sutah', 'Budae', 
                         'Habanero', 'Jinjja', 'Jin', 'Yeul', 'Mala', 'Teumsae', 'Bibim', 'Picante', 'Bulnak', 
                         'Volcano', 'Odongtong', 'Sriracha', 'Arrabiata', 'Tom Yum', 'Tom Yam', 'Tom Saab', 
                         'Tom Klong', 'Suki', 'Stir Fry', 'Bokkeum', 'Tteokbokki', 'Topokki', 'Yukgaejang', 
                         'Rabokki', 'Yakisoba', 'Yaki-Soba', 'Yakiosoba', 'Fried', 'Goreng', 'Ramyonsari', 
                         'Keopnurungji', 'Sabalmyeon', 'Miso', 'Teriyaki', 'Mushroom', 'Udon', 'Udoin', 
                         'Tomato', 'Chili', 'Chilli', 'chili', 'Wonton', 'Wantan', 'Pickled', 'Sesame', 
                         'Superior', 'Carbonara', 'Chow Mein', 'Sweet', 'Pad Thai', 'Sour', 'sour', 'Curry', 
                         'Soy', 'Shoyu', 'Shiitake', 'Shitake', 'Tofu', 'Pho', 'Clear', 'Egg', 'Tempura', 
                         'Laksa', 'Buckwheat', 'Soba', 'Salt', 'Shio', 'Sio', 'Tomato', 'Neapolitan', 
                         'Napolitan', 'Spaghetti', 'Mayo', 'Barbecue', 'BBQ', 'Masala', 'Kimchi', 'Veg',
                         'Tteobokki', 'Rice', 'Mi', 'Onion', 'Pollo', 'Cheese'])

# create True/False for whether the row contains a keyword in the product name
train['has_keyword'] = train.name.str.contains(keyword_mask)

Based on the above keywords, I ran the following two cells to check remaining values that I missed earlier. If I found a notable word, I added it to the above keyword list and re-ran the cells. I repeated this process until I was satisfied with the words I had designated as keywords.

In [18]:
# check if column is mostly True values
train.has_keyword.value_counts()

True     1345
False     170
Name: has_keyword, dtype: int64

In [19]:
# check all rows without keywords for each unique word's value counts in entire list (ran this cell multiple times)
(
    pd.Series( # make a Series of each instance of each word
        ' '.join(
                 train[~train.has_keyword]    # look at rows we haven't caught with a keyword yet
                 .name.tolist()        # put all 'name' cells in a list
                ).split()        # join all lists into one string, then split the string into a list of each word
    ).value_counts()        # calculate the value counts of each word in the series
    .head(10)         # display the top 10 (changed from 30 to 10 after the words I wanted were captured)
)

Noodles    42
Noodle     34
Ramen      20
Instant    18
Cup        15
Flavor     13
Flavour    13
Sabor      10
Sauce      10
With       10
dtype: int64

### Grouping Keywords, Checking Value Counts
Now that we have a keyword list, I will organize it into groups and check counts for each grouping. The goal of this is to prepare for group elimination of keywords in the next section. Each grouping will be considered as one feature or value; for example, the **'noodle_type'** feature would have a 'noodle' value that covers ['Noodle', 'Myeon', 'Myon']. 

#### Noodle Type
* 'Noodle', 'Myeon', 'Myon' (665)
* 'Udon', 'Udoin' (49)
* 'Miso' (23)
* 'Rice', 'Mi' (239)
* 'Vermicelli', 'Vernicalli', 'Bihun', 'Sano' (43)
* 'Rice Cake', 'Tteok', 'Rabokki', 'Raobokki' (4)
* 'Wonton', 'Wantan' (5)
* 'Spaghetti', 'Carbonara', 'Neapolitan', 'Napolitan' (10)
* 'Buckwheat', 'Soba' (18)

In [20]:
# checking row count having the above noodle types (ran this cell multiple times)
train.name.str.contains("|".join(['Buckwheat', 'Soba'])).sum()

18

#### Meats
* 'Chicken', 'Chikin', 'Duck', 'Pollo', 'Buldalk' (205)
* 'Beef', 'Gomtang', 'Seolleongtang', 'Sukiyaki', 'Nam Tok', 'Sutah' (152)
* 'Pork', 'Jjajangmen', 'Jiajang', 'Tonkotsu', 'Tomkotsu', 'Bacon', 'Budae' (108)
* 'Seafood', 'Crab', 'Anchovy', 'Bajirak', 'Clam', 'Abalone', 'Scallop', 'Vongole', 'Salmon', 'Lobster', 'Shrimp', 'Prawn', 'Tuna', 'Jjamppong', 'Jjambbong' (198)
* 'Chili', 'Chilli', 'chili' (35)
* 'Chow Mein' (25)
* 'Egg' (5)
* 'Tofu' (2)
* 'Barbecue', 'BBQ' (9)

In [21]:
# checking row count having the above meat types (ran this cell multiple times)
train.name.str.contains("|".join(['Barbecue', 'BBQ'])).sum()

9

#### Vegetables
* 'Clear', 'Veg' (covers 'Vegetable', 'Veggie', 'Vegetarian' and 'Veg') (85)
* 'Kimchi', 'Sabalmyeon' (22)
* 'Mushroom', 'Shiitake', 'Shitake' (35)
* 'Tomato' (21)

In [22]:
# checking row count having the above veggie types (ran this cell multiple times)
train.name.str.contains("|".join(['Tomato'])).sum()

21

#### Taste
* 'Spicy', 'Spice', 'Shin', 'Jjamppong'/'Jjambbong'(seafood), 'Buldalk'(chicken), 'Sutah'(beef), 'Budae'(sausage), 'Habanero', 'Jinjja', 'Jin', 'Yeul', 'Mala', 'Teumsae', 'Bibim', 'Picante', 'Bulnak', 'Volcano', 'Odongtong', 'Sriracha', 'Arrabiata', 'Tom Yum', 'Tom Yam', 'Tom Saab', 'Tom Klong', 'Suki', 'Laksa' (304)
* 'Ramyonsari', 'Keopnurungji' (2)
* 'Salt', 'Shio', 'Sio' (17)
* 'Soy', 'Shoyu', 'Shouyu', 'Teriyaki' (70)
* 'Mayo' (6)
* 'Cheese' (11)
* 'Sweet' (18)
* 'Sour', 'sour' (19)
* 'Curry' (68)
* 'Sesame' (32)
* 'Pickle' (11)
* 'Masala' (9)

In [23]:
# checking row count having the above taste types (ran this cell multiple times)
train.name.str.contains("|".join(['Masala'])).sum()

9

#### Preparation
* 'Instant', 'Minute', 'Ramyun', 'Jinjja', 'Bibim' (296)
* 'Stir Fry', 'Bokkeum', 'Tteokbokki', 'Tteobokki', 'Topokki', 'Yukgaejang', 'Rabokki', 'Yakisoba', 'Yaki-Soba', 'Yakiosoba', 'Fried', 'Goreng', 'Tempura' (123)
    * Eliminate 'Non-Fried' from this feature
* 'Soup', 'Jjigae', 'Consomme' (109)
* 'Pad Thai' (4)
* 'Pho' (10)

In [24]:
# checking row count having the above preparation types (ran this cell multiple times)
train.name.str.contains("|".join(['Pho'])).sum()

10

## Choosing Feasible Features
Now that we have row counts for each grouping, we can begin to consider what features are viable. Here is what we should consider:
1. A feature must have at least two values 
    ** EX: 'meat_type' feature has 'chicken', 'beef', 'pork', etc values
1. A crosstab of the feature must have more than five values in each cell for the chi square statistical test
1. The feature's values should be independent from one another 
    ** EX: 'taste_type' should not have individual 'sweet' and 'sour' values because some ramen have 'sweet & sour' in the product name
    
### Features that Pass the Above Requirements
- noodle_type: 
    * **wheat** ('Udon', 'Udoin', 'Sano', 'Spaghetti', 'Carbonara', 'Neapolitan', 'Napolitan') (60)
    * **buckwheat** ('Buckwheat', 'Soba') (18)
    * **rice** ('Rice', 'Mi', 'Vermicelli', 'Vernicalli', 'Bihun') (261)
- flavor: 
    * **miso** ('Miso') (23)
    * **chicken** ('Chicken', 'Chikin', 'Duck', 'Pollo', 'Buldalk') (205)
    * **beef** ('Beef', 'Gomtang', 'Seolleongtang', 'Sukiyaki', 'Nam Tok', 'Sutah') (152)
    * **pork** ('Pork', 'Jjajangmen', 'Jiajang', 'Tonkotsu', 'Tomkotsu', 'Bacon') (106)
    * **crustacean** ('Crab', 'Lobster', 'Shrimp', 'Prawn') (108)
    * **mollusk** ('Bajirak', 'Clam', 'Abalone', 'Scallop', 'Vongole') (15)
    * **chili** ('Chili', 'Chilli', 'chili') (35)
    * **curry** ('Curry') (68)
    * **chow_mein** ('Chow Mein') (25)
    * **kimchi** ('Kimchi', 'Sabalmyeon') (22)
    * **mushroom** ('Mushroom', 'Shiitake', 'Shitake') (35)
    * **tomato** ('Tomato') (21)
    * **veggie** ('Clear', 'Veg') (85)
    * **sesame** ('Sesame') (32)
- spicy:
    * **True** ('Spicy', 'Spice', 'Shin', 'Jjamppong'/'Jjambbong'(seafood), 'Buldalk'(chicken), 'Sutah'(beef), 'Budae'(sausage), 'Habanero', 'Jinjja', 'Jin', 'Yeul', 'Mala', 'Teumsae', 'Bibim', 'Picante', 'Bulnak', 'Volcano', 'Odongtong', 'Sriracha', 'Arrabiata', 'Tom Yum', 'Tom Yam', 'Tom Saab', 'Tom Klong', 'Suki', 'Laksa') (304)
    * **False** (Everything not listed)
- fried:
    * **True** ('Stir Fry', 'Bokkeum', 'Tteokbokki', 'Tteobokki', 'Topokki', 'Yukgaejang', 'Rabokki', 'Yakisoba', 'Yaki-Soba', 'Yakiosoba', 'Fried', 'Goreng', 'Tempura') (123)
    * **False** (Everything not listed)