# Overview
This notebook explores the cleaned ramen-ratings CSV. Any engineered features or scripts will be added to explore.py.

# Findings
1. Ramen packaging and five-star ratings are independent.
2. Ramen country of origin and five-star ratings **have a dependent relationship.**

In [1]:
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

import wrangle

In [2]:
# create train split for exploration
train, _, _ = wrangle.prep_explore()
print('')

# check work
train.info()

Train size: (1515, 5) Validate size: (506, 5) Test size: (506, 5)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1515 entries, 2566 to 2127
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   brand       1515 non-null   object
 1   name        1515 non-null   object
 2   package     1515 non-null   object
 3   country     1515 non-null   object
 4   five_stars  1515 non-null   bool  
dtypes: bool(1), object(4)
memory usage: 60.7+ KB


# Initial Exploration of Ramen Packaging
Let's check if there's a dependent relationship between packaging and our target.

Hypotheses:
- $H_0$: Packaging and five-star ratings are independent.
- $H_a$: Packaging and five-star ratings have a dependent relationship.

Confidence interval: 95%

In [3]:
# set confidence interval
alpha = .05

In [4]:
# check dependence of packaging and target
package_5star_crosstab = pd.crosstab(train.package, train.five_stars)
_, p, _, _ = stats.chi2_contingency(package_5star_crosstab)

In [5]:
# check if p is significant
if p < alpha:
    print("Packaging and five-star ratings have a dependent relationship with 95% confidence.")
    print("p-value:", p)
else:
    print("Packaging and five-star ratings are independent, did not pass 95% confidence interval.")
    print("p-value:", p)

Packaging and five-star ratings are independent, did not pass 95% confidence interval.
p-value: 0.43924606238117814


**Packaging and five-star ratings are independent.** We will not use 'package' in our predictive model.

# Initial Exploration of Ramen Country of Origin
Let's check if there's a dependent relationship between country and our target.

Hypotheses:
- $H_0$: Country of origin and five-star ratings are independent.
- $H_a$: Country of origin and five-star ratings have a dependent relationship.

Confidence interval: 95%

In [6]:
# set confidence interval
alpha = .05

In [7]:
# create crosstab for chi-square statistical test
country_5star_crosstab = pd.crosstab(train.country, train.five_stars)
# limit only to countries with sufficient value counts in crosstab (an assumption of chi-square)
enough_values_mask = (country_5star_crosstab[False] > 5) & (country_5star_crosstab[True] > 5)
# run chi-square test
_, p, _, _ = stats.chi2_contingency(country_5star_crosstab[enough_values_mask])

In [8]:
# check if p is significant
if p < alpha:
    print("Country of origin and five-star ratings have a dependent relationship with 95% confidence.")
    print("p-value:", p)
else:
    print("Country of origin and five-star ratings are independent, did not pass 95% confidence interval.")
    print("p-value:", p)

Country of origin and five-star ratings have a dependent relationship with 95% confidence.
p-value: 1.680501270556502e-06


**Country of origin and five-star ratings have a dependent relationship.** We will further explore country of origin and consider using it in our model.

# Initial Exploration of Ramen Brand
Let's check if there's a dependent relationship between ramen brand and our target.

Hypotheses:
- $H_0$: Ramen brand and five-star ratings are independent.
- $H_a$: Ramen brand and five-star ratings have a dependent relationship.

Confidence interval: 95%

In [9]:
# set confidence interval
alpha = .05

In [10]:
# create crosstab for chi-square statistical test
brand_5star_crosstab = pd.crosstab(train.brand, train.five_stars)
# limit only to brands with sufficient value counts in crosstab (an assumption of chi-square)
enough_values_mask = (brand_5star_crosstab[False] > 5) & (brand_5star_crosstab[True] > 5)
# run chi-square test
_, p, _, _ = stats.chi2_contingency(brand_5star_crosstab[enough_values_mask])

In [11]:
# check if p is significant
if p < alpha:
    print("Ramen brand and five-star ratings have a dependent relationship with 95% confidence.")
    print("p-value:", p)
else:
    print("Ramen brand and five-star ratings are independent, did not pass 95% confidence interval.")
    print("p-value:", p)

Ramen brand and five-star ratings are independent, did not pass 95% confidence interval.
p-value: 0.22292206264805428


**Ramen brand and five-star ratings are independent.** We will not use 'brand' in our predictive model.

# Breaking Down the Ramen Product Name
Nearly all ramen reviews in our dataset have a unique combination of brand and product name. Only 21 combinations of brand and product out of the nearly 1500 in our exploration split have two reviews, and there are no combinations with more than two reviews.

In [12]:
# show the review repeats and non-repeats
print("Brand+name with only one review:", (train[['brand','name']].value_counts() == 1).sum())
print("Brand+name with two reviews:", (train[['brand','name']].value_counts() == 2).sum())
print("Brand+name with more than two reviews:", (train[['brand','name']].value_counts() > 2).sum())

Brand+name with only one review: 1473
Brand+name with two reviews: 21
Brand+name with more than two reviews: 0


Because of this, we will need to split out certain keywords in product names to use as features.

## Identifying Keywords to Use
I looked at ramen with the word 'flavor' in the title in my older analysis. These are the counts I found:
- Chicken (48); Beef (30); Pork (16); Shrimp (12); Fish/Seafood (10); Chow Mein (6);
- Chili/Chilli (7); Curry (6); Kimchi (5); Stir-Fry (4);
- Tom Yum (6); Tonkotsu (5); Miso (4); Wonton (2); Jjamppong (1); Shiodare (1); Bulgogi (1); Bulalo (1);
- Tomato (6); Mushroom (5); Sesame (5); Onion (3); Shiitake (1);
- Spicy (26); Hot (12); Soy (6); Sour (6); Sweet (2); Umami [savory] (2);
- Sriracha (3); 
- Abalone (4);
- Vegetable (11);
- has_Lime (8);

I will use these words to search the train split for more potential keywords. The words I find interesting will all be listed here:
- Instant or Minute (does the name indicate a worse rating?)
- Artificial or Imitation (do artificial products rate worse than others?)
- Stew (do stews outperform other styles of ramen?)
- Good or Premium or Delicious (does the name indicate a better rating?)
- Less Sodium (does less sodium make for a worse-rated product?)

In [13]:
# chicken (highest word count in previous analysis)
train[train.name.str.contains('Chicken')].name.value_counts().head(10)

Chicken                                                        5
Instant Noodles Chicken Flavour                                4
Artificial Chicken                                             3
Chicken Abalone Flavour                                        2
Imitation Chicken Vegetarian                                   2
Chicken Flavor                                                 2
Artificial Chicken Rice Vermicelli                             1
Good Chicken Bean Vermicelli                                   1
Pan Asian Kitchen Sweet & Sour Chicken Flavor Ramen Noodles    1
Spoon-it Creamy Chicken                                        1
Name: name, dtype: int64

In [14]:
# beef (second highest word count in previous analysis)
train[train.name.str.contains('Beef')].name.value_counts().head(10)

Beef                                            5
Artificial Spicy Beef                           3
Artificial Stew Beef                            2
Premium Instant Noodles Roasted Beef Flavour    2
Kung Fu Artificial Beef Rice Noodle             2
Instant Noodles Beef Flavour                    2
Hot & Sour Beef Noodles                         2
Hot & Spicy Beef                                2
Artificial Beef Flavor                          2
Beef Tongue Shio Mayo Ramen                     1
Name: name, dtype: int64

In [15]:
# spicy (third highest word count in previous analysis)
train[train.name.str.contains('Spicy')].name.value_counts().head(10)

Artificial Spicy Beef                                                    3
Sichuan Spicy Flavor                                                     2
Hot & Spicy Beef                                                         2
Chongqing Spicy Hot Noodles                                              1
What’s That? Leisure Meatballs Spicy Chicken Flavor                      1
Nature Is Delicious Spicy                                                1
Bowl Noodles Hot & Spicy Chicken Flavor Less Sodium Ramen Noodle Soup    1
Spicy Shrimp Cup Noodle                                                  1
2 Minute Noodles Hungrooo Masala Spicy                                   1
Guanmiao Dried Noodles With Spicy Sauce                                  1
Name: name, dtype: int64