# Overview
This project analyzes the Ramen Ratings dataset from Kaggle. This dataset has a few thousand ramen products and their ratings, from 0 to 5 stars. The data includes the review number, the ramen's brand, product name, packaging style, country of origin, rating, and whether or not the ramen is in the top 10. I chose to make this a one-vs-rest classification problem on whether a ramen is rated five stars or not. The analysis uses independence tests and feature engineering to arrive at key drivers of five-star ratings, then uses these features to build a predictive model. The project successfully identified key drivers using these methods and build a predictive model that performs better than a standard baseline for the work.

## The cool new stuff I accomplished for this project
- **Heavy keyword engineering**
    * Domain research to understand ramen products based on keyword
    * Translation to bring all keywords to consistent categories
    * Categorization on common factors based on domain research and translation
- **Multi-layered statistical testing to eliminate features**
    * Chi-Square tests to eliminate initial features that are not related to target
    * One-hot encoding of remaining features' categories
    * Chi-Square tests to eliminate one-hot-encoded categories that are not related to target
- **Clustering country and keyword features into low-, medium-, and high-rate five-star ratings groups**
    * Checked proportions of five-star rating counts against not-five-star rating counts for True in encoded feature
    * Checked proportions of five-star rating counts against not-five-star rating counts for False in encoded feature
    * Compared five-star proportions to check increase/decrease in proportion from False to True
    * Bracketed increasing, middle, and decreasing proportions from False to True
    
## Other stuff that I've done before
- Wrangle
    * Categorize and encode target into five_stars column (classes: is five-stars, isn't five stars)
    * Fix some values, drop some nulls, outliers, and duplicate rows, get rid of unnecessary columns
    * Create univariate visualizations
- Explore
    * Run Chi-Square testing to determine if feature is related to target
    * Feature engineering (overall)
    * Create bivariate visualizations
    * Choose features for model
- Model
    * Choose optimization priorities for the model (F1 Score)
    * Resample the target to address class imbalance
    * Create baseline model and multiple algorithmic models with varying hyperparameter combinations
    * Evaluate models on Validate (first out-of-sample split)
    * Choose best model in terms of our optimization priority
    * Calculate ROC AUC of baseline and best model
    * Evaluate baseline and best model on Test split
    
## Findings
1. The brand of ramen does not influence whether or not the ramen product has a five-star rating.
1. The packaging of ramen does not influence whether or not the ramen product has a five-star rating.
1. A ramen's country of origin has an influence on whether or not the ramen product has a five-star rating.
    - Malaysia has the highest five-star rating proportion of all origin countries.
    - Ramen originating from Malaysia, Singapore, or Taiwan have the highest proportion of five-star ratings.
    - Ramen from Hong Kong, Japan, South Korea, or Indonesia have the next-highest proportions of five-star ratings.
    - Ramen from China, Thailand, or USA have the lowest proportions of five-star ratings.
    - China has the lowest five-star rating proportion of all origin countries.
1. A ramen's noodle type does not influence whether or not the ramen product has a five-star rating.
1. A ramen's flavor influences whether or not the ramen product has a five-star rating.
    - Curry flavor has the highest proportion of five-star ratings for all flavor categories.
    - Ramen with curry or sesame flavor have the highest proportions of five-star ratings.
    - Ramen with pork flavor or the common crustaceans have the next-highest proportions of five-star ratings.
    - Chicken- and beef- flavored ramen products have the lowest five-star rating proportions of all flavors.
    - Chicken flavor has the lowest five-star rating proportions of all flavors.
1. A ramen's spicy status influences whether or not the ramen product has a five-star rating.
1. A ramen's fried status does not have an effect on whether or not the ramen product has a five-star rating.

## Model Results
- Features used: country and flavor brackets (as described above) and spicy status
- Evaluation Metric: F1 Score
- Best model: Logistic Regression
- Model performance: outperforms the baseline on F1 Score and ROC AUC for unseen data

# Imports

In [1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_curve, auc, roc_auc_score

import wrangle
import model

# Wrangle
## Bottom Line Up Front: What I Did for Wrangle
1. Acquire ramen-ratings.csv from Kaggle
1. Rename a United States value to USA
1. Drop low-count ramen styles Box, Can, and Bar (8 rows)
1. Drop countries with less than 5 cumulative observations (29 rows)
1. Drop Unrated, nulls and duplicates (16 rows)
1. Replace Stars column with five_stars column
1. Drop 'Review #' and 'Top Ten' columns
1. Rename columns for easier exploration
1. Split cleaned data into Train, Validate, and Test splits for exploration and modeling

In [2]:
# wrangle.py script to wrangle the data as described above
train, _, _ = wrangle.prep_explore()

train.head(3)

Train size: (1515, 5) Validate size: (506, 5) Test size: (506, 5)


Unnamed: 0,brand,name,package,country,five_stars
2566,Samyang,Hot,Pack,South Korea,False
332,Nongshim,Shin Noodle Soup,Cup,USA,True
1363,Doll,Hello Kitty Dim Sum Noodle Japanese Curry Flavour,Cup,Hong Kong,False


# Explore
## Bottom Line Up Front: What I Did for Explore
- Statistical testing on Ramen Brands that found brand is independent of five-star outcomes
- Statistical testing on Ramen Packaging that found packaging is independent of five-star outcomes
- Statistical testing on Country of Origin that found country is related to five-star outcomes
- Keyword engineering to categorize ramen products into noodle type, flavor, spicy status, and fried status categories
- Statistical testing on new features that found noodle type and fried status have no impact on five-star outcomes
- Statistical testing on new features that found ramen flavor and spicy status have an impact on five-star outcomes
- Dropped specific countries and flavors that did not have at least 5 reviews with five-star rating
- Analyzed proportions of five-star reviews to all reviews for each country and flavor category
- Grouped into high-, medium-, and low-proportion brackets for country and for flavor category
- Checked country and flavor category brackets along with spicy status in terms of five-star and non-five-star reviews
- Chose these features for modeling

## Ramen Brand is Independent
Hypotheses:
- $H_0$: Ramen brand and five-star ratings are independent.
- $H_a$: Ramen brand and five-star ratings have a dependent relationship.

Confidence interval: 95%

In [3]:
# set confidence interval
alpha = .05
# create crosstab for chi-square statistical test
brand_5star_crosstab = pd.crosstab(train.brand, train.five_stars)
# limit only to brands with sufficient value counts in crosstab (an assumption of chi-square)
enough_values_mask = (brand_5star_crosstab[False] > 5) & (brand_5star_crosstab[True] > 5)
# run chi-square test
_, p, _, _ = stats.chi2_contingency(brand_5star_crosstab[enough_values_mask])

# check if p is significant
if p < alpha:
    print("Ramen brand and five-star ratings have a dependent relationship with 95% confidence.")
    print("p-value:", p)
else:
    print("Ramen brand and five-star ratings are independent, did not pass 95% confidence interval.")
    print("p-value:", p)

Ramen brand and five-star ratings are independent, did not pass 95% confidence interval.
p-value: 0.22292206264805428


## Ramen Packaging is Independent
Hypotheses:
- $H_0$: Packaging and five-star ratings are independent.
- $H_a$: Packaging and five-star ratings have a dependent relationship.

Confidence interval: 95%

In [4]:
# set confidence interval
alpha = .05
# check dependence of packaging and target
package_5star_crosstab = pd.crosstab(train.package, train.five_stars)
_, p, _, _ = stats.chi2_contingency(package_5star_crosstab)
# check if p is significant
if p < alpha:
    print("Packaging and five-star ratings have a dependent relationship with 95% confidence.")
    print("p-value:", p)
else:
    print("Packaging and five-star ratings are independent, did not pass 95% confidence interval.")
    print("p-value:", p)

Packaging and five-star ratings are independent, did not pass 95% confidence interval.
p-value: 0.43924606238117814


## Ramen Country of Origin is Related
Hypotheses:
- $H_0$: Country of origin and five-star ratings are independent.
- $H_a$: Country of origin and five-star ratings have a dependent relationship.

Confidence interval: 95%

In [5]:
# set confidence interval
alpha = .05
# create crosstab for chi-square statistical test
country_5star_crosstab = pd.crosstab(train.country, train.five_stars)
# limit only to countries with sufficient value counts in crosstab (an assumption of chi-square)
enough_values_mask = (country_5star_crosstab[False] > 5) & (country_5star_crosstab[True] > 5)
# run chi-square test
_, p, _, _ = stats.chi2_contingency(country_5star_crosstab[enough_values_mask])
# check if p is significant
if p < alpha:
    print("Country of origin and five-star ratings have a dependent relationship with 95% confidence.")
    print("p-value:", p)
else:
    print("Country of origin and five-star ratings are independent, did not pass 95% confidence interval.")
    print("p-value:", p)

Country of origin and five-star ratings have a dependent relationship with 95% confidence.
p-value: 1.680501270556502e-06


## Breaking Down the Ramen Product Name
### Why We Need to Engineer Features for Product Name
Nearly all ramen reviews in our dataset have a unique combination of brand and product name. Only 21 combinations of brand and product out of the nearly 1500 in our exploration split have two reviews, and there are no combinations with more than two reviews. Because of this low commonality, we can't run initial chi-square tests to see if product names have a dependent relationship with five star reviews.

In [6]:
# show the review repeats and non-repeats
print("Brand+name with only one review:", (train[['brand','name']].value_counts() == 1).sum())
print("Brand+name with two reviews:", (train[['brand','name']].value_counts() == 2).sum())
print("Brand+name with more than two reviews:", (train[['brand','name']].value_counts() > 2).sum())

Brand+name with only one review: 1473
Brand+name with two reviews: 21
Brand+name with more than two reviews: 0


### Solving Uniqueness Through Categorizing Keywords
The following keyword mask contains all the words I've designated as keywords. This keyword conglomeration will be unpacked in a readable format in the sections below this section.

In [7]:
# identify all keywords, prepare list for df.col.str.contains()
keyword_mask = '|'.join(['Vermicelli', 'Vernicalli', 'Bihun', 'Sano', 'Chicken', 'Chikin', 'Duck',
                         'Vegetable', 'Veggie', 'Vegetarian','Beef', 'Gomtang', 'Seolleongtang', 'Sukiyaki', 
                         'Nam Tok', 'Pork', 'Jjajangmen', 'Jiajang', 'Tonkotsu', 'Tomkotsu', 'Bacon', 'Budae',
                         'Seafood', 'Crab', 'Anchovy', 'Bajirak', 'Clam', 'Abalone', 'Scallop', 'Vongole', 
                         'Salmon', 'Lobster', 'Shrimp', 'Prawn', 'Tuna', 'Tteok', 'Rabokki', 'Raobokki',
                         'Spicy', 'Spice', 'Shin', 'Jjamppong', 'Jjambbong', 'Buldalk', 'Sutah', 'Budae', 
                         'Habanero', 'Jinjja', 'Jin', 'Yeul', 'Mala', 'Teumsae', 'Bibim', 'Picante', 'Bulnak', 
                         'Volcano', 'Odongtong', 'Sriracha', 'Arrabiata', 'Tom Yum', 'Tom Yam', 'Tom Saab', 
                         'Tom Klong', 'Suki', 'Stir Fry', 'Bokkeum', 'Tteokbokki', 'Topokki', 'Yukgaejang', 
                         'Rabokki', 'Yakisoba', 'Yaki-Soba', 'Yakiosoba', 'Fried', 'Goreng', 'Ramyonsari', 
                         'Keopnurungji', 'Sabalmyeon', 'Miso', 'Teriyaki', 'Mushroom', 'Udon', 'Udoin', 
                         'Tomato', 'Chili', 'Chilli', 'chili', 'Wonton', 'Wantan', 'Pickled', 'Sesame', 
                         'Superior', 'Carbonara', 'Chow Mein', 'Sweet', 'Pad Thai', 'Sour', 'sour', 'Curry', 
                         'Soy', 'Shoyu', 'Shiitake', 'Shitake', 'Tofu', 'Pho', 'Clear', 'Egg', 'Tempura', 
                         'Laksa', 'Buckwheat', 'Soba', 'Salt', 'Shio', 'Sio', 'Tomato', 'Neapolitan', 
                         'Napolitan', 'Spaghetti', 'Mayo', 'Barbecue', 'BBQ', 'Masala', 'Kimchi', 'Veg',
                         'Tteobokki', 'Rice', 'Onion', 'Pollo', 'Cheese', 'Betawi', 'Chah Chiang',
                         'Namja', 'Perisa', 'Kari', 'Jjawang', 'Jjajangmyeon', 'Sogokimyun', 'Jjajang',
                         'Ossyoi', 'Befikr', 'curry', 'Sotanghon', 'U-Dong', 'U-dong', 'Mi Goreng', 'Kocok',
                         'Chacharoni', 'Yakibuta', 'Cuchareable', 'RMy', 'Jalapeno', 'Biryani', 'Carne',
                         'Kimchee', 'Pad Kee Mao', 'Kalguksoo', 'Prok', 'Nipis', 'Jjampong', 'Buldak',
                         'tom Yum', 'Sesami', 'Kim Chee', 'Kebab', 'Hyoubanya', 'Batchoy', 'Gentong',
                         'Kokomen', 'Requeijao', 'Champong', 'Gallina', 'Bulalo', 'Wasabi', 'Kalamansi',
                         'Cabe', 'Oosterse', 'Kung Pao'])

# create True/False for whether the row contains a keyword in the product name
train['has_keyword'] = train.name.str.contains(keyword_mask)

Based on the above keywords, I ran the following two cells to check remaining values that I missed earlier. If I found a notable word, I added it to the above keyword list and re-ran the cells. I repeated this process until I was satisfied with the words I had designated as keywords.

In [8]:
# check if column is mostly True values
train.has_keyword.value_counts()

True     1377
False     138
Name: has_keyword, dtype: int64

In [9]:
# check all rows without keywords for each unique word's value counts in entire list (ran this cell multiple times)
(
    pd.Series( # make a Series of each instance of each word
        ' '.join(
                 train[~train.has_keyword]    # look at rows we haven't caught with a keyword yet
                 .name.tolist()        # put all 'name' cells in a list
                ).split()        # join all lists into one string, then split the string into a list of each word
    ).value_counts()        # calculate the value counts of each word in the series
    .head(10)         # display the top 10 (changed from 30 to 10 after the words I wanted were captured)
)

Noodles    34
Noodle     30
Ramen      19
Instant    16
Cup        14
Flavour    11
Sauce      11
Flavor     10
Rasa        9
Mi          8
dtype: int64

### Grouping Keywords, Checking Value Counts
Now that we have a keyword list, I will organize it into groups and check counts for each grouping. The goal of this is to prepare for group elimination of keywords in the next section. Each grouping will be considered as one feature or value; for example, the **'noodle_type'** feature would have a 'noodle' value that covers ['Noodle', 'Myeon', 'Myon']. 

#### Noodle Type
* 'Noodle', 'Myeon', 'Myon' (665)
* 'Udon', 'Udoin' (49)
* 'Miso' (23)
* 'Rice', 'Mi' (239)
* 'Vermicelli', 'Vernicalli', 'Bihun', 'Sano' (43)
* 'Rice Cake', 'Tteok', 'Rabokki', 'Raobokki' (4)
* 'Wonton', 'Wantan' (5)
* 'Spaghetti', 'Carbonara', 'Neapolitan', 'Napolitan' (10)
* 'Buckwheat', 'Soba' (18)

In [10]:
# checking row count having the above noodle types (ran this cell multiple times)
train.name.str.contains("|".join(['Buckwheat', 'Soba'])).sum()

18

#### Meats
* 'Chicken', 'Chikin', 'Duck', 'Pollo', 'Buldalk' (205)
* 'Beef', 'Gomtang', 'Seolleongtang', 'Sukiyaki', 'Nam Tok', 'Sutah' (152)
* 'Pork', 'Jjajangmen', 'Jiajang', 'Tonkotsu', 'Tomkotsu', 'Bacon', 'Budae' (108)
* 'Seafood', 'Crab', 'Anchovy', 'Bajirak', 'Clam', 'Abalone', 'Scallop', 'Vongole', 'Salmon', 'Lobster', 'Shrimp', 'Prawn', 'Tuna', 'Jjamppong', 'Jjambbong' (198)
* 'Chili', 'Chilli', 'chili' (35)
* 'Chow Mein' (25)
* 'Egg' (5)
* 'Tofu' (2)
* 'Barbecue', 'BBQ' (9)

In [11]:
# checking row count having the above meat types (ran this cell multiple times)
train.name.str.contains("|".join(['Barbecue', 'BBQ'])).sum()

9

#### Vegetables
* 'Clear', 'Veg' (covers 'Vegetable', 'Veggie', 'Vegetarian' and 'Veg') (85)
* 'Kimchi', 'Sabalmyeon' (22)
* 'Mushroom', 'Shiitake', 'Shitake' (35)
* 'Tomato' (21)

In [12]:
# checking row count having the above veggie types (ran this cell multiple times)
train.name.str.contains("|".join(['Tomato'])).sum()

21

#### Taste
* 'Spicy', 'Spice', 'Shin', 'Jjamppong'/'Jjambbong'(seafood), 'Buldalk'(chicken), 'Sutah'(beef), 'Budae'(sausage), 'Habanero', 'Jinjja', 'Jin', 'Yeul', 'Mala', 'Teumsae', 'Bibim', 'Picante', 'Bulnak', 'Volcano', 'Odongtong', 'Sriracha', 'Arrabiata', 'Tom Yum', 'Tom Yam', 'Tom Saab', 'Tom Klong', 'Suki', 'Laksa' (304)
* 'Ramyonsari', 'Keopnurungji' (2)
* 'Salt', 'Shio', 'Sio' (17)
* 'Soy', 'Shoyu', 'Shouyu', 'Teriyaki' (70)
* 'Mayo' (6)
* 'Cheese' (11)
* 'Sweet' (18)
* 'Sour', 'sour' (19)
* 'Curry' (68)
* 'Sesame' (32)
* 'Pickle' (11)
* 'Masala' (9)

In [13]:
# checking row count having the above taste types (ran this cell multiple times)
train.name.str.contains("|".join(['Masala'])).sum()

9

#### Preparation
* 'Instant', 'Minute', 'Ramyun', 'Jinjja', 'Bibim' (296)
* 'Stir Fry', 'Bokkeum', 'Tteokbokki', 'Tteobokki', 'Topokki', 'Yukgaejang', 'Rabokki', 'Yakisoba', 'Yaki-Soba', 'Yakiosoba', 'Fried', 'Goreng', 'Tempura' (123)
* 'Soup', 'Jjigae', 'Consomme' (109)
* 'Pad Thai' (4)
* 'Pho' (10)

In [14]:
# checking row count having the above preparation types (ran this cell multiple times)
train.name.str.contains("|".join(['Lime', 'Jeruk Nipis', 'Kalamansi'])).sum()

11

### Choosing Feasible Features
Now that we have row counts for each grouping, we can begin to consider what features are viable. Here is what we should consider:
1. A feature must have at least two values 
    ** EX: 'meat_type' feature has 'chicken', 'beef', 'pork', etc values
1. A crosstab of the feature must have more than five values in each cell for the chi square statistical test
1. The feature's values should be independent from one another 
    ** EX: 'taste_type' should not have individual 'sweet' and 'sour' values because some ramen have 'sweet & sour' in the product name
    
#### Features that Pass the Above Requirements
- noodle_type: 
    * **wheat** ('Udon', 'Udoin', 'U-Dong', 'U-dong', 'Sano', 'Spaghetti', 'Carbonara', 'Neapolitan', 'Napolitan', 'Kalguksoo') (63)
    * **buckwheat** ('Buckwheat', 'Soba') (18)
    * **rice** ('Rice', 'Vermicelli', 'Vernicalli', 'Bihun', 'Biryani', 'Tteokbokki', 'Tteobokki', 'Topokki', 'Rabokki') (109)
- flavor: 
    * **miso** ('Miso') (23)
    * **chicken** ('Chicken', 'Chikin', 'Duck', 'Pollo', 'Buldalk', 'Buldak', 'Requeijao', 'Gallina') (215)
    * **beef** ('Beef', 'Gomtang', 'Seolleongtang', 'Sukiyaki', 'Nam Tok', 'Sutah', 'Sogokimyun', 'Cuchareable', 'Carne', 'Kebab', 'Gentong', 'Bulalo', 'Yukgaejang') (163)
    * **pork** ('Pork', 'Prok', 'Jjajangmyeon', 'Jjajangmen', 'Jiajang', 'Jjajang', 'Chacharoni', 'Jjawang', 'Tonkotsu', 'Tomkotsu', 'Bacon', 'Ossyoi', 'Yakibuta', 'Batchoy') (115)
    * **crustacean** ('Crab', 'Lobster', 'Shrimp', 'Prawn') (108)
    * **mollusk** ('Bajirak', 'Clam', 'Abalone', 'Scallop', 'Vongole') (15)
    * **chili** ('Chili', 'Chilli', 'chili', 'Cabe') (37)
    * **curry** ('Curry', 'curry', 'Betawi', 'Perisa', 'Kari') (93)
    * **chow_mein** ('Chow Mein') (25)
    * **kimchi** ('Kimchi', 'Kimchee', 'Sabalmyeon', 'Kim Chee') (24)
    * **mushroom** ('Mushroom', 'Shiitake', 'Shitake') (35)
    * **tomato** ('Tomato') (21)
    * **veggie** ('Clear', 'Veg', 'Oosterse') (86)
    * **sesame** ('Sesame', 'Sesami') (33)
    * **lime** ('Lime', 'Jeruk Nipis', 'Kalamansi') (11)
- spicy:
    * **True** ('Spicy', 'Spice', 'Shin', 'Jjamppong'/'Jjambbong'/'Jjampong'/'Champong'(seafood), 'Buldalk'/'Buldak'(chicken), 'Sutah'(beef), 'Budae'(sausage), 'RMy', 'Habanero', 'Jinjja', 'Jin', 'Yeul', 'Mala', 'Teumsae', 'Bibim', 'Picante', 'Bulnak', 'Volcano', 'Odongtong', 'Sriracha', 'Arrabiata', 'Tom Yum', 'Tom Yam', 'tom Yum', 'Tom Saab', 'Tom Klong', 'Suki', 'Laksa', 'Chah Chiang', 'Namja', 'Befikr', 'Mi Goreng', 'Kocek', 'Jalapeno', 'Pad Kee Mao', 'Kokomen', 'Wasabi', 'Kung Pao', 'Kimchi', 'Kimchee', 'Sabalmyeon', 'Kim Chee', 'Nam Tok', 'Sogokimyun', 'Gentong', 'Chili', 'Chilli', 'chili', 'Cabe', 'Yukgaejang', 'Yakisoba', 'Yaki-Soba', 'Yakiosoba') (446)
    * **False** ('Miso', 'Requeijao', 'Seolleongtang', 'Sukiyaki', 'Jjajangmyeon', 'Jjajangmen', 'Jiajang', 'Jjajang', 'Chacharoni', 'Jjawang', 'Ossyoi', 'Batchoy', 'Bajirak', 'Mushroom', 'Shiitake', 'Shitake', 'Tomato', 'Clear') (99)
- fried:
    * **True** ('Stir Fry', 'Bokkeum', 'Tteokbokki', 'Tteobokki', 'Topokki', 'Yukgaejang', 'Rabokki', 'Yakisoba', 'Yaki-Soba', 'Yakiosoba', 'Fried', 'Goreng', 'Tempura', 'Kung Pao', 'Sukiyaki', 'Kebab', 'Gentong', 'Bulalo', 'Jjajangmyeon', 'Jjajangmen', 'Jiajang', 'Jjajang', 'Chacharoni', 'Jjawang', 'Tonkotsu', 'Tomkotsu', 'Bacon', 'Yakibuta', 'Batchoy', 'Chow Mein') (207)
    * **False** ('Requeijao', 'Yakisoba', 'Yaki-Soba', 'Yakiosoba', 'Gomtang', 'Seolleongtang', 'Nam Tok', 'Sutah', 'Sogokimyun', 'Cuchareable', 'Gomtang', 'Yukgaejang', 'Ossyoi', 'Clear') (43)
    
### Creating the Features
#### noodle_type

In [15]:
# build script to create noodle_type feature, run here

#### flavor

In [16]:
# build script to create flavor feature, run here

#### spicy

In [17]:
# build script to create spicy feature, run here

#### fried

In [18]:
# build script to create fried feature, run here (eliminate non-fried)

### Testing New Features for Relationship to Target
#### Feature noodle_type is Independent

In [19]:
# test noodle_type here

#### Feature flavor is Related to Target

In [20]:
# test flavor here

#### Feature spicy is Related to Target

In [21]:
# test spicy here

#### Feature fried is Independent

In [22]:
# test fried here

### Results of Breaking Down Product Name
- Features created:
- Features eliminated:
- Features kept:

# Univatiate Look at Our Candidate Features
- Histograms

In [23]:
# create histograms here

# Each Country and Flavor Into Brackets
## Country
### Check Total Reviews and Number of Five-Star Reviews

In [24]:
# check counts

### Bracket Countries Based on Proportion of Five-Star Reviews

In [25]:
# bracket countries into high-, medium-, and low-proportion five star review brackets

## Flavor
### Check Total Reviews and Number of Five-Star Reviews

In [26]:
# check counts

### Bracket Flavors Based on Proportion of Five-Star Reviews

In [27]:
# bracket countries into high-, medium-, and low-proportion five star review brackets

## Final Features: Bivariate Look in Terms of Target

In [28]:
# visualize target/non-target charts for each feature here
# include exact numbers and proportions

## Results of Exploration
- Features kept for modeling:

# Model
## Bottom Line Up Front: What I Did for Model
- Prepared entire dataset with model features
- Chose F1 Score as our main evaluation metric due to prioritizing accuracy in presence of imbalanced classes
- Split dataset into Train, Validate, and Test
- Applied SMOTE+Tomek resampling to fix the class imbalance in our target for the Train split
- Built, fit several classification models and hyperparameter combinations on resampled Train split
- Evaluated baseline and all model performances on Validate, chose best model (Logistic Regression)
- Chose not to use Grid Search to optimize hyperparameters due to nature of Logistic Regression hyperparameters
- Evaluated baseline and best model's ROC Curve AUC
- Evaluated baseline and best model on sequestered Test split

# Conclusion
Using nothing more than analyzing ramen product names, I was able to build several categorical features that took into account domain knowledge and translation. Some of these keyword-engineering features were statistically related to our target and used in the model. In the end, our predictive model outperformed the baseline on common evaluation metrics.