# Overview
This project analyzes the Ramen Ratings dataset from Kaggle. This dataset has a few thousand ramen products and their ratings, from 0 to 5 stars. The data includes the review number, the ramen's brand, product name, packaging style, country of origin, rating, and whether or not the ramen is in the top 10. I chose to make this a one-vs-rest classification problem on whether a ramen is rated five stars or not. The analysis uses independence tests and feature engineering to arrive at key drivers of five-star ratings, then uses these features to build a predictive model. The project successfully identified key drivers using these methods and build a predictive model that performs better than a standard baseline for the work.

## The cool new stuff I accomplished for this project
- **Heavy keyword engineering**
    * Domain research to understand ramen products based on keyword
    * Translation to bring all keywords to consistent categories
    * Categorization on common factors based on domain research and translation
- **Multi-layered statistical testing to eliminate features**
    * Chi-Square tests to eliminate initial features that are not related to target
    * One-hot encoding of remaining features' categories
    * Chi-Square tests to eliminate one-hot-encoded categories that are not related to target
- **Clustering country and keyword features into low-, medium-, and high-rate five-star ratings groups**
    * Checked proportions of five-star rating counts against not-five-star rating counts for True in encoded feature
    * Checked proportions of five-star rating counts against not-five-star rating counts for False in encoded feature
    * Compared five-star proportions to check increase/decrease in proportion from False to True
    * Bracketed increasing, middle, and decreasing proportions from False to True
    
## Other stuff that I've done before
- Wrangle
    * Categorize and encode target into five_stars column (classes: is five-stars, isn't five stars)
    * Fix some values, drop some nulls, outliers, and duplicate rows, get rid of unnecessary columns
    * Create univariate visualizations
- Explore
    * Run Chi-Square testing to determine if feature is related to target
    * Feature engineering (overall)
    * Create bivariate visualizations
    * Choose features for model
- Model
    * Choose optimization priorities for the model (F1 Score)
    * Resample the target to address class imbalance
    * Create baseline model and multiple algorithmic models with varying hyperparameter combinations
    * Evaluate models on Validate (first out-of-sample split)
    * Choose best model in terms of our optimization priority
    * Calculate ROC AUC of baseline and best model
    * Evaluate baseline and best model on Test split
    
## Findings
1. The brand of ramen does not influence whether or not the ramen product has a five-star rating.
1. The packaging of ramen does not influence whether or not the ramen product has a five-star rating.
1. A ramen's country of origin has an influence on whether or not the ramen product has a five-star rating.
    - Malaysia has the highest five-star rating proportion of all origin countries.
    - Ramen originating from Malaysia, Singapore, or Taiwan have the highest proportion of five-star ratings.
    - Ramen from Hong Kong, Japan, South Korea, or Indonesia have the next-highest proportions of five-star ratings.
    - Ramen from China, Thailand, or USA have the lowest proportions of five-star ratings.
    - China has the lowest five-star rating proportion of all origin countries.
1. A ramen's noodle type does not influence whether or not the ramen product has a five-star rating.
1. A ramen's flavor influences whether or not the ramen product has a five-star rating.
    - Curry flavor has the highest proportion of five-star ratings for all flavor categories.
    - Ramen with curry or sesame flavor have the highest proportions of five-star ratings.
    - Ramen with pork flavor or the common crustaceans have the next-highest proportions of five-star ratings.
    - Chicken- and beef- flavored ramen products have the lowest five-star rating proportions of all flavors.
    - Chicken flavor has the lowest five-star rating proportions of all flavors.
1. A ramen's spicy status influences whether or not the ramen product has a five-star rating.
1. A ramen's fried status does not have an effect on whether or not the ramen product has a five-star rating.

## Model Results
- Features used: country and flavor brackets (as described above) and spicy status
- Evaluation Metric: F1 Score
- Best model: Logistic Regression
- Model performance: outperforms the baseline on F1 Score and ROC AUC for unseen data

# Imports

In [1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_curve, auc, roc_auc_score

import wrangle
import model

# Wrangle
## Bottom Line Up Front: What I Did for Wrangle
1. Acquire ramen-ratings.csv from Kaggle
1. Rename a United States value to USA
1. Drop low-count ramen styles Box, Can, and Bar (8 rows)
1. Drop countries with less than 5 cumulative observations (29 rows)
1. Drop Unrated, nulls and duplicates (16 rows)
1. Replace Stars column with five_stars column
1. Drop 'Review #' and 'Top Ten' columns
1. Rename columns for easier exploration
1. Split cleaned data into Train, Validate, and Test splits for exploration and modeling

In [2]:
# wrangle.py script to wrangle the data as described above
train, _, _ = wrangle.prep_explore()

train.head(3)

Train size: (1515, 5) Validate size: (506, 5) Test size: (506, 5)


Unnamed: 0,brand,name,package,country,five_stars
2566,Samyang,Hot,Pack,South Korea,False
332,Nongshim,Shin Noodle Soup,Cup,USA,True
1363,Doll,Hello Kitty Dim Sum Noodle Japanese Curry Flavour,Cup,Hong Kong,False


# Explore
## Bottom Line Up Front: What I Did for Explore
- Statistical testing on Ramen Brands that found brand is independent of five-star outcomes
- Statistical testing on Ramen Packaging that found packaging is independent of five-star outcomes
- Statistical testing on Country of Origin that found country is related to five-star outcomes
- Keyword engineering to categorize ramen products into noodle type, flavor, spicy status, and fried status categories
- Statistical testing on new features that found noodle type and fried status have no impact on five-star outcomes
- Statistical testing on new features that found ramen flavor and spicy status have an impact on five-star outcomes
- Dropped specific countries and flavors that did not have at least 5 reviews with five-star rating
- Analyzed proportions of five-star reviews to all reviews for each country and flavor category
- Grouped into high-, medium-, and low-proportion brackets for country and for flavor category
- Checked country and flavor category brackets along with spicy status in terms of five-star and non-five-star reviews
- Chose these features for modeling

# Model
## Bottom Line Up Front: What I Did for Model
- Prepared entire dataset with model features
- Chose F1 Score as our main evaluation metric due to prioritizing accuracy in presence of imbalanced classes
- Split dataset into Train, Validate, and Test
- Applied SMOTE+Tomek resampling to fix the class imbalance in our target for the Train split
- Built, fit several classification models and hyperparameter combinations on resampled Train split
- Evaluated baseline and all model performances on Validate, chose best model (Logistic Regression)
- Chose not to use Grid Search to optimize hyperparameters due to nature of Logistic Regression hyperparameters
- Evaluated baseline and best model's ROC Curve AUC
- Evaluated baseline and best model on sequestered Test split

# Conclusion
Using nothing more than analyzing ramen product names, I was able to build several categorical features that took into account domain knowledge and translation. Some of these keyword-engineering features were statistically related to our target and used in the model. In the end, our predictive model outperformed the baseline on common evaluation metrics.