# **Introduction**

***

<p style="font-size:18px;"> This data set contains 4650 ramen reviews from Hans "The Ramen Rater" Lienesch. Taken from his website (https://www.theramenrater.com/resources-2/the-list/)

<p style="font-size:18px;"> The Ramen Rater started reviewing ramen in 2002 and hasn't stopped since although each review is not dated they are in order as per the 'Review #' column. A total of 4650 reviews over 21 years equates to a whopping 221 ramen reviews on average per year!

<p style="font-size:18px;"> The purpose of this case study is to explore and analyze the data to determine which ramen has the best reviews and whether the flavor or country of manufacture impacts the number of stars a ramen receives. I will begin by importing and cleaning the data before analyzing and visualizing the data.

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px

import warnings
warnings.filterwarnings("ignore")

print("Libaries Imported")

Libaries Imported


In [2]:
df = pd.read_csv("/kaggle/input/ramen-ratings-2024/The-Big-List-20231127-Reviews-to-4650.csv")

df.head()

Unnamed: 0,Review #,Brand,Variety,Style,Country,Stars,T
0,4650,Jasmine,XXL Bihun Segera Penang White Curry,Pack,Malaysia,5.0,
1,4649,Indomie,Mi Instan Mi Keriting Goreng Spesial,Pack,Indonesia,5.0,
2,4648,MAMA,Oriental Kitchen Dried Instant Noodles Truffle...,Pack,Thailand,4.5,
3,4647,Ottogi,Jin Jjajang Smoked Black Bean Flavor,Pack,United States,4.5,
4,4646,Samyang Foods,Samyand Ramen,Pack,United States,5.0,


# **Data Cleaning**

***

<p style="font-size:18px;">  In this section of the case study I will clean the data to ensure that there are no missing or incorrect values. I will begin by checking the number of non-null values per column and the number of unique values per column.

In [3]:
df.info()
print("----------------------------------------")
print("----------------------------------------")
print("Unique values per column:")
df.nunique()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4650 entries, 0 to 4649
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Review #  4650 non-null   int64  
 1   Brand     4650 non-null   object 
 2   Variety   4650 non-null   object 
 3   Style     4650 non-null   object 
 4   Country   4650 non-null   object 
 5   Stars     4648 non-null   object 
 6   T         0 non-null      float64
dtypes: float64(1), int64(1), object(5)
memory usage: 254.4+ KB
----------------------------------------
----------------------------------------
Unique values per column:


Review #    4650
Brand        698
Variety     4325
Style         10
Country       54
Stars         53
T              0
dtype: int64

<p style="font-size:18px;"> From this I can see that with the exception of the columns "Stars" and "T" all the other columns contain 4650 values. Where the column T only contains null values this column can be removed. In addition it looks like there are 2 missing values from the stars column which will have to be dropped.

In [4]:
df = df.drop('T', axis=1)
print("Column Dropped")

Column Dropped


<p style="font-size:18px;"> There appears to be a huge number of unique values in stars. I was expecting 11 values (If you include half stars) or 21 values if you include quarter stars but something doesn't seem right if there are 53 unique values. To check this I will print out all the unique values below.

In [5]:
df['Stars'].unique()

array(['5', '4.5', '3.75', '3.5', '2.5', '4.25', '4', '3.25', '1.5', '2',
       'Detail of the lid (click to enlarge). ', '1', nan, '3', '1.75',
       '435', '2.75', '0', '0.25', '2.25', '4.75', '35', '0.5', '0.75',
       '1.25', 'NS', 'NR', '3.5/2.5', '4/4', '5/5', '4.5/5', '5/2.5',
       '5/4', '4.25/5', 'Unrated', '3.50', '1.1', '2.1', '0.9', '3.1',
       '4.125', '3.125', '2.125', '2.9', '0.1', '2.8', '3.7', '3.4',
       '3.6', '2.85', '2.3', '3.2', '3.65', '1.8'], dtype=object)

<p style="font-size:18px;"> It appears that the rating system is a bit unconventional. For the most part the reviews are what I would expect but there are several odd values such as 3.6, 2.85, 0.9, 2.125. I cross referenced the reviews in this dataset vs the data source (https://www.theramenrater.com/resources-2/the-list/) The Ramen Rater notes that:
    
> <p style="font-size:18px;">These reviews are based on my personal preferences, not on sales of popularity. Scores are in .25 increments – rounding is NOT recommended. Think of letter grading; a 3.5 score out of 5 stars – (3.5 * 2) * 10 = 70 = C. So, rounding a 3.5 to a 4.0 doesn’t reflect it correctly.
I started reviewing in 2002. I’m a stay at home dad and have very poor vision – typos may be in there here and there, but the whole set is pretty cleaned up as to such idiosyncrasies.
    
<p style="font-size:18px;"> With this knowledge in mind I will go though and round the star's value to the nearest 0.25 and remove/amend invalid values. to begin I find out how many rows have the string ratings.

In [6]:
# Function to check if a value is numeric
def is_numeric(value):
    try:
        float(value)
        return True
    except ValueError:
        return False

# Apply the function to the 'Stars' column to filter non-numeric values
non_numeric_values = df[~df['Stars'].apply(is_numeric)]['Stars']

print("Number of non-numeric values:",len(non_numeric_values))
non_numeric_values

Number of non-numeric values: 16


82      Detail of the lid (click to enlarge). 
1217                                        NS
1500                                        NR
1501                                        NR
2009                                        NR
2038                                   3.5/2.5
2039                                       4/4
2040                                       5/5
2041                                     4.5/5
2042                                     5/2.5
2043                                       5/4
2044                                    4.25/5
2045                                    4.25/5
2102                                   Unrated
2192                                   Unrated
3063                                   Unrated
Name: Stars, dtype: object

<p style="font-size:18px;"> It appears that there are 16 values which require cleaning. Some values where there has been no rating can be removed as there is no way to infer the correct rating. It also seems that some of the values are correct but have been put in an invalid format i.e. 4.25/5 instead of just 4.25. Where possible I will amend the handful of values which are not formatted correctly and remove the rows which do not have a review.

In [7]:
values_to_drop = ["Detail of the lid (click to enlarge). ", "NS", "NR", "Unrated", "3.5/2.5"]
df = df[~df['Stars'].isin(values_to_drop)]

values_to_replace = {"5/5": "5","4/4": "5","4.5/5": "4.5","5/2.5": "2.5","5/4": "4","4.25/5": "4.25", "35":"3.5", "435":"4.35"}
df['Stars'] = df['Stars'].replace(values_to_replace)

df['Stars'].unique()

array(['5', '4.5', '3.75', '3.5', '2.5', '4.25', '4', '3.25', '1.5', '2',
       '1', nan, '3', '1.75', '4.35', '2.75', '0', '0.25', '2.25', '4.75',
       '0.5', '0.75', '1.25', '3.50', '1.1', '2.1', '0.9', '3.1', '4.125',
       '3.125', '2.125', '2.9', '0.1', '2.8', '3.7', '3.4', '3.6', '2.85',
       '2.3', '3.2', '3.65', '1.8'], dtype=object)

<p style="font-size:18px;"> This looks much better but there still appears to be one several NaN values in the dataset.

In [8]:
df['Stars'].isna().sum()

2

<p style="font-size:18px;"> Where there are only 2 rows which have NaN values I will remove these rows as well. The Stars column now only contains numeric values and can be converted from object datatype to float datatype.

In [9]:
df = df.dropna(subset=['Stars'])

# Converting "Stars" to float datatype
df['Stars'] = df['Stars'].astype('float')

print("Total length of dataset:",len(df))

Total length of dataset: 4639


<p style="font-size:18px;"> By removing the rows which contain null values or unusable data the total dataset has been reduced by 11 rows from the maximum 4650 to 4639.

<p style="font-size:18px;"> Next I need to identify how many reviews have incorrect values (any values which are not divisible by 0.25)

In [10]:
star_errors = df[df['Stars'] % 0.25 != 0]

print("Number of star errors:", len(star_errors))
star_errors

Number of star errors: 24


Unnamed: 0,Review #,Brand,Variety,Style,Country,Stars
119,4531,Shoo Loong Kan,Lan Zhou La Mian,Box,China,4.35
4015,635,Ottogi,Buckwheat Bibim Ramyon,Pack,South Korea,1.1
4063,587,Sunlight,Steam Vermicelli,Pack,Taiwan,2.1
4115,535,Wai Wai,Chicken Flavor,Pack,Thailand,0.9
4116,534,Mama,Pho Ga,Pack,Thailand,3.1
4238,412,Maruchan,Midori No Tanuki,Bowl,Japan,4.125
4304,346,Paldo,Bowl Noodle Spicy Artificial Chicken,Bowl,South Korea,3.125
4316,334,Nissin,Big Cup Noodles Beef,Cup,United States,2.125
4460,190,Maggi,Perencah Tom Yam,Pack,Malaysia,2.9
4474,176,Paldo,Dosirac Pork,Tray,South Korea,4.125


<p style="font-size:18px;"> In total there are 24 values which need amending. As these reviews do not follow the star rating system as defined by The Ramen Rater. 

In [11]:
# Ensures rounding up of the 0.125 Values
df['Stars'] = (df['Stars'] + 0.001)

# Apply the rounding function to the 'Stars' column
df['Stars'] = (df['Stars'] * 4).round() / 4

# Unique Star values sorted in accending order
np.sort(df['Stars'].unique())

array([0.  , 0.25, 0.5 , 0.75, 1.  , 1.25, 1.5 , 1.75, 2.  , 2.25, 2.5 ,
       2.75, 3.  , 3.25, 3.5 , 3.75, 4.  , 4.25, 4.5 , 4.75, 5.  ])

<p style="font-size:18px;"> All the values have been amended and the star columns now only contain values from 0 to 5 in 0.25 increments.
    
<p style="font-size:18px;"> To continue it is important to check all the columns are clean and correct. I decided to look at the "Style" column next.

In [12]:
df['Style'].value_counts()

Style
Pack          2478
Bowl           926
Cup            908
Tray           207
Box            113
Restaurant       3
Boowl            1
Bottle           1
Can              1
Bar              1
Name: count, dtype: int64

<p style="font-size:18px;"> Unfortunately there is one spelling mistake where "Bowl" has been inputted as "Boowl" so I will correct this. In addition there are several styles which only appear in a handful of rows, where these styles are too few I will group them together and change their style to "Other".

In [13]:
df['Style'] = df['Style'].replace('Boowl', 'Bowl')
df.loc[df['Style'].isin(['Restaurant', 'Bottle', 'Can', 'Bar']), 'Style'] = 'Other'
df['Style'].value_counts()

Style
Pack     2478
Bowl      927
Cup       908
Tray      207
Box       113
Other       6
Name: count, dtype: int64

<p style="font-size:18px;"> Now That I was confident that the data was cleaned and usable I used the describe function within pandas which gives me a quick overview of the data per column.

In [14]:
df.describe(include='all')

Unnamed: 0,Review #,Brand,Variety,Style,Country,Stars
count,4639.0,4639,4639,4639,4639,4639.0
unique,,695,4316,6,54,
top,,Nissin,Miso Ramen,Pack,Japan,
freq,,566,12,2478,949,
mean,2323.494072,,,,,3.741108
std,1342.722277,,,,,1.071223
min,1.0,,,,,0.0
25%,1160.5,,,,,3.5
50%,2321.0,,,,,3.75
75%,3487.5,,,,,4.5


<p style="font-size:18px;"> I wanted to ensure that the data no longer contained any null values before moving onto the next stage of this case study.

In [15]:
df.isna().sum()

Review #    0
Brand       0
Variety     0
Style       0
Country     0
Stars       0
dtype: int64

# **Feature Engineering**

***

<p style="font-size:18px;"> To explore the data further I wanted to determine if there were any trends which could be identified from the flavors of the ramen. To do this I split the string in the 'Variety' column into individual words then counted the frequency of each word to identify the most common words within the variety column. From this I hope to identify the most common flavors.

In [16]:
# Lowercase all strings in the "Variety" column
df['Variety'] = df['Variety'].str.lower()

# Split each string into words
df['Words'] = df['Variety'].str.split()

# Explode the list of words into separate rows
words_df = df.explode('Words')

# Count the frequency of each word
word_counts_df = words_df['Words'].value_counts().reset_index()
word_counts_df.columns = ['Word', 'Frequency']

word_counts_df[(word_counts_df['Frequency'] >= 500) & (word_counts_df['Frequency'] <= 2000)]

Unnamed: 0,Word,Frequency
0,noodles,1048
1,noodle,970
2,ramen,902
3,flavor,650
4,instant,637
5,flavour,577
6,chicken,528
7,spicy,527


<p style="font-size:18px;"> Although a manual task by changing the frequency of the most common words I was able to identify the most common flavors. I stored these flavors in a list called "flavors"
    
<p style="font-size:18px;"> With the most common flavors identified I wrote a function which looks for a flavor in the list "flavors" if there is a match the identified flavor is added to a new column in the dataframe called "Flavor". Although this approach is not perfect it quickly allows for the flavor of each ramen to be identified without manually checking and amending every row.

In [17]:
# List of possible flavors
flavors = ["chicken", "spicy", "beef", "hot", "curry", "shrimp", "seafood", "pork", "tonkotsu","yakisoba", "miso", 
           "goreng", "sesame", "shoyu","vermicelli","soy", "vegtable", "kimchi", "chili", "vegetable", "vegetarian",
           "laksa", "mushroom", "crab", "pepper", "tomato", "cheese", "buldak", "prawn"]

# Function to extract flavors
def extract_flavor(variety, flavors):
    for flavor in flavors:
        if flavor.lower() in variety.lower():
            return flavor
    return None

# Apply the function to the dataframe
df['Flavor'] = df['Variety'].apply(lambda x: extract_flavor(x, flavors))

df.head(3)

Unnamed: 0,Review #,Brand,Variety,Style,Country,Stars,Words,Flavor
0,4650,Jasmine,xxl bihun segera penang white curry,Pack,Malaysia,5.0,"[xxl, bihun, segera, penang, white, curry]",curry
1,4649,Indomie,mi instan mi keriting goreng spesial,Pack,Indonesia,5.0,"[mi, instan, mi, keriting, goreng, spesial]",goreng
2,4648,MAMA,oriental kitchen dried instant noodles truffle...,Pack,Thailand,4.5,"[oriental, kitchen, dried, instant, noodles, t...",


<p style="font-size:18px;">  To ensure this approach was working as intended I decided to review the rows which did not contain a flavor in the newly created "Flavor" column. The rows which did not contain a flavor appeared to either have no flavor in the "Variety" column or a niche/unusual flavor.

In [18]:
print("Number of rows without a flavor:", len(df[df['Flavor'].isna()]))
df[df['Flavor'].isna()].head(3)

Number of rows without a flavor: 1452


Unnamed: 0,Review #,Brand,Variety,Style,Country,Stars,Words,Flavor
2,4648,MAMA,oriental kitchen dried instant noodles truffle...,Pack,Thailand,4.5,"[oriental, kitchen, dried, instant, noodles, t...",
3,4647,Ottogi,jin jjajang smoked black bean flavor,Pack,United States,4.5,"[jin, jjajang, smoked, black, bean, flavor]",
4,4646,Samyang Foods,samyand ramen,Pack,United States,5.0,"[samyand, ramen]",


<p style="font-size:18px;"> It looks like my approach has worked well and only 1452 of the total 4639 rows did not have a flavor attributed to them. To ensure that every row had a value in the "Flavor" column I filled all the null values with "other". To keep the dataframe clean I then decided to drop the created "Words'' column as this would provide no further benefit.

In [19]:
df['Flavor'] = df['Flavor'].fillna('other')
df = df.drop('Words', axis=1)
df.head()

Unnamed: 0,Review #,Brand,Variety,Style,Country,Stars,Flavor
0,4650,Jasmine,xxl bihun segera penang white curry,Pack,Malaysia,5.0,curry
1,4649,Indomie,mi instan mi keriting goreng spesial,Pack,Indonesia,5.0,goreng
2,4648,MAMA,oriental kitchen dried instant noodles truffle...,Pack,Thailand,4.5,other
3,4647,Ottogi,jin jjajang smoked black bean flavor,Pack,United States,4.5,other
4,4646,Samyang Foods,samyand ramen,Pack,United States,5.0,other


<p style="font-size:18px;"> Next I wanted to ensure that all the countries in the "Country" column were correct. I listed all the unique values in the "Country" column that would allow me to quickly determine if there were any mistakes.

In [20]:
unique_countries = sorted(df['Country'].unique())
unique_countries_str = ", ".join(unique_countries)
unique_countries_str

'Australia, Bangladesh, Brazil, Cambodia, Canada, China, Colombia, Dubai, Estonia, Fiji, Finland, France, Germany, Ghana, Holland, Hong Kong, Hungary, India, Indonesia, Ireland, Israel, Italy, Japan, Malaysia, Mexico, Myanmar, Nepal, Netherlands, New Zealand, Nigeria, Pakistan, Peru, Philippines, Phillippines, Phlippines, Poland, Portugal, Russia, Russian Federation, Sarawak, Serbia, Singapore, Souh Korea, South Korea, Spain, Sweden, Taiwan, Thailand, UK, USA, Ukraine, United Kingdom, United States, Vietnam'

<p style="font-size:18px;"> Unfortunately I found several mistakes which I wanted to correct. I found either spelling mistakes, cities being used instead of country names and abbreviated names being used. It is important for me to fix these mistakes as it will allow me to plot the data on a geographical map using plotly in my data analysis.

In [21]:
corrections = {  
    'Sarawak': 'Malaysia',
    'Holland': 'Netherlands',
    'Phillippines': 'Philippines',
    'Phlippines': 'Philippines',
    'Russian Federation': 'Russia',
    'Souh Korea': 'South Korea',
    'UK': 'United Kingdom',
    'United States': 'United States of America',
    'USA': 'United States of America',
    'Dubai' : 'United Arab Emirates',
    'Hong Kong' : 'China'
}

df['Country'] = df['Country'].replace(corrections)

<p style="font-size:18px;"> Once I was happy that all the countries had the correct name I decided to add all the country ISO alpha 3 codes into the data as this would be required to plot the data on a plotly map. Although this was a manual and slightly tedious task once completed I knew it would help with my data analysis.

In [22]:
# Manually mapping country names to ISO codes
country_to_iso = {
    'Australia': 'AUS',
    'Bangladesh': 'BGD',
    'Brazil': 'BRA',
    'Cambodia': 'KHM',
    'Canada': 'CAN',
    'China': 'CHN',
    'Colombia': 'COL',
    'Estonia': 'EST',
    'Fiji': 'FJI',
    'Finland': 'FIN',
    'France': 'FRA',
    'Germany': 'DEU',
    'Ghana': 'GHA',
    'Hong Kong': 'HKG',
    'Hungary': 'HUN',
    'India': 'IND',
    'Indonesia': 'IDN',
    'Ireland': 'IRL',
    'Israel': 'ISR',
    'Italy': 'ITA',
    'Japan': 'JPN',
    'Malaysia': 'MYS',
    'Mexico': 'MEX',
    'Myanmar': 'MMR',
    'Nepal': 'NPL',
    'Netherlands': 'NLD',
    'New Zealand': 'NZL',
    'Nigeria': 'NGA',
    'Pakistan': 'PAK',
    'Peru': 'PER',
    'Philippines': 'PHL',
    'Poland': 'POL',
    'Portugal': 'PRT',
    'Russia': 'RUS',
    'Serbia': 'SRB',
    'Singapore': 'SGP',
    'South Korea': 'KOR',
    'Spain': 'ESP',
    'Sweden': 'SWE',
    'Taiwan': 'TWN',
    'Thailand': 'THA',
    'Ukraine': 'UKR',
    'United Arab Emirates': 'ARE',
    'United Kingdom': 'GBR',
    'United States of America': 'USA',
    'Vietnam': 'VNM'
}

# Adding a new column with ISO codes
df['ISO_CC'] = df['Country'].map(country_to_iso)

df.head()

Unnamed: 0,Review #,Brand,Variety,Style,Country,Stars,Flavor,ISO_CC
0,4650,Jasmine,xxl bihun segera penang white curry,Pack,Malaysia,5.0,curry,MYS
1,4649,Indomie,mi instan mi keriting goreng spesial,Pack,Indonesia,5.0,goreng,IDN
2,4648,MAMA,oriental kitchen dried instant noodles truffle...,Pack,Thailand,4.5,other,THA
3,4647,Ottogi,jin jjajang smoked black bean flavor,Pack,United States of America,4.5,other,USA
4,4646,Samyang Foods,samyand ramen,Pack,United States of America,5.0,other,USA


<p style="font-size:18px;"> The data is now looking great. I am happy that all the columns have been cleaned and corrected. With the addition of the Flavor and ISO country code columns I would be able to analyze the data in further detail and create more striking visuals.

# **Data Analysis**

***

<p style="font-size:18px;"> With the data cleaned and ready the next step is to now analyse the data and find trends, patterns and insights. I want to explore how the ramen raters reviews differ per country, if his taste has changed over time and how the brand and style of ramen may impact the score the ramen rater gives a ramen. 
    
<p style="font-size:18px;"> To begin I wanted to explore the average reviews per country so I created a dataframe which calculates the average number of stars per country and the number of reviews for each country.

In [23]:
# Calculate average stars per country
avg_stars = df.groupby('Country').agg(
    ISO_CC=('ISO_CC', 'first'),
    Review_Count=('Stars', 'size'),
    Avg_Stars=('Stars', 'mean')
).reset_index().round(2)

avg_stars.head(3)

Unnamed: 0,Country,ISO_CC,Review_Count,Avg_Stars
0,Australia,AUS,26,3.27
1,Bangladesh,BGD,15,3.28
2,Brazil,BRA,32,3.62


<p style="font-size:18px;"> With the dataframe created I could now plot this data on a map using plotly. Which would create an interactive and visually striking map with minimal code.

In [24]:
import plotly.graph_objects as go

hover_text = (avg_stars['Country'] + '<br>' + avg_stars['Avg_Stars'].astype(str) + ' Average Stars' + 
              '<br>' + avg_stars["Review_Count"].astype(str) + ' Number of Reviews')

fig = go.Figure(data=go.Choropleth(locations=avg_stars['ISO_CC'],z=avg_stars['Avg_Stars'],
    text=hover_text, hoverinfo='text', colorscale='RdYlGn', zmin=0, zmax=5,
    marker_line_width=0.5, colorbar_title='Avg Stars'))

fig.update_layout(title_text='Average Reviews by Country',width=1100, height=1000,
    geo=dict(showcoastlines=False, projection_type='miller', landcolor="white",
        showocean=True, oceancolor="lightblue", showcountries=True, countrycolor="grey"))

fig.show()

print("Please note that Singapore is not displayed on this map.")
print("Singapore has 154 reviews and 4.09 average stars")
print("---------------------------------------------------")
print("Average stars accross entire data set:",df["Stars"].mean().round(2))

Please note that Singapore is not displayed on this map.
Singapore has 154 reviews and 4.09 average stars
---------------------------------------------------
Average stars accross entire data set: 3.74


<p style="font-size:18px;"> Wow this is great to see! For the most part each country appears to have an average star rating in the 3-4 range. However there are a few outliers which are outside of this norm such as Ireland, Canada and Portugal (to name a few) which have a lower average star rating compared to the other countries. By looking into this further, many of the countries have less than 5 reviews as such there is too little data on several countries to accurately say that the averages are correct.
    
<p style="font-size:18px;"> Knowing this I decided to plot this map again but only to show the countries which have a minimum of 5 reviews.

In [25]:
avg_stars_filtered = avg_stars[avg_stars["Review_Count"] >= 5]

hover_text = (avg_stars_filtered['Country'] + '<br>' + avg_stars_filtered['Avg_Stars'].astype(str) + ' Average Stars' + 
              '<br>' + avg_stars_filtered["Review_Count"].astype(str) + ' Number of Reviews')

fig = go.Figure(data=go.Choropleth(locations=avg_stars_filtered['ISO_CC'],z=avg_stars_filtered['Avg_Stars'],
    text=hover_text, hoverinfo='text', colorscale='RdYlGn', zmin=0, zmax=5, marker_line_width=0.5,
    colorbar_title='Avg Stars'))

fig.update_layout(
    title_text='Average Reviews by Country', width=1100, height=1000,
    geo=dict(showcoastlines=False, projection_type='miller', landcolor="white",
        showocean=True, oceancolor="lightblue", showcountries=True, countrycolor="grey"),)

fig.show()

print("Please note that Singapore is not displayed on this map.")
print("Singapore has 154 reviews and 4.09 average stars")

Please note that Singapore is not displayed on this map.
Singapore has 154 reviews and 4.09 average stars


<p style="font-size:18px;"> Next I wanted to see if the Ramen Raters reviews improved by country over time. This would help determine if a particular country has improved their Ramen over the 21+ years in which the Ramen Rater has been reviewing ramen. I decided to plot every review by country to see if there were any obvious trends.

In [26]:
# Scatter Plot
fig = px.scatter(df, x='Review #', y='Stars', color='Country',
                 title='Stars vs Time per Country',
                 labels={'Stars': 'Stars', 'Review #': 'Review Number'})

# Update hover template for all traces
fig.update_traces(hovertemplate= 'Review Number: %{x} <br> Stars: %{y}')

# Hide all styles except 'Pack'
fig.for_each_trace(lambda trace: trace.update(visible='legendonly') if trace.name != 'United States of America' else trace.update(visible=True))

# Lock the y-axis range
fig.update_layout(yaxis=dict(range=[df['Stars'].min() - 0.2, df['Stars'].max() + 0.2]))

fig.show()

<p style="font-size:18px;"> Although this is interesting to see, it is hard to see if there is any correlation between reviews over time by country. To help simplify this I have decided to create a correlation dataframe which calculates the correlation between time (review number) and stars by each country such that each country has an easy to understand correlation number. From this dataframe I am then able to visualise this data using a barchart.

In [27]:
# Correlation Dataframe
country_counts = df['Country'].value_counts()
filtered_df = df[df['Country'].isin(country_counts[country_counts >= 5].index)]

correlation_df = filtered_df.groupby('Country').apply(lambda x: x['Review #'].corr(x['Stars'])).round(3).reset_index()
correlation_df.columns = ['Country', 'Correlation']

sorted_correlations = correlation_df.sort_values(by='Correlation', ascending=False).reset_index(drop=True)

print(sorted_correlations.head())
print("----------------------------")

# Bar Chart Plot
fig = px.bar(sorted_correlations, x='Country', y='Correlation', title='Correlation of Time and Stars by Country',
             labels={'Correlation': 'Correlation Coefficient ', 'Country': 'Country '}, color='Correlation',color_continuous_scale='portland')

fig.update_traces(hovertemplate='Country = <b>%{x}</b><br>Correlation = <b>%{y}</b>')


fig.show()

       Country  Correlation
0       Russia        0.765
1        India        0.495
2     Thailand        0.482
3  Philippines        0.433
4     Pakistan        0.393
----------------------------


<p style="font-size:18px;"> Wow this is great to see! This suggests that Russia, India and Thailand have shown the greatest ramen improvement over time. On the other hand Poland, Nepal and Bangladesh have shown the greatest deterioration.
    
<p style="font-size:18px;"> Moving on I wanted to explore how different brands may impact a review. I created a dataframe which calculates the number of reviews by brand and the average stars per brand. Please see below.

In [28]:
Brand_df = df.groupby('Brand').agg(Review_Count=('Brand', 'size'),
            Avg_Stars=('Stars', 'mean')).round(2).reset_index()

Brand_df = Brand_df.sort_values(by='Review_Count', ascending=False)

Brand_df.head(3)

Unnamed: 0,Brand,Review_Count,Avg_Stars
398,Nissin,566,3.86
328,Maruchan,182,3.66
374,Myojo,152,3.93


<p style="font-size:18px;"> With my newly created datafame I wanted to visualise this data so I decided to plot the data on a bar chart with the X axis representing the number of reviews and the colour of each bar representing the average reviews.

In [29]:
fig = px.bar(Brand_df.head(50), x='Brand', y='Review_Count', color='Avg_Stars', 
             title='Frequency of Reviews per Brand',
             labels={'Brand': 'Brand', 'Review_Count': 'Number of Reviews'},
             hover_data={'Avg_Stars': ':.2f'},
             color_continuous_scale='Portland')

fig.update_layout(xaxis_tickangle=-70)

fig.update_traces(hovertemplate='<b>%{x}</b><br>Number of Reviews: %{y}<br>Average Stars: %{customdata:.2f}')

fig.show()

<p style="font-size:18px;"> The data shows that for the most part each brand has an average star rating of between 3-4 however there are a few notable outliers. "Mykuali" and "Mom's Dry Noodle" have the highest average brand scores. Whereas "Mr. Noodles", "Annie Chun's" and "Wai Wai" have the lowest average scores.
    
<p style="font-size:18px;"> Where the bar chart shows averages I wanted to see if there was a correlation between time and a brand's scores. This would determine if over time a brand has made improvements and their scores have increased or if a brand has had a decline and their scores have decreased. Like before I decided to create a bar chart to visualise this rather than using a scatter graph.

In [30]:
brand_counts = df['Brand'].value_counts()
filtered_df = df[df['Brand'].isin(brand_counts[brand_counts >= 15].index)]

correlation_df = filtered_df.groupby('Brand').apply(lambda x: x['Review #'].corr(x['Stars'])).round(3).reset_index()
correlation_df.columns = ['Brand', 'Correlation']

sorted_correlations = correlation_df.sort_values(by='Correlation', ascending=False).reset_index(drop=True)

print(sorted_correlations.head())
print("----------------------------")

# Bar Chart Plot
fig = px.bar(sorted_correlations, x='Brand', y='Correlation', title='Correlation of Time and Stars by the Top 50 Brands',
             labels={'Correlation': 'Correlation Coefficient ', 'Brand': 'Brand '}, color='Correlation',color_continuous_scale='portland')

fig.update_traces(hovertemplate='Brand = <b>%{x}</b><br>Correlation = <b>%{y}</b>')

fig.update_layout(xaxis_tickangle=-65)

fig.show()

         Brand  Correlation
0       Kamfen        0.816
1  Little Cook        0.588
2         Mama        0.582
3      Samyang        0.483
4      Sau Tao        0.449
----------------------------


<p style="font-size:18px;"> I wanted to see if the Ramen Raters' taste changed over time. To visualise this I decided to plot each review by flavour to see if there was any correlation between time and reviews. To begin I created a dataframe which groups the reviews by flavour and calculates the average stars by flavour. From this I could then plot each flavour on a scatter plot to visualise the data.

In [31]:
# Flavor dataframe
flavor_df = df.groupby('Flavor').agg(Review_Count=('Flavor', 'size'),
            Avg_Stars=('Stars', 'mean')).round(2).reset_index()

flavor_df = flavor_df.sort_values(by='Avg_Stars', ascending=True)

print(flavor_df.head(3))
print("------------------------------------")

# Scatter plot using Plotly Express
fig = px.scatter(flavor_df, x='Avg_Stars', y='Flavor', color='Review_Count',
                 title='Flavor vs Average Stars',
                 labels={'Flavor': 'Flavor', 'Review_Count': 'Number of Reviews', 'Avg_Stars': 'Average Stars'},
                 hover_data=['Review_Count'], color_continuous_scale='Portland')

fig.update_traces(hovertemplate= "Flavor: <b>%{y}</b> <br>" + "Average Stars: <b>%{x}</b> <br>"
                  +"Number of reviews: <b>%{customdata}</b>")

fig.update_layout(width=1100,height=800,
                  yaxis_title='Flavor',xaxis_title='Average Stars',
                  xaxis=dict(range=[0, 5]),
                  yaxis=dict(categoryorder='array', categoryarray=flavor_df['Flavor']))

fig.show()

        Flavor  Review_Count  Avg_Stars
12    mushroom            37       2.76
26  vegetarian            45       3.06
25   vegetable            69       3.41
------------------------------------


<p style="font-size:18px;"> This is very interesting. It appears that the Ramen Rater tends to prefer ramen which have the flavours laksa, goreng and buldak. Overall it appears that the Ramen Rater has a preference to more spicy flavours. On the other hand the least liked flavours appear to be vegetable related with flavours with the lowest average rating being vegetable, vegetarian and mushroom.

<p style="font-size:18px;"> To take this one step further I wanted to see if there is a correlation between the flavours and stars over time. I plotted the bar chart below to show this.

In [32]:
flavor_counts = df['Flavor'].value_counts()
filtered_df = df[df['Flavor'].isin(flavor_counts[flavor_counts >= 15].index)]

correlation_df = filtered_df.groupby('Flavor').apply(lambda x: x['Review #'].corr(x['Stars'])).round(3).reset_index()
correlation_df.columns = ['Flavor', 'Correlation']

sorted_correlations = correlation_df.sort_values(by='Correlation', ascending=False).reset_index(drop=True)

print(sorted_correlations.head())
print("----------------------------")

# Bar Chart Plot
fig = px.bar(sorted_correlations, x='Flavor', y='Correlation', title='Correlation of Time and Stars by the Flavor',
             labels={'Correlation': 'Correlation Coefficient ', 'Flavor': 'Flavor '}, color='Correlation',color_continuous_scale='portland')

fig.update_traces(hovertemplate='Flavor = <b>%{x}</b><br>Correlation = <b>%{y}</b>')

fig.update_layout(xaxis_tickangle=-65)

fig.show()

       Flavor  Correlation
0  vermicelli        0.516
1  vegetarian        0.378
2       laksa        0.346
3        pork        0.328
4       chili        0.317
----------------------------


<p style="font-size:18px;"> It does appear that with time most of the flavours have a positive correlation with the star rating. Although vegetarian ramen is one of the least liked flavours (based on average stars) it does have a positive correlation which could suggest that with time the Ramen Rater is starting to prefer vegetarian ramen or that the world of vegetarian ramen has improved over the years and the flavour and products have been better refined with time.

<p style="font-size:18px;"> Finally I wanted to explore the style of the ramen and how the style could impact reviews. Once again I created a dataframe which groups the reviews by style and calculates the average stars per style.

In [33]:
style_df = df.groupby('Style').agg(Review_Count=('Style', 'size'),
            Avg_Stars=('Stars', 'mean')).round(2).reset_index()

style_df = style_df.sort_values(by='Avg_Stars', ascending=False)

style_df.head(6)

Unnamed: 0,Style,Review_Count,Avg_Stars
1,Box,113,4.24
3,Other,6,3.88
4,Pack,2478,3.85
0,Bowl,927,3.68
5,Tray,207,3.57
2,Cup,908,3.49


<p style="font-size:18px;"> With the dataframe created I can now plot this data. I decided to plot a pie chart to visualise the makeup of the dataframe and the percentage of each style in the dataframe. On top of this I wanted to visualise each review by style and finally I wanted to calculate the correlation between style and stars over time to see if over time if there has been an improvement in styles stars.

In [34]:
# Pie Chart Plot
fig = px.pie(style_df, values='Review_Count', names='Style',
             title='Top 10 Styles by Review Count', 
             hover_data=['Avg_Stars'],
             labels={'Review_Count': 'Number of Reviews'})

fig.update_traces(hovertemplate='%{label} <br> Review count: %{value} (%{percent})<br> Average Stars: %{customdata[0]}')

fig.update_layout(width=800, height=600)

fig.show()

# Scatter Plot
fig = px.scatter(df, x='Review #', y='Stars', color='Style',
                 title='Stars vs Time per Style',
                 labels={'Stars': 'Stars', 'Review #': 'Review Number'})

fig.update_traces(hovertemplate= 'Review Number: %{x} <br> Stars: %{y}')

fig.for_each_trace(lambda trace: trace.update(visible='legendonly') if trace.name != 'Pack' else trace.update(visible=True))

fig.show()

# Stars and Style Correlation by Time Dataframe
style_counts = df['Style'].value_counts()
filtered_df = df[df['Style'].isin(style_counts[style_counts >= 15].index)]

correlation_df = filtered_df.groupby('Style').apply(lambda x: x['Review #'].corr(x['Stars'])).round(3).reset_index()
correlation_df.columns = ['Style', 'Correlation']

sorted_correlations = correlation_df.sort_values(by='Correlation', ascending=False).reset_index(drop=True)

print("----------------------")
print(sorted_correlations)

----------------------
  Style  Correlation
0  Pack        0.265
1   Cup        0.076
2  Bowl        0.071
3  Tray       -0.002
4   Box       -0.042


<p style="font-size:18px;"> From this I can see that the majority of the ramen reviews are from pack style ramen and there appears to be minimal correlation between the styles reviews over time. The data also shows that box ramen is the Ramen Raters preferred style of ramen with an average score of 4.24 stars.

# **Summary and Conclusion**

***

<p style="font-size:18px;"> Many thanks for reaching the end of my case study. I hope this case study has provided insights to this dataset and the reviews of the Ramen Rater. In this case study I have imported, cleaned and analysed the data to find the Ramen Raters preferred ramen and how the reviews have changed over time. From the data I have been able to find the following insights about the Ramen Rater:
    
<ul style="font-size:18px;"><li>He prefers ramen which comes from asian countries, south asia in particular.
<li> Spicy flavours are preferred over non spicy and vegetable ramen.
<li> Box or pack ramen are the most preferred styles of ramen.
<li> Over time his tastes have changed and he has had a greater dislike of tomato, yakisoba and cheese ramen.  
<li> His Favourite brands (based on average score) are "Mykuali" and "Mom's Dry Noodle".  
</ul>
    
<p style="font-size:18px;"> Feel free to explore this dataset and see if you can find interesting insights. To take this one step further you could use ML algorithms to see if it's possible to predict the Ramen Raters star score of a ramen given the brand, flavour, country of origin and style of ramen.
    
<p style="font-size:18px;"> <strong> Please feel free to leave any feedback on this case study and upvote if you enjoyed!
    
<p style="font-size:18px;"> <strong> I hope you have a good day! 😄
    
    