Hello 🙌, welcome to my notebook. In this notebook we will try to learn Interactive Visualization and Clustering. Also try to make Geo-Map to better visualization about Ramen Stars and Review distribution around the World. Feel free if you have any question or suggestion! Thank you!

![](https://realfood.tesco.com/media/images/1400x919-BeefRamen-db0570eb-10cf-437b-a497-6cd26837ee2d-0-1400x919.jpg) (www.realfood.tesco.com)

- Context: The Ramen Rater is a product review website for the hardcore ramen enthusiast (or "ramenphile"), with over 2500 reviews to date. This dataset is an export of "The Big List" (of reviews), converted to a CSV format.

- Content: Each record in the dataset is a single ramen product review. Review numbers are contiguous: more recently reviewed ramen varieties have higher numbers. Brand, Variety (the product name), Country, and Style (Cup? Bowl? Tray?) are pretty self-explanatory. Stars indicate the ramen quality, as assessed by the reviewer, on a 5-point scale; this is the most important column in the dataset!

In [None]:
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

data1 = pd.read_csv('../input/ramen-ratings/ramen-ratings.csv')


In [None]:
data1.head()

In [None]:
data1.shape

In [None]:
data1.info()

In [None]:
'''Missing Value Chart'''
import matplotlib.pyplot as plt

plt.figure(figsize=(13, 3))
data1.isnull().mean(axis=0).plot.barh()
plt.title("Ratio of missing values per columns")

In [None]:
'''Unique Columns'''

def unique_counts(data):
   for i in data.columns:
       count = data[i].nunique()
       print(i, ": ", count)
unique_counts(data1)

In [None]:
data1['Top Ten'].unique()

In [None]:
'''Handling Missing Values'''

var = ["Top Ten"]
for i in var:
    data1[i].fillna(0, inplace=True)
    
data1.dropna(inplace=True) #removing all missing values

- For the data shape, we have 2580 rows, and 7 columns
- For the data type, all columns labeled as object. Later we must change feature such as Review and Stars to numerical
- Top ten feature is feature which have most missing values
- We will ignore the Top Ten because it doesn't contain complete informasion about top ten ramen each year. For the missing values, i fill it with 0
- We check duplicate and drop it
- For other feature that have missing values, i simply drop it. Because the size is not too much.

In [None]:
import plotly.express as px

custom_aggregation = {}
custom_aggregation["Style"] = "count"
data2 = data1.groupby("Style").agg(custom_aggregation)

data2.columns = ["Count"]
data2['Style'] = data2.index

fig = px.bar(data2, x='Style', y="Count", color="Style", title="Number of Ramen Style")
fig.show()

In [None]:
fig = px.box(data1, x="Style", y="Review #", color="Style", boxmode="overlay", title="Style x Review Boxplot")
fig.update_traces(quartilemethod="inclusive")
fig.show()

- The most counted ramen style in this dataset is Ramen with Pack style
- The most reviewed ramen style in this dataset is Ramen with Cup style

In [None]:
custom_aggregation = {}
custom_aggregation["Country"] = "count"
data2 = data1.groupby("Country").agg(custom_aggregation)

data2.columns = ["Count"]
data2['Country'] = data2.index

fig = px.bar(data2, x='Country', y="Count", color="Country", title="Number of Country")
fig.show()

In [None]:
data1['Stars'][data1['Stars'] == 'Unrated'] = 0
data1['Stars'] = data1['Stars'].astype(float)

custom_aggregation = {}
custom_aggregation["Stars"] = "mean"
data2 = data1.groupby("Country").agg(custom_aggregation)

data2.columns = ["Stars"]
data2['Country'] = data2.index
data2['Stars'] = data2['Stars'].round(decimals=1)

fig = px.bar(data2, x='Country', y="Stars", color="Country", title="Stars Rating in Each Country")
fig.show()

In [None]:
custom_aggregation = {}
custom_aggregation["Variety"] = "count"
data2 = data1.groupby("Country").agg(custom_aggregation)

data2.columns = ["Count"]
data2['Country'] = data2.index

fig = px.bar(data2, x='Country', y="Count", color="Country", title="Ramen Variety in Each Country")
fig.show()

- The most country which have ramen review is Japan, USA and South Korea
- For the stars rating, i use mean rating in each country and search for the highest rating which is Brazil, Sarawak, Malaysia and Indonesia
- For the ramen variety, Japan, USA and South Korea have most ramen variety

In [None]:
'''Making TF-IDV'''

import nltk, warnings
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from string import digits, punctuation
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


X = data1["Variety"].unique()
stemmer = nltk.stem.porter.PorterStemmer()
stopword = nltk.corpus.stopwords.words('english')

def stem_and_filter(doc):
    tokens = [stemmer.stem(w) for w in analyzer(doc)]
    return [token for token in tokens if token.isalpha()]

analyzer = TfidfVectorizer().build_analyzer()
CV = TfidfVectorizer(lowercase=True, stop_words="english", analyzer=stem_and_filter, min_df=0.00, max_df=0.3)  # we remove words if it appears in more than 30 % of the corpus (not found stopwords like Box, Christmas and so on)
TF_IDF_matrix = CV.fit_transform(X)
print("TF_IDF_matrix :", TF_IDF_matrix.shape, "of", TF_IDF_matrix.dtype)

In [None]:
'''TF-IDV Embedded'''

from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer

svd = TruncatedSVD(n_components = 100)
normalizer = Normalizer(copy=False)
TF_IDF_embedded = svd.fit_transform(TF_IDF_matrix)
TF_IDF_embedded = normalizer.fit_transform(TF_IDF_embedded)
print("TF_IDF_embedded :", TF_IDF_embedded.shape, "of", TF_IDF_embedded.dtype)

In [None]:
'''Silhoutte Scoring'''

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
import numpy as np

score_tfidf = []
x = list(range(5, 155, 10))

for n_clusters in x:
    kmeans = KMeans(init='k-means++', n_clusters = n_clusters, n_init=10)
    kmeans.fit(TF_IDF_embedded)
    clusters = kmeans.predict(TF_IDF_embedded)
    silhouette_avg = silhouette_score(TF_IDF_embedded, clusters)
    rep = np.histogram(clusters, bins = n_clusters-1)[0]
    score_tfidf.append(silhouette_avg)
    
plt.figure(figsize=(20,16))
plt.subplot(2, 1, 1)
plt.plot(x, score_tfidf, label="TF-IDF matrix")
plt.title("Evolution of the Silhouette Score")
plt.legend()

In [None]:
'''KMeans Clustering'''

n_clusters = 70
kmeans = KMeans(init='k-means++', n_clusters = n_clusters, n_init=30, random_state=0)
proj = kmeans.fit_transform(TF_IDF_embedded)
clusters = kmeans.predict(TF_IDF_embedded)
plt.figure(figsize=(10,10))
plt.scatter(proj[:,0], proj[:,1], c=clusters)
plt.title("ACP with 135 clusters", fontsize="20")

In [None]:
'''TSNE Visualization'''
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2)
proj = tsne.fit_transform(TF_IDF_embedded)
plt.figure(figsize=(10,10))
plt.scatter(proj[:,0], proj[:,1], c=clusters)
plt.title("Visualization of the clustering with TSNE", fontsize="20")

In [None]:
'''WordClouding'''
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import random

plt.figure(figsize=(20,8))
wc = WordCloud()

for num, cluster in enumerate(random.sample(range(60), 12)) :
    plt.subplot(3, 4, num+1)
    wc.generate(" ".join(X[np.where(clusters==cluster)]))
    plt.imshow(wc, interpolation='bilinear')
    plt.title("Cluster {}".format(cluster))
    plt.axis("off")
    
plt.figure()

- For the text generation,transformation and clustering i learned alot from this kernel: ttps://www.kaggle.com/miljan/customer-segmentation. Kindly check and upvote his kernel!
- To transforming text into feature i used TfidfVectorizer, after get the matriks we will perform dimensionality reduction using truncated SVD (aka LSA). This transformer performs linear dimensionality reduction by means of truncated singular value decomposition (SVD).
- And then we will try Silhoutte Scoring to search the best number of cluster and visualize our clustering using TSNE
- Silhouette score is used to evaluate the quality of clusters created using clustering algorithms such as K-Means in terms of how well samples are clustered with other samples that are similar to each other (dzone.com)

![](https://miro.medium.com/max/700/1*cUcY9jSBHFMqCmX-fp8BvQ.jpeg) (www.towardsdatascience.com)

- For the variety we get 70 cluster which the best to represent number of variety of ramen in this dataset (based on Silhoutte Scoring)
- And then try to look up the popular word in cluster (randomly) using WordCloud

In [None]:
'''Clustering Using KMeans from Selected Feature'''

feature = ['Review #', 'Stars']
cls_ = data1[feature]

from sklearn.cluster import KMeans
nc = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in nc]
score = [kmeans[i].fit(cls_).score(cls_) for i in range(len(kmeans))]
plt.plot(nc,score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show()

In [None]:
kmeans = KMeans(n_clusters=4, random_state=0).fit(cls_)
cls_['Cluster'] = kmeans.labels_
data1 = pd.merge(data1,cls_)

In [None]:
dict_article_to_cluster = {article : cluster for article, cluster in zip(X, clusters)}
cluster = data1['Variety'].apply(lambda x : dict_article_to_cluster[x])

In [None]:
data1 = pd.concat([data1,cluster], axis=1)

In [None]:
data1.columns = ["Review", "Brand", "Variety", "Style", 'Country', 'Stars', 'Top Ten', 'Cluster', 'Label'] #changing name columns
data1['Review'] = data1['Review'].astype(int)

In [None]:
data1.head()

- I try to make clustering again based on Stars and Review feature using KMeans Clustering
- The best number of cluster to represent the Stars and Review is 4 cluster
- And then combine the result into our dataset

In [None]:
cls_0 = data1[data1['Cluster']==0]
print(f'Mean stars rating in Cluster 0 : {round(cls_0.Stars.mean(),2)}')
print(f'Mean review in Cluster 0 : {round(cls_0.Review.mean(),2)}')
print(cls_0.groupby('Stars').size()[0:3].sort_values(ascending=False))
print(cls_0.groupby('Label').size().sort_values(ascending=False)[0:4])

In [None]:

wc = WordCloud()

img1 = cls_0.loc[cls_0['Label'] == 15]
text1 = str(img1.Variety)
wordcloud1 = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(text1)

img2 = cls_0.loc[cls_0['Label'] == 4]
text2 = str(img2.Variety)
wordcloud2 = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(text2)

img3 = cls_0.loc[cls_0['Label'] == 53]
text3 = str(img3.Variety)
wordcloud3 = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(text3)

img4 = cls_0.loc[cls_0['Label'] == 10]
text4 = str(img4.Variety)
wordcloud4 = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(text4)

fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, figsize=(50,20))
ax1.imshow(wordcloud1, interpolation="bilinear")
ax2.imshow(wordcloud2, interpolation="bilinear")
ax3.imshow(wordcloud3, interpolation="bilinear")
ax4.imshow(wordcloud4, interpolation="bilinear")

ax1.axis("off")
ax2.axis("off")
ax3.axis("off")
ax4.axis("off")

1. Cluster 0
    - Mean stars rating: 3.74
    - Mean review: 1602
    - Variety: 24, 25, 39, 55

*Notes: I only take 4 variety (label) to represent ramen variety for each cluster

In [None]:
cls_1 = data1[data1['Cluster']==1]
print(f'Mean stars rating in Cluster 1 : {round(cls_1.Stars.mean(),2)}')
print(f'Mean review in Cluster 1 : {round(cls_1.Review.mean(),2)}')
print(cls_1.groupby('Stars').size()[0:3].sort_values(ascending=False))
print(cls_1.groupby('Label').size().sort_values(ascending=False)[0:4])

In [None]:

img1 = cls_1.loc[cls_1['Label'] == 47]
text1 = str(img1.Variety)
wordcloud1 = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(text1)

img2 = cls_1.loc[cls_1['Label'] == 38]
text2 = str(img2.Variety)
wordcloud2 = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(text2)

img3 = cls_1.loc[cls_1['Label'] == 39]
text3 = str(img3.Variety)
wordcloud3 = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(text3)

img4 = cls_1.loc[cls_1['Label'] == 2]
text4 = str(img4.Variety)
wordcloud4 = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(text4)

fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, figsize=(50,20))
ax1.imshow(wordcloud1, interpolation="bilinear")
ax2.imshow(wordcloud2, interpolation="bilinear")
ax3.imshow(wordcloud3, interpolation="bilinear")
ax4.imshow(wordcloud4, interpolation="bilinear")

ax1.axis("off")
ax2.axis("off")
ax3.axis("off")
ax4.axis("off")

2. Cluster 1
    - Mean stars rating: 3.21
    - Mean review: 317
    - Variety: 47, 38, 39, 2

In [None]:
cls_2 = data1[data1['Cluster']==2]
print(f'Mean stars rating in Cluster 2 : {round(cls_2.Stars.mean(),2)}')
print(f'Mean review in Cluster 2 : {round(cls_2.Review.mean(),2)}')
print(cls_2.groupby('Stars').size()[0:3].sort_values(ascending=False))
print(cls_2.groupby('Label').size().sort_values(ascending=False)[0:4])

In [None]:

img1 = cls_2.loc[cls_2['Label'] == 23]
text1 = str(img1.Variety)
wordcloud1 = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(text1)

img2 = cls_2.loc[cls_2['Label'] == 54]
text2 = str(img2.Variety)
wordcloud2 = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(text2)

img3 = cls_2.loc[cls_2['Label'] == 15]
text3 = str(img3.Variety)
wordcloud3 = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(text3)

img4 = cls_2.loc[cls_2['Label'] == 39]
text4 = str(img4.Variety)
wordcloud4 = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(text4)

fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, figsize=(50,20))
ax1.imshow(wordcloud1, interpolation="bilinear")
ax2.imshow(wordcloud2, interpolation="bilinear")
ax3.imshow(wordcloud3, interpolation="bilinear")
ax4.imshow(wordcloud4, interpolation="bilinear")

ax1.axis("off")
ax2.axis("off")
ax3.axis("off")
ax4.axis("off")

3. Cluster 2
    - Mean stars rating: 3.93
    - Mean review: 2253
    - Variety: 23, 54, 15, 39

In [None]:
cls_3 = data1[data1['Cluster']==3]
print(f'Mean stars rating in Cluster 3 : {round(cls_3.Stars.mean(),2)}')
print(f'Mean review in Cluster 3 : {round(cls_3.Review.mean(),2)}')
print(cls_3.groupby('Stars').size()[0:3].sort_values(ascending=False))
print(cls_3.groupby('Label').size().sort_values(ascending=False)[0:4])

In [None]:

img1 = cls_3.loc[cls_3['Label'] == 26]
text1 = str(img1.Variety)
wordcloud1 = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(text1)

img2 = cls_3.loc[cls_3['Label'] == 39]
text2 = str(img2.Variety)
wordcloud2 = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(text2)

img3 = cls_3.loc[cls_3['Label'] == 30]
text3 = str(img3.Variety)
wordcloud3 = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(text3)

img4 = cls_3.loc[cls_3['Label'] == 49]
text4 = str(img4.Variety)
wordcloud4 = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(text4)

fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, figsize=(50,20))
ax1.imshow(wordcloud1, interpolation="bilinear")
ax2.imshow(wordcloud2, interpolation="bilinear")
ax3.imshow(wordcloud3, interpolation="bilinear")
ax4.imshow(wordcloud4, interpolation="bilinear")

ax1.axis("off")
ax2.axis("off")
ax3.axis("off")
ax4.axis("off")

4. Cluster 3
    - Mean stars rating: 3.71
    - Mean review: 957
    - Variety: 26, 39, 30, 49

In [None]:
fig = px.box(data1, x="Cluster", y="Review", color="Cluster", boxmode="overlay", title="Review Boxplot by Cluster")
fig.update_traces(quartilemethod="inclusive") # or "inclusive", or "linear" by default
fig.show()

In [None]:
fig = px.box(data1, x="Cluster", y="Stars", color="Cluster", boxmode="overlay", title="Review Boxplot by Cluster")
fig.update_traces(quartilemethod="inclusive") # or "inclusive", or "linear" by default
fig.show()

In [None]:
import plotly.graph_objs as go
import warnings
from plotly.offline import init_notebook_mode,iplot
init_notebook_mode(connected=True)
warnings.filterwarnings("ignore")

custom_aggregation = {}
custom_aggregation["Stars"] = "mean"
data2 = data1.groupby("Country").agg(custom_aggregation)
data2.columns = ["Stars"]

temp = data2
temp = temp.reset_index(drop = False)
countries = temp['Country'].value_counts()

data = dict(type='choropleth',
            locations = countries.index,
            locationmode = 'country names', 
            z = data2['Stars'],
            text = countries.index, 
            colorbar = {'title':'Rating'},
            colorscale=[
            [0, "rgb(8, 29, 88)"], 
            [0.125, "rgb(37, 52, 148)"], 
            [0.25, "rgb(34, 94, 168)"], 
            [0.375, "rgb(29, 145, 192)"], 
            [0.5, "rgb(65, 182, 196)"], 
            [0.625, "rgb(127, 205, 187)"], 
            [0.75, "rgb(199, 233, 180)"], 
            [0.875, "rgb(237, 248, 217)"], 
            [1, "rgb(255, 255, 217)"]],    

            reversescale = False)

layout = dict(title='Ramen Rating per Country',

geo = dict(showframe = True, projection={'type':'mercator'}))

choromap = go.Figure(data = [data], layout = layout)

iplot(choromap, validate=False)

In [None]:
custom_aggregation = {}
custom_aggregation["Review"] = "mean"
data2 = data1.groupby("Country").agg(custom_aggregation)
data2.columns = ["Review"]

temp = data2
temp = temp.reset_index(drop = False)
countries = temp['Country'].value_counts()

data = dict(type='choropleth',
            locations = countries.index,
            locationmode = 'country names', 
            z = data2['Review'],
            text = countries.index, 
            colorbar = {'title':'Nb. of Review'},
            colorscale=[
            [0, "rgb(8, 29, 88)"], 
            [0.125, "rgb(37, 52, 148)"], 
            [0.25, "rgb(34, 94, 168)"], 
            [0.375, "rgb(29, 145, 192)"], 
            [0.5, "rgb(65, 182, 196)"], 
            [0.625, "rgb(127, 205, 187)"], 
            [0.75, "rgb(199, 233, 180)"], 
            [0.875, "rgb(237, 248, 217)"], 
            [1, "rgb(255, 255, 217)"]],    
            reversescale = False)

layout = dict(title='Ramen Review per Country',

geo = dict(showframe = True, projection={'type':'mercator'}))

choromap = go.Figure(data = [data], layout = layout)

iplot(choromap, validate=False)

- We try to check the distribution of Review and Stars of our clustering
- It seems that our clustering seperate the data very well!
- And then try to make Ramen Star Rating and Review Geo-Map

Dont' Forget to Upvote! Thank you!:)