# UIUC Data Mining Project Task2 

## instructions

Some questions to consider when building the cuisine map are the following:

- What's the best way of representing a cuisine? If all we know about a cuisine is just the name, then there is nothing we can do. However, if we can associate a cuisine with the restaurants offering the cuisine and all the information about the restaurants, particularly reviews, then we will have a basis to characterize cuisines and assess their similarity. Since review text contains a lot of useful information about a cuisine, a natural question is: what's the best way to represent a cuisine with review text data? Are some words more important in representing a cuisine than others?

- What's the best way of computing the similarity of two cuisines? Assuming that two cuisines can each be represented by their corresponding reviews, how should we compute their similarity?

- What's the best way of clustering cuisines? Clustering of cuisines can help reveal major categories of cuisines. How would the number of clusters impact the utility of your results for understanding cuisine categories? How does a clustering algorithm affect the visualization of the cuisine map?

- Is your cuisine map actually useful to at least some people? In what way? If it's not useful, how might you be able to improve it to make it more useful?

Note that most of these questions are open questions that nobody really has a good answer to, but they are practically important questions to address. Thus, by working on this task, you are really working on a frontier research topic in data mining. Your goal in this task is to do a preliminary exploration of these questions and help provide preliminary answers to them. You can address such questions by analyzing the visualization of the cuisine map and comparing the results of alternative ways of mining the data to assess which strategy seems to work better for what purpose. You are encouraged to think creatively about how to quantitatively evaluate clustering results. For example, you can consider separating all the reviews about one cuisine (e.g., Indian) into multiple disjoint subsets (e.g., Indian1, Indian2, and Indian3) and thus artificially create multiple separate cuisines that are known to be of the same category. You can then test your algorithm on such an artificial data set to see if it can really group these artificial subcategories of the same cuisine together or give them very high similarity values.

# Packages needed for this part

In [1]:
import os
import pandas as pd
import numpy as np
import string

import altair as alt

In [2]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

ps = PorterStemmer()

stop_words = set(stopwords.words('english'))

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

In [4]:
# stop_words

In [5]:
# string.punctuation

# Data Processing
    I randomly choose 50 categories from the given category dataset.

## Actions for this part
- Get all files from directory
- prepocess each categorie using following steps:
    - Toeknizing paragraph
    - Removing stop words
    - Removing all punctuation
    - stemming

In [6]:
# categories_url = 'categories_selected/'
# categories_list = os.listdir(categories_url)
# categories_list

In [7]:
# temp = []
# for i in categories_list:
#     temp.append(i)
# temp

## Preprocessing function for a single file

In [8]:
# test file 'Arabian.txt'
test_url = 'categories_selected/' + 'Arabian.txt'
def category_Preprocessing(file_url):
    para_record = []
    para_dict = {}
    with open(file_url) as f:
        lines = f.readlines()
        for line in lines:
            line = line.strip()
            if line != '':
                # Toeknizing paragraph 
                words = word_tokenize(line)
                # Removing stop words && Removing all punctuation
                for word in words:
                    word = word.lower()
                    if word not in stop_words and word not in string.punctuation and word[0] not in string.punctuation and word[-1] not in string.punctuation:
                        #stemming words
                        ps.stem(word)
                        para_record.append(word)
    #             print(word_tokenize(line))
    f.close()

    for word in para_record:
        if para_dict.get(word) != None:
            para_dict[word] += 1
        else:
            para_dict[word] = 1

    # remove all words appear once
    para_dict = {key:val for key, val in para_dict.items() if val > 1}
    
    # turn dictionary into string for further comparation
    para = ''

    for key, value in para_dict.items():
        for i in range(0, value):
            para += key
            para += ' '

    return para.strip()

# print(category_Preprocessing(test_url))    

In [9]:
# # preprocess test for a single file
# para_record = []
# para_dict = {}
# with open(test_url) as f:
#     lines = f.readlines()
#     for line in lines:
#         line = line.strip()
#         if line != '':
#             # Toeknizing paragraph 
#             words = word_tokenize(line)
#             # Removing stop words && Removing all punctuation
#             for word in words:
#                 word = word.lower()
#                 if word not in stop_words and word not in string.punctuation and word[0] not in string.punctuation and word[-1] not in string.punctuation:
#                     #stemming words
#                     ps.stem(word)
#                     para_record.append(word)
# #             print(word_tokenize(line))
# f.close()

# for word in para_record:
#     if para_dict.get(word) != None:
#         para_dict[word] += 1
#     else:
#         para_dict[word] = 1

# # remove all words appear once
# para_dict = {key:val for key, val in para_dict.items() if val > 1}

# # turn dictionary into string for further comparation
# para = ''

# for key, value in para_dict.items():
#     for i in range(0, value):
#         para += key
#         para += ' '

# # print(para_record)
# print(para.strip())


## Get sorted paragraph of all 50 samples

In [10]:
# function to get all paragraph form an array
def get_para_list(paras, file_url):
    paras.append(category_Preprocessing(file_url))
    return paras
    

In [11]:
# function to form a name list
def get_name_list(cate_list):
    name_list = []
    for cate in cate_list:
        name_list.append(cate.split('.')[0])
    return name_list

In [12]:
name_list = []
paras_list = []

cnt = 0

categories_url = 'categories_selected/'
categories_list = os.listdir(categories_url)

for i in categories_list:
    paras_list = get_para_list(paras_list, categories_url + i)

name_list = get_name_list(categories_list)

In [13]:
# len(paras_list)

In [14]:
# name_list

# Task 2.1: Visualization of the Cuisine Map
1. Similarity without using IDF (Using Cuisine Similarity)
2. Similarity Matrix from IDF (Colors only for better distinction.)
3. Similarity Matrix from LDA (Colors only for better distinction.)

## 1. Similarity using Cuisine Similarity

In [15]:
# Create the Document Term Matrix
count_vectorizer = CountVectorizer(stop_words='english')
count_vectorizer = CountVectorizer()
sparse_matrix = count_vectorizer.fit_transform(paras_list)

doc_term_matrix = sparse_matrix.todense()
df = pd.DataFrame(doc_term_matrix, 
                  columns=count_vectorizer.get_feature_names())
# df

# Compute Cosine Similarity
res = cosine_similarity(df, df)

res



array([[1.        , 0.70131137, 0.49922837, ..., 0.54464761, 0.61443961,
        0.7240215 ],
       [0.70131137, 1.        , 0.49745527, ..., 0.58526374, 0.61748964,
        0.7526091 ],
       [0.49922837, 0.49745527, 1.        , ..., 0.44357131, 0.38241327,
        0.4971215 ],
       ...,
       [0.54464761, 0.58526374, 0.44357131, ..., 1.        , 0.46323682,
        0.54668251],
       [0.61443961, 0.61748964, 0.38241327, ..., 0.46323682, 1.        ,
        0.61510107],
       [0.7240215 , 0.7526091 , 0.4971215 , ..., 0.54668251, 0.61510107,
        1.        ]])

### Visualization using Grid-Heat-Map

In [16]:
# process data for HeatMap suitable for altair
vis_data = pd.DataFrame(res)
vis_data.insert(loc=0, column='ID', value = np.arange(len(df)))
vis_data = vis_data.melt(id_vars=['ID'])


In [17]:
vis_data

Unnamed: 0,ID,variable,value
0,0,0,1.000000
1,1,0,0.701311
2,2,0,0.499228
3,3,0,0.629677
4,4,0,0.659567
...,...,...,...
2495,45,49,0.521534
2496,46,49,0.771266
2497,47,49,0.546683
2498,48,49,0.615101


In [18]:
vis_data.iat[0, 1]

0

In [19]:
for i in range(0, len(vis_data)):
    for j in range(0, 2):
        val = vis_data.iat[i, j]
        vis_data.iat[i, j] = name_list[val]

In [20]:
vis_data

Unnamed: 0,ID,variable,value
0,Arabian,Arabian,1.000000
1,Argentine,Arabian,0.701311
2,Automotive,Arabian,0.499228
3,Belgian,Arabian,0.629677
4,Brasseries,Arabian,0.659567
...,...,...,...
2495,Salad,Ukrainian,0.521534
2496,Scottish,Ukrainian,0.771266
2497,Seafood Markets,Ukrainian,0.546683
2498,Singaporean,Ukrainian,0.615101


In [21]:
Cuisine = alt.Chart(vis_data).mark_rect().encode(
    x=alt.X('variable:O', title="Cuisine", axis=alt.Axis(labelAngle=45)),
    y=alt.Y('ID:O', title="Cuisine"),
    color=alt.Color('value:Q', title="Similarity", scale = alt.Scale(scheme='greenblue'))
).properties(
    title='Task2.1_Cuisine_Similarity'
)
Cuisine

  for col_name, dtype in df.dtypes.iteritems():


In [22]:
Cuisine.save('Task2-1_noIDF.json')

## 2.Similarity Matrix from IDF (Colors only for better distinction.)

In [23]:
tfidf = TfidfVectorizer().fit_transform(paras_list)
idf_similarity = tfidf * tfidf.T
# print(idf_similarity)

In [24]:
vis_data = idf_similarity.toarray()
vis_data

array([[1.        , 0.34205414, 0.18367461, ..., 0.2432659 , 0.24071969,
        0.35857105],
       [0.34205414, 1.        , 0.18872535, ..., 0.27464666, 0.25183926,
        0.38079188],
       [0.18367461, 0.18872535, 1.        , ..., 0.15068367, 0.11301433,
        0.17661113],
       ...,
       [0.2432659 , 0.27464666, 0.15068367, ..., 1.        , 0.21236433,
        0.23873944],
       [0.24071969, 0.25183926, 0.11301433, ..., 0.21236433, 1.        ,
        0.25450225],
       [0.35857105, 0.38079188, 0.17661113, ..., 0.23873944, 0.25450225,
        1.        ]])

In [25]:
# process data for HeatMap suitable for altair
vis_data = pd.DataFrame(res)
vis_data.insert(loc=0, column='ID', value = np.arange(len(df)))
vis_data = vis_data.melt(id_vars=['ID'])

for i in range(0, len(vis_data)):
    for j in range(0, 2):
        val = vis_data.iat[i, j]
        vis_data.iat[i, j] = name_list[val]
vis_data

Unnamed: 0,ID,variable,value
0,Arabian,Arabian,1.000000
1,Argentine,Arabian,0.701311
2,Automotive,Arabian,0.499228
3,Belgian,Arabian,0.629677
4,Brasseries,Arabian,0.659567
...,...,...,...
2495,Salad,Ukrainian,0.521534
2496,Scottish,Ukrainian,0.771266
2497,Seafood Markets,Ukrainian,0.546683
2498,Singaporean,Ukrainian,0.615101


In [26]:
Cuisine = alt.Chart(vis_data).mark_rect().encode(
    x=alt.X('variable:O', title="Cuisine", axis=alt.Axis(labelAngle=45)),
    y=alt.Y('ID:O', title="Cuisine"),
    color=alt.Color('value:Q', title="Similarity", scale = alt.Scale(scheme='rainbow'))
).properties(
    title='Task2.1_TFIDF_Similarity'
)
Cuisine

  for col_name, dtype in df.dtypes.iteritems():


## 2.Similarity Matrix from LDA(Colors only for better distinction.)

- before we make visualization, we must make topic based on word

# Task 2.3: Incorporating Clustering in Cuisine Map
Use any similarity results from Task 2.1 or Task 2.2 to do clustering. Visualize the clustering results to show the major categories of cuisines. Vary the number of clusters to try at least two very different numbers of clusters, and discuss how this affects the quality or usefulness of the map. Use multiple clustering algorithms for this task. Also note in that each color is a separate cluster in the sample images below.