<h1><center>FOOD AND COOKING</center></h1>

### Ideas Explored in this notebook : 
1. Extracted data from web API in JSON format
2. Used python library Beautiful Soup for extracting data from the json file
3. Pulled out all the latest posts and comments on the posts along with the timestamp. Converted original UNIX timestamp to a readable format.
4. Cleaned the data for analysis - Removed punctuations, context specific stop words and unicode characters
5. Performed tokenization and Lemmatization to find root words as per the context
6. Explored Latent Dirichlet Allocation for topic modelling
7. Recorded the three trending topics
8. Found TF-IDF scores for each word in a topic
9. For repeating words in a topic, computed the average TF-IDF score
10. Gave more weightage to the inference of the different topics based on the words having highest TF-IDF scores in a topic

In [9]:
# Import Library

import requests
import json
from pprint import pprint
from datetime import datetime as dt
from bs4 import BeautifulSoup
import datetime
import string
import pandas as pd
import numpy as np
from collections import Counter
import pickle

### NLTK imports
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer 
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize

### Gensim imports
import gensim
from gensim import corpora

# ignoring warnings
import warnings
warnings.filterwarnings('ignore')

<h1><center>DATA EXTRACTION</center></h1>

In [10]:
# Pulling data for Category : Food and Cooking
# r = requests.get('https://a.4cdn.org/ck/catalog.json')
# r = r.json()

# pprint(r)

In [11]:
#save into a JSON file
# import json
# import re
# with open('Food.json', 'w') as fout:
#     json.dump(r , fout)

In [16]:
#Read from a JSON file
with open(r"Food.json", "r") as read_file:
    data = json.load(read_file)

In [17]:
# More exploration
# Pulling the comments and replies to the comments from the JSON file
comments = []
date_time = []
for i in range(len(data)):
    for j in range(len(data[i]['threads'])):
        if 'com' in data[i]['threads'][j].keys() :
            one_comm = data[i]['threads'][j]['com']
            date_time += [datetime.datetime.fromtimestamp(data[i]['threads'][j]['time']).strftime("%B %d, %Y")]
            soup1 = BeautifulSoup(one_comm)
            comments += [soup1.get_text()]

In [18]:
print(date_time[0])
print(comments[0])

May 08, 2020
>checks the oven


### Observations:
1. We have extracted 146 posts from the Food and cooking section of the website.
2. The comments were made on May 4th, 2020 to May 8th, 2020

### Assumptions :
We are working with the assumption that the comments below every post would be from the same topic as the post

<h1><center>DATA PREPARATION/CLEANING</center></h1>

### 1.      CONVERTS WORDS TO THEIR BASE FORMS
### 2. DATA CLEANING
##### --> Removing punctuations, adding more context related stop words for removal, removing unicode characters

In [19]:
def clean_data(tokens, stop_words = ()):

    cleaned_tokens = []

    for token, tag in pos_tag(tokens):
        if tag.startswith("NN"):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'

        lemmatizer = WordNetLemmatizer()
        token = lemmatizer.lemmatize(token, pos)

        if token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens.append(token)
    return cleaned_tokens


In [20]:
cleaned_data = []
with open("stop_words.txt", "r") as input:
        extra_stop_words = [line.split(",")[0] for line in input.read().splitlines()]
stop_words = stopwords.words('english')
newStopWords = ['/g/','n\'t','\'s','\'\'','``','would','get','like','use','one','\'m','http','n','0', 'thus','x','1','say','good','much','want','go','run','need','new','even','shit','fuck']
stop_words+=newStopWords
stop_words+=extra_stop_words
for text in comments:
    cleaned_data += [clean_data(word_tokenize(text), stop_words)]

### DATA EXPLORATION

### EXPERIMENT 1 : USING LDA to classify documents into different topics and finding topic-keyword distribution

#### Create a dictionary from data , then convert to bag-of-words corpus and save the dictionary and corpus for future use

In [21]:
dictionary = corpora.Dictionary(cleaned_data)
corpus = [dictionary.doc2bow(text) for text in cleaned_data]

In [22]:
pickle.dump(corpus, open('corpus.pkl', 'wb'))
dictionary.save('dictionary.gensim')

#### Fixing the number of topics to 3 
##### Why 3?
##### We are already aware that all the posts relate to food and cooking. Also we have assumed that all the comments on a post will be of the same topic as the post and not taken the comments for our analysis. 

In [23]:
NUM_TOPICS = 3
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
ldamodel.save('model_Food.gensim')

In [25]:
topics = ldamodel.print_topics(num_words=6)
for topic in topics:
    print(topic)

(0, '0.007*"food" + 0.006*"eat" + 0.006*"burger" + 0.005*"post" + 0.004*"meal" + 0.004*"country"')
(1, '0.006*"recipe" + 0.006*"drink" + 0.005*"cheese" + 0.004*"eat" + 0.004*"add" + 0.004*"post"')
(2, '0.012*"food" + 0.011*"pizza" + 0.009*"recipe" + 0.008*"buy" + 0.007*"/ck/" + 0.006*"time"')


### Topic 1 includes words like "burger", "country", "cream". This sounds like a topic related to fast food.
### Topic 2 includes words like "meat", "butter". This is a topic which discusses non-veg and soem receipe with butter
### Topic 3 includes words like "recipe", "pizza", "buy",. This sounds like a discussion about a pizza receipe or a drink receipe

<h1><center>VISUALIZE THE TOPIC KEYWORDS</center></h1>

#### Saliency : a measure how mcuh the term tells you about the topic
#### Relevance : A weighted average of the probability of the word given the topic and the word given the topic normalized by the probability of the topic
#### The size of the bubble shows the importance of the topics relative to the data

#### We can view the frequency of the top words in a given topic by howevering over the topic

In [26]:
dictionary = gensim.corpora.Dictionary.load('dictionary.gensim')
corpus = pickle.load(open('corpus.pkl', 'rb'))
lda = gensim.models.ldamodel.LdaModel.load('model_Food.gensim')

import pyLDAvis.gensim
lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

Inference : 
1. Topic 1,2 and 3 are far apart in intertopic distance map showing the large semantic differences between the topics
2. All posts discussing Pizza and breakfast are in topic 1
3. Topic 2 mostly has non-vegetarians discussing chicken and meat
4. TOpic 3 discusses burgers, service and gourmet

<h1><center>COMPUTING TF_IDF</center></h1>

In [525]:
# Calculating Document Frequency. 
# Iterating through all words in all the documents and storing the doc id for each word.
# Creating a set if the word dosen't have a set else add to the set 

DF = {}

for i in range(N):
    tokens = cleaned_data[i]
    for w in tokens:
        try:
            DF[w].add(i)
        except:
            DF[w] = {i}

    tokens = cleaned_data[i]
    for w in tokens:
        try:
            DF[w].add(i)
        except:
            DF[w] = {i}
for i in DF:
    DF[i] = len(DF[i])

In [526]:
def doc_freq(word):
    c = 0
    try:
        c = DF[word]
    except:
        pass
    return c

In [527]:
# Calculating TF_IDF
def calc_tf_idf(cleaned_data):
    tf_idf = {}
    doc = 0
    N = len(cleaned_data)
    for i in range(N):
        tokens = cleaned_data[i]
        counter = Counter(tokens + cleaned_data[i])
        words_count = len(tokens + cleaned_data[i])
        for token in np.unique(tokens):
            tf = counter[token]/words_count
            df = doc_freq(token)
            idf = np.log(N+1/(df+1))
            tf_idf[doc, token] = tf*idf
        doc += 1 
    return tf_idf

In [528]:
# Finding the topic assigned to each document by LDA algorithm

def topic_assigned(ldamodel):
    checker = []
    topic_word_map = []
    for post in range(len(cleaned_data)):
        ques_vec = dictionary.doc2bow(cleaned_data[post])
        topic_vec = max(ldamodel[ques_vec],key=lambda item:item[1])
        train_classification = [comments[post],topic_vec[0]+1,topic_vec[1]]
        topic_word_map += [[cleaned_data[post],topic_vec[0]+1]]
        checker += [train_classification]
    return topic_word_map

In [531]:
#Finding word-topic mapping : Variable name : Topic_1,...
# Document-Topic-Word-Score mapping : Variable name : word_tf_idf_1 provides [doc_no,topic_no,word,score]
def get_mapping(topic_word_map, tf_idf):
    word_tf_idf_1,word_tf_idf_2,word_tf_idf_3 = [],[],[]
    Topic_1,Topic_2,Topic_3= [],[],[]
    for i in range(len(topic_word_map)):
        if(topic_word_map[i][1]==1):
            for idx in range(len(topic_word_map[i][0])):
                doc_no = i
                topic_no = 1
                word = topic_word_map[i][0][idx]
                score = str(round(tf_idf[(i,topic_word_map[i][0][idx])],2))
                word_tf_idf_1 += [[doc_no,topic_no,word,score]]
                Topic_1 += [word]
        elif(topic_word_map[i][1]==2):
            for idx in range(len(topic_word_map[i][0])):
                doc_no = i
                topic_no = 2
                word = topic_word_map[i][0][idx]
                score = str(round(tf_idf[(i,topic_word_map[i][0][idx])],2))
                word_tf_idf_2 += [[doc_no,topic_no,word,score]]
                Topic_2 += [word]
        else:
            for idx in range(len(topic_word_map[i][0])):
                doc_no = i
                topic_no = 3
                word = topic_word_map[i][0][idx]
                score = str(round(tf_idf[(i,topic_word_map[i][0][idx])],2))
                word_tf_idf_3 += [[doc_no,topic_no,word,score]]
                Topic_3 += [word]
    return word_tf_idf_1,word_tf_idf_2,word_tf_idf_3,Topic_1,Topic_2,Topic_3

                            
#print(word_tf_idf)    

In [554]:
# Finds tf-idf score for each word assigned to a topic (takes average score for 3items appearing multiple times in a topic)
def find_score(mapping):
    word = {}
    word_score = {}
    for i in mapping:
        if i[2] not in word.keys():
            word[i[2]] = [i[3]]
        else:
            word[i[2]].append(i[3])

    for key in word.keys():
        if(len(word[key])>1):
            score_sum = 0
            for idx in range(len(word[key])):
                score_sum += float(word[key][idx])
            word_score[key] = float(round(score_sum/len(word[key]),2))
            #print(key,round(score_sum/len(word[key]),2))
        else:
            word_score[key] = float(word[key][0])
    return word_score

scores = find_score(word_tf_idf_3)    

<h1><center>USING TF_IDF ALONG WITH TOPICS FOUND USING LDA TO COME UP WITH INTUITONS ABOUT THE TOPICS</center></h1>

In [555]:
# Calculating TF-IDF scores for the documents
tf_idf_scores = calc_tf_idf(cleaned_data)

# Find topic assigned to each document by LDA algorithm
lda_topics = topic_assigned(ldamodel)

# Get average TF-IDF scores for words in each topic
word_tf_idf_1,word_tf_idf_2,word_tf_idf_3,Topic_1,Topic_2,Topic_3 = get_mapping(topic_word_map, tf_idf_scores)
top1_avg_tfidf = find_score(word_tf_idf_1)
top2_avg_tfidf = find_score(word_tf_idf_2)
top3_avg_tfidf = find_score(word_tf_idf_3)

# Sort the topics and tf_idf scores in descending order
sorted_top1 = dict(sorted(top1_avg_tfidf.items(), key=operator.itemgetter(1),reverse=True))
sorted_top2 = dict(sorted(top2_avg_tfidf.items(), key=operator.itemgetter(1),reverse=True))
sorted_top3 = dict(sorted(top3_avg_tfidf.items(), key=operator.itemgetter(1),reverse=True))


In [558]:
sorted_top1

{'Mac': 2.49,
 'egg': 2.49,
 'always': 2.49,
 'Hot': 2.49,
 'Dog': 2.49,
 'Caduceus': 2.49,
 'Cellars': 2.49,
 'redbulls': 2.49,
 'hungry': 2.49,
 'man': 2.49,
 'post': 2.06,
 'Look': 1.66,
 'Great': 1.66,
 'Taste': 1.66,
 'plant': 1.66,
 'burger': 1.66,
 'sell': 1.66,
 'Staub': 1.66,
 'Le': 1.66,
 'Creuset': 1.66,
 '/menus/': 1.66,
 'boys': 1.66,
 'Pls': 1.66,
 'check': 1.66,
 'rate': 1.66,
 'cheese': 1.34,
 'many': 1.34,
 'spice': 1.29,
 'drinking': 1.25,
 'fine': 1.25,
 'evening': 1.25,
 'Filipino': 1.25,
 'dish': 1.25,
 'Anyone': 1.25,
 'ever': 1.25,
 'smoke': 1.25,
 'MSG': 1.25,
 'anyone': 1.25,
 'else': 1.25,
 'hooter': 1.25,
 'Mother': 1.25,
 'Day': 1.25,
 'Russians': 1.25,
 'Americans': 1.25,
 'vodka': 1.25,
 'sugar': 1.11,
 'tropical': 1.0,
 'theme': 1.0,
 'w/': 1.0,
 'malibu': 1.0,
 'sake': 1.0,
 'collapse': 1.0,
 'restaurant': 1.0,
 'chain': 1.0,
 'begunhttps': 1.0,
 '//www.sandiegoville.com/2020/05/souplantation-sweet-tomatoes-will-not-reopen.html': 1.0,
 'Rate': 1.0,
 'bre

In [559]:
sorted_top2

{'check': 2.49,
 'video': 2.49,
 'budgie': 2.49,
 'flaw': 2.49,
 'BEHOLD': 2.49,
 'MAN': 2.49,
 'Find': 2.49,
 'Lazeez': 2.49,
 'chad': 2.49,
 'Chicken': 2.49,
 'Royale': 2.49,
 'spiceracks': 2.49,
 'Since': 1.66,
 'previous': 1.66,
 'successful': 1.66,
 'banana': 1.66,
 'pancakes': 1.66,
 'type': 1.66,
 'guilty': 1.66,
 'pleasure': 1.66,
 'Girl': 1.66,
 'Post': 1.36,
 'eat': 1.34,
 'choccy': 1.25,
 'tolerate': 1.25,
 'cooky': 1.25,
 'serious': 1.25,
 '*tastes': 1.25,
 'smell': 1.25,
 'cat': 1.25,
 'piss*': 1.25,
 'keep': 1.25,
 'come': 1.25,
 'cupboard': 1.25,
 'expire': 1.25,
 'tuna': 1.11,
 'big': 1.09,
 'name': 1.08,
 'north': 1.0,
 'american': 1.0,
 'equivalent': 1.0,
 'Strong': 1.0,
 'Zero': 1.0,
 'tf': 1.0,
 '’': 1.0,
 'subway': 1.0,
 'copycat': 1.0,
 'tuna.Not': 1.0,
 'anchovies.It': 1.0,
 '/DEENZ/': 1.0,
 'oven': 0.98,
 'bread': 0.96,
 'cheese': 0.94,
 'difficult': 0.93,
 'popular': 0.9,
 'roast': 0.83,
 'potato': 0.83,
 'yellow': 0.83,
 'burger': 0.83,
 'sear': 0.83,
 'nicely

In [560]:
sorted_top3

{'bun': 4.99,
 'wrong': 4.99,
 'Thoughts': 4.99,
 'Favorite': 2.49,
 'donut': 2.49,
 'Seafood': 2.49,
 'gross': 2.49,
 'retarded': 2.49,
 'brilliant': 2.49,
 'sauerkraut': 1.66,
 'ITT': 1.66,
 'moment': 1.66,
 'unsubscribed': 1.66,
 'flavour': 1.66,
 'crisp': 1.66,
 'sandwich': 1.66,
 'cricket': 1.66,
 'Steve': 1.25,
 'brand': 1.25,
 'Fiestaware': 1.25,
 'chocolate': 1.25,
 'chip': 1.25,
 'cookie': 1.25,
 'Wendy': 1.25,
 'Mcdonalds': 1.25,
 'finna': 1.25,
 'slice': 1.25,
 'za': 1.25,
 'sawmill': 1.25,
 'sesame': 1.25,
 'steak': 1.25,
 '/sip/': 1.25,
 'generalwhat': 1.25,
 'sip': 1.25,
 'lately': 1.25,
 'Liquor': 1.25,
 'store': 1.25,
 'owner': 1.25,
 'hate': 1.04,
 'favorite': 1.03,
 'Post': 1.0,
 'strong': 1.0,
 'fact': 1.0,
 'shakey': 1.0,
 'cheese': 1.0,
 'frustrate': 1.0,
 'chew': 1.0,
 'unhealthy': 1.0,
 'spicy': 1.0,
 'noodle': 1.0,
 'yeah': 1.0,
 'bro': 1.0,
 'clean': 1.0,
 'fair': 1.0,
 'recipe': 0.93,
 'miss': 0.92,
 'burger': 0.91,
 'pasta': 0.88,
 'taste': 0.86,
 'ever': 0.8

<h1><center>INFERENCE</center></h1>
The hot topics in Food and Cooking have been unearthed from the posts on the website. Let's look at some main topics and the posts falling into it. : 

### 1. Topic 1 discusses Mac, egg , Hot, Dog, Burgers and some Russian, American, Filipino dish

   <font color='blue'>Some instances from documents assigned to this topic - Mac and cheese thread, Why do eggs always make me have to shit?, Can Russians do anything right? Americans do vodka better</font>
   
   
### 2. Topic 2 discusses some chicken, banana and pancakes

   <font color='blue'> Some instances from documents assigned to this topic - are pancakes a type of bread?, Are bananas really that difficult to eat?</font>
   
   
### 3. Topic 3 is discussing donuts and seafood, Mcdonalds and cookies among other things 

   <font color='blue'>Some instances from documents assigned to this topic - Seafood is gross, Coronachan turned out was the real shit >air travel and international travel in general now outlawed >you are given one final one way ticket trip to the country of your choice >you will have to live in this country, eventually integrate into its culture and only eat its food until you die>only remnants of other countries are already established chains like McDonalds and KFCWell /ck/ which country would you choose to live the rest of your days in?</font>


<h1><center>NEXT STEPS</center></h1>
1. We are getting a sense of what could be in the different topics looking at the highest TF-TDF scorers in each topic.

2. To get better intution on the details of the topics, we should include bi-gram and tri-gram information as features.