<h1><center>TV and Film </center></h1>

### Ideas Explored in this notebook : 

1. Extracted data from web API in JSON format
2. Used python library Beautiful Soup for extracting data from the json file
3. Pulled out all the latest posts and comments on the posts along with the timestamp. Converted original UNIX timestamp to a readable format.
4. Cleaned the data for analysis - Removed punctuations, context specific stop words and unicode characters
5. Performed tokenization and Lemmatization to find root words as per the context
6. Explored Latent Dirichlet Allocation for topic modelling
7. Recorded the three trending topics

    

In [10]:
# Importing library


import requests
import json
from pprint import pprint
from datetime import datetime as dt
from bs4 import BeautifulSoup
import datetime
import string
import pandas as pd
import numpy as np
from collections import Counter
import pickle

### NLTK imports
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer 
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize

### Gensim imports
import gensim

### SKLEARN imports
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import TruncatedSVD


# ignoring warnings
import warnings
warnings.filterwarnings('ignore')

<h1><center>DATA EXTRACTION</center></h1>

In [319]:
#Pulling data for Category : Television and Film
r = requests.get('https://a.4cdn.org/tv/catalog.json')
r = r.json()

#pprint(r)

In [320]:
#save into a JSON file
import json
import re
with open('TV_Film.json', 'w') as fout:
    json.dump(r , fout)

In [321]:
# Read from a JSON file
with open(r"TV_Film.json", "r") as read_file:
    data = json.load(read_file)

In [322]:
# More exploration
# Pulling the comments and replies to the comments from the JSON file
comments = []
date_time = []
for i in range(len(data)):
    for j in range(len(data[i]['threads'])):
        if 'com' in data[i]['threads'][j].keys() :
            one_comm = data[i]['threads'][j]['com']
            date_time += [datetime.datetime.fromtimestamp(data[i]['threads'][j]['time']).strftime("%B %d, %Y")]
            soup1 = BeautifulSoup(one_comm)
            comments += [soup1.get_text()]
            if 'last_replies' in data[i]['threads'][j].keys():
                for k in range(len(data[i]['threads'][j]['last_replies'])):
                    if 'com' in data[i]['threads'][j]['last_replies'][k]:
                        reply = data[i]['threads'][j]['last_replies'][k]['com']
                        date_time += [datetime.datetime.fromtimestamp(data[i]['threads'][j]['last_replies'][k]['time']).strftime("%B %d, %Y")]
                        soup2 = BeautifulSoup(reply)
                        comments += [soup2.get_text()]  

  ' that document to Beautiful Soup.' % decoded_markup


In [323]:
len(comments)

697

<h1><center>DATA PREPARATION/CLEANING</center></h1>

### 1.      CONVERTS WORDS TO THEIR BASE FORMS

### 2. DATA CLEANING

#### The next step is to clean the data and remove things which do not add meaning to our analysis. Broadly, we will be looking at removing punctuations and stop words 

In [324]:
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer 
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
import string
def clean_data(tokens, stop_words = ()):

    cleaned_tokens = []

    for token, tag in pos_tag(tokens):
        if tag.startswith("NN"):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'

        lemmatizer = WordNetLemmatizer()
        token = lemmatizer.lemmatize(token, pos)

        if token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens.append(token)
    return cleaned_tokens


#### Since we have scraped data from the TV and Film section of the website, we are sure the people would be taking about some show, film or season and the frequencies of these words will be higher but these words do not add meaning to our analysis, as we are interested to know what movie/show/film/season is being talked about based on the frequency of words. Thus adding these obvious words to the list of stop words.

In [325]:
with open("stop_words.txt", "r") as input:
        extra_stop_words = [line.split(",")[0] for line in input.read().splitlines()]
cleaned_data = []
stop_words = stopwords.words('english')
newStopWords = ['think','movie','show','make','season','film','watch','people','know','look','thing','could','\'re','\'ve','every','never','end','time','/g/','n\'t','\'s','\'\'','``','would','get','like','use','one','\'m','http','n','0', 'thus','x','1','say','good','much','want','go','run','need','new','even','shit','fuck']
stop_words+=newStopWords
print(len(stop_words))
stop_words+=extra_stop_words
print(len(stop_words))
for text in comments:
    cleaned_data += [clean_data(word_tokenize(text), stop_words)]

225
416


#### Removing unicode characters from Data

In [326]:
# Remove unicode characters
prepared_data = []
for i in range(len(cleaned_data)):
    str = []
    for j in range(len(cleaned_data[i])):
        encoded_string = cleaned_data[i][j].encode("ascii", "ignore")
        decode_string = encoded_string.decode()
        if(decode_string!=''):
            str += [decode_string]
    prepared_data += [str]
    

146

### Observations:
#### We have extracted 146 comments from the Food and Cooking section of the website.
#### The comments were made on May 8th, 2020

<h1><center>DATA EXPLORATION</center></h1>

### LDA with Gensim
#### Create a dictionary from data , then convert to bag-of-words corpus and save the dictionary and corpus for future use

In [327]:
from gensim import corpora
dictionary_TV = corpora.Dictionary(prepared_data)
corpus_TV = [dictionary_TV.doc2bow(text) for text in prepared_data]

In [328]:
import pickle
pickle.dump(corpus_TV, open('corpus_TV.pkl', 'wb'))
dictionary_TV.save('dictionary_TV.gensim')

#### Let's Try 5 topics

In [329]:
import gensim
NUM_TOPICS = 3
ldamodel = gensim.models.ldamodel.LdaModel(corpus_TV, num_topics = NUM_TOPICS, id2word=dictionary_TV, passes=15)
ldamodel.save('model_TV.gensim')

In [330]:
topics = ldamodel.print_topics(num_words=10)
for topic in topics:
    print(topic)

(0, '0.004*"//www.youtube.com/watch" + 0.004*"year" + 0.004*"/tv/" + 0.003*"girl" + 0.003*"hope" + 0.003*"post" + 0.003*"also" + 0.002*"two" + 0.002*"week" + 0.002*"base"')
(1, '0.018*"--" + 0.005*"first" + 0.003*"Brock" + 0.003*"old" + 0.003*"ever" + 0.003*"Kasady" + 0.003*"though" + 0.003*"anything" + 0.003*"true" + 0.002*"Joe"')
(2, '0.014*"STANNIS" + 0.003*"woman" + 0.003*"character" + 0.003*"kino" + 0.003*"Jedi" + 0.003*"give" + 0.003*"find" + 0.003*"point" + 0.003*"Miquela" + 0.003*"Germany"')


### Topic 1  - youtube link, girl, week : Some youtube link is being referenced
### Topic 2  - Kasady,Brock,Joe : These characters are trending
### Topic 3  - Stannis, kino, jedi, miquela, germany : TV series with these references are talked about
    

<h1><center>Visualize the topic keywords</center></h1>

#### Saliency : a measure how mcuh the term tells you about the topic
#### Relevance : A weighted average of the probability of the word given the topic and the word given the topic normalized by the probability of the topic
#### Each bubble represents a topic and the size of the bubble shows the importance of the topics relative to the data

#### We can view the frequency of the top words in a given topic by howevering over the topic

In [8]:
dictionary = gensim.corpora.Dictionary.load('dictionary_TV.gensim')
corpus = pickle.load(open('corpus_TV.pkl', 'rb'))
lda = gensim.models.ldamodel.LdaModel.load('model_TV.gensim')

import pyLDAvis.gensim
lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

<h1><center>Inference</center></h1>
#### 1. All the three topics are prevalent. The bubble sizes are equally big showcasing their importance
#### 2. The topic model is good as we can see big non-overlapping bubbles
    
The hot topics in TV and Film section have been unearthed from the posts on the website. Tha main topics are : 
1. Topic 1 discusses a youtube videos and dinosaurs
2. Topic 2 discusses some characters - Bob, kasady etc
3. Topic 3 is discussing Stannis, Miquela, Germany, kino, empire stories
   
    

<h1><center>Next Steps</center></h1>
1. We see the drawback of using LDA without using the bi-grams, tri-grams and n-grams in the sense that we do not know which phrases are being spoken about and can derieve very little information from just the keyword frequencies in a document

2. We see daily use words like us, nothing etc being reported as important words in a topic but these words do not help in providing us with any additional information about a topic. Using TF-IDF for each word will help remove such words from being touted as important for a topic