<h1><center>LITERATURE</center></h1>

### Ideas Explored in this notebook : 
1. Extracted data from web API in JSON format
2. Used python library Beautiful Soup for extracting data from the json file
3. Pulled out all the latest posts and comments on the posts along with the timestamp. Converted original UNIX timestamp to a readable format.
4. Cleaned the data for analysis - Removed punctuations, context specific stop words and unicode characters
5. Performed tokenization and Lemmatization to find root words as per the context. Created the corpus
6. Used TF_IDF vectorizer to convert each document in the corpus into a sparse matrix of TF-IDF features
7. Performed SVD decomposition on the TF-IDF feature matrix
8. Based on the eigen values and the eigen vectors, finding the words/phrases belongind to each topic,
9. Derieving Inferences


In [2]:
import requests
import json
from pprint import pprint
from datetime import datetime as dt
from bs4 import BeautifulSoup
import datetime
import string
import pandas as pd
import numpy as np
from collections import Counter

### NLTK imports
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer 
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize

### Gensim imports
import gensim

### SKLEARN imports
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

<h1><center>DATA EXTRACTION</center></h1>

In [3]:
# Pulling data for Category : Food and Cooking
# r = requests.get('https://a.4cdn.org/lit/catalog.json')
# r = r.json()

# pprint(r)

In [4]:
#save into a JSON file
# import json
# import re
# with open('Literature.json', 'w') as fout:
#     json.dump(r , fout)

In [6]:
# Read from a JSON file
with open(r"Literature.json", "r") as read_file:
    data = json.load(read_file)

In [7]:
# More exploration
# Pulling the comments and replies to the comments from the JSON file
comments = []
date_time = []
for i in range(len(data)):
    for j in range(len(data[i]['threads'])):
        if 'com' in data[i]['threads'][j].keys() :
            one_comm = data[i]['threads'][j]['com']
            date_time += [datetime.datetime.fromtimestamp(data[i]['threads'][j]['time']).strftime("%B %d, %Y")]
            soup1 = BeautifulSoup(one_comm)
            comments += [soup1.get_text()]

In [8]:
print(date_time[0])
print(comments[0])

October 14, 2018
/lit/ is for the discussion of literature, specifically books (fiction & non-fiction), short stories, poetry, creative writing, etc. If you want to discuss history, religion, or the humanities, go to /his/. If you want to discuss politics, go to /pol/. Philosophical discussion can go on either /lit/ or /his/, but those discussions of philosophy that take place on /lit/ should be based around specific philosophical works to which posters can refer.Check the wiki, the catalog, and the archive before asking for advice or recommendations, and please refrain from starting new threads for questions that can be answered by a search engine./lit/ is a slow board! Please take the time to read what others have written, and try to make thoughtful, well-written posts of your own. Bump replies are not necessary.Looking for books online? Check here:Guide to #bookzhttps://www.geocities.ws/prissy_90/Media/Texts/BookzHelp19kb.htmBookzzhttp://b-ok.org/Recommended Literaturehttp://4chanli

In [9]:
# The first comment is from Oct 2018 and is the guidelines for using the forum. Removing it for the current data analysis.
comments.pop(0)
date_time.pop(0)

'October 14, 2018'

In [9]:
len(comments)

144

### Observations:
1. We have extracted 144 posts from the Literature section of the website.
2. The comments were made on May 9th, 2020

### Assumptions :
We are working with the assumption that the comments below every post would be from the same topic as the post

<h1><center>DATA PREPARATION/CLEANING</center></h1>

### 1.      CONVERTS WORDS TO THEIR BASE FORMS
### 2. DATA CLEANING
##### --> Removing punctuations, adding more context related stop words for removal, removing unicode characters

In [10]:
def clean_data(tokens, stop_words = ()):

    cleaned_tokens = []

    for token, tag in pos_tag(tokens):
        if tag.startswith("NN"):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'

        lemmatizer = WordNetLemmatizer()
        token = lemmatizer.lemmatize(token, pos)

        if token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens.append(token)
    return cleaned_tokens


In [11]:
cleaned_data = []
with open("stop_words.txt", "r") as input:
        extra_stop_words = [line.split(",")[0] for line in input.read().splitlines()]
stop_words = stopwords.words('english')
newStopWords = ['/g/','n\'t','\'s','\'\'','``','would','get','like','use','one','\'m','http','n','0', 'thus','x','1','say','good','much','want','go','run','need','new','even','shit','fuck']
stop_words+=newStopWords
stop_words+=extra_stop_words
for text in comments:
    cleaned_data += [clean_data(word_tokenize(text), stop_words)]

<h1><center>DATA EXPLORATION</center></h1>

## TF_IDF VECTORIZING
Using Scikit-learn's TF_IDF vectorizer to take my corpus and convert each document into a sparse matrix of TFIDF features

In [12]:
vectorizer = TfidfVectorizer(stop_words=stop_words,
                                 use_idf=True, ngram_range=(1, 3))
X = vectorizer.fit_transform(comments)
#print(vectorizer.get_feature_names())

In [13]:
vec = vectorizer.get_feature_names()
len(vec)

7189

In [14]:
print(X[0])

  (0, 3413)	0.3046444355243244
  (0, 817)	0.3046444355243244
  (0, 6221)	0.3046444355243244
  (0, 2376)	0.3046444355243244
  (0, 3412)	0.3046444355243244
  (0, 816)	0.3046444355243244
  (0, 6220)	0.3046444355243244
  (0, 3673)	0.2324118217657168
  (0, 2373)	0.25181235460358126
  (0, 3407)	0.26467850914370317
  (0, 815)	0.3046444355243244
  (0, 6219)	0.26467850914370317


<h1><center>SVD decomposition of TF-IDF feature vector</center></h1>

Feature vector : X, a matrix where m is the number of documents and n is the number of terms

Process : We will perform SVD decomposition if the matrix X

In [15]:
# Feature Vector
X.shape

(144, 7189)

## Why SVD decomposition for finding the topics?
### The eigen vectors computed are orthogonal to each other and are in the direction of maximum variance and we need our final topics to be unrelated to each other and to cover the maximum range of discussions.

In [34]:
lsa = TruncatedSVD(n_components=3, n_iter=100)
lsa.fit(X)

TruncatedSVD(algorithm='randomized', n_components=3, n_iter=100,
             random_state=None, tol=0.0)

In [40]:
lsa.singular_values_
# The biggest singular value explains the maximum variance

array([1.49689484, 1.33859437, 1.31893058])

In [41]:
lsa.explained_variance_

array([0.01224463, 0.0100076 , 0.01202409])

In [29]:
#This is the first row for V
lsa.components_.shape

(100, 7189)

In [19]:
for i, comp in enumerate(lsa.components_): 
    print(i,comp)

0 [0.00045709 0.00045709 0.00045709 ... 0.00018687 0.00018687 0.00018687]
1 [0.0017405  0.0017405  0.0017405  ... 0.00063329 0.00063329 0.00063329]
2 [-0.00125716 -0.00125716 -0.00125716 ... -0.00018194 -0.00018194
 -0.00018194]


In [18]:
vectorizer.get_feature_names()

['011',
 '011 lit',
 '011 lit ing',
 '10',
 '10 works',
 '10 works period',
 '12',
 '12 year',
 '12 year old',
 '15290943the',
 '15290943the next',
 '15290943the next settle',
 '15k',
 '1wvc5f4up2g8klfe',
 '1wvc5f4up2g8klfe 8s5wlylkp5oegtn4ctifggm2xs4',
 '1wvc5f4up2g8klfe 8s5wlylkp5oegtn4ctifggm2xs4 edit',
 '2014',
 '20th',
 '20th century',
 '20th century writers',
 '21st',
 '21st century',
 '21st century late',
 '31',
 '31 page',
 '31 page previews',
 '3m3l4vizv',
 '3m3l4vizv uhere',
 '3m3l4vizv uhere neat',
 '400',
 '400 absolute',
 '400 absolute essentials',
 '400 cigarettes',
 '400 cigarettes week',
 '45',
 '45 sunday',
 '45 sunday come',
 '80',
 '80 cigarettes',
 '80 cigarettes day',
 '8s5wlylkp5oegtn4ctifggm2xs4',
 '8s5wlylkp5oegtn4ctifggm2xs4 edit',
 '8s5wlylkp5oegtn4ctifggm2xs4 edit usp',
 '90',
 '90 95',
 '90 95 agents',
 '90 collected',
 '90 collected fictions',
 '95',
 '95 agents',
 '95 agents female',
 '9mm',
 '9mm automaticw',
 '9mm automaticw mean',
 'a11t',
 'a11t kmru',

In [71]:
terms = vectorizer.get_feature_names()
for i, comp in enumerate(lsa.components_): 
    termsInComp = zip (terms,comp)
    sortedTerms =  sorted(termsInComp, key=lambda x: x[1], reverse=True) [:10]
    print("Topic %d:" % i )
    for term in sortedTerms:
        print(term[0])
    print (" ")
# These words are the most important features for a topic

Topic 0:
books
read
books taoism
taoism
books read
books read kids
read kids
kids
philosophy
help
 
Topic 1:
start
lit
philosophy
book
start philosophy
greatest
time
reading
last
cinema
 
Topic 2:
start
philosophy
start philosophy
philosopher
australia
australia philosophy
consider greatest
consider greatest philosopher
country
country start
 


<h1><center>INFERENCE</center></h1>
The hot topics in Literature have been unearthed from the posts on the website. Tha main topics are : 

1. Topic 1 discusses Kids books and Taoism (Chinese Philosopy)
   <font color='blue'>Some instances from documents assigned to this topic - What are the best books on Taoism</font>
   
   
2. Topic 2 discusses general philosophy, book and cinema
   <font color='blue'> Some instances from documents assigned to this topic - Has cinema surpassed the novel?, Humanism has changed drastically since the 17th century, only someone who can&#039;t close read novels or engage with philosophy would come to this conclusion. Many modernist writers were immensely influenced by early cinema</font>
   
   
3. Topic 3 is discusses australia philosophy and greatest philosopher
<font color='blue'> Some instances from documents assigned to this topic -I thought he was British. But yeah, John Finnis is likely the greatest Australian philosopher.</font>


<h1><center>NEXT STEPS</center></h1>
1. Tag every document to a topic (Eigen vector) based on any nearest neighbor methods

2. We can see that the topics discussed are mostly chinese and australian philosophy. I would like to look at the demographic information of the users.
