### Web Crawling Script for Games Writer Kirk Hamilton ###

Step 1: import all necessary packages

In [None]:
import re
import urllib
import requests
from bs4 import BeautifulSoup
import pprint
import pandas as pd

Step 2: download a sample webpage. 
You can save the html page onto your computer and use text editor to view its content

In [None]:
url = requests.get('https://kotaku.com/detroit-become-human-the-kotaku-review-1826277408')
#URL Options:
#https://kotaku.com/detroit-become-human-the-kotaku-review-1826277408
#https://kotaku.com/hollow-knight-the-kotaku-review-1827367425
#https://kotaku.com/destiny-2-the-kotaku-review-1818530629
#https://kotaku.com/no-mans-sky-the-kotaku-review-1785383774

Step 3: use BeautifulSoup to parse the webpage and extract the lyrics content. The division that includes the lyrics starts from the html tag "lyrics-body-text"

In [None]:
soup = BeautifulSoup(url.content, 'html.parser')
title = soup.title.string
print(title)

divTag = soup.findAll('div',attrs={"class":"post-content entry-content js_entry-content "})
body = []
for tag in divTag:
    for element in tag.findAll("p"):
        ptext = element.text
        body.append(ptext)

print(body)
body = str(body)

Step 4: split text into individual words

In [None]:
words = body.split()
words = [element.lower() for element in words]
#print(words)
print(len(words))

Or use NLTK's tokenizer to split text into words. For more details about NLTK, read the documentation at http://www.nltk.org/book/ch05.html

In [None]:
import nltk
#nltk.download('punkt')
#nltk.download('averaged_perceptron_tagger')
tokens = nltk.word_tokenize(body)
#print(tokens)
tags = nltk.pos_tag(tokens)
#print(tags)
print(tags[0][0], tags[0][1])

Remove stopwords

In [None]:
#nltk.download('stopwords')
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))
def removeStopwords(wordlist, stopwords):
  return [w for w in wordlist if w not in stopwords]
words = removeStopwords(words, stopwords)
print(words)

count word frequency

In [None]:
counts = dict()
for word in words:
  counts[word] = counts.get(word,0) + 1
sorted(counts, key=counts.__getitem__, reverse=True)
pprint.pprint(counts)


sort words by frequency

The above method uses loop, which needs quite a lot of programming, and is also slow. 
The following method uses the dataframe data structure in the pandas package to quickly count and sort words by frequencies. 
Pandas documentation includes more details on its powerful data structure
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

In [None]:
df=pd.DataFrame(words, columns=['word'])
x=df["word"].value_counts()
pprint.pprint(x)
filename_pfx = title.split(' ', 1)[0]
x.to_csv('results.csv', sep = ",")