# **Scraping (part 2) and Text Analysis**

Last week, as you may remember, we worked on web-scraping. As you may remember, we did two types of scraping, to extract data from the internet. First, we directly scraped data from a wikipedia page; after that, we talked about getting data from apis.

Today we're going to finish our work on "scraping" by playing with the reddit tool we spoke about, that's built on top of the api. Then we'll start learning about the types of text analysis we can do, once we're done "scraping

**SCRAPING PART 2**

First' let's review a few things from last week. Just as an overview, you don't need to remember in too much detail.

First, we used a package called pandas to play with "dataframes" which were like spreadsheets. below, see a review of code that we used to "import" or download the pandas package, and then code we used to create a dataframe called data1. We then wrote code to pick out an item in this dataframe, and to "loop" through one column of a dataframe.

In [None]:
#import pandas
import pandas as pd

In [None]:
#create a dataframe called data1
d = {'col1': [1, 2], 'col2': [3, 4]}
data1 = pd.DataFrame(data=d)
print(data1)

In [None]:
#picking out an item in col1 of the dataframe, row 1.  Note that the first row is numbered row 0, so when we pick out the item in row 1, it's actially in the second row

data1["col2"][1]

In [None]:
#loop through all the items in all of the rows of the first column of the dataframe, and print the items (in range we put 2 because there are two rows)

for i in range(2):
  print(data1["col2"][i])

We also learned how to upload a spreadsheet file into google colab, like this:

In [None]:
#import your file

from google.colab import files
uploaded = files.upload()
print(uploaded)



In [None]:
#put the file in a pandas dataframe by inserting its name, which should have printed out above, into the spot below.

pd.read_excel('FILENAMEGOESHERE.xlsx')

OK now that we've reviewed that, let's go get spreadsheets of data from the reddit api using an app built for us to do so.

Find the app here:

https://smm.ncsa.illinois.edu/

**TEXT ANALYSIS**

OK, so now we've got some data.

But what happens when we want to analyze this data? Well, we might just look at closely or read it, for ourselves - this would be easy, for example, if we just had one page of text to look at; but it would start to get hard if we had thousands.

For this reason, we might want to use computational methods of text analysis. Today, we're going to learn how to use some of them with code.

Please note however that we probably will not get through all of this today! So, we'll start this text analysis work in today's workshop, and then we'll move on to finishing it in the next

First, let's grab some text to analyze. In honor of the internet, we're going to use the text of the taylor swift song, shake it off. I'm going to save that text below, in a string, called text.


In [None]:
text = "I stay out too late Got nothing in my brain That's what people say That's what people say I go on too many dates But I can't make them stay At least that's what people say That's what people say But I keep cruising Can't stop, won't stop moving It's like I got this music in my mind Saying it's gonna be alright I never miss a beat I'm lightning on my feet And that's what they don't see That's what they don't see Players gonna play, play, play, play, play And the haters gonna hate, hate, hate, hate, hate (haters gonna hate) Baby, I'm just gonna shake, shake, shake, shake, shake I shake it off, I shake it off Heartbreakers gonna break Fakers gonna fake I'm just gonna shake I shake it off, I shake it off I shake it off, I shake it off I, I, I shake it off, I shake it off I, I, I shake it off, shake it off I, I, I shake it off, I shake it off I, I, I shake it off, I shake it off I, I, I shake it off, I shake it off I, I, I, shake it off, I shake it off I, I, I, shake it off, I shake it off"

OK, so there are ways to use Python to analyze text. But it's worth noting, before we do that, that there are now actually certain software programs that do some of these things for you, without your having to code it yourself.

Just like there was a nice little app we used to scrape the reddit api for us, there's also nice apps for text analysis. let's look at one here:

https://voyant-tools.org/


OK, now that we're back, let's learn how to do things like this with Python.

In [None]:
text.lower()

In [None]:
text.upper()

In [None]:
text.count("shake")

In [None]:
# Like the for loop above, we can iterate through sequences of the
# result of a method called split(). The split method is going to
# be *very important* for text analysis.
#
# The default way in which split() works is to split on the delimiter
# of a space--ideal for splitting a sentence. We'll worry more about
# what is returned by split() later, but this gets us started with
# working on words.

for word in text.split():
    print(word)


In [None]:
# Let's generate some basic data from this very long string
print("Total characters:",len(text))
print("Total word count:",len(text.split()))
print("Paragraph count:",text.count("\n"))
print("Rough sentence count:",text.count("."))

In [None]:
# We now know how to process each word. Let's find long words.
# What is a long word?
long_word = 10
for word in text.split():

    # This is our first *conditional statement*
    # the "if" means if a certain thing is true. if the statement that follows is true, then the code will perform the action on the next line

    if len(word) >= 10:
        print(word)

In [None]:
# Let's begin by loading up some important libraries/packages
import numpy as np
import glob as glob
import nltk

#nltk in particular is very useful for text analysis, and the main package you'll often use

from nltk.probability import FreqDist
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')

# allow for displaying of graphics
%matplotlib inline

In [None]:
allwords=[]
words = tokenizer.tokenize(text)
for w in words:
    allwords.append(w)

freqdist = FreqDist(allwords)
freqlist = freqdist.most_common()
print(freqlist)

In [None]:
import seaborn as sns
sns.set_style('darkgrid')
freqdist.plot(20);

In [None]:
# Now let's run nltk part of speech tagger on the tokens:

#normally, when we use the part of speech tagger on tokens, we do this. But notice colab is going to throw an error
#so let's practice reading error messages and dealing with them, what should we do?
pos = nltk.pos_tag(words)
print(pos)

In [None]:
#INSERT YOUR CODE HERE TO FIX THE PROBLEM



In [None]:
#ok, now let's continue

# Now we run the tagger on the tokens:
pos = nltk.pos_tag(words)
print(pos)

In [None]:
nouns = []
for p in pos:
    if p[1] == 'NN':
        nouns.append(p[0])
nounfreq = FreqDist(nouns)
nounfreqlist = nounfreq.most_common()

print(nounfreqlist)

"Assignment 1"

In [None]:
#alright, now you have a coding assignment, which we're going to walk through together
#OR leave for next time if we need to
#I want us to scrape a wiki page and then do one piece of text analysis from above to it
#so, we want to walk through these steps...

In [None]:
#first, find the code from last week to scrape a wiki page and run it below, in blocks you create

In [None]:
#second, take the text of the wiki page and give it a name, like "text1"
#do this in the same way that we titled the 'text" of the tay swift song in the code above. Create a variable called text1 and set it equal to a string of the wiki page text

In [None]:
#do some type of analysis, from above, on the wiki page text, called 'text1"

Assignment 2 (we'll start here next time, most likely)

Analyze the reddit data we collected earlier

first, the easy way: upload your reddit spreadsheet into voyant and let's see what we can do
(hint - you might first want to create a new spreadsheet file just with the columb of text you want to analyze)

second, the hard way: upload the reddit spreadsheet into this coding notebook and analyze

In [None]:
#group activity, we'll work through this together

**Finally some more advanced text analysis with machine learning**

I will likely update this code before we use it next week, so stay tuned...

OK, but as you know from reading my article, there are more complex things to do with texts, using machine learning; like topic modeling and classification. Next week, we're going to do both of those things, in addition to looking at a little image analysis. But, if we have time now, we can start topic modeling today....

First, let's discuss what topic modeling is (return to real life); ok, we're back...

If you want to see a whole very basic workflow of topic modeling, the most basic steps, here is a tutorial: https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/

this explains every block of this code, in case you want to topic model yourself. You'd have to substitue some of this for your data.

In [None]:
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
doc5 = "Health experts say that Sugar is not good for your lifestyle."

# compile documents
doc_complete = [doc1, doc2, doc3, doc4, doc5]

In [None]:
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string

#I had to add this to their code, for colab not in original:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

doc_clean = [clean(doc).split() for doc in doc_complete]

In [None]:
# Importing Gensim
import gensim
from gensim import corpora

# Creating the term dictionary of our courpus, where every unique term is assigned an index.

dictionary = corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

In [None]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)

In [None]:
print(ldamodel.print_topics(num_topics=3, num_words=3))