<a href="https://colab.research.google.com/github/TessM2/content/blob/main/Workshop_3_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Modeling**

last week, we looked at some basic methods of text analysis with nltk (tokenization, wordcounts, part of speech tagging etc)

but there are also more complicated (if more/equally limited) ways of engaging in textual, or imagistic, analysis

A lot of these ways use machine learning models

We've discussed machine learning, but let's briefly pause to discuss what a machine learning model is...


OK, now that we've done that...

Today we're going to take a quick look at three types of text/image analysis models. 

As a caveat - each of these methods is questionable/problematic in its own way, and it's important to understand the pros/cons of each method. We'll definitely touch on this briefly, but I've also shared links where you can read about these methods in a bit more detail, if you are interested in using them

As another caveat - this is truly the briefest introduction, and, as in all our coding work, we're only skimming the surface here. But it's meant as a starter kit, so that you can pursue any of these methods further if youre interested

**Topic Modeling**

This is a method of extracting the "topics" that appear in a set of textual documents. let's pause for a second to understand exactly what that means, in this context....

Here, also are two links to more indepth explanations of what topic modeling is and how one might use it, for/from a more humanist perspective

https://tedunderwood.com/2012/04/07/topic-modeling-made-just-simple-enough/
https://maria-antoniak.github.io/2022/07/27/topic-modeling-for-the-people.html

ok, let's look at some code for topic modeling...

In [None]:
#the first thing we need to do is have some documents we want to work with; 
#here we'll make them ourselves, and use a very small sample
#in reality, we'd gather the texts from a corpus of interest, and use a larger sample
#though there are questions concerning the use of topic models on short texts (we can discuss, thouhg my tldr is it still works with lda even though other models might be preferable) you can try this on a group of tweets

doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
doc5 = "Health experts say that Sugar is not good for your lifestyle."

# compile documents
doc_complete = [doc1, doc2, doc3, doc4, doc5]

As human readers, what topics might we say that we see across the above documents?

In [None]:
#next we need to clean and prepare our textual data; remember last time we discussed tokenization
#here we're going to "lemmatize" and delete "stopwords", among other things; let's discuss what that means

import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string

#we need below for coding in colab, but would not otherwise:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

#cleaning the text in each document in our list
doc_clean = [clean(doc).split() for doc in doc_complete] 

One package we can use for topic modeling is gensim, though it's not the only one
one type of topic model we can use is LDA though, again, it's not the only one
see below on "model selection"

here we're going to process the text, further, to prepare for the modeling
here we'll put the features/tokens in a document term matrix
let's talk for a moment about what that is

In [None]:
# Importing Gensim
import gensim
from gensim import corpora


# Creating the term dictionary of our corpus, where every unique term is assigned an index. 

dictionary = corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above. (in gensmin doc2bow will create the matrix of word quantities for us, on the dictionary object)
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

running the model...

In [None]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)

printing and examining the topics...thoughts?

In [None]:
print(ldamodel.print_topics(num_topics=3, num_words=3))

I want to highlight a few steps we just went through, in which we made a number of choices
These are steps we go through with running/training almost every type of model
And the choices we make in each of these steps effect the outcome
They are

1. preprocessing/cleaning (how do we process/prepare our text?)
2. feature selection (what textual elements do we use as features?)
3. Model selection (what type of model we use; here, lda; there are a few principles we might apply to choosing models, in general)
4. parameter or hyperparameter tuning (the "settings" of the model, from how many topics we choose to create, passes through corpus during training)

You have to consider all of these choices with any type of model

Try rerunning the model above changing some of the parameters (e.g. number of topics of passes)

**Classifier**

next we can use a classifier
these can work for both images and text
we've discussed these before, but let's review, with the classic example of the spam filter...

Here's an introduction to this method that also starts with the spam filter example

https://developers.google.com/machine-learning/guides/text-classification

Again, we're going to have to make choices through a number of steps. We're going to have to choose:
1. how we pre-process the text
2. what features we select to train the model on (lots of choices, in real classification)
3. what type of model we use. we'll practice with naive bayes, but others are possible (why choose this one?)
4. how we tunr parameters/hyperparameters. here, this matters quite a bit, as it will change the efficacy of the model by a good amount. especially the max-features parameter, as we'll see below



first, as always we need some data. basically, we need a corpus of two differnet categories of texts, in equal numbers, that we want to train our classifier to distinguish from one another

in most classification problems the standard wisdom is that you want 1,000 samples of each type of text, which have to be labeled. labeling of texts into categories is often collective (via multiple annotators), crowdsourced (mechanical turk), or drawing from existing labels (like genre labels, on netflix). but you can also label yourself.

Here's a sample corpus of two categories of texts, which are different types of medium posts that I labeled. The first group are texts of medium posts that are in what I call the "7 habits" style (let's look at these a second); the other group are texts of other randomly selected medium posts from self-help pubications. We're actually going to use just 585 of each.  

To run a model, we want to start with a spreadsheet/matrix of the texts themselves, and then, associated with each, a label to indicate the type. we'll typically use the labels 1 and 0 for the two types. So, we want, with each of the 585 examples of the "7 habits" medium posts, the label 1, and with each of the 585 examples of the other posts, the label 0. 

Here we're going to import that data, and take a look at it

please note that this method of importing data is something we only have to do on colab; in your own work, you'd import from your own machine; in colab, you'd use the method i put in the code from workshop 2 (I can show you again, if you need help)

In [None]:
#Now, let's download the dataset we're going to use to train the model as a zipped archive
import pandas as pd
import urllib.request
urllib.request.urlretrieve("https://github.com/TessM2/7habitsdata/archive/refs/heads/main.zip", "dataset.zip")
import zipfile
with zipfile.ZipFile("dataset.zip", 'r') as zf:
    zf.extractall()
#Clean up after ourselves
import os
os.remove("dataset.zip")
sevendata = pd.read_csv('7habitsdata-main/sevenhabits585formodel.csv')
sevendata

In [None]:
#the .head function also lets us view the first few rows of a dataframe
print(sevendata.head())


In [None]:
# next lets upload the packages we need
#we can do a lot of machine learning modeling, as people often do, with a package called scikit learn
#we're going to import scikit learn and some utilities/functions in that package we also might need
#what do you think countvectorizer does?

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score

let's try running a naive bayes classifier on the full texts (graphs, vs. titles)
but first let's talk through some elements, here
let's talk, especially, about splitting the data up into **training and testing sets**


In [None]:

#vectorize the text and set max_features; this is our feature selection
countvec = CountVectorizer(max_features = 6547)
#below, you see another version of the above line we could use, hashed out, whic deletes stopwords
#countvec = CountVectorizer(stop_words = "english", max_features = i)

#vectorize the text (i.e., convert features to numerical values)
vectors = countvec.fit_transform(sevendata['paragraphs'])
labels = sevendata['7 habits']

#initialize the model (take off shelf; multinomial naive bayes is the type we're using)
mnb = MultinomialNB()

#fit the model to the data
mnb.fit(vectors, labels)
#run and get accuract scores, cross_validating ten different times
#what do we mean by cross validating? *discuss

scores = cross_val_score(mnb, vectors, labels, cv=10)
scoreavg = (sum(scores) / len(scores))
print(scoreavg)
print(scores)

  

In [None]:
#note that setting parameters is very important, here; especially the max_features
#what do you think max_features means?
#try some other max_features numbers. does the accuracy come out better?


In [None]:

#in reality, you might want to loop through a number of different max features to see where you get the highest accuracy
#I already did this and determined that 6574 was the peak, at about 70.8 percent
#but you could build your own loop, and set the max_features numbers you want t loop through, like this
#set, in the range of the loop, the numbers of max features you want to loop through (don't make the range too big so it won't take too long;but you might try, e.g., a range of about 50 values)

In [None]:
#putting our code above in a loop, to look through ranges of max_features:

#notice this takes a while, so you might want to stop it part way through

for i in range (6500,6600):
  countvec = CountVectorizer(max_features = i)
  vectors = countvec.fit_transform(sevendata['paragraphs'])
  labels = sevendata['7 habits']
  mnb = MultinomialNB()
  mnb.fit(vectors, labels)
  scores = cross_val_score(mnb, vectors, labels, cv=10) 
  scoreavg = (sum(scores) / len(scores))
  print(i)
  print(scoreavg)
  print(scores)

In [None]:
#after, we can look at the features the model "relied" on most to distinguish the two groups; 
#i.e., in this case, with simple word counts, we'll get this in the form of a proportion: how many times the word appeared in one type of text sv. in another
#in these proportions we always add one to the numerator and denominator, though, to avoid dividing by zero

#BEFORE WE DO THIS: make sure to rerun the model, once, with the parameters (max features) you want; this will run with that version of the model
#you might just rerun the first version of the model before the loop, then skip the loop

#getting top distinguishing words
title_tokens = countvec.get_feature_names()

mnb.feature_count_
mnb.feature_count_.shape

# number of times each token appears across all non7hab texts
nonseven_token_count = mnb.feature_count_[0, :]
nonseven_token_count

# number of times each token appears across all 7hab texts
seven_token_count = mnb.feature_count_[1, :]
seven_token_count

# create a DataFrame of tokens with their separate non7hab and 7hab counts
tokens = pd.DataFrame({'token':title_tokens, 'nonseven':nonseven_token_count, 'seven':seven_token_count}).set_index('token')
tokens.head()

mnb.class_count_

# add 1 to 7hab and non7hab counts to avoid dividing by 0
tokens['nonseven'] = tokens.nonseven + 1
tokens['seven'] = tokens.seven + 1
tokens.sample(5, random_state=6)

# calculate the ratio of 7hab-to-non7hab for each token
tokens['seven_ratio'] = tokens.seven / tokens.nonseven
tokens.sample(5, random_state=6)

# examine the DataFrame sorted by ua_ratio
# note: use sort() instead of sort_values() for pandas 0.16.2 and earlier
tokens.sort_values('seven_ratio', ascending=False)


Notice again, we made lots of decisions here
1. how we preprocessed text (kept stopwords)
2. how we chose features (unigrams; could have chosen lots of other things, e.g. bigrams, trigrams, etc.)
3. how we chose model (naive bayes, why?)
4. how we chose parameters (max features)

Now try redoing the above code, but on the headline texts, instead of the full texts. what do you have to change in the code to redo that?

copy and paste new code below, making the necessary changes

Next time, I hope: image classification (no time today)