**Text Analysis**

Today we'll start looking at the implementation, in code, for three types of text analysis you read about: tf-df, NER, and POS.

As we've already mentioned, these text analysis methods fall into two categories:
1. Based on (newly created) word counts/quantities alone, and not incorporating a previously trained model (which means landguage agnostic) - tf-idf
2. Incorporating a previously trained model, on a particular language (so, not language agnostic) - NER, POS

Today we'll talk about the code implementation of both (tf-idf and NER/POS); but let's start by quickly touching on encodings. What did you learn from the reading on text encoding?

Sample of where we've seen text encoding already:

In [None]:
#This is code to load a text file into colab. Can you spot a line that has to do with encoding? What do you think it does?

from google.colab import files

# Upload a file
uploaded = files.upload()

# Retrieve the first filename from the uploaded files
filename = list(uploaded.keys())[0]

# Open the file with UTF-8 encoding and read its content
with open(filename, 'r', encoding='utf-8') as f:
    content = f.read()

print(content)


**Tf-idf**

Let's start with tf-idf, which is a method of reducing documents to word counts just one step more advanced than raw word counting, itself. We've already discussed the concept, but now let's look at the implementation.

Lert's say, for example, we have a group of five short texts in a dataframe. And we want to get tf-idf scores for each word in each text/document. Let's start by assembling the data for this example:

In [4]:
#Create dataframe of five texts

import pandas as pd

# Create sample data
data = {
    "document": ["doc1", "doc2", "doc3", "doc4", "doc5"],
    "text": [
        "apple banana orange",   # doc1
        "banana fruit orange",   # doc2 (overlapping "banana" and "orange")
        "grape apple banana",    # doc3 (overlapping "apple" and "banana")
        "kiwi mango papaya",     # doc4 (unique words)
        "banana apple mango"     # doc5 (mix of common and unique words)
    ]
}

# Create DataFrame
df = pd.DataFrame(data)

# Display DataFrame
print(df)


  document                 text
0     doc1  apple banana orange
1     doc2  banana fruit orange
2     doc3   grape apple banana
3     doc4    kiwi mango papaya
4     doc5   banana apple mango


As you learned from your homework, to create tf-idf vectors (and we'll talk about the term "vectors" in a moment), you need to learn a machine learning library called sci-kitlearn. We'll be using this library later throughout the course for other operations, as well (though what we're doing now is not machine learning, yet). Fom scki kit learn, the program we want oto use is the tf-idf vectorizer. Here's the line of code to load in scikitlearn and the vectorizer

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer


OK, so, we have our texts we want to "vectorize" (in the dataframe called df) and we have the function we want to use to "vectorize" them. But it's still a bit tricky, because we need everything to both "go in" and "come out" of the function in the right way. We want to be able to put our documents "in" the function - but we need to figure out what format they need to be in, and how to do that. And we also probably want the output to be some nice table that tells us the tf-idf score for every term in every document, but that's potentially a tricky operation too. So, the first thing we need to do is just take a look at the tf-idf vectorizer function and learn what kind of "input" it takes and what kind of "output" it gives. This is essential for working with any prepackaged function, since if we don't give it the data in th right format, it won't work; and if we don't understand the format of the output we won't be able to manipulate it. Usually, you can find this information in the documentation of the function you're using. But here let's look at the Tfidf vectorizer and its details together:

In [None]:
#The tf-idf vectorizer function in a common usage looks like this:

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
#When we create or "initialize" the vector in this way, with the parentheses empty, we do so using it's default settings
#You might remember from your homework that one of those is what Melanie called "smoothing"
#In practice, you might want to change those default settings by adding phrases inside the parentheses like "stopwords=English" which would, for example, remove english stopwords (not a default setting)
#The documentation for tf-idf in sckit learn will show you all the various settings
#when youre using a function like this for a "real" project, it's good to check out all those settings/parameters and think about which you might want to use and why (even if you settle on the default, it's best to understand it)

# Transform the text data
#As it's input the vectorizer takes a LIST OF DOCUMENTS (so: documents, here, stands for a list of documents)
#as it's output, the tf-idf matrix, it gives us a matrix where each row is a document, and each column gives the tf-idf score for every word in that doc (which means the columns are all word in all documents)
tfidf_matrix = vectorizer.fit_transform(documents)

# The matrix we get as output can be converted into a dataframe like this
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

#the first part of what's in parentheses, above, turns our matrix (who knows what that is?) into an array. An array could look like this:

array([
    [0.58, 0.58, 0,    0.58],  # Doc1
    [0,    0.49, 0,    0.69],  # Doc2
    [0.58, 0.58, 0,    0],     # Doc3
    [0,    0,    0.58, 0],     # Doc4
    [0.50, 0.50, 0.50, 0]      # Doc5
])

#the nextpart get_feature_names_out is a function from the tf-idf vectorizer that gets us the names of the words in order of how the data appears in the arrays


OK so let's say we want to convert the texts in our dataframe, df, to tf-idf vectors and then print another dataframe that portrays them. What's the first thing we'd have to do?

Yep...make the docs in our dataframe a list

In [5]:
#we could make a whole for loop to put the columns texts into a list, or we could just use the shortcut function that puts the items in a column in a list:
# Extract the text column from the DataFrame and convert it into a list
fruit_documents = df["text"].tolist()

# Display the list
print(fruit_documents)


['apple banana orange', 'banana fruit orange', 'grape apple banana', 'kiwi mango papaya', 'banana apple mango']


OK, now we've got our list of documents, fruit_documents, let's try vectorizing:

In [6]:
# Initialize the vectorizer
vectorizer = TfidfVectorizer()

# Transform the documents into a TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(fruit_documents)

# Display the raw output (sparse matrix); let's take a look at the raw output before converting it to a dataframe
print(tfidf_matrix)


  (0, 0)	0.5626379641131365
  (0, 1)	0.47330879280030447
  (0, 6)	0.6778032959469461
  (1, 1)	0.4015651234424611
  (1, 6)	0.5750625560879445
  (1, 2)	0.7127752157729959
  (2, 0)	0.5039681962632367
  (2, 1)	0.42395393449691715
  (2, 3)	0.7525151949161979
  (3, 4)	0.6141889663426562
  (3, 5)	0.49552379079705033
  (3, 7)	0.6141889663426562
  (4, 0)	0.5626379641131365
  (4, 1)	0.47330879280030447
  (4, 5)	0.6778032959469461


Well, that's hard to read! So let's use the code we learned to convert to a df

In [7]:
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())


Now you try;wite the full code to:
make a list of three documents with three words each
load in scikit learn and the tf-idf vectorizer
use the vectorizer on your list of docs with default settings
convert the output to a dataframe and print it

OK so why do we call these vectors?

**brief break to discuss vectors, for next week**

**NER and POS with spacy**

OK, now let's talk about the second process for today, NER and POS with spacy
Again, we've already talked about these methods, what they do, and why they are not "language agnostic" - i.e they rely on pretrained models for a particular language. But now, let's practice the code.

First, let's grab a txt to use for practice

In [8]:
sample_text = """It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness,
it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness,
it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us,
we were all going direct to Heaven, we were all going direct the other way—in short, the period was so far like the present period,
that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.
There were a king with a large jaw and a queen with a plain face, on the throne of England; there were a king with a large jaw and a queen
with a fair face, on the throne of France."""



OK, to do NER and POS we use a program called spacy. Spacy allows us to load in models from different languages - we'll here use English, but look at Mel's textbok for samples of other language uses! - let's first look at the code for loading in spacy, and then loading in one of its particular models (in this case, for English)

In [9]:
# Install spaCy (only needed if not already installed)
# !pip install spacy

# Import spaCy
import spacy

# Load the English language model
nlp = spacy.load("en_core_web_sm")


Next, we can take our "sample_text" which is a string and load it, in that format, into the spacy tagger which will tag both NER and POS; then, to begin with, we'll print out the named entities

In [10]:
# Process the text with spaCy
#the input of spacy, sample_text, is a single string. So, if we want to process multiple texts, we'll do them one at a time (see later); but let's just do one for now
doc = nlp(sample_text)

# Extract and display named entities
for ent in doc.ents:
    print(f"{ent.text} - {ent.label_}")


the season - DATE
Light - GPE
the season - DATE
the spring - DATE
England - GPE
France - GPE


These are two of spacy's labels: Date and GPE or Geopolitical entity. Does anyone remember other labels spacy has for NER?

e.g Person PER or Date DATE

ou can also find them in Melanie's textbbok or the official spacy documentation (https://spacy.io/models/en); But here's a table:

In [18]:
# @title NER TAGS
import pandas as pd


# Define the NER entity table
ner_data = {
    "Entity Type": [
        "PERSON", "GPE", "ORG", "DATE", "TIME", "MONEY", "PERCENT",
        "LOC", "FAC", "PRODUCT", "EVENT", "WORK_OF_ART", "LAW",
        "LANGUAGE", "NORP", "ORDINAL", "CARDINAL", "QUANTITY", "MISC"
    ],
    "Shorthand Code": [
        "PERSON", "GPE", "ORG", "DATE", "TIME", "MONEY", "PERCENT",
        "LOC", "FAC", "PRODUCT", "EVENT", "WORK_OF_ART", "LAW",
        "LANGUAGE", "NORP", "ORDINAL", "CARDINAL", "QUANTITY", "MISC"
    ],
    "Description": [
        "People, including fictional characters.",
        "Countries, cities, states.",
        "Companies, agencies, institutions, etc.",
        "Absolute or relative dates or periods.",
        "Times smaller than a day.",
        "Monetary values, including currency.",
        "Percentage values (e.g., '20%').",
        "Non-GPE locations, mountain ranges, bodies of water.",
        "Buildings, airports, highways, bridges, etc.",
        "Objects, vehicles, foods, etc. (not services).",
        "Named hurricanes, battles, wars, sports events, etc.",
        "Titles of books, songs, artworks, etc.",
        "Named documents made into laws.",
        "Any named language.",
        "Nationalities, religious, or political groups.",
        "'First', 'second', etc.",
        "Numerals that do not fall under another type.",
        "Measurements, as of weight or distance.",
        "Miscellaneous entities that don't fit into other categories."
    ]
}

# Convert to a DataFrame
ner_df = pd.DataFrame(ner_data)

# Hide the code but show the table
display.display(ner_df)


Unnamed: 0,Entity Type,Shorthand Code,Description
0,PERSON,PERSON,"People, including fictional characters."
1,GPE,GPE,"Countries, cities, states."
2,ORG,ORG,"Companies, agencies, institutions, etc."
3,DATE,DATE,Absolute or relative dates or periods.
4,TIME,TIME,Times smaller than a day.
5,MONEY,MONEY,"Monetary values, including currency."
6,PERCENT,PERCENT,"Percentage values (e.g., '20%')."
7,LOC,LOC,"Non-GPE locations, mountain ranges, bodies of ..."
8,FAC,FAC,"Buildings, airports, highways, bridges, etc."
9,PRODUCT,PRODUCT,"Objects, vehicles, foods, etc. (not services)."


OK, but that;s not a beautiful format. What if we wanted to tag our named entities and look at them in a pretty, highlighted format? For that we can use "displacy", as you learned from the homework txtbook:

In [11]:
from spacy import displacy

# Process the text with spaCy; note that we already did ths step, above! but we can do it again here
doc = nlp(sample_text)

# Display named entities in a nicely formatted way
displacy.render(doc, style="ent", jupyter=True)


Now you try. Type a brief sentence; tage the entities using NER; and print out the results via displacy:

OK, but let's say, now, we want to put our entities in a dataframe, showing how many times each appears?

In [12]:
import pandas as pd
from collections import Counter

# Process the text with spaCy
doc = nlp(sample_text)

# Extract entity text (making a list of all the entity labels)
entities = [ent.text for ent in doc.ents]

# Count occurrences of each entity (getting a count of each entity in the list, how many times it appears)
entity_counts = Counter(entities)

# Convert to a Pandas DataFrame
entity_df = pd.DataFrame(entity_counts.items(), columns=["Entity", "Count"])

# Display the DataFrame
print(entity_df)


       Entity  Count
0  the season      2
1       Light      1
2  the spring      1
3     England      1
4      France      1


Printing outputs of some of the lines of code above can help you understand what's going on:

In [15]:
entity_counts

Counter({'the season': 2,
         'Light': 1,
         'the spring': 1,
         'England': 1,
         'France': 1})

In [14]:
entity_counts.items()

dict_items([('the season', 2), ('Light', 1), ('the spring', 1), ('England', 1), ('France', 1)])

We're not going to do it now, but for homework you can use the above code to take your own named entity outputs and put them in a dataframe

Now let's look at pos!

To tag pos we use exactly the same code as NER applied to our sample text, but we deal with the output differently

In [19]:
# Process the sample text with spaCy
doc = nlp(sample_text)

# We can use this code to print each token with its corresponding tags (coarse and fine grained) and the explanations of each tag
for token in doc:
    print(f"{token.text:<15} {token.pos_:<10} {token.tag_:<10} {spacy.explain(token.tag_)}")
    #by the way, what you see above is called "f-string format" and you've probably seen it before. It allows you to include variables in part of the string you print and is nice for printing outputs.
    #the 15 and 10s up there are formatting the output with alignment widths. you can play with what happens, on your own, if you delete or change them!

#note we get two tags for each word, a more coarse grained and then fine grained


It              PRON       PRP        pronoun, personal
was             AUX        VBD        verb, past tense
the             DET        DT         determiner
best            ADJ        JJS        adjective, superlative
of              ADP        IN         conjunction, subordinating or preposition
times           NOUN       NNS        noun, plural
,               PUNCT      ,          punctuation mark, comma
it              PRON       PRP        pronoun, personal
was             AUX        VBD        verb, past tense
the             DET        DT         determiner
worst           ADJ        JJS        adjective, superlative
of              ADP        IN         conjunction, subordinating or preposition
times           NOUN       NNS        noun, plural
,               PUNCT      ,          punctuation mark, comma
it              PRON       PRP        pronoun, personal
was             AUX        VBD        verb, past tense
the             DET        DT         determiner
age         

In [20]:
#here's how we'd print just the tags, no explanation:
for token in doc:
    print(f"{token.text:<15} {token.pos_:<10} {token.tag_:<10}")


It              PRON       PRP       
was             AUX        VBD       
the             DET        DT        
best            ADJ        JJS       
of              ADP        IN        
times           NOUN       NNS       
,               PUNCT      ,         
it              PRON       PRP       
was             AUX        VBD       
the             DET        DT        
worst           ADJ        JJS       
of              ADP        IN        
times           NOUN       NNS       
,               PUNCT      ,         
it              PRON       PRP       
was             AUX        VBD       
the             DET        DT        
age             NOUN       NN        
of              ADP        IN        
wisdom          NOUN       NN        
,               PUNCT      ,         
it              PRON       PRP       
was             AUX        VBD       
the             DET        DT        
age             NOUN       NN        
of              ADP        IN        
foolishness 

OK, let's say we wanted to print these out using displacy; we could print out the grammatical relationships, or "dependencies" between these entities in this way:

We won't go over parsing dependencies now, but if you want to analyze more complicated grammatical structures it's a good thing to look into.

In [21]:
from spacy import displacy

# Process the text with spaCy
doc = nlp(sample_text)

# Render the POS visualization
displacy.render(doc, style="dep", jupyter=True)


What if we wanted our pos's in a dataframe?

In [22]:
import pandas as pd

# Process the sample text with spaCy
doc = nlp(sample_text)

# Extract token data
pos_data = {
    "Token": [token.text for token in doc],
    "Lemma": [token.lemma_ for token in doc],  # Lemmatized form of the word
    "POS": [token.pos_ for token in doc],  # Coarse POS tag
    "Fine POS": [token.tag_ for token in doc],  # Detailed POS tag
    "Dependency": [token.dep_ for token in doc],  # Syntactic dependency relation
}

# Convert to a DataFrame
pos_df = pd.DataFrame(pos_data)

# Display the DataFrame
import IPython.display as display
display.display(pos_df)


Unnamed: 0,Token,Lemma,POS,Fine POS,Dependency
0,It,it,PRON,PRP,nsubj
1,was,be,AUX,VBD,ccomp
2,the,the,DET,DT,det
3,best,good,ADJ,JJS,attr
4,of,of,ADP,IN,prep
...,...,...,...,...,...
183,the,the,DET,DT,det
184,throne,throne,NOUN,NN,pobj
185,of,of,ADP,IN,prep
186,France,France,PROPN,NNP,pobj


Now you try; make a text, tag its parts of speech, and display them any way you like (maybe in a dataframe?)

Of course, we've been tagging only a single string with NER and POS tags. What if we wanted to tag multiple different documents? As I said, we can only put single strings/docs into spacy at a time, as opposed to lists of docs (like with tf-idf), so we'd have to put the documents in one by one. How would we do this? (e.g. Write a for loop?). This will be part of your homework!

**Homework**

We need to organize into groups!

Here's what you'll cover:
-Topic Modeling
-Sentiment Analysis
-Classification
-BERT (encodings)
-Transformers (LLMs like GPT, encodings)
-Image Analysis (Object Detection method)