# Introduction to Text Analysis Notebook

You are welcome to either work locally or in this Google Colab notebook. Below you will find code to replicate creating the dataset and some code to start you exploring NER and TF-IDF. Please reach out to the instructors if you have questions or concerns. To use this notebook, you will need to install the following packages:

```bash
!pip install pandas
!pip install tqdm
!pip install altair
!pip install scikit-learn
!pip install requests
!pip install gutenbergpy
!pip install -U spacy
!pip install -U spacy-transformers
!python -m spacy download en_core_web_sm
!python -m spacy download xx_ent_wiki_sm
```

## Install and Import Libraries

In [None]:
# Importing libraries
import spacy
from spacy import displacy
import pandas as pd
from tqdm import tqdm
import altair as alt
import gutenbergpy.textget
import requests
import warnings
warnings.filterwarnings('ignore')

# Load Spacy's multilingual model (can be replaced with a larger model if needed)
multi_nlp = spacy.load('xx_ent_wiki_sm')
eng_nlp = spacy.load('en_core_web_sm')

## Load Datasets

You either have the option of using the premade dataset available in Google Drive (though you will need to change the path to the file) or running the code below to remake the dataset from scratch. 

**Be warned, this file is quite large because of the size of the novels, so you may want to use a subset of the novels to test this code.**

### Google Drive Dataset

You can download this dataset here [https://drive.google.com/file/d/1LkaRtYph_lWtMPRyzZpECuzEMD3WPx26/view?usp=sharing](https://drive.google.com/file/d/1LkaRtYph_lWtMPRyzZpECuzEMD3WPx26/view?usp=sharing) and it's very larger so make sure you don't push it up to GitHub.


In [2]:
combined_novels_nyt_df = pd.read_csv("combined_novels_nyt_with_text.csv")

### Rerun Dataset Creation Code

In [None]:
novels_df = pd.read_csv("https://raw.githubusercontent.com/melaniewalsh/responsible-datasets-in-context/main/datasets/top-500-novels/library_top_500.csv")
nyt_bestsellers_df = pd.read_csv("https://raw.githubusercontent.com/ecds/post45-datasets/main/nyt_full.tsv", sep='\t')
combined_novels_nyt_df = novels_df.merge(nyt_bestsellers_df, how='left', on=['author', 'title'])

def get_text(url):
	if pd.notna(url):
		try:
			response = requests.get(url)
			if response.status_code == 200:
				return response.text
		except Exception as e:
			return None
			
	return None

from tqdm import tqdm

tqdm.pandas(desc="Progress")
combined_novels_nyt_df.loc[:, 'pg_eng_text'] = combined_novels_nyt_df.pg_eng_url.progress_apply(get_text)
combined_novels_nyt_df.loc[:, 'pg_orig_text'] = combined_novels_nyt_df['pg_orig_url'].progress_apply(get_text)

combined_novels_nyt_df['pg_eng_tokens'] = combined_novels_nyt_df['pg_eng_text'].str.split()
combined_novels_nyt_df['pg_orig_tokens'] = combined_novels_nyt_df['pg_orig_text'].str.split()

combined_novels_nyt_df['pg_eng_text_len'] = combined_novels_nyt_df.pg_eng_text.str.len()
combined_novels_nyt_df['pg_orig_text_len'] = combined_novels_nyt_df.pg_orig_text.str.len()
combined_novels_nyt_df['pg_eng_token_len'] = combined_novels_nyt_df.pg_eng_tokens.str.len()
combined_novels_nyt_df['pg_orig_token_len'] = combined_novels_nyt_df.pg_orig_tokens.str.len()

def clean_book(url):
	# This gets a book by its gutenberg id number
	if pd.notna(url):
		pg_id = url.split('/pg')[-1].split('.')[0]
		try:
			raw_book = gutenbergpy.textget.get_text_by_id(pg_id) # with headers
			clean_book = gutenbergpy.textget.strip_headers(raw_book) # without headers
			return clean_book
		except Exception as e:
			return None

combined_novels_nyt_df.loc[:, 'cleaned_pg_eng_text'] = combined_novels_nyt_df.pg_eng_url.apply(clean_book)
combined_novels_nyt_df.loc[:, 'cleaned_pg_orig_text'] = combined_novels_nyt_df.pg_orig_url.apply(clean_book)

## NER Code

There are many libraries for doing NER but today we'll be using `spaCy`, which is one of the most popular. You can visit the documentation here [https://spacy.io/usage](https://spacy.io/usage).

Now spaCy is a much more complex library than we have used before so let's try out an example and then talk through some the documentation. Let's start with an example from their spaCy 101 guide [https://spacy.io/usage/spacy-101#annotations-ner](https://spacy.io/usage/spacy-101#annotations-ner) and try copying some code.

```python
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
```

In our case we'll change `nlp` to be `eng_nlp`. Let's try running this code and see what happens.

In [3]:
doc = eng_nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


So what exactly does this output mean? Well first let's explore a bit of the spaCy API. First how could we check out what our `doc` variable contains?

We should see `spacy.tokens.doc.Doc` so let's take a look at the `Doc` <https://spacy.io/api/doc>. The documentation is pretty dense, but we should get a sense that a `Doc` is a collection of [`Token`](https://spacy.io/api/token) classes, and that it has a number of built-in methods. 

We can list these methods with the following code:

```python
# Let's dig into what this Class gives us
[prop for prop in dir(doc) if not prop.startswith('_')]

first_word = doc[0]
type(first_word)

[prop for prop in dir(first_word) if not prop.startswith('__')]
```

Take a look at the [`similiarity`](https://spacy.io/api/doc#similarity) and [`ents`](https://spacy.io/api/doc#ents) methods. What do these methods/attributes do? 

Part of understanding what they are doing, requires understanding how spaCy works. Below is a figure of their pipeline:

![spacy pipeline](https://d33wubrfki0l68.cloudfront.net/3ad0582d97663a1272ffc4ccf09f1c5b335b17e9/7f49c/pipeline-fde48da9b43661abcdf62ab70a546d71.svg)

This gives us a bit of a sense of how this is working (recognize tokenizer?) but let's also read their broad overview page <https://spacy.io/usage/facts-figures>. 

Looking at their comparison usage, what do you think are the benefits and limitations of spaCy? How does spaCy create their models and what are these models exactly?

Returning our example above, let's try using one of spaCy's built-in visualizations to understand how this is working.

```python
from spacy import displacy
displacy.render(doc, style="dep")
```

We should now see something that looks like this:

![spacy pos](https://spacy.io/images/displacy.svg)

What this visualization is highlighting is essentially how spaCy works, which is with something called Parts of Speech Tagging.

From the [spaCy docs](https://spacy.io/usage/linguistic-features#pos-tagging):

> After tokenization, spaCy can parse and tag a given Doc. This is where the trained pipeline and its statistical models come in, which enable spaCy to make predictions of which tag or label most likely applies in this context. A trained component includes binary data that is produced by showing a system enough examples for it to make predictions that generalize across the language – for example, a word following “the” in English is most likely a noun.

So the key thing to understand is that this super powerful library also comes with a lot of built-in assumptions about how language in your corpus is structured.

Now that we've considered some of the pros and cons, let's try out spaCy's NER with some of our data.

spaCy does have a limit on how much text it can process at once, so we'll need to break up our text into smaller chunks. Let's try a small subset first:

In [6]:
text = combined_novels_nyt_df.cleaned_pg_eng_text[0][0:1000]
doc = eng_nlp(text)

for ent in doc.ents[0:10]:
	print(ent.text, ent.start_char, ent.end_char, ent.label_)

152K)\n\n\nFull 73 88 CARDINAL
Miguel de Cervantes\n\n\n\n Translated 129 167 PERSON
John Ormsby\n\n\n\n\nEbook 171 197 PERSON
Note\n\n\n\nThe 218 233 ORG
Ormsby 320 326 PERSON
J. W. Clark 391 402 PERSON
Gustave Dor\xc3\xa9 419 438 PERSON
Clark 440 445 PERSON
English 491 498 LANGUAGE
\xe2\x80\x98Don Quixote\xe2\x80\x99 507 542 ORG


We could visualize this using displacy as well.

In [8]:
from spacy import displacy

displacy.render(doc, style="ent", jupyter=True)

You'll notice there's some weird looking characters in the text, like `\xe2\x80\x99s` or `\n`. These are unicode characters and for example `\n` is a newline character. 



More importantly we are starting to see how NER works and what entities it identifies. Below we can see the number of entities that exist in spaCy. 

![ner entites](https://devopedia.org/images/article/256/8660.1580659054.png)

We could for example be interested in what places are mentioned in these novels, so we would want to subset to only `GPE` entities. 

```python
for ent in doc.ents:
	if ent.label_ == 'GPE':
		print(ent.text, ent.start_char, ent.end_char, ent.label_)
```

So let's try writing code that gets all the `GPE` entities from our novels.

In [13]:
def chunk_text(text, n=10000):
    return (text[i:i+n] for i in range(0, len(text), n))

def get_entities(text, nlp, n=10000, entity_type="GPE"):
    entities = []
    chunks = list(chunk_text(text, n))
    for chunk in tqdm(chunks, desc="Processing Text Chunks"):
        doc = nlp(chunk)
        entities.extend(ent.text for ent in doc.ents if ent.label_ == entity_type)
    return entities

# Assuming combined_novels_nyt_df and eng_nlp are already defined
test_df = combined_novels_nyt_df[0:1]
# tqdm.pandas(desc="Identifying Entities")
test_df['pg_eng_gpe'] = test_df.cleaned_pg_eng_text.apply(get_entities, nlp=eng_nlp, n=10000, entity_type="GPE")


Processing Chunks: 100%|██████████| 251/251 [02:01<00:00,  2.06it/s]


In [15]:
# explode the list of entities into separate rows
exploded_test_df = test_df.explode('pg_eng_gpe')
# group by the entities and count the number of times they appear
gpe_counts = exploded_test_df.groupby('pg_eng_gpe').size().reset_index(name='counts')
# sort the entities by the number of times they appear
gpe_counts = gpe_counts.sort_values(by='counts', ascending=False)
# plot the top 10 entities
alt.Chart(gpe_counts[0:10]).mark_bar().encode(
	y=alt.Y('pg_eng_gpe', sort='-x'),
	x='counts'
)

Now we are starting to see how NER might help us start to explore our cultural data. You'll notice that there is `se\xc3\xb1or` which is a unicode character for `ñ` and `c3` and `b1` are the hex values for that character, which means it stands for `senor`. It's unclear why spaCy is recognizing this as a `GPE` entity, but it's likely because of some issues in the model (remember we're using an English model). We could try experimenting with other spaCy models to see if this changes, but fundamentally this shows both the power and danger of relying on these models.

#### What to try next?

1. Try using a different spaCy model and see how the results change.
2. Try running it on the full dataset and see what you find.
3. Try spaCy's parts of speech tagging and see if you can find any interesting patterns, like which pronouns are most common in these novels.

## TF-IDF Code

To run TF-IDF we'll be using the `scikit-learn` library. You can find the documentation here [https://scikit-learn.org/stable/](https://scikit-learn.org/stable/). In particular, we'll be using the `TfidfVectorizer` class [https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

# Fit and transform the text data
tfidf_matrix = vectorizer.fit_transform(combined_novels_nyt_df.cleaned_pg_eng_text.fillna(''))

# Convert the TF-IDF matrix to a DataFrame for better readability
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# Add the titles to the DataFrame
tfidf_df['title'] = combined_novels_nyt_df['title'].values

# Melt the DataFrame to get a long format DataFrame with terms and scores
melted_tfidf_df = tfidf_df.melt(id_vars=['title'], var_name='term', value_name='score')

# Sort the DataFrame by score in descending order
sorted_tfidf_df = melted_tfidf_df.sort_values(by='score', ascending=False)

# Display the top 10 results
sorted_tfidf_df.head(10)

Unnamed: 0,title,term,score
115304890,The Last Days of Pompeii,the,0.730242
115304767,Germinal,the,0.700247
115304868,Nostromo,the,0.69957
115304932,Death In Venice,the,0.687305
115304523,The Grapes of Wrath,the,0.670729
115304601,The War of the Worlds,the,0.668319
115304796,The Phantom of the Opera,the,0.667619
115304750,Death Comes for the Archbishop,the,0.659896
115304525,The Last of the Mohicans,the,0.65986
115304564,A Journey to the Center of the Earth,the,0.653512


You'll notice that our top results are all the words `the`. This is exactly the issue with common words and why stop words are so popular. So we can try removing the stop words and see what results we get.

Indeed, TF-IDF has a number of parameters that you can set. Specifically it includes parameters for dealing with stop words, max_df, min_df, and ngram_range. Let's experiment with some of these parameters.

```python
from sklearn.feature_extraction.text import TfidfVectorizer

# Let's try out some different parameters
tfidf = TfidfVectorizer(stop_words='english', max_df=0.5, min_df=2, ngram_range=(1, 2))
tfidf_matrix = tfidf.fit_transform(df['text'])
```

In [23]:
vectorizer = TfidfVectorizer(stop_words="english")

# Fit and transform the text data
tfidf_matrix = vectorizer.fit_transform(combined_novels_nyt_df.cleaned_pg_eng_text.fillna(''))

# Convert the TF-IDF matrix to a DataFrame for better readability
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# Add the titles to the DataFrame
tfidf_df['title'] = combined_novels_nyt_df['title'].values

# Melt the DataFrame to get a long format DataFrame with terms and scores
melted_tfidf_df = tfidf_df.melt(id_vars=['title'], var_name='term', value_name='score')

# Sort the DataFrame by score in descending order
sorted_tfidf_no_stopwords_df = melted_tfidf_df.sort_values(by='score', ascending=False)

# Display the top 10 results
sorted_tfidf_no_stopwords_df.head(10)

Unnamed: 0,title,term,score
38042168,Heidi,heidi,0.821331
107126913,A Christmas Carol,scrooge,0.819125
8154752,This Side of Paradise,amory,0.786431
16185827,Candide,candide,0.748896
65341057,Resurrection,nekhludoff,0.732123
124364958,Heart of Darkness,x80,0.684902
141660306,Heart of Darkness,xe2,0.684902
141659937,Uncle Tom's Cabin,xe2,0.684411
124364589,Uncle Tom's Cabin,x80,0.684411
124364572,The Adventures of Tom Sawyer,x80,0.684342


Now we see we are getting words that are more specific to each novel, though we still have some unicode characters like `x80` and `x99` which are `€` and `’` respectively. We could try removing these characters from our text or use more of the TF-IDF parameters to make them less important. We can also add a custom token pattern that removes these characters.

```python
tfidf = TfidfVectorizer(stop_words='english', max_df=0.5, min_df=2, ngram_range=(1, 2), token_pattern=r'(?u)\b\w+\b')
tfidf_matrix = tfidf.fit_transform(df['text'])
```

Alternatively though we could actually use spaCy to help us clean this text. Let's try using spaCy to clean our text and see if we get better results.

```python
# Load the spaCy English model
nlp = spacy.load("en_core_web_sm")

# Define a function to clean the text with spaCy
def clean_text_spacy(text):
    # Process the text with spaCy
    doc = nlp(text)
    # Remove stopwords, punctuation, and non-alphabetic tokens
    tokens = [token.lemma_ for token in doc if token.is_alpha and not token.is_stop]
    # Join tokens back to a single string
    return ' '.join(tokens)

# Apply the cleaning function to the text data
combined_novels_nyt_df['cleaned_text'] = combined_novels_nyt_df['cleaned_pg_eng_text'].fillna('').apply(clean_text_spacy)
```

In [None]:
def chunk_text(text, n=10000):
    return (text[i:i+n] for i in range(0, len(text), n))

# Define a function to clean text with spaCy, processing in chunks
def clean_text_spacy(text, nlp, chunk_size=10000):
    cleaned_chunks = []
    chunks = list(chunk_text(text, chunk_size))
    
    # Chunk the text and process each chunk
    for chunk in tqdm(chunks, desc="Processing Chunks", leave=False):
        doc = nlp(chunk)  # Process each chunk with spaCy
        # Remove stopwords, punctuation, and non-alphabetic tokens, and apply lemmatization
        tokens = [token.lemma_ for token in doc if token.is_alpha and not token.is_stop]
        # Join tokens of the current chunk and add to cleaned chunks
        cleaned_chunks.append(' '.join(tokens))
    
    # Join all chunks into a single string (optional, based on your needs)
    return ' '.join(cleaned_chunks)

# Assuming combined_novels_nyt_df and eng_nlp are already defined
tqdm.pandas(desc="Cleaning Text")
combined_novels_nyt_df['cleaned_pg_eng_text_spacy'] = combined_novels_nyt_df.cleaned_pg_eng_text.fillna('').progress_apply(clean_text_spacy, nlp=eng_nlp)


In [None]:
vectorizer = TfidfVectorizer(stop_words="english", min_df=1, max_df=0.7,)

# Fit and transform the text data
tfidf_matrix = vectorizer.fit_transform(combined_novels_nyt_df.cleaned_pg_eng_text_spacy.fillna(''))

# Convert the TF-IDF matrix to a DataFrame for better readability
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# Add the titles to the DataFrame
tfidf_df['title'] = combined_novels_nyt_df['title'].values

# Melt the DataFrame to get a long format DataFrame with terms and scores
melted_tfidf_df = tfidf_df.melt(id_vars=['title'], var_name='term', value_name='score')

# Sort the DataFrame by score in descending order
sorted_tfidf_no_stopwords_min_max_df = melted_tfidf_df.sort_values(by='score', ascending=False)

# Display the top 10 results
sorted_tfidf_no_stopwords_min_max_df.head(10)

Unnamed: 0,title,term,score
44672498,The Hunchback of Notre Dame,n,0.992497
44672555,The Hound of the Baskervilles,n,0.988642
44672495,The Count of Monte Cristo,n,0.951079
44672602,The Sun Also Rises,n,0.950076
44672542,Sons and Lovers,n,0.946033
44672446,Jane Eyre,n,0.944493
44672469,Les Misérables,n,0.941966
44672520,A Journey to the Center of the Earth,n,0.941441
44672492,Ulysses,n,0.939559
44672491,Ulysses,n,0.939559


You'll notice that I've set the parameter `min_df=1` and `max_df=0.7`. If we go to the scikit-learn documentation for `TfidfVectorizer` we can see that `min_df` is the minimum document frequency and `max_df` is the maximum document frequency. This means that we are only including words that appear in at least 2 documents and in at most 70% of the documents. So you can imagine that these parameters can determine a lot of our analysis. 

In [None]:
vectorizer = TfidfVectorizer(stop_words="english", min_df=1, max_df=0.9, ngram_range=(1, 2), max_features=1000)

# Fit and transform the text data
tfidf_matrix = vectorizer.fit_transform(combined_novels_nyt_df.cleaned_pg_eng_text_spacy.fillna(''))

# Convert the TF-IDF matrix to a DataFrame for better readability
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# Add the titles to the DataFrame
tfidf_df['title'] = combined_novels_nyt_df['title'].values

# Melt the DataFrame to get a long format DataFrame with terms and scores
melted_tfidf_df = tfidf_df.melt(id_vars=['title'], var_name='term', value_name='score')

# Sort the DataFrame by score in descending order
sorted_tfidf_no_stopwords_min_max_ngram_df = melted_tfidf_df.sort_values(by='score', ascending=False)

# Display the top 10 results
sorted_tfidf_no_stopwords_min_max_ngram_df.head(10)