# Counting Words

Using the counts of words in a document is a good starting point for descriptive analysis, visualization, and predictive modeling. For example, counts of words is what makes up the data behind [Google Search Trends](https://trends.google.com/home) and the [Google NGram Viewer](https://books.google.com/ngrams/).

Between search trends, ngram viewer, and more, the goal is to provide a broad picture as to what things are being discussed, searched, mentioned, etc.

Applying this same idea to municipal legislative data, we might try understand what general topics are commonly discussed in a council via core words for a topic. For example, if the words "housing", "rent", and "affordability" are all spoken frequently in a city council meeting, we can broadly interpret that those three terms (and more general ideas about "housing affordability") were important topics of the meeting.

So for this chapter, let's try to be able to **find and compare common topics discussed in a city council meeting, via word counts.**

## Let's Count

Counting words is such a common task and is so heavily utilized that `scikit-learn` has a whole class for doing just that called [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

# Create CountVectorizer object
count_vec = CountVectorizer()

# An example sentence we will count the words in
example = "Hello my name is Eva and I am commenting today in opposition of this bill."

# Get counts
counts = count_vec.fit_transform([example])
counts

<1x14 sparse matrix of type '<class 'numpy.int64'>'
	with 14 stored elements in Compressed Sparse Row format>

Don't be afraid of the "sparse matrix" it is just an efficient storage mechanism that scikit-learn uses. We can see the results in a pandas dataframe too by doing the following.

In [2]:
import pandas as pd

counts_df = pd.DataFrame(
    counts.toarray(),  # convert from sparse matrix to numpy
    columns=count_vec.get_feature_names_out(),  # store words as the column names
)
counts_df

Unnamed: 0,am,and,bill,commenting,eva,hello,in,is,my,name,of,opposition,this,today
0,1,1,1,1,1,1,1,1,1,1,1,1,1,1


Let's take a moment to notice that all of the words have been lowercased.

There are many decisions to make when counting and we recommend you view the documentation for [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to see all of the options but we will highlight a few throughout the rest of this chapter.

First, as already noted, lowercasing is something you can enable or disable. It depends on how specific you want your counts to be, do you want "House" and "house" to be counted separately? The only difference between the two might be that one is the first word in a sentence, but there are many other cases where casing is important to keep.

In [3]:
# And example of turning off lowercasing
count_vec_no_lowercase = CountVectorizer(lowercase=False)
pd.DataFrame(
    count_vec_no_lowercase.fit_transform([example]).toarray(),
    columns=count_vec_no_lowercase.get_feature_names_out(),
)

Unnamed: 0,Eva,Hello,am,and,bill,commenting,in,is,my,name,of,opposition,this,today
0,1,1,1,1,1,1,1,1,1,1,1,1,1,1


## Ngrams

Second, all of the words are counted individually. I.e. a single word, gets a single count. What if we want to track pairs of words, or even three word groups, what about `N`-word-groups? This is the idea behind n-grams: contiguous sequence of `n` items from a given sample of text or speech. The items can be words, letters, syllables, etc.

For example, this type of tracking might be useful when we want to look for "housing affordability" as a single item instead of "housing" and "affordability" as separate items.

We can use `CountVectorizer` to construct any range of ngrams.

In [4]:
# Allow unigrams to trigrams (1 word to 3 word items)
count_vec_uni_tri = CountVectorizer(ngram_range=(1, 3))
pd.DataFrame(
    count_vec_uni_tri.fit_transform([example]).toarray(),
    columns=count_vec_uni_tri.get_feature_names_out(),
)

Unnamed: 0,am,am commenting,am commenting today,and,and am,and am commenting,bill,commenting,commenting today,commenting today in,...,of this,of this bill,opposition,opposition of,opposition of this,this,this bill,today,today in,today in opposition
0,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


Note: as you increase your ngram range, you gain specificity at the cost of memory because you simply have to keep track of many more distinct grams.

## Working With Transcript Data

To get a better understanding of how to work with count data, let's use city council meeting transcript data made available by Council Data Project.

To do so, let's first pull a small sample of meeting data, read the transcripts, and construct ngram counts.

In [None]:
from cdp_data import CDPInstances, datasets

df = datasets.get_session_dataset(
    CDPInstances.Seattle,
    start_datetime="2020-01-01",
    end_datetime="2021-01-01",
    store_transcript=True,
    store_transcript_as_csv=True,
    raise_on_error=False,
)
df.columns, df.shape

## STOPPING HERE FOR NOW. CONTINUE LATER

# How to Count Words

Lets use a single transcript for an example.

In [None]:
import pandas as pd

example_session = pd.read_csv(df.iloc[0].transcript_as_csv_path)
example_session

### Scikit Learn Count Vectorizer

If we want to track word counts over time, what we want is a large matrix of events and counts

Sort of like so:

| session_id | session_datetime | word | count |
|:-----------|:-----------------|:-------|:----|
| abcd-1234  | 2020-01-01       | hello  | 4   |
| abcd-1234  | 2020-01-01       | world  | 2   |
| ...        | ...              | ...    | ... |
| abcd-1234  | 2020-01-01       | ramen  | 2   |

Scikit Learn (`sklearn`) has a function to make that somewhat simple. We just need to combine all of the text for a single session into a single string.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
all_session_text = " ".join(example_session.text)
counts = vectorizer.fit_transform([all_session_text])
vectorizer.get_feature_names_out()[:20]

## Words?

The `CountVectorizer` includes all text, so we can see that some of the items counted were numbers and some were words. If we wanted to see their counts we can combine them.

Looks like "able" and "about" were both used 12 and 11 times respectively. Let's do this counting process for all of the meetings and store the data to a dataframe.

In [None]:
for word, count in zip(
    vectorizer.get_feature_names_out()[:20],
    list(counts[0, :20].toarray()[0]),
):
    print(f"'{word}' was used {count} times in the meeting")

## Pass In All Documents At Once

In [None]:
texts = []
for i, row in df.iterrows():
    session_transcript = pd.read_csv(row.transcript_as_csv_path)
    try:
        texts.append({
            "session_id": row.id,
            "session_datetime": row.session_datetime,
            "session_text": " ".join(session_transcript.text),
        })
    except:
        pass

text_df = pd.DataFrame(texts)
text_df

In [None]:
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(text_df.session_text)
count_df = pd.DataFrame(counts.toarray(), columns=vectorizer.get_feature_names_out())
count_df["session_id"] = text_df["session_id"]
count_df["session_datetime"] = text_df["session_datetime"]
count_df

In [None]:
count_df = count_df.melt(id_vars=["session_id", "session_datetime"], var_name="word", value_name="word_count")
count_df

## Most Common Words (Average Across All Meetings)

Filler words are the most common

In [None]:
count_df.groupby("word")["word_count"].mean().nlargest(20)

## Plotting

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="darkgrid")

# Select the word we are interested in
housing_counts = count_df[count_df["word"] == "housing"]

# Plot the responses for different events and regions
sns.lineplot(
    x="session_datetime",
    y="word_count",
    data=housing_counts,
)
_ = plt.xticks(rotation=35, ha="right")

## Why Is This Deceptive?

1. Meetings are different length. One meeting might be longer and therefore have more words overall.
2. Discussion doesn't happen every day. Need to smooth it out somehow.
3. "Housing" doesn't include "house", "houses", etc.

## Possible Solution: Make Each Word Count a "Percentage" of Total Words for The Meeting

In [None]:
count_df["percent_use_in_meeting"] = count_df["word_count"] / count_df.groupby("session_id")["word_count"].transform("sum")
count_df

## Replot

Interpret: The y-axis is now "percent usage of this word in each session" i.e. if the value is ~0.01 that means that the word was used 1% of the time (or about 1/100 words used in the meeting was "housing").

In [None]:
# Select the word we are interested in
housing_counts = count_df[count_df["word"] == "housing"]

# Plot the responses for different events and regions
sns.lineplot(
    x="session_datetime",
    y="percent_use_in_meeting",
    data=housing_counts,
)
_ = plt.xticks(rotation=35, ha="right")

## Possible Solution: Rolling Mean Over One Month

In [None]:
rolling_30_days = count_df.set_index("session_datetime").sort_index(ascending=True).groupby("word").rolling("30D").agg({
    "percent_use_in_meeting": "mean"
}).reset_index()
rolling_30_days

In [None]:
# Select the word we are interested in
housing_counts = rolling_30_days[rolling_30_days["word"] == "housing"]

# Plot the responses for different events and regions
sns.lineplot(
    x="session_datetime",
    y="percent_use_in_meeting",
    data=housing_counts,
)
_ = plt.xticks(rotation=35, ha="right")

## Possible Solution: Stem Each Word Prior to Counting

In [None]:
# Select the word we are interested in
housing_counts = rolling_30_days[rolling_30_days["word"].isin(["housing", "house", "houses"])]

# Plot the responses for different events and regions
sns.lineplot(
    x="session_datetime",
    y="percent_use_in_meeting",
    hue="word",
    data=housing_counts,
)
_ = plt.xticks(rotation=35, ha="right")