# Categorical Word Frequencies


Within this notebook, we'll explore some text data and compile the top N most frequently occuring terms within categorical groups.

### Import dependencies

In [None]:
import pandas as pd
import math
import matplotlib.pyplot as plt
from matplotlib import gridspec

### Load text dataset from SKlearn's [`fetch_20newsgroups`](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html)

SKlearn's [`fetch_20newsgroups`](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) is a pre-compiled dataset that (as its name suggests) offers news data for 20 different categories.

In [None]:
from sklearn.datasets import fetch_20newsgroups

# define which categories we'd like to use
topic_categories = [
    'alt.atheism',
    'comp.graphics',
    'comp.sys.ibm.pc.hardware',
    'comp.windows.x',
    'misc.forsale',
    'rec.autos',
    'sci.space',
    'rec.motorcycles',
    'rec.sport.baseball',
    'sci.crypt']

# remove unnecessary components of each record to focus on just the 
# text body and filter by categories defined above
news = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'), categories=topic_categories)

# documents
docs = news.data

# categories
cats = news.target

# Convert to pandas DataFrame
df = pd.DataFrame({"body": docs, "category": [news.target_names[x] for x in cats]})
df.head()

### How many documents per category?

In [None]:
df['category'].value_counts()

### Use pandas `apply` to broadcast the `split` function to every row's _body_ column

The pandas `apply` function broadcasts a function over all values of a particular column of a DataFrame (or a Series). Within the `apply` function, `lambda` is acting similar to a JavaScript _arrow_ function. It is an abbreviated way to write a function. Below, `body` represents each row's text value in the _body_ column, and `body.split()` splits the text string by spaces into a list of individual tokens.

In [None]:
df['body_tokens'] = df['body'].apply(lambda body : body.split())

df.head()

### Group dataframe by category and combine each record's list of tokens

Performing a _sum_ aggregation on a column that contains lists will merge the lists into one.

In [None]:
category_tokens = df.groupby('category')['body_tokens'].sum()
category_tokens

## Explore the number of tokens by category

Below we're computing 3 metrics:
1. **Number of tokens** - Calculated by simply finding the length of each category's list of tokens
2. **Number of _unique_ tokens** - Calculated by first reducing the list of tokens down to unique values using the `set` function, then finding the length
3. **Lexical Diversity** - Ratio of unique terms to total terms

In [None]:
explore_df = pd.DataFrame({"Total Number of Tokens": category_tokens.apply(lambda x: len(x)),
                        "Number of Unique Tokens": category_tokens.apply(lambda x: len(set(x)))})

explore_df["Lexical Diversity"] = explore_df['Number of Unique Tokens'] / explore_df['Total Number of Tokens']

explore_df

### Plot the Lexical Diversity

Keep in mind that we don't know the origin of this data, or the number of authors that generated the underlying records, so conclusions based purely on the aggregate-level diversity scores may be skewed.

In [None]:
explore_df['Lexical Diversity'].sort_values().plot(kind="barh", 
                                                   xlim=(0, max(explore_df['Lexical Diversity'].values)*1.1), 
                                                   figsize=(10,10), 
                                                   fontsize=15, 
                                                   title="Lexical Diversity by Category")
# add labels to plot
for j, v in enumerate(explore_df['Lexical Diversity'].sort_values()):
        plt.text(0.05, j, str(round(v,4)), color='white', fontweight='bold', va='center')

### Count the frequencies of each term in the word lists and return the top n most frequent

Below we're using the [`Counter`](https://docs.python.org/3/library/collections.html#collections.Counter) function which receives an iterable object and returns a dictionary with each unique token's frequency. Then, we're using a combination of `sorted` and `operator.itemgetter` to perform a reverse sort on a dictionary by its values, as opposed to sorting by the keys.

In [None]:
from collections import Counter
import operator

def wordListToFreqList(wordlist, top_n=10):
    """Compile a list of all words and their frequency of occurence"""
    
    # count each term's number of occurrences
    freqDict = Counter(wordlist)
    
    # sort the frequency dictionary by its values descending and return the items as a list of tuples
    sortedFreqs = sorted(freqDict.items(), key=operator.itemgetter(1), reverse=True)
    
    return sortedFreqs[:top_n]

freqs = category_tokens.apply(lambda tokens: wordListToFreqList(tokens))
freqs

### Plot the most frequently occurring terms for each category

In [None]:
# Using gridspec allows us to dynamically add subplots in grid
N = len(freqs.keys())
cols = 2
rows = int(math.ceil(N / cols))
gs = gridspec.GridSpec(rows, cols)

# define the figure space for the plots
fig = plt.figure()
fig.set_figheight(N*2)
fig.set_figwidth(20)

# iterate over number of categories to plot each one's top terms
for i in range(N):
    
    # add a plot to the figure
    ax = fig.add_subplot(gs[i])
    ax.set_title(f"Most Frequent Words for: {news.target_names[i]}", fontsize=14, fontweight='bold')
    
    # break the terms and term counts into two lists/tuples
    x,y = zip(*freqs[i])
    #plot the data
    ax.bar(x,y)
    # increase x-label font size
    plt.xticks(fontsize=14)
    # place numeric label on the bar
    for j, v in enumerate(y):
        ax.text(j, v/2, str(v), color='white', fontweight='bold', ha='center')
        
fig.tight_layout()