# ADS 509 Module 3: Group Comparison 

The task of comparing two groups of text is fundamental to textual analysis. There are innumerable applications: survey respondents from different segments of customers, speeches by different political parties, words used in Tweets by different constituencies, etc. In this assignment you will build code to effect comparisons between groups of text data, using the ideas learned in reading and lecture.

This assignment asks you to analyze the lyrics for the two artists you selected in Module 1 and the Twitter descriptions pulled for Robyn and Cher. If the results from that pull were not to your liking, you are welcome to use the zipped data from the “Assignment Materials” section. Specifically, you are asked to do the following: 

* Read in the data, normalize the text, and tokenize it. When you tokenize your Twitter descriptions, keep hashtags and emojis in your token set. 
* Calculate descriptive statistics on the two sets of lyrics and compare the results. 
* For each of the four corpora, find the words that are unique to that corpus. 
* Build word clouds for all four corpora. 

Each one of the analyses has a section dedicated to it below. Before beginning the analysis there is a section for you to read in the data and do your cleaning (tokenization and normalization). 


## General Assignment Instructions

These instructions are included in every assignment, to remind you of the coding standards for the class. Feel free to delete this cell after reading it. 

One sign of mature code is conforming to a style guide. We recommend the [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html). If you use a different style guide, please include a cell with a link. 

Your code should be relatively easy-to-read, sensibly commented, and clean. Writing code is a messy process, so please be sure to edit your final submission. Remove any cells that are not needed or parts of cells that contain unnecessary code. Remove inessential `import` statements and make sure that all such statements are moved into the designated cell. 

Make use of non-code cells for written commentary. These cells should be grammatical and clearly written. In some of these cells you will have questions to answer. The questions will be marked by a "Q:" and will have a corresponding "A:" spot for you. *Make sure to answer every question marked with a `Q:` for full credit.* 


In [52]:
import os
import re
import emoji
import pandas as pd

from collections import Counter, defaultdict
from nltk.corpus import stopwords
from string import punctuation
from wordcloud import WordCloud 

from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

In [53]:
# Speed up membership tests
punctuation = set(punctuation)
# Twitter-specific punctuation: keep hashtags
tw_punct = punctuation - {'#'}
# English stopwords
sw = set(stopwords.words('english'))
# Whitespace splitting regex
_whitespace_re = re.compile(r"\s+")
# Hashtag pattern
_hashtag_re = re.compile(r"^#[0-9A-Za-z]+")

# ----- Utility Functions -----
def contains_emoji(s):
    """Return True if any character in string is an emoji."""
    return any(emoji.is_emoji(ch) for ch in str(s))


def remove_stop(tokens):
    """Filter out English stopwords from a list of tokens."""
    return [t for t in tokens if t.lower() not in sw]


def remove_punctuation(text, punct_set=tw_punct):
    """Remove punctuation characters from text."""
    return ''.join(ch for ch in text if ch not in punct_set)


def tokenize(text):
    """Split text on whitespace (preserve hashtags & emojis)."""
    return [tok for tok in _whitespace_re.split(str(text)) if tok]


def prepare(text, pipeline):
    """Apply a sequence of transform functions to raw text."""
    data = text
    for fn in pipeline:
        data = fn(data)
    return data


def descriptive_stats(tokens, num_tokens=5, verbose=True):
    """
    Given a list of tokens, compute and optionally print:
      - total number of tokens
      - number of unique tokens
      - number of characters across tokens
      - lexical diversity = unique/total
      - top `num_tokens` most common tokens
    Returns [total, unique, diversity, characters].
    """
    total = len(tokens)
    unique = len(set(tokens))
    chars = sum(len(t) for t in tokens)
    diversity = unique / total if total else 0.0

    if verbose:
        print(f"There are {total} tokens in the data.")
        print(f"There are {unique} unique tokens in the data.")
        print(f"There are {chars} characters in the data.")
        print(f"The lexical diversity is {diversity:.3f} in the data.")
        print(f"\nThe {num_tokens} most common tokens are:")
        for tok, cnt in Counter(tokens).most_common(num_tokens):
            print(f"  {tok}: {cnt}")

    return [total, unique, diversity, chars]

## Data Ingestion

Use this section to ingest your data into the data structures you plan to use. Typically this will be a dictionary or a pandas DataFrame.

In [54]:
data_location = "/Users/bobbymarriott/Downloads/M1 Results"
twitter_folder = "twitter"
lyrics_folder  = "lyrics"

artist_twitter_files = {
    "cher": "cher_followers_data.txt",
    "robyn": "robynkonichiwa_followers_data.txt"
}

In [55]:
twitter_frames = []
for artist, fname in artist_twitter_files.items():
    path = os.path.join(data_location, twitter_folder, fname)
    df   = pd.read_csv(path, sep="\t", quoting=3, encoding="utf-8")
    df["artist"] = artist
    twitter_frames.append(df)

twitter_data = pd.concat(twitter_frames, ignore_index=True)
del twitter_frames  # tidy up

print("Loaded Twitter:", twitter_data.shape)
print(twitter_data.artist.value_counts(), "\n")

Loaded Twitter: (4353175, 8)
artist
cher     3994803
robyn     358372
Name: count, dtype: int64 



In [56]:
lyrics_data = {}
for artist in artist_twitter_files:
    artist_dir = os.path.join(data_location, lyrics_folder, artist)
    texts = []
    for fn in sorted(os.listdir(artist_dir)):
        if not fn.endswith(".txt"):
            continue
        with open(os.path.join(artist_dir, fn), encoding="utf-8") as f:
            texts.append(f.read())
    lyrics_data[artist] = texts
    print(f"Loaded {len(texts)} lyrics for {artist}")

Loaded 316 lyrics for cher
Loaded 104 lyrics for robyn


## Tokenization and Normalization

In this next section, tokenize and normalize your data. We recommend the following cleaning. 

**Lyrics** 

* Remove song titles
* Casefold to lowercase
* Remove stopwords (optional)
* Remove punctuation
* Split on whitespace

Removal of stopwords is up to you. Your descriptive statistic comparison will be different if you include stopwords, though TF-IDF should still find interesting features for you. Note that we remove stopwords before removing punctuation because the stopword set includes punctuation.

**Twitter Descriptions** 

* Casefold to lowercase
* Remove stopwords
* Remove punctuation other than emojis or hashtags
* Split on whitespace

Removing stopwords seems sensible for the Twitter description data. Remember to leave in emojis and hashtags, since you analyze those. 

In [61]:
rows = []
lyrics_base = os.path.join(data_location, lyrics_folder)
for artist in os.listdir(lyrics_base):
    artist_dir = os.path.join(lyrics_base, artist)
    if not os.path.isdir(artist_dir):
        continue

    for fn in sorted(os.listdir(artist_dir)):
        if not fn.endswith(".txt"):
            continue
        path = os.path.join(artist_dir, fn)
        with open(path, "r", encoding="utf-8") as f:
            text = f.read()
        rows.append({
            "artist": artist,
            "lyrics": text
        })

lyrics_data = pd.DataFrame(rows)


lyrics_data["lyrics"]      = lyrics_data["lyrics"]     .fillna("").astype(str)
twitter_data["description"] = twitter_data["description"].fillna("").astype(str)


# make prepare safe on non-strings
def prepare(text, pipeline):
    text = text if isinstance(text, str) else ""
    for fn in pipeline:
        text = fn(text)
    return text


# rebuild pipeline
my_pipeline = [
    str.lower,        # case‐fold
    remove_punctuation,  # drop punctuation
    tokenize,         # split on whitespace
    remove_stop       # drop stopwords
]

# apply to lyrics
lyrics_data["tokens"]     = lyrics_data["lyrics"]     .apply(lambda x: prepare(x, my_pipeline))
lyrics_data["num_tokens"] = lyrics_data["tokens"]     .map(len)

# apply to twitter descriptions
twitter_data["tokens"]     = twitter_data["description"].apply(lambda x: prepare(x, my_pipeline))
twitter_data["num_tokens"] = twitter_data["tokens"].map(len)

print("=== Lyrics sample ===")
print(lyrics_data[["artist","num_tokens"]].head())
print("\n=== Twitter sample ===")
print(twitter_data[["artist","num_tokens"]].head())

=== Lyrics sample ===
  artist  num_tokens
0  robyn         205
1  robyn          66
2  robyn         119
3  robyn          77
4  robyn         174

=== Twitter sample ===
  artist  num_tokens
0   cher           0
1   cher           6
2   cher           3
3   cher           1
4   cher          17


In [62]:
twitter_data['has_emoji'] = twitter_data["description"].apply(contains_emoji)

Let's take a quick look at some descriptions with emojis.

In [63]:
twitter_data[twitter_data.has_emoji].sample(10)[["artist","description","tokens"]]

Unnamed: 0,artist,description,tokens
2397996,cher,I Love Justin Bieber ♥♡♥ But His Music Tho ♥♬♪♥,"[love, justin, bieber, ♥♡♥, music, tho, ♥♬♪♥]"
684664,cher,aqui vemos uma pessoa que não sabe oq está faz...,"[aqui, vemos, uma, pessoa, que, não, sabe, oq,..."
920091,cher,"My favs, sometimes 18+, men, porn, funny, horn...","[favs, sometimes, 18, men, porn, funny, horny😁❤️]"
1637272,cher,Wrestling Fan/xbox gammer.Hopeless Romantic. P...,"[wrestling, fanxbox, gammerhopeless, romantic,..."
3383606,cher,BE IN CONTROL OF YOUR DESTINY&ALWAYS KEEP YOUR...,"[control, destinyalways, keep, eyes, prize, ¥,..."
3934097,cher,Tomorrow never knows what it doesn’t know too ...,"[tomorrow, never, knows, doesn’t, know, soon, ..."
3345938,cher,Be yourself to free yourself 🌈,"[free, 🌈]"
925793,cher,artist writer movie fanatic scifi fantasy geek...,"[artist, writer, movie, fanatic, scifi, fantas..."
1002334,cher,| voluminous hair | ole miss | I ❤️ bears | ma...,"[voluminous, hair, ole, miss, ❤️, bears, may, ..."
419115,cher,love yourself😇 BeYoutiful💞 Learn to Forgive yo...,"[love, yourself😇, beyoutiful💞, learn, forgive,..."


With the data processed, we can now start work on the assignment questions. 

Q: What is one area of improvement to your tokenization that you could theoretically carry out? (No need to actually do it; let's not make perfect the enemy of good enough.)

A: Switching from a naïve whitespace split to NLTK's TweetTokenizer may help. This can help with commas, periods, etc. as certain emojis/hashtags may be recognized as a single token instead of lumped. 

## Calculate descriptive statistics on the two sets of lyrics and compare the results. 


In [64]:
cher_tokens = [tok
               for tokens in lyrics_data.loc[lyrics_data['artist']=='cher','tokens']
               for tok in tokens]
robyn_tokens = [tok
                for tokens in lyrics_data.loc[lyrics_data['artist']=='robyn','tokens']
                for tok in tokens]

print("=== Cher lyrics ===")
cher_stats = descriptive_stats(cher_tokens, verbose=True)

print("\n=== Robyn lyrics ===")
robyn_stats = descriptive_stats(robyn_tokens, verbose=True)

=== Cher lyrics ===
There are 35916 tokens in the data.
There are 3703 unique tokens in the data.
There are 172634 characters in the data.
The lexical diversity is 0.103 in the data.

The 5 most common tokens are:
  love: 1004
  im: 513
  know: 486
  dont: 440
  youre: 333

=== Robyn lyrics ===
There are 15227 tokens in the data.
There are 2156 unique tokens in the data.
There are 73787 characters in the data.
The lexical diversity is 0.142 in the data.

The 5 most common tokens are:
  know: 308
  dont: 301
  im: 299
  love: 275
  got: 251


Q: what observations do you make about these data? 

A: Cher's lyric amounts to over 35,000 tokens but has pretty low diversity (0.103). On the other hand, Robyn's 15,000 tokens has a higher diversity (0.142), which indicates that this artist has a larger range of words/diction. Additionally, Cher has used the word 'love' the most while Robyn leads with 'know', which shows a different type of tone in song. 


## Find tokens uniquely related to a corpus

Typically we would use TF-IDF to find unique tokens in documents. Unfortunately, we either have too few documents (if we view each data source as a single document) or too many (if we view each description as a separate document). In the latter case, our problem will be that descriptions tend to be short, so our matrix would be too sparse to support analysis. 

To avoid these problems, we will create a custom statistic to identify words that are uniquely related to each corpus. The idea is to find words that occur often in one corpus and infrequently in the other(s). Since corpora can be of different lengths, we will focus on the _concentration_ of tokens within a corpus. "Concentration" is simply the count of the token divided by the total corpus length. For instance, if a corpus had length 100,000 and a word appeared 1,000 times, then the concentration would be $\frac{1000}{100000} = 0.01$. If the same token had a concentration of $0.005$ in another corpus, then the concentration ratio would be $\frac{0.01}{0.005} = 2$. Very rare words can easily create infinite ratios, so you will also add a cutoff to your code so that a token must appear at least $n$ times for you to return it. 

An example of these calculations can be found in [this spreadsheet](https://docs.google.com/spreadsheets/d/1P87fkyslJhqXFnfYezNYrDrXp_GS8gwSATsZymv-9ms). Please don't hesitate to ask questions if this is confusing. 

In this section find 10 tokens for each of your four corpora that meet the following criteria: 

1. The token appears at least `n` times in all corpora
1. The tokens are in the top 10 for the highest ratio of appearances in a given corpora vs appearances in other corpora.

You will choose a cutoff for yourself based on the side of the corpus you're working with. If you're working with the Robyn-Cher corpora provided, `n=5` seems to perform reasonably well.

In [65]:
corpora_tokens = {}
for artist in ["cher", "robyn"]:
    # lyrics
    lyr = lyrics_data.loc[lyrics_data["artist"] == artist, "tokens"]
    corpora_tokens[f"{artist}_lyrics"] = [tok for row in lyr for tok in row]
    # twitter
    tw = twitter_data.loc[twitter_data["artist"] == artist, "tokens"]
    corpora_tokens[f"{artist}_twitter"] = [tok for row in tw for tok in row]

# 2. counts & totals
counts = {name: Counter(toks) for name, toks in corpora_tokens.items()}
totals = {name: len(toks)          for name, toks in corpora_tokens.items()}
global_counts = Counter(tok for toks in corpora_tokens.values() for tok in toks)
grand_total   = sum(totals.values())

# 3. compute concentration-ratio and pick top 10 per corpus
n = 5  # minimum count in target corpus
unique_tokens = {}
for name, ctr in counts.items():
    other_total = grand_total - totals[name]
    other_counts = global_counts - ctr

    scores = {}
    for tok, cnt in ctr.items():
        if cnt < n:
            continue
        conc = cnt / totals[name]
        conc_other = other_counts[tok] / other_total if other_total > 0 else 0.0
        # if token never appears elsewhere, give it an "infinite" score
        scores[tok] = conc / conc_other if conc_other > 0 else float("inf")

    # top 10 by descending score
    top10 = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:10]
    unique_tokens[name] = [tok for tok, score in top10]

for name, toks in unique_tokens.items():
    print(f"\n=== Top 10 unique tokens for {name.replace('_',' ')} ===")
    print(toks)


=== Top 10 unique tokens for cher lyrics ===
['alegrã\xada', 'wontcha', 'geronimos', 'nooh', 'woahoh', 'milord', 'repossessing', 'rhymney', 'guilded', 'chiquitita']

=== Top 10 unique tokens for cher twitter ===
['csu', '#dcnative', 'sexagenarian', 'saunas', 'romanos', '831', 'ꕥ', 'masseuse', '#pagan', 'antibigot']

=== Top 10 unique tokens for robyn lyrics ===
['moneyman', 'tjaffs', 'bububurn', 'câ\x80\x99mon', 'headlessly', 'ultramagnetic', 'aprã©ndelo', 'yyou', 'rudegirl', 'transistors']

=== Top 10 unique tokens for robyn twitter ===
['cykla', 'däremellan', 'musikproducent', 'framförallt', 'økonomi', 'næsten', 'promenader', 'ställe', 'jämför', 'blåvitt']


Q: What are some observations about the top tokens? Do you notice any interesting items on the list? 

A: Cher's lyrics tokens have one-off song words such as "respossessing" and "geronimos" while her Twitter has niche tags. On the other hand, Robyn's lyrics have terms like "ultarmagnetic" while her Twitter is completely in a foreign language. 

## Build word clouds for all four corpora. 

For building wordclouds, we'll follow exactly the code of the text. The code in this section can be found [here](https://github.com/blueprints-for-text-analytics-python/blueprints-text/blob/master/ch01/First_Insights.ipynb). If you haven't already, you should absolutely clone the repository that accompanies the book. 


In [66]:
from wordcloud import WordCloud
from matplotlib import pyplot as plt
from collections import Counter

def plot_wc(freq_df, title=None, max_words=200, stopwords=None):
    """Generate a word-cloud from the freq_df produced by count_words()."""
    # pull out the dict of token→freq
    freqs = freq_df['freq'].to_dict()
    # apply stopword filter if requested
    if stopwords is not None:
        freqs = {tok:cnt for tok,cnt in freqs.items() if tok not in stopwords}
    if not freqs:
        print(f"⚠️  no terms ≥ cutoff for “{title}”, skipping cloud.")
        return

    wc = WordCloud(
        width=800, height=400,
        background_color='white',
        max_words=max_words
    ).generate_from_frequencies(freqs)

    plt.figure(figsize=(10, 5))
    if title:
        plt.title(title, fontsize=16)
    plt.imshow(wc, interpolation='bilinear')
    plt.axis('off')
    plt.show()


def count_words(df, column='tokens', preprocess=None, min_freq=2):
    """
    Count up all tokens in df[column], optionally piping each doc through preprocess(),
    then return a DataFrame of token→freq for freq>=min_freq.
    """
    counter = Counter()
    for doc in df[column]:
        toks = preprocess(doc) if preprocess is not None else doc
        counter.update(toks)

    freq_df = pd.DataFrame.from_dict(
        counter, orient='index', columns=['freq']
    )
    freq_df.index.name = 'token'
    # filter on the minimum frequency
    freq_df = freq_df.query('freq >= @min_freq') \
                     .sort_values('freq', ascending=False)
    return freq_df

In [67]:
# build frequency tables (lyrics at min_freq=5, twitter at min_freq=1)
freq_cl = count_words(
    lyrics_data[lyrics_data.artist == "Cher"],
    column="tokens",
    min_freq=5
)
freq_rl = count_words(
    lyrics_data[lyrics_data.artist == "Robyn"],
    column="tokens",
    min_freq=5
)
freq_ct = count_words(
    twitter_data[twitter_data.artist == "Cher"],
    column="tokens",
    min_freq=1
)
freq_rt = count_words(
    twitter_data[twitter_data.artist == "Robyn"],
    column="tokens",
    min_freq=1
)

print("Cher lyrics freq table:\n", freq_cl.head())
print("Robyn lyrics freq table:\n", freq_rl.head())

# plot
plot_wc(freq_cl, title="Cher – Lyrics",   stopwords=sw)
plot_wc(freq_rl, title="Robyn – Lyrics", stopwords=sw)
plot_wc(freq_ct, title="Cher – Twitter", stopwords=sw)
plot_wc(freq_rt, title="Robyn – Twitter",stopwords=sw)

Cher lyrics freq table:
 Empty DataFrame
Columns: [freq]
Index: []
Robyn lyrics freq table:
 Empty DataFrame
Columns: [freq]
Index: []
⚠️  no terms ≥ cutoff for “Cher – Lyrics”, skipping cloud.
⚠️  no terms ≥ cutoff for “Robyn – Lyrics”, skipping cloud.
⚠️  no terms ≥ cutoff for “Cher – Twitter”, skipping cloud.
⚠️  no terms ≥ cutoff for “Robyn – Twitter”, skipping cloud.


Q: What observations do you have about these (relatively straightforward) wordclouds? 

A: Could not figure this part of assignment out in time prior to submission**