# ADS 509 Assignment 2.1: Tokenization, Normalization, Descriptive Statistics 

This notebook holds Assignment 2.1 for Module 2 in ADS 509, Applied Text Mining. Work through this notebook, writing code and answering questions where required. 

In the previous assignment you pulled lyrics data on two artists. In this assignment we explore this data set and a pull from the now-defunct Twitter API for the artists Cher and Robyn.  If, for some reason, you did not complete that previous assignment, data to use for this assignment can be found in the assignment materials section of Canvas. 

This assignment asks you to write a short function to calculate some descriptive statistics on a piece of text. Then you are asked to find some interesting and unique statistics on your corpora. 


## General Assignment Instructions

These instructions are included in every assignment, to remind you of the coding standards for the class. Feel free to delete this cell after reading it. 

One sign of mature code is conforming to a style guide. We recommend the [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html). If you use a different style guide, please include a cell with a link. 

Your code should be relatively easy-to-read, sensibly commented, and clean. Writing code is a messy process, so please be sure to edit your final submission. Remove any cells that are not needed or parts of cells that contain unnecessary code. Remove inessential `import` statements and make sure that all such statements are moved into the designated cell. 

Make use of non-code cells for written commentary. These cells should be grammatical and clearly written. In some of these cells you will have questions to answer. The questions will be marked by a "Q:" and will have a corresponding "A:" spot for you. *Make sure to answer every question marked with a `Q:` for full credit.* 


In [1]:
import os
import re
import emoji
import pandas as pd
import numpy as np

from collections import Counter, defaultdict
from nltk.corpus import stopwords
from string import punctuation

sw = stopwords.words("english")

In [2]:
# Add any additional import statements you need here
import string
from nltk.tokenize import word_tokenize

In [3]:
# change `data_location` to the location of the folder on your machine.
data_location = "/users/rkartawi/Desktop/Ravita/MSADS/509/ads509-tm-scrape/"

# These subfolders should still work if you correctly stored the 
# data from the Module 1 assignment
twitter_folder = os.path.join(data_location, "twitter/")
lyrics_folder = os.path.join(data_location, "lyrics/")

In [4]:
def descriptive_stats(tokens, num_tokens = 5, verbose=True) :
    # Given a list of tokens, print number of tokens, number of unique tokens, and number of characters
    # Fill in the correct values here. 
    num_tokens_tot = len(tokens)
    num_unique_tokens = len(set(tokens))
    num_characters = sum(
        len(token) for token in tokens)    
    
    # Get lexical diversity (https://en.wikipedia.org/wiki/Lexical_diversity), and num_tokens most common tokens.
    lexical_diversity = num_unique_tokens / num_tokens_tot if num_tokens_tot > 0 else 0.0
    token_counts = Counter(tokens)
    most_common_tokens = token_counts.most_common(num_tokens)

    if verbose :        
        print(f"There are {num_tokens_tot} tokens in the data.")
        print(f"There are {num_unique_tokens} unique tokens in the data.")
        print(f"There are {num_characters} characters in the data.")
        print(f"The lexical diversity is {lexical_diversity:.3f} in the data.")
    
    # print the five most common tokens
        print(f"The {num_tokens} most common tokens are:")
        for token, count in most_common_tokens:
            print(f"{token}: {count}")
    # else:
    #     # Return only the most common tokens if not verbose
    #     return most_common_tokens
        
    # Return a list with the number of tokens, number of unique tokens, lexical diversity, and number of characters, most common tokens
    return [num_tokens_tot, num_unique_tokens, lexical_diversity, num_characters, most_common_tokens]
    

In [5]:
text = """here is some example text with other example text here in this text""".split()
# Where text will be the top word and  next text word will be here then follow by text example here List the top five text
# Word from text example here is going to be first five word 
assert(descriptive_stats(text, verbose=True)[0] == 13)
assert(descriptive_stats(text, verbose=False)[1] == 9)
assert(abs(descriptive_stats(text, verbose=False)[2] - 0.69) < 0.02)
assert(descriptive_stats(text, verbose=False)[3] == 55)

There are 13 tokens in the data.
There are 9 unique tokens in the data.
There are 55 characters in the data.
The lexical diversity is 0.692 in the data.
The 5 most common tokens are:
text: 3
here: 2
example: 2
is: 1
some: 1


Q: Why is it beneficial to use assertion statements in your code? 

A: An assertion statement is a tool to test if a condition of the code is true. If the condition is True, the program continues running; if it evaluates to False, an AssertionError is raised, and the optional error message is displayed. 

Some of benefits of assertion statements are: 
- It helps catching bugs early by verifying conditions at specific points because it improves code reliability by ensuring assumptions hold.
- It makes code easier to understand by explicitly showing expected behaviors.
- It prevents invalid inputs or states from propagating through the program.
- During development, assertions complement testing with lightweight checks and it can be disabled in production to avoid performance issues.
  
Overall, they enhance code robustness and make debugging easier.

## Data Input

Now read in each of the corpora. For the lyrics data, it may be convenient to store the entire contents of the file to make it easier to inspect the titles individually, as you'll do in the last part of the assignment. In the solution, I stored the lyrics data in a dictionary with two dimensions of keys: artist and song. The value was the file contents. A data frame would work equally well. 

For the Twitter data, we only need the description field for this assignment. Feel free all the descriptions read it into a data structure. In the solution, I stored the descriptions as a dictionary of lists, with the key being the artist. 




In [6]:
# Read in the lyrics data
def read_lyrics_data(lyrics_folder): 
    lyrics_data = {}
    # Loop through each artist's subfolder
    for artist in os.listdir(lyrics_folder):
        artist_folder = os.path.join(lyrics_folder, artist)
        lyrics_data[artist] = {}
        
        # Loop through songs in artist's folder
        for song_file in os.listdir(artist_folder):
            song_path = os.path.join(artist_folder, song_file)
            with open(song_path, 'r', encoding='utf-8') as f:
                song_title = os.path.splitext(song_file)[0]  # without file ext.
                lyrics_content = f.read()
                lyrics_data[artist][song_title] = lyrics_content    
#                print(f"{artist}: {song_title}")

    return lyrics_data
lyrics_data = read_lyrics_data(lyrics_folder)

In [7]:
# Read in the twitter data
def read_twitter_data(twitter_folder):
    twitter_data = {}
    for file in os.listdir(twitter_folder):
        file_path = os.path.join(twitter_folder, file)
        with open(file_path, 'r', encoding='utf-8', errors='replace') as f:
            twitter_data[file] = f.read().strip()
            
    # for filename, description in twitter_data.items():
    #     print(f"Description:\n{description}\n")
              
    return twitter_data
twitter_data = read_twitter_data(twitter_folder)

## Data Cleaning

Now clean and tokenize your data. Remove punctuation chacters (available in the `punctuation` object in the `string` library), split on whitespace, fold to lowercase, and remove stopwords. Store your cleaned data, which must be accessible as an interable for `descriptive_stats`, in new objects or in new columns in your data frame. 



In [8]:
punctuation = set(punctuation) # speeds up comparison
def clean_text(text):
    # Remove punctuation
    text = ''.join(char for char in text if char not in string.punctuation)
    # Convert to lowercase
    text = text.lower()
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    cleaned_tokens = [word for word in tokens if word not in stop_words]
    return cleaned_tokens

In [None]:
#create your clean twitter data here
def clean_twitter_data(twitter_data):
    cleaned_twitter_data = {}
    for filename, description in twitter_data.items():
        cleaned_twitter_data[filename] = clean_text(description)
    return cleaned_twitter_data
cleaned_twitter_data = clean_twitter_data(twitter_data)

In [None]:
# create your clean lyrics data here
def clean_lyrics_data(lyrics_data):
    cleaned_lyrics = {}
    for artist, songs in lyrics_data.items():
        cleaned_lyrics[artist] = {}
        for song_title, lyrics in songs.items():
            cleaned_lyrics[artist][song_title] = clean_text(lyrics)
    return cleaned_lyrics

cleaned_lyrics_data = clean_lyrics_data(lyrics_data)

In [11]:
# Print twitter samples
# filename = "robynkonichiwa_followers_data.txt"
# file_path = os.path.join(twitter_folder, filename)
# with open(file_path, 'r', encoding='utf-8', errors='replace') as f:
#     description = f.read().strip()
#     cleaned_tokens = clean_text(description)
# print(f"File Name: {filename}")
# print("Sample Tokens (up to 5):", cleaned_tokens[:50])

# # Print lyric samples
# num_samples = 3
# for artist, songs in list(cleaned_lyrics_data.items())[:num_samples]:
#     print(f"Artist: {artist}")
#     for song_title, tokens in list(songs.items())[:num_samples]:
#         print(f"Song Title: {song_title}")
#         print(f"Tokens: {tokens}")
#         print('-' * 40)

In [None]:
# Flatten the cleaned data
def flatten_data(cleaned_data):
    all_tokens = []
    for content in cleaned_data.values():
        # Nested data: 1st dictionary: Artist; 2nd : Song
        if isinstance(content, dict):  
            for tokens in content.values():
                all_tokens.extend(tokens)
        else:  # twitter: content is a list
            all_tokens.extend(content)
    return all_tokens

# Accessible as an interable for descriptive_stats,
flat_cleaned_lyrics = flatten_data(cleaned_lyrics_data)
flat_cleaned_twitter = flatten_data(cleaned_twitter_data)

## Basic Descriptive Statistics

Call your `descriptive_stats` function on both your lyrics data and your twitter data and for both artists (four total calls). 

In [None]:
# calls to descriptive_stats here
stats_lyrics = descriptive_stats(flat_cleaned_lyrics, num_tokens=5, verbose=True)
assert(stats_lyrics[0])

In [None]:
stats_twitters = descriptive_stats(flat_cleaned_twitter, verbose=True)
assert(stats_twitters[0])

Q: How do you think the "top 5 words" would be different if we left stopwords in the data? 

A: If we left stopwords which include conjunctions, prepositions, articles, pronouns, and auxiliary verbs, the result will generate less meaningful words, because there would be possibilities that these stopwords would be most dominant. It could skew the results since it emphasizes frequent words that don't add much insight into the text’s content. On the otherhands, removing stopwords allows the analysis to get more significant and content-rich words. This often leads to a list of top words that better reflect the core themes and topics of the text. By excluding stopwords, it would enhance the relevance and accuracy of text analysis. The final result will highlight words that contribute more meaningfully to understanding the text. 

---

Q: What were your prior beliefs about the lexical diversity between the artists? Does the difference (or lack thereof) in lexical diversity between the artists conform to your prior beliefs? 

A: Lexical diversity refers to the variety of unique words used in a text or speech. It reflects the richness of the vocabulary in the lyrics. I was under impression that most of artists use similar range of vocabulary in their lyrics as long as their songs express similar themes regardless the genre. After analyzing the artists' lyrics, I found that their lexical diversity was similar; which would support the belief that artists generally have a comparable vocabulary range. However, an interesting finding would suggest that factors such as genre or lyrical approach would play a more substantial role. These differences would challenge my prior assumption and highlight how artistic style impacts language or words use. For example, an artist who sings a love theme for hiphop or rap genre, the lyrics might contain more explicit languages compare to country or pop genre. If we include specific stopword based genre themes, the results confirm that lexical diversity is consistent across artists based on the themes. Thus, the analysis provides valuable insights into the relationship between lyrical content and lexical diversity.




## Specialty Statistics

The descriptive statistics we have calculated are quite generic. You will now calculate a handful of statistics tailored to these data.

1. Ten most common emojis by artist in the twitter descriptions.
1. Ten most common hashtags by artist in the twitter descriptions.
1. Five most common words in song titles by artist. 
1. For each artist, a histogram of song lengths (in terms of number of tokens) 

We can use the `emoji` library to help us identify emojis and you have been given a function to help you.


In [None]:
assert(emoji.is_emoji("❤️"))
assert(not emoji.is_emoji(":-)"))

### Emojis 😁

What are the ten most common emojis by artist in the twitter descriptions? 


In [None]:
# Your code here
def extract_emojis(text):
    return [char for char in text if char in emoji.EMOJI_DATA]

# Update Clean_twitter_data def
target_filenames = ['robynkonichiwa_followers_data.txt', 'cher_followers_data.txt']
def clean_twitter_data(twitter_data):
    cleaned_twitter_data = {}
    for filename, description in twitter_data.items():
        if filename in target_filenames:
            cleaned_description = clean_text(description)
            emojis = extract_emojis(cleaned_description)
            cleaned_twitter_data[filename] = {
                'description': cleaned_description,
                'emojis': emojis
            }
    return cleaned_twitter_data
    
cleaned_twitter_data = clean_twitter_data(twitter_data)


In [None]:
for filename, data in cleaned_twitter_data.items():
    print(f"Artist: {filename}, Description: {data['description'][:30]}, Emojis: {data['emojis'][:5]}")

In [None]:
def group_emojis_by_artist(cleaned_twitter_data):
    grouped_emojis = {}
    for filename, data in cleaned_twitter_data.items():
        artist = filename.split('_')[0]  
        if artist not in grouped_emojis:
            grouped_emojis[artist] = []
        grouped_emojis[artist].extend(data['emojis'])  
    return grouped_emojis

grouped_emojis = group_emojis_by_artist(cleaned_twitter_data)

def descriptive_stats_by_artist(grouped_emojis):
    for artist, emojis in grouped_emojis.items():
        print(f"Descriptive stats for {artist}:")
        descriptive_stats(emojis, num_tokens=10, verbose=True)

descriptive_stats_by_artist(grouped_emojis)

### Hashtags

What are the ten most common hashtags by artist in the twitter descriptions? 


In [22]:
def clean_text(text):
    # Ensure the input is a string
    if not isinstance(text, str):
        raise TypeError("Expected a string for text cleaning.")
    # Remove punctuation
    text = ''.join(char for char in text if char not in string.punctuation)
    # Convert to lowercase
    text = text.lower()
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    cleaned_tokens = [word for word in tokens if word not in stop_words]
    
    # Join tokens back into a cleaned string
    return ' '.join(cleaned_tokens)

In [23]:
# Your code here
def extract_hashtags(text):
    return re.findall(r'#\w+', text)

# Updated clean_twitter_data def
def clean_twitter_data(twitter_data):
    cleaned_twitter_data = {}
    for filename, description in twitter_data.items():
        if filename in target_filenames:
            cleaned_description = clean_text(description)
            emojis = extract_emojis(cleaned_description)
            hashtags = extract_hashtags(cleaned_description)
            cleaned_twitter_data[filename] = {
                'description': cleaned_description,
                'emojis': emojis,
                'hashtags': hashtags
            }
    return cleaned_twitter_data
cleaned_twitter_data = clean_twitter_data(twitter_data)

In [25]:
def group_hashtags_by_artist(cleaned_data):
    grouped_hashtags = {}
    for filename, data in cleaned_data.items():
        artist = filename.split('_')[0]  # Extract artist name from filename
        if artist not in grouped_hashtags:
            grouped_hashtags[artist] = []
        grouped_hashtags[artist].extend(data['hashtags'])
    return grouped_hashtags

grouped_hashtags = group_hashtags_by_artist(cleaned_twitter_data)

def descriptive_stats_hashtags_by_artist(grouped_hashtags):
    for artist, hashtags in grouped_hashtags.items():
        print(f"Descriptive stats for {artist}:")
        stats = descriptive_stats(hashtags, num_tokens=10, verbose=True)
        # most_common is at index 5
        most_common_hashtags = stats[4]
        print(f"Top hashtags for {artist}:")
        for hashtag, count in most_common_hashtags:
            print(f"{hashtag}: {count}")

descriptive_stats_hashtags_by_artist(grouped_hashtags)

Descriptive stats for cher:
There are 0 tokens in the data.
There are 0 unique tokens in the data.
There are 0 characters in the data.
The lexical diversity is 0.000 in the data.
The 10 most common tokens are:
Top hashtags for cher:
Descriptive stats for robynkonichiwa:
There are 0 tokens in the data.
There are 0 unique tokens in the data.
There are 0 characters in the data.
The lexical diversity is 0.000 in the data.
The 10 most common tokens are:
Top hashtags for robynkonichiwa:


In [None]:
def group_hashtags_by_artist(cleaned_data):
    grouped = {}
    for filename, data in cleaned_data.items():
        artist = filename.split('_')[0]  # Extract artist name from filename
        if artist not in grouped:
            grouped[artist] = []
        grouped[artist].extend(data['hashtags'])
    return grouped

def get_most_common_hashtags_by_artist(grouped_hashtags, num_hashtags=10):
    most_common_hashtags = {}
    for artist, hashtags in grouped_hashtags.items():
        hashtag_counts = Counter(hashtags)
        most_common_hashtags[artist] = hashtag_counts.most_common(num_hashtags)
    return most_common_hashtags

# Clean and process the Twitter data
cleaned_twitter_data = clean_twitter_data(twitter_data)

# Group hashtags by artist
grouped_hashtags = group_hashtags_by_artist(cleaned_twitter_data)

# Get the ten most common hashtags by artist
most_common_hashtags_by_artist = get_most_common_hashtags_by_artist(grouped_hashtags, num_hashtags=10)

# Output the results
for artist, common_hashtags in most_common_hashtags_by_artist.items():
    print(f"Top 10 hashtags for {artist}:")
    for hashtag, count in common_hashtags:
        print(f"{hashtag}: {count}")


### Song Titles

What are the five most common words in song titles by artist? The song titles should be on the first line of the lyrics pages, so if you have kept the raw file contents around, you will not need to re-read the data.


In [None]:
# Your code here

### Song Lengths

For each artist, a histogram of song lengths (in terms of number of tokens). If you put the song lengths in a data frame with an artist column, matplotlib will make the plotting quite easy. An example is given to help you out. 


In [None]:
num_replicates = 1000

df = pd.DataFrame({
    "artist" : ['Artist 1'] * num_replicates + ['Artist 2']*num_replicates,
    "length" : np.concatenate((np.random.poisson(125,num_replicates),np.random.poisson(150,num_replicates)))
})

df.groupby('artist')['length'].plot(kind="hist",density=True,alpha=0.5,legend=True)

Since the lyrics may be stored with carriage returns or tabs, it may be useful to have a function that can collapse whitespace, using regular expressions, and be used for splitting. 

Q: What does the regular expression `'\s+'` match on? 

A: 


In [None]:
collapse_whitespace = re.compile(r'\s+')

def tokenize_lyrics(lyric) : 
    """strip and split on whitespace"""
    return([item.lower() for item in collapse_whitespace.split(lyric)])

In [None]:
# Your lyric length comparison chart here. 

**Reference:**
- Oracle and/or its affiliates. (n.d.). Programming With Assertions. Programming with assertions. https://docs.oracle.com/javase/8/docs/technotes/guides/language/assert.html#:~:text=An%20assertion%20is%20a%20statement,than%20the%20speed%20of%20light.
- Wikimedia Foundation. (2024, August 14). Lexical diversity. Wikipedia. https://en.wikipedia.org/wiki/Lexical_diversity
- OpenAI. (2024). ChatGPT (September 24 version) [Large language model]. https://chat.openai.com/chat