# Text Mining - Acquire and Analyze
# AJ Eckmann

In [1]:
import json
import os
import glob
import pprint
from nltk.corpus import stopwords
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import *

## Introduction:

For this aquire and analyze I will be using data that I acquired for one of my data set shares. This data set that I created comes from transcripts of all the office episodes ever made. The base transcripts can be found and downloaded from GitHub at: https://github.com/brianbuie/the-office.

I was actually given the idea to look for this data from seeing Mary’s data set share on the Friends transcripts. I am a big fan of the office and thought it would be a cool dataset to look at. Instead of looking at differences between episodes/seasons, I thought a cool analysis would be to look at the differences between characters.

In my initial data set share, I just created a dictionary for each of the office characters and appended all of the tokens that they said throughout the show. In this Acquire and Analyze, I take this a couple steps further. In addition to just looking at all the tokens (words) that each character says, I want to also be able to differentiate between seasons of the show. To do this I will need to re-format the inital dictionaries that I created in the data set share.

The purpose of this acquire and analyze is to take this previously messy json data, and turn it into interpretable code to analyze the sentiments and usage of all of the office characters throughout the course of the show

# The Data and Analysis

In this section I will format the data that I am using and also talk through some of my code/thought process on how to go about formatting the data so that I am able to analyze the sentiment of characters.

In the initial data share, I only sorted out the data by character. In this A&A I took this a step further and sorted it out by character, by season and episode, so that I could look more closely on some of the changes in sentiment and useage over time.

In [2]:
#get the filenames from the eplib (episode library) 
fileList = []
for filenames in os.walk('eplib'):
    for li in filenames[2]:
        fileList.append(li)

#create a list of dictionaries for each scene
scenes = []
for file in fileList:
    path = "eplib/"
    fil = path + file
    with open(fil, encoding='utf-8', mode='r') as currentFile:
        rd = currentFile.read()
        title = json.loads(rd)["title"]
        scen = json.loads(rd)["scenes"]
        
        ##This part is all about breaking up the json
        ##scen is a list of dictionaries that I itterate though
        ##char is each dictionary in this list, each of these dictionaries maps 1 to 1 with a character
        ##unlike in my data share, I redesigned this part to include season and episode in the dict so I can use that later
        
        for part in scen:
            for char in part:
                scene_dict = char
                scene_dict["season"] = json.loads(rd)["season"]
                scene_dict["episode"] = json.loads(rd)["episode"]
                scenes.append(scene_dict)


In the following section, I now have to take a list of nested dictionaries that I created above from the json files, and turn them into something workable for my analysis. The best way that I could come up with of doing this is by creating a triple nested dictionary. It sounds inneficient and hard to keep track of, but when I have to iterate through 700 different characters' transcripts later on, this methods proves to be very efficient.

The structure of the nested dictionaries is:

chardict (main dictionary, keys are characters)
season_dict (there is one of these dictionaries for each of the 700 characters, the keys to this dictionary are the seasons of the show (1-9) that this character had lines in)
episode_dict (this is the lowest level of dicitonary in this section, there is one of these dictionaries for each season for every character, the keys to this dictionary are episodes)

For example, a call to chardict['Michael'] might return:

{'season 1': { 'episode 1' : 'Hello, my name is Michael Scott', 'episode 2' : 'How are you...' } , 'season 2' : { 'episode 1' :...} }

In [3]:
#Initialize dictionary and iterate through the scene dictionaries
chardict = {}
for pers in scenes:
    new_char = pers['character']
    new_season = pers['season']
    new_episode = pers['episode']
    lines = pers['line'].split(" ")
    
    #adding characters to the dictionary for the first time
    if new_char not in chardict:    
        season_dict = {}
        episode_dict = {}
        episode_dict[new_episode] = lines
        season_dict[new_season] = episode_dict
        chardict[new_char] = season_dict
    
    #if character already in the dictionary
    else:      
        #if they don't already have a dictionary for this season, create one
        if new_season not in chardict[new_char]:
            episode_dict = {}
            episode_dict[new_episode] = lines
            #person_dict[new_season] = episode_dict
            chardict[new_char][new_season] = episode_dict
            
        
        else:
            #if they haven't already spoken in this episode, we need to create a new episode entry
            if new_episode not in chardict[new_char][new_season]:
                chardict[new_char][new_season][new_episode] = lines
            
            else:
                chardict[new_char][new_season][new_episode].extend(lines)
                

In [4]:
#Just a little eda... creating a dictionary of all the characters and how many words they spoke
#also counting how many words total (620,192)
bigl = {}
tot = 0
for pers in chardict.keys():
    pers_words = 0
    for seas in chardict[pers]:
        for ep in chardict[pers][seas]:
            tot += len(chardict[pers][seas][ep])
            pers_words += len(chardict[pers][seas][ep])
    bigl[pers] = pers_words 
    
print("There are " + str(tot) + " total words in the transcripts")
print("There are " + str(len(chardict)) + " different characters with lines in this dataset")

There are 620192 total words in the transcripts
There are 700 different characters with lines in this dataset


In [5]:
#sorting characters by how many words they said
import operator
sorted_char = sorted(bigl.items(), key=operator.itemgetter(1), reverse = True)

In [6]:
#Sorted character list by total words, printing top 25 for space reasons
sorted_char[:25]

[('Michael', 159101),
 ('Dwight', 82739),
 ('Jim', 62394),
 ('Pam', 48203),
 ('Andy', 47671),
 ('Angela', 14574),
 ('Erin', 14090),
 ('Kevin', 13887),
 ('Oscar', 12830),
 ('Ryan', 12678),
 ('Darryl', 12291),
 ('Kelly', 9696),
 ('Jan', 8351),
 ('Toby', 8349),
 ('Phyllis', 8019),
 ('Nellie', 7184),
 ('Robert California', 6234),
 ('Gabe', 6153),
 ('Stanley', 6109),
 ('David Wallace', 5310),
 ('Holly', 5099),
 ('Meredith', 4750),
 ('Creed', 4327),
 ('Deangelo', 3823),
 ('Jo', 3111)]

The next thing that I want to look at is the breakdown of total tokens and total characters by season.

In [7]:
character_seasons = {}
season_count = {'1':0,'2':0,'3':0,'4':0,'5':0,'6':0,'7':0,'8':0,'9':0}
only_once = 0
for char in chardict:
    szn_count = len(chardict[char])
    if szn_count == 1:
        only_once += 1
    szn_list = [szn_count]
    empty_list = [0,0,0,0,0,0,0,0,0]
    for seas in chardict[char]:
        empty_list[int(seas)-1] = 1
        season_count[str(seas)] += 1
    szn_list.append(empty_list)
    character_seasons[char] = szn_list
    

Below I printed the character_seasons output for Micheal Scott. Michael was the main character, but left the show after season 7. However, he did return during the very last episode of season 9, which is reflected in our data.

In [8]:
character_seasons['Michael']

[8, [1, 1, 1, 1, 1, 1, 1, 0, 1]]

Another cool case here is with a character like Holly, who appears in 3 non-consecutive seasons, all of this data checks out with the activities of the show which is cool to see given the complexity of digging through the character layers of the json.

In [9]:
character_seasons['Holly']

[3, [0, 0, 0, 1, 1, 0, 1, 0, 0]]

In the code chunk above I also created a dictionary that counted how many unique characters appeared in each of the 9 seasons. Note that season 1 was only 6 episodes, and season 4 was only 14 episodes, while the rest of the seasons were ~20 episodes long. This would indicate that the more episodes in a season, would result in more unique characters, which is pretty obvious.

In [10]:
season_count

{'1': 29,
 '2': 116,
 '3': 102,
 '4': 96,
 '5': 125,
 '6': 139,
 '7': 142,
 '8': 124,
 '9': 184}

I also ran a counter on the number of characters that only appeared in 1 of the 9 seasons. Interestingly enough, 569 of the 700 characters only appeared in 1 of the 9 seasons. This is a much higher number than I would've expected.

In [11]:
only_once

569

Now I will prepare the data to do some sentiment analysis on the trascripts of these characters and see if we can draw any insigts from the most positive/negative characters or just the general sentiment of the show.

In [12]:
#create a list of all the words spoken for each character, no more breakdown by season
scenes2 = []
for file in fileList:
    path = "eplib/"
    fil = path + file
    with open(fil, encoding='utf-8', mode='r') as currentFile:
        rd = currentFile.read()
        title = json.loads(rd)["title"]
        scen = json.loads(rd)["scenes"]
        for part in scen:
            scenes2.append(part)

In [13]:
#Now going through the new list of all the words spoken (no longer broken down by season) and
# recreating the dictionary, mapping a character to all their words

chardict = {}
for part in scenes2:
    for pers in part:
        if pers['character'] not in chardict:
            new = pers['character']
            line = pers['line'].split(" ")
            chardict[new] = line
        else:
            new = pers['character']
            line = pers['line'].split(" ")
            chardict[new].extend(line)

If you dig deeper into these json files, or the output from the dictionary above, you will notice there is one large problem when trying to analyze the spoken words from these characters solely based on transcript output attributed to them.

This problem is that, unspoken actions are also included in the trascript, mixed in with the spoken words. In order to do a proper sentiment analysis I have to deal with this problem. Fortunately all of these unspoken actions are enclosed with '[ ]'. This makes it easier to identify these unspoken actions. Now we have to iterate through the trascript and search for the opening and closing square brackets and exclude all the text between the two from our analysis.

An example of this would be:
{'Michael' : 'Hey, how are you doing? I'm [shakes his hand] doing well.'}

In [14]:
#I create a new dicitonary "talk_only", that excludes these unspoken lines
talk_only = {}
for char in chardict:
    
    #initializing a list of indexes of words that are unspoken so I can delete them in the end
    dele_list = []
    talk = chardict[char]
    leng = len(talk)
    for word in range(0,len(talk)):
        
        #make sure that we didn't index incorrectly
        if talk[word] == '':
            dele_list.append(word)
        
        #If a word is the start of this unspoken piec
        elif talk[word][0] == '[':
            #If it's also the end, just delete that word
            if talk[word][-1] == ']':
                dele_list.append(word)
                break
            #otherwise, keep going till the end of the transcript until we find the end of the unspoken
            else:
                for end in range(word,len(talk)):
                    if talk[end][-1] == ']':
                        dele_list.append(end)
                        break
                    else:
                        dele_list.append(end) 
                        
    #now delete all the indexes (in reverse so the indices don't change on us)
    for i in sorted(dele_list, reverse=True):
        del talk[i]
    
    #now go through and reformat the words that are actually spoken
    
    sw = stopwords.words("english")
    
    talk = map(str.strip, talk) 
    
    further = [w.lower() for w in talk if w.lower() not in sw and w.isalpha()]
    
    #reassign the new list to the character as their talk only (and correctly formatted text)
    
    talk_only[char] = further

In [15]:
#just a quick check to make sure we got rid of sw, stripped the lines and everything is lowecase
talk_only['Michael'][:25]

['right',
 'quarterlies',
 'look',
 'things',
 'come',
 'master',
 'let',
 'show',
 'like',
 'speak',
 'office',
 'michael',
 'regional',
 'manager',
 'dunder',
 'mifflin',
 'paper',
 'wanted',
 'talk',
 'done',
 'thank',
 'gentleman',
 'woman',
 'talking',
 'low']

# Sentiment Analysis

In [16]:
#using a file and code from earlier in the semester to classify each of these words as +/- on a 1/-1 scale

sentiment_scores = {}
            
# Open the file `tidytext_sentiments.txt`
# Fill up sentiment scores so it has values like 
# sentiment_scores['awesome'] = 1

with open("tidytext_sentiments.txt", "r") as infile:
    next(infile)
    for line in infile.readlines():
        line = line.strip().split()
        word = line[0]
        if line[1] == "negative":
            sentiment_scores[word] = -1
        elif line[1] == "positive":
            sentiment_scores[word] = 1

When looking at the sentiment of the characters, I wanted to look at more than just a total sentiment score. Looking at only the sentiment score here seemed flawed for a couple reasons. Firstly, looking at only the sentiment score (when scored on the 1/-1 scale) you get no context for how many words are being spoken. If the show is relatively happy, the character with the highest sentiment score will most likely be the person who talks the most. The way that I tried to account for this project was instead of only calculating a sentiment score for each character, I created a dictionary for each character. In this dictionary I stored the sentiment score, positive words, negative words, total words, average sentiment per word (sentiment score/total words), percent of words that were positive and percent of words that were negative. I also included an 'emotion' variable, which is simply what % of a characters words were classified as either positive or negative

In [17]:
char_sent = {}
for pers in talk_only:
    totals = {'sentiment':0,'positive':0,'negative':0,'total_words':0,'Avg_sent':0,'Pct_Pos':0,'Pct_Neg':0, 'Emotion' : 0}
    for word in talk_only[pers]:
        totals['total_words'] += 1
        if word in sentiment_scores:
            sco = sentiment_scores[word]
            totals['sentiment'] += sco
            if sco == 1:
                totals['positive'] += 1
            if sco == -1:
                totals['negative'] += 1
    
    if totals['total_words'] > 0:
        totals['Avg_sent'] = round(totals['sentiment'] / totals['total_words'],3)
        totals['Pct_Pos'] = round(totals['positive'] / totals['total_words'],3)
        totals['Pct_Neg'] = round(totals['negative'] / totals['total_words'],3)
        totals['Emotion'] = round((totals['negative'] + totals['positive'] ) /totals['total_words'],3)
    
    char_sent[pers] = totals
        

Below is a sample of the output from the char_sent dictionary.

In [18]:
char_sent['Michael']

{'sentiment': 3172,
 'positive': 6196,
 'negative': 3024,
 'total_words': 42278,
 'Avg_sent': 0.075,
 'Pct_Pos': 0.147,
 'Pct_Neg': 0.072,
 'Emotion': 0.218}

Next, I print the sorted values for the average emotion, percent negative, percent positive, and % emotion words. In the dicionaries that I create, I am filtering only the characters who speak > 1000 words to filter out the characters who may have only spoken a sentence or two that was havily positive or negative. This will also allow us to see both ends of the spectrum in the same print. In addition to the prints, I create dictionaries for the 'Avg_sent', 'Pct_Post', 'Pct_Neg' and 'Emotion'. These dictionaries take a character as a key and the value is a list of their ranks in these different sorted lists.

In this first code block and print, we can see that the character with the highest sentiment is David Wallace. He only appears in a a couple seasons of the show, but is generally up positive and happy, hence the high score. Second highest sentiment score was Ryan, which also makes sense, while he is not a super up-beat character, he is always talking about big ideas and business plans that he has and is generally in very good spirits. The next 3 characters are Holly, Pam, Phyllis who are all very cheery and generally happy characters.

The worst sentiment score (notably, with a score still > 0), was Meredith. She is portrayed in the show as having a pretty tough life and a lot of things go wrong for her so this score makes sense. The next lowest characters are Creed, Dwight, Kelly, Gabe and Stanley. This group of characters is typically the brutally honest and pesimistic group, so this scoring makes sense.

In [19]:
top_sents = {}

for char in char_sent:
    if char_sent[char]['total_words'] > 1000:
        top_sents[char] = char_sent[char]['Avg_sent']

sort_sent = sorted(top_sents.items(), key=lambda x: x[1], reverse=True)

for i in sort_sent:
    print(i[0], i[1])

David Wallace 0.091
Ryan 0.087
Holly 0.083
Pam 0.079
Phyllis 0.077
Angela 0.077
Michael 0.075
Jim 0.073
Andy 0.071
Erin 0.07
Nellie 0.07
Jan 0.069
Robert California 0.069
Deangelo 0.069
Kevin 0.066
Oscar 0.06
Toby 0.06
Darryl 0.06
Stanley 0.055
Gabe 0.054
Kelly 0.05
Dwight 0.049
Creed 0.045
Meredith 0.028


Next, I print and create the dictionary for the percent of negative words. Unsurprisingly, this list is very close to a reversed list of the sentiment score printed above. Below that is the printed rankings by percent positive.

In [20]:
top_neg = {}

for char in char_sent:
    if char_sent[char]['total_words'] > 1000:
        top_neg[char] = char_sent[char]['Pct_Neg']
        
sort_neg = sorted(top_neg.items(), key=lambda x: x[1], reverse=True)

for i in sort_neg:
    print(i[0], i[1])

Meredith 0.1
Kelly 0.089
Dwight 0.086
Creed 0.086
Gabe 0.084
Nellie 0.084
Stanley 0.078
Andy 0.077
Oscar 0.076
Angela 0.076
Toby 0.076
Robert California 0.076
Darryl 0.074
Michael 0.072
Jan 0.072
Holly 0.07
Deangelo 0.07
Kevin 0.069
Erin 0.068
Jim 0.063
David Wallace 0.061
Pam 0.06
Phyllis 0.058
Ryan 0.058


In [22]:
top_pos = {}

for char in char_sent:
    if char_sent[char]['total_words'] > 1000:
        top_pos[char] = char_sent[char]['Pct_Pos']
        
sort_pos = sorted(top_pos.items(), key=lambda x: x[1], reverse=True)

for i in sort_pos:
    print(i[0], i[1])

Angela 0.153
Holly 0.153
Nellie 0.153
David Wallace 0.152
Andy 0.149
Michael 0.147
Ryan 0.145
Robert California 0.145
Jan 0.141
Kelly 0.14
Pam 0.139
Deangelo 0.139
Erin 0.138
Gabe 0.138
Oscar 0.136
Toby 0.136
Jim 0.135
Dwight 0.135
Phyllis 0.135
Kevin 0.135
Stanley 0.134
Darryl 0.134
Creed 0.131
Meredith 0.128


Lastly I print the result of the emotion variable that I created. I created this varible as a way to see what % of the characters transcript had words that were emotion based (positive or negative), this could show the characters that show the most emotions, not necessarily positive or negative. Note that here I am only using total words that aren't in stop words from the lists created earlier.

In [23]:
emot = {}

for char in char_sent:
    if char_sent[char]['total_words'] > 1000:
        emot[char] = char_sent[char]['Emotion']

sort_emot = sorted(emot.items(), key=lambda x: x[1], reverse=True)

for i in sort_emot:
    print(i[0], i[1])

Nellie 0.237
Angela 0.229
Kelly 0.229
Meredith 0.227
Andy 0.226
Holly 0.223
Gabe 0.222
Dwight 0.221
Robert California 0.221
Michael 0.218
Creed 0.218
Jan 0.213
Toby 0.213
David Wallace 0.213
Stanley 0.212
Oscar 0.212
Deangelo 0.209
Darryl 0.208
Erin 0.207
Kevin 0.205
Ryan 0.202
Pam 0.2
Jim 0.198
Phyllis 0.193


Unsurprisingly, the people who appear at the top of the emotions variable rank are the characters who were at the very top (Angela, Holly) and very bottom (Kelly, Meredith, Gabe, Dwight) of the average sentiment rankings.

Next, I created dictionaries to store the characters ranks in each of these four rankings I created above. I store these 4 answers as a list, and then also classify people as 'positive' or 'negative' if their ranking in one of the categories is >= 5, the other ranking. The output of this dictionary is shown.

In [24]:
char_ranks = {}
final_emot_rk = {}

for rank, pers in enumerate(sort_pos):
    char_ranks[pers[0]] = [rank+1,0,0,0,""]
    
for rank, pers in enumerate(sort_neg):
    char_ranks[pers[0]][1] = rank+1
    if(char_ranks[pers[0]][1] < (char_ranks[pers[0]][0] - 4)):
        char_ranks[pers[0]][4] = "Negative"
    if(char_ranks[pers[0]][1] > (char_ranks[pers[0]][0] + 4)):
        char_ranks[pers[0]][4] = "Positive"
    net_pos_rank = char_ranks[pers[0]][1] - char_ranks[pers[0]][0]
    char_ranks[pers[0]][3] = net_pos_rank
    final_emot_rk[pers[0]] = net_pos_rank

for rank, pers in enumerate(sort_emot):
    char_ranks[pers[0]][2] = rank+1

    
print(char_ranks)

{'Angela': [1, 10, 2, 9, 'Positive'], 'Holly': [2, 16, 6, 14, 'Positive'], 'Nellie': [3, 6, 1, 3, ''], 'David Wallace': [4, 21, 14, 17, 'Positive'], 'Andy': [5, 8, 5, 3, ''], 'Michael': [6, 14, 10, 8, 'Positive'], 'Ryan': [7, 24, 21, 17, 'Positive'], 'Robert California': [8, 12, 9, 4, ''], 'Jan': [9, 15, 12, 6, 'Positive'], 'Kelly': [10, 2, 3, -8, 'Negative'], 'Pam': [11, 22, 22, 11, 'Positive'], 'Deangelo': [12, 17, 17, 5, 'Positive'], 'Erin': [13, 19, 19, 6, 'Positive'], 'Gabe': [14, 5, 7, -9, 'Negative'], 'Oscar': [15, 9, 16, -6, 'Negative'], 'Toby': [16, 11, 13, -5, 'Negative'], 'Jim': [17, 20, 23, 3, ''], 'Dwight': [18, 3, 8, -15, 'Negative'], 'Phyllis': [19, 23, 24, 4, ''], 'Kevin': [20, 18, 20, -2, ''], 'Stanley': [21, 7, 15, -14, 'Negative'], 'Darryl': [22, 13, 18, -9, 'Negative'], 'Creed': [23, 4, 11, -19, 'Negative'], 'Meredith': [24, 1, 4, -23, 'Negative']}


In the code chunk above, I also created a dictionary that maps a character to their net_ranking. This would correlate to, how much higher are they ranked in the 'percent of positive words' rankings than in the 'percent of negative words' rankings. This should give us a good idea of who is the most consistently positive (meaning without using a lot of negative words as well). The output is printed below. This is a similar list to the general sentiment score, with a few exceptions. In a list of 25 characters, this is not a very telling analysis. However, if we were to change the code to include all characters who spoke > 10 words, it would be hard to differentiate the people whose positive rank was 100 and negative rank was 700 as positive, but this method can help sort out these lists.

In [25]:
sort_final = sorted(final_emot_rk.items(), key=lambda x: x[1], reverse=True)

print("Rk)  Character   Net Positive")
for rk, i in enumerate(sort_final):
    print(rk+1,"  ", i[0], " ", i[1])

Rk)  Character   Net Positive
1    David Wallace   17
2    Ryan   17
3    Holly   14
4    Pam   11
5    Angela   9
6    Michael   8
7    Jan   6
8    Erin   6
9    Deangelo   5
10    Robert California   4
11    Phyllis   4
12    Nellie   3
13    Andy   3
14    Jim   3
15    Kevin   -2
16    Toby   -5
17    Oscar   -6
18    Kelly   -8
19    Gabe   -9
20    Darryl   -9
21    Stanley   -14
22    Dwight   -15
23    Creed   -19
24    Meredith   -23


In [26]:
# This just writes the original {person : all text} dictionary to a to a txt file.

header = ["Character", "Words"]
with open('office-transcript.txt','w', encoding='utf-8') as out_file:
    out_file.write('\t'.join(header)+'\n')
    
    for idx, key in enumerate(chardict):
        outl = [key, str(chardict[key])]
        
        out_file.write('\t'.join(outl) + '\n')

# Results

While there were not a lot of "wow" discoveries in this data analysis, I think that being able to pull raw json files of trascripts and properly dig through and classify them - then being able to confirm these results through analysis that would make complete sense to someone who watches the show is pretty cool.

Some of the things I thought were cool takeaways were:

- There were 700 unique characters who appeared on the office in all 9 seasons, and they spoke a total of > 620,000 words.
- The main character Michael Scott spoke 159,101 words over the course of the show. This was 25.7% of the total words spoken on the show, and only appeared in the first 7 seasons (and 1 episode in season 9).
- All of the main characters (>1000 spoken words), had sentiment scores of > 0.
- 569 of the 700 unique characters (81.3%) only appeared in 1 of the 9 seasons of the show.
- The expected positive and negative characters scored about exactly as a viewer would've expected, which is cool to see the confimation of the data and the accuracy of the sentiment classifier used.

Some futures directions that someone could go with this analysis are:

- Look at show sentiment (from all characters) for each specific episode. You could then create some visuals of the trends of the emotions thoughout the course of the show.
- Look at how closely the sentiment of the the title of an episode matches with the sentiment of the show.
- For characters that appear less frequently we could look at sentiments of the shows that they are in and see if they bring positive or negative emotions into the show.