### POS tag list

**CC**	coordinating conjunction<br>
**CD**	cardinal digit<br>
**DT**	determiner<br>
**EX**	existential there (like: "there is" ... think of it like "there exists")<br>
**FW**	foreign word<br>
**IN**	preposition/subordinating conjunction<br>
**JJ**	adjective	'big'<br>
**JJR**	adjective, comparative	'bigger'<br>
**JJS**	adjective, superlative	'biggest'<br>
**LS**	list marker	1)<br>
**MD**	modal	could, will<br>
**NN**	noun, singular 'desk'<br>
**NNS**	noun plural	'desks'<br>
**NNP**	proper noun, singular	'Harrison'<br>
**NNPS**	proper noun, plural	'Americans'<br>
**PDT**	predeterminer	'all the kids'<br>
**POS**	possessive ending	parent's<br>
**PRP**	personal pronoun	I, he, she<br>
**PRP\$**	possessive pronoun	my, his, hers<br>
**RB**	adverb	very, silently,<br>
**RBR**	adverb, comparative	better<br>
**RBS**	adverb, superlative	best<br>
**RP**	particle	give up<br>
**TO**	to	go 'to' the store.<br>
**UH**	interjection	errrrrrrrm<br>
**VB**	verb, base form	take<br>
**VBD**	verb, past tense	took<br>
**VBG**	verb, gerund/present participle	taking<br>
**VBN**	verb, past participle	taken<br>
**VBP**	verb, sing. present, non-3d	take<br>
**VBZ**	verb, 3rd person sing. present	takes<br>
**WDT**	wh-determiner	which<br>
**WP**	wh-pronoun	who, what<br>
**WP\$**	possessive wh-pronoun	whose<br>
**WRB**	wh-abverb	where, when<br>

### Stopwords

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're",<br>
"you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', <br>
'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's",<br>
'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', <br>
'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', <br>
'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', <br>
'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if',<br>
'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about',<br>
'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above',<br>
'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under',<br>
'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why',<br>
'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some',<br>
'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very',<br>
's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now',<br>
'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't",<br>
'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', <br>
"haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn',<br>
"needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', <br>
"weren't", 'won', "won't", 'wouldn', "wouldn't"]

In [123]:
import requests
import re
from bs4 import BeautifulSoup as BS
from pprint import pprint
from collections import Counter
import numpy as np
import pandas as pd
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
def get_corpus(url):
    '''Pass in url. Retrieve content and return list of paragraphs.'''
    content = requests.get(url).text
    soup = BS(content, 'lxml')
    paragraphs = soup.find_all('p')
    paragraphs = [p.get_text() for p in paragraphs]
    return paragraphs

In [3]:
def view_tails(paragraphs, start=10, end=10):
    '''Pass in list of paragraphs and print out first 5 and last 5.
    Use to quickly check what needs to be trimmed.
    '''
    pprint(paragraphs[:start])
    print('\n')
    pprint(paragraphs[-end:])

In [4]:
def replace_curly_quotes(*corpora):
    '''Pass in 1 or more corpora and replace all curly quotes
    with standard quotes.
    '''
    output = []
    for i, corpus in enumerate(corpora):
        new_corpus = corpus.replace("“", '"').replace("”", '"').replace("’", "'")
        output.append(new_corpus)
    return output

def replace_curly_str(paragraph):
    '''Pass in string and replace all curly quotes
    with standard quotes.
    '''
    output = paragraph.replace("“", '"').replace("”", '"').replace("’", "'")
    return output

In [5]:
def clean_corpus(paragraphs, trim_start=0, trim_end=0, delete=None, replace_quotes=False, *args):
    '''Pass in list of paragraphs, # of paragraphs to drop from beginning
    and end, filler paragraphs to delete, bool specifying whether to replace 
    curly quotes, and optional tuples of format (to_replace, replace_with).
    '''
    # Trim list first before any paragraphs are deleted, changing the length.
    trim_end = len(paragraphs) - trim_end
    paragraphs = paragraphs[trim_start:trim_end]
    
    # Replace curly quotes and delete filler paragraphs. Check which params
    # were passed in so we only have to iterate over the list once here.
    if replace_quotes and delete is not None:
        paragraphs = [replace_curly_str(p) for p in paragraphs if p != delete]
    elif replace_quotes:
        paragraphs = [replace_curly_str(p) for p in paragraphs]
    elif delete is not None:
        paragraphs = [p for p in paragraphs if p != delete]
    
    # Make any miscellaneous replacements specified by user.
    for arg in args:
        paragraphs = [p.replace(arg[0], arg[1]) for p in paragraphs]
    return paragraphs

In [6]:
# def separate_speakers(paragraphs, speakers, suffix=':'):
#     '''Pass in list of paragraphs and list of 2 speaker names. Return dict
#     with list of paragraphs for each speaker.
#     '''
#     speaker_lists = {}
#     speaker_suffs = [s + suffix for s in speakers]
#     speaker_lower = [s.lower() for s in speakers]
#     for s_suff, s_low in zip(speaker_suffs, speaker_lower):
#         speaker_lists[s_low] = [p.replace(s_suff, '').strip() for p in paragraphs\
#                                 if p.startswith(s_suff)]
#     speaker_lists['unlabeled'] = [p for p in paragraphs if not\
#                                   any(p.startswith(s) for s in speaker_suffs)]
#     #stripped_paragraphs = [p.strip(s_suff).strip() for p in paragraphs for s_suff in speaker_suffs]
#     #speaker_lists['unlabeled'] = list(set(stripped_paragraphs) - \
#     #    (set(speaker_lists[speaker_lower[0]]) | set(speaker_lists[speaker_lower[1]])))
#     return speaker_lists

In [60]:
# def separate_speakers(paragraphs, speakers, suffix=':'):
#     '''Pass in list of paragraphs and list of 2 speaker names. Return dict
#     with list of paragraphs for each speaker.
#     '''
#     speaker_suffs = [s + suffix for s in speakers]
#     speaker_lower = [s.lower() for s in speakers]
#     speaker_lists = {s: [] for s in speaker_lower}
#     speaker_lists['unlabeled'] = []
#     for p in paragraphs:
#         assigned = False
#         for s_suff, s_low in zip(speaker_suffs, speaker_lower):
#             if p.startswith(s_suff):
#                 speaker_lists[s_low].append(p.replace(s_suff, '').strip())
#                 assigned = True
#         if not assigned:
#             speaker_lists['unlabeled'].append(p)
#     return speaker_lists

In [59]:
def separate_speakers(paragraphs, speakers, suffix=':'):
    '''Pass in list of paragraphs and list of 2 speaker names. Return dict
    with list of paragraphs for each speaker.
    
    Testing ways to deal with unlabeled data - assign to previous speaker?
    '''
    speaker_suffs = [s + suffix for s in speakers]
    speaker_lower = [s.lower() for s in speakers]
    speaker_lists = {s: [] for s in speaker_lower}
    speaker_lists['unlabeled'] = []
    
    # Loop through list of paragraphs and assign each to a speaker.
    # If no speaker label exists, assign to most recent speaker and
    # also store in 'unlabeled' for manual checking.
    for p in paragraphs:
        assigned = False
        for s_suff, s_low in zip(speaker_suffs, speaker_lower):
            if p.startswith(s_suff):
                speaker_lists[s_low].append(p.replace(s_suff, '').strip())
                assigned = True
                previous_speaker = s_low
        if not assigned:
            speaker_lists[previous_speaker].append(p)
            speaker_lists['unlabeled'].append((previous_speaker, p))
    return speaker_lists

# refactoring 1070

In [9]:
# IN PROGRESS - REPLACING ABOVE CELL 1 WITH FUNCTION CALLS
url_1070 = r'https://erikamentari.wordpress.com/2018/02/27/jre-1070-jordan-peterson-transcript/'
raw_1070 = get_corpus(url_1070)

In [10]:
# IN PROGRESS - REPLACING BELOW CELL 2 WITH FUNCTION CALLS
cleaned_1070 = clean_corpus(raw_1070, 8, 10, None, True, ('Joe Rogan: ', 'Joe: '), 
                            ('Dr Jordan B Peterson: ', 'Jordan: '))
joined_1070 = ' '.join(cleaned_1070)

print(len(cleaned_1070))
print(len(joined_1070))

611
163379


In [11]:
# IN PROGRESS...
p_1070 = separate_speakers(cleaned_1070, ['Joe', 'Jordan'])
p_1070['joe'].insert(2, p_1070['unlabeled'].pop(0))
for key, val in p_1070.items():
    print('Paragraph count ({}): \n{}\n'.format(key.title(), len(val)))

Paragraph count (Joe): 
306

Paragraph count (Jordan): 
304

Paragraph count (Unlabeled): 
1



In [61]:
text_1070 = separate_speakers(cleaned_1070, ['Joe', 'Jordan'])
text_1070['unlabeled']

[('joe',
  'Boom and we\'re live. 12 Rules for Life. So without reading this…"So, what you\'re saying is…"'),
 ('joe', '18:18')]

## Load Episode 1070

In [12]:
# REPLACING
url = r'https://erikamentari.wordpress.com/2018/02/27/jre-1070-jordan-peterson-transcript/'

content = requests.get(url).text
soup = BS(content, 'lxml')
paragraphs = soup.find_all('p')
paragraphs = [p.get_text().replace('Joe Rogan: ', 'Joe: ')\
              .replace('Dr Jordan B Peterson: ', "Jordan: ")\
              for p in paragraphs[8:-10]]
paragraph_text = ' '.join(paragraphs)

print('Paragraphs:', len(paragraphs))
print('Words:', len(paragraph_text))

Paragraphs: 611
Words: 163379


In [13]:
# REPLACING
rogan_p = [p.replace('Joe:', '').strip() for p in paragraphs if p.startswith('Joe:')]
peterson_p = [p.replace('Jordan:', '').strip() for p in paragraphs if p.startswith('Jordan:')]
unlabeled = [p for p in paragraphs if not p.startswith('Joe:') and not p.startswith('Jordan:')]
rogan_p.insert(1, unlabeled.pop(0))

print('Rogan paragraph count:', len(rogan_p))
print('Peterson paragraph count:', len(peterson_p))
print('Unassigned text:', unlabeled)

Rogan paragraph count: 306
Peterson paragraph count: 304
Unassigned text: ['18:18']


In [14]:
rogan_words = ' '.join(rogan_p).split(' ')
peterson_words = ' '.join(peterson_p).split(' ')
rogan_count, peterson_count = len(rogan_words), len(peterson_words)
total_count = rogan_count + peterson_count
rogan_percent = rogan_count / total_count
peterson_percent = peterson_count / total_count

print('Total word count:', total_count)
print('Rogan word count:', rogan_count)
print('Peterson word count:', peterson_count)
print('% of conversation (Rogan): {}%'.format(round(100* rogan_percent, 2)))
print('% of conversation (Peterson): {}%'.format(round(100 * peterson_percent, 2)))

Total word count: 29119
Rogan word count: 6736
Peterson word count: 22383
% of conversation (Rogan): 23.13%
% of conversation (Peterson): 76.87%


In [15]:
common_rogan = Counter(rogan_words).most_common()
common_peterson = Counter(peterson_words).most_common()

print('Most common words (Rogan)\n\n' + str(common_rogan[:25]))
print('\nMost common words (Peterson)\n\n' + str(common_peterson[:25]))

Most common words (Rogan)

[('to', 187), ('of', 182), ('the', 178), ('I', 170), ('and', 157), ('a', 155), ('that', 145), ('you', 127), ('is', 126), ('this', 91), ('in', 87), ('people', 65), ('And', 64), ('what', 59), ('have', 59), ('it', 56), ('about', 53), ('think', 51), ('was', 50), ('are', 47), ('you’re', 42), ('not', 42), ('it’s', 41), ('with', 38), ('one', 35)]

Most common words (Peterson)

[('the', 868), ('to', 614), ('and', 524), ('you', 517), ('of', 498), ('a', 493), ('that', 466), ('I', 430), ('is', 291), ('in', 262), ('And', 244), ('it', 219), ('it’s', 195), ('like,', 189), ('that’s', 181), ('It’s', 174), ('what', 168), ('they', 156), ('have', 149), ('so', 144), ('was', 135), ('for', 134), ('people', 133), ('be', 128), ('do', 125)]


In [16]:
rogan_corpus = ' '.join(rogan_words)
peterson_corpus = ' '.join(peterson_words)

In [17]:
# replace curly quotes, then tokenize
rogan_corpus, peterson_corpus = replace_curly_quotes(rogan_corpus, peterson_corpus)
rogan_word_toke, rogan_p_toke = word_tokenize(rogan_corpus), sent_tokenize(rogan_corpus)
peterson_word_toke, peterson_p_toke = word_tokenize(peterson_corpus), sent_tokenize(peterson_corpus)

print('Rogan words: ' + str(len(rogan_word_toke)))
print('Peterson words: ' + str(len(peterson_word_toke)))

Rogan words: 8000
Peterson words: 27448


In [18]:
stop_words = stopwords.words('english')
rogan_word_toke = [w for w in rogan_word_toke if w.lower() not in stop_words]
peterson_word_toke = [w for w in peterson_word_toke if w.lower() not in stop_words]

print('Rogan words (stop words removed): ' + str(len(rogan_word_toke)))
print('Peterson words (stop words removed): ' + str(len(peterson_word_toke)))
print('\nRogan most common words:\n' + str(Counter(rogan_word_toke).most_common(25)))
print('\nPeterson most common words\n' + str(Counter(peterson_word_toke).most_common(25)))

Rogan words (stop words removed): 4205
Peterson words (stop words removed): 15093

Rogan most common words:
[('.', 449), (',', 368), ("'s", 148), ('people', 75), ("'re", 73), ('?', 62), ('think', 53), ('like', 45), ("n't", 45), ('one', 35), ('things', 30), (';', 25), ("'m", 24), ('Yeah', 24), ('Right', 23), ('know', 23), ('Yes', 22), ('get', 22), ('going', 22), ('mean', 21), ('way', 20), ("'ve", 20), ('saying', 19), ('right', 17), ('Well', 17)]

Peterson most common words
[(',', 1853), ('.', 1348), ("'s", 816), ('like', 308), ("n't", 230), ("'re", 200), ('?', 191), ('well', 165), (';', 163), ('people', 157), ('know', 150), ('think', 120), ('Yeah', 99), ('Well', 97), ('going', 94), ("'m", 84), ('right', 81), ('say', 80), ('things', 70), ('way', 65), ('want', 65), ('one', 61), ('life', 58), ('thing', 58), ('get', 58)]


In [19]:
rogan_tagged = nltk.pos_tag(rogan_word_toke)
peterson_tagged = nltk.pos_tag(rogan_word_toke)

In [20]:
def chunk(tagged):
    '''Pass in tagged word tokens and find chunks.'''
    chunk_gram = '''chunk_name: {<JJ.?>+<NN.?>+}'''
    chunk_parser = nltk.RegexpParser(chunk_gram)
    chunked = chunk_parser.parse(tagged)
    #chunked.draw()
    return chunked

In [21]:
chunk(rogan_tagged[:100])

LookupError: 

===========================================================================
NLTK was unable to find the gs file!
Use software specific configuration paramaters or set the PATH environment variable.
===========================================================================

Tree('S', [('guest', 'NN'), ('today', 'NN'), Tree('chunk_name', [('great', 'JJ'), ('powerful', 'JJ'), ('Jordan', 'NNP'), ('Peterson', 'NNP')]), ('.', '.'), ('Jordan', 'NNP'), ('podcast', 'NN'), ('multiple', 'NN'), ('times', 'NNS'), (',', ','), ("'s", 'VBZ'), ('one', 'CD'), Tree('chunk_name', [('favorite', 'JJ'), ('human', 'JJ'), ('beings', 'NNS'), ('talk', 'NN')]), (',', ','), ("'m", 'VBP'), Tree('chunk_name', [('happy', 'JJ'), ('exists', 'NNS')]), ('.', '.'), ("'s", 'POS'), Tree('chunk_name', [('brilliant', 'JJ'), ('man', 'NN')]), ('amazing', 'VBG'), ('book', 'NN'), ('right', 'NN'), ("'s", 'POS'), ('called', 'VBN'), ('twelve', 'NN'), ('rules', 'NNS'), ('life', 'NN'), (':', ':'), Tree('chunk_name', [('antidote', 'JJ'), ('chaos', 'NN')]), ('one', 'CD'), Tree('chunk_name', [('important', 'JJ'), ('messages', 'NNS')]), Tree('chunk_name', [('brilliant', 'JJ'), ('man', 'NN')]), (',', ','), ('please', 'VB'), Tree('chunk_name', [('welcome', 'JJ'), ('Jordan', 'NNP'), ('Peterson', 'NNP')]), ('.'

In [24]:
# import string

# print(string.punctuation)
# remove_punct = str.maketrans(string.punctuation, ' '*len(string.punctuation))
# a = 'abdlkja;c,3dk!.ejf}]dk'
# print(a)
# print(a.translate(remove_punct))

## Episode 877

In [74]:
url2 = ('https://beyondhumannature.wordpress.com/2018/03/12/joe-rogan-experience'
        '-ep-877-with-jordan-peterson-transcript/')
raw_p2 = get_corpus(url2)

In [77]:
p2 = clean_corpus(raw_p2, 11, 16, '\xa0', True)
texts = separate_speakers(p2, ['ROGAN', 'PETERSON'])
texts['unlabeled'] = texts['unlabeled'][1]
texts['rogan'].remove(texts['unlabeled'][1])

print('Raw text paragraphs:', len(raw_p2))
print('Cleaned text paragraphs:', len(p2))
print('Rogan paragraphs:', len(texts['rogan']))
print('Peterson paragraphs:', len(texts['peterson']))
print('\nUnassigned paragraphs:', texts['unlabeled'][1])

Raw text paragraphs: 423
Cleaned text paragraphs: 389
Rogan paragraphs: 178
Peterson paragraphs: 210

Unassigned paragraphs: [MUSIC 172:38]


In [140]:
cv = CountVectorizer()
tfidf = TfidfVectorizer()

rogan_corpus_877 = ' '.join(texts['rogan'])
cv_count_877 = cv.fit_transform([rogan_corpus_877])
tfidf_count_877 = tfidf.fit_transform([rogan_corpus_877])

rogan_words_877 = np.array(cv.get_feature_names())
rogan_counts_877 = cv_count_877.toarray().flatten()
rogan_tfidf_877 = tfidf_count_877.toarray().flatten()

In [145]:
tfidf2 = TfidfVectorizer(stop_words='english')
df_877_stop = pd.DataFrame(
    {'scaled_count': tfidf2.fit_transform([rogan_corpus_877]).toarray().flatten(),
    'word': np.array(tfidf2.get_feature_names())})
df_877_stop.sort_values('scaled_count', ascending=False)

Unnamed: 0,scaled_count,word
521,0.458033,people
400,0.286270,just
721,0.278091,think
424,0.253554,like
620,0.171762,right
773,0.171762,ve
210,0.147225,don
794,0.147225,way
309,0.139046,going
630,0.139046,say


In [143]:
df_877 = pd.DataFrame({'word': rogan_words_877, 'count': rogan_counts_877,
                       'scaled_count': rogan_tfidf_877})
df_877.sort_values('scaled_count', ascending=False)

Unnamed: 0,count,scaled_count,word
846,180,0.375681,the
39,177,0.369419,and
845,165,0.344374,that
992,152,0.317241,you
583,136,0.283848,of
870,119,0.248367,to
469,101,0.210799,it
465,98,0.204537,is
860,79,0.164882,this
853,71,0.148185,they
