### POS tag list

**CC**	coordinating conjunction<br>
**CD**	cardinal digit<br>
**DT**	determiner<br>
**EX**	existential there (like: "there is" ... think of it like "there exists")<br>
**FW**	foreign word<br>
**IN**	preposition/subordinating conjunction<br>
**JJ**	adjective	'big'<br>
**JJR**	adjective, comparative	'bigger'<br>
**JJS**	adjective, superlative	'biggest'<br>
**LS**	list marker	1)<br>
**MD**	modal	could, will<br>
**NN**	noun, singular 'desk'<br>
**NNS**	noun plural	'desks'<br>
**NNP**	proper noun, singular	'Harrison'<br>
**NNPS**	proper noun, plural	'Americans'<br>
**PDT**	predeterminer	'all the kids'<br>
**POS**	possessive ending	parent's<br>
**PRP**	personal pronoun	I, he, she<br>
**PRP\$**	possessive pronoun	my, his, hers<br>
**RB**	adverb	very, silently,<br>
**RBR**	adverb, comparative	better<br>
**RBS**	adverb, superlative	best<br>
**RP**	particle	give up<br>
**TO**	to	go 'to' the store.<br>
**UH**	interjection	errrrrrrrm<br>
**VB**	verb, base form	take<br>
**VBD**	verb, past tense	took<br>
**VBG**	verb, gerund/present participle	taking<br>
**VBN**	verb, past participle	taken<br>
**VBP**	verb, sing. present, non-3d	take<br>
**VBZ**	verb, 3rd person sing. present	takes<br>
**WDT**	wh-determiner	which<br>
**WP**	wh-pronoun	who, what<br>
**WP\$**	possessive wh-pronoun	whose<br>
**WRB**	wh-abverb	where, when<br>

### Stopwords

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're",<br>
"you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', <br>
'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's",<br>
'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', <br>
'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', <br>
'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', <br>
'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if',<br>
'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about',<br>
'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above',<br>
'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under',<br>
'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why',<br>
'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some',<br>
'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very',<br>
's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now',<br>
'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't",<br>
'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', <br>
"haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn',<br>
"needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', <br>
"weren't", 'won', "won't", 'wouldn', "wouldn't"]

In [1]:
import requests
import re
from bs4 import BeautifulSoup as BS
from collections import Counter
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from pprint import pprint

In [2]:
def get_corpus(url):
    '''Pass in url. Retrieve content and return list of paragraphs.'''
    content = requests.get(url).text
    soup = BS(content, 'lxml')
    paragraphs = soup.find_all('p')
    paragraphs = [p.get_text() for p in paragraphs]
    corpus = ' '.join(paragraphs)
    return paragraphs

In [3]:
def view_tails(paragraphs, start=10, end=10):
    '''Pass in list of paragraphs and print out first 5 and last 5.
    Use to quickly check what needs to be trimmed.'''
    pprint(paragraphs[:start])
    print('\n')
    pprint(paragraphs[-end:])

In [4]:
def clean_corpus(paragraphs, trim_start=0, trim_end=0, delete=None, *args):
    '''Pass in list of paragraphs, # of paragraphs to drop from beginning
    and end, and tuples of format (to_replace, replace_with).'''
    trim_end = len(paragraphs) - trim_end
    if delete is not None:
        paragraphs = [p for p in paragraphs if p != delete]
    paragraphs = paragraphs[trim_start:trim_end]
    for arg in args:
        paragraphs = [p.replace(arg[0], arg[1]) for p in paragraphs]
    return paragraphs

In [55]:
def separate_speakers(paragraphs, suffix=': ', *args):
    '''Pass in list of paragraphs and speaker names as strings. Return list 
    of paragraphs for each speaker.'''
    speaker_paragraphs = {}
    for arg in args:
        arg_suff = arg + suffix
        speaker_paragraphs[arg.lower()] = [p for p in paragraphs if p.startswith(arg_suff)]
    speaker_paragraphs['unlabeled'] = [p for p in paragraphs 
                       if not any(p in ls for ls in speaker_paragraphs.values())]
    return speaker_paragraphs

In [6]:
def replace_curly_quotes(*corpora):
    '''Pass in 1 or more corpora and replace all curly quotes
    with standard quotes.
    '''
    output = []
    for i, corpus in enumerate(corpora):
        new_corpus = corpus.replace("“", '"').replace("”", '"').replace("’", "'")
        output.append(new_corpus)
    return output

In [50]:
a = ['abc', 'def', 'ghi', 'jkl', 'mno', 'pqr', 'stu', 'vwx', 'yz']
b = [group for i, group in enumerate(a) if i % 2 == 0]
c = [group for i, group in enumerate(a) if i < 3 or i > 6]
a_dict = {'b': b,
         'c': c}
print(a_dict)
d = [group for group in a if not any(group in val for val in a_dict.values())]
print(d)

print(a_dict.values())

def f_args(name, age=3, **kwargs):
    for key, val in kwargs.items():
        print(val)
        print(key)
f_args(name='jason', age=4, test=5, major='steven')

{'b': ['abc', 'ghi', 'mno', 'stu', 'yz'], 'c': ['abc', 'def', 'ghi', 'vwx', 'yz']}
['jkl', 'pqr']
dict_values([['abc', 'ghi', 'mno', 'stu', 'yz'], ['abc', 'def', 'ghi', 'vwx', 'yz']])
5
test
steven
major


## Load Episode 1070

In [8]:
url = r'https://erikamentari.wordpress.com/2018/02/27/jre-1070-jordan-peterson-transcript/'

content = requests.get(url).text
soup = BS(content, 'lxml')
paragraphs = soup.find_all('p')
paragraphs = [p.get_text().replace('Joe Rogan: ', 'Joe: ')\
              .replace('Dr Jordan B Peterson: ', "Jordan: ")\
              for p in paragraphs[8:-10]]
paragraph_text = ' '.join(paragraphs)

print('Paragraphs:', len(paragraphs))
print('Words:', len(paragraph_text))

Paragraphs: 611
Words: 163379


In [54]:
# IN PROGRESS - REPLACING ABOVE CELL WITH NEW FUNCTION CALLS
test_p = get_corpus(url)
cleaned_test = clean_corpus(test_p, 8, 10, None, ('Joe Rogan: ', 'Joe: '), 
                            ('Dr Jordan B Peterson: ', 'Jordan: '))
test_text = ' '.join(cleaned_test)

print(len(cleaned_test))
print(len(test_text))

test_dict = separate_speakers(cleaned_test, ': ', 'Joe', 'Jordan')
for key, val in test_dict.items():
    print(key, len(val))
cleaned_test == test_dict['unlabeled']

611
163379
Joe 303
Jordan 304
unlabeled 4


False

In [10]:
rogan_p = [p.replace('Joe:', '').strip() for p in paragraphs if p.startswith('Joe:')]
peterson_p = [p.replace('Jordan:', '').strip() for p in paragraphs if p.startswith('Jordan:')]
unlabeled = [p for p in paragraphs if not p.startswith('Joe:') and not p.startswith('Jordan:')]
rogan_p.insert(1, unlabeled.pop(0))

print('Rogan paragraph count:', len(rogan_p))
print('Peterson paragraph count:', len(peterson_p))
print('Unassigned text:', unlabeled)

Rogan paragraph count: 306
Peterson paragraph count: 304
Unassigned text: ['18:18']


In [11]:
rogan_words = ' '.join(rogan_p).split(' ')
peterson_words = ' '.join(peterson_p).split(' ')
rogan_count, peterson_count = len(rogan_words), len(peterson_words)
total_count = rogan_count + peterson_count
rogan_percent = rogan_count / total_count
peterson_percent = peterson_count / total_count

print('Total word count:', total_count)
print('Rogan word count:', rogan_count)
print('Peterson word count:', peterson_count)
print('% of conversation (Rogan): {}%'.format(round(100* rogan_percent, 2)))
print('% of conversation (Peterson): {}%'.format(round(100 * peterson_percent, 2)))

Total word count: 29119
Rogan word count: 6736
Peterson word count: 22383
% of conversation (Rogan): 23.13%
% of conversation (Peterson): 76.87%


In [12]:
common_rogan = Counter(rogan_words).most_common()
common_peterson = Counter(peterson_words).most_common()

print('Most common words (Rogan)\n\n' + str(common_rogan[:25]))
print('\nMost common words (Peterson)\n\n' + str(common_peterson[:25]))

Most common words (Rogan)

[('to', 187), ('of', 182), ('the', 178), ('I', 170), ('and', 157), ('a', 155), ('that', 145), ('you', 127), ('is', 126), ('this', 91), ('in', 87), ('people', 65), ('And', 64), ('what', 59), ('have', 59), ('it', 56), ('about', 53), ('think', 51), ('was', 50), ('are', 47), ('you’re', 42), ('not', 42), ('it’s', 41), ('with', 38), ('one', 35)]

Most common words (Peterson)

[('the', 868), ('to', 614), ('and', 524), ('you', 517), ('of', 498), ('a', 493), ('that', 466), ('I', 430), ('is', 291), ('in', 262), ('And', 244), ('it', 219), ('it’s', 195), ('like,', 189), ('that’s', 181), ('It’s', 174), ('what', 168), ('they', 156), ('have', 149), ('so', 144), ('was', 135), ('for', 134), ('people', 133), ('be', 128), ('do', 125)]


In [13]:
rogan_corpus = ' '.join(rogan_words)
peterson_corpus = ' '.join(peterson_words)

In [14]:
# replace curly quotes, then tokenize
rogan_corpus, peterson_corpus = replace_curly_quotes(rogan_corpus, peterson_corpus)
rogan_word_toke, rogan_p_toke = word_tokenize(rogan_corpus), sent_tokenize(rogan_corpus)
peterson_word_toke, peterson_p_toke = word_tokenize(peterson_corpus), sent_tokenize(peterson_corpus)

print('Rogan words: ' + str(len(rogan_word_toke)))
print('Peterson words: ' + str(len(peterson_word_toke)))

Rogan words: 8000
Peterson words: 27448


In [15]:
stop_words = stopwords.words('english')
rogan_word_toke = [w for w in rogan_word_toke if w.lower() not in stop_words]
peterson_word_toke = [w for w in peterson_word_toke if w.lower() not in stop_words]

print('Rogan words (stop words removed): ' + str(len(rogan_word_toke)))
print('Peterson words (stop words removed): ' + str(len(peterson_word_toke)))
print('\nRogan most common words:\n' + str(Counter(rogan_word_toke).most_common(25)))
print('\nPeterson most common words\n' + str(Counter(peterson_word_toke).most_common(25)))

Rogan words (stop words removed): 4205
Peterson words (stop words removed): 15093

Rogan most common words:
[('.', 449), (',', 368), ("'s", 148), ('people', 75), ("'re", 73), ('?', 62), ('think', 53), ('like', 45), ("n't", 45), ('one', 35), ('things', 30), (';', 25), ("'m", 24), ('Yeah', 24), ('Right', 23), ('know', 23), ('Yes', 22), ('get', 22), ('going', 22), ('mean', 21), ('way', 20), ("'ve", 20), ('saying', 19), ('right', 17), ('Well', 17)]

Peterson most common words
[(',', 1853), ('.', 1348), ("'s", 816), ('like', 308), ("n't", 230), ("'re", 200), ('?', 191), ('well', 165), (';', 163), ('people', 157), ('know', 150), ('think', 120), ('Yeah', 99), ('Well', 97), ('going', 94), ("'m", 84), ('right', 81), ('say', 80), ('things', 70), ('way', 65), ('want', 65), ('one', 61), ('life', 58), ('thing', 58), ('get', 58)]


In [16]:
rogan_tagged = nltk.pos_tag(rogan_word_toke)
peterson_tagged = nltk.pos_tag(rogan_word_toke)

In [18]:
def chunk(tagged):
    '''Pass in tagged word tokens and find chunks.'''
    chunk_gram = '''chunk_name: {<JJ.?>+<NN.?>+}'''
    chunk_parser = nltk.RegexpParser(chunk_gram)
    chunked = chunk_parser.parse(tagged)
    #chunked.draw()
    return chunked

In [19]:
chunk(rogan_tagged[:100])

LookupError: 

===========================================================================
NLTK was unable to find the gs file!
Use software specific configuration paramaters or set the PATH environment variable.
===========================================================================

Tree('S', [('guest', 'NN'), ('today', 'NN'), Tree('chunk_name', [('great', 'JJ'), ('powerful', 'JJ'), ('Jordan', 'NNP'), ('Peterson', 'NNP')]), ('.', '.'), ('Jordan', 'NNP'), ('podcast', 'NN'), ('multiple', 'NN'), ('times', 'NNS'), (',', ','), ("'s", 'VBZ'), ('one', 'CD'), Tree('chunk_name', [('favorite', 'JJ'), ('human', 'JJ'), ('beings', 'NNS'), ('talk', 'NN')]), (',', ','), ("'m", 'VBP'), Tree('chunk_name', [('happy', 'JJ'), ('exists', 'NNS')]), ('.', '.'), ("'s", 'POS'), Tree('chunk_name', [('brilliant', 'JJ'), ('man', 'NN')]), ('amazing', 'VBG'), ('book', 'NN'), ('right', 'NN'), ("'s", 'POS'), ('called', 'VBN'), ('twelve', 'NN'), ('rules', 'NNS'), ('life', 'NN'), (':', ':'), Tree('chunk_name', [('antidote', 'JJ'), ('chaos', 'NN')]), ('one', 'CD'), Tree('chunk_name', [('important', 'JJ'), ('messages', 'NNS')]), Tree('chunk_name', [('brilliant', 'JJ'), ('man', 'NN')]), (',', ','), ('please', 'VB'), Tree('chunk_name', [('welcome', 'JJ'), ('Jordan', 'NNP'), ('Peterson', 'NNP')]), ('.'

In [20]:
import string

print(string.punctuation)
remove_punct = str.maketrans(string.punctuation, ' '*len(string.punctuation))
a = 'abdlkja;c,3dk!.ejf}]dk'
print(a)
#b = 'abdlkja;c,3dk!.ejf}]dk'.remove(string.punctuation)
print(a.translate(remove_punct))
#print(b)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
abdlkja;c,3dk!.ejf}]dk
abdlkja c 3dk  ejf  dk


In [27]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

## Episode 877

In [61]:
url2 = ('https://beyondhumannature.wordpress.com/2018/03/12/joe-rogan-experience'
        '-ep-877-with-jordan-peterson-transcript/')
raw_p2 = get_corpus(url2)
view_tails(raw_p2, 1, 18)
p2 = clean_corpus(raw_p2, 11, 16, '\xa0')
texts = separate_speakers(p2, ': ', 'ROGAN', 'PETERSON')

print(len(raw_p2))
print(len(p2))
print(len(texts['rogan']))
print(len(texts['peterson']))
print(len(texts['unlabeled']))

['Jordan Peterson, Podcast, Psychology']


['\xa0',
 'PETERSON: I’ve had lots, I’ve had lots of letters, obviously. Maybe, I don’t '
 'know, 2500 letters maybe to my email accounts now about this, and a very '
 'large number of them—maybe two hundred letters, a hundred and fifty to two '
 'hundred—have been from people on the radical left who’ve written to me and '
 'said that they can no longer speak because the authoritarian types, the PC '
 'authoritarians, have got so controlling that their once fashionable position '
 'is now being deemed unacceptable, and they’re alienated and excluded. When '
 'you see that happening with feminists like Germaine Greer, I mean—Germaine '
 'Greer who’s been banned from campuses—she’s not very happy with the idea '
 'that being a woman is something that’s been reduced to a whim, right, '
 'because she thinks that there’s more to being a woman than mere subjective '
 'choice. Well, that’s no longer a tenemal viewpoint on the left, and so '
 'people 

In [65]:
for i, text in enumerate(texts['unlabeled'][-8:]):
    print(i, text[:100])

0 —
1 

				View all posts by janearvine			

2 Fill in your details below or click an icon to log in:
3 

			You are commenting using your WordPress.com account.			
				( Log Out / 
				Change )
			


4 

			You are commenting using your Google+ account.			
				( Log Out / 
				Change )
			


5 

			You are commenting using your Twitter account.			
				( Log Out / 
				Change )
			


6 

			You are commenting using your Facebook account.			
				( Log Out / 
				Change )
			


7 Connecting to %s
