# PODCAST DATA

### Table of contents

- [Build Dataframe](#Build-dataframe)
- [Episode Counts](#Episode-counts)
- [Target Features](#Target-features)
- [Analysis](#Analysis)
- [Analysis Checkpoint 1](#Analysis-Checkpoint-1)
- [Analysis Checkpoint 2](#Analysis-Checkpoint-2)
- [Analysis Completed](#Analysis-completed)
- [Machine Learning Preparation](#ML-preparation)

In [1]:
import pandas as pd
import re
import pickle
import nltk
import spacy
from spacy.lang.en import English
import collections
from collections import Counter
from itertools import chain
import statistics
import math

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
# initialize spacy objects
nlp = spacy.load('en_core_web_md')

In [3]:
%store -r Nightvale_df
%store -r myDNA_df
%store -r YWA_df
%store -r uu_df
%store -r radiolab_df
%store -r tal_df
%store -r bullseye
%store -r mother
%store -r hodgman
%store -r flophouse
%store -r switchblade
%store -r mbmbam
%store -r sawbones
%store -r wonderful
%store -r tgg
%store -r ffire
%store -r shmanners
%store -r taz
%store -r neoscum_df
%store -r allusionist_df


# %store -r freak_df
# %store -r Lore_df
# %store -r Invisible_df
# %store -r OnBeing_df
# %store -r StoryCorps_df

## Build dataframe
Read in each podcast's dataframes from spiders using jupyter notebook's %store -r function.
Manually construct main dataframe

In [4]:
# create a dataframe with all podcasts
data = pd.concat([Nightvale_df.reset_index(drop=True),
                  myDNA_df.reset_index(drop=True),
                  YWA_df.reset_index(drop=True),                  
                  uu_df.reset_index(drop=True),
                  radiolab_df.reset_index(drop=True),
                  tal_df.reset_index(drop=True),
                  bullseye,
                  mother,
                  hodgman,
                  flophouse,
                  switchblade,
                  mbmbam,
                  sawbones,
                  wonderful,
                  tgg,
                  ffire,
                  shmanners,
                  taz,
                  neoscum_df,
                  allusionist_df],
                  # StoryCorps_df, OnBeing_df, Invisible_df, Lore_df],
                  keys = ['Welcome to Nightvale','Move Your DNA','You\'re Wrong About','Unlocking Us',
                         'Radiolab','This American Life', 'Bullseye with Jesse Thorn','One Bad Mother',
                         'Judge John Hodgman','The Flophouse','Switchblade Sisters',
                         'MBMBaM','Sawbones','Wonderful','The Greatest Generation','Friendly Fire','Shmanners',
                          'The Adventure Zone','NeoScum', 'The Allusionist'], names=['podcast','#']).reset_index(level=1)

In [5]:
data.sample(10)
data.index.value_counts()

Unnamed: 0_level_0,#,Episode,Year,Title,Text,Podcast
podcast,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Shmanners,8,222.0,2020.0,ask shmanners idioms pt. 3,Shmanners 222: Ask Shmanners/Idioms Pt. 3 \nPu...,shmanners
Move Your DNA,22,34.0,,Thoughts on Incontinence,"[Episode 34: Thoughts on Incontinence, , Desc...",
The Greatest Generation,5,322.0,,final draft,Note: This show periodically replaces their ad...,the greatest generation
Radiolab,18,,2018.0,Gonads: X & Y,GONADS: X AND Y FINAL WEB TRANSCRIPT [ADVE...,
This American Life,509,512.0,,House Rules,"Prologue Ira Glass A few years back, w...",
This American Life,314,315.0,,The Parrot and the Potbellied Pig,Prologue Ira Glass When Rosie was a k...,
The Allusionist,19,9.0,2015.0,the space between,visit theallusionist.org/spaces to find out ...,
This American Life,347,348.0,,Tough Room,Prologue Ira Glass They can laugh abo...,
This American Life,108,109.0,,Notes on Camp,Prologue Ira Glass It's a typical camp...,
This American Life,664,667.0,,Wartime Radio,Prologue: Prologue Ira Glass When Dav...,


This American Life           734
Radiolab                     261
Welcome to Nightvale         180
Move Your DNA                108
The Allusionist               97
Bullseye with Jesse Thorn     63
MBMBaM                        32
Sawbones                      30
Wonderful                     29
One Bad Mother                29
Judge John Hodgman            28
Shmanners                     28
The Greatest Generation       28
Friendly Fire                 28
NeoScum                       20
You're Wrong About            19
The Adventure Zone            19
The Flophouse                 16
Switchblade Sisters           14
Unlocking Us                  12
Name: podcast, dtype: int64

In [6]:
data = data.drop(columns=['#', 'Podcast'])
data.sample(5)

Unnamed: 0_level_0,Episode,Year,Title,Text
podcast,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Welcome to Nightvale,171,2020.0,go to the mirror,What makes you You? Welcome to Night Vale. Do...
This American Life,243,,Later That Same Day,Prologue Ira Glass What can 20 years ...
This American Life,229,,Secret Government,Prologue Ira Glass John Podesta used t...
This American Life,405,,Inside Job,Prologue Ira Glass A friend of mine ra...
This American Life,255,,,Prologue Ira Glass One week before Ch...


## Episode counts

In [7]:
sum(data.index.value_counts())

1775

In [8]:
# get rid of texts less than 6500 characters
podcast_df = pd.DataFrame()

for i in range(len(data)):
    if len(data.iloc[i, 3]) > 6500:  # I kept changing this number to see what returned, this gets rid of the erroneous text
        podcast_df = podcast_df.append(data.iloc[i, :])

In [9]:
sum(podcast_df.index.value_counts())

1436

In [10]:
podcast_df.sample(10)

Unnamed: 0,Episode,Text,Title,Year
Radiolab,,"JAD ABUMRAD: Before we start, a quick heads...",In the No Part 2,2018.0
This American Life,554.0,Prologue Ira Glass Adriana was talking...,Not It!,
This American Life,370.0,Prologue Ira Glass There are people wh...,Ruining It for the Rest of Us,
Welcome to Nightvale,10.0,"Regret nothing, until it is too late. Then reg...",feral dogs,2012.0
NeoScum,7.0,"Mike Migdall (MM): It was really cool because,...",Walking the Edge,
Shmanners,218.0,Shmanners 218: Animal Crossing \nPublished Jul...,animal crossing,2020.0
Sawbones,355.0,Sawbones 355: The Great Smog \nPublished 2nd F...,the great smog,2021.0
This American Life,65.0,"Passing Ira Glass From WBEZ Chicago, i...",Who's Canadian?,
This American Life,182.0,Prologue Ira Glass Joe worked at this ...,Cringe,
This American Life,643.0,Prologue Ira Glass A couple months ag...,Damned If You Do…,


## Target features

- Number of hosts: whole number = number of regular hosts, 0.5 represents if the podcast regularly has guests
- genre (aka Tag 1)
- topic (aka Tag 2)
- scripted/unscripted
- fiction/nonfiction
- format: "chat" indicates general, unfocused conversation, and "recap" indicates specfic topic discussion
- rating: from iTunes, range from 4.6 to 4.9

This has been the hardest part of the project so far.  A lot of these categories are open to interpretation.

In [11]:
pod_feats = [['Welcome to Nightvale', 1, ['comedy', 'sci-fi'], 'scripted', 'fiction', 'news', 4.8],
             ['Move Your DNA', 1.5, ['health', 'fitness'], 'unscripted', 'nonfiction', 'chat', 4.8],
             ['You\'re Wrong About', 2, ['history', 'education'], 'unscripted', 'nonfiction', 'chat', 4.6],
             ['Unlocking Us', 1.5, ['health', 'lifestyle'], 'unscripted', 'nonfiction', 'interview', 4.6],
             ['Radiolab', 2, ['society', 'education'], 'unscripted', 'nonfiction', 'storytelling', 4.7],
             ['This American Life', 1.5, ['society','history'], 'unscripted', 'nonfiction', 'storytelling', 4.6],
             ['Bullseye with Jesse Thorn' , 1.5, ['comedy', 'society'], 'unscripted', 'nonfiction', 'interview', 4.7],
             ['One Bad Mother', 2.5, ['comedy', 'parenting'], 'unscripted', 'nonfiction', 'chat', 4.7],
             ['Judge John Hodgman', 1.5, ['comedy, advice'], 'unscripted', 'nonfiction', 'chat', 4.8],
             ['The Flophouse' , 3, ['comedy', 'movies'], 'unscripted', 'nonfiction', 'recap', 4.8],
             ['Switchblade Sisters', 1.5, ['comedy', 'movies'], 'unscripted', 'nonfiction', 'chat', 4.9],
             ['MBMBaM', 3, ['comedy','advice'], 'unscripted', 'nonfiction', 'chat', 4.9],
             ['Sawbones', 2, ['history', 'medicine'], 'unscripted', 'nonfiction', 'storytelling', 4.8],
             ['Wonderful', 2, ['comedy', 'society'], 'unscripted', 'nonfiction', 'chat', 4.9],
             ['The Greatest Generation', 2, ['comedy', 'TV'], 'unscripted', 'nonfiction', 'recap', 4.9],
             ['Friendly Fire', 3, ['history', 'movies'], 'unscripted', 'nonfiction', 'recap', 4.6],
             ['Shmanners', 2, ['society', 'advice'], 'unscripted', 'nonfiction', 'chat', 4.8],
             ['The Adventure Zone', 4, ['games', 'RP'], 'unscripted', 'fiction', 'LARP', 4.9],
             ['NeoScum', 5, ['games', 'RP'], 'unscripted', 'fiction', 'LARP', 4.9],
             ['The Allusionist', 1, ['education', 'language'], 'scripted', 'nonfiction','storytelling', 4.8]]

# In case you're a cool person reading this and don't know, LARP is live action role playing.

In [12]:
pod_feats_df = pd.DataFrame(pod_feats, columns = ['podcast', 'Hosts', 'Genre-Topic', 
                                                  'Scripted/Un', 'Fiction/Non', 
                                                  'Format', 'Rating']).set_index('podcast')
pod_feats_df

Unnamed: 0_level_0,Hosts,Genre-Topic,Scripted/Un,Fiction/Non,Format,Rating
podcast,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Welcome to Nightvale,1.0,"[comedy, sci-fi]",scripted,fiction,news,4.8
Move Your DNA,1.5,"[health, fitness]",unscripted,nonfiction,chat,4.8
You're Wrong About,2.0,"[history, education]",unscripted,nonfiction,chat,4.6
Unlocking Us,1.5,"[health, lifestyle]",unscripted,nonfiction,interview,4.6
Radiolab,2.0,"[society, education]",unscripted,nonfiction,storytelling,4.7
This American Life,1.5,"[society, history]",unscripted,nonfiction,storytelling,4.6
Bullseye with Jesse Thorn,1.5,"[comedy, society]",unscripted,nonfiction,interview,4.7
One Bad Mother,2.5,"[comedy, parenting]",unscripted,nonfiction,chat,4.7
Judge John Hodgman,1.5,"[comedy, advice]",unscripted,nonfiction,chat,4.8
The Flophouse,3.0,"[comedy, movies]",unscripted,nonfiction,recap,4.8


In [13]:
sum(podcast_df.index.value_counts())

1436

In [14]:
podcast_df = pod_feats_df.join(podcast_df, on='podcast', sort=True)
podcast_df.sample(10)

Unnamed: 0_level_0,Hosts,Genre-Topic,Scripted/Un,Fiction/Non,Format,Rating,Episode,Text,Title,Year
podcast,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
You're Wrong About,2.0,"[history, education]",unscripted,nonfiction,chat,4.6,,Sarah Marshall The internet is like a pipe fu...,Tipper Gore vs. Heavy Metal: The Case Against ...,2021.0
The Allusionist,1.0,"[education, language]",scripted,nonfiction,storytelling,4.8,17.0,"to hear this episode or read more about it, vi...","fix, part i",2015.0
This American Life,1.5,"[society, history]",unscripted,nonfiction,storytelling,4.6,113.0,Prologue Ira Glass From WBEZ Chicago ...,Windfall,
This American Life,1.5,"[society, history]",unscripted,nonfiction,storytelling,4.6,121.0,Prologue Ira Glass This story is alway...,Twentieth Century Man,
This American Life,1.5,"[society, history]",unscripted,nonfiction,storytelling,4.6,242.0,Prologue Ira Glass Here's the story th...,Enemy Camp,
This American Life,1.5,"[society, history]",unscripted,nonfiction,storytelling,4.6,468.0,Prologue Ira Glass This spring my fri...,Switcheroo,
Radiolab,2.0,"[society, education]",unscripted,nonfiction,storytelling,4.7,,(SOUNDBITE OF MUSIC) UNIDENTIFIED PERSON: L...,Deep Cuts,2021.0
This American Life,1.5,"[society, history]",unscripted,nonfiction,storytelling,4.6,686.0,Prologue: Prologue Ira Glass It's nea...,Umbrellas Up,
This American Life,1.5,"[society, history]",unscripted,nonfiction,storytelling,4.6,511.0,Prologue Sarah Koenig Can you just say...,The Seven Things You’re Not Supposed to Talk A...,
The Allusionist,1.0,"[education, language]",scripted,nonfiction,storytelling,4.8,61.0,visit theallusionist.org/graphology to read ...,in your hand,2017.0


In [15]:
len(podcast_df)

1443

### Text-processing functions

In [16]:
# anticipating a log of ugly floats
def percent(decimal):
    decimal *= 100
    percentage = '{:.3f}'.format(decimal)
    percentage = float(percentage)
    return percentage

percent(0.45615981981)

45.616

In [17]:
for t in podcast_df.Text:
    if isinstance(t, float):
        print(t)

nan
nan
nan
nan
nan
nan
nan


In [18]:
podcast_df = podcast_df[podcast_df['Text'].notna()]

In [19]:
len(podcast_df)

1436

## Analysis

Here, I added about 50 non-lexical features.  Almost all of them use custom functions.

In [20]:
# add Tokens column
podcast_df['Tokens'] = podcast_df.Text.map(nlp)

#### columns so far:
- Tokens (spacy)

In [21]:
podcast_df.Tokens.loc['Welcome to Nightvale'][2][:500]

And now, the news. Have any of our listeners seen the glowing cloud that has been moving in from the west? Well, John Peters, you know, the farmer? He saw it over the Western Ridge this morning, said he would have thought it was the setting sun if it wasn’t for the time of day. Apparently the cloud glows in a variety of colors, perhaps changing from observer to observer, although all report a low whistling when it draws near. One death has already been attributed to the glowcloud.  But listen, it’s probably nothing. If we had to shut down the town for every mysterious event that at least one death could be attributed to, we’d never have time to do anything, right? That’s what the Sheriff’s Secret Police are saying, and I agree, although I would not go so far as to endorse their suggestion to “run directly at the cloud, shrieking and waving your arms, just to see what it does.”  The Apache Tracker, and I remind you that this is that white guy who wears the huge and cartoonishly inaccura

In [22]:
def top50(Tokens):
    counts = Counter(t.text for t in Tokens if t.is_alpha)
    return counts.most_common(50)

In [23]:
# add Top50 column
podcast_df['Top50'] = podcast_df.Tokens.map(top50)

#### columns so far:
- Tokens (spacy)
- top50 Tokens

In [24]:
# add token count (transcript_length) column
podcast_df['Token_count'] = podcast_df.Tokens.map(len)

#### columns so far:
- Tokens (spacy)
- top50 (50 most common tokens)
- Token_count (transcript length

In [25]:
def word_len(Tokens):
    if len(Tokens) > 10:
        lengths = [(w, len(w.text)) for w in Tokens if w.is_alpha]
    else:
        lengths = [('null',0)]
    
    avg = statistics.mean([l[-1] for l in lengths])
    
    return lengths, avg

In [26]:
# add token length column
podcast_df['Token_lengths'] = podcast_df.Tokens.map(lambda x: word_len(x)[0])

In [27]:
# add average token length column
podcast_df['Avg_token_len'] = podcast_df.Tokens.map(lambda x: word_len(x)[1])

#### columns so far:
- Tokens  (spacy)
- top50  (50 most common tokens)
- Token_count  (transcript length)
- Token_lengths  (list of tuples: (token, length))
- Avg_token_len  (float, mean of all alphabetic token lengths)

In [28]:
# TTR
import random

def get_ttr(Tokens):
    if len(Tokens) > 1:
        lower = [t.text.lower() for t in Tokens if t.is_alpha]
        rand = random.randint(0, len(lower))
        chunk = lower[rand:(rand + 300)]
        ttr = percent(len(set(lower))/len(lower))
    else:
        ttr = 0
        
    return ttr

In [29]:
# add TTR column
podcast_df['TTR'] = podcast_df.Tokens.map(get_ttr)

#### columns so far:
- Tokens (spacy)
- top50 (50 most common tokens)
- Token_count (transcript length)
- Token_lengths (list of tuples: (token, length))
- Avg_token_len (float, mean of all alphabetic token lengths)
- TTR  (type/token ratio measured against 300 characters)

In [30]:
# read in k-bands
import pickle
f = open('data/goog_kband.pkl','rb')
goog_kband = pickle.load(f)
f.close()

goog_kband['throughout']

2

In [31]:
def get_kband(Tokens):
    if len(Tokens) > 1:
        kbands = []
        for t in Tokens:
            if t.lemma_ in goog_kband:
                kbands.append((t, goog_kband[t.lemma_]))
        avg_kband = statistics.mean([t[1] for t in kbands])
    else:
        kbands = 0
        avg_kband = 0
    
    return kbands, avg_kband

In [32]:
# add ('word', kband) column
podcast_df['kband'] = podcast_df.Tokens.map(lambda x: get_kband(x)[0])

In [33]:
# add average kband column
podcast_df['Avg_kband'] = podcast_df.Tokens.map(lambda x: get_kband(x)[1])

#### columns so far:
- Tokens (spacy doc)
- top50 (Counter of 50 most common tokens and their counts)
- Token_count (int, transcript length)
- Token_lengths (list of tuples: (token, length))
- Avg_token_len (float, mean of all alphabetic token lengths)
- TTR  (float, type/token ratio measured against 300 characters)
- kband  (list of tuples as (word, kband))
- Avg_kband  (float, mean kband)

In [34]:
def bigrams(Tokens):
    if len(Tokens) > 1:
        bigrams = []
        for t in Tokens[:-1]:
            if t.text.isalpha() and Tokens[t.i + 1].text.isalpha():
                bigram = (t.text.lower(), Tokens[t.i + 1].text.lower())
                bigrams.append(bigram)
        counts = Counter(b for b in bigrams).most_common(25)
    else:
        bigrams = 'null'
        
    return bigrams

In [35]:
# add bigrams column
podcast_df['Bigrams'] = podcast_df.Tokens.map(lambda x: bigrams(x))

In [36]:
# add 25 most common bigram column
podcast_df['Bigram_top25'] = podcast_df.Bigrams.map(lambda x: Counter(x).most_common(25))

#### columns so far:
- Tokens (spacy doc)
- top50 (Counter of 50 most common tokens and their counts)
- Token_count (int, transcript length)
- Token_lengths (list of tuples: (token, length))
- Avg_token_len (float, mean of all alphabetic token lengths)
- TTR  (float, type/token ratio measured against 300 characters)
- kband  (list of tuples as (word, kband))
- Avg_kband  (float, mean kband)
- Bigrams (list of bigram tuples)
- Bigram_top25  (Counter object of 25 most common bigrams and their counts)

In [37]:
# add (token, part-of-speech) column
podcast_df['POS'] = podcast_df.Tokens.map(lambda t: [(w, w.pos_) for w in t])

#### columns so far:
- Tokens (spacy doc)
- top50 (Counter of 50 most common tokens and their counts)
- Token_count (int, transcript length)
- Token_lengths (list of tuples: (token, length))
- Avg_token_len (float, mean of all alphabetic token lengths)
- TTR  (float, type/token ratio measured against 300 characters)
- kband  (list of tuples as (word, kband))
- Avg_kband  (float, mean kband)
- Bigrams (list of bigram tuples)
- Bigram_top25  (Counter object of 25 most common bigrams and their counts)
- POS  (list of tuples as (token, spacy POS tag))

In [38]:
# weighs pos frequency against total text length
def POS_frequency(POS_text):
    counts = Counter(elem[-1].upper() for elem in POS_text)
    total = len(POS_text)
    
    pos_freq = {}
    for (pos, count) in counts.items():
        pos_freq[pos] = percent(count/total)
        
    return pos_freq

In [39]:
# add {part-of-speech: frequency} column
podcast_df['POS_freq'] = podcast_df.POS.map(POS_frequency)

#### columns so far:
- Tokens (spacy doc)
- top50 (Counter of 50 most common tokens and their counts)
- Token_count (int, transcript length)
- Token_lengths (list of tuples: (token, length))
- Avg_token_len (float, mean of all alphabetic token lengths)
- TTR  (float, type/token ratio measured against 300 characters)
- kband  (list of tuples as (word, kband))
- Avg_kband  (float, mean kband)
- Bigrams (list of bigram tuples)
- Bigram_top25  (Counter object of 25 most common bigrams and their counts)
- POS  (list of tuples as (token, spacy POS tag))
- POS_freq  (dictionary as {POS: % of entire document})

In [40]:
podcast_df.POS_freq[1]
spacy.explain('SCONJ')

{'SPACE': 14.54,
 'VERB': 9.559,
 'PUNCT': 14.211,
 'DET': 7.671,
 'NOUN': 12.356,
 'ADV': 3.604,
 'PRON': 7.704,
 'ADP': 6.239,
 'ADJ': 3.348,
 'SCONJ': 1.036,
 'CCONJ': 2.2,
 'AUX': 3.359,
 'NUM': 3.053,
 'PROPN': 6.557,
 'X': 0.785,
 'PART': 1.638,
 'INTJ': 2.139}

'subordinating conjunction'

In [41]:
# add part-of-speech frequency columns
podcast_df['Noun_freq'] = podcast_df.POS_freq.map(lambda x: x.get('NOUN', 'null'))
podcast_df['Proper_noun_freq'] = podcast_df.POS_freq.map(lambda x: x.get('PROPN', 'null'))
podcast_df['Verb_freq'] = podcast_df.POS_freq.map(lambda x: x.get('VERB', 'null'))
podcast_df['Adj_freq'] = podcast_df.POS_freq.map(lambda x: x.get('ADJ', 'null'))
podcast_df['Adv_freq'] = podcast_df.POS_freq.map(lambda x: x.get('ADV', 'null'))
podcast_df['Interjection_freq'] = podcast_df.POS_freq.map(lambda x: x.get('INTJ', 'null'))
podcast_df['Preposition_freq'] = podcast_df.POS_freq.map(lambda x: x.get('ADP', 'null'))
podcast_df['Conjunction_freq'] = podcast_df.POS_freq.map(lambda x: x.get('SCONJ', 'null'))

## Analysis Checkpoint 1

#### columns so far:
- Tokens (spacy doc)
- top50 (Counter of 50 most common tokens and their counts)
- Token_count (int, transcript length)
- Token_lengths (list of tuples: (token, length))
- Avg_token_len (float, mean of all alphabetic token lengths)
- TTR  (float, type/token ratio measured against 300 characters)
- kband  (list of tuples as (word, kband))
- Avg_kband  (float, mean kband)
- Bigrams (list of bigram tuples)
- Bigram_top25  (Counter object of 25 most common bigrams and their counts)
- POS  (list of tuples as (token, spacy POS tag))
- POS_freq  (dictionary as {POS: % of entire document})
- Noun_freq  (float, % of tokens that are nouns)
- Verb_freq  (float, % of tokens that are verbs)
- Adj_freq  (float, % of tokens that are adjectives)
- Adv_freq  (float, % of tokens that are adverbs)
- Interjection_freq  (float, % of tokens that are interjections)
- Preposition_freq  (float, % of tokens that are prepositions)
- Conjunction_freq  (float, % of tokens that are conjunctions)

In [42]:
podcast_df.sample(5)

Unnamed: 0_level_0,Hosts,Genre-Topic,Scripted/Un,Fiction/Non,Format,Rating,Episode,Text,Title,Year,...,POS,POS_freq,Noun_freq,Proper_noun_freq,Verb_freq,Adj_freq,Adv_freq,Interjection_freq,Preposition_freq,Conjunction_freq
podcast,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
This American Life,1.5,"[society, history]",unscripted,nonfiction,storytelling,4.6,354.0,"Prologue Ira Glass OK, this just in. ...",Mistakes Were Made,,...,"[( , SPACE), (Prologue, NOUN), ( , SPACE)...","{'SPACE': 3.289, 'NOUN': 11.861, 'PROPN': 7.03...",11.861,7.03,12.552,4.352,6.037,0.886,7.535,1.268
Radiolab,2.0,"[society, education]",unscripted,nonfiction,storytelling,4.7,,JA: Jad Abumrad SA: Simon Adler AM: Annie M...,The Curious Case of the Russian Flash Mob ...,2018.0,...,"[( , SPACE), (JA, PROPN), (:, PUNCT), (Jad, PR...","{'SPACE': 4.55, 'PROPN': 5.918, 'PUNCT': 13.03...",13.292,5.918,12.385,4.307,5.023,2.224,7.771,1.086
Radiolab,2.0,"[society, education]",unscripted,nonfiction,storytelling,4.7,,[RADIOLAB INTRO] PAT WALTERS: Jad? JAD ...,Dispatches from 1918,2020.0,...,"[( , SPACE), ([, PUNCT), (RADIOLAB, PROPN), (I...","{'SPACE': 4.183, 'PUNCT': 18.314, 'PROPN': 9.5...",12.351,9.528,9.364,3.914,5.076,1.957,7.768,0.978
This American Life,1.5,"[society, history]",unscripted,nonfiction,storytelling,4.6,664.0,Prologue: Prologue Ira Glass A man wa...,The Room of Requirement,,...,"[( , SPACE), (Prologue, NOUN), (:, PUNCT), (P...","{'SPACE': 5.659, 'NOUN': 12.68, 'PUNCT': 12.47...",12.68,8.391,10.574,4.219,5.899,0.813,8.577,0.96
This American Life,1.5,"[society, history]",unscripted,nonfiction,storytelling,4.6,253.0,Prologue Ira Glass It took a week to ...,The Middle of Nowhere,,...,"[( , SPACE), (Prologue, NOUN), ( , SPACE)...","{'SPACE': 3.639, 'NOUN': 13.448, 'PROPN': 7.63...",13.448,7.634,11.893,4.152,5.864,0.877,8.668,1.34


In [43]:
def POS_length(POS_text):
    pos_dict = {'NOUN': 0, 'VERB': 0, 'ADV': 0, 'ADJ': 0}
    pron_dict = {'i': 0, 'you': 0, 'she': 0, 'he': 0, 'it': 0, 'they': 0, 'we': 0}
    for (token, pos) in POS_text:
        if pos in pos_dict.keys():
            pos_dict[pos] = (pos_dict[pos] + len(token.text))/2
        if token.text in pron_dict.keys():
            pron_dict[token.text] = pron_dict[token.text] + 1
    
    if sum(pron_dict.values()) != 0:
        pron_total = sum(pron_dict.values())
    
    if sum(pron_dict.values()) != 0:
        for (p, c) in pron_dict.items():
            pron_dict[p] = percent(c/pron_total)
    
    
    return pos_dict, pron_dict

# Average word length of each POS
# POS_length[0][0] = noun
#           [0][1] = verb
#           [0][2] = adv
#           [0][3] = adj

# Individual pronoun occurrence weighed against total # of pronouns
# POS_length[1][1] = 'i'
#           [1][2] = 'you'
#           [1][3] ='she'
#           [1][4] = 'he'
#           [1][5] = 'it'
#           [1][6] = 'they'
#           [1][7] = 'we'


In [44]:
podcast_df['POS_length'] = podcast_df.POS.map(lambda p: POS_length(p)[0])

In [45]:
podcast_df['Avg_noun_len'] = podcast_df.POS_length.map(lambda d: d['NOUN'])
podcast_df['Avg_verb_len'] = podcast_df.POS_length.map(lambda d: d['VERB'])
podcast_df['Avg_adj_len'] = podcast_df.POS_length.map(lambda d: d['ADJ'])
podcast_df['Avg_adv_len'] = podcast_df.POS_length.map(lambda d: d['ADV'])

#### columns so far:
- Tokens (spacy doc)
- top50 (Counter of 50 most common tokens and their counts)
- Token_count (int, transcript length)
- Token_lengths (list of tuples: (token, length))
- Avg_token_len (float, mean of all alphabetic token lengths)
- TTR  (float, type/token ratio measured against 300 characters)
- kband  (list of tuples as (word, kband))
- Avg_kband  (float, mean kband)
- Bigrams (list of bigram tuples)
- Bigram_top25  (Counter object of 25 most common bigrams and their counts)
- POS  (list of tuples as (token, spacy POS tag))
- POS_freq  (dictionary as {POS: % of entire document})
- Noun_freq  (float, % of tokens that are nouns)
- Verb_freq  (float, % of tokens that are verbs)
- Adj_freq  (float, % of tokens that are adjectives)
- Adv_freq  (float, % of tokens that are adverbs)
- Interjection_freq  (float, % of tokens that are interjections)
- Preposition_freq  (float, % of tokens that are prepositions)
- Conjunction_freq  (float, % of tokens that are conjunctions)
- POS_length  (dictionary as {POS: average character length})
- Avg_noun_len  (float, average character length of all nouns)
- Avg_verb_len  (float, average character length of all verbs)
- Avg_adj_len  (float, average character length of all adjectives)
- Avg_adv_len  (float, average character length of all adverbs)

In [46]:
podcast_df['Pron_counts'] = podcast_df.POS.map(lambda p: POS_length(p)[1])

In [47]:
podcast_df['i_count'] = podcast_df.Pron_counts.map(lambda d: d['i'])
podcast_df['you_count'] = podcast_df.Pron_counts.map(lambda d: d['you'])
podcast_df['she_count'] = podcast_df.Pron_counts.map(lambda d: d['she'])
podcast_df['he_count'] = podcast_df.Pron_counts.map(lambda d: d['he'])
podcast_df['it_count'] = podcast_df.Pron_counts.map(lambda d: d['it'])
podcast_df['they_count'] = podcast_df.Pron_counts.map(lambda d: d['they'])
podcast_df['we_count'] = podcast_df.Pron_counts.map(lambda d: d['we'])

#### columns so far:
- Tokens (spacy doc)
- top50 (Counter of 50 most common tokens and their counts)
- Token_count (int, transcript length)
- Token_lengths (list of tuples: (token, length))
- Avg_token_len (float, mean of all alphabetic token lengths)
- TTR  (float, type/token ratio measured against 300 characters)
- kband  (list of tuples as (word, kband))
- Avg_kband  (float, mean kband)
- Bigrams (list of bigram tuples)
- Bigram_top25  (Counter object of 25 most common bigrams and their counts)
- POS  (list of tuples as (token, spacy POS tag))
- POS_freq  (dictionary as {POS: % of entire document})
- Noun_freq  (float, % of tokens that are nouns)
- Verb_freq  (float, % of tokens that are verbs)
- Adj_freq  (float, % of tokens that are adjectives)
- Adv_freq  (float, % of tokens that are adverbs)
- Interjection_freq  (float, % of tokens that are interjections)
- Preposition_freq  (float, % of tokens that are prepositions)
- Conjunction_freq  (float, % of tokens that are conjunctions)
- POS_length  (dictionary as {POS: average character length})
- Avg_noun_len  (float, average character length of all nouns)
- Avg_verb_len  (float, average character length of all verbs)
- Avg_adj_len  (float, average character length of all adjectives)
- Avg_adv_len  (float, average character length of all adverbs)
- Pron_counts  (dictionary as {pronoun: % of all pronoun occurrence that this pronoun makes up})
- i_count  (float, % of pronouns that are 'i')
- you_count  (float, % of pronouns that are 'you')
- she_count  (float, % of pronouns that are 'she')
- he_count  (float, % of pronouns that are 'he')
- it_count  (float, % of pronouns that are 'it')
- they_count (float, % of pronouns that are 'they')
- we_count  (float, % of pronouns that are 'we')

In [48]:
podcast_df.POS_freq[0]
podcast_df.Noun_freq[0]

{'SPACE': 13.087,
 'VERB': 10.211,
 'PUNCT': 13.748,
 'DET': 7.533,
 'NOUN': 11.726,
 'ADV': 4.144,
 'PRON': 9.205,
 'ADP': 6.561,
 'ADJ': 3.428,
 'SCONJ': 1.184,
 'CCONJ': 2.545,
 'AUX': 3.705,
 'NUM': 2.457,
 'PROPN': 5.564,
 'INTJ': 2.087,
 'PART': 2.057,
 'X': 0.735,
 'SYM': 0.025}

11.726

In [49]:
# most common verb lemmas
def verb_lemmas(POS_text):
    counts = Counter(elem[0].lemma_ for elem in POS_text if elem[1] == 'VERB')
    
    verb_counter = {}
    for (verb, value) in counts.most_common(20):
        verb_counter[verb] = percent(value/sum(counts.values()))
        
    return verb_counter

verb_lemmas(podcast_df.POS[1])
# spacy thinks that an apostrophe is a verb?  Wonder why.

{'’': 9.848,
 'be': 5.594,
 'know': 4.604,
 'have': 4.196,
 'think': 3.322,
 'get': 3.147,
 'go': 2.972,
 'laugh': 2.739,
 'do': 2.681,
 'make': 2.564,
 'see': 2.273,
 'say': 1.573,
 'feel': 1.573,
 'gon': 1.34,
 'chuckle': 1.224,
 'watch': 1.224,
 'take': 1.049,
 'want': 1.049,
 'love': 1.049,
 'mean': 0.991}

In [50]:
# add verb_lemmas column
podcast_df['verb_lemmas'] = podcast_df.POS.map(verb_lemmas)

#### columns so far:
- Tokens (spacy doc)
- top50 (Counter of 50 most common tokens and their counts)
- Token_count (int, transcript length)
- Token_lengths (list of tuples: (token, length))
- Avg_token_len (float, mean of all alphabetic token lengths)
- TTR  (float, type/token ratio measured against 300 characters)
- kband  (list of tuples as (word, kband))
- Avg_kband  (float, mean kband)
- Bigrams (list of bigram tuples)
- Bigram_top25  (Counter object of 25 most common bigrams and their counts)
- POS  (list of tuples as (token, spacy POS tag))
- POS_freq  (dictionary as {POS: % of entire document})
- Noun_freq  (float, % of tokens that are nouns)
- Verb_freq  (float, % of tokens that are verbs)
- Adj_freq  (float, % of tokens that are adjectives)
- Adv_freq  (float, % of tokens that are adverbs)
- Interjection_freq  (float, % of tokens that are interjections)
- Preposition_freq  (float, % of tokens that are prepositions)
- Conjunction_freq  (float, % of tokens that are conjunctions)
- POS_length  (dictionary as {POS: average character length})
- Avg_noun_len  (float, average character length of all nouns)
- Avg_verb_len  (float, average character length of all verbs)
- Avg_adj_len  (float, average character length of all adjectives)
- Avg_adv_len  (float, average character length of all adverbs)
- Pron_counts  (dictionary as {pronoun: % of all pronoun occurrence that this pronoun makes up})
- i_count  (float, % of pronouns that are 'i')
- you_count  (float, % of pronouns that are 'you')
- she_count  (float, % of pronouns that are 'she')
- he_count  (float, % of pronouns that are 'he')
- it_count  (float, % of pronouns that are 'it')
- they_count (float, % of pronouns that are 'they')
- we_count  (float, % of pronouns that are 'we')
- verb_lemmas  (dictionary of 20 most common verb lemmas as {lemma: % of all verbs that verb comprises})

In [51]:
# add sent_toks column
podcast_df['Sent_toks'] = podcast_df.Text.map(nltk.sent_tokenize)

In [52]:
# minor alteration to unit_len
def sent_len(doc):
    sentlens = []
    for c in doc:
        length = len([l for l in c.split()])
        sentlens.append((c, length))
        
    return sentlens

In [53]:
sent_len(podcast_df.Sent_toks[0][:10])

[('  \nNote: This show periodically replaces their ad breaks with new promotional clips.',
  12),
 ('Because of this, both the \ntranscription for the clips and the timestamps after them may be inaccurate at the time of viewing this \ntranscript.',
  24),
 ('00:00:00  Music  Music  “Service and Deployment,” composed by Mark Isham, from the \nalbum Megan Leavey (Original Motion Picture Soundtrack) plays as \nJohn speaks.',
  23),
 ('It is a minimalist, ethereal synth melody.', 7),
 ('00:00:01  John  Host  Now this is a dog movie, so a certain percentage of our audience \nRoderick  has already decided it’s a 5–Milk-Bone film or whatever without even \nwatching it.',
  30),
 ('The dog people, you know the ones I mean, the \nAnubisians.', 11),
 ('When I was a kid, dogs roamed around outside doing \ndog things like shitting everywhere and licking their peanuts and \nhumping each other and severely biting kids on the leg that were \nonly trying to ride their bikes to the Northway Mall and wh

In [54]:
podcast_df['Sent_length'] = podcast_df.Sent_toks.map(sent_len)

In [55]:
podcast_df['Avg_sent_len'] = podcast_df.Sent_length.map(lambda s: statistics.mean([t[-1] for t in s]))

## Analysis Checkpoint 2

#### columns so far:
- Tokens (spacy doc)
- top50 (Counter of 50 most common tokens and their counts)
- Token_count (int, transcript length)
- Token_lengths (list of tuples: (token, length))
- Avg_token_len (float, mean of all alphabetic token lengths)
- TTR  (float, type/token ratio measured against 300 characters)
- kband  (list of tuples as (word, kband))
- Avg_kband  (float, mean kband)
- Bigrams (list of bigram tuples)
- Bigram_top25  (Counter object of 25 most common bigrams and their counts)
- POS  (list of tuples as (token, spacy POS tag))
- POS_freq  (dictionary as {POS: % of entire document})
- Noun_freq  (float, % of tokens that are nouns)
- Verb_freq  (float, % of tokens that are verbs)
- Adj_freq  (float, % of tokens that are adjectives)
- Adv_freq  (float, % of tokens that are adverbs)
- Interjection_freq  (float, % of tokens that are interjections)
- Preposition_freq  (float, % of tokens that are prepositions)
- Conjunction_freq  (float, % of tokens that are conjunctions)
- POS_length  (dictionary as {POS: average character length})
- Avg_noun_len  (float, average character length of all nouns)
- Avg_verb_len  (float, average character length of all verbs)
- Avg_adj_len  (float, average character length of all adjectives)
- Avg_adv_len  (float, average character length of all adverbs)
- Pron_counts  (dictionary as {pronoun: % of all pronoun occurrence that this pronoun makes up})
- i_count  (float, % of pronouns that are 'i')
- you_count  (float, % of pronouns that are 'you')
- she_count  (float, % of pronouns that are 'she')
- he_count  (float, % of pronouns that are 'he')
- it_count  (float, % of pronouns that are 'it')
- they_count (float, % of pronouns that are 'they')
- we_count  (float, % of pronouns that are 'we')
- verb_lemmas  (dictionary of 20 most common verb lemmas as {lemma: % of all verbs that verb comprises})
- Sent_toks  (list of sentences)
- Sent_length  (list of tuples as (sentence, sentence length))
- Avg_sent_len  (float, average sentence length over entire transcript)

In [56]:
# add entity frequency column
def ent_counter(doc):
    ents = []    
    people = []
    length = len(doc)
    
    for ent in doc.ents:
        ents.append((ent.text, ent.label_))
        if ent.label_ == 'PERSON':
            people.append(ent.text.lower().strip())
    count = dict(Counter([ent[1] for ent in ents]))
    for label, c in count.items():
        count[label] = percent(c/length)
        
    return count, people

ent_counter(podcast_df.Tokens[0])[0]
        
        

{'WORK_OF_ART': 0.025,
 'PERSON': 3.039,
 'CARDINAL': 0.197,
 'FAC': 0.03,
 'LOC': 0.02,
 'ORDINAL': 0.039,
 'DATE': 0.207,
 'ORG': 0.197,
 'PRODUCT': 0.054,
 'GPE': 0.168,
 'EVENT': 0.03,
 'MONEY': 0.271,
 'NORP': 0.113,
 'PERCENT': 0.03,
 'TIME': 0.025,
 'QUANTITY': 0.01}

In [57]:
podcast_df['Ents'] = podcast_df.Tokens.map(lambda x: (ent_counter(x)[0], ent_counter(x)[-1]))
podcast_df.Ents[2][0]

{'PRODUCT': 0.014,
 'PERSON': 3.019,
 'ORG': 0.258,
 'NORP': 0.167,
 'CARDINAL': 0.324,
 'GPE': 0.176,
 'ORDINAL': 0.062,
 'MONEY': 0.21,
 'EVENT': 0.043,
 'DATE': 0.172,
 'TIME': 0.014,
 'WORK_OF_ART': 0.038,
 'LANGUAGE': 0.005,
 'LOC': 0.019,
 'FAC': 0.01,
 'QUANTITY': 0.024}

In [58]:
# VALUES ARE % OF ENTIRE PODCAST LENGTH
podcast_df['Organization'] = podcast_df.Ents.map(lambda x: x[0].get('ORG', 0))
podcast_df['Art'] = podcast_df.Ents.map(lambda x: x[0].get('WORK_OF_ART', 0))
podcast_df['Date'] = podcast_df.Ents.map(lambda x: x[0].get('DATE', 0))
podcast_df['Geopolitical'] = podcast_df.Ents.map(lambda x: x[0].get('GPE', 0))
podcast_df['Numbers'] = podcast_df.Ents.map(lambda x: x[0].get('CARDINAL', 0))
podcast_df['Event'] = podcast_df.Ents.map(lambda x: x[0].get('EVENT', 0))
podcast_df['Cash'] = podcast_df.Ents.map(lambda x: x[0].get('MONEY', 0))
podcast_df['Time'] = podcast_df.Ents.map(lambda x: x[0].get('TIME', 0))
podcast_df['Product'] = podcast_df.Ents.map(lambda x: x[0].get('PRODUCT', 0))

### Why no host recognition?
I originally wanted to try and do something with parsing out host names then singling out their speech.  This turned out to be a bit of a pipe dream, since A. spacy isn't very good at recognizing names in the somewhat chaotic transcripts and B. there's no way of differentiating between a speaker's tag (i.e. Jad Abumrad: talk talk talk) and a name just being mentioned in speech.  Also formatting is wildly different both from podcast to podcast and within the same podcasts.  For instance, Radiolab formats speaker tag four different ways (JA:, Jad, JAD, JAD ABUMRAD)

In [59]:
# my beautiful and *definitely* state-of-the-art spacy matcher pattern-maker.  I found it really annoying to have to
#      format a pattern matcher every time I wanted to look for something new
from spacy.matcher import Matcher

def pattern_maker():
    
    patterns = []
    add_items = True
    add_dict = True    
    
    while add_dict:        
        match_single = {}                    
            
        while add_items:       
            tag = input('enter tag (lowercase):  ').upper()
            if tag == '0':
                add_dict = False
                break

            string = input('enter string (all lowercase):  ')
            if string == 'true' or string == 'false':
                string = bool(string)

            if tag == 'POS':
                string = string.upper()
            
            match_single[tag] = string
            
            add_items = input('add more to this dict?(y/n)  ')
            if add_items == 'n':
                patterns.append(match_single)
                break
        
        continue
    
    if len(patterns) == 0:
        return
   
    return patterns

In [60]:
def pattern_matcher(pattern, doc): 
    matcher = Matcher(nlp.vocab)
    matcher.add('pattern', [pattern])
    matches = matcher(doc)
    
    
    match_strings = []
    for match_id, start, end in matches:
        matched_span = doc[start:end].text
        match_strings.append(matched_span)
    # print('{} matches found'.format(len(match_strings)))
    return match_strings

pattern_matcher([{'IS_PUNCT': True}], podcast_df.Tokens[0])[:20]
# sorry about the giant flash of punctuation, I'm not sure why it's even printing

[':',
 '.',
 ',',
 '.',
 '“',
 ',',
 '”',
 ',',
 '(',
 ')',
 '.',
 ',',
 '.',
 ',',
 '–',
 '-',
 '.',
 ',',
 ',',
 '.']

In [61]:
podcast_df['Punctuation'] = podcast_df.Tokens.map(lambda t: Counter([p for p in pattern_matcher([{'IS_PUNCT': True}], t) if p != ',']))
podcast_df.Punctuation[10]

Counter({':': 52,
         '.': 752,
         '“': 89,
         '”': 90,
         '(': 6,
         ')': 3,
         '-': 76,
         ';': 5,
         '?': 133,
         '—': 165,
         '!': 186,
         '[': 126,
         ']': 126,
         '&': 1,
         '…': 47,
         '‘': 10,
         '’': 6,
         '"': 1,
         '%': 2,
         '):': 3})

In [62]:
podcast_df['period_freq'] = podcast_df.Punctuation.map(lambda d: percent(d['.']/sum(d.values())))
podcast_df['excl_freq'] = podcast_df.Punctuation.map(lambda d: percent(d['!']/sum(d.values())))
podcast_df['quest_freq'] = podcast_df.Punctuation.map(lambda d: percent(d['?']/sum(d.values())))
podcast_df['hyph_freq'] = podcast_df.Punctuation.map(lambda d: percent(d['-']/sum(d.values())))
podcast_df.loc['MBMBaM', 'period_freq':'hyph_freq'][:20]
# numbers are percentage of all punctuation except commas

Unnamed: 0_level_0,period_freq,excl_freq,quest_freq,hyph_freq
podcast,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
MBMBaM,35.132,5.795,6.115,0.52
MBMBaM,33.926,8.562,6.422,0.889
MBMBaM,35.208,6.587,5.971,0.886
MBMBaM,35.177,6.525,5.234,1.391
MBMBaM,35.253,6.605,5.298,1.378
MBMBaM,32.447,6.844,5.219,0.739
MBMBaM,29.617,2.973,4.868,2.861
MBMBaM,31.822,3.05,4.334,3.491
MBMBaM,30.61,3.612,4.348,2.665
MBMBaM,31.569,4.714,4.946,3.053


In [63]:
podcast_df.columns

Index(['Hosts', 'Genre-Topic', 'Scripted/Un', 'Fiction/Non', 'Format',
       'Rating', 'Episode', 'Text', 'Title', 'Year', 'Tokens', 'Top50',
       'Token_count', 'Token_lengths', 'Avg_token_len', 'TTR', 'kband',
       'Avg_kband', 'Bigrams', 'Bigram_top25', 'POS', 'POS_freq', 'Noun_freq',
       'Proper_noun_freq', 'Verb_freq', 'Adj_freq', 'Adv_freq',
       'Interjection_freq', 'Preposition_freq', 'Conjunction_freq',
       'POS_length', 'Avg_noun_len', 'Avg_verb_len', 'Avg_adj_len',
       'Avg_adv_len', 'Pron_counts', 'i_count', 'you_count', 'she_count',
       'he_count', 'it_count', 'they_count', 'we_count', 'verb_lemmas',
       'Sent_toks', 'Sent_length', 'Avg_sent_len', 'Ents', 'Organization',
       'Art', 'Date', 'Geopolitical', 'Numbers', 'Event', 'Cash', 'Time',
       'Product', 'Punctuation', 'period_freq', 'excl_freq', 'quest_freq',
       'hyph_freq'],
      dtype='object')

#### columns so far:
- Tokens (spacy doc)
- top50 (Counter of 50 most common tokens and their counts)
- Token_count (int, transcript length)
- Token_lengths (list of tuples: (token, length))
- Avg_token_len (float, mean of all alphabetic token lengths)
- TTR  (float, type/token ratio measured against 300 characters)
- kband  (list of tuples as (word, kband))
- Avg_kband  (float, mean kband)
- Bigrams (list of bigram tuples)
- Bigram_top25  (Counter object of 25 most common bigrams and their counts)
- POS  (list of tuples as (token, spacy POS tag))
- POS_freq  (dictionary as {POS: % of entire document})
- Noun_freq  (float, % of tokens that are nouns)
- Verb_freq  (float, % of tokens that are verbs)
- Adj_freq  (float, % of tokens that are adjectives)
- Adv_freq  (float, % of tokens that are adverbs)
- Interjection_freq  (float, % of tokens that are interjections)
- Preposition_freq  (float, % of tokens that are prepositions)
- Conjunction_freq  (float, % of tokens that are conjunctions)
- POS_length  (dictionary as {POS: average character length})
- Avg_noun_len  (float, average character length of all nouns)
- Avg_verb_len  (float, average character length of all verbs)
- Avg_adj_len  (float, average character length of all adjectives)
- Avg_adv_len  (float, average character length of all adverbs)
- Pron_counts  (dictionary as {pronoun: % of all pronoun occurrence that this pronoun makes up})
- i_count  (float, % of pronouns that are 'i')
- you_count  (float, % of pronouns that are 'you')
- she_count  (float, % of pronouns that are 'she')
- he_count  (float, % of pronouns that are 'he')
- it_count  (float, % of pronouns that are 'it')
- they_count (float, % of pronouns that are 'they')
- we_count  (float, % of pronouns that are 'we')
- verb_lemmas  (dictionary of 20 most common verb lemmas as {lemma: % of all verbs that verb comprises})
- Sent_toks  (list of sentences)
- Sent_length  (list of tuples as (sentence, sentence length))
- Avg_sent_len  (float, average sentence length over entire transcript)
- Ents  (dictionary as {spacy's ent tag of token: % of ent occurrence over document length})
- Organization  (float, % of organization entities against doc length)
- Art  (float, % of art entities against doc length)
- Date  (float, % of date entities against doc length)
- Geopolitical  (float, % of geopolitical entities (countries, cities, etc.) against doc length)
- Numbers  (float, % of number occurrence against doc length)
- Event  (float, % of event entity occurrence against doc length)
- Cash  (float, % of monetary value entity occurrence against doc length)
- Time  (float, % of time entity tags occurrence against doc length)
- Product  (float, % of product entity tag occurrence against doc length)

In [64]:
# make a regex for tags
expression = r'(?<=\[).+?(?=\])'
len(re.findall(expression, podcast_df.Text[100]))
re.findall(expression, podcast_df.loc['Welcome to Nightvale', 'Text'][0])[:20]

5

['phone rings', 'weather: “Lemonade in the Shade” by   ']

In [65]:
# maybe get len and parts of speech within tags
podcast_df['Tags'] = podcast_df.Text.map(lambda text: re.findall(expression, text))
len(podcast_df.Tags[0])
podcast_df.Tags[0]

120

['Background music fades into podcast theme.',
 'Drumroll begins, which leads into the theme song.',
 'Song fades down and plays quietly as host begins to speak.',
 'John groans and Ben chuckles.',
 'Laughs deliberately',
 'Adam makes a thoughtful sound and Ben laughs loudly.',
 'John laughs.',
 'Chuckles',
 'Ben laughs loudly.',
 'Ben laughs.',
 'Somberly',
 'chuckles',
 'sighs',
 'Laughs.',
 'Adam laughs.',
 'John laughs, Ben starts chuckling.',
 'John chuckles.',
 'John laughs.',
 'chuckles',
 'Chuckles',
 'Adam makes a couple of affirming sounds as Adam speaks.',
 'Adam chuckles.',
 'Ben makes a couple of affirming sounds as John speaks.',
 'Ben laughs.',
 'Ben and John laughs.',
 'chuckles',
 'Adam chuckles briefly.',
 'Adam laughs.',
 'Laughs.',
 'Chuckling, amused',
 'Chuckling',
 'Laughs',
 'Ben laughs heartily.',
 'Ben laughs heavily.',
 'chuckling',
 'Laughs and claps in the background.',
 'imitates a teenager’s voice',
 'A score of minimal, slightly unearthly, emotional synt

In [66]:
from spacy.tokenizer import Tokenizer

def avg_tag_length(tag_list):
    lengths = []
       
    for t in tag_list:
        t = t.split(' ')
        lengths.append(len(t))
       
    if len(lengths) > 0:
        avg_length = statistics.mean(lengths)
    else:
        avg_length = 0
    
    
    return avg_length
        
avg_tag_length(podcast_df.Tags[0])
    

2.9

In [67]:
podcast_df['Tag_len'] = podcast_df.Tags.map(avg_tag_length)
podcast_df.Tag_len[:20]

podcast
Friendly Fire    2.900000
Friendly Fire    2.664179
Friendly Fire    2.658730
Friendly Fire    2.385542
Friendly Fire    3.036364
Friendly Fire    2.712230
Friendly Fire    2.737374
Friendly Fire    2.000000
Friendly Fire    2.206186
Friendly Fire    2.151515
Friendly Fire    2.368852
Friendly Fire    2.288288
Friendly Fire    2.348837
Friendly Fire    2.180952
Friendly Fire    2.364583
Friendly Fire    2.425532
Friendly Fire    2.364865
Friendly Fire    2.384615
Friendly Fire    2.481928
Friendly Fire    2.607477
Name: Tag_len, dtype: float64

In [68]:
def tag_top_verb(tag_text_list):
    lemmas = []
    
    for t in tag_text_list:
        t = nlp(t)
        for token in t:
            if token.pos_=='VERB':
                lemmas.append(token.lemma_)    
    
    if len(lemmas) > 0:
        top_verb = Counter(lemmas).most_common(1)[0][0]
    else:
        top_verb = 'NaN'
        
    return top_verb
        
        
tag_top_verb(podcast_df.Tags[0])
    

'chuckle'

In [69]:
podcast_df['Tag_top_verb'] = podcast_df.Tags.map(tag_top_verb)
podcast_df.Tag_top_verb[115]
# including this column definitely reflects confirmation bias, since I think that 
#      having "laugh" in a tag will weight it in favor of being comedy

'crosstalk'

#### columns so far:
- Tokens (spacy doc)
- top50 (Counter of 50 most common tokens and their counts)
- Token_count (int, transcript length)
- Token_lengths (list of tuples: (token, length))
- Avg_token_len (float, mean of all alphabetic token lengths)
- TTR  (float, type/token ratio measured against 300 characters)
- kband  (list of tuples as (word, kband))
- Avg_kband  (float, mean kband)
- Bigrams (list of bigram tuples)
- Bigram_top25  (Counter object of 25 most common bigrams and their counts)
- POS  (list of tuples as (token, spacy POS tag))
- POS_freq  (dictionary as {POS: % of entire document})
- Noun_freq  (float, % of tokens that are nouns)
- Verb_freq  (float, % of tokens that are verbs)
- Adj_freq  (float, % of tokens that are adjectives)
- Adv_freq  (float, % of tokens that are adverbs)
- Interjection_freq  (float, % of tokens that are interjections)
- Preposition_freq  (float, % of tokens that are prepositions)
- Conjunction_freq  (float, % of tokens that are conjunctions)
- POS_length  (dictionary as {POS: average character length})
- Avg_noun_len  (float, average character length of all nouns)
- Avg_verb_len  (float, average character length of all verbs)
- Avg_adj_len  (float, average character length of all adjectives)
- Avg_adv_len  (float, average character length of all adverbs)
- Pron_counts  (dictionary as {pronoun: % of all pronoun occurrence that this pronoun makes up})
- i_count  (float, % of pronouns that are 'i')
- you_count  (float, % of pronouns that are 'you')
- she_count  (float, % of pronouns that are 'she')
- he_count  (float, % of pronouns that are 'he')
- it_count  (float, % of pronouns that are 'it')
- they_count (float, % of pronouns that are 'they')
- we_count  (float, % of pronouns that are 'we')
- verb_lemmas  (dictionary of 20 most common verb lemmas as {lemma: % of all verbs that verb comprises})
- Sent_toks  (list of sentences)
- Sent_length  (list of tuples as (sentence, sentence length))
- Avg_sent_len  (float, average sentence length over entire transcript)
- Ents  (dictionary as {spacy's ent tag of token: % of ent occurrence over document length})
- Organization  (float, % of organization entities against doc length)
- Art  (float, % of art entities against doc length)
- Date  (float, % of date entities against doc length)
- Geopolitical  (float, % of geopolitical entities (countries, cities, etc.) against doc length)
- Numbers  (float, % of number occurrence against doc length)
- Event  (float, % of event entity occurrence against doc length)
- Cash  (float, % of monetary value entity occurrence against doc length)
- Time  (float, % of time entity tags occurrence against doc length)
- Product  (float, % of product entity tag occurrence against doc length)
- Tag_len  (float, average word length of text tags (i.e. [he laughed], [music plays]))
- Top_tag_verb  (string, most commonly-occurring verb lemma within tags)

In [70]:
# I apologize in advance to sensitive eyes and my parents
swears = ['fuck', 'fucking','fucker', 'shit','ass','asshole','damn','dammit', 'goddamnn','bitch','bitchy','cunt']

In [71]:
podcast_df['Swear_count'] = podcast_df.Tokens.map(lambda tokens: percent(len([t.text for t in tokens if t.text in swears])/len(tokens)))
podcast_df.Swear_count[-20:-10]

podcast
Wonderful             0.131
You're Wrong About    0.137
You're Wrong About    0.098
You're Wrong About    0.021
You're Wrong About    0.130
You're Wrong About    0.146
You're Wrong About    0.055
You're Wrong About    0.131
You're Wrong About    0.056
You're Wrong About    0.120
Name: Swear_count, dtype: float64

In [72]:
fake_swears = ['fudge','shoot','butthead','darn']
podcast_df['Fake_swear_count'] = podcast_df.Tokens.map(lambda tokens: percent(len([t.text for t in tokens if t.text in fake_swears])/len(tokens)))
podcast_df.Fake_swear_count[-20:-10]

podcast
Wonderful             0.028
You're Wrong About    0.000
You're Wrong About    0.000
You're Wrong About    0.000
You're Wrong About    0.000
You're Wrong About    0.000
You're Wrong About    0.009
You're Wrong About    0.006
You're Wrong About    0.000
You're Wrong About    0.000
Name: Fake_swear_count, dtype: float64

#### columns so far:
- Tokens (spacy doc)
- top50 (Counter of 50 most common tokens and their counts)
- Token_count (int, transcript length)
- Token_lengths (list of tuples: (token, length))
- Avg_token_len (float, mean of all alphabetic token lengths)
- TTR  (float, type/token ratio measured against 300 characters)
- kband  (list of tuples as (word, kband))
- Avg_kband  (float, mean kband)
- Bigrams (list of bigram tuples)
- Bigram_top25  (Counter object of 25 most common bigrams and their counts)
- POS  (list of tuples as (token, spacy POS tag))
- POS_freq  (dictionary as {POS: % of entire document})
- Noun_freq  (float, % of tokens that are nouns)
- Verb_freq  (float, % of tokens that are verbs)
- Adj_freq  (float, % of tokens that are adjectives)
- Adv_freq  (float, % of tokens that are adverbs)
- Interjection_freq  (float, % of tokens that are interjections)
- Preposition_freq  (float, % of tokens that are prepositions)
- Conjunction_freq  (float, % of tokens that are conjunctions)
- POS_length  (dictionary as {POS: average character length})
- Avg_noun_len  (float, average character length of all nouns)
- Avg_verb_len  (float, average character length of all verbs)
- Avg_adj_len  (float, average character length of all adjectives)
- Avg_adv_len  (float, average character length of all adverbs)
- Pron_counts  (dictionary as {pronoun: % of all pronoun occurrence that this pronoun makes up})
- i_count  (float, % of pronouns that are 'i')
- you_count  (float, % of pronouns that are 'you')
- she_count  (float, % of pronouns that are 'she')
- he_count  (float, % of pronouns that are 'he')
- it_count  (float, % of pronouns that are 'it')
- they_count (float, % of pronouns that are 'they')
- we_count  (float, % of pronouns that are 'we')
- verb_lemmas  (dictionary of 20 most common verb lemmas as {lemma: % of all verbs that verb comprises})
- Sent_toks  (list of sentences)
- Sent_length  (list of tuples as (sentence, sentence length))
- Avg_sent_len  (float, average sentence length over entire transcript)
- Ents  (dictionary as {spacy's ent tag of token: % of ent occurrence over document length})
- Organization  (float, % of organization entities against doc length)
- Art  (float, % of art entities against doc length)
- Date  (float, % of date entities against doc length)
- Geopolitical  (float, % of geopolitical entities (countries, cities, etc.) against doc length)
- Numbers  (float, % of number occurrence against doc length)
- Event  (float, % of event entity occurrence against doc length)
- Cash  (float, % of monetary value entity occurrence against doc length)
- Time  (float, % of time entity tags occurrence against doc length)
- Product  (float, % of product entity tag occurrence against doc length)
- Tag_len  (float, average word length of text tags (i.e. [he laughed], [music plays]))
- Top_tag_verb  (string, most commonly-occurring verb lemma within tags)
- Swear_count  (float, % of text tokens that are swear words)
- Fake_swear_count  (float, % of text tokens that are fake swear words)

In [73]:
string = 'I thought that there are three more of I than you.  I felt good about this.  Do you feel okay?'
pattern_matcher([{'POS': 'PRON'}, {'POS': 'AUX', 'OP': '*'}, {'LEMMA': 'feel'}], nlp(string))
pattern_matcher([{'POS': 'PRON'}, {'POS': 'AUX', 'OP': '*'}, {'LEMMA': 'think'}], nlp(string))

['I felt', 'you feel']

['I thought']

In [74]:
def opinions(Tokens):
    count = 0
    
    count += len(pattern_matcher([{'POS': 'PRON'}, {'POS': 'AUX', 'OP': '*'}, {'LEMMA': 'feel'}], Tokens))
    count += len(pattern_matcher([{'POS': 'PRON'}, {'POS': 'AUX', 'OP': '*'}, {'LEMMA': 'think'}], nlp(string)))
    
    verb_count = len([t for t in Tokens if t.pos_ == 'VERB'])
    count /= verb_count
    return count

opinions(nlp(string))
    

1.0

In [75]:
podcast_df['Opinion_count'] = podcast_df.Tokens.map(opinions)
podcast_df.Opinion_count[449]

0.0027070925825663237

In [76]:
def prep_per_sent(POSes, sent_toks):
    POSes = len([x[1] for x in POSes if x[1] == 'ADP'])
    adps = POSes/len(sent_toks)
    
    return adps

In [77]:
podcast_df['Prep_per_sent'] = podcast_df.apply(lambda x: prep_per_sent(POSes = x['POS'], sent_toks = x['Sent_toks']), axis=1)
podcast_df.Prep_per_sent[:157]

podcast
Friendly Fire    1.230342
Friendly Fire    1.159420
Friendly Fire    1.281599
Friendly Fire    1.381510
Friendly Fire    1.198267
                   ...   
Radiolab         0.943898
Radiolab         0.854637
Radiolab         0.950197
Radiolab         1.079086
Radiolab         0.716392
Name: Prep_per_sent, Length: 157, dtype: float64

In [78]:
podcast_df['Donation_appeal'] = podcast_df.Tokens.map(lambda t: len(pattern_matcher([{'LEMMA':'donate', 'DEP': 'ROOT'}], t)))
podcast_df.Donation_appeal.value_counts()
# this doesn't look like it'll be particularly useful, but I made it so it can't hurt to leave it in

0    1380
1      45
2       6
3       5
Name: Donation_appeal, dtype: int64

In [79]:
sm = ['twitter','facebook','instagram','linkedin','twitch','tik tok']

def social_count(tokens):
    count = 0
    for t in tokens:
        if t.text in sm:
            count += 1
    
    return count

In [80]:
podcast_df['Social'] = podcast_df.Tokens.map(social_count)
podcast_df.Social.value_counts()
# same as last column: doesn't seem like it'd be particularly useful, but might be

0    1359
1      30
2      24
3      21
4       2
Name: Social, dtype: int64

In [81]:
podcast_df.columns

Index(['Hosts', 'Genre-Topic', 'Scripted/Un', 'Fiction/Non', 'Format',
       'Rating', 'Episode', 'Text', 'Title', 'Year', 'Tokens', 'Top50',
       'Token_count', 'Token_lengths', 'Avg_token_len', 'TTR', 'kband',
       'Avg_kband', 'Bigrams', 'Bigram_top25', 'POS', 'POS_freq', 'Noun_freq',
       'Proper_noun_freq', 'Verb_freq', 'Adj_freq', 'Adv_freq',
       'Interjection_freq', 'Preposition_freq', 'Conjunction_freq',
       'POS_length', 'Avg_noun_len', 'Avg_verb_len', 'Avg_adj_len',
       'Avg_adv_len', 'Pron_counts', 'i_count', 'you_count', 'she_count',
       'he_count', 'it_count', 'they_count', 'we_count', 'verb_lemmas',
       'Sent_toks', 'Sent_length', 'Avg_sent_len', 'Ents', 'Organization',
       'Art', 'Date', 'Geopolitical', 'Numbers', 'Event', 'Cash', 'Time',
       'Product', 'Punctuation', 'period_freq', 'excl_freq', 'quest_freq',
       'hyph_freq', 'Tags', 'Tag_len', 'Tag_top_verb', 'Swear_count',
       'Fake_swear_count', 'Opinion_count', 'Prep_per_sent', 'Dona

In [82]:
podcast_df['Know'] = podcast_df.verb_lemmas.map(lambda x: x.get('know', 0))
podcast_df['Be'] = podcast_df.verb_lemmas.map(lambda x: x.get('be', 0))
podcast_df['Do'] = podcast_df.verb_lemmas.map(lambda x: x.get('do', 0))
podcast_df['Mean'] = podcast_df.verb_lemmas.map(lambda x: x.get('mean', 0))
podcast_df['Make'] = podcast_df.verb_lemmas.map(lambda x: x.get('make', 0))
podcast_df['Go'] = podcast_df.verb_lemmas.map(lambda x: x.get('go', 0))

## Analysis completed

#### all features:
- Tokens (spacy doc)
- top50 (Counter of 50 most common tokens and their counts)
- Token_count (int, transcript length)
- Token_lengths (list of tuples: (token, length))
- Avg_token_len (float, mean of all alphabetic token lengths)
- TTR  (float, type/token ratio measured against 300 characters)
- kband  (list of tuples as (word, kband))
- Avg_kband  (float, mean kband)
- Bigrams (list of bigram tuples)
- Bigram_top25  (Counter object of 25 most common bigrams and their counts)
- POS  (list of tuples as (token, spacy POS tag))
- POS_freq  (dictionary as {POS: % of entire document})
- Noun_freq  (float, % of tokens that are nouns)
- Verb_freq  (float, % of tokens that are verbs)
- Adj_freq  (float, % of tokens that are adjectives)
- Adv_freq  (float, % of tokens that are adverbs)
- Interjection_freq  (float, % of tokens that are interjections)
- Preposition_freq  (float, % of tokens that are prepositions)
- Conjunction_freq  (float, % of tokens that are conjunctions)
- POS_length  (dictionary as {POS: average character length})
- Avg_noun_len  (float, average character length of all nouns)
- Avg_verb_len  (float, average character length of all verbs)
- Avg_adj_len  (float, average character length of all adjectives)
- Avg_adv_len  (float, average character length of all adverbs)
- Pron_counts  (dictionary as {pronoun: % of all pronoun occurrence that this pronoun makes up})
- i_count  (float, % of pronouns that are 'i')
- you_count  (float, % of pronouns that are 'you')
- she_count  (float, % of pronouns that are 'she')
- he_count  (float, % of pronouns that are 'he')
- it_count  (float, % of pronouns that are 'it')
- they_count (float, % of pronouns that are 'they')
- we_count  (float, % of pronouns that are 'we')
- verb_lemmas  (dictionary of 20 most common verb lemmas as {lemma: % of all verbs that verb comprises})
- Sent_toks  (list of sentences)
- Sent_length  (list of tuples as (sentence, sentence length))
- Avg_sent_len  (float, average sentence length over entire transcript)
- Ents  (dictionary as {spacy's ent tag of token: % of ent occurrence over document length})
- Organization  (float, % of organization entities against doc length)
- Art  (float, % of art entities against doc length)
- Date  (float, % of date entities against doc length)
- Geopolitical  (float, % of geopolitical entities (countries, cities, etc.) against doc length)
- Numbers  (float, % of number occurrence against doc length)
- Event  (float, % of event entity occurrence against doc length)
- Cash  (float, % of monetary value entity occurrence against doc length)
- Time  (float, % of time entity tags occurrence against doc length)
- Product  (float, % of product entity tag occurrence against doc length)
- Tag_len  (float, average word length of text tags (i.e. [he laughed], [music plays]))
- Top_tag_verb  (string, most commonly-occurring verb lemma within tags)
- Swear_count  (float, % of text tokens that are swear words)
- Fake_swear_count  (float, % of text tokens that are fake swear words)
- Opinion_count  (float, occurrence of pronoun followed by optional auxiliary followed by lemma think or feel weighed
                    against total verb occurrence)
- Prep_per_sent  (float, average occurrence of prepositions per sentence)
- Donation_appeal  (int, count of "donate" occurring as a phrase root)
- Social_count  (int, count of how many times a social media platform is mentioned)
- Know  (int, extracted from verb_lemma dictionary column)
- Be  (int, extracted from verb_lemma dictionary column)
- Do  (int, extracted from verb_lemma dictionary column)
- Mean  (int, extracted from verb_lemma dictionary column)
- Make  (int, extracted from verb_lemma dictionary column)
- Go  (int, extracted from verb_lemma dictionary column)

In [83]:
podcast_df.sample(10)

Unnamed: 0_level_0,Hosts,Genre-Topic,Scripted/Un,Fiction/Non,Format,Rating,Episode,Text,Title,Year,...,Opinion_count,Prep_per_sent,Donation_appeal,Social,Know,Be,Do,Mean,Make,Go
podcast,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Radiolab,2.0,"[society, education]",unscripted,nonfiction,storytelling,4.7,,UNIDENTIFIED PERSON #1: Listener-supported WN...,The Great Vaccinator,2020.0,...,0.004596,1.173352,0,0,3.493,9.743,5.79,1.562,2.206,4.963
Radiolab,2.0,"[society, education]",unscripted,nonfiction,storytelling,4.7,,"JAD: Hey, I’m Jad Abumrad. ROBERT: I’m Rob...",Poop Train,2013.0,...,0.005455,1.076316,0,0,2.909,4.545,4.0,0.0,0.0,4.909
Radiolab,2.0,"[society, education]",unscripted,nonfiction,storytelling,4.7,,SONGS THAT CROSS BORDERS FINAL WEB TRANSCRIPT...,Songs that Cross Borders,2019.0,...,0.001684,0.917411,0,0,2.694,11.111,2.189,0.842,0.0,4.377
Wonderful,2.0,"[comedy, society]",unscripted,nonfiction,chat,4.9,152.0,Wonderful! 152: Air Milk \nPublished September...,air milk,2020.0,...,0.013774,1.02026,0,0,3.306,12.672,2.847,1.286,2.571,2.847
Welcome to Nightvale,1.0,"[comedy, sci-fi]",scripted,fiction,news,4.8,144.0,"It's turtles all the way down, but, man, it's ...",the dreamer,2019.0,...,0.005479,1.4375,0,0,1.644,4.932,1.096,0.0,0.0,1.644
This American Life,1.5,"[society, history]",unscripted,nonfiction,storytelling,4.6,654.0,"Prologue Sean Cole From WBEZ Chicago,...",The Feather Heist,,...,0.003976,1.49125,0,0,3.181,8.217,1.723,1.59,0.0,2.651
Radiolab,2.0,"[society, education]",unscripted,nonfiction,storytelling,4.7,,JA: Jad Abumrad SA: Simon Adler AM: Annie M...,The Curious Case of the Russian Flash Mob ...,2018.0,...,0.009288,2.365759,0,0,4.85,6.502,3.922,0.0,1.651,2.374
This American Life,1.5,"[society, history]",unscripted,nonfiction,storytelling,4.6,483.0,Prologue Ira Glass I spoke with Julia...,Self-Improvement Kick,,...,0.00538,1.264352,0,0,2.959,8.541,3.43,0.0,2.555,3.093
Friendly Fire,3.0,"[history, movies]",unscripted,nonfiction,recap,4.6,152.0,Note: This show periodically replaces their ad...,final draft,,...,0.013535,1.073831,0,0,2.186,7.6,2.967,1.458,2.186,2.551
Radiolab,2.0,"[society, education]",unscripted,nonfiction,storytelling,4.7,,Jim Dixon: Hello there. This is Jim Dickson s...,An Ice-Cold Case,2013.0,...,0.004608,1.15655,0,0,2.535,8.065,2.304,0.0,0.0,1.382


## ML preparation

Separate columns by what kind of model they can fit.  Target features, numerical features, and lexical features will be three separate csvs.  Transfer csvs to crc and do machine learning there.

In [84]:
# separate out dataframe used for regression -- all numerical values
num_df = podcast_df[['Hosts','Rating','Token_count','Avg_token_len', 'Avg_sent_len', 'TTR','Avg_kband',
                     'Noun_freq','Proper_noun_freq','Verb_freq','Adj_freq','Adv_freq',
                     'Interjection_freq', 'Preposition_freq', 'Conjunction_freq', 'Avg_noun_len',
                     'Avg_verb_len', 'Avg_adj_len', 'Avg_adv_len', 'i_count', 'you_count', 'she_count',
                     'he_count', 'it_count', 'they_count', 'we_count', 'Know','Be', 'Do', 'Mean',
                     'Make', 'Go', 'Organization', 'Art', 'Date', 'Geopolitical', 'Numbers', 'Event',
                     'Cash', 'Time', 'Product', 'period_freq', 'excl_freq', 'quest_freq', 'hyph_freq',
                     'Tag_len', 'Swear_count', 'Fake_swear_count', 'Opinion_count', 'Prep_per_sent',
                     'Donation_appeal', 'Social']]

In [85]:
# separate target df
target_df = podcast_df[['Hosts', 'Genre-Topic', 'Scripted/Un', 'Fiction/Non', 'Format', 'Rating', 'Year']]
target_df

Unnamed: 0_level_0,Hosts,Genre-Topic,Scripted/Un,Fiction/Non,Format,Rating,Year
podcast,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Friendly Fire,3.0,"[history, movies]",unscripted,nonfiction,recap,4.6,
Friendly Fire,3.0,"[history, movies]",unscripted,nonfiction,recap,4.6,1951
Friendly Fire,3.0,"[history, movies]",unscripted,nonfiction,recap,4.6,
Friendly Fire,3.0,"[history, movies]",unscripted,nonfiction,recap,4.6,1970
Friendly Fire,3.0,"[history, movies]",unscripted,nonfiction,recap,4.6,1942
...,...,...,...,...,...,...,...
You're Wrong About,2.0,"[history, education]",unscripted,nonfiction,chat,4.6,2021
You're Wrong About,2.0,"[history, education]",unscripted,nonfiction,chat,4.6,2021
You're Wrong About,2.0,"[history, education]",unscripted,nonfiction,chat,4.6,2021
You're Wrong About,2.0,"[history, education]",unscripted,nonfiction,chat,4.6,2021


In [86]:
# make tag 1 and tag 2 their own columns, rather than a list in a single column
target_df['Tag1'] = target_df['Genre-Topic'].map(lambda x: x[0])
target_df['Tag2'] = target_df['Genre-Topic'].map(lambda x: x[-1])
target_df = target_df.drop(['Genre-Topic'], axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  target_df['Tag1'] = target_df['Genre-Topic'].map(lambda x: x[0])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  target_df['Tag2'] = target_df['Genre-Topic'].map(lambda x: x[-1])


In [87]:
podtoks_df = podcast_df[['Title', 'Tokens', 'Top50']]
podtoks_df

Unnamed: 0_level_0,Title,Tokens,Top50
podcast,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Friendly Fire,final draft,"( \n, Note, :, This, show, periodically, repl...","[(the, 484), (a, 431), (Host, 402), (I, 371), ..."
Friendly Fire,final draft,"( \n, Note, :, This, show, periodically, repl...","[(the, 481), (Host, 437), (a, 306), (I, 261), ..."
Friendly Fire,final draft,"( \n, Note, :, This, show, periodically, repl...","[(the, 536), (that, 400), (a, 392), (Host, 382..."
Friendly Fire,final draft,"( \n, Note, :, This, show, periodically, repl...","[(the, 461), (a, 300), (Host, 282), (that, 253..."
Friendly Fire,final draft,"( \n, Note, :, This, show, periodically, repl...","[(the, 505), (Host, 355), (a, 321), (that, 301..."
...,...,...,...
You're Wrong About,Vanessa Williams Part 2: Saving The Best For Last,"(Sarah, :, , The, point, of, shaming, someon...","[(to, 387), (of, 343), (the, 329), (that, 312)..."
You're Wrong About,The O.J. Simpson Trial: From the Mixed-Up File...,"(Sarah, :, , Economically, like, it, was, on...","[(the, 309), (of, 265), (to, 260), (like, 258)..."
You're Wrong About,"Bonus: ""The Dark Knight""","(Mike, , :, Ooh, ,, I, have, one, ,, I, have,...","[(the, 523), (I, 458), (like, 438), (to, 374),..."
You're Wrong About,"""Political Correctness""","(Also, ,, is, free, speech, a, right, in, the,...","[(the, 522), (of, 463), (to, 382), (a, 348), (..."


In [88]:
podtoks_df.to_csv('data/podtoks_df.csv', encoding='utf-8')

In [89]:
target_df.to_csv('data/target_df.csv', encoding='utf-8')

In [91]:
num_df.to_csv('data/num_df.csv')