In [1]:
import Archive
import Segment

In [2]:
# searchAllSegments uses the scraping API and the cursor feature to fetch all results to the (hardcoded) query.
# It will return all results, or exit early if it reaches the max_len specified with its argument.
# max_len defaults to 100,000 but the Query has less than 50,000 results so if not specified, it will return all results.
search_segments = Archive.searchAllSegments(100)
# This would return all results
#search_segments = Archive.searchAllSegments()

Fetching segments on page 0, found 0 segments already
Found 10000 segments, 39430 remaining


In [3]:
# Given a list of segment objects (from searchAllSegments), downloadPages will attempt to fetch the page, parse the information, and save it to disk in the specified folder.
# Some pages fail because they don't have captions or are in an unexpected format. Usually they are older segments.
# Files are stored in as a json string. This keeps text and metadata information together in the same file and is easy to load into a dictionary.
# I'm using a subset of the list so it doesn't run for too long.
Archive.downloadPages(search_segments[:50], folder_name='Demo')



  page = soup(res.text)




In [4]:
# listLocalSegments returns all segment files in the specified local folder.
local_segments = Archive.listLocalSegments('Demo')
len(local_segments), local_segments[:5]

(30,
 ['Demo/Late Night With Jimmy Fallon/NBC November 11, 2009 3:05am-4:00am EST.txt',
  'Demo/Late Night With Jimmy Fallon/NBC November 4, 2009 12:35am-1:35am EST.txt',
  'Demo/American Morning/CNN September 11, 2009 6:00am-9:00am EDT.txt',
  'Demo/American Morning/CNN November 25, 2009 6:00am-9:00am EST.txt',
  'Demo/Larry King Live/CNN August 8, 2009 9:00pm-10:00pm EDT.txt'])

In [5]:
# Open a file by passing the path to the file which can be found using listLocalSegments above or provided manually.
# Returns a Segment class that parses the JSON data from the file.
segment = Segment.openFile(local_segments[0])
# Read the metadata with the info function.
segment.info()

{'Topics': ['Jason Sudeikis',
  'New York',
  'Matt Bomer',
  'Kansas',
  'Johnny',
  'Taylor',
  'Higgins',
  'Gordon Heavyhand',
  'Richard Simmons',
  'Mike Trush',
  'Jon Bovi',
  'New York City',
  'Boston',
  'Esurance',
  'Kirk',
  'Reba Mcentire',
  'Geico',
  'S.c.',
  'Oreck Power Team',
  'Johnson'],
 'Network': 'NBC',
 'Duration': '00:55:00',
 'Title': 'Late Night With Jimmy Fallon',
 'Datetime': 'NBC November 11, 2009 3:05am-4:00am EST'}

In [6]:
# Read the caption text using the text function.
print(segment.text()[:3000])

♪ [ cheers and applause ] >> jimmy: thank you very, very much, everybody. thank you, thank you. welcome. welcome to the "late night with jimmy fallon," everybody. welcome, happy tuesday. it's election day. and i voted here in new york this morning. it's not like the last election i voted in. you don't text in your vote. you guys know that? [ laughter ] you have to go wait in line and there's a machine -- the word lambert isn't anywhere. it's just awful. i thought it was nice, so everyone who voted got an "i voted" swine flu mask to leave with. [ laughter ] so, that was thoughtful. things are looking good for new york city mayor michael bloomberg, looks like he's going to win a third term. [ applause ] yeah, of course, he's spent the most money in new york electoral history, just barely exceeding the new york yankees' salary cap. [ laughter ] are you guys watching the series? [ cheers and applause ] you guys watching it, too? [ cheers ] tough loss for the yankees last night, but history

In [7]:
# Read the minute-by-minute snippets with the snippets function
# returns a formatted string so we can't programmatically access the time. Use the content property to do that (see below).
print(segment.snippets()[:3000])

3:05 am : ♪ [ cheers and applause ] >> jimmy: thank you very, very much, everybody. thank you, thank you. welcome. welcome to the "late night with jimmy fallon," everybody. welcome, happy tuesday. it's election day. and i voted here in new york this morning. it's not like the last election i voted in. you don't text in your vote. you guys know that? [ laughter ] you have to go wait in line and there's a machine -- the word lambert isn't anywhere. it's just awful. i thought it was nice, so everyone who voted got an "i voted" swine flu mask to leave with. [ laughter ] so, that was thoughtful. things are looking good for new york city mayor michael bloomberg, looks like he's going to win a third term. [ applause ] yeah, of course, he's spent the most money in new york electoral history, just barely exceeding the new york yankees' salary cap. [ laughter ] are you guys watching the series? [ cheers and applause ] you guys watching it, too?

3:06 am : [ cheers ] tough loss for the yankees la

In [8]:
# Access the content using the content property.
for time, snippet in segment.content['snippets'][:10]:
    print(time, snippet[:100])

3:05 am ♪ [ cheers and applause ] >> jimmy: thank you very, very much, everybody. thank you, thank you. welc
3:06 am [ cheers ] tough loss for the yankees last night, but history was made when the phillies' chase utle
3:07 am he wrote on these notes. and this guy made a song about it. i'm so excited about it. the song is cal
3:08 am they were like, "that's impossible." you guys, a new study found that experiencing bad moods can act
3:09 am >> jimmy: hey. we g a great show, you guys. anyone see "gossip girl" last night? [ cheers ] yeah, i 
3:10 am matt bomer is joining us. [ cheers and applause ] he's good. he's hot. and we have a performance fro
3:11 am "gonna make sweet love" -- hilarious. well, anyway, today is election day and i'm almost afraid t as
3:12 am our patriotic duty ow ♪ [ laughter ] ♪ time for me to tap that patriotic booty tap that booty ♪ ♪ do
3:13 am [ laughter ] ♪ make sweet sweet love to your woman this election day ♪ ♪ make sweet love to your wom
3:14 am and yogurty

In [9]:
# The content property is a dict with the following format,
'''
{
  'snippets': [ [time, snippet_text], [time, snippet_text], ...]
  'metadata': {
      'Topics': [ 'Topic1', 'Topic2', ...],
      'Netork': 'Bloomberg',
      'Duration': '1:00:00',
      ...
  }
}
'''
print('Content', segment.content.keys())
print('Metadata', segment.content['metadata'].keys())

Content dict_keys(['snippets', 'metadata'])
Metadata dict_keys(['Topics', 'Network', 'Duration', 'Title', 'Datetime'])
