*This is a protype to prerpocess data.*

*Here, I will use just the cnn/stories data and preprocess them.*

Tools: Python, Tensorflow, Stanford CoreNLP

## Imports

In [2]:
import os
import shutil
import hashlib
import struct
import collections
import subprocess
import tensorflow as tf
from tensorflow.core.example import example_pb2

Acceptable ways to end a sentence

In [None]:
END_TOKENS = ['.', '!', '?', '...', "'", "`", '"']

Define the directory where all  stories are stored.

In [None]:
exp_stories_dir = "/content/drive/MyDrive/Projects/Suvidha-Foundation-Internship-Project/data/interim/exp_stories"

Number of files in the exp_stories folder.

**STEPS TAKEN FOR CONVERTING SIMPLE STORIES INTO CHUNKS OF BINARY FILES CONTAINING TOKENIZED VERSION OF THE STORY AND BROKEN DOWN INTO ARTICLE AND ABSTRACT**

1. Tokenize and store the tokenized stories with same name (simplifies the mapping).
2. Build a function to seperate article and abstract part from each story.
3. Build function to convert the url links into their hashed file names.
4. Go through each story file name that is part of those url hashes (3 url lists are there - train, val, test). So, different filenames belong to different part of the dataset (e.g. train or val or test).
5. Create containers that store each stories article and abstract part. Then store them inside corresponding .bin files.
6. Make vocabolary for training data.

*Tokenize a sentence using `edu.stanford.nlp.process.PTBTokenizer`*

using `edu.stanford.nlp.process.PTBTokenizer`

In [None]:
!echo "Please tokenize this text." | java -cp '/content/drive/MyDrive/Projects/Suvidha-Foundation-Internship-Project/data/external/stanford-corenlp-4.5.4/stanford-corenlp-4.5.4.jar' edu.stanford.nlp.process.PTBTokenizer

Please
tokenize
this
text
.
PTBTokenizer tokenized 5 tokens at 40.54 tokens per second.


Store all story file names inside `stories`

In [None]:
stories = os.listdir(exp_stories_dir)

This is an experimental directory to save the tokenized files.

In [None]:
tokenized_dir = "/content/drive/MyDrive/Projects/Suvidha-Foundation-Internship-Project/data/interim/exp_tokenized"

In [None]:
os.makedirs(tokenized_dir)

To use `edu.stanford.nlp.process.PTBTokenizer` for tokenizing sentences and at the same time saving them at a location, we need a mapping file.

The mapping file will contain the source story file path and after a tab, it will contain the destination path along with file name with which it will be saved. e.g. `/path/to/story_files/fa8c8f4bf51d704fe61d9c722641e8a9889b4f1b.story TAB /path/to/tokenized_story_files/fa8c8f4bf51d704fe61d9c722641e8a9889b4f1b.story`

In [None]:
with open("mapping.txt", "w") as f:
  for s in stories:
    f.write("%s \t %s\n" % (os.path.join(exp_stories_dir, s), os.path.join(tokenized_dir, s)))

`COMMAND` to call `edu.stanford.nlp.process.PTBTokenizer` to tokenize each file from the stories directory and storing the tokenized version to the tokenized directory is - 

`
java -cp /path/to/stanford-core-nlp-4.5.4/stanford-core-nlp-4.5.4.jar edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines mappings
`

Use `subprocess.call(COMMAND)` to apply the effect of the command on terminal.

In [None]:
command = ['java', '-cp', 
           '/content/drive/MyDrive/Projects/Suvidha-Foundation-Internship-Project/data/external/stanford-corenlp-4.5.4/stanford-corenlp-4.5.4.jar',
           'edu.stanford.nlp.process.PTBTokenizer', '-ioFileList', '-preserveLines', 'mapping.txt']
print("Tokenizing %i files in %s and saving in %s..." % (len(stories), exp_stories_dir, tokenized_dir))
subprocess.call(command)
print("Stanford CoreNLP Tokenizer has finished.")
os.remove("mapping.txt")

Tokenizing 4099 files in /content/drive/MyDrive/Projects/Suvidha-Foundation-Internship-Project/data/interim/exp_stories and saving in /content/drive/MyDrive/Projects/Suvidha-Foundation-Internship-Project/data/interim/exp_tokenized...
Stanford CoreNLP Tokenizer has finished.


4099 story files are there in the experiment directory

In [None]:
len(os.listdir(tokenized_dir))

4099

## Read the .story files in python

In [None]:
os.listdir(exp_stories_dir)[0]

'fa8c8f4bf51d704fe61d9c722641e8a9889b4f1b.story'

In [None]:
file_path = os.path.join(exp_stories_dir, os.listdir(exp_stories_dir)[0])

In [None]:
with open(file_path, 'r') as file:
  print(file.read())

(CNN)More than a week after the Paris terror attacks and with an investigation in full swing, the evidence points to an international conspiracy by militants to bring terror to the streets of France's capital.

There are reports of a new accomplice, in addition to the three gunmen killed by French authorities and the widow of one, who escaped.

There's a money trail that points to Yemen and a cache of weapons reportedly found in an apartment.

This all comes as a nation continues to mourn and Parisians flock to newsstands in support of the satirical magazine targeted by the terrorists.

Everyone seems to want a piece of history.

Three million copies of Charlie Hebdo's first edition since the terrorist attacks flew off newsstand racks Wednesday. Another million or so went on sale Thursday.

The cover features a cartoon of the Prophet Mohammed crying as he holds a sign saying "Je suis Charlie," or "I am Charlie," beneath the headline "All is forgiven."  This run of the magazine could re

In [None]:
with open(file_path, 'r') as file:
  for line in file:
    print(line)

(CNN)More than a week after the Paris terror attacks and with an investigation in full swing, the evidence points to an international conspiracy by militants to bring terror to the streets of France's capital.



There are reports of a new accomplice, in addition to the three gunmen killed by French authorities and the widow of one, who escaped.



There's a money trail that points to Yemen and a cache of weapons reportedly found in an apartment.



This all comes as a nation continues to mourn and Parisians flock to newsstands in support of the satirical magazine targeted by the terrorists.



Everyone seems to want a piece of history.



Three million copies of Charlie Hebdo's first edition since the terrorist attacks flew off newsstand racks Wednesday. Another million or so went on sale Thursday.



The cover features a cartoon of the Prophet Mohammed crying as he holds a sign saying "Je suis Charlie," or "I am Charlie," beneath the headline "All is forgiven."  This run of the magaz

## Break the lines down and store them

In [None]:
lines = []
with open(file_path, 'r') as file:
  for line in file:
    line = line.strip()
    lines.append(line)

lines

["(CNN)More than a week after the Paris terror attacks and with an investigation in full swing, the evidence points to an international conspiracy by militants to bring terror to the streets of France's capital.",
 '',
 'There are reports of a new accomplice, in addition to the three gunmen killed by French authorities and the widow of one, who escaped.',
 '',
 "There's a money trail that points to Yemen and a cache of weapons reportedly found in an apartment.",
 '',
 'This all comes as a nation continues to mourn and Parisians flock to newsstands in support of the satirical magazine targeted by the terrorists.',
 '',
 'Everyone seems to want a piece of history.',
 '',
 "Three million copies of Charlie Hebdo's first edition since the terrorist attacks flew off newsstand racks Wednesday. Another million or so went on sale Thursday.",
 '',
 'The cover features a cartoon of the Prophet Mohammed crying as he holds a sign saying "Je suis Charlie," or "I am Charlie," beneath the headline "Al

## Build a fully functional procedure to read lines of a file and return the list of lines

In [None]:
def read_lines(file):
  lines = []

  with open(file, 'r') as f:
    for line in f:
      lines.append(line.strip())

  return lines

## Read URLs from each .txt file of the url_lists folder

In [None]:
url_dir = '/content/drive/MyDrive/Projects/Suvidha-Foundation-Internship-Project/data/external/url_lists'

In [None]:
train_urls = os.path.join(url_dir, 'all_train.txt')
val_urls = os.path.join(url_dir, 'all_val.txt')
test_urls = os.path.join(url_dir, 'all_test.txt')

In [None]:
read_lines(train_urls)

['http://web.archive.org/web/20070716092219id_/http://us.cnn.com:80/2007/US/07/13/btsc.obrien.criminallyinsane/index.html',
 'http://web.archive.org/web/20070804173413id_/http://www.cnn.com:80/2007/SHOWBIZ/Movies/07/23/potter.radcliffe.reut/index.html?iref=newssearch',
 'http://web.archive.org/web/20070817151404id_/http://us.cnn.com:80/2007/US/08/02/bridge.survivors/index.html',
 'http://web.archive.org/web/20070827221123id_/http://www.cnn.com:80/2007/WORLD/meast/08/24/iraq.boyfolo/index.html?iref=topnews',
 'http://web.archive.org/web/20070830082937id_/http://www.cnn.com:80/2007/POLITICS/07/21/bush.colonoscopy/index.html?eref=rss_topstories',
 'http://web.archive.org/web/20070830193806id_/http://www.cnn.com:80/2007/US/law/08/24/michael.vick/index.html?eref=time_us',
 'http://web.archive.org/web/20070902195602id_/http://www.cnn.com:80/2007/WORLD/meast/08/15/iraq.prostitution/index.html?eref=ib_world',
 'http://web.archive.org/web/20070903175945id_/http://www.cnn.com:80/2007/POLITICS/08

## Convert the url links into hex hash

In [None]:
url = read_lines(train_urls)[0]
url_in_byte = bytes(url, 'utf-8')
print(url_in_byte)
h = hashlib.sha1()
h.update(url_in_byte)
h.hexdigest()

b'http://web.archive.org/web/20070716092219id_/http://us.cnn.com:80/2007/US/07/13/btsc.obrien.criminallyinsane/index.html'


'ee8871b15c50d0db17b0179a6d2beab35065f1e9'

## Function to convert a list of urls into a list of hex hashes

In [None]:
def hashhex(s):
  s = bytes(s, 'utf-8')

  h = hashlib.sha1()
  h.update(s)
  return h.hexdigest()

def get_url_hashes(url_list):
  return [hashhex(url) for url in url_list]

In [None]:
urls = read_lines(train_urls)[:10]
get_url_hashes(urls)

['ee8871b15c50d0db17b0179a6d2beab35065f1e9',
 '42c027e4ff9730fbb3de84c1af0d2c506e41c3e4',
 '06352019a19ae31e527f37f7571c6dd7f0c5da37',
 'a1ebb8bb4d370a1fdf28769206d572be60642d70',
 '24521a2abb2e1f5e34e6824e0f9e56904a2b0e88',
 '7fe70cc8b12fab2d0a258fababf7d9c6b5e1262a',
 '7c0e61ac829a3b3b653e2e3e7536cc4881d1f264',
 '5e22bbfc7232418b8d2dd646b952e404df5bd048',
 '017d27d00eb43678c15cb4a8dd4723a035323219',
 '0d43b97000ff852282c89d8d105e41495c0ee9bd']

## Finding out how many train, val and test samples are present in our experiement directory

In [None]:
train_urls_list = read_lines(train_urls)
val_urls_list = read_lines(val_urls)
test_urls_list = read_lines(test_urls)

train_hex = get_url_hashes(train_urls_list)
val_hex = get_url_hashes(val_urls_list)
test_hex = get_url_hashes(test_urls_list)

train_story_fnames = [s+'.story' for s in train_hex]
val_story_fnames = [s+'.story' for s in val_hex]
test_story_fnames = [s+'.story' for s in test_hex]

train_samples = 0
for story_file in train_story_fnames:
  if os.path.isfile(os.path.join(tokenized_dir, story_file)):
    train_samples+=1

val_samples = 0
for story_file in val_story_fnames:
  if os.path.isfile(os.path.join(tokenized_dir, story_file)):
    val_samples+=1

test_samples = 0
for story_file in test_story_fnames:
  if os.path.isfile(os.path.join(tokenized_dir, story_file)):
    test_samples+=1

print(train_samples, val_samples, test_samples)

1786 1220 1093


Unfortunately the test samples are not present but we can still work with this.

## Store the first story file path

In [None]:
for story_file in train_story_fnames:
  if os.path.isfile(os.path.join(tokenized_dir, story_file)):
    story_file = os.path.join(tokenized_dir, story_file)
    break

In [None]:
story_file

'/content/drive/MyDrive/Projects/Suvidha-Foundation-Internship-Project/data/interim/exp_tokenized/b787f83c7d6458e1574b3a2589759a6d60428da4.story'

## Read the story file

In [None]:
lines = read_lines(story_file)

In [None]:
lines

['( CNN ) -- The only thing cuter than a baby panda might be TWO baby pandas .',
 '',
 'CNN sent reporter Alina Machado and producer John Murgatroyd on Thursday to Zoo Atlanta for a behind - the - scenes peek at the twin panda brothers ( the only surviving panda twins in the U.S. ) .',
 '',
 'Machado and Murgatroyd got to watch as veterinarians weighed and measured Cub " A , " took his temperature and listened to his heart . They \'ll be known as Cubs " A " and " B " until a naming ceremony 100 days after their birth , in keeping with Chinese tradition .',
 '',
 'Rules for the panda nursery : Before entering the " secure bio zone , " Machado and Murgatroyd were required to change into yellow gowns and put on booties over their shoes . Even the tripod got booties . Machado asked if she could hold one of the little guys but the zoo said no .',
 '',
 'Seeing the babies together was a special treat . Usually one panda is cared for by zoo staff while the other is with mom . Lun Lun is a ver

Each individual meaningful word is seperated due to tokenizing.

## Build a function to fix the sentences that does not contain any ending token.

In [None]:
def fix_missing_period(line):
  if line == "": return line
  elif "@highlight" in line: return line
  elif line[-1] in END_TOKENS: return line
  
  else: return line + " ."

## Preprocess the story file using .lower() and fix_missing_periods()

In [None]:
lines = [line.lower() for line in lines]

lines = [fix_missing_period(line) for line in lines]

lines

['( cnn ) -- the only thing cuter than a baby panda might be two baby pandas .',
 '',
 'cnn sent reporter alina machado and producer john murgatroyd on thursday to zoo atlanta for a behind - the - scenes peek at the twin panda brothers ( the only surviving panda twins in the u.s. ) .',
 '',
 'machado and murgatroyd got to watch as veterinarians weighed and measured cub " a , " took his temperature and listened to his heart . they \'ll be known as cubs " a " and " b " until a naming ceremony 100 days after their birth , in keeping with chinese tradition .',
 '',
 'rules for the panda nursery : before entering the " secure bio zone , " machado and murgatroyd were required to change into yellow gowns and put on booties over their shoes . even the tripod got booties . machado asked if she could hold one of the little guys but the zoo said no .',
 '',
 'seeing the babies together was a special treat . usually one panda is cared for by zoo staff while the other is with mom . lun lun is a ver

## Seperate the article part and abstract part from the story file

In [None]:
next_line_highlight = False

article = []
abstract = []

for line in lines:
  if line == "":
    continue

  elif line == "@highlight":
    next_line_highlight = True

  elif next_line_highlight:
    abstract.append(line)
  else:
    article.append(line)

article, abstract

(['( cnn ) -- the only thing cuter than a baby panda might be two baby pandas .',
  'cnn sent reporter alina machado and producer john murgatroyd on thursday to zoo atlanta for a behind - the - scenes peek at the twin panda brothers ( the only surviving panda twins in the u.s. ) .',
  'machado and murgatroyd got to watch as veterinarians weighed and measured cub " a , " took his temperature and listened to his heart . they \'ll be known as cubs " a " and " b " until a naming ceremony 100 days after their birth , in keeping with chinese tradition .',
  'rules for the panda nursery : before entering the " secure bio zone , " machado and murgatroyd were required to change into yellow gowns and put on booties over their shoes . even the tripod got booties . machado asked if she could hold one of the little guys but the zoo said no .',
  'seeing the babies together was a special treat . usually one panda is cared for by zoo staff while the other is with mom . lun lun is a very hands - on mo

## Convert the article and abstract from word level to a senetence.

In [None]:
SENTENCE_START = '<s>'
SENTENCE_END = '</s>'

In [None]:
print(' '.join(article))
print(' '.join(["%s %s %s" % (SENTENCE_START, sent, SENTENCE_END) for sent in abstract]))

( cnn ) -- the only thing cuter than a baby panda might be two baby pandas . cnn sent reporter alina machado and producer john murgatroyd on thursday to zoo atlanta for a behind - the - scenes peek at the twin panda brothers ( the only surviving panda twins in the u.s. ) . machado and murgatroyd got to watch as veterinarians weighed and measured cub " a , " took his temperature and listened to his heart . they 'll be known as cubs " a " and " b " until a naming ceremony 100 days after their birth , in keeping with chinese tradition . rules for the panda nursery : before entering the " secure bio zone , " machado and murgatroyd were required to change into yellow gowns and put on booties over their shoes . even the tripod got booties . machado asked if she could hold one of the little guys but the zoo said no . seeing the babies together was a special treat . usually one panda is cared for by zoo staff while the other is with mom . lun lun is a very hands - on mom , machado reports . zo

## Define a function that does the above for all story files passed to it

In [None]:
def to_article_abstract(story_file):
  lines = read_lines(story_file)

  lines = [line.lower() for line in lines]

  lines = [fix_missing_period(line) for line in lines]

  article_lines = []
  highlights = []
  next_line_highlight = False

  for line in lines:
    if line=="":
      continue

    elif "@highlight" in line:
      next_line_highlight = True

    elif next_line_highlight:
      highlights.append(line)

    else:
      article_lines.append(line)

  article = ' '.join(article_lines)
  abstract = ' '.join(["%s %s %s" % (SENTENCE_START, sent, SENTENCE_END) for sent in highlights])

  return article, abstract

In [None]:
to_article_abstract(story_file)

('( cnn ) -- the only thing cuter than a baby panda might be two baby pandas . cnn sent reporter alina machado and producer john murgatroyd on thursday to zoo atlanta for a behind - the - scenes peek at the twin panda brothers ( the only surviving panda twins in the u.s. ) . machado and murgatroyd got to watch as veterinarians weighed and measured cub " a , " took his temperature and listened to his heart . they \'ll be known as cubs " a " and " b " until a naming ceremony 100 days after their birth , in keeping with chinese tradition . rules for the panda nursery : before entering the " secure bio zone , " machado and murgatroyd were required to change into yellow gowns and put on booties over their shoes . even the tripod got booties . machado asked if she could hold one of the little guys but the zoo said no . seeing the babies together was a special treat . usually one panda is cared for by zoo staff while the other is with mom . lun lun is a very hands - on mom , machado reports .

## Create a tf_example instance that will store the story's article and abstract

In [None]:
tf_example = example_pb2.Example()

In [None]:
tf_example



In [None]:
article, abstract = to_article_abstract(story_file)
tf_example.features.feature['article'].bytes_list.value.extend([bytes(article, 'utf-8')])
tf_example.features.feature['abstract'].bytes_list.value.extend([bytes(abstract, 'utf-8')])

tf_example

features {
  feature {
    key: "abstract"
    value {
      bytes_list {
        value: "<s> twin pandas were born at the zoo atlanta on july 15 . </s> <s> cnn \'s alina machado and john murgatroyd watched cub \" a \" get a vet check . </s> <s> check out the instagram photos and footage they shot . </s>"
      }
    }
  }
  feature {
    key: "article"
    value {
      bytes_list {
        value: "( cnn ) -- the only thing cuter than a baby panda might be two baby pandas . cnn sent reporter alina machado and producer john murgatroyd on thursday to zoo atlanta for a behind - the - scenes peek at the twin panda brothers ( the only surviving panda twins in the u.s. ) . machado and murgatroyd got to watch as veterinarians weighed and measured cub \" a , \" took his temperature and listened to his heart . they \'ll be known as cubs \" a \" and \" b \" until a naming ceremony 100 days after their birth , in keeping with chinese tradition . rules for the panda nursery : before entering the 

## Serialize the tensorflow example object into string

In [None]:
tf_example.SerializeToString()

b'\n\x93\x0b\n\xa6\t\n\x07article\x12\x9a\t\n\x97\t\n\x94\t( cnn ) -- the only thing cuter than a baby panda might be two baby pandas . cnn sent reporter alina machado and producer john murgatroyd on thursday to zoo atlanta for a behind - the - scenes peek at the twin panda brothers ( the only surviving panda twins in the u.s. ) . machado and murgatroyd got to watch as veterinarians weighed and measured cub " a , " took his temperature and listened to his heart . they \'ll be known as cubs " a " and " b " until a naming ceremony 100 days after their birth , in keeping with chinese tradition . rules for the panda nursery : before entering the " secure bio zone , " machado and murgatroyd were required to change into yellow gowns and put on booties over their shoes . even the tripod got booties . machado asked if she could hold one of the little guys but the zoo said no . seeing the babies together was a special treat . usually one panda is cared for by zoo staff while the other is with m

In [None]:
tf_example_str = tf_example.SerializeToString()

str_len = len(tf_example_str)

## How struct can save the serialized string in binary format

In [None]:
struct.pack('q', str_len)

b'\x96\x05\x00\x00\x00\x00\x00\x00'

In [None]:
struct.pack('%ds'%str_len, tf_example_str)

b'\n\x93\x0b\n\xa6\t\n\x07article\x12\x9a\t\n\x97\t\n\x94\t( cnn ) -- the only thing cuter than a baby panda might be two baby pandas . cnn sent reporter alina machado and producer john murgatroyd on thursday to zoo atlanta for a behind - the - scenes peek at the twin panda brothers ( the only surviving panda twins in the u.s. ) . machado and murgatroyd got to watch as veterinarians weighed and measured cub " a , " took his temperature and listened to his heart . they \'ll be known as cubs " a " and " b " until a naming ceremony 100 days after their birth , in keeping with chinese tradition . rules for the panda nursery : before entering the " secure bio zone , " machado and murgatroyd were required to change into yellow gowns and put on booties over their shoes . even the tripod got booties . machado asked if she could hold one of the little guys but the zoo said no . seeing the babies together was a special treat . usually one panda is cared for by zoo staff while the other is with m

## Bring everything into one function

In [None]:
def write_to_bin(url_file):
  url_list = read_lines(url_file)
  stories = get_url_hashes(url_list)

  story_fnames = [s+'.story' for s in stories]

  for story in story_fnames:
    if os.path.isfile(os.path.join(tokenized_dir, story)):
      story_file = os.path.join(tokenized_dir, story)

      article, abstract = to_article_abstract(story_file)

      tf_example = example_pb2.Example()
      tf_example.features.feature['article'].bytes_list.value.extend([bytes(article, 'utf-8')])
      tf_example.features.feature['abstract'].bytes_list.value.extend([bytes(abstract, 'utf-8')])

      tf_example_str = tf_example.SerializeToString()
      str_len = len(tf_example_str)

      print(tf_example_str, str_len)
      break

In [None]:
write_to_bin(train_urls)

b'\n\x93\x0b\n\xa6\t\n\x07article\x12\x9a\t\n\x97\t\n\x94\t( cnn ) -- the only thing cuter than a baby panda might be two baby pandas . cnn sent reporter alina machado and producer john murgatroyd on thursday to zoo atlanta for a behind - the - scenes peek at the twin panda brothers ( the only surviving panda twins in the u.s. ) . machado and murgatroyd got to watch as veterinarians weighed and measured cub " a , " took his temperature and listened to his heart . they \'ll be known as cubs " a " and " b " until a naming ceremony 100 days after their birth , in keeping with chinese tradition . rules for the panda nursery : before entering the " secure bio zone , " machado and murgatroyd were required to change into yellow gowns and put on booties over their shoes . even the tripod got booties . machado asked if she could hold one of the little guys but the zoo said no . seeing the babies together was a special treat . usually one panda is cared for by zoo staff while the other is with m

##  Modify the function so that it can save the binary files

In [None]:
exp_finished_dir = '/content/exp_finished'

if not os.path.isdir(exp_finished_dir): os.makedirs(exp_finished_dir)

exp_out_file = os.path.join(exp_finished_dir, 'exp_out.bin')

In [None]:
def write_to_bin(url_file, out_file):
  url_list = read_lines(url_file)
  stories = get_url_hashes(url_list)

  story_fnames = [s+'.story' for s in stories]
  max_itr = 10
  itr = 1

  with open(out_file, 'wb') as writer:
    for story in story_fnames:
      if os.path.isfile(os.path.join(tokenized_dir, story)):
        story_file = os.path.join(tokenized_dir, story)

        article, abstract = to_article_abstract(story_file)

        tf_example = example_pb2.Example()
        tf_example.features.feature['article'].bytes_list.value.extend([bytes(article, 'utf-8')])
        tf_example.features.feature['abstract'].bytes_list.value.extend([bytes(abstract, 'utf-8')])

        tf_example_str = tf_example.SerializeToString()
        str_len = len(tf_example_str)

        writer.write(struct.pack('q', str_len))
        writer.write(struct.pack('%ds'%str_len, tf_example_str))

        itr += 1

        if itr > 10:
          break

In [None]:
write_to_bin(train_urls, exp_out_file)

## How to create a vocabolary from the stories

In [None]:
vocab_counter = collections.Counter()
vocab_counter

Counter()

In [None]:
article, abstract

('( cnn ) -- the only thing cuter than a baby panda might be two baby pandas . cnn sent reporter alina machado and producer john murgatroyd on thursday to zoo atlanta for a behind - the - scenes peek at the twin panda brothers ( the only surviving panda twins in the u.s. ) . machado and murgatroyd got to watch as veterinarians weighed and measured cub " a , " took his temperature and listened to his heart . they \'ll be known as cubs " a " and " b " until a naming ceremony 100 days after their birth , in keeping with chinese tradition . rules for the panda nursery : before entering the " secure bio zone , " machado and murgatroyd were required to change into yellow gowns and put on booties over their shoes . even the tripod got booties . machado asked if she could hold one of the little guys but the zoo said no . seeing the babies together was a special treat . usually one panda is cared for by zoo staff while the other is with mom . lun lun is a very hands - on mom , machado reports .

In [None]:
[t for t in abstract.split(' ') if t not in [SENTENCE_START, SENTENCE_END]]

['twin',
 'pandas',
 'were',
 'born',
 'at',
 'the',
 'zoo',
 'atlanta',
 'on',
 'july',
 '15',
 '.',
 'cnn',
 "'s",
 'alina',
 'machado',
 'and',
 'john',
 'murgatroyd',
 'watched',
 'cub',
 '"',
 'a',
 '"',
 'get',
 'a',
 'vet',
 'check',
 '.',
 'check',
 'out',
 'the',
 'instagram',
 'photos',
 'and',
 'footage',
 'they',
 'shot',
 '.']

In [None]:
art = article.split(' ')
abs = abstract.split(' ')

abs = [t for t in abs if t not in [SENTENCE_START, SENTENCE_END]]

tokens = art + abs

tokens = [t.strip() for t in tokens]
tokens = [t for t in tokens if t!=""]

vocab_counter.update(tokens)

vocab_counter

Counter({'(': 2,
         'cnn': 3,
         ')': 2,
         '--': 1,
         'the': 15,
         'only': 2,
         'thing': 1,
         'cuter': 1,
         'than': 1,
         'a': 10,
         'baby': 2,
         'panda': 5,
         'might': 1,
         'be': 3,
         'two': 1,
         'pandas': 3,
         '.': 15,
         'sent': 1,
         'reporter': 1,
         'alina': 2,
         'machado': 6,
         'and': 9,
         'producer': 1,
         'john': 2,
         'murgatroyd': 4,
         'on': 5,
         'thursday': 1,
         'to': 7,
         'zoo': 5,
         'atlanta': 2,
         'for': 3,
         'behind': 1,
         '-': 3,
         'scenes': 1,
         'peek': 1,
         'at': 2,
         'twin': 2,
         'brothers': 1,
         'surviving': 1,
         'twins': 1,
         'in': 2,
         'u.s.': 1,
         'got': 2,
         'watch': 1,
         'as': 2,
         'veterinarians': 1,
         'weighed': 1,
         'measured': 1,
         'c

## Combine the vocabolary into our function

In [None]:
def write_to_bin(url_file, out_file, makevocab=False, vocab_path = "/finished_file/vocab"):
  url_list = read_lines(url_file)
  stories = get_url_hashes(url_list)

  story_fnames = [s+'.story' for s in stories]
  max_itr = 10
  itr = 1

  if makevocab:
    vocab_counter = collections.Counter()

  with open(out_file, 'wb') as writer:
    for story in story_fnames:
      if os.path.isfile(os.path.join(tokenized_dir, story)):
        story_file = os.path.join(tokenized_dir, story)

        article, abstract = to_article_abstract(story_file)

        tf_example = example_pb2.Example()
        tf_example.features.feature['article'].bytes_list.value.extend([bytes(article, 'utf-8')])
        tf_example.features.feature['abstract'].bytes_list.value.extend([bytes(abstract, 'utf-8')])

        tf_example_str = tf_example.SerializeToString()
        str_len = len(tf_example_str)

        writer.write(struct.pack('q', str_len))
        writer.write(struct.pack('%ds'%str_len, tf_example_str))

        if makevocab:
          art_tokens = article.split(' ')
          abs_tokens = abstract.split(' ')
          abs_tokens = [t for t in abs_tokens if t not in [SENTENCE_START, SENTENCE_END]]

          tokens = art_tokens + abs_tokens

          tokens = [t.strip() for t in tokens]
          tokens = [t for t in tokens if t!=""]

          vocab_counter.update(tokens)

          itr += 1

          if itr > 10:
            break
  if makevocab:
    with open(vocab_path, 'w') as writer:
      for word, count in vocab_counter.most_common(20):
        writer.write(word + ' ' + str(count) + '\n')

In [None]:
vocab_path = '/content/exp_finished/vocab'
write_to_bin(train_urls, exp_out_file, True, vocab_path)

## Modify the function to save more vocabolaries

In [None]:
VOCAB_SIZE = 4000

def write_to_bin(url_file, out_file, makevocab=False, vocab_path = "/finished_file/vocab"):
  print("Using %s to get urls and convert them to hashes for generating story file names" % url_file)
  url_list = read_lines(url_file)
  stories = get_url_hashes(url_list)

  story_fnames = [s+'.story' for s in stories]

  if makevocab:
    vocab_counter = collections.Counter()

  print("Writing binary tf_example_str to %s" % out_file)
  with open(out_file, 'wb') as writer:
    for story in story_fnames:
      if os.path.isfile(os.path.join(tokenized_dir, story)):
        story_file = os.path.join(tokenized_dir, story)

        article, abstract = to_article_abstract(story_file)

        tf_example = example_pb2.Example()
        tf_example.features.feature['article'].bytes_list.value.extend([bytes(article, 'utf-8')])
        tf_example.features.feature['abstract'].bytes_list.value.extend([bytes(abstract, 'utf-8')])

        tf_example_str = tf_example.SerializeToString()
        str_len = len(tf_example_str)

        writer.write(struct.pack('q', str_len))
        writer.write(struct.pack('%ds'%str_len, tf_example_str))

        if makevocab:
          art_tokens = article.split(' ')
          abs_tokens = abstract.split(' ')
          abs_tokens = [t for t in abs_tokens if t not in [SENTENCE_START, SENTENCE_END]]

          tokens = art_tokens + abs_tokens

          tokens = [t.strip() for t in tokens]
          tokens = [t for t in tokens if t!=""]

          vocab_counter.update(tokens)

  if makevocab:
    with open(vocab_path, 'w') as writer:
      for word, count in vocab_counter.most_common(VOCAB_SIZE):
        writer.write(word + ' ' + str(count) + '\n')

## Use the function to create train.bin, val.bin & test.bin

In [None]:
exp_finished_dir = "/content/drive/MyDrive/Projects/Suvidha-Foundation-Internship-Project/data/preprocessed/exp_bins_vocab"

In [None]:
train_bin = os.path.join(exp_finished_dir, "train.bin")
val_bin = os.path.join(exp_finished_dir, "val.bin")
test_bin = os.path.join(exp_finished_dir, "test.bin")
vocab_path = os.path.join(exp_finished_dir, "vocab")

In [None]:
write_to_bin(train_urls, train_bin, makevocab=True, vocab_path=vocab_path)
write_to_bin(val_urls, val_bin)
write_to_bin(test_urls, test_bin)

Using /content/drive/MyDrive/Projects/Suvidha-Foundation-Internship-Project/data/external/url_lists/all_train.txt to get urls and convert them to hashes for generating story file names
Writing binary tf_example_str to /content/drive/MyDrive/Projects/Suvidha-Foundation-Internship-Project/data/preprocessed/exp_bins_vocab/train.bin
Using /content/drive/MyDrive/Projects/Suvidha-Foundation-Internship-Project/data/external/url_lists/all_val.txt to get urls and convert them to hashes for generating story file names
Writing binary tf_example_str to /content/drive/MyDrive/Projects/Suvidha-Foundation-Internship-Project/data/preprocessed/exp_bins_vocab/val.bin
Using /content/drive/MyDrive/Projects/Suvidha-Foundation-Internship-Project/data/external/url_lists/all_test.txt to get urls and convert them to hashes for generating story file names
Writing binary tf_example_str to /content/drive/MyDrive/Projects/Suvidha-Foundation-Internship-Project/data/preprocessed/exp_bins_vocab/test.bin


## Make chunks from the bin files

In [None]:
with open(exp_out_file, 'rb') as file:
  print(file.read())

b'\x96\x05\x00\x00\x00\x00\x00\x00\n\x93\x0b\n\xe7\x01\n\x08abstract\x12\xda\x01\n\xd7\x01\n\xd4\x01<s> twin pandas were born at the zoo atlanta on july 15 . </s> <s> cnn \'s alina machado and john murgatroyd watched cub " a " get a vet check . </s> <s> check out the instagram photos and footage they shot . </s>\n\xa6\t\n\x07article\x12\x9a\t\n\x97\t\n\x94\t( cnn ) -- the only thing cuter than a baby panda might be two baby pandas . cnn sent reporter alina machado and producer john murgatroyd on thursday to zoo atlanta for a behind - the - scenes peek at the twin panda brothers ( the only surviving panda twins in the u.s. ) . machado and murgatroyd got to watch as veterinarians weighed and measured cub " a , " took his temperature and listened to his heart . they \'ll be known as cubs " a " and " b " until a naming ceremony 100 days after their birth , in keeping with chinese tradition . rules for the panda nursery : before entering the " secure bio zone , " machado and murgatroyd were

How BufferReader class objects work.

In [None]:
reader = open(exp_out_file, 'rb')
reader

<_io.BufferedReader name='/content/exp_finished/exp_out.bin'>

In [None]:
reader.read(8)

b'\x96\x05\x00\x00\x00\x00\x00\x00'

In [None]:
reader.read(8)

b'\n\x93\x0b\n\xe7\x01\n\x08'

First 8 bits are the bits representing the length of the example_pb2.Example().SerializeString() string.

Why 8? => Because format string `q` has standard size of 8.

In [None]:
reader = open(exp_out_file, 'rb')

len_bytes = reader.read(8)
len_bytes

b'\x96\x05\x00\x00\x00\x00\x00\x00'

In [None]:
struct.unpack('q', len_bytes)

(1430,)

struct.unpack returns it in a tuple.

In [None]:
str_len = struct.unpack('q', len_bytes)[0]

Read the next str_len bits that contain the example bits.

In [None]:
example_str = struct.unpack('%ds'%str_len, reader.read(str_len))[0]
example_str

b'\n\x93\x0b\n\xe7\x01\n\x08abstract\x12\xda\x01\n\xd7\x01\n\xd4\x01<s> twin pandas were born at the zoo atlanta on july 15 . </s> <s> cnn \'s alina machado and john murgatroyd watched cub " a " get a vet check . </s> <s> check out the instagram photos and footage they shot . </s>\n\xa6\t\n\x07article\x12\x9a\t\n\x97\t\n\x94\t( cnn ) -- the only thing cuter than a baby panda might be two baby pandas . cnn sent reporter alina machado and producer john murgatroyd on thursday to zoo atlanta for a behind - the - scenes peek at the twin panda brothers ( the only surviving panda twins in the u.s. ) . machado and murgatroyd got to watch as veterinarians weighed and measured cub " a , " took his temperature and listened to his heart . they \'ll be known as cubs " a " and " b " until a naming ceremony 100 days after their birth , in keeping with chinese tradition . rules for the panda nursery : before entering the " secure bio zone , " machado and murgatroyd were required to change into yellow 

Check the deserialized version of the example

In [None]:
example_pb2.Example.FromString(example_str)

features {
  feature {
    key: "abstract"
    value {
      bytes_list {
        value: "<s> twin pandas were born at the zoo atlanta on july 15 . </s> <s> cnn \'s alina machado and john murgatroyd watched cub \" a \" get a vet check . </s> <s> check out the instagram photos and footage they shot . </s>"
      }
    }
  }
  feature {
    key: "article"
    value {
      bytes_list {
        value: "( cnn ) -- the only thing cuter than a baby panda might be two baby pandas . cnn sent reporter alina machado and producer john murgatroyd on thursday to zoo atlanta for a behind - the - scenes peek at the twin panda brothers ( the only surviving panda twins in the u.s. ) . machado and murgatroyd got to watch as veterinarians weighed and measured cub \" a , \" took his temperature and listened to his heart . they \'ll be known as cubs \" a \" and \" b \" until a naming ceremony 100 days after their birth , in keeping with chinese tradition . rules for the panda nursery : before entering the 

In [None]:
tmp_finished_dir = "/content/exp_finished"
CHUNK_SIZE = 200

def chunk_file(bin_file):
  reader = open(bin_file, 'rb')
  chunk = 0
  finished = False

  while not finished:
    chunk_file = os.path.join(tmp_finished_dir, "chunk-%03d.bin" % chunk)

    with open(chunk_file, 'wb') as writer:
      for _ in range(CHUNK_SIZE):
        len_bytes = reader.read(8)

        if not len_bytes:
          finished=True
          break

        str_len = struct.unpack('q', len_bytes)[0]
        example_str = struct.unpack('%ds' % str_len, reader.read(str_len))[0]

        writer.write(struct.pack('q', str_len))
        writer.write(struct.pack('%ds' % str_len, example_str))
        
      chunk +=1

In [None]:
chunk_file(exp_out_file)

In [3]:
CHUNK_SIZE = 200
finished_dir = "/content/drive/MyDrive/Projects/Suvidha-Foundation-Internship-Project/data/preprocessed/exp_bins_vocab"
chunks_dir = os.path.join(finished_dir, "chunks")

def chunk_file(set_name):
  in_file = os.path.join(finished_dir, set_name+".bin")
  
  reader = open(in_file, 'rb')

  finished = False
  chunk = 0

  while not finished:
    chunk_file = os.path.join(chunks_dir, "%s-%03d.bin" % (set_name, chunk))

    with open(chunk_file, 'wb') as writer:
      for _ in range(CHUNK_SIZE):
        len_bytes = reader.read(8)

        if not len_bytes:
          finished = True
          break

        str_len = struct.unpack('q', len_bytes)[0]
        example_str = struct.unpack('%ds' % str_len, reader.read(str_len))[0]

        writer.write(struct.pack('q', str_len))
        writer.write(struct.pack('%ds' % str_len, example_str))

      chunk += 1

In [4]:
def chunk_all():
  if not os.path.isdir(os.path.join(finished_dir, "chunks")):
    os.makedirs(os.path.join(finished_dir, "chunks"))

  set_names = ["train", "val", "test"]

  for set_name in set_names:
    chunk_file(set_name)

  print("Saved chunked data in %s..." % chunks_dir)

In [5]:
chunk_all()

Saved chunked data in %s... /content/drive/MyDrive/Projects/Suvidha-Foundation-Internship-Project/data/preprocessed/exp_bins_vocab/chunks
