<a href="https://colab.research.google.com/github/Arup3201/Summarization-Project-using-Pointer-Gen/blob/main/Get_To_The_Point_Summarization_with_Pointer_Generator_Networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import tensorflow as tf
import pathlib

In [2]:
path_to_cnn_stories = tf.keras.utils.get_file(
    origin="https://huggingface.co/datasets/cnn_dailymail/resolve/main/data/cnn_stories.tgz",
    extract=True
)

path_to_dailymail_stories = tf.keras.utils.get_file(
    origin="https://huggingface.co/datasets/cnn_dailymail/resolve/main/data/dailymail_stories.tgz",
    extract=True
)

Downloading data from https://huggingface.co/datasets/cnn_dailymail/resolve/main/data/cnn_stories.tgz
Downloading data from https://huggingface.co/datasets/cnn_dailymail/resolve/main/data/dailymail_stories.tgz


In [3]:
path_to_cnn_stories, path_to_dailymail_stories

('/root/.keras/datasets/cnn_stories.tgz',
 '/root/.keras/datasets/dailymail_stories.tgz')

In [4]:
!ls -l /root/.keras/datasets

total 521960
drwxr-xr-x 3 root root      4096 Jul 23 16:03 cnn
-rw-r--r-- 1 root root 158577824 Jul 23 16:03 cnn_stories.tgz
drwxr-xr-x 3 root root      4096 Jul 23 16:04 dailymail
-rw-r--r-- 1 root root 375893739 Jul 23 16:04 dailymail_stories.tgz


In [5]:
cnn_stories_dir = pathlib.Path('/root/.keras/datasets/cnn/stories')
dailymail_stories_dir = pathlib.Path('/root/.keras/datasets/dailymail/stories')

In [6]:
cnn_stories_dir, dailymail_stories_dir

(PosixPath('/root/.keras/datasets/cnn/stories'),
 PosixPath('/root/.keras/datasets/dailymail/stories'))

In [7]:
def print_filenames(dir_path, num_files=5):
  '''Prints the name of the files that are present at `dir_path`.
  Maximum `num_files` number of files are shown.

  Arguements:
    dir_path: PosixPath, pointing to the directory of which the user
              wants to prints the file names.
    num_files: int, number of files user wants to print.

  returns:
    nothing
  '''

  count = 0
  for f in dir_path.glob('*.story'):
    print(f.name)
    count += 1

    if count == num_files:
      break
  else:
    print(f"Less than {num_files} is present!")

In [8]:
print_filenames(cnn_stories_dir)

438411e10e1ef79b47cc48cd95296d85798c1e38.story
e453e379e8a70af2d3dff1c75c41b0a35edbe9cc.story
2079f35aca44978a7985afe0ddacdf02bedf98f2.story
4702f28c198223157bb8f69665b039d560eebb0f.story
db3e2ea79323a98379228b17cd3b9dec17dbd2cb.story


In [9]:
print_filenames(dailymail_stories_dir)

f4ba18635997139c751311b9f2ad18f455dd7c98.story
4a3ef32cff589c85ad0d22724e2ed747c0dacf87.story
5375ed75939108c72001b043d3b4799c47f32be9.story
fe9e57c21e21fb4ec26e394f0e92824f38d18a95.story
6a544b5cdd2384be6cc657b265d7aa2de72a99e0.story


In [12]:
# Define the global variables
dm_single_close_quote = u'\u2019' # unicode
dm_double_close_quote = u'\u201d'
END_TOKENS = ['.', '!', '?', '...', "'", "`", '"',
              dm_single_close_quote, dm_double_close_quote, ")"]

MAX_STORIES = 50000
TRAIN_SIZE = 40000
VAL_SIZE = 5000
TEST_SIZE = 5000

In [14]:
# Taking a sample .story file from cnn stories
sample_filename = "438411e10e1ef79b47cc48cd95296d85798c1e38.story"
sample_filedir = cnn_stories_dir

sample_filepath = sample_filedir / sample_filename
with open(sample_filepath, 'r') as f:
  sample_story = f.read()

print(sample_story)

New York (CNN) -- The U.S. population is expected to top out at close to 312.8 million people just around the time crowds gather to watch the ball drop on New Year's Eve, according to new census data released Thursday.

The figure represents a 0.7% increase from last year, adding 2,250,129 people to the U.S. population since the start of 2011, and a 1.3% increase since Census Day, April 1, 2010.

The agency estimates that beginning in January, one American will be born every eight seconds and one will die every 12 seconds.

U.S.-bound immigrants are also expected to add one person every 46 seconds.

That combination of births, deaths and migration is expected to add a single person to the U.S. population every 17 seconds, the Census Bureau said.

Meanwhile, millions are set to ring in the new year.

In New York, authorities are preparing for large crowds in Manhattan's Times Square, where Lady Gaga is expected to join Mayor Michael Bloomberg to push the button that drops the Waterford 

I am creating a function `fix_missing_period` where I am taking 2 arguements, one for the `line` for which I am checking and fixing the period and other is `end_tokens` which is a list that has all the tokens that I should consider as ending of a sentence.

These are the steps -
1. Check if line contains `@highlight`, if True then just return the line.
2. Check if line is empty, then return line as it is.
3. Check is line ends with any of the `end_tokens`, if so then return line as it is.
4. Only is none of the above conditions match then append `.` to the current line.

In [16]:
def fix_missing_period(line, end_tokens=END_TOKENS):
  '''function to fix the missing periods for some story lines which do not end with
  any of the end_tokens mentioned.

  Arguements:
    line: string, line of the story to fix the missing the period of.
    end_tokens: list of strings, all the tokens that are considered as line end.

  Returns:
    new line with fixed the ending part by adding an ending token if not present.
  '''
  if "@highlight" in line:
    return line
  elif line == "":
    return line
  elif line[-1] in end_tokens:
    return line

  return line + '.'

In [18]:
fix_missing_period(sample_story.split('\n')[0])

"New York (CNN) -- The U.S. population is expected to top out at close to 312.8 million people just around the time crowds gather to watch the ball drop on New Year's Eve, according to new census data released Thursday."

I am creating a function `split_article_summary` which will split the story into article and summary parts.

The function takes only 1 arguement and that is the `story` which will be splitted into article and summary.

The steps to follow are -
1. Split the story by new line `\n`. I will get a list of lines.
2. Strip the lines by using list comprehension.
3. Use list comprehension to make lower case each line by using `.lower()`.
4. Fix each line by adding period if there is none in that line using `fix_missing_period` function.
5. Make 2 empty list for `article` and `summary`.
6. Go through each line. In each line, I need to check 4 things,
  * line contains `@highlight` or not, if True then set `next_highlight` to `True` because the next to next line is going to be a summary line.
  * line is `""` empty or not, if True then ignore.
  * `next_highlight` is True or not, if True then append the line to `summary`.
  * If non of the ebove then append to `article`.
7. After done with filling the `article` and `summary` list with lines, join those sentences to make the whole article and summary. Here, I am using `.join()` method.

In [19]:
def split_article_summary(story):
  '''Splits the story into 2 parts, one for article and other for summary of that
  article. Returns the article and summary.

  Arguements:
    story: string file that contains both article and summary combiningly.

  Returns:
    article, summary seperately from the story.

  '''
  lines = story.split('\n')
  lines = [line.strip() for line in lines]
  lines = [line.lower() for line in lines]

  # Fix the ending period
  lines = [fix_missing_period(line) for line in lines]

  # List to contain the article and summary lines
  article = []
  summary = []

  # Indicator of whether the next line is the summary or not
  next_highlight = False

  for line in lines:
    if "@highlight" in line:
      next_highlight = True
    elif line=="":
      continue
    elif next_highlight:
      summary.append(line)
    else:
      article.append(line)

  article = ' '.join(article)
  summary = ' '.join(summary)

  return article, summary

In [20]:
split_article_summary(sample_story)

('new york (cnn) -- the u.s. population is expected to top out at close to 312.8 million people just around the time crowds gather to watch the ball drop on new year\'s eve, according to new census data released thursday. the figure represents a 0.7% increase from last year, adding 2,250,129 people to the u.s. population since the start of 2011, and a 1.3% increase since census day, april 1, 2010. the agency estimates that beginning in january, one american will be born every eight seconds and one will die every 12 seconds. u.s.-bound immigrants are also expected to add one person every 46 seconds. that combination of births, deaths and migration is expected to add a single person to the u.s. population every 17 seconds, the census bureau said. meanwhile, millions are set to ring in the new year. in new york, authorities are preparing for large crowds in manhattan\'s times square, where lady gaga is expected to join mayor michael bloomberg to push the button that drops the waterford cr

I am creating a function `get_articles_summaries` which will process each of the stories present in the directory of cnn and dailymail and return the articles, summaries in the form of list.

This function will take 2 arguements. One will be the `stories_dir` which is a Posix format string from `pathlib` library and another arguement is of `max_stories` which is the maximum number of stories that we will extract from those directories.

The process is simple. We will follow this steps -
1. Create 2 empty lists of `articles` and `summaries`.
2. Loop through all the files present in the directory `stories_dir` using `.glob` generator method.
3. Make a `count` variable which will count the number of processed strories and when it hits `max_stories`, break from the loop.
4. Inside the loop, you will open the file in `r` reading format, then just use `.read()` method to read the story.
5. Everytime after reading the story, split the article and summary part from it and then append them inside the `articles` and `summaries` list.
6. Return the 2 lists.

In [26]:
def get_articles_summaries(stories_dir, max_stories):
  '''stores the stories from stories_dir folder into a list and returns the list

  Arguement:
    stories_dir: Posix string, the directory where the stories are stored
    max_stories: maximum number of stories to store

  Returns:
    list of stories.

  '''
  articles = []
  summaries = []

  count = 0
  for f in stories_dir.glob("*.story"):
    count += 1
    with open(f, 'r') as reader:
      story = reader.read()

      article, summary = split_article_summary(story)

      articles.append(article)
      summaries.append(summary)

    if count == max_stories:
      break

  return articles, summaries

```
cnn
  stories
    438411e10e1ef79b47cc48cd95296d85798c1e38.story
    e453e379e8a70af2d3dff1c75c41b0a35edbe9cc.story
    2079f35aca44978a7985afe0ddacdf02bedf98f2.story
    4702f28c198223157bb8f69665b039d560eebb0f.story
    db3e2ea79323a98379228b17cd3b9dec17dbd2cb.story
    ...
    ...
    ...

dailymail
  stories
    f4ba18635997139c751311b9f2ad18f455dd7c98.story
    4a3ef32cff589c85ad0d22724e2ed747c0dacf87.story
    5375ed75939108c72001b043d3b4799c47f32be9.story
    fe9e57c21e21fb4ec26e394f0e92824f38d18a95.story
    6a544b5cdd2384be6cc657b265d7aa2de72a99e0.story
    ...
    ...
    ...

```

Out of all available .story files, we will only take `MAX_STORIES` number of files and then open them.

In [27]:
cnn_articles, cnn_summaries = get_articles_summaries(cnn_stories_dir, MAX_STORIES)

len(cnn_articles)

50000

In [30]:
print(f"Total no of cnn stories captured are {len(cnn_articles)}\n\n")
print(f"One of the CNN articles: {cnn_articles[0]}\n\n")
print(f"The summary of this article: {cnn_summaries[0]}\n\n")

Total no of cnn stories captured are 50000


One of the CNN articles: new york (cnn) -- the u.s. population is expected to top out at close to 312.8 million people just around the time crowds gather to watch the ball drop on new year's eve, according to new census data released thursday. the figure represents a 0.7% increase from last year, adding 2,250,129 people to the u.s. population since the start of 2011, and a 1.3% increase since census day, april 1, 2010. the agency estimates that beginning in january, one american will be born every eight seconds and one will die every 12 seconds. u.s.-bound immigrants are also expected to add one person every 46 seconds. that combination of births, deaths and migration is expected to add a single person to the u.s. population every 17 seconds, the census bureau said. meanwhile, millions are set to ring in the new year. in new york, authorities are preparing for large crowds in manhattan's times square, where lady gaga is expected to join mayo