# Task 2: Summarization
Podcasts are a rapidly growing medium for news, commentary, entertainment, and learning.  Some podcast shows release new episodes on a regular schedule (daily, weekly, etc); others irregularly.  Some podcast shows feature short episodes of 5 minutes or less touching on one or two topics; others may release 3+ hour long episodes touching on a wide range of topics.  Some are structured as news delivery, some as conversations, some as storytelling.

Given a podcast episode, its audio, and transcription, return a short text snippet capturing the most important information in the content. Returned summaries should be grammatical, standalone statement of significantly shorter length than the input episode description.

The user task is to provide a short text summary that the user might read when deciding whether to listen to a podcast. Thus the summary should accurately convey the content of the podcast, and be short enough to quickly read on a smartphone screen. It should also be human-readable.

In [1]:
import pandas as pd
import numpy as np
import os
import re
import json

from IPython.display import display, Javascript

In [2]:
dataset_path = os.path.join(os.path.abspath(""), 'podcasts-no-audio-13GB')

### Training set inspection

In [3]:
metadata_path_train = os.path.join(dataset_path, 'metadata.tsv')
metadata_train = pd.read_csv(metadata_path_train, sep='\t')
print("Columns: ", metadata_train.columns)
print("Shape: ", metadata_train.shape)

Columns:  Index(['show_uri', 'show_name', 'show_description', 'publisher', 'language',
       'rss_link', 'episode_uri', 'episode_name', 'episode_description',
       'duration', 'show_filename_prefix', 'episode_filename_prefix'],
      dtype='object')
Shape:  (105360, 12)


Analyze one episode of the trian dataset

In [4]:
i = 0
episode_example_train = metadata_train.iloc[i]
print(episode_example_train)
print("\nCopy this uri into the browser to listen to the episode:\n", episode_example_train['episode_uri'])

show_uri                                 spotify:show:2NYtxEZyYelR6RMKmjfPLB
show_name                                               Kream in your Koffee
show_description           A 20-something blunt female takes on the world...
publisher                                                        Katie Houle
language                                                              ['en']
rss_link                            https://anchor.fm/s/11b84b68/podcast/rss
episode_uri                           spotify:episode:000A9sRBYdVh66csG2qEdj
episode_name                                         1: It’s Christmas Time!
episode_description        On the first ever episode of Kream in your Kof...
duration                                                           12.700133
show_filename_prefix                             show_2NYtxEZyYelR6RMKmjfPLB
episode_filename_prefix                               000A9sRBYdVh66csG2qEdj
Name: 0, dtype: object

Copy this uri into the browser to listen to the epis

In [6]:
# extract the 2 reference number/letter to access the episode transcript
show_filename = episode_example_train['show_filename_prefix']
episode_filename = episode_example_train['episode_filename_prefix'] + ".json"
dir_1, dir_2 = re.match(r'show_(\d)(\w).*', show_filename).groups()

interval_folders = [range(0,3), range(3, 6), range(6,8)]

# check which is the main folder containing the transcript
main_dir = ""
for interval in interval_folders:
    if int(dir_1) in interval:
        main_dir = "podcasts-transcripts-{}to{}".format(interval[0], interval[-1])
assert main_dir != ""

# check if the transcript file in all the derived subfolders exist
transcipt_path = os.path.join(dataset_path, main_dir, "spotify-podcasts-2020", "podcasts-transcripts", dir_1, dir_2, show_filename, episode_filename)
assert os.path.isfile(transcipt_path)

print("Transcript path:\n", transcipt_path)

Transcript path:
 c:\Users\peppe\UNIBO\Natural Language Processing\lab\Project\podcasts-no-audio-13GB\podcasts-transcripts-0to2\spotify-podcasts-2020\podcasts-transcripts\2\N\show_2NYtxEZyYelR6RMKmjfPLB\000A9sRBYdVh66csG2qEdj.json


In [7]:
with open(transcipt_path, 'r') as f:
    episode_json = json.load(f)
    # seems that the last result in each trastcript is a repetition of the first one, so we ignore it
    transcripts = [result["alternatives"][0]['transcript'] for result in episode_json["results"][:-1]]

print(f"Episode description:\n{episode_example_train['episode_description']}")
print(f"\nEpisode transcription:\n{' '.join(transcripts)}")

Episode description:
On the first ever episode of Kream in your Koffee, Katie talks about tips for Christmas shopping. We also get a little insight into who and what we’ll be hearing about in next weeks episode! 

Episode transcription:
Hello. Hello. Hello everyone. This is Katie and we are here together on our first ever episode of cream in your coffee. Thank you so much guys for humoring me on this podcast Journey. It's been a huge goal of mine for the past few years. I finally just growing a pair and jumping right into it and seeing how it goes. So, thank you again for bearing with me here.  Alrighty guys, so I figured for the first episode why not jump into a topic that is so exciting for pretty much everybody out there just around the corner and seven or eight days. We have the wonderful Christmas and I'm so excited Christmas is a great time of year for me. I just love the spirit of happiness and joy and the giving spirit. I love every single part of it and I hope that you guys ar

### Golden set
**Golden set pool 1**

Creator-provided podcast episode descriptions will be judged by assessors who have read the entire transcript. The judgements will be on an EGFB (Excellent, Good, Fair, Bad) scale as approximately follows:

* Excellent: the summary accurately conveys all the most important attributes of the episode, which could include topical content, genre, and participants. It contains almost no redundant material which isn’t needed when deciding whether to listen.
* Good: the summary conveys most of the most important attributes and gives the reader a reasonable sense of what the episode contains. Does not need to be fully coherent or well edited. It contains little redundant material which isn’t needed when deciding whether to listen.
* Fair: the summary conveys some attributes of the content but gives the reader an imperfect or incomplete sense of what the episode contains. It may contain some redundant material which isn’t needed when deciding whether to listen.
* Bad: the summary does not convey any of the most important content items of the episode or gives the reader an incorrect sense of what the episode contains. It may contain a lot of redundant information that isn’t needed when deciding whether to listen to the episode.

The pool will be filtered down to the subset that have been judged to be excellent. Evaluation on this subset of the labels is analogous to predicting the creator-provided episode description, in the subset of episodes where the description is high quality.

**Golden set pool 2**

Automatically-generated summaries produced by participants for a same set of podcasts. They will be given relevance assessments (on an EFGB scale) by assessors. The pool will be filtered down to the subset that have been judged to be excellent (on an EGFB scale).

The test set will be selected to have qualified descriptions which will be used as the standard to which the submitted summaries will be compared. Submitted summaries will be judged on a four-step scale (EGFB) intended for a listener to be able to make a decision whether to listen to a podcast or not, conveying a gist of what the user should expect to hear listening to the podcast. This first year, they will not be expected to include assessment for style, format, or other aesthetic or non-topical qualities.

In [14]:
metadata_path_gold = os.path.join(dataset_path, '150gold.tsv')
metadata_gold = pd.read_csv(metadata_path_gold, sep='\t')
print("Columns: ", metadata_gold.columns)
print("Shape: ", metadata_gold.shape)

Columns:  Index(['show name', 'episode name', 'episode id', 'creator description',
       'EGFB', 'lexrank summary', 'EGFB.1', 'textrank summary', 'EGFB.2',
       'lsa summary', 'EGFB.3', 'quasi-supervised summary', 'EGFB.4',
       'supervised summary', 'EGFB.5'],
      dtype='object')
Shape:  (150, 15)


In [42]:
i = 0
episode_example_gold = metadata_gold.iloc[i]
print(episode_example_gold)
print("\nCopy this uri into the browser to listen to the episode:\n", episode_example_gold['episode id'])

show name                                               Alpha Male Strategies
episode name                Passive Aggressive Women & Developing Mental S...
episode id                             spotify:episode:4KRC1TZ28FavN3J5zLHEtQ
creator description         Boost the podcast! Leave a 5-star review on th...
EGFB                                                                        B
lexrank summary             All right guys now as y'all guys might know so...
EGFB.1                                                                      G
textrank summary            There's no such thing as talk about passengers...
EGFB.2                                                                      F
lsa summary                 I'll pay for all you guys who don't know what ...
EGFB.3                                                                      F
quasi-supervised summary    All women a passive-aggressive. When a woman w...
EGFB.4                                                          

In [48]:
# find the corresponding episode in the train set
episode_example_gold = pd.concat([metadata_gold.iloc[i], metadata_train[metadata_train['episode_uri'] == episode_example_gold['episode id']].iloc[0]])

# extract the 2 reference number/letter to access the episode transcript
show_filename = episode_example_gold['show_filename_prefix']
episode_filename = episode_example_gold['episode_filename_prefix'] + ".json"
dir_1, dir_2 = re.match(r'show_(\d)(\w).*', show_filename).groups()

interval_folders = [range(0,3), range(3, 6), range(6,8)]

# check which is the main folder containing the transcript
main_dir = ""
for interval in interval_folders:
    if int(dir_1) in interval:
        main_dir = "podcasts-transcripts-{}to{}".format(interval[0], interval[-1])
assert main_dir != ""

# check if the transcript file in all the derived subfolders exist
transcipt_path_gold = os.path.join(dataset_path, main_dir, "spotify-podcasts-2020", "podcasts-transcripts", dir_1, dir_2, show_filename, episode_filename)
assert os.path.isfile(transcipt_path_gold)

print("Transcript path:\n", transcipt_path_gold)

Transcript path:
 c:\Users\peppe\UNIBO\Natural Language Processing\lab\Project\podcasts-no-audio-13GB\podcasts-transcripts-3to5\spotify-podcasts-2020\podcasts-transcripts\3\Z\show_3Z2DiiMgsKo9h6EfuWkVPY\4KRC1TZ28FavN3J5zLHEtQ.json


In [54]:
with open(transcipt_path_gold, 'r') as f:
    episode_json = json.load(f)
    # seems that the last result in each trastcript is a repetition of the first one, so we ignore it
    transcripts = [result["alternatives"][0]['transcript'] for result in episode_json["results"][:-1] if 'transcript' in result["alternatives"][0]]

print(f"Episode description:\n{episode_example_gold['episode_description']}")
print(f"\nLexrank description:\n{episode_example_gold['lexrank summary']}")
print(f"\nEpisode transcription:\n{' '.join(transcripts)}")

Episode description:
Boost the podcast! Leave a 5-star review on the Apple, and Stitcher podcast apps.  YouTube- AMS Check out my Patreon for my latest content - https://www.patreon.com/alphamalestra...  Coaching - http://www.alphamalestrategies.com  Instagram - https://www.instagram.com/alpha_male_s/  AMS Fitness channel - https://m.youtube.com/channel/UCwS7o_...  AMS Clothing - http://amsclothingbrand.com/  Purchase E-book @ amazon.com. Don't try to buy on amazon app. follow link @ https://www.amazon.com/dp/B07HDCLCK5/...  Purchase hard copy @https://www.amazon.com/gp/product/172...   ---   Support this podcast: https://anchor.fm/alphamalestrategies/support

Lexrank description:
All right guys now as y'all guys might know sometimes maybe she did have a little bit of interest but you just did some just went on a kilt that but a lot of times they don't even be that guys when you go up and get some of these women normal guys. Alright guys, so anytime a woman rejects you what she basical