# Transcript Summarization
This notebook reads the transcript dataset and generates summaries using the Bert Extractive Summarizer.

There are 2 methods of Bert Summarization
* Bert Extractive Summarizer
* SBert Summarizer

Hyperparameters to be considered for models
* param ratio: Ratio of sentences to use.
* param min_length: Minimum length of sentence candidates to utilize for the summary.
* param max_length: Maximum length of sentence candidates to utilize for the summary.
* param use_first: Whether or not to use the first sentence.
* param algorithm: Which clustering algorithm to use. (kmeans, gmm)
* param num_sentences: Number of sentences to use (overrides ratio).
* param return_as_list: Whether or not to return sentences as list.

In [1]:
import pandas as pd
import numpy as np
from summarizer import Summarizer
from summarizer.sbert import SBertSummarizer
import re
import os

## Input Course URL
The course id can be found in the hyperlink for any page in the course.

In [2]:
url = 'https://www.coursera.org/learn/siads697698/lecture/3vwIb/how-to-do-a-standup'
course = re.search('(?<=coursera.org/learn/)(\w+)', url).group(0)

## Load Directory

In [4]:
directory = os.listdir('/Users/nicolascap/MADS/Capstone/intermediate_data')
new = True
for file in directory:
    if '{}_summaries'.format(course) in file:
        print("Course Already In Directory")
        new = False
        break

Course Already In Directory


## Read in Transcript DataFrame

In [7]:
df = pd.read_csv("./intermediate_data/transcripts_{}.csv".format(course), index_col=0)
df.head()

Unnamed: 0,course_id,video_title,transcripts,length
0,siads697698,recording-of-elle-o-brien-office-hours-siads-6...,We'll see if anybody is joining us today. What...,20293
1,siads697698,recording-of-elle-o-brien-office-hours-siads-6...,"I'm going to do this, Git log.one line and tha...",20294
2,siads697698,recording-of-elle-o-brien-office-hours-siads-6...,"Well, not sure if anybody is joining this morn...",21840
3,siads697698,recording-of-elle-o-brien-office-hours-siads-6...,"now. Cool. There's lots of stuff here. Wow, l...",21840
4,siads697698,recording-of-elle-o-brien-office-hours-siads-6...,"Hello, nice to meet you. >> Nice to meet you t...",24317


## Generate First Transcript to Summarize


In [8]:
#Input url
print(course)
title = url.split('/')[-1]
print(title)

siads697698
how-to-do-a-standup


In [9]:
body = df[df['video_title']==title].transcripts.iloc[0]
print(len(body.split('. ')))
print(body)

26
I mentioned to you that we're going to do some biweekly stand-ups, so here's what to do in a stand-up. Some of you who might have worked at tech companies or similar organizations might already know about them, but if you're new, here is roughly what we're going to do. When it comes time for your stand-up, you're going to get on your webcam, or your screen recording tool, doesn't matter.Z You can show your face or not, and you're going to answer the following questions. What did my team work on this past week? What are we working on now? What issues are blocking us? Even if you're not especially blocked by anything, just say what are some of the challenges that you're facing or things that you're not sure of. To emphasize here, any team member can make the recording, so it doesn't have to be everybody on the team. I just want one representative from the team to make the stand-up and feel free to use your screen-sharing to show us any code snippets or cool results. It can be quite ca

## Generate Summary

### Initialize Summarizer

In [10]:
model = Summarizer('distilbert-base-uncased', hidden=[-2], hidden_concat=True, random_state=42)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Generate Result

In [12]:
result = model(body, num_sentences=6, use_first=False) #min_length=60

### Print Result

In [13]:
result

"I mentioned to you that we're going to do some biweekly stand-ups, so here's what to do in a stand-up. Some of you who might have worked at tech companies or similar organizations might already know about them, but if you're new, here is roughly what we're going to do. When it comes time for your stand-up, you're going to get on your webcam, or your screen recording tool, doesn't matter. If you go over, that's fine, if you go under, that can also be okay. There is limited storage space on Slack, so please don't directly upload your video to Slack. Then I will be asking you to watch and comment on two other classmates stand-ups, one stand up per team every two weeks, don't worry, I will remind you when they are coming, you will not be in the dark about this, and I can also create, you know, I'll put one in the channel, a demo, so you can see a real one."

## Apply to All Transcripts


In [15]:
df['summary'] = df['transcripts'].apply(lambda x: model(x, num_sentences=6, use_first=False))

In [18]:
df

Unnamed: 0,course_id,video_title,transcripts,length,summary
0,siads697698,recording-of-elle-o-brien-office-hours-siads-6...,We'll see if anybody is joining us today. What...,20293,"Because at first time, I was thinking about us..."
1,siads697698,recording-of-elle-o-brien-office-hours-siads-6...,"I'm going to do this, Git log.one line and tha...",20294,"In this branch, let's create a new file and ca..."
2,siads697698,recording-of-elle-o-brien-office-hours-siads-6...,"Well, not sure if anybody is joining this morn...",21840,"If that doesn't get you exactly what you want,..."
3,siads697698,recording-of-elle-o-brien-office-hours-siads-6...,"now. Cool. There's lots of stuff here. Wow, l...",21840,If you're not very comfortable doing terminal ...
4,siads697698,recording-of-elle-o-brien-office-hours-siads-6...,"Hello, nice to meet you. >> Nice to meet you t...",24317,">> Yeah, you can do that too, so today, I'm go..."
5,siads697698,recording-of-elle-o-brien-office-hours-siads-6...,model perhaps we don't have any rules like th...,24317,"And so the ideal model for each one could, it ..."
6,siads697698,how-to-write-an-effective-blog-post,It's not enough to just do data science on you...,7725,It could be that you've done something that ot...
7,siads697698,how-to-do-a-standup,I mentioned to you that we're going to do some...,3253,I mentioned to you that we're going to do some...
8,siads697698,how-to-collaborate-with-a-team,One of the most unexpectedly challenging parts...,13279,I don't know what a really reliable and certai...
9,siads697698,capstone-overview,"Hi, welcome to the capstone. My name's Dr. Ell...",5438,Office hours are not required or expected of y...


## Save DataFrame with Feature Vectors
We save the dataset with protein sequence, fold classification, and feature vectors as a Pandas dataframe for further analysis.

In [17]:
df.to_csv("./intermediate_data/transcripts_{}_summaries.csv".format(course))

## Next step
After you saved the dataset here, run the next step in the workflow [4-KeywordExtraction.ipynb](./4-KeywordExtraction.ipynb) or go back go back to [0-Workflow.ipynb](./0-Workflow.ipynb).

---

**Authors:** [Wei Zhou](mailto:weiwzhou@umich.edu), [Nick Capaldini](mailto:nickcaps@umich.edu), University of Michigan, August 21, 2022

---