# 작업내용

## 1. 영상에서 자막 추출

## 2. 전처리 자막 생성
1.  괄호로 들어가는 보조 설명 제거   
 ex. [Music]
2. 구어에서 발생하는 filler words 제거
3. 반복단어 제거
4. 문장 끝점 추가(rpunct)

## 3. 요약모델 테스트
1. "facebook/bart-large-cnn" 모델
2. "sshleifer/distilbart-cnn-12-6" 모델
3. "human-centered-summarization/financial-summarization-pegasus" 모델
4. "Bert-extractive-summarization" 모델

In [2]:
# 라이브러리 설치

!pip install youtube_transcript_api
!pip install pytube
!pip install transformers
!pip install rpunct



In [3]:
import pandas as pd
import numpy as np
from youtube_transcript_api import YouTubeTranscriptApi as yta
from pytube import YouTube, extract
from transformers import pipeline
import re

#### 테스트 영상
1.   yh_subtitle_merge - "https://www.youtube.com/watch?v=TYZoTlKN0rg" : 3분 34초. U.S. economy is ‘not oil independent’: Yahoo Finance’s Rick Newman, Yahoo Finance
2.   ted_subtitle_merge - "https://www.youtube.com/watch?v=uXrCeiQxWyc" : 18분 55초. What is economic value, and who creates it? | Mariana Mazzucato, TED
3. gr_subtitle_merge - "https://www.youtube.com/watch?v=3A19a8jpcDw" : 8분 9초. Tutorial: Disable GOS (Game Optimizing Service) on Samsung Galaxy S22 Ultra, 60FPS Genshin Impact!, Golden Reviewer
4. wion_subtitle_merge - "https://www.youtube.com/watch?v=eT4LpF87aGk" : 2분 54초. Ukraine war to lead to a frozen conflict? Russia's ploy to keep the West away | World English News, WION

## 1. 영상 자막 추출
- 유튜브 영상 링크로 자막 추출
- 추출 후 길이와 앞부분 내용 확인

In [4]:
# 링크 넣어서 자막 받아오는 함수 (돌려주는 건 합치지 않은 자막과 합친 자막)

def get_subtitle(youtube_link) :
  
  video_id = extract.video_id(youtube_link)
  subtitle = yta.get_transcript(video_id)
  
  subtitle_merge = ''
  for line in subtitle :
    subtitle_merge += line['text'] + " "
  
  return subtitle, subtitle_merge

### (1) Yahoo

In [4]:
# 자막 받아오고자 하는 유튜브 영상의 링크 넣기
yh_subtitle, yh_subtitle_merge = get_subtitle("https://www.youtube.com/watch?v=TYZoTlKN0rg")
# 자막 길이 확인
print(len(yh_subtitle_merge))

3491


In [5]:
# 자막 앞부분 확인
print(yh_subtitle_merge)

as we watch the spike that we've been seeing in commodity prices we have heard a renewal of calls in the united states to drill more that the u.s should be energy independent our rick newman has been looking into that question um and it's it's not sort of as easy as that i think is the is the bottom line right rick there's a lot of confusion about this so i'm doing some reporting to try to bust some of these myths so people think the united states used to be energy independent so here's what that actually means when you combine all forms of energy that we produce in the united states that's oil but also natural gas also coal also renewables such as solar and wind yes we do consume more than we produce which i guess you could say makes us independent but we also participate in global markets which means we export a lot of that we still import energy and in terms of just oil we are not oil independent we have we still uh consume considerably more oil than we produce and we we have not be

### (2) Ted

In [5]:
# 자막 받아오고자 하는 유튜브 영상의 링크 넣기
ted_subtitle, ted_subtitle_merge = get_subtitle("https://www.youtube.com/watch?v=uXrCeiQxWyc")
# 자막 길이 확인
print(len(ted_subtitle_merge))

19767


In [26]:
# 자막 앞부분 확인
print(ted_subtitle_merge)

Value creation. Wealth creation. These are really powerful words. Maybe you think of finance,
you think of innovation, you think of creativity. But who are the value creators? If we use that word, we must be implying
that some people aren't creating value. Who are they? The couch potatoes? The value extractors? The value destroyers? To answer this question, we actually
have to have a proper theory of value. And I'm here as an economist
to break it to you that we've kind of lost our way
on this question. Now, don't look so surprised. What I mean by that is,
we've stopped contesting it. We've stopped actually asking
really tough questions about what is the difference between
value creation and value extraction, productive and unproductive activities. Now, let me just give you
some context here. 2009 was just about
a year and a half after one of the biggest
financial crises of our time, second only to the 1929 Great Depression, and the CEO of Goldman Sachs said Goldman Sachs workers are t

### (3) Golden Review

In [31]:
# 자막 받아오고자 하는 유튜브 영상의 링크 넣기
gr_subtitle, gr_subtitle_merge = get_subtitle("https://www.youtube.com/watch?v=3A19a8jpcDw")
# 자막 길이 확인
print(len(gr_subtitle_merge))

6735


In [32]:
# 자막 앞부분 확인
print(gr_subtitle_merge)

hi guys welcome back to golden reviewer today i'm going to show you how to disable game optimization service on samsung smartphone first you search for this netgrad app in play store and install it then you open the app go to settings advanced options and select manage system apps because the apps we are going to block are all system apps then we search for the keyword game there are three apps we want to disable one is game optimization service another is game booster and game launcher as well for game booster plus and game plugins i think they they don't really matter but to be safe here i just uninstall them because they are not system apps so uh you can just uninstall them but for the three apps we are going to block their system apps so just click these two buttons here to block them from accessing the internet and then you press the button on the top left corner to make sure that netgard is enabled then you just accept all the promotes and until you see this key in your status ba

(4) WION

In [51]:
# 자막 받아오고자 하는 유튜브 영상의 링크 넣기
wion_subtitle, wion_subtitle_merge = get_subtitle("https://www.youtube.com/watch?v=eT4LpF87aGk")
# 자막 길이 확인
print(len(wion_subtitle_merge))

2178


In [52]:
# 자막 앞부분 확인
print(wion_subtitle_merge)

the russia-ukraine war could be headed for a frozen conflict the situation on ground is evolving in such a manner that a frozen conflict is more likely the way forward to end the war experts say that humanitarian sees fires hinting towards that possibility as well but first let's tell you what a frozen conflict is a frozen conflict is a situation in which active armed conflict has ended but without a formal peace agreement or other political framework in place frozen conflict occurs in regions of a country no longer controlled by its central authorities these zones then remain under jurisdiction of the separators as a result states backing the separators run their puppet governments moreover the lack of non-violent solutions failed to permanently end conflict this form of conflict was unique to a few former soviet republics especially during the collapse of the soviet union in 1991. now russia is often accused of destabilizing its former soviet neighbors to keep them in its sphere of i

### 영상 별 추출 자막 변수
1.   yh_subtitle_merge - 자막 길이 : 3,491
2.   ted_subtitle_merge - 자막 길이 : 19,767
3. gr_subtitle_merge - 자막 길이 : 6,735
4. wion_subtitle_merge - 자막 길이 : 2,178

## 2. 전처리 자막 생성
1.  괄호나 대괄호로 처리된 description이 있는 경우 제거   
 ex. [Music]
2. 구어에서 발생하는 filler words 제거 : 만들어진 filler words 리스트가 없어서 직접 구글 검색 등을 통해 사용 빈도가 높은 filler words 들을 수집하여 list를 생성함
3. 반복단어 제거
4. 문장 구별 및 끝점 추가(rpunct) : https://huggingface.co/felflare/bert-restore-punctuation?text=My+name+is+Clara+and+I+live+in+Berkeley%2C+California.

In [7]:
# 수집한 fillerwords 목록을 fillerwords 변수에 할당
fillerwords = ["i mean", "basically", "you know", "umm", "um", "uh", "huh", "er", "eh", "ah", "like that", "just", "really", "somehow", "i guess", "i suppose", "like i said", "or something like that", "kind of", "sort of", "you see", "see what i mean", "yeah"]

In [8]:
fw_list = []
for i in fillerwords :
  fw_list.append(" " + i + " ")
fw_list = '|'.join(fw_list)
fw_list

' i mean | basically | you know | umm | um | uh | huh | er | eh | ah | like that | just | really | somehow | i guess | i suppose | like i said | or something like that | kind of | sort of | you see | see what i mean | yeah '

In [9]:
# 문장 구별 및 끝점 추가(rpunct) 라이브러리
from rpunct import RestorePuncts

(1) Yahoo

*   전처리 이전 : yh_subtitle_merge : 3,491
*   전처리 이후 : yh_cleaned_text : 3,458

In [15]:
# 일단 자막 space로 split하고, '['나 '('가 포함되어 있는건 삭제, 아니면(보통 문자열이면) cleaned 된 문자열 리스트에 추가 

splited = yh_subtitle_merge.split()
yh_bracket_cleaned_subtitle = ""
for i in splited :
  if '[' in i or '(' in i :
    pass
  else :
    yh_bracket_cleaned_subtitle += (i + ' ')

len(yh_bracket_cleaned_subtitle)

# !!!!!!! 현재는 ()나 []안에 단어가 하나일 때는 잘 제거되지만 만약 그 안에 서술이 공백을 포함해 두 단어 이상 있으면 제대로 제거되지 않음.

3491

In [16]:
# fw_cleaned_test에 클린이 된 text가 할당됨
yh_fw_cleaned_test = re.sub(fw_list, " ", yh_bracket_cleaned_subtitle)
yh_fw_cleaned_test

"as we watch the spike that we've been seeing in commodity prices we have heard a renewal of calls in the united states to drill more that the u.s should be energy independent our rick newman has been looking into that question and it's it's not as easy as that i think is the is the bottom line right rick there's a lot of confusion about this so i'm doing some reporting to try to bust some of these myths so people think the united states used to be energy independent so here's what that actually means when you combine all forms of energy that we produce in the united states that's oil but also natural gas also coal also renewables such as solar and wind yes we do consume more than we produce which you could say makes us independent but we also participate in global markets which means we export a lot of that we still import energy and in terms of oil we are not oil independent we have we still consume considerably more oil than we produce and we we have not been oil dependent since sin

In [17]:
# rpunct를 사용해 문장 구별 및 온점 처리
rpunct = RestorePuncts()
yh_cleaned_text = rpunct.punctuate(yh_fw_cleaned_test)

# 전처리 후 문장 길이 및 앞부분 확인
print(len(yh_cleaned_text))
print(yh_cleaned_text[:1000])

Downloading:   0%|          | 0.00/1.22k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/530 [00:00<?, ?B/s]

3458
As we watch the spike that we've been seeing in commodity prices, we have heard a renewal of calls in the United States to drill more that the U.s should be energy independent. Our Rick Newman has been looking into that question and it's It's not as easy as that I think is the is the bottom line, right Rick? There's a lot of confusion about this, so I'm doing some reporting to try to bust some of these myths. So people think the United States used to be energy independent. So here's what that actually means when you combine all forms of energy that we produce in the United States, that's oil, but also natural gas. also coal. also renewables such as solar and wind. Yes, we do consume more than we produce, which you could say makes us independent. but we also participate in global markets, which means we export a lot of that. We still import energy, and in terms of oil, we are not oil independent. We have. We still consume considerably more oil than we produce, and we we have not be

(2) TED

*   전처리 이전 : ted_subtitle_merge : 19,767
*   전처리 이후 : ted_cleaned_text : 19,519

In [10]:
# 일단 자막 space로 split하고, '['나 '('가 포함되어 있는건 삭제, 아니면(보통 문자열이면) cleaned 된 문자열 리스트에 추가 

splited = ted_subtitle_merge.split()
ted_bracket_cleaned_subtitle = ""
for i in splited :
  if '[' in i or '(' in i :
    pass
  else :
    ted_bracket_cleaned_subtitle += (i + ' ')

len(ted_bracket_cleaned_subtitle)

19705

In [11]:
# fw_cleaned_test에 클린이 된 text가 할당됨
ted_fw_cleaned_test = re.sub(fw_list, " ", ted_bracket_cleaned_subtitle)
ted_fw_cleaned_test

'Value creation. Wealth creation. These are powerful words. Maybe you think of finance, you think of innovation, you think of creativity. But who are the value creators? If we use that word, we must be implying that some people aren\'t creating value. Who are they? The couch potatoes? The value extractors? The value destroyers? To answer this question, we actually have to have a proper theory of value. And I\'m here as an economist to break it to you that we\'ve lost our way on this question. Now, don\'t look so surprised. What I mean by that is, we\'ve stopped contesting it. We\'ve stopped actually asking tough questions about what is the difference between value creation and value extraction, productive and unproductive activities. Now, let me give you some context here. 2009 was about a year and a half after one of the biggest financial crises of our time, second only to the 1929 Great Depression, and the CEO of Goldman Sachs said Goldman Sachs workers are the most productive in the

In [13]:
# rpunct를 사용해 문장 구별 및 온점 처리
rpunct = RestorePuncts()
ted_cleaned_text = rpunct.punctuate(ted_fw_cleaned_test)

# 전처리 후 문장 길이 및 앞부분 확인
print(len(ted_cleaned_text))
print(ted_cleaned_text)

19519
Value Creation. Wealth Creation. These are powerful words. Maybe You think of finance, you think of innovation, you think of creativity. But Who are the value creators? If We use that word, we must be implying that some people aren't creating value. Who are they? The Couch potatoes? The Value extractors? The Value destroyers? To Answer this question, we actually have to have a proper theory of value. And I'm here as an economist to break it to you that we've lost our way on this question. Now, don't look so surprised. What I Mean by that is,, we've stopped contesting it. We've stopped actually asking tough questions about what is the difference between value creation and value extraction, productive and unproductive activities. Now, let me give you some context here. 2009 was about a year and a half after one of the biggest financial crises of our time, second only to the 1929 Great Depression, and the CEO of Goldman Sachs said Goldman Sachs Workers are the most productive in the

(3) Golden Revuew

*   전처리 이전 : gr_subtitle_merge : 6,735
*   전처리 이후 : gr_cleaned_text : 6,756

In [33]:
# 일단 자막 space로 split하고, '['나 '('가 포함되어 있는건 삭제, 아니면(보통 문자열이면) cleaned 된 문자열 리스트에 추가 

splited = gr_subtitle_merge.split()
gr_bracket_cleaned_subtitle = ""
for i in splited :
  if '[' in i or '(' in i :
    pass
  else :
    gr_bracket_cleaned_subtitle += (i + ' ')

len(gr_bracket_cleaned_subtitle)

6735

In [34]:
# fw_cleaned_test에 클린이 된 text가 할당됨
gr_fw_cleaned_test = re.sub(fw_list, " ", gr_bracket_cleaned_subtitle)
gr_fw_cleaned_test

"hi guys welcome back to golden reviewer today i'm going to show you how to disable game optimization service on samsung smartphone first you search for this netgrad app in play store and install it then you open the app go to settings advanced options and select manage system apps because the apps we are going to block are all system apps then we search for the keyword game there are three apps we want to disable one is game optimization service another is game booster and game launcher as well for game booster plus and game plugins i think they they don't matter but to be safe here i uninstall them because they are not system apps so you can uninstall them but for the three apps we are going to block their system apps so click these two buttons here to block them from accessing the internet and then you press the button on the top left corner to make sure that netgard is enabled then you accept all the promotes and until this key in your status bar that means it's running the next st

In [36]:
# rpunct를 사용해 문장 구별 및 온점 처리
rpunct = RestorePuncts()
gr_cleaned_text = rpunct.punctuate(gr_fw_cleaned_test)

# 전처리 후 문장 길이 및 앞부분 확인
print(len(gr_cleaned_text))
print(gr_cleaned_text)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

(4) WION

*   전처리 이전 : wion_subtitle_merge : 2,178
*   전처리 이후 : wion_cleaned_text : 2,207

In [53]:
# 일단 자막 space로 split하고, '['나 '('가 포함되어 있는건 삭제, 아니면(보통 문자열이면) cleaned 된 문자열 리스트에 추가 

splited = wion_subtitle_merge.split()
wion_bracket_cleaned_subtitle = ""
for i in splited :
  if '[' in i or '(' in i :
    pass
  else :
    wion_bracket_cleaned_subtitle += (i + ' ')

len(wion_bracket_cleaned_subtitle)

2178

In [54]:
# fw_cleaned_test에 클린이 된 text가 할당됨
wion_fw_cleaned_test = re.sub(fw_list, " ", wion_bracket_cleaned_subtitle)
wion_fw_cleaned_test

"the russia-ukraine war could be headed for a frozen conflict the situation on ground is evolving in such a manner that a frozen conflict is more likely the way forward to end the war experts say that humanitarian sees fires hinting towards that possibility as well but first let's tell you what a frozen conflict is a frozen conflict is a situation in which active armed conflict has ended but without a formal peace agreement or other political framework in place frozen conflict occurs in regions of a country no longer controlled by its central authorities these zones then remain under jurisdiction of the separators as a result states backing the separators run their puppet governments moreover the lack of non-violent solutions failed to permanently end conflict this form of conflict was unique to a few former soviet republics especially during the collapse of the soviet union in 1991. now russia is often accused of destabilizing its former soviet neighbors to keep them in its sphere of 

In [56]:
# rpunct를 사용해 문장 구별 및 온점 처리
rpunct = RestorePuncts()
wion_cleaned_text = rpunct.punctuate(wion_fw_cleaned_test)

# 전처리 후 문장 길이 및 앞부분 확인
print(len(wion_cleaned_text))
print(wion_cleaned_text)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
2207
The Russia-ukraine war could be headed for a frozen conflict. The situation on ground is evolving in such a manner that a frozen conflict is more likely the way forward to end the war. Experts say that humanitarian sees fires hinting towards that possibility as well. But first, let's tell you what a frozen conflict is: A frozen conflict is a situation in which active armed conflict has ended, but without a formal peace agreem

### 영상 별 전처리 생성 변수
1.   yh_cleaned_text - 자막 길이 : 3,458
2.   ted_cleaned_text - 자막 길이 : 19,519
3. gr_cleaned_text - 자막 길이 : 6,756
4. wion_cleaned_text - 자막 길이 : 2,207

## 3. 요약 모델 테스트
1.  "facebook/bart-large-cnn" 모델 - 생성요약
2. "sshleifer/distilbart-cnn-12-6" 모델
3."human-centered-summarization/financial-summarization-pegasus" 모델
4. "Bert-extractive-summarization" 모델

### 1. "facebook/bart-large-cnn" 모델 사용한 요약 생성
- 해당 모델의 경우 input으로 4000단어 미만밖에 받지 못함

In [37]:
# "facebook/bart-large-cnn"모델 사용해서 요약 생성하는 함수
def summarize(subtitle):
  summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
  summarized_subtitle = summarizer(subtitle, max_length=300, min_length=30, do_sample=False)

  return summarized_subtitle[0]["summary_text"]


### 2. "sshleifer/distilbart-cnn-12-6" 모델 테스트

- api로 테스트 가능 https://huggingface.co/sshleifer/distilbart-cnn-12-6

- long documentation 요약 
  - 튜토리얼 영상 https://www.youtube.com/watch?v=78KjuKYiF6s&t=228s
  - https://gist.github.com/saprativa/b5cb639e0c035876e0dd3c46e5a380fd


허깅페이스에서 api 이용해서 4000자 이하 자막 테스트 한 결과, 퀄이 좋지는 않음



'As we watch the spike that we've been seeing in commodity prices, we have heard a renewal of calls in the United States to drill more that the U.S. should be energy independent . Rick Newman has been looking into that question and it's not as easy as that I think is the is the bottom line .'

In [18]:
# Load the Model and Tokenizer
# import and initialize the tokenizer and model from the checkpoint

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

checkpoint = "sshleifer/distilbart-cnn-12-6"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

### 3. "human-centered-summarization/financial-summarization-pegasus" 모델 테스트

- api 테스트 : https://huggingface.co/human-centered-summarization/financial-summarization-pegasus

- 결과
'Oil, coal, solar and wind are all forms of energy in the United States. `

### 4. "Bert-extractive-summarization" 모델 테스트

In [31]:
import time

#### (1) Yahoo

- "facebook/bart-large-cnn" 모델

In [32]:
# 문장구별 및 온점처리 없이 요약 걸리는 시간

start = time.time()

summarize(yh_subtitle_merge)

print(time.time()-start)

43.18373131752014


In [33]:
# 전처리 전 문장 요약 길이
print(len(summarize(yh_subtitle_merge)))

350


In [34]:
# 전처리 전 문장 요약
print(summarize(yh_subtitle_merge))

Rick newman: There's a lot of confusion about this so i'm doing some reporting to try to bust some of these myths so people think the united states used to be energy independent. We are not oil independent we have we still uh consume considerably more oil than we produce and we have not been oil dependent since the early days of uh the oil economy.


In [35]:
# 문장구별 및 온점처리 후 요약 걸리는 시간
start = time.time()

summarize(yh_cleaned_text)

print(time.time()-start)

33.42358994483948


In [36]:
# 전처리 후 문장 요약 길이
print(len(summarize(yh_cleaned_text)))

315


In [37]:
# 전처리 후 문장 요약
print(summarize(yh_cleaned_text))

The U.S. has not been oil dependent since since the early days of the oil economy, which goes all the way back to the 1860s. We have imported more oil than we have produced for something like the last 45 or 50 years, and that's probably going to continue. A lot of U.s oil actually gets exported to other countries.


- "sshleifer/distilbart-cnn-12-6" 모델

In [38]:
# 요약할 문서 FileContent 변수에 할당하기
yh_FileContent = yh_subtitle_merge
yh_FileContent
len(yh_FileContent)

3491

In [39]:
# Convert entire document to sentences using 'nltk' 
import nltk
nltk.download('punkt')
sentences = nltk.tokenize.sent_tokenize(yh_FileContent)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [40]:
# Create the chunks


# initialize

length = 0
chunk = ""
chunks = []
count = -1
for sentence in sentences:
  count += 1
  combined_length = len(tokenizer.tokenize(sentence)) + length # add the no. of sentence tokens to the length counter

  if combined_length  <= tokenizer.max_len_single_sentence: # if it doesn't exceed
    chunk += sentence + " " # add the sentence to the chunk
    length = combined_length # update the length counter

    # if it is the last sentence
    if count == len(sentences) - 1:
      chunks.append(chunk.strip()) # save the chunk
    
  else: 
    chunks.append(chunk.strip()) # save the chunk
    
    # reset 
    length = 0 
    chunk = ""

    # take care of the overflow sentence
    chunk += sentence + " "
    length = len(tokenizer.tokenize(sentence))
len(chunks)

1

In [41]:
# With special tokens added
sum([len(tokenizer(c).input_ids) for c in chunks])


741

In [42]:
len(tokenizer(yh_FileContent).input_ids)

742

In [43]:
# without special tokens added 
sum([len(tokenizer.tokenize(c)) for c in chunks])

739

In [44]:
len(tokenizer.tokenize(yh_FileContent))

740

In [47]:
# Get the inputs

start = time.time()

inputs = [tokenizer(chunk, return_tensors="pt") for chunk in chunks]

# Output
for input in inputs:
  output = model.generate(**input)
  print(tokenizer.decode(*output, skip_special_tokens=True))

print(time.time()-start)

 As we watch the spike that we've been seeing in commodity prices we have heard a renewal of calls in the united states to drill more that the u.s should be energy independent. The united states have not been oil dependent since the 1860s so we have imported more oil than we have produced for something like the last 45 or 50 years.
18.068843364715576


In [49]:
# 요약할 문서 FileContent 변수에 할당하기
yh_cleaned_FileContent = yh_cleaned_text
yh_cleaned_FileContent
len(yh_cleaned_FileContent)

3458

In [50]:
sentences = nltk.tokenize.sent_tokenize(yh_cleaned_FileContent)

In [51]:
# Create the chunks


# initialize

length = 0
chunk = ""
chunks = []
count = -1
for sentence in sentences:
  count += 1
  combined_length = len(tokenizer.tokenize(sentence)) + length # add the no. of sentence tokens to the length counter

  if combined_length  <= tokenizer.max_len_single_sentence: # if it doesn't exceed
    chunk += sentence + " " # add the sentence to the chunk
    length = combined_length # update the length counter

    # if it is the last sentence
    if count == len(sentences) - 1:
      chunks.append(chunk.strip()) # save the chunk
    
  else: 
    chunks.append(chunk.strip()) # save the chunk
    
    # reset 
    length = 0 
    chunk = ""

    # take care of the overflow sentence
    chunk += sentence + " "
    length = len(tokenizer.tokenize(sentence))
len(chunks)

1

In [52]:
# With special tokens added
sum([len(tokenizer(c).input_ids) for c in chunks])


775

In [53]:
len(tokenizer.tokenize(yh_cleaned_FileContent))

773

In [54]:
# Get the inputs

start = time.time()

inputs = [tokenizer(chunk, return_tensors="pt") for chunk in chunks]

# Output
for input in inputs:
  output = model.generate(**input)
  print(tokenizer.decode(*output, skip_special_tokens=True))

print(time.time()-start)

 Rick Newman has been looking into that question and it's not as easy as that I think is the is the bottom line, right Rick? There's a lot of confusion about this, so I'm doing some reporting to try to bust some of these myths. We consume more than we produce, which you could say makes us independent, but we also participate in global markets. We still import energy, and in terms of oil, we are not oil independent.
19.027642250061035


- "human-centered-summarization/financial-summarization-pegasus" 모델

In [59]:
print(yh_subtitle_merge)

as we watch the spike that we've been seeing in commodity prices we have heard a renewal of calls in the united states to drill more that the u.s should be energy independent our rick newman has been looking into that question um and it's it's not sort of as easy as that i think is the is the bottom line right rick there's a lot of confusion about this so i'm doing some reporting to try to bust some of these myths so people think the united states used to be energy independent so here's what that actually means when you combine all forms of energy that we produce in the united states that's oil but also natural gas also coal also renewables such as solar and wind yes we do consume more than we produce which i guess you could say makes us independent but we also participate in global markets which means we export a lot of that we still import energy and in terms of just oil we are not oil independent we have we still uh consume considerably more oil than we produce and we we have not be

In [60]:
print(yh_cleaned_text)

As we watch the spike that we've been seeing in commodity prices, we have heard a renewal of calls in the United States to drill more that the U.s should be energy independent. Our Rick Newman has been looking into that question and it's It's not as easy as that I think is the is the bottom line, right Rick? There's a lot of confusion about this, so I'm doing some reporting to try to bust some of these myths. So people think the United States used to be energy independent. So here's what that actually means when you combine all forms of energy that we produce in the United States, that's oil, but also natural gas. also coal. also renewables such as solar and wind. Yes, we do consume more than we produce, which you could say makes us independent. but we also participate in global markets, which means we export a lot of that. We still import energy, and in terms of oil, we are not oil independent. We have. We still consume considerably more oil than we produce, and we we have not been oi

- "Bert-extractive-summarization" 모델

#### (2) TED

- "facebook/bart-large-cnn" 모델

In [61]:
# 문장구별 및 온점처리 없이 요약 걸리는 시간

start = time.time()

summarize(ted_subtitle_merge)

print(time.time()-start)

Token indices sequence length is longer than the specified maximum sequence length for this model (4415 > 1024). Running this sequence through the model will result in indexing errors


IndexError: ignored

In [None]:
# 전처리 전 문장 요약 길이
print(len(summarize(ted_subtitle_merge)))

In [None]:
# 전처리 전 문장 요약
print(summarize(ted_subtitle_merge))

In [None]:
# 문장구별 및 온점처리 후 요약 걸리는 시간
start = time.time()

summarize(ted_cleaned_text)

print(time.time()-start)

In [None]:
# 전처리 후 문장 요약 길이
print(len(summarize(ted_cleaned_text)))

In [None]:
# 전처리 후 문장 요약
print(summarize(ted_cleaned_text))

- "sshleifer/distilbart-cnn-12-6" 모델

In [14]:
# 요약할 문서 FileContent 변수에 할당하기
ted_FileContent = ted_subtitle_merge
ted_FileContent
len(ted_FileContent)

19767

In [16]:
# Convert entire document to sentences using 'nltk' 
import nltk
nltk.download('punkt')
sentences = nltk.tokenize.sent_tokenize(ted_FileContent)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [19]:
# Create the chunks


# initialize

length = 0
chunk = ""
chunks = []
count = -1
for sentence in sentences:
  count += 1
  combined_length = len(tokenizer.tokenize(sentence)) + length # add the no. of sentence tokens to the length counter

  if combined_length  <= tokenizer.max_len_single_sentence: # if it doesn't exceed
    chunk += sentence + " " # add the sentence to the chunk
    length = combined_length # update the length counter

    # if it is the last sentence
    if count == len(sentences) - 1:
      chunks.append(chunk.strip()) # save the chunk
    
  else: 
    chunks.append(chunk.strip()) # save the chunk
    
    # reset 
    length = 0 
    chunk = ""

    # take care of the overflow sentence
    chunk += sentence + " "
    length = len(tokenizer.tokenize(sentence))
len(chunks)

5

In [20]:
# With special tokens added
sum([len(tokenizer(c).input_ids) for c in chunks])

4422

In [21]:
len(tokenizer.tokenize(ted_FileContent))

Token indices sequence length is longer than the specified maximum sequence length for this model (4413 > 1024). Running this sequence through the model will result in indexing errors


4413

In [23]:
# Get the inputs
import time
start = time.time()

inputs = [tokenizer(chunk, return_tensors="pt") for chunk in chunks]

# Output
for input in inputs:
  output = model.generate(**input)
  print(tokenizer.decode(*output, skip_special_tokens=True))

print(time.time()-start)

 Goldman Sachs CEO said Goldman Sachs workers are the most productive in the world. Peter Bergen: We've stopped asking tough questions about value creation and value extraction. He says the term "wealth creation" and "value" have become weak and lazy. Bergen says it's important to know what is the difference between productive and unproductive activities.
 Adam Smith, David Ricardo, Karl Marx and others asked the question "What is value?" They had a labor theory of value, but again, their focus was reproduction. Adam Smith had this really great example of the pin factory where he said if you only have one person making every bit of the. But if you actually invest in factory production and the division of labor, new thinking.
 Up until 1970, most of the financial sector was not even included in GDP. The U.N. called it the "banking problem" because it was seen as just kind of moving stuff around, not actually producing anything new. In the UK, between 10 and 20 percent of finance finds i

In [24]:
# 전처리 후 요약할 문서 FileContent 변수에 할당하기
ted_cleaned_FileContent = ted_cleaned_text
ted_cleaned_FileContent
len(ted_cleaned_FileContent)

19519

In [25]:
sentences = nltk.tokenize.sent_tokenize(ted_cleaned_FileContent)

In [27]:
# Create the chunks


# initialize

length = 0
chunk = ""
chunks = []
count = -1
for sentence in sentences:
  count += 1
  combined_length = len(tokenizer.tokenize(sentence)) + length # add the no. of sentence tokens to the length counter

  if combined_length  <= tokenizer.max_len_single_sentence: # if it doesn't exceed
    chunk += sentence + " " # add the sentence to the chunk
    length = combined_length # update the length counter

    # if it is the last sentence
    if count == len(sentences) - 1:
      chunks.append(chunk.strip()) # save the chunk
    
  else: 
    chunks.append(chunk.strip()) # save the chunk
    
    # reset 
    length = 0 
    chunk = ""

    # take care of the overflow sentence
    chunk += sentence + " "
    length = len(tokenizer.tokenize(sentence))
len(chunks)

5

In [28]:
# With special tokens added
sum([len(tokenizer(c).input_ids) for c in chunks])

4096

In [29]:
len(tokenizer.tokenize(ted_cleaned_FileContent))

4085

In [30]:
# Get the inputs

start = time.time()

inputs = [tokenizer(chunk, return_tensors="pt") for chunk in chunks]

# Output
for input in inputs:
  output = model.generate(**input)
  print(tokenizer.decode(*output, skip_special_tokens=True))

print(time.time()-start)

 Goldman Sachs CEO said Goldman Sachs workers are the most productive in the world. Peter Bergen: We've stopped asking tough questions about what is the difference between value creation and value extraction, productive and unproductive activities. He argues that we have to have a proper theory of value to answer this question.
 Adam Smith argued that industrial labor was the source of the value that was getting siphoned out of the economy. Adam Smith showed that 10 specialized workers who had been invested in, in their human capital, could produce 4,800 pins a day, as opposed to one by an unspecialized worker. The big revolution that happened with the current system of economic thinking that we have, which is called "neoclassical Economics," was that the logic completely changed.
 Up until 1970, most of the financial sector was not even included in GDP. Instead of pausing and asking, "What is it actually doing?" -- that was a missed opportunity. This real focus on prices and also shar

(3) Goleden Review

In [38]:
# 문장구별 및 온점처리 없이 요약 걸리는 시간

start = time.time()

summarize(gr_subtitle_merge)

print(time.time()-start)

Token indices sequence length is longer than the specified maximum sequence length for this model (1370 > 1024). Running this sequence through the model will result in indexing errors


IndexError: ignored

In [None]:
# 전처리 전 문장 요약 길이
print(len(summarize(gr_subtitle_merge)))

In [None]:
# 전처리 전 문장 요약
print(summarize(gr_subtitle_merge))

In [None]:
# 문장구별 및 온점처리 후 요약 걸리는 시간
start = time.time()

summarize(gr_cleaned_text)

print(time.time()-start)

In [None]:
# 전처리 후 문장 요약 길이
print(len(summarize(gr_cleaned_text)))

In [None]:
# 전처리 후 문장 요약
print(summarize(gr_cleaned_text))

- "sshleifer/distilbart-cnn-12-6" 모델

In [39]:
# 요약할 문서 FileContent 변수에 할당하기
gr_FileContent = gr_subtitle_merge
gr_FileContent
len(gr_FileContent)

6735

In [40]:
sentences = nltk.tokenize.sent_tokenize(gr_FileContent)

In [41]:
# Create the chunks


# initialize

length = 0
chunk = ""
chunks = []
count = -1
for sentence in sentences:
  count += 1
  combined_length = len(tokenizer.tokenize(sentence)) + length # add the no. of sentence tokens to the length counter

  if combined_length  <= tokenizer.max_len_single_sentence: # if it doesn't exceed
    chunk += sentence + " " # add the sentence to the chunk
    length = combined_length # update the length counter

    # if it is the last sentence
    if count == len(sentences) - 1:
      chunks.append(chunk.strip()) # save the chunk
    
  else: 
    chunks.append(chunk.strip()) # save the chunk
    
    # reset 
    length = 0 
    chunk = ""

    # take care of the overflow sentence
    chunk += sentence + " "
    length = len(tokenizer.tokenize(sentence))
len(chunks)

1

In [42]:
# With special tokens added
sum([len(tokenizer(c).input_ids) for c in chunks])

2

In [43]:
len(tokenizer.tokenize(gr_FileContent))

1368

In [44]:
# Get the inputs

start = time.time()

inputs = [tokenizer(chunk, return_tensors="pt") for chunk in chunks]

# Output
for input in inputs:
  output = model.generate(**input)
  print(tokenizer.decode(*output, skip_special_tokens=True))

print(time.time()-start)

 CNN.com will feature iReporter photos in a weekly Travel Snapshots gallery. Please submit your best shots for next week's gallery of snapshots of places you want to visit. Visit CNN iReport.com/Travel next Wednesday for a new gallery of shots of travel next week.
9.549052715301514


In [45]:
# 요약할 문서 FileContent 변수에 할당하기
gr_cleaned_FileContent = gr_cleaned_text
gr_cleaned_FileContent
len(gr_cleaned_FileContent)

6756

In [46]:
sentences = nltk.tokenize.sent_tokenize(gr_cleaned_FileContent)

In [47]:
# Create the chunks


# initialize

length = 0
chunk = ""
chunks = []
count = -1
for sentence in sentences:
  count += 1
  combined_length = len(tokenizer.tokenize(sentence)) + length # add the no. of sentence tokens to the length counter

  if combined_length  <= tokenizer.max_len_single_sentence: # if it doesn't exceed
    chunk += sentence + " " # add the sentence to the chunk
    length = combined_length # update the length counter

    # if it is the last sentence
    if count == len(sentences) - 1:
      chunks.append(chunk.strip()) # save the chunk
    
  else: 
    chunks.append(chunk.strip()) # save the chunk
    
    # reset 
    length = 0 
    chunk = ""

    # take care of the overflow sentence
    chunk += sentence + " "
    length = len(tokenizer.tokenize(sentence))
len(chunks)

2

In [48]:
# With special tokens added
sum([len(tokenizer(c).input_ids) for c in chunks])

1470

In [49]:
len(tokenizer.tokenize(gr_cleaned_FileContent))

1466

In [50]:
# Get the inputs

start = time.time()

inputs = [tokenizer(chunk, return_tensors="pt") for chunk in chunks]

# Output
for input in inputs:
  output = model.generate(**input)
  print(tokenizer.decode(*output, skip_special_tokens=True))

print(time.time()-start)

 The game Optimization service actually download a full list of app signatures from Samsung server when it's running in the background. So we block them from accessing the internet so they cannot download the app name list and then we clear the app data to remove any already downloaded app names. Finally turn off your Wi-fi or data and ending and restart your phone.
 If we have Gos on, the power is restricted to a very low level, even from the beginning, so we don't see any change in power consumption. With Gos Off, we get more than 10 Fps more and the game indeed run much smoother. We see an overall more than 50 percent power consumption increase, especially for the first two minutes when the device has not started to throttle.
33.93785858154297


(4) WION

In [57]:
# 문장구별 및 온점처리 없이 요약 걸리는 시간

start = time.time()

summarize(wion_subtitle_merge)

print(time.time()-start)

27.419185161590576


In [59]:
# 전처리 전 문장 요약 길이
print(len(summarize(wion_subtitle_merge)))

307


In [60]:
# 전처리 전 문장 요약
print(summarize(wion_subtitle_merge))

The russia-ukraine war could be headed for a frozen conflict. A frozen conflict is a situation in which active armed conflict has ended but without a formal peace agreement or other political framework in place. frozen conflict occurs in regions of a country no longer controlled by its central authorities.


In [61]:
# 문장구별 및 온점처리 후 요약 걸리는 시간
start = time.time()

summarize(wion_cleaned_text)

print(time.time()-start)

27.26302719116211


In [62]:
# 전처리 후 문장 요약 길이
print(len(summarize(wion_cleaned_text)))

365


In [63]:
# 전처리 후 문장 요약
print(summarize(wion_cleaned_text))

A frozen conflict is a situation in which active armed conflict has ended, but without a formal peace agreement or other political framework in place. Frozen conflict occurs in regions of a country no longer controlled by its central authorities. This form of conflict was unique to a few former Soviet republics, especially during the collapse of the Soviet Union.


- "sshleifer/distilbart-cnn-12-6" 모델

In [64]:
# 요약할 문서 FileContent 변수에 할당하기
wion_FileContent = wion_subtitle_merge
wion_FileContent
len(wion_FileContent)

2178

In [65]:
sentences = nltk.tokenize.sent_tokenize(wion_FileContent)

In [66]:
# Create the chunks


# initialize

length = 0
chunk = ""
chunks = []
count = -1
for sentence in sentences:
  count += 1
  combined_length = len(tokenizer.tokenize(sentence)) + length # add the no. of sentence tokens to the length counter

  if combined_length  <= tokenizer.max_len_single_sentence: # if it doesn't exceed
    chunk += sentence + " " # add the sentence to the chunk
    length = combined_length # update the length counter

    # if it is the last sentence
    if count == len(sentences) - 1:
      chunks.append(chunk.strip()) # save the chunk
    
  else: 
    chunks.append(chunk.strip()) # save the chunk
    
    # reset 
    length = 0 
    chunk = ""

    # take care of the overflow sentence
    chunk += sentence + " "
    length = len(tokenizer.tokenize(sentence))
len(chunks)

1

In [67]:
# With special tokens added
sum([len(tokenizer(c).input_ids) for c in chunks])

452

In [68]:
len(tokenizer.tokenize(wion_FileContent))

451

In [69]:
# Get the inputs

start = time.time()

inputs = [tokenizer(chunk, return_tensors="pt") for chunk in chunks]

# Output
for input in inputs:
  output = model.generate(**input)
  print(tokenizer.decode(*output, skip_special_tokens=True))

print(time.time()-start)

 The russia-ukraine war could be headed for a frozen conflict. A frozen conflict is a situation in which active armed conflict has ended but without a formal peace agreement or other political framework in place frozen conflict occurs in regions of a country no longer controlled by its central authorities. These zones then remain under jurisdiction of the separators.
14.656025409698486


In [70]:
# 요약할 문서 FileContent 변수에 할당하기
wion_cleaned_FileContent = wion_cleaned_text
wion_cleaned_FileContent
len(wion_cleaned_FileContent)

2207

In [71]:
sentences = nltk.tokenize.sent_tokenize(wion_cleaned_FileContent)

In [72]:
# Create the chunks


# initialize

length = 0
chunk = ""
chunks = []
count = -1
for sentence in sentences:
  count += 1
  combined_length = len(tokenizer.tokenize(sentence)) + length # add the no. of sentence tokens to the length counter

  if combined_length  <= tokenizer.max_len_single_sentence: # if it doesn't exceed
    chunk += sentence + " " # add the sentence to the chunk
    length = combined_length # update the length counter

    # if it is the last sentence
    if count == len(sentences) - 1:
      chunks.append(chunk.strip()) # save the chunk
    
  else: 
    chunks.append(chunk.strip()) # save the chunk
    
    # reset 
    length = 0 
    chunk = ""

    # take care of the overflow sentence
    chunk += sentence + " "
    length = len(tokenizer.tokenize(sentence))
len(chunks)

1

In [73]:
# With special tokens added
sum([len(tokenizer(c).input_ids) for c in chunks])

442

In [74]:
len(tokenizer.tokenize(wion_cleaned_FileContent))

440

In [75]:
# Get the inputs

start = time.time()

inputs = [tokenizer(chunk, return_tensors="pt") for chunk in chunks]

# Output
for input in inputs:
  output = model.generate(**input)
  print(tokenizer.decode(*output, skip_special_tokens=True))

print(time.time()-start)

 A frozen conflict is a situation in which active armed conflict has ended, but without a formal peace agreement or other political framework in place. This form of conflict was unique to a few former Soviet republics, especially during the collapse of the Soviet Union in 1991.. now, Russia is often accused of destabilizing its former Soviet neighbors to keep them in its sphere of influence.
12.231482744216919
