# Using facebook-bart-large-cnn model and keybert to summarize text and extract keywords.

### 1. Import the necessary dependencies and instantiate summarizer object.

In [2]:
import pandas as pd
import json
from nltk import tokenize
from transformers import pipeline
from keybert import KeyBERT

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

### 2.. Import the csv file, extract the text as a new line separated concatenated string, then tokenize the text.

After importing the csv file as a dataframe, I want to extract the text column as a concatenation of strings. Once I have the text in the desired format, I will use 'sent_tokenize' to identify all the sentences in the text.

In [3]:
transcript_df = pd.read_csv('transcript.csv')
transcript_df.columns = ['person_name','text']
text = transcript_df.text.str.cat(sep='')
sentences = tokenize.sent_tokenize(text)

### 3. Divide the text in chunks of 10 sentences each and see how many chunks there are.

In [4]:
chunks = [sentences[x:x+10] for x in range(0, len(sentences), 10)]

len(chunks)


17

### 4. Loop through the chunks and summarize them.

In [None]:
for i in range(16):
    chunks[i] = summarizer(chunks[i], max_length=130, min_length=30)

In [21]:
chunks

[[{'summary_text': "We are surrounded by software. It is present in any kind of devices nowadays, starting from.Transportation like planes, cars, uh space shuttles. It's everywhere and it's connected and. it's getting more and more involved with with us humans."}],
 [{'summary_text': "Umm. So it's the.Tool tips, announcements, walkthroughs.Any kind of self service functionality guidance through. And of course there comes many benefits with such platforms or such tools or.Augmenting augmentation of the existing tools we have or or customer services."}],
 [{'summary_text': 'The market and the digital option solutions is growing rapidly and companies are investing heavily to make sure that the products and services are. are usable. If atea wants to be ahead of the curve and be market leader, we need to know what customers want.'}],
 [{'summary_text': 'Most of the tools, if not all of them meet the the all basic business requirements. But they, as they say, the devil is always in the detai

### 5. Loop through the summarized chunks, subset the summarized text and append it to a new list. After that, join all the strings together in one unified summarized text.

In [23]:
summary_list = []

for chunk in chunks:
    summary_list.append(chunk[0]['summary_text'])
        

In [24]:
summary_list

["We are surrounded by software. It is present in any kind of devices nowadays, starting from.Transportation like planes, cars, uh space shuttles. It's everywhere and it's connected and. it's getting more and more involved with with us humans.",
 "Umm. So it's the.Tool tips, announcements, walkthroughs.Any kind of self service functionality guidance through. And of course there comes many benefits with such platforms or such tools or.Augmenting augmentation of the existing tools we have or or customer services.",
 'The market and the digital option solutions is growing rapidly and companies are investing heavily to make sure that the products and services are. are usable. If atea wants to be ahead of the curve and be market leader, we need to know what customers want.',
 'Most of the tools, if not all of them meet the the all basic business requirements. But they, as they say, the devil is always in the detail and only in demos we could really see the true nature of those platforms. So

In [25]:
summarized = ''.join(summary_list)

In [26]:
summarized

"We are surrounded by software. It is present in any kind of devices nowadays, starting from.Transportation like planes, cars, uh space shuttles. It's everywhere and it's connected and. it's getting more and more involved with with us humans.Umm. So it's the.Tool tips, announcements, walkthroughs.Any kind of self service functionality guidance through. And of course there comes many benefits with such platforms or such tools or.Augmenting augmentation of the existing tools we have or or customer services.The market and the digital option solutions is growing rapidly and companies are investing heavily to make sure that the products and services are. are usable. If atea wants to be ahead of the curve and be market leader, we need to know what customers want.Most of the tools, if not all of them meet the the all basic business requirements. But they, as they say, the devil is always in the detail and only in demos we could really see the true nature of those platforms. So what is our rec

In [142]:
%config Completer.use_jedi=False

### 5. Instantiate KeyBERT model and extract the top 10 keywords in the text.

In [144]:
model = KeyBERT(model="distilbert-base-nli-mean-tokens")

2022-06-18 23:27:16,795 : INFO : Load pretrained SentenceTransformer: distilbert-base-nli-mean-tokens
2022-06-18 23:27:24,818 : INFO : Use pytorch device: cpu


In [172]:
%%time
out = model.extract_keywords(
    summarized,
    top_n=10,
    keyphrase_ngram_range=(1, 1),
    stop_words="english",
)

CPU times: user 7.07 s, sys: 1.02 s, total: 8.09 s
Wall time: 1.2 s


In [173]:
out

[('javascript', 0.3616),
 ('web', 0.3357),
 ('rapidly', 0.331),
 ('impressed', 0.3148),
 ('growing', 0.3116),
 ('augmentation', 0.3099),
 ('speed', 0.3088),
 ('space', 0.3049),
 ('cars', 0.3008),
 ('shuttles', 0.2977)]

### 6. Combine topics and summarized chunks in one json dictionary and then create two separate dataframes for the topics and for the summarized chunks.

In [192]:
topics = []
    
for item in out:
    topics.append(item[0])
    output = {
        "topics": topics,
        "summarized_rows": summary_list
    }
    json.dumps(output)

In [193]:
output

{'topics': ['javascript',
  'web',
  'rapidly',
  'impressed',
  'growing',
  'augmentation',
  'speed',
  'space',
  'cars',
  'shuttles'],
 'summarized_rows': ["We are surrounded by software. It is present in any kind of devices nowadays, starting from.Transportation like planes, cars, uh space shuttles. It's everywhere and it's connected and. it's getting more and more involved with with us humans.",
  "Umm. So it's the.Tool tips, announcements, walkthroughs.Any kind of self service functionality guidance through. And of course there comes many benefits with such platforms or such tools or.Augmenting augmentation of the existing tools we have or or customer services.",
  'The market and the digital option solutions is growing rapidly and companies are investing heavily to make sure that the products and services are. are usable. If atea wants to be ahead of the curve and be market leader, we need to know what customers want.',
  'Most of the tools, if not all of them meet the the al

In [215]:
topics = pd.DataFrame(output['topics'])
topics.rename(columns = {0:'topics'}, inplace=True)
topics

Unnamed: 0,topics
0,javascript
1,web
2,rapidly
3,impressed
4,growing
5,augmentation
6,speed
7,space
8,cars
9,shuttles


In [216]:
summaries = pd.DataFrame(output['summarized_rows'])
pd.options.display.max_colwidth = 400
summaries.rename(columns= {0:'summaries'}, inplace=True)
summaries

Unnamed: 0,summaries
0,"We are surrounded by software. It is present in any kind of devices nowadays, starting from.Transportation like planes, cars, uh space shuttles. It's everywhere and it's connected and. it's getting more and more involved with with us humans."
1,"Umm. So it's the.Tool tips, announcements, walkthroughs.Any kind of self service functionality guidance through. And of course there comes many benefits with such platforms or such tools or.Augmenting augmentation of the existing tools we have or or customer services."
2,"The market and the digital option solutions is growing rapidly and companies are investing heavily to make sure that the products and services are. are usable. If atea wants to be ahead of the curve and be market leader, we need to know what customers want."
3,"Most of the tools, if not all of them meet the the all basic business requirements. But they, as they say, the devil is always in the detail and only in demos we could really see the true nature of those platforms. So what is our recommendation?"
4,"I have a question on the pricing. I know PRODUCT C is very pricey, but it sounds like you are also really impressed by what it offers us. What is would there be anything that actually? would allow us to pay that price in in other benefits."
5,3rd Part I have is you know what kind of tool is this because I have a little bit of an issue to position this. To me it's the tools and methods application or tool set. You know something that you use to enhance other tools. It's not a tool that that you provide to the end user directly.
6,"I would, I would follow up and I want and have it. Could we used it for PRODUCT E instead of the PRODUCT D. Now could we have one too?We could have, yeah."
7,"It's a it's a sad solution. It's basically.Courtney, but that is added to the web page which invokes a JavaScript library. It doesn't store any kind of user data."
8,40 myatea part we had meetings with Daniel and.Hello I think ohh and they are taking over that they have the capabilities and resources to do that for each shop. That is how the demand started all the way back in March 21.
9,"Robert. Atea does customer Orient that not have any solution of the kind available today. We need to get experience. If this is also something to be used in internal systems, obviously make sense."
