<a href="https://colab.research.google.com/github/Techseeker-404/News_title_generation_using_pegasus/blob/main/newstitlegen.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### This project is divided into two sections 
#### 1) Clustering a particular set of news documents which is already provided
#### 2) Generate titles for each cluster.

In [108]:
from google.colab import drive

In [109]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [110]:
import pandas as pd
import re
import warnings 
warnings.filterwarnings("ignore")

#### Loading dataset as a dataframe.

In [111]:
df = pd.read_csv("/content/drive/MyDrive/dataset.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,Article,Author,Date,Keywords
0,0,"SAN JOSE, Calif. â Arm acquired Stream Tec...",Rick Merritt,06.13.2018,"4g, Accessories, Cellular, Cloud Computing, Co..."
1,1,SAN FRANCISCO â Semiconductor industry a...,Dylan McGrath,06.18.2018,"Automotive, Computers And Peripherals, Consume..."
2,2,"Since the dawn of the compute era, the trifec...",EETimes,05.25.2017,"Automotive/Transportation, Design Standards, D..."
3,3,"SAN JOSE, Calif. â The 3GPP released a set...",Rick Merritt,09.25.2018,"5g, Advanced Technology, Analog ICs, Associati..."
4,4,"SAN JOSE, Calif. â BrainChip described wha...",Rick Merritt,09.10.2018,"Advanced Technology, Communications And Networ..."


In [112]:
df.loc[0,"Article"]

'  SAN JOSE, Calif. â\x80\x94 Arm acquired Stream Technologies (Glasgow) in an effort to grow a business in paid services for devices on the Internet of Things. The move comes as the IoT is still in an early stage but widely seen to have huge potential with services expected to be one of its hottest sectors. Stream, a private company founded in 2000, claims that its connectivity management software and services are used by 770,000 devices carrying 2 terabytes of traffic daily. Though mainly focused on cellular, its offerings are network-agnostic, also supporting LoRa and satellite nets carrying IP and non-IP data. [Sponsored: Learn more about Computer Vision for the Masses] Stream serves a wide variety of applications including asset tracking, smart meters, and the U.K.â\x80\x99s National Rail system. Its services include support for billing and the so-called embedded subscriber identity module (eSIM), a software-based cellular ID. Earlier this year, Arm rolled out software that it cal

In [113]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  1000 non-null   int64 
 1   Article     1000 non-null   object
 2   Author      1000 non-null   object
 3   Date        1000 non-null   object
 4   Keywords    895 non-null    object
dtypes: int64(1), object(4)
memory usage: 39.2+ KB


### Choosing relevant Features.

In [114]:
df.loc[0,"Keywords"]

'4g, Accessories, Cellular, Cloud Computing, Communications And Networking Systems Or Equipment, Computers And Peripherals, Consumer Electronics & Appliances, Design Management, Hardware Development, Integrated Development Environments (ides), Internet Of Things, Microcontroller, Mobile, Open Source, Operating Systems, Peripherals, Programming Languages, Research & Development, Semiconductors, SoC, Software, Wireless, Wireless Networking'

### From the initial analysis it's palpable that Only two feature columns are actually needed i.e , Article and Keyword as they only contain relevant information.

In [115]:
df.drop(["Unnamed: 0","Author","Date"], axis = 1,inplace=True)

In [116]:
# Having a view on head of the dataframe
df.head(5)

Unnamed: 0,Article,Keywords
0,"SAN JOSE, Calif. â Arm acquired Stream Tec...","4g, Accessories, Cellular, Cloud Computing, Co..."
1,SAN FRANCISCO â Semiconductor industry a...,"Automotive, Computers And Peripherals, Consume..."
2,"Since the dawn of the compute era, the trifec...","Automotive/Transportation, Design Standards, D..."
3,"SAN JOSE, Calif. â The 3GPP released a set...","5g, Advanced Technology, Analog ICs, Associati..."
4,"SAN JOSE, Calif. â BrainChip described wha...","Advanced Technology, Communications And Networ..."


In [117]:
# Having a view on tail of the dataframe
df.tail(5)

Unnamed: 0,Article,Keywords
995,LONDON â Researchers from the University...,"Academia, Computers And Peripherals, Consumer ..."
996,The MIPI Alliance recently released MIPI A-PH...,"ADAS, Automotive, Interface, Lidar, Microcontr..."
997,If you didnât go to Las VegasÂ last week...,"Associations, Audio, Ces, Commercial, Computer..."
998,"SAN JOSE, Calif. â A veteran semiconductor ...","Advanced Technology, Career/Profession, Commun..."
999,Qualcommâs “Automotive Redefined” wasnât ...,"ADAS, Automotive, autonomous vehicles"


In [118]:
"""
combining Article and keywords feature column into one column called "Document"
"""
df["Document"] = df.Article + df.Keywords

In [119]:
df.loc[0,"Document"]

'  SAN JOSE, Calif. â\x80\x94 Arm acquired Stream Technologies (Glasgow) in an effort to grow a business in paid services for devices on the Internet of Things. The move comes as the IoT is still in an early stage but widely seen to have huge potential with services expected to be one of its hottest sectors. Stream, a private company founded in 2000, claims that its connectivity management software and services are used by 770,000 devices carrying 2 terabytes of traffic daily. Though mainly focused on cellular, its offerings are network-agnostic, also supporting LoRa and satellite nets carrying IP and non-IP data. [Sponsored: Learn more about Computer Vision for the Masses] Stream serves a wide variety of applications including asset tracking, smart meters, and the U.K.â\x80\x99s National Rail system. Its services include support for billing and the so-called embedded subscriber identity module (eSIM), a software-based cellular ID. Earlier this year, Arm rolled out software that it cal

In [120]:
"""
Creating extra dataframe checkpoint, As a contigency section for helping our prototyping task if needed.
"""
cols = ['Article', 'Keywords']    # Set columns to combine
df['combined'] = df[cols].apply(lambda row: ', '.join(row.values.astype(str)), axis=1)

# Define which column is index
df_i = df.set_index('combined') 

# Set the index to None
# df_i.index.names = [None] 

In [121]:
df_i.drop(["Article","Keywords","Document"],axis=1,inplace=True)

In [122]:
### Creating a data cleaning check point.
dfComb = df_i

### Creating a dataframe called "dfDoc" which will be a combination Article and Keyword which will be used for the datacjeaning pipeline

In [123]:


dfDoc = pd.DataFrame(df["Document"])
dfDoc.Document

0        SAN JOSE, Calif. â Arm acquired Stream Tec...
1          SAN FRANCISCO â Semiconductor industry a...
2       Since the dawn of the compute era, the trifec...
3        SAN JOSE, Calif. â The 3GPP released a set...
4        SAN JOSE, Calif. â BrainChip described wha...
                             ...                        
995        LONDON â Researchers from the University...
996     The MIPI Alliance recently released MIPI A-PH...
997        If you didnât go to Las VegasÂ last week...
998     SAN JOSE, Calif. â A veteran semiconductor ...
999     Qualcommâs “Automotive Redefined” wasnât ...
Name: Document, Length: 1000, dtype: object

In [124]:

dfDoc.loc[0,"Document"]


'  SAN JOSE, Calif. â\x80\x94 Arm acquired Stream Technologies (Glasgow) in an effort to grow a business in paid services for devices on the Internet of Things. The move comes as the IoT is still in an early stage but widely seen to have huge potential with services expected to be one of its hottest sectors. Stream, a private company founded in 2000, claims that its connectivity management software and services are used by 770,000 devices carrying 2 terabytes of traffic daily. Though mainly focused on cellular, its offerings are network-agnostic, also supporting LoRa and satellite nets carrying IP and non-IP data. [Sponsored: Learn more about Computer Vision for the Masses] Stream serves a wide variety of applications including asset tracking, smart meters, and the U.K.â\x80\x99s National Rail system. Its services include support for billing and the so-called embedded subscriber identity module (eSIM), a software-based cellular ID. Earlier this year, Arm rolled out software that it cal

In [125]:
dfDoc.loc[0,"Document"]
def remove_symbols_low(dfcolumns):
    dfcolumns = dfcolumns.str.replace(r'\W',' ')
    dfcolumns = dfcolumns.str.lower()
    return dfcolumns
dfDoc.Document = remove_symbols_low(dfDoc.Document)
dfDoc

Unnamed: 0,Document
0,san jose calif â arm acquired stream tec...
1,san francisco â semiconductor industry a...
2,since the dawn of the compute era the trifec...
3,san jose calif â the 3gpp released a set...
4,san jose calif â brainchip described wha...
...,...
995,london â researchers from the university...
996,the mipi alliance recently released mipi a ph...
997,if you didnâ t go to las vegasâ last week...
998,san jose calif â a veteran semiconductor ...


In [126]:
"""
Converting dataframe series into string type for next level of data preprocessing
"""
dfDoc["Document"] = dfDoc["Document"].fillna('').apply(str)


In [127]:
dfDoc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 1 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Document  1000 non-null   object
dtypes: object(1)
memory usage: 7.9+ KB


#### Second level of cleaning , Removing digits and converting datatype to Str type.

In [128]:
dfDoc.Document = dfDoc.Document.str.replace('\d+', '')

In [129]:
dfDoc.loc[0,"Document"]

'  san jose  calif  â   arm acquired stream technologies  glasgow  in an effort to grow a business in paid services for devices on the internet of things  the move comes as the iot is still in an early stage but widely seen to have huge potential with services expected to be one of its hottest sectors  stream  a private company founded in   claims that its connectivity management software and services are used by   devices carrying  terabytes of traffic daily  though mainly focused on cellular  its offerings are network agnostic  also supporting lora and satellite nets carrying ip and non ip data   sponsored  learn more about computer vision for the masses  stream serves a wide variety of applications including asset tracking  smart meters  and the u k â  s national rail system  its services include support for billing and the so called embedded subscriber identity module  esim   a software based cellular id  earlier this year  arm rolled out software that it called kigen os to enable 

### Final level of text cleaning and tokenizing words of entire documents also using Regex to select alphabets as a final level measure.

In [130]:
def tokenize_words(data):
    data = re.sub('[^a-zA-Z]',' ',data)
    data = data.split()
    return data
wordTokenize = lambda x: tokenize_words(x)

In [131]:
dfDoc.Document = dfDoc.Document.apply(wordTokenize)

In [132]:
dfDoc.head(6)

Unnamed: 0,Document
0,"[san, jose, calif, arm, acquired, stream, tech..."
1,"[san, francisco, semiconductor, industry, anal..."
2,"[since, the, dawn, of, the, compute, era, the,..."
3,"[san, jose, calif, the, gpp, released, a, set,..."
4,"[san, jose, calif, brainchip, described, what,..."
5,"[as, demands, for, machine, learning, grow, so..."


#### ==========================================================================================================

### Importing important specific natural language libraries.

In [133]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [134]:
# Instantiating required API's
Wordnetlem = WordNetLemmatizer()
tfidfVect = TfidfVectorizer(stop_words="english")

In [135]:
nltk.download('stopwords')
# Removing stop words
sw = stopwords.words('english')
wordsetexclude = ['asap','gonna','wanna','bro','lit','more','most','far','cheeky',
             'uff','ceeya','on','in','and','wow','whoah',
              'wooww','astounding','no','yes','km','mile','god','allah','jesus','jesuschrist',
             'oh','oh god','etc','damn','needless','fine','http','www','website','dollar','co',
             'ltd','unit','union','got','heard','nope','little','us','we','our','almighty',
             'might','ought','buy','sell','sent','invite','a','of','the']
for i in wordsetexclude:
    sw.append(i)
def remove_sw(data):
    data = [word for word in data if word not in sw]
    return data
dfDoc.Document = dfDoc.Document.apply(lambda x: remove_sw(x))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [136]:
dfDoc.tail(3)

Unnamed: 0,Document
997,"[go, las, vegas, last, week, ces, shame, well,..."
998,"[san, jose, calif, veteran, semiconductor, exe..."
999,"[qualcomm, automotive, redefined, event, witne..."





#### Now moving onto some intermediate level of feature engineering part in NLP.



#### Lemmatization of words to convert it into its root words.

In [137]:
nltk.download('wordnet')
def lemmatize(data):
    data = [Wordnetlem.lemmatize(word) for word in data]
    return data
lemmatizWord = lambda x : lemmatize(x)
dfDoc.Document = dfDoc.Document.apply(lemmatizWord)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [138]:
"""
Recombining word tokens back.
"""
def joinwords(data):
    combined_text = ' '.join(data)
    return combined_text
combined = lambda x: joinwords(x)
dfDoc.Document = dfDoc.Document.apply(combined)
dfDoc.loc[0,"Document"]

'san jose calif arm acquired stream technology glasgow effort grow business paid service device internet thing move come iot still early stage widely seen huge potential service expected one hottest sector stream private company founded claim connectivity management software service used device carrying terabyte traffic daily though mainly focused cellular offering network agnostic also supporting lora satellite net carrying ip non ip data sponsored learn computer vision mass stream serf wide variety application including asset tracking smart meter u k national rail system service include support billing called embedded subscriber identity module esim software based cellular id earlier year arm rolled software called kigen o enable esim core arm integrate stream product nascent mbed iot service arm disclose much paid stream size stream revenue size nascent iot service business deal likely one example arm reaching beyond processor core sector dominates encouragement new owner softbank a

In [139]:
# Creating a variable with document column as Unicode type.
document = dfDoc.Document.values.astype("U")

In [140]:
document[0]

'san jose calif arm acquired stream technology glasgow effort grow business paid service device internet thing move come iot still early stage widely seen huge potential service expected one hottest sector stream private company founded claim connectivity management software service used device carrying terabyte traffic daily though mainly focused cellular offering network agnostic also supporting lora satellite net carrying ip non ip data sponsored learn computer vision mass stream serf wide variety application including asset tracking smart meter u k national rail system service include support billing called embedded subscriber identity module esim software based cellular id earlier year arm rolled software called kigen o enable esim core arm integrate stream product nascent mbed iot service arm disclose much paid stream size stream revenue size nascent iot service business deal likely one example arm reaching beyond processor core sector dominates encouragement new owner softbank a

In [141]:
features = tfidfVect.fit_transform(document)

####            






### Importing KMeans clustering from scikit learn library to cluster our given document

In [142]:
from sklearn.cluster import KMeans

In [143]:
"""
Initially taking 10 clusters.
"""
cluster_numbers = 12

#Kmeans clustering
modelKmeans = KMeans(n_clusters=cluster_numbers, init="k-means++",max_iter=500,n_init=1)


In [144]:
modelKmeans.fit(features)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=500,
       n_clusters=12, n_init=1, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)


######   


#####   





### Creating a new dataset with labels of cluster and Document column which is already a combined column of Article and Keywords

In [145]:
# Kmeans Cluster
df["Clusters"] = modelKmeans.labels_


In [146]:
# Creating a Dataframe called dfclustered which has two clustering labels
select_features = ["Article","Keywords","Clusters"]
dfclustered = df[select_features]
dfclustered.tail(23)

Unnamed: 0,Article,Keywords,Clusters
977,Semiconductor and electronics companies are g...,"Electronic Instrumentation Or Test, Test & Mea...",2
978,SAN FRANCISCOâIt has taken far longer and c...,"Events, ICs, Research & Development, Semicondu...",2
979,SAN FRANCISCO â Design starts for artificia...,,2
980,Iâm heading to Yosemite for a long weekend ...,"Analog ICs, Avago Technologies, Communications...",2
981,TORONTOÂ â Emerging use cases are revealin...,"Aerospace, Automotive, Defense, Encryption, Go...",2
982,Infineonâs sensor brand and family XENSIVTM...,,2
983,The 3GPP is the organization that has been ma...,Research & Development,2
984,I was taught engineering in a very traditiona...,"Academia, Computers And Peripherals, Industry ...",2
985,I have a lot of respect for cables and connec...,"Connectors/Sockets, EELife, Electronic Instrum...",2
986,"For decades scientists, writers and filmmake...","Advanced Technology, Communications And Networ...",2


In [147]:
len(dfclustered)

1000

In [148]:
modelKmeans.cluster_centers_.argsort()[:,::-1]

array([[  653, 19959,  4313, ..., 13287, 13288,     0],
       [17668,  2897, 18239, ..., 13227, 13228,     0],
       [15498, 17736, 15875, ..., 12070,  8731,  6068],
       ...,
       [ 2599,  2533,  3648, ..., 13320, 13321,     0],
       [ 5542, 18149, 17696, ..., 13365, 13366,     0],
       [ 8421, 18597,  7347, ..., 13322, 13323,     0]])

In [149]:
# Finding feature terms on each cluster by checking centroids or central gravity of each cluster.
# which will gives the insights about how the machine seives out the feature terms for each clusters 
# in KMeans clustering.
#print("Cluster centroids: \n")
order_centroids = modelKmeans.cluster_centers_.argsort()[:,::-1]
terms = tfidfVect.get_feature_names()

wordfeatureset = [] # List of word sets
clusterset = [] # List of cluster of words.

wordfeatures = 200
for i in range(cluster_numbers):
    #print("Cluster %d:" %i)
    for j in order_centroids[i, :wordfeatures]:
        #print("%s" %terms[j])
        wordfeatureset.append(terms[j])
   
    clusterset.append(wordfeatureset)
    #print("----------")


In [150]:
# Creating a HashMap out of Labels
clustersmap = dict()
for i in range(0,cluster_numbers):
    #print(i*wordfeatures)
    #Mapping cluster list of words to Cluster HashMap
    clustersmap["Cluster: "+str(i)] = [clusterset[1][0+i*wordfeatures: wordfeatures+i*wordfeatures]]
    
    

In [151]:

dfclustermap = pd.DataFrame(clustersmap,index=["cluster"])
dfclustermap = dfclustermap.iloc[:,0:cluster_numbers]
dfclustermap

Unnamed: 0,Cluster: 0,Cluster: 1,Cluster: 2,Cluster: 3,Cluster: 4,Cluster: 5,Cluster: 6,Cluster: 7,Cluster: 8,Cluster: 9,Cluster: 10,Cluster: 11
cluster,"[amd, xilinx, deal, center, ai, fpgas, intel, ...","[tariff, china, trade, trump, administration, ...","[said, technology, semiconductor, chip, year, ...","[try, dario, gil, conference, dac, everybody, ...","[fibre, emulex, channel, card, hoogenboom, cre...","[rowen, cadence, underserved, embedded, compan...","[uber, patel, ized, design, bangalore, service...","[aae, safety, arm, cortex, mandyam, lock, auto...","[parameter, tool, design, ml, flow, correlatio...","[cartoon, caption, contest, eeweb, competition...","[eda, tool, tcl, flow, iso, qualification, evi...","[huawei, uk, gear, chinese, mobile, equipment,..."


In [152]:
"""
Transposed dataframe
"""
dfMap = dfclustermap.transpose()
dfMap.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12 entries, Cluster: 0 to Cluster: 11
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   cluster  12 non-null     object
dtypes: object(1)
memory usage: 512.0+ bytes


In [153]:
dfMap

Unnamed: 0,cluster
Cluster: 0,"[amd, xilinx, deal, center, ai, fpgas, intel, ..."
Cluster: 1,"[tariff, china, trade, trump, administration, ..."
Cluster: 2,"[said, technology, semiconductor, chip, year, ..."
Cluster: 3,"[try, dario, gil, conference, dac, everybody, ..."
Cluster: 4,"[fibre, emulex, channel, card, hoogenboom, cre..."
Cluster: 5,"[rowen, cadence, underserved, embedded, compan..."
Cluster: 6,"[uber, patel, ized, design, bangalore, service..."
Cluster: 7,"[aae, safety, arm, cortex, mandyam, lock, auto..."
Cluster: 8,"[parameter, tool, design, ml, flow, correlatio..."
Cluster: 9,"[cartoon, caption, contest, eeweb, competition..."


In [154]:
# Recombining word tokens back into perfect sentences
dfMap.cluster = dfMap.cluster.apply(combined)

In [155]:
dfMap.loc["Cluster: 1","cluster"]

'tariff china trade trump administration chinese war semiconductor product imposed intellectual billion property supply policy tech worth import neuffer sia said list chain uncertainty economy electronics industry morey worker negotiation qualcomm firm technology semi job group deadline component ultimately proposed market whittier wednesday hurting agreement ministry mofcom lewis practice supplier enacted administrative nxp damage manufacturing consumer global acquisition deficit chip thursday tuesday president unfair international economist huawei company good deal imposing tension country representative transfer equipment largest timmons protection action argued risk economic francisco month net outsourcing concern leading discriminatory expressed oppose aerospace year association mollenkopf government relation negative positive week believe american licensing commerce subject responded chipmakers approve export high related force long round analyst ban outbreak blunt computer remai

In [156]:
"""
Uncomment and run this cell to view all the clusters specifically.
"""
# def printcols(dfcolumns):
#     for v,i in enumerate(dfcolumns):
#         print("cluster: "+ str(v))
#         print(str(i))
# printcols(dfMap.cluster)

'\nUncomment and run this cell to view all the clusters specifically.\n'

In [157]:
"""Uncomment this cell if colab environment is not intalled with below mentioned modules"""

# !pip install transformers
# !pip install sentencepiece
# !pip install sentence-splitter

'Uncomment this cell if colab environment is not intalled with below mentioned modules'

##      

##  






### Generating Titles for Each cluster using pretrained transformers model from 'Hugging Face' Pegasus pretrained model 'google/pegasus-xsum' 

In [106]:
import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
model_name = 'google/pegasus-xsum'
# model_name = 'google/pegasus-cnn_dailymail'
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)



In [107]:
def get_response(dfColumns):
  for index, data in enumerate(dfColumns):
    print("cluster: "+ str(index)+" Title")
    input_text = [str(data)]
    batch = tokenizer(input_text,truncation=True,padding='longest',max_length=30, return_tensors="pt").to(torch_device)
    translated = model.generate(**batch,max_length=28)
    tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
    print(tgt_text[0])
  # return tgt_text

In [158]:
get_response(dfMap.cluster)

cluster: 0 Title
BBC News NI takes a look at some of the top stories from the past 24 hours.
cluster: 1 Title
US President Donald Trump and Chinese President Xi Jinping have agreed to "remove all remaining barriers" to trade between the two countries
cluster: 2 Title
.
cluster: 3 Title
The World Economic Forum (WEF) is holding its annual meeting in Davos, Switzerland.
cluster: 4 Title
fibre emulex channel card hoogenboom crehan market nvme ssds revenue stable storage somewhat
cluster: 5 Title
BBC News takes a look at some of the key technology stories of the week.
cluster: 6 Title
A look at some of the top technology stories of the week.
cluster: 7 Title
BBC News takes a look at how driverless cars are changing the way we drive.
cluster: 8 Title
A team of researchers at the University of California, Berkeley, has developed a new way to measure the performance of computer chips
cluster: 9 Title
cartoon caption contest eeweb punchline contest betsy com fun bob winner edn monthly adorned 

## Out of 12 clusters finally managed to generate Titles for 10 clusters through abstractive text summarization using pegasus through specific fine tuning and feature engineering we can generate more and accurate titles.
