<a href="https://colab.research.google.com/github/KAVINESH23/Text-Summarisation/blob/main/Text_summarisation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import numpy as np
import json
import glob

def fetch_url(url):
    # Send an HTTP request to fetch the page content
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    # Find the relevant content (e.g., paragraphs)
    paragraphs = soup.find_all("p")
    return [p.get_text() for p in paragraphs]

# List of URLs to fetch data from
urls = [
    "https://thehackernews.com/2024/07/spanish-hackers-bundle-phishing-kits.html",
    "https://thehackernews.com/2024/07/french-authorities-launch-operation-to.html",
    "https://thehackernews.com/2024/07/how-searchable-encryption-changes-data.html",
    "https://thehackernews.com/2024/07/stargazer-goblin-creates-3000-fake.html"
]

# Fetch and combine text data from all URLs
all_text_data = []
for url in urls:
    all_text_data.extend(fetch_url(url))

# Create a Pandas DataFrame
df = pd.DataFrame({"Text": all_text_data})
df



Unnamed: 0,Text
0,A Spanish-speaking cybercrime group named GXC ...
1,"Singaporean cybersecurity company Group-IB, wh..."
2,The phishing kit is priced anywhere between $1...
3,Targets of the campaign include users of Spani...
4,Also part of the spectrum of services offered ...
...,...
101,"""The average user views the separation of priv..."
102,"""Unfortunately, [...] that is not always true...."
103,Watch as experts simulate real-world threats t...
104,Get actionable steps and tools to harness the ...


In [None]:
import nltk
import re
import string
nltk.download("stopwords")
from nltk.corpus import stopwords
print(stopwords)
nltk.download('punkt')
nltk.download('wordnet')
from nltk.tokenize import word_tokenize,sent_tokenize
from nltk.stem import WordNetLemmatizer
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from heapq import nlargest

<WordListCorpusReader in '/root/nltk_data/corpora/stopwords'>


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
# Basic Preprocessing
data = df['Text'].str.lower()
df["Text"]=data.apply(lambda s: ' '.join(re.sub("[.,!?:;-='*$^...'@#_/|&]", " ", s).split()))
df["Text"] = df["Text"].str.replace(r'\d+', '', regex=True)
punctuations = string.punctuation
df["Text"] = df["Text"].apply(lambda x: x.translate(str.maketrans('', '', punctuations)))
df.head()

Unnamed: 0,Text
0,a spanishspeaking cybercrime group named gxc t...
1,singaporean cybersecurity company groupib whic...
2,the phishing kit is priced anywhere between a...
3,targets of the campaign include users of spani...
4,also part of the spectrum of services offered ...


In [None]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    words = nltk.word_tokenize(text)
    filtered_words = [word for word in words if word.lower() not in stop_words]
    lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]
    return ' '.join(lemmatized_words)

Text = df['Text'].apply(preprocess_text)
Text.head()

Unnamed: 0,Text
0,spanishspeaking cybercrime group named gxc tea...
1,singaporean cybersecurity company groupib trac...
2,phishing kit priced anywhere month whereas bun...
3,target campaign include user spanish financial...
4,also part spectrum service offered sale stolen...


In [None]:
# Calculate word frequencies
all_words = ' '.join(Text.tolist()).split()
word_counts = Counter(all_words)

# Print the most common words
print(word_counts.most_common(10))

[('data', 73), ('encryption', 31), ('repository', 24), ('phishing', 21), ('malware', 21), ('account', 21), ('said', 17), ('threat', 17), ('malicious', 15), ('actor', 15)]


In [None]:
# prompt: calculate cosine similarity

# Vectorize the preprocessed text data
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(Text)

# Calculate cosine similarity between all pairs of documents
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Print the cosine similarity matrix
print(cosine_sim)


[[1.         0.         0.16420208 ... 0.         0.         0.        ]
 [0.         1.         0.         ... 0.         0.         0.        ]
 [0.16420208 0.         1.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 1.         0.         0.08213055]
 [0.         0.         0.         ... 0.         1.         0.07695393]
 [0.         0.         0.         ... 0.08213055 0.07695393 1.        ]]


In [None]:
# Select the top n sentences with the highest scores
n = 15

# Calculate the average cosine similarity for each document
average_similarities = np.mean(cosine_sim, axis=1)

# Get the indices of the top n documents with highest average similarity
summary_sentences = nlargest(n, range(len(average_similarities)), key=average_similarities.__getitem__)

# Construct the summary
summary_tfidf = ' '.join([df['Text'][i] for i in sorted(summary_sentences)])

print(summary_tfidf)

Watch as experts simulate real-world threats to demonstrate compelling advantages. Get actionable steps and tools to harness the full potential of GenAI while protecting your sensitive data. Get actionable steps and tools to harness the full potential of GenAI while protecting your sensitive data. Organizations know they must encrypt their most valuable, sensitive data to prevent data theft and breaches. They also understand that organizational data exists to be used. To be searched, viewed, and modified to keep businesses running. Unfortunately, our Network and Data Security Engineers were taught for decades that you just can't search or edit data while in an encrypted state. It's safe to conclude that the way we're securing that data just isn't working. It's critical that we evolve our thought and approach. It's time to encrypt all data at rest, in transit, and also IN USE. So, how do we effectively encrypt data that needs to be used? Because of this cycle of complexity, in many situ

In [None]:
summary = summary_tfidf
summary_sentences = summary.split('. ')
formatted_summary = '.\n'.join(summary_sentences)

print(formatted_summary)

Watch as experts simulate real-world threats to demonstrate compelling advantages.
Get actionable steps and tools to harness the full potential of GenAI while protecting your sensitive data.
Get actionable steps and tools to harness the full potential of GenAI while protecting your sensitive data.
Organizations know they must encrypt their most valuable, sensitive data to prevent data theft and breaches.
They also understand that organizational data exists to be used.
To be searched, viewed, and modified to keep businesses running.
Unfortunately, our Network and Data Security Engineers were taught for decades that you just can't search or edit data while in an encrypted state.
It's safe to conclude that the way we're securing that data just isn't working.
It's critical that we evolve our thought and approach.
It's time to encrypt all data at rest, in transit, and also IN USE.
So, how do we effectively encrypt data that needs to be used? Because of this cycle of complexity, in many situ

In [None]:
reference_summary="Experts are demonstrating the benefits of GenAI in securing sensitive data, highlighting the need for a modern, complete database encryption strategy that considers encryption of critical data in three states: at rest, in motion, and in use. This approach, known as Searchable Encryption, keeps data fully encrypted while it is still usable, eliminating the complexity and expense associated with the archaic encrypt, decrypt, use, re-encrypt process. Gartner emphasizes the importance of protecting data confidentiality and maintaining data utility for data analytics and privacy teams working with large amounts of data. Paperclip, a 30+ year-old data management company, has created a solution to achieve this, leveraging patented shredding technology and Searchable Symmetric Encryption. This solution removes the complexity, latency, and risk inherent with legacy data security and encryption strategies, focusing on the vast amounts of unencrypted, plaintext data used to support key operational activities. Organizations should take action to secure against Cross Fork Object Reference (CFOR) vulnerabilities, which allow sensitive data to be accessed from deleted forks, deleted repositories, and even private repositories on GitHub."

In [None]:
# Evaluation
!pip install rouge
import rouge
from rouge import Rouge
def evaluate_rouge(reference_text, summary_text):
  rouge = Rouge()
  scores = rouge.get_scores(reference_text, summary_text)
  return scores[0]['rouge-1']['f']



In [None]:
# Evaluate the summary using ROUGE
rouge_score = evaluate_rouge(reference_summary, formatted_summary)

print(f"ROUGE score: {rouge_score}")

ROUGE score: 0.45248868366822964



ABSTRACTION SUMMARIZATION


In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

In [None]:
model=T5ForConditionalGeneration.from_pretrained('t5-base')
tokenizer=T5Tokenizer.from_pretrained('t5-base')


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
input_text = "Experts are demonstrating the benefits of GenAI in securing sensitive data, highlighting the need for a modern, complete database encryption strategy that considers encryption of critical data in three states: at rest, in motion, and in use. This approach, known as Searchable Encryption, keeps data fully encrypted while it is still usable, eliminating the complexity and expense associated with the archaic encrypt, decrypt, use, re-encrypt process. Gartner emphasizes the importance of protecting data confidentiality and maintaining data utility for data analytics and privacy teams working with large amounts of data. Paperclip, a 30+ year-old data management company, has created a solution to achieve this, leveraging patented shredding technology and Searchable Symmetric Encryption. This solution removes the complexity, latency, and risk inherent with legacy data security and encryption strategies, focusing on the vast amounts of unencrypted, plaintext data used to support key operational activities. Organizations should take action to secure against Cross Fork Object Reference (CFOR) vulnerabilities, which allow sensitive data to be accessed from deleted forks, deleted repositories, and even private repositories on GitHub."
input=tokenizer.encode(input_text,return_tensors='pt',max_length=512,truncation=True)

In [None]:
output=model.generate(input,max_length=150,num_beams=7,early_stopping=True)

In [None]:
output_text=tokenizer.decode(output[0],skip_special_tokens=True)
print(output_text)

data at rest, in motion, and in use. Paperclip has created a solution leveraging patented shredding technology and Searchable Symmetric Encryption. This solution removes the complexity, latency, and risk inherent with legacy data security and encryption strategies, focusing on the vast amounts of unencrypted, plaintext data used to support key operational activities.


In [None]:


# Evaluate the abstractive summary using ROUGE
rouge_score_abstract = evaluate_rouge(reference_summary, output_text)

print(f"ROUGE score for abstractive summarization: {rouge_score_abstract}")


ROUGE score for abstractive summarization: 0.49999999619253654
