<a href="https://colab.research.google.com/github/RohithJ11/NLP_Privacy_Policies/blob/main/Hugfctrans_Prvcy_plcy_Mtd_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Installing Packages

In [None]:
!pip install transformers pandas nltk




### Importing Modules

In [None]:
import pandas as pd
import torch
from transformers import pipeline
from collections import Counter
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

### Define Functions for Summarization and Keyword Extraction

In [None]:
# Function to summarize texts
def summarize_text(text):
    summarizer = pipeline("summarization", model="t5-small")
    summary_text = summarizer(text, max_length=150, min_length=40, do_sample=False)
    return summary_text[0]['summary_text']

# Function to extract keywords
def extract_keywords(text):
    words = word_tokenize(text)
    tagged = pos_tag(words)
    # I selected nouns and adjectives as they tend to be most informative
    keywords = [word for word, pos in tagged if pos in ['NN', 'NNS', 'JJ', 'JJR', 'JJS']]
    # Counting the frequency of each keyword
    freq_dist = Counter(keywords)
    # Selecting the top 10 keywords based on their frequency
    most_common_keywords = [word for word, freq in freq_dist.most_common(10)]
    return most_common_keywords


### Load Dataset and Process Each Privacy Policy

In [None]:
# Load the dataset
df = pd.read_csv('/content/Privacyplcy_DS_3.csv')

# Name of the column to summarize 'policy_text' dataset that contains the privacy policies
column_name = 'Summary_of_Content'  # Example column name

# Processing each privacy policy
results = []
for index, row in df.iterrows():
    try:
        summary = summarize_text(row[column_name])
        keywords = extract_keywords(summary)
        results.append({'summary': summary, 'keywords': keywords})
    except Exception as e:
        print(f"Error processing row {index}: {e}")
        results.append({'summary': None, 'keywords': None})

# Optionally, convert the results to a DataFrame and save it to a CSV file
results_df = pd.DataFrame(results)
results_df.to_csv('/content/Privacyplcy_DS_3 copy 2.csv', index=False)


Your max_length is set to 150, but your input_length is only 138. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=69)
Your max_length is set to 150, but your input_length is only 124. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=62)
Your max_length is set to 150, but your input_length is only 136. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=68)
Your max_length is set to 150, but your input_length is only 109. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=54)


### After executing all the above code cells, the output will be generated and stored into the "Privacyplcy_DS_3 copy 2.csv" file.

### Code to output the values in a downloadable .csv file

In [None]:
# Assuming the earlier steps have been executed

# Load the dataset
df = pd.read_csv('/content/Privacyplcy_DS_3.csv')

# Make sure to replace 'policy_text' with the actual column name from your dataset that contains the privacy policies
column_name = 'Summary_of_Content'  # Example column name

# Initialize summarizer and keyword extractor
summarizer = pipeline("summarization", model="t5-small", device=0 if torch.cuda.is_available() else -1)

def summarize_and_extract_keywords(text):
    summary = summarizer(text, max_length=150, min_length=40, do_sample=False)[0]['summary_text']
    keywords = extract_keywords(summary)
    return summary, keywords

# Processing each privacy policy
processed_policies = []
for index, row in df.iterrows():
    try:
        summary, keywords = summarize_and_extract_keywords(row[column_name])
        processed_policies.append({'Summary': summary, 'Keywords': ', '.join(keywords)})
    except Exception as e:
        print(f"Error processing row {index}: {e}")
        processed_policies.append({'Summary': 'Error processing text', 'Keywords': ''})

# Convert the results to a DataFrame
results_df = pd.DataFrame(processed_policies)

# Save to CSV
output_csv_path = '/content/summarized_privacy_policies_and_keywords.csv'
results_df.to_csv(output_csv_path, index=False)
print(f"Output saved to {output_csv_path}")


Your max_length is set to 150, but your input_length is only 138. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=69)
Your max_length is set to 150, but your input_length is only 124. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=62)
Your max_length is set to 150, but your input_length is only 136. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=68)
Your max_length is set to 150, but your input_length is only 109. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=54)


Output saved to /content/summarized_privacy_policies_and_keywords.csv
