In [1]:
import os
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")

In [2]:
import torch

if torch.cuda.is_available():
    print("GPU is available!")
    print(f"Using device: {torch.cuda.get_device_name(0)}")
else:
    print("No GPU. Using CPU.")

GPU is available!
Using device: NVIDIA GeForce GTX 1650 Ti


# 1. Loading the Data

In [3]:
df = pd.read_csv(r"C:\Users\neupa\OneDrive\Desktop\data606_capstone_teamC\Data\fake_news_0\WELFake_Dataset.csv")

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
1,1,,Did they post their votes for Hillary already?,1
2,2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1
3,3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0
4,4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1


So, we have primarily to main input features: title and text.

We can either choose to only train on title or text. The length of text is too big.

If we wish to apply pretrained embeddings, we need to ensure that our input text is not too big and not exceed the max input tokens of the embedding model. Otherwise, our text will be truncated and thats no good.

In [5]:
df.isnull().sum()

Unnamed: 0      0
title         558
text           39
label           0
dtype: int64

In [6]:
# we don't need rows where both 'title' and 'text' are NaN
df = df.dropna(subset=['title', 'text'], how='all')

In [7]:
df.isnull().sum()

Unnamed: 0      0
title         558
text           39
label           0
dtype: int64

Well, it looks like either the title is null or text is null but not both. Thats fine. Our approach will utilize both.

In [8]:
random_titles = df['title'].sample(50, random_state=80).values
random_texts = df['text'].sample(50, random_state=80).values

In [9]:
random_titles[0]

"Trump say appeals court decision on travel ban was 'political'"

In [10]:
random_texts[0]

'WASHINGTON (Reuters) - President Donald Trump on Thursday called the appellate court ruling that upheld the suspension of his order restricting travel from seven Muslim-majority countries a “political decision,” and vowed his administration would ultimately prevail. “We’ll see them in court,” Trump told reporters who had gathered outside his press secretary’s office. “It’s a political decision.” Trump said he did not view the ruling as a major setback for his White House. “This is just a decision that came down, but we’re going to win the case,” he said. '

# 2. Tokenization 

We use the [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) embedding model from huggingface. 

This embedding model is relatively lightweight, one of the most popular ones, and converts an input text (max 512 tokens) into a 384 dimensional dense vector embedding. 


The produced embedding can be used for checking sentence similarity, clustering, information retrieval, or text classification by passing in the embeddings as input features.

In [11]:
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModel

In [12]:
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

In [13]:
model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 384, padding_idx=0)
    (position_embeddings): Embedding(512, 384)
    (token_type_embeddings): Embedding(2, 384)
    (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-5): 6 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=384, out_features=384, bias=True)
            (key): Linear(in_features=384, out_features=384, bias=True)
            (value): Linear(in_features=384, out_features=384, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=384, out_features=384, bias=True)
            (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)


In [14]:
tokenizer

BertTokenizerFast(name_or_path='sentence-transformers/all-MiniLM-L6-v2', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

We can see that the model or its tokenizor has more than 30k tokens, and the maximum tokens it can take in as on a input sequence is 512. If less than that, it can pad the text and if bigger, it will truncate. 

In [15]:
random_titles[0]

"Trump say appeals court decision on travel ban was 'political'"

In [16]:
tokenizer.encode(random_titles[0], return_tensors='pt')

tensor([[ 101, 8398, 2360, 9023, 2457, 3247, 2006, 3604, 7221, 2001, 1005, 2576,
         1005,  102]])

The above is the id/index in the tokenizer's vocabulary of the corresponding text. For instance, the text is tokenized as follows:

`['[CLS]', 'Trump', 'say', 'appeals', 'court', 'decision', 'on', 'travel', 'ban', 'was', "'", 'political', "'", '[SEP]']`

In [17]:
ids = [101, 8398, 2360, 9023, 2457, 3247, 2006, 3604, 7221, 2001, 1005, 2576, 1005, 102]
decoded = tokenizer.decode(ids)
print(decoded)

[CLS] trump say appeals court decision on travel ban was ' political ' [SEP]


In [18]:
from tqdm import tqdm

for current_title in tqdm(random_titles):
    encoded = tokenizer.encode(current_title, return_tensors='pt')
    print("Number of tokens for current title:", len(encoded[0]))

100%|██████████| 50/50 [00:00<00:00, 2851.33it/s]

Number of tokens for current title: 14
Number of tokens for current title: 24
Number of tokens for current title: 15
Number of tokens for current title: 14
Number of tokens for current title: 13
Number of tokens for current title: 22
Number of tokens for current title: 42
Number of tokens for current title: 24
Number of tokens for current title: 36
Number of tokens for current title: 21
Number of tokens for current title: 13
Number of tokens for current title: 15
Number of tokens for current title: 10
Number of tokens for current title: 19
Number of tokens for current title: 21
Number of tokens for current title: 23
Number of tokens for current title: 20
Number of tokens for current title: 21
Number of tokens for current title: 12
Number of tokens for current title: 24
Number of tokens for current title: 13
Number of tokens for current title: 18
Number of tokens for current title: 18
Number of tokens for current title: 17
Number of tokens for current title: 15
Number of tokens for curr




Titles aren't big so we don't hit the max token limit. Lets look at text.

In [19]:
all_title = df["title"].fillna("").values
all_text = df["text"].fillna("").values

In [20]:
total_tokens = []

for current_text in tqdm(all_text):
    encoded = tokenizer.encode(current_text, return_tensors='pt')
    total_tokens.append(len(encoded[0]))

  0%|          | 0/72134 [00:00<?, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (1200 > 512). Running this sequence through the model will result in indexing errors
100%|██████████| 72134/72134 [03:20<00:00, 360.59it/s]


In [21]:
for current_text in tqdm(random_texts):
    encoded = tokenizer.encode(current_text, return_tensors='pt')
    print("Number of tokens for current text:", len(encoded[0]))

100%|██████████| 50/50 [00:00<00:00, 331.71it/s]

Number of tokens for current text: 123
Number of tokens for current text: 700
Number of tokens for current text: 570
Number of tokens for current text: 946
Number of tokens for current text: 41
Number of tokens for current text: 1034
Number of tokens for current text: 27
Number of tokens for current text: 153
Number of tokens for current text: 588
Number of tokens for current text: 861
Number of tokens for current text: 117
Number of tokens for current text: 513
Number of tokens for current text: 972
Number of tokens for current text: 272
Number of tokens for current text: 910
Number of tokens for current text: 747
Number of tokens for current text: 315
Number of tokens for current text: 1998
Number of tokens for current text: 26
Number of tokens for current text: 666
Number of tokens for current text: 253
Number of tokens for current text: 10
Number of tokens for current text: 270
Number of tokens for current text: 408
Number of tokens for current text: 265
Number of tokens for curren




Looks like some of the texts are too big (exceed the input token limit). Hmm. I do not want to lose data by truncating. Lets see if i can use LLMs to summarize the title and text and create a new feature instead that is <500 input tokens.

I created an API key on groq.com and set it up in the current virtual environment. Grok is a platform that provides high-throughput and performance endpoints to use and test multiple open-source LLMs for free. There are rate limits to the free account but it is enough for our use case.

In [22]:
df

Unnamed: 0.1,Unnamed: 0,title,text,label
0,0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
1,1,,Did they post their votes for Hillary already?,1
2,2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1
3,3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0
4,4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1
...,...,...,...,...
72129,72129,Russians steal research on Trump in hack of U....,WASHINGTON (Reuters) - Hackers believed to be ...,0
72130,72130,WATCH: Giuliani Demands That Democrats Apolog...,"You know, because in fantasyland Republicans n...",1
72131,72131,Migrants Refuse To Leave Train At Refugee Camp...,Migrants Refuse To Leave Train At Refugee Camp...,0
72132,72132,Trump tussle gives unpopular Mexican leader mu...,MEXICO CITY (Reuters) - Donald Trump’s combati...,0


In [23]:
titles_exceeding_512_token_limit = []
titles_token_count = []
texts_token_count = []

for i, (current_title, current_text) in enumerate(tqdm(zip(all_title, all_text))):
    encoded_title = tokenizer.encode(current_title, return_tensors='pt')
    encoded_text = tokenizer.encode(current_text, return_tensors='pt')

    titles_token_count.append(len(encoded_title[0]))
    texts_token_count.append(len(encoded_text[0]))

    if len(encoded_title[0]) > 512:
        titles_exceeding_512_token_limit.append(current_title)

72134it [03:51, 311.09it/s]


In [24]:
titles_exceeding_512_token_limit, min(titles_token_count), max(titles_token_count)

([], 2, 109)

Perfect. All of the titles are good. None of them exceed the input limit. The biggest title has 109 tokens whereas there are some empty titles which is why there are just two tokens that are start and end default tokens.

I will keep the title as is. And, i will calculate the maximum number of tokens possible for each text so that my title+text will be <= 512. 

If a title + text is already <=512, it will be left untouched wheres the text that exceeds this limit will be passed to an LLM to summarize to <=512 by text specific prompting.

In [25]:
# available_tokens_for_text = 512 - np.array(titles_token_count)

In [26]:
available_tokens_for_text_summarization = []
summarize_text = []

for text_token_count, title_token_count in tqdm(zip(texts_token_count, titles_token_count)):
    if text_token_count + title_token_count <= 512:
        perform_summarization = False
    else:
        perform_summarization = True


    available_tokens_for_text = 512 - title_token_count
    available_tokens_for_text_summarization.append(available_tokens_for_text)
    summarize_text.append(perform_summarization)

72134it [00:00, 1070120.38it/s]


In [27]:
sum(summarize_text)

36879

In [28]:
true_indices = [idx for idx, val in enumerate(summarize_text) if val]
all_text_true = [all_text[idx] for idx in true_indices]
max_tokens = [available_tokens_for_text_summarization[idx] for idx in true_indices]

In [29]:
len(all_text_true)

36879

Now we have all the big texts, and max tokens available to summarize them. Lets do that using Groq.

I will use llama-3.1-8b-instant from Groq to do the summarization

In [30]:
def return_prompt(
    title: str,
    text: str,
    title_token_count: int,
    text_token_count: int,
    available_tokens_for_summarization: int,
    max_tokens: int = 512
):
    """
    Generates a prompt to summarize or rewrite the input news text to fit within
    the token limit of an embedding model, preserving key information.
    """
    prompt = f"""Input details:
    
    Title: {title}\n
    Title Token Count: {title_token_count}\n
    Original Text: {text}\n
    Original Text Token Count: {text_token_count}\n
    Max Allowed Total Tokens (Title + Text): {max_tokens}\n
    Available Tokens for Text Summary: {available_tokens_for_summarization}\n
    
    Now rewrite the article text only (not the title) within the available token budget. Your output should be clean, self-contained, and informative. 
    Do not add any information that is not present in the original news text or do not censor the text.
    Your output must be exactly text summary and nothing else. You are also not allowed to water-down or censor the text as the text may contain sensitive information related 
    to politics, race or religion. Your job is to just summarize the text and not censor/water-down the text in any way shape or form.
    The summary should be concise, informative, and maintain the original meaning.
    """
    return prompt


In [31]:
labels = df['label'].values

In [32]:
''' 
This loop is used to find the first text that requires summarization.
It will break once it finds a text that needs summarization.
'''
for i in range(len(summarize_text)):
    if summarize_text[i]:
        current_available_tokens = available_tokens_for_text_summarization[i]
        current_title = all_title[i]
        current_text = all_text[i]
        current_title_token_count = titles_token_count[i]
        current_text_token_count = texts_token_count[i]
        prompt = return_prompt(
            current_title,
            current_text,
            current_title_token_count,
            current_text_token_count,
            current_available_tokens,
        )
        
        break

In [33]:
print(labels[i])

1


In [34]:
prompts = []
maximum_allocated_tokens = []

for i in range(0, len(all_text)):
    if summarize_text[i]:
        current_available_tokens = available_tokens_for_text_summarization[i]
        current_title = all_title[i]
        current_text = all_text[i]
        current_title_token_count = titles_token_count[i]
        current_text_token_count = texts_token_count[i]
        prompts.append(
                return_prompt(
                current_title,
                current_text,
                current_title_token_count,
                current_text_token_count,
                current_available_tokens,
            )
        )
        maximum_allocated_tokens.append(current_available_tokens)

In [35]:
len(summarize_text)

72134

In [39]:
print(prompt)

Input details:

    Title: LAW ENFORCEMENT ON HIGH ALERT Following Threats Against Cops And Whites On 9-11By #BlackLivesMatter And #FYF911 Terrorists [VIDEO]

    Title Token Count: 35

    Original Text: No comment is expected from Barack Obama Members of the #FYF911 or #FukYoFlag and #BlackLivesMatter movements called for the lynching and hanging of white people and cops. They encouraged others on a radio show Tuesday night to  turn the tide  and kill white people and cops to send a message about the killing of black people in America.One of the F***YoFlag organizers is called  Sunshine.  She has a radio blog show hosted from Texas called,  Sunshine s F***ing Opinion Radio Show. A snapshot of her #FYF911 @LOLatWhiteFear Twitter page at 9:53 p.m. shows that she was urging supporters to  Call now!! #fyf911 tonight we continue to dismantle the illusion of white Below is a SNAPSHOT Twitter Radio Call Invite   #FYF911The radio show aired at 10:00 p.m. eastern standard time.During the show

In [40]:
len(tokenizer.encode(prompt, return_tensors='pt')[0])

1410

Hmmm.. our prompt iself is ~1500 tokens. So, hopefully, our model will not hallucinate and give us proper summary.

In [41]:
from groq import Groq

llm_api_key = os.getenv("LLM_API_KEY")
client = Groq(api_key=llm_api_key)

In [102]:
system_prompt = '''You are an expert summarizer helping a user prepare input for a sentence embedding model with a strict 512-token input limit.
The user will always include the full **title** of the news article. Your job is to **rewrite or summarize the news article text only**, using no more than the available tokens provided. The goal is to preserve **all important meaning** from the article without exceeding the token budget.

You must:
- Leave the title unchanged
- Output **only the rewritten article text**
- Not include anything non-relevant stuff in your response

You will be provided Input details.
'''

In [None]:
def run_groq_summary(prompt_text, max_tokens, model="llama3-70b-8192"):
    response = client.chat.completions.create(
        messages=[
            {"role": "system", "content":system_prompt},
            {"role": "user", "content": prompt_text}
        ],
        model=model,
        temperature=0.1,
        top_p=1,
        max_completion_tokens=max_tokens # this is the maximum number of tokens possible (title tokens + max_tokens = 512)
    )
    return response.choices[0].message.content

In [43]:
print(current_text)

Goldman Sachs Endorses Hillary Clinton For President For Goldman Sachs, was there really any other choice this cycle? | October 27, 2016 Be Sociable, Share! Lloyd Blankfein, CEO of Goldman Sachs, is greeted by Hillary Clinton at a panel discussion at the Clinton Global Initiative, Sept. 24, 2014 in New York. 
Published in partnership with Shadowproof . 
He’s with her. On Sunday, former Secretary of State Hillary Clinton earned the endorsement of Goldman Sachs CEO Lloyd Blankfein —an endorsement she had been working toward for years. 
As was revealed by Wikileaks, Hillary Clinton spent the run up to her presidential campaign giving speeches to Goldman Sachs and other Wall Street banks, where she praised their talents and explained her positions on financial regulation. 
On October 24, 2013, Clinton told Goldman Sachs that Dodd-Frank had to be done mostly for “political reasons” because Congress needed to look like it was doing something about the crisis. She said, “There’s nothing magic

In [44]:
print(prompt)

Input details:

    Title: LAW ENFORCEMENT ON HIGH ALERT Following Threats Against Cops And Whites On 9-11By #BlackLivesMatter And #FYF911 Terrorists [VIDEO]

    Title Token Count: 35

    Original Text: No comment is expected from Barack Obama Members of the #FYF911 or #FukYoFlag and #BlackLivesMatter movements called for the lynching and hanging of white people and cops. They encouraged others on a radio show Tuesday night to  turn the tide  and kill white people and cops to send a message about the killing of black people in America.One of the F***YoFlag organizers is called  Sunshine.  She has a radio blog show hosted from Texas called,  Sunshine s F***ing Opinion Radio Show. A snapshot of her #FYF911 @LOLatWhiteFear Twitter page at 9:53 p.m. shows that she was urging supporters to  Call now!! #fyf911 tonight we continue to dismantle the illusion of white Below is a SNAPSHOT Twitter Radio Call Invite   #FYF911The radio show aired at 10:00 p.m. eastern standard time.During the show

In [45]:
%%time

output = run_groq_summary(prompt, current_available_tokens)

CPU times: total: 15.6 ms
Wall time: 1.08 s


In [46]:
print(output)

Members of #FYF911 and #BlackLivesMatter called for the lynching and hanging of white people and cops on a radio show. They encouraged others to kill white people and cops to send a message about the killing of black people in America. One of the organizers, Sunshine, urged supporters to "call now" and "dismantle the illusion of white supremacy." During the show, callers clearly called for the lynching and killing of white people. One unidentified black man said they should target white people when they're alone and "f***ing hang them from a damn tree." He also suggested taking a picture and sending it to the "mother f**kers." Another black man spoke up, saying they needed to kill cops that are killing black people. The movement's supporters have also made threats against veterans who infiltrated their conference calls, including a threat to rape and gut a pregnant wife.


<b>Original: </b>

No comment is expected from Barack Obama Members of the #FYF911 or #FukYoFlag and #BlackLivesMatter movements called for the lynching and hanging of white people and cops. They encouraged others on a radio show Tuesday night to  turn the tide  and kill white people and cops to send a message about the killing of black people in America.One of the F***YoFlag organizers is called  Sunshine.  She has a radio blog show hosted from Texas called,  Sunshine s F***ing Opinion Radio Show. A snapshot of her #FYF911 @LOLatWhiteFear Twitter page at 9:53 p.m. shows that she was urging supporters to  Call now!! #fyf911 tonight we continue to dismantle the illusion of white Below is a SNAPSHOT Twitter Radio Call Invite   #FYF911The radio show aired at 10:00 p.m. eastern standard time.During the show, callers clearly call for  lynching  and  killing  of white people.A 2:39 minute clip from the radio show can be heard here. It was provided to Breitbart Texas by someone who would like to be referred to as  Hannibal.  He has already received death threats as a result of interrupting #FYF911 conference calls.An unidentified black man said  when those mother f**kers are by themselves, that s when when we should start f***ing them up. Like they do us, when a bunch of them ni**ers takin  one of us out, that s how we should roll up.  He said,  Cause we already roll up in gangs anyway. There should be six or seven black mother f**ckers, see that white person, and then lynch their ass. Let s turn the tables. They conspired that if  cops started losing people,  then  there will be a state of emergency. He speculated that one of two things would happen,  a big-ass [R s?????] war,  or  ni**ers, they are going to start backin  up. We are already getting killed out here so what the f**k we got to lose? Sunshine could be heard saying,  Yep, that s true. That s so f**king true. He said,  We need to turn the tables on them. Our kids are getting shot out here. Somebody needs to become a sacrifice on their side.He said,  Everybody ain t down for that s**t, or whatever, but like I say, everybody has a different position of war.  He continued,  Because they don t give a f**k anyway.  He said again,  We might as well utilized them for that s**t and turn the tables on these n**ers. He said, that way  we can start lookin  like we ain t havin  that many casualties, and there can be more causalities on their side instead of ours. They are out their killing black people, black lives don t matter, that s what those mother f**kers   so we got to make it matter to them. Find a mother f**ker that is alone. Snap his ass, and then f***in hang him from a damn tree. Take a picture of it and then send it to the mother f**kers. We  just need one example,  and  then people will start watchin .  This will turn the tables on s**t, he said. He said this will start  a trickle-down effect.  He said that when one white person is hung and then they are just  flat-hanging,  that will start the  trickle-down effect.  He continued,  Black people are good at starting trends. He said that was how  to get the upper-hand. Another black man spoke up saying they needed to kill  cops that are killing us. The first black male said,  That will be the best method right there. Breitbart Texas previously reported how Sunshine was upset when  racist white people  infiltrated and disrupted one of her conference calls. She subsequently released the phone number of one of the infiltrators. The veteran immediately started receiving threatening calls.One of the #F***YoFlag movement supporters allegedly told a veteran who infiltrated their publicly posted conference call,  We are going to rape and gut your pregnant wife, and your f***ing piece of sh*t unborn creature will be hung from a tree. Breitbart Texas previously encountered Sunshine at a Sandra Bland protest at the Waller County Jail in Texas, where she said all white people should be killed. She told journalists and photographers,  You see this nappy-ass hair on my head?   That means I am one of those more militant Negroes.  She said she was at the protest because  these redneck mother-f**kers murdered Sandra Bland because she had nappy hair like me. #FYF911 black radicals say they will be holding the  imperial powers  that are actually responsible for the terrorist attacks on September 11th accountable on that day, as reported by Breitbart Texas. There are several websites and Twitter handles for the movement. Palmetto Star  describes himself as one of the head organizers. He said in a YouTube video that supporters will be burning their symbols of  the illusion of their superiority,  their  false white supremacy,  like the American flag, the British flag, police uniforms, and Ku Klux Klan hoods.Sierra McGrone or  Nocturnus Libertus  posted,  you too can help a young Afrikan clean their a** with the rag of oppression.  She posted two photos, one that appears to be herself, and a photo of a black man, wiping their naked butts with the American flag.For entire story: Breitbart News.

<b>Summarized:</b>

Members of #FYF911 and #BlackLivesMatter movements called for the lynching and hanging of white people and cops on a radio show. They encouraged killing white people and cops to send a message about the killing of black people in America. One organizer, Sunshine, urged supporters to "call now" and "dismantle the illusion of white supremacy. During the show, callers clearly called for lynching and killing of white people. An unidentified black man said they should target white people when they're alone and hang them from a tree, saying "we might as well utilize them for that s**t and turn the tables on these n**ers." He also said that black people are good at starting trends and that would give them the upper-hand. Another black man spoke up saying they needed to kill cops that are killing us. Sunshine was previously upset when racist white people infiltrated and disrupted one of her conference calls, and she released the phone number of one of them, leading to threatening calls.


Not bad at all!!!. So, lets see the token count in the summarized text.

In [47]:
print("The token count of original text of the news article: ", len(tokenizer.encode(current_text, return_tensors='pt')[0]))
print("The token count of new summarized text of the news article: ", len(tokenizer.encode(output, return_tensors='pt')[0]))

The token count of original text of the news article:  568
The token count of new summarized text of the news article:  197


In [48]:
print("The token count of original title of the news article: ", titles_token_count[-1])

The token count of original title of the news article:  11


In [49]:
titles_token_count[-1] + len(tokenizer.encode(output, return_tensors='pt')[0])

208

<b> Which is less than 512.  Perfect !!!

In [None]:
llm_tokenizer = tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

def get_token_count_for_prompt(prompt_text):
    """
    Returns the token count for a given prompt text.
    """
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": prompt_text},
    ]

    prompt = llm_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    encoded = llm_tokenizer.encode(prompt, return_tensors='pt')
    return len(encoded[0])

In [107]:
texts = []
labels_for_text_summarization = []
title_tokens= []
text_tokens = []
new_title = []
new_text = []
prompt_token_count = []

for i in tqdm(range(0, len(all_text))):
    current_title = all_title[i]
    current_text = all_text[i]
    current_label = labels[i]
    combined_text = f"{current_title}. {current_text}"

    if summarize_text[i]:
        current_available_tokens = available_tokens_for_text_summarization[i]
        current_title_token_count = titles_token_count[i]
        current_text_token_count = texts_token_count[i]
        current_prompt = return_prompt(
            current_title,
            current_text,
            current_title_token_count,
            current_text_token_count,
            current_available_tokens,
        )
        

        texts.append(current_prompt)
        new_title.append(current_title)
        new_text.append(current_text)
        title_tokens.append(current_title_token_count)
        text_tokens.append(current_text_token_count)
        prompt_token_count.append(get_token_count_for_prompt(current_prompt))        
        labels_for_text_summarization.append(current_label)

  0%|          | 0/72134 [00:00<?, ?it/s]

100%|██████████| 72134/72134 [03:15<00:00, 368.11it/s]


In [108]:
df_for_summarization = pd.DataFrame({
    'title': new_title, 
    'text': new_text,
    'label': labels_for_text_summarization,
    "title_tokens": title_tokens,
    "text_tokens": text_tokens,
    "prompt_token_count": prompt_token_count
})

In [None]:
# saving the texts that need summarization to a CSV file which wiill be used later for doing the summarization
pd.DataFrame({
    'text': texts,
    'label': labels_for_text_summarization
}).to_csv("Summary_prompts.csv", index=False)

<b> We will now proceed to notebook llm_summarization.ipynb which will do the summarization

In [109]:
summarized_df = pd.read_csv("llama3_2_3b_summary_results1.csv")

In [110]:
summarized_df.head()

Unnamed: 0,text,label,token_count,prompt,summary
0,Input details:\n\n Title: LAW ENFORCEMENT O...,1,1556,<|begin_of_text|><|start_header_id|>system<|en...,Members of the #FYF911 and #BlackLivesMatter m...
1,"Input details:\n\n Title: Bobby Jindal, rai...",0,1975,<|begin_of_text|><|start_header_id|>system<|en...,"Louisiana Gov. Bobby Jindal, a Hindu by birth,..."
2,Input details:\n\n Title: Latest Pipeline L...,1,1011,<|begin_of_text|><|start_header_id|>system<|en...,Underscoring the dangers of America’s unreliab...
3,Input details:\n\n Title: GOP Senator Just...,1,1552,<|begin_of_text|><|start_header_id|>system<|en...,Sen. Ben Sasse (R-Neb.) took aim at Alt-Right ...
4,Input details:\n\n Title: Schumer calls on ...,0,869,<|begin_of_text|><|start_header_id|>system<|en...,"Charles Schumer, the top Democrat in the U.S. ..."


In [111]:
summarized_df

Unnamed: 0,text,label,token_count,prompt,summary
0,Input details:\n\n Title: LAW ENFORCEMENT O...,1,1556,<|begin_of_text|><|start_header_id|>system<|en...,Members of the #FYF911 and #BlackLivesMatter m...
1,"Input details:\n\n Title: Bobby Jindal, rai...",0,1975,<|begin_of_text|><|start_header_id|>system<|en...,"Louisiana Gov. Bobby Jindal, a Hindu by birth,..."
2,Input details:\n\n Title: Latest Pipeline L...,1,1011,<|begin_of_text|><|start_header_id|>system<|en...,Underscoring the dangers of America’s unreliab...
3,Input details:\n\n Title: GOP Senator Just...,1,1552,<|begin_of_text|><|start_header_id|>system<|en...,Sen. Ben Sasse (R-Neb.) took aim at Alt-Right ...
4,Input details:\n\n Title: Schumer calls on ...,0,869,<|begin_of_text|><|start_header_id|>system<|en...,"Charles Schumer, the top Democrat in the U.S. ..."
...,...,...,...,...,...
36863,Input details:\n\n Title: Russians steal re...,0,1288,<|begin_of_text|><|start_header_id|>system<|en...,Hackers believed to be working for the Russian...
36864,Input details:\n\n Title: WATCH: Giuliani ...,1,1073,<|begin_of_text|><|start_header_id|>system<|en...,Giuliani demands that Democrats apologize for ...
36865,Input details:\n\n Title: Migrants Refuse T...,0,929,<|begin_of_text|><|start_header_id|>system<|en...,Thousands of migrants flooded into a train sta...
36866,Input details:\n\n Title: Trump tussle give...,0,1070,<|begin_of_text|><|start_header_id|>system<|en...,Donald Trump's combative style has buffeted Me...


In [128]:
new_title = []
new_text = []
new_label = []
summarized_text = []
title_tokens = []
text_tokens = []

for i, (title, text, label) in enumerate(tqdm(zip(all_title, all_text, labels))):
    if not summarize_text[i]:
        new_title.append(title)
        new_text.append(text)
        new_label.append(label)
        title_tokens.append(titles_token_count[i])
        text_tokens.append(texts_token_count[i])
        summarized_text.append(np.nan)

df1 = pd.DataFrame({
    'title': new_title, 
    'text': new_text,
    'label': new_label,
    "title_tokens": title_tokens,
    "text_tokens": text_tokens,
    'text_summary': summarized_text
})

72134it [00:00, 1007539.18it/s]


In [113]:
df1

Unnamed: 0,title,text,label,title_tokens,text_tokens,summarized_text
0,,Did they post their votes for Hillary already?,1,2,11,
1,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1,30,38,
2,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1,25,425,
3,About Time! Christian Group Sues Amazon and SP...,All we can say on this one is it s about time ...,1,18,325,
4,DR BEN CARSON TARGETED BY THE IRS: “I never ha...,DR. BEN CARSON TELLS THE STORY OF WHAT HAPPENE...,1,26,19,
...,...,...,...,...,...,...
35250,LEFTY MEDIA DESPERATELY Tries To Bury Trump Bu...,CNN political hack Alison Camarota wants to na...,1,23,196,
35251,BILL O’REILLY Destroys Liberal Pundit On Trump...,,1,16,2,
35252,Swiss stop seizing income from asylum seekers ...,ZURICH (Reuters) - Switzerland will stop seizi...,0,15,320,
35253,WOW! JILL STEIN’S ‘FIRESIDE CHAT’ Exposes Her ...,,1,24,2,


In [114]:
print(df_for_summarization["prompt_token_count"].max()) # this is way too long text; we are excluding any news article with tokens larger than 20000
df_for_summarization = df_for_summarization[df_for_summarization["prompt_token_count"] < 20000]

42427


In [115]:
df_for_summarization = df_for_summarization.reset_index(drop=True)

In [116]:
df_for_summarization.shape, summarized_df.shape

((36868, 6), (36868, 5))

In [117]:
df_for_summarization["text_summary"] = summarized_df['summary']

In [123]:
print(df_for_summarization.sample(1, random_state=42)["text"].values[0])

In a brand-new expos  on Mother Jones, investigative journalist Ari Berman delves into an aspect of the 2016 election that has been largely ignored in the shadow of controversies over internal DNC politics, shifting blame, and what is now known to be concrete Russian interference: Voter suppression.On the heels of a report Berman summarized back in September, his article for the November-December issue of the liberal stalwart magazine continues where he left off. Anecdotal evidence of individual voters who were prevented from voting in Wisconsin during the presidential election all adds up to a picture that explains the numbers in the University of Wisconsin-Madison study that Berman cites in his summary.The study, completed by UW-Madison s professor of Political Science, Kenneth Mayer, shows a survey in two large counties in Wisconsin that asked whether respondents were either prevented or deterred from voting in 2016 by the new Voter ID law in effect for the first time in that electi

Original:

In a brand-new expos  on Mother Jones, investigative journalist Ari Berman delves into an aspect of the 2016 election that has been largely ignored in the shadow of controversies over internal DNC politics, shifting blame, and what is now known to be concrete Russian interference: Voter suppression.On the heels of a report Berman summarized back in September, his article for the November-December issue of the liberal stalwart magazine continues where he left off. Anecdotal evidence of individual voters who were prevented from voting in Wisconsin during the presidential election all adds up to a picture that explains the numbers in the University of Wisconsin-Madison study that Berman cites in his summary.The study, completed by UW-Madison s professor of Political Science, Kenneth Mayer, shows a survey in two large counties in Wisconsin that asked whether respondents were either prevented or deterred from voting in 2016 by the new Voter ID law in effect for the first time in that election. The results of the survey were stunning: More than 11 percent of respondents said they were deterred, and that 6 percent   more than half of the nonvoters   were prevented entirely. If those sound like small numbers, bear with me.In the course of research, voters were asked why they didn t vote, how engaged they were in the election, what types of ID they possessed, how confident they were in the final vote count, and their general demographics. The survey never required any respondent to answer who they may have voted for.Possibly the most distressing part of the findings was how disproportionately the Voter ID law affected low-income and minority voters: More than 21 percent of households under $25,000 of income per year were deterred; that number dropped to just over 7 percent for those above that line. The demarcation between white and black voters was even more stark, with just of 8 percent of white voters reporting trouble and 27.5 percent of African-Americans reporting the same.What does all that add up to?As Berman notes in the new piece for MJ, Wisconsin showed the second-highest voter turnout in the United States in both 2008 and 2012, but just four years later, participation was the lowest it had been in 16 years.So let s get back to those numbers.In just the two counties that were surveyed, Dane and Milwaukee, the percentage of registered voters who were adversely affected by the new Voter ID law represents a low estimate of nearly 17,000 and an upper estimate of 23,252 voters. In fact, the study points out, the 11.2 percent nonvoter estimate is actually low in itself, as it doesn t take into account voters who were deterred from even registering by the new law.If every single one of those in the study who answered that they were deterred or prevented from voting had actually cast a vote that counted, voter turnout would have increased by 2.24 percent. And that seems like an incredibly small number, until you extrapolate that across the state. The Wisconsin Election Commission estimated that 3.1 million voters turned out for the 2016 presidential election. That means 69,440 voters   under a conservative estimate   were disenfranchised entirely by the new Voter ID law.Donald Trump s margin of victory in the state? Less than a third of that.Given that the vast majority of disenfranchised voters, although they were not asked their political affiliations, were demographically more likely to vote for Hillary Clinton than for Donald Trump, it hardly requires extrapolation to conclude that, at least in the state of Wisconsin, Trump was right.The vote was rigged.Featured image via Scott Olson/Getty Images


In [124]:
print(df_for_summarization.sample(1, random_state=42)["text_summary"].values[0])

In a new report, investigative journalist Ari Berman examines voter suppression in the 2016 US presidential election, focusing on Wisconsin. A University of Wisconsin-Madison study found that 11% of respondents in two counties were deterred from voting due to the new Voter ID law, while 6% were prevented entirely. The study also showed that low-income and minority voters were disproportionately affected, with 21% of households under $25,000 in income deterred and 27.5% of African-Americans reporting trouble. The study estimates that nearly 17,000 to 23,252 voters were affected, which could have increased turnout by 2.24% if they had voted. This translates to 69,440 disenfranchised voters, with most being demographically likely to vote for Hillary Clinton. The report suggests that Donald Trump's margin of victory in Wisconsin was less than a third of the total disenfranchised voters, implying that the election may have been "rigged."


Summary:

In a new report, investigative journalist Ari Berman examines voter suppression in the 2016 US presidential election, focusing on Wisconsin. A University of Wisconsin-Madison study found that 11% of respondents in two counties were deterred from voting due to the new Voter ID law, while 6% were prevented entirely. The study also showed that low-income and minority voters were disproportionately affected, with 21% of households under $25,000 in income deterred and 27.5% of African-Americans reporting trouble. The study estimates that nearly 17,000 to 23,252 voters were affected, which could have increased turnout by 2.24% if they had voted. This translates to 69,440 disenfranchised voters, with most being demographically likely to vote for Hillary Clinton. The report suggests that Donald Trump's margin of victory in Wisconsin was less than a third of the total disenfranchised voters, implying that the election may have been "rigged."

In [125]:
df1

Unnamed: 0,title,text,label,title_tokens,text_tokens,summarized_text
0,,Did they post their votes for Hillary already?,1,2,11,
1,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1,30,38,
2,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1,25,425,
3,About Time! Christian Group Sues Amazon and SP...,All we can say on this one is it s about time ...,1,18,325,
4,DR BEN CARSON TARGETED BY THE IRS: “I never ha...,DR. BEN CARSON TELLS THE STORY OF WHAT HAPPENE...,1,26,19,
...,...,...,...,...,...,...
35250,LEFTY MEDIA DESPERATELY Tries To Bury Trump Bu...,CNN political hack Alison Camarota wants to na...,1,23,196,
35251,BILL O’REILLY Destroys Liberal Pundit On Trump...,,1,16,2,
35252,Swiss stop seizing income from asylum seekers ...,ZURICH (Reuters) - Switzerland will stop seizi...,0,15,320,
35253,WOW! JILL STEIN’S ‘FIRESIDE CHAT’ Exposes Her ...,,1,24,2,


In [130]:
df_for_summarization = df_for_summarization[df1.columns]

In [131]:
final_df = pd.concat([df1, df_for_summarization], ignore_index=True)

In [133]:
final_df.to_csv("final_data_with_summarization.csv", index=False)