In [1]:
import sys

import pandas as pd

from utils import ChatBotSummarizer

In [2]:
df = pd.read_csv("news_articles_data_2023.csv")
df.shape

(1958, 7)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1958 entries, 0 to 1957
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   news_id     1958 non-null   int64 
 1   title       1958 non-null   object
 2   body        1958 non-null   object
 3   region      1958 non-null   object
 4   category    1958 non-null   object
 5   image       1958 non-null   object
 6   added_date  1958 non-null   object
dtypes: int64(1), object(6)
memory usage: 107.2+ KB


In [4]:
df["body_length"] = df["body"].str.len()

In [5]:
df.head(2)

Unnamed: 0,news_id,title,body,region,category,image,added_date,body_length
0,173042,Women's World Cup 2023. Philippines win over N...,The Philippine national team revived its hopes...,العالم,كرة القدم,https://images.alkass.net/newsimages/large_202...,2023-07-25 11:36:00,637
1,173041,Muaither Club announces the signing of player ...,"Muaither Sports Club announced today, Monday, ...",قطر,كرة القدم,https://images.alkass.net/newsimages/large_202...,2023-07-25 10:35:00,801


In [6]:
# let's see the body of the first article
print(df.body[0])

The Philippine national team revived its hopes of qualifying for the second round (the price of the final) of the 2023 Women's World Cup, with its historic victory (1-0) over New Zealand, the owner of the land, today, Tuesday, in the second round of Group A matches in the tournament.The Philippine national team settled the match with a clean goal in the first half, scored by Sarina Bolden in the 24th minute.The Philippine team won its first 3 points in the group, and it is its first 3 points ever in the Women's World Cup, as it is playing the tournament for the first time in its history, New Zealand's balance stopped at 3 points.


In [7]:
smallest, biggest = df["body_length"].min(), df["body_length"].max()

sample = df.query("body_length == @smallest or body_length == @biggest")
sample = df.query("body_length != @smallest or body_length != @biggest").sample(98).merge(sample, how="outer")

In [8]:
sample.shape

(100, 8)

In [9]:
gpt_response, langchain_response = ChatBotSummarizer().chat_and_summarize(df.body[0])

print(gpt_response)
print()
print(langchain_response)

The Philippine national team won 1-0 against New Zealand in the second round of Group A matches in the 2023 Women's World Cup, giving them a chance to qualify for the second round. Sarina Bolden scored the only goal of the match in the 24th minute. It was the Philippines' first-ever victory and first three points in the Women's World Cup.

 The Philippine national team won their first-ever victory in the 2023 Women's World Cup, beating New Zealand 1-0 in the second round of Group A matches. Sarina Bolden scored the only goal of the match in the 24th minute. The win gives the Philippines a chance to qualify for the second round.<|im_end|>


In [10]:
print(df.body[0])

The Philippine national team revived its hopes of qualifying for the second round (the price of the final) of the 2023 Women's World Cup, with its historic victory (1-0) over New Zealand, the owner of the land, today, Tuesday, in the second round of Group A matches in the tournament.The Philippine national team settled the match with a clean goal in the first half, scored by Sarina Bolden in the 24th minute.The Philippine team won its first 3 points in the group, and it is its first 3 points ever in the Women's World Cup, as it is playing the tournament for the first time in its history, New Zealand's balance stopped at 3 points.


In [12]:
longest_article = df.query("news_id == 170541")
for _, row in longest_article.iterrows():
    content = row.body

content

'State institutions participated in the activities of the twelfth edition of the Sports Day, which is held this year under the slogan "The choice is yours", in response to Emiri Resolution No. (80) of 2011, which stipulated that Tuesday of the second week of February of each year will be a sports day for the state, during which ministries and other government agencies, public authorities and institutions organize sports events in which workers and their families participate, to achieve awareness of the importance of sports and its role in the lives of individuals and societies, and make it a lifestyle that is practiced throughout Among the most prominent entities that participated today in these events are the Ministry of Justice, the Ministry of Social Development and Family, and the National Service Academy in joint activities organized at the headquarters of the National Service Academy "Camp Meqdem", in the presence of His Excellency Mr. Masoud bin Mohammed Al Ameri, Minister of Ju

In [13]:
gpt_response, langchain_response = ChatBotSummarizer().chat_and_summarize(content)

print(gpt_response)
print()
print(langchain_response)

InvalidRequestError: This model's maximum context length is 8192 tokens. However, your messages resulted in 9384 tokens. Please reduce the length of the messages.

In [14]:
langchain_response = ChatBotSummarizer().langchain_summarize(content)
langchain_response

InvalidRequestError: Too many inputs. The max number of inputs is 1.  We hope to increase the number of inputs per request soon. Please contact us through an Azure support request at: https://go.microsoft.com/fwlink/?linkid=2213926 for further questions.

In [40]:
## apply to all the records

summarization = []
for idx, row in df.iterrows():
    try:
        summarization.append(ChatBotSummarizer().openai_chat_completion(row.body).strip())
    except Exception as e:
        print(e)
        summarization.append(None)

    sys.stdout.write(f"\r{idx+1}/{df.shape[0]}")

90/1958'content'
1474/1958This model's maximum context length is 8192 tokens. However, your messages resulted in 9441 tokens. Please reduce the length of the messages.
1958/1958

In [41]:
df['summarization'] = summarization

In [42]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1958 entries, 0 to 1957
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   news_id        1958 non-null   int64 
 1   title          1958 non-null   object
 2   body           1958 non-null   object
 3   region         1958 non-null   object
 4   category       1958 non-null   object
 5   image          1958 non-null   object
 6   added_date     1958 non-null   object
 7   summarization  1956 non-null   object
dtypes: int64(1), object(7)
memory usage: 122.5+ KB


In [43]:
df.to_csv("news_articles_data_2023_summarized.csv", index=False)

In [44]:
## How many articles are not summarized?

df["summarization"].isnull().sum()

2

In [12]:
for idx, row in df.query("news_id == 170541").iterrows():
    print(row.body)

State institutions participated in the activities of the twelfth edition of the Sports Day, which is held this year under the slogan "The choice is yours", in response to Emiri Resolution No. (80) of 2011, which stipulated that Tuesday of the second week of February of each year will be a sports day for the state, during which ministries and other government agencies, public authorities and institutions organize sports events in which workers and their families participate, to achieve awareness of the importance of sports and its role in the lives of individuals and societies, and make it a lifestyle that is practiced throughout Among the most prominent entities that participated today in these events are the Ministry of Justice, the Ministry of Social Development and Family, and the National Service Academy in joint activities organized at the headquarters of the National Service Academy "Camp Meqdem", in the presence of His Excellency Mr. Masoud bin Mohammed Al Ameri, Minister of Jus

In [45]:
df[df["summarization"].isnull()]

Unnamed: 0,news_id,title,body,region,category,image,added_date,summarization
90,172950,"""Paris Panthers"" champion of the Champions Lea...","The ""Paris Panthers"" team achieved the title o...",قطر,العاب مختلفة,https://images.alkass.net/newsimages/large_202...,2023-07-15 12:24:00,
1474,170541,Distinguished participation of government agen...,State institutions participated in the activit...,قطر,العاب مختلفة,https://images.alkass.net/newsimages/large_202...,2023-02-14 20:01:00,


In [9]:
LangchainSummarization(df.body[0]).summarize()

                temperate was transferred to model_kwargs.
                Please confirm that temperate is what you intended.


AttributeError: 'str' object has no attribute 'page_content'