Based on: https://github.com/tomasonjo/blogs/blob/master/youtube/video2graph.ipynb

Uses newspaper3k: https://pythonrepo.com/repo/codelucas-newspaper-python-web-crawling

In [1]:
import pandas as pd
import newspaper
import openai
import tiktoken
from secret_credentials import OPENAI_API_KEY


In [2]:
def num_tokens_from_messages(messages, model="gpt-3.5-turbo"):
  """Returns the number of tokens used by a list of messages."""
  try:
      encoding = tiktoken.encoding_for_model(model)
  except KeyError:
      encoding = tiktoken.get_encoding("cl100k_base")
  if model == "gpt-3.5-turbo":  # note: future models may deviate from this
      num_tokens = 0
      for message in messages:
          num_tokens += 4  # every message follows <im_start>{role/name}\n{content}<im_end>\n
          for key, value in message.items():
              num_tokens += len(encoding.encode(value))
              if key == "name":  # if there's a name, the role is omitted
                  num_tokens += -1  # role is always required and always 1 token
      num_tokens += 2  # every reply is primed with <im_start>assistant
      return num_tokens
  else:
      raise NotImplementedError(f"""num_tokens_from_messages() is not presently implemented for model {model}.
  See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens.""")

In [33]:
# set openai system prompt
openai.api_key = OPENAI_API_KEY
prompt_system = '''You are an expert financial, economic, and political analyst helping to read news articles and extract relevant information (entities and relationships) into a knowledge and current events graph. As input, you will accept the text of a news article. The first line of the input will always be the article headline. You will generate output that contains three sections:
entity - All relevant entities (related to finance, the economy, or politics), labeled with an appropriate descriptive category. Each entity is written on its own line as “LABEL {Entity Name}”
relationship - All direct relationships between the extracted entities. Each relationship is written on its own line as: “{Head Entity Name} RELATIONSHIP {Tail Entity Name}”
current_event - All news items (actions or events described in the article that involve one or more of the extracted entities but are not simple direct relationships), along with the associated entities (only reference entities which you have previously defined in the first section). Each event and associated entities is written on its own line as “NEWS_ITEM {Entity 1}, {Entity 2}, {…}, {Entity n}”

To help you understand the requirements, here are 2 examples:
EXAMPLE 1:
INPUT
DeSantis threatens Disney with legal retaliation
Florida Governor Ron DeSantis escalated the state's ongoing legal battle with Disney for control over their special district in Orlando, Florida.

OUTPUT
entity:
PLACE {Florida} 
PLACE {Orlando}
PERSON {Ron DeSantis}
COMPANY {Disney}
PLACE {Disney Special District}

relationship:
{Ron DeSantis} GOVERNOR_OF {Florida}
{Disney} OWNS {Disney Special District}
{Disney Special District} IN {Orlando}
{Orlando} IN {Florida} 

current_event:
ONGOING_LEGAL_BATTLE {Florida}, {Disney}, {Ron DeSantis}, {Disney Special District}

EXAMPLE 2:
INPUT
Samsung to cut chip production after posting lowest profit in 14 years
Seoul Reuters —
Samsung Electronics said on Friday it would make a “meaningful” cut to chip production after flagging a worse-than-expected 96% plunge in quarterly operating profit, as a sharp downturn in the global semiconductor market worsens.
Shares in the world’s largest memory chip and TV maker rose 3% in early trading, while rival SK Hynix shares surged 5% as investors welcomed plans to cut production to help preserve pricing power.
Samsung (SSNLF) estimated its operating profit fell to 600 billion won ($455.5 million) in January-March, from 14.12 trillion won a year earlier, in a short preliminary earnings statement. It was the lowest profit for any quarter in 14 years.
“Memory demand dropped sharply … due to the macroeconomic situation and slowing customer purchasing sentiment, as many customers continue to adjust their inventories for financial purposes,” it said in the statement.
“We are lowering the production of memory chips by a meaningful level, especially that of products with supply secured,” it added, in a reference to those with sufficient inventories.
The production cut signal is unusually strong for Samsung, which previously said it would make small adjustments like pauses for refurbishing production lines but not a full-blown cut.
It did not disclose the size of the planned cut.
The first-quarter profit fell short of a 873 billion won Refinitiv SmartEstimate, weighted toward analysts who are more consistently accurate. Multiple estimates were revised down earlier this week.
It was the lowest since a 590 billion won profit in the first quarter of 2009, according to company data.
With consumer demand for tech devices sluggish due to rising inflation, semiconductor buyers including data center operators and smartphone and personal computer makers are refraining from new chip purchases and using up inventories.
Analysts estimated the chip division sustained quarterly losses of more than 4 trillion won ($3.03 billion) as memory chip prices fell and its inventory values were slashed.
This would be the chip business’ first quarterly loss since the first quarter of 2009, a major divergence for what is normally a cash cow that generates about half of Samsung’s profits in better years.
Revenue likely fell 19% from the same period a year earlier to 63 trillion won, Samsung said.
The company is due to release detailed earnings, including divisional breakdowns, later this month.

OUTPUT
entity:
COMPANY {Samsung Electronics}
COMPANY {SK Hynix}
INDUSTRY {semiconductors}
PRODUCT {memory chips}
PLACE {Seoul}
GROUP {semiconductor buyers}
GROUP {data center operators}
GROUP {smartphone and personal computer makers}

relationship:
{Samsung Electronics} IN {Seoul}
{Samsung Electronics} PRODUCES {memory chips}
{memory chips} ARE {semiconductors}
{SK Hynix} PRODUCES {memory chips}
{SK Hynix} IS_RIVAL_OF {Samsung Electronics}
{data center operators} ARE {semiconductor buyers}
{smartphone and personal computer makers} ARE {semiconductor buyers}

current_event:
CHIP_PRODUCTION_CUT {Samsung Electronics, SK Hynix}
PROFITS_DECLINED {Samsung Electronics, SK Hynix, semiconductor}
DECLINE_IN_MEMORY_CHIP_DEMAND {data center operators, smartphone and personal computer makers}
'''

In [37]:
# download article
#cnn_paper = newspaper.build('http://cnn.com')
#articles = [article for article in cnn_paper.articles] # if "business" in article.url] # and article.url.endswith("index.html")
article = newspaper.Article("https://www.cnn.com/2023/04/03/tech/china-micron-probe-us-chip-war-intl-hnk/index.html")
article.download()
article.parse()

prompt_input = article.title + "\r\n" + article.text

In [38]:
prompt_messages=[
        {"role": "system", "content": prompt_system},
        {"role": "user", "content": prompt_input}
    ]
num_tokens_from_messages(prompt_messages, "gpt-3.5-turbo")

1683

In [39]:
response = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=prompt_messages,
  temperature=0 #most deterministic
)

In [40]:
restext = response["choices"][0]["message"]["content"]

In [41]:
for line in restext.split('\n'):
    print(line)

entity:
COMPANY {Micron Technology}
COUNTRY {China}
ORGANIZATION {Cyberspace Administration of China}
COUNTRY {Japan}
COUNTRY {United States}
COUNTRY {Netherlands}
INDUSTRY {semiconductor}
PRODUCT {memory chips}
COMPANY {Mintz Group}
COMPANY {Deloitte}

relationship:
{Micron Technology} PRODUCES {memory chips}
{Micron Technology} DERIVES_REVENUE_FROM {China}
{Cyberspace Administration of China} CONDUCTS_CYBERSECURITY_PROBE_OF {Micron Technology}
{Japan} RESTRICTS_EXPORT_OF {advanced chip manufacturing equipment}
{United States} BANS_CHINESE_COMPANIES_FROM_BUYING {advanced chips and chipmaking equipment}
{Netherlands} RESTRICTS_OVERSEAS_SALES_OF {semiconductor technology}
{Mintz Group} HAS_BEIJING_OFFICE_CLOSED_BY {Chinese authorities}
{Deloitte} HAS_OPERATIONS_SUSPENDED_BY {Chinese authorities}

current_event:
CYBERSECURITY_PROBE {Micron Technology, China}
RESTRICTIONS_ON_TECH_EXPORTS {China, Japan, United States, Netherlands}
OFFICE_CLOSURE_AND_SUSPENSION {Mintz Group, Deloitte, China

In [32]:
for line in prompt_input.split('\n'):
    print(line)

Samsung to cut chip production after posting lowest profit in 14 years
Seoul Reuters —

Samsung Electronics said on Friday it would make a “meaningful” cut to chip production after flagging a worse-than-expected 96% plunge in quarterly operating profit, as a sharp downturn in the global semiconductor market worsens.

Shares in the world’s largest memory chip and TV maker rose 3% in early trading, while rival SK Hynix shares surged 5% as investors welcomed plans to cut production to help preserve pricing power.

Samsung (SSNLF) estimated its operating profit fell to 600 billion won ($455.5 million) in January-March, from 14.12 trillion won a year earlier, in a short preliminary earnings statement. It was the lowest profit for any quarter in 14 years.

“Memory demand dropped sharply … due to the macroeconomic situation and slowing customer purchasing sentiment, as many customers continue to adjust their inventories for financial purposes,” it said in the statement.

“We are lowering the 