<a href="https://colab.research.google.com/github/TurkuNLP/textual-data-analysis-course/blob/main/openai_api_ex6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Data
!wget http://dl.turkunlp.org/TKO_8964_2023/news-en-2021.jsonl

--2025-02-04 11:07:47--  http://dl.turkunlp.org/TKO_8964_2023/news-en-2021.jsonl
Resolving dl.turkunlp.org (dl.turkunlp.org)... 195.148.30.23
Connecting to dl.turkunlp.org (dl.turkunlp.org)|195.148.30.23|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3385882 (3.2M) [application/octet-stream]
Saving to: ‘news-en-2021.jsonl’


2025-02-04 11:07:49 (3.06 MB/s) - ‘news-en-2021.jsonl’ saved [3385882/3385882]



In [5]:
import json

news_articles = []
with open("news-en-2021.jsonl", "rt", encoding="utf-8") as f:
    for line in f:
        data = json.loads(line)
        if len(data["text"].split()) > 300: # skip long documents
          continue
        news_articles.append(data["text"])
        if len(news_articles) == 10:
          break

for a in news_articles:
  print(a, "\n\n\n")

Rental fees for non-subsidised apartments rose across most of Finland during April to June, compared to the same period a year ago, according to data from Statistics Finland. 
 On average, rents rose by 0.9 percent during that period across the country. 
 Timo Metsola , board chair of rental agency Vuokraturva, attributed the increase to growing demand, saying that competition clearly intensified for the most desirable properties. 
 The sharpest rise in apartment rents during the April-June period was seen in the city of Turku, where costs rose by 1.6 percent, with the city of Tampere seeing an increase of 1.4 percent. 
 Meanwhile in the Greater Helsinki area, rents ticked up by 0.9 percent. 
 Among the country's municipal centres, the town of Mikkeli was the only area which saw rental fees decline. 
 Still priciest in Helsinki area 
 The number-crunching agency reported that the median rent for a studio apartment in central Helsinki was 809 euros per month, while they stood at 583 eur

In [20]:
# Let's write a prompt
# Note that this is just an experimental prompt, and I have not evaluated it!
# For NER prompting strategies, see e.g. https://arxiv.org/pdf/2305.15444 or https://arxiv.org/abs/2304.10428

prompt = """You are an expert in Named Entity Recognition (NER), specializing in the CoNLL-2003 dataset for English named entities. Your task is to extract named entities from a given news article using the CoNLL-2003 definitions and your expertise.

Format your response exactly as follows:

## Named Entity Extraction
PER: [Comma-separated list of extracted PERSON names]
ORG: [Comma-separated list of extracted ORGANIZATION names]
LOC: [Comma-separated list of extracted LOCATION names]
MISC: [Comma-separated list of extracted MISCELLANEOUS names]

Guidelines:

* Do not include any additional text beyond the extracted entities.
* Maintain the exact formatting above.
* If an entity type has no mentions in the article, omit that category.

News article (may include newlines):
"""

In [28]:
import os
from openai import OpenAI

from google.colab import userdata
my_token = userdata.get('openai_token')

client = OpenAI(api_key=my_token)

results = []

print("Number of news articles:", len(news_articles))

for i, news in enumerate(news_articles):
  print(i)
  chat_completion = client.chat.completions.create(messages=[{"role": "user", "content": prompt + news}], model="gpt-4o-mini")
  results.append(chat_completion)

Number of news articles: 10
0
1
2
3
4
5
6
7
8
9


In [31]:
for news, chat in zip(news_articles, results):
  print(news, "\n")
  print(chat.choices[0].message.content, "\n\n\n")

Rental fees for non-subsidised apartments rose across most of Finland during April to June, compared to the same period a year ago, according to data from Statistics Finland. 
 On average, rents rose by 0.9 percent during that period across the country. 
 Timo Metsola , board chair of rental agency Vuokraturva, attributed the increase to growing demand, saying that competition clearly intensified for the most desirable properties. 
 The sharpest rise in apartment rents during the April-June period was seen in the city of Turku, where costs rose by 1.6 percent, with the city of Tampere seeing an increase of 1.4 percent. 
 Meanwhile in the Greater Helsinki area, rents ticked up by 0.9 percent. 
 Among the country's municipal centres, the town of Mikkeli was the only area which saw rental fees decline. 
 Still priciest in Helsinki area 
 The number-crunching agency reported that the median rent for a studio apartment in central Helsinki was 809 euros per month, while they stood at 583 eur

In [37]:
# for comparison, let's run pipeline with dslim/bert-base-NER
import transformers

pipeline = transformers.pipeline('token-classification', model='dslim/bert-base-NER', tokenizer='dslim/bert-base-NER')

pipeline_results = pipeline(news_articles, aggregation_strategy="max")

for news, lm_r, pipeline_r in zip(news_articles, results, pipeline_results):
  print(news, "\n")
  print(lm_r.choices[0].message.content, "\n")
  print("Pipeline:")
  for ne in pipeline_r:
    print(ne["entity_group"], ne["word"])
  print("\n\n\n")


Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


Rental fees for non-subsidised apartments rose across most of Finland during April to June, compared to the same period a year ago, according to data from Statistics Finland. 
 On average, rents rose by 0.9 percent during that period across the country. 
 Timo Metsola , board chair of rental agency Vuokraturva, attributed the increase to growing demand, saying that competition clearly intensified for the most desirable properties. 
 The sharpest rise in apartment rents during the April-June period was seen in the city of Turku, where costs rose by 1.6 percent, with the city of Tampere seeing an increase of 1.4 percent. 
 Meanwhile in the Greater Helsinki area, rents ticked up by 0.9 percent. 
 Among the country's municipal centres, the town of Mikkeli was the only area which saw rental fees decline. 
 Still priciest in Helsinki area 
 The number-crunching agency reported that the median rent for a studio apartment in central Helsinki was 809 euros per month, while they stood at 583 eur