```
Proof on Concept:

Twitter Post (1 baris):

  (ideal)  
  - Filtrasi keyword   (udah ada di dalam dataset)
  - Ekstraksi Person   (by PoS atau @)
  - Ekstraksi Aspect   (by Noun dari NER / PoS / KBBI)
  - Generate Sentiment (per aspect terdeteksi) -> translate ke bahasa inggris (bert pair cls)
 
  (mvp)
  - utilize openAI at all. :D

Expected Result (tabular format)

source data:
| name | tweets | re-tweets | ... |

result enrichment:
| name | tweets | re-tweets | Person / Organization (NER) | Aspect - Sentiment (ABSA) | Topic - (input by user)
```

In [36]:
# Load libraries 
import os
import re
import time
import openai 
import pandas as pd 
from tqdm import tqdm
from typing import Tuple
from dotenv import load_dotenv


load_dotenv()
pd.set_option("display.max_columns", None)

In [11]:
# Setting credentials
OPENAI_KEY = os.getenv("OPENAI_API_KEY", default = None) 
openai.api_key = OPENAI_KEY

In [37]:
# Load dataset
data = pd.read_csv("../dataset/data_twitter_pemilu_2024.csv")
data.head()

Unnamed: 0,name,text,rt,id
0,prabowo,"Megawati Soekarnoputri, diyakini akan menjadik...",0,1552261054964461568
1,prabowo,"Diremehkan, Citra Pak @prabowo menjadi terting...",3,1551415694738313216
2,prabowo,Dulu Tuhan disuruh menangin Prabowo atau kagak...,0,1551415694738313216
3,prabowo,@SantorinisSun Loh miss valak masih menyembah ...,0,1551415694738313216
4,prabowo,Yth bapak Presiden republik Indonesia Ir Haji ...,39,1552234605419237376


In [38]:
# Data Duplicate checking
data.duplicated(subset = ['text', 'id', 'rt']).value_counts()

False    625
True     160
Name: count, dtype: int64

In [39]:
# Overview duplicated data
data[data.duplicated(subset = ['text', 'id', 'rt'])].head(10)

Unnamed: 0,name,text,rt,id
10,prabowo,Yth bapak Presiden republik Indonesia Ir Haji ...,39,1552234605419237376
15,prabowo,Kapolri Jenderal Listyo Sigit Prabowo mengatak...,1,1552476373855244289
21,prabowo,Ngopi daring tayang siang ini di Youtube @kemh...,2,1551476092749447168
22,prabowo,Yth bapak Presiden republik Indonesia Ir Haji ...,39,1552234605419237376
23,prabowo,Yth bapak Presiden republik Indonesia Ir Haji ...,39,1552234605419237376
33,prabowo,Yth bapak Presiden republik Indonesia Ir Haji ...,39,1552234605419237376
37,prabowo,Yth bapak Presiden republik Indonesia Ir Haji ...,39,1552234605419237376
42,prabowo,Yth bapak Presiden republik Indonesia Ir Haji ...,39,1552234605419237376
45,prabowo,"Catat nih, Pak Prabowo menduduki tempat dipunc...",2,1551415694880677891
49,prabowo,Yth bapak Presiden republik Indonesia Ir Haji ...,39,1552234605419237376


In [40]:
# Duplicate data filtering
data = data.drop_duplicates(subset = ['text', 'id', 'rt'])

In [41]:
# Data Duplicate checking - validation
data.duplicated(subset = ['text', 'id', 'rt']).value_counts()

False    625
Name: count, dtype: int64

In [42]:
# Define prompt and ingestion script
def prompt_enrichment(tweet_comment: str) -> str:
    prompt = \
    f"""
    Ekstraksi informasi yang dibutuhkan berdasarkan komentar twitter dibawah, dengan response cukup sesuai yang di definisikan tanpa penjelasan tambahan.

    komentar_twitter: "{tweet_comment}"

    Untuk response cukup isi dengan format dibawah.
    named_entity_recognition: [Jawaban anda: cakupan NER sesuai label "PERSON" atau "ORGANIZATION" saja]
    aspect_sentiment: [Identifikasi verb / noun-phrase hasil dari part-of-speech di dalam komentar, disertai dengan nilai sentiment masing-masing aspect dengan format <aspect (sentiment)>]
    """
    return prompt

def ingest_openai(tweet_comment: str, model_base: str = "gpt-3.5-turbo") -> Tuple[str, int]: 
    token_usage = 0
    response_extraction = ""
    try:
        response = openai.ChatCompletion.create(
            model = model_base, 
            messages = [{"role" : "user", "content" : prompt_enrichment(tweet_comment)}], 
            temperature = 0.1, max_tokens = 512, top_p = 1.0, 
            frequency_penalty = 0.0, presence_penalty = 0.0
        )
        response_extraction = response["choices"][0]["message"]["content"]
        token_usage = response["usage"]["total_tokens"]
    except Exception as E:
        print(f"[ERROR] - {E}")
        print("Retry with Recursive Func")
        time.sleep(5)
        ingest_openai(tweet_comment = tweet_comment)
    return response_extraction, token_usage

In [45]:
# Test ingestion
comment = data['text'].sample(1).values[0]
extraction, token_usage = ingest_openai(tweet_comment = comment)
print(f"[COMMENT]\n{comment}\n[RESULT - Token Usage: {token_usage}]\n{extraction}")

[COMMENT]
Puan tak masalah bahkan Ganjar jadi salah satu bacapres. Waktunya Puan Maharani
[RESULT - Token Usage: 216]
named_entity_recognition: [Puan, Ganjar, Puan Maharani]
aspect_sentiment: [Puan (positive), Ganjar (positive), bacapres (positive), Waktunya Puan Maharani (neutral)]


In [46]:
# Apply on entire dataset
final_result_extraction, final_token_usage = [], []

In [48]:
# Iter and push into array
for comment in tqdm(data["text"], desc = "Ingestion Start"):
    result, token = ingest_openai(tweet_comment = comment)
    final_result_extraction.append(result)
    final_token_usage.append(token)

Ingestion Start:  13%|█▎        | 84/625 [10:42<46:56,  5.21s/it]  

[ERROR] - Request timed out: HTTPSConnectionPool(host='api.openai.com', port=443): Read timed out. (read timeout=600)
Retry with Recursive Func


Ingestion Start:  25%|██▌       | 158/625 [28:13<33:08,  4.26s/it]   

[ERROR] - The server is overloaded or not ready yet.
Retry with Recursive Func


Ingestion Start:  45%|████▌     | 284/625 [40:18<22:14,  3.91s/it]  

[ERROR] - Request timed out: HTTPSConnectionPool(host='api.openai.com', port=443): Read timed out. (read timeout=600)
Retry with Recursive Func


Ingestion Start:  61%|██████    | 379/625 [59:05<35:12,  8.59s/it]    

[ERROR] - Request timed out: HTTPSConnectionPool(host='api.openai.com', port=443): Read timed out. (read timeout=600)
Retry with Recursive Func


Ingestion Start:  61%|██████    | 380/625 [1:09:17<12:53:12, 189.36s/it]

[ERROR] - HTTP code 502 from API (<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>cloudflare</center>
</body>
</html>
)
Retry with Recursive Func


Ingestion Start: 100%|██████████| 625/625 [1:38:27<00:00,  9.45s/it]    


In [49]:
# Assign result into dataframe
data['result extraction'] = final_result_extraction
data['token usage'] = final_token_usage

In [50]:
# Save into dataframe
data.to_csv("../dataset/data_twitter_pemilu_2024_enrich.csv", index = False)