Search 이외 임베딩 활용 방법
- ABC news topic modeling
    - Clustering
    - 정보의 다양성 측정
    - Outlier detection

    
=> VectorDB에 저장하고자 하는 컨텐츠에 대한 검수 및 전처리

---

In [1]:
import pandas as pd
import os
import json
import openai
from openai import OpenAI
import numpy as np
from tqdm.notebook import tqdm, trange
from sklearn.cluster import KMeans
from utils import create_embeddings
from dotenv import load_dotenv
load_dotenv()

# initialize openai
openai.api_key = os.environ["OPENAI_API_KEY"]

# How To (ABC News)

## 1. Clustering
- 2020년에 어떤 주제들의 뉴스들이 있었을까?
##### => __각 문서의 주제 탐색 / 유사 문서 그룹핑__

In [2]:
df = pd.read_csv("../data/abcnews_2020.csv")

(비용 발생 주의) batch 별로 embedding화

In [3]:
batch_size = 2000
headline_emb = list()

headline = df['headline_text'].tolist()

for i in trange(0, len(headline), batch_size):
    i_end = min(len(headline), i+batch_size)
    data_batch = headline[i:i_end]

    tmp_emb = create_embeddings(data_batch)
    headline_emb.extend(tmp_emb)

  0%|          | 0/2 [00:00<?, ?it/s]

In [4]:
df['headline_emb'] = headline_emb

In [5]:
df.head()

Unnamed: 0,publish_date,headline_text,headline_emb
0,20200101,a new type of resolution for the new year,"[-0.0299206729978323, 0.02757796086370945, 0.0..."
1,20200101,adelaide records driest year in more than a de...,"[0.02336890995502472, 0.02421138435602188, 0.0..."
2,20200101,adelaide riverbank catches alight after new ye...,"[0.008516565896570683, -0.006767496466636658, ..."
3,20200101,adelaides 9pm fireworks spark blaze on riverbank,"[0.03186402469873428, 3.975793561039609e-07, 0..."
4,20200101,archaic legislation governing nt women propert...,"[0.05418519303202629, 0.06181729584932327, 0.0..."


In [6]:
df.to_csv("../data/abcnews_2020_emb.csv", index=False)

k-means를 활용하여 주요 토픽별 cluster 생성

<img src="https://static.javatpoint.com/tutorial/machine-learning/images/k-means-clustering-algorithm-in-machine-learning.png" width="500" height="300"/>
<br>
출처 : https://static.javatpoint.com/tutorial/machine-learning/images/k-means-clustering-algorithm-in-machine-learning.png

In [7]:
df = pd.read_csv("../data/abcnews_2020_emb.csv")

In [8]:
df.head()

Unnamed: 0,publish_date,headline_text,headline_emb
0,20200101,a new type of resolution for the new year,"[-0.0299206729978323, 0.02757796086370945, 0.0..."
1,20200101,adelaide records driest year in more than a de...,"[0.02336890995502472, 0.02421138435602188, 0.0..."
2,20200101,adelaide riverbank catches alight after new ye...,"[0.008516565896570683, -0.006767496466636658, ..."
3,20200101,adelaides 9pm fireworks spark blaze on riverbank,"[0.03186402469873428, 3.975793561039609e-07, 0..."
4,20200101,archaic legislation governing nt women propert...,"[0.05418519303202629, 0.06181729584932327, 0.0..."


In [9]:
type(df.loc[0, 'headline_emb'])

str

In [10]:
df['headline_emb'] = df['headline_emb'].apply(json.loads) ## str to array

In [11]:
type(df.loc[0, 'headline_emb'])

list

In [12]:
df.head(2)

Unnamed: 0,publish_date,headline_text,headline_emb
0,20200101,a new type of resolution for the new year,"[-0.0299206729978323, 0.02757796086370945, 0.0..."
1,20200101,adelaide records driest year in more than a de...,"[0.02336890995502472, 0.02421138435602188, 0.0..."


In [13]:
clusters = KMeans(n_clusters=15, random_state=0).fit_predict(df['headline_emb'].tolist())
df['cluster'] = clusters

In [14]:
df.head(2)

Unnamed: 0,publish_date,headline_text,headline_emb,cluster
0,20200101,a new type of resolution for the new year,"[-0.0299206729978323, 0.02757796086370945, 0.0...",2
1,20200101,adelaide records driest year in more than a de...,"[0.02336890995502472, 0.02421138435602188, 0.0...",11


In [15]:
df.loc[df['cluster']==1]

Unnamed: 0,publish_date,headline_text,headline_emb,cluster
162,20200103,nick kyrgios kicks off australias atp cup chal...,"[-0.055879849940538406, 0.018472496420145035, ...",1
201,20200104,bushfire help sparked by ashleigh barty pink a...,"[-0.006274708081036806, -0.03248249366879463, ...",1
249,20200104,wrong anthem played for moldova at atp cup,"[-0.03378334268927574, 0.012585321441292763, 0...",1
297,20200105,sasha zhoya turns back on australia athletics ...,"[0.026827052235603333, 0.032450079917907715, 0...",1
302,20200105,stars marcus stoinis fined personal abuse rene...,"[-0.029239336028695107, 0.06348463147878647, 0...",1
...,...,...,...,...
2314,20200130,rafael nadal agitated by chair umpire after gi...,"[-0.051277462393045425, 0.030257854610681534, ...",1
2315,20200130,rafael nadal loses to dominic thiem australian...,"[-0.018174681812524796, 0.014082658104598522, ...",1
2354,20200131,australian open dominic thiem beats alexander ...,"[-0.01039914321154356, 0.03773451969027519, 0....",1
2355,20200131,australian open has delivered more that we cou...,"[-0.01898774690926075, 0.051637910306453705, 0...",1


## 2. 정보의 다양성 (Diversity) 측정

- 각 클러스터 내에 있는 뉴스들은 얼마나 유사한 정보를 담고 있을까?

In [16]:
from sklearn.metrics.pairwise import cosine_similarity

def calculate_diversity(df, column_name):
    """
    Calculates the diversity of a set of embeddings based on cosine distance.
    
    :param embeddings: NumPy array of embeddings
    :return: The average cosine distance between embeddings, higher means more diverse
    """
    # 각각의 임베딩끼리 모두 pairwise cosine similarity를 계산
    embeddings = np.vstack(df[column_name])
    cosine_sim = cosine_similarity(embeddings)
    
    # self-comparisons (diagonal elements)를 제외하고 cosine similarity 계산
    np.fill_diagonal(cosine_sim, np.nan) # 본인과의 similarity는 제외
    avg_distance = np.nanmean(cosine_sim)
    
    return cosine_sim, avg_distance


In [17]:
dist, avg = calculate_diversity(df, 'headline_emb')

In [28]:
dist

array([[       nan, 0.1746668 , 0.30672911, ..., 0.05149311, 0.21791545,
        0.09140314],
       [0.1746668 ,        nan, 0.51810496, ..., 0.03842442, 0.19517199,
        0.07771924],
       [0.30672911, 0.51810496,        nan, ..., 0.07020558, 0.12080507,
        0.09901438],
       ...,
       [0.05149311, 0.03842442, 0.07020558, ...,        nan, 0.05112324,
        0.31272533],
       [0.21791545, 0.19517199, 0.12080507, ..., 0.05112324,        nan,
        0.0762298 ],
       [0.09140314, 0.07771924, 0.09901438, ..., 0.31272533, 0.0762298 ,
               nan]])

In [18]:
avg

0.19836681801768138

In [19]:
diversity_score = {k:calculate_diversity(df.loc[df['cluster']==k], 'headline_emb')[1] for k in range(0, 15)}

In [20]:
diversity_score

{0: 0.2680357512495172,
 1: 0.42701174844124884,
 2: 0.15113842472548009,
 3: 0.3878776156825337,
 4: 0.2909252194955938,
 5: 0.5789271380584485,
 6: 0.47415081854318053,
 7: 0.28207174145432246,
 8: 0.2937214030248429,
 9: 0.4680506879504668,
 10: 0.19553639050745047,
 11: 0.405714894175974,
 12: 0.3850829531453164,
 13: 0.5048776279133564,
 14: 0.28626770531104967}

In [27]:
df.loc[df['cluster']==13]

Unnamed: 0,publish_date,headline_text,headline_emb,cluster
616,20200109,harry and meghan royal family in uncharted ter...,"[0.008956545032560825, 0.013932404108345509, 0...",13
644,20200109,prince harry and meghan markle step back as se...,"[0.023914147168397903, 0.0278632715344429, 0.0...",13
645,20200109,prince harry and meghan markle step back uk me...,"[0.005646861158311367, 0.03033471666276455, -0...",13
646,20200109,prince harry and meghan markle to step back as...,"[0.022870700806379318, 0.03542037680745125, 0....",13
647,20200109,prince harry and meghan to step back from royal,"[0.01944713294506073, 0.03662598878145218, 0.0...",13
668,20200109,your prince harry and meghan markle questions ...,"[0.01220667827874422, 0.0202767513692379, 0.02...",13
702,20200111,prince harry meghan markle conference call fut...,"[0.00656588189303875, 0.007069929502904415, 0....",13
703,20200111,prince harry meghan markle exodus is monarchy ...,"[-0.01837480440735817, 0.045023079961538315, 0...",13
745,20200112,prince harry meghan markle do they have their ...,"[0.047689106315374374, 0.02747255377471447, 0....",13
747,20200112,queen elizabeth ii calls prince harry for cris...,"[-0.03957179933786392, 0.01446374412626028, 0....",13


## 4. Outlier detection
- 각 클러스터 내에 속하지 않는 정보들이 있을까?

<img src="https://miro.medium.com/v2/resize:fit:725/1*y3wXEId0poYUIzCD3HBh4w.png"/>
<br>
출처 : https://miro.medium.com/v2/resize:fit:725/1*y3wXEId0poYUIzCD3HBh4w.png

In [22]:
from sklearn.ensemble import IsolationForest

In [23]:
cluster = df.loc[df['cluster']==10]

In [29]:
iso_forest = IsolationForest(contamination=0.05)  # Adjust contamination as needed
anomalies = iso_forest.fit_predict(cluster['headline_emb'].tolist())

anomalous_headlines = np.array(cluster['headline_text'].tolist())[anomalies == -1]
print("Anomalous Headlines:", anomalous_headlines)

Anomalous Headlines: ['more blood donors needed' 'a pocket guide to climate change'
 'reserve bank braces for record number of damaged bank notes'
 'chemical plant explosion in spain'
 'where did the drop bear myth originate'
 'hundreds central american migrants wade across river into mexico'
 'hours in dinosaurs day physics astrophysics'
 'streamlined fire assistance still too hard for many farmers'
 'unemployment numbers fail to paint full picture of jobs market'
 'honey bees insect colony collapse varroa mite deformed wing'
 'how you can send your child to school outside catchment zone']


In [25]:
anomalies

array([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1, -1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,
        1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,
        1,  1,  1, -1,  1,  1,  1,  1,  1])

In [26]:
anomalous_headlines

array(['meteorologist describes why forecast conditions are so dangerous',
       'nols moldova 0401',
       'lithium sulfur battery greener cheaper and more efficient',
       'rain gardens green roofs solutions for stormwater problem',
       'genetically engineered mosquitoes immune to all dengue strains',
       'where did the drop bear myth originate',
       'why experts say it is easy to leave children in cars',
       'consumers urged to eat more pineapple to combat glut',
       'indonesia plans to build ten more balis for tourists',
       'how does car measure temperature in a heatwave',
       'how you can send your child to school outside catchment zone'],
      dtype='<U64')

단순히 텍스트를 embedding화 하는 것에서 더 나아가, <br>
텍스트를 특징별로 묶거나 유관하지 않다고 판단되는 텍스트는 제외하는 등, 컨텐츠 자체를 preprocessing/탐색 하는데에 활용 가능

--END--