## 실습 및 데이터셋 소개

1. 간단한 word2vec 모델 활용 및 embedding을 활용한 계산 법 익히기
2. 다양한 embedding 모델을 실제 활용하면서 현재 데이터의 retrieval 성능 체크
3. Search 이외에도 embedding을 활용할 수 있는 다양한 방법 소개
4. Embedding을 활용한 간단한 서비스 구현 (search 최적화)
---

# 1. The Simpsons dataset

#### 데이터 소개 : 심슨 등장 인물들의 대화를 담은 데이터 셋
#### 데이터 활용 목적 : 대화 분석을 통해 단어들간의 관계를 파악

- 다운로드 : https://www.kaggle.com/datasets/pierremegret/dialogue-lines-of-the-simpsons?resource=download

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('../data/simpsons_dataset.csv')
df.shape

(158314, 2)

In [3]:
df.head()

Unnamed: 0,raw_character_text,spoken_words
0,Miss Hoover,"No, actually, it was a little of both. Sometim..."
1,Lisa Simpson,Where's Mr. Bergstrom?
2,Miss Hoover,I don't know. Although I'd sure like to talk t...
3,Lisa Simpson,That life is worth living.
4,Edna Krabappel-Flanders,The polls will be open from now until the end ...


In [4]:
from collections import Counter

In [5]:
counts = Counter(df['raw_character_text'])

In [6]:
counts.most_common()

[('Homer Simpson', 29782),
 (nan, 17814),
 ('Marge Simpson', 14141),
 ('Bart Simpson', 13759),
 ('Lisa Simpson', 11489),
 ('C. Montgomery Burns', 3162),
 ('Moe Szyslak', 2862),
 ('Seymour Skinner', 2438),
 ('Ned Flanders', 2144),
 ('Grampa Simpson', 1880),
 ('Milhouse Van Houten', 1862),
 ('Chief Wiggum', 1830),
 ('Krusty the Clown', 1768),
 ('Nelson Muntz', 1172),
 ('Lenny Leonard', 1166),
 ('Apu Nahasapeemapetilon', 1006),
 ('Waylon Smithers', 996),
 ('Kent Brockman', 891),
 ('Carl Carlson', 883),
 ('Edna Krabappel-Flanders', 739),
 ('Dr. Julius Hibbert', 691),
 ('Barney Gumble', 611),
 ('Selma Bouvier', 611),
 ('Sideshow Bob', 576),
 ('Rev. Timothy Lovejoy', 558),
 ('Crowd', 540),
 ('Groundskeeper Willie', 534),
 ('Gary Chalmers', 523),
 ('Ralph Wiggum', 507),
 ('Mayor Joe Quimby', 503),
 ('Patty Bouvier', 479),
 ('Comic Book Guy', 478),
 ('Otto Mann', 423),
 ('Martin Prince', 409),
 ('Announcer', 387),
 ('Kids', 365),
 ('Jimbo Jones', 357),
 ('Sideshow Mel', 352),
 ('Lou', 350),
 (

# 2. Quora dataset

#### 데이터 소개 : 네이버의 지식IN과 비슷한 목적을 가진 플랫폼인 Quora에서, 유사한 질문들을 모아둔 데이터 셋.
#### 데이터 활용 목적 : Embedding을 기반으로 유사한 질문을 탐색하는 실습에 활용

- datasets 패키지 활용

In [7]:
from datasets import load_dataset
import pandas as pd

dataset = load_dataset("quora")

In [8]:
dataset.keys()

dict_keys(['train'])

In [9]:
raw_df = dataset["train"].to_pandas()

In [10]:
raw_df.head()

Unnamed: 0,questions,is_duplicate
0,"{'id': [1, 2], 'text': ['What is the step by s...",False
1,"{'id': [3, 4], 'text': ['What is the story of ...",False
2,"{'id': [5, 6], 'text': ['How can I increase th...",False
3,"{'id': [7, 8], 'text': ['Why am I mentally ver...",False
4,"{'id': [9, 10], 'text': ['Which one dissolve i...",False


In [11]:
raw_df.loc[0]['questions']

{'id': array([1, 2]),
 'text': array(['What is the step by step guide to invest in share market in india?',
        'What is the step by step guide to invest in share market?'],
       dtype=object)}

중복된 질문이라고 체크된 질문들만 선택

In [12]:
raw_df = raw_df.loc[raw_df['is_duplicate']==True].reset_index(drop=True)

In [13]:
raw_df.loc[0, 'questions']

{'id': array([11, 12]),
 'text': array(['Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?',
        "I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?"],
       dtype=object)}

In [14]:
# 중복되는 id를 개별 컬럼으로 배치
raw_df["q1"] = raw_df["questions"].apply(lambda q: q["text"][0])
raw_df["q2"] = raw_df["questions"].apply(lambda q: q["text"][1])
raw_df["id1"] = raw_df["questions"].apply(lambda q: q["id"][0])
raw_df["id2"] = raw_df["questions"].apply(lambda q: q["id"][1])

q1_to_q2 = raw_df.copy().rename(columns={"q1": "text", "id1": "id", "id2": "dq_id"}).drop(columns=["questions", "q2"])
q2_to_q1 = raw_df.copy().rename(columns={"q2": "text", "id2": "id", "id1": "dq_id"}).drop(columns=["questions", "q1"])
flat_df = pd.concat([q1_to_q2, q2_to_q1])

flat_df = flat_df.sort_values(by=['id']).reset_index(drop=True)

In [15]:
flat_df.head()

Unnamed: 0,is_duplicate,text,id,dq_id
0,True,Astrology: I am a Capricorn Sun Cap moon and c...,11,12
1,True,"I'm a triple Capricorn (Sun, Moon and ascendan...",12,11
2,True,How can I be a good geologist?,15,16
3,True,What should I do to be a great geologist?,16,15
4,True,How do I read and find my YouTube comments?,23,24


In [16]:
flat_df.loc[flat_df['id']==568]

Unnamed: 0,is_duplicate,text,id,dq_id
1263,True,How can I make money online with free of cost?,568,569
1264,True,How can I make money online with free of cost?,568,214223
1265,True,How can I make money online with free of cost?,568,8268
1266,True,How can I make money online with free of cost?,568,5511
1267,True,How can I make money online with free of cost?,568,36751
1268,True,How can I make money online with free of cost?,568,101583
1269,True,How can I make money online with free of cost?,568,92406


전체 데이터 중 작은 샘플만 활용

In [17]:
flat_df = flat_df.loc[((flat_df['id'] <= 15000) & (flat_df['dq_id'] <= 15000))]

In [18]:
flat_df.shape

(12574, 4)

In [19]:
flat_df.head()

Unnamed: 0,is_duplicate,text,id,dq_id
0,True,Astrology: I am a Capricorn Sun Cap moon and c...,11,12
1,True,"I'm a triple Capricorn (Sun, Moon and ascendan...",12,11
2,True,How can I be a good geologist?,15,16
3,True,What should I do to be a great geologist?,16,15
4,True,How do I read and find my YouTube comments?,23,24


In [20]:
# 각 질문 하나당 중복되는 질문 id를 list 형태로 저장
df = flat_df.drop_duplicates("id")
df.loc[:, "duplicated_questions"] = df["id"].apply(lambda qid: flat_df[flat_df["id"] == qid]["dq_id"].tolist())
df = df.drop(columns=["dq_id", "is_duplicate"])
df.loc[:, 'length'] = [len(x) for x in df['duplicated_questions']]

df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[:, "duplicated_questions"] = df["id"].apply(lambda qid: flat_df[flat_df["id"] == qid]["dq_id"].tolist())


Unnamed: 0,text,id,duplicated_questions,length
0,Astrology: I am a Capricorn Sun Cap moon and c...,11,[12],1
1,"I'm a triple Capricorn (Sun, Moon and ascendan...",12,[11],1
2,How can I be a good geologist?,15,[16],1
3,What should I do to be a great geologist?,16,[15],1
4,How do I read and find my YouTube comments?,23,[24],1


In [21]:
df.loc[[len(i)>2 for i in df.duplicated_questions]]

Unnamed: 0,text,id,duplicated_questions,length
14,What would a Trump presidency mean for current...,31,"[6937, 12544, 11435, 32, 1101]",5
24,How will a Trump presidency affect the student...,32,"[2067, 1100, 6937, 12544, 31, 1101, 2066, 1143...",10
46,Why are so many Quora users posting questions ...,37,"[12639, 1358, 4951, 1357, 6551, 38]",6
63,Why do people ask Quora questions which can be...,38,"[4950, 4407, 4408, 6552, 6551, 12638, 5041, 12...",14
126,What is best way to make money online?,57,"[6800, 12851, 13144, 6099, 4038, 8037, 6799, 1...",23
...,...,...,...,...
33034,Where can you find out what needs to be improv...,14958,"[14288, 967, 966, 2929]",4
33047,Which is the best way to learn hacking just as...,14962,"[8240, 14963, 14168, 14167, 8401, 8400, 8241]",7
33080,What are the safety precautions on handling sh...,14966,"[1596, 10671, 12719, 8119, 5903, 5434, 1595, 1...",10
33112,How did the 2016 US election polls get it so w...,14976,"[14977, 10435, 10434]",3


In [24]:
df.to_csv("../data/quora_dataset.csv", index=False)

In [25]:
df.shape

(5539, 4)

# 3. ABC News dataset

- 다운로드 : https://www.kaggle.com/datasets/therohk/million-headlines

#### 데이터 소개 : ABC news의 날짜와 헤드라인
#### 데이터 활용 목적 : Embedding을 다양한 machine learning 모델들과 결합하여 정보를 처리하는 방법 소개

In [26]:
df = pd.read_csv("../data/abcnews.csv")

In [27]:
df.shape

(1244184, 2)

In [28]:
df.head()

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


In [29]:
df.publish_date.max(), df.publish_date.min()

(20211231, 20030219)

In [30]:
news_2020 = df.loc[(df['publish_date']>=20200101) & (df['publish_date']<20200201)].reset_index(drop=True)

In [31]:
news_2020.head()

Unnamed: 0,publish_date,headline_text
0,20200101,a new type of resolution for the new year
1,20200101,adelaide records driest year in more than a de...
2,20200101,adelaide riverbank catches alight after new ye...
3,20200101,adelaides 9pm fireworks spark blaze on riverbank
4,20200101,archaic legislation governing nt women propert...


In [32]:
news_2020.tail()

Unnamed: 0,publish_date,headline_text
2442,20200131,who coronavirus global emergency
2443,20200131,who declares coronavirus outbreak as global he...
2444,20200131,will travel insurance cover trip cancelled ove...
2445,20200131,world youngest leader 33 years old offers hope...
2446,20200131,wuhan evacuation form


In [33]:
news_2020.to_csv("../data/abcnews_2020.csv", index=False)

# 4. Resume data

- 다운로드 : https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset/data

#### 데이터 소개 : livecareer.com 에서 발췌한 예시 resume
#### 데이터 활용 목적 : Search 최적화를 통한 간단한 서비스 구현. 우리가 요구하는 스킬과 경험을 갖고 있는 후보 search!

In [34]:
resume = pd.read_csv("../data/Resume.csv")

In [35]:
resume.head()

Unnamed: 0,ID,Resume_str,Resume_html,Category
0,16852973,HR ADMINISTRATOR/MARKETING ASSOCIATE\...,"<div class=""fontsize fontface vmargins hmargin...",HR
1,22323967,"HR SPECIALIST, US HR OPERATIONS ...","<div class=""fontsize fontface vmargins hmargin...",HR
2,33176873,HR DIRECTOR Summary Over 2...,"<div class=""fontsize fontface vmargins hmargin...",HR
3,27018550,HR SPECIALIST Summary Dedica...,"<div class=""fontsize fontface vmargins hmargin...",HR
4,17812897,HR MANAGER Skill Highlights ...,"<div class=""fontsize fontface vmargins hmargin...",HR


In [36]:
resume.Category.unique()

array(['HR', 'DESIGNER', 'INFORMATION-TECHNOLOGY', 'TEACHER', 'ADVOCATE',
       'BUSINESS-DEVELOPMENT', 'HEALTHCARE', 'FITNESS', 'AGRICULTURE',
       'BPO', 'SALES', 'CONSULTANT', 'DIGITAL-MEDIA', 'AUTOMOBILE',
       'CHEF', 'FINANCE', 'APPAREL', 'ENGINEERING', 'ACCOUNTANT',
       'CONSTRUCTION', 'PUBLIC-RELATIONS', 'BANKING', 'ARTS', 'AVIATION'],
      dtype=object)