<a href="https://colab.research.google.com/github/CamelGoong/DataScienceLab/blob/main/word2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install Konlpy

In [1]:
!pip install konlpy

Collecting konlpy
  Downloading konlpy-0.5.2-py2.py3-none-any.whl (19.4 MB)
[K     |████████████████████████████████| 19.4 MB 1.4 MB/s 
Collecting beautifulsoup4==4.6.0
  Downloading beautifulsoup4-4.6.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 6.8 MB/s 
[?25hCollecting JPype1>=0.7.0
  Downloading JPype1-1.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (448 kB)
[K     |████████████████████████████████| 448 kB 82.5 MB/s 
Collecting colorama
  Downloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Installing collected packages: JPype1, colorama, beautifulsoup4, konlpy
  Attempting uninstall: beautifulsoup4
    Found existing installation: beautifulsoup4 4.6.3
    Uninstalling beautifulsoup4-4.6.3:
      Successfully uninstalled beautifulsoup4-4.6.3
Successfully installed JPype1-1.3.0 beautifulsoup4-4.6.0 colorama-0.4.4 konlpy-0.5.2


In [2]:
%%bash
sudo apt-get install curl git

Reading package lists...
Building dependency tree...
Reading state information...
curl is already the newest version (7.58.0-2ubuntu3.15).
git is already the newest version (1:2.17.1-1ubuntu0.9).
0 upgraded, 0 newly installed, 0 to remove and 37 not upgraded.


In [3]:
%%bash
bash <(curl -s https://raw.githubusercontent.com/konlpy/konlpy/master/scripts/mecab.sh)

Installing automake (A dependency for mecab-ko)
Get:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Ign:3 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release
Hit:5 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:6 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ Packages [67.4 kB]
Get:7 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
Get:9 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Hit:10 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:12 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Hit:13 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Get:14 http://security.ubuntu.

In [4]:
import numpy as np
import pandas as pd
import seaborn as sns
from konlpy.tag import Mecab
import matplotlib.pyplot as plt
mecab = Mecab()

## (1) Count Vectorizing

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

In [9]:
corpus = ["저는 라쿤 입니다",
          "너구리는 라쿤이 아닙니다",
          "따라서 저는 너구리가 아닙니다"]

In [10]:
tagged = [' '.join(mecab.morphs(i)) for i in corpus] # 공백단위로 join해서 리스트로 만들기

In [11]:
tagged

['저 는 라쿤 입니다', '너구리 는 라쿤 이 아닙니다', '따라서 저 는 너구리 가 아닙니다']

In [12]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(tagged) # 공백 단위로 join해줘야 이런식으로 세는것이 가능한듯.

In [13]:
df = pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names())
df

Unnamed: 0,너구리,따라서,라쿤,아닙니다,입니다
0,0,0,1,0,1
1,1,0,1,1,0
2,1,1,0,1,0


## (2) TF-IDF

###$$w_{i,j} = tf_{i,j} \times log(\frac{N}{df_i})$$

<center>

$tf_{i,j}$ = number of occurrences of i in j

$df_{i}$ = number of documents containing i

$N$ = total number of documents
</center>

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [15]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(tagged) # 위에서 리스트 형태로 모아준 morphs들을 input으로 함.

In [16]:
df = pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names())
df

Unnamed: 0,너구리,따라서,라쿤,아닙니다,입니다
0,0.0,0.0,0.605349,0.0,0.795961
1,0.57735,0.0,0.57735,0.57735,0.0
2,0.517856,0.680919,0.0,0.517856,0.0


## (3) Latent Semantic Analysis

![](https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FsqIMh%2FbtqB1mGuVjr%2FHJPw2IUtRxj1Pv7GBmiwYK%2Fimg.png)

Truncated SVD는 $\Sigma$ 행렬의 대각원소(특이값) 중 상위 일부 데이터만 추출해 분해

$$A = U\Sigma V^T$$

<center>
$U$ : $m \times m$ 직교행렬

$V$ : $n \times n$ 직교행렬

$\Sigma$ : $m \times n$ 직사각 대각행렬
</center>

설명력이 낮은 정보를 삭제하고 설명력이 높은 정보를 남김.

In [17]:
from sklearn.decomposition import TruncatedSVD # sklearn.decomposition에서 SVD를 불러옴.

In [18]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(tagged)

In [19]:
X.toarray().shape
print(X)

  (0, 4)	0.7959605415681652
  (0, 2)	0.6053485081062916
  (1, 3)	0.5773502691896257
  (1, 0)	0.5773502691896257
  (1, 2)	0.5773502691896257
  (2, 1)	0.680918560398684
  (2, 3)	0.5178561161676974
  (2, 0)	0.5178561161676974


In [20]:
svd = TruncatedSVD(n_components=2) # 2차원의 형태로만 축소하여 남김.
svd.fit_transform(X)

array([[ 4.64212732e-01,  8.63349379e-01],
       [ 9.19949729e-01,  7.06367511e-16],
       [ 7.94238027e-01, -5.04606629e-01]])

In [21]:
df = pd.DataFrame(svd.fit_transform(X), columns=[f'component_{i}' for i in range(1,3)])
df

Unnamed: 0,component_1,component_2
0,0.464213,0.8633494
1,0.91995,2.928332e-16
2,0.794238,-0.5046066


## (4) Word2Vec


In [22]:
from gensim.models import Word2Vec # Gensi은 현대적인 통계 머신러닝을 사용하는 비지도 주제 모델링 및 자연어 처리를 위한 오픈 소스 라이브러리

In [23]:
model = Word2Vec(size=3, window=1, min_count=1, sg=0) # size: 단어당 만들어질 벡터의 크기(지금의 경우는 3차원으로 하겠다는 것.). / min_count: 총 사용빈도가 min_count이하인 단어들은 무시. / sg: 적용할 훈련 알고리즘. (CBOW = 0, skip-gram = 1)

In [24]:
token = [i.split(' ') for i in tagged] # 공백을 기준으로 나누기

In [25]:
token

[['저', '는', '라쿤', '입니다'],
 ['너구리', '는', '라쿤', '이', '아닙니다'],
 ['따라서', '저', '는', '너구리', '가', '아닙니다']]

In [26]:
model.build_vocab(token) # vocab 만들기

In [27]:
model.train(token, 
            total_examples=model.corpus_count, 
            epochs=100, 
            report_delay=1)

(162, 1500)

In [28]:
model.wv['라쿤'] # 라쿤이라는 단어의 3차원상 벡터값

array([-0.0179496 ,  0.0900695 ,  0.03303715], dtype=float32)

In [29]:
model.wv['너구리']

array([ 0.00092892, -0.06185791, -0.00451376], dtype=float32)

In [30]:
corpus

['저는 라쿤 입니다', '너구리는 라쿤이 아닙니다', '따라서 저는 너구리가 아닙니다']

In [32]:
model.wv.vocab # vocabulary들 출력

{'가': <gensim.models.keyedvectors.Vocab at 0x7f0e66deeb10>,
 '너구리': <gensim.models.keyedvectors.Vocab at 0x7f0e66dee290>,
 '는': <gensim.models.keyedvectors.Vocab at 0x7f0e66deeb90>,
 '따라서': <gensim.models.keyedvectors.Vocab at 0x7f0e66df8850>,
 '라쿤': <gensim.models.keyedvectors.Vocab at 0x7f0e66dee410>,
 '아닙니다': <gensim.models.keyedvectors.Vocab at 0x7f0e66df8f50>,
 '이': <gensim.models.keyedvectors.Vocab at 0x7f0e66df8ed0>,
 '입니다': <gensim.models.keyedvectors.Vocab at 0x7f0e66deebd0>,
 '저': <gensim.models.keyedvectors.Vocab at 0x7f0e66dee510>}

In [33]:
model_result = model.wv.most_similar("라쿤") # 유사도가 가장 높은 단어 출력
print(model_result)

[('는', 0.5998097062110901), ('아닙니다', 0.5652697086334229), ('저', 0.4620060920715332), ('입니다', 0.17930394411087036), ('이', 0.17023025453090668), ('따라서', -0.22000238299369812), ('가', -0.7962130904197693), ('너구리', -0.9476590156555176)]


## (5) Doc2Vec

In [34]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [35]:
!unzip "/content/gdrive/My Drive/DSL/20210916_Embedding/movies.zip" -d "/content" # 영화 1000개와 그에 대한 정보가 들어있는 파일.

Archive:  /content/gdrive/My Drive/DSL/20210916_Embedding/movies.zip
  inflating: /content/imdb_top_1000.csv  


In [36]:
mydf = pd.read_csv('/content/imdb_top_1000.csv') # dataframe으로 읽어오기

In [38]:
mydf.head(3)

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444


In [39]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument # 이번에는 word2vec이 아니고 Do2Vec
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [45]:
idx_df = [(a['Series_Title'], a['Overview']) for i,a in mydf.loc[:,["Series_Title","Overview"]].iterrows()] # 판다스에서 iterrows()로 행을 반복적으로 돌면서 처리 가능.
tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in idx_df] # nltk에서 불러온 word_tokenize 사용. 여기서 i = 제목, _d = 내용 
# 밑에 출력된 것을 보면 알겠지만, words로 들어가는 인자값들이 tokenize 대상이 되는 거고, tags는 영화제목이 되는 것.

In [44]:
idx_df[0:5]

[('The Shawshank Redemption',
  'Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.'),
 ('The Godfather',
  "An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son."),
 ('The Dark Knight',
  'When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.'),
 ('The Godfather: Part II',
  'The early life and career of Vito Corleone in 1920s New York City is portrayed, while his son, Michael, expands and tightens his grip on the family crime syndicate.'),
 ('12 Angry Men',
  'A jury holdout attempts to prevent a miscarriage of justice by forcing his colleagues to reconsider the evidence.')]

In [46]:
tagged_data[0:5]

[TaggedDocument(words=['two', 'imprisoned', 'men', 'bond', 'over', 'a', 'number', 'of', 'years', ',', 'finding', 'solace', 'and', 'eventual', 'redemption', 'through', 'acts', 'of', 'common', 'decency', '.'], tags=['The Shawshank Redemption']),
 TaggedDocument(words=['an', 'organized', 'crime', 'dynasty', "'s", 'aging', 'patriarch', 'transfers', 'control', 'of', 'his', 'clandestine', 'empire', 'to', 'his', 'reluctant', 'son', '.'], tags=['The Godfather']),
 TaggedDocument(words=['when', 'the', 'menace', 'known', 'as', 'the', 'joker', 'wreaks', 'havoc', 'and', 'chaos', 'on', 'the', 'people', 'of', 'gotham', ',', 'batman', 'must', 'accept', 'one', 'of', 'the', 'greatest', 'psychological', 'and', 'physical', 'tests', 'of', 'his', 'ability', 'to', 'fight', 'injustice', '.'], tags=['The Dark Knight']),
 TaggedDocument(words=['the', 'early', 'life', 'and', 'career', 'of', 'vito', 'corleone', 'in', '1920s', 'new', 'york', 'city', 'is', 'portrayed', ',', 'while', 'his', 'son', ',', 'michael', '

In [47]:
max_epochs = 100
vec_size = 20
alpha = 0.025

model = Doc2Vec(size=vec_size,
                alpha=alpha,  # 초기 학습률
                min_alpha=0.00025, 
                min_count=1,
                dm =1) # 어떤 알고리즘을 선택할 것인지, 0: DBOW 1: DM(Distributed Memory)
  
model.build_vocab(tagged_data)



In [None]:
mydf[5:10]

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
5,https://m.media-amazon.com/images/M/MV5BNzA5ZD...,The Lord of the Rings: The Return of the King,2003,U,201 min,"Action, Adventure, Drama",8.9,Gandalf and Aragorn lead the World of Men agai...,94.0,Peter Jackson,Elijah Wood,Viggo Mortensen,Ian McKellen,Orlando Bloom,1642758,377845905
6,https://m.media-amazon.com/images/M/MV5BNGNhMD...,Pulp Fiction,1994,A,154 min,"Crime, Drama",8.9,"The lives of two mob hitmen, a boxer, a gangst...",94.0,Quentin Tarantino,John Travolta,Uma Thurman,Samuel L. Jackson,Bruce Willis,1826188,107928762
7,https://m.media-amazon.com/images/M/MV5BNDE4OT...,Schindler's List,1993,A,195 min,"Biography, Drama, History",8.9,"In German-occupied Poland during World War II,...",94.0,Steven Spielberg,Liam Neeson,Ralph Fiennes,Ben Kingsley,Caroline Goodall,1213505,96898818
8,https://m.media-amazon.com/images/M/MV5BMjAxMz...,Inception,2010,UA,148 min,"Action, Adventure, Sci-Fi",8.8,A thief who steals corporate secrets through t...,74.0,Christopher Nolan,Leonardo DiCaprio,Joseph Gordon-Levitt,Elliot Page,Ken Watanabe,2067042,292576195
9,https://m.media-amazon.com/images/M/MV5BMmEzNT...,Fight Club,1999,A,139 min,Drama,8.8,An insomniac office worker and a devil-may-car...,66.0,David Fincher,Brad Pitt,Edward Norton,Meat Loaf,Zach Grenier,1854740,37030102


In [48]:
model.docvecs.most_similar("Knives Out")

[('The Visitor', 0.6318782567977905),
 ('The Fighter', 0.6307891607284546),
 ('The Lady Vanishes', 0.5823297500610352),
 ('C.R.A.Z.Y.', 0.5703487396240234),
 ('Leviafan', 0.569624125957489),
 ('Dà hóng denglong gaogao guà', 0.5211832523345947),
 ('Celda 211', 0.519517719745636),
 ('In America', 0.4988778829574585),
 ('Se7en', 0.49472561478614807),
 ("God's Own Country", 0.47965243458747864)]

# 세션과제: Superhero dataset에서 superman과 가장 비슷한 히어로 10명 찾기

데이터셋 불러오기 Data Downloaed from [kaggle](https://www.kaggle.com/jonathanbesomi/superheroes-nlp-dataset)



In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
import pandas as pd

!unzip "/content/drive/MyDrive/DSL/20210916_Embedding/superhero.zip" -d "/content"

Archive:  /content/drive/MyDrive/DSL/20210916_Embedding/superhero.zip
  inflating: /content/superheroes_nlp_dataset.csv  


In [7]:
superhero_df = pd.read_csv("superheroes_nlp_dataset.csv")

In [8]:
superhero_df.head()

Unnamed: 0,name,real_name,full_name,overall_score,history_text,powers_text,intelligence_score,strength_score,speed_score,durability_score,power_score,combat_score,superpowers,alter_egos,aliases,place_of_birth,first_appearance,creator,alignment,occupation,base,teams,relatives,gender,type_race,height,weight,eye_color,hair_color,skin_color,img,has_electrokinesis,has_energy_constructs,has_mind_control_resistance,has_matter_manipulation,has_telepathy_resistance,has_mind_control,has_enhanced_hearing,has_dimensional_travel,has_element_control,...,has_fire_resistance,has_fire_control,has_dexterity,has_reality_warping,has_illusions,has_energy_beams,has_peak_human_condition,has_shapeshifting,has_heat_resistance,has_jump,has_self-sustenance,has_energy_absorption,has_cold_resistance,has_magic,has_telekinesis,has_toxin_and_disease_resistance,has_telepathy,has_regeneration,has_immortality,has_teleportation,has_force_fields,has_energy_manipulation,has_endurance,has_longevity,has_weapon-based_powers,has_energy_blasts,has_enhanced_senses,has_invulnerability,has_stealth,has_marksmanship,has_flight,has_accelerated_healing,has_weapons_master,has_intelligence,has_reflexes,has_super_speed,has_durability,has_stamina,has_agility,has_super_strength
0,3-D Man,"Delroy Garrett, Jr.","Delroy Garrett, Jr.",6,"Delroy Garrett, Jr. grew up to become a track ...",,85,30,60,60,40,70,"['Super Speed', 'Super Strength']",[],[''],,,Marvel Comics,Good,,,"['Annihilators', 'Asgardians', 'Avengers', 'Ne...",,Male,Human,-,-,,,,/pictures2/portraits/11/050/10038.jpg?v=156096...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,514A (Gotham),Bruce Wayne,,10,He was one of the many prisoners of Indian Hil...,,100,20,30,50,35,100,"['Durability', 'Reflexes', 'Super Strength']","['Batgod', 'Batman', 'Batman (1966)', 'Batman ...","['Subject 514A', 'Bruce Wayne', 'Bruce 2']",,,DC Comics,,,,[],Bruce Wayne (genetic template),,,-,-,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
2,A-Bomb,Richard Milhouse Jones,Richard Milhouse Jones,20,"Richard ""Rick"" Jones was orphaned at a young ...","On rare occasions, and through unusual circu...",80,100,80,100,100,80,"['Accelerated Healing', 'Agility', 'Berserk Mo...",[],['Rick Jones'],"Scarsdale, Arizona","Hulk Vol 2 #2 (April, 2008) (as A-Bomb)",Marvel Comics,Good,"Musician, adventurer, author; formerly talk sh...",,"['Teen Brigade', 'Ultimate Fantastic Four', 'U...",Marlo Chandler-Jones (wife); Polly (aunt); Mrs...,Male,Human,6'8 • 203 cm,980 lb • 441 kg,Yellow,No Hair,,/pictures2/portraits/10/050/10060.jpg?v=158233...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0
3,Aa,Aa,,12,Aa is one of the more passive members of the P...,,80,50,55,45,100,55,"['Energy Absorption', 'Energy Armor', 'Energy ...",[],[''],Stoneworld,Green Lantern Vol 3 #21,DC Comics,Good,,,"['Blue Lantern Corps', 'Green Lantern Corps', ...",,Male,Human,-,-,,,,/pictures2/portraits/10/050/1410.jpg?v=1581168103,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Aaron Cash,Aaron Cash,Aaron Cash,5,Aaron Cash is the head of security at Arkham A...,,80,10,25,40,30,50,"['Weapon-based Powers', 'Weapons Master']",[],[''],Gotham City,,DC Comics,Good,,,[],,Male,Human,-,-,,,,/pictures2/portraits/11/050/11650.jpg?v=156173...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
superhero_df[superhero_df.name == "Superman (2006)"]

Unnamed: 0,name,real_name,full_name,overall_score,history_text,powers_text,intelligence_score,strength_score,speed_score,durability_score,power_score,combat_score,superpowers,alter_egos,aliases,place_of_birth,first_appearance,creator,alignment,occupation,base,teams,relatives,gender,type_race,height,weight,eye_color,hair_color,skin_color,img,has_electrokinesis,has_energy_constructs,has_mind_control_resistance,has_matter_manipulation,has_telepathy_resistance,has_mind_control,has_enhanced_hearing,has_dimensional_travel,has_element_control,...,has_fire_resistance,has_fire_control,has_dexterity,has_reality_warping,has_illusions,has_energy_beams,has_peak_human_condition,has_shapeshifting,has_heat_resistance,has_jump,has_self-sustenance,has_energy_absorption,has_cold_resistance,has_magic,has_telekinesis,has_toxin_and_disease_resistance,has_telepathy,has_regeneration,has_immortality,has_teleportation,has_force_fields,has_energy_manipulation,has_endurance,has_longevity,has_weapon-based_powers,has_energy_blasts,has_enhanced_senses,has_invulnerability,has_stealth,has_marksmanship,has_flight,has_accelerated_healing,has_weapons_master,has_intelligence,has_reflexes,has_super_speed,has_durability,has_stamina,has_agility,has_super_strength
1253,Superman (2006),Kal-El,Kal-El,17,"As far as is known, the early history of Kal-E...","Like all Kryptonians, Superman develops superh...",85,100,100,100,100,75,"['Absorption', 'Cryokinesis', 'Durability', 'E...","['Parallax', 'Strange Visitor Superman', 'Supe...",['Clark kent'],Krypton,Superman returns,DC Comics,Good,Reporter,Metropolis,[],"Jor-El (biological father), Lara (biological m...",Male,Kryptonian,5'10 • 178 cm,225 lb • 101 kg,Blue,Black,,/pictures2/portraits/11/050/14788.jpg?v=155370...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0


필요한 전처리 라이브러리 Import

In [28]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [41]:
# superpowers 추출
idx_df2 = [(a['name'], a['superpowers']) for i,a in superhero_df.loc[:,["name","superpowers"]].iterrows()]

tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in idx_df2]

In [42]:
tagged_data[0:5]

[TaggedDocument(words=['[', "'super", 'speed', "'", ',', "'super", 'strength', "'", ']'], tags=['3-D Man']),
 TaggedDocument(words=['[', "'durability", "'", ',', "'reflexes", "'", ',', "'super", 'strength', "'", ']'], tags=['514A (Gotham)']),
 TaggedDocument(words=['[', "'accelerated", 'healing', "'", ',', "'agility", "'", ',', "'berserk", 'mode', "'", ',', "'bloodlust", "'", ',', "'camouflage", "'", ',', "'cloaking", "'", ',', "'cold", 'resistance', "'", ',', "'durability", "'", ',', "'emotional", 'power', 'up', "'", ',', "'endurance", "'", ',', "'energy", 'resistance', "'", ',', "'enhanced", 'senses', "'", ',', "'fire", 'resistance', "'", ',', "'gamma", 'mutant', 'physiology', "'", ',', "'heat", 'resistance', "'", ',', "'indestructible", 'digestion', "'", ',', "'invulnerability", "'", ',', "'jump", "'", ',', "'longevity", "'", ',', "'natural", 'armor', "'", ',', "'natural", 'weapons', "'", ',', "'power", 'augmentation', "'", ',', "'radiation", 'absorption', "'", ',', "'radiation", 'i

Doc2Vec 모델 생성 및 훈련

In [43]:
max_epochs = 100
vec_size = 20
alpha = 0.025

model = Doc2Vec(size=vec_size,
                alpha=alpha,  # 초기 학습률
                min_alpha=0.00025, 
                min_count=1,
                dm =0) # 어떤 알고리즘을 선택할 것인지, 0: DBOW 1: DM(Distributed Memory)
  
model.build_vocab(tagged_data)



'superpowers' 측면에서 Superman과 가장 유사한 Top10

In [50]:
model.docvecs.most_similar('Superman (2006)')

[('Battle-Suit Batman (DCEU)', 0.5962851047515869),
 ('Azrael (Gotham)', 0.5772967338562012),
 ('Master Chief', 0.5746957063674927),
 ('Jane Foster (MCU)', 0.5671156644821167),
 ('Hulk (Stark Gauntlet) (MCU)', 0.5663480758666992),
 ('The One Below All', 0.5660735368728638),
 ('Batman (1966)', 0.562859833240509),
 ('Steel', 0.5502662658691406),
 ('Portal', 0.5345695614814758),
 ('Anti-Spawn', 0.5309906005859375)]