데이터 분석용 전처리.
Word2Vec을 활용한 문자열 데이터 임베딩 전처리.

In [57]:
# Data Analysis
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
plt.style.use('seaborn')
sns.set_style("whitegrid")

# Modelling
from gensim.models import Word2Vec
from gensim.models.callbacks import CallbackAny2Vec
from sklearn.model_selection import train_test_split
from sklearn.manifold import TSNE
from scipy import stats

# Additional
import os
import math
import random
import itertools
import multiprocessing
from tqdm import tqdm
from time import time
import logging
import pickle

작업 경로 설정

In [58]:
os.chdir(r'D:\Github 작업용\Github_cowork_practice')

다른 데이터의 전처리는 간단하게 요약이 되어져 있다.
분석 직전단계에, scaling만 사용하면 될것이지만, 문자열 단어의 경우에는 매핑이 필요할것으로 예상된다.

train, test셋에 있는 모든 데이터를 활용해 메뉴 데이터 얻기.

In [59]:
data1 = pd.read_csv('train.csv')
data2 = pd.read_csv('test.csv')
full_data = pd.concat([data1, data2],axis=0)

In [60]:
data1.shape

(1205, 12)

In [61]:
data2.shape

(50, 10)

In [62]:
full_data.shape
del data1, data2

데이터 형태 체크

In [63]:
full_data.head

<bound method NDFrame.head of             일자 요일  본사정원수  본사휴가자수  본사출장자수  본사시간외근무명령서승인건수  현본사소속재택근무자수  \
0   2016-02-01  월   2601      50     150             238          0.0   
1   2016-02-02  화   2601      50     173             319          0.0   
2   2016-02-03  수   2601      56     180             111          0.0   
3   2016-02-04  목   2601     104     220             355          0.0   
4   2016-02-05  금   2601     278     181              34          0.0   
..         ... ..    ...     ...     ...             ...          ...   
45  2021-04-05  월   2973     125     174             704        331.0   
46  2021-04-06  화   2973      76     170             636        364.0   
47  2021-04-07  수   2973      96     214               1        334.0   
48  2021-04-08  목   2973     105     238             509        324.0   
49  2021-04-09  금   2973     259     268               0        229.0   

                                                 조식메뉴  \
0   모닝롤/찐빵  우유/두유/주스 계란후라이  호두죽/쌀밥 (

In [64]:
full_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1255 entries, 0 to 49
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   일자              1255 non-null   object 
 1   요일              1255 non-null   object 
 2   본사정원수           1255 non-null   int64  
 3   본사휴가자수          1255 non-null   int64  
 4   본사출장자수          1255 non-null   int64  
 5   본사시간외근무명령서승인건수  1255 non-null   int64  
 6   현본사소속재택근무자수     1255 non-null   float64
 7   조식메뉴            1255 non-null   object 
 8   중식메뉴            1255 non-null   object 
 9   석식메뉴            1255 non-null   object 
 10  중식계             1205 non-null   float64
 11  석식계             1205 non-null   float64
dtypes: float64(3), int64(4), object(5)
memory usage: 127.5+ KB


In [65]:
full_data.shape

(1255, 12)

우리가 전처리 하고 싶은 열은 다음과 같다. ->

In [66]:
full_data.조식메뉴

0     모닝롤/찐빵  우유/두유/주스 계란후라이  호두죽/쌀밥 (쌀:국내산) 된장찌개  쥐...
1     모닝롤/단호박샌드  우유/두유/주스 계란후라이  팥죽/쌀밥 (쌀:국내산) 호박젓국찌...
2     모닝롤/베이글  우유/두유/주스 계란후라이  표고버섯죽/쌀밥 (쌀:국내산) 콩나물국...
3     모닝롤/토마토샌드  우유/두유/주스 계란후라이  닭죽/쌀밥 (쌀,닭:국내산) 근대국...
4     모닝롤/와플  우유/두유/주스 계란후라이  쇠고기죽/쌀밥 (쌀:국내산) 재첩국  방...
                            ...                        
45    모닝롤/커피콩빵 우유/주스 계란후라이 누룽지탕/흑미밥 청양콩나물국 스팸구이 양상추샐...
46    모닝롤/모닝샌드 우유/주스 계란후라이 고구마스프/흑미밥 아욱국 참치채소볶음 양상추샐...
47    모닝롤/호떡맥모닝 우유/주스 계란후라이 팥죽/흑미밥 닭살해장국 우엉채조림 양상추샐러...
48    모닝롤/크로크무슈 우유/주스 계란후라이 누룽지탕/흑미밥 감자국 두부양념조림 양상추샐...
49    모닝롤/토마토샌드 우유/주스 계란후라이 채소죽/흑미밥 대구지리 애호박나물볶음 양상추...
Name: 조식메뉴, Length: 1255, dtype: object

In [67]:
full_data.중식메뉴

0     쌀밥/잡곡밥 (쌀,현미흑미:국내산) 오징어찌개  쇠불고기 (쇠고기:호주산) 계란찜 ...
1     쌀밥/잡곡밥 (쌀,현미흑미:국내산) 김치찌개  가자미튀김  모둠소세지구이  마늘쫑무...
2     카레덮밥 (쌀,현미흑미:국내산) 팽이장국  치킨핑거 (닭고기:국내산) 쫄면야채무침 ...
3     쌀밥/잡곡밥 (쌀,현미흑미:국내산) 쇠고기무국  주꾸미볶음  부추전  시금치나물  ...
4     쌀밥/잡곡밥 (쌀,현미흑미:국내산) 떡국  돈육씨앗강정 (돼지고기:국내산) 우엉잡채...
                            ...                        
45    쌀밥/흑미밥/찰현미밥 쇠고기미역국 춘천닭갈비 오지치즈후라이 가지두반장볶음 포기김치 ...
46    쌀밥/귀리밥/찰현미밥 순두부백탕 매콤소갈비찜 깻잎완자전 돌나물초장무침 포기김치 시리...
47    쌀밥/흑미밥/찰현미밥 냉이국 돈육간장불고기 비빔냉면 오이나물볶음 겉절이김치 양상추샐...
48    쌀밥/옥수수밥/찰현미밥 맑은떡국 (New)로제찜닭 가자미구이*장 유채나물무침 포기김...
49    쌀밥/흑미밥/찰현미밥 사골우거지국 해물누룽지탕 청포묵*양념간장 비름나물고추장무침 석...
Name: 중식메뉴, Length: 1255, dtype: object

In [68]:
full_data.석식메뉴

0     쌀밥/잡곡밥 (쌀,현미흑미:국내산) 육개장  자반고등어구이  두부조림  건파래무침 ...
1     콩나물밥*양념장 (쌀,현미흑미:국내산) 어묵국  유산슬 (쇠고기:호주산) 아삭고추무...
2     쌀밥/잡곡밥 (쌀,현미흑미:국내산) 청국장찌개  황태양념구이 (황태:러시아산) 고기...
3     미니김밥*겨자장 (쌀,현미흑미:국내산) 우동  멕시칸샐러드  군고구마  무피클  포...
4     쌀밥/잡곡밥 (쌀,현미흑미:국내산) 차돌박이찌개 (쇠고기:호주산) 닭갈비 (닭고기:...
                            ...                        
45                      흑미밥 돈육고추장찌개 갈치구이 김치전 취나물무침 깍두기 
46             추가밥 짬뽕*생면 수제찹쌀꿔바로우 메추리알곤약장조림 단무지무침 포기김치 
47             단호박카레라이스 시금치된장국 소떡소떡 파프리카해초무침 감귤쥬스 포기김치 
48                  흑미밥 어묵매운탕 쇠고기숙주볶음 채소계란찜 쑥갓생무침 김치볶음 
49               흑미밥 맑은버섯국 매운사태조림 춘권*타르타르D 열무나물무침 포기김치 
Name: 석식메뉴, Length: 1255, dtype: object

우리는 이 데이터를 훈련에 사용할게 아니라 단순히 밥을 뭐 먹었는지 명부를 얻고자 한다.
공백 두번과 / 문자열로 구분자를 사용하고 있으며 (), *과 같이 묶여있는 경우 구분자가 아니라 연결하는 의미에서 중요하다고 판단하였음.
따라서 조식, 중식, 석식메뉴에 고유하게 등장한 메뉴값을 구한다.

In [69]:
temp = []
for i in full_data.조식메뉴:
    line = i.split(' ')
    temp.extend(line)

for i in full_data.중식메뉴:
    line = i.split(' ')
    temp.extend(line)

for i in full_data.석식메뉴:
    line = i.split(' ')
    temp.extend(line)


띄어쓰기 두번을 기준으로 구분을 한 데이터는 다음과 같이 분류되었음.

In [70]:
temp

['모닝롤/찐빵',
 '',
 '우유/두유/주스',
 '계란후라이',
 '',
 '호두죽/쌀밥',
 '(쌀:국내산)',
 '된장찌개',
 '',
 '쥐어채무침',
 '',
 '포기김치',
 '(배추,고추가루:국내산)',
 '',
 '모닝롤/단호박샌드',
 '',
 '우유/두유/주스',
 '계란후라이',
 '',
 '팥죽/쌀밥',
 '(쌀:국내산)',
 '호박젓국찌개',
 '',
 '시래기조림',
 '',
 '포기김치',
 '(배추,고추가루:국내산)',
 '',
 '모닝롤/베이글',
 '',
 '우유/두유/주스',
 '계란후라이',
 '',
 '표고버섯죽/쌀밥',
 '(쌀:국내산)',
 '콩나물국',
 '',
 '느타리호박볶음',
 '',
 '포기김치',
 '(배추,고추가루:국내산)',
 '',
 '모닝롤/토마토샌드',
 '',
 '우유/두유/주스',
 '계란후라이',
 '',
 '닭죽/쌀밥',
 '(쌀,닭:국내산)',
 '근대국',
 '',
 '멸치볶음',
 '',
 '포기김치',
 '(배추,고추가루:국내산)',
 '',
 '모닝롤/와플',
 '',
 '우유/두유/주스',
 '계란후라이',
 '',
 '쇠고기죽/쌀밥',
 '(쌀:국내산)',
 '재첩국',
 '',
 '방풍나물',
 '',
 '포기김치',
 '(배추,고추가루:국내산)',
 '',
 '팬케익/찐빵',
 '',
 '우유/두유/주스',
 '',
 '계란후라이',
 '',
 '견과류죽/쌀밥',
 '(쌀:국내산)',
 '감자찌개',
 '',
 '명엽채무침',
 '포기김치',
 '(김치:국내산)',
 '',
 '모닝롤/야채샌드',
 '',
 '우유/두유/주스',
 '',
 '계란후라이',
 '',
 '고구마죽/쌀밥',
 '(쌀:국내산)',
 '봄동된장국',
 '',
 '숙주나물',
 '포기김치',
 '(김치:국내산)',
 '',
 '모닝롤/치즈프레즐',
 '',
 '우유/두유/주스',
 '',
 '계란후라이',
 '',
 '잣죽/쌀밥',
 '(쌀:국내산)',
 '민물새우찌개',
 '',
 '콩조림',
 '

/ 문자열을 기준으로 더 나눠준다.

In [71]:
cat = []
for i in temp:
    line = i.split('/')
    cat.extend(line)


처리한 형태는 더욱 깔끔해졌다.

In [72]:
cat

['모닝롤',
 '찐빵',
 '',
 '우유',
 '두유',
 '주스',
 '계란후라이',
 '',
 '호두죽',
 '쌀밥',
 '(쌀:국내산)',
 '된장찌개',
 '',
 '쥐어채무침',
 '',
 '포기김치',
 '(배추,고추가루:국내산)',
 '',
 '모닝롤',
 '단호박샌드',
 '',
 '우유',
 '두유',
 '주스',
 '계란후라이',
 '',
 '팥죽',
 '쌀밥',
 '(쌀:국내산)',
 '호박젓국찌개',
 '',
 '시래기조림',
 '',
 '포기김치',
 '(배추,고추가루:국내산)',
 '',
 '모닝롤',
 '베이글',
 '',
 '우유',
 '두유',
 '주스',
 '계란후라이',
 '',
 '표고버섯죽',
 '쌀밥',
 '(쌀:국내산)',
 '콩나물국',
 '',
 '느타리호박볶음',
 '',
 '포기김치',
 '(배추,고추가루:국내산)',
 '',
 '모닝롤',
 '토마토샌드',
 '',
 '우유',
 '두유',
 '주스',
 '계란후라이',
 '',
 '닭죽',
 '쌀밥',
 '(쌀,닭:국내산)',
 '근대국',
 '',
 '멸치볶음',
 '',
 '포기김치',
 '(배추,고추가루:국내산)',
 '',
 '모닝롤',
 '와플',
 '',
 '우유',
 '두유',
 '주스',
 '계란후라이',
 '',
 '쇠고기죽',
 '쌀밥',
 '(쌀:국내산)',
 '재첩국',
 '',
 '방풍나물',
 '',
 '포기김치',
 '(배추,고추가루:국내산)',
 '',
 '팬케익',
 '찐빵',
 '',
 '우유',
 '두유',
 '주스',
 '',
 '계란후라이',
 '',
 '견과류죽',
 '쌀밥',
 '(쌀:국내산)',
 '감자찌개',
 '',
 '명엽채무침',
 '포기김치',
 '(김치:국내산)',
 '',
 '모닝롤',
 '야채샌드',
 '',
 '우유',
 '두유',
 '주스',
 '',
 '계란후라이',
 '',
 '고구마죽',
 '쌀밥',
 '(쌀:국내산)',
 '봄동된장국',
 '',
 '숙주나물',
 '포기김치',
 '(김치:

공백값을 가지고 있는 리스트는 무의미한 데이터이다.
이 값을 제거하였다.

In [73]:
while '' in cat:
    cat.remove('')

공백값을 지운 리스트의 형태이다.
이를 중복값을 제거해 데이터 인덱싱 값으로 전환해 주겠다.

In [74]:
catdict = {val : idx for idx, val in enumerate(set(cat))}
catdict

{'사골파국': 0,
 '깻잎': 1,
 '오렌지쥬스': 2,
 '꽃게된장찌개': 3,
 '물파래초무침': 4,
 '군고구마': 5,
 '아쿠아돈까스': 6,
 '토마토샐러드*발사믹드레싱': 7,
 '순살파닭': 8,
 '고들빼기무침': 9,
 '도라지초무침': 10,
 '해물수제비국': 11,
 '머위나물무침': 12,
 '상추&마늘': 13,
 '맑은버섯육개장': 14,
 '참나물두부무침': 15,
 '달래오이생채': 16,
 '소보루빵': 17,
 '쇠고기우엉볶음': 18,
 '상추튀김(모둠튀김*양념장)': 19,
 '닭감자조림': 20,
 '(쌀,돈육:국내산)': 21,
 '홍합살무국': 22,
 '모듬야채쌈': 23,
 '칼국수': 24,
 '양배추쌈&쌈장': 25,
 '두부커틀릿': 26,
 '(쌀:국내산,돈육:국내)': 27,
 '콩나물두루치기': 28,
 '표고버섯죽': 29,
 '연두부샐러드*소스': 30,
 '(돈등갈비:국내산)': 31,
 '명엽채무침': 32,
 '(무,고추가루:국내산)': 33,
 '연두부찌개': 34,
 '닭다리바베큐오븐구이': 35,
 '시저샐러드*시저드레싱': 36,
 '브로컬리무침': 37,
 '꽁치와사비구이': 38,
 '돈육피망볶음': 39,
 '치커리단감무침': 40,
 '주꾸미초무침': 41,
 '매콤미니함박': 42,
 '청경채김치': 43,
 '열무나물무침': 44,
 '핫도그&케찹': 45,
 '메밀전병만두': 46,
 '깐풍두부': 47,
 '찰현미밥': 48,
 '미트볼케찹조림': 49,
 '양상추샐러드*살구D': 50,
 '탕수육': 51,
 '쌀밥': 52,
 '모둠버섯초무침': 53,
 '표고버섯탕수육': 54,
 '마늘쫑메추리알조림': 55,
 '돈육잡채': 56,
 '(훈제오리:국내산)': 57,
 '치커리유자생채': 58,
 '와사비무쌈*쌈장': 59,
 '청경채사과생채': 60,
 '새송이버섯곤약장조림': 61,
 '오이스틱*쌈장': 62,
 '닭윙강정튀김': 63,
 '언양식바싹불고기

사전 제작에 사용한 변수는 지워준다

In [75]:
del full_data, cat, line, temp

각 열에 조식, 점심, 저녁 메뉴에 있어 전처리를 진행한다.
치환을 해야하므로 각 열에 변수를 적용할 필요가 있다.

각 열에 같은 전처리를 하는 함수를 제작한다.

In [76]:
def split_menu(x):
    temp = x.split(' ')
    cat = []
    for j in temp:
        line = j.split('/')
        cat.extend(line)
    while '' in cat:
        cat.remove('')
    return cat

다시 train과 test 데이터셋을 불러온다.

In [77]:
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

train과 test셋의 메뉴를 리스트화 시켜준다.

In [78]:
train_data['조식메뉴'] = train_data['조식메뉴'].transform(split_menu)
train_data['중식메뉴'] = train_data['중식메뉴'].transform(split_menu)
train_data['석식메뉴'] = train_data['석식메뉴'].transform(split_menu)
test_data['조식메뉴'] = test_data['조식메뉴'].transform(split_menu)
test_data['중식메뉴'] = test_data['중식메뉴'].transform(split_menu)
test_data['석식메뉴'] = test_data['석식메뉴'].transform(split_menu)

아, 굳이 라벨인코딩 할 필요가 없었구나
이제 이 식사 데이터를 다차원으로 펼처준다.
메뉴는 16개를 초과하지 않으므로 연습삼아 50개 벡터로 임베딩 해 보겠다.
마찬가지로, 조식, 중식, 석식이므로 이 작업은 세번에 나누어서 걸처 진행하였다.

-> 원래라면 데이터셋의 식사가 없는 값에대해선 word2vec을 하면 안되는데 우리는 이 값 예측이 아니라 임베딩 목적이므로, 식사가 없는 "-" 하나도 매핑하는 식으로 진행한다.

In [79]:
# 콜백을 설정한다.
logging.basicConfig(format="%(asctime)s : %(levelname)s : %(message)s", level=logging.INFO)

class Callback(CallbackAny2Vec):
    def __init__(self):
        self.epoch = 1
        self.training_loss = []

    def on_epoch_end(self, model):
        loss = model.get_latest_training_loss()
        if self.epoch == 1:
            current_loss = loss
        else:
            current_loss = loss - self.loss_previous_step
        print(f"Loss after epoch {self.epoch}: {current_loss}")
        self.training_loss.append(current_loss)
        self.epoch += 1
        self.loss_previous_step = loss

고유한 메뉴의 개수는 3282개. 256차원에 임베딩한다.
임베딩 결과로 나오는 출력 벡터는 임의의 50차원에 할당시키는 식으로 코딩을 진행하였다.

In [80]:
len(catdict)

3282

같이 임베딩을 하는게 옳다고도 생각했지만, 아침 점심 저녁마다 의미가 다르다고 생각하여 따로 임베딩을 실시한다.

In [81]:
model_1 = Word2Vec(window = 10, min_count = 1, sg = 0 ,negative = 20, workers = multiprocessing.cpu_count()-1, vector_size = 100)
print(model_1)

logging.disable(logging.NOTSET) # enable logging
t = time()

X = list(train_data['조식메뉴'])
model_1.build_vocab(X)

logging.disable(logging.INFO) # disable logging
callback = Callback() # instead, print out loss for each epoch
t = time()

model_1.train(X, total_examples = model_1.corpus_count, epochs = 1000, compute_loss = True, callbacks = [callback])

model_1.save("조식메뉴.model")

2022-08-28 15:43:03,898 : INFO : collecting all words and their counts
2022-08-28 15:43:03,899 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2022-08-28 15:43:03,902 : INFO : collected 822 word types from a corpus of 14416 raw words and 1205 sentences
2022-08-28 15:43:03,902 : INFO : Creating a fresh vocabulary
2022-08-28 15:43:03,905 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=1 retains 822 unique words (100.00% of original 822, drops 0)', 'datetime': '2022-08-28T15:43:03.905336', 'gensim': '4.2.0', 'python': '3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22000-SP0', 'event': 'prepare_vocab'}
2022-08-28 15:43:03,905 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=1 leaves 14416 word corpus (100.00% of original 14416, drops 0)', 'datetime': '2022-08-28T15:43:03.905336', 'gensim': '4.2.0', 'python': '3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]', 'platf

Word2Vec<vocab=0, vector_size=100, alpha=0.025>
Loss after epoch 1: 29276.55078125
Loss after epoch 2: 18489.38671875
Loss after epoch 3: 17423.05859375
Loss after epoch 4: 16892.95703125
Loss after epoch 5: 16719.2265625
Loss after epoch 6: 16775.7109375
Loss after epoch 7: 16526.53125
Loss after epoch 8: 16450.984375
Loss after epoch 9: 16384.203125
Loss after epoch 10: 16451.390625
Loss after epoch 11: 16223.578125
Loss after epoch 12: 16055.578125
Loss after epoch 13: 16090.109375
Loss after epoch 14: 16350.0
Loss after epoch 15: 16010.828125
Loss after epoch 16: 16104.15625
Loss after epoch 17: 15609.625
Loss after epoch 18: 15633.15625
Loss after epoch 19: 15503.625
Loss after epoch 20: 15344.375
Loss after epoch 21: 15335.5625
Loss after epoch 22: 14685.125
Loss after epoch 23: 14773.65625
Loss after epoch 24: 14620.46875
Loss after epoch 25: 14369.78125
Loss after epoch 26: 13701.5
Loss after epoch 27: 13791.21875
Loss after epoch 28: 13624.21875
Loss after epoch 29: 13582.9062

In [82]:
model_1.corpus_count

1205

In [83]:
model_2 = Word2Vec(vector_size = 100, window = 10, min_count = 1,sg = 0,negative = 20, workers = multiprocessing.cpu_count()-1)
print(model_2)

logging.disable(logging.NOTSET) # enable logging
t = time()

X = list(train_data['중식메뉴'])
model_2.build_vocab(X)

logging.disable(logging.INFO) # disable logging
callback = Callback() # instead, print out loss for each epoch
t = time()

model_2.train(X, total_examples = model_2.corpus_count, epochs = 1000, compute_loss = True, callbacks = [callback])

model_2.save("중식메뉴.model")

2022-08-28 15:43:20,146 : INFO : collecting all words and their counts
2022-08-28 15:43:20,147 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2022-08-28 15:43:20,149 : INFO : collected 1763 word types from a corpus of 11155 raw words and 1205 sentences
2022-08-28 15:43:20,149 : INFO : Creating a fresh vocabulary
2022-08-28 15:43:20,155 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=1 retains 1763 unique words (100.00% of original 1763, drops 0)', 'datetime': '2022-08-28T15:43:20.155989', 'gensim': '4.2.0', 'python': '3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22000-SP0', 'event': 'prepare_vocab'}
2022-08-28 15:43:20,155 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=1 leaves 11155 word corpus (100.00% of original 11155, drops 0)', 'datetime': '2022-08-28T15:43:20.155989', 'gensim': '4.2.0', 'python': '3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]', 'pl

Word2Vec<vocab=0, vector_size=100, alpha=0.025>
Loss after epoch 1: 63992.109375
Loss after epoch 2: 31387.0078125
Loss after epoch 3: 26244.7734375
Loss after epoch 4: 24795.390625
Loss after epoch 5: 23994.59375
Loss after epoch 6: 23809.40625
Loss after epoch 7: 23884.359375
Loss after epoch 8: 23740.171875
Loss after epoch 9: 23862.84375
Loss after epoch 10: 24077.125
Loss after epoch 11: 23790.78125
Loss after epoch 12: 23657.125
Loss after epoch 13: 23600.59375
Loss after epoch 14: 23319.5625
Loss after epoch 15: 23131.0625
Loss after epoch 16: 22977.15625
Loss after epoch 17: 22626.78125
Loss after epoch 18: 22433.90625
Loss after epoch 19: 22343.96875
Loss after epoch 20: 22116.5
Loss after epoch 21: 22045.03125
Loss after epoch 22: 22029.375
Loss after epoch 23: 21962.25
Loss after epoch 24: 21561.75
Loss after epoch 25: 21276.6875
Loss after epoch 26: 21158.5625
Loss after epoch 27: 20760.5
Loss after epoch 28: 20598.375
Loss after epoch 29: 20062.1875
Loss after epoch 30: 20

In [84]:
model_3 = Word2Vec(vector_size = 100, window = 10, min_count = 1,sg = 0,negative = 20, workers = multiprocessing.cpu_count()-1)
print(model_3)

logging.disable(logging.NOTSET) # enable logging
t = time()

X = list(train_data['석식메뉴'])
model_3.build_vocab(X)

logging.disable(logging.INFO) # disable logging
callback = Callback() # instead, print out loss for each epoch
t = time()

model_1.train(X, total_examples = model_3.corpus_count,epochs = 1000, compute_loss = True, callbacks = [callback])

model_3.save("석식메뉴.model")

2022-08-28 15:43:37,255 : INFO : collecting all words and their counts
2022-08-28 15:43:37,255 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2022-08-28 15:43:37,258 : INFO : collected 1668 word types from a corpus of 10143 raw words and 1205 sentences
2022-08-28 15:43:37,258 : INFO : Creating a fresh vocabulary
2022-08-28 15:43:37,264 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=1 retains 1668 unique words (100.00% of original 1668, drops 0)', 'datetime': '2022-08-28T15:43:37.264835', 'gensim': '4.2.0', 'python': '3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22000-SP0', 'event': 'prepare_vocab'}
2022-08-28 15:43:37,264 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=1 leaves 10143 word corpus (100.00% of original 10143, drops 0)', 'datetime': '2022-08-28T15:43:37.264835', 'gensim': '4.2.0', 'python': '3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]', 'pl

Word2Vec<vocab=0, vector_size=100, alpha=0.025>


2022-08-28 15:43:37,285 : INFO : resetting layer weights
2022-08-28 15:43:37,286 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2022-08-28T15:43:37.286840', 'gensim': '4.2.0', 'python': '3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22000-SP0', 'event': 'build_vocab'}


Loss after epoch 1: 8052.9404296875
Loss after epoch 2: 7211.369140625
Loss after epoch 3: 7346.8623046875
Loss after epoch 4: 7299.96875
Loss after epoch 5: 7245.546875
Loss after epoch 6: 6538.05859375
Loss after epoch 7: 6781.67578125
Loss after epoch 8: 6665.58203125
Loss after epoch 9: 6690.71875
Loss after epoch 10: 6421.91796875
Loss after epoch 11: 6865.3359375
Loss after epoch 12: 6751.90625
Loss after epoch 13: 7192.78125
Loss after epoch 14: 6920.5625
Loss after epoch 15: 6848.4453125
Loss after epoch 16: 6642.8125
Loss after epoch 17: 6966.8125
Loss after epoch 18: 6658.9453125
Loss after epoch 19: 6517.0390625
Loss after epoch 20: 6550.78125
Loss after epoch 21: 6432.8125
Loss after epoch 22: 6493.296875
Loss after epoch 23: 6766.609375
Loss after epoch 24: 6725.015625
Loss after epoch 25: 6699.234375
Loss after epoch 26: 6732.4375
Loss after epoch 27: 6486.40625
Loss after epoch 28: 6712.78125
Loss after epoch 29: 6729.6875
Loss after epoch 30: 6763.984375
Loss after epoc

In [85]:
embedding_matrix1 = model_1.wv.get_normed_vectors()

In [86]:
embedding_matrix1

array([[-0.00450715,  0.05964039,  0.08381372, ...,  0.07196336,
        -0.04054832,  0.03271627],
       [-0.11065961,  0.11728547, -0.09917068, ...,  0.11167339,
        -0.07896709,  0.02503748],
       [-0.14760242,  0.05595193, -0.15697204, ...,  0.12606807,
        -0.06119975,  0.01155684],
       ...,
       [-0.06872609, -0.07395434, -0.02730837, ..., -0.13011783,
        -0.08392897, -0.05470637],
       [-0.07968951, -0.11945372, -0.02567193, ..., -0.15110694,
        -0.10402954, -0.03691002],
       [-0.24551778, -0.00439571, -0.0558849 , ...,  0.02195042,
        -0.08145563,  0.03673066]], dtype=float32)

In [87]:
print(model_1)

Word2Vec<vocab=822, vector_size=100, alpha=0.025>


In [88]:
train_data['조식메뉴'].shape

(1205,)

In [89]:
sum(train_data['조식메뉴'].apply(lambda x: len(x)) <= 1)

0

이제 이렇게 훈련한 모델의 값을 벡터화 해서 풀어서 저장해줘야한다.
train셋에 기준으로 word2vec 작업을 실시하였고
word2vec모델을 입력해 입력한 변수 기준, 데이터 벡터화를 실시하였음.
test셋에 대해서는 이 모델만을 이용해 벡터화를 실시함.


벡터로 푸는 임베딩은 이 글보고 참조함.

https://dylancastillo.co/nlp-snippets-cluster-documents-using-word2vec/

In [90]:
def vectorize(list_of_docs, model):
    """Generate vectors for list of documents using a Word Embedding

    Args:
        list_of_docs: List of documents
        model: Gensim's Word Embedding

    Returns:
        List of document vectors
    """
    features = []

    for tokens in list_of_docs:
        zero_vector = np.zeros(model.vector_size)
        vectors = []
        for token in tokens:
            if token in model.wv:
                try:
                    vectors.append(model.wv[token])
                except KeyError:
                    continue
        if vectors:
            vectors = np.asarray(vectors)
            avg_vec = vectors.mean(axis=0)
            features.append(avg_vec)
        else:
            features.append(zero_vector)
    return features

In [91]:
tokenized_docs = list(train_data['조식메뉴'])
vectorized_docs_1 = vectorize(tokenized_docs, model=model_1)
len(vectorized_docs_1), len(vectorized_docs_1[0])

(1205, 100)

In [92]:
vectorized_docs_1

[array([ 0.12730446,  0.25760743,  0.5476122 ,  0.27350518, -0.30784732,
        -0.22400402,  0.03812586, -0.71547496,  0.32812732, -0.2744891 ,
        -0.25558513, -0.22910485,  0.4222252 , -0.8238965 , -0.17046025,
        -0.28243273, -1.7058022 , -0.10034464,  1.0404278 ,  0.3807275 ,
         0.41237986, -1.0187796 , -0.5386382 ,  0.51378214, -0.15252647,
        -0.35782152,  0.12676103,  0.03611669,  0.73582906, -0.11230759,
         0.21584266, -0.10311688, -0.24805325,  0.3214749 , -0.02654655,
        -0.09728014,  0.62089574,  0.08804606,  0.06313432,  0.31773296,
         0.15265688, -1.1944187 ,  0.36669394, -0.9995063 , -0.48781255,
         0.22823074,  0.03094223, -0.63793623, -0.248995  , -0.24155456,
         1.0661561 , -0.09106787, -0.13843694,  0.31253883,  0.7229596 ,
         0.72501636, -0.11753401,  0.01016598,  0.02216391,  0.32996213,
        -0.34470958,  0.41660374,  0.4358969 ,  0.08216914, -0.19978791,
         0.07065348, -0.5126928 , -0.74678904,  0.1

In [93]:
tokenized_docs = list(train_data['중식메뉴'])
vectorized_docs_2 = vectorize(tokenized_docs, model=model_1)
len(vectorized_docs_2), len(vectorized_docs_2[0])

(1205, 100)

In [94]:
tokenized_docs = list(train_data['석식메뉴'])
vectorized_docs_3 = vectorize(tokenized_docs, model=model_1)
len(vectorized_docs_3), len(vectorized_docs_3[0])

(1205, 100)

In [95]:
tokenized_docs = list(test_data['조식메뉴'])
test_vectorized_docs_1 = vectorize(tokenized_docs, model=model_1)
len(test_vectorized_docs_1), len(test_vectorized_docs_1[0])

(50, 100)

In [96]:
tokenized_docs = list(test_data['중식메뉴'])
test_vectorized_docs_2 = vectorize(tokenized_docs, model=model_1)
len(test_vectorized_docs_2), len(test_vectorized_docs_2[0])

(50, 100)

In [97]:
tokenized_docs = list(test_data['석식메뉴'])
test_vectorized_docs_3 = vectorize(tokenized_docs, model=model_1)
len(test_vectorized_docs_3), len(test_vectorized_docs_3[0])

(50, 100)

In [98]:
train1 = pd.DataFrame(vectorized_docs_1, columns = ['조식'+str(i) for i in range(1,101)])
train2 = pd.DataFrame(vectorized_docs_2, columns = ['중식'+str(i) for i in range(1,101)])
train3 = pd.DataFrame(vectorized_docs_3, columns = ['석식'+str(i) for i in range(1,101)])

test1 = pd.DataFrame(test_vectorized_docs_1, columns = ['조식'+str(i) for i in range(1,101)])
test2 = pd.DataFrame(test_vectorized_docs_2, columns = ['중식'+str(i) for i in range(1,101)])
test3 = pd.DataFrame(test_vectorized_docs_3, columns = ['석식'+str(i) for i in range(1,101)])

In [99]:
trains = pd.concat([train1,train2,train3],axis = 1)
trains

Unnamed: 0,조식1,조식2,조식3,조식4,조식5,조식6,조식7,조식8,조식9,조식10,...,석식91,석식92,석식93,석식94,석식95,석식96,석식97,석식98,석식99,석식100
0,0.127304,0.257607,0.547612,0.273505,-0.307847,-0.224004,0.038126,-0.715475,0.328127,-0.274489,...,-0.801821,-0.380168,-0.619642,0.123337,-0.909220,0.419394,0.231539,0.452549,0.167723,0.397080
1,0.403453,0.024484,0.247226,0.065089,-0.136926,-0.215884,-0.308194,-0.865672,0.690861,0.358931,...,-0.390211,-0.679858,-0.367707,0.231150,-0.119020,-0.072185,0.328293,0.571315,0.633374,-0.939822
2,0.216448,0.308116,0.311472,0.331251,-0.315856,0.012068,-0.184462,-0.103463,0.321294,0.440288,...,-0.010766,0.487713,-0.625857,0.548922,-0.326612,-1.426111,-0.141372,-0.274090,-0.276920,-0.127859
3,-0.192419,0.578234,-0.261870,-0.017245,-0.648358,-0.442273,-0.837201,-0.838712,0.742965,0.282547,...,0.316081,-0.904823,-1.632920,-0.194064,-1.377587,-0.649722,-0.021026,0.662347,0.552243,-0.332059
4,0.064057,0.565668,0.058803,0.173223,-0.129037,-0.107467,-0.298549,-0.311675,0.204569,0.219797,...,0.487017,-0.463689,0.164248,0.137904,-0.841847,-0.223817,-0.516131,0.308813,-0.242846,-0.034809
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1200,-2.034448,-0.353786,-1.054533,-0.883202,-0.304654,-0.983137,-0.240777,0.435433,0.045229,0.481136,...,0.417278,0.237436,-0.223325,-0.819458,-1.515428,-0.034094,0.965614,1.173566,-0.050917,0.169004
1201,-1.324221,-0.470396,-1.089208,-0.444025,-0.521843,-0.640553,0.039632,0.625747,0.125422,0.446766,...,-0.052708,-0.074582,-0.349942,-0.702761,-0.519537,0.337701,0.702355,-0.465351,-0.837733,0.407873
1202,-1.732236,-0.079323,-0.451985,-0.808451,-0.311322,-1.003150,0.249375,0.518736,0.117867,0.092310,...,0.252223,0.648180,-0.553343,-0.444173,-0.387879,-0.892189,0.174453,0.919008,0.633327,-1.756728
1203,-1.805069,-0.131774,-0.760006,-0.589073,-0.238986,-0.836542,0.193178,0.489698,0.206380,0.117974,...,-0.874287,-0.033421,-0.612048,-0.211504,-0.143802,0.533571,0.611764,-0.347412,-0.916020,-0.891743


In [100]:
tests = pd.concat([test1,test2,test3],axis = 1)
tests

Unnamed: 0,조식1,조식2,조식3,조식4,조식5,조식6,조식7,조식8,조식9,조식10,...,석식91,석식92,석식93,석식94,석식95,석식96,석식97,석식98,석식99,석식100
0,-1.821336,0.237686,-0.973366,-0.736827,0.063663,-0.808185,-0.392122,0.463136,0.20493,0.095725,...,-0.374542,-0.287505,-0.520426,-0.530235,-0.941172,0.993017,1.466026,-0.602551,-0.505692,-0.735012
1,-1.209769,-0.392556,-0.580457,-0.756506,-0.56945,-0.534412,0.069655,0.100766,0.186502,0.287415,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,-1.423517,-0.344229,-0.590078,-0.400957,-0.343349,-0.42696,-0.196231,0.877243,-0.157392,-0.174624,...,-0.515465,-0.70552,-1.054055,-0.545785,-0.650279,0.397171,0.790346,-0.395544,-0.427766,-0.994062
3,-1.581092,0.041783,-0.683824,-0.687377,-0.343271,-0.593146,-0.272732,0.436274,0.06815,0.225451,...,-0.985826,-1.253553,-0.463552,-0.282734,-0.021294,0.000165,1.331887,-0.949383,-1.051228,-0.583881
4,-1.662282,-0.023853,-1.222977,-0.85743,-0.513943,-0.258975,-0.846427,0.737781,0.521505,0.228275,...,0.773249,-0.749277,-0.948147,-1.684721,0.261702,0.82228,1.130417,-2.061396,0.060347,-1.256325
5,-1.717155,-0.307514,-1.028005,-0.769036,-0.035112,-0.293,-0.149801,0.645808,-0.152813,0.612263,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,-1.646292,-0.386,-0.508134,-0.276603,-0.35951,-0.572129,0.071174,0.301029,0.071117,0.630818,...,-0.515465,-0.70552,-1.054055,-0.545785,-0.650279,0.397171,0.790346,-0.395544,-0.427766,-0.994062
7,-1.550466,0.079913,-0.925992,-1.202348,0.303168,-0.979067,-0.33805,0.553885,0.087253,-0.122708,...,-0.333308,-1.475864,-0.714185,1.48396,-0.850368,-0.469901,1.027584,0.69606,-0.891219,-0.302799
8,-1.484573,-0.453224,-0.475515,-0.466826,-0.467202,-0.449676,0.276644,0.847753,-0.060879,0.000668,...,-0.09033,-0.149433,-0.026438,-0.077751,0.42991,0.108399,0.931176,-0.918079,-0.036475,-0.168863
9,-1.787411,-0.689851,-0.867506,-1.135271,0.05941,-0.905612,0.042552,0.296691,-0.068681,0.281444,...,0.16389,1.03707,-0.483276,-0.610491,1.054497,0.64757,0.047814,-0.446598,0.053169,-0.361961


이제 조식, 중식, 석식메뉴에 결합이 완료되었으니
원본 테이블에 연결하여 준다.


In [101]:
train_data = pd.concat([train_data, trains], axis = 1)
test_data = pd.concat([test_data, tests], axis = 1)

train_data.head()

Unnamed: 0,일자,요일,본사정원수,본사휴가자수,본사출장자수,본사시간외근무명령서승인건수,현본사소속재택근무자수,조식메뉴,중식메뉴,석식메뉴,...,석식91,석식92,석식93,석식94,석식95,석식96,석식97,석식98,석식99,석식100
0,2016-02-01,월,2601,50,150,238,0.0,"[모닝롤, 찐빵, 우유, 두유, 주스, 계란후라이, 호두죽, 쌀밥, (쌀:국내산),...","[쌀밥, 잡곡밥, (쌀,현미흑미:국내산), 오징어찌개, 쇠불고기, (쇠고기:호주산)...","[쌀밥, 잡곡밥, (쌀,현미흑미:국내산), 육개장, 자반고등어구이, 두부조림, 건파...",...,-0.801821,-0.380168,-0.619642,0.123337,-0.90922,0.419394,0.231539,0.452549,0.167723,0.39708
1,2016-02-02,화,2601,50,173,319,0.0,"[모닝롤, 단호박샌드, 우유, 두유, 주스, 계란후라이, 팥죽, 쌀밥, (쌀:국내산...","[쌀밥, 잡곡밥, (쌀,현미흑미:국내산), 김치찌개, 가자미튀김, 모둠소세지구이, ...","[콩나물밥*양념장, (쌀,현미흑미:국내산), 어묵국, 유산슬, (쇠고기:호주산), ...",...,-0.390211,-0.679858,-0.367707,0.23115,-0.11902,-0.072185,0.328293,0.571315,0.633374,-0.939822
2,2016-02-03,수,2601,56,180,111,0.0,"[모닝롤, 베이글, 우유, 두유, 주스, 계란후라이, 표고버섯죽, 쌀밥, (쌀:국내...","[카레덮밥, (쌀,현미흑미:국내산), 팽이장국, 치킨핑거, (닭고기:국내산), 쫄면...","[쌀밥, 잡곡밥, (쌀,현미흑미:국내산), 청국장찌개, 황태양념구이, (황태:러시아...",...,-0.010766,0.487713,-0.625857,0.548922,-0.326612,-1.426111,-0.141372,-0.27409,-0.27692,-0.127859
3,2016-02-04,목,2601,104,220,355,0.0,"[모닝롤, 토마토샌드, 우유, 두유, 주스, 계란후라이, 닭죽, 쌀밥, (쌀,닭:국...","[쌀밥, 잡곡밥, (쌀,현미흑미:국내산), 쇠고기무국, 주꾸미볶음, 부추전, 시금치...","[미니김밥*겨자장, (쌀,현미흑미:국내산), 우동, 멕시칸샐러드, 군고구마, 무피클...",...,0.316081,-0.904823,-1.63292,-0.194064,-1.377587,-0.649722,-0.021026,0.662347,0.552243,-0.332059
4,2016-02-05,금,2601,278,181,34,0.0,"[모닝롤, 와플, 우유, 두유, 주스, 계란후라이, 쇠고기죽, 쌀밥, (쌀:국내산)...","[쌀밥, 잡곡밥, (쌀,현미흑미:국내산), 떡국, 돈육씨앗강정, (돼지고기:국내산)...","[쌀밥, 잡곡밥, (쌀,현미흑미:국내산), 차돌박이찌개, (쇠고기:호주산), 닭갈비...",...,0.487017,-0.463689,0.164248,0.137904,-0.841847,-0.223817,-0.516131,0.308813,-0.242846,-0.034809


길이 맞는지 확인.

In [102]:
train_data.shape

(1205, 312)

In [103]:
test_data.shape

(50, 310)

In [104]:
train_data.to_csv("Embedding_train_data.csv", index = False, encoding = 'utf-8-sig')
test_data.to_csv("Embedding_test_data.csv", index = False, encoding = 'utf-8-sig')

뭔가 불안해서, 체크해봤는데 값 존재함.

In [105]:
train_data.중식계

0       1039.0
1        867.0
2       1017.0
3        978.0
4        925.0
         ...  
1200    1093.0
1201     832.0
1202     579.0
1203    1145.0
1204    1015.0
Name: 중식계, Length: 1205, dtype: float64

In [106]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1205 entries, 0 to 1204
Columns: 312 entries, 일자 to 석식100
dtypes: float32(100), float64(203), int64(4), object(5)
memory usage: 2.4+ MB
