Spark Transfrom(ETL)

데이터의 구성
    
    데이터 컬럼 명은 'lableCol', 'featureCol'으로 설정해서 사용할 수 있다.
    
    기본 컬럼 명을 사용하는 경우:
        구분	컬럼 구성
        DataFrame	'label' (DoubleType), 'features' (sparse or dense vectors)
        RDD	LabeledPoint

-----------------------------

DataFrame 변환

    RDD는 map-reduce를 사용하지만 DataFrame은 사용 하지 않는다.
        대신 Transformer와 Estimator를 사용하는데, 여러 Estimator를 
        Pipeline을 적용하여 값을 반환한다.
        
    Estimator : 모델의 인자 설정, 이를 데이터에 적용
        Transformer를 반환, fit() 함수를 사용
    Transformer : 열을 선택, 변환. 
        DataFrame으로 반환, transform() 함수를 사용

------------------------------------------------------------------------

특징 추출 -> Feature vectors
분류 -> Class or Label(같은 의미)

텍스트는 'Bag of words'로 표현
    =>단어의 순서는 의미가 없다!
    
    corpus 문서 집합 
    document 레코드 (문장을 문서로 취급할 수 있다.)
    vocabulary 중복없는 단어 집합
    word vector 있다-없다, 단어빈도, 등등..
  

In [1]:
import os
import sys

os.environ["PYLIB"]=os.path.join(os.environ["SPARK_HOME"],'python','lib')
sys.path.insert(0,os.path.join(os.environ["PYLIB"],'py4j-0.10.1-src.zip'))
sys.path.insert(0,os.path.join(os.environ["PYLIB"],'pyspark.zip'))

import pyspark
myConf=pyspark.SparkConf()
spark = pyspark.sql.SparkSession.builder\
    .master("local")\
    .appName("myApp")\
    .config("spark.sql.warehouse.dir", "C:/Users/G312")\
    .getOrCreate()

In [2]:
doc=[
    "When I find myself in times of trouble",
    "Mother Mary comes to me",
    "Speaking words of wisdom, let it be",
    "And in my hour of darkness",
    "She is standing right in front of me",
    "Speaking words of wisdom, let it be",
    "Let it be",
    "Let it be",
    "Let it be",
    "Let it be",
    "Whisper words of wisdom, let it be"
]


In [4]:
#Python으로 단어 빈도를 계산한다.
d={}
for sentence in doc:
    words=sentence.split()
    for word in words:
        if word in d:
            d[word]+=1
        else:
            d[word]=1
            
for k,v in d.iteritems():
    print k,v

right 1
be 7
is 1
When 1
it 7
in 3
Mary 1
Speaking 2
standing 1
darkness 1
find 1
wisdom, 3
to 1
Let 4
And 1
I 1
let 3
She 1
words 3
Mother 1
front 1
trouble 1
me 2
myself 1
hour 1
of 6
times 1
Whisper 1
my 1
comes 1


TfidfTransformer는 TF-IDF(Term Frequency-Inverse Document Frequency)를 계산한다.
    
    단계 1: Tokenizer를 사용하여 문장을 단어로 분리
    단계 2: CountVectorizer를 사용하여 단어의 빈도수tf를 계산
    단계 3: HashingTF를 사용하여 'word vector'를 계산. HashingTF은 hash함수에 따라 
    단어의 고유번호를 생성, hash고유번호의 충돌 가능성을 줄이기 위해, 
    단어 수를 제한할 수 있다.
    단계 4: IDF를 계산
    단계 5: TF-IDF를 계산

In [5]:
#TF-IDF는 머신러닝에서 단어의 '중요성'을 나타내기 위해 사용하는 계산법.

#TfidfTransformer는 TF-IDF를 계산.
#TF(Term Frequency) 계산
    #stopwords를 제외한 한 단어의 노출 빈도.
#df(Document frequency) 계산
    #한 단어가 문서(문장)에 나타난 빈도.
#N(number of documents) 계산
    #전체 문서의 수
#idf(inverse document frequency) 계산
    #단어가 나타난 문서의 비율을 거꾸로 계산.
    #0으로 나뉘는 것을 방지하기 위해 +1을 해준다.
    
#위의 문서에서 'wisdom'이란 단어의 TF-IDF를 계산해본다. 
import math
tf=1./4 #단어 4개가 빈도 1이므로 wisdom 단어의 빈도는 1./4
df=3.
N=11.
idf=math.log((N+1)/(df+1))+1 #log는 자연로그.
print idf

2.09861228867


Spark는 'sklearn'의 TF-IDF와 동일한 방식으로 계산한다. CountVectorizer를 사용하여, 문서 x 단어를 표로 계산결과를 출력할 수 있다. 그 다음으로, TF-IDF를 계산할 수 있다. 이 때 (문서id, 단어id) 별로 결과가 출력된다.

아래는 sklearn을 이용한 방식.

In [6]:
#countVectorizer로 문서 X 단어를 표로 하여 계산결과를 출력한다.
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000) 
vectorizer = CountVectorizer()

In [7]:
print vectorizer.fit_transform(doc).todense() #Word Vector를 출력한다.

[[0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 1 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 1]
 [1 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 1 1 0 0 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 1]
 [0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1]]


In [8]:
print vectorizer.vocabulary_ #단어의 일련번호(자동 부여된 상태)

{u'and': 0, u'be': 1, u'right': 17, u'whisper': 25, u'is': 8, u'it': 9, u'wisdom': 26, u'me': 12, u'let': 10, u'words': 27, u'in': 7, u'front': 5, u'trouble': 23, u'find': 4, u'standing': 20, u'comes': 2, u'myself': 15, u'darkness': 3, u'hour': 6, u'of': 16, u'when': 24, u'times': 21, u'to': 22, u'she': 18, u'mother': 13, u'my': 14, u'mary': 11, u'speaking': 19}


In [13]:
#TfidfVectorizer를 사용해서 계산
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=1.0, min_df=1, stop_words='english',norm = None)

In [10]:
print vectorizer.fit_transform(doc) 
#출력값 : (문서번호, 단어번호)       TF-IDF value
#(2, 12) -> 3번째 문서번호의 'wisdom'에 대한 TF-IDF를 알 수 있다.

  (0, 9)	2.791759469228055
  (0, 10)	2.791759469228055
  (1, 5)	2.791759469228055
  (1, 4)	2.791759469228055
  (1, 0)	2.791759469228055
  (2, 7)	2.386294361119891
  (2, 13)	2.09861228866811
  (2, 12)	2.09861228866811
  (2, 3)	1.4054651081081644
  (3, 2)	2.791759469228055
  (3, 1)	2.791759469228055
  (4, 8)	2.791759469228055
  (4, 6)	2.791759469228055
  (5, 7)	2.386294361119891
  (5, 13)	2.09861228866811
  (5, 12)	2.09861228866811
  (5, 3)	1.4054651081081644
  (6, 3)	1.4054651081081644
  (7, 3)	1.4054651081081644
  (8, 3)	1.4054651081081644
  (9, 3)	1.4054651081081644
  (10, 13)	2.09861228866811
  (10, 12)	2.09861228866811
  (10, 3)	1.4054651081081644
  (10, 11)	2.791759469228055


In [11]:
print vectorizer.vocabulary_

{u'standing': 8, u'right': 6, u'darkness': 1, u'hour': 2, u'whisper': 11, u'times': 9, u'let': 3, u'speaking': 7, u'words': 13, u'mother': 5, u'trouble': 10, u'wisdom': 12, u'mary': 4, u'comes': 0}


In [12]:
print vectorizer.idf_

[2.79175947 2.79175947 2.79175947 1.40546511 2.79175947 2.79175947
 2.79175947 2.38629436 2.79175947 2.79175947 2.79175947 2.79175947
 2.09861229 2.09861229]


Sentiment : 긍정, 부정이 존재
Sentiment analysis : 감성 분석

In [14]:
doc=[
    ["When I find myself in times of trouble"],
    ["Mother Mary comes to me"],
    ["Speaking words of wisdom, let it be"],
    ["And in my hour of darkness"],
    ["She is standing right in front of me"],
    ["Speaking words of wisdom, let it be"],
    [u"우리 Let it be"],
    [u"나 Let it be"],
    [u"너 Let it be"],
    ["Let it be"],
    ["Whisper words of wisdom, let it be"]
]

myDf=spark.createDataFrame(doc,['sent'])
myDf.show(truncate=False)

+--------------------------------------+
|sent                                  |
+--------------------------------------+
|When I find myself in times of trouble|
|Mother Mary comes to me               |
|Speaking words of wisdom, let it be   |
|And in my hour of darkness            |
|She is standing right in front of me  |
|Speaking words of wisdom, let it be   |
|우리 Let it be                          |
|나 Let it be                           |
|너 Let it be                           |
|Let it be                             |
|Whisper words of wisdom, let it be    |
+--------------------------------------+



# StringIndexer

변수의 성격

        구분        설명	                           예

        nominal     명목 또는 구분 값 cateogory        사자, 호랑이, 사람

        ordinal     명목값과 다른 점은 순서가 있다.    키 low, med, high

        interval    일정한 간격이 있다.                150-165, 165-180, 180-195

문자를 인덱스 값(double)로 저장한다.

In [16]:
from pyspark.ml.feature import StringIndexer
labelIndexer = StringIndexer(inputCol="sent", outputCol="sentLabel")
model=labelIndexer.fit(myDf)
siDf=model.transform(myDf)
siDf.show()

+--------------------+---------+
|                sent|sentLabel|
+--------------------+---------+
|When I find mysel...|      9.0|
|Mother Mary comes...|      8.0|
|Speaking words of...|      0.0|
|And in my hour of...|      5.0|
|She is standing r...|      4.0|
|Speaking words of...|      0.0|
|        우리 Let it be|      6.0|
|         나 Let it be|      1.0|
|         너 Let it be|      2.0|
|           Let it be|      7.0|
|Whisper words of ...|      3.0|
+--------------------+---------+



# Tokenizer

문장을 단어와 같은 token으로 분리한다.

단어는 배열로 구성한다. 요소는 string이다.

In [17]:
from pyspark.ml.feature import Tokenizer
tokenizer = Tokenizer(inputCol="sent", outputCol="words")
tokDf = tokenizer.transform(myDf)
for r in tokDf.select("sent", "words").take(3):
    print r

Row(sent=u'When I find myself in times of trouble', words=[u'when', u'i', u'find', u'myself', u'in', u'times', u'of', u'trouble'])
Row(sent=u'Mother Mary comes to me', words=[u'mother', u'mary', u'comes', u'to', u'me'])
Row(sent=u'Speaking words of wisdom, let it be', words=[u'speaking', u'words', u'of', u'wisdom,', u'let', u'it', u'be'])


# RegTokenizer

Regular Expression Tokenizer

단어를 분리하기 위한 패턴을 적용할 수 있다.

한글에는 \w 패턴이 적용되지 않는다.

whitespace \s 패턴을 적용한다. 공백, TAB, CR, New Line 등이 해당된다.

In [18]:
from pyspark.ml.feature import RegexTokenizer
re = RegexTokenizer(inputCol="sent", outputCol="wordsReg", pattern="\\s+")
reDf=re.transform(myDf)
reDf.show()

+--------------------+--------------------+
|                sent|            wordsReg|
+--------------------+--------------------+
|When I find mysel...|[when, i, find, m...|
|Mother Mary comes...|[mother, mary, co...|
|Speaking words of...|[speaking, words,...|
|And in my hour of...|[and, in, my, hou...|
|She is standing r...|[she, is, standin...|
|Speaking words of...|[speaking, words,...|
|        우리 Let it be|   [우리, let, it, be]|
|         나 Let it be|    [나, let, it, be]|
|         너 Let it be|    [너, let, it, be]|
|           Let it be|       [let, it, be]|
|Whisper words of ...|[whisper, words, ...|
+--------------------+--------------------+



# Stopwords
한 단어 등 불용어.

Tokenize를 거친 후 사용해야 한다.

In [19]:
from pyspark.ml.feature import StopWordsRemover
stop = StopWordsRemover(inputCol="wordsReg", outputCol="nostops")

In [23]:
stopwords=list()
_stopwords=stop.getStopWords()
for e in _stopwords:
    stopwords.append(e)

_mystopwords=[u"나",u"너", u"우리"]
for e in _mystopwords:
    stopwords.append(e)
stop.setStopWords(stopwords)

StopWordsRemover_4ce7a7977fb99c486253

In [24]:
for e in stop.getStopWords():
    print e,

i me my myself we our ours ourselves you your yours yourself yourselves he him his himself she her hers herself it its itself they them their theirs themselves what which who whom this that these those am is are was were be been being have has had having do does did doing a an the and but if or because as until while of at by for with about against between into through during before after above below to from up down in out on off over under again further then once here there when where why how all any both each few more most other some such no nor not only own same so than too very s t can will just don should now d ll m o re ve y ain aren couldn didn doesn hadn hasn haven isn ma mightn mustn needn shan shouldn wasn weren won wouldn 나 너 우리


In [25]:
stopDf=stop.transform(reDf)
stopDf.show()

+--------------------+--------------------+--------------------+
|                sent|            wordsReg|             nostops|
+--------------------+--------------------+--------------------+
|When I find mysel...|[when, i, find, m...|[find, times, tro...|
|Mother Mary comes...|[mother, mary, co...|[mother, mary, co...|
|Speaking words of...|[speaking, words,...|[speaking, words,...|
|And in my hour of...|[and, in, my, hou...|    [hour, darkness]|
|She is standing r...|[she, is, standin...|[standing, right,...|
|Speaking words of...|[speaking, words,...|[speaking, words,...|
|        우리 Let it be|   [우리, let, it, be]|               [let]|
|         나 Let it be|    [나, let, it, be]|               [let]|
|         너 Let it be|    [너, let, it, be]|               [let]|
|           Let it be|       [let, it, be]|               [let]|
|Whisper words of ...|[whisper, words, ...|[whisper, words, ...|
+--------------------+--------------------+--------------------+



In [27]:
stopDf.show(truncate=False)

+--------------------------------------+-----------------------------------------------+-------------------------------+
|sent                                  |wordsReg                                       |nostops                        |
+--------------------------------------+-----------------------------------------------+-------------------------------+
|When I find myself in times of trouble|[when, i, find, myself, in, times, of, trouble]|[find, times, trouble]         |
|Mother Mary comes to me               |[mother, mary, comes, to, me]                  |[mother, mary, comes]          |
|Speaking words of wisdom, let it be   |[speaking, words, of, wisdom,, let, it, be]    |[speaking, words, wisdom,, let]|
|And in my hour of darkness            |[and, in, my, hour, of, darkness]              |[hour, darkness]               |
|She is standing right in front of me  |[she, is, standing, right, in, front, of, me]  |[standing, right, front]       |
|Speaking words of wisdom, let i

# CountVectorizer
입력: a collection of text documents

출력: word vector (sparse) vocabulary x TF

tokenize하고 나서 사용(Stopwords 제거도 해야한다.)

minDF

    소수점은 비율, 사용된 문서 수/전체 문서 수
        정수는 사용된 문서 수, 단어가 몇 개의 문서에 사용되어야 하는지

In [28]:
from pyspark.ml.feature import CountVectorizer
cv = CountVectorizer(inputCol="nostops", outputCol="cv",
    vocabSize=30,minDF=1.0) #cv는 sparse vector로 표현된다.
cvModel = cv.fit(stopDf)
cvDf = cvModel.transform(stopDf)

cvDf.collect()
cvDf.select('sent','nostops','cv').show()
for v in cvModel.vocabulary:
    print v,

+--------------------+--------------------+--------------------+
|                sent|             nostops|                  cv|
+--------------------+--------------------+--------------------+
|When I find mysel...|[find, times, tro...|(16,[5,6,8],[1.0,...|
|Mother Mary comes...|[mother, mary, co...|(16,[10,13,14],[1...|
|Speaking words of...|[speaking, words,...|(16,[0,1,2,3],[1....|
|And in my hour of...|    [hour, darkness]|(16,[7,9],[1.0,1.0])|
|She is standing r...|[standing, right,...|(16,[4,12,15],[1....|
|Speaking words of...|[speaking, words,...|(16,[0,1,2,3],[1....|
|        우리 Let it be|               [let]|      (16,[0],[1.0])|
|         나 Let it be|               [let]|      (16,[0],[1.0])|
|         너 Let it be|               [let]|      (16,[0],[1.0])|
|           Let it be|               [let]|      (16,[0],[1.0])|
|Whisper words of ...|[whisper, words, ...|(16,[0,1,2,11],[1...|
+--------------------+--------------------+--------------------+

let wisdom, words speaki

In [29]:
cvDf.select('sent','nostops','cv').show(truncate=False)

+--------------------------------------+-------------------------------+---------------------------------+
|sent                                  |nostops                        |cv                               |
+--------------------------------------+-------------------------------+---------------------------------+
|When I find myself in times of trouble|[find, times, trouble]         |(16,[5,6,8],[1.0,1.0,1.0])       |
|Mother Mary comes to me               |[mother, mary, comes]          |(16,[10,13,14],[1.0,1.0,1.0])    |
|Speaking words of wisdom, let it be   |[speaking, words, wisdom,, let]|(16,[0,1,2,3],[1.0,1.0,1.0,1.0]) |
|And in my hour of darkness            |[hour, darkness]               |(16,[7,9],[1.0,1.0])             |
|She is standing right in front of me  |[standing, right, front]       |(16,[4,12,15],[1.0,1.0,1.0])     |
|Speaking words of wisdom, let it be   |[speaking, words, wisdom,, let]|(16,[0,1,2,3],[1.0,1.0,1.0,1.0]) |
|우리 Let it be                        

# TF-IDF
Term frequency-inverse document frequency (TF-IDF)

tokenizer하고 나서 사용해야 함.

HashingTF : 고정길이 word vectors.
 
IDF

In [30]:
from pyspark.ml.feature import HashingTF, IDF

hashTF = HashingTF(inputCol="nostops", outputCol="hash", numFeatures=50)
hashDf = hashTF.transform(stopDf)
idf = IDF(inputCol="hash", outputCol="idf")
idfModel = idf.fit(hashDf)
idfDf = idfModel.transform(hashDf)
for e in idfDf.select("nostops","hash").take(10):
    print(e)

Row(nostops=[u'find', u'times', u'trouble'], hash=SparseVector(50, {10: 1.0, 24: 1.0, 43: 1.0}))
Row(nostops=[u'mother', u'mary', u'comes'], hash=SparseVector(50, {1: 1.0, 21: 1.0, 24: 1.0}))
Row(nostops=[u'speaking', u'words', u'wisdom,', u'let'], hash=SparseVector(50, {9: 1.0, 12: 1.0, 14: 1.0, 41: 1.0}))
Row(nostops=[u'hour', u'darkness'], hash=SparseVector(50, {23: 1.0, 27: 1.0}))
Row(nostops=[u'standing', u'right', u'front'], hash=SparseVector(50, {24: 1.0, 43: 1.0, 46: 1.0}))
Row(nostops=[u'speaking', u'words', u'wisdom,', u'let'], hash=SparseVector(50, {9: 1.0, 12: 1.0, 14: 1.0, 41: 1.0}))
Row(nostops=[u'let'], hash=SparseVector(50, {14: 1.0}))
Row(nostops=[u'let'], hash=SparseVector(50, {14: 1.0}))
Row(nostops=[u'let'], hash=SparseVector(50, {14: 1.0}))
Row(nostops=[u'let'], hash=SparseVector(50, {14: 1.0}))


# NGram
unigram은 한 단어로, bigram은 두 단어로 구성한다

In [32]:
from pyspark.ml.feature import NGram
ngram = NGram(n=2, inputCol="words", outputCol="ngrams") #n=2 => bigram으로 구성되었다.
ngramDf = ngram.transform(tokDf)
ngramDf.show(truncate=False)
for e in ngramDf.select("words","ngrams").take(3):
    print e
    
#글에서 감정/의견을 나타내는 부분의 범주에 맞추어 n의 값을 설정한다.
#ex) n=1 '좋다', '나쁘다' n=2 '영화가 좋다', '영화가 나쁘다'

+--------------------------------------+-----------------------------------------------+--------------------------------------------------------------------------+
|sent                                  |words                                          |ngrams                                                                    |
+--------------------------------------+-----------------------------------------------+--------------------------------------------------------------------------+
|When I find myself in times of trouble|[when, i, find, myself, in, times, of, trouble]|[when i, i find, find myself, myself in, in times, times of, of trouble]  |
|Mother Mary comes to me               |[mother, mary, comes, to, me]                  |[mother mary, mary comes, comes to, to me]                                |
|Speaking words of wisdom, let it be   |[speaking, words, of, wisdom,, let, it, be]    |[speaking words, words of, of wisdom,, wisdom, let, let it, it be]        |
|And in my hour 

# 연속데이터의 변환

In [33]:
from pyspark.sql.types import *
rdd=spark.sparkContext\
    .textFile(os.path.join('data','ds_spark_heightweight.txt'))

myRdd=rdd.map(lambda line:[float(x) for x in line.split('\t')])
myDf=spark.createDataFrame(myRdd,["id","weight","height"])

In [34]:
from pyspark.ml.feature import Binarizer
binarizer = Binarizer(threshold=68.0, inputCol="weight", outputCol="weight2")
binDf = binarizer.transform(myDf)
binDf.show(10)
#binerizer : threshold(임계값)에 따라 0/1의 값으로 데이터를 구분한다.

+----+------+------+-------+
|  id|weight|height|weight2|
+----+------+------+-------+
| 1.0| 65.78|112.99|    0.0|
| 2.0| 71.52|136.49|    1.0|
| 3.0|  69.4|153.03|    1.0|
| 4.0| 68.22|142.34|    1.0|
| 5.0| 67.79| 144.3|    0.0|
| 6.0|  68.7| 123.3|    1.0|
| 7.0|  69.8|141.49|    1.0|
| 8.0| 70.01|136.46|    1.0|
| 9.0|  67.9|112.37|    0.0|
|10.0| 66.78|120.67|    0.0|
+----+------+------+-------+
only showing top 10 rows



In [36]:
from pyspark.ml.feature import QuantileDiscretizer

discretizer = QuantileDiscretizer(numBuckets=3, inputCol="height", outputCol="height3")
qdDf = discretizer.fit(binDf).transform(binDf)
qdDf.show(10)

+----+------+------+-------+-------+
|  id|weight|height|weight2|height3|
+----+------+------+-------+-------+
| 1.0| 65.78|112.99|    0.0|    0.0|
| 2.0| 71.52|136.49|    1.0|    1.0|
| 3.0|  69.4|153.03|    1.0|    2.0|
| 4.0| 68.22|142.34|    1.0|    2.0|
| 5.0| 67.79| 144.3|    0.0|    2.0|
| 6.0|  68.7| 123.3|    1.0|    0.0|
| 7.0|  69.8|141.49|    1.0|    2.0|
| 8.0| 70.01|136.46|    1.0|    1.0|
| 9.0|  67.9|112.37|    0.0|    0.0|
|10.0| 66.78|120.67|    0.0|    0.0|
+----+------+------+-------+-------+
only showing top 10 rows



# VectorAssembler
열을 묶어서 Vector열로 만든다.

string은 묶을 수 없다.

pyspark.ml.linalg.Vectors를 사용한다. (주의: pyspark.mllib.linalg.Vectors를 사용하지 않는다.)

In [37]:
#머신러닝에 필요한 column : label, features
#vectorAssembler로 features를 생성한다.
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

va = VectorAssembler(inputCols=["weight2","height3"],outputCol="features") 
vaDf = va.transform(qdDf)
vaDf.printSchema()
vaDf.show(5)

root
 |-- id: double (nullable = true)
 |-- weight: double (nullable = true)
 |-- height: double (nullable = true)
 |-- weight2: double (nullable = true)
 |-- height3: double (nullable = true)
 |-- features: vector (nullable = true)

+---+------+------+-------+-------+---------+
| id|weight|height|weight2|height3| features|
+---+------+------+-------+-------+---------+
|1.0| 65.78|112.99|    0.0|    0.0|(2,[],[])|
|2.0| 71.52|136.49|    1.0|    1.0|[1.0,1.0]|
|3.0|  69.4|153.03|    1.0|    2.0|[1.0,2.0]|
|4.0| 68.22|142.34|    1.0|    2.0|[1.0,2.0]|
|5.0| 67.79| 144.3|    0.0|    2.0|[0.0,2.0]|
+---+------+------+-------+-------+---------+
only showing top 5 rows



+-------------------------------------------------------------------------------------------------------------------------------------+
|sent                                                                                                                                 |
+-------------------------------------------------------------------------------------------------------------------------------------+
|존경하는 국민 여러분, 경찰관 여러분, 일흔네 돌 ‘경찰의 날’입니다.                                                                                              |
|                                                                                                                                     |
|국민의 안전을 위해 밤낮없이 애쓰시는 전국의 15만 경찰관 여러분께 먼저 감사를 드립니다. 전몰·순직 경찰관들의 고귀한 희생에 경의를 표합니다. 유가족 여러분께 위로의 마음을 전합니다.                              |
|                                                                                                                                     |
|오늘 홍조근정훈장을 받으신 중앙경찰학교장 이은정 치안감님, 근정포장을 받으신 광주남부