<img align="right" src="https://ds-cs-images.s3.ap-northeast-2.amazonaws.com/Codestates_Fulllogo_Color.png" width=100>

## *AIB / SECTION 4 / SPRINT 2 / NOTE 1*

# 📝 Assignment

---


# Count-based_Representation

indeed.com 에서 Data Scientist 키워드로 Job descrition을 찾아 스크래핑한 데이터를 이용해 과제를 진행해 보겠습니다.

[Data_Scienties.csv](https://ds-lecture-data.s3.ap-northeast-2.amazonaws.com/indeed/Data_Scientist.csv) 파일에는 1300여개의 Data Scientist job description 정보가 담겨 있습니다.

## 1. 데이터 전처리 (Text preprocessing)

In [2]:
import re
import string

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### 0) 텍스트 분석에 앞서 데이터 전처리를 진행합니다.

- 파일을 불러온 후 title, company, description 에 해당하는 Column만 남겨주세요.
- 중복값을 제거하세요.

In [6]:
df = pd.read_csv('https://ds-lecture-data.s3.ap-northeast-2.amazonaws.com/indeed/Data_Scientist.csv')


# title, company, description column
pre_df = df[['title', 'company', 'description']]

# delete duplicates
print(pre_df.duplicated())
pre_df = pre_df.drop_duplicates()

print(pre_df.shape)



0       False
1        True
2       False
3        True
4       False
        ...  
1295    False
1296     True
1297    False
1298     True
1299    False
Length: 1300, dtype: bool
(757, 3)


### 1) 토큰을 정제합니다.

- 문자를 소문자로 통일
- 분석에 관련 없는 정보 제거
- 이번 과제는 `spacy` 로부터 `"en_core_web_sm"` 을 로드하여 진행해주세요.

- **문항 1) 대문자를 소문자로 변경하는 함수를 입력하세요.**
- **문항 2) 정규 표현식을 사용하여 re 라이브러리에서 알파벳 소문자, 숫자만 받을 수 있는 코드를 작성하세요.**

In [7]:
import spacy
from spacy.tokenizer import Tokenizer

nlp = spacy.load("en_core_web_sm")
tokenizer = Tokenizer(nlp.vocab)

def refine(sentence):
    
    # lowercase
    sentence = sentence.lower()
    
    # apply regex
    sentence = re.sub(r"[^a-z0-9 ]", "", sentence)

    return sentence

last_desc = pre_df['description'][-1:].values[0]
#print(last_desc[0])

tokens = tokenizer(last_desc)
#print(tokens)
print(tokens[0])

doc = nlp(last_desc)
#print(doc)
print(doc[0])

last_desc = refine(doc[0].text)
print(last_desc)

tl;dr
tl;dr
tldr


### 2) 정제한 토큰을 시각화 합니다.

- Top 10 토큰을 프린트 합니다.
- 토큰의 수, 빈도 순위, 존재 문서 수, 비율 등 정보를 계산합니다.
- 토큰 순위에 따른 퍼센트 누적 분포 그래프를 시각화합니다.

- **문항 3) 추천 토큰 순위 10개 단어를 입력하세요.**

In [None]:
def resub(desc):
    desc = desc.lower()
#    desc = desc.replace("\n", " ")
    desc = re.sub(r"\n", " ", desc)
    desc = re.sub(r"[^a-z0-9 ]", "", desc)

    tokens = desc.split()
    return tokens

#pre_df['tokens'] = pre_df['description'].apply(tokenizer)
pre_df['tokens'] = pre_df['description'].apply(resub)
#pre_df['tokens'] = pre_df['tokens'].apply(tokenizer)   # 도대체 왜 안되는걸까... 후..... 
#pre_df['tokens'] = pre_df['tokens'].apply(list)

#pre_df['description'].head()
pre_df['tokens'].head()

#pre_df.head()

In [10]:
from collections import Counter

word_counts = Counter()

pre_df['tokens'].apply(lambda x: word_counts.update(x))  # -> lambda 좀 더 익숙해지기

## Top 10 tokens
word_counts.most_common(10)

[('and', 21863),
 ('to', 12694),
 ('the', 10538),
 ('of', 8839),
 ('data', 7425),
 ('in', 6769),
 ('a', 6436),
 ('with', 5727),
 ('for', 4132),
 ('or', 3812)]

In [11]:
## 토큰의 수, 빈도 순위, 존재 문서 수, 비율 등 정보 계산
def word_count(docs):
    w_counts = Counter()
    w_in_docs = Counter()
    total_docs = len(docs)

    for doc in docs:
        w_counts.update(doc)
        w_in_docs.update(set(doc))
    
    temp = zip(w_counts.keys(), w_counts.values())   # zip 좀 더 파악하기
    wc = pd.DataFrame(temp, columns=['word', 'count'])

    return wc

#wc = word_count(df['tokens'])
#wc.head()

In [12]:
## 토큰 순위에 따른 퍼센트 누적 분포 그래프 시각화
import seaborn as sns

### 4) 확장된 불용어 사전을 사용해 토큰을 정제합니다.


- **문항 4) 기본 불용어 사전에 두 단어(`"data", "work"`)를 추가하는 코드를 사용해주세요.**
- **문항 5) 불용어를 제거하고 난 뒤 토큰 순위 10개의 단어를 입력하세요.**

In [13]:


#nlp.Defaults.stop_words

## add "data", "work" to stop_words
Stop_Words = nlp.Defaults.stop_words.union(['data', 'work'])

## Top 10 tokens

tok = []

for doc in tokenizer.pipe(pre_df['description']):   # Tokenizer.pipe 파악하기
    doc_tok = []

    for token in doc:
        if (token.text.lower() not in Stop_Words) & (token.text.lower() not in ['\n', '\n\n']):
        # if (token.is_stop == False) & (token.is_punct == False):
            doc_tok.append(token.text.lower())
    
    tok.append(doc_tok)

pre_df['added stop word tokens'] = tok
pre_df['added stop word tokens'].head()

sw_counts = Counter()

pre_df['added stop word tokens'].apply(lambda x: sw_counts.update(x))  # -> lambda 좀 더 익숙해지기

## Top 10 tokens
sw_counts.most_common(10)


[('experience', 3055),
 ('business', 1885),
 ('team', 1323),
 ('learning', 1193),
 ('machine', 1140),
 ('science', 1048),
 ('ability', 958),
 ('analysis', 896),
 ('statistical', 890),
 ('skills', 886)]

### 5) Lemmatization 사용 효과를 분석해 봅니다.



- **문항 6) Lemmatization을 진행한 뒤 상위 10개 단어를 입력하세요.**

In [14]:


## Lemmatization (표제어 추출)

def get_lemmas(text):
    lemmas = []
    l_doc = nlp(text)

    for token in l_doc:
        if (token.text.lower() not in Stop_Words
            ) & (token.is_punct == False
            ) & (token.text.lower() not in ['\n', '\n\n']
            ) & (token.pos_ != 'PRON'):  # Part-Of-Speech tag
            lemmas.append(token.lemma_)   # .lower() 를 하고 안하고가 탑10 결과에 영향을 끼침!
    return lemmas

pre_df['lemmas'] = df['description'].apply(get_lemmas)
pre_df['lemmas'].head

lm_counts = Counter()

pre_df['lemmas'].apply(lambda x: lm_counts.update(x))  # -> lambda 좀 더 익숙해지기

## Top 10 tokens
lm_counts.most_common(10)

## .lower() 했을 경우 탑10
#  [('experience', 3681),
#  ('team', 2338),
#  ('business', 2194),
#  ('science', 1714),
#  ('analysis', 1606),
#  ('model', 1464),
#  ('learning', 1330),
#  ('product', 1298),
#  ('machine', 1214),
#  ('skill', 1212)]



[('experience', 3578),
 ('team', 2251),
 ('business', 2029),
 ('analysis', 1531),
 ('model', 1446),
 ('skill', 1205),
 ('product', 1196),
 ('include', 1178),
 ('develop', 1163),
 ('analytic', 1153)]

## 2. 유사한 문서 찾기

### 1) `TfidfVectorizer`를 이용해 각 문서들을 벡터화 한 후 KNN 모델을 만들고, <br/> 내가 원하는 `job description`을 질의해 가장 가까운 검색 결과들을 가져오고 분석합니다.

- **문항 9) 88번 index의 `job description`와 5개의 가장 유사한 `job description`이 있는 index를 입력하세요.**
    - 답은 88번 인덱스를 포함합니다.
    - `max_features = 3000` 으로 설정합니다.
    - [88, 90, 91, 93, 94] 형태로 답을 입력해주세요

In [15]:


from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

pre_df['description'].iloc[88]
pre_df.iloc[88]

tfidf = TfidfVectorizer(stop_words='english', max_features=3000)

dtm_tfidf = tfidf.fit_transform(pre_df['description'])

dtm_tfidf = pd.DataFrame(dtm_tfidf.todense(), columns=tfidf.get_feature_names())

dtm_tfidf.head()





Unnamed: 0,00,000,10,100,11,12,14,15,18,19,...,written,www,year,years,yelp,yes,york,yrs,zillow,zulily
0,0.0,0.0,0.05272,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.023379,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.018485,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.05851,0.0,0.0,0.017488,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.033337,0.0,0.035082,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.060084,0.013012,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.047268,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
from sklearn.neighbors import NearestNeighbors

nn = NearestNeighbors(n_neighbors=5, algorithm='kd_tree')
nn.fit(dtm_tfidf)

nn.kneighbors([dtm_tfidf.iloc[88]])

  "X does not have valid feature names, but"


(array([[0.        , 1.1283426 , 1.18893646, 1.19442548, 1.19937307]]),
 array([[ 88,  40, 121,  68, 680]]))

## 3. TF-IDF 이용한 텍스트 분류 진행하기

TF-IDF를 이용해 문장 혹은 문서를 벡터화한 경우, 이 벡터값을 이용해 문서 분류 태스크를 진행할 수 있습니다. 

현재 다루고 있는 데이터셋에는 label이 존재하지 않으므로, title 컬럼에 "Senior"가 있는지 없는지 여부를 통해 Senior 직무 여부를 분류하는 작업을 진행해보겠습니다.

### 1) title 컬럼에 "Senior" 문자열이 있으면 1, 없으면 0인 "Senior"라는 새로운 컬럼을 생성해주세요.

문항 7) 새롭게 만든 Senior 컬럼에서 값이 1인 (Senior O) 데이터의 개수는?

In [17]:


pre_df['senior'] = pre_df['title'].apply(lambda x: 1 if 'Senior' in x else 0)
pre_df.head()

pre_df['senior'].value_counts()



from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

x = dtm_tfidf
x_train, x_val, y_train, y_val = train_test_split(x, pre_df['senior'],test_size = 0.1, train_size=0.9, random_state=42)

model = DecisionTreeClassifier(random_state=42)
model.fit(x_train, y_train)
y_pred = model.predict(x_val)

print(classification_report(y_val, y_pred))

              precision    recall  f1-score   support

           0       0.93      0.84      0.88        68
           1       0.27      0.50      0.35         8

    accuracy                           0.80        76
   macro avg       0.60      0.67      0.62        76
weighted avg       0.86      0.80      0.83        76



문항 8) sklearn의 `train_test_split`을 통해 train 데이터와 valid 데이터로 나눈 후, `sklearn`의 `DecisionTreeClassifier`를 이용해 분류를 진행해주세요. 

단, x값은 위에서 학습한 dtm_tfidf를 그대로 이용해주세요. train_test_split과 DecisionTreeClassifier의 random_state을 42로 고정하고, test_size는 0.1로 설정해주세요.

학습을 완료한 후, test 데이터에 대한 예측을 진행하고 label 1에 대한 precision과 recall 값을 적어주세요

In [18]:


from sklearn.metrics import precision_recall_fscore_support

precision_recall_fscore_support(y_val, y_pred)



(array([0.93442623, 0.26666667]),
 array([0.83823529, 0.5       ]),
 array([0.88372093, 0.34782609]),
 array([68,  8]))