<a href="https://colab.research.google.com/github/HanGyeolN/CampusProject/blob/master/L11_%EB%AC%B8%EC%84%9C%EC%9C%A0%EC%82%AC%EB%8F%84.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Extraction

Feature extraction is usually used when the original data was very different. In particular when you could not have used the raw data.

E.g. original data were images. You extract the redness value, or a description of the shape of an object in the image. It's lossy, but at least you get some result now.

Feature engineering is the careful preprocessing into more meaningful features, even if you could have used the old data.

E.g. instead of using variables x, y, z you decide to use log(x)-sqrt(y)*z instead, because your engineering knowledge tells you that this derived quantity is more meaningful to solve your problem. You get better results than without.

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


## 코사인 유사도 : 예제

In [0]:
import numpy as np
from numpy.linalg import norm

In [0]:
def cos_sim(x, y):
  '''
  코사인 유사도 계산
  두개의 문서에 대해 유사도를 계산한다.
  유사도 = (X1 . X2) / ( |X1| * |X2| )
  '''
  return ( np.dot(x,y) ) / ( norm(x) * norm(y) )
  
  

In [0]:
doc1 = np.array([1,1,0,1])
doc2 = np.array([1,0,1,1])
doc3 = np.array([2,0,2,2])

print(cos_sim(doc1, doc2))
print(cos_sim(doc1, doc3))
print(cos_sim(doc2, doc3))

0.6666666666666667
0.6666666666666667
1.0000000000000002


## 영화 데이터 코사인 유사도 예제
Objective : 영화의 개요에 나오는 단어를 가지고, 유사 영화를 찾는다.


### 1.2.1 Data Loading

In [0]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer

import json

In [0]:
movies = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Data_Analysis/data/the-movies-dataset/movies_metadata.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [0]:
# movies.head() # Vis
data = movies

### 1.2.2 na 처리

- 제거할경우 reset_index

In [0]:
data[['overview']].info()
# res : 900개의 null값

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 1 columns):
overview    44512 non-null object
dtypes: object(1)
memory usage: 355.3+ KB


In [0]:
data['overview'] = data['overview'].fillna('')


### 1.2.3 tfidf 생성

In [0]:
tfidf = TfidfVectorizer(stop_words = 'english')

In [0]:
input_size = int(input('size : '))
tfidf_mat = tfidf.fit_transform(data['overview'].head(input_size))

### 1.2.4 코사인 유사도 계산

In [0]:
from sklearn.metrics.pairwise import linear_kernel

In [0]:
cosine_sim = linear_kernel(tfidf_mat, tfidf_mat)

In [0]:
cosine_sim

### 1.2.5 특정 영화와 유사한 영화찾기

#### 1.2.5.1 [영화이름과 인덱스(key)] 추출

In [0]:
# index를 value로 title_column을 index로

temp = pd.Series(data[:20000].index, index = data['title'].head(20000))

# index를 기준으로 중복을 제거한다
# 특정 키 값으로도 가능하다. 
#temp.drop_duplicates()

idx = temp

In [0]:
len(cosine_sim)

#### 1.2.5.2 유사도와 인덱스(key) 결합

In [0]:
search_title = input('title : ')

idx_title = idx[search_title]
print('index', idx_title)

cosine_sim[idx_title]

scores = pd.Series(np.arange(0,20000), index = cosine_sim[idx_title])

In [0]:
cosine_sim[idx_title].argsort()[::-1][0:10]

#### 1.2.5.3 유사도로 정렬

In [0]:
sorted_scores = scores.sort_index(ascending = False)
sorted_scores

#### 1.2.5.4 정렬된 유사도에서 영화 인덱스 추출 

In [0]:
target_idxs = list(sorted_scores.iloc[1:10])

#### 1.2.5.5 영화이름-인덱스로 조회

In [0]:
for title_idx in target_idxs:
  print(data['overview'].iloc[title_idx])

### 1.2.6

In [0]:
data['overview']