**pandas 연습**    
과제 2.
코사인유사도 기반 와인 추천시스템 개발
- 제공받은 csv 파일을 로드
- 상위 5000개의 와인 데이터만 추출하여 변수에 저장
- description 열을 추출
- 불용어 제거(길이가 짧은 단어, 불용어 사전, 5000개 문서 중에서 단어의 등장 횟수 3개 이하 등)
- DTM 구성(만약 시간이 너무 많이 소요되면, 상위 1000개만 가지고 작업)
- TFIDF 행렬 구성(5000*단어 개수)
- 코사인유사도(5000*5000)      
ex) 번호가 50번에 해당되는 와인과 가장 유사한 와인 10개를 추천해줘     
  코사인유사도 최댓값 10개 추출 -> 이름 추출       


In [1]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [2]:
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/data/winemag-data-130k-v2.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


In [3]:
df = df.head(5000)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             5000 non-null   int64  
 1   country                4997 non-null   object 
 2   description            5000 non-null   object 
 3   designation            3523 non-null   object 
 4   points                 5000 non-null   int64  
 5   price                  4657 non-null   float64
 6   province               4997 non-null   object 
 7   region_1               4208 non-null   object 
 8   region_2               1956 non-null   object 
 9   taster_name            3969 non-null   object 
 10  taster_twitter_handle  3794 non-null   object 
 11  title                  5000 non-null   object 
 12  variety                5000 non-null   object 
 13  winery                 5000 non-null   object 
dtypes: float64(1), int64(2), object(11)
memory usage: 547.0+

In [5]:
df.describe()

Unnamed: 0.1,Unnamed: 0,points,price
count,5000.0,5000.0,4657.0
mean,2499.5,88.1546,34.737599
std,1443.520003,2.925818,52.321656
min,0.0,80.0,4.0
25%,1249.75,86.0,16.0
50%,2499.5,88.0,25.0
75%,3749.25,90.0,40.0
max,4999.0,100.0,1900.0


In [6]:
df.isnull().sum(axis=0)

Unnamed: 0                  0
country                     3
description                 0
designation              1477
points                      0
price                     343
province                    3
region_1                  792
region_2                 3044
taster_name              1031
taster_twitter_handle    1206
title                       0
variety                     0
winery                      0
dtype: int64

In [7]:
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

def tokenize_sentence(sentence):
    return word_tokenize(sentence)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [8]:
df['description'] = df['description'].apply(tokenize_sentence)

In [9]:
df['description'] = df['description'].apply(lambda x: [word for word in x if len(word) > 2])

In [10]:
df.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"[Aromas, include, tropical, fruit, broom, brim...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"[This, ripe, and, fruity, wine, that, smooth, ...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"[Tart, and, snappy, the, flavors, lime, flesh,...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"[Pineapple, rind, lemon, pith, and, orange, bl...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"[Much, like, the, regular, bottling, from, 201...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


In [11]:
from collections import Counter

word_freq = Counter(word for tokens_list in df['description'] for word in tokens_list)

In [12]:
word_freq

Counter({'Aromas': 171,
         'include': 31,
         'tropical': 125,
         'fruit': 1692,
         'broom': 6,
         'brimstone': 3,
         'and': 13180,
         'dried': 301,
         'herb': 249,
         'The': 1984,
         'palate': 1385,
         "n't": 136,
         'overly': 20,
         'expressive': 25,
         'offering': 96,
         'unripened': 1,
         'apple': 447,
         'citrus': 400,
         'sage': 55,
         'alongside': 234,
         'brisk': 92,
         'acidity': 1298,
         'This': 1561,
         'ripe': 928,
         'fruity': 369,
         'wine': 3050,
         'that': 1432,
         'smooth': 249,
         'while': 362,
         'still': 210,
         'structured': 165,
         'Firm': 26,
         'tannins': 1135,
         'are': 960,
         'filled': 10,
         'out': 258,
         'with': 4351,
         'juicy': 353,
         'red': 708,
         'berry': 616,
         'fruits': 537,
         'freshened': 3,
         'alr

In [13]:
min_freq = 4
filtered_words = [word for word, freq in word_freq.items() if freq >= min_freq]

In [14]:
df['description'] = df['description'].apply(lambda tokens_list: [word for word in tokens_list if word in filtered_words])

In [15]:
df.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"[Aromas, include, tropical, fruit, broom, and,...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"[This, ripe, and, fruity, wine, that, smooth, ...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"[Tart, and, snappy, the, flavors, lime, flesh,...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"[Pineapple, rind, lemon, pith, and, orange, bl...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"[Much, like, the, regular, bottling, from, 201...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


In [16]:
df['description'] = df['description'].apply(lambda tokens_list: ' '.join(tokens_list))

In [17]:
df.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,Aromas include tropical fruit broom and dried ...,Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,This ripe and fruity wine that smooth while st...,Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,Tart and snappy the flavors lime flesh and rin...,,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,Pineapple rind lemon pith and orange blossom s...,Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,Much like the regular bottling from 2012 this ...,Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


In [18]:
v= CountVectorizer()
v.fit_transform(df['description']).toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [29]:
v.vocabulary_

{'aromas': 160,
 'include': 1256,
 'tropical': 2601,
 'fruit': 1048,
 'broom': 358,
 'and': 125,
 'dried': 764,
 'herb': 1189,
 'the': 2525,
 'palate': 1746,
 'overly': 1725,
 'expressive': 891,
 'offering': 1686,
 'apple': 144,
 'citrus': 496,
 'sage': 2114,
 'alongside': 107,
 'brisk': 353,
 'acidity': 55,
 'this': 2542,
 'ripe': 2061,
 'fruity': 1052,
 'wine': 2760,
 'that': 2524,
 'smooth': 2279,
 'while': 2746,
 'still': 2383,
 'structured': 2405,
 'firm': 967,
 'tannins': 2482,
 'are': 156,
 'filled': 953,
 'out': 1719,
 'with': 2768,
 'juicy': 1314,
 'red': 1996,
 'berry': 257,
 'fruits': 1051,
 'already': 109,
 'drinkable': 767,
 'although': 111,
 'will': 2758,
 'certainly': 438,
 'better': 260,
 'from': 1045,
 '2016': 19,
 'tart': 2486,
 'snappy': 2284,
 'flavors': 984,
 'lime': 1419,
 'flesh': 986,
 'rind': 2057,
 'dominate': 748,
 'some': 2299,
 'green': 1129,
 'pineapple': 1817,
 'through': 2549,
 'crisp': 633,
 'was': 2718,
 'all': 95,
 'fermented': 944,
 'lemon': 1384,
 '

In [19]:
tfidf = TfidfVectorizer(stop_words='english')

In [20]:
tfidf_m = tfidf.fit_transform(df['description'])
tfidf_m.shape

(5000, 2635)

In [21]:
tfidf_m

<5000x2635 sparse matrix of type '<class 'numpy.float64'>'
	with 103044 stored elements in Compressed Sparse Row format>

In [22]:
cos_sim = cosine_similarity(tfidf_m,tfidf_m)

In [23]:
cos_sim.shape

(5000, 5000)

In [24]:
df['title']

0                       Nicosia 2013 Vulkà Bianco  (Etna)
1           Quinta dos Avidagos 2011 Avidagos Red (Douro)
2           Rainstorm 2013 Pinot Gris (Willamette Valley)
3       St. Julian 2013 Reserve Late Harvest Riesling ...
4       Sweet Cheeks 2012 Vintner's Reserve Wild Child...
                              ...                        
4995    Mud House 2007 Swan Sauvignon Blanc (Marlborough)
4996     Fattoria Alois 2006 Cunto Pallagrello (Campania)
4997                      Florio NV Fine Sweet  (Marsala)
4998    Vice Versa 2005 Le Petit Vice Cabernet Sauvign...
4999    Viña Mar de Casablanca 2008 Reserva Especial S...
Name: title, Length: 5000, dtype: object

In [25]:
df.index

RangeIndex(start=0, stop=5000, step=1)

In [26]:
data = dict(zip(df['title'], df.index))
i = data['Nicosia 2013 Vulkà Bianco  (Etna)']
print(i)

0


In [27]:
def recommend(title, cosine_sim=cos_sim):
  idx = data[title]
  sim_scores = list(enumerate(cosine_sim[idx]))
  print(sim_scores)
  ss= sorted(sim_scores, key=lambda x:x[1], reverse=True)
  # print(ss)
  # print(ss[1:11])
  ss= ss[1:11]
  print(ss)
  res = [i[0] for i in ss]
  print(res)
  return df['title'].iloc[res]

result = recommend('Nicosia 2013 Vulkà Bianco  (Etna)')

[(0, 1.0), (1, 0.015526023929750946), (2, 0.016009256486030056), (3, 0.021547949670277036), (4, 0.0), (5, 0.03581718935908812), (6, 0.09674922753428318), (7, 0.02315817925089834), (8, 0.15143858870773652), (9, 0.06197077382406388), (10, 0.0), (11, 0.036504956093721834), (12, 0.0), (13, 0.11335965444089169), (14, 0.020124911669669218), (15, 0.0381157357626879), (16, 0.02889469116555737), (17, 0.01043239662639437), (18, 0.028060905596679467), (19, 0.028919466868221777), (20, 0.07801689030192276), (21, 0.0318691329756252), (22, 0.13729927118036214), (23, 0.021201255902895337), (24, 0.025162120734095423), (25, 0.00985980927741594), (26, 0.12182898267428811), (27, 0.0658674822966382), (28, 0.06381509630212212), (29, 0.008749249796605867), (30, 0.01568517392041236), (31, 0.009992126625666694), (32, 0.08135592531694671), (33, 0.0), (34, 0.15683498986632796), (35, 0.04329014188341659), (36, 0.03853582220407609), (37, 0.10076367304681223), (38, 0.04342294429321018), (39, 0.02632088992094489), (

In [28]:
result

2000    Feudi del Pisciotto 2013 Baglio del Sole Inzol...
1036                  Vivera 2010 Salisire Bianco  (Etna)
4033    Chateau Ste. Michelle 2015 Horse Heaven Vineya...
4053    Cecilia Beretta 2011 Terre di Cariano Riserva ...
908                     Cascina Bruciata 2013  Barbaresco
1181              Bel Colle 2013 Montersino  (Barbaresco)
677     Baracchi Riccardo 2011 Smeriglio Riserva Syrah...
3908                     La Rajade 2014 Friulano (Collio)
2734            Viña Maipo 2015 Vitral Chardonnay (Chile)
3414        Lechthaler 2014 Drago Pinot Grigio (Trentino)
Name: title, dtype: object