# How to Start With a Raw Dataset?

This demo shows how to deal with the raw datasets with reviews into lexicon format, which is required by sentiment-related models, e.g. EFM, MTER...

# Case 1 Use Cornac Built-in Dataset

In [1]:
import warnings
warnings.filterwarnings('ignore') 
from cornac.data.lexicon import SentimentAnalysis

In [2]:
import cornac
import pandas as pd
from cornac.datasets import amazon_digital_music

In [3]:
ratings = amazon_digital_music.load_feedback()
reviews = amazon_digital_music.load_review()
columns = ['user', 'item', 'rating', 'review_text']
ratings_df = pd.DataFrame(ratings, columns=columns[:3])
reviews_df = pd.DataFrame(reviews, columns=[columns[0], columns[1], columns[3]])

In [4]:
input_df = pd.merge(ratings_df, reviews_df, on=['user', 'item'])
input_df.head()

Unnamed: 0,user,item,rating,review_text
0,A3EBHHCZO6V2A4,5555991584,5.0,"""It's hard to believe """"Memory of Trees"""" came..."
1,AZPWAXJG9OJXV,5555991584,5.0,"A clasically-styled and introverted album, Mem..."
2,A38IRL0X2T4DPF,5555991584,5.0,I never thought Enya would reach the sublime h...
3,A22IK3I6U76GX0,5555991584,5.0,This is the third review of an irish album I w...
4,A1AISPOIIHTHXX,5555991584,4.0,"""Enya, despite being a successful recording ar..."


In [6]:
output_lexicon = './dataset/lexicon.txt'
output_rating = './dataset/rating.txt'
# The parameter usecols is required, should be the same with the first line of the raw file
SA = SentimentAnalysis(input_df, sep='', usecols = ['user', 'item', 'rating', 'review_text']) 
df = SA.build_lexicons()
SA.save_to_file(output_lexicon, output_rating)
df.head()

100%|██████████| 64705/64705 [39:14<00:00, 27.48it/s]  


number of users: 5541
number of items: 3568
3521 rows have no lexicon
61184 rows after dropping users having less than 1 reviews


Unnamed: 0,user,item,rating,review_text,lexicon
0,A3EBHHCZO6V2A4,5555991584,5.0,"""It's hard to believe """"Memory of Trees"""" came...","album:last:1,album:great:1,spark:creative:1,sp..."
1,AZPWAXJG9OJXV,5555991584,5.0,"A clasically-styled and introverted album, Mem...","shyness:endearing:1,piano:soft:1,voice:lovely:..."
2,A38IRL0X2T4DPF,5555991584,5.0,I never thought Enya would reach the sublime h...,heights:sublime:1
3,A22IK3I6U76GX0,5555991584,5.0,This is the third review of an irish album I w...,"review:third:1,album:irish:1,music:best:1,musi..."
4,A1AISPOIIHTHXX,5555991584,4.0,"""Enya, despite being a successful recording ar...","artist:successful:1,appeal:broad:-1,station:po..."


# Case 2 Use your own data

### Step 1: Prepare the raw data file
**Make sure the first line in the file indicates the column names**

At least including 4 columns named [user_id, item_id, rating, review_text], see example ./dataset/data_demo.csv


In [66]:
raw_file = './dataset/data_demo.csv'
sep = '\t'

### Step 2: Generate Lexicon
**Use the ```cornac.data.lexicon.SentimentAnalysis```**

- Input: raw data file path;
- Output: 
  - rating.txt: [user_id, item_id, rating]
  - lexicon.txt: [user_id, item_id, lexicon]

*Note: The user-item pairs in the two ouput files are consistent with each other.*

In [67]:
import warnings
warnings.filterwarnings('ignore') 
from cornac.data.lexicon import SentimentAnalysis

In [68]:
output_lexicon = './dataset/lexicon.txt'
output_rating = './dataset/rating.txt'
# The parameter usecols is required, should be the same with the first line of the raw file
SA = SentimentAnalysis(raw_file, sep=sep, usecols = ['user_id', 'item_id', 'rating', 'review_text']) 
df = SA.build_lexicons()
SA.save_to_file(output_lexicon, output_rating)
df.head()

  0%|          | 0/60 [00:00<?, ?it/s]

100%|██████████| 60/60 [00:00<00:00, 101.53it/s]

number of users: 12
number of items: 10
total60
15 rows have no lexicon
45 rows after dropping users having less than 1 reviews





Unnamed: 0,user_id,item_id,rating,review_text,lexicon
0,713cc3505e77532b97d0a69812320fa7,4303163,2.0,"Complex and captivating, DARKLY showcases Pess...","twists:unexpected:1,page:final:1,delight:absol..."
1,5f0d7ea4515a98abebea35cec77f864c,192805,4.0,I was absolutely captivated by the synopsis fo...,"thing:kinda:1,job:amazing:1"
5,9003d274774f4c47e62f77600b08ac1d,23167683,3.0,A failure on nearly every level. The character...,"voice:preposterous:1,concern:main:1,stumps:pro..."
6,9131e02af6b7d8d2dd23472b264971af,23167683,4.0,I believe this book will be in my top 3 I have...,duties:everyday:1
7,ba2455719e99ae6e0771877da9e81474,48100,3.0,An interesting twist on the black ops/espionag...,"twist:interesting:1,genres:black:1,point:polit..."


# Now Train Recommendation Models

In [69]:

from cornac.data import Reader
from cornac.experiment.experiment import Experiment
from cornac.models import MTER, EFM
from cornac.eval_methods import RatioSplit
from cornac.data import SentimentModality

In [70]:
reader = Reader()
ratings = reader.read(output_rating, fmt='UIR', sep=',')
lexicon = reader.read(output_lexicon, fmt='UITup', sep=',', tup_sep=':')
sentiment = SentimentModality(data = lexicon)

In [71]:
rs = RatioSplit(data=ratings, 
                sentiment = sentiment,
                test_size=0.2, 
                rating_threshold=4.0, 
                seed=123)

In [72]:
efm = EFM()
efm.fit(rs.train_set)

<cornac.models.efm.recom_efm.EFM at 0x2889f04c0>

In [73]:
efm.recommend("713cc3505e77532b97d0a69812320fa7")

['342923',
 '48100',
 '192805',
 '23167683',
 '1171422',
 '4303163',
 '27423576',
 '7932435']