## Data Preprocessing - LibraryThing Dataset

***This notebook aims to provide detailed procedure on preprocessing the LibraryThing dataset. After the complete execution of the notebook, two files will be generated "Lthing_rating.txt" and "LThing_trust.txt" which can used as inputs to trust-based book recommender systems***

#### Import required libraries

In [93]:
import pandas as pd
import numpy as np
import json
import os

In [94]:
# printing current work directory and pandas version
print(os.getcwd())

C:\Users\Owner\Desktop\Amruta\CMPE256Project


#### Reading the reviews metadata into a dataframe

In [95]:
# taking reviews.json metadata path into a variable
json_metadata_path = r"C:\Users\Owner\Desktop\Amruta\CMPE256Project\LThingData\reviews.json"

In [138]:
import ast
# Read all lines into a list
with open(json_metadata_path) as f:
    json_content = f.readlines()

# Convert each list item to a dict
json_content = [ast.literal_eval(line) for line in json_content]

json_content

[{'work': '3206242',
  'flags': [],
  'unixtime': 1194393600,
  'stars': 5.0,
  'nhelpful': 0,
  'time': 'Nov 7, 2007',
  'comment': 'This a great book for young readers to be introduced to the world of Middle Earth. ',
  'user': 'van_stef'},
 {'work': '12198649',
  'flags': [],
  'unixtime': 1333756800,
  'stars': 5.0,
  'nhelpful': 0,
  'time': 'Apr 7, 2012',
  'comment': 'Help Wanted: Tales of On The Job Terror from Evil Jester Press is a fun and scary read. This book is edited by Peter Giglio and has short stories by Joe McKinney, Gary Brandner, Henry Snider and many more. As if work wasnt already scary enough, this book gives you more reasons to be scared. Help Wanted is an excellent anthology that includes some great stories by some master storytellers.\nOne of the stories includes Agnes: A Love Story by David C. Hayes, which tells the tale of a lawyer named Jack who feels unappreciated at work and by his wife so he starts a relationship with a photocopier. They get along well un

In [139]:
# converting the contents of the json file into a dataframe
reviews_df = pd.DataFrame(json_content)

In [140]:
# printing the dataframe
reviews_df

Unnamed: 0,work,flags,unixtime,stars,nhelpful,time,comment,user
0,3206242,[],1.194394e+09,5.0,0,"Nov 7, 2007",This a great book for young readers to be intr...,van_stef
1,12198649,[],1.333757e+09,5.0,0,"Apr 7, 2012",Help Wanted: Tales of On The Job Terror from E...,dwatson2
2,12533765,[],1.352938e+09,,0,"Nov 15, 2012","Magoon, K. (2012). Fire in the streets. New Yo...",edspicer
3,12981302,[],1.364515e+09,4.0,0,"Mar 29, 2013","Well, I definitely liked this book better than...",amdrane2
4,5231009,[],1.270944e+09,3.0,0,"Apr 11, 2010",It's a nice science-fiction thriller with some...,Lila_Gustavus
...,...,...,...,...,...,...,...,...
1707065,5377722,[],1.376438e+09,3.0,0,"Aug 14, 2013","Yea, so, this is borderline my type of thing. ...",Jellyn
1707066,13302111,[],1.356739e+09,,0,"Dec 29, 2012",solito pacco editoriale natalizio di monologhi...,ShanaPat
1707067,452711,[],1.220227e+09,4.0,0,"Sep 1, 2008",In The Last Dive: A Father and Sons Fatal Desc...,koeniel
1707068,3109878,[],1.195690e+09,,2,"Nov 22, 2007",The Age of Turbulence by Alan Greenspan\nA Rev...,Ductor


In [141]:
# impute the NaN ratings to 0
reviews_df = reviews_df.fillna(0)
reviews_df

Unnamed: 0,work,flags,unixtime,stars,nhelpful,time,comment,user
0,3206242,[],1.194394e+09,5.0,0,"Nov 7, 2007",This a great book for young readers to be intr...,van_stef
1,12198649,[],1.333757e+09,5.0,0,"Apr 7, 2012",Help Wanted: Tales of On The Job Terror from E...,dwatson2
2,12533765,[],1.352938e+09,0.0,0,"Nov 15, 2012","Magoon, K. (2012). Fire in the streets. New Yo...",edspicer
3,12981302,[],1.364515e+09,4.0,0,"Mar 29, 2013","Well, I definitely liked this book better than...",amdrane2
4,5231009,[],1.270944e+09,3.0,0,"Apr 11, 2010",It's a nice science-fiction thriller with some...,Lila_Gustavus
...,...,...,...,...,...,...,...,...
1707065,5377722,[],1.376438e+09,3.0,0,"Aug 14, 2013","Yea, so, this is borderline my type of thing. ...",Jellyn
1707066,13302111,[],1.356739e+09,0.0,0,"Dec 29, 2012",solito pacco editoriale natalizio di monologhi...,ShanaPat
1707067,452711,[],1.220227e+09,4.0,0,"Sep 1, 2008",In The Last Dive: A Father and Sons Fatal Desc...,koeniel
1707068,3109878,[],1.195690e+09,0.0,2,"Nov 22, 2007",The Age of Turbulence by Alan Greenspan\nA Rev...,Ductor


In [142]:
# create a new dataframe so that we can have the original dataframe as is
# drop unessential columns
new_reviews_df = reviews_df.drop(columns=['flags','unixtime','nhelpful','time','comment'])

new_reviews_df

Unnamed: 0,work,stars,user
0,3206242,5.0,van_stef
1,12198649,5.0,dwatson2
2,12533765,0.0,edspicer
3,12981302,4.0,amdrane2
4,5231009,3.0,Lila_Gustavus
...,...,...,...
1707065,5377722,3.0,Jellyn
1707066,13302111,0.0,ShanaPat
1707067,452711,4.0,koeniel
1707068,3109878,0.0,Ductor


In [143]:
# sorting the dataframe by user column
new_reviews_df = new_reviews_df.sort_values(by=['user']) # we see that there are empty cells in the user column

In [144]:
# As the reviews with no user is not helpful, we can remove those rows from the dataframe
new_reviews_df = new_reviews_df[new_reviews_df['user'].str.strip().astype(bool)]

new_reviews_df

Unnamed: 0,work,stars,user
833714,30040,5.0,%C3%90ark-Angel
1217988,1637340,3.0,---fan
91800,733009,3.0,-AlyssaE-
125986,2729214,3.0,-AlyssaE-
1608276,3032251,4.0,-AlyssaE-
...,...,...,...
800841,11490342,4.0,zzshupinga
1154272,12462078,5.0,zzshupinga
1590461,10904215,5.0,zzshupinga
554152,14062110,4.0,zzshupinga


In [145]:
# check if work column has any NaN values
new_reviews_df['work'].isna().any()

False

In [146]:
# check if there are duplicated values
duplicate = new_reviews_df[new_reviews_df.duplicated()]

In [147]:
duplicate # no duplicates found

Unnamed: 0,work,stars,user


**Using LabelEncoder from sklearn library to label encode the values as algorithms work only with numbers and not strings**

In [124]:
# import LabelEncoder
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

# list of column names you want encoded
columns_to_be_encoded = ['work','user']  

# Instantiate the encoders
encoders = {column: LabelEncoder() for column in columns_to_be_encoded}

for column in columns_to_be_encoded:
    new_reviews_df[column] = encoders[column].fit_transform(new_reviews_df[column])

In [125]:
# storing the encoded values into a dataframe 
le_rating_df = new_reviews_df.sort_values(by=['user'])

In [126]:
le_rating_df

Unnamed: 0,work,stars,user
833714,266338,5.0,0
1217988,189305,3.0,1
463942,44118,4.0,2
991816,307829,4.0,2
673621,188812,5.0,2
...,...,...,...
1427042,154240,5.0,83192
654139,152760,4.0,83192
573252,30617,3.0,83192
1004932,459414,5.0,83192


In [127]:
# rearranging the columns as required by the algorithm - user bookid rating
le_rating_df = le_rating_df[["user", "work", "stars"]]

le_rating_df

Unnamed: 0,user,work,stars
833714,0,266338,5.0
1217988,1,189305,3.0
463942,2,44118,4.0
991816,2,307829,4.0
673621,2,188812,5.0
...,...,...,...
1427042,83192,154240,5.0
654139,83192,152760,4.0
573252,83192,30617,3.0
1004932,83192,459414,5.0


In [128]:
# convert the dataframe into text file
np.savetxt(r'C:\Users\Owner\Desktop\Amruta\CMPE256Project\Lthing_rating.txt', le_rating_df.values, fmt='%d')

#### Reading the edges metadata into a dataframe

In [161]:
# reading the path of edges.txt into a variable
edges_metadata_path = r"C:\Users\Owner\Desktop\Amruta\CMPE256Project\LThingData\edges.txt"

In [162]:
# reading the data of the txt file into a dataframe and naming columns
edges_df = pd.read_csv(edges_metadata_path, header = None, sep = ' ', names = ["user", "trusted"])

In [163]:
# printing the dataframe
edges_df

Unnamed: 0,user,trusted
0,Rodo,anehan
1,Rodo,sevilemar
2,Rodo,dingsi
3,Rodo,slash
4,RelaxedReader,AnnRig
...,...,...
219785,Capfox,lampbane
219786,Capfox,maberry
219787,Capfox,raphinou
219788,Capfox,library1359


In [164]:
# check if the trusted column is empty

edges_df[edges_df['trusted'] == ''].index # there are no empty values in both the columns

Int64Index([], dtype='int64')

In [165]:
# check if the user column is empty

edges_df[edges_df['user'] == ''].index # there are no empty values in both the columns

Int64Index([], dtype='int64')

In [166]:
# add trust column to the dataframe with value 1 which means person in the user column trusts the person in trusted column
edges_df.loc[:,'trust'] = 1

In [167]:
edges_df = edges_df.sort_values('user')

edges_df

Unnamed: 0,user,trusted,trust
84099,-AlyssaE-,othersam,1
84097,-AlyssaE-,othersam,1
84096,-AlyssaE-,ENCPress,1
84095,-AlyssaE-,CarlaR,1
84098,-AlyssaE-,Bookmarque,1
...,...,...,...
42194,zzshupinga,joshuamneff,1
42195,zzshupinga,joslintepper,1
42196,zzshupinga,JustinTheLibrarian,1
42210,zzshupinga,Jenica26,1


In [168]:
# check if there are duplicated values
duplicate = edges_df[edges_df.duplicated()]

In [169]:
duplicate # there are 17612 duplicate rows

Unnamed: 0,user,trusted,trust
84097,-AlyssaE-,othersam,1
133752,-HarryH-,kennylu,1
133748,-HarryH-,alanpan,1
78446,-Nieves-,aleteo,1
136503,-Trin-,fastia,1
...,...,...,...
124571,zolasdisciple,bjbookman,1
218769,ztutz,astutz,1
52448,zwelbast,DavidBronkhorst,1
52466,zwelbast,DavidBronkhorst,1


In [170]:
# remove the duplicate rows by keeping the first occurence
edges_df.drop_duplicates(subset=['user', 'trusted', 'trust'], keep='first', inplace=True)

In [171]:
edges_df

Unnamed: 0,user,trusted,trust
84099,-AlyssaE-,othersam,1
84096,-AlyssaE-,ENCPress,1
84095,-AlyssaE-,CarlaR,1
84098,-AlyssaE-,Bookmarque,1
133763,-HarryH-,Angela_C,1
...,...,...,...
42194,zzshupinga,joshuamneff,1
42195,zzshupinga,joslintepper,1
42196,zzshupinga,JustinTheLibrarian,1
42210,zzshupinga,Jenica26,1


**Using LabelEncoder from sklearn library to label encode the values as algorithms work only with numbers and not strings**

In [172]:
# import labelencoder from sklearn library
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

# list of column names you want encoded
columns_to_be_encoded = ['user','trusted']  

# Instantiate the encoders
encoders = {column: LabelEncoder() for column in columns_to_be_encoded}

for column in columns_to_be_encoded:
    edges_df[column] = encoders[column].fit_transform(edges_df[column])

In [173]:
# storing the encoded value in a new dataframe
le_edges_df = edges_df.sort_values(by=['user'])

In [174]:
le_edges_df

Unnamed: 0,user,trusted,trust
84099,0,49216,1
84096,0,6077,1
84095,0,3674,1
84098,0,2901,1
133750,1,38338,1
...,...,...,...
42204,25609,50361,1
42206,25609,51908,1
42205,25609,51577,1
42209,25609,58054,1


In [176]:
# Saving the dataframe into a text file - Lthing_trust.txt
np.savetxt(r'C:\Users\Owner\Desktop\Amruta\CMPE256Project\Lthing_trust.txt', le_edges_df.values, fmt='%d')