# Data Understanding on Original Datasets

## 1. Data Import (item.csv, transactions.csv)

In [3]:
import pandas as pd
transactions_dataset = pd.read_csv('../Original_dataset/transactions.csv',sep='|')
items_dataset = pd.read_csv('../Original_dataset/items.csv',sep='|')

## 2. Introduction to transaction dataset

### 2.1 Data Description

In [4]:
#see the first 5 rows for transaction data
transactions_dataset.head()

Unnamed: 0,sessionID,itemID,click,basket,order
0,0,21310,1,0,0
1,1,73018,1,0,0
2,2,19194,1,0,0
3,3,40250,1,0,0
4,4,46107,1,0,0


In [9]:
print('1. total number of transactions is:', transactions_dataset.sessionID.count())
print ('2. number of distinct sessions is:',  len(pd.unique(transactions_dataset['sessionID'])))

#we can label each instance by true whenever it is duplicated
transactions_dataset['duplicated'] = transactions_dataset.duplicated(subset='sessionID', keep=False)
duplicated_transactions = transactions_dataset[transactions_dataset['duplicated']!=False]
print('3. number of rows with repeated session id is:',duplicated_transactions.sessionID.count())
print ('3. Repeated sessions include',  len(pd.unique(duplicated_transactions['sessionID'])),'distinct session id')

print('4. maximum number of repetition for a session id is:', duplicated_transactions['sessionID'].value_counts().max())
print('5. number of distinct itemID that exists in transaction data is:', len(pd.unique(transactions_dataset['itemID'])))
print('6. number of distinct items that we have information about them from more than one session id is: ', len(pd.unique(duplicated_transactions['itemID'])))

1. total number of transactions is: 365143
2. number of distinct sessions is: 271983
3. number of rows with repeated session id is: 129501
3. Repeated sessions include 36341 distinct session id
4. maximum number of repetition for a session id is: 213
5. number of distinct itemID that exists in transaction data is: 24909
6. number of distinct items that we have information about them from more than one session id is:  14471


### **From the above result, we have the following data understanding:**
1. we have 365,143 sessions in our transactions data set. Different session ids refer to different users. 
2. We have 271983 unique users. 
3. We have only one record from 65% (235,642) of users. In contrast from the other 35% (36,341) of users, we have 129,501 records. In average we have 3.6 records for each user.
4. The maximum number of records for a user is 213. Anyway, the most important issue here, is that we can only rely on these 129,501 records to find associations.
5. In next descriptions, we will see that all transactions history only covers 32% of books (24,909 books). This means that for the remaining 68% we do not have any information (even one click) in our transaction data set. 
6. Further, if for each book we want to explore transaction history from at least two users (i.e. searching for associations), these information is limited to 14,471 items (only 18.5% of books).

**(Note that our recommender system cannot work in a way that is common for RS based on collaborative filtering! Simply because, when we want to recommend to most 5 similar items, we do not have access to user id! Put differently, we will recommend regardless of the identity of a user who searched our DB. Nevertheless, we still can rely on wisdom of crowd to some extent. Transaction dataset can be used to find out which items have been together (associations: click, basket, order). Further, transaction data set gives us the opportunity to distinguish between the head and the long tail.)**


In [10]:
#delete the duplicated column
del transactions_dataset['duplicated']

### 2.2 Data Distribution of Different Attributes

In [12]:
#the data distribution of transaction dataset
transactions_dataset[['click', 'basket','order']].describe()

Unnamed: 0,click,basket,order
count,365143.0,365143.0,365143.0
mean,1.23318,0.141202,0.048403
std,1.069996,1.107574,0.268717
min,0.0,0.0,0.0
25%,1.0,0.0,0.0
50%,1.0,0.0,0.0
75%,1.0,0.0,0.0
max,118.0,293.0,28.0


By looking at 75%, we find only one click and zero basket and order. This means that for a huge proportion of transactions, recorded information is just limited to a click! Looking at mean gives us more information about customer journey possibility of conversion from click to basket and then order. Finally, transaction data set can give us information about bestsellers.

In [46]:
with_click_transactions = transactions_dataset [transactions_dataset['click'] > 0]
with_basket_transactions = transactions_dataset [transactions_dataset['basket'] > 0]
with_order_transactions = transactions_dataset [transactions_dataset['order'] > 0]
print('number of distinct items with at least one click in our transaction history: ', len(pd.unique(with_click_transactions['itemID'])))
print('number of distinct items with at least one basket in our transaction history: ', len(pd.unique(with_basket_transactions['itemID'])))
print('number of distinct items with at least one order in our transaction history: ', len(pd.unique(with_order_transactions['itemID'])))

number of distinct items with at least one click in our transaction history:  24620
number of distinct items with at least one basket in our transaction history:  8746
number of distinct items with at least one order in our transaction history:  5298


In our transaction data set we have only 5,298 distinct items (6.8% of all books) that have been ordered. Only 11.2% of all books have ever been in the basket. Finally, 31.6% of all books have ever been clicked.

### 2.3 Information about Transactional Association

In [47]:
# now we focus on items that can give us information about associations.
dup_click_trans = duplicated_transactions [duplicated_transactions['click'] > 0]
dup_basket_trans = duplicated_transactions [duplicated_transactions['basket'] > 0]
dup_order_trans = duplicated_transactions [duplicated_transactions['order'] > 0]
print('number of distinct books that have been clicked by more than one session id:', len(pd.unique(dup_click_trans['itemID'])))
print('number of distinct books that have been put in the basket by more than one session id: ', len(pd.unique(dup_basket_trans['itemID'])))
print('number of distinct books that have been ordered by more than one session id: ', len(pd.unique(dup_order_trans['itemID'])))

number of distinct books that have been clicked by more than one session id: 14076
number of distinct books that have been put in the basket by more than one session id:  7149
number of distinct books that have been ordered by more than one session id:  3115


Only 3,115 books (4% of all books) have been ordered by more than one session id, and only 18% of all books have been clicked by more than one session id.

## 3. Introduction to Item Dataset
### 3.1 Data Description

In [13]:
items_dataset.head()

Unnamed: 0,itemID,title,author,publisher,main topic,subtopics
0,21310,Princess Poppy: The Big Mix Up,Janey Louise Jones,Penguin Random House Children's UK,YFB,[5AH]
1,73018,Einfach zeichnen! Step by Step,Wiebke Krabbe,Schwager und Steinlein,AGZ,"[5AJ,AGZ,WFA,YBG,YBL,YNA,YPA]"
2,19194,Red Queen 1,Victoria Aveyard,Orion Publishing Group,YFH,"[5AP,FBA]"
3,40250,Meine Kindergarten-Freunde (Pirat),,Ars Edition GmbH,YB,"[5AC,5AD,YBG,YBL,YF]"
4,46107,Mein großes Schablonen-Buch - Wilde Tiere,Elizabeth Golding,Edition Michael Fischer,WFTM,"[WD,WFTM,YBG,YBL,YBLD,YBLN1]"


In [14]:
print('total number of items is', items_dataset.itemID.count())
print('there are ', len(pd.unique(items_dataset['title'])),'distinct titles')
print('there are ', len(pd.unique(items_dataset['author'])),'distinct authors')
print('there are ', len(pd.unique(items_dataset['publisher'])),'distinct publisher')

total number of items is 78030
there are  72128 distinct titles
there are  35970 distinct authors
there are  7073 distinct publisher


In [33]:
items_dataset['dup-title'] = items_dataset.duplicated(subset=['title'], keep=False)
items_dataset['dup-title-author'] = items_dataset.duplicated(subset=['title','author'], keep=False)
items_dataset['dup-title-author-pub'] = items_dataset.duplicated(subset=['title','author','publisher'], keep=False)
items_dataset['dup-title-author-pub-top'] = items_dataset.duplicated(subset=['title','author','publisher','main topic'], keep=False)
items_dataset['dup-title-author-pub-top-sub'] = items_dataset.duplicated(subset=['title','author','publisher','main topic','subtopics'], keep=False)
duplicated_itmes1 = items_dataset[items_dataset['dup-title']!=False]
duplicated_itmes2 = items_dataset[items_dataset['dup-title-author']!=False]
duplicated_itmes3 = items_dataset[items_dataset['dup-title-author-pub']!=False]
duplicated_itmes4 = items_dataset[items_dataset['dup-title-author-pub-top']!=False]
duplicated_itmes5 = items_dataset[items_dataset['dup-title-author-pub-top-sub']!=False]
print('1. there are {} books with same title, but different ids'.format(duplicated_itmes1.itemID.count()))
print('2. there are {} books with same title & author, but different ids among them, \n   there are {} distinct titles'.format(duplicated_itmes2.itemID.count(), len(pd.unique(duplicated_itmes2['title']))))
print('3. there are {} books that are duplicated at least one time. i.e. with same title, \n   author & publisher, but different ids among them, there are {} distinct titles'.format(duplicated_itmes3.itemID.count(), len(pd.unique(duplicated_itmes3['title']))))
print('4. there are {} books that are duplicated at least one time. i.e. with same title, \n   author, publisher & topic, but different ids among them, there are {} distinct title'.format(duplicated_itmes4.itemID.count(), len(pd.unique(duplicated_itmes4['title']))))
print('5. there are {} books that are duplicated at least one time. i.e. with same title, \n   author, publisher, topic & sub-topic, but different ids among them, \n   there are {} distinct title'.format(duplicated_itmes5.itemID.count(), len(pd.unique(duplicated_itmes5['title']))))



1. there are 10095 books with same title, but different ids
2. there are 7411 books with same title & author, but different ids among them, 
   there are 3182 distinct titles
3. there are 4496 books that are duplicated at least one time. i.e. with same title, 
   author & publisher, but different ids among them, there are 1999 distinct titles
4. there are 3830 books that are duplicated at least one time. i.e. with same title, 
   author, publisher & topic, but different ids among them, there are 1725 distinct title
5. there are 3390 books that are duplicated at least one time. i.e. with same title, 
   author, publisher, topic & sub-topic, but different ids among them, 
   there are 1527 distinct title


Basically, we should find a way to deal with repeted books in our dataset:
1. 4496 records (5.8% of all records) can be considered duplicated. There is at least one other record with exactly same title, author, and publisher, but different id. (**The only possibility is having more than one edition from a book in our dataset. Since we don't have information about editions, we should decide how we want to deal with these records.**) Note that among these 4496 records, we have 3390 records that are completely duplicated. (There are at least one other records with exactly same title, author, publisher, topic, and sub-topic, but different ids). There are 440 (=3830-3390) records that are duplicated with regards to all features except sub-topic. It has no meaning that we have same books with same topics but different sub-topics. One simple remedy is to subsitute the sub-topic field of each repeated record by the union of sub-topics of all its twins. There are also 666 (=4496-3830) records among these duplicated items that are duplicated with regards to title, author and publisher but they have different topics and sub-topics. This situation is not possible in the real world. So, we can use the same remedy for both topic and sub-topic attributes.
2. There are also 2,915 (=7411-4496) records (3.7% of all records) that are (**duplicated with regards to title and author, but published by different publishers.**)  It is very likely that our model, recommend these books when their similar items are search. 
3. Finally, there are 2,614 (=10095-7411) records (3.4%) of records with duplicated title, but are written by different authors. In the real world this could happen. But, we should consider one possibility: (**There may be error in author attribute data entry, or even it is possible that one entry only consists of subset of authors.**)  So, We should consider it in our data preprocessing.

In [34]:
items_dataset.drop(['dup-title-author','dup-title-author-pub','dup-title-author-pub-top','dup-title-author-pub-top-sub'], axis=1, inplace=True)

One solid idea is to annotate the item dataset with language of each item based on the title. This idea simply rely on the assumption that a readers are interested in books that are written in only one specific language, or at least we can say that if a reader is interested in a book that is written in a specific language, (s)he will probably is interested in other books written in that language. 

### 3.2 Understand the Main Topic Attribute

In [35]:
print ('number of distinct main topics is:',  len(pd.unique(items_dataset['main topic'])))
items_filtered = items_dataset.dropna(subset=['main topic'])
print('total number of books with something as main topic is:', items_filtered.itemID.count())

number of distinct main topics is: 700
total number of books with something as main topic is: 77772


We have 700 distinct main topic in items dataset.  But, note that we have 258 item ids without any main topic (i.e. substraction of 77,772 from 78,030). So, we should solve the problem of missing values for these enteries. 

In [36]:
j = 0
max_length_maintopic = 0
for index, row in items_filtered.iterrows():
    max_length_maintopic = max(max_length_maintopic,len(row['main topic']))
    if (len(row['main topic']) <= 1): j = 1
print ('The maximum length of main topic is:', max_length_maintopic)
print ('The minimum length of main topic is:', j)

The maximum length of main topic is: 10
The minimum length of main topic is: 1


In [37]:
j1 = 0
j2 = 0
j3 = 0
j4 = 0
j5 = 0
j6 = 0
j7 = 0
j8 = 0
j9 = 0
j10 = 0
for index, row in items_filtered.iterrows():
    if (len(row['main topic']) == 1): j1 = j1 + 1
    elif (len(row['main topic']) == 2): j2 = j2 + 1
    elif (len(row['main topic']) == 3): j3 = j3 + 1
    elif (len(row['main topic']) == 4): j4 = j4 + 1
    elif (len(row['main topic']) == 5): j5 = j5 + 1
    elif (len(row['main topic']) == 6): j6 = j6 + 1
    elif (len(row['main topic']) == 7): j7 = j7 + 1
    elif (len(row['main topic']) == 8): j8 = j8 + 1
    elif (len(row['main topic']) == 9): j9 = j9 + 1
    else: j10 = j10 + 1
print ('distribution of main topic length is {min to max}:',j1,j2,j3,j4,j5,j6,j7,j8,j9,j10)

distribution of main topic length is {min to max}: 291 17689 44551 13395 1105 740 0 0 0 1


Clearly, we can omit one data point with length of main topic as 10, then we will have lengths between 1 and 7. Furthermore, the most common lengths are 2,3,4 which include 97% of all books.

In [45]:
j = 0
topicage = '5A'
for i in range (len(items_dataset)):
    main_topic = str(items_dataset.at[i,'main topic'])
    if (topicage in main_topic): j = j + 1
print ('number of main topic with 5A (Interest Age) category:', j)

number of main topic with 5A (Interest Age) category: 0


### 3.3 Understand Suptopic

In [44]:
j = 0
sub = '5A'
for i in range (len(items_dataset)):
    sub_topic = str(items_dataset.subtopics.loc[i])
    if (sub in sub_topic): j = j + 1
print ('number of subtopic with 5A (Interest Age) category:', j)

number of subtopic with 5A (Interest Age) category: 12548
