# **Data Preparation for Habitech Initial Dataset**

This notebook will contain data preprocessing for initial cluster dataset. Data that has been preprocessed will be used for initial cluster in Habitech matchmaking system.

*   Dataset for  User Preference: https://www.kaggle.com/ruchi798/bookcrossing-dataset
*   Dataset for Book Database: https://www.kaggle.com/lukaanicin/book-covers-dataset

## Import Package

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf

## Unzip Data

In [2]:
!unzip Preprocessed_data.zip

Archive:  Preprocessed_data.zip
  inflating: Preprocessed_data.csv   


## Read & See Data Summary

In [3]:
df = pd.read_csv('./Preprocessed_data.csv')
print(len(df))
df.head(5)

1031175


Unnamed: 0.1,Unnamed: 0,user_id,location,age,isbn,rating,book_title,book_author,year_of_publication,publisher,img_s,img_m,img_l,Summary,Language,Category,city,state,country
0,0,2,"stockton, california, usa",18.0,195153448,0,Classical Mythology,Mark P. O. Morford,2002.0,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,Provides an introduction to classical myths pl...,en,['Social Science'],stockton,california,usa
1,1,8,"timmins, ontario, canada",34.7439,2005018,5,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,"In a small town in Canada, Clara Callan reluct...",en,['Actresses'],timmins,ontario,canada
2,2,11400,"ottawa, ontario, canada",49.0,2005018,0,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,"In a small town in Canada, Clara Callan reluct...",en,['Actresses'],ottawa,ontario,canada
3,3,11676,"n/a, n/a, n/a",34.7439,2005018,8,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,"In a small town in Canada, Clara Callan reluct...",en,['Actresses'],,,
4,4,41385,"sudbury, ontario, canada",34.7439,2005018,0,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,"In a small town in Canada, Clara Callan reluct...",en,['Actresses'],sudbury,ontario,canada


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1031175 entries, 0 to 1031174
Data columns (total 19 columns):
 #   Column               Non-Null Count    Dtype  
---  ------               --------------    -----  
 0   Unnamed: 0           1031175 non-null  int64  
 1   user_id              1031175 non-null  int64  
 2   location             1031175 non-null  object 
 3   age                  1031175 non-null  float64
 4   isbn                 1031175 non-null  object 
 5   rating               1031175 non-null  int64  
 6   book_title           1031175 non-null  object 
 7   book_author          1031175 non-null  object 
 8   year_of_publication  1031175 non-null  float64
 9   publisher            1031175 non-null  object 
 10  img_s                1031175 non-null  object 
 11  img_m                1031175 non-null  object 
 12  img_l                1031175 non-null  object 
 13  Summary              1031175 non-null  object 
 14  Language             1031175 non-null  object 
 15

## Choose Desired Column

In [5]:
new_df = pd.DataFrame(
    data = {
      'user_id': df['user_id'],
      'age': df['age'],
      'rating': df['rating'],
      'book_title': df['book_title'],
      'book_author': df['book_author'],
      'category': df['Category']      
    } 
)
print(len(new_df))
new_df.head(5)

1031175


Unnamed: 0,user_id,age,rating,book_title,book_author,category
0,2,18.0,0,Classical Mythology,Mark P. O. Morford,['Social Science']
1,8,34.7439,5,Clara Callan,Richard Bruce Wright,['Actresses']
2,11400,49.0,0,Clara Callan,Richard Bruce Wright,['Actresses']
3,11676,34.7439,8,Clara Callan,Richard Bruce Wright,['Actresses']
4,41385,34.7439,0,Clara Callan,Richard Bruce Wright,['Actresses']


## Delete Null Value

In [6]:
new_df = new_df.dropna()
print(len(new_df))
new_df.head(5)

1031175


Unnamed: 0,user_id,age,rating,book_title,book_author,category
0,2,18.0,0,Classical Mythology,Mark P. O. Morford,['Social Science']
1,8,34.7439,5,Clara Callan,Richard Bruce Wright,['Actresses']
2,11400,49.0,0,Clara Callan,Richard Bruce Wright,['Actresses']
3,11676,34.7439,8,Clara Callan,Richard Bruce Wright,['Actresses']
4,41385,34.7439,0,Clara Callan,Richard Bruce Wright,['Actresses']


## Change Age Data Type into Integer

In [7]:
new_df['age'] = new_df['age'].astype('int')
new_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1031175 entries, 0 to 1031174
Data columns (total 6 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   user_id      1031175 non-null  int64 
 1   age          1031175 non-null  int64 
 2   rating       1031175 non-null  int64 
 3   book_title   1031175 non-null  object
 4   book_author  1031175 non-null  object
 5   category     1031175 non-null  object
dtypes: int64(3), object(3)
memory usage: 55.1+ MB


## Load Main Dataset

In [8]:
!unzip main_dataset.csv.zip

Archive:  main_dataset.csv.zip
  inflating: main_dataset.csv        


In [9]:
main_dataset_df = pd.read_csv('./main_dataset.csv')
main_dataset_df.head(5)

Unnamed: 0,image,name,author,format,book_depository_stars,price,currency,old_price,isbn,category,img_paths
0,https://d1w7fb2mkkr3kw.cloudfront.net/assets/i...,This is Going to Hurt,Adam Kay,Paperback,4.5,7.6,$,11.4,9781509858637,Medical,dataset/Medical/0000001.jpg
1,https://d1w7fb2mkkr3kw.cloudfront.net/assets/i...,"Thinking, Fast and Slow",Daniel Kahneman,Paperback,4.0,11.5,$,15.0,9780141033570,Medical,dataset/Medical/0000002.jpg
2,https://d1w7fb2mkkr3kw.cloudfront.net/assets/i...,When Breath Becomes Air,Paul Kalanithi,Paperback,4.5,9.05,$,11.5,9781784701994,Medical,dataset/Medical/0000003.jpg
3,https://d1w7fb2mkkr3kw.cloudfront.net/assets/i...,The Happiness Trap,Russ Harris,Paperback,4.0,8.34,$,13.9,9781845298258,Medical,dataset/Medical/0000004.jpg
4,https://d1w7fb2mkkr3kw.cloudfront.net/assets/i...,Man's Search For Meaning,Viktor E. Frankl,Paperback,4.5,9.66,$,,9781846041242,Medical,dataset/Medical/0000005.jpg


In [10]:
main_dataset_df = pd.DataFrame(
  data = {
      'title': main_dataset_df['name'],
      'author': main_dataset_df['author'],
      'category': main_dataset_df['category'],
  }
)
main_dataset_df = main_dataset_df.dropna()
main_dataset_df

Unnamed: 0,title,author,category
0,This is Going to Hurt,Adam Kay,Medical
1,"Thinking, Fast and Slow",Daniel Kahneman,Medical
2,When Breath Becomes Air,Paul Kalanithi,Medical
3,The Happiness Trap,Russ Harris,Medical
4,Man's Search For Meaning,Viktor E. Frankl,Medical
...,...,...,...
32576,Elementary Korean Workbook,Insun Lee,Travel-Holiday-Guides
32577,Lonely Planet Best of Peru,Lonely Planet,Travel-Holiday-Guides
32578,Complete Finnish Beginner to Intermediate Cour...,Terttu Leney,Travel-Holiday-Guides
32579,Simple Thai Food,Leela Punyaratabandhu,Travel-Holiday-Guides


## Choose Book with the one that main_dataset has through tuple(title, author)

In [11]:
main_dataset_df_no_cat = pd.DataFrame(
  data = {
      'title': main_dataset_df['title'],
      'author': main_dataset_df['author'],
  }
)
main_dataset_df_no_cat.head(3)

Unnamed: 0,title,author
0,This is Going to Hurt,Adam Kay
1,"Thinking, Fast and Slow",Daniel Kahneman
2,When Breath Becomes Air,Paul Kalanithi


In [12]:
# List of book tuple (title, author) in main dataset
main_book_list = list(main_dataset_df_no_cat.itertuples(index=False, name=None))
len(main_book_list)

32383

In [13]:
def book_check(row):
  current_book = (row['book_title'], row['book_author'])
  return current_book in main_book_list

df_is_book_in_main = new_df.apply(book_check, axis=1)
new_df = new_df[df_is_book_in_main] # boolean indexing
new_df 

Unnamed: 0,user_id,age,rating,book_title,book_author,category
4219,26,34,10,To Kill a Mockingbird,Harper Lee,['Fiction']
4220,1032,34,10,To Kill a Mockingbird,Harper Lee,['Fiction']
4221,2766,42,10,To Kill a Mockingbird,Harper Lee,['Fiction']
4222,3542,34,10,To Kill a Mockingbird,Harper Lee,['Fiction']
4223,4017,48,10,To Kill a Mockingbird,Harper Lee,['Fiction']
...,...,...,...,...,...,...
1029090,275631,34,8,It Works,R.H. Jarrett,9
1029110,275737,54,0,Freeing the Natural Voice,Kristin Linklater,['Performing Arts']
1029224,275970,46,0,The Shadow of the Sun,Ryszard Kapuscinski,['Africa']
1029323,275970,46,9,Me Talk Pretty One Day,David Sedaris,9


## Convert Category based on Main Dataset

In [14]:
import random

def get_main_category(row):
  current_title = row['book_title']
  current_author = row['book_author']

  data_row = main_dataset_df.loc[(main_dataset_df['title'] == current_title) & (main_dataset_df['author'] == current_author)]
  main_category_list = data_row['category'].unique()

  # Randomize if book has more than 1 category
  return random.choice(main_category_list)
  
new_df['category'] = new_df.apply(get_main_category, axis=1)
new_df

Unnamed: 0,user_id,age,rating,book_title,book_author,category
4219,26,34,10,To Kill a Mockingbird,Harper Lee,Childrens-Books
4220,1032,34,10,To Kill a Mockingbird,Harper Lee,Crime-Thriller
4221,2766,42,10,To Kill a Mockingbird,Harper Lee,Crime-Thriller
4222,3542,34,10,To Kill a Mockingbird,Harper Lee,Crime-Thriller
4223,4017,48,10,To Kill a Mockingbird,Harper Lee,Childrens-Books
...,...,...,...,...,...,...
1029090,275631,34,8,It Works,R.H. Jarrett,Mind-Body-Spirit
1029110,275737,54,0,Freeing the Natural Voice,Kristin Linklater,Art-Photography
1029224,275970,46,0,The Shadow of the Sun,Ryszard Kapuscinski,History-Archaeology
1029323,275970,46,9,Me Talk Pretty One Day,David Sedaris,Entertainment


## Group Data by User ID (Combine Row with Same User ID)

In [15]:
# Use <,> because it will be converted into list
new_df_grouped = new_df.groupby(by='user_id', as_index=False).agg({
    'age': 'first',
    'book_title': '<,>'.join, 
    'book_author': '<,>'.join,
    'category': '<,>'.join
})
new_df_grouped.head(5)

Unnamed: 0,user_id,age,book_title,book_author,category
0,26,34,To Kill a Mockingbird,Harper Lee,Childrens-Books
1,32,34,Pride and Prejudice,Jane Austen,Romance
2,75,37,"The Tao of Pooh<,>The Prince","Benjamin Hoff<,>Niccolo Machiavelli","Mind-Body-Spirit<,>Poetry-Drama"
3,77,34,Starship Troopers,Robert A. Heinlein,Science-Fiction-Fantasy-Horror
4,99,42,The Pillars of the Earth,Ken Follett,Romance


In [16]:
new_df_grouped['book_title'] = new_df_grouped['book_title'].apply(lambda x: x.split('<,>'))
new_df_grouped['book_author'] = new_df_grouped['book_author'].apply(lambda x: x.split('<,>'))
new_df_grouped['category'] = new_df_grouped['category'].apply(lambda x: x.split('<,>'))
new_df_grouped.head(5)

Unnamed: 0,user_id,age,book_title,book_author,category
0,26,34,[To Kill a Mockingbird],[Harper Lee],[Childrens-Books]
1,32,34,[Pride and Prejudice],[Jane Austen],[Romance]
2,75,37,"[The Tao of Pooh, The Prince]","[Benjamin Hoff, Niccolo Machiavelli]","[Mind-Body-Spirit, Poetry-Drama]"
3,77,34,[Starship Troopers],[Robert A. Heinlein],[Science-Fiction-Fantasy-Horror]
4,99,42,[The Pillars of the Earth],[Ken Follett],[Romance]


## Add Education Column based on Age

In [17]:
def get_education(row):
  '''
  Get user education by age
  Params:
  - age: int
  '''
  age = row['age']

  if age < 6:
    return 'Tidak Ada'
  elif age < 12:
    return 'SD'
  elif age < 15:
    return 'SMP'
  elif age < 17:
    return 'SMA'
  elif age < 22:
    return 'Diploma'
  else:
    high_edu = ['S1', 'S2', 'S3', 'Profesi']
    return random.choice(high_edu)

new_df_grouped['education'] = new_df_grouped.apply(get_education, axis=1)
new_df_grouped

Unnamed: 0,user_id,age,book_title,book_author,category,education
0,26,34,[To Kill a Mockingbird],[Harper Lee],[Childrens-Books],S1
1,32,34,[Pride and Prejudice],[Jane Austen],[Romance],Profesi
2,75,37,"[The Tao of Pooh, The Prince]","[Benjamin Hoff, Niccolo Machiavelli]","[Mind-Body-Spirit, Poetry-Drama]",S2
3,77,34,[Starship Troopers],[Robert A. Heinlein],[Science-Fiction-Fantasy-Horror],S2
4,99,42,[The Pillars of the Earth],[Ken Follett],[Romance],S2
...,...,...,...,...,...,...
13305,278641,34,[Danny the Champion of the World],[Roald Dahl],[Teen-Young-Adult],S3
13306,278672,43,[A Year in Provence],[Peter Mayle],[Travel-Holiday-Guides],S2
13307,278796,34,[1984],[George Orwell],[Science-Fiction-Fantasy-Horror],Profesi
13308,278846,23,[Brave New World],[Aldous Huxley],[Childrens-Books],Profesi


## Write Data to CSV

In [18]:
new_df_grouped.to_csv('./final_initial_dataset.csv', index=False)

# **Preprocessing Final Initial Dataset**

## Select Desired Column for Clustering

In [19]:
final_df = pd.read_csv('./final_initial_dataset.csv')
print(len(final_df))
final_df.head(5)

13310


Unnamed: 0,user_id,age,book_title,book_author,category,education
0,26,34,['To Kill a Mockingbird'],['Harper Lee'],['Childrens-Books'],S1
1,32,34,['Pride and Prejudice'],['Jane Austen'],['Romance'],Profesi
2,75,37,"['The Tao of Pooh', 'The Prince']","['Benjamin Hoff', 'Niccolo Machiavelli']","['Mind-Body-Spirit', 'Poetry-Drama']",S2
3,77,34,['Starship Troopers'],['Robert A. Heinlein'],['Science-Fiction-Fantasy-Horror'],S2
4,99,42,['The Pillars of the Earth'],['Ken Follett'],['Romance'],S2


In [20]:
final_df_cluster = pd.DataFrame(
    data={
        'age': final_df['age'],
        'category': final_df['category'],
        'education': final_df['education'],
    }
)
print(len(final_df_cluster))
final_df_cluster.head(5)

13310


Unnamed: 0,age,category,education
0,34,['Childrens-Books'],S1
1,34,['Romance'],Profesi
2,37,"['Mind-Body-Spirit', 'Poetry-Drama']",S2
3,34,['Science-Fiction-Fantasy-Horror'],S2
4,42,['Romance'],S2


## One Hot Encode Education Column

In [21]:
import sklearn.preprocessing as prep

In [22]:
education_list = ['SD', 'SMP', 'SMA', 'Diploma', 'S1', 'S2', 'S3', 'Profesi', 'Tidak Ada']
enc_edu = prep.OneHotEncoder(categories=[education_list])

enc_edu_fit = enc_edu.fit_transform(final_df_cluster[['education']])
enc_edu_df = pd.DataFrame(enc_edu_fit.toarray())
enc_edu_df.columns = enc_edu.get_feature_names()

enc_edu_df

Unnamed: 0,x0_SD,x0_SMP,x0_SMA,x0_Diploma,x0_S1,x0_S2,x0_S3,x0_Profesi,x0_Tidak Ada
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...
13305,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
13306,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
13307,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
13308,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


## One Hot Encode Category Column

In [23]:
category_list = main_dataset_df['category'].unique()
len(category_list)

33

In [24]:
enc_cat = prep.OneHotEncoder(categories=[category_list])

category_as_list_df = final_df_cluster['category'].apply(eval)
enc_cat_df = pd.DataFrame()

for category_list in category_as_list_df:
  encoded_category_list_df = pd.DataFrame()
  for idx in range(len(category_list)):
    enc_cat_fit = enc_cat.fit_transform([[category_list[idx]]])
    
    current_encoded_data = enc_cat_fit.toarray()
    if (len(encoded_category_list_df) == 0):
      encoded_category_list_df = pd.DataFrame(current_encoded_data)
    else:
      row_to_append =  {}
      for i in range(len(encoded_category_list_df.columns)):
        row_to_append[encoded_category_list_df.columns[i]] = current_encoded_data[0][i]
      
      encoded_category_list_df = encoded_category_list_df.append(row_to_append, ignore_index=True)
  enc_cat_df = enc_cat_df.append(encoded_category_list_df.agg(['max']))

enc_cat_df.columns = enc_cat.get_feature_names()

enc_cat_df

Unnamed: 0,x0_Medical,x0_Science-Geography,x0_Art-Photography,x0_Biography,x0_Business-Finance-Law,x0_Childrens-Books,x0_Computing,x0_Crafts-Hobbies,x0_Crime-Thriller,x0_Dictionaries-Languages,x0_Entertainment,x0_Food-Drink,x0_Graphic-Novels-Anime-Manga,x0_Health,x0_History-Archaeology,x0_Home-Garden,x0_Humour,x0_Mind-Body-Spirit,x0_Natural-History,x0_Personal-Development,x0_Poetry-Drama,x0_Reference,x0_Religion,x0_Romance,x0_Science-Fiction-Fantasy-Horror,x0_Society-Social-Sciences,x0_Sport,x0_Stationery,x0_Teaching-Resources-Education,x0_Technology-Engineering,x0_Teen-Young-Adult,x0_Transport,x0_Travel-Holiday-Guides
max,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
max,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
max,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
max,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Concate All Encoded Data

In [25]:
final_df_cluster_age_only = final_df_cluster['age']
final_df_cluster_age_only

0        34
1        34
2        37
3        34
4        42
         ..
13305    34
13306    43
13307    34
13308    23
13309    33
Name: age, Length: 13310, dtype: int64

In [26]:
df_age = final_df_cluster_age_only.reset_index(drop=True)
df_edu = enc_edu_df.reset_index(drop=True)
df_cat = enc_cat_df.reset_index(drop=True)

final_encoded_dataset = pd.concat([df_age, df_edu, df_cat], axis=1)
final_encoded_dataset.head(5)

Unnamed: 0,age,x0_SD,x0_SMP,x0_SMA,x0_Diploma,x0_S1,x0_S2,x0_S3,x0_Profesi,x0_Tidak Ada,x0_Medical,x0_Science-Geography,x0_Art-Photography,x0_Biography,x0_Business-Finance-Law,x0_Childrens-Books,x0_Computing,x0_Crafts-Hobbies,x0_Crime-Thriller,x0_Dictionaries-Languages,x0_Entertainment,x0_Food-Drink,x0_Graphic-Novels-Anime-Manga,x0_Health,x0_History-Archaeology,x0_Home-Garden,x0_Humour,x0_Mind-Body-Spirit,x0_Natural-History,x0_Personal-Development,x0_Poetry-Drama,x0_Reference,x0_Religion,x0_Romance,x0_Science-Fiction-Fantasy-Horror,x0_Society-Social-Sciences,x0_Sport,x0_Stationery,x0_Teaching-Resources-Education,x0_Technology-Engineering,x0_Teen-Young-Adult,x0_Transport,x0_Travel-Holiday-Guides
0,34,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,34,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,37,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,34,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,42,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Write Data to CSV & Download CSV

In [27]:
final_encoded_dataset.to_csv('./final_encoded_dataset.csv', index=False)

## (Optional) Auto download CSV file if Using Colab

In [28]:
from google.colab import files

files.download('./final_encoded_dataset.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>