In this ETL project, we build a collection of NYT fiction best sellers from 2008 to 2018. We put together and save to MongoDB a dataset combining a kaggle dataset and additional data scraped from Amazon.com.
This collection can then be potentially used to build an app to help book fans select great books by various keywords, or identify current good deals for buying NYT best sellers, or analyze best seller list trends, among other potential uses.

The team members in this project are Joyce Jiang and Sandrine Poissonnet.

# EXTRACT

### STEP 1: Extract data from kaggle's file 'nyt2.json'

Our first challenge started with the data file 'nyt2.json' we downloaded from kaggle (https://www.kaggle.com/cmenca/new-york-times-hardcover-fiction-best-sellers). We expected a traditional json file (a list of dictionaries or a dictionary of dictionaries), but it is, instead, a compilation of json files, one per line. We wrote the code below to transform it into a true json file, which we named 'output.json'.

In [1]:
# Import dependencies needed

import json
import pandas as pd
from pprint import pprint

In [2]:
# Load 'nyt2.json' file into dataframe:

raw_nyt = pd.read_json('Resources/nyt2.json', lines=True, orient='columns')
raw_nyt.head()

Unnamed: 0,_id,amazon_product_url,author,bestsellers_date,description,price,published_date,publisher,rank,rank_last_week,title,weeks_on_list
0,{'$oid': '5b4aa4ead3089013507db18b'},http://www.amazon.com/Odd-Hours-Dean-Koontz/dp...,Dean R Koontz,{'$date': {'$numberLong': '1211587200000'}},"Odd Thomas, who can communicate with the dead,...",{'$numberInt': '27'},{'$date': {'$numberLong': '1212883200000'}},Bantam,{'$numberInt': '1'},{'$numberInt': '0'},ODD HOURS,{'$numberInt': '1'}
1,{'$oid': '5b4aa4ead3089013507db18c'},http://www.amazon.com/The-Host-Novel-Stephenie...,Stephenie Meyer,{'$date': {'$numberLong': '1211587200000'}},Aliens have taken control of the minds and bod...,{'$numberDouble': '25.99'},{'$date': {'$numberLong': '1212883200000'}},"Little, Brown",{'$numberInt': '2'},{'$numberInt': '1'},THE HOST,{'$numberInt': '3'}
2,{'$oid': '5b4aa4ead3089013507db18d'},http://www.amazon.com/Love-Youre-With-Emily-Gi...,Emily Giffin,{'$date': {'$numberLong': '1211587200000'}},A woman's happy marriage is shaken when she en...,{'$numberDouble': '24.95'},{'$date': {'$numberLong': '1212883200000'}},St. Martin's,{'$numberInt': '3'},{'$numberInt': '2'},LOVE THE ONE YOU'RE WITH,{'$numberInt': '2'}
3,{'$oid': '5b4aa4ead3089013507db18e'},http://www.amazon.com/The-Front-Garano-Patrici...,Patricia Cornwell,{'$date': {'$numberLong': '1211587200000'}},A Massachusetts state investigator and his tea...,{'$numberDouble': '22.95'},{'$date': {'$numberLong': '1212883200000'}},Putnam,{'$numberInt': '4'},{'$numberInt': '0'},THE FRONT,{'$numberInt': '1'}
4,{'$oid': '5b4aa4ead3089013507db18f'},http://www.amazon.com/Snuff-Chuck-Palahniuk/dp...,Chuck Palahniuk,{'$date': {'$numberLong': '1211587200000'}},An aging porn queens aims to cap her career by...,{'$numberDouble': '24.95'},{'$date': {'$numberLong': '1212883200000'}},Doubleday,{'$numberInt': '5'},{'$numberInt': '0'},SNUFF,{'$numberInt': '1'}


In [3]:
# Save DataFrame 'raw_nyt' into a json file ('output.json') and load it as 'data' we can now work with:

raw_nyt.to_json(path_or_buf='Output/output.json', orient = "records")
with open('Output/output.json') as file:
    data = json.load(file)
#pprint(data)

In [4]:
# Set up lists to hold reponse info:

nyt_ids = []
urls = []
authors = []
bestsellers_dates = []
descriptions = []
prices = []
published_dates = []
publishers = []
ranks = []
ranks_last_week = []
titles = []
weeks_on_lists = []

# Populate the lists:

for item in data:
    nyt_ids.append(item['_id']['$oid'])
    urls.append(item['amazon_product_url'])
    authors.append(item['author'])
    bestsellers_dates.append(item['bestsellers_date']['$date']['$numberLong'])
    descriptions.append(item['description'])
    published_dates.append(item['published_date']['$date']['$numberLong'])
    publishers.append(item['publisher'])
    ranks.append(item['rank']['$numberInt'])
    ranks_last_week.append(item['rank_last_week']['$numberInt'])
    titles.append(item['title'])
    weeks_on_lists.append(item['weeks_on_list']['$numberInt'])
# Here we have to check for the correct keyname before we can extract the price string:
    price_key, = item['price'].keys()
    if price_key == '$numberInt' or price_key == '$numberDouble':
        prices.append(item['price'][price_key])# Populate the lists:

In [5]:
# Create a DataFrame from the lists

bestsellers_dict = {
    "nyt_id": nyt_ids,
    "title": titles,
    "author": authors,
    "url": urls,
    "publisher": publishers,
    "description": descriptions,
    "list_price": prices,
    "published_date": published_dates,
    "bestseller_date": bestsellers_dates,
    "rank": ranks,
    "rank_last_week": ranks_last_week,
    "weeks_on_list": weeks_on_lists
}
bestsellers_data = pd.DataFrame(bestsellers_dict)
bestsellers_data.head()

Unnamed: 0,nyt_id,title,author,url,publisher,description,list_price,published_date,bestseller_date,rank,rank_last_week,weeks_on_list
0,5b4aa4ead3089013507db18b,ODD HOURS,Dean R Koontz,http://www.amazon.com/Odd-Hours-Dean-Koontz/dp...,Bantam,"Odd Thomas, who can communicate with the dead,...",27.0,1212883200000,1211587200000,1,0,1
1,5b4aa4ead3089013507db18c,THE HOST,Stephenie Meyer,http://www.amazon.com/The-Host-Novel-Stephenie...,"Little, Brown",Aliens have taken control of the minds and bod...,25.99,1212883200000,1211587200000,2,1,3
2,5b4aa4ead3089013507db18d,LOVE THE ONE YOU'RE WITH,Emily Giffin,http://www.amazon.com/Love-Youre-With-Emily-Gi...,St. Martin's,A woman's happy marriage is shaken when she en...,24.95,1212883200000,1211587200000,3,2,2
3,5b4aa4ead3089013507db18e,THE FRONT,Patricia Cornwell,http://www.amazon.com/The-Front-Garano-Patrici...,Putnam,A Massachusetts state investigator and his tea...,22.95,1212883200000,1211587200000,4,0,1
4,5b4aa4ead3089013507db18f,SNUFF,Chuck Palahniuk,http://www.amazon.com/Snuff-Chuck-Palahniuk/dp...,Doubleday,An aging porn queens aims to cap her career by...,24.95,1212883200000,1211587200000,5,0,1


In [6]:
bestsellers_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10195 entries, 0 to 10194
Data columns (total 12 columns):
nyt_id             10195 non-null object
title              10195 non-null object
author             10195 non-null object
url                10195 non-null object
publisher          10195 non-null object
description        10195 non-null object
list_price         10195 non-null object
published_date     10195 non-null object
bestseller_date    10195 non-null object
rank               10195 non-null object
rank_last_week     10195 non-null object
weeks_on_list      10195 non-null object
dtypes: object(12)
memory usage: 955.9+ KB


### STEP 2: Scrape info from amazon.com

In this step, we identify the list of 2329 unique Amazon URLs from our bestsellers_data DataFrame, and we visit each of those URLs to scrape the Amazon price offer, number of customer reviews, average 5-star rating, and thumbnail book picture.
We encountered our second challenge here, as we had to iterate multiple times through our list of Amazon URLs and repetitively request the info we were seeking. We each scraped Amazon for the info sought, and then combined our respective results together (see TRANSFORM, Step 2, further below). Please see files 
'Web_Scraping_Amazon.py' and 'More_Scraping_Amazon.ipynb' to examine the code used for this scraping.

We believe we had trouble scraping all info from Amazon in part because the site identified us as robots, and in part because the Amazon page has lots of javascript that takes time to load. We later learnt that the requests library cannot process javascript and it might have been preferable to use Splinter to deal with the javascript problem, but we felt we already had an interesting dataset for the purpose of this project.

In [7]:
# Import dependencies needed

from bs4 import BeautifulSoup
import requests
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36"}


In [8]:
# Get list of unique amazon urls

amazon_urls = list(dict.fromkeys(urls))
len(amazon_urls)

2329

#### Please see file 'Web_Scraping_Amazon.py' to examine the code one team member developed for scraping the Amazon pages. 

#### Please see jupyter notebook 'More_Scraping_Amazon.ipynb' to examine the code used by the other team member for scraping the Amazon pages.

# TRANSFORM

### STEP 1: Transform data from 'bestsellers_data' DataFrame

In this step we group our data by URL and convert some columns to their appropriate data type (integer, float, or date types where needed). 

In [9]:
# Import dependencies needed
import numpy as np
from datetime import datetime

In [10]:
# Group bestsellers_data by url

grouped = bestsellers_data.groupby('url')
bestsellers_grouped = grouped['url', 'title', 'author', 'publisher', 'description', 'list_price', 'weeks_on_list'].max()

# Calculate additional columns

bestsellers_grouped['first_date_listed'] = grouped[['published_date']].min()
bestsellers_grouped['last_date_listed'] = grouped[['published_date']].max()
bestsellers_grouped['worst_rank'] = grouped[['rank']].max()
bestsellers_grouped['best_rank'] = grouped[['rank']].min()
bestsellers_grouped['times_listed'] = grouped[['url']].count()
bestsellers_grouped.count()

url                  2329
title                2329
author               2329
publisher            2329
description          2329
list_price           2329
weeks_on_list        2329
first_date_listed    2329
last_date_listed     2329
worst_rank           2329
best_rank            2329
times_listed         2329
dtype: int64

In [11]:
bestsellers_grouped.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2329 entries, http://www.amazon.com/10th-Anniversary-Womens-Murder-Club/dp/1455511463?tag=NYTBS-20 to https://www.amazon.com/You-Will-Pay-Lisa-Jackson/dp/1617734667?tag=NYTBS-20
Data columns (total 12 columns):
url                  2329 non-null object
title                2329 non-null object
author               2329 non-null object
publisher            2329 non-null object
description          2329 non-null object
list_price           2329 non-null object
weeks_on_list        2329 non-null object
first_date_listed    2329 non-null object
last_date_listed     2329 non-null object
worst_rank           2329 non-null object
best_rank            2329 non-null object
times_listed         2329 non-null int64
dtypes: int64(1), object(11)
memory usage: 236.5+ KB


In [12]:
# Convert 'list_price' to float:
bestsellers_grouped['list_price'] = bestsellers_grouped['list_price'].apply(lambda x : float(x))

# Convert date columns from unix time stamp to date format:
bestsellers_grouped['first_date_listed'] = bestsellers_grouped['first_date_listed'].apply(lambda x : datetime.utcfromtimestamp(int(x[:10])).strftime('%Y-%m-%d'))
bestsellers_grouped['last_date_listed'] = bestsellers_grouped['last_date_listed'].apply(lambda x : datetime.utcfromtimestamp(int(x[:10])).strftime('%Y-%m-%d'))

# Convert 'weeks_on_list', 'worst_rank' and 'best_rank' to integer:
bestsellers_grouped['weeks_on_list'] = bestsellers_grouped['weeks_on_list'].apply(lambda x : int(x))
bestsellers_grouped['worst_rank'] = bestsellers_grouped['worst_rank'].apply(lambda x : int(x))
bestsellers_grouped['best_rank'] = bestsellers_grouped['best_rank'].apply(lambda x : int(x))

In [13]:
bestsellers_grouped.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2329 entries, http://www.amazon.com/10th-Anniversary-Womens-Murder-Club/dp/1455511463?tag=NYTBS-20 to https://www.amazon.com/You-Will-Pay-Lisa-Jackson/dp/1617734667?tag=NYTBS-20
Data columns (total 12 columns):
url                  2329 non-null object
title                2329 non-null object
author               2329 non-null object
publisher            2329 non-null object
description          2329 non-null object
list_price           2329 non-null float64
weeks_on_list        2329 non-null int64
first_date_listed    2329 non-null object
last_date_listed     2329 non-null object
worst_rank           2329 non-null int64
best_rank            2329 non-null int64
times_listed         2329 non-null int64
dtypes: float64(1), int64(4), object(7)
memory usage: 236.5+ KB


In [14]:
bestsellers_grouped.head()

Unnamed: 0_level_0,url,title,author,publisher,description,list_price,weeks_on_list,first_date_listed,last_date_listed,worst_rank,best_rank,times_listed
url,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
http://www.amazon.com/10th-Anniversary-Womens-Murder-Club/dp/1455511463?tag=NYTBS-20,http://www.amazon.com/10th-Anniversary-Womens-...,10TH ANNIVERSARY,James Patterson and Maxine Paetro,"Little, Brown",Detective Lindsay Boxer’s long-awaited wedding...,27.99,8,2011-05-22,2011-07-24,9,16,10
http://www.amazon.com/11-22-63-A-Novel/dp/1451627297?tag=NYTBS-20,http://www.amazon.com/11-22-63-A-Novel/dp/1451...,11/22/63,Stephen King,Scribner,An English teacher travels back to 1958 by way...,35.0,9,2011-11-27,2012-05-13,6,1,23
http://www.amazon.com/11th-Hour-Womens-Murder-Club/dp/0446571830?tag=NYTBS-20,http://www.amazon.com/11th-Hour-Womens-Murder-...,11TH HOUR,James Patterson and Maxine Paetro,"Little, Brown","When a millionaire is gunned down, Detective L...",27.99,8,2012-05-27,2012-07-29,6,1,10
http://www.amazon.com/1225-Christmas-Tree-Lane-Cedar/dp/0778312690?tag=NYTBS-20,http://www.amazon.com/1225-Christmas-Tree-Lane...,1225 CHRISTMAS TREE LANE,Debbie Macomber,Mira,Puppies that need good homes and an ex-husband...,16.95,4,2011-10-16,2011-11-27,8,10,7
http://www.amazon.com/12th-Never-Womens-Murder-Club/dp/1455515795?tag=NYTBS-20,http://www.amazon.com/12th-Never-Womens-Murder...,12TH OF NEVER,James Patterson and Maxine Paetro,"Little, Brown","One week after the birth of her baby, Detectiv...",0.0,6,2013-05-19,2013-06-30,8,1,7


### STEP 2: Combine output files of Amazon data

In this step we merge our scraped data together. Some of the output results are duplicated, so we cleaned them through merge and drop_duplicates. We end up with a dataframe of 2194 unique urls.

In [15]:
file1=pd.read_csv("Output/output.csv")
file2=pd.read_csv("Output/output1.csv")
file3=pd.read_csv("Output/output2.csv")
file4=pd.read_csv("Output/output3.csv")
file5=pd.read_csv("Output/output4.csv")
file6=pd.read_csv("Output/more_scraped_data.csv")

In [16]:
file6=file6.rename(columns={"nb_reviews":"reviews","nb_stars":"rating","amazon_price":"price"})
file6['reviews']=file6['reviews'].apply(str)
file6['rating']=file6['rating'].apply(str)

In [17]:
join1=pd.merge(file1,file2,on=['url','reviews','rating','price'],how='outer')
join2=pd.merge(join1,file3,on=['url','reviews','rating','price'],how='outer')
join3=pd.merge(join2,file4,on=['url','reviews','rating','price'],how='outer')
join4=pd.merge(join3,file5,on=['url','reviews','rating','price'],how='outer')
join5=pd.merge(join4,file6,on=['url','reviews','rating','price'],how='outer')

In [18]:
clean_data=join5.drop_duplicates(subset='url', keep='first', inplace=False)
clean_data.head()

Unnamed: 0,Unnamed: 0_x,url,reviews,rating,price,Unnamed: 0_y,Unnamed: 0_x.1,Unnamed: 0_y.1,Unnamed: 0_x.2,Unnamed: 0_y.2,img_url
0,0.0,http://www.amazon.com/Snuff-Chuck-Palahniuk/dp...,237 customer reviews,3.5 out of 5 stars,$9.99,4.0,4.0,,,,
20,1.0,http://www.amazon.com/The-Whole-Truth-David-Ba...,"1,064 customer reviews",4.4 out of 5 stars,$9.99,9.0,9.0,,,,
24,2.0,http://www.amazon.com/Story-Edgar-Sawtelle-Dav...,"2,573 customer reviews",3.8 out of 5 stars,$8.99,65.0,73.0,13.0,1.0,,
288,3.0,http://www.amazon.com/The-Quilters-Kitchen-Qui...,218 customer reviews,3.6 out of 5 stars,$10.99,,,,,,
289,4.0,http://www.amazon.com/Testimony-A-Novel-Anita-...,408 customer reviews,3.9 out of 5 stars,$7.99,,,,28.0,,


In [19]:
final_data=clean_data[['url','reviews','rating','price']]
final_data.head()

Unnamed: 0,url,reviews,rating,price
0,http://www.amazon.com/Snuff-Chuck-Palahniuk/dp...,237 customer reviews,3.5 out of 5 stars,$9.99
20,http://www.amazon.com/The-Whole-Truth-David-Ba...,"1,064 customer reviews",4.4 out of 5 stars,$9.99
24,http://www.amazon.com/Story-Edgar-Sawtelle-Dav...,"2,573 customer reviews",3.8 out of 5 stars,$8.99
288,http://www.amazon.com/The-Quilters-Kitchen-Qui...,218 customer reviews,3.6 out of 5 stars,$10.99
289,http://www.amazon.com/Testimony-A-Novel-Anita-...,408 customer reviews,3.9 out of 5 stars,$7.99


In [20]:
final_data.count()

url        2194
reviews    2194
rating     2194
price      1809
dtype: int64

In [21]:
final_data.to_csv('Output/join_data.csv')

### STEP 3: Merge Amazon and NYT data
In this step, we merge our combined scraped Amazon data with our NYT dataset. We do some additional data type conversions and other clean-up.

In [22]:
# Load our combined scraped Amazon data ('join_data.csv')

Amazon_data = pd.read_csv("Output/join_data.csv")
Amazon_data.count()

Unnamed: 0    2194
url           2194
reviews       1990
rating        1975
price         1809
dtype: int64

In [23]:
# Merge Amazon dataframe with NYT dataframe

df = pd.merge(bestsellers_grouped, Amazon_data, on='url',how='left')
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2329 entries, 0 to 2328
Data columns (total 16 columns):
url                  2329 non-null object
title                2329 non-null object
author               2329 non-null object
publisher            2329 non-null object
description          2329 non-null object
list_price           2329 non-null float64
weeks_on_list        2329 non-null int64
first_date_listed    2329 non-null object
last_date_listed     2329 non-null object
worst_rank           2329 non-null int64
best_rank            2329 non-null int64
times_listed         2329 non-null int64
Unnamed: 0           2194 non-null float64
reviews              1990 non-null object
rating               1975 non-null object
price                1809 non-null object
dtypes: float64(2), int64(4), object(10)
memory usage: 309.3+ KB


Defaulting to column, but this will raise an ambiguity error in a future version
  exec(code_obj, self.user_global_ns, self.user_ns)


In [24]:
# Drop columns 'Unnamed: 0'
df.drop('Unnamed: 0', axis=1, inplace=True)

In [25]:
# Rename some columns
df = df.rename(columns={"list_price":"NYT_price", "price":"Amazon_price", 'url':'Amazon_url', 'reviews': 'Amazon_reviews', 'rating': 'Amazon_rating'})


In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2329 entries, 0 to 2328
Data columns (total 15 columns):
Amazon_url           2329 non-null object
title                2329 non-null object
author               2329 non-null object
publisher            2329 non-null object
description          2329 non-null object
NYT_price            2329 non-null float64
weeks_on_list        2329 non-null int64
first_date_listed    2329 non-null object
last_date_listed     2329 non-null object
worst_rank           2329 non-null int64
best_rank            2329 non-null int64
times_listed         2329 non-null int64
Amazon_reviews       1990 non-null object
Amazon_rating        1975 non-null object
Amazon_price         1809 non-null object
dtypes: float64(1), int64(4), object(10)
memory usage: 291.1+ KB


In [27]:
# Convert data types where appropriate

df['Amazon_price']=df['Amazon_price'].replace('[\$,]', '', regex=True).astype(float)
df['Amazon_reviews']=df['Amazon_reviews'].str.replace('customer reviews', '')
df['Amazon_rating']=df['Amazon_rating'].str.replace('out of 5 stars', '')
df["Amazon_price_as%_NYT_price"]=(df["Amazon_price"]/df["NYT_price"]*100).round(2)

df['Amazon_reviews']=df['Amazon_reviews'].str.replace(',', '')
df['Amazon_reviews']=pd.to_numeric(df['Amazon_reviews'], errors='coerce')
df['Amazon_rating']=pd.to_numeric(df['Amazon_rating'], errors='coerce')

df.dtypes

Amazon_url                     object
title                          object
author                         object
publisher                      object
description                    object
NYT_price                     float64
weeks_on_list                   int64
first_date_listed              object
last_date_listed               object
worst_rank                      int64
best_rank                       int64
times_listed                    int64
Amazon_reviews                float64
Amazon_rating                 float64
Amazon_price                  float64
Amazon_price_as%_NYT_price    float64
dtype: object

In [28]:
output_table = df[['title','author','publisher','description','Amazon_url','times_listed','first_date_listed','last_date_listed','weeks_on_list','worst_rank','best_rank','Amazon_reviews','Amazon_rating','NYT_price','Amazon_price','Amazon_price_as%_NYT_price']].reset_index(drop=True)
output_table.head()


Unnamed: 0,title,author,publisher,description,Amazon_url,times_listed,first_date_listed,last_date_listed,weeks_on_list,worst_rank,best_rank,Amazon_reviews,Amazon_rating,NYT_price,Amazon_price,Amazon_price_as%_NYT_price
0,10TH ANNIVERSARY,James Patterson and Maxine Paetro,"Little, Brown",Detective Lindsay Boxer’s long-awaited wedding...,http://www.amazon.com/10th-Anniversary-Womens-...,10,2011-05-22,2011-07-24,8,9,16,1365.0,4.6,27.99,9.99,35.69
1,11/22/63,Stephen King,Scribner,An English teacher travels back to 1958 by way...,http://www.amazon.com/11-22-63-A-Novel/dp/1451...,23,2011-11-27,2012-05-13,9,6,1,26427.0,4.5,35.0,10.99,31.4
2,11TH HOUR,James Patterson and Maxine Paetro,"Little, Brown","When a millionaire is gunned down, Detective L...",http://www.amazon.com/11th-Hour-Womens-Murder-...,10,2012-05-27,2012-07-29,8,6,1,1858.0,4.6,27.99,9.99,35.69
3,1225 CHRISTMAS TREE LANE,Debbie Macomber,Mira,Puppies that need good homes and an ex-husband...,http://www.amazon.com/1225-Christmas-Tree-Lane...,7,2011-10-16,2011-11-27,4,8,10,741.0,,16.95,5.98,35.28
4,12TH OF NEVER,James Patterson and Maxine Paetro,"Little, Brown","One week after the birth of her baby, Detectiv...",http://www.amazon.com/12th-Never-Womens-Murder...,7,2013-05-19,2013-06-30,6,8,1,3789.0,4.5,0.0,9.99,inf


In [29]:
output_table.count()

title                         2329
author                        2329
publisher                     2329
description                   2329
Amazon_url                    2329
times_listed                  2329
first_date_listed             2329
last_date_listed              2329
weeks_on_list                 2329
worst_rank                    2329
best_rank                     2329
Amazon_reviews                1989
Amazon_rating                 1975
NYT_price                     2329
Amazon_price                  1809
Amazon_price_as%_NYT_price    1808
dtype: int64

In [30]:
output_table.to_csv('Output/output_table_final_version.csv')

# LOAD into MongoDB

### Why MongoDB?

The original kaggle dataset contains over 2300 datapoints, yet we could not scrape all of the info sought in amazon for each of the unique urls contained in the dataset. Each team member scraped on her end with different using requests, but iterating in different ways (see jupyter notebook X and Y). We then merged our scraped data together and ended up with a dataframe of with a number of NaN values on certain fields. Since we wanted to keep all info for future app developpement, and one field is a long string (book description), we thought MongoDB would offer the best flexibility to store our data.

In [31]:
# Import dependencies needed
import pymongo

In [32]:
# Initialize PyMongo to work with MongoDB

conn = 'mongodb://localhost:27017'
client = pymongo.MongoClient(conn)

# Check if database exists already and drop it if so

dblist = client.list_database_names()
if "nyt_bestsellers" in dblist:
    client.drop_database("nyt_bestsellers")

In [33]:
# Define database and collection

db = client.nyt_bestsellers
collection = db.bestsellers

In [34]:
bestsellers_df = pd.read_csv("Output/output_table_final_version.csv")
bestsellers_df.head()

Unnamed: 0.1,Unnamed: 0,title,author,publisher,description,Amazon_url,times_listed,first_date_listed,last_date_listed,weeks_on_list,worst_rank,best_rank,Amazon_reviews,Amazon_rating,NYT_price,Amazon_price,Amazon_price_as%_NYT_price
0,0,10TH ANNIVERSARY,James Patterson and Maxine Paetro,"Little, Brown",Detective Lindsay Boxer’s long-awaited wedding...,http://www.amazon.com/10th-Anniversary-Womens-...,10,2011-05-22,2011-07-24,8,9,16,1365.0,4.6,27.99,9.99,35.69
1,1,11/22/63,Stephen King,Scribner,An English teacher travels back to 1958 by way...,http://www.amazon.com/11-22-63-A-Novel/dp/1451...,23,2011-11-27,2012-05-13,9,6,1,26427.0,4.5,35.0,10.99,31.4
2,2,11TH HOUR,James Patterson and Maxine Paetro,"Little, Brown","When a millionaire is gunned down, Detective L...",http://www.amazon.com/11th-Hour-Womens-Murder-...,10,2012-05-27,2012-07-29,8,6,1,1858.0,4.6,27.99,9.99,35.69
3,3,1225 CHRISTMAS TREE LANE,Debbie Macomber,Mira,Puppies that need good homes and an ex-husband...,http://www.amazon.com/1225-Christmas-Tree-Lane...,7,2011-10-16,2011-11-27,4,8,10,741.0,,16.95,5.98,35.28
4,4,12TH OF NEVER,James Patterson and Maxine Paetro,"Little, Brown","One week after the birth of her baby, Detectiv...",http://www.amazon.com/12th-Never-Womens-Murder...,7,2013-05-19,2013-06-30,6,8,1,3789.0,4.5,0.0,9.99,inf


In [35]:
# Create a list of dictionaries that hold the mongoDB documents to be inserted

post = bestsellers_df.to_dict(orient='records')
print(f"Posting {len(post)} documents into collection of bestsellers inside nyt_bestsellers mongo database...")

# Insert the list of documents into the database

collection.insert_many(post)

Posting 2329 documents into collection of bestsellers inside nyt_bestsellers mongo database...


<pymongo.results.InsertManyResult at 0x1217b1988>

In [36]:
# Verify results

results = collection.find()
for result in results:
    print(result)

{'_id': ObjectId('5c85a280bd7b6a02852b8e6e'), 'Unnamed: 0': 0, 'title': '10TH ANNIVERSARY', 'author': 'James Patterson and Maxine Paetro', 'publisher': 'Little, Brown', 'description': 'Detective Lindsay Boxer’s long-awaited wedding celebration becomes a distant memory when the Women’s Murder Club is called in to find a missing baby.', 'Amazon_url': 'http://www.amazon.com/10th-Anniversary-Womens-Murder-Club/dp/1455511463?tag=NYTBS-20', 'times_listed': 10, 'first_date_listed': '2011-05-22', 'last_date_listed': '2011-07-24', 'weeks_on_list': 8, 'worst_rank': 9, 'best_rank': 16, 'Amazon_reviews': 1365.0, 'Amazon_rating': 4.6, 'NYT_price': 27.99, 'Amazon_price': 9.99, 'Amazon_price_as%_NYT_price': 35.69}
{'_id': ObjectId('5c85a280bd7b6a02852b8e6f'), 'Unnamed: 0': 1, 'title': '11/22/63', 'author': 'Stephen King', 'publisher': 'Scribner', 'description': 'An English teacher travels back to 1958 by way of a time portal in a Maine diner. His assignment: Stop Lee Harvey Oswald.', 'Amazon_url': 'h

{'_id': ObjectId('5c85a280bd7b6a02852b9544'), 'Unnamed: 0': 1750, 'title': 'THE SNOWMAN', 'author': 'Jo Nesbo', 'publisher': 'Knopf', 'description': 'The Oslo detective Harry Hole searches for a serial killer who builds snowmen outside the homes of his victims.', 'Amazon_url': 'http://www.amazon.com/The-Snowman-Harry-Hole-Novel-ebook/dp/B004G5ZY7E?tag=NYTBS-20', 'times_listed': 5, 'first_date_listed': '2011-05-29', 'last_date_listed': '2011-06-26', 'weeks_on_list': 4, 'worst_rank': 9, 'best_rank': 10, 'Amazon_reviews': 1305.0, 'Amazon_rating': 4.1, 'NYT_price': 25.95, 'Amazon_price': 9.99, 'Amazon_price_as%_NYT_price': 38.5}
{'_id': ObjectId('5c85a280bd7b6a02852b9545'), 'Unnamed: 0': 1751, 'title': 'THE SON', 'author': 'Philipp Meyer', 'publisher': 'Ecco/HarperCollins', 'description': 'More than 150 years in a Texas family, from Comanche raids to the present, and its rise to money and power in the cattle and oil industries.', 'Amazon_url': 'http://www.amazon.com/The-Son-Philipp-Meyer/d

## What can we do with this data?

We could use this data along two axes:

##### ANALYSIS AND VISUALIZATION:
We can look for statistically significant correlations between some of our variables. For exmaple, we could do data visualization examining any significant correlation between number of reviews and ratings, or number of reviews and Amazon discounts, weeks on the list and price discount, etc.

##### BUILD AN APPLICATION FOR BOOKWORMS(!):
- This data could be used to build a Flask app. For example, we could render the data sorting it by best deals, or number of reviews, or blockbusters (the bestsellers that lasted the longest on the list). 
- We could also build queries to search for books by keywords in the description, and render the results via Flask.
- We could also render the results from our previous analysis.