# EXTRACT

### STEP 1: Extract data from kaggle's file 'nyt2.json'

Our first challenge started with the data file 'nyt2.json' we downloaded from kaggle. We expected a traditional json file (a list of dictionaries or a dictionary of dictionaries), but it is, instead, a compilation of json files, one per line. We wrote the code below to transform it into a true json file, which we named 'output.json'.

In [26]:
# Import dependencies needed

import json
import pandas as pd
from pprint import pprint

In [27]:
# Load 'nyt2.json' file into dataframe:

raw_nyt = pd.read_json('Resources/nyt2.json', lines=True, orient='columns')
raw_nyt.head()

Unnamed: 0,_id,amazon_product_url,author,bestsellers_date,description,price,published_date,publisher,rank,rank_last_week,title,weeks_on_list
0,{'$oid': '5b4aa4ead3089013507db18b'},http://www.amazon.com/Odd-Hours-Dean-Koontz/dp...,Dean R Koontz,{'$date': {'$numberLong': '1211587200000'}},"Odd Thomas, who can communicate with the dead,...",{'$numberInt': '27'},{'$date': {'$numberLong': '1212883200000'}},Bantam,{'$numberInt': '1'},{'$numberInt': '0'},ODD HOURS,{'$numberInt': '1'}
1,{'$oid': '5b4aa4ead3089013507db18c'},http://www.amazon.com/The-Host-Novel-Stephenie...,Stephenie Meyer,{'$date': {'$numberLong': '1211587200000'}},Aliens have taken control of the minds and bod...,{'$numberDouble': '25.99'},{'$date': {'$numberLong': '1212883200000'}},"Little, Brown",{'$numberInt': '2'},{'$numberInt': '1'},THE HOST,{'$numberInt': '3'}
2,{'$oid': '5b4aa4ead3089013507db18d'},http://www.amazon.com/Love-Youre-With-Emily-Gi...,Emily Giffin,{'$date': {'$numberLong': '1211587200000'}},A woman's happy marriage is shaken when she en...,{'$numberDouble': '24.95'},{'$date': {'$numberLong': '1212883200000'}},St. Martin's,{'$numberInt': '3'},{'$numberInt': '2'},LOVE THE ONE YOU'RE WITH,{'$numberInt': '2'}
3,{'$oid': '5b4aa4ead3089013507db18e'},http://www.amazon.com/The-Front-Garano-Patrici...,Patricia Cornwell,{'$date': {'$numberLong': '1211587200000'}},A Massachusetts state investigator and his tea...,{'$numberDouble': '22.95'},{'$date': {'$numberLong': '1212883200000'}},Putnam,{'$numberInt': '4'},{'$numberInt': '0'},THE FRONT,{'$numberInt': '1'}
4,{'$oid': '5b4aa4ead3089013507db18f'},http://www.amazon.com/Snuff-Chuck-Palahniuk/dp...,Chuck Palahniuk,{'$date': {'$numberLong': '1211587200000'}},An aging porn queens aims to cap her career by...,{'$numberDouble': '24.95'},{'$date': {'$numberLong': '1212883200000'}},Doubleday,{'$numberInt': '5'},{'$numberInt': '0'},SNUFF,{'$numberInt': '1'}


In [28]:
# Save DataFrame 'raw_nyt' into a json file ('output.json') and load it as 'data' we can now work with:

raw_nyt.to_json(path_or_buf='output.json', orient = "records")
with open('output.json') as file:
    data = json.load(file)
#pprint(data)

In [29]:
# Set up lists to hold reponse info:

nyt_ids = []
urls = []
authors = []
bestsellers_dates = []
descriptions = []
prices = []
published_dates = []
publishers = []
ranks = []
ranks_last_week = []
titles = []
weeks_on_lists = []

# Populate the lists:

for item in data:
    nyt_ids.append(item['_id']['$oid'])
    urls.append(item['amazon_product_url'])
    authors.append(item['author'])
    bestsellers_dates.append(item['bestsellers_date']['$date']['$numberLong'])
    descriptions.append(item['description'])
    published_dates.append(item['published_date']['$date']['$numberLong'])
    publishers.append(item['publisher'])
    ranks.append(item['rank']['$numberInt'])
    ranks_last_week.append(item['rank_last_week']['$numberInt'])
    titles.append(item['title'])
    weeks_on_lists.append(item['weeks_on_list']['$numberInt'])
# Here we have to check for the correct keyname before we can extract the price string:
    price_key, = item['price'].keys()
    if price_key == '$numberInt' or price_key == '$numberDouble':
        prices.append(item['price'][price_key])# Populate the lists:

In [30]:
# Create a DataFrame from the lists

bestsellers_dict = {
    "nyt_id": nyt_ids,
    "title": titles,
    "author": authors,
    "url": urls,
    "publisher": publishers,
    "description": descriptions,
    "list_price": prices,
    "published_date": published_dates,
    "bestseller_date": bestsellers_dates,
    "rank": ranks,
    "rank_last_week": ranks_last_week,
    "weeks_on_list": weeks_on_lists
}
bestsellers_data = pd.DataFrame(bestsellers_dict)
bestsellers_data.head()

Unnamed: 0,author,bestseller_date,description,list_price,nyt_id,published_date,publisher,rank,rank_last_week,title,url,weeks_on_list
0,Dean R Koontz,1211587200000,"Odd Thomas, who can communicate with the dead,...",27.0,5b4aa4ead3089013507db18b,1212883200000,Bantam,1,0,ODD HOURS,http://www.amazon.com/Odd-Hours-Dean-Koontz/dp...,1
1,Stephenie Meyer,1211587200000,Aliens have taken control of the minds and bod...,25.99,5b4aa4ead3089013507db18c,1212883200000,"Little, Brown",2,1,THE HOST,http://www.amazon.com/The-Host-Novel-Stephenie...,3
2,Emily Giffin,1211587200000,A woman's happy marriage is shaken when she en...,24.95,5b4aa4ead3089013507db18d,1212883200000,St. Martin's,3,2,LOVE THE ONE YOU'RE WITH,http://www.amazon.com/Love-Youre-With-Emily-Gi...,2
3,Patricia Cornwell,1211587200000,A Massachusetts state investigator and his tea...,22.95,5b4aa4ead3089013507db18e,1212883200000,Putnam,4,0,THE FRONT,http://www.amazon.com/The-Front-Garano-Patrici...,1
4,Chuck Palahniuk,1211587200000,An aging porn queens aims to cap her career by...,24.95,5b4aa4ead3089013507db18f,1212883200000,Doubleday,5,0,SNUFF,http://www.amazon.com/Snuff-Chuck-Palahniuk/dp...,1


In [25]:
bestsellers_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10195 entries, 0 to 10194
Data columns (total 12 columns):
author             10195 non-null object
bestseller_date    10195 non-null object
description        10195 non-null object
nyt_id             10195 non-null object
price              10195 non-null float64
published_date     10195 non-null object
publisher          10195 non-null object
rank               10195 non-null int64
rank_last_week     10195 non-null int64
title              10195 non-null object
url                10195 non-null object
weeks_on_list      10195 non-null int64
dtypes: float64(1), int64(3), object(8)
memory usage: 955.9+ KB


### STEP 2: Scrape info from amazon.com

In this step, we identify the list of unique Amazon URLs from our bestsellers_data DataFrame, and we visit each of those URLs to scrape the Amazon price offer, number of customer reviews, and average 5-star rating.
We encountered our second challenge here, as we had to iterate multiple times through our list of Amazon URLs and repetitively request the info we were seeking. Eventually we were able to scrape info for 1211 URLs out of the 2329 identified in our kaggle dataset.

In [31]:
# Import dependencies needed

import requests
from bs4 import BeautifulSoup
#from splinter import Browser
#from splinter.exceptions import ElementDoesNotExist
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36"}


In [32]:
# Get list of unique amazon urls

amazon_urls = list(dict.fromkeys(urls))
len(amazon_urls)

2329

# TRANSFORM

### STEP 1: Transform data from 'bestsellers_data' DataFrame

All data extracted from the json file and read into our bestsellers_data DataFrame has a the type 'string'. We now need to transform some of into the correct data type. Specifically:
- 'price' should have the float type
- 'published_date' and 'bestseller_date' should be formatted as dates
- 'rank', 'rank_last_week', and 'weeks_on_list' should have the integer type

In [16]:
# Convert data type of 'price' from string to float:

bestsellers_data['price'] = bestsellers_data['price'].apply(lambda x : float(x))

In [17]:
# Convert 'published_date' and 'bestseller_date' from a unix time stamp to a date format:

bestsellers_data['published_date'] = bestsellers_data['published_date'].apply(lambda x : datetime.utcfromtimestamp(int(x[:10])).strftime('%Y-%m-%d'))
bestsellers_data['bestseller_date'] = bestsellers_data['bestseller_date'].apply(lambda x : datetime.utcfromtimestamp(int(x[:10])).strftime('%Y-%m-%d'))


In [18]:
# Convert data type of 'rank', rank_last_week', and 'weeks_on_list' to integer:

bestsellers_data['rank'] = bestsellers_data['rank'].apply(lambda x : int(x))
bestsellers_data['rank_last_week'] = bestsellers_data['rank_last_week'].apply(lambda x : int(x))
bestsellers_data['weeks_on_list'] = bestsellers_data['weeks_on_list'].apply(lambda x : int(x))


In [19]:
bestsellers_data.head()

Unnamed: 0,author,bestseller_date,description,nyt_id,price,published_date,publisher,rank,rank_last_week,title,url,weeks_on_list
0,Dean R Koontz,2008-05-24,"Odd Thomas, who can communicate with the dead,...",5b4aa4ead3089013507db18b,27.0,2008-06-08,Bantam,1,0,ODD HOURS,http://www.amazon.com/Odd-Hours-Dean-Koontz/dp...,1
1,Stephenie Meyer,2008-05-24,Aliens have taken control of the minds and bod...,5b4aa4ead3089013507db18c,25.99,2008-06-08,"Little, Brown",2,1,THE HOST,http://www.amazon.com/The-Host-Novel-Stephenie...,3
2,Emily Giffin,2008-05-24,A woman's happy marriage is shaken when she en...,5b4aa4ead3089013507db18d,24.95,2008-06-08,St. Martin's,3,2,LOVE THE ONE YOU'RE WITH,http://www.amazon.com/Love-Youre-With-Emily-Gi...,2
3,Patricia Cornwell,2008-05-24,A Massachusetts state investigator and his tea...,5b4aa4ead3089013507db18e,22.95,2008-06-08,Putnam,4,0,THE FRONT,http://www.amazon.com/The-Front-Garano-Patrici...,1
4,Chuck Palahniuk,2008-05-24,An aging porn queens aims to cap her career by...,5b4aa4ead3089013507db18f,24.95,2008-06-08,Doubleday,5,0,SNUFF,http://www.amazon.com/Snuff-Chuck-Palahniuk/dp...,1


In [20]:
# Store published_date range and bestseller_date range
start_published_date = bestsellers_data.published_date.min()
end_published_date = bestsellers_data.published_date.max()
start_bestseller_date = bestsellers_data.bestseller_date.min()
end_bestseller_date = bestsellers_data.bestseller_date.max()
print(f"published_date range: {start_published_date} to {end_published_date}")
print(f"bestseller_date range: {start_bestseller_date} to {end_bestseller_date}")

published_date range: 2008-06-08 to 2018-07-22
bestseller_date range: 2008-05-24 to 2018-07-07


In [21]:
koontz = bestsellers_data[bestsellers_data.author == 'Dean R Koontz']
koontz

Unnamed: 0,author,bestseller_date,description,nyt_id,price,published_date,publisher,rank,rank_last_week,title,url,weeks_on_list
0,Dean R Koontz,2008-05-24,"Odd Thomas, who can communicate with the dead,...",5b4aa4ead3089013507db18b,27.0,2008-06-08,Bantam,1,0,ODD HOURS,http://www.amazon.com/Odd-Hours-Dean-Koontz/dp...,1
22,Dean R Koontz,2008-05-31,"Odd Thomas, who can communicate with the dead,...",5b4aa4ead3089013507db1a1,27.0,2008-06-15,Bantam,3,1,ODD HOURS,http://www.amazon.com/Odd-Hours-Dean-Koontz/dp...,2
46,Dean R Koontz,2008-06-07,"Odd Thomas, who can communicate with the dead,...",5b4aa4ead3089013507db1b9,27.0,2008-06-22,Bantam,7,3,ODD HOURS,http://www.amazon.com/Odd-Hours-Dean-Koontz/dp...,3
67,Dean R Koontz,2008-06-14,"Odd Thomas, who can communicate with the dead,...",5b4aa4ead3089013507db1ce,27.0,2008-06-29,Bantam,8,7,ODD HOURS,http://www.amazon.com/Odd-Hours-Dean-Koontz/dp...,4
91,Dean R Koontz,2008-06-21,"Odd Thomas, who can communicate with the dead,...",5b4aa4ead3089013507db1e6,27.0,2008-07-06,Bantam,12,8,ODD HOURS,http://www.amazon.com/Odd-Hours-Dean-Koontz/dp...,5
113,Dean R Koontz,2008-06-28,"Odd Thomas, who can communicate with the dead,...",5b4aa4ead3089013507db1fc,27.0,2008-07-13,Bantam,14,12,ODD HOURS,http://www.amazon.com/Odd-Hours-Dean-Koontz/dp...,6
139,Dean R Koontz,2008-07-05,"Odd Thomas, who can communicate with the dead,...",5b4aa4ead3089013507db216,0.0,2008-07-20,Bantam,20,0,ODD HOURS,http://www.amazon.com/Odd-Hours-Dean-Koontz/dp...,0


In [22]:
meyer = bestsellers_data[bestsellers_data.author == 'Stephenie Meyer']
meyer

Unnamed: 0,author,bestseller_date,description,nyt_id,price,published_date,publisher,rank,rank_last_week,title,url,weeks_on_list
1,Stephenie Meyer,2008-05-24,Aliens have taken control of the minds and bod...,5b4aa4ead3089013507db18c,25.99,2008-06-08,"Little, Brown",2,1,THE HOST,http://www.amazon.com/The-Host-Novel-Stephenie...,3
21,Stephenie Meyer,2008-05-31,Aliens have taken control of the minds and bod...,5b4aa4ead3089013507db1a0,25.99,2008-06-15,"Little, Brown",2,2,THE HOST,http://www.amazon.com/The-Host-Novel-Stephenie...,4
41,Stephenie Meyer,2008-06-07,Aliens have taken control of the minds and bod...,5b4aa4ead3089013507db1b4,25.99,2008-06-22,"Little, Brown",2,2,THE HOST,http://www.amazon.com/The-Host-Novel-Stephenie...,5
62,Stephenie Meyer,2008-06-14,Aliens have taken control of the minds and bod...,5b4aa4ead3089013507db1c9,25.99,2008-06-29,"Little, Brown",3,2,THE HOST,http://www.amazon.com/The-Host-Novel-Stephenie...,6
82,Stephenie Meyer,2008-06-21,Aliens have taken control of the minds and bod...,5b4aa4ead3089013507db1dd,25.99,2008-07-06,"Little, Brown",3,3,THE HOST,http://www.amazon.com/The-Host-Novel-Stephenie...,7
104,Stephenie Meyer,2008-06-28,Aliens have taken control of the minds and bod...,5b4aa4ead3089013507db1f3,25.99,2008-07-13,"Little, Brown",5,3,THE HOST,http://www.amazon.com/The-Host-Novel-Stephenie...,8
125,Stephenie Meyer,2008-07-05,Aliens have taken control of the minds and bod...,5b4aa4ead3089013507db208,25.99,2008-07-20,"Little, Brown",6,5,THE HOST,http://www.amazon.com/The-Host-Novel-Stephenie...,9
146,Stephenie Meyer,2008-07-12,Aliens have taken control of the minds and bod...,5b4aa4ead3089013507db21d,25.99,2008-07-27,"Little, Brown",7,6,THE HOST,http://www.amazon.com/The-Host-Novel-Stephenie...,10
163,Stephenie Meyer,2008-07-19,Aliens have taken control of the minds and bod...,5b4aa4ead3089013507db22e,25.99,2008-08-03,"Little, Brown",4,7,THE HOST,http://www.amazon.com/The-Host-Novel-Stephenie...,11
182,Stephenie Meyer,2008-07-26,Aliens have taken control of the minds and bod...,5b4aa4ead3089013507db241,25.99,2008-08-10,"Little, Brown",3,4,THE HOST,http://www.amazon.com/The-Host-Novel-Stephenie...,12


In [13]:
Amazon_data=pd.read_csv("Output/join_data.csv")
Amazon_data.head()

Unnamed: 0.1,Unnamed: 0,url,reviews,rating,price
0,0,http://www.amazon.com/Snuff-Chuck-Palahniuk/dp...,237 customer reviews,3.5 out of 5 stars,$9.99
1,20,http://www.amazon.com/The-Whole-Truth-David-Ba...,"1,064 customer reviews",4.4 out of 5 stars,$9.99
2,24,http://www.amazon.com/Story-Edgar-Sawtelle-Dav...,"2,573 customer reviews",3.8 out of 5 stars,$8.99
3,288,http://www.amazon.com/The-Quilters-Kitchen-Qui...,218 customer reviews,3.6 out of 5 stars,$10.99
4,289,http://www.amazon.com/Testimony-A-Novel-Anita-...,408 customer reviews,3.9 out of 5 stars,$7.99


In [14]:
join=pd.merge(bestsellers_data, Amazon_data,on='url',how='inner')
join

Unnamed: 0.1,nyt_id,title,author,url,publisher,description,price_x,published_date,bestseller_date,rank,rank_last_week,weeks_on_list,Unnamed: 0,reviews,rating,price_y
0,5b4aa4ead3089013507db18b,ODD HOURS,Dean R Koontz,http://www.amazon.com/Odd-Hours-Dean-Koontz/dp...,Bantam,"Odd Thomas, who can communicate with the dead,...",27.00,2008-06-08,2008-05-24,1,0,1,8319,920 customer reviews,4.4 out of 5 stars,$7.99
1,5b4aa4ead3089013507db1a1,ODD HOURS,Dean R Koontz,http://www.amazon.com/Odd-Hours-Dean-Koontz/dp...,Bantam,"Odd Thomas, who can communicate with the dead,...",27.00,2008-06-15,2008-05-31,3,1,2,8319,920 customer reviews,4.4 out of 5 stars,$7.99
2,5b4aa4ead3089013507db1b9,ODD HOURS,Dean R Koontz,http://www.amazon.com/Odd-Hours-Dean-Koontz/dp...,Bantam,"Odd Thomas, who can communicate with the dead,...",27.00,2008-06-22,2008-06-07,7,3,3,8319,920 customer reviews,4.4 out of 5 stars,$7.99
3,5b4aa4ead3089013507db1ce,ODD HOURS,Dean R Koontz,http://www.amazon.com/Odd-Hours-Dean-Koontz/dp...,Bantam,"Odd Thomas, who can communicate with the dead,...",27.00,2008-06-29,2008-06-14,8,7,4,8319,920 customer reviews,4.4 out of 5 stars,$7.99
4,5b4aa4ead3089013507db1e6,ODD HOURS,Dean R Koontz,http://www.amazon.com/Odd-Hours-Dean-Koontz/dp...,Bantam,"Odd Thomas, who can communicate with the dead,...",27.00,2008-07-06,2008-06-21,12,8,5,8319,920 customer reviews,4.4 out of 5 stars,$7.99
5,5b4aa4ead3089013507db1fc,ODD HOURS,Dean R Koontz,http://www.amazon.com/Odd-Hours-Dean-Koontz/dp...,Bantam,"Odd Thomas, who can communicate with the dead,...",27.00,2008-07-13,2008-06-28,14,12,6,8319,920 customer reviews,4.4 out of 5 stars,$7.99
6,5b4aa4ead3089013507db216,ODD HOURS,Dean R Koontz,http://www.amazon.com/Odd-Hours-Dean-Koontz/dp...,Bantam,"Odd Thomas, who can communicate with the dead,...",0.00,2008-07-20,2008-07-05,20,0,0,8319,920 customer reviews,4.4 out of 5 stars,$7.99
7,5b4aa4ead3089013507db18c,THE HOST,Stephenie Meyer,http://www.amazon.com/The-Host-Novel-Stephenie...,"Little, Brown",Aliens have taken control of the minds and bod...,25.99,2008-06-08,2008-05-24,2,1,3,309,"6,109 customer reviews",4.5 out of 5 stars,$7.99
8,5b4aa4ead3089013507db1a0,THE HOST,Stephenie Meyer,http://www.amazon.com/The-Host-Novel-Stephenie...,"Little, Brown",Aliens have taken control of the minds and bod...,25.99,2008-06-15,2008-05-31,2,2,4,309,"6,109 customer reviews",4.5 out of 5 stars,$7.99
9,5b4aa4ead3089013507db1b4,THE HOST,Stephenie Meyer,http://www.amazon.com/The-Host-Novel-Stephenie...,"Little, Brown",Aliens have taken control of the minds and bod...,25.99,2008-06-22,2008-06-07,2,2,5,309,"6,109 customer reviews",4.5 out of 5 stars,$7.99


In [15]:
clean_data=join.drop_duplicates(subset='url', keep='first', inplace=False)
clean_data

Unnamed: 0.1,nyt_id,title,author,url,publisher,description,price_x,published_date,bestseller_date,rank,rank_last_week,weeks_on_list,Unnamed: 0,reviews,rating,price_y
0,5b4aa4ead3089013507db18b,ODD HOURS,Dean R Koontz,http://www.amazon.com/Odd-Hours-Dean-Koontz/dp...,Bantam,"Odd Thomas, who can communicate with the dead,...",27.00,2008-06-08,2008-05-24,1,0,1,8319,920 customer reviews,4.4 out of 5 stars,$7.99
7,5b4aa4ead3089013507db18c,THE HOST,Stephenie Meyer,http://www.amazon.com/The-Host-Novel-Stephenie...,"Little, Brown",Aliens have taken control of the minds and bod...,25.99,2008-06-08,2008-05-24,2,1,3,309,"6,109 customer reviews",4.5 out of 5 stars,$7.99
75,5b4aa4ead3089013507db18d,LOVE THE ONE YOU'RE WITH,Emily Giffin,http://www.amazon.com/Love-Youre-With-Emily-Gi...,St. Martin's,A woman's happy marriage is shaken when she en...,24.95,2008-06-08,2008-05-24,3,2,2,8347,702 customer reviews,4.0 out of 5 stars,$8.99
89,5b4aa4ead3089013507db18e,THE FRONT,Patricia Cornwell,http://www.amazon.com/The-Front-Garano-Patrici...,Putnam,A Massachusetts state investigator and his tea...,22.95,2008-06-08,2008-05-24,4,0,1,8383,323 customer reviews,3.0 out of 5 stars,$7.99
92,5b4aa4ead3089013507db18f,SNUFF,Chuck Palahniuk,http://www.amazon.com/Snuff-Chuck-Palahniuk/dp...,Doubleday,An aging porn queens aims to cap her career by...,24.95,2008-06-08,2008-05-24,5,0,1,0,237 customer reviews,3.5 out of 5 stars,$9.99
97,5b4aa4ead3089013507db190,SUNDAYS AT TIFFANY’S,James Patterson and Gabrielle Charbonnet,http://www.amazon.com/Sundays-at-Tiffanys-Jame...,"Little, Brown",A woman finds an unexpected love,24.99,2008-06-08,2008-05-24,6,3,4,8389,478 customer reviews,4.0 out of 5 stars,$6.99
103,5b4aa4ead3089013507db191,PHANTOM PREY,John Sandford,http://www.amazon.com/Phantom-Prey-John-Sandfo...,Putnam,The Minneapolis detective Lucas Davenport inve...,26.95,2008-06-08,2008-05-24,7,4,3,8419,435 customer reviews,4.2 out of 5 stars,$9.99
107,5b4aa4ead3089013507db192,SWINE NOT?,Jimmy Buffett,http://www.amazon.com/From-Worse-Southern-Vamp...,"Little, Brown",A Southern family tries to hide its pet pig at...,21.99,2008-06-08,2008-05-24,8,6,2,8431,685 customer reviews,4.4 out of 5 stars,$7.99
116,5b4aa4ead3089013507db193,CARELESS IN RED,Elizabeth George,http://www.amazon.com/Where-Are-You-Now-Novel/...,Harper,"In Cornwall, trying to recover from his wife's...",27.95,2008-06-08,2008-05-24,9,8,3,8435,621 customer reviews,4.4 out of 5 stars,$8.99
119,5b4aa4ead3089013507db194,THE WHOLE TRUTH,David Baldacci,http://www.amazon.com/The-Whole-Truth-David-Ba...,Grand Central,An intelligence agent and a journalist team up...,26.99,2008-06-08,2008-05-24,10,7,5,20,"1,064 customer reviews",4.4 out of 5 stars,$9.99


In [16]:
clean_data=clean_data.rename(columns={"price_x":"NYT_Price","price_y":"Amazon_Price",'url':'Amazon_url'})
clean_data.head()

Unnamed: 0.1,nyt_id,title,author,Amazon_url,publisher,description,NYT_Price,published_date,bestseller_date,rank,rank_last_week,weeks_on_list,Unnamed: 0,reviews,rating,Amazon_Price
0,5b4aa4ead3089013507db18b,ODD HOURS,Dean R Koontz,http://www.amazon.com/Odd-Hours-Dean-Koontz/dp...,Bantam,"Odd Thomas, who can communicate with the dead,...",27.0,2008-06-08,2008-05-24,1,0,1,8319,920 customer reviews,4.4 out of 5 stars,$7.99
7,5b4aa4ead3089013507db18c,THE HOST,Stephenie Meyer,http://www.amazon.com/The-Host-Novel-Stephenie...,"Little, Brown",Aliens have taken control of the minds and bod...,25.99,2008-06-08,2008-05-24,2,1,3,309,"6,109 customer reviews",4.5 out of 5 stars,$7.99
75,5b4aa4ead3089013507db18d,LOVE THE ONE YOU'RE WITH,Emily Giffin,http://www.amazon.com/Love-Youre-With-Emily-Gi...,St. Martin's,A woman's happy marriage is shaken when she en...,24.95,2008-06-08,2008-05-24,3,2,2,8347,702 customer reviews,4.0 out of 5 stars,$8.99
89,5b4aa4ead3089013507db18e,THE FRONT,Patricia Cornwell,http://www.amazon.com/The-Front-Garano-Patrici...,Putnam,A Massachusetts state investigator and his tea...,22.95,2008-06-08,2008-05-24,4,0,1,8383,323 customer reviews,3.0 out of 5 stars,$7.99
92,5b4aa4ead3089013507db18f,SNUFF,Chuck Palahniuk,http://www.amazon.com/Snuff-Chuck-Palahniuk/dp...,Doubleday,An aging porn queens aims to cap her career by...,24.95,2008-06-08,2008-05-24,5,0,1,0,237 customer reviews,3.5 out of 5 stars,$9.99


In [17]:
clean_data['Amazon_reviews']=clean_data['reviews'].str.replace('customer reviews', '', regex=True)
clean_data['Amazon_rating']=clean_data['rating'].str.replace('out of 5 stars', '', regex=True)

In [18]:
#clean_data["Amazon_Price"] = pd.to_numeric(clean_data["Amazon_Price"])
clean_data['Amazon_Price']=clean_data['Amazon_Price'].replace('[\$,]', '', regex=True).astype(float)
clean_data.dtypes

nyt_id              object
title               object
author              object
Amazon_url          object
publisher           object
description         object
NYT_Price          float64
published_date      object
bestseller_date     object
rank                 int64
rank_last_week       int64
weeks_on_list        int64
Unnamed: 0           int64
reviews             object
rating              object
Amazon_Price       float64
Amazon_reviews      object
Amazon_rating       object
dtype: object

In [19]:
clean_data["Amazon_Discount%"]=(clean_data["Amazon_Price"]/clean_data["NYT_Price"]*100).round(2)

In [20]:
clean_data['Amazon_reviews']=clean_data['Amazon_reviews'].str.replace(',', '', regex=True)
clean_data['Amazon_reviews']=pd.to_numeric(clean_data['Amazon_reviews'], errors='coerce')

In [21]:
clean_data['Amazon_rating']=pd.to_numeric(clean_data['Amazon_rating'], errors='coerce')

In [22]:
clean_data.dtypes

nyt_id               object
title                object
author               object
Amazon_url           object
publisher            object
description          object
NYT_Price           float64
published_date       object
bestseller_date      object
rank                  int64
rank_last_week        int64
weeks_on_list         int64
Unnamed: 0            int64
reviews              object
rating               object
Amazon_Price        float64
Amazon_reviews      float64
Amazon_rating       float64
Amazon_Discount%    float64
dtype: object

In [23]:
output_table=clean_data[['nyt_id','title','Amazon_url','author','publisher','description','published_date','bestseller_date','rank','rank_last_week','weeks_on_list','NYT_Price','Amazon_Price','Amazon_reviews','Amazon_rating','Amazon_Discount%']].reset_index(drop=True)
output_table

Unnamed: 0,nyt_id,title,Amazon_url,author,publisher,description,published_date,bestseller_date,rank,rank_last_week,weeks_on_list,NYT_Price,Amazon_Price,Amazon_reviews,Amazon_rating,Amazon_Discount%
0,5b4aa4ead3089013507db18b,ODD HOURS,http://www.amazon.com/Odd-Hours-Dean-Koontz/dp...,Dean R Koontz,Bantam,"Odd Thomas, who can communicate with the dead,...",2008-06-08,2008-05-24,1,0,1,27.00,7.99,920.0,4.4,29.590000
1,5b4aa4ead3089013507db18c,THE HOST,http://www.amazon.com/The-Host-Novel-Stephenie...,Stephenie Meyer,"Little, Brown",Aliens have taken control of the minds and bod...,2008-06-08,2008-05-24,2,1,3,25.99,7.99,6109.0,4.5,30.740000
2,5b4aa4ead3089013507db18d,LOVE THE ONE YOU'RE WITH,http://www.amazon.com/Love-Youre-With-Emily-Gi...,Emily Giffin,St. Martin's,A woman's happy marriage is shaken when she en...,2008-06-08,2008-05-24,3,2,2,24.95,8.99,702.0,4.0,36.030000
3,5b4aa4ead3089013507db18e,THE FRONT,http://www.amazon.com/The-Front-Garano-Patrici...,Patricia Cornwell,Putnam,A Massachusetts state investigator and his tea...,2008-06-08,2008-05-24,4,0,1,22.95,7.99,323.0,3.0,34.810000
4,5b4aa4ead3089013507db18f,SNUFF,http://www.amazon.com/Snuff-Chuck-Palahniuk/dp...,Chuck Palahniuk,Doubleday,An aging porn queens aims to cap her career by...,2008-06-08,2008-05-24,5,0,1,24.95,9.99,237.0,3.5,40.040000
5,5b4aa4ead3089013507db190,SUNDAYS AT TIFFANY’S,http://www.amazon.com/Sundays-at-Tiffanys-Jame...,James Patterson and Gabrielle Charbonnet,"Little, Brown",A woman finds an unexpected love,2008-06-08,2008-05-24,6,3,4,24.99,6.99,478.0,4.0,27.970000
6,5b4aa4ead3089013507db191,PHANTOM PREY,http://www.amazon.com/Phantom-Prey-John-Sandfo...,John Sandford,Putnam,The Minneapolis detective Lucas Davenport inve...,2008-06-08,2008-05-24,7,4,3,26.95,9.99,435.0,4.2,37.070000
7,5b4aa4ead3089013507db192,SWINE NOT?,http://www.amazon.com/From-Worse-Southern-Vamp...,Jimmy Buffett,"Little, Brown",A Southern family tries to hide its pet pig at...,2008-06-08,2008-05-24,8,6,2,21.99,7.99,685.0,4.4,36.330000
8,5b4aa4ead3089013507db193,CARELESS IN RED,http://www.amazon.com/Where-Are-You-Now-Novel/...,Elizabeth George,Harper,"In Cornwall, trying to recover from his wife's...",2008-06-08,2008-05-24,9,8,3,27.95,8.99,621.0,4.4,32.160000
9,5b4aa4ead3089013507db194,THE WHOLE TRUTH,http://www.amazon.com/The-Whole-Truth-David-Ba...,David Baldacci,Grand Central,An intelligence agent and a journalist team up...,2008-06-08,2008-05-24,10,7,5,26.99,9.99,1064.0,4.4,37.010000


In [24]:
output_table.to_csv('output_table_final_version.csv')