# Milestone 2

In [1]:
from ada import data
import pandas as pd
import requests
from bs4 import BeautifulSoup

## Global data description

We chosed to only use the useful amazon datasets and not the global one. For sake of ease we also selected files with duplicated items reviews removed. We used the following files : 
* Reviews and metadata for books
* Reviews and metadata for Movies and TV
We decided not to use the reviews for Amazon Instant videos because a large majority of products are not related to a title, we have noticed that we could use the Amazon Product API but we think we have enough data with the 2 firsts categories.

We constructed our code so that we can handle both plain and compressed json files (`.json` and `.json.gz`), so that we can avoid extracting json files if memory is limited. 

We worked on our PC as the size of the data is small (about 8gb for compressed files and 25gb for plain files)

### Books reviews

* Number of lines : `wc -l reviews_Books.json  : 22507155 reviews_Books.json` which is coherent with the number given on the source website    

In [2]:
books_reviews_lines = data.read_data("reviews_Books", 5)
df = {}
i=0
for book_line in books_reviews_lines:
    df[i]=book_line
    i+=1
pd.DataFrame.from_dict(df,orient='index')

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,AH2L9G3DQHHAJ,116,chris,"[5, 5]",Interesting Grisham tale of a lawyer that take...,4.0,Show me the money!,1019865600,"04 27, 2002"
1,A2IIIDRK3PRRZY,116,Helene,"[0, 0]",The thumbnail is a shirt. The product shown i...,1.0,Listing is all screwed up,1395619200,"03 24, 2014"
2,A1TADCM7YWPQ8M,868,Joel@AWS,"[10, 10]",I'll be honest. I work for a large online reta...,4.0,Not a Bad Translation,1031702400,"09 11, 2002"
3,AWGH7V0BDOJKB,13714,Barbara Marshall,"[0, 0]",It had all the songs I wanted but I had ordere...,4.0,Not the large print,1383177600,"10 31, 2013"
4,A3UTQPQPM4TQO0,13714,betty burnett,"[0, 0]","We have many of the old, old issue. But the nu...",5.0,I was disappointed that you would only allow m...,1374883200,"07 27, 2013"


Basic description of features : 
* reviewerID : ID of the reviewer
* asin : ID of the product, will be used to match metadata
* reviewerName : name of the reviewer
* helpful :  helpfulness rating of the review, for example in the first row : 5/5
* reviewText : Content of the review
* overall : rating of the product
* summary : summary of the review
* unixReviewTime : Unix time of the review
* reviewTime  : time of the review in " month day,year" format


In [10]:
#Warning Long operation
books_reviews_lines = data.read_data("reviews_Books", 22507155)
distinct_reviewers = set()
distinct_asin = set()
totalHelpful = 0
helpfulCount = 0
helpfulZero = 0
overallTotal = 0
overallCount = 0
minTime = 2000000000
maxTime = 0
for book_line in books_reviews_lines:
    distinct_reviewers.add(book_line["reviewerID"])
    distinct_asin.add(book_line["asin"])
    helpful = book_line["helpful"]
    if helpful[0] == 0 and helpful[1] == 0:
        helpfulZero += 1
    else:
        totalHelpful += helpful[0] / helpful[1]
        helpfulCount += 1

    overallTotal += book_line["overall"]
    unixTime = book_line["unixReviewTime"]
    maxTime = unixTime if unixTime > maxTime else maxTime
    minTime = unixTime if unixTime < minTime else minTime


KeyboardInterrupt: 

Results : 
* Number of distinct reviewers : 8026324 (35.66% of total)  : a reviewer give 2.8 reviews in mean
* Number of distinct products :  2330066 (10.35% of total)  : 9.66 reviews per product
* Helpful mean (without 0/0) : 0.7288443590389628
* Non rated hepfulness (0/0) : 10473154 (46.53% of total)
* Overall mean : 4.29575892643917
* Max unix Time : 1406073600 : Wednesday, July 23, 2014 
* Min unix Time : 832550400 : Monday, May 20, 1996

We can already notice a high mean for overall reviews

### Books metadata

* Number of lines : `wc -l meta_Books.json 2370585 meta_Books.json`

In [8]:
books_meta_lines = data.read_data("meta_Books", 6)
df = {}
i=0
for book_line in books_meta_lines:
    df[i]=book_line
    i+=1
pd.DataFrame.from_dict(df,orient='index')

Unnamed: 0,asin,salesRank,imUrl,categories,title,description,price,related
0,1048791,{'Books': 6334800},http://ecx.images-amazon.com/images/I/51MKP0T4...,[[Books]],"The Crucible: Performed by Stuart Pankin, Jero...",,,
1,1048775,{'Books': 13243226},http://ecx.images-amazon.com/images/I/5166EBHD...,[[Books]],Measure for Measure: Complete &amp; Unabridged,William Shakespeare is widely regarded as the ...,,
2,1048236,{'Books': 8973864},http://ecx.images-amazon.com/images/I/51DH145C...,[[Books]],The Sherlock Holmes Audio Collection,"&#34;One thing is certain, Sherlockians, put a...",9.26,"{'also_viewed': ['1442300191', '9626349786', '..."
3,401048,{'Books': 6448843},http://ecx.images-amazon.com/images/I/41bchvIf...,[[Books]],The rogue of publishers' row;: Confessions of ...,,,{'also_viewed': ['068240103X']}
4,1019880,{'Books': 9589258},http://ecx.images-amazon.com/images/I/61LcHUdv...,[[Books]],Classic Soul Winner's New Testament Bible,,5.39,"{'also_viewed': ['B003HMB5FC', '0834004593'], ..."
5,1048813,,http://ecx.images-amazon.com/images/I/41k5u0lr...,[[Books]],Archer Christmas 4 Tape Pack,,,


Basic description of features : 
* asin : ID of the product
* salesRank : sales rank information
* imUrl : url of the product image
* categories : list of categories the product belongs to
* title : name of the product
* price : price in US dollars
* description : product description
* related : related products, for example product that was also bought
* brand : brand name


This time the file is small enough to load it in memory with pandas but due to the format of the input it's faster to make a custom read.

For example we can not use the `unique` function to list categories as they are in a list format which are unhashable

In [2]:
books_meta_lines = data.read_data("meta_Books", 2370585)
distinct_categories = set()
sales_rank_min = 100000000
sales_rank_max = 0
sales_rank_total = 0
sales_rank_count = 0
title_count = 0
description_count = 0
price_count = 0
price_total = 0
related_count = 0
for book_line in books_meta_lines:
    for c in book_line["categories"]:
        for v in c:
            distinct_categories.add(v)
    if "salesRank" in book_line:
        sales_rank_count+=1
        s_rank = book_line["salesRank"]["Books"]
        sales_rank_total += s_rank
        sales_rank_max = s_rank if s_rank > sales_rank_max else sales_rank_max
        sales_rank_min = s_rank if s_rank < sales_rank_min else sales_rank_min
    title_count += 1 if "title" in book_line else 0
    description_count += 1 if "description" in book_line else 0
    if "price" in book_line:
        price_count += 1
        price_total += book_line["price"]
    related_count += 1 if "related" in book_line else 0

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



KeyboardInterrupt: 

We then get the following results (we made multiple computations in different times, without launching the dataframe computation each time, so we show here a summary): 

In [10]:
meta_books_headers = ["Name","Mean","Count","Count (%)","Min","Max"]
meta_books_results = [["Price",17.62630714920411,1679410,70.84369469983147],
                      ["Sales rank",2837317.7674769713,1891017,79.77005675814198,1,14690268],
                      ["Title",None,1938767,81.78432749722116],
                      ["Description",None,1121358,47.303007485494085],
                       ["Related",None,1620429,68.3556590461848]
                     ]
pd.DataFrame(meta_books_results,columns=meta_books_headers)

Unnamed: 0,Name,Mean,Count,Count (%),Min,Max
0,Price,17.62631,1679410,70.843695,,
1,Sales rank,2837318.0,1891017,79.770057,1.0,14690268.0
2,Title,,1938767,81.784327,,
3,Description,,1121358,47.303007,,
4,Related,,1620429,68.355659,,


There are also `3935` distincts categories, containing obivously `Books`

We can notice that we have a lot of titles here, so our project is still viable for now. We also see that we have the number 1 of books sale, meaning it will probably have a lot of reviews and will be interesting to analyse.


### Movies reviews

* Number of lines : ```wc -l reviews_Movies_and_TV.json 4607047 reviews_Movies_and_TV.json```
* The features are the same than those of Books reviews, please see there the decription

In [6]:
movies_reviews_lines = data.read_data("reviews_Movies_and_TV", 5)
df = {}
i=0
for movie_line in movies_reviews_lines:
    df[i]=movie_line
    i+=1
pd.DataFrame.from_dict(df,orient='index')

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A3R5OBKS7OM2IR,143502,Rebecca L. Johnson,"[0, 0]",This has some great tips as always and is help...,5.0,Alton... nough said,1358380800,"01 17, 2013"
1,A3R5OBKS7OM2IR,143529,Rebecca L. Johnson,"[0, 0]",This is a great pastry guide. I love how Alto...,5.0,Ah Alton...,1380672000,"10 2, 2013"
2,AH3QC2PC1VTGP,143561,Great Home Cook,"[2, 4]",I have to admit that I am a fan of Giada's coo...,2.0,Don't waste your money,1216252800,"07 17, 2008"
3,A3LKP6WPMP9UKX,143588,Anna V. Carroll,"[9, 9]",I bought these two volumes new and spent over ...,5.0,VOLUME 1 & VOLUME 2-BETTER THAN THERAPY,1236902400,"03 13, 2009"
4,AVIY68KEPQ5ZD,143588,Rebecca Millington,"[1, 4]",I am very pleased with the dvd only wish i cou...,5.0,Barefoot Contesst Vol 2,1232236800,"01 18, 2009"


We will have the same approach than for book files, here we could have loaded the whole file in memory and worked with pandas but for the description this is not useful

Results : 
* Number of distinct reviewers : 2088620 (45.33 % of total) :  a reviewer give 2.2 reviews in mean
* Number of distinct products :  200941  (4.36 % of total ) : 22.9 reviews per product
* Helpful mean (without 0/0) : 0.6132033945128144
* Non rated hepfulness (0/0) : 1967621 (42.71% of total)
* Overall mean : 4.18688001229421
* Max unix Time : 1406073600 : Wednesday, July 23, 2014 
* Min unix Time : 871084800 : Saturday, August 9, 1997

There are clearly less distinct products (as a percentage) than for the books, so we should have more reviews per product. We can also notice a high mean for overall mean but slightly worse than for books. 

### Movies metadata

* Number of lines : `wc -l meta_Movies_and_TV.json  208321 meta_Movies_and_TV.json`
* The features are the same than those of Books metadata, please see there the description

In [7]:
movies_meta_lines = data.read_data("meta_Movies_and_TV", 5)
df = {}
i=0
for movie_line in movies_meta_lines:
    df[i]=movie_line
    i+=1
pd.DataFrame.from_dict(df,orient='index')

Unnamed: 0,asin,categories,description,title,price,salesRank,imUrl,related
0,0000143561,"[[Movies & TV, Movies]]","3Pack DVD set - Italian Classics, Parties and ...","Everyday Italian (with Giada de Laurentiis), V...",12.99,{'Movies & TV': 376041},http://g-ecx.images-amazon.com/images/G/01/x-s...,"{'also_viewed': ['B0036FO6SI', 'B000KL8ODE', '..."
1,0000589012,"[[Movies & TV, Movies]]",,Why Don't They Just Quit? DVD Roundtable Discu...,15.95,{'Movies & TV': 1084845},http://ecx.images-amazon.com/images/I/519%2B1k...,"{'also_bought': ['B000Z3N1HQ', '0578045427', '..."
2,0000695009,"[[Movies & TV, Movies]]",,Understanding Seizures and Epilepsy DVD,,{'Movies & TV': 1022732},http://g-ecx.images-amazon.com/images/G/01/x-s...,
3,000107461X,"[[Movies & TV, Movies]]",,Live in Houston [VHS],,{'Movies & TV': 954116},http://ecx.images-amazon.com/images/I/41WY47gL...,
4,0000143529,"[[Movies & TV, Movies]]",Disc 1: Flour Power (Scones; Shortcakes; South...,My Fair Pastry (Good Eats Vol. 9),19.99,{'Movies & TV': 463562},http://ecx.images-amazon.com/images/I/51QY79CD...,"{'also_bought': ['B000NR4CRM', 'B0019BK3KQ', '..."


In [11]:
meta_books_headers = ["Name","Mean","Count","Count (%)","Min","Max"]
meta_books_results = [["Price",23.48689486774547,155623,74.70346244497674],
                      ["Sales rank",391833.30351553153,204777,98.29877928773384,11,1149966],
                      ["Title",None,107671,51.685139760273806],
                      ["Description",None,178086,85.48634079137484],
                       ["Related",None,154859,74.33672073386745]
                     ]
pd.DataFrame(meta_books_results,columns=meta_books_headers)

Unnamed: 0,Name,Mean,Count,Count (%),Min,Max
0,Price,23.486895,155623,74.703462,,
1,Sales rank,391833.303516,204777,98.298779,11.0,1149966.0
2,Title,,107671,51.68514,,
3,Description,,178086,85.486341,,
4,Related,,154859,74.336721,,


We have clearly less titles for the movies and this might be restrictive if too much of this titles are not connected to books. 
It is a bit weird to have the product description and not its title.
Once again there are many categories (786)

## Matching books and movies

We will do the matching between books and movies from the metadata titles, then once we have the interesting products we will merge the metadata and reviews (once for books, once for movies) for the mathcing products. 
Finally we will be able to make our compared analysis

For this milestone, we decided to focus on known associations of books and movies, we might consider matching products from both categories with other techniques for the final project. 

To obtain a list of association books-movies we decided to parse the [Wikipedia pages](https://en.wikipedia.org/wiki/Lists_of_fiction_works_made_into_feature_films) in this subject. 
Those pages are separated in 4 pages, depending in the book first letter.

In [8]:
def get_dict_titles(urlList):
    titles = {}
    for url in urlList:
        r = requests.get(url)
        soup = BeautifulSoup(r.text,"html5lib")
        tables = soup.findAll("table", {"class": "wikitable"})
        for table in tables:
            for row in table.findAll("tr"):
                cells = row.findAll("td")
                if len(cells) == 2:
                    movie_data = cells[1].findAll(text=True)
                    movie_titles = [movie_data[0]]
                    movie_titles_mult = [value for index, value in enumerate(
                        movie_data) if movie_data[(index - 1) % len(movie_data)] == '\n']
                    titles[cells[0].find(text=True)] = list(
                        set(movie_titles + movie_titles_mult))
    return titles

In [10]:
urlList = ["https://en.wikipedia.org/wiki/List_of_fiction_works_made_into_feature_films_(0%E2%80%939_and_A%E2%80%93C)",
           "https://en.wikipedia.org/wiki/List_of_fiction_works_made_into_feature_films_(D%E2%80%93J)",
           "https://en.wikipedia.org/wiki/List_of_fiction_works_made_into_feature_films_(K%E2%80%93R)",
           "https://en.wikipedia.org/wiki/List_of_fiction_works_made_into_feature_films_(S%E2%80%93Z)"]
dict_titles = get_dict_titles(urlList)

In [12]:
print(f"Number of books : {len(dict_titles.keys())}")
movies_titles = [item for sublist in dict_titles.values()
                 for item in sublist]
print(f"Number of movies : {len(movies_titles)}")

Number of books : 1635
Number of movies : 2253
