# Sections
* [1. Introduction](#introduction)
* [2. Data gathering](#data)
* * [2.1 Amazon dataset](#amazon)
* * [2.2 scraping Wikipedia](#wikipedia)
* * [2.3 Amazon Product API](#amazonapi)
* [3. What's next for Milestone #3](#milestone3)

In [1]:
#essential imports
import pandas as pd
import numpy as np
import json
from pandas.io.json import json_normalize

#scraping imports
import requests
from bs4 import BeautifulSoup

#plotting imports
%matplotlib inline
import matplotlib.pyplot as plt

#String matching
import re

#date
import datetime as dt

<a id='introduction'></a>
# 1. Introduction

## Abstract

It is often said, ironically, that Van Gogh never sold a painting in his lifetime while he is one of the most famous painters in history.
Is this an isolated case? Is it possible that society has a greater interest in the works of deceased personalities rather than those of their contemporaries? And if so, is this interest more marked when the news is still fresh?
The following project aims to analyze the effect of artists / authors’ death on sales of their own work. 
It starts with the hypothesis that a real societal phenomenon exists, which we will call "post-mortem worship", according to which people feel more interested in the works of artists / authors after their recent decease.
The second assumption, is that this phenomenon can be reduced to the artistic and literary community, which are those concerned with mass celebrity. The last assumption is that the current means of communication allow the whole society concerned, in this case American, to know the news few time after the event. Especially if it is about well-known people.

By working on data from Amazon, the giant of online commerce, and Wikipedia, the most famous encyclopedia of the web, it is possible to test the post-mortem worship effect.
Indeed, the first part of this project consisted of the extraction of the data of interest from Amazon and Wikipedia. This required to filter Amazon data to contain only the required cathegories, clean it and store it in a convenient format for future implementations.
Otherwise, the list of authors deceased in the time interval corresponding to Amazon's data, was scrapped from Wikipedia and stored in a compact and easy-to-use format.  
The second part of the research will be based on the extraction of quantifiable features (interest in the form of number of reviews, appraisal index of reviews, temporal dimensionnality...) in order to allow a mathematical analysis of the data.
The last conclusions will be drawn based on mathematical results and hypothesis testing.


## Research questions
* When a author/artist died, What trend of popularity occurs on their related product on amazon? (For an author; it's book, for an actor; related movies,... etc)
* What's this impact in function of the type of artwork the author/artist did? (musics/books/films/...)


## Dataset
We want to use the Amazon datasets provided in the course, both the review and the metadata dataset. (So at most 20 + 3.1 gb in Json). 
But we will use only specific categories related the creation by an author/artist. (musics/books/films/...)
Since we're very interested in the amount of reviews as a metric of interest, we will restrict our data to the 5-core dataset, as to have at least a few reviews per product.
The interest rate in function of time will be computed with the help of the review content and their dates. (text analysis)
To correlate this interest rate, we will need artists'/authors' death and their corresponding work. For this we will use Wikipedia and scrap the useful data needed for our project.
One hard part will be to match  the works of an artist/author to corresponding product on amazon. 


## A list of internal milestones up until project milestone 2
* Define the useful feature inside all the dataset
* Select the categories of product in Amazon containing works of authors/artists (Amazon has 24 categories of item)
* Scrap the death of artists/authors of the N last years, match it with all it's work, then match it with all corresponding amazon product.
* Clean the data
* Think about how to present the project in term of data visualization

---

<a id='data'></a>
# 2. Data gathering

<a id='amazon'></a>
## 2.1 Amazon Dataset

Amazon has a lot of categories:

Books, Electronics, Movies and TV, CDs and Vinyl, Clothing (Shoes and Jewelry), Home and Kitchen, Kindle Store, Sports and Outdoors, Cell Phones and Accessorie, Health and Personal Care, Toys and Games, Video Games, Tools and Home Improvement, Beauty, Apps for Android, Office Products, Pet Supplies, Automotive, Grocery and Gourmet Food, Patio (Lawn and Garden), Baby, Digital Music, Musical Instruments, Amazon Instant Video


### For our project we consider the useful categories as:
* Books
* Movies and TV
* CDs and Vinyl
* Kindle Store
* Digital Music
* Amazon Instant Video

#### Potentially useful
* Toy and Games
* Video Games

---
We can acess the amazon detaset review [here](http://jmcauley.ucsd.edu/data/amazon/links.html).

##### Authors
* [Books](http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Books_5.json.gz)
* [Kindle Store](http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Kindle_Store_5.json.gz)

##### Actors
* [Movies and TV](http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Movies_and_TV_5.json.gz)
* [Amazon Instant Video](http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Amazon_Instant_Video_5.json.gz)

##### Musician
* [CDs and Vinyl](http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_CDs_and_Vinyl_5.json.gz)
* [Digital Music](http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Digital_Music_5.json.gz)


#### Metadata (list of all the product with its description)
* [Books](http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/meta_Books.json.gz)
* [Kindle Store](http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/meta_Kindle_Store.json.gz)
* [Movies and TV](http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/meta_Movies_and_TV.json.gz)
* [Amazon Instant Video](http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/meta_Amazon_Instant_Video.json.gz)
* [CDs and Vinyl](http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/meta_CDs_and_Vinyl.json.gz)
* [Digital Music](http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/meta_Digital_Music.json.gz)
---

In [2]:
#data_path = 'DATA/'
#data = pd.read_json(data_path+'Kindle_Store_5.json', lines=True)

Right now we don't use the amazon dataset review at all; for the milestone #2 we only gather the data from wikipedia and amazon product api, to have to tools to do our analysis.

Note that the files are at most 2GB, and we can separate them for our analysis. So we won't have to use SPARK since the size of our files.

<a id='wikipedia'></a>
## 2.2 scraping Wikipedia

We want to scrap all the musicians, actors and authors' dead on wikipedia. Since we want the most notorious ones, we will scrap from [the page summarizing the year](https://en.wikipedia.org/w/index.php?title=2000#Deaths), and not the specific [page for the deaths](https://en.wikipedia.org/wiki/Deaths_in_2000).

We will iterate for each wanted year page, and scrap the celebrity's name, birth date, description, and death date.

In [3]:
# Function that matchs a line of the wikitext and return an array of tuples: 
# (person_death_date, person_name, person_description, person_birth_date)
# debug = True return the intermediary matching
def matchLine(line, debug=False):
    if debug:
        print(line)
    #match when a line contains only 1 celebrity
    match_1_line = re.match( r'.*\[\[(.*?)\]\] (&ndash;|-|–).*?\[\[(.*?)\]\](.*)\(b\..*\[\[(.*?)\]\]\)', line)
    if match_1_line:
        if debug:
            print(match_1_line)
        return [(match_1_line.group(1),match_1_line.group(3),match_1_line.group(4),match_1_line.group(5))]
    #didn't find a match for 1 celebrity
    else:
        result = []
        #consider it's a line with multiple celebrities dead on the same day, separated by \n**
        s = line.split("\n**")
        #if the split didn't work, we return a matching error:
        if len(s)==1:
            print("No match found for: "+ str(year) +" "+ line)
        #the first split contains only the death date
        match_date = re.match( r'\[\[(.*)\]\]',s[0]).group(1)
        #iterate for each celebrities
        for i in range(1,len(s)):
            match_3_param = re.match( r'.*\[\[(.*?)\]\](.*)\(b\. ?\[\[(.*?)\]\]\)', s[i])
            if debug:
                print(match_3_param)
            #if the match is succesful, add the celebrity to the array
            if match_3_param and match_date:
                result.append((match_date,match_3_param.group(1),match_3_param.group(2),match_3_param.group(3)))
            #otherwise return a matching error
            else:
                print("No match found for: "+ str(year) +" "+ s[i])
        return result

In [4]:
month_list = ['January', 'February', 'March', 'April', 'May', 
              'June', 'July', 'August', 'September', 'October', 'November', 'December']

# Function that return a standardized date in string %year-%month-%day
def computeDate(year, md):
    a = md.split(" ")
    if a[0] in month_list:
        month = month_list.index(a[0])+1
        day = a[1]
        return str(year)+"-"+str(month).zfill(2)+"-"+str(day).zfill(2)
    #return an date error that we couldn't convert
    else:
        print("Date error: "+md+" "+str(year));
        return np.nan

In [5]:
#create an empty dataFrame
columns = ['Death Date', 'Name', 'Description','Birth Date']
df = pd.DataFrame(columns=columns) 

#the year of the interval of the amazon dataset
starting_year = 1996
ending_year = 2014

#every year we scrap the corresponding wikipedia page
for year in range(starting_year,ending_year+1):
    #we retrieve only the wikitext using the parameter action=raw
    #note that the api documentation advise to do this way if we are interested only by the wikitext
    year_url = "https://en.wikipedia.org/w/index.php?action=raw&title={}&maxlag=5".format(year)
    r = requests.get(year_url)
    page_text = r.text
    #we split the text to retrieve only the part about celebrities' deaths
    if("== Deaths ==" in page_text):
        a = page_text.split("== Deaths ==")[1]
    elif("==Deaths==" in page_text):
        a = page_text.split("==Deaths==")[1]
    else:
        a = None
    #we set our starting text at the first month (i.e January)
    if("=== January ===" in page_text):
        a = a.split("=== January ===")[1]
    elif("===January===" in page_text):
        a = a.split("===January===")[1]
    else:
        a = None
    #all the month are separated by \n\n, so we split and iterate over the 12
    deaths = a.split("\n\n")
    for i in range(12):
        #We have 2 case: 1 celebrity per date, or multiple celebrity per date, so we split in consequence,
        #then we match the line into groups using pattern recognition.
        s = deaths[i].split('* ',1)[1]
        lines = s.split("\n* ")
        for line in lines:
            res = matchLine(line)
            for person in res:
                    #when we finished matching, we format the date, then add the celebrity into the dataFrame
                    date = computeDate(year,person[0])
                    df.loc[len(df)]=[date,person[1]," "+person[2].lower(),person[3]]
print(df.shape)
df.head(10)

No match found for: 1996  Victims of [[TWA Flight 800]]
No match found for: 2001  2,996 people (2,977 victims and 19 hijackers) who died in the [[September 11 attacks]]
No match found for: 2013 [[April 30]] (death announced on this date) &ndash; [[Deanna Durbin]], Canadian-born singer and actress (b. [[1921]])
(3312, 4)


Unnamed: 0,Death Date,Name,Description,Birth Date
0,1996-01-01,Malladihalli Sri Raghavendra Swamiji,", indian yogi",1890
1,1996-01-01,Moshe Aryeh Freund,", israeli rabbi",1894
2,1996-01-01,Arleigh Burke,", american naval officer",1901
3,1996-01-01,Arthur Rudolph,", german rocket engineer",1906
4,1996-01-02,Karl Targownik,", hungarian psychiatrist and holocaust survivor",1915
5,1996-01-05,Yahya Ayyash,", palestinian shaheed",1966
6,1996-01-05,Lincoln Kirstein,", american writer and impresario",1907
7,1996-01-05,Richard Versalle,", american operatic tenor",1932
8,1996-01-07,Prime Minister of Hungary,,1930
9,1996-01-07,Tarō Okamoto,", japanese artist",1911


The matcher has only 3 errors over 3313 matching, which is pretty nice.
Most of those error is just a special change of format in the list, where they listed celebrities inside specific tragedies.
In this case we will add them manually.

In [6]:
#we add only wanted celebrity (singer, actor,...)
#TWA Flight 800
df.loc[len(df)]=["1996-07-17","Marcel Dadi",", French guitarist".lower(),"1951"]
df.loc[len(df)]=["1996-07-17","David Hogan",", American composer".lower(),"1949"]
#Deanna Durbin
df.loc[len(df)]=["2013-04-30","Deanna Durbin",", Canadian-born singer and actress".lower(),"1921"]

We consider to multi-index on name and birth date.
Then we order them in fuction of their death date.

In [7]:
df2 = df.set_index(['Name', 'Birth Date'])
print("Index is unique: "+str(df.index.is_unique))
df2.sort_values('Death Date',inplace = True)
df2.head()

Index is unique: True


Unnamed: 0_level_0,Unnamed: 1_level_0,Death Date,Description
Name,Birth Date,Unnamed: 2_level_1,Unnamed: 3_level_1
Malladihalli Sri Raghavendra Swamiji,1890,1996-01-01,", indian yogi"
Moshe Aryeh Freund,1894,1996-01-01,", israeli rabbi"
Arleigh Burke,1901,1996-01-01,", american naval officer"
Arthur Rudolph,1906,1996-01-01,", german rocket engineer"
Karl Targownik,1915,1996-01-02,", hungarian psychiatrist and holocaust survivor"


Then we removed the celebrities which died outside of the interval of the Amazon dataset (May 1996 - July 2014).

In [8]:
df2 = df2[(df2['Death Date'] >= '1996-05-01') & (df2['Death Date'] <= '2014-07-31')]
df2.tail(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,Death Date,Description
Name,Birth Date,Unnamed: 2_level_1,Unnamed: 3_level_1
David Easton,1917,2014-07-19,", canadian-american political scientist"
James Garner,1928,2014-07-19,", american actor"
Carlo Bergonzi,1924,2014-07-25,", italian tenor and actor"
Francesco Marchisano,1929,2014-07-27,", italian cardinal"
Julio Grondona,1931,2014-07-30,", argentinian football authority"


Now we want to keep only the celebrities **that are musician, actor or author.**
Since we have a small description of the celebrity, it's trivial that the description will contains his job if the celebrity is famous by his job. 
_Note that some celebrities doesn't have a description, it's the case when their name contains the description: president, prince, king,...etc._

We match a celebrity if he is a musician, actor or author if it contains a specific keyword, for example: _"actor"_.

In [9]:
df3 = df2

#return true if the description contains one of this keywork
jobMusician = ["dj","baritone","bard","pianist","singer","tenor ","soprano", "composer","trumpeter","saxophonist","lyricist", "drummer", "musician", "rapper","guitarist","violinist","violist","bassist"]
def isMusician(s):
    for job in jobMusician:
        if job in s:
            return True
    return False

#return true if the description contains one of this keywork
jobActor = ["actor", "actress","filmmaker","cinematographer","film director", "film producer"]
def isActor(s):
    for job in jobActor:
        if job in s:
            return True
    return False

#return true if the description contains one of this keywork
jobAuthor = ["autor", "author", "writer", "poet", "novelist","cartoonist","comic strip artist","manga artist"]
def isAuthor(s):
    for job in jobAuthor:
        if job in s:
            return True
    return False

#add 3 column to the data frame: Actor, author and Musician. It's possible that someone is both, so they all have those 3 booleans parameters.
df3 = df3.merge(df3.Description.apply(lambda s: pd.Series({'Musician':isMusician(s), 'Actor':isActor(s), 'Author':isAuthor(s)})), 
    left_index=True, right_index=True) 

#filter only actor author or musician
df_artists = df3[(df3['Actor'] == True) | (df3['Author'] == True) | (df3['Musician'] == True)]
df_artists = df_artists.sort_values('Death Date')

print("Number of celebrities: "+str(df.shape[0]))
print("Number of useful celebrities: "+str(df_artists.shape[0]))
print("Number of unwanted celebrities: "+str(df.shape[0]-df_artists.shape[0]))

#print only the wanted celebrities
df_artists.head(5)

Number of celebrities: 3315
Number of useful celebrities: 1315
Number of unwanted celebrities: 2000


Unnamed: 0_level_0,Unnamed: 1_level_0,Death Date,Description,Actor,Author,Musician
Name,Birth Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Jack Weston,1924,1996-05-03,", american actor",True,False,False
John Beradino,1917,1996-05-19,", american baseball player and actor",True,False,False
Jon Pertwee,1919,1996-05-20,", british actor",True,False,False
Paul Delph,1957,1996-05-21,", american musician and producer",False,False,True
Lash LaRue,1917,1996-05-21,", american actor",True,False,False


We save this DF, since wikipedia is pretty volatile. (Vandalism,...)

In [10]:
df_artists.to_csv('DATA/deaths.csv')

<a id='amazonapi'></a>
## 2.3 Amazon Product API

Now we want to link a celebrity to all his works.

Our first idea was to scrap his works from their personnal Wikipedia page, and scrap all their works. For example, for an actor, we would scrap the subsection "Filmography" and read the corresponding tables in its subsections to find the link between the actor and his films.

However there exists an [Amazon product API](http://docs.aws.amazon.com/AWSECommerceService/latest/DG/becomingDev.html) that does the hard work for us. When you give a product ID (ASIN), the api will return a wrapper object, that contains field values for **directors, actors, authors, creators**. It's perfect for us! This way we will filter the products related to our artists list.

---

<a id='milestone3'></a>
# 3. What's next for Milestone #3

Currently we have all the review for each product, the list of celebrities and the list of product linked with celebrities name.
We will mix everything together and analyse the patterns of review (date, rating,...etc) for each given artist and draw conclusions from it.

We still want to answer those questions:
* When a author/artist died, What trend of popularity occurs on their related product on amazon? (For an author; it's book, for an actor; related movies,... etc)
* What's this impact in function of the type of artwork the author/artist did? (musics/books/films/...)