<a href="https://colab.research.google.com/github/BStricks/music_information_retrieval/blob/master/music_webscrape.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web scraping the album review corpus

The primary purpose of this script is to crawl the pitchfork.com website for all album reviews and download the review text into a dataframe along with the artist and album name attributes. This methodology can be extended to include multiple other domains e.g. amazon reviews, rolling stone etc.

The secondary purpose was to trial a document matching algorithm on the newly created corpus; using a range of matching techniques the aim is to match a user's natuaral language query with the most appropriate album. 



# Section 1: web scraping

In [0]:
###mount drive
from google.colab import drive
drive.mount('/content/gdrive')

###change directory
%cd gdrive/My Drive/Colab Notebooks/album_reviews

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
[Errno 2] No such file or directory: 'gdrive/My Drive/Colab Notebooks/album_reviews'
/content/gdrive/My Drive/Colab Notebooks/album_reviews


In [0]:
###libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd 
import numpy as np
import pickle
!wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

## Pitchfork scrape

In [0]:
###webpages to scrape
pagelist = []
for i in range(1, 50):
  pagelist.append('https://pitchfork.com/reviews/albums/?page='+str(i))

###create table for hyperlinks
master_table_pitchfork = pd.DataFrame(columns=['href', 'artist', 'album'])

###function to scrape hyperlinks and extract artist/album tags
for i in pagelist:

  page = requests.get(i)
  soup = BeautifulSoup(page.text, 'html.parser').find_all('div', attrs={"class":"review"})

  for div in soup:
    href = ['https://pitchfork.com/'+div.find('a',attrs={"class":"review__link"})['href']]
    artist = [div.find('li').text]
    album = [div.find('h2').text]

    new_table = pd.DataFrame(
        {'href': href,
        'artist': artist,
        'album': album
        })

    master_table_pitchfork = master_table_pitchfork.append(new_table)

In [0]:
###scrape webpage for album review text
review_text = []

for i in range(0,588):
  
  href = master_table_pitchfork.iloc[i][0]
  page = requests.get(href)

  if not page:
    review_text.append("NULL")

  else: 
    soup = BeautifulSoup(page.text, 'html.parser').find_all('div', attrs={"class":"contents"})
  
    for div in soup:
    
      if div.text:
        review_text.append(div.text)

In [0]:
master_table_pitchfork = master_table_pitchfork.assign(review_text=review_text)

## NME scrape

In [0]:
###webpages to scrape
pagelist = []
for i in range(1, 2):
  pagelist.append('https://www.nme.com/reviews/album/page/'+str(i))

###create table for hyperlinks
master_table_nme = pd.DataFrame(columns=['href', 'artist','album'])

###function to scrape hyperlinks and extract artist/album tags
for i in pagelist:

  page = requests.get(i)
  soup = BeautifulSoup(page.text, 'html.parser').find_all('li', attrs={"class":"listing-item"})
  href = []
  artist = []
  album = []

  for i in soup:
    
    for a in i.find_all('a'):
      href.append(a['href'])

    for header in i.find_all("h3"):
      header_1 = header.text.strip()
      artist1 = header_1.split(' –')[0]
      artist.append(artist1)
      try: 
        album1 = header_1.split('\'')[1]
        album2 = album1.split('\'')[0]
        album.append(album2)
      except:
        album1 = header_1.split('‘')[1]
        album2 = album1.split('’')[0]
        album.append(album2)

new_table = pd.DataFrame({'href': href,'artist': artist,'album': album})

master_table_nme = master_table_nme.append(new_table)

In [0]:
print(len(master_table_nme))

31


In [0]:
###scrape webpage for album review text
review_text = []

for i in range(0,31):
  
  href = master_table_nme.iloc[i][0]
  page = requests.get(href)

  if not page:
    review_text.append("NULL")

  else: 
    soup = BeautifulSoup(page.text, 'html.parser').find_all('p')   
    sentences = []
    for p in soup:
        if p.text:
          para = str(p.text.strip())
          if para.startswith("window"):
            pass
          elif para.startswith("Release"):
            pass
          elif para.startswith("Record"):
            pass
          else:
            sentences.append(para)
  
  review_text.append(' '.join(sentences))

In [0]:
print(len(review_text))

31


In [0]:
master_table_nme = master_table_nme.assign(review_text=review_text)

In [0]:
#combine
master_table = master_table_pitchfork.append(master_table_nme)

#pickle
outfile = open('album_corpus','wb')
pickle.dump(master_table,outfile)
outfile.close()

# Section 2: Sentence matching

## Approach 1 - sentence level matching
The first approach taken is to split each review into sentence level data and then perform matching on a 'user -> review sentence' basis. It is hypothesised that given the complex nature of reviews that this will perform better than creating a document level match.

Features:
*   Vector space model - count vectors (1-4gram)
*   Vector space model - tfidf vectors (1-4gram)
*   Topic model - LSA (Latent Semantic Analysis)
*   Topic model - LDA (Latent Dirichlet Allocation)
*   Embeddings - pre-trained
*   Embeddings - corpus-trained

Distance measures:
*   Standard distance measures e.g. cosine etc.
*   Non-standard distance measures e.g. Word Movers Distance




In [0]:
album_corpus = pickle.load( open( "master_table.pkl", "rb" ) )

## Standard distance measures

In [0]:
from math import*
from decimal import Decimal
 
class Similarity():
  
  def euclidean_distance(self,x,y):
          return sqrt(sum(pow(a-b,2) for a, b in zip(x, y)))

  def manhattan_distance(self,x,y):
          return sum(abs(a-b) for a,b in zip(x,y))

  def minkowski_distance(self,x,y,p_value):
          return self.nth_root(sum(pow(abs(a-b),p_value) for a,b in zip(x, y)),
             p_value)

  def cosine_similarity(self,x,y):
          numerator = sum(a*b for a,b in zip(x,y))
          denominator = self.square_rooted(x)*self.square_rooted(y)
          return round(numerator/float(denominator),3)

  def jaccard_similarity(self,x,y):
          intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
          union_cardinality = len(set.union(*[set(x), set(y)]))
          return intersection_cardinality/float(union_cardinality)
        
  def nth_root(self,value, n_root):
          root_value = 1/float(n_root)
          return round (Decimal(value) ** Decimal(root_value),3)
      
  def square_rooted(self,x): 
          return round(sqrt(sum([a*a for a in x])),3)


## Vector space features and standard distance measures

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer
measures = Similarity()
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords', quiet=True, raise_on_error=True)
stop_words = set(nltk.corpus.stopwords.words('english'))
tokenized_stop_words = nltk.word_tokenize(' '.join(nltk.corpus.stopwords.words('english')))
 
class LemmaTokenizer(object):
  
    def __init__(self):
        nltk.download('punkt', quiet=True, raise_on_error=True)
        self.stemmer = nltk.stem.PorterStemmer()
        
    def _stem(self, token):
        if (token in stop_words):
            return token  # Solves error "UserWarning: Your stop_words may be inconsistent with your preprocessing."
        return self.stemmer.stem(token)
        
    def __call__(self, line):
        tokens = nltk.word_tokenize(line)
        tokens = (self._stem(token) for token in tokens)  # Stemming
        return list(tokens)


def distance_vectors(corpus, stringlist, vect=TfidfVectorizer, dist=measures.euclidean_distance):
  
  ###vectorizer
  t_vectorizer = vect(tokenizer=LemmaTokenizer(),
                      strip_accents='unicode',
                      stop_words=tokenized_stop_words,
                      lowercase=True,
                      ngram_range=(1,4),
                      analyzer='word')

  X_t = t_vectorizer.fit_transform(corpus)
  test_t = t_vectorizer.transform(stringlist)
  
  ###similarity calculation
  scores = []
  for i in range(0,len(corpus)):    
    scores.append(dist(test_t.toarray()[0],X_t[i].toarray()[0]))

  ###print top 3 most similar
  indices = np.array(scores).argsort()[0:3]
  for i in indices:
    values = album_corpus.iloc[i][1:3]
    print(values.values)


In [0]:
###vector space test - working well! returning logical results
distance_vectors(corpus=album_corpus['review_text'],stringlist=['a heavy drum section followed by uplifting chorus, with rock and roll influences'])


['Darkthrone' 'Old Star']
['T. Rex' 'The Slider']
['Bob Dylan' 'Bob Dylan: Rolling Thunder Revue: The 1975 Live Recordings']


## Embedding features and standard distance measures

This will be an average of the individual word embeddings for each sentence

## Embedding features and non-standard distance measures

WMD is an embedding specific distance measure; it assesses the "distance" between two documents in a meaningful way, even when they have no words in common, by using word2vec vector embeddings of words.




In [0]:
import gensim
from gensim.models import Word2Vec
from gensim.models import Phrases
from nltk.corpus import stopwords
from nltk import download
download('stopwords')
stop_words = stopwords.words('english')

def distance_embeddings(corpus, stringlist,trained=False):
  
  ###pre-processing
  def pre_processor(list):
    pp_corpus=[]
    for i in list:
      i = i.lower().split()
      i = [w for w in i if w not in stop_words]
      pp_corpus.append(i)
    return pp_corpus 

  pp_corpus = pre_processor(corpus)


  ###word embeddings
  #bigram_transformer = Phrases(album_corpus['review_text'])
  if trained == False:
    word_model = Word2Vec(pp_corpus, min_count=2, size=100, window=5, iter=100)
  else:
    word_model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)
  ###similarity calculation
  scores = []
  for i in range(0,len(corpus)):    
    scores.append(word_model.wmdistance(stringlist[0],pp_corpus[i]))


  ###print top 3 most similar
  indices = np.array(scores).argsort()[0:3]
  for i in indices:
    values = album_corpus.iloc[i][1:3]
    print(values.values)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [0]:
###this is not working well...
distance_embeddings(corpus=album_corpus['review_text'],stringlist=['a heavy drum section followed by uplifting chorus, with rock and roll influences'])




['Slowthai' 'Nothing Great About Britain']
['Megan Thee Stallion' 'Fever']
['Don Cherry' 'Brown Rice']


In [0]:
###this is not working well...
distance_embeddings(corpus=album_corpus['review_text'],stringlist=['a heavy drum section followed by uplifting chorus, with rock and roll influences'],trained=True)


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


['Káryyn' 'The Quanta Series']
['Lil Nas X' '7 EP']
['Big K.R.I.T.' 'K.R.I.T. IZ HERE']


# Next steps

lookup list for UMG artists \\
alternative websites - any decent music.com \\
lyric search \\
artist cluster \\