<h2>Metadata generation for music reviews using Google’s Natural Language API</h2>

---

This workshop will introduce you to accessing Google Cloud resources through the command line and python client libraries.

First, you will load in some review data from a Cloud Storage bucket and explore what it looks like. Then, connect to the Cloud Language AI through the python client library and see how each API responds.

Finally, apply the API to the reviews, and examine the added value of the results you get. If you like data exploration, spend some time cross-correlating these results with the other structured data found in the dataset. Or think about how the API has opened up new possibilities for further enriching this dataset.

<h3>Setting up access</h3>

In [0]:
from google.colab import auth

auth.authenticate_user()

!gsutil cp gs://datatonic-pitchfork-data/datatonic-external-training-24407c914082.json .

<h3>Importing Libraries</h3>

In [0]:
# update to latest version
!pip install --upgrade google-cloud-language

In [0]:
# import libraries
import pandas as pd
import glob
import functools
from google.cloud import language
from google.cloud.language import enums
from google.cloud.language import types
import seaborn as sns
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings('ignore')
colours = ["#1637b5", "#79808c", "#1dbbf8", "#9e51da", "#f4454f"]
sns.set_palette(colours)
sns.set(font="Arial")
sns.set_style("whitegrid")
%matplotlib inline

# instantiate a client
language_client = language.LanguageServiceClient.from_service_account_file("./datatonic-external-training-24407c914082.json")

Helper functions

In [0]:
def show_sentiments(annotations, n):
    score = annotations.document_sentiment.score
    magnitude = annotations.document_sentiment.magnitude

    for index, sentence in enumerate(annotations.sentences[:n]):
        sentence_sentiment = sentence.sentiment.score
        print('Sentence {} has a sentiment score of {:.3f}'.format(
            index + 1, sentence_sentiment))
        print('Sentence text:\n{}\n'.format(sentence.text.content))

    print('Overall Sentiment: score of {:.3f} with magnitude of {:.3f}'.format(
        score, magnitude))
    return 0


def show_categories(categories, n):
    for category in categories.categories[:n]:
        print('=' * 20)
        print('name: {}'.format(category.name))
        print('confidence: {:.3f}'.format(category.confidence))
        
        
def show_sentences(syntax, n):
    for sentence in syntax.sentences[:n]:
        print('=' * 20)
        print('{}'.format(sentence.text.content))
        

def show_tokens(syntax, n):
    for token in syntax.tokens[:n]:
        print('=' * 20)
        print('text:  {}'.format(token.text.content))
        print('part_of_speech:\n{}'.format('  ' + str(token.part_of_speech).replace('\n', '\n  ')))
        print('lemma: {}'.format(token.lemma))
        

def show_entities(entities, n):
    for entity in entities.entities[:n]:
        print('=' * 20)
        print('         name: {0}'.format(entity.name))
        print('         type: {0}'.format(entity.Type.Name(entity.type)))
        print('     metadata: {0}'.format(dict(entity.metadata)))
        print('     salience: {:.4f}'.format(entity.salience))
        
        
def show_entity_sentiments(entity_sentiments, n):
    for entity in entity_sentiments.entities[:n]:
        print('=' * 20)
        print('         name: {0}'.format(entity.name))
        print('         type: {0}'.format(entity.Type.Name(entity.type)))
        print('        score: {:.3f}'.format(entity.sentiment.score))
        print('    magnitude: {:.3f}'.format(entity.sentiment.magnitude))

<h3>Downloading Dataset</h3>

Download the pitchfork reviews dataset from Google Cloud Storage. 

In [0]:
!gsutil -m cp -r gs://datatonic-pitchfork-data/*.csv .

In [0]:
# check that the files are present in current directory
!ls *.csv

Read the CSV files into pandas DataFrames.

In [0]:
content = pd.read_csv('content.csv', header=0, delimiter='|')
genres = pd.read_csv('genres.csv', header=0, delimiter='|')
labels = pd.read_csv('labels.csv', header=0, delimiter='|')
reviews = pd.read_csv('reviews.csv', header=0, delimiter='|')

<h3>Exploring the dataset</h3>

In [0]:
print("Content has {} rows and {} columns.".format(*content.shape))
print("Genres has {} rows and {} columns.".format(*genres.shape))
print("Labels has {} rows and {} columns.".format(*labels.shape))
print("Reviews has {} rows and {} columns.".format(*reviews.shape))

In [0]:
reviews.head()

Genres dataset's first 5 rows:

In [0]:
genres.head()

Labels dataset's first 5 rows:

In [0]:
labels.head()

Content dataset's first 5 rows:

In [0]:
content.head()

We might want to join up the datasets on reviewId later on to get a better idea of the context for each review, but for now, let's just use the content dataset and try inputting the free review text as is into Google's natural language API and see what insights we can get immediately.

Now, let's pick a review text and explore the Natural Language API. I've picked a review about The Beatle's Let It Be album. You are welcome to search for your favourite band/artist on https://pitchfork.com and copy the review ID from the URL. For example, for The Beatles, the URL is https://pitchfork.com/reviews/albums/13430-let-it-be/ and the review ID is 13430.

In [0]:
review_id = 13430
review_text = row = content[content.reviewid == review_id]['content'].values[0]
print('{}...'.format(review_text[:205]))

In [0]:
document = types.Document(content=review_text, type=enums.Document.Type.PLAIN_TEXT)

<h3>Content classification</h3>

<p>The Natural Language API can be used for analyzing documents and obtaining lists of content categories that apply to the text in the document.</p>

In [0]:
categories = language_client.classify_text(document=document)

show_categories(categories, 5)

<br />
<p>The API managed to pick up the broad categories that the review falls under. For The Beatles' album, the categories are "Arts & Entertainment", "Music & Audio" and "Rock Music". Depending on which review you picked, you may see different categories. If you want to explore the API's capabilities further, there is a <a href="https://cloud.google.com/natural-language/">free demo</a> (scroll down a bit to "Try the API") that you can try. See whether the API can classify your text correctly if you describe an animal or an object without using the word for it! Below is the text I experimented with.</p>

<blockquote>
<p>These animals are domesticated felines of the genus felidae who are smaller than the majority of their wild cousins who share the genus. Efficient hunters with speed, sharp teeth, retractable claws, superior hearing, eyes which can function in near-complete darkness and a wide variety of vocalizations. Primarily short-haired, but with long-haired variations resulting from selective breeding. Markings can vary from a single solid color to various stripes, swirls, spots and differently-colored extremities, primarily in tones of brown, black, orange, red, white and grey.</p>

<p>They are popular as pets and generally considered sleek, beautiful, intelligent, fastidious and aloof, but are also known for their independence,  playfulness and the purr, a vocalization that can indicate contentedness, security and affection, as well as distress and self-comfort. Humans find the purr comforting and studies have shown that the purr reduces stress in humans.</p>
</blockquote>

<p>See whether you can find out which words contribute the most to the API's confidence that the category "/Pets & Animals/Pets/Cats" is a category it should associate with the text.

<h3>Syntactic analysis</h3>
<p>The Natural Language API provides a powerful set of tools for analyzing and parsing text through syntactic analysis. Syntactic analysis consists of <b>sentence extraction</b> (breaks up the stream of text into a series of sentences) and <b>tokenization</b> (breaks up the stream of text into a series of tokens where each token corresponds to a single word).</p>

In [0]:
syntax = language_client.analyze_syntax(document)

<b>First 5 sentences</b>

In [0]:
show_sentences(syntax, 5)

<b>First 5 tokens</b>

In [0]:
show_tokens(syntax, 5)

<p>The <code>text</code> field contains the text data associated with the token. The <code>part_of_speech</code> field provides grammatical information including <a href="https://en.wikipedia.org/wiki/Morphology_(linguistics)">morphological information</a>, about the token, such as the token's tense, person, number, gender, etc (for more information, refer to the <a href="https://cloud.google.com/natural-language/docs/morphology">documentation</a>). The <code>lemma</code> field contains the "root" word upon which this word is based, which allows you to canonicalize word usage within your text. For example, the words "write", "writing", "wrote" and "written" are all based on the same lemma ("write"). As well, plural and singular forms are based on lemmas: "house" and "houses" both refer to the same form.</p>

<p>There are also other more advanced fields available. For more information, refer to the <a href="https://cloud.google.com/natural-language/docs/">Natural Language API documentation</a>.</p>

<h3>Sentiment analysis</h3>

In [0]:
sentiments = language_client.analyze_sentiment(document=document)

The sentiments for the first 5 sentences are:

In [0]:
show_sentiments(sentiments, 5)

<p>We can see that it is really easy to get started and obtain meaningful results very quickly. The language API picked up the text language automatically and preprocessed it without us having to do any cleaning.</p>

<h4>Guide to interpreting sentiment analysis values</h4>

<p>The <i>score</i> of a document's sentiment indicates the overall emotion of a document. The <i>magnitude</i> of a document's sentiment indicates how much emotional content is present within the document, and this value is often proportional to the length of the document.</p>

<p>It is important to note that the Natural Language API indicates differences between positive and negative emotion in a document, but does not identify specific positive and negative emotions. For example, "angry" and "sad" are both considered negative emotions. However, when the Natural Language API analyzes text that is considered "angry", or text that is considered "sad", the response only indicates that the sentiment in the text is negative, not "sad" or "angry".</p>

<p>A document with a neutral score (around <code>0.0</code>) may indicate a low-emotion document, or may indicate mixed emotions, with both high positive and negative values which cancel each out. Generally, you can use <code>magnitude</code> values to disambiguate these cases, as truly neutral documents will have a low <code>magnitude</code> value, while mixed documents will have higher magnitude values.</p>

<p>When comparing documents to each other (especially documents of different length), make sure to use the <code>magnitude</code> values to calibrate your scores, as they can help you gauge the relevant amount of emotional content.</p>

<p>The chart below shows some sample values and how to interpret them:</p>

| Sentiment      | Sample Values |
| -------------- | ------------- |
| Clearly Positive      | "<b>score</b>": 0.8, "<b>magnitude</b>": 3.0       |
| Clearly Negative | "<b>score</b>": -0.6, "<b>magnitude</b>": 4.0 |
| Neutral | "<b>score</b>": 0.1, "<b>magnitude</b>": 0.0 |
| Mixed | "<b>score</b>": 0.0, "<b>magnitude</b>": 4.0 |

<h3>Entity analysis</h3>
<p>We can also perform entity analysis on our review text. Entity analysis provides information about entities in the text, which generally refer to named "things" such as famous individuals, landmarks, common objects, etc.</p>

In [0]:
entities = language_client.analyze_entities(document=document)

The first 5 entities with their names, types, metadata and salience is shown below.

In [0]:
show_entities(entities, 5)

<p>The <code>type</code> field indicates the type of the entity (e.g. person, location, event, etc). This information helps distinguish and/or disambiguate entities, and can be used for writing patterns or extracting information. For example, a <code>type</code> value can help distinguish similarly named entities such as "Lawrence of Arabia", tagged as a <code>WORK_OF_ART</code> (film), from "T.E Lawrence", tagged as a <code>PERSON</code>, for example.</p>

<p>The <code>metadata</code> field contains source information about the entity's knowledge repository and may contain <code>wikipedia_url</code> and <code>mid</code> (machine-generated identifier corresponding to the entity's <a href="https://www.google.com/intl/bn/insidesearch/features/search/knowledge.html">Google Knowledge Graph</a> entry which remains unique across languages and can be used to tie entities together from different languages).</p>

<p>The <code>salience</code> field indicates the importance or relevance of this entity to the entire document text. This score can assist information retrieval and summarization by prioritizing salient entities. Scores closer to <code>0.0</code> are less important, while scores closer to <code>1.0</code> are highly important</p>

<h3>Entity sentiment analysis</h3>
<p>We can also combine named entities and sentiment analysis and obtain sentiments for each named entity.</p>

In [0]:
entity_sentiments = language_client.analyze_entity_sentiment(document=document)

The first 5 entity sentiments with their names, types, scores and magnitude is shown below.

In [0]:
show_entity_sentiments(entity_sentiments, 5)

<h3>Extracting structured information</h3>

<p>Now that we've explored the Natural Language API, we can start thinking about how to extract more structured information from the text. We may need it to conform to a schema or we may need to create additional features for a machine learning model. The set of features of interest will vary depending on the use case, but for the purposes of this exercise, let's assume we want to be able to predict the review score. The overall sentiment of the text and the categories it falls under could make good features for this task. Let's join up the data and add the additional features.</p>

<p>First, let's merge <code>content</code> with <code>reviews</code> because they are one-to-one.</p>

In [0]:
df = pd.merge(content.dropna(), reviews.dropna(), how='inner', on='reviewid')

Next, let's group the labels per review and list them as comma separated values, so that we can join them to the dataframe from the previous step.

In [0]:
review_labels = labels.dropna().groupby(['reviewid'])['label'].apply(', '.join).reset_index()
df = pd.merge(df, review_labels, how='inner', on='reviewid')

Finally, let's join do the same for <code>genres</code>.

In [0]:
review_genres = genres.dropna().groupby(['reviewid'])['genre'].apply(', '.join).reset_index()
df = pd.merge(df, review_genres, how='inner', on='reviewid')

Great, we have all the review information in one dataframe now. Let's have a peek.

In [0]:
df.head()

<p>Now, let's augment the dataframe above with some additional features. In particular, the sentiment of the text and the categories the text falls under would make useful features. We can also add comma separated list of important entities in the text as a feature. Let's get started.</p>

<p>Running classification on the entire dataframe might take a while and we might run into API rate limiting issues, so selecting a small subset of the data is a good way to demonstrate the process without spending too long on waiting for requests to complete.</p>

<p>Let's create a mini dataframe with just the first 10 rows.</p>

In [0]:
mini_df = df[:10]

Then, let's run classification on the text for each one and add it to <code>mini_df</code> as a column.

In [0]:
def get_category(content):
    document = types.Document(content=content, type=enums.Document.Type.PLAIN_TEXT)
    categories = language_client.classify_text(document=document)
    return ', '.join([c.name for c in categories.categories])

mini_df_categories = mini_df.content.apply(get_category)
mini_df['categories'] = mini_df_categories

In [0]:
mini_df.head()

Next, let's obtain the sentiment scores and magnitudes for each review and add that as a column too.

In [0]:
def get_sentiment_scores(content):
    document = types.Document(content=content, type=enums.Document.Type.PLAIN_TEXT)
    sentiments = language_client.analyze_sentiment(document=document)
    score = sentiments.document_sentiment.score
    magnitude = sentiments.document_sentiment.magnitude
    return pd.Series([score, magnitude])

mini_df_sentiment_scores = mini_df.content.apply(get_sentiment_scores)
mini_df_sentiment_scores = pd.DataFrame(mini_df_sentiment_scores)
mini_df_sentiment_scores.columns = ['score', 'magnitude']
mini_df = pd.concat([mini_df, mini_df_sentiment_scores], axis=1)

Finally, let's add an <code>entities</code> column which includes the 5 entities with the highiest salience scores.

In [0]:
def get_top_entities(content):
    document = types.Document(content=content, type=enums.Document.Type.PLAIN_TEXT)
    entities = language_client.analyze_entities(document=document)
    return ', '.join([e.name for e in entities.entities[:5]])

top_entities = mini_df.content.apply(get_top_entities)
mini_df['entities'] = top_entities

In [0]:
mini_df.head()

<p>That's it! We have added a few additional features which a machine learning model trying to predict the review score can use. These features are much easier to encode and feed into a model than free-form text is. The Natural Language API has done most of the heavy lifting for us.</p>

<p>Feel free to experiment and see what other features can be extracted using the APIs!</p> 

<h3>Using spaCy</h3>

<p>For natural language processing, spaCy is a well regarded free python library which supports similar linguistic features to the Natural Language API. You can read more about it <a href="https://spacy.io/usage/linguistic-features">here</a>. In this tutorial, we won't explore spaCy in detail, but it is good to know that an open source alternative is available</p>

<p><b>spaCy supports:</b></p>
<ol>
    <li><a href="https://spacy.io/usage/linguistic-features#section-pos-tagging">Part-Of-Speech Tagging</a></li>
    <li><a href="https://spacy.io/usage/linguistic-features#section-dependency-parse">Dependency Parse</a></li>
    <li><a href="https://spacy.io/usage/linguistic-features#section-named-entities">Named Entities</a></li>
    <li><a href="https://spacy.io/usage/linguistic-features#section-tokenization">Tokenization</a></li>
    <li><a href="https://spacy.io/usage/linguistic-features#section-sbd">Sentence Segmentation</a></li>
    <li><a href="https://spacy.io/usage/linguistic-features#section-rule-based-matching">Rule-based Matching</a></li>
</ol>

<h3>Data exploration</h3>

<p>For the remainder of the tutorial we will be exploring the data a bit more and trying to answer some of the questions below.</p>

<h4>Exploratory Analysis</h4>

Let us answer some of the questions below to gain a better understanding of the dataset, especially now that it is merged:

* How many distinct artists/albums/genres/record labels do we have in our dataset?
* What is the trend of reviews posted over time? When were the most reviews posted, and what are the most common times?
* Who are the most popular artists that are reviewed?
* What are the most popular albums/genres/record labels to write about?
* Who are the authors that have contributed the most?
* Which artists/albums/genres/record labels have the highest/lowest average scores?
* Are some authors more biased to giving high/low scores?
* How does the author type influence the score?

In [0]:
print("The time range of the dataset spans {} to {}".format(df["pub_date"].min(), df["pub_date"].max()))

In [0]:
# how many unique values across categories of interest?
print("Number of unique authors: {}".format(df["author"].nunique()))
print("Number of unique artists: {}".format(df["artist"].nunique()))
print("Number of unique albums: {}".format(df["title"].nunique()))
print("Number of unique genres: {}".format(df["genre"].nunique()))
print("Number of unique record labels: {}".format(df["label"].nunique()))

In [0]:
def plot(df, group_col, type="bar", xlab=None, ylab=None, title=None, fig_size= (8,4)):
    trend = df.groupby([group_col]).size().reset_index()
    trend.columns = [group_col, "count"]
    plt.figure(figsize=fig_size)
  
    if type == "ts":
        ax = sns.tsplot(trend["count"], trend[group_col], color=colours[0])
    elif type == "bar":
        trend = trend.sort_values(by="count", ascending=False)
        ax = sns.barplot(y=trend[group_col], x=trend["count"], color=colours[1])
    ax.set(xlabel=xlab, ylabel=ylab, title=title)
    plt.show()

plot(df, "pub_year", "ts", "Publish Year", "Number of Reviews", "Yearly Trend of Reviews")
plot(df, "pub_month", "ts", "Publish Month", "Number of Reviews", "Monthly Trend of Reviews")
plot(df, "pub_weekday", "ts", "Publish Weekday", "Number of Reviews", "Day of Week Trend of Reviews")
print("Note that 0 corresponds to Monday, and 6 corresponds to Sunday.")

* The number of reviews was the greatest in <b>2016</b> - in general it has been increasing year on year and so we can assume the number of reviews in 2017 exceeded 2016. The biggest jump in absolute volume (of 467 reviews) was from 2001 to 2002.
* The most reviews are published in the month of <b>October</b> (across all years) with a total of 1763 reviews, however in December it drops significantly to 819 which is expected as it is around Christmas time and there are also probably less albums released around that time.
* The most reviews are published on <b>Tuesday</b> which makes sense because prior to July 2015, new music was released on Tuesdays in the US, and then this was changed to Fridays. Note that Pitchfork is an American website.

In [0]:
# total reviews per genre
plot(genres, "genre", "bar","Number of Reviews", "Genre", "Total Reviews by Genre", (10, 6))

<p>The volume of reviews for the <b>rock genre</b> are at least 2x more than any other genre.</p>

<p>Let's see which artists were the popular for each year. Where there is a tie, I've taken the top 3 artists.</p>

In [0]:
artist_reviews_per_year = df.groupby(['pub_year','artist']).size().reset_index()
artist_reviews_per_year.columns = ['pub_year', 'artist', 'num_reviews']
artist_reviews_per_year = artist_reviews_per_year.sort_values(by=['pub_year', 'num_reviews'], ascending=False)

most_popular_artists_per_year = []

for name, group in artist_reviews_per_year.groupby(['pub_year']):
    max_count, artists, n = -1, [], 3
    for row, data in group.iterrows():
        if data['num_reviews'] > max_count:
            max_count = data['num_reviews']
            artists = [data['artist']]
            continue

        if data['num_reviews'] == max_count:
            artists.append(data['artist'])
            n -= 1
            
        if n == 0 or data['num_reviews'] < max_count:
            most_popular_artists_per_year.append([name, ', '.join(artists), max_count])
            break
        
most_popular_artists_per_year_df = pd.DataFrame(most_popular_artists_per_year)
most_popular_artists_per_year_df.columns = ['pub_year', 'artist', 'num_reviews']
most_popular_artists_per_year_df.sort_values(by=['pub_year'], ascending=False)

See whether you can answer the remaining questions that were posed in the beginning by exploring the data further!