# Analysing Facebook (Pages & Groups) Data with IBM Watson

Este es un cuaderno destinado a mostrar cómo estamos utilizando una exportación estándar de Facebook Analytics que presenta textos de publicaciones, artículos y miniaturas, junto con métricas de rendimiento estándar como me gusta, acciones e impresiones.

**1st** we'll use the Natural Language Understanding and (optionally) Visual Recognition services from IBM Watson to enrich the Facebook posts, thumbnails, and articles by pulling out `Sentiment`, `Emotion`, `Entities`, `Keywords`, and `Images`. 

**2nd** we'll set up multiple pandas DataFrames that will contain the values, and metrics needed to find insights.

**3rd** we'll use charts to visualize the features that we discovered during enrichment and show how they correlate with student impressions.

We'll need these packages:
 - [Watson APIs Python SDK](https://github.com/watson-developer-cloud/python-sdk): a client library for Watson services.
 - <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="_blank" rel="noopener noreferrer">Beautiful Soup</a>: a library to parse data from HTML for enriching the Facebook data.
 - <a href="https://ibm-cds-labs.github.io/pixiedust/" target="_blank" rel="noopener noreferrer">PixieDust</a>: a library to visualize the data. 

Install the [Watson Python SDK](https://pypi.org/project/ibm-watson/) package:

In [1]:
!pip -q install --user --no-warn-script-location ibm-watson==4.7.1

Install the [Beautiful Soup](https://pypi.org/project/beautifulsoup4) package:

In [2]:
!pip -q install --user beautifulsoup4==4.9.2

Install [PixieDust](https://pypi.org/project/pixiedust/) Library

In [3]:
!pip -q install --user --no-warn-script-location --upgrade pixiedust==1.1.18

Importing all Packages and Libraries

In [4]:
import json
import sys

from ibm_watson import NaturalLanguageUnderstandingV1
from ibm_watson import VisualRecognitionV3
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator
from ibm_watson.natural_language_understanding_v1 import Features, EntitiesOptions, KeywordsOptions, EmotionOptions, SentimentOptions

import operator
from functools import reduce
from io import StringIO
import numpy as np
from bs4 import BeautifulSoup as bs
from operator import itemgetter
from os.path import join, dirname
import pandas as pd
import numpy as np
import requests

# Suppress some pandas warnings
pd.options.mode.chained_assignment = None  # default='warn'
# Suppress SSL warnings
requests.packages.urllib3.disable_warnings()

Service Credentials From IBM Cloud for Watson Services

* Watson Natural Language Understanding service 
    * Service for [Natural Language Understanding (NLU)](https://cloud.ibm.com/catalog/services/natural-language-understanding). 
* Watson Visual Recognition service 
    * Service for [Visual Recognition](https://cloud.ibm.com/catalog/services/visual-recognition).

In [5]:
#@hidden_cell

# Watson Natural Language Understanding (NLU)
NATURAL_LANGUAGE_UNDERSTANDING_API_KEY = '7MGwCGJ9qF4zahtrQl3eN6vn2RxBqiQ5YhWZqi4Xee3E'
NATURAL_LANGUAGE_UNDERSTANDING_URL = 'https://api.us-south.natural-language-understanding.watson.cloud.ibm.com/instances/68ba0285-8044-46d0-a3cb-6a9b334e1879'

# Watson Visual Recognition
VISUAL_RECOGNITION_API_KEY = 'pfY63VzhP5NqpR9-5AZMFxmeajkJdjxL9mBDNre_cOh-'
VISUAL_RECOGNITION_URL = 'https://api.us-south.visual-recognition.watson.cloud.ibm.com/instances/4f4bc535-5e22-4923-816a-f4c1afb058d1'


In [6]:
# Create the Watson clients

nlu_auth = IAMAuthenticator(NATURAL_LANGUAGE_UNDERSTANDING_API_KEY)
nlu = NaturalLanguageUnderstandingV1(version='2020-08-01',
                                     authenticator=nlu_auth)
nlu.set_service_url(NATURAL_LANGUAGE_UNDERSTANDING_URL)

visual_recognition = False  # Making visrec optional.
if VISUAL_RECOGNITION_API_KEY and VISUAL_RECOGNITION_URL:
    vr_auth = IAMAuthenticator(VISUAL_RECOGNITION_API_KEY)
    visual_recognition = VisualRecognitionV3(version='2020-08-01',
                                             authenticator=vr_auth)
    visual_recognition.set_service_url(VISUAL_RECOGNITION_URL)
else:
    print("Skipping Visual Recognition")


### 2nd Loading Data
The data you'll analyzing is a sample of a standard export of the Facebook Insights Post information from the "ETA 16 Dolores Hidalgo C.I.N". Engagement metrics such as clicks, impressions, and so on, are altered and do not reflect actual post performance data.

In [7]:
import types
import pandas as pd
df_data_1 = pd.read_csv('./data/facebook_data.csv')
df_data_1.head()

Unnamed: 0,Post ID,Permalink,Post Message,Type,Countries,Languages,Posted,Audience Targeting,Lifetime Post Total Reach,Lifetime Post organic reach,...,Lifetime Matched Audience Targeting Consumptions by Type - other clicks,Lifetime Matched Audience Targeting Consumptions by Type - link clicks,Lifetime Matched Audience Targeting Consumptions by Type - photo view,Lifetime Matched Audience Targeting Consumptions by Type - video play,Lifetime Negative Feedback from Users by Type - hide_all_clicks,Lifetime Negative Feedback from Users by Type - hide_clicks,Lifetime Negative Feedback from Users by Type - report_spam_clicks,Lifetime Negative Feedback by Type - hide_all_clicks,Lifetime Negative Feedback by Type - hide_clicks,Lifetime Negative Feedback by Type - report_spam_clicks
0,,,,,,,,,Lifetime: The number of people who had your Pa...,Lifetime: The number of people who had your Pa...,...,Lifetime: The number of clicks anywhere in the...,,,,Lifetime: The number of times people have give...,,,Lifetime: The number of people who have given ...,,
1,187446750783_10156099244995784,https://www.facebook.com/ibmwatson/posts/10156...,"In 2018, IBM has made significant strides towa...",Link,,,12/26/18 9:22,,13560,13560,...,53,84.0,,,3,4.0,,3,4.0,
2,187446750783_10156082794855784,https://www.facebook.com/ibmwatson/posts/10156...,"Available now on IBM Cloud and Cloud Private, ...",Photo,,,12/18/18 7:15,,7837,7837,...,25,17.0,47.0,,4,,,4,,
3,187446750783_10156079731075784,https://www.facebook.com/ibmwatson/posts/10156...,A year in review in AI: https://ibm.co/2QAlo29,Link,,,12/16/18 14:54,,10727,10727,...,39,129.0,,,3,4.0,,3,4.0,
4,187446750783_10156067568775784,https://www.facebook.com/ibmwatson/posts/10156...,IBM Watson,SharedVideo,,,12/10/18 14:06,,7937,7937,...,191,80.0,,,1,4.0,,1,4.0,


Variables:
 - The name of the DataFrame
 - Credentials for the source file
 - A file name for the enriched DataFrame

In [8]:
# Make sure this uses the variable above. The number will vary in the inserted code.
try:
    df = df_data_1
except NameError as e:
    print('Error: Setup is incorrect or incomplete.\n')
    raise

Using our credentials

In [9]:
credentials_1 = {
    'IAM_SERVICE_ID': 'iam-ServiceId-abff04b7-b6e9-4b78-8ea5-76efe48912de',
    'IBM_API_KEY_ID': 'gYRYR_gfr-GnwdSKMXUYQHjn4u6p9cGNIfsMYcJA_K9z',
    'ENDPOINT': 'https://s3-api.us-geo.objectstorage.service.networklayer.com',
    'IBM_AUTH_ENDPOINT': 'https://iam.cloud.ibm.com/oidc/token',
    'BUCKET': 'ivyhacks-donotdelete-pr-myhkmp8kqkr9uk',
    'FILE': 'example_facebook_data.csv'
}

In [10]:
try:
    credentials = credentials_1
except NameError as e:
    print('Error: Setup is incorrect or incomplete.\n')
    raise

### Preparing Data
We'll prepare the data by cleansing it and extracting the URLs. 
Many of the posts contain both text and a URL. 
First we have to separate URLs from the text so that they can be analyzed separately. 
After that we have to get thumbnails for the photos and links, and convert any shortened URLs to full URLs.

Data Cleansing with Python

Renaming columns, removing noticeable noise in the data, pulling out URLs and appending to a new column to run through NLU.

To cleanse the data, we'll rename a column and add a column with the URLs that were embedded in the facebook post.   
    `Post Message` column to `Text`:

In [11]:
df.rename(columns={'Post Message': 'Text'}, inplace=True)

In [12]:
# Drop the rows that have no value for the text.
df.dropna(subset=['Text'], inplace=True)

Using the `str.partition` function to remove strings that contain "http" and "www" from the `Text` column and save them in new DataFrames, then we add all web addresses to a new `Link` column in the original DataFrame. This process captures all web addresses: https, http, and www.

In [13]:
df_http = df["Text"].str.partition("http")
df_www = df["Text"].str.partition("www")

# Combine delimiters with actual links
df_http["Link"] = df_http[1].map(str) + df_http[2]
df_www["Link1"] = df_www[1].map(str) + df_www[2]

# Include only Link columns
df_http.drop(df_http.columns[0:3], axis=1, inplace = True)
df_www.drop(df_www.columns[0:3], axis=1, inplace = True)

# Merge http and www DataFrames
dfmerge = pd.concat([df_http, df_www], axis=1)

# The following steps will allow you to merge data columns from the left to the right
dfmerge = dfmerge.apply(lambda x: x.str.strip()).replace('', np.nan)

# Use fillna to fill any blanks with the Link1 column
dfmerge["Link"].fillna(dfmerge["Link1"], inplace = True)

# Delete Link1 (www column)
dfmerge.drop("Link1", axis=1, inplace = True)

# Combine Link data frame
df = pd.concat([dfmerge,df], axis = 1)

# Make sure text column is a string
df["Text"] = df["Text"].astype("str")

# Strip links from Text column
df['Text'] = df['Text'].apply(lambda x: x.split('http')[0])
df['Text'] = df['Text'].apply(lambda x: x.split('www')[0])

Extracting thumbnails and extended links

A standard Facebook export does not provide the thumbnail that usually summarizes the link or photo associated with each post. Using the Beautiful Soup library to go into the HTML of the post and extract the thumbnail text:

In [14]:
# Changing links from objects to strings
for link in df.Link:
    df.Link.to_string()

piclinks = []
description = []
for url in df["Link"]:
    if pd.isnull(url):
        piclinks.append("")
        description.append("")
        continue
        
    try:
        # Skiping certificate check with verify=False.
        page3 = requests.get(url, verify=False)
        if page3.status_code != requests.codes.ok:
            piclinks.append("")
            description.append("")
            continue
    except Exception as e:
        print("Skipping url %s: %s" % (url, e))
        piclinks.append("")
        description.append("")
        continue
        
    soup3 = bs(page3.text,"lxml")
    
    pic = soup3.find('meta', property ="og:image")
    if pic:
        piclinks.append(pic["content"])
    else: 
        piclinks.append("")
    
    content = None
    desc = soup3.find(attrs={'name':'Description'})
    if desc:
        content = desc['content']
    if not content or content == 'null':
        # Try again with lowercase description
        desc = soup3.find(attrs={'name':'description'})
        if desc:
            content = desc['content']
    if not content or content == 'null':
        description.append("")
    else:
        description.append(content)
            
# Saving thumbnail descriptions to df in a column titled 'Thumbnails'
df["Thumbnails"] = description
# Saving image links to df in a column titled 'Image'
df["Image"] = piclinks

Skipping url https://ibm.co/2zZaWrE: HTTPConnectionPool(host='softwaredownloads-prod.mrs-prod-7d4bdc08e7ddc90fa89b373d95c240eb-0000.us-south.containers.appdomain.cloud', port=80): Max retries exceeded with url: /mrs/https:/www.ibm.com/account/reg/us-en/signup?formid=urx-20732&cm_mmc=OSocial_Facebook-_-Watson%20Core_Watson%20Core%20-%20Conversation-_-WW_WW-_-Virtual%20Summit%20Recorded%20Session%20Sign%20Up%20Nov%2022&cm_mmca2=10004432&cm_mmca1=000027BD (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000204A0F64D48>: Failed to establish a new connection: [WinError 10060] Se produjo un error durante el intento de conexión ya que la parte conectada no respondió adecuadamente tras un periodo de tiempo, o bien se produjo un error en la conexión establecida ya que el host conectado no ha podido responder'))


Converting shortened links to full links via requests module to pull extended links. This is only necessary if the Facebook page uses different links than the articles themselves. 

In [None]:
shortlink = df["Link"]
extendedlink = []

for link in shortlink:
    if isinstance(link, float):  # Float is not a URL, probably NaN.
        extendedlink.append('')
    else:
        try:
            extended_link = requests.Session().head(link, allow_redirects=True).url
            extendedlink.append(extended_link)
        except Exception as e:
            print("Skipping link %s: %s" % (link, e))
            extendedlink.append('')

df["Extended Links"] = extendedlink

NLU for the Post Text

The following script is an example of how to use Natural Language Understanding to iterate through each post and extract enrichment features for future analysis.

For this example, we are looking at the `Text` column in our DataFrame, which contains the text of each post. NLU can also iterate through a column of URLs, or other freeform text. There's a list within a list for the Keywords and Entities features to allow gathering multiple entities and keywords from each piece of text.

Each extracted feature is appended to the DataFrame in a new column that's defined at the end of the script. If you want to run this same script for the other columns, set the loop iterable to the column name, if you are using URLs, change the `text=response` parameter to `url=response`, and update the new column names as necessary. 

In [None]:
# Define the list of features to get enrichment values for entities, keywords, emotion and sentiment
features = Features(entities=EntitiesOptions(), keywords=KeywordsOptions(), emotion=EmotionOptions(), sentiment=SentimentOptions())

overallSentimentScore = []
overallSentimentType = []
highestEmotion = []
highestEmotionScore = []
kywords = []
entities = []

# Go through every response and enrich the text using NLU.
for text in df['Text']:
  if not text:
    # print("Text is empty")
    overallSentimentScore.append('0')
    overallSentimentType.append('0')
    highestEmotion.append("")
    highestEmotionScore.append("")
    kywords.append("")
    entities.append("")
    continue
  else:
    # We are assuming English to avoid errors when the language cannot be detected.
    enriched_json = nlu.analyze(text=text, features=features, language='en').get_result()

    # Get the SENTIMENT score and type
    if 'sentiment' in enriched_json:
        if('score' in enriched_json['sentiment']["document"]):
            overallSentimentScore.append(enriched_json["sentiment"]["document"]["score"])
        else:
            overallSentimentScore.append('0')

        if('label' in enriched_json['sentiment']["document"]):
            overallSentimentType.append(enriched_json["sentiment"]["document"]["label"])
        else:
            overallSentimentType.append('0')
    else:
        overallSentimentScore.append('0')
        overallSentimentType.append('0')

    # Read the EMOTIONS into a dict and get the key (emotion) with maximum value
    if 'emotion' in enriched_json:
        me = max(enriched_json["emotion"]["document"]["emotion"].items(), key=operator.itemgetter(1))[0]
        highestEmotion.append(me)
        highestEmotionScore.append(enriched_json["emotion"]["document"]["emotion"][me])
    else:
        highestEmotion.append("")
        highestEmotionScore.append("")

    # Iterate and get KEYWORDS with a confidence of over 70%
    if 'keywords' in enriched_json:
        tmpkw = []
        for kw in enriched_json['keywords']:
            if(float(kw["relevance"]) >= 0.7):
                tmpkw.append(kw["text"])
        # Convert multiple keywords in a list to a string and append the string
        kywords.append(', '.join(tmpkw))
    else:
        kywords.append("")
            
    # Iterate and get Entities with a confidence of over 30%
    if 'entities' in enriched_json:
        tmpent = []
        for ent in enriched_json['entities']: 
            if(float(ent["relevance"]) >= 0.3):
                tmpent.append(ent["type"])
 
        # Convert multiple entities in a list to a string and append the string
        entities.append(', '.join(tmpent))
    else:
        entities.append("")
    
# Create columns from the list and append to the DataFrame
if highestEmotion:
    df['TextHighestEmotion'] = highestEmotion
if highestEmotionScore:
    df['TextHighestEmotionScore'] = highestEmotionScore

if overallSentimentType:
    df['TextOverallSentimentType'] = overallSentimentType
if overallSentimentScore:
    df['TextOverallSentimentScore'] = overallSentimentScore

df['TextKeywords'] = kywords
df['TextEntities'] = entities

After we extract all of the Keywords and Entities from each Post, we have columns with multiple Keywords and Entities separated by commas. For our Analysis in Part II, we also wanted the top Keyword and Entity for each Post. Because of this, we added two new columns to capture the `MaxTextKeyword` and `MaxTextEntity`.

In [None]:
# Choose first of Keywords and Entities
df["MaxTextKeywords"] = df["TextKeywords"].apply(lambda x: x.split(',')[0])
df["MaxTextEntity"] = df["TextEntities"].apply(lambda x: x.split(',')[0])

#### NLU for Thumbnail Text

We will repeat the same process for Thumbnails and Article Text.

In [None]:
# Define the list of features to get enrichment values for entities, keywords, emotion and sentiment
features = Features(entities=EntitiesOptions(), keywords=KeywordsOptions(), emotion=EmotionOptions(), sentiment=SentimentOptions())

overallSentimentScore = []
overallSentimentType = []
highestEmotion = []
highestEmotionScore = []
kywords = []
entities = []

# Go through every response and enrich the text using NLU.
for text in df['Thumbnails']:
    if not text:
        overallSentimentScore.append(' ')
        overallSentimentType.append(' ')
        highestEmotion.append(' ')
        highestEmotionScore.append(' ')
        kywords.append(' ')
        entities.append(' ')
        continue

    enriched_json = nlu.analyze(text=text, features=features, language='en').get_result()

    # Get the SENTIMENT score and type
    if 'sentiment' in enriched_json:
        if('score' in enriched_json['sentiment']["document"]):
            overallSentimentScore.append(enriched_json["sentiment"]["document"]["score"])
        else:
            overallSentimentScore.append("")

        if('label' in enriched_json['sentiment']["document"]):
            overallSentimentType.append(enriched_json["sentiment"]["document"]["label"])
        else:
            overallSentimentType.append("")

    # Read the EMOTIONS into a dict and get the key (emotion) with maximum value
    if 'emotion' in enriched_json:
        me = max(enriched_json["emotion"]["document"]["emotion"].items(), key=operator.itemgetter(1))[0]
        highestEmotion.append(me)
        highestEmotionScore.append(enriched_json["emotion"]["document"]["emotion"][me])

    else:
        highestEmotion.append("")
        highestEmotionScore.append("")

    # Iterate and get KEYWORDS with a confidence of over 70%
    if 'keywords' in enriched_json:
        tmpkw = []
        for kw in enriched_json['keywords']:
            if(float(kw["relevance"]) >= 0.7):
                tmpkw.append(kw["text"])
        # Convert multiple keywords in a list to a string and append the string
        kywords.append(', '.join(tmpkw))
     
    # Iterate and get Entities with a confidence of over 30%
    if 'entities' in enriched_json:
        tmpent = []
        for ent in enriched_json['entities']:              
            if(float(ent["relevance"]) >= 0.3):
                tmpent.append(ent["type"])
        # Convert multiple entities in a list to a string and append the string
        entities.append(', '.join(tmpent))
    else:
        entities.append("")     
  
# Create columns from the list and append to the DataFrame
if highestEmotion:
    df['ThumbnailHighestEmotion'] = highestEmotion
if highestEmotionScore:
    df['ThumbnailHighestEmotionScore'] = highestEmotionScore

if overallSentimentType:
    df['ThumbnailOverallSentimentType'] = overallSentimentType
if overallSentimentScore:
    df['ThumbnailOverallSentimentScore'] = overallSentimentScore

df['ThumbnailKeywords'] = kywords
df['ThumbnailEntities'] = entities

 Add two new columns to capture the `MaxThumbnailKeyword` and `MaxThumbnailEntity`:

In [None]:
# Set 'Max' to first one from keywords and entities lists
df["MaxThumbnailKeywords"] = df["ThumbnailKeywords"].apply(lambda x: x.split(',')[0])
df["MaxThumbnailEntity"] = df["ThumbnailEntities"].apply(lambda x: x.split(',')[0])

#### NLU for Article Text

In [None]:
# Define the list of features to get enrichment values for entities, keywords, emotion and sentiment
features = Features(entities=EntitiesOptions(), keywords=KeywordsOptions(), emotion=EmotionOptions(), sentiment=SentimentOptions())

overallSentimentScore = []
overallSentimentType = []
highestEmotion = []
highestEmotionScore = []
kywords = []
entities = []
article_text = []
        
# Go through every response and enrich the article using NLU
for url in df['Extended Links']:
    if not url:
        overallSentimentScore.append(' ')
        overallSentimentType.append(' ')
        highestEmotion.append(' ')
        highestEmotionScore.append(' ')
        kywords.append(' ')
        entities.append(' ')
        article_text.append(' ')
        continue

    # Run links through NLU to get entities, keywords, emotion and sentiment.
    # Use return_analyzed_text to extract text for Tone Analyzer to use.
    
    try:
        enriched_json = nlu.analyze(url=url,
                                features=features,
                                language='en',
                                return_analyzed_text=True).get_result()
        article_text.append(enriched_json["analyzed_text"])
    except Exception as e:
        print("Skipping url %s: %s" % (url, e))
        overallSentimentScore.append(' ')
        overallSentimentType.append(' ')
        highestEmotion.append(' ')
        highestEmotionScore.append(' ')
        kywords.append(' ')
        entities.append(' ')
        article_text.append(' ')
        continue
      
    
    

    # Get the SENTIMENT score and type
    if 'sentiment' in enriched_json:
        if('score' in enriched_json['sentiment']["document"]):
            overallSentimentScore.append(enriched_json["sentiment"]["document"]["score"])
        else:
            overallSentimentScore.append('None')

        if('label' in enriched_json['sentiment']["document"]):
            overallSentimentType.append(enriched_json["sentiment"]["document"]["label"])
        else:
            overallSentimentType.append('')

    # Read the EMOTIONS into a dict and get the key (emotion) with maximum value
    if 'emotion' in enriched_json:
        me = max(enriched_json["emotion"]["document"]["emotion"].items(), key=operator.itemgetter(1))[0]
        highestEmotion.append(me)
        highestEmotionScore.append(enriched_json["emotion"]["document"]["emotion"][me])

    else:
        highestEmotion.append('')
        highestEmotionScore.append('')

    # Iterate and get KEYWORDS with a confidence of over 70%
    if 'keywords' in enriched_json:
        tmpkw = []
        for kw in enriched_json['keywords']:
            if(float(kw["relevance"]) >= 0.7):
                tmpkw.append(kw["text"])
        # Convert multiple keywords in a list to a string and append the string
        kywords.append(', '.join(tmpkw))
    else: 
        kywords.append("")
            
    # Iterate and get Entities with a confidence of over 30%
    if 'entities' in enriched_json:
        tmpent = []
        for ent in enriched_json['entities']:               
            if(float(ent["relevance"]) >= 0.3):
                tmpent.append(ent["type"])
        # Convert multiple entities in a list to a string and append the string
        entities.append(', '.join(tmpent))
    else:
        entities.append("")
    
# Create columns from the list and append to the DataFrame
if highestEmotion:
    df['LinkHighestEmotion'] = highestEmotion
if highestEmotionScore:
    df['LinkHighestEmotionScore'] = highestEmotionScore

if overallSentimentType:
    df['LinkOverallSentimentType'] = overallSentimentType
if overallSentimentScore:
    df['LinkOverallSentimentScore'] = overallSentimentScore

df['LinkKeywords'] = kywords
df['LinkEntities'] = entities
df['Article Text'] = article_text

Add two new columns to capture the `MaxLinkKeyword` and `MaxLinkEntity`:

In [None]:
# Set 'Max' to first one from keywords and entities lists
df["MaxLinkKeywords"] = df["LinkKeywords"].apply(lambda x: x.split(',')[0])
df["MaxLinkEntity"] = df["LinkEntities"].apply(lambda x: x.split(',')[0])

Enrichment is now COMPLETE  
Save a copy of the enriched DataFrame as a file in Cloud Object Storage. 

In [None]:
cos = ibm_boto3.client(service_name='s3',
    ibm_api_key_id=credentials['IBM_API_KEY_ID'],
    ibm_service_instance_id=credentials['IAM_SERVICE_ID'],
    ibm_auth_endpoint=credentials['IBM_AUTH_ENDPOINT'],
    config=Config(signature_version='oauth'),
    endpoint_url=credentials['ENDPOINT'])

In [None]:
# Build the enriched file name from the original filename.
localfilename = 'enriched_' + credentials['FILE']

# Write a CSV file from the enriched pandas DataFrame.
df.to_csv(localfilename, index=False)

# Use the above put_file method with credentials to put the file in Object Storage.
cos.upload_file(localfilename, Bucket=credentials['BUCKET'],Key=localfilename)

In [None]:
# If you want to use the enriched local file, you can read it back in.
# This might be handy if you already enriched and just want to re-run
# from this cell and below. Uncomment the following line.

# df = pd.read_csv(localfilename)

Part II - Data Preparation

Prepare Multiple DataFrames for Visualizations
Before we can create the separate tables for each Watson feature we need to organize and reformat the data. First, we need to determine which data points are tied to metrics. Second, we need to make sure make sure each metric is numeric. 

In [None]:
# Put the lifetime metrics in a list
metrics = [metric for metric in df.columns.values.tolist() if 'Lifetime' in metric]

Create a Consolidated Sentiment and Emotion DataFrame
we'll create a DataFrame for the sentiment and emotion of the post text and a DataFrame for the sentiment and emotion of the article text. Then you'll combine them into one DataFrame.


#### Post Sentiment and Emotion DataFrame

In [None]:
# Create a list with only Post sentiment and emotion values
post_tones = ["Text","TextHighestEmotion", "TextHighestEmotionScore", "TextOverallSentimentType", "TextOverallSentimentScore"]

# Append DataFrame with these metrics
post_tones.extend(metrics)

# Create a new DataFrame with metrics and sentiment and emotion
df_post_tones = df[post_tones]

# Determine which tone values are suppose to be numeric and ensure they are numeric. 
post_numeric_values = ["TextHighestEmotionScore", "TextOverallSentimentScore"]
for i in post_numeric_values:
    df_post_tones[i] = pd.to_numeric(df_post_tones[i], errors='coerce')

# Make all metrics numeric
for i in metrics:
    df_post_tones[i] = pd.to_numeric(df_post_tones[i], errors='coerce')

# Add in a column to distinguish what portion the enrichment was happening 
df_post_tones["Type"] = "Post"

#### Article Sentiment and Emotion DataFrame

In [None]:
# Create a list with only Article sentiment and emotion values
article_tones = ["Text", "LinkHighestEmotion", "LinkHighestEmotionScore", "LinkOverallSentimentType", "LinkOverallSentimentScore"]

# Append DataFrame with these metrics
article_tones.extend(metrics)

# Create a new DataFrame with metrics and sentiment and emotion
df_article_tones = df[article_tones]

# Determine which values are suppose to be numeric and ensure they are numeric. 
art_numeric_values = ["LinkHighestEmotionScore", "LinkOverallSentimentScore"]
for i in art_numeric_values:
    df_article_tones[i] = pd.to_numeric(df_article_tones[i], errors='coerce')
    
# Make all metrics numeric
for i in metrics:
    df_article_tones[i] = pd.to_numeric(df_article_tones[i], errors='coerce')

# Add in a column to distinguish what portion the enrichment was happening 
df_article_tones["Type"] = "Article"

#### Combine Post and Article DataFrames to Make DataFrame with Sentiment and Emotion

In [None]:
# First make the Column Headers the same
df_post_tones.rename(columns={"TextHighestEmotion":"Emotion",
                              "TextHighestEmotionScore":"Emotion Score",
                              "TextOverallSentimentType": "Sentiment",
                              "TextOverallSentimentScore": "Sentiment Score"
                             },
                     inplace=True)

df_article_tones.rename(columns={"LinkHighestEmotion":"Emotion",
                                 "LinkHighestEmotionScore":"Emotion Score",
                                 "LinkOverallSentimentType": "Sentiment",
                                 "LinkOverallSentimentScore": "Sentiment Score"
                                },
                        inplace=True)

# Combine into one data frame
df_tones = pd.concat([df_post_tones, df_article_tones])

In [None]:
# Only keep the positive, neutral, and negative sentiments. The others are empty or unusable.
df_tones = df_tones[df_tones.Sentiment.isin(['positive', 'neutral', 'negative'])]

Create a Consolidated Keyword DataFrame
DataFrames for the keywords of the article text, the thumbnail text, and the post text. Then you'll combine them into one DataFrame.

 #### Article Keyword DataFrame 

In [None]:
# Create a list with only Article Keywords
article_keywords = ["Text", "MaxLinkKeywords"]

# Append DataFrame with these metrics
article_keywords.extend(metrics)

# Create a new DataFrame with keywords and metrics
df_article_keywords = df[article_keywords]

# Make all metrics numeric
for i in metrics:
    df_article_keywords[i] = pd.to_numeric(df_article_keywords[i], errors='coerce')

# Drop NA Values in Keywords Column
df_article_keywords['MaxLinkKeywords'].replace(' ', np.nan, inplace=True)
df_article_keywords.dropna(subset=['MaxLinkKeywords'], inplace=True)

# Add in a column to distinguish what portion the enrichment was happening 
df_article_keywords["Type"] = "Article"

#### Thumbnail Keyword DataFrame 

In [None]:
# Create a list with only Thumbnail Keywords
thumbnail_keywords = ["Text", "MaxThumbnailKeywords"]

# Append DataFrame with these metrics
thumbnail_keywords.extend(metrics)

# Create a new DataFrame with keywords and metrics
df_thumbnail_keywords = df[thumbnail_keywords]

# Make all metrics numeric
for i in metrics:
    df_thumbnail_keywords[i] = pd.to_numeric(df_thumbnail_keywords[i], errors='coerce')
    
# Drop NA Values in Keywords Column
df_thumbnail_keywords['MaxThumbnailKeywords'].replace(' ', np.nan, inplace=True)
df_thumbnail_keywords.dropna(subset=['MaxThumbnailKeywords'], inplace=True)

# Add in a column to distinguish what portion the enrichment was happening 
df_thumbnail_keywords["Type"] = "Thumbnails"

#### Post Keyword DataFrame 

In [None]:
# Create a list with only Thumbnail Keywords
post_keywords = ["Text", "MaxTextKeywords"]

# Append DataFrame with these metrics
post_keywords.extend(metrics)

# Create a new DataFrame with keywords and metrics
df_post_keywords = df[post_keywords]

# Make all metrics numeric
for i in metrics:
    df_post_keywords[i] = pd.to_numeric(df_post_keywords[i], errors='coerce')
    
# Drop NA Values in Keywords Column
df_post_keywords['MaxTextKeywords'].replace(' ', np.nan, inplace=True)
df_post_keywords.dropna(subset=['MaxTextKeywords'], inplace=True)

# Add in a column to distinguish what portion the enrichment was happening 
df_post_keywords["Type"] = "Posts"

#### Combine Post, Thumbnail, and Article DataFrames to Make One Keywords DataFrame

In [None]:
# First make the column headers the same
df_post_keywords.rename(columns={"MaxTextKeywords": "Keywords"}, inplace=True)
df_thumbnail_keywords.rename(columns={"MaxThumbnailKeywords":"Keywords"}, inplace=True)
df_article_keywords.rename(columns={"MaxLinkKeywords":"Keywords"}, inplace=True)

# Combine into one data frame
df_keywords = pd.concat([df_post_keywords, df_thumbnail_keywords, df_article_keywords])

# Discard posts with lower total reach to make charting easier
df_keywords = df_keywords[df_keywords["Lifetime Post Total Reach"] > 20000]


Create a Consolidated Entity DataFrame
You'll create DataFrames for the entities of the article text, the thumbnail text, and the post text. Then you'll combine them into one DataFrame.

#### Article Entity DataFrame 

In [None]:
# Create a list with only Article Keywords
article_entities = ["Text", "MaxLinkEntity"]

# Append DataFrame with these metrics
article_entities.extend(metrics)

# Create a new DataFrame with keywords and metrics
df_article_entities = df[article_entities]
    
# Make all metrics numeric
for i in metrics:
    df_article_entities[i] = pd.to_numeric(df_article_entities[i], errors='coerce')
    
# Drop NA Values in Keywords Column
df_article_entities['MaxLinkEntity'] = df["MaxLinkEntity"].replace(r'\s+', np.nan, regex=True)
df_article_entities.dropna(subset=['MaxLinkEntity'], inplace=True)

# Add in a column to distinguish what portion the enrichment was happening 
df_article_entities["Type"] = "Article"

#### Thumbnail Entity DataFrame

In [None]:
# Create a list with only Thumbnail Keywords
thumbnail_entities = ["Text", "MaxThumbnailEntity"]

# Append DataFrame with these metrics
thumbnail_entities.extend(metrics)

# Create a new DataFrame with keywords and metrics
df_thumbnail_entities = df[thumbnail_entities]

# Make all metrics numeric
for i in metrics:
    df_thumbnail_entities[i] = pd.to_numeric(df_thumbnail_entities[i], errors='coerce')
    
# Drop NA Values in Keywords Column
df_thumbnail_entities['MaxThumbnailEntity'] = df_thumbnail_entities['MaxThumbnailEntity'].replace(r'\s+', np.nan, regex=True)
df_thumbnail_entities.dropna(subset=['MaxThumbnailEntity'], inplace=True)

# Add in a column to distinguish what portion the enrichment was happening 
df_thumbnail_entities["Type"] = "Thumbnails"

#### Post Entity DataFrame

In [None]:
# Create a list with only Thumbnail Keywords
post_entities = ["Text", "MaxTextEntity"]

# Append DataFrame with these metrics
post_entities.extend(metrics)

# Create a new DataFrame with keywords and metrics
df_post_entities = df[post_entities]

# Make all metrics numeric
for i in metrics:
    df_post_entities[i] = pd.to_numeric(df_post_entities[i], errors='coerce')
    
# Drop NA Values in Keywords Column
df_post_entities['MaxTextEntity'] = df_post_entities['MaxTextEntity'].replace(r'\s+', np.nan, regex=True)
df_post_entities.dropna(subset=['MaxTextEntity'], inplace=True)

# Add in a column to distinguish what portion the enrichment was happening 
df_post_entities["Type"] = "Posts"

#### Combine Post, Thumbnail, and Article DataFrames to Make One Entity DataFrame

In [None]:
# First make the column headers the same
df_post_entities.rename(columns={"MaxTextEntity": "Entities"}, inplace=True)

df_thumbnail_entities.rename(columns={"MaxThumbnailEntity":"Entities"}, inplace=True)

df_article_entities.rename(columns={"MaxLinkEntity":"Entities"}, inplace=True)

# Combine into one data frame
df_entities = pd.concat([df_post_entities, df_thumbnail_entities, df_article_entities])

df_entities["Entities"] = df_entities["Entities"].replace('', np.nan)
df_entities.dropna(subset=["Entities"], inplace=True)

Create a Consolidated Image DataFrame

#### Combine Metrics with Type Hierarchy, Class and Color to Make One Image DataFrame

In [None]:
if visual_recognition:
    # Create a list with only Visual Recognition columns
    pic_keywords = ['Image Type', 'Image Subtype', 'Image Subtype2', 'Image Class', 'Image Color']

    # Append DataFrame with these metrics
    pic_keywords.extend(metrics)

    # Create a new DataFrame with keywords and metrics
    df_pic_keywords = df[pic_keywords]

    # Make all metrics numeric
    for i in metrics:
        df_pic_keywords[i] = pd.to_numeric(df_pic_keywords[i], errors='coerce')

    # Discard posts with lower total reach to make charting easier
    df_pic_keywords = df_pic_keywords[df_pic_keywords["Lifetime Post Total Reach"] > 15000]

In [None]:
if visual_recognition:
    images = df_pic_keywords[df_pic_keywords['Image Type'] != ' ']

3rd Step

Setup
Assign Variables
Assign new DataFrames to variables. 

In [None]:
entities = df_entities
tones = df_tones
keywords = df_keywords

Visualize Data
Run PixieDust Visualization Library with Display() API
PixieDust lets you visualize your data in just a few clicks using the display() API. You can find more info at https://pixiedust.github.io/pixiedust/displayapi.html.

#### We can use a pie chart to identify how lifetime engagement was broken up by sentiment. 

Click on the `Options` button to change the chart.  Here are some things to try:
* Add *Type* to make the breakdown show *Post* or *Article*.
* Show *Emotion* intead of *Sentiment* (or both).
* Try a different metric.

In [None]:
import pixiedust

In [None]:
display(tones)

#### Now let's look at the same statistics as a bar chart.

It is the same line of code. Use the `Edit Metadata` button to see how PixieDust knows to show us a bar chart. If you don't have a button use the menu and select `View > Cell Toolbar > Edit Metadata`.

A bar chart is better at showing more information. We added `Cluster By: Type` so we already see numbers for posts and articles. Notice what the chart tells you. Most of our articles and posts are `positive`. But what sentiment really engages more users?  Click on `Options` and try this:

* Change the aggregation to `AVG`.

What sentiment leads to higher average engagement?


In [None]:
display(tones)

#### Now let's look at the entities that were detected by Natural Language Understanding.

The following bar chart shows the entities that were detected. This time we are stacking negative feedback and "likes" to get a picture of the kind of feedback the entities were getting. We chose a horizontal, stacked bar chart with descending values for a little variety.

* Try a different renderer and see what you get.

In [None]:
display(entities)

#### Next we look at the keywords detected by Natural Language Understanding


In [None]:
display(keywords)