# Scrapalyze:  Web Scraper and Sentiment Analyzer
Authored by Chris Cotton, 07/17/2017

chris.j.cotton@me.com


## Purpose:

This notebook leverages the Goose and Textblob modules for Python which enable the scraping, parsing, and analyzing of sentiment and subjectivity of natural language data on the internet.  Goose scrapes and cleans raw web data; Textblob leverages the NLTK module for Python to perform sentiment and subjectivity analysis on the data.

Some bubble charts are displayed for fun, just plotting sentiment vs. subjectivity of various parts of the data scraped (title, metadata, text, etc.).  The heatmaps display all of the attributes of the text analyzed (6 of them) vs. all observations (websites) in the data set.  A bottom-up, hierarchical, agglomerative clustering algorithm (Ward clustering) is applied whose objective function is to find pairs of rows with the most similar variance (smallest difference in erorr sum of squares between rows).  A taxonomy is created that can be pruned at any level to find clades of websites with similar sentiments.  The same algorithm is applied column-wise, to find attributes that are, holistically, most similar to one another *across* the websites.


## Inputs:

The user defines a list of domains, and the program tries to scrape and analyze every domain in the list that it can.


## Execution:

After defining your list, click the "Cell" menu, and click "Run All."

In [1]:
from __future__ import division
import os
import sys
import re
import numpy as np
import pandas as pd
import scipy

import nltk, re, pprint
from nltk import word_tokenize
import urllib2 as ul2
from goose import Goose
from textblob import TextBlob

import plotly.plotly as py 
from plotly.graph_objs import *
import plotly.graph_objs as go
from plotly import __version__
import plotly.offline as offline
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

import rpy2 as r
from rpy2.robjects.packages import importr
import rpy2.robjects as ro
from rpy2.robjects import conversion
from rpy2.robjects import pandas2ri
from IPython.display import display, HTML, IFrame


Matplotlib is building the font cache using fc-list. This may take a moment.



In [2]:
%load_ext rpy2.ipython

import warnings
warnings.filterwarnings('ignore')

In [3]:
%R require(ggplot2); require(tidyr); require(plotly); require(d3heatmap)

array([1], dtype=int32)

In [4]:
R = ro.r
pandas2ri.activate()
# plotly = importr("plotly")
# d3heatmap = importr("d3heatmap")
# forcats = importr("forcats")
# anomaly = importr("AnomalyDetection")

root_dir = os.getcwd()
output_dir = root_dir

In [5]:
g = Goose()

In [6]:
domain_list = ["google.com","youtube.com","facebook.com","baidu.com","wikipedia.org","yahoo.com","reddit.com","google.co.in","qq.com","amazon.com","taobao.com","twitter.com","tmall.com","google.co.jp","vk.com","live.com","sohu.com","instagram.com","sina.com.cn","jd.com","weibo.com","360.cn","google.de","google.co.uk","google.com.br","list.tmall.com","linkedin.com","google.fr","google.ru","yandex.ru","netflix.com","google.com.hk","yahoo.co.jp","google.it","ebay.com","t.co","pornhub.com","google.es","imgur.com","bing.com","twitch.tv","msn.com","onclkds.com","gmw.cn","tumblr.com","google.com.mx","google.ca","alipay.com","xvideos.com","livejasmin.com","mail.ru","ok.ru","microsoft.com","aliexpress.com","wordpress.com","hao123.com","stackoverflow.com","imdb.com","amazon.co.jp","github.com","blogspot.com","csdn.net","wikia.com","pinterest.com","apple.com","google.com.tr","popads.net","youth.cn","bongacams.com","office.com","paypal.com","google.com.tw","google.com.au","whatsapp.com","microsoftonline.com","google.pl","xhamster.com","detail.tmall.com","diply.com","google.co.id","adobe.com","nicovideo.jp","craigslist.org","amazon.de","txxx.com","amazon.in","google.com.ar","porn555.com","coccoc.com","dropbox.com","booking.com","thepiratebay.org","google.com.pk","googleusercontent.com","google.co.th","pixnet.net","china.com","google.com.eg","soso.com","bbc.co.uk","tianya.cn","google.com.sa","amazon.co.uk","savefrom.net","fc2.com","bbc.com","rakuten.co.jp","uptodown.com","so.com","soundcloud.com","google.com.ua","mozilla.org","xnxx.com","cnn.com","amazonaws.com","quora.com","ask.com","google.nl","ettoday.net","nytimes.com","naver.com","adf.ly","dailymotion.com","clicksgear.com","google.co.za","steamcommunity.com","onlinesbi.com","google.co.ve","espn.com","google.co.kr","salesforce.com","chase.com","fbcdn.net","blogger.com","stackexchange.com","ebay.de","vice.com","vimeo.com","theguardian.com","chaturbate.com","steampowered.com","blastingnews.com","ebay.co.uk","mediafire.com","tribunnews.com","indeed.com","buzzfeed.com","openload.co","google.gr","avito.ru"]

In [7]:
url_list = ["http://www." + i for i in domain_list]

In [8]:
def parse_and_analyze(url):
    domain = url.replace("http://www.","")
    parsed = g.extract(url)

    title = parsed.title
    meta = parsed.meta_description
    text = parsed.cleaned_text
    overall = title + " " + meta + " " + text

    title_blob = TextBlob(title)
    meta_blob = TextBlob(meta)
    text_blob = TextBlob(text)
    overall_blob = TextBlob(overall)

    title_sentiment = title_blob.sentiment.polarity
    meta_sentiment = meta_blob.sentiment.polarity
    text_sentiment = text_blob.sentiment.polarity
    overall_sentiment = overall_blob.sentiment.polarity

    title_subjectivity = title_blob.sentiment.subjectivity
    meta_subjectivity = meta_blob.sentiment.subjectivity
    text_subjectivity = text_blob.sentiment.subjectivity
    overall_subjectivity = overall_blob.sentiment.subjectivity

    results_list = [
    domain, title, meta, text,
    title_sentiment, meta_sentiment, text_sentiment, overall_sentiment,
    title_subjectivity, meta_subjectivity, text_subjectivity, overall_subjectivity
    ]

    return results_list



def merge(url_list):
    results_list = []
    successful = []
    failed = []

    i = 0
    s = 0
    f = 0

    for url in url_list:
        try:
            results = parse_and_analyze(url)
            results_list.append(results)
            successful.append(url)
            i += 1
            s += 1
            print url + " successful!"
            print "{} urls processed so far; {} successful; {} failed.".format(i, s, f)
        except:
            failed.append(url)
            i += 1
            f += 1
            print url + " failed..."
            print "{} urls processed so far; {} successful; {} failed.".format(i, s, f)

    df = pd.DataFrame(results_list)

    return df, successful, failed

In [9]:
df, successful, failed = merge(url_list)

http://www.google.com successful!
1 urls processed so far; 1 successful; 0 failed.
http://www.youtube.com successful!
2 urls processed so far; 2 successful; 0 failed.
http://www.facebook.com successful!
3 urls processed so far; 3 successful; 0 failed.
http://www.baidu.com successful!
4 urls processed so far; 4 successful; 0 failed.
http://www.wikipedia.org successful!
5 urls processed so far; 5 successful; 0 failed.
http://www.yahoo.com failed...
6 urls processed so far; 5 successful; 1 failed.
http://www.reddit.com successful!
7 urls processed so far; 6 successful; 1 failed.
http://www.google.co.in successful!
8 urls processed so far; 7 successful; 1 failed.
http://www.qq.com successful!
9 urls processed so far; 8 successful; 1 failed.
http://www.amazon.com successful!
10 urls processed so far; 9 successful; 1 failed.
http://www.taobao.com successful!
11 urls processed so far; 10 successful; 1 failed.
http://www.twitter.com successful!
12 urls processed so far; 11 successful; 1 failed

In [10]:
df.columns = [
"Domain", "Title", "Meta", "Text",
"Title Sentiment", "Meta Sentiment", "Text Sentiment", "Overall Sentiment",
"Title Subjectivity", "Meta Subjectivity", "Text Subjectivity", "Overall Subjectivity",
]

In [11]:
df.to_csv("/Users/chrcotto/cloudzero_scrapy/cloudzero_results.txt", sep = "\t", encoding = "utf-8")

In [12]:
df

Unnamed: 0,Domain,Title,Meta,Text,Title Sentiment,Meta Sentiment,Text Sentiment,Overall Sentiment,Title Subjectivity,Meta Subjectivity,Text Subjectivity,Overall Subjectivity
0,google.com,Google,"Search the world's information, including webp...",,0.000000,0.401786,0.000000,0.401786,0.000000,0.455357,0.000000,0.455357
1,youtube.com,YouTube,"Enjoy the videos and music you love, upload or...",All the clips that are burning up the morning ...,0.000000,0.425000,-0.133333,0.145833,0.000000,0.616667,0.300000,0.458333
2,facebook.com,Log In or Sign Up,Create an account or log into Facebook. Connec...,"By clicking Create Account, you agree to our T...",0.000000,-0.125000,0.000000,-0.125000,0.000000,0.375000,0.000000,0.375000
3,baidu.com,百度一下，你就知道,,,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
4,wikipedia.org,Wikipedia,"Wikipedia is a free online encyclopedia, creat...",,0.000000,0.400000,0.000000,0.400000,0.000000,0.800000,0.000000,0.800000
5,reddit.com,reddit: the front page of the internet,reddit: the front page of the internet,Want to join? Log in or sign up in seconds.,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
6,google.co.in,Google,,,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
7,qq.com,腾讯首页,腾讯网(www.QQ.com)是中国浏览量最大的中文门户网站，是腾讯公司推出的集新闻信息、互...,,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
8,amazon.com,"Amazon.com: Online Shopping for Electronics, A...",Online shopping from the earth's biggest selec...,Sign in for the best experience,0.500000,0.000000,1.000000,0.500000,0.500000,0.300000,0.300000,0.366667
9,taobao.com,淘宝网,,,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000


In [13]:
df.describe()

Unnamed: 0,Title Sentiment,Meta Sentiment,Text Sentiment,Overall Sentiment,Title Subjectivity,Meta Subjectivity,Text Subjectivity,Overall Subjectivity
count,133.0,133.0,133.0,133.0,133.0,133.0,133.0,133.0
mean,0.068307,0.115469,0.07928,0.139452,0.09787,0.180445,0.127392,0.225333
std,0.169876,0.202842,0.210016,0.203553,0.251001,0.293305,0.234539,0.287854
min,0.0,-0.125,-0.4,-0.125,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.2,0.0,0.3,0.0,0.375,0.16,0.5
max,0.8,0.8,1.0,0.8,1.0,1.0,1.0,0.9


In [14]:
df[df.sum(axis = 1) > 0].describe()

Unnamed: 0,Title Sentiment,Meta Sentiment,Text Sentiment,Overall Sentiment,Title Subjectivity,Meta Subjectivity,Text Subjectivity,Overall Subjectivity
count,57.0,57.0,57.0,57.0,57.0,57.0,57.0,57.0
mean,0.159383,0.269427,0.184986,0.325387,0.228363,0.421038,0.297249,0.525777
std,0.230753,0.234015,0.289936,0.189959,0.343761,0.315708,0.279772,0.185811
min,0.0,-0.125,-0.4,-0.125,0.0,0.0,0.0,0.1
25%,0.0,0.0,0.0,0.24375,0.0,0.1,0.0,0.418182
50%,0.0,0.283333,0.033333,0.333333,0.0,0.455357,0.3,0.5
75%,0.4,0.5,0.3125,0.452778,0.5,0.616667,0.5375,0.65
max,0.8,0.8,1.0,0.8,1.0,1.0,1.0,0.9


In [15]:
hover_text = []

for index, row in df.iterrows():
    hover_text.append(
        ('Domain: {domain}<br>'+
        'Title Sentiment: {title}<br>'+
        'Text Sentiment: {text}<br>').format(
                                        domain = row["Domain"],
                                        title = row["Title Sentiment"],
                                        text = row["Text Sentiment"]
                                        )
                     )


df["Hover Text"] = hover_text


trace0 = go.Scatter(
    x = df["Title Sentiment"],
    y = df["Text Sentiment"],
    text = df["Hover Text"],
    mode = "markers",
    marker = dict(
        size=[40] * len(df),
    )
)


layout = go.Layout(
    title = "Text vs. Title Sentiment for Top Alexa Sites",
    xaxis = dict(
        title = "Title Sentiment Score",
        gridcolor = "rgb(255, 255, 255)",
        zerolinewidth = 1,
        ticklen = 5,
        gridwidth = 2,
    ),
    yaxis=dict(
        title = "Text Sentiment Score",
        gridcolor = "rgb(255, 255, 255)",
        zerolinewidth = 1,
        ticklen = 5,
        gridwidth = 2,
    ),
    paper_bgcolor='rgb(243, 243, 243)',
    plot_bgcolor='rgb(243, 243, 243)',
)


data = [trace0]

fig = go.Figure(data = data, layout = layout)
py.iplot(fig, filename = "bubblechart-size")

In [16]:
hover_text = []

for index, row in df.iterrows():
    hover_text.append(
        ('Domain: {domain}<br>'+
        'Subjectivity: {subjectivity}<br>'+
        'Sentiment: {sentiment}<br>').format(
                                        domain = row["Domain"],
                                        subjectivity = row["Text Subjectivity"],
                                        sentiment = row["Text Sentiment"]
                                        )
                     )


df["Hover Text2"] = hover_text


trace0 = go.Scatter(
    x = df["Text Subjectivity"],
    y = df["Text Sentiment"],
    text = df["Hover Text2"],
    mode = "markers",
    marker = dict(
        size=[40] * len(df),
    )
)


layout = go.Layout(
    title = "Text Sentiment vs. Subjectivity for Top Alexa Sites",
    xaxis = dict(
        title = "Text Subjectivity Score",
        gridcolor = "rgb(255, 255, 255)",
        zerolinewidth = 1,
        ticklen = 5,
        gridwidth = 2,
    ),
    yaxis=dict(
        title = "Text Sentiment Score",
        gridcolor = "rgb(255, 255, 255)",
        zerolinewidth = 1,
        ticklen = 5,
        gridwidth = 2,
    ),
    paper_bgcolor='rgb(243, 243, 243)',
    plot_bgcolor='rgb(243, 243, 243)',
)


data = [trace0]

fig = go.Figure(data = data, layout = layout)
py.iplot(fig, filename = "bubblechart-size")

In [17]:
df_heatmap = df[["Domain","Title Sentiment","Meta Sentiment","Text Sentiment",
                "Title Subjectivity","Meta Subjectivity","Text Subjectivity"]]

df_heatmap.set_index("Domain", inplace = True)
df_heatmap = df_heatmap[df_heatmap.sum(axis = 1) > 0]
df_row_normalized = df_heatmap.div(df_heatmap.sum(axis = 1), axis = 0)
df_col_normalized = df_heatmap.div(df_heatmap.sum(axis = 0), axis = 1)

In [18]:
df_heatmap.head()

Unnamed: 0_level_0,Title Sentiment,Meta Sentiment,Text Sentiment,Title Subjectivity,Meta Subjectivity,Text Subjectivity
Domain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
google.com,0.0,0.401786,0.0,0.0,0.455357,0.0
youtube.com,0.0,0.425,-0.133333,0.0,0.616667,0.3
facebook.com,0.0,-0.125,0.0,0.0,0.375,0.0
wikipedia.org,0.0,0.4,0.0,0.0,0.8,0.0
amazon.com,0.5,0.0,1.0,0.5,0.3,0.3


In [19]:
df_row_normalized.head()

Unnamed: 0_level_0,Title Sentiment,Meta Sentiment,Text Sentiment,Title Subjectivity,Meta Subjectivity,Text Subjectivity
Domain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
google.com,0.0,0.46875,0.0,0.0,0.53125,0.0
youtube.com,0.0,0.351724,-0.110345,0.0,0.510345,0.248276
facebook.com,0.0,-0.5,0.0,0.0,1.5,0.0
wikipedia.org,0.0,0.333333,0.0,0.0,0.666667,0.0
amazon.com,0.192308,0.0,0.384615,0.192308,0.115385,0.115385


In [20]:
df_col_normalized.head()

Unnamed: 0_level_0,Title Sentiment,Meta Sentiment,Text Sentiment,Title Subjectivity,Meta Subjectivity,Text Subjectivity
Domain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
google.com,0.0,0.026162,0.0,0.0,0.018974,0.0
youtube.com,0.0,0.027674,-0.012645,0.0,0.025695,0.017706
facebook.com,0.0,-0.008139,0.0,0.0,0.015626,0.0
wikipedia.org,0.0,0.026046,0.0,0.0,0.033335,0.0
amazon.com,0.055037,0.0,0.094839,0.038412,0.0125,0.017706


In [21]:
%%R -i df_heatmap
p <- d3heatmap(df_heatmap, colors = "YlGnBu", theme = "dark", height = 800, width = 800,
          k_row = 18, k_col = 6, scale = "none", symm = TRUE,
            hclustfun = function(x) hclust(x, method = "ward.D2"),
                na.rm = TRUE, xaxis_font_size = 12, yaxis_font_size = 11)
htmlwidgets::saveWidget(as.widget(p), "/users/chrcotto/df_heatmap.html", selfcontained = T)

In [22]:
IFrame("df_heatmap.html", width = 900, height = 900)

In [23]:
%%R -i df_row_normalized
p2 <- d3heatmap(df_row_normalized, colors = "YlGnBu", theme = "dark", height = 800, width = 800,
          k_row = 18, k_col = 6, scale = "row", symm = TRUE,
            hclustfun = function(x) hclust(x, method = "ward.D2"),
                na.rm = TRUE, xaxis_font_size = 12, yaxis_font_size = 11)
htmlwidgets::saveWidget(as.widget(p2), "/users/chrcotto/df_heatmap_row_normal.html", selfcontained = T)

In [24]:
IFrame("df_heatmap_row_normal.html", width = 900, height = 900)

In [25]:
%%R -i df_col_normalized
p3 <- d3heatmap(df_col_normalized, colors = "YlGnBu", theme = "dark", height = 800, width = 800,
          k_row = 18, k_col = 6, scale = "col", symm = TRUE,
            hclustfun = function(x) hclust(x, method = "ward.D2"),
                na.rm = TRUE, xaxis_font_size = 12, yaxis_font_size = 11)
htmlwidgets::saveWidget(as.widget(p3), "/users/chrcotto/df_heatmap_col_normal.html", selfcontained = T)

In [26]:
IFrame("df_heatmap_col_normal.html", width = 900, height = 900)