# Real Time Tweets: Sentiment Analysis with IBM Watson and Bluemix Technology

#### In this lab I used the twitter API to scrape real time tweets from random users to perform basic sentiment analysis.

#### Here I Installed Python Twitter API library & Watson Developer Cloud:
The Twitter API is given here https://dev.twitter.com/rest/public
Watson Developer Cloud Allows me to access Watson Tone Analyzer for preprocessing & Watson Personality Insights to model and explore the users personality profiles using adjectives and emotions

In [1]:
!pip install twitter python-twitter
!pip install --user watson-developer-cloud
!pip install --upgrade --user pixiedust

Requirement already up-to-date: pixiedust in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s0e2-2229468d893d94-ac4cd83694d3/.local/lib/python2.7/site-packages
Requirement already up-to-date: mpld3 in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s0e2-2229468d893d94-ac4cd83694d3/.local/lib/python2.7/site-packages (from pixiedust)
Requirement already up-to-date: lxml in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s0e2-2229468d893d94-ac4cd83694d3/.local/lib/python2.7/site-packages (from pixiedust)


#### I then Install the streaming Twitter jar in the notebook from the Github repo

Pixiedust is especially helpful here because it allows us to display() visualizations without knowing that much code.

In [1]:
import pixiedust
jarPath = "https://github.com/ibm-cds-labs/spark.samples/raw/master/dist/streaming-twitter-assembly-1.6.jar"
pixiedust.installPackage(jarPath)
print("done")

Pixiedust database opened successfully


Package already installed: https://github.com/ibm-cds-labs/spark.samples/raw/master/dist/streaming-twitter-assembly-1.6.jar
done


#### I got my credentials off of Twitter and my IBM Watson profiles in order to access the data (tweets) and the analytics tools (Watson Tone Analyzer & Personality Insights)

In [3]:
import pixiedust

sqlContext=SQLContext(sc)

#Set up the twitter credentials, they will be used both in scala and python cells below
consumerKey = "g0EXNI4DVJTxmlAKzMiqEGKgi"
consumerSecret = "TSpkjqxGxuR9cLAh2h7CusVYxmJ5SG6eLE7f0RUs6MUUbcN6qT"
accessToken = "1599677424-SId5x6I6evEbraG3qFs8GuNzPqrzAnc4PbtjcKb"
accessTokenSecret = "369avxzU45YOF9cynXfpnd7ISQXOqXZnT9wYFyZtKZLhz"

#Set up the Watson Personality insight credentials
piUserName = "5447d9c5-f882-4d03-80cc-77619d4e2536"
piPassword = "DbK0lo7KRMHY"

#Set up the Watson Tone Analyzer credentials
taUserName = "0ffe4864-d94f-48f8-87b3-b5fdfdb520a0"
taPassword = "EsdQIZACnwIV"

#### I used Scala Bridge to run the command line version of the app

In [4]:
%%scala
val demo = com.ibm.cds.spark.samples.StreamingTwitter
demo.setConfig("twitter4j.oauth.consumerKey",consumerKey)
demo.setConfig("twitter4j.oauth.consumerSecret",consumerSecret)
demo.setConfig("twitter4j.oauth.accessToken",accessToken)
demo.setConfig("twitter4j.oauth.accessTokenSecret",accessTokenSecret)
demo.setConfig("watson.tone.url","https://gateway.watsonplatform.net/tone-analyzer/api")
demo.setConfig("watson.tone.password",taPassword)
demo.setConfig("watson.tone.username",taUserName)

import org.apache.spark.streaming._
demo.startTwitterStreaming(sc, Seconds(30))  //Run the application for a limited time

Starting twitter stream
Twitter stream started
Tweets are collected real-time and analyzed
To stop the streaming and start interacting with the data use: StreamingTwitter.stopTwitterStreaming
Receiver Started: TwitterReceiver-0
Batch started with 62 records
Batch completed with 62 records
Batch started with 165 records
Batch completed with 165 records
Batch started with 172 records
Batch completed with 172 records
Batch started with 200 records


#### I created a tweets dataframe from the data fetched above and transfered it to Python. No need for Pandas.


In [5]:
%%scala
val demo = com.ibm.cds.spark.samples.StreamingTwitter
val (__sqlContext, __df) = demo.createTwitterDataFrames(sc)

Stopping Twitter stream. Please wait this may take a while
Receiver Stopped: TwitterReceiver-0
Reason:  : Stopped by driver
Batch completed with 200 records
Twitter stream stopped
You can now create a sqlContext and DataFrame with 72 Tweets created. Sample usage: 
val (sqlContext, df) = com.ibm.cds.spark.samples.StreamingTwitter.createTwitterDataFrames(sc)
df.printSchema
sqlContext.sql("select author, text from tweets").show
A new table named tweets with 72 records has been correctly created and can be accessed through the SQLContext variable
Here's the schema for tweets
root
 |-- author: string (nullable = true)
 |-- userid: string (nullable = true)
 |-- date: string (nullable = true)
 |-- lang: string (nullable = true)
 |-- text: string (nullable = true)
 |-- lat: double (nullable = true)
 |-- long: double (nullable = true)
 |-- Anger: double (nullable = true)
 |-- Disgust: double (nullable = true)
 |-- Fear: double (nullable = true)
 |-- Joy: double (nullable = true)
 |-- Sadness: d

#### I grouped the tweets by author and userid

In [6]:
import pyspark.sql.functions as F
usersDF = __df.groupby("author", "userid").agg(F.avg("Anger").alias("Anger"), F.avg("Disgust").alias("Disgust"))
usersDF.show()

+---------------+---------------+-----------------+-----------------+
|         author|         userid|            Anger|          Disgust|
+---------------+---------------+-----------------+-----------------+
|         KYLER.|       YEOGISEO|              5.0|             27.0|
|     Nicholarse|NickMargerrison|             13.0|              2.0|
|  Coleen Rooney|  ColeenRooney1|              1.0|              3.0|
|   Curated News| UZoneIndonesia|             10.0|              1.0|
|     Webcam Men|  WebcamMencouk|             10.0|              6.0|
|   Edvard Busic|    EdvardBusic|              4.0|              5.0|
|      don miles|       MilesDon|              3.0|             17.0|
|      Big Mitch|      mitch3835|             25.0|              9.0|
|  TroubleMannnn|    DCorbin1994|              8.0|              2.0|
|   Vigaroo News|    VigarooNews|             11.0|             12.0|
|     Big Moolah|     Big_Moolah|              9.0|              9.0|
| Yes Lord Radio|   

#### Here I'm setting up the Twitter API from python-twitter module

In [11]:
import twitter
api = twitter.Api(consumer_key=consumerKey,
                  consumer_secret=consumerSecret,
                  access_token_key=accessToken,
                  access_token_secret=accessTokenSecret)

#print(api.VerifyCredentials())

#### Here are the last 200 tweets for each author

In [12]:
def getTweets(screenName):
    statuses = api.GetUserTimeline(screen_name=screenName,
                        since_id=None,
                        max_id=None,
                        count=200,
                        include_rts=False,
                        trim_user=False,
                        exclude_replies=True)
    return statuses

usersWithTweetsRDD = usersDF.flatMap(lambda s: [(s.user.screen_name, s.text.encode('ascii', 'ignore')) for s in getTweets(s['userid'])])
print(usersWithTweetsRDD.count())

7529


#### I concatenated (or joined) all the tweets for each user so we have enough words to send to Watson Personality Insights

In [13]:
import re
usersWithTweetsRDD2 = usersWithTweetsRDD.map(lambda s: (s[0], s[1])).reduceByKey(lambda s,t: s + '\n' + t)\
    .filter(lambda s: len(re.findall(r'\w+', s[1])) > 100 )
print(usersWithTweetsRDD2.count())
#usersWithTweetsRDD2.take(2)

62


### I Called Watson Personality Insights on the text for each author

In [14]:
from pyspark.sql.types import *
from watson_developer_cloud import PersonalityInsightsV3
broadCastPIUsername = sc.broadcast(piUserName)
broadCastPIPassword = sc.broadcast(piPassword)
def getPersonalityInsight(text, schema=False):
    personality_insights = PersonalityInsightsV3(
          version='2016-10-20',
          username=broadCastPIUsername.value,
          password=broadCastPIPassword.value)
    try:
        p = personality_insights.profile(
            text, content_type='text/plain',
            raw_scores=True, consumption_preferences=True)

        if schema:
            return \
                [StructField(t['name'], FloatType()) for t in p["needs"]] + \
                [StructField(t['name'], FloatType()) for t in p["values"]] + \
                [StructField(t['name'], FloatType()) for t in p['personality' ]]
        else:
            return \
                [t['raw_score'] for t in p["needs"]] + \
                [t['raw_score'] for t in p["values"]] + \
                [t['raw_score'] for t in p['personality']]   
    except:
        return []

usersWithPIRDD = usersWithTweetsRDD2.map(lambda s: [s[0]] + getPersonalityInsight(s[1])).filter(lambda s: len(s)>1)
print(usersWithPIRDD.count())
#usersWithPIRDD.take(2)

58


### I Converted the Resilient Distributed Dataset (RDD) back to a DataFrame and called PixieDust display to visualize the results

In [15]:
#convert to dataframe
schema = StructType(
    [StructField('userid',StringType())] + getPersonalityInsight(usersWithTweetsRDD2.take(1)[0][1], schema=True)
)

usersWithPIDF = sqlContext.createDataFrame(
    usersWithPIRDD, schema
)

usersWithPIDF.cache()
display(usersWithPIDF)

userid,Challenge,Closeness,Curiosity,Excitement,Harmony,Ideal,Liberty,Love,Practicality,Self-expression,Stability,Structure,Conservation,Openness to change,Hedonism,Self-enhancement,Self-transcendence,Openness,Conscientiousness,Extraversion,Agreeableness,Emotional range
CfzJb,0.75979077816,0.818263471127,0.855106532574,0.759127140045,0.850096285343,0.752394139767,0.793208181858,0.796184062958,0.750640392303,0.694531321526,0.780816137791,0.730340301991,0.695895135403,0.836819827557,0.74822473526,0.730241656303,0.846361875534,0.769867002964,0.643722832203,0.551995515823,0.726228058338,0.51096367836
marawan851980,0.685935139656,0.772022247314,0.823568224907,0.49182215333,0.818374574184,0.644522130489,0.720162808895,0.699367880821,0.706054806709,0.643603682518,0.758623301983,0.700499594212,0.630316734314,0.733722627163,0.566103696823,0.600021123886,0.83362030983,0.733953356743,0.654639422894,0.404366225004,0.720687925816,0.49599018693
WaypointGuild,0.722791075706,0.729157269001,0.800780534744,0.651420176029,0.779681622982,0.656615018845,0.718829870224,0.70237237215,0.734766304493,0.636616706848,0.693978190422,0.698047637939,0.625422656536,0.760382175446,0.70645827055,0.696341633797,0.820253908634,0.766944587231,0.61107814312,0.506716787815,0.663859605789,0.474881708622
FKittlerbot,0.644196391106,0.686391294003,0.836667537689,0.530090630054,0.735473692417,0.592344939709,0.674027979374,0.68900257349,0.699688196182,0.623851060867,0.656437933445,0.675355255604,0.533590555191,0.785522997379,0.595792591572,0.633947134018,0.817435979843,0.862760066986,0.616503119469,0.501579940319,0.630539119244,0.490947753191
JazzyyQuintero,0.770479857922,0.852511286736,0.866814494133,0.870203197002,0.845277428627,0.810212016106,0.850557863712,0.836387217045,0.762621819973,0.765399694443,0.78122985363,0.706307709217,0.707459568977,0.850067734718,0.833122551441,0.766277730465,0.873659074306,0.740658581257,0.619791924953,0.596415996552,0.706162035465,0.430294901133
GermanJuice___,0.682165145874,0.879219889641,0.829231798649,0.753856599331,0.859576940536,0.715872883797,0.756660103798,0.82734811306,0.739323139191,0.698077738285,0.756651222706,0.688493609428,0.652733922005,0.812431871891,0.801630437374,0.687346339226,0.848086416721,0.738107264042,0.534128606319,0.469097673893,0.725022375584,0.393051087856
Theyre_AFs,0.640417277813,0.685050010681,0.766379237175,0.55161857605,0.693106949329,0.578539133072,0.667595088482,0.596425533295,0.681704223156,0.654808759689,0.642943561077,0.639998793602,0.564257800579,0.779699206352,0.661161065102,0.655634224415,0.799363732338,0.836124241352,0.623574256897,0.525433540344,0.739377200603,0.451999276876
Tsebeleke,0.762700974941,0.847661077976,0.830913305283,0.759373664856,0.848404705524,0.725499272346,0.76967382431,0.793786287308,0.752219676971,0.685307383537,0.772594809532,0.708025515079,0.722093343735,0.809534549713,0.782219231129,0.716865897179,0.830991566181,0.7231118083,0.638504564762,0.543078958988,0.771303534508,0.458847552538
pragniks,0.673166334629,0.696778774261,0.794139504433,0.5866766572,0.761285007,0.650122046471,0.709991574287,0.723803818226,0.728226542473,0.65439581871,0.694537639618,0.658301651478,0.601528525352,0.766190230846,0.665008604527,0.698632955551,0.80671197176,0.75634843111,0.625524938107,0.580187916756,0.689650118351,0.511692047119
Big_Moolah,0.752878785133,0.737123548985,0.794979512691,0.668868899345,0.77105307579,0.680730223656,0.755392074585,0.631422758102,0.790079951286,0.71421200037,0.745378732681,0.74302226305,0.682443082333,0.744373500347,0.699754297733,0.741559147835,0.825237393379,0.791546702385,0.693256437778,0.582934975624,0.724512577057,0.599052369595


#### I computed the sentiment distributions for tweets with scores greater than 60% and used matplotlib chart visualization.

In [None]:
tweets=__df
tweets.count()
display(tweets)

#create an array that will hold the count for each sentiment
sentimentDistribution=[0] * 13
#For each sentiment, run a sql query that counts the number of tweets for which the sentiment score is greater than 60%
#Store the data in the array
for i, sentiment in enumerate(tweets.columns[-13:]):
    sentimentDistribution[i]=__sqlContext.sql("SELECT count(*) as sentCount FROM tweets where " + sentiment + " > 60")\
        .collect()[0].sentCount

author,userid,date,lang,text,lat,long,Anger,Disgust,Fear,Joy,Sadness,Analytical,Confident,Tentative,Openness,Conscientiousness,Extraversion,Agreeableness,EmotionalRange
あべる,PlainAbel,Fri Feb 17 02:44:59 CST 2017,en,RT @Ieafsprodigy: Every niggas first catfish https://t.co/xOjYVeXGuf,0.0,0.0,1.0,2.0,4.0,63.0,34.0,0.0,94.0,0.0,47.0,38.0,49.0,46.0,0.0
Kittlerbot,FKittlerbot,Fri Feb 17 02:45:00 CST 2017,en,On the basis of this experiment the mechanic Kruesi was given the calvary.,0.0,0.0,13.0,24.0,15.0,41.0,18.0,74.0,0.0,0.0,89.0,34.0,31.0,5.0,0.0
Edvard Busic,EdvardBusic,Fri Feb 17 02:45:00 CST 2017,en,@Polonca60 @mislavbusic Isto tako i porez na dobit i dohodak. Sve je krivo.,0.0,0.0,4.0,5.0,16.0,7.0,74.0,0.0,0.0,0.0,2.0,11.0,34.0,56.0,0.0
Anastasia Isadora,BitchyAnastasia,Fri Feb 17 02:45:00 CST 2017,en,https://t.co/31Z7AWoQe3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
James,WileyRova,Fri Feb 17 02:45:00 CST 2017,en,Media will say no Trump is trying to install a Autocracy. This is wrong. He is merely tool puppet cover of Bannon's design.,0.0,0.0,29.0,29.0,12.0,1.0,43.0,43.0,0.0,86.0,92.0,37.0,39.0,15.0,0.0
Jayline,pumpkinpielrh,Fri Feb 17 02:45:00 CST 2017,en,#5SOSFam #BestFanArmy #iHeartAwards . https://t.co/FxJH8ecy3l,0.0,0.0,15.0,2.0,11.0,65.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Curated News,UZoneIndonesia,Fri Feb 17 02:45:00 CST 2017,en,"Sakit Pinggang, Jangan Anggap Sepele https://t.co/o3kAN8drbE",0.0,0.0,10.0,1.0,7.0,76.0,11.0,0.0,0.0,0.0,25.0,27.0,54.0,60.0,0.0
iven,ailven28,Fri Feb 17 02:45:00 CST 2017,en,https://t.co/qUYti5lWjY,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Jay Bee,CfzJb,Fri Feb 17 02:45:00 CST 2017,en,RT @CfzJb: Received my first offer #taborcollege #taborcollegefootball https://t.co/3rOet4MxiE,0.0,0.0,7.0,2.0,4.0,87.0,4.0,0.0,0.0,0.0,8.0,65.0,23.0,70.0,0.0
Big Mitch,mitch3835,Fri Feb 17 02:45:00 CST 2017,en,Lemmie just take a guess and assume this guy has a thing for basic math https://t.co/fIYUD3awXx,0.0,0.0,25.0,9.0,6.0,37.0,31.0,0.0,0.0,90.0,59.0,65.0,40.0,1.0,0.0
