Collect text data using Twitter APIs.
--------------------------------------------------

There are a lot of free APIs through which we can collect data and use it to solve problems. We will learn the Twitter API in particular (as it can be used in many applications of NLP like product reviews, sentiment analysis,....).

Problem
------------
You want to collect text data using Twitter APIs.

Solution
------------
Twitter has a gigantic amount of data with a lot of value in it. Social media
marketers are making their living from it. There is an enormous amount
of tweets every day, and every tweet has some story to tell. When all of this
data is collected and analyzed, it gives a tremendous amount of insights to
a business about their company, product, service, etc.

How It Works
-------------------
Log in to the Twitter developer portal (https://developer.twitter.com/)

Create your own app in the Twitter developer portal, and get the keys
mentioned below. Once you have these credentials, you can start pulling
data. Keys needed are:

> • consumer key: Key associated with the application (Twitter, Facebook, etc.).

> • consumer secret: Password used to authenticate with the authentication server 
(Twitter, Facebook, etc.).

> • access token: Key given to the client after successful authentication of  above keys.

> • access token secret: Password for the access key.

Useful links :
-----------------
https://iag.me/socialmedia/how-to-create-a-twitter-app-in-8-easy-steps/

https://developer.twitter.com/en/docs/tweets/sample-realtime/overview/GET_statuse_sample

In [2]:
!pip install tweepy



In [5]:
# Once all the credentials are in place, use the code below to fetch the data.

# Install tweepy
# !pip install tweepy

# Import the libraries
import numpy as np
import tweepy
import json
import pandas as pd
from tweepy import OAuthHandler

# credentials  --> put your credentials here
consumer_key = "yntx8srYI7fmIMdqnjUiKNtSQ"
consumer_secret = "fGRKUwI8JviZSipw2tCckuOMoBQXbrv9QaJjGOzTQZk8gLdKYj"
access_token = "782828160182136832-1ok7nP3EU273wTc8UMplSQzxnFvPJaA"
access_token_secret = "8v3bMePhsnuce9DDH5Ay0XImxu2mAXJkK8EPi3jgzfXGe"

# calling API
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

# Provide the query you want to pull the data. For example,
# pulling data for "bollywood stars"
query = "republic day"

# Fetching tweets
Tweets = api.search(query, count = 10, lang='en', exclude='retweets',tweet_mode='extended')

for tweet in Tweets:
        print(tweet)
        print("----------------------")

# The query above will pull the top 10 tweets when the term "JP Morgan" 
# is searched. The API will pull English tweets since the language 
# given is ‘en’ and it will exclude retweets.

Status(_api=<tweepy.api.API object at 0x0000000A038EC390>, _json={'created_at': 'Sun Jan 26 05:47:26 +0000 2020', 'id': 1221308643506708480, 'id_str': '1221308643506708480', 'full_text': '@SakshiMalik Happy republic day', 'truncated': False, 'display_text_range': [13, 31], 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'SakshiMalik', 'name': 'Sakshi Malik', 'id': 728173090027520000, 'id_str': '728173090027520000', 'indices': [0, 12]}], 'urls': []}, 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'}, 'source': '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>', 'in_reply_to_status_id': 1221303407538917376, 'in_reply_to_status_id_str': '1221303407538917376', 'in_reply_to_user_id': 728173090027520000, 'in_reply_to_user_id_str': '728173090027520000', 'in_reply_to_screen_name': 'SakshiMalik', 'user': {'id': 1147806560288043009, 'id_str': '1147806560288043009', 'name': '_sujit_RISER', 'screen_name': 'RiserSujit'

Getting the Tweets + Some Attributes
------------------------------------
In this section, we will get some tweets plus some of their related attributes and store them in a structured format.

If we are interested in getting more than 100 tweets at a time, which we are in our case, we will not be able to do so by just using api.search. We will need to use tweepy.Cursor which will allow us to get as many tweets as we desire. I did not get too deep into trying to understand what Cursoring does, but the general idea in our case is that it will allow us to read 100 tweets, store them in a page inherently, then read the next 100 tweets.

For our purpose, the end result is that it will just keep going on fetching tweets until we ask it to stop by breaking the loop.

In [8]:
# start by creating an empty DataFrame with the columns we'll need
df = pd.DataFrame(columns = ['Tweets', 'User', 'User_statuses_count', 
                             'user_followers', 'User_location', 'User_verified',
                             'fav_count', 'rt_count', 'tweet_date'])

In [9]:
# Next, lets define a function as follows.
def stream(data, file_name):
    i = 0
    for tweet in tweepy.Cursor(api.search, q=data, count=100, lang='en').items():
        print(i, end='\r') # '\r' is like home button, this will bring the cursor back on home position. if you do not give this python wlil take the cursor on next line after print
        df.loc[i, 'Tweets'] = tweet.text
        df.loc[i, 'User'] = tweet.user.name
        df.loc[i, 'User_statuses_count'] = tweet.user.statuses_count  # indicates the no. of times the user as tweeted 
        df.loc[i, 'user_followers'] = tweet.user.followers_count
        df.loc[i, 'User_location'] = tweet.user.location
        df.loc[i, 'User_verified'] = tweet.user.verified
        df.loc[i, 'fav_count'] = tweet.favorite_count # indicates number of tweets the user has marked as fav
        df.loc[i, 'rt_count'] = tweet.retweet_count
        df.loc[i, 'tweet_date'] = tweet.created_at
        df.to_excel('{}.xlsx'.format(file_name))
        i+=1
        if i == 1000:
            break
        else:
            pass

Let's look at this function from the inside out:
------------------------------------------------
>First, we followed the same methodology of getting each tweet in a for loop, but this time from tweepy.Cursor.

>Inside tweepy.Cursor, we pass our api.search and the attributes we want: q = data: data will be whatever piece of text we pass into the stream function to ask our api.search to search for just like we did passing "JP Morgan India" in the previous example.

count = 100: Here we are setting the number of tweets to return to 100, via api.search, which is the maximum possible number.

lang = 'en': Here I am simply filtering results to return tweets in English only.

Next, I am filling my DataFrame with the attributes I am interested in and during each iteration making use of the .loc method in Pandas and my i counter.

The attributes I am passing into each column are self explanatory and you can look into the Twitter API documentation for what other attributes are available and play around with those.

>Finally I am saving the result into an excel file using "df.to_excel" and here I am using a placeholder {} instead of naming the file inside the function because I want to be able to name the file myself when I run the function.

Now, I can just call my function as follows, looking for tweets about food again and naming my file "my_tweets."

Now, since we put our api.search into tweepy.Cursor, it will not just stop at the first 100 tweets. It will instead keep going on forever; that's why we are using i as a counter to stop the loop after 1000 iterations.

In [11]:
!pip install openpyxl

Collecting openpyxl
  Downloading https://files.pythonhosted.org/packages/6f/af/88ff9eef0b8f665aee1111ac6cede5ad12190c5bd726242bd2b26fc21b32/openpyxl-3.0.0.tar.gz (172kB)
Collecting jdcal
  Using cached https://files.pythonhosted.org/packages/f0/da/572cbc0bc582390480bbd7c4e93d14dc46079778ed915b505dc494b37c57/jdcal-1.4.1-py2.py3-none-any.whl
Collecting et_xmlfile
  Using cached https://files.pythonhosted.org/packages/22/28/a99c42aea746e18382ad9fb36f64c1c1f04216f41797f2f0fa567da11388/et_xmlfile-1.0.1.tar.gz
Installing collected packages: jdcal, et-xmlfile, openpyxl
    Running setup.py install for et-xmlfile: started
    Running setup.py install for et-xmlfile: finished with status 'done'
    Running setup.py install for openpyxl: started
    Running setup.py install for openpyxl: finished with status 'done'
Successfully installed et-xmlfile-1.0.1 jdcal-1.4.1 openpyxl-3.0.0


In [12]:
# calling the above function 
stream(data = ['JP Morgan India'], file_name = 'my_tweets')

113

In [13]:
# view first 5 records
df.head()

Unnamed: 0,Tweets,User,User_statuses_count,user_followers,User_location,User_verified,fav_count,rt_count,tweet_date
0,RT @narendramodi: Very good interaction with t...,.,99130,761,"Mumbai, India",False,0,2577,2019-11-03 03:00:04
1,RT @narendramodi: Very good interaction with t...,Aman Srivastava,37255,211,"Gyanpur, India",False,0,2577,2019-11-02 07:24:23
2,Omg guys there is a whole protest tmrw over CE...,vishan,9941,341,,False,5,0,2019-11-02 05:46:30
3,RT @narendramodi: Very good interaction with t...,BJP Mahila Morcha Chandrapur,851,190,,False,0,2577,2019-11-02 03:23:45
4,'The JP Morgan International Council which was...,Dawn. A. Alderson,51151,15,"Wales, United Kingdom",False,0,0,2019-11-01 21:58:04



Let's Analyze Some Tweets
--------------------------


In [15]:
!pip install TextBlob

Collecting TextBlob
  Downloading https://files.pythonhosted.org/packages/60/f0/1d9bfcc8ee6b83472ec571406bd0dd51c0e6330ff1a51b2d29861d389e85/textblob-0.15.3-py2.py3-none-any.whl (636kB)
Installing collected packages: TextBlob
Successfully installed TextBlob-0.15.3


In [16]:
# importing TextBlob. It has build-in sentiment property
from textblob import TextBlob

# The sentiment property returns a named tuple of the form 
# Sentiment(polarity,subjectivity). The polarity score is a float 
# within the range [-1.0, 1.0]. 
# The subjectivity is a float within the range [0.0, 1.0] 
# where 0.0 is very objective and 1.0 is very subjective.

>I would like to add an extra column to this DataFrame that indicates the sentiment of a tweet.

>We will also need to add another column with the tweets stripped of useless symbols, then run the sentiment analyzer on those cleaned up tweets to be more effective.

In [17]:
# Let's start by writing our tweets cleaning function:
import re
def clean_tweet(tweet):
    return ' '.join(re.sub('(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)', ' ', tweet).split())

# small w means any word, capital W means any non word
# small s means space and capital S means non-space
# split is used coz all the banlks which are converted above will be cleaned as only data will be split

In [18]:
# Let's also write our sentiment analyzer function:
def analyze_sentiment(tweet):
    analysis = TextBlob(tweet)
    if analysis.sentiment.polarity > 0:
        return 'Positive'
    elif analysis.sentiment.polarity ==0:
        return 'Neutral'
    else:
        return 'Negative'

In [19]:
# Now let's create our new columns:
df['clean_tweet'] = df['Tweets'].apply(lambda x: clean_tweet(x))
df['Sentiment'] = df['clean_tweet'].apply(lambda x: analyze_sentiment(x))

In [20]:
# Let's look at some random rows to make sure our functions worked correctly.

# Example (100th row):
n=100
print('Original tweet:\n'+ df['Tweets'][n])
print()
print('Clean tweet:\n'+df['clean_tweet'][n])
print()
print('Sentiment:\n'+df['Sentiment'][n])

Original tweet:
RT @narendramodi: Very good interaction with the JP Morgan International Council, an illustrious gathering of top policy makers, thinkers,…

Clean tweet:
RT Very good interaction with the JP Morgan International Council an illustrious gathering of top policy makers thinkers

Sentiment:
Positive


In [28]:
# Let's look at some random rows to make sure our functions worked correctly.

# Example (60th row):
n=25
print('Original tweet:\n'+ df['Tweets'][n])
print()
print('Clean tweet:\n'+df['clean_tweet'][n])
print()
print('Sentiment:\n'+df['Sentiment'][n])


Original tweet:
After 12 years JP Morgan International Council meet held in India, PM Modi meets powerful leaders… https://t.co/F736oJRC2O

Clean tweet:
After 12 years JP Morgan International Council meet held in India PM Modi meets powerful leaders

Sentiment:
Positive


Unnamed: 0,Tweets,User,User_statuses_count,user_followers,User_location,User_verified,fav_count,rt_count,tweet_date,clean_tweet,Sentiment
26,@dominos_india I have been trying to order piz...,Rajguru,596,19,,False,0,0,2019-10-29 15:59:25,india I have been trying to order pizza since ...,Negative
44,India Has Unbelievable Skills But Needs to Tra...,Wadhwani Foundation,8738,6480,,False,0,0,2019-10-28 06:58:53,India Has Unbelievable Skills But Needs to Tra...,Negative


In [29]:
df[df.Sentiment == 'Negative'].shape[0]

2

In [30]:
df[df.Sentiment == 'Positive'].shape[0]

107

In [31]:
df[df.Sentiment == 'Neutral'].shape[0]

5

In [32]:
# Once all the credentials are in place, use the code below to fetch the data.

# Install tweepy
# !pip install tweepy

# Import the libraries
import numpy as np
import tweepy
import json
import pandas as pd
from tweepy import OAuthHandler

# credentials  --> put your credentials here
consumer_key = "yntx8srYI7fmIMdqnjUiKNtSQ"
consumer_secret = "fGRKUwI8JviZSipw2tCckuOMoBQXbrv9QaJjGOzTQZk8gLdKYj"
access_token = "782828160182136832-1ok7nP3EU273wTc8UMplSQzxnFvPJaA"
access_token_secret = "8v3bMePhsnuce9DDH5Ay0XImxu2mAXJkK8EPi3jgzfXGe"

# calling API
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

# Provide the query you want to pull the data. For example,
# pulling data for "bollywood stars"
query = "BJP Shiv Sena Alliance"

# Fetching tweets
Tweets = api.search(query, count = 10, lang='en', exclude='retweets',tweet_mode='extended')

for tweet in Tweets:
        print(tweet)
        print("----------------------")

# The query above will pull the top 10 tweets when the term "JP Morgan" 
# is searched. The API will pull English tweets since the language 
# given is ‘en’ and it will exclude retweets.

Status(_api=<tweepy.api.API object at 0x0000000012CFC5F8>, _json={'created_at': 'Sun Nov 03 05:04:33 +0000 2019', 'id': 1190857272647282689, 'id_str': '1190857272647282689', 'full_text': 'Fate Of Maharashtra Assembly Elections  BJP Alliance Shiv Sena  The Winner Astrologer Anil Aggarwala https://t.co/V2IE50Exny', 'truncated': False, 'display_text_range': [0, 124], 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/V2IE50Exny', 'expanded_url': 'https://www.astrodocanil.com/2019/11/fate-of-maharashtra-assembly-elections-bjp-alliance-shiv-sena-the-winner-astrologer-anil-aggarwala/', 'display_url': 'astrodocanil.com/2019/11/fate-o…', 'indices': [101, 124]}]}, 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'}, 'source': '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_scre

In [33]:
df = pd.DataFrame(columns = ['Tweets', 'User', 'User_statuses_count', 
                             'user_followers', 'User_location', 'User_verified',
                             'fav_count', 'rt_count', 'tweet_date'])

In [34]:
def stream(data, file_name):
    i = 0
    for tweet in tweepy.Cursor(api.search, q=data, count=100, lang='en').items():
        print(i, end='\r')
        df.loc[i, 'Tweets'] = tweet.text
        df.loc[i, 'User'] = tweet.user.name
        df.loc[i, 'User_statuses_count'] = tweet.user.statuses_count  # indicates the no. of times the user as tweeted 
        df.loc[i, 'user_followers'] = tweet.user.followers_count
        df.loc[i, 'User_location'] = tweet.user.location
        df.loc[i, 'User_verified'] = tweet.user.verified
        df.loc[i, 'fav_count'] = tweet.favorite_count
        df.loc[i, 'rt_count'] = tweet.retweet_count
        df.loc[i, 'tweet_date'] = tweet.created_at
        df.to_excel('{}.xlsx'.format(file_name))
        i+=1
        if i == 500:
            break
        else:
            pass

In [35]:
stream(data = ['BJP Shiv Sena Alliance'], file_name = 'myBJP_tweets')

499

In [36]:
df.head()

Unnamed: 0,Tweets,User,User_statuses_count,user_followers,User_location,User_verified,fav_count,rt_count,tweet_date
0,Fate Of Maharashtra Assembly Elections BJP Al...,Astrologer-astrodocanil-www.astrodocanil.com,1832,194,"New Delhi, India",False,0,0,2019-11-03 05:04:33
1,RT @NATRAJSHETTY: The only solution for stable...,Rajendra Seth,161,1,,False,0,7,2019-11-03 05:03:53
2,Fate Of Maharashtra Assembly Elections BJP All...,Astrologer-astrodocanil-www.astrodocanil.com,1832,194,"New Delhi, India",False,0,0,2019-11-03 05:03:28
3,RT @Panipat_1761: Hello..Shiv Sena? Gentle rem...,Haruko the Encouraging,75542,282,World of Slam Dunk,False,0,4,2019-11-03 04:58:04
4,RT @Panipat_1761: Hello..Shiv Sena? Gentle rem...,suchittayashaH,39685,6330,India,False,0,4,2019-11-03 04:57:18


In [37]:
from textblob import TextBlob

In [38]:
import re
def clean_tweet(tweet):
    return ' '.join(re.sub('(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)', ' ', tweet).split())

In [39]:
def analyze_sentiment(tweet):
    analysis = TextBlob(tweet)
    if analysis.sentiment.polarity > 0:
        return 'Positive'
    elif analysis.sentiment.polarity ==0:
        return 'Neutral'
    else:
        return 'Negative'

In [40]:
df['clean_tweet'] = df['Tweets'].apply(lambda x: clean_tweet(x))
df['Sentiment'] = df['clean_tweet'].apply(lambda x: analyze_sentiment(x))

In [41]:
n=300
print('Original tweet:\n'+ df['Tweets'][n])
print()
print('Clean tweet:\n'+df['clean_tweet'][n])
print()
print('Sentiment:\n'+df['Sentiment'][n])


Original tweet:
RT @sri9011: BJP  made a  huge mistake by trusting Shiv Sena and fighting state elections in alliance.

BJP should have gone it alone as th…

Clean tweet:
RT BJP made a huge mistake by trusting Shiv Sena and fighting state elections in alliance BJP should have gone it alone as th

Sentiment:
Positive


In [42]:
df[df.Sentiment == 'Positive'].shape[0]

221

In [43]:
df[df.Sentiment == 'Negative'].shape[0]

67

In [44]:
df[df.Sentiment == 'Neutral'].shape[0]

212