In [1]:
import tweepy
import re
from textblob import TextBlob
from pandas import DataFrame, Series

<b/>To access the Twitter API through Python, one must have a developer account that will provide you with the following 
credentials to get Twitter data

In [6]:
consumer_key = ''           # Enter your consumer key
consumer_secret = ''        # Enter your consumer secret
access_token = ''           # Enter your access token
access_token_secret = ''    # Enter your access token secret

# To authenticate the user credentials and create an object 'api' to stream data
auth=tweepy.OAuthHandler(consumer_key,consumer_secret)
auth.set_access_token(access_token,access_token_secret)

api = tweepy.API(auth)

<b>The Twitter API allows users to pull a maximum of 100 latest tweets per request, thus the count variable is used to specify the required no.</b>

<b/> The reason of keeping the count to 50 is explained next, though this can be varied between 0-100 as per user.

In [7]:
# 'trending is the variable string which becomes the keyword to look for while pulling the data from Twitter
trending = 'Cristiano Ronaldo'
public_tweets = api.search(trending,count = 50) # Twitter does not provide historical data older than 1 week
all_tt=[]
for tweet in public_tweets:
    all_tt.append(tweet.text)

print 'Number of Tweets pulled:',len(all_tt)

Number of Tweets pulled: 50


<h2> Due to API restrictions noted above, we have come up with a solution to pull the desired number of tweets(if available) and store them as a list of tweets. 

<b>The number of tweets depend on the 'counter' variable and the count argumment in the api call, thus by changing these 2 values, user can get the desired no of tweets.

<b>The tweets will be appended to the same list of tweets to make operation easy.

In [8]:
counter = 5

for i in range(counter):
    
    lst=[]
    for i in all_tt:
        a=re.findall(r"@(\w+)", i)    # extract the ids of all the users that whose tweets were pulled
        if len(a) >0:
            lst.extend(a)
    nxt = '@'+str(lst[-1])      # using the last id as the next id to pull the next batch of tweets, hence continuing the cycle 
    print nxt
    public_tweets2 = api.search(trending,count =50,max_id=nxt)
    
    for tweet in public_tweets2:
        all_tt.append(tweet.text) 

print 'The grand total of Tweets that could be pulled:',len(all_tt)

@Todo_atleti
@LaLiga_aldia
@LaLiga_aldia
@LaLiga_aldia
@TrollFootball
The grand total of Tweets that could be pulled: 300


<h1>Removing Duplicates</h1>
Re-tweets are another major issue that amount to a huge percentage of the total tweets. Thus these duplicates have to be removed, otherwise they would not only add redundant data, nor do they provide any additional infromation.

In [9]:
all_tt = set(all_tt)
all_tt = list(all_tt)
clean = []
for x in all_tt:
    test=''
    test=' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split()) # Removing '@' symbol
    test=test.replace('RT','')  # Replacing 'RT' symbol with blank space
    clean.append(test)
print 'Total number of unique tweets:',len(clean)

Total number of unique tweets: 97


<h2> Pre-Precessing the data(tweets)

In [10]:
with open ('stopwords.txt','r') as fn:
    stop_words=fn.read()
stop_words=stop_words.split()

In [11]:
nos = list(range(10))
nos = [str(x) for x in nos]

l = '!"#$%&()*+,-/:;<=>?@[\]^_`{|}~'
ls = []

for i in l:
    ls.append(i)

sp = nos + ls

In [12]:
from nltk.tokenize import word_tokenize
def rmv(lst):
    test = []
    for txt in lst:
        for letter in txt:
            if (letter in sp):
                txt = txt.replace(letter,'')
        test.append(txt)
    return test

def rm_spch(s):

    one = []
    x = ''

    for num in s:
        x = str(num)
        test = word_tokenize(x)
        
        res = test[:]
        for xn in test:
            
            if (xn in  stop_words):
                    res.remove(xn)
                
        sen = ' '.join(res)
        one.append(sen)
        one = rmv(one)
    return one

Performing the cleaning operation

In [13]:
clean = rm_spch(clean)

In case there is any unwanted data still left, user can specify the minimum length of the tweet required and remove the rest.  

In [14]:
clean = [x for x in clean if len(x)>2]
print len(clean)

97


<b>Even after the above cleaning process, there may exist some redundancy/duplicates, this is just a precautionary cleaning step.

In [15]:
clean = set(clean)
clean = list(clean)
print 'The final number of tweets:', len(clean)

The final number of tweets: 93


<h2> Tweet segregation using the 'textblob' package

In [16]:
pol = []
sub = []
for i in clean:
    first = TextBlob(i)
    pol.append(first.sentiment[0])    # Polarity of the respective tweet
    sub.append(first.sentiment[1])    # Subjectivity of the respective tweet

<b> Using 'Counter' object to make sure that tweets obtained are unique

In [17]:
from collections import Counter as cnt
cnt(clean)

Counter({' Buen momento para recordar a Cristiano Ronaldo lo destap la prensa y la Ag Tributaria entre presiones y c': 1,
         'Aos  anos de idade Cristiano Ronaldo foi contratado pelo Nacional de Portugal por  bolas de futebol e  pares de chu': 1,
         'Arsenal paid m Andrey Arshavin Let remind club paid  m Cristiano Ronaldo': 1,
         'Bless women drew Cristiano Ronaldo feet What a great talent': 1,
         'CASPA A Declara CRISTIANO RONALDO pero ponen im genes de Messi Grandes altavoces blancos nacionales al servicio blanco h': 1,
         'Cadena SER Cristiano told judge today I m I m Cristiano Ronaldo If a mic drop m': 1,
         'Coche Cristiano Ronaldo entrando a declarar': 1,
         'Como cuando CR se queda fuera de su habitaci n de hotel y en calzoncillo': 1,
         'Comunicado de Cristiano Ronaldo hubo intenci n de evadir impuestos': 1,
         'Cristiano Ronaldo Appears In Court On Tax Charges': 1,
         'Cristiano Ronaldo Best Dribbling Skills  Full HD 

<h2>OPTIONAL

<b> Users can store/output the above data as a .csv file.

In [18]:
di = {'Polarity':pol,'Subjectivity':sub, 'Text':clean}
df = DataFrame(di)
df.to_csv('tweets.csv', index = False)

<b> Users can also segregate the texts/tweets based on their polarity as per their own criteria or threshold. The below code is just an example.

In [19]:
classification = []
for cl in pol:
    clf = ''
    if cl > 0.03:
        clf = 'positive'
        classification.append(clf)
    elif cl < 0:
        clf = 'negative'
        classification.append(clf)
    else:
        clf = 'neutral'
        classification.append(clf)        

<h2> To check the segregation categories and their respective counts.

In [20]:
classification
Series(classification).value_counts()

neutral     83
positive     8
negative     2
dtype: int64

<h2>OPTIONAL

<b> Users can store/output the above data as a .csv file.

In [21]:
di_clf = {'Class':classification, 'Text':clean}
df_clf = DataFrame(di_clf)
df_clf.to_csv('tweets_classified.csv', index = False)

<h1> End Notes

<li>Like mentioned above, historical data upto 1 week only can be fetched from the Twitter API, so always try to fetch the data when it is trending in real time.
<li>As the number of count for tweets increases, the fetching time will also increase, so adjust the parameteres accordingly or have enough patience.
<li>We have kept the 'count' argument in the api call to 50, as we have experienced that this gives the best result. Though users can play with that number, any number below 80 is highly suggested.
<li>To collect enough data, user must keep the 'counter' variable reasonably high since only about 10% data is unique, and rest is re-tweeted.
<li>Our personal trails have shown that a lower 'count' value results in higher percentage of unique tweets(data), though this is not absolute.
