# Collect Tweets from Twitter accounts

In this notebook, a software to extract tweets from Twitter API and to process them is created.

In [None]:
import tweepy
import pandas as pd
import re

## Extracting tweets into a dataframe

First log in twitter with credentials:

In [None]:
# Variables that contains the credentials to access Twitter API
ACCESS_TOKEN = 'XXXXXX'
ACCESS_SECRET = 'XXXXXX'
CONSUMER_KEY = 'XXXXXX'
CONSUMER_SECRET = 'XXXXXX'


# Setup access to API
def connect_to_twitter_OAuth():
    auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
    auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)

    api = tweepy.API(auth)
    return api


# Create API object
api = connect_to_twitter_OAuth()

Formula to extract wished information from tweets and add them in a dataframe. Retweets are not considered.

In [None]:
# fuction to extract data from tweet object
def extract_tweet_attributes(tweet_object):
    # create empty list
    tweet_list =[]
    # loop through tweet objects
    for tweet in tweet_object:
      if (not tweet.retweeted) and ('RT @' not in tweet.full_text):
        #here the  attributes from the tweet object
        text = tweet.full_text # utf-8 text of tweet

        # append attributes to list
        tweet_list.append({'text':text})

    # create dataframe   
    df = pd.DataFrame(tweet_list, columns=['text'])

    return df

Just the last 200 tweets are extracted to check their format complies with the customers requirements.

Three Twitter accounts are defined:

* @GreatestQuotes
* @FeelingsQuote
* @quotepage


### @GreatestQuotes

In [None]:
gq_tweets = []

tweets = api.user_timeline(user_id=22256645, count=200, tweet_mode='extended')

for tweet in tweets:
  gq_tweets.append(tweet)

dfgq = extract_tweet_attributes(gq_tweets)
dfgq.head(3)

Unnamed: 0,text
0,"Change is the constant, the signal for rebirth..."
1,To weep is to make less the depth of grief. - ...
2,"As long as one keeps searching, the answers co..."


In [None]:
dfgq.iloc[0][0]

'Change is the constant, the signal for rebirth, the egg of the phoenix. - Christina Baldwin'

### @_FeelingsQuote_



In [None]:
fq_tweets = []

tweets = api.user_timeline(user_id=152856447, count=200, tweet_mode='extended')

for tweet in tweets:
  fq_tweets.append(tweet)

dffq = extract_tweet_attributes(fq_tweets)
dffq.head(3)

Unnamed: 0,text
0,Keep your circle small and be careful who you ...
1,Make yourself a priority.
2,"Stop stressing over it, just let it be, everyt..."


In [None]:
dffq.iloc[2][0]

'Stop stressing over it, just let it be, everything will be ok.'

### @quotepage

In [None]:
qp_tweets = []

tweets = api.user_timeline(user_id=23245396, count=200, tweet_mode='extended')

for tweet in tweets:
  qp_tweets.append(tweet)

dfqp = extract_tweet_attributes(qp_tweets)
dfqp.head(3)

Unnamed: 0,text
0,"""We should devote ourselves to being self-suff..."
1,"""Fear is a reaction. Courage is a decision."" -..."
2,"""Even when disagreeing with someone, choose go..."


In [None]:
dfqp.iloc[2][0]

'As long as one keeps searching, the answers come. - Joan Baez'

## Data preparation

### @GreatestQuotes

In [None]:
# Calculate amount hashtags and rows with hashtags
list_hashtags = []
counthash = 0
countrow = 0
for ind in dfgq.index:
  for word in dfgq['text'][ind].split():
    if (word[:1] == '#'):
      counthash = counthash + 1
      list_hashtags.append(word)
  for word in dfgq['text'][ind].split():
    if (word[:1] == '#'):
      countrow = countrow + 1
      break

#list_hashtags
#counthash
countrow

0

In [None]:
# Calculate amount mentions and rows with mentions
list_mentions = []
countment = 0
countrow = 0
for ind in dfgq.index:
  for word in dfgq['text'][ind].split():
    if (word[:1] == '@'):
      countment = countment + 1
      list_mentions.append(word)
  for word in dfgq['text'][ind].split():
    if (word[:1] == '@'):
      countrow = countrow + 1
      break

#list_mentions
#countment
countrow

0

In [None]:
#Count number of links present in the dataset
countrow = 0
substring = 'http'
for ind in dfgq.index:
  for word in dfgq['text'][ind].split():
    if word.count(substring):
      countrow = countrow + 1

countrow

0

In [None]:
#Add column length
dfgq['Length'] = dfgq.text.str.len()

In [None]:
#Number of quotes longer than requirement
len(dfgq[dfgq['Length']>280])

0

In [None]:
#Number of quotes shorter than requirement
len(dfgq[dfgq['Length']<21])

0

The first obervation shows that non of the tweet samples for this account presents unacceptable format.

Additionally, the author must be removed from these tweets.

In [None]:
#example
dfgq.iloc[2][0]

'As long as one keeps searching, the answers come. - Joan Baez'

In [None]:
#With this we get the positions of the wished character
example = list(dfgq.iloc[2][0])
c = '-'
print([pos for pos, char in enumerate(example) if char == c])

[50]


In [None]:
#With this we find out the positions of all slashes in each text
for i,j in dfgq.iterrows():
  dfgq['Total -'][i] = [pos for pos, char in enumerate(dfgq.text[i]) if char == '-']

In [None]:
#Let's check of there are quotes with more than just one slash
manyslash = dfgq.loc[dfgq['Total -'].str.len() > 1]
manyslash
#This means always just the last slash must be taken out

Unnamed: 0,text,Total -
17,It's better to look ahead and prepare than to ...,"[68, 82]"
47,All great achievements have one thing in commo...,"[48, 84]"
64,"Hell, there are no rules here - we're trying t...","[30, 70]"
69,Nothing builds self-esteem and self-confidence...,"[19, 35, 68]"
196,Self-trust is the first secret of success. - R...,"[4, 43]"


### @_FeelingsQuote_

In [None]:
# Calculate amount hashtags and rows with hashtags
list_hashtags = []
counthash = 0
countrow = 0
for ind in dffq.index:
  for word in dffq['text'][ind].split():
    if (word[:1] == '#'):
      counthash = counthash + 1
      list_hashtags.append(word)
  for word in dffq['text'][ind].split():
    if (word[:1] == '#'):
      countrow = countrow + 1
      break

#list_hashtags
#counthash
countrow

0

In [None]:
# Calculate amount mentions and rows with mentions
list_mentions = []
countment = 0
countrow = 0
for ind in dffq.index:
  for word in dffq['text'][ind].split():
    if (word[:1] == '@'):
      countment = countment + 1
      list_mentions.append(word)
  for word in dffq['text'][ind].split():
    if (word[:1] == '@'):
      countrow = countrow + 1
      break

#list_mentions
#countment
countrow

1

In [None]:
#Count number of links present in the dataset
countrow = 0
substring = 'http'
for ind in dffq.index:
  for word in dffq['text'][ind].split():
    if word.count(substring):
      countrow = countrow + 1

countrow

7

In [None]:
#Add column length
dffq['Length'] = dffq.text.str.len()

In [None]:
#Number of quotes longer than requirement
len(dffq[dffq['Length']>280])

0

In [None]:
#Number of quotes shorter than requirement
len(dffq[dffq['Length']<21])

0

In can be seen that in this account tweets must be cleaned up before being processed. The tweets here do not present author at the end so no need for deleting content.

In [None]:
dffq.iloc[120][0]

'I was toxic to some and a blessing to others, I’ll admit I’m not perfect.'

### @quotepage

In [None]:
# Calculate amount hashtags and rows with hashtags
list_hashtags = []
counthash = 0
countrow = 0
for ind in dfqp.index:
  for word in dfqp['text'][ind].split():
    if (word[:1] == '#'):
      counthash = counthash + 1
      list_hashtags.append(word)
  for word in dfqp['text'][ind].split():
    if (word[:1] == '#'):
      countrow = countrow + 1
      break

#list_hashtags
#counthash
countrow

0

In [None]:
# Calculate amount mentions and rows with mentions
list_mentions = []
countment = 0
countrow = 0
for ind in dfqp.index:
  for word in dfqp['text'][ind].split():
    if (word[:1] == '@'):
      countment = countment + 1
      list_mentions.append(word)
  for word in dfqp['text'][ind].split():
    if (word[:1] == '@'):
      countrow = countrow + 1
      break

#list_mentions
#countment
countrow

0

In [None]:
#Count number of links present in the dataset
countrow = 0
substring = 'http'
for ind in dfqp.index:
  for word in dfqp['text'][ind].split():
    if word.count(substring):
      countrow = countrow + 1

countrow

43

In [None]:
#Add column length
dfqp['Length'] = dfqp.text.str.len()

In [None]:
#Number of quotes longer than requirement
len(dfqp[dfqp['Length']>280])

0

In [None]:
#Number of quotes shorter than requirement
len(dfqp[dfqp['Length']<21])

0

Here tweets are pretty clean but content lots of links. The author must be removed. Let's check if more than one slash can be found.

In [None]:
#example
dfqp.iloc[3][0]
dfqp['Total -'] = 'empty'

In [None]:
#With this we find out the positions of all slashes in each text
for i,j in dfqp.iterrows():
  dfqp['Total -'][i] = [pos for pos, char in enumerate(dfqp.text[i]) if char == '-']

In [None]:
#Let's check of there are quotes with more than just one slash
manyslash = dfqp.loc[dfqp['Total -'].str.len() > 1]
manyslash
#This means always just the last slash must be taken out

Unnamed: 0,text,Total,Total -
0,"""We should devote ourselves to being self-suff...","[41, 129]","[41, 129]"
9,"""Not money, or success, or position or travel ...","[71, 97]","[71, 97]"
26,"""There is no such thing as gratitude unexpress...","[88, 113]","[88, 113]"
138,"""I have self-doubt. I have insecurity. I have ...","[12, 82]","[12, 82]"


## Software to clean tweets life

Now the software to clean the tweets in streaming

In [None]:
def checktweet(tweet):
  # remove hyperlinks
  tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)

  # remove hashtags with space before it
  tweet = re.sub(r'(\s)#\w+', r'', tweet)

  # remove hashtags without space before it
  tweet = re.sub(r'#\w+', r'', tweet)
 
  # remove mentions
  tweet = re.sub(r'(\s)@\w+', '', tweet)

  # remove author
  if len([pos for pos, char in enumerate(tweet) if char == '-']) > 0:
    tweet = tweet[:[pos for pos, char in enumerate(tweet) if char == '-'][-1]]

  if (len(tweet) > 280) or (len(tweet) < 20):
    return 0
  else:  
    return tweet

In [None]:
checktweet(dfqp.iloc[22][0])

'"Your really can change the world if you care enough." '

In [None]:
len(tweet)

157