<a href="https://colab.research.google.com/github/SeyiAgboola/Extract-Tweets-7-days/blob/master/How_to_Scrape_Quote_Tweets_with_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to Scrape Quote Tweets with Python

This notebook will show you how to track quote tweets of an original tweet of your choice. Quote tweeting is now just as common as tweet replies and hold important information of the sentimental response to original tweets.

Main steps within this Notebook are to:

* Authenticate your Twitter API Access
* Search for the Quote Tweets based on Tweet ID
* Store resulting Quote Tweets in a CSV file

The main use cases with this notebook are to:

* Return tweets from last 7 days based on search term
* Identify most popular topic per hashtag
* Return Quote tweets of a specific tweet
* Sentimental analysis of response to a tweet

If you're looking for a new platform to learn platform, you might want to consider DataCamp which has courses for everything to do with Python, SQL and other data related programming languages and software tool. For example:

* [Web Scraping in Python](https://datacamp.pxf.io/3PraYy)
* [Introduction to Data Science in Python](https://datacamp.pxf.io/NK3JXP)
* [Introduction to Deep Learning in Python](https://datacamp.pxf.io/Yg3YLK)
* And practice solving real-world problems [with guided and unguided projects](https://datacamp.pxf.io/qngna5)




In [1]:
import tweepy
import pandas as pd
import numpy as np
import re

# Authenticate your Twitter API Access

You will need a Developer API Access credentials to scrape tweets from Twitter via Python. I won't go into detail on that process since there are plenty of tutorials that can explain this but I will provide some links to assist with this.

* [Developer API site](https://developer.twitter.com/en)
* [Tutorial on getting Twitter API access](https://developer.twitter.com/en/docs/twitter-api/getting-started/getting-access-to-the-twitter-api)

Once you have access I recommend you create a file to access your credentials locally on your laptop or PC. I have used a csv file since I am familiar with manipulating pandas dataframes including pulling specific data out.

In this case, I created a column for each access detail needed including:
* Client ID
* Client Secret
* API Key ([Also known as Consumer Key](https://developer.twitter.com/en/docs/authentication/oauth-1-0a/api-key-and-secret))
* API Key Secret ([Also known as Consumer Secret](https://developer.twitter.com/en/docs/authentication/oauth-1-0a/api-key-and-secret))
* Bearer Token
* Access Token ([Known as Token credentials to grant access and can be revoked](https://developer.twitter.com/en/docs/authentication/oauth-1-0a/obtaining-user-access-tokens))
* Access Token Secret

I would be lying to you if I said I knew why we need all of these but these variables are needed for the tweepy functions to work.

In [2]:
#Store credentials in a DataFrame to be accessed directly via pandas
df = pd.read_csv("/content/twicreds - Sheet1.csv")

client_id = df['Client ID'][0]
client_secret = df['Client Secret'][0]
#The API Key and Secret (also known as Consumer Key and Secret)
consumer_key = df['API Key'][0]
consumer_secret = df['API Key Secret'][0]
bearer_token = df['Bearer Token'][0]
access_token = df['Access Token'][0]
access_secret = df['Access Token Secret'][0]

In [3]:
#Create auth object
authenticate = tweepy.OAuthHandler(consumer_key, consumer_secret)
#Set the access token and access token secret
authenticate.set_access_token(access_token, access_secret)
#Create the API object
api = tweepy.API(authenticate, wait_on_rate_limit=True)

# Things to know when working with Twitter's API

There are a few limitations depending on the level of developer access you have. In this case I'm going to assume that you have Essential or Elevated Access which are both free and require minimum criteria to gain. These are the things to be aware of: 

* Based on your access level, you have a monthly tweet cap. For Essential it's 500K and for Elevated it's [2 million Tweets per month](https://developer.twitter.com/en/docs/twitter-api/tweet-caps)
* When searching for tweets you will get more results than you need so you might want to adjust the result_type parameter. Although the tweepy documentation does not list result_type in its parameters, it is an available parameter in the Twitter Dev Docs. Here is the definition of [popular result_type](#https://docs.tweepy.org/en/stable/api.html).
* Watch out for guides and tutorials that refer to different versions of the Twitter API. This one for example, [Comprehensive Guide on Using Twitter API V2](https://dev.to/twitterdev/a-comprehensive-guide-for-using-the-twitter-api-v2-using-tweepy-in-python-15d9) - is based on using Tweepy.Client that requires support for Twitter API v2 which is still in development on the master branch.
* Beware that the Twitter’s standard search API only “searches against a sampling of recent Tweets [published in the past 7 days.](https://docs.tweepy.org/en/stable/api.html#search-tweets)”. This means you cannot take your search results as gospel. There will be valuable results missing and should be treated as a sample of a bigger population.
* Like mentioned above you will only receive Tweets from the last 7 days. This means searching for tweets older than that will not result any accurate results. If you have access to the [Academic Research Product Track](https://developer.twitter.com/en/products/twitter-api/academic-research), you can get Tweets older than 7 days
* Direct link to the [Search API tweepy documentation](https://docs.tweepy.org/en/stable/client.html#search-tweets)
* Twitter official documentation on [building search queries](https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query)
* One common query clause is -is:retweet, which will not match on Retweets, thus matching only on original Tweets, Quote Tweets, and replies

Errors to watch out for:

* TweepError: [{'message': 'You currently have Essential access which includes access to Twitter API v2 endpoints only. 
If you need access to this endpoint, you’ll need to apply for Elevated access via the Developer Portal. You can learn more here: https://developer.twitter.com/en/docs/twitter-api/getting-started/about-twitter-api#v2-access-leve', 'code': 453}]
* TweepError: Read-only application cannot POST.
* AttributeError: module 'tweepy' has no attribute 'Client'
* TweepError: Twitter error response: status code = 403 

# Search for Tweets with PS5Share hashtag

Here we run our search query and store the results into a DataFrame which we can store within a CSV file locally.

In [5]:
#Create search term
search_term = '#PS5Share -is:retweet' #Refer to https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query to build advanced queries
#Create the cursor object
tweets = tweepy.Cursor(api.search, #Refer to https://docs.tweepy.org/en/stable/api.html - API.search_tweets
                       q = search_term, #The search query string of 500 characters maximum
                       lang='en', #Restricts tweets to the given language
                       since='2022-05-01',#Beware of the 7 day limit
                       tweet_mode = 'extended' 
                       ).items(2000) #limit max - important for not going over tweet caps limit
#Store the tweets
all_tweets = [tweet.full_text for tweet in tweets]
#Show all tweets
print(len(all_tweets))
print(all_tweets[:5])

2000
["Matching Michael Schumacher's Winning record🙌\nIf you know you know 😎\n#PS5Share, #GranTurismo7 https://t.co/DNwRpkfxyn", '#PS5Share, #Fortnite First win in the new season https://t.co/RokvAPqvmP', 'RT @kingforever008: SHOOT &amp; RIDE | Day 5 |#GG30SILHOUETTE \n#Cyberpunk2077 #CyberSunday #NPCSunday  #PS5Share #VirtualPhotography #WIGVP #WV…', 'RT @yesmynameissumo: #PS5Share, #ControlUltimateEdition loving the Alan Wake missions... https://t.co/O0G4IdVkSP', 'RT @ZivLisker: Another repost for: \n #GG30SILHOUETTE ❤️😇📸\n#AssassinsCreedOrigins @ubisoft\n #PS5Share #VGPUnite #VirtualPhotography #ZarnGam…']


In [6]:
#Dataframe to store tweets
df = pd.DataFrame(all_tweets, columns=['Tweets'])
df.head()

Unnamed: 0,Tweets
0,Matching Michael Schumacher's Winning record🙌\...
1,"#PS5Share, #Fortnite First win in the new seas..."
2,RT @kingforever008: SHOOT &amp; RIDE | Day 5 |...
3,"RT @yesmynameissumo: #PS5Share, #ControlUltima..."
4,RT @ZivLisker: Another repost for: \n #GG30SIL...


In [7]:
#Function to clean tweets
def cleanTweets(tweet):
  tweet = re.sub('RT', '', tweet)
  #tweet = re.sub('#[A-Za-z0-9]+', '', tweet) #Remove special characters
  tweet = re.sub('\n', '', tweet) #Remove newlines
  #tweet = re.sub('https?\/\/\S+', '', tweet) #Remove hyperlinks
  tweet = re.sub('@[\S]*', '', tweet) #Remove mentions
  tweet = re.sub('^[\s]+|[\s]+$', '', tweet) #Remove whitespaces
  return tweet


In [8]:
df['Clean_tweets'] = df['Tweets'].apply(cleanTweets)

In [9]:
df.head(10)

Unnamed: 0,Tweets,Clean_tweets
0,Matching Michael Schumacher's Winning record🙌\...,Matching Michael Schumacher's Winning record🙌I...
1,"#PS5Share, #Fortnite First win in the new seas...","#PS5Share, #Fortnite First win in the new seas..."
2,RT @kingforever008: SHOOT &amp; RIDE | Day 5 |...,SHOOT &amp; RIDE | Day 5 |#GG30SILHOUETTE #Cyb...
3,"RT @yesmynameissumo: #PS5Share, #ControlUltima...","#PS5Share, #ControlUltimateEdition loving the ..."
4,RT @ZivLisker: Another repost for: \n #GG30SIL...,Another repost for: #GG30SILHOUETTE ❤️😇📸#Assa...
5,"RT @Travellingpaddy: The Moon Is A Friend , Fo...","The Moon Is A Friend , For The Lonesome To Tal..."
6,Confronting fear is the destiny of a Jedi.\n#P...,Confronting fear is the destiny of a Jedi.#PS5...
7,RT @NuttyRoyale: Feel like I'm always in this ...,"Feel like I'm always in this domain #PS5Share,..."
8,RT @hatchiedave: Queue in dramatic walkout! An...,Queue in dramatic walkout! And scene! 🎬Tap for...
9,"#PS5Share, #HorizonForbiddenWest Such a cool g...","#PS5Share, #HorizonForbiddenWest Such a cool g..."


In [10]:
#DataFrame rows and columns before
print(df.shape)
#Drop duplicates inplace
df.drop_duplicates(inplace=True)
#DataFrame rows and columns after
print(df.shape)

(2000, 2)
(1597, 2)


In [11]:
#Create a list of numbers from 0 to total length of dataframe
idx = list(range(0, len(df)))

In [None]:
#Reset index
df = df.set_index(pd.Index(idx))
df.head(10)

In [13]:
filename = 'PS5Share_Since_1stMay_2000_MAX.csv'
#Save CSV to Files section
df.to_csv(filename)

# Things to Know Before Scraping Quote Tweets

It took me a while to work out how to get quote tweets. So I have wrote down some things to consider if this is something you require.

* Quote tweets or better known as [Retweets with comments are just regular tweets](https://github.com/tweepy/tweepy/issues/1291), with a permalink to another tweet at the end. To find Quote Tweets of specific tweets, you can use the Search API to search for the tweet based on it's id which looks something like this: 1193899515901829120
* Enterprise Users (paying customers) have it more easy and can get Quote Tweets (by user or of user) much more easily. See [full documentation here](https://developer.twitter.com/en/docs/twitter-api/enterprise/account-activity-api/overview)
* If my code for Quote Tweets doesn't work, [here's another article](https://blog.f-secure.com/processing-quote-tweets-with-twitter-api/) that might work. I didn't use this in the end because Status Errors I couldn't unwrangle.


# Scraping Quote Tweets

This piece of code is a combination of:
* [Hackernoon's Tweet Replies Guide](https://hackernoon.com/scraping-tweet-replies-with-python-and-tweepy-twitter-api-a-step-by-step-guide-z11x3yr8)
* [Processing Quote Tweet Guide](https://blog.f-secure.com/processing-quote-tweets-with-twitter-api/)
* [GeekforGeeks Status Object Status Guide](https://www.geeksforgeeks.org/python-status-object-in-tweepy/).

The main steps are:

* Identify the tweet you want quote tweets for
* Build your search query to pull only tweets with that URL
* Cross check tweets are quote tweets for that specific tweet
* Store resulting tweets into a list
* Store resulting list into a CSV file and save locally

In [4]:
import csv
import tweepy
# import ssl

#Couldn't tell you what this means
# ssl._create_default_https_context = ssl._create_unverified_context

# Authentication with Twitter
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth)

#Insert your Tweet details here including the Account Name + Tweet ID + Tweet URL
name = 'PlayStation' #The name of the twitter account that tweeted the original tweet
tweet_id = '1532491343116656640' #That long number at the end of the Tweet URL is the Tweet ID
tweet_url = "https://twitter.com/PlayStation/status/1532491343116656640" #Full Tweet URL

replies=[] #replies to original tweet
quotes=[] #tweets with true quote status
matching_quotes = [] #tweet
quote_ids = [] #tweet id of tweet being quoted
objects = [] #tweet metadata
all_tweets=[] #all tweets returned
for tweet in tweepy.Cursor(api.search,q="url:"+tweet_url, result_type='recent', timeout=999999).items(3000):
  #Checks if tweet is a tweet reply to the specific tweet
    all_tweets.append(tweet)
    if hasattr(tweet, 'in_reply_to_status_id_str'):
        if (tweet.in_reply_to_status_id_str==tweet_id):
            replies.append(tweet)
    #if hasattr(tweet, 'quoted_status'):
    if tweet.is_quote_status == True:
      quotes.append(tweet) #Full Tweet objects with all the meta data
      quote_ids.append(tweet.quoted_status_id) #These should all be the exact same
      if str(tweet_id) == str(tweet.quoted_status_id):
        objects.append(tweet)
        matching_quotes.append(tweet.text)


In [5]:
print("Lengths")
print("Total tweets: " + str(len(all_tweets)))
print("Total quotes tweets: " + str(len(quotes)))
print("Total matching quote texts: " + str(len(matching_quotes)))
print("Total quote ids: " + str(len(quote_ids)))
print("Total matching quote objects: " + str(len(objects)))
#This is a extra check to make sure we've only pulled quote tweets that are related to the original tweet

Lengths
Total tweets: 100
Total quotes tweets: 16
Total matching quote texts: 16
Total quote ids: 16
Total matching quote objects: 16


In [16]:
filename = name + "_" + tweet_id + "_" + 'QTs.csv'
#Create a live file
with open(filename, 'w') as f:
  #Create CSV file with 2 columns
    csv_writer = csv.DictWriter(f, fieldnames=('Tweet Text', 'Metadata'))
    #Add a header
    csv_writer.writeheader()
    #For each entity in objects
    for tweet in objects:
      #Create row of current tweet and insert these values
        row = {'Tweet Text': tweet.text, 'Metadata': tweet}
      #Add row to current CSV
        csv_writer.writerow(row)

# Debugging and Troubleshooting Tweepy

In [17]:
#Check the first Quote Tweet Text
matching_quotes[0]

'Siii que ganitas dios!!!! pal verano que viene ya sabéis lo que tocará en los directos 😏😏😏😏 https://t.co/hdDht206q6'

In [18]:
#Check first object in list
objects[0]

Status(_api=<tweepy.api.API object at 0x7f609ad555d0>, _json={'created_at': 'Sat Jun 04 14:56:44 +0000 2022', 'id': 1533100445736087555, 'id_str': '1533100445736087555', 'text': 'Siii que ganitas dios!!!! pal verano que viene ya sabéis lo que tocará en los directos 😏😏😏😏 https://t.co/hdDht206q6', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/hdDht206q6', 'expanded_url': 'https://twitter.com/PlayStation/status/1532491343116656640', 'display_url': 'twitter.com/PlayStation/st…', 'indices': [92, 115]}]}, 'metadata': {'iso_language_code': 'es', 'result_type': 'recent'}, 'source': '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 1678128758, 'id_str': '1678128758', 'name': 'KRN Lexone23', 'screen_name': 'lexone23'

In [None]:

  # full_tweet = tweet.text
  # original_url = parse_qTweet(full_tweet)
  # url_splits = original_url.split("/")
  # if url_splits[-1] == tweet_id:
    # quotes.append(tweet.text)



* How to use requests to [parse open truncated twitter links](https://stackoverflow.com/questions/8872232/how-can-i-unwrap-t-co-links-with-python) which can be a tweet, video or photo

In [43]:
#How to unwrap a t.co link
import requests
from re import search
def parse_qTweet(text):
  words = text.split()
  print(len(words))
  for word in words:
    word = str(word)
    # if 'https://t.co' in word:
    #   print("True")
    #   full_url = requests.get(word).url
    #   return full_url
    if search('https://t.co', word):
      full_url = requests.get(word).url
      return full_url
    if word.find('https://t.co') != -1:
      print("Found!")
    else:
      return "Invalid"

yo = parse_qTweet(matching_quotes[0])
example = matching_quotes[0]
splits = example.split()
splits
for word in splits:
    if 'https://t.co' in word:
        try: 
            full_url = requests.get(word).url
            print(full_url)
        except:
          print('Nah')
full_url
yo

18
https://twitter.com/PlayStation/status/1532491343116656640


'Invalid'

In [None]:
#See all the attributes in the tweet object
dir(all_tweets[0])

In [31]:
# fetching the status
example_id = "1531228577445728257"
status = api.get_status(example_id)

In [29]:
#Check quote id for a quote tweet matches original tweet
quotes[0].quoted_status_id

1532491343116656640

In [32]:
# printing the information from a tweet object
print("The status was created at : " + str(status.created_at))
print("The id is : " + str(status.id))
print("The id_str is : " + status.id_str)
print("The text is : " + status.text)
print("The source_url is : " + status.source_url)
  
  
print("The in_reply_to_status_id is : " + str(status.in_reply_to_status_id))
print("The in_reply_to_user_id is : " + str(status.in_reply_to_user_id))
print("The in_reply_to_screen_name is : " + str(status.in_reply_to_screen_name))
  
  
print("The poster's screen name is : " + status.user.screen_name)
print("The is_quote_status is : " + str(status.is_quote_status))
  
print("Has the authenticated user favourited the status? : " + str(status.favorited))
print("Has the authenticated user retweeted the status? " + str(status.retweeted))

The status was created at : 2022-05-30 10:58:36
The id is : 1531228577445728257
The id_str is : 1531228577445728257
The text is : Need some more combos https://t.co/Kw2zKAsy5j
The source_url is : http://twitter.com/download/android
The in_reply_to_status_id is : None
The in_reply_to_user_id is : None
The in_reply_to_screen_name is : None
The poster's screen name is : LiteraryDecay
The is_quote_status is : True
Has the authenticated user favourited the status? : False
Has the authenticated user retweeted the status? False
