## Project: scrape Adidas Twitter data based given timeline with replies

In this project, I used **snscrape** to get the tweets posted by **@adidas** twitter account in a certain timeframe with replies to each tweet. Twitter has limated options for datawizards and also the other scraping tools/packages(Tweepy, GetOldTweets, Twint, etc.) has limitations like scraping limit, no access to historical data, removal of endpoint as twitter keeps updating its end point support.

Using snscrape we can scrape **millions of tweets** based on various criteria(specific keywords, timeframe, etc.)


In [None]:
# Run the pip install command below if you don't already have the library
# !pip install snscrape

In [1]:
# import required packages
import snscrape.modules.twitter as sntwitter
import pandas as pd
import datetime as dt

In [2]:
# Setting variables to be used below
start_date = '2022-02-01'
end_date = '2022-02-28'
from_count=0

# Let's fetch last two months data count which is posted from adidas twitter account
for i,tweet1 in enumerate(sntwitter.TwitterSearchScraper(f'from:@adidas since:{start_date} until:{end_date}').get_items()):
    from_count+=1
print(f'Number of tweets posted by Adidas from {start_date} to {end_date} is: {from_count}')                

Number of tweets posted by Adidas from 2022-02-01 to 2022-02-28 is: 37


In [3]:
to_count=0

# Let's fetch last two months data count which is posted to adidas twitter account
for i,tweet1 in enumerate(sntwitter.TwitterSearchScraper(f'to:@adidas since:{start_date}').get_items()):
    to_count+=1
print(f'Number of tweets posted to Adidas from {start_date} to now is: {to_count}')                

Number of tweets posted to Adidas from 2022-02-01 to now is: 10829


### Process to extract the tweets with replies:
- Fetch the tweets posted by **@adidas** using snscrape TwitterSearchScraper (`from:@adidas`) for the given timeframe(`since:{start_date} until:{end_date}`)
- Find the `tweet_id=tweet.id` of the tweets for which the replies are need to be fetched
- Using using snscrape TwitterSearchScraper retrieve all the tweets matching (`to:@adidas`) tweets since the start date.
- Results matching the `inReplyToTweetId` to `tweet_id` are the replies for the post

In [4]:
# Setting variables to be used below
count=0

# Creating list to append tweet data to
tweets_list = []

# Using TwitterSearchScraper to scrape data and append tweets to list
for i,tweet in enumerate(sntwitter.TwitterSearchScraper(f'from:@adidas since:{start_date} until:{end_date}').get_items()):
    count+=1
    print(f'extracting tweet {count} at {dt.datetime.now()}')
    reply_texts=[]
    for j,tweet2 in enumerate(sntwitter.TwitterSearchScraper('to:@adidas since:2022-01-01').get_items()):
        if hasattr(tweet2, 'inReplyToTweetId'):
            if(str(tweet2.inReplyToTweetId)==str(tweet.id)):
                reply_texts.append(tweet2.content)                
    tweets_list.append([tweet.date, tweet.id, tweet.content, tweet.user.username, reply_texts, tweet.url, 
                        tweet.likeCount, tweet.retweetCount])

extracting tweet 1 at 2022-03-02 14:06:53.749261
extracting tweet 2 at 2022-03-02 14:13:21.503263
extracting tweet 3 at 2022-03-02 14:19:49.564604
extracting tweet 4 at 2022-03-02 14:26:18.825692
extracting tweet 5 at 2022-03-02 14:32:36.869678
extracting tweet 6 at 2022-03-02 14:38:55.247959
extracting tweet 7 at 2022-03-02 14:45:14.763720
extracting tweet 8 at 2022-03-02 14:51:47.248644
extracting tweet 9 at 2022-03-02 14:58:04.369878
extracting tweet 10 at 2022-03-02 15:04:20.537379
extracting tweet 11 at 2022-03-02 15:10:56.194043
extracting tweet 12 at 2022-03-02 15:17:29.037282
extracting tweet 13 at 2022-03-02 15:24:06.666086
extracting tweet 14 at 2022-03-02 15:30:24.267604
extracting tweet 15 at 2022-03-02 15:36:58.189633
extracting tweet 16 at 2022-03-02 15:43:34.620566
extracting tweet 17 at 2022-03-02 15:50:11.284539
extracting tweet 18 at 2022-03-02 15:56:47.888549
extracting tweet 19 at 2022-03-02 16:03:06.360990
extracting tweet 20 at 2022-03-02 16:09:41.683954
extractin

In [5]:
# Creating a dataframe from the tweets list above
tweets_df = pd.DataFrame(tweets_list, columns=['datetime', 'tweet_Id', 'tweet_content', 'username', 'reply_content', 
                                               'tweet_url',  'like_count', 'retweet_count'])

# Display first 5 entries from dataframe
tweets_df.head()    

Unnamed: 0,datetime,tweet_Id,tweet_content,username,reply_content,tweet_url,like_count,retweet_count
0,2022-02-22 13:38:03+00:00,1496117082068500480,A limited number of the new adidas TERREX HS1 ...,adidas,[@adidas Why are you still supporting the #Wor...,https://twitter.com/adidas/status/149611708206...,35,4
1,2022-02-22 13:38:02+00:00,1496117077752664065,Introducing: the adidas TERREX HS1\n \nCompose...,adidas,[A limited number of the new adidas TERREX HS1...,https://twitter.com/adidas/status/149611707775...,71,9
2,2022-02-22 13:38:00+00:00,1496117070504906752,ICYMI: Last year we partnered with Finnish tex...,adidas,"[@adidas @SpinnovaPlc Looks interesting 🤔, Int...",https://twitter.com/adidas/status/149611707050...,18,1
3,2022-02-22 13:38:00+00:00,1496117068864831495,"Calling all adventurers, hikers, and nature lo...",adidas,"[@adidas adidas, stop being the official brand...",https://twitter.com/adidas/status/149611706886...,107,12
4,2022-02-14 22:40:22+00:00,1493354457308016651,"@Vishy_vish Unfortunately, we're not able to a...",adidas,[],https://twitter.com/adidas/status/149335445730...,0,0


In [6]:
# Lets check the dataframe size
tweets_df.shape

(37, 8)

In [8]:
tweets_df.to_csv(path_or_buf = 'data/snscrape_adidas.csv', encoding='utf-8', index=False)

**References**
- https://medium.com/dataseries/how-to-scrape-millions-of-tweets-using-snscrape-195ee3594721
- https://github.com/MartinBeckUT/TwitterScraper/