# Scraping Twitter data with snscrape

## 0. Set up
**snscrape** is a scraper for social networking services. It currently supports lots of services such as Twitter, Facebook, Instagram, etc. It scrapes things like user profiles, hashtags, or searches and returns the discovered items, e.g. the relevant posts. More information about snscrape can be viewed from its [GitHub](https://github.com/JustAnotherArchivist/snscrape) site.

**Note: you may need to connect the Duke vpn (portal.duke.edu) to finish this lab.**

In [1]:
# install snscrape using pip
# !pip3 install snscrape

In [13]:
# import libraries
import snscrape.modules.twitter as sntwitter
import snscrape.modules.twitter

import pandas as pd
import numpy as np

## 1. Overview

In today's lab, we are going to use snscrape to scrape data from twitter. You will learn to scrape tweets from:
1. text search query
2. a specific user

Snscrape provides many useful attributes available through snscrape Twitter object [(Beck, 2020)](https://betterprogramming.pub/how-to-scrape-tweets-with-snscrape-90124ed006af). You might need some of the above attributes when scraping data for your group projects.
* url: Permalink pointing to tweet location
* date: The date on which the tweet was created
* content: The text content of the tweet
* id: The id of the tweet
* user: User object containing the following data: username, displayname, id, description, descriptionUrls, verified, created, followersCount, friendsCount, statusesCount, favouritesCount, listedCount, location, protected, linkUrl, profileImageUrl, profileBannerUrl
* replyCount: The count of replies
* retweetCount: The count of retweets
* likeCount: The count of likes
* quoteCount: The count of users that quoted the tweet and replied
* conversationId: Appears to be the same as tweet id
* lang: Machine generated, assumed language of the tweet
* source: Where tweet was posted from (e.g., iPhone, Andriod, etc.)
* retweetedTweet: The id of the original tweet (if it is a retweet)
* mentionedUsers: The user objects of any mentioned user in the tweet

##  2. Scrape tweets from a text search query

The queried text could be certain keywords and hashtags.

###  Example 1
In this example, we scrape 10 tweets with the **#stopasianhate** hashtag with at least three likes `min_faves:3`. We save the scraped tweets with its posted date `tweet.date`, text content `tweet.content`, poster's location `tweet.user.location`, and the number of received likes `tweet.likeCount` into a dataframe.

In [40]:
# create a list to append twitter data to
tweets_list1 = []

# use TwitterSearchScraper to scrape data and append tweets to list
for i, tweet in enumerate(sntwitter.TwitterSearchScraper('Chinese virus since:2022-02-15 until:2022-03-16 min_faves:0').get_items()):
    if i >= 3000:
        break
    tweets_list1.append([tweet.date,  tweet.content, tweet.user.location, tweet.likeCount])

In [41]:
# create a dataframe to view the result
tweets_df1 = pd.DataFrame(tweets_list1, columns=['Date', 'Content', 'User\'s location', '# of likes'])
tweets_df1

Unnamed: 0,Date,Content,User's location,# of likes
0,2022-03-15 23:37:26+00:00,"Chinese virus cases climb, raise threat of tra...",United States,2
1,2022-03-15 23:29:36+00:00,"Chinese virus cases climb, raise threat of tra...",,0
2,2022-03-15 23:18:26+00:00,VP Snickers hubby has the chinese virus too......,them thar hills Arkansas,1
3,2022-03-15 23:12:37+00:00,@alexotaola @Hola_Otaola @Cubanoselmundo 👉 ¡RA...,Continente Americano,1
4,2022-03-15 23:10:06+00:00,@TheView @hulu Help her out how? She should ...,,2
...,...,...,...,...
2721,2022-02-15 00:53:17+00:00,"Hey Tweeter, when are you going to apologize f...","St. Charles, MO",10
2722,2022-02-15 00:34:44+00:00,@tanay_mandowara @Si_lv_er I completely believ...,"Punyabhoomi, aka Bharat",1
2723,2022-02-15 00:24:20+00:00,VIOLENT REVOLUTION NOW INEVITABLE #Sad \n🇨🇦🇨🇦🇨...,MALAGA Spain🇪🇸,1
2724,2022-02-15 00:18:35+00:00,@POTUS no more rules on the Chinese virus?-it’...,,0


In [52]:
# len(tweets_df1)
tweets_df1.sort_values(by='# of likes', ascending=False).head(30)

Unnamed: 0,Date,Content,User's location,# of likes
772,2022-03-08 23:24:50+00:00,"the lesson of ""Chinese Virus"" and people rando...",Unceded Coast Salish Territory,9694
680,2022-03-09 18:02:43+00:00,There is a reason why USA never raised concern...,India,878
2423,2022-02-19 01:46:40+00:00,"Tons of love, respect &amp; regards for your b...",India,746
1387,2022-03-01 14:01:44+00:00,"West was busy sleeping, when China has been s...",India,741
180,2022-03-15 03:06:13+00:00,Most Indians (especially the educated ones) ha...,,621
2171,2022-02-22 04:42:11+00:00,Rana is trying her best to defame India but in...,"चुरु, भारत",621
231,2022-03-14 17:03:55+00:00,Today I had parked my car near an auto stand f...,Tamil Nadu . India,427
2337,2022-02-20 12:02:31+00:00,Has to do this to save his leg. \n2 month's to...,India,381
1613,2022-02-27 06:31:18+00:00,"Why is it OK to denounce all ""Russians"" becaus...","Washington, DC",308
1656,2022-02-26 22:26:37+00:00,#Russophobia\n\nI'm old enough to remember tha...,"Glasgow, Scotland",291


In [59]:
# tweets_df1.loc[np.random.randint(len(tweets_df1)),'Content']
tweets_df1.loc[994,'Content']

'Remember: Don\'t call it "the Chinese virus." https://t.co/FPWtqgIheo'

In [9]:
## Output example1 csv
tweets_df1.to_csv('eg1.csv')

### Example 2

In this example, we scrape tweets that contains keywords in the phrase **let's go duke** around the date of a basketball match. We save the scraped tweets with its posted date `tweet.date`, text content `tweet.content`, and the user's location `tweet.user.location` into a dataframe.

In [5]:
# create a list to append twitter data to
tweets_list2 = []

# use TwitterSearchScraper to scrape data and append tweets to list
for i, tweet in enumerate(sntwitter.TwitterSearchScraper('let\'s go duke since:2021-11-12 until:2021-11-13').get_items()):
    tweets_list2.append([tweet.date, tweet.content, tweet.user.location])

In [6]:
# create a dataframe to view the result
tweets_df2 = pd.DataFrame(tweets_list2, columns=['Date', 'Content', 'User\'s location'])
tweets_df2

Unnamed: 0,Date,Content,User's location
0,2021-11-12 23:59:21+00:00,Let's go Duke 🤘,My Own Personal Hell
1,2021-11-12 23:58:46+00:00,Roach\nKeels\nMoore\nBanchero\nWilliams\n\nGra...,
2,2021-11-12 23:57:29+00:00,👏🏻 Let’s 👏🏻 Go ✊🏻 Duke! 🔵😈 https://t.co/H589Xp...,"Germantown, MD"
3,2021-11-12 23:56:40+00:00,Let’s go Duke!! #BeatArmy,#SI6HTS
4,2021-11-12 23:55:37+00:00,LET’S GO DUKE!!! One last ride against where t...,"Raleigh, NC"
5,2021-11-12 23:54:42+00:00,Let’s Go Duke👿 https://t.co/beTdrGHzXW,
6,2021-11-12 23:53:25+00:00,Let’s go Duke!!! 💙😈🏀,
7,2021-11-12 23:51:58+00:00,LET’S GO DUKE 😈,252
8,2021-11-12 23:51:13+00:00,Gametime baby let's go Duke https://t.co/lmXSV...,"San Antonio, TX"
9,2021-11-12 23:48:41+00:00,@DukeMBB Let's Go Duke!!! #HereComesDuke,


In [10]:
## Output example1 csv
tweets_df2.to_csv('eg2.csv')

## 3. Scrape tweets from a specific user

### Example 3

In this example, we scrape the latest 100 tweets posted by the Twitter user **@DukeKunshan**. We save the scraped tweets with its posted date `tweet.date`, text content `tweet.content`, and username `tweet.user.username` into a dataframe.

In [7]:
# create a list to append twitter data to
tweets_list3 = []

# use TwitterSearchScraper to scrape data and append tweets to list
for i, tweet in enumerate(sntwitter.TwitterSearchScraper('from:DukeKunshan').get_items()):
    if i >= 100:
        break
    tweets_list3.append([tweet.date, tweet.content, tweet.user.username])

In [8]:
# create a dataframe to view the result
tweets_df3 = pd.DataFrame(tweets_list3, columns=['Date', 'Content', 'Username'])
tweets_df3

Unnamed: 0,Date,Content,Username
0,2022-03-28 10:00:05+00:00,Many graduating students from DKU’s Master of ...,DukeKunshan
1,2022-03-24 01:00:05+00:00,📢 Starting tomorrow: Discrete Math Seminar\n\n...,DukeKunshan
2,2022-03-23 02:30:02+00:00,📢 Starting tomorrow: Duke Kunshan University ...,DukeKunshan
3,2022-03-16 02:00:05+00:00,📢 Save the date: Duke Kunshan University 2022...,DukeKunshan
4,2022-03-15 15:30:12+00:00,Check out these photos of construction progres...,DukeKunshan
...,...,...,...
95,2021-09-22 13:00:43+00:00,There’s nothing better than a good roommate at...,DukeKunshan
96,2021-09-20 03:30:02+00:00,A new art book co-edited by Duke Kunshan profe...,DukeKunshan
97,2021-09-19 01:00:10+00:00,Wishing everyone a happy Mid-Autumn Festival!\...,DukeKunshan
98,2021-09-16 07:00:05+00:00,📢 Starting tomorrow: DKU Career Fair | Fall 2...,DukeKunshan


In [11]:
tweets_df3.to_csv('eg3.csv')

## Exercise

## Here scrape all posts from 2021-10-10 to 2021-11-30 that mention 'covid modeling'

In [24]:
tweets_list_exc = []

# use TwitterSearchScraper to scrape data and append tweets to list
for i, tweet in enumerate(sntwitter.TwitterSearchScraper('Covid modeling since:2021-10-10 until:2021-11-30').get_items()):
    tweets_list_exc.append([tweet.date, tweet.content, tweet.user.location])

In [25]:
tweets_df_ex = pd.DataFrame(tweets_list_exc, columns=['Date', 'Content', 'Username'])
tweets_df_ex

Unnamed: 0,Date,Content,Username
0,2021-11-29 23:42:12+00:00,New SARS-CoV-2 modeling suggests virus will re...,
1,2021-11-29 23:42:02+00:00,@medrxivpreprint New SARS-CoV-2 modeling sugge...,
2,2021-11-29 23:04:11+00:00,"Join us on Friday, Dec.3 at 2pm HST for a semi...",University of Hawai‘i
3,2021-11-29 22:50:29+00:00,@ZaleskiLuke @GOP Trump failed abysmally w/ hi...,USA
4,2021-11-29 22:48:18+00:00,@Siobhan23738909 @JCWI_UK @JolyonMaugham @Bori...,the future the past never now
...,...,...,...
963,2021-10-10 16:15:54+00:00,The COVID vaccine clinic is open after he Farm...,"West Fork, AR"
964,2021-10-10 15:31:21+00:00,@BernieHancock69 @DavidStaplesYEG The same poo...,I live on Freehold Land
965,2021-10-10 05:51:59+00:00,how rude it is that i have to do a project mod...,scorpio ☼ sag ☽ cap ➶
966,2021-10-10 05:15:48+00:00,Covid modeling update...\n\nWe now go live to ...,"Taranaki Region, New Zealand"


In [None]:
tweets_df_ex.to_csv('ex_covid_modeling.csv')

# Optional: Scraping Weibo data

If your project's topic falls in the environment of Chinese society, you may need to scrape data from [Weibo](http://www.weibo.com/) instead of Twitter. A simple, easy-to-use API can be found [here](https://github.com/dataabc/weibo-search). If you can NOT sign up for an account but want a cookie to use this API, you can e-mail me at rb347@duke.edu.