# Scraping Twitter data with snscrape

## 0. Set up
**snscrape** is a scraper for social networking services. It currently supports lots of services such as Twitter, Facebook, Instagram, etc. It scrapes things like user profiles, hashtags, or searches and returns the discovered items, e.g. the relevant posts. More information about snscrape can be viewed from its [GitHub](https://github.com/JustAnotherArchivist/snscrape) site.

**Note: you may need to connect the Duke vpn (portal.duke.edu) to finish this lab.**

In [None]:
# install snscrape using pip
#important notice: require Python version 3.8+
#if <3.8, try "conda update python=[target version]" or create new conda env with "conda create -n [name you like] python=[target version]" in command line
!pip3 install snscrape

#or
#pip3 install git+https://github.com/JustAnotherArchivist/snscrape.git

#more details please refer to the github link: https://github.com/JustAnotherArchivist/snscrape




In [None]:
# import libraries
import snscrape.modules.twitter as sntwitter
import snscrape.modules.twitter

import pandas as pd

## 1. Overview

In today's lab, we are going to use snscrape to scrape data from twitter. You will learn to scrape tweets from:
1. text search query
2. a specific user

Snscrape provides many useful attributes available through snscrape Twitter object [(Beck, 2020)](https://betterprogramming.pub/how-to-scrape-tweets-with-snscrape-90124ed006af). You might need some of the above attributes when scraping data for your group projects.
* url: Permalink pointing to tweet location
* date: The date on which the tweet was created
* content: The text content of the tweet
* id: The id of the tweet
* user: User object containing the following data: username, displayname, id, description, descriptionUrls, verified, created, followersCount, friendsCount, statusesCount, favouritesCount, listedCount, location, protected, linkUrl, profileImageUrl, profileBannerUrl
* replyCount: The count of replies
* retweetCount: The count of retweets
* likeCount: The count of likes
* quoteCount: The count of users that quoted the tweet and replied
* conversationId: Appears to be the same as tweet id
* lang: Machine generated, assumed language of the tweet
* source: Where tweet was posted from (e.g., iPhone, Andriod, etc.)
* retweetedTweet: The id of the original tweet (if it is a retweet)
* mentionedUsers: The user objects of any mentioned user in the tweet

##  2. Scrape tweets from a text search query

The queried text could be certain keywords and hashtags.

###  Example 1
In this example, we scrape 10 tweets with the **#stopasianhate** hashtag with at least three likes `min_faves:3`. We save the scraped tweets with its posted date `tweet.date`, text content `tweet.content`, poster's location `tweet.user.location`, and the number of received likes `tweet.likeCount` into a dataframe.

In [None]:
# create a list to append twitter data to
tweets_list1 = []

# use TwitterSearchScraper to scrape data and append tweets to list
for i, tweet in enumerate(sntwitter.TwitterSearchScraper('#stopasianhate since:2021-03-15 until:2021-03-16 min_faves:3').get_items()):
    if i >= 10:
        break
    tweets_list1.append([tweet.date,  tweet.content, tweet.user.location, tweet.likeCount])

In [None]:
# create a dataframe to view the result
tweets_df1 = pd.DataFrame(tweets_list1, columns=['Date', 'Content', 'User\'s location', '# of likes'])
tweets_df1

Unnamed: 0,Date,Content,User's location,# of likes
0,2021-03-15 23:40:23+00:00,Nobody can stop @JoeBiden @FLOTUS ! #vaccine n...,"台北市, 台灣",4
1,2021-03-15 23:36:27+00:00,A big thank you to the entire community for su...,"San Francisco, CA",9
2,2021-03-15 23:28:45+00:00,@BigBosSebas @rayvolpe nice we talking about #...,"Los Angeles, CA",5
3,2021-03-15 23:08:56+00:00,@CeFaanKim @Syissle @JoshHartmann Thank you @...,Los Angeles & New York,12
4,2021-03-15 23:06:24+00:00,Remember their faces. Remember their names. #S...,,20
5,2021-03-15 23:04:32+00:00,1st step to solving a problem is acknowledging...,"San Antonio, TX",4
6,2021-03-15 22:48:26+00:00,"Grateful for all my #YeunBuns, but let’s keep ...","Los Angeles, CA",1626
7,2021-03-15 22:35:14+00:00,We condemn the hate crimes and discrimination ...,Worldwide,6
8,2021-03-15 22:28:09+00:00,When you have a moment please read this thread...,"Atlanta, GA",26
9,2021-03-15 22:24:12+00:00,“Go back to f****** communist China you b****!...,New York City,1142


### Example 2

In this example, we scrape tweets that contains keywords in the phrase **let's go duke** around the date of a basketball match. We save the scraped tweets with its posted date `tweet.date`, text content `tweet.content`, and the user's location `tweet.user.location` into a dataframe.

In [None]:
# create a list to append twitter data to
tweets_list2 = []

# use TwitterSearchScraper to scrape data and append tweets to list
for i, tweet in enumerate(sntwitter.TwitterSearchScraper('let\'s go duke since:2021-11-12 until:2021-11-13').get_items()):
    tweets_list2.append([tweet.date, tweet.content, tweet.user.location])

In [None]:
# create a dataframe to view the result
tweets_df2 = pd.DataFrame(tweets_list2, columns=['Date', 'Content', 'User\'s location'])
tweets_df2

Unnamed: 0,Date,Content,User's location
0,2021-11-12 23:59:21+00:00,Let's go Duke 🤘,My Own Personal Hell
1,2021-11-12 23:58:46+00:00,Roach\nKeels\nMoore\nBanchero\nWilliams\n\nGra...,
2,2021-11-12 23:57:29+00:00,👏🏻 Let’s 👏🏻 Go ✊🏻 Duke! 🔵😈 https://t.co/H589Xp...,"Germantown, MD"
3,2021-11-12 23:56:40+00:00,Let’s go Duke!! #BeatArmy,#SI6HTS
4,2021-11-12 23:55:37+00:00,LET’S GO DUKE!!! One last ride against where t...,"Raleigh, NC"
5,2021-11-12 23:54:42+00:00,Let’s Go Duke👿 https://t.co/beTdrGHzXW,
6,2021-11-12 23:53:25+00:00,Let’s go Duke!!! 💙😈🏀,
7,2021-11-12 23:51:58+00:00,LET’S GO DUKE 😈,252
8,2021-11-12 23:51:13+00:00,Gametime baby let's go Duke https://t.co/lmXSV...,"San Antonio, TX"
9,2021-11-12 23:48:41+00:00,@DukeMBB Let's Go Duke!!! #HereComesDuke,


## 3. Scrape tweets from a specific user

### Example 3

In this example, we scrape the latest 100 tweets posted by the Twitter user **@DukeKunshan**. We save the scraped tweets with its posted date `tweet.date`, text content `tweet.content`, and username `tweet.user.username` into a dataframe.

In [None]:
# create a list to append twitter data to
tweets_list3 = []

# use TwitterSearchScraper to scrape data and append tweets to list
for i, tweet in enumerate(sntwitter.TwitterSearchScraper('from:DukeKunshan').get_items()):
    if i >= 100:
        break
    tweets_list3.append([tweet.date, tweet.content, tweet.user.username])

In [None]:
# create a dataframe to view the result
tweets_df3 = pd.DataFrame(tweets_list3, columns=['Date', 'Content', 'Username'])
tweets_df3

Unnamed: 0,Date,Content,Username
0,2022-03-24 01:00:05+00:00,📢 Starting tomorrow: Discrete Math Seminar\n\n...,DukeKunshan
1,2022-03-23 02:30:02+00:00,📢 Starting tomorrow: Duke Kunshan University ...,DukeKunshan
2,2022-03-16 02:00:05+00:00,📢 Save the date: Duke Kunshan University 2022...,DukeKunshan
3,2022-03-15 15:30:12+00:00,Check out these photos of construction progres...,DukeKunshan
4,2022-03-14 11:00:10+00:00,📢 GHYLP Global Health Seminar Series 2022\n\n🗓...,DukeKunshan
...,...,...,...
95,2021-09-20 03:30:02+00:00,A new art book co-edited by Duke Kunshan profe...,DukeKunshan
96,2021-09-19 01:00:10+00:00,Wishing everyone a happy Mid-Autumn Festival!\...,DukeKunshan
97,2021-09-16 07:00:05+00:00,📢 Starting tomorrow: DKU Career Fair | Fall 2...,DukeKunshan
98,2021-09-13 03:30:01+00:00,After Covid-19 made international travel almos...,DukeKunshan


# Optional: Scraping Weibo data

If your project's topic falls in the environment of Chinese society, you may need to scrape data from [Weibo](http://www.weibo.com/) instead of Twitter. A simple, easy-to-use API can be found [here](https://github.com/dataabc/weibo-search). 