# Article Notebook for Scraping Twitter Using snscrape's CLI Commands With Python
<br>Package Github: https://github.com/JustAnotherArchivist/snscrape
<br>This notebook will be using the development version of snscrape

Article Read-Along: https://medium.com/better-programming/how-to-scrape-tweets-with-snscrape-90124ed006af

### Notebook Author: Martin Beck
<b>Information current as of November, 26th 2020</b><br>

This notebook contains materials for scraping tweets from Twitter using snscrape's CLI commands with Python

<b>Dependencies: </b> 
- Your <b>Python</b> version must be <b>3.8</b> or higher. The development version of snscrape will not work with Python 3.7 or lower. You can download the latest Python version [here](https://www.python.org/downloads/).
- <b>Development version of snscrape</b>, uncomment the pip install line in the below cell to pip install in the notebook if you don't already have it.
- <b>Pandas</b>, the dataframes allows easy manipulation and indexing of data, this is more of a preference but is what I follow in this notebook.

In [1]:
# Run the pip install command below if you don't already have the library
# !pip install git+https://github.com/JustAnotherArchivist/snscrape.git

# Run the below command if you don't already have Pandas
# !pip install pandas

# Imports
import os
import pandas as pd

# Query by Username
The code below will scrape for 100 tweets by a username then provide a CSV file with Pandas

# Query by Text Search
The code below will scrape for 500 tweets between June 1st, 2020 and July 31st, 2020, by a text search then provide a CSV file with Pandas

In [4]:
# Setting variables to be used in format string command below
tweet_count = 500
text_query = "#tattleware OR #Bossware"
since_date = "2019-03-01"
until_date = "2020-03-01"

# Using OS library to call CLI commands in Python
os.system('snscrape --jsonl --max-results {} --since {} twitter-search "{} until:{}"> paper3_2019.json'.format(tweet_count, since_date, text_query, until_date))

0

In [7]:
# Reads the json generated from the CLI command above and creates a pandas dataframe
tweets_df2 = pd.read_json('tattle3.json', lines=True)

# Displays first 5 entries from dataframe
tweets_df2.head()

Unnamed: 0,entities,public_metrics,author_id,id,created_at,reply_settings,source,lang,referenced_tweets,text,conversation_id,possibly_sensitive,author,__twarc,context_annotations,attachments,in_reply_to_user_id,in_reply_to_user,geo
0,"{'mentions': [{'start': 3, 'end': 12, 'usernam...","{'retweet_count': 1, 'reply_count': 0, 'like_c...",876437232000827392,1398711486797778944,2021-05-29 18:42:59+00:00,everyone,Twitter for Android,en,"[{'type': 'retweeted', 'id': '1387636063812849...",RT @wplawmag: wp Legal Briefcase No. 184 | 29 ...,1398711486797778944,False,"{'url': '', 'public_metrics': {'followers_coun...",{'url': 'https://api.twitter.com/2/tweets/sear...,,,,,
1,"{'mentions': [{'start': 3, 'end': 18, 'usernam...","{'retweet_count': 79, 'reply_count': 0, 'like_...",1433348587,1394787036331057152,2021-05-18 22:48:38+00:00,everyone,Twitter Web App,en,"[{'type': 'retweeted', 'id': '1381581282916896...","RT @commieleejones: “The app, called Blip, gen...",1394787036331057152,False,"{'url': 'https://t.co/RTOVjOKNAM', 'public_met...",{'url': 'https://api.twitter.com/2/tweets/sear...,"[{'domain': {'id': '67', 'name': 'Interests an...",,,,
2,"{'mentions': [{'start': 3, 'end': 17, 'usernam...","{'retweet_count': 1, 'reply_count': 0, 'like_c...",703220155560505344,1393919130139598848,2021-05-16 13:19:53+00:00,everyone,FLSA Today2,en,"[{'type': 'retweeted', 'id': '1393914343184547...",RT @SpringLawFirm: “Tattleware” or software th...,1393919130139598848,False,"{'url': 'https://t.co/8bJBrk7sI5', 'public_met...",{'url': 'https://api.twitter.com/2/tweets/sear...,"[{'domain': {'id': '123', 'name': 'Ongoing New...",,,,
3,"{'mentions': [{'start': 203, 'end': 218, 'user...","{'retweet_count': 1, 'reply_count': 0, 'like_c...",846475793429614592,1393914343184547840,2021-05-16 13:00:51+00:00,everyone,Hootsuite Inc.,en,,“Tattleware” or software that somehow tracks e...,1393914343184547840,False,"{'url': 'https://t.co/z7IEpeAdya', 'public_met...",{'url': 'https://api.twitter.com/2/tweets/sear...,"[{'domain': {'id': '123', 'name': 'Ongoing New...",,,,
4,"{'mentions': [{'start': 3, 'end': 12, 'usernam...","{'retweet_count': 1, 'reply_count': 0, 'like_c...",876437232000827392,1392239564639584256,2021-05-11 22:05:53+00:00,everyone,Twitter for Android,en,"[{'type': 'retweeted', 'id': '1391682547218796...",RT @CDHLegal: Podcast | #EmploymentLaw Directo...,1392239564639584256,False,"{'url': '', 'public_metrics': {'followers_coun...",{'url': 'https://api.twitter.com/2/tweets/sear...,"[{'domain': {'id': '30', 'name': 'Entities [En...",,,,


In [10]:
# Export dataframe into a CSV
tweets_df2.to_csv('text-query-tweets.csv', sep=',', index=False)

In [None]:
# Setting variables to be used in format string command below
tweet_count = 100
username = "jack"

# Using OS library to call CLI commands in Python
os.system("snscrape --jsonl --max-results {} twitter-search 'from:{}'> user-tweets.json".format(tweet_count, username))

In [6]:
# Reads the json generated from the CLI command above and creates a pandas dataframe
tweets_df1 = pd.read_json('user-tweets.json', lines=True)

# Displays first 5 entries from dataframe
tweets_df1.head()

Unnamed: 0,url,date,content,renderedContent,id,user,outlinks,tcooutlinks,replyCount,retweetCount,likeCount,quoteCount,conversationId,lang,source,media,retweetedTweet,quotedTweet,mentionedUsers
0,https://twitter.com/jack/status/13324354308016...,2020-11-27 21:25:36+00:00,@JesseDorogusker @Square ❤️,@JesseDorogusker @Square ❤️,1332435430801690624,"{'username': 'jack', 'displayname': 'jack', 'i...",[],[],54,8,226,1,1332428871891775488,und,"<a href=""http://twitter.com/download/iphone"" r...",,,,"[{'username': 'JesseDorogusker', 'displayname'..."
1,https://twitter.com/jack/status/13291496370060...,2020-11-18 19:49:02+00:00,@NeerajKA Welcome!,@NeerajKA Welcome!,1329149637006041088,"{'username': 'jack', 'displayname': 'jack', 'i...",[],[],72,14,800,8,1329140522565439490,en,"<a href=""http://twitter.com/download/iphone"" r...",,,,"[{'username': 'NeerajKA', 'displayname': 'Neer..."
2,https://twitter.com/jack/status/13291372550263...,2020-11-18 18:59:50+00:00,Join @CashApp! #Bitcoin https://t.co/SbYANIZyix,Join @CashApp! #Bitcoin twitter.com/owenbjenni...,1329137255026311168,"{'username': 'jack', 'displayname': 'jack', 'i...",[https://twitter.com/owenbjennings/status/1329...,[https://t.co/SbYANIZyix],585,277,2507,132,1329137255026311168,en,"<a href=""http://twitter.com/download/iphone"" r...",,,{'url': 'https://twitter.com/owenbjennings/sta...,"[{'username': 'CashApp', 'displayname': 'Cash ..."
3,https://twitter.com/jack/status/13291366656847...,2020-11-18 18:57:29+00:00,@kateconger @sarahintampa Nah,@kateconger @sarahintampa Nah,1329136665684705280,"{'username': 'jack', 'displayname': 'jack', 'i...",[],[],38,5,176,10,1329126492731699203,und,"<a href=""http://twitter.com/download/iphone"" r...",,,,"[{'username': 'kateconger', 'displayname': 'o...."
4,https://twitter.com/jack/status/13291358061921...,2020-11-18 18:54:05+00:00,@mmasnick Terrible idea! And terribly false.,@mmasnick Terrible idea! And terribly false.,1329135806192107521,"{'username': 'jack', 'displayname': 'jack', 'i...",[],[],51,13,222,16,1329128773845860352,en,"<a href=""http://twitter.com/download/iphone"" r...",,,,"[{'username': 'mmasnick', 'displayname': 'Mike..."


In [7]:
# Export dataframe into a CSV
tweets_df1.to_csv('user-tweets.csv', sep=',', index=False)