# Cleaning raw JSON tweets data scraped using snscrape library

## Introduction

In this notebook, I will be discussing how to clean and pre-process raw JSON data about tweets scraped using the Python library snscrape. The JSON data is finally converted to CSV files to make it easier for analysis.

As an example, I have scraped tweets that contain the hashtag "#FarmersProtest" using snscrape which gives a JSON file about the relevant tweets.

snscrape is a Python library that allows you to scrape tweets easily through the Twitter API without any request limits. I will not be focussing on how to scrape tweets and get the raw JSON tweets data. For an easy-to-follow tutorial on how to use snscrape to scrape tweets through the Twitter API, check out [this Medium blog by Martin Beck](https://betterprogramming.pub/how-to-scrape-tweets-with-snscrape-90124ed006af).

## Importing required libraries

Let's start by importing the required libraries. We will be needing Pandas to load and work with JSON data as well as the json_normalize() function in the pandas.io.json package to perform some transformation functions on JSON data.

In [1]:
# Importing required libraries

import pandas as pd
from pandas.io.json import json_normalize
import warnings
warnings.filterwarnings("ignore")

## Read raw JSON tweets data

Next, we load the raw JSON tweets data using the function read_json() available in pandas library. Since we are interested in performing analysis using techniques such as NLP, I have only retained tweets that are in the English language. Next, let's take a look at the first 5 records for the raw JSON data.

In [2]:
# Read JSON file containing tweets data and removce tweets not in English

raw_tweets = pd.read_json(r'../input/farmers-protest-tweets-dataset-raw-json/farmers-protest-tweets-2021-2-4.json', lines=True)
raw_tweets = raw_tweets[raw_tweets['lang']=='en']
print("Shape: ", raw_tweets.shape)
raw_tweets.head(5)

Shape:  (48429, 21)


Unnamed: 0,url,date,content,renderedContent,id,user,outlinks,tcooutlinks,replyCount,retweetCount,...,quoteCount,conversationId,lang,source,sourceUrl,sourceLabel,media,retweetedTweet,quotedTweet,mentionedUsers
0,https://twitter.com/ArjunSinghPanam/status/136...,2021-02-24 09:23:35+00:00,The world progresses while the Indian police a...,The world progresses while the Indian police a...,1364506249291784198,"{'username': 'ArjunSinghPanam', 'displayname':...",[https://twitter.com/ravisinghka/status/136415...,[https://t.co/es3kn0IQAF],0,0,...,0,1364506249291784198,en,"<a href=""http://twitter.com/download/iphone"" r...",http://twitter.com/download/iphone,Twitter for iPhone,,,{'url': 'https://twitter.com/RaviSinghKA/statu...,"[{'username': 'narendramodi', 'displayname': '..."
1,https://twitter.com/PrdeepNain/status/13645062...,2021-02-24 09:23:32+00:00,#FarmersProtest \n#ModiIgnoringFarmersDeaths \...,#FarmersProtest \n#ModiIgnoringFarmersDeaths \...,1364506237451313155,"{'username': 'PrdeepNain', 'displayname': 'Pra...",[],[],0,0,...,0,1364506237451313155,en,"<a href=""http://twitter.com/download/android"" ...",http://twitter.com/download/android,Twitter for Android,[{'thumbnailUrl': 'https://pbs.twimg.com/ext_t...,,,"[{'username': 'Kisanektamorcha', 'displayname'..."
3,https://twitter.com/anmoldhaliwal/status/13645...,2021-02-24 09:23:16+00:00,@ReallySwara @rohini_sgh watch full video here...,@ReallySwara @rohini_sgh watch full video here...,1364506167226032128,"{'username': 'anmoldhaliwal', 'displayname': '...",[https://youtu.be/-bUKumwq-J8],[https://t.co/wBPNdJdB0n],0,0,...,0,1364350947099484160,en,"<a href=""https://mobile.twitter.com"" rel=""nofo...",https://mobile.twitter.com,Twitter Web App,[{'thumbnailUrl': 'https://pbs.twimg.com/ext_t...,,,"[{'username': 'ReallySwara', 'displayname': 'S..."
8,https://twitter.com/anmoldhaliwal/status/13645...,2021-02-24 09:22:34+00:00,@mandeeppunia1 watch full video here https://t...,@mandeeppunia1 watch full video here youtu.be/...,1364505991887347714,"{'username': 'anmoldhaliwal', 'displayname': '...",[https://youtu.be/-bUKumwq-J8],[https://t.co/wBPNdJdB0n],0,0,...,0,1364428985074032646,en,"<a href=""https://mobile.twitter.com"" rel=""nofo...",https://mobile.twitter.com,Twitter Web App,[{'thumbnailUrl': 'https://pbs.twimg.com/ext_t...,,,"[{'username': 'mandeeppunia1', 'displayname': ..."
11,https://twitter.com/anmoldhaliwal/status/13645...,2021-02-24 09:21:51+00:00,@mandeeppunia1 watch full video here https://t...,@mandeeppunia1 watch full video here youtu.be/...,1364505813834989568,"{'username': 'anmoldhaliwal', 'displayname': '...",[https://youtu.be/-bUKumwq-J8],[https://t.co/wBPNdJdB0n],0,0,...,0,1364480983995584515,en,"<a href=""https://mobile.twitter.com"" rel=""nofo...",https://mobile.twitter.com,Twitter Web App,[{'thumbnailUrl': 'https://pbs.twimg.com/ext_t...,,,"[{'username': 'mandeeppunia1', 'displayname': ..."


## Normalize 'user' field in raw_tweets

We see that 'raw_tweets' has a nested JSON field named 'user'. This field can be normalized for better analysis using the json_normalize() function in the pandas.io.json library. Essentially, semi-structured JSON data is "normalized" into a flat table.

For more info on how to use json_normalize(), check out [the documentation page for pandas.io.json.json_normalize()](https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.io.json.json_normalize.html).

I have also renamed the fields 'id' to 'userId' and 'url' to 'profileUrl' for it to make more sense and avoid confusion. The fields 'description' and 'linkTcourl' are not important and hence, have been dropped.

Let's take a look at the first 5 records.

In [3]:
# Normalize 'user' field

users = json_normalize(raw_tweets['user'])
users.drop(['description', 'linkTcourl'], axis=1, inplace=True)
users.rename(columns={'id':'userId', 'url':'profileUrl'}, inplace=True)
users.head(5)

Unnamed: 0,username,displayname,userId,rawDescription,descriptionUrls,verified,created,followersCount,friendsCount,statusesCount,favouritesCount,listedCount,mediaCount,location,protected,linkUrl,profileImageUrl,profileBannerUrl,profileUrl
0,ArjunSinghPanam,Arjun Singh Panam,45091142,"Global Citizen, Actor, Director: Sky is the ro...",[],False,2009-06-06T07:50:57+00:00,603,311,17534,4269,23,1211,,False,https://www.cosmosmovieofficial.com,https://pbs.twimg.com/profile_images/121554174...,https://pbs.twimg.com/profile_banners/45091142...,https://twitter.com/ArjunSinghPanam
1,PrdeepNain,Pradeep Nain,1355092620662329349,Live in the sunshine where you belong,[],False,2021-01-29T09:58:06+00:00,14,134,160,240,0,102,,False,,https://pbs.twimg.com/profile_images/136417063...,https://pbs.twimg.com/profile_banners/13550926...,https://twitter.com/PrdeepNain
2,anmoldhaliwal,Anmol,137908912,coming soon,[],False,2010-04-28T03:12:18+00:00,51,27,228,77,0,12,"Brampton, On",False,,https://pbs.twimg.com/profile_images/156497514...,,https://twitter.com/anmoldhaliwal
3,anmoldhaliwal,Anmol,137908912,coming soon,[],False,2010-04-28T03:12:18+00:00,51,27,228,77,0,12,"Brampton, On",False,,https://pbs.twimg.com/profile_images/156497514...,,https://twitter.com/anmoldhaliwal
4,anmoldhaliwal,Anmol,137908912,coming soon,[],False,2010-04-28T03:12:18+00:00,51,27,228,77,0,12,"Brampton, On",False,,https://pbs.twimg.com/profile_images/156497514...,,https://twitter.com/anmoldhaliwal


## Create users DF

Next, let's create the final DataFrame for Twitter users who tweeted using the hashtag "#FarmersProtest". I have also dropped duplicate records from the DataFrame based on the field 'userID' as each user must have a unique user ID.

Let's take a look at the shape and first 5 records for the final DataFrame for the Twitter users.

In [4]:
# Create DataFrame and remove duplicates

users = pd.DataFrame(users)
users.drop_duplicates(subset=['userId'], inplace=True)
print("Shape: ", users.shape)
users.head(5)

Shape:  (12407, 19)


Unnamed: 0,username,displayname,userId,rawDescription,descriptionUrls,verified,created,followersCount,friendsCount,statusesCount,favouritesCount,listedCount,mediaCount,location,protected,linkUrl,profileImageUrl,profileBannerUrl,profileUrl
0,ArjunSinghPanam,Arjun Singh Panam,45091142,"Global Citizen, Actor, Director: Sky is the ro...",[],False,2009-06-06T07:50:57+00:00,603,311,17534,4269,23,1211,,False,https://www.cosmosmovieofficial.com,https://pbs.twimg.com/profile_images/121554174...,https://pbs.twimg.com/profile_banners/45091142...,https://twitter.com/ArjunSinghPanam
1,PrdeepNain,Pradeep Nain,1355092620662329349,Live in the sunshine where you belong,[],False,2021-01-29T09:58:06+00:00,14,134,160,240,0,102,,False,,https://pbs.twimg.com/profile_images/136417063...,https://pbs.twimg.com/profile_banners/13550926...,https://twitter.com/PrdeepNain
2,anmoldhaliwal,Anmol,137908912,coming soon,[],False,2010-04-28T03:12:18+00:00,51,27,228,77,0,12,"Brampton, On",False,,https://pbs.twimg.com/profile_images/156497514...,,https://twitter.com/anmoldhaliwal
5,ShariaActivist,Sharia Ali Siddique,1362487487747121152,Little Climate & Environmental Activist | Foun...,[],False,2021-02-18T19:41:57+00:00,46,106,112,60,0,53,she/they,False,,https://pbs.twimg.com/profile_images/136428288...,https://pbs.twimg.com/profile_banners/13624874...,https://twitter.com/ShariaActivist
6,KaurDosanjh1979,Red 💚,538638801,,[],False,2012-03-27T23:14:32+00:00,427,1005,29803,18962,0,30,,False,,https://pbs.twimg.com/profile_images/135582023...,https://pbs.twimg.com/profile_banners/53863880...,https://twitter.com/KaurDosanjh1979


## Create tweets DF

Next, we will transform the 'raw_tweets' DataFrame to obtain a DataFrame for tweets that contain the hashtag "#FarmersProtest". A new field, 'userId' is added which corresponds to the unique ID of the user who posted the particular tweet.

Next, I have retained only the important fields and renamed the fields 'id' to 'tweetId' and 'url' to 'tweetUrl' for it to make more sense and avoid confusion.

Let's take a look at the first 5 records of this DataFrame.

In [5]:
# Transform 'raw_tweets' DataFrame

# Add column for 'userId'
user_id = []
for user in raw_tweets['user']:
    uid = user['id']
    user_id.append(uid)
raw_tweets['userId'] = user_id

# Remove less important columns
cols = ['url', 'date', 'renderedContent', 'id', 'userId', 'replyCount', 'retweetCount', 'likeCount', 'quoteCount', 'source', 'media', 'retweetedTweet', 'quotedTweet', 'mentionedUsers']
tweets = raw_tweets[cols]
tweets.rename(columns={'id':'tweetId', 'url':'tweetUrl'}, inplace=True)
tweets.head(5)

Unnamed: 0,tweetUrl,date,renderedContent,tweetId,userId,replyCount,retweetCount,likeCount,quoteCount,source,media,retweetedTweet,quotedTweet,mentionedUsers
0,https://twitter.com/ArjunSinghPanam/status/136...,2021-02-24 09:23:35+00:00,The world progresses while the Indian police a...,1364506249291784198,45091142,0,0,0,0,"<a href=""http://twitter.com/download/iphone"" r...",,,{'url': 'https://twitter.com/RaviSinghKA/statu...,"[{'username': 'narendramodi', 'displayname': '..."
1,https://twitter.com/PrdeepNain/status/13645062...,2021-02-24 09:23:32+00:00,#FarmersProtest \n#ModiIgnoringFarmersDeaths \...,1364506237451313155,1355092620662329349,0,0,0,0,"<a href=""http://twitter.com/download/android"" ...",[{'thumbnailUrl': 'https://pbs.twimg.com/ext_t...,,,"[{'username': 'Kisanektamorcha', 'displayname'..."
3,https://twitter.com/anmoldhaliwal/status/13645...,2021-02-24 09:23:16+00:00,@ReallySwara @rohini_sgh watch full video here...,1364506167226032128,137908912,0,0,0,0,"<a href=""https://mobile.twitter.com"" rel=""nofo...",[{'thumbnailUrl': 'https://pbs.twimg.com/ext_t...,,,"[{'username': 'ReallySwara', 'displayname': 'S..."
8,https://twitter.com/anmoldhaliwal/status/13645...,2021-02-24 09:22:34+00:00,@mandeeppunia1 watch full video here youtu.be/...,1364505991887347714,137908912,0,0,0,0,"<a href=""https://mobile.twitter.com"" rel=""nofo...",[{'thumbnailUrl': 'https://pbs.twimg.com/ext_t...,,,"[{'username': 'mandeeppunia1', 'displayname': ..."
11,https://twitter.com/anmoldhaliwal/status/13645...,2021-02-24 09:21:51+00:00,@mandeeppunia1 watch full video here youtu.be/...,1364505813834989568,137908912,0,0,0,0,"<a href=""https://mobile.twitter.com"" rel=""nofo...",[{'thumbnailUrl': 'https://pbs.twimg.com/ext_t...,,,"[{'username': 'mandeeppunia1', 'displayname': ..."


Finally, I have created the final DataFrame for tweets that contain the hashtag "#FarmersProtest". Duplicate records are dropped from the DF based on the unique ID for each tweet (the field 'tweetId').

Let's take a look at the shape and first 5 records of the final tweets DataFrame.

In [6]:
# Convert to DataFrame, remove duplicates and keep only English tweets

tweets = pd.DataFrame(tweets)
tweets.drop_duplicates(subset=['tweetId'], inplace=True)
print("Shape: ", tweets.shape)
tweets.head(5)

Shape:  (48429, 14)


Unnamed: 0,tweetUrl,date,renderedContent,tweetId,userId,replyCount,retweetCount,likeCount,quoteCount,source,media,retweetedTweet,quotedTweet,mentionedUsers
0,https://twitter.com/ArjunSinghPanam/status/136...,2021-02-24 09:23:35+00:00,The world progresses while the Indian police a...,1364506249291784198,45091142,0,0,0,0,"<a href=""http://twitter.com/download/iphone"" r...",,,{'url': 'https://twitter.com/RaviSinghKA/statu...,"[{'username': 'narendramodi', 'displayname': '..."
1,https://twitter.com/PrdeepNain/status/13645062...,2021-02-24 09:23:32+00:00,#FarmersProtest \n#ModiIgnoringFarmersDeaths \...,1364506237451313155,1355092620662329349,0,0,0,0,"<a href=""http://twitter.com/download/android"" ...",[{'thumbnailUrl': 'https://pbs.twimg.com/ext_t...,,,"[{'username': 'Kisanektamorcha', 'displayname'..."
3,https://twitter.com/anmoldhaliwal/status/13645...,2021-02-24 09:23:16+00:00,@ReallySwara @rohini_sgh watch full video here...,1364506167226032128,137908912,0,0,0,0,"<a href=""https://mobile.twitter.com"" rel=""nofo...",[{'thumbnailUrl': 'https://pbs.twimg.com/ext_t...,,,"[{'username': 'ReallySwara', 'displayname': 'S..."
8,https://twitter.com/anmoldhaliwal/status/13645...,2021-02-24 09:22:34+00:00,@mandeeppunia1 watch full video here youtu.be/...,1364505991887347714,137908912,0,0,0,0,"<a href=""https://mobile.twitter.com"" rel=""nofo...",[{'thumbnailUrl': 'https://pbs.twimg.com/ext_t...,,,"[{'username': 'mandeeppunia1', 'displayname': ..."
11,https://twitter.com/anmoldhaliwal/status/13645...,2021-02-24 09:21:51+00:00,@mandeeppunia1 watch full video here youtu.be/...,1364505813834989568,137908912,0,0,0,0,"<a href=""https://mobile.twitter.com"" rel=""nofo...",[{'thumbnailUrl': 'https://pbs.twimg.com/ext_t...,,,"[{'username': 'mandeeppunia1', 'displayname': ..."


# Conclusion

Hence, in this notebook we have seen how to perform some transformations to convert the raw JSON data about tweets scraped using snscrape into a more usable falt table form. The single JSON file containing data about tweets is now converted into 2 easier to use DataFrames, 'tweets' and 'users', which contain data about tweets and the users who posted those tweets separately. The 2 DFs can be joined on the 'userId' field.

From here, we can save the 'tweets' and 'users' DataFrames as CSV files or continue the analysis using the DFs.