# Tweets data analysis

## Import libraries

I will need to import the following libraries:
* os: to load environment variables
* re: to use regular expressions to extract the hashtags
* panda: to handle data
* emoji: to extract the emojis

In [2]:
import os
import re
import pandas as pd
import emoji

Load environment variables and set other variables

In [4]:
# Set this environment variable before to load a sample of data
SAMPLE = os.environ.get("SAMPLE", None)
show_columns = ["url", "content", "retweetCount"]

In [6]:
# Set the path to load the data from based on the value of the SAMPLE env variable
data_path = "data/farmers-protest-tweets-sample.json" if SAMPLE else "data/farmers-protest-tweets-2021-2-4.json"
# Load the data from the selected, note the lines=True as this is a newline delimited JSON file
data_df = pd.read_json(data_path, lines=True)
# Show a small sample of the data for reference
data_df.head()

Unnamed: 0,url,date,content,renderedContent,id,user,outlinks,tcooutlinks,replyCount,retweetCount,...,quoteCount,conversationId,lang,source,sourceUrl,sourceLabel,media,retweetedTweet,quotedTweet,mentionedUsers
0,https://twitter.com/ArjunSinghPanam/status/136...,2021-02-24 09:23:35+00:00,The world progresses while the Indian police a...,The world progresses while the Indian police a...,1364506249291784198,"{'username': 'ArjunSinghPanam', 'displayname':...",[https://twitter.com/ravisinghka/status/136415...,[https://t.co/es3kn0IQAF],0,0,...,0,1364506249291784198,en,"<a href=""http://twitter.com/download/iphone"" r...",http://twitter.com/download/iphone,Twitter for iPhone,,,{'url': 'https://twitter.com/RaviSinghKA/statu...,"[{'username': 'narendramodi', 'displayname': '..."
1,https://twitter.com/PrdeepNain/status/13645062...,2021-02-24 09:23:32+00:00,#FarmersProtest \n#ModiIgnoringFarmersDeaths \...,#FarmersProtest \n#ModiIgnoringFarmersDeaths \...,1364506237451313155,"{'username': 'PrdeepNain', 'displayname': 'Pra...",[],[],0,0,...,0,1364506237451313155,en,"<a href=""http://twitter.com/download/android"" ...",http://twitter.com/download/android,Twitter for Android,[{'thumbnailUrl': 'https://pbs.twimg.com/ext_t...,,,"[{'username': 'Kisanektamorcha', 'displayname'..."
2,https://twitter.com/parmarmaninder/status/1364...,2021-02-24 09:23:22+00:00,ਪੈਟਰੋਲ ਦੀਆਂ ਕੀਮਤਾਂ ਨੂੰ ਮੱਦੇਨਜ਼ਰ ਰੱਖਦੇ ਹੋਏ \nਮੇ...,ਪੈਟਰੋਲ ਦੀਆਂ ਕੀਮਤਾਂ ਨੂੰ ਮੱਦੇਨਜ਼ਰ ਰੱਖਦੇ ਹੋਏ \nਮੇ...,1364506195453767680,"{'username': 'parmarmaninder', 'displayname': ...",[],[],0,0,...,0,1364506195453767680,pa,"<a href=""http://twitter.com/download/android"" ...",http://twitter.com/download/android,Twitter for Android,,,,
3,https://twitter.com/anmoldhaliwal/status/13645...,2021-02-24 09:23:16+00:00,@ReallySwara @rohini_sgh watch full video here...,@ReallySwara @rohini_sgh watch full video here...,1364506167226032128,"{'username': 'anmoldhaliwal', 'displayname': '...",[https://youtu.be/-bUKumwq-J8],[https://t.co/wBPNdJdB0n],0,0,...,0,1364350947099484160,en,"<a href=""https://mobile.twitter.com"" rel=""nofo...",https://mobile.twitter.com,Twitter Web App,[{'thumbnailUrl': 'https://pbs.twimg.com/ext_t...,,,"[{'username': 'ReallySwara', 'displayname': 'S..."
4,https://twitter.com/KotiaPreet/status/13645061...,2021-02-24 09:23:10+00:00,#KisanEktaMorcha #FarmersProtest #NoFarmersNoF...,#KisanEktaMorcha #FarmersProtest #NoFarmersNoF...,1364506144002088963,"{'username': 'KotiaPreet', 'displayname': 'Pre...",[],[],0,0,...,0,1364506144002088963,und,"<a href=""http://twitter.com/download/iphone"" r...",http://twitter.com/download/iphone,Twitter for iPhone,[{'previewUrl': 'https://pbs.twimg.com/media/E...,,,


## Top retweets

To obtain the top retweeted I just need to sort the dataframe by the "retweetCount" column.
For optimization, I previously select records that have at least one retweet.

In [7]:
data_df[data_df["retweetCount"] > 0][show_columns].sort_values("retweetCount", ascending=False).head(10)

Unnamed: 0,url,content,retweetCount
111329,https://twitter.com/RakeshTikaitBKU/status/136...,मध्यप्रदेश में निजी व्यापारी 200 करोड़ का धान ...,7723
7645,https://twitter.com/dhruv_rathee/status/136414...,There's a #FarmersProtest happening in Germany...,6164
89780,https://twitter.com/rupikaur_/status/136088206...,"disha ravi, a 21-year-old climate activist, ha...",4673
88911,https://twitter.com/amaanbali/status/136090860...,Disha Ravi broke down in court room and told j...,3742
111556,https://twitter.com/jedijasmin_/status/1360162...,Farmers are so sweet. Y’all have to see this @...,3332
64492,https://twitter.com/rupikaur_/status/136179092...,india is targeting young women to silence diss...,3230
108072,https://twitter.com/RaviSinghKA/status/1360260...,Bollywood has betrayed Panjab &amp; the farmer...,3182
60721,https://twitter.com/sherryontopp/status/136189...,लहरों को ख़ामोश देख कर ये ना समझना कि समंदर मे...,3057
29510,https://twitter.com/sherryontopp/status/136309...,"हाँ मैं जानता हूँ कि मैं शायर नहीं, और ज़ुल्म ...",3040
24160,https://twitter.com/sherryontopp/status/136337...,"कलियुग है साहब , यहाँ झूठे को स्वीकार किया जा...",2622


## Top users

To get the top tweeting users first I need to extract the username (or other unique user id) from the structure in the "user" column. Then group by the new column "username", count the unique tweet ids, sort the counts descending, and show the top 10.

In [8]:
data_df["username"] = data_df["user"].map(lambda x : x["username"])
data_df.groupby("username")["id"].count().sort_values(ascending=False).head(10)
# TODO: Improve how to show the list

username
jot__b             1019
rebelpacifist       850
MaanDee08215437     830
Gurpreetd86         636
GurmVicky           597
shells_n_petals     576
preetysaini321      573
ish_kayy            515
KaurDosanjh1979     512
DigitalKisanBot     490
Name: id, dtype: int64

## Top dates

To get the dates with most tweets, I first create a new column called "date_dd" with only the date part of the "date" column. Then group by the new column, count the unique tweet ids, sort the counts descending, and show the top 10.

In [11]:
data_df["date_dd"] = data_df["date"].map(lambda x : x.date())
data_df.groupby("date_dd")["id"].count().sort_values(ascending=False).head(10)
# TODO: Improve how to show the list

date_dd
2021-02-12    12347
2021-02-13    11296
2021-02-17    11087
2021-02-16    10443
2021-02-14    10249
2021-02-18     9625
2021-02-15     9197
2021-02-20     8502
2021-02-23     8417
2021-02-19     8204
Name: id, dtype: int64

## Top hashtags

Hashtags are easily extracted by using a regex to find "words" that start with the # symbol and then at least one alphanumeric character.

According to [Twitter (now X)](https://business.twitter.com/en/blog/how-to-create-and-use-hashtags.html#:~:text=Keep%20in%20mind%2C%20hashtags%20are%20not%20case%2Dsensitive%2C%20but%20adding%20capital%20letters%20does%20make%20them%20easier%20to%20read%3A%20%23MakeAWish%20vs.%20%23makeawish.) hashtags are case insensitive, so it makes sense to normalize them by searching on the lowercase version of the tweet.

First, define a regex for the hashtags, next, create a new column with the list of the hashtags for each tweet. Then create an auxiliary dataframe where I exploded (de-normalize) the hashtag lists. Finally, group by this dataframe on the "hashtags" column, count the unique tweet ids, sort the counts descending, and show the top 10.

In [12]:
# Extract and normalize hashtags into list with regex
hashtag_regex = r"#[0-z]+"
data_df["hashtags"] = data_df["content"].map(lambda x: re.findall(hashtag_regex, x.lower()))
# Select hashtag column and id and de-normalize the hashtag data in an aux dataframe
hashtag_df = data_df[["hashtags","id"]].explode("hashtags")
# Group by hashtag and count
hashtag_df.groupby("hashtags")["id"].count().sort_values(ascending=False).head(10)
# TODO: Improve how to show the list

hashtags
#farmersprotest             119297
#releasedetainedfarmers       6002
#farmersmakeindia             5358
#mahapanchayatrevolution      4869
#repealonlywayahead           4629
#indiabeingsilenced           4519
#farmersprotests              3750
#standwithfarmers             3678
#pagdi_sambhal_jatta          3551
#disharavi                    3170
Name: id, dtype: int64

## Top emojis

To extract the emojis for each tweet the [emoji Python library](https://pypi.org/project/emoji/) is used. The .analyze method of this library returns a list of the emojis in a tweet. Using it inside a .map I got a new column "emojis" with a list for each tweet. Next, I created an auxiliary dataframe where I exploded the "emojis" column, Finally, I grouped by this dataframe on the "emojis" column, count the unique tweet ids, sort the counts descending, and show the top 10.

In [13]:
data_df["emojis"] = data_df["content"].map(lambda x: [e.chars for e in emoji.analyze(x)])
# Select hashtag column and id and de-normalize the hashtag data
emoji_df = data_df[["emojis","id"]].explode("emojis")
# Group by hashtag and count
emoji_df.groupby("emojis")["id"].count().sort_values(ascending=False).head(10)
# TODO: Improve how to show the list

emojis
🙏     5049
😂     3072
🚜     2972
🌾     2182
🇮🇳    2086
🤣     1668
✊     1651
❤️    1382
🙏🏻    1317
💚     1040
Name: id, dtype: int64

## Top influencers

To get the top influencers I created an auxiliary dataframe with two columns that give us the sum of retweets per user and the number of tweets per user by grouping by the "username" column I created earlier. Then I decided to also compute the ratio between the sum of retweets and the number of tweets the user made, as this weighted sum can show the users that were very influential with less tweets. Finally, I sorted by the retweetCount column and the retweetRatio column and show the top 10.

In [19]:
# Get retweets per user
users_df = pd.concat([data_df.groupby("username")["retweetCount"].sum(),
                      data_df.groupby("username")["id"].count()], axis=1)
users_df["retweetRatio"] = users_df.apply(lambda x: x["retweetCount"]/x["id"], axis=1)

Top 10 Influencers based on the retweet count of all their tweets

In [20]:
users_df.sort_values("retweetCount", ascending=False).head(10)

Unnamed: 0_level_0,retweetCount,id,retweetRatio
username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
amaanbali,26354,97,271.690722
saahilmenghani,23288,81,287.506173
RaviSinghKA,22974,22,1044.272727
sherryontopp,19175,12,1597.916667
RakeshTikaitBKU,12001,4,3000.25
rupikaur_,11420,10,1142.0
news24tvchannel,10960,85,128.941176
iMani_KaurRai,10636,165,64.460606
Monica_Gill1,8593,175,49.102857
bhupenderc19,7360,58,126.896552


Top 10 Influencers based on the retweet ratio

In [22]:
users_df.sort_values("retweetRatio", ascending=False).head(10)

Unnamed: 0_level_0,retweetCount,id,retweetRatio
username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
dhruv_rathee,6164,1,6164.0
RakeshTikaitBKU,12001,4,3000.25
sushant_says,6678,3,2226.0
avinashkalla,2208,1,2208.0
keithellison,1805,1,1805.0
jedijasmin_,3340,2,1670.0
sherryontopp,19175,12,1597.916667
Kisanektamorcha,1554,1,1554.0
ChitraSarwara,1463,1,1463.0
AcharyaPramodk,2469,2,1234.5
