# Predicting the popularity of tweets

In [21]:
# Import all libraries needed through the whole notebook
import pandas as pd

# Exploratory data analysis

During the exploratory data analysis, we will look at the data to understand its structure and the relationships between the variables. We have two main goals during this phase:
1. Choose which variable will be our target variable
2. Identify the most important variables that will help us predict the target variable

## Load the data

The first step is to load the data and check the information about the columns and the data types are correct. One special thing to consider the data is that we have two datasets: one with the tweets and another with the users. We will need to join these datasets to have all the information in one place.

In [22]:
# Load both datasets
tweets = pd.read_csv('./data/tweets.csv', low_memory=False)
users = pd.read_csv('./data/users.csv')

In [23]:
# Join the datasets
users.set_index('id', inplace=True)
tweets_users = tweets.join(users, on='user_id', lsuffix='_tweet', rsuffix='_user', how='left')

### Check data has been loaded correctly

In [24]:
tweets_users

Unnamed: 0,id,conversation_id,created_at,date,timezone,place,tweet,language,hashtags,cashtags,...,tweets,location,following,followers,likes,media,private,verified,avatar,background_image
0,1425590913959612419,1425590913959612419,1.628722e+12,2021-08-12 00:52:14,200,,RT @girlsalliance: We're so proud of the four ...,en,[],[],...,1770,"Washington, DC",16,20854298,184,461,False,True,https://pbs.twimg.com/profile_images/136674780...,https://pbs.twimg.com/profile_banners/40948655...
1,1427736867739299841,1427736867739299841,1.629234e+12,2021-08-17 22:59:29,200,,Some casual suggestions to 😏SLIDE😏 into when u...,en,"['shoesdaytuesday', 'afterskewlslide']",[],...,11420,,235,108819032,7995,2170,False,True,https://pbs.twimg.com/profile_images/139246535...,https://pbs.twimg.com/profile_banners/21447363...
2,1427667300488937476,1427667300488937476,1.629217e+12,2021-08-17 18:23:03,200,,RT @ValaAfshar: You are not your job.,en,[],[],...,11420,,235,108819032,7995,2170,False,True,https://pbs.twimg.com/profile_images/139246535...,https://pbs.twimg.com/profile_banners/21447363...
3,1427667012105371652,1427667012105371652,1.629217e+12,2021-08-17 18:21:55,200,,What have we become 😔😂 Toddler Cites Freedom ...,en,[],[],...,11420,,235,108819032,7995,2170,False,True,https://pbs.twimg.com/profile_images/139246535...,https://pbs.twimg.com/profile_banners/21447363...
4,1427497703596990467,1427497703596990467,1.629177e+12,2021-08-17 07:09:08,200,,The tech giants that refuse to massively addre...,en,[],[],...,11420,,235,108819032,7995,2170,False,True,https://pbs.twimg.com/profile_images/139246535...,https://pbs.twimg.com/profile_banners/21447363...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40576,1427593440469061634,1427593440469061634,1.629200e+12,2021-08-17 13:29:34,200,,Commencement of works. #Agenda111 https://t.c...,en,['agenda111'],[],...,7744,Ghana,352,2003463,338,1499,False,True,https://pbs.twimg.com/profile_images/817691975...,https://pbs.twimg.com/profile_banners/24721710...
40577,1427592955272089642,1427592930722820096,1.629200e+12,2021-08-17 13:27:38,200,,Commencement of works. #Agenda111 https://t.c...,en,['agenda111'],[],...,7744,Ghana,352,2003463,338,1499,False,True,https://pbs.twimg.com/profile_images/817691975...,https://pbs.twimg.com/profile_banners/24721710...
40578,1427592942441598980,1427592930722820096,1.629200e+12,2021-08-17 13:27:35,200,,Commencement of works. #Agenda111 https://t.c...,en,['agenda111'],[],...,7744,Ghana,352,2003463,338,1499,False,True,https://pbs.twimg.com/profile_images/817691975...,https://pbs.twimg.com/profile_banners/24721710...
40579,1427592930722820096,1427592930722820096,1.629200e+12,2021-08-17 13:27:32,200,,Commencement of works. #Agenda111 https://t.c...,en,['agenda111'],[],...,7744,Ghana,352,2003463,338,1499,False,True,https://pbs.twimg.com/profile_images/817691975...,https://pbs.twimg.com/profile_banners/24721710...


At first glance, we see that some columns do not have the expected data types. We will convert them to the correct data types before we make any analysis to make sure we don't have any problems with the data types.

In [25]:
# Convert columns to the correct data types
tweets_users['id'] = tweets_users['id'].astype('str')
tweets_users['conversation_id'] = tweets_users['conversation_id'].astype('str')
tweets_users['created_at'] = pd.to_datetime(tweets_users['created_at'], unit='ms')
tweets_users['date'] = pd.to_datetime(tweets_users['date'])
tweets_users['user_id'] = tweets_users['user_id'].astype('str')
tweets_users['user_id_str'] = tweets_users['user_id_str'].astype('str')
tweets_users['video'] = tweets_users['video'].astype('bool')
tweets_users['user_rt_id'] = tweets_users['user_rt_id'].astype('str')
tweets_users['retweet_id'] = tweets_users['retweet_id'].astype('str')
tweets_users['retweet_date'] = pd.to_datetime(tweets_users['retweet_date'].str.rstrip(' CEST'))
tweets_users['join_datetime'] = pd.to_datetime(tweets_users['join_datetime'])
tweets_users['join_date'] = pd.to_datetime(tweets_users['join_date'])
tweets_users['join_time'] = pd.to_datetime(tweets_users['join_time'], format='%H:%M:%S %Z').dt.time
tweets_users['private'] = tweets_users['private'].astype('bool')
tweets_users['verified'] = tweets_users['verified'].astype('bool')

In [26]:
tweets_users.shape

(40867, 55)

In [27]:
tweets_users.info()

<class 'pandas.core.frame.DataFrame'>
Index: 40867 entries, 0 to 40580
Data columns (total 55 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   id                40867 non-null  object             
 1   conversation_id   40867 non-null  object             
 2   created_at        40867 non-null  datetime64[ns]     
 3   date              40867 non-null  datetime64[ns]     
 4   timezone          40867 non-null  int64              
 5   place             34 non-null     object             
 6   tweet             40867 non-null  object             
 7   language          40867 non-null  object             
 8   hashtags          40867 non-null  object             
 9   cashtags          40867 non-null  object             
 10  user_id           40867 non-null  object             
 11  user_id_str       40867 non-null  object             
 12  username_tweet    40867 non-null  object             
 13  name_t

In [28]:
tweets_users.describe()

Unnamed: 0,created_at,date,timezone,day,hour,nlikes,nreplies,nretweets,search,near,...,retweet_date,translate,trans_src,trans_dest,join_date,tweets,following,followers,likes,media
count,40867,40867,40867.0,40867.0,40867.0,40867.0,40867.0,40867.0,0.0,0.0,...,6182,0.0,0.0,0.0,40867,40867.0,40867.0,40867.0,40867.0,40867.0
mean,2021-08-15 13:00:53.727041280,2021-08-15 15:00:53.727041280,200.0,3.816013,12.99956,1417.17,78.194118,288.468569,,,...,2021-08-12 23:04:28.001941248,,,,2009-01-17 07:22:25.584456960,177398.772971,11529.55,10107600.0,13963.126728,50492.90819
min,2021-08-11 22:00:00,2021-08-12 00:00:00,200.0,1.0,0.0,0.0,0.0,0.0,,,...,2015-09-01 12:10:02,,,,2006-03-21 00:00:00,5.0,0.0,107.0,0.0,5.0
25%,2021-08-13 14:56:50,2021-08-13 16:56:50,200.0,2.0,6.0,6.0,0.0,4.0,,,...,2021-08-13 05:30:01.249999872,,,,2007-05-12 00:00:00,40113.0,418.0,3039465.0,353.0,3402.0
50%,2021-08-15 15:30:01,2021-08-15 17:30:01,200.0,4.0,15.0,39.0,3.0,16.0,,,...,2021-08-15 00:38:46,,,,2008-10-09 00:00:00,132248.0,862.0,4878881.0,2600.0,22586.0
75%,2021-08-17 13:45:32,2021-08-17 15:45:32,200.0,5.0,19.0,206.0,24.0,74.0,,,...,2021-08-16 23:37:16,,,,2009-10-22 00:00:00,306127.0,2265.0,10255470.0,8759.0,75519.0
max,2021-08-18 22:00:00,2021-08-19 00:00:00,200.0,7.0,23.0,1920242.0,88035.0,541964.0,,,...,2021-08-18 23:51:40,,,,2021-06-21 00:00:00,508811.0,4200793.0,129909300.0,492144.0,236024.0
std,,,0.0,1.887138,7.328175,18246.48,983.854499,4644.698916,,,...,,,,,,148999.823601,162886.2,13221460.0,51451.295821,62376.057012


In [29]:
tweets_users.nunique()

id                  40581
conversation_id     37273
created_at          36756
date                36756
timezone                1
place                  22
tweet               40026
language               40
hashtags             2561
cashtags                4
user_id               847
user_id_str           847
username_tweet        847
name_tweet            847
day                     7
hour                   24
link                40581
urls                19576
photos               5344
video                   2
thumbnail            8661
retweet                 2
nlikes               4573
nreplies             1258
nretweets            2281
quote_url            2446
search                  0
near                    0
geo                     0
source                  0
user_rt_id           3560
user_rt              6046
retweet_id           6048
reply_to             3576
retweet_date         5972
translate               0
trans_src               0
trans_dest              0
name_user   

In [30]:
tweets_users.isnull().sum()

id                      0
conversation_id         0
created_at              0
date                    0
timezone                0
place               40833
tweet                   0
language                0
hashtags                0
cashtags                0
user_id                 0
user_id_str             0
username_tweet          0
name_tweet              0
day                     0
hour                    0
link                    0
urls                    0
photos                  0
video                   0
thumbnail           31644
retweet                 0
nlikes                  0
nreplies                0
nretweets               0
quote_url               0
search              40867
near                40867
geo                 40867
source              40867
user_rt_id              0
user_rt             34685
retweet_id              0
reply_to                0
retweet_date        34685
translate           40867
trans_src           40867
trans_dest          40867
name_user   

The dataset has been loaded correctly, no issues were found during the process, and the data is coherent with what we could expect from each of the variables.



### Data exploration

After loading correctly the dataset and checking the data types, we will do an exploratory data analysis. This will help us understand the data and find out which is the most relevant variables for our project and how they can help us find a solution for the problem.

In [31]:
tweets_users.head(10)

Unnamed: 0,id,conversation_id,created_at,date,timezone,place,tweet,language,hashtags,cashtags,...,tweets,location,following,followers,likes,media,private,verified,avatar,background_image
0,1425590913959612419,1425590913959612419,2021-08-11 22:52:14,2021-08-12 00:52:14,200,,RT @girlsalliance: We're so proud of the four ...,en,[],[],...,1770,"Washington, DC",16,20854298,184,461,False,True,https://pbs.twimg.com/profile_images/136674780...,https://pbs.twimg.com/profile_banners/40948655...
1,1427736867739299841,1427736867739299841,2021-08-17 20:59:29,2021-08-17 22:59:29,200,,Some casual suggestions to 😏SLIDE😏 into when u...,en,"['shoesdaytuesday', 'afterskewlslide']",[],...,11420,,235,108819032,7995,2170,False,True,https://pbs.twimg.com/profile_images/139246535...,https://pbs.twimg.com/profile_banners/21447363...
2,1427667300488937476,1427667300488937476,2021-08-17 16:23:03,2021-08-17 18:23:03,200,,RT @ValaAfshar: You are not your job.,en,[],[],...,11420,,235,108819032,7995,2170,False,True,https://pbs.twimg.com/profile_images/139246535...,https://pbs.twimg.com/profile_banners/21447363...
3,1427667012105371652,1427667012105371652,2021-08-17 16:21:55,2021-08-17 18:21:55,200,,What have we become 😔😂 Toddler Cites Freedom ...,en,[],[],...,11420,,235,108819032,7995,2170,False,True,https://pbs.twimg.com/profile_images/139246535...,https://pbs.twimg.com/profile_banners/21447363...
4,1427497703596990467,1427497703596990467,2021-08-17 05:09:08,2021-08-17 07:09:08,200,,The tech giants that refuse to massively addre...,en,[],[],...,11420,,235,108819032,7995,2170,False,True,https://pbs.twimg.com/profile_images/139246535...,https://pbs.twimg.com/profile_banners/21447363...
5,1426598917471305735,1426598917471305735,2021-08-14 17:37:41,2021-08-14 19:37:41,200,,RT @peterdaou: An Italian town hit 124 degrees...,en,['climateemergency'],[],...,11420,,235,108819032,7995,2170,False,True,https://pbs.twimg.com/profile_images/139246535...,https://pbs.twimg.com/profile_banners/21447363...
6,1425588921233133572,1425588921233133572,2021-08-11 22:44:19,2021-08-12 00:44:19,200,,Thank you @MTV @vmas! ⚔️💓 https://t.co/iyo2KW...,en,[],[],...,9519,,119314,83675119,2310,1795,False,True,https://pbs.twimg.com/profile_images/142258922...,https://pbs.twimg.com/profile_banners/14230524...
7,1427927723356430337,1427904450136510464,2021-08-18 09:37:53,2021-08-18 11:37:53,200,,@Jefflez @LeroyAhBen Love you !!!,en,[],[],...,10252,Citizen of the World Dahhhling,188,21612341,4346,2504,False,True,https://pbs.twimg.com/profile_images/134656973...,https://pbs.twimg.com/profile_banners/19248106...
8,1427806400919580672,1427806400919580672,2021-08-18 01:35:47,2021-08-18 03:35:47,200,,Why would anyone be shocked that I’m drinking ...,en,[],[],...,10252,Citizen of the World Dahhhling,188,21612341,4346,2504,False,True,https://pbs.twimg.com/profile_images/134656973...,https://pbs.twimg.com/profile_banners/19248106...
9,1427758371873202177,1427758371873202177,2021-08-17 22:24:56,2021-08-18 00:24:56,200,,RT @mrtimchan: Wrote about @MariahCarey's new ...,en,[],[],...,10252,Citizen of the World Dahhhling,188,21612341,4346,2504,False,True,https://pbs.twimg.com/profile_images/134656973...,https://pbs.twimg.com/profile_banners/19248106...


**The dataset is formed by more than 40k records and 55 columns.** Each row represents a tweet and has additional information about the user who posted it. The data is collected from 2021/08/11 to 2021/08/18.

There is some data that is not relevant for our problem, **the dataset contains rows that do not represent original tweets, but retweets instead.** For this reason, we will filter out these records from the original dataset to keep only the original tweets.

In order to find out which tweets are going to be the most relevant, we will have to define a target variable that determines the popularity of a tweet. Considering we are creating content for social media, we look for users who are willing to interact and read more about the content of the tweet and, therefore, read the article we share. **For this reason, we will use the number of comments as the target variable**, as it represents the users which read the tweet and wanted to share their thoughts and discuss about it.

The dataset contains a broad number of features, some of them are not relevant for our problem or are just redundant. After a brief analysis, we can see that the following columns are relevant for our problem and will be need to be further analyzed:
    - day: The day of the week (1 to 7) when the tweet was posted
    - hour: The hour of the day (0 to 23) when the tweet was posted
    - followers: The number of followers the user has
    - conversation_id (vs id): If conversation_id is different from id, it means the tweet is a reply to another tweet. Even thought the most straight forward approach for this variable would be to delete the rows that are replies, since they will not be the most relevant tweets. It can also be interesting to keep it, as a popular reply to a tweet could mean high engagement. 
    
TODO: To further analyze the data, we will plot some graphs to understand the distribution of the data and the relationships between the variables.

## Data cleansing

After defining the target variable and the most relevant features, we will clean the data to remove any irrelevant columns and rows as we have defined in the previous section.

In [34]:
# Drop tweets that are retweets
tweets_users = tweets_users[tweets_users['retweet'] == False]

The variable `followers` has a big number of outliers. Even thought the data has been extracted from popular Twitter accounts, there are still accounts that have a massive number of followers. We will clip the data to remove the outliers and make the analysis easier.

In [33]:
tweets_users['followers'].describe()

count    3.468500e+04
mean     1.053939e+07
std      1.362222e+07
min      1.070000e+02
25%      3.039465e+06
50%      5.027648e+06
75%      1.167850e+07
max      1.299093e+08
Name: followers, dtype: float64