# Exploring Tweets of Top European Football (Soccer) Clubs

Football is huge, and Europeans take their Football seriously. Twitter is all about what is happening now, and sports is an area that reflects this whenever there is action.  
This is a quick exploration of a dataset containing the last 3,200 tweets of the top seven European clubs (as well as their sub accounts in different languages), for a total of forty-two accounts and around 131k tweets.  

## Outline

* [Identifying the Clubs](#identify)
* [Getting the Data](#get_data)
* [Tweet Activity](#tweet_activity)
* [Hashtags](#hashtags)
* [Mentions](#mentions)
* [Emoji](#emoji)
* [Currency Symbols](#currency)
* [Intense Words](#intense)
* [Questions](#questions)
* [Exclamations](#exclamations)
* [Words' Effect on Engagement](#words_effect)


In [None]:
!pip install advertools==0.7.4 plotly==4.0.0

In [None]:
from IPython.display import clear_output
clear_output()
import advertools as adv
import pandas as pd
pd.options.display.max_columns = None
import plotly.graph_objects as go
import plotly

print('Package     Version')
print('=' * 20)
for pack in [pd, adv, plotly]:
    print(f'{pack.__name__:<10}', ':', pack.__version__)

<a id='identify'></a>
## Identifying the Clubs  
[Wikipedia](https://en.wikipedia.org/wiki/List_of_UEFA_club_competition_winners) provides a list of the top clubs, based on the number of tournaments that they won. We can easily download the table with `pandas.read_html` function, and then save to CSV. This table was imported in July 2019.

In [None]:
# clubs = pd.read_html('https://en.wikipedia.org/wiki/List_of_UEFA_club_competition_winners')[1]
# clubs.to_csv('clubs.csv', index=False)

In [None]:
clubs = pd.read_csv('../input/clubs.csv')
clubs.head(7)

I manually got the Twitter handles of each club by searching for them on Twitter. Here are the top seven. 

In [None]:
handles = [
    'realmadrid',
    'acmilan',
    'FCBarcelona',
    'LFC',
    'juventusfc',
    'FCBayern',
    'AFCAjax',
]

Later, we will need to map accounts to the club names, so I've created this dictionary, mapping Twitter handles to the club name. 

In [None]:
handles_clubs = dict(zip(handles, clubs['Club'][:7]))
handles_clubs

<a id='get_data'></a>
## Getting the Data

To be able to send queries and receive responses from the Twitter API, you will need to do the following:

* [Apply for access as a developer:](https://developer.twitter.com/en/apply-for-access) Once approved, you will then need to get your account credentials.  
* [Create an app:](https://developer.twitter.com/en/apps) So you will be able to get the app's credentials.  
* Get credentials by clicking on "Details" and then "Keys and tokens": You should see your keys where they are clearly labeled: API key; API secret key; Access token; and Access token secret.

Then with the following dictionary that has the credentials, we can set our auth paramters as follows:

In [None]:
auth_params = {
    'app_key': 'YOUR_APP_KEY',
    'app_secret': 'YOUR_APP_SECRET',
    'oauth_token': 'YOUR_OAUTH_TOKEN',
    'oauth_token_secret': 'YOUR_OAUTH_TOKEN_SECRET',
}

adv.twitter.set_auth_params(**auth_params)

The first step is to request the users from Twitter. The `advertools.twitter.lookup_user` function returns a `user` object that contains meta data about that particular user; number of followers, number of tweets, creation date, etc.  
This is a simple loop that gets all requested users, concatenates them into one DataFrame, and saves them to a CSV file.

In [None]:
# user_dfs = []

# for club in handles:
#     df = adv.twitter.lookup_user(screen_name=club, tweet_mode='extended')
#     user_dfs.append(df)

# user_dfs = pd.concat(user_dfs, sort=False)
# user_dfs.to_csv('user_dfs.csv', index=False)

In [None]:
user_dfs = pd.read_csv('../input/user_dfs.csv')
user_dfs.head(3)

In [None]:
(user_dfs
 .sort_values('followers_count', ascending=False)
 [['created_at', 'screen_name', 'followers_count', 'statuses_count']]
 .style.format({'followers_count': '{:,}','statuses_count':  '{:,}'}))

It's clear that these are massive accounts. Each one of them can be considered as a major media outlet. 

## Getting the Sub-accounts
Usually on Twitter profiles, users list other accounts that they are affiliated with. In this case the clubs have links (as mentions) to all the other accounts they have. Sub-accounts are typically for other languages, but not all of them.  
`advertools.extract_mentions` can get us the mentions in any text list, together with some summary statistics about those mentions.  
Here we extract the mentions, and then put them together with the main accounts in one list. 

In [None]:
club_handles = user_dfs['screen_name'].tolist()
mentioned_handles = adv.extract_mentions(user_dfs['description'])['mentions_flat']
mentioned_handles = [m.replace('@', '') for m in mentioned_handles]
all_handles = sorted(club_handles + mentioned_handles)
print('number of accounts:', len(all_handles))
print('sample:')
all_handles[:5] + all_handles[-5:]

Then for each account we simply request the latest tweets. Twitter provides the latest 3,200 tweets. All you have to do with the `advertools.twitter.get_user_timeline` function is provide the screen name, number of tweets requested, and the tweet mode (make sure it is set to "extended", otherwise, it will get you up to 140 characters of each tweet. 

In [None]:
# clubs_tweets_dfs = []
# for acct in all_handles:
#     df = adv.twitter.get_user_timeline(screen_name=acct, count=3500, tweet_mode='extended')
#     clubs_tweets_dfs.append(df)
# (pd.concat(clubs_tweets_dfs, sort=False, ignore_index=True)
#  .to_csv('clubs_tweets.csv', index=False))

In [None]:
club_tweets = pd.read_csv('../input/clubs_tweets.csv', parse_dates=['tweet_created_at', 'user_created_at'],
                          low_memory=False)
print(club_tweets.shape)
club_tweets.head(2)

The original size of this DataFrame is around 66MB, and there are many duplicated values. A good strategy is to convert the columns that have a few values repeated many times to the `category` data type. This way they are represented as integers, and take up much less memory.  

The following code identifies the columns where the number of unique values is less than 250 (out of 131k), which means there are many repetitions.  
After that `category_cols` become keys in a dictionary, whos values are all 'category'.  
Once this is done, we can simply change the data types by running `astype` on the DataFrame, and passing our dictionary as the argument.  
This reduced the size of the DataFrame to almost half its original size.

In [None]:
category_cols = club_tweets.columns[club_tweets.nunique().lt(250)] 
dtypes = dict(zip(category_cols, ['category' for i in range(len(category_cols))]))
club_tweets = club_tweets.astype(dtypes)

Because we have several accounts that belong to the same club, I think it would be useful to map the handles to their respective clubs, and add them as a column to our DataFrame.   
We first create a dictionary with the keys being the `screen_name`s of our `user_dfs` DataFrame, and the values as the extracted mentions from their descriptions. As a result we can see a list of sub-accounts corresponding to each main account.


In [None]:
main_sub_accts = dict(zip(user_dfs['screen_name'].values, 
                          adv.extract_mentions(user_dfs['description'])['mentions'])) 
for acct, subacct in main_sub_accts.items():
    print(acct, ':',  *subacct, sep=' ')
    print()

This mapping shows the main user account (not club name), mapped to its sub-accounts. What we need is to invert this mapping (have the handles as keys, and club names as the values). 
We will also need to use a dictionary that maps the main Twitter handles to their respective club names to do the mapping.  
The `defaultdict` data structure is perfect for this kind of task, and the following code achieves this. 

In [None]:
from collections import defaultdict
dd = defaultdict()

for k, v in main_sub_accts.items():
    for val in v:
        dd[val.replace('@', '').lower()] = handles_clubs[k]
        dd[k.lower()] = handles_clubs[k]
dd

Now we can simply use the `map` method on the `club_tweets['user_screen_name']` to achieve this. Note that I'm using the names in lower case to avoid any errors.

In [None]:
club_tweets['club_name'] = club_tweets['user_screen_name'].str.lower().map(dd)

As a quick sanity check we can see if the values are correctly mapped. The following line can be run several times, to make sure things look OK. 

In [None]:
club_tweets[['user_screen_name', 'club_name']].sample(10)

<a id='tweet_activity'></a>
## Tweet Activity
The following table shows us how frequently each club tweets, by the showing the minimum and maximum values of the `tweet_created_at` column, showing the earliest and latest tweets that we have. The `date_range` column shows the time difference to give us an idea on how long a period we are dealing with. The top accounts tweeted more than three thousand times in less than four months, while others' tweets go back to 2017. 

In [None]:
(club_tweets
 [['user_screen_name', 'tweet_created_at', 'club_name', 'user_statuses_count']]
 .groupby(['club_name', 'user_screen_name'])
 .agg({'tweet_created_at': ['min', 'max'], 'user_statuses_count': 'count'})
 .assign(date_range=lambda df: df[('tweet_created_at', 'max')] - 
         df[('tweet_created_at', 'min')])
 .sort_values('date_range'))

Let's dig a little deeper, and see how many tweets have been tweeted every week for the whole dataset.  
When you set a datetime column as an index with `set_index` the DataFrame becomes a time series, and you can then run special functions for that. Here we run `resample` and set the frequency as 'W' to get a weekly frequency. Then we get the count for `tweet_full_text`.

In [None]:
weekyl_count = (club_tweets
               .set_index('tweet_created_at')
               .resample('W')['tweet_full_text']
               .count())
weekyl_count.head()

In [None]:
fig = go.Figure()
fig.add_bar(x=weekyl_count.index, y=weekyl_count.values)
fig.layout.title = 'Number of Weekly Tweets for Top European Football Clubs\' Twitter Accounts'
fig.layout.paper_bgcolor = '#E5ECF6'
fig.show()

Did those clubs just discover Twitter?!  
Or is it the case that some of them tweeted a lot lately, and some tweet less often to get us this distribution? Keep in mind that they almost all have the same number of tweets.  
Splitting by club might give us a better picture. Remember that club does not mean Twitter account, as they each have an average of 6-7 accounts.

In [None]:
from plotly.subplots import make_subplots
club_names = list(club_tweets['club_name'].unique())
fig = make_subplots(rows=7, cols=1, x_title='Week', shared_xaxes=True,
                    y_title='Number of Tweets', subplot_titles=club_names)
for i, club in enumerate(club_names):
        weekly = (club_tweets[club_tweets['club_name']==club]
                  .set_index('tweet_created_at').resample('W')['tweet_full_text'].count())
        fig.add_bar(x=weekly.index, y=weekly.values,
                    showlegend=False,
                    marker={'line': {'color': '#000000'}},
                    row=i+1, col=1)
fig.layout.title = 'Number of Weekly Tweets by Club (last 3,200 tweets)'
fig.layout.paper_bgcolor = '#E5ECF6'
fig.layout.height = 750
fig

This give a better view of what is happening, and I think it would be better to dig deeper into each account and see if there are other patterns.  
It could be that the use of the other accounts started recently? This can easily be explored by changing some parameters in the previous visualization.  
Let's see what languages are used the most. The `tweet_lang` column shows this, but it is not 100% accurate, and in some cases you have 'und' for undefined values. 

In [None]:
(club_tweets['tweet_lang']
 .value_counts()
 .to_frame()
 .assign(perc=lambda df: df['tweet_lang'].div(df['tweet_lang'].sum()),
         cum_perc=lambda df: df['perc'].cumsum())
 .head(10)
 .style.format({'tweet_lang': '{:,}', 'perc': '{:.1%}', 'cum_perc': '{:.1%}'}))

English is not surprising, but it's really interesting that Arabic is the second most used language.  
It would be interesting to figure out why. It might be because the Arabic script is unambiguous, whereas the Latin script can be used for all the Latin languages, so it's not always easy for Twitter algorithms to detect if a Tweet is in Spanish or Italian for example?
As an example what language is this tweet in?

>15. GOOOOOOOOOOOAAAAAAAAAAAAAAAAALLLLLL!!!!!  
DONNNNNNNNNNN!  #UCL #totaja https://t.co/B6N0Xx7iDk


Twitter is also blocked in China.

## Tweets by Weekday
Is there a pattern of tweeting more or fewer tweets on weekends vs weekdays? 

In [None]:
tweets_by_wkday = (club_tweets
                   .groupby(club_tweets['tweet_created_at']
                           .dt.weekday_name)['tweet_full_text']
                   .count().to_frame().sort_values('tweet_full_text'))
fig = go.Figure()
fig.add_bar(y=tweets_by_wkday.index, x=tweets_by_wkday['tweet_full_text'], 
            orientation='h')
fig.layout.title = 'Number of Tweets per Day of Week'
fig.layout.xaxis.title = 'Number of Tweets'
fig.layout.paper_bgcolor = '#E5ECF6'
fig

Saturday and Sunday are clearly the big days. This makes sense as most matches happen on weekends.  

Time for some `plotly.express` magic!  
It's interesting to explore for each period which are the most engaging tweets, and how many there were.  

For example, here we can filter the tweets from the account `realmadrid`, then create a `month` column, and then explore on a monthly basis the retweet activity.  
We can also segment by language, and read the actual text of the tweet while mousing over the circles. All in one function call. Thank you Plotly! 

When moving from month to month, you will need to double click on the chart to get the circles displayed (this might be a bug). 

In [None]:
import plotly.express as xp
real_madrid = club_tweets[club_tweets['user_screen_name']=='realmadrid'].copy()
real_madrid['month'] = [pd.Period(x, freq='M') for x in real_madrid['tweet_created_at']]
real_madrid.loc[:,'month'] = real_madrid['month'].astype('str')

fig = xp.scatter(real_madrid[::-1], x='tweet_created_at',
                 y='tweet_retweet_count', 
                 title='@realmadrid Tweets - Monthly',
                 color='tweet_lang', opacity=0.6,
                 template='plotly_white',
                 animation_frame='month',
                 hover_data=['tweet_full_text'])
fig.layout.yaxis.title = 'Tweet Retweet Count'
fig.layout.xaxis.title = 'Tweet Creation Date'
fig.show()

<a id='hashtags'></a>
## Top Hashtags

In [None]:
hashtag_summary = adv.extract_hashtags(club_tweets['tweet_full_text'])
hashtag_summary['overview']

In [None]:
hashtag_summary['top_hashtags'][:15]

In [None]:
fig = go.FigureWidget()
fig.add_bar(x=[h[1] for h in hashtag_summary['top_hashtags'][:20][::-1]],
            y=[h[0] for h in hashtag_summary['top_hashtags'][:20][::-1]], orientation='h')
fig.layout.height = 800
fig.layout.title = 'Top Hashtags Used By All Clubs'
fig.layout.paper_bgcolor = '#E5ECF6'
fig.layout.xaxis.title = 'Number of times the hashtag was used'
fig.show()


Of course it's better to explore the top hashtags by account, and we will do this next.  
Here are the account ranked by the number of followers. 

In [None]:
(club_tweets
 .drop_duplicates('user_screen_name')
 .sort_values('user_followers_count', ascending=False)
 [['user_screen_name', 'user_followers_count']]
 .head(15)
 .reset_index(drop=True)
 .style.format({'user_followers_count': '{:,}'}))


And here they are extracted, and assigned to `top_5`. 

In [None]:
top_5 = (club_tweets
         .drop_duplicates('user_screen_name')
         .sort_values('user_followers_count', ascending=False)
         ['user_screen_name']
         .head(5).tolist())
top_5

This code creates titles for the subplots that we will create to visualize the top hashtags.  
The hashtags will be counted twice for each account. Once using a normal count 'Absolute Freq' and once using a weighted count 'Wtd. Freq'. The weighted frequency takes into consideration the number of retweets a tweet got when it contained that particular hashtag. This is usually more interesting because some hashtags get used a lot but don't generate a lot of retweets and vice versa. The best is to get both counts and see how they compare. 

In [None]:
titles = []
for title in [['@' + club + ' Wtd. Freq', '@' + club + ' Absolute Freq'] for club in top_5]:
    titles.append(title[0])
    titles.append(title[1])
titles

Feel free to modify this visualization with more/fewer hashtags, other clubs or weighting by other criteria. 

In [None]:
fig = make_subplots(rows=5, cols=2, subplot_titles=titles)

for i, club in enumerate(top_5):
    df = club_tweets[club_tweets['user_screen_name']==club]
    hashtag_df = adv.word_frequency(df['tweet_full_text'], df['tweet_retweet_count'], 
                                    regex=adv.regex.HASHTAG_RAW)
    fig.add_bar(y=hashtag_df['word'][:7][::-1],
                x=hashtag_df['wtd_freq'][:7][::-1], orientation='h',
                row=i+1, col=1, showlegend=False,
                marker={'color': plotly.colors.DEFAULT_PLOTLY_COLORS[i]})
    fig.add_bar(y=hashtag_df.sort_values('abs_freq', ascending=False)['word'][:7][::-1], 
                x=hashtag_df.sort_values('abs_freq', ascending=False)['abs_freq'][:7][::-1],
                orientation='h', 
                row=i+1, col=2, showlegend=False,
                marker={'color': plotly.colors.DEFAULT_PLOTLY_COLORS[i]})

fig.layout.height = 1200
fig.layout.paper_bgcolor = '#E5ECF6'
fig.layout.title = ('<i>Top Hashtags by Twitter Account - Weighted by Number of Retweets</i><br>' +
                    '<b>Wtd. Freq:</b> number of hashtags times total retweets of tweets containing the hashtag<br>' +
                    '<b>Absolute Freq:</b> Simple count showing the number of times a hashtag was used<br>')
fig.layout.margin = {'t': 180, 'r': 10}
fig

As you can see above, the account @FCBarcelona used the hashtag #messi 349 times and #forçabarça 681 times (on the Absolute Freq side). However, looking at the weighted frequency, you will see that tweets containing #messi generated a total of 1.55M retweets.

<a id='mentions'></a>
## Top Mentions
We can do the same and see who are the accounts that are mentioned the most, and which accounts when mentioned generate the most retweets. Again you can weight by something else like the number of favorites for example. 

In [None]:
fig = make_subplots(rows=5, cols=2, subplot_titles=titles)
for i, club in enumerate(top_5):
    df = club_tweets[club_tweets['user_screen_name']==club]
    mention_df = adv.word_frequency(df['tweet_full_text'], df['tweet_retweet_count'], 
                                    regex=adv.regex.MENTION_RAW)
    fig.add_bar(y=mention_df['word'][:7][::-1],
                x=mention_df['wtd_freq'][:7][::-1], orientation='h',
                row=i+1, col=1, showlegend=False,
                marker={'color': plotly.colors.DEFAULT_PLOTLY_COLORS[i+5]})
    fig.add_bar(y=mention_df.sort_values('rel_value', ascending=False)['word'][:7][::-1], 
                x=mention_df.sort_values('abs_freq', ascending=False)['abs_freq'][:7][::-1],
                orientation='h', 
                row=i+1, col=2, showlegend=False,
                marker={'color': plotly.colors.DEFAULT_PLOTLY_COLORS[i+5]})

fig.layout.height = 1200
fig.layout.paper_bgcolor = '#E5ECF6'
fig.layout.title = ('<i>Top Mentions by Twitter Account - Weighted by Number of Retweets</i><br>' +
                    '<b>Wtd. Freq:</b> number of mentions times total retweets of tweets containing the mention<br>' +
                    '<b>Absolute Freq:</b> Simple count showing the number of times a mention was used<br>')
fig.layout.margin = {'t': 180, 'r': 10}
fig

<a id='emoji'></a>
## Top Emoji

In [None]:
emoji_summary = adv.extract_emoji(club_tweets['tweet_full_text'])
emoji_summary['overview']

As you might expect these tweets are going to be rich with emoji. We have 2.55 emoji per tweet on average, with 1,509 unique emoji. That's almost half of all existing emoji! 

In [None]:
emoji_summary.keys()

<a id='currency'></a>
## Currencies in Tweets
Does any of the clubs use prices, money amounts, or anything related to money in their tweets?

In [None]:
currency_summary = adv.extract_currency(club_tweets['tweet_full_text'])
print(currency_summary.keys())
print()
currency_summary['overview']

Only 115 tweets containing anything related to money. We can take a look at the context in which they appeared with the `surrounding_text` key. It shows all tweets that contained any currency symbol, together with the twenty characters before and after it to get a good view of the context in which it was mentioned. 

In [None]:
[t for t in currency_summary['surrounding_text'] if t][:20]

In [None]:
from collections import Counter
Counter(currency_summary['currency_symbols_flat'])


<a id='intense'></a>
## Intense Words
These are words that contain a character repeated three or more times (you can modify this constraint if you want). You know when sometime on social media people say how much they "looooooooove" something? Let's see how intensely those clubs are trying to communicate. Note that intensity does not mean positive or negative. It's simply intense. 

In [None]:
intensity_summary = adv.extract_intense_words(club_tweets['tweet_full_text'])
print(intensity_summary.keys())
print()
intensity_summary['overview']

More than 16% of the tweets have intense words. That looks like a lot of shouting. Remember, these are the official accounts of the clubs. Let's explore more. 

In [None]:
intensity_summary['top_intense_words'][:30]

So the majority are either emoji or shouting "GOOOOOOL" when a goal is scored. Let's quickly check how many "goooal" tweets we have. 

In [None]:
club_tweets[club_tweets['tweet_full_text'].str.contains('go+a?l', case=False)].__len__()

Lots of goals scored! This is 9,755 ÷ 131,234 = 7.4% of tweets. It might be interesting to explore these further and find out more. 

<a id='questions'></a>
## Questions in Tweets
Let's see if the clubs ask questions, and explore some statistics.

In [None]:
question_summary = adv.extract_questions(club_tweets['tweet_full_text'])
print(question_summary.keys())
print()
question_summary['overview']

Around 8% of tweets contain questions. Not massive, but not negligible either. 

In [None]:
[q for q in question_summary['question_text'] if q][:20]

You can further explore the engagement rates of those questions, and which questions got the most engagement/response. 
<a id='exclamations'></a>
## Exclamations!
Similar to intense words, it's interesting to see how many tweets end with an exclamation mark, and check some statistics for them.

In [None]:
exclamation_summary = adv.extract_exclamations(club_tweets['tweet_full_text'])
print(exclamation_summary.keys())
print()
exclamation_summary['overview']

Lots of emoji, lots of intense words, and 64% of tweets contain exclamation marks.

In [None]:
[x for x in exclamation_summary['exclamation_text'] if x][:10]

<a id='words_effect'></a>
## Analyzing the Effect of Words on Engagement

In [None]:
emoji_freq =  adv.word_frequency(club_tweets['tweet_full_text'], 
                                 club_tweets['user_followers_count'],
                                 regex=adv.emoji.EMOJI_RAW)
emoji_freq.head(20).style.format({'abs_freq': '{:,}', 'wtd_freq': '{:,}', 'rel_value': '{:,.0f}'})

The above table shows the top used emoji and their frequencies. Obviously, the same exercise can be done for other entities, or simply by checking the top words used.  
Nothing surprising about the top emoji here. You would expect a football, a trophy, and muscles to be in the top. But what are the red and blue dots? Are they useful? What do they mean?  
First, let's see if using those dots increases engagement.  
We can `describe` the DataFrame containing only the favorites and retweet counts to see the effect of the red dot.  
This code does this for tweets containing it, and the following code does the same for tweets _not_ containing it. 

In [None]:
print('tweets containing 🔴:')
(club_tweets
 [club_tweets['tweet_full_text'].str.contains('🔴')]
 .filter(regex='tweet_favorite_count|tweet_retweet_count')
 .describe()
 .style.format('{:,.2f}'))

In [None]:
print('tweets NOT containing 🔴:')
(club_tweets
 [~club_tweets['tweet_full_text'].str.contains('🔴')]
 .filter(regex='tweet_favorite_count|tweet_retweet_count')
 .describe()
 .style.format('{:,.2f}'))

The mean is much higher for tweets not containing the dot, but the median is slightly lower on retweets, and much lower for favorites (24 vs 25 and 103 vs 191 respectively). The standard deviation is 1,666 vs 944 for retweets, so we can see a big difference in the variance of both.  
Obviously, looking at all tweets is probably hiding something. Let's see which clubs use the red dot the most.

In [None]:
(club_tweets
 [club_tweets['tweet_full_text'].str.contains('🔴')]
 ['club_name'].value_counts()
 .reset_index().style.format({'club_name': '{:,}'}))

In [None]:
(club_tweets
 [club_tweets['tweet_full_text'].str.contains('🔴')]
 ['user_screen_name'].value_counts()
 .head(8)
 .reset_index().style.format({'club_name': '{:,}'}))

So Barcelona is the one using them the most (looking at clubs and looking at `user_screen_name`s).  
So let's see if there is an effect for Barcelona tweets. 

In [None]:
print('Barcelona tweets containing 🔴:\n')
(club_tweets
 [club_tweets['tweet_full_text'].str.contains('🔴') & 
  (club_tweets['club_name'] == 'Barcelona')]
 .filter(regex='tweet_favorite_count|tweet_retweet_count')
 .describe()
 .style.format('{:,.2f}'))

In [None]:
print('Barcelona tweets NOT containing 🔴:\n')
(club_tweets
 [~club_tweets['tweet_full_text'].str.contains('🔴') & 
  (club_tweets['club_name'] == 'Barcelona')]
 .filter(regex='tweet_favorite_count|tweet_retweet_count')
 .describe()
 .style.format('{:,.2f}'))

Engagement values are higher for the tweets without the red dot on all measures, and for both favorites and retweets. 



>Dear Barcelona Social Media Manager,  
Please note that using the red dot 14,982 times in your latest tweets has not caused the effect we think you are trying to achieve. It seems engagement rates on tweets containing 🔴 are higher when you don't use it. Please refer to the tables above.    
On a subjective level, going through some of those tweets, you seem to be trying to use the dot together with the blue one to show the jersey colors of Barcelona?  
They are not the same colors... Up to you! 

In [None]:
pd.options.display.max_colwidth = 280

for x in (club_tweets[(club_tweets['club_name']=='Barcelona') & 
                      club_tweets['tweet_full_text'].str.contains('🔴')]
          [['tweet_full_text']][:10].values):
    print(*x)
    print('='*30, '\n')

Of course, it doesn't always make sense to simply look at engagement rates and decide whether or not using a certain emoji or word is good or not. For example, the word "goal" would almost certainly outperform any other word on engagement, simply because this is what people are waiting for, and it is is by definition the highlight of the match.  
In this case however, because the dots are essentially meaningless, and if I understood correctly, they are mainly for attracting people's attention, then I think it makes sense to look at the numbers in this way. 


## Further Information/Resources
- [Documentation for the `advertools.twitter` module](https://www.kaggle.com/eliasdabbas/twitter-in-a-dataframe)
- [A tutorial on some concepts in analyzing text, especially for social media](https://www.semrush.com/blog/text-analysis-for-online-marketers/)
- [Extracting entities from social media posts](https://www.kaggle.com/eliasdabbas/extract-entities-from-social-media-posts)