<h1> Analysing Elon Musk's tweeting style 2015-2020

I have created this notebook strictly for the exercising purposes and I hope you find it interesting and learn something about Elon Musk's tweets in 2015–2020. Most recently I have started the Data Science Nanodegree online course with Udacity. This post is a part of an exercise and is required for me to pass the first module. The project consists of two parts: 
- exploratory analysis of a dataset of your choice
- Medium post with the results from the above (to be found here: https://medium.com/p/176a8279cefb)

Enjoy the journey of exploring Elon's tweets which is the Kaggle dataset (https://www.kaggle.com/vidyapb/elon-musk-tweets-2015-to-2020) I've chosen for the exercise.
I will try to answer these three questions:
* What's Elon Musk's tweeting style?
* What tweeting pattern yields the highest engagement?
* What are the most popular words used in his tweets?

<h2> Import all necessary libraries

In [None]:
import os
import re
import nltk
import altair as alt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

from datetime import date, datetime
from pandas_profiling import ProfileReport
from tabulate import tabulate
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
from scipy import stats
from nltk.corpus import stopwords
from collections import Counter

In [None]:
#List all the libraries used in the project
#!pip3 freeze

<h2> Create the file name and path references

In [None]:
file_path_name = '../input/elon-musk-tweets-2015-to-2020/elonmusk.csv'

In [None]:
def open_file(file_path_name):
    """
    This function opens the csv file and creates the dataframe
    :param file_path_name: Name of the path and the input file
    :return: The pandas dataframe
    """
    return pd.read_csv(file_path_name, index_col=[0])

<h3> Check the file content

In [None]:
print(open_file(file_path_name).head())

<h2> Create a dataframe profile pdf (Optional)

In [None]:
# def profile_dataframe(file_path_name):
#     """
#     This function is taking the pandas dataframe and creating the profile report in the html format.
#     :param file_path_name: Name of the path and the input file
#     :return: The dataframe profile in the html format
#     """
#     today = date.today()
#     df = open_file(file_path_name)
#     profile = ProfileReport(df, title="Pandas Profiling Report", explorative=True)
#     profile.to_file("report_{0}.html".format(today))


# open(profile_dataframe(file_path_name))

<h2> Prepare the code to perform the dataframe cleaning

In [None]:
def drop_columns_with_constant_values(df):
    return df.drop(columns=list(df.columns[df.nunique() <= 1]))


def check_the_head_of_constant_columns(df):
    print(tabulate
          (df[df.columns[df.nunique() <= 1]].head(20), headers='keys', tablefmt='psql'))

    
columns_to_drop = ['hashtags', 'cashtags', 'link', 'quote_url', 'urls', 'created_at']


def drop_redundant_columns(df, columns_to_drop):
    return df.drop(columns=columns_to_drop, axis=0)

<h2> Add new count columns

In [None]:
def drop_columns_with_constant_values(df):
    """
    This function takes the pandas dataframe and drops the columns with the constant values.
    :param df: The input dataframe.
    :return: The transformed dataframe with dropped columns with the constant values.
    """
    return df.drop(columns=list(df.columns[df.nunique() <= 1]))


def check_the_head_of_constant_columns(df):
    """
    This function takes the pandas dataframe and checks the name of columns with the constant values.
    :param df: The input dataframe.
    :return: The names of the columns with the constant values.
    """
    print(tabulate
          (df[df.columns[df.nunique() <= 1]].head(20), headers='keys', tablefmt='psql'))


columns_to_drop = ['hashtags', 'cashtags', 'link', 'quote_url', 'urls', 'created_at']


def drop_redundant_columns(df, columns_to_drop):
    """
    This function takes the pandas dataframe and drops the specified.
    :param df: The input dataframe.
    :param columns_to_drop: The list of columns to drop from the dataframe.
    :return: The transformed dataframe with dropped columns.

    """
    return df.drop(columns=columns_to_drop, axis=0)

def add_reply_to_count(df):
    """
    This function takes the pandas dataframe and adds the new column with the count of replies.
    :param df: The input dataframe.
    :return: The dataframe with added new reply to count column.
    """
    reply_to_count_values = []
    for i, content in df['reply_to'].items():
        reply_to_count_values.append((int(content.count("{")) - 1))
    df['reply_to_count'] = reply_to_count_values
    return df


def add_mentions_count(df):
    """
    This function takes the pandas dataframe and adds the new column with the count of mentions.
    :param df: The input dataframe.
    :return: The dataframe with added new mentions count column.
    """
    new_values = []
    for i, content in df['mentions'].items():
        new_values.append(int(content.count("'") / 2))
    df['mentions_count'] = new_values
    return df


def add_photos_count(df):
    """
    This function takes the pandas dataframe and adds the new column with the count of photos.
    :param df: The input dataframe.
    :return: The dataframe with added new photos count column.
    """
    new_values = []
    for i, content in df['photos'].items():
        new_values.append(int(content.count("https")))
    df['photos_count'] = new_values
    return df


def add_weekday(df):
    """
    This function takes the pandas dataframe, extracts the weekday and adds it to the new column.
    :param df: The input dataframe.
    :return: The dataframe with added new weekday column.
    """
    weekday = []
    for i, content in df['date'].items():
        year, month, day = map(int, content.split('-'))
        d = date(year, month, day)
        weekday.append(d.weekday())
    df['weekday'] = weekday
    return df


def convert_to_datetime(df):
    """
    This function takes the pandas dataframe, converts to the datetime and adds it to the new column.
    :param df: The input dataframe.
    :return: The dataframe with added new datetime column.
    """
    df['datetime'] = (df['date'] + " " + df['time']).astype('string')
    return df

<h2> Add new date and time related columns

In [None]:
def extract_hour_minute(df):
    """
    This function takes the pandas dataframe, extracts the time related values and adds it to the new columns.
    :param df: The input dataframe.
    :return: The dataframe with added new year, month, hour, and minute columns.
    """
    year_col = []
    month_col = []
    hour_col = []
    minute_col = []
    for i, content in df['datetime'].items():
        t1 = datetime.strptime(content, '%Y-%m-%d %H:%M:%S')
        year_col.append(t1.year)
        month_col.append(t1.month)
        hour_col.append(t1.hour)
        minute_col.append(t1.minute)
    df['year'] = year_col
    df['month'] = month_col
    df['hour'] = hour_col
    df['minute'] = minute_col
    return df

<h3> Bring it all together

In [None]:
def set_index(df, index_column):
    """
    This function takes the pandas dataframe, the index column name and creates a new index.
    :param df: The input dataframe.
    :param index_column: The list of the name of an index column.
    :return: The dataframe with a new index.
    """
    return df.set_index(index_column, drop=True, inplace=False, verify_integrity=True)


def clean_dataframe(df, columns_to_drop):
    """
    This function takes the pandas dataframe, a list of columns to drop and removes them.
    :param df: The input dataframe.
    :param columns_to_drop: The list of the columns to drop.
    :return: The dataframe without dropped columns.
    """
    df = drop_redundant_columns(df, columns_to_drop)
    return df


def transform_dataframe(df):
    """
    This function takes the pandas dataframe, and performs the above operations.
    :param df: The input dataframe.
    :return: The transformed dataframe.
    """
    df = drop_columns_with_constant_values(df)
    add_mentions_count(df)
    add_weekday(df)
    add_reply_to_count(df)
    add_photos_count(df)
    convert_to_datetime(df)
    extract_hour_minute(df)
    df = drop_redundant_columns(df, ['photos', 'date', 'mentions', 'reply_to', 'reply_to_count'])
    return df

<h3> Open the file and check the datatypes

In [None]:
df = open_file(file_path_name)
new_df = clean_dataframe(df, columns_to_drop)
new_df = transform_dataframe(new_df)

print("\n----------------- DATA TYPES -------------------")
print(new_df.dtypes)
#Use the code below for the pretty print the dataframe
# print(tabulate(new_df.loc[new_df['photos_count'] == 2], headers='keys')) #, tablefmt='psql'))

<h1> Text preprocessing

In [None]:
#Count the number of tweets
print(new_df.count())

#Normalize the tweets to be lowercase
df['tweet'] = df['tweet'].str.lower()

In [None]:
#Drop duplicates tweets
new_df.drop_duplicates(subset=['tweet'], keep='first', inplace=True)
print(new_df.shape)
print(new_df.count())

In [None]:
#Remove the @users
def remove_users(tweet, pattern1, pattern2):
    """
    This function takes the tweet text and removes the regex pattern words.
    :param tweet: The text to be searched for the pattern.
    :pattern1: The first pattern to be removed.
    :pattern2: The second pattern to be removed.
    :return: The text without a removed patterns.
    """
    r = re.findall(pattern1, tweet)
    for i in r:
        tweet = re.sub(i, '', tweet)
        
    r = re.findall(pattern2, tweet)
    for i in r:
        tweet = re.sub(i, '', tweet)
    return tweet

new_df['tidy_tweet'] = np.vectorize(remove_users)(new_df['tweet'], "@ [\w]*", "@[\w]*")

<h1> Descriptive statistics

One we've got the dataframe cleaned and ready, we can perform some basic statictics, and start looking for the answers for the questions posed.

In [None]:
#Count the number of characters and length of a tweet
count = new_df['tweet'].str.split().str.len()
count.index = count.index.astype(str) + ' words:'
count.sort_index(inplace=True)


def word_count(df):
    """
    This function takes the dataframe and adds a new colun with the number of words.
    :param df: The dataframe to be transformed.
    :return: The transformed dataframe.
    """
    words_count = []
    for i, content in df['tweet'].items():
        new_values =[]
        new_values = content.split()
        words_count.append(len(new_values))
    df['word_count'] = words_count
    return df

new_df = word_count(new_df)

print("Total number of words: ", count.sum(), "words")

In [None]:
print("Average number of words per tweet: ", round(count.mean(),2), "words")
print("Max number of words per tweet: ", count.max(), "words")
print("Min number of words per tweet: ", count.min(), "words")

In [None]:
new_df['tweet_length'] = new_df['tweet'].str.len()

print("Total length of a dataset: ", new_df.tweet_length.sum(), "characters")
print("Average length of a tweet: ", round(new_df.tweet_length.mean(),0), "characters")

In [None]:
plt.subplots(figsize=(10,8))
sns.heatmap(new_df.corr(), annot=True, linewidths=1.5, fmt=".2f");

<b> From the correlation matrix above we can observe a few interesting correlations. Replies and retweets count are highly (73% and 91% respectively) positively correlated with the likes count. On the other hand, mentions count is moderately negatively (-23%) correlated with the likes count (the more mentions, the less likes). Attaching the photos or videos will likely increase the number of likes too (29% and 10% respectively). The length of a tweet is slightly negatively correlated with the likes count.

In [None]:
# print(new_df.columns)
X = new_df[['replies_count', 'retweets_count', 'mentions_count', 'weekday', 
            'photos_count', 'year', 'hour', 'minute', 'tweet_length','word_count'
           ]]
y = new_df[['likes_count']]

X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())

<b> From the table above, we can see the P>|t| value, which indicates if we can find an evidence of the statistical significance of each feature. When the value is ~<0.05, we could conclude that the feature has a statistial difference (statisticaly significant in other words). We can see that most of our features, but weekday, a minute, and a word count, are statistically significant for measureing the correlation between them and the target feature (likes count). Read more here: https://blog.minitab.com/blog/statistics-and-quality-data-analysis/what-are-t-values-and-p-values-in-statistics and here: https://stackoverflow.com/questions/27928275/find-p-value-significance-in-scikit-learn-linearregression

<h1> The Analysis

<h1> What's Elon Musk's tweeting style?

In [None]:
alt.data_transformers.disable_max_rows()

alt.Chart(new_df, width=800, height=400
         ).mark_point(filled=True
                     ).encode(
    alt.X('hour:N', scale=alt.Scale(zero=True)),
    alt.Y('month:N', scale=alt.Scale(zero=True)),
    alt.Size('likes_count:Q'),
    #alt.Color('photos_count:N'),
    alt.OpacityValue(0.5),
    tooltip=['tweet:N', 'likes_count:Q', 'photos_count:N', 'year:N']
)

In [None]:
bars = alt.Chart(new_df).mark_bar().encode(
    y='month:O',
    x='count_tweets:Q',
    tooltip=['mean_tweets:N'],
    text='mean_tweets:Q'
).transform_aggregate(
    count_tweets='count(id)',
    mean_tweets='mean(count_tweets)',
    #sum_likes='sum(likes_count)',
    groupby=["month"]
)

text = bars.mark_text(
    align='left',
    baseline='middle',
    dx=3  # Nudges text to right so it doesn't appear on top of the bar
).encode(
    text='count_tweets:Q'
)

(bars + text).properties(height=300)

In [None]:
bars = alt.Chart(new_df).mark_bar().encode(
    y='weekday:O',
    x='count_tweets:Q',
    #tooltip=['mean_tweets:N'],
    text='mean_tweets:N'
).transform_aggregate(
    count_tweets='count(id)',
    mean_tweets='mean(count_tweets)',
    #sum_likes='sum(likes_count)',
    groupby=["weekday"]
)

text = bars.mark_text(
    align='left',
    baseline='middle',
    dx=3  # Nudges text to right so it doesn't appear on top of the bar
).encode(
    text='count_tweets:Q'
)

(bars + text).properties(height=300)

In [None]:
bars = alt.Chart(new_df).mark_bar().encode(
    y='hour:O',
    x='count_tweets:Q',
    #tooltip=['mean_tweets:N'],
    text='mean_tweets:N'
).transform_aggregate(
    count_tweets='count(id)',
    mean_tweets='mean(count_tweets)',
    #sum_likes='sum(likes_count)',
    groupby=["hour"]
)

text = bars.mark_text(
    align='left',
    baseline='middle',
    dx=3  # Nudges text to right so it doesn't appear on top of the bar
).encode(
    text='count_tweets:Q'
)

(bars + text).properties(height=300)

<h1> What tweeting pattern yields the highest engagement?

In [None]:
# sns.set_theme(color_codes=True)
g = sns.lmplot(x='hour', y='likes_count', data = new_df,# col = 'photos_count', 
              aspect = 2.5, robust=False, palette='tab5',
              scatter_kws=dict(s=50, linewidths=.1, edgecolors='black'),
              order=2, ci=None
              )
plt.show()

<h3> The graph above helps us understand what times are most likely to yield the highest likes count in tweeting. Times between 6-9 am and 7-9 pm are yeilding the highest results. 

In [None]:
g = sns.FacetGrid(new_df, col="weekday", height=6, aspect=.5)
g.map(sns.barplot, "photos_count", "likes_count", order=[0, 1, 2])

<h3> We can see an interesting pattern on the figure above. Tweets with two photos, posted on Monday and Tuesday, will yield more likes than posted other days. Looks like tweets with one photo attached, will gain more response in likes posted from Wednesday till Sunday. Tweets with no photos will collect the lowest likes on average.

In [None]:
import altair as alt
alt.data_transformers.disable_max_rows()

base = alt.Chart(new_df, width=600, height=400).mark_point(filled=True).encode(
    x=alt.X('month:O'), y='likes_count:Q', tooltip=['tweet:N', 'likes_count:Q', 'photos_count:N', 'time:N']
)

# A slider filter
year_slider = alt.binding_range(min=2015, max=2020, step=1)
slider_selection = alt.selection_single(bind=year_slider, fields=['year'], name="Tweet")

rating_color_condition = alt.condition(slider_selection,
                      alt.Color('photos_count:N'),
                      alt.value('lightgray'))

filter_year = base.add_selection(
    slider_selection
).encode(
    color=rating_color_condition
).transform_filter(
    slider_selection
).properties(title="Slider Filtering")

filter_year

<h1> What are the most popular words used in his tweets?

<h3> Prepare the text blob to extract the most popular words.

In [None]:
def create_text_blob(df, text_column):
    blob_text=[]
    for i, content in df[text_column].items():
        for i in content.split():
            blob_text.append(i.lower())
    return blob_text

In [None]:
blob_text = create_text_blob(new_df, 'tidy_tweet')
print(blob_text[0:100])

In [None]:
counts = Counter(blob_text)

In [None]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))  
filtered_sentence = [w for w in blob_text if not w in stop_words]  
filtered_sentence = []  
  
for w in blob_text:  
    if w not in stop_words:  
        filtered_sentence.append(w)  

print(filtered_sentence[0:200])  

In [None]:
counts = Counter(filtered_sentence)
#Use the code below for the row-by-row print.
# for i, n in counts.items():
#     print(i,":", n)

In [None]:
import plotly.express as px

top_20_words = {}

for (key, value) in counts.items():
   # Check if value is greater than 200 and add to new dictionary
    if value > 200 :
        top_20_words[key] = value
    continue

sorted_top_20_words = dict(sorted(top_20_words.items(), key=lambda item: item[1], reverse=False))

word = sorted_top_20_words.keys()
count = sorted_top_20_words.values()


fig = px.bar(y=word, x=count, text = count)
fig.update_traces(texttemplate='%{text:}', textposition='outside')
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide')
fig.show()

<h3> As we can see on the graph above, there are a few insignificant words like '&', '...', or stopwords like "it's", "would" or numbers. We will get rid of them to get a clearer picture of what Elon Musk is tweeting about.

In [None]:
top_20_words_clean = {}

for (key, value) in counts.items():
    # Check if key length is greater than 3 and value greater than 150 and add to new dictionary
    if len(key)>2 and value > 150 :
        top_20_words_clean[key] = value
    continue

sorted_top_20_words_clean = dict(sorted(top_20_words_clean.items(), key=lambda item: item[1], reverse=False))

word = sorted_top_20_words_clean.keys()
count = sorted_top_20_words_clean.values()

fig = px.bar(y=word, x=count, text = count)
fig.update_traces(texttemplate='%{text:}', textposition='outside')
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide')
fig.show()

<h3> Looks like Elon Musk was mostly tweeting about the tesla car model 3, and used nouns as like, good, great. We can see that SpaceX was also occurring quite often.

The Medium article about the above analysis: https://medium.com/@lukasz.aszyk/this-is-how-5-years-of-elon-musks-tweets-look-like-part-1-176a8279cefb

The Github repo with the project above: https://github.com/asheone/Data-Science-Project-1