<br>

# Elongate: Elon Musk's Twitter Takeover

## Using Natural Language Processing and Graph Network Analysis to Examine the Tweeted Conversations about Elon Musk's Acquisition of Twitter.

### CUNY MSDS Data620 Final Project

Group (lucky) #7: Bonnie Cooper, George Cruz Deschamps, Rob Hodde

<br><br>

## Project Proposal

For our Final Project, Group (lucky) #7 would like to perform a social listening analysis on a corpus of tweets that reflect conversations related to Elon Musk. Specifically, we are interested in analyzing content that reflects the timeline of Elon Musk's purchase of Twitter. Our goal is to quantify aspects of the conversations and relate the findings to an established and documented timeline of events. Can we find data driven insights from Twitter activity that is informative of the public's reaction(s) to the news of Twitter's acquision by Musk?  

Twitter accepted Elon Musk's acquisition offer on April 25th 2022. However, the weeks leading up to the finalization of the deal were dramatic and tumultuous. After all, Elon had only just purchased a majority sharehold of Twitter on April 4th; this - and the progression of events that culminated in the acquisition - took the tech world by storm. For our project we would like to analyse aspects of the public reaction to these events. Can we find patterns in the volume, sentiment and semantics of Twitter data that inform onthe public's reception the the Twitter takeover.

<br><br>

## The Data

We will collect a corpus of Twitter data to study Elon Musk's Twitter acquisition.  

To follow the timeline of events, we will collect tweets from April 1st to May 5th. Tweets will be scraped using the open source Python library [`snscrape`](https://github.com/MartinBeckUT/TwitterScraper) by keyword search using the library's CLI commands.

We will briefly discuss collection of a preliminary data set below:  


### Example scraping code

```python

# import libraries
import pandas
import os
import snscrape

# define query variables
text_query = "elon musk"
since_date = "2022-04-01"
until_date = "2022-05-05"

# Using OS library to call CLI commands in Python
os.system('snscrape --jsonl --since {} twitter-search "{} until:{}"> text-query-tweets.json'.format(since_date, text_query, until_date))

```

Scraping Twitter by running this code in command line resulted in 478321 tweets. The tweet content comes with much metadata about the tweets. The following code will show a glimpse f the raw scraped data:

In [2]:
import pandas as pd 
# Reads the gzipped json generated from the CLI command above and creates a pandas dataframe
tweets_df = pd.read_json('elongate0000.gz', lines=True, compression='gzip')

# Displays information about the fields of metadata
tweets_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25670 entries, 0 to 25669
Data columns (total 29 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   _type             25670 non-null  object             
 1   url               25670 non-null  object             
 2   date              25670 non-null  datetime64[ns, UTC]
 3   content           25670 non-null  object             
 4   renderedContent   25670 non-null  object             
 5   id                25670 non-null  int64              
 6   user              25670 non-null  object             
 7   replyCount        25670 non-null  int64              
 8   retweetCount      25670 non-null  int64              
 9   likeCount         25670 non-null  int64              
 10  quoteCount        25670 non-null  int64              
 11  conversationId    25670 non-null  int64              
 12  lang              25670 non-null  object             
 13  s

<br>

The raw data returns 29 features. However, not all are necessary for our analysis. Therefore, reduced the size to just the columns of interest:

* **date** - time and date of tweet
* **id** - unique identifying number for a tweet
* **content** - raw tweet content (text, emojis, #s, @s, etc)
* **lang** - language classification
* **replyCount** - number of replys to this tweet
* **retweetCount** - number of retweets to the tweet
* **likeCount** - number of likes this tweets recieved
* **inReplyToTweetId** - tweet this current tweet was in reply to
* **inReplyToUser** - user this current tweet was in reply to
* **mentionedUsers** - users mentioned (@) by this current tweet

In [4]:
# information about preliminary tweet data
df = pd.read_csv( 'elongate_tweets.csv' )
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 478321 entries, 0 to 478320
Data columns (total 10 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   date              478321 non-null  object 
 1   id                478321 non-null  int64  
 2   content           478321 non-null  object 
 3   lang              478321 non-null  object 
 4   replyCount        478321 non-null  int64  
 5   retweetCount      478321 non-null  int64  
 6   likeCount         478321 non-null  int64  
 7   inReplyToTweetId  373152 non-null  float64
 8   inReplyToUser     373152 non-null  object 
 9   mentionedUsers    435145 non-null  object 
dtypes: float64(1), int64(4), object(5)
memory usage: 36.5+ MB


<br><br>

## Proposed Methods

From the scraped Twitter data we will focus our analysis on three main directions:

* Sentiment Analysis: can we observe trends in sentiment and/ or emotion classification across the timeframe of our data collection? Furthermore, can we related shifts in sentiment to events along the timeline?
* Topic Cluster Analysis: For the preliminary data set, we collected tweets using a very broad search query term: 'elon musk'. This would return tweets containing the words 'elon' and 'musk'. Can we use topic cluster analysis to isolate tweets more directly related to Elon Musk's Twitter acquisition? Can we learn about other interesting topics being discussed concurrently with the twitter acquisition. How does the volume of tweets compare across topics?
* Information Diffusion: Can we apply analyses from previously published approaches (De Domenico 2013) to learn about patterns in the propagation of news about the acquisition?

<br><br>

## References

* De Domenico, Manlio, et al. ["The anatomy of a scientific rumor."](https://www.nature.com/articles/srep02980?message-global=remove&WT_ec_id=SREP-20131022) Nature Scientific Reports 3.1 (2013): 1-9.
* [Scraping Tweets from Twitter](https://github.com/MartinBeckUT/TwitterScraper)
