<a href="https://colab.research.google.com/github/PaulAlvarez13/ISYS2001/blob/main/twitter_data_cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Twitter

In this notebook we are going to extract some twitter posts.

# snscrape

This software is a scraper for social networking services (SNS). It scrapes things like user profiles, hashtags. The following services are currently supported:

* Facebook: user profiles, groups, and communities (aka visitor posts)
* Instagram: user profiles, hashtags, and locations
* Mastodon: user profiles and toots (single or thread)
* Reddit: users, subreddits, and searches (via Pushshift)
* Telegram: channels
* **Twitter: users, user profiles, hashtags, searches, tweets (single or surrounding thread), list posts, and trends**
* VKontakte: user profiles
* Weibo (Sina Weibo): user profiles

We are going to use [snscrape](https://github.com/JustAnotherArchivist/snscrape) to get twitter data. Some of the benefits of using this package are:

- Can fetch almost all Tweets 
- Fast initial setup;
- Can be used anonymously and without Twitter sign up;
- No rate limitations (don't abuse).

> This is not a tutorial on snscrape.  We will use the tool in a simple way.  You can search for documentation and example on the internet is you want to take further.


There are alternative.  Some of the more common options include:

* [**Tweepy**](https://www.tweepy.org/). Use your developer account credintals to start scraping. Tweepy has a scraping limit of 3200 tweets and seven day history.
> I am tired of having to `create and account`, so although simple to use I have decided to try a different package

* [**Twitter’s Firehose API**](https://developer.twitter.com/en/docs/twitter-api/enterprise/compliance-firehose-api/overview) You can upgrade you developer account and get nearly unlimited access to Twitter’s data stream via one of the various Twitter data provider partners.

* [**GetOldTweets3**](https://pypi.org/project/GetOldTweets3/) Twitter has removed the endpoint the GetOldTweets3 uses and that makes GOT no longer useful. Included because it may come up in conversation.

* [**TWINT**](https://github.com/twintproject/twint). Twint (Twitter Intelligence Tool) is an advanced tool, currently difficult to install. The author of the library recommends using a Dockerfile which is difficult on Colab.

* [**Octoparse**](https://www.octoparse.com/) Octoparse is a paid software with free option with limitations.


* **Use existing data**
  * [Top 25 Twitter Datasets for Natural Language Processing and Machine Learning](https://imerit.net/blog/top-25-twitter-datasets-for-natural-language-processing-and-machine-learning-all-pbm/)
  * [Free Twitter Datasets Mega Compilation](https://www.trackmyhashtag.com/blog/free-twitter-datasets/)
  * [Awesome Twitter Data](https://awesomeopensource.com/project/shaypal5/awesome-twitter-data)




In [None]:
# Run the pip install command below if you don't already have the library
!pip install snscrape

## Get the data

We will follow our typical pattern
* import the libraries
* setup the variable(s)
* do the calculation
* output the results

We will use a common strategy of storing the data into a Pandas dataframe.


## The Query

As the purpose of the notebook is to visualise data.  We will keep our query simple

*How often is `Python` mentioned in tweets.  We will limit our scrape to a modest 500 tweets between `January 1, 2022`, and `May 1, 2022`.*




In [None]:
# Imports
import snscrape.modules.twitter as sntwitter
import pandas as pd

# Setting variables to be used below
maxTweets = 50

# Creating list to append tweet data to
tweets = []

# Using TwitterSearchScraper to scrape data and append tweets to list
for i, tweet in enumerate(sntwitter.TwitterSearchScraper('python since:2022-01-01 until:2022-05-01').get_items()):
    if i > maxTweets:
        break
    tweets.append([tweet.date, tweet.id, tweet.content])

# Creating a dataframe from the tweets list above
tweets_df = pd.DataFrame(tweets, columns=['Datetime', 'Tweet Id', 'Text'])

# Display first 5 entries from dataframe
tweets_df.head()

The text of the tweet contains non-english words, emojis, and strange puncutation.  Depending on you business need these may be important.  For this exercise we will focus on the english words.

> For the curious, here is a pythonic way using list generators.  We havent discussed list generators so have followed our pattern of creating an empty list, and then appending to the list.
```python
query = sntwitter.TwitterSearchScraper('python since:2022-01-01 until:2022-05-01').get_items()
tweet_data = [next(query) for _ in range(maxTweets)]
tweets = [[tweet.date, tweet.id, tweet.content] for tweet in tweet_data ]
```




# Clean the data

Real world data is *messy*.  In the next cell run some python code to *clean* the data.  In our case this means extracting the english words.

We are going to use a [Regular Expression](https://en.wikipedia.org/wiki/Regular_expression).  A regulare expression is a way to use a sequence of characters as a search pattern.  We can extract substrings that match the pattern.  This is not 'perfect' but will meets our need of the exercise.

In [None]:
import re

def clean(text):
  ''' Uses regular expresison to extract english letter and digits from the supplied text. '''
  regExp = "(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"
  return ' '.join(re.sub(regExp, " ", text).split())

tweets_df['Clean Text'] = tweets_df['Text'].apply(clean)
tweets_df.head()

# Save the data

For this exercise we only want the date and the cleaned data.  Once extracted save to csv file.

In [None]:
tweets_df['Date'] = pd.to_datetime(tweets_df['Datetime']).dt.date
tweets_df[['Date','Clean Text']].to_csv("tweets.csv")