In this lab, we will use 'twarc' Python package to interact with Twitter API v2.

We will use Search Tweets --> Recent search (endpoint)

This API endpoint:
- (By default) Retrieves 10 tweets in the recent 7 days
- (By setting max_results to 100) Can retrieve up to 100 tweets per API request

Ref: https://twarc-project.readthedocs.io/en/latest/api/client2/#twarc.client2.Twarc2.search_recent

Ref: https://www.youtube.com/watch?v=guHH51GDDI0

Ref: https://catriscode.com/2021/05/01/tweets-cleaning-with-python/

In [None]:
import plotly.graph_objects as go
from twarc import Twarc2

In [None]:
# Paste in your own bearer token below
client = Twarc2(
    bearer_token = ''
)

In [None]:
query = 'covid lang=en'

In [None]:
results = client.search_recent(
    query=query,
    max_results=100,
    tweet_fields="author_id,context_annotations"
)

In [None]:
for page in results:
    print("======================")
    print(page)
    data = page['data']

In [None]:
len(data)

In [None]:
# Convert JSON into Pandas data frame

import pandas as pd

df = pd.json_normalize(data)

df

In [None]:
# Let's retrieve just the text from each tweet
# Store each tweet text as a List item

tweet_text_list = df['text'].tolist()

tweet_text_list

In [None]:
'''
1) Lowercasing all the letters

This step is important to make sure that all your letters are in uniform.

temp = tweet.lower()
temp
'''

In [None]:
'''
2) Removing hashtags and mentions

Hashtags and mentions are common in tweets.
There are cases where you want to remove them so you only get the 
clean content of a tweet without all these elements.

You can remove these hashtags and mentions using regex.

import re

temp = re.sub("@[A-Za-z0-9_]+","", temp)
temp = re.sub("#[A-Za-z0-9_]+","", temp)
temp
'''

In [None]:
'''
3) Removing links

Links are usually not necessary for text processing, so it’s
better to remove them from your text.

temp = re.sub(r"http\S+", "", temp)
temp = re.sub(r"www.\S+", "", temp)
temp
'''

In [None]:
'''
4) Removing punctuations

Depending on your needs, you may not need punctuations such as
period, comma, exclamation mark, question mark, etc.

temp = re.sub('[()!?]', ' ', temp)
temp = re.sub('\[.*?\]',' ', temp)
temp
'''

In [None]:
'''
5) Filtering non-alphanumeric characters

The previous step may have removed the punctuations, including all the
non-alphanumeric characters, but just to be sure, we can remove all letters
except the alphabets (a-z) and numbers (0-9). The sign ^ below means except.

temp = re.sub("[^a-z0-9]"," ", temp)
temp
'''

In [None]:
'''
6) Tokenization

In tokenization, you basically tokenize your text into tokens.

And what is a token? In this case, you split your text into smaller components,
for example a paragraph into a list of sentences, or a sentence into a list of words.

Library such as nltk provides functions such as word_tokenize() or sent_tokenize()
to help you with this.

However, if you just want a simple tokenizing step where you split your text 
into words into a list, then you can do it as simple as the following code.

The result will give you a list of words from your text.

temp = temp.split()
temp
'''

In [None]:
'''
7) Stop words removal

Stop words are words that are considered unimportant to the meaning of a text.
These words may seem important to us, humans, but to machine these words may be
considered nuisance to the processing steps.

It’s also important to keep in mind that stop words are largely language-dependent.
In English, you have stop words such as for, to, and, or, in, out, etc.

Here I first defined a list of stop words in English.
Then, I match each token with each stop word.

If a token isn’t found in the list of stop words, the token gets saved,
otherwise it’s not saved. In the end, you join all the words into one text again.


stopwords = ["for", "on", "an", "a", "of", "and", "in", "the", "to", "from"]
 
temp = [w for w in temp if not w in stopwords]
temp = " ".join(word for word in temp)
'''

In [None]:
'''
Text Preprocessing: From Start to Finish

I hope you understand the steps I have explained above.
Now we can combine all those lines of code into one function that we can
call and pass an argument to.

The function then returns a clean text that is ready for you to work with.

Keep in mind that the order of steps here are not absolute.
You can arrange them around depending on your text and your needs.
The code below is what I found to be the most effective on the data I
usually work with, but in case you find another pattern of data,
you can always work them out differently.
'''
import numpy as np
import re

stopwords = ["for", "on", "an", "a", "of", "and", "in", "the", "to", "from"]

def clean_tweet(tweet):
    if type(tweet) == np.float:
        return ""
    temp = tweet.lower()
    temp = re.sub("'", "", temp) # to avoid removing contractions in english
    temp = re.sub("@[A-Za-z0-9_]+","", temp)
    temp = re.sub("#[A-Za-z0-9_]+","", temp)
    temp = re.sub(r'http\S+', '', temp)
    temp = re.sub('[()!?]', ' ', temp)
    temp = re.sub('\[.*?\]',' ', temp)
    temp = re.sub("[^a-z0-9]"," ", temp)
    temp = temp.split()
    temp = [w for w in temp if not w in stopwords]
    temp = " ".join(word for word in temp)
    return temp

In [None]:
tweets = ["Get ready for #NatGeoEarthDay! Join us on 4/21 for an evening of music and celebration, exploration and inspiration https://on.natgeo.com/3t0wzQy.",
"Coral in the shallows of Aitutaki Lagoon, Cook Islands, Polynesia https://on.natgeo.com/3gkgq4Z",
"Don't miss our @reddit AMA with author and climber Mark Synnott who will be answering your questions about his historic journey to the North Face of Everest TODAY at 12:00pm ET! Start submitting your questions here: https://on.natgeo.com/3ddSkHk @DuttonBooks"]
 
results = [clean_tweet(tw) for tw in tweets]
results

In [None]:
# Let's go and apply this to our own tweets
my_results = [clean_tweet(tw) for tw in tweet_text_list]
my_results

In [None]:
# importing all necessary modules

# Note that wordcloud package also provides a stop word list
# We won't be using it here - but please do explore!!!
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

comment_words = ''

# iterate through list
for tweet_text in results:
    # split the text into tokens
    tokens = tweet_text.split()
    # Append tokens to string comment_words
    comment_words += " ".join(tokens)+" "
    
# Let's make a word cloud
wordcloud = WordCloud(width = 800, height = 800,
                background_color ='white',
                #stopwords = stopwords,
                min_font_size = 10).generate(comment_words)
 
# plot the WordCloud image                      
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
 
plt.show()