# Basic Operations and Manipulating Tweet Data

We'll be using `pandas` to manipulate our saved data as dataframes. If you're familiar with R dataframes, the structure is similar! 

These dataframes allow us to conduct operations and manipulate rows and columns in an efficient manner, without having to do for loops or other methods.

If you're looking for more resources on learning Pandas basics, check out the resources list in `README.md`

## Visualizing Tweet Frequency Across Time

As linguists we might be interested in studying the temporal distribution of different lexical items. For example, you might study the rise and fall of 'slang' terms over months and years. You could also study spelling variants that possibly reflect dialect differences and how they change in representation in twitter. You might also be interested in how certain lexical items are distributed w.r.t. to certain important events or times of day.

Let's visualize the distribution of our 10,000 tweets containing "pillow" in `saved_searches/pillow_tweets.csv` in order to test the hypothesis: 

- **H1: Tweets containing the word "pillow" will be more frequent leading up to people's bedtimes**

In [None]:
import pandas as pd
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import datetime
import nltk
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
# download nltk resources if you haven't before (uncomment to do so)
#nltk.download('stopwords')
#nltk.download('punkt')

# set matplot preference
%matplotlib inline

# read in data
pillow = pd.read_csv('saved_searches/pillow_tweets.csv')

# create new column that only includes time information
pillow['created_at'] = pd.to_datetime(pillow['created_at'], format = '%Y-%m-%d %H:%M:%S')
pillow['time'] = [datetime.datetime.time(i) for i in pillow['created_at']]

# view first and last 5 rows
print(pillow.head())
print(pillow.tail())




In [None]:
# API tweets are in UTC time, tweets were recorded during Daylight's Savings Time, so EDT is -4 hours from UTC.
# this is implemented by subtracting a datetime.timedelta() of 4 hours
fig, ax = plt.subplots(1,1)
ax.hist(pillow['created_at'] - datetime.timedelta(hours = 4), bins=16, histtype='step')
ax.xaxis.set_major_formatter(formatter = mdates.DateFormatter('%H:%M'))
plt.xticks(rotation = 45)
plt.show()

## Interpreting the Graph

This graph is suggestive, but there are a few issues here that don't allow us to directly address our hypothesis.

**Can you think of some? Try brainstorming for a second before looking at some problems below...**

<br />
<br />
<br />
<br />
<br />
<br />
<br />

- These are just counts! They don't tell us if the relative frequency of "pillow" changes following our hypothesis.
- The search didn't restrict *where* tweets came from, it's basically assured that not all of these tweets are from EDT users.
- "Pillow" search can include wider set of items than "sleeping pillow", these different senses probably have different distributions

# Text Processing

Some of the questions we might ask about a tweet's text might be easier if we could remove punctuation, emoji, or function words.



In [None]:
## Demonstration of method to tokenize and clean sentences

# word_tokenize() breaks up a large string into a list containing individual words it has identified
s = 'This is an example sentence, with a TON..... of punctuation!!!!And :) a smiley'
s_tokenized = nltk.tokenize.word_tokenize(s)
print(s_tokenized)

# let's convert all of the words to lower case
s_lower = [w.lower() for w in s_tokenized]
print(s_lower)

# now let's remove non-words
s_words = [s for s in s_lower if s.isalpha()]
print(s_words)

## Stopwords

A common method to improve analyses of content words in NLP is to remove **stopwords**. These roughly map onto pronouns and function words, and are often ignored for the purpose of entity tagging, sentiment analysis, and other NLP methods.

If you're interested in syntactic phenomena or the spelling/frequency of function words, you'll obviously want to keep them. But otherwise, you might consider removing stopwords like we do below:

In [None]:
print(s_words)

# define the stopwords list
stopwords = nltk.corpus.stopwords.words('english')

# only include words that aren't in the stopwords list
s_final = [w for w in s_words if not w in stopwords]
print(s_final)

## Text Cleaning Function 

We want to apply this routine (tokenize --> lower case --> remove non-alpha --> remove stopwords) to all of the text of the tweets we have. Let's define a function to do this!

In [None]:
def clean_sentence(s, stopwords):
    """
    Take as input a sentence as a str and a list of stopwords
    Then output a tokenized, lower case list of all words which are not stopwords.
    You could make this function more efficient by vectorizing it and then applying it to a pandas column simultaneously.
    """
    s_tokenized = nltk.tokenize.word_tokenize(s)
    s_lower = [w.lower() for w in s_tokenized]
    s_words = [w for w in s_lower if w.isalpha()]
    s_final = [w for w in s_words if not w in stopwords]
    return s_final

print(clean_sentence('this!!!is a test OF THE FUNCTION,!;', stopwords))

In [None]:
print(pillow['text'][0])
print(clean_sentence(pillow['text'][0], stopwords))

Notice that this process isn't perfect for tweets. 

It splits hyperlinks and includes "https" in our words list, and also keeps "amp" from the symbol "&amp;".

For our rough purposes though, this will be sufficient!

In [None]:
# using .apply() and lambda allows us to apply our new function to each row of the dataframe
# and save the output as a new column
pillow['cleaned_text'] = pillow.apply(lambda row: clean_sentence(row['text'], stopwords), axis = 1)
pillow.head()

# Bi-Grams

A useful representation of ordered sequence data (like sentences/tweets) are N-Grams: 

In our case, let's consider the bigrams (sequences of two elements) in the following sentence:

**"I fed the dog before I hugged the dog."**

- I fed
- fed the
- the dog
- dog before
- before I
- I hugged
- hugged the
- the dog

Notice that some bigrams in this sentence are more common than others ("the dog" occurs twice), while some bigrams are unattested (\*"dog the")!

<br />

Let's consider if the bigrams including "pillow" give us any insight into the types of conversations people are having.

In [None]:
# convert all rows of cleaned_text column into a single list
all_cleaned_text = pillow['cleaned_text'].explode().dropna().to_list()

# construct the nltk bigram finder
bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(all_cleaned_text)

# apply filter and only look at bigrams that contain "pillow"
pillow_filter = lambda *w: 'pillow' not in w
finder.apply_ngram_filter(pillow_filter)

# print out the 10 most likely bigrams (measured by chi_sq)
for l in finder.nbest(bigram_measures.chi_sq, 10):
    print(l)

In addition to just identifying high-frequency collocations (bigrams), n-gram models are the basis of many computational language models.

We won't explore n-grams any further, but check out the resources on NLTK and NLP/N-Grams in the `README.md`