<a href="https://colab.research.google.com/github/BrockDSL/BRB_Harvesting_Social_Media/blob/main/Havesting_Social_Media_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![dsl_logo.png](https://raw.githubusercontent.com/BrockDSL/BRB_Harvesting_Social_Media/main/dsl_logo.png)

# Harvesting Social Media Data
## Buidling Better Research Workshop Series

This workshop will introduce you to the basics of the what/how of harvesting social media information.


## How this notebook works

This webpage is a Google Colab notebook and is comprised of different *cells*. Some are code cells that run Python snippets. To works through these cells simply click on the triangle _run_ button in each cell.

## Save a copy 

To save a copy of this notebook so you can return to it later please go to **File > Save Copy in Drive**

In [None]:
# This code cell will load up all the required pieces to run our notebook.
# Once you click into this cell you'll see a triangle 'play' button appear
# Click on that to start your session

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
pd.set_option('display.max_columns', None)

print("Ready to proceed!")

## Twitter Information via the API

Like we discussed during our introductory talk we can make use of an application program interface (API) to ask for data generated on Twitter.

**Our Dataset**: I searched for "_COVID_" on Twitter on January 10, 2022 and harvested for 3 seconds worth of information. We are going to take a look at all of the information that we have.


## Loading the Data

The API returns data in a format called _JSON_. The exact details of this type of file are a pretty extensive but for us, the best way to think of it is to think of it as a very fancy _CSV_ file. We are going to use a python data analysis tool called [Pandas](https://pandas.pydata.org/docs/user_guide/10min.html)

In [None]:
twitter_data = pd.read_json("https://raw.githubusercontent.com/BrockDSL/BRB_Harvesting_Social_Media/main/covid.json",lines=True)
print("Data Loaded!")

## How many records?

How many records did our 10 seconds of harvesting produce?


## Question 1: ## 
How many records you do you think are in the dataset?

In [None]:
len(twitter_data)

## What does a record look like?

Let's look at a random entry to see what data fields are associated with it. We will do this by randomly **sampling** one record.


## Question 2: ## 
How many data fields do you thing are included in one tweet?

In [None]:
random_entry = twitter_data.sample(1)
random_entry

## Question 3:

What types of questions can you ask and answer with this type of data?


---

## Tweet Metrics

Let's look at some very specific Twitter information from these list of tweets.

## Retweets



## Question 4: ## 
What do you think is the top number of retweets?

In [None]:
most_retweets = twitter_data["retweet_count"].max()
most_retweets

Let's see this popular tweet

In [None]:
twitter_data[twitter_data["retweet_count"] == twitter_data["retweet_count"].max()]

## Favourites



## Question: 5## 
What do you think is the top number of favourites in this list of tweets?

In [None]:
most_favorite_count = twitter_data["favorite_count"].max()
most_favorite_count

Let's see this popular tweet

In [None]:
twitter_data[twitter_data["favorite_count"] == twitter_data["favorite_count"].max()]



---

## Whole Dataset Analysis

Let's take a look at some characteristics of the whole body of tweets.

## Languages

## Question: 6 ##

Besides _English_ what do you think the top language in the dataset is?


Let's create a pie graph of the languages in the dataset so we visualize it.

In [None]:
#The language is one of the columns in the dataset, it is called _metadata_
#Let's look at a random tweet from earlier and see what language it is.

#Grab the metadata column contents
tweet_metadata = random_entry["metadata"]

#Print all the metadata items for this tweet
for tw in tweet_metadata:
  print(tw)


In [None]:
language_count = dict()

# Go through each row of the data and see what two letter language code
# is in the iso_language_code metadata field

for row in twitter_data.itertuples(index=False):
  language_entry = row.metadata['iso_language_code']
  #Create a lookup 'dictionary' of codes
  if language_entry in language_count:
    language_count[language_entry] += 1
  else:
    language_count[language_entry] = 1
    

plt.pie(list(language_count.values()),labels=list(language_count.keys()))
plt.title("Languages in the Tweets")
plt.show()


In [None]:
#Numerical language data b/c those wedges of pie are getting small
language_count

## Searching

You can search in the full-text of tweets to see if words show up. Try searching for _covid deaths_. 

## Question: 7 ##

What interesting things can you find from searching in the tweet? (Run the next cell and click the _Magic Wand_ to load up the interactive data viewer)

In [None]:
#Run this cell then click on the 'Magic Wand' icon
ft_twitter_data = twitter_data.filter(['full_text'],axis = 1)
ft_twitter_data



---

## Conclusion

When analyzing social media data you often get lets of metadata in addition to the full-text of the posts you have indentified. This extra data can be analyzed in many different ways.

If you're interested in exploring social media data for a research project or class please contact: **dsl @ brocku.ca** or checkout our the [DSL webpage](https://brocku.ca/library/dsl) for more details on how the Digital Scholarship Lab can help your research.
