# Project: Viral Tweet Data about Johnson & Johnson-Covid 19 Vaccine








## INTRODUCTION

This project builds and cleans a dataset around Covid-19 and the Johnson & Johnson vaccine, focusing on one week before and after the CDC and other health agencies temporarily pausing the use of the vaccine to examine the likelihood of a rare but severe side effect, and looks at simple (and future) opportunties to analyze the dataset.


## PROJECT MATERIALS

- Jupyter Notebooks: Project Dcoumentation (including workload distribution and distribution plan); Tweet Collection,  Pandas dataframe
- Dataset: Tweets.txt, JnJ_dataframe.csv
- Presentation
- Readme File

## CODE DOCUMENTATION



### PULLING DATA USING TWITTER API

*   Import the required libraries and set desired options.

In [1]:
import requests
import os
import json
import pandas as pd
import time
from collections import defaultdict

pd.set_option('display.max_colwidth', None)

#Create a new folder to store/access all the data files.
os.makedirs("tweetdata", exist_ok=True) 

*   To establish a connection with 'Twitter Search API', we did authentication with a single Bearer token.
*   To complete this step, put your 'BEARER_TOKEN' in the place of 'XXXX'.

In [2]:
# Set environment variables
os.environ['BEARER_TOKEN'] = 'XXXX'

# Get environment variables
bearer_token = os.environ.get("BEARER_TOKEN")
search_url = "https://api.twitter.com/2/tweets/search/all"

-    Created a search query to filter tweets around Johnson & Johnson Covid19-vaccination topic. 
-  Here we have used message keywords (JNJ, johnsonandjohnson, covid19) and  hashtags(#Vaxed, #COVID19) combined into a query with boolean logic and parentheses to help refine the queries matching behavior.

-  Here we provide a short description of search parameters used below that are useful for the purpose of this project. For the full description and the other types of data that can be pulled through API, please refer to https://developer.twitter.com/en/docs/twitter-api/fields  
  
  -   "tweet.fields":
    - author_id: unique identifier for Twitter user 
    - created_at: creation time of the tweet.
    - public_metrics: public engagement metrics for the Tweet at the time the data was pulled, including likes, retweet, quote-retweet and reply counts.  
  
  - "user.fields":
    - description: The text of this user's profile description (also known as bio)
    - location: The location specified in the user's profile, if the user provided one.
    - pinned_tweet_id: Unique identifier of this user's pinned Tweet.

  - "place.fields": (note often time this information is missing)
    - country: The full-length name of the country this place belongs to.
    - geo: contains geo location info (e.g., latititude) in GeoJSON format.




In [3]:
query_params = {"query": "(JNJ (#Vaxed OR #COVID19 OR corona OR COVID OR covid19 OR vaccination)) OR (johnsonandjohnson (#Vaxed OR #COVID19 OR corona OR COVID OR covid19 OR vaccination)) -is:retweet lang:en", "max_results":"500",
                "start_time": "2021-04-07T00:00:00Z",
                "end_time": "2021-04-20T23:00:00Z", 
                "expansions": "author_id",
                "tweet.fields": "author_id,created_at,lang,possibly_sensitive,public_metrics,text",
                "user.fields": "created_at,description,location,name,pinned_tweet_id,public_metrics,username",
                "media.fields": "url,public_metrics",
                "place.fields": "contained_within,country,country_code,full_name,geo,id,name,place_type"}

*   Sending API Requests, to retrieve matching tweets between 2021-04-07 to 2021-04-20. This will likely generate more data than can be returned in a single response, hence, we have used pagination to get complete result.
*   As there are subsequent requests to twitter API, this processing will take approximately 20sec to complete. In the end it will display the total number of matching tweets retrieved as a result.
*   We are storing the API response in ‚Äòtweets.txt‚Äô file. In the next step, we will use this file to apply transformation and make pandas dataframes.


In [4]:
def main():
    
    FILE = 'Tweets.txt'

    def create_headers(bearer_token):
        headers = {"Authorization": "Bearer {}".format(bearer_token)}
        return headers


    def connect_to_endpoint(url, headers, params):
        response = requests.request("GET", search_url, headers=headers, params=params)
#        print(response.status_code)
        if response.status_code != 200:
            raise Exception(response.status_code, response.text)
        return response.json()

    headers = create_headers(bearer_token)

    json_response = connect_to_endpoint(search_url, headers, query_params)

    count=0
    total_tweets= 0
    count_current = 0

    with open('tweetdata/' + FILE, mode='w') as json_file:
        json_file.write(json.dumps(json_response))
        json_file.write("\n") 
        count=count+ count_current

    while json_response["meta"]["next_token"] :
        if 'result_count' in json_response["meta"]:
            count_current = json_response["meta"]["result_count"]
    
        time.sleep(1)
        
        query_params["next_token"] = json_response["meta"]["next_token"]
        json_response = connect_to_endpoint(search_url, headers, query_params)
        count=count + count_current
        
        with open('tweetdata/' + FILE, mode='a') as json_file:
            json_file.write(json.dumps(json_response))
            json_file.write("\n")
        
        if 'next_token' in json_response["meta"]:
            continue
        else:
            total_tweets = count
            print('Total number of collected tweets: ', total_tweets)
            break

*   Driver code the pull request.

In [5]:
if __name__ == "__main__":
    main()

Total number of collected tweets:  10236


*   Identify the fields we want to extract for the two datasets that will later be combined to form our final dataset.
*   The dictionary within the list is to help us handle the nested elements of the json dictionary.

In [6]:
tweet_data_fields = ['author_id', 'created_at', 'id', {'public_metrics': ['retweet_count', 'reply_count', 'like_count', 'quote_count']}, 'text'] 
user_data_fields = ['id','created_at', 'username', 'pinned_tweet_id', 'name', 'description', 'location', {'public_metrics': ['followers_count', 'following_count', 'tweet_count', 'listed_count']}]

### Identify Fields and Creating Panda Dataframe

*   Loops to go through the JSON dictionary responses.

In [7]:
def json_to_tweet_dict(dic, cols, json_dic):
    for tweet in json_dic:
        if tweet['lang'] != 'en':
            continue
        for col in cols:
            if type(col) is dict:
                for item in col:
                    for key in tweet[item]:
                        dic[key].append(tweet[item][key])
            elif col == 'created_at':
                dic['tweet_created_at'].append(tweet[col])
            else:
                dic[col].append(tweet[col]) 
                
def json_to_user_dict(dic, cols, json_dic):
    for user in json_dic:
        for col in cols:
            if type(col) is dict:
                for item in col:
                    for key in user[item]:
                        dic[key].append(user[item][key])
            elif col == 'location':
                dic[col].append(user.get(col, 'NA'))
            elif col == 'pinned_tweet_id':
                dic[col].append(user.get(col, 'NA'))
            elif col == 'id':
                dic['author_id'].append(user[col])
            elif col == 'created_at':
                dic['account_created_at'].append(user[col])
            else:
                dic[col].append(user[col])

### Parse text file

In [8]:
with open('tweetdata/Tweets.txt', 'r', encoding = 'utf8') as f:
    lines = f.readlines()
    tweet_dict = defaultdict(list)
    user_dict =defaultdict(list)
    tweet_count = 0
    for line in lines:
        response = json.dumps(json.loads(line), indent = 4)
        json_dic = json.loads(response)
        for datatype in json_dic:
            if datatype == 'data':
                #use datacols
                json_to_tweet_dict(tweet_dict, tweet_data_fields, json_dic[datatype])
                tweet_count += len(json_dic[datatype])
            elif datatype == 'includes':
                json_to_user_dict(user_dict, user_data_fields, json_dic[datatype]['users'])
            else:
                continue



### Create Panda Dataframes and inital cleaning

In [None]:
tweets = pd.DataFrame.from_dict(tweet_dict)
users = pd.DataFrame.from_dict(user_dict)

*   Merge and clean duplicates.

In [None]:
jnj_df = tweets.merge(users, how = 'inner', on = 'author_id').drop_duplicates(subset = 'text', ignore_index = True)

*   Drop columns rename remaining.

In [None]:
jnj_df = jnj_df.drop(columns = ['pinned_tweet_id', 'listed_count'])
jnj_df = jnj_df.rename(columns = {'author_id': 'author id', 'tweet_created_at': 'tweet created at', 'id' : 'tweet id', 'retweet_count' : 'retweet count', 'reply_count': 'reply count', 'like_count': 'like count', 'quote_count' : 'quote count', 'text' : 'tweet text', 'account_created_at': 'account created at', 'followers_count': 'followers count', 'following_count':'following count', 'tweet_count': 'tweet count'} )

*   Order cols in a more logical way.

In [None]:
ordered_cols = ['author id', 'tweet id', 'tweet created at', 'tweet text', 'retweet count', 'reply count', 'like count', 'quote count', 'name', 'username', 'location', 'description', 'account created at', 'followers count', 'following count', 'tweet count']
jnj_df = jnj_df[ordered_cols]

*   Pretty date time.

In [None]:
jnj_df['tweet created at'] = pd.to_datetime(jnj_df['tweet created at']).dt.tz_convert(None)
jnj_df['account created at'] = pd.to_datetime(jnj_df['account created at']).dt.tz_convert(None)

*   Store the final dataframe as a CSV file placed inside the 'tweetdata' folder.

In [None]:
jnj_df.to_csv (r'tweetdata/JnJ_dataframe.csv', index = False, header=True)

## HOW THE DATASET CAN BE USED

### Search tweets for specific words or hashtags

In [None]:
#search tweets for specific words or hashtags
jnj_df[jnj_df['tweet text'].str.contains("#grateful")]

Unnamed: 0,author id,tweet id,tweet created at,tweet text,retweet count,reply count,like count,quote count,name,username,location,description,account created at,followers count,following count,tweet count
45,27323975,1379635409571958787,2021-04-07 03:21:28,One &amp; Done. #grateful #vaccinated #JohnsonandJohnson #COVID19 #vaccines #stopCOVID https://t.co/04aJwVW8sD,0,0,5,0,Angela Jacobs WFTV,AngelaJacobsTV,"Orlando, FL",2x Emmy Winner @WFTV Reporterüé§ Sports Anchor alum ‚öæÔ∏è üèà Cancer Survivor üíó RTs üö´ endorsements / Story idea?angela.jacobs@wftv.com @Insta @AngelaJacobsWFTV,2009-03-28 22:33:26,2006,802,9460


### Filter data with users with certain amount of following


In [None]:
#only look at data from users with a large following (set to 1000+ followers)
large_following = jnj_df['followers count'] >= 1000
jnj_df[large_following]

Unnamed: 0,author id,tweet id,tweet created at,tweet text,retweet count,reply count,like count,quote count,name,username,location,description,account created at,followers count,following count,tweet count
3,1353222852,1379880773491232770,2021-04-07 19:36:28,#Hialeah secures partnership to distribute #Moderna #CovidVaccine as #FEMA sites move to only offer #JohnsonandJohnson @ADelgadoT51 @Telemundo51 #vaccinenews https://t.co/2Ol9hBzXqz https://t.co/dlgHOlgjAq,0,0,0,0,JRodriguez,JRodzMIA,MIAFL,Senior Assignment Editor @Telemundo51 ... RTs are not endorsements,2013-04-15 02:11:21,1268,738,12418
5,484882506,1379877798039457795,2021-04-07 19:24:38,"Today, the Corpus Christi - Nueces County Public Health District will administer 1,000 #JohnsonAndJohnson COVID-19 vaccines beginning at 9 AM at the Richard M. Borchard Regional Fairgrounds. 1st &amp; 2nd doses of #Moderna vaccines are also available 9 AM-5 PM. Drop-ins are welcome. https://t.co/bvH1Xb2BlK",0,0,0,0,CC Public Works,PublicWorksCC,"Corpus Christi, TX",The official page for the City of Corpus Christi's Engineering Services and Street Operations departments.,2012-02-06 16:02:31,1651,56,2401
10,32135940,1379872327492640771,2021-04-07 19:02:54,2:00 PM Update: There are still Johnson &amp; Johnson COVID-19 vaccines available for those 18 and up. Drop-ins are welcome. No appointment needed. #JohnsonandJohnson #COVID19Vaccine #OneAndDone https://t.co/iDvtcBhjdh,5,0,9,0,City of Corpus Christi,cityofcc,"Corpus Christi, TX",Official Twitter page for the City of Corpus Christi.,2009-04-16 20:15:21,20220,268,16770
11,32135940,1379843720254734338,2021-04-07 17:09:13,"12:00 PM Update: There are still #JohnsonAndJohnson COVID-19 vaccines available. Drop-ins are welcome. No appointment needed. #OneAndDone\n\nWe also still have the #Moderna COVID-19 vaccine available. If it's been 28 days since your 1st dose, head on over! https://t.co/iDvtcBhjdh",4,0,4,0,City of Corpus Christi,cityofcc,"Corpus Christi, TX",Official Twitter page for the City of Corpus Christi.,2009-04-16 20:15:21,20220,268,16770
12,32135940,1379790923782033411,2021-04-07 13:39:26,"Today, the Corpus Christi - Nueces County Public Health District will administer 1,000 #JohnsonAndJohnson COVID-19 vaccines beginning at 9 AM at the Richard M. Borchard Regional Fairgrounds. 1st &amp; 2nd doses of #Moderna vaccines are also available 9 AM-5 PM. Drop-ins are welcome. https://t.co/5xM2kLImbW",9,0,5,2,City of Corpus Christi,cityofcc,"Corpus Christi, TX",Official Twitter page for the City of Corpus Christi.,2009-04-16 20:15:21,20220,268,16770
15,317624806,1379859669691285518,2021-04-07 18:12:36,"I didn't throw away my shot. One and done. For those like me in Group 5 who became eligible today, the @DukeHealth system has lots of #JohnsonandJohnson #vaccine appointments this week and beyond.\n\n#GetVaccinated #COVID19 https://t.co/zh9H6abGYc",0,0,1,0,Dustin Ingalls,punstiningalls,"Raleigh, NC",UNC '07. Native upstate NYer living in Raleigh over 26 years. Puns. Politics. Yankees ‚öæ. KISS. Musicals. Bloody marys. @nclcv comms.,2011-06-15 06:37:05,1017,951,14787
16,49873766,1379857664461639685,2021-04-07 18:04:38,One and done. üíâ \n\n#vaccine #covid #covid19 #jnj #jj #janssen #Î∞±Ïã† #CVS #USA https://t.co/gqpf8LKOIB,0,0,0,0,Richard Ward,zeampzpvy,"ÌîåÎ°úÎ¶¨Îã§ (Florida, USA)","Performing magic in the cloud as executive editor @ ReadLeft. Author, scientist, hacker. I'm big in Korea ZEÏä§Ïï∞ÌîÑ.",2009-06-23 03:39:59,1526,958,11811
18,20457806,1379856610068205577,2021-04-07 18:00:26,"""I'm a Johns Hopkins-trained epidemiologist, and I couldn't even navigate the system.""‚ÄîDr. Debra Furr-Holden on her personal experience with racial discrepancies during #COVID19. See how #JNJ is mobilizing to address inequities like this: https://t.co/XoqIcj5qzR #WorldHealthDay https://t.co/3AYXF0UWLT",7,9,29,2,Johnson & Johnson,JNJNews,,"At Johnson & Johnson, we blend heart, science and ingenuity to profoundly change the trajectory of health for humanity. Follow us to learn more and connect.",2009-02-09 19:12:13,232043,2467,14738
21,119084466,1379837273953472517,2021-04-07 16:43:36,"How do the Pfizer, Moderna and J&amp;J vaccines compare on efficacy and side effects? https://t.co/aLKjvaJxTr\n\n#vaccine #covidvaccine #covid19 #pfizer #moderna #jandj #johnsonandjohnson #sideeffects #efficacy #vaccinesideeffects",0,0,1,0,ConsumerLab.com,ConsumerLab,New York,http://t.co/1baEDHXNkR reports on the quality of health and nutrition products.,2010-03-02 16:36:29,2880,79,2370
23,336229829,1379833520370167809,2021-04-07 16:28:41,"ALERT! Glendale Baptist Church in #SouthDade SW 117th Ave is scheduling appointments today to administer the #JohnsonandJohnson #COVID19 vaccine tomorrow, April 8, 2021.¬† You will need to call the church today, to schedule an appointment.¬† Call 305-233-6435.",4,0,7,1,Marlon A. Hill,MarlonAHill,"Miami, Florida",Business/Govt Lawyer @wsh_law | @FSU Alum | KAPsi‚ô¶Ô∏è| @miamifoundation Fellow |üôãüèΩ‚Äç‚ôÇÔ∏èAdvocate | ü§îStrategist | #WSHCBLaw üáØüá≤üá∫üá∏,2011-07-15 23:26:10,7424,5658,36952


In [None]:
#only look at data from users with a large following (1,000+ followers) and whose tweet received high engagement (10+ likes)
jnj_df_large_high = jnj_df[(jnj_df['like count'] >= 10) & (jnj_df['followers count'] >= 1000)] 
jnj_df_large_high

Unnamed: 0,author id,tweet id,tweet created at,tweet text,retweet count,reply count,like count,quote count,name,username,location,description,account created at,followers count,following count,tweet count
18,20457806,1379856610068205577,2021-04-07 18:00:26,"""I'm a Johns Hopkins-trained epidemiologist, and I couldn't even navigate the system.""‚ÄîDr. Debra Furr-Holden on her personal experience with racial discrepancies during #COVID19. See how #JNJ is mobilizing to address inequities like this: https://t.co/XoqIcj5qzR #WorldHealthDay https://t.co/3AYXF0UWLT",7,9,29,2,Johnson & Johnson,JNJNews,,"At Johnson & Johnson, we blend heart, science and ingenuity to profoundly change the trajectory of health for humanity. Follow us to learn more and connect.",2009-02-09 19:12:13,232043,2467,14738
44,31419286,1379637474515771392,2021-04-07 03:29:41,I‚Äôm pfeeling pfantastic! #yayvaccines #COVID #pfizer #moderna #JohnsonandJohnson #AstraZeneca,0,0,12,0,"Kelly Rawlings, MPH",KellyRawlings,"Vancouver, BC / Iowa",1st-person worder ‚Ä¢ public health ‚Ä¢ digital therapeutics ‚Ä¢ health comms ‚Ä¢ diabetes ‚Ä¢ #Bl√ºntLancet ‚Ä¢ gardener ‚Ä¢ #NarrowAcres ‚Ä¢ she/her ‚Ä¢ posts=my own,2009-04-15 14:28:00,15281,12489,40465
48,1296251323,1379616784081244163,2021-04-07 02:07:28,"Alamedans 65+/disability/medical condition needing a #vaccine we will have a vaccination clinic Saturday, April 17 at Mastick Senior Center. It‚Äôs by appointment only - please call and reserve your spot by this Saturday by calling 510-747-7512 #alamtg #COVID19 #JohnsonandJohnson",8,1,18,0,Malia Vella,Malia_Vella,"Alameda, CA","Candidate for CA Assembly, mom of 2üë∂üèª, @CityofAlameda Vice Mayor, @alameda_homes Director, @Wellesley, @Teamsters ‚úäüèΩlawyer, Educator, & Pragmatic Optimist.",2013-03-24 18:44:59,1185,475,2728


### Find where users are located


In [None]:
#find where users are located
jnj_df['location'].unique()

array(['Kansas City', 'Los Alamos, NM', 'Inside your PC monitor', 'MIAFL',
       'NA', 'Corpus Christi, TX', 'Charlotte, NC', 'Flint, Michigan',
       'Nueces County, TX', 'Santa Maria, CA', 'Raleigh, NC',
       'ÌîåÎ°úÎ¶¨Îã§ (Florida, USA)', 'S√£o Jos√© dos Campos - SP',
       'MA/NH/VT üá∫üá∏also üá¨üáßüá®üá≠', 'Hong Kong', 'New York', 'Tiffin, OH',
       'Miami, Florida', 'Fort Worth, TX', 'Ohio, USA',
       'Hoffman Estates, IL', 'Poughkeepsie, NY', 'Ravenna , Italy',
       'Colorado, USA', 'TN, KY, MS, LA and IN üìç', 'Nigeria',
       'Winter Haven, FL', ' New Delhi !India', 'India', 'South Africa',
       'Florida, USA', 'Vancouver, BC / Iowa', 'Orlando, FL', 'Singapore',
       'stony brook, ny', 'Alameda, CA', 'Long Beach, CA',
       'Michigan, USA', 'Riverdale Park, MD ', 'Lewiston, ME'],
      dtype=object)

## FUTURE OPPORTUNITIES TO EXPLORE THE DATASET

-  Sentiment analysis
  -  Sentiment analysis utilizes text analysis to systematically identify affective states of languages. The unique context of our dataset (i.e., including both pre- and post- of the Johnson and Johnson vaccine pause in the US) would allow the user to explore the differences in attitudes toward the vaccine between those two distinct periods. 

- Social Network Analysis
  - Social Network Analysis is a technique to study social structures in terms of "actors" in the network and the "links" between the actors. In the context of our dataset, an actor could be considered as the users who posted the tweets, and the link between the actors could be conceptualized in a number of ways. For example, a link could be operationalized as a retweet. If actor B retweet a tweet from actor A, then there would be a link between the two. Users who are interested in how vaccine information is spread before/after the J&J vaccine pause may consider pursuing social network analysis.    




## CHALLENGES: EXPERIENCED AND UPCOMING

Originally, we set out to study tweets that spread misinformation about Covid-19 vaccines and the potential societal implications this can have. We wanted to identify characteristics of tweets that were reaching large audiences and causing hesitation and uncertainty surrounding the vaccine. We also wanted to identify any factors that could play into vaccine hesitancy for example maybe a specific age group is more likely to spread misinformation than others. Or perhaps a broadcasted news event sparked an influx of anti-vaccine related posts.

However, due to the complexity of our target data, we decided to shift our focus. It is extremely difficult to parse text with all the possible language nuances that could interfere such as typos and slang. The most difficult hiderence to detect is probably sarcasm, which is a very common dialect on the Twitter platform. Many people tweeting about vaccine hesitancy are doing so as a joke for their audiences. These obstructions could greatly impact the validity of our dataset. 

As a result, we narrowed our focus to only look at data focused on the Johnson & Johnson vaccine and, specifically, only during the two weeks leading up to the temporary pause of distribution of this version of the vaccine. Secondly, our data aquisition was based on buzzwords and hashtags users feature in their tweets about Covid-19 and the Johnson & Johnson vaccine.

Other challenges:

*   Location data is self-identfied and optional for Twitter users. This makes it hard to accurately identify where users are located and limit the dataset to our target base of US residents to understand unique cultural / social implications. 
*   Identifying buzzwords and hashtags to track conversations about the Johnson & Johnson vaccine was a challenge. The list we came up with is not exhaustive and could be honed in the future.
*   Originally, we had some difficulties pulling a large number of tweets on but, through trial and error, finetuned the code to retrieve around 10,000 tweets.






