## Filter DataFrames for Claims by URL
This notebook consists of several methods and helper methods that will filter dataframes of any size. Filtering process includes sorting out non-english tweets, tweets that don't reference a URL, and tweets that aren't associated with a place. There are also methods to identify key topics from a tweet based off of defined keywords related to cures and prevention methods for COVID-19. Finally, once a dataframe is filtered, the top URLs from a dataframe can be identified along with the specific tweet that referenced that URL. 

This script is meant to be run from start to finish. If you run the main function and realize you need to change a functionality in one of the methods, you will have to rerun the processing data files function.

In [63]:
import pandas as pd
from pathlib import Path
from langdetect import detect, lang_detect_exception, DetectorFactory
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from collections import defaultdict
from nltk.corpus import wordnet as wn

### Processing data files into DataFrames
If you have several JSON files in a folder, run lines 5 and 6 below and comment out line 2 and 3. You should change the path to match where you have your data files located. If you have data files in another format other than JSON, change line 6 to match your file type.

In [76]:
#data path to folder with JSON files
data_dir = Path('../..') / 'data_samples/json_files/may_sample'
data_files = data_dir.glob('*.json')

### Methods to Filter DataFrames
detectLang method was inspired by Pysmap library (source: https://github.com/SMAPPNYU/pysmap#detect_tweet_language)

lemmatization method was used from https://medium.com/@bedigunjit/simple-guide-to-text-classification-nlp-using-svm-and-naive-bayes-with-python-421db3a72d34

In [65]:
#Determine language by langDetect 
def detectLang(tweet):
    DetectorFactory.seed = 0
    detected_lang = None
    try: 
        detected_lang = detect(tweet['full_text'])             
    except:
        pass

    return detected_lang

In [66]:
#Filter out if a tweet is associated with a location
def hasPlace(tweet):
    if tweet['place'] == None:
        return False
    else: 
        return True

In [67]:
#Filter out if a tweet is associated with a url
def hasURL(tweet):
    if len(tweet['entities']['urls']) == 0:
        return False
    else: 
        return True

In [68]:
#Helper method to modify columns; scrape fields from a tweet to generate a quick link to be clicked to view the tweet object on twitter
def tweetOnTwitter(df):
    for i in range(len(df)):  
        begin = 'https://twitter.com/'
        username = df.loc[i]['user']['screen_name']
        breakin = '/'
        status = 'status' 
        tweetid = str(df.loc[i]['id'])
        urlstr = begin+username+breakin+status+breakin+tweetid
        df.at[i, 'tweet_on_twitter'] = urlstr
    
    return df

In [69]:
#Add two columns in the first half of the dataframe to easily view the link referenced in the tweet and the 
#tweet direct link to view on twitter
def modifyColumns(df):
    #Create a quick an easy URL Link column to view the link that is being referenced.
    try:
        df.insert(3, 'url_link', '')
    except:
        pass
    
    for i in range(len(df)):
        if len(df.loc[i]['entities']['urls']) > 0:
            urlstr = df.loc[i]['entities']['urls'][0]['expanded_url']
            df.at[i, 'URL_link'] = urlstr
    #create a quick and easy column to view the tweet on twitter
    try:
        df.insert(4, 'tweet_on_twitter','')
    except:
        pass
    return tweetOnTwitter(df)

In [70]:
#Filters the dataframe by getting rid of tweets that don't reference a URL, don't have a place, and is only in english
def filterDF(df):
    dropList = []
    for i in range(len(df)):
        #fix the language
        df.at[i, 'lang'] = detectLang(df.loc[i])

        #drop records that don't have a url or a place
        if not hasPlace(df.loc[i]):
            dropList.append(i)
        if not hasURL(df.loc[i]):
            if not i in dropList:
                dropList.append(i)

    #at the end remove all the records from the droplist
    df = df.drop(dropList, axis=0)
    df = df[df['lang'] == 'en']
    df = df.reset_index()
    return df

In [71]:
def lemmatization(df):
    #Twitter specific stopwords
    tweetWords = ['rt', 'co', 'https']
    #Change all the text to lower case
    df['full_text'] = [entry.lower() for entry in df['full_text']]
    #Tokenize each entry
    tokens = [word_tokenize(entry) for entry in df['full_text']]
    #Remove Stop words, Non-Numeric and perfom Word Stemming/Lemmenting.
    #WordNetLemmatizer requires Pos tags to understand if the word is noun or verb or adjective etc. By default it is set to Noun
    tag_map = defaultdict(lambda : wn.NOUN)
    tag_map['J'] = wn.ADJ
    tag_map['V'] = wn.VERB
    tag_map['R'] = wn.ADV
    for index,entry in enumerate(tokens):
        #Declaring Empty List to store the words that follow the rules for this step
        Final_words = []
        #Initializing WordNetLemmatizer()
        word_Lemmatized = WordNetLemmatizer()
        #pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else.
        for word, tag in pos_tag(entry):
            #Below condition is to check for Stop words and consider only alphabets
            if word not in stopwords.words('english') and word not in tweetWords and word.isalpha():
                word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])
                Final_words.append(word_Final)
        #The final processed set of words for each iteration will be stored in 'text_final'
        df.loc[index,'text_final'] = str(Final_words)
        
    return df

In [72]:
def topics(df):
    #keywords relating to COVID-19 cures, preventions, spread, and origin
    keywords = ['cure', 'oil', 'remedy', 'medicine', 'tradition','traditional', 'natural', 'tea', 'whiskey', 'honey', 'mask', 'n95',
               'garlic', 'oregano','sesame','prevent','help', 'pee', 'poop', 'dung', 'cow', 'rare', 'drink', 'drugs',
               'urine', 'anti-HIV', 'HIV', 'drug', 'wear', 'try', 'eat', 'use', 'health', 'origin', 'bioweapon', 'canada',
               'spread', 'snake', 'bat', 'market', 'wuhan']
    df['topics'] = None #Created a topics column in the dataframe to identify tweets matching with keywords/topics
    for i in range(len(df)):
        topicWords = []
        text_final = df.loc[i]['text_final']
        for word in keywords:
            if word in text_final:
                topicWords.append(word)
        df.at[i, 'topics'] = topicWords
    
    return df

In [73]:
def topURLs(df):
    #create lang dictionary to get a count of the number of times a url is tweeted
    tweet_URLs = {}
    url_to_tweet_map = {}

    for i in range(len(df)):
        if hasURL(df.loc[i]):
            expURL = df.loc[i]['entities']['urls'][0]['expanded_url']
            if expURL.startswith('https://twitter.com/'):
                continue
            else: 
                if expURL in tweet_URLs:
                    tweet_URLs[expURL] += 1
                else:
                    tweet_URLs[expURL] = 1

                if expURL in url_to_tweet_map:
                    url_to_tweet_map[expURL].append(i)
                else:
                    url_to_tweet_map[expURL] = []
                    url_to_tweet_map[expURL].append(i)
        else:
            continue;
            
    return tweet_URLs, url_to_tweet_map

### Main function
If you have multiple data files in one folder uncomment the code in the cell directly below and comment the code in the following cell make sure to change numb_files to the number of files you have. This is used for a print statement to let you know the progress of the filtering. Also change the name of what the dataframe is going to be saved as and the file type.

In [74]:
#For Multiple Files in a directory
result_dfs = []
numb_files = 3
file_count = 0.0
count = 0
for file in data_files:
    df = pd.read_json(str(file), lines=True)
    df = modifyColumns(df)
    result_dfs.append(df)
    file_count += 1
    completed = file_count / numb_files
    print("Completed: {:.2%}".format(completed))
    
final_df = pd.concat(result_dfs)
final_df = final_df.reset_index()
data_dir = Path('../..') / 'data_samples/'
final_df.to_csv(str(data_dir) + '/file_filtered.csv')
final_df.head(1)

Completed: 33.33%
Completed: 66.67%
Completed: 100.00%


Unnamed: 0,index,created_at,id,id_str,url_link,tweet_on_twitter,full_text,truncated,display_text_range,entities,...,favorited,retweeted,possibly_sensitive,lang,quoted_status_id,quoted_status_id_str,quoted_status_permalink,quoted_status,retweeted_status,URL_link
0,0,2020-05-01 02:04:02+00:00,1256041661152612352,1256041661152612352,,https://twitter.com/PlSmith57/status/125604166...,Received this email. Apparently there is a “Na...,False,"[0, 216]","{'hashtags': [], 'symbols': [], 'user_mentions...",...,False,False,0.0,en,,,,,,


In [77]:
tweet_urls, urls_to_tweet_map = topURLs(final_df)

In [78]:
tweet_urls

{'https://urmedium.com/c/presstv/14102?utm_source=dlvr.it&utm_medium=twitter': 1,
 'http://www.lanacion.com.ar/2360167': 1,
 'http://chronlaw.com/texas-still-wont-say-which-nursing-homes-have-covid-19-cases-families-are-demanding-answers/': 1,
 'https://on.ft.com/2VQeabd': 1,
 'https://www.sports-tokyo-info.metro.tokyo.lg.jp/stayhome_enjoysports.html': 1,
 'http://dlvr.it/RVnkRc': 1,
 'https://www.entregrillosychapulines.com/wp-content/uploads/2020/04/osorio-chong.jpg': 1,
 'http://www.safetyhealthnews.com/as-covid-19-shutters-practices-virtual-doc-patient-activity-soars/': 1,
 'https://mile.io/3d4TVMF': 1,
 'https://trib.al/rHC1Ous': 1,
 'https://www.elindependiente.com/politica/2020/04/30/plasticos-en-los-juzgados-para-protegerse-del-coronavirus-no-hay-presupuesto/': 1,
 'https://www.suara.com/lifestyle/2020/05/01/085855/pria-india-pamit-pergi-belanja-saat-lockdown-pulang-malah-bawa-pengantin?utm_source=twitter.dlvrit&utm_medium=twitter&utm_campaign=suaradotcom': 1,
 'http://dld.bz/j

In [79]:
urls_to_tweet_map

{'https://urmedium.com/c/presstv/14102?utm_source=dlvr.it&utm_medium=twitter': [6],
 'http://www.lanacion.com.ar/2360167': [12],
 'http://chronlaw.com/texas-still-wont-say-which-nursing-homes-have-covid-19-cases-families-are-demanding-answers/': [22],
 'https://on.ft.com/2VQeabd': [30],
 'https://www.sports-tokyo-info.metro.tokyo.lg.jp/stayhome_enjoysports.html': [41],
 'http://dlvr.it/RVnkRc': [42],
 'https://www.entregrillosychapulines.com/wp-content/uploads/2020/04/osorio-chong.jpg': [43],
 'http://www.safetyhealthnews.com/as-covid-19-shutters-practices-virtual-doc-patient-activity-soars/': [45],
 'https://mile.io/3d4TVMF': [49],
 'https://trib.al/rHC1Ous': [54],
 'https://www.elindependiente.com/politica/2020/04/30/plasticos-en-los-juzgados-para-protegerse-del-coronavirus-no-hay-presupuesto/': [55],
 'https://www.suara.com/lifestyle/2020/05/01/085855/pria-india-pamit-pergi-belanja-saat-lockdown-pulang-malah-bawa-pengantin?utm_source=twitter.dlvrit&utm_medium=twitter&utm_campaign=su