<h4>The purpose of this notebook is to perform all necessary cleaning, (pre-)processing, and enrichment of data collected from external OSIF sources, such as TweetBeaver or the Youtube Data API. </h4>

<h5><strong>Tweet Processing: @PLNewsToday timeline</strong></h5>

The columns in the file `PLnewstoday timeline.csv` are `Tweet author`,`Tweet ID`,`Date posted`,`Tweet text`,`URL`,`Retweets`,`Favorited`,`Source`. Notably in comparison to other files of tweets such as `thedcpatriot-lancaster-twitter1.json` from Twint, this timeline file (which I downloaded from TweetBeaver) does not contain columns for hashtags, user mentions, hyperlinks, tweet geo-tagged location, and several other useful properties of tweets. Let's extract those properties from the tweets in the timeline ourselves and save the output to a new file, which we will call `PLnewstoday timeline processed.csv`. 

In addition to the standard tweet properties of interest like mentions, URLs, and hashtags, there are also some aspects of these tweets that are of special value to us due to the nature of their source (a journalist reporting from a war zone, with several of their stories deemed false or deceptive by outside experts). As such, I will use a few APIs on these tweets for additional feature extraction: the GATE Journalist Safety Analyzer, the GATE Rumour veracity classifier, and the GATE TwitIE Named Entity Recognizer for Tweets. 

In [1]:
!pip install twitter-text-python



You should consider upgrading via the 'c:\users\dell\miniconda3\python.exe -m pip install --upgrade pip' command.


In [2]:
import json
import pandas as pd
from ttp import ttp


In [3]:
df = pd.read_csv("data/PLnewstoday timeline.csv")
print(df.head(3))
    


  Tweet author                Tweet ID                     Date posted  \
0  PLnewstoday  ID 1522888466773721088  Sat May 07 10:37:59 +0000 2022   
1  PLnewstoday  ID 1522851834431516672  Sat May 07 08:12:25 +0000 2022   
2  PLnewstoday  ID 1522477719707172864  Fri May 06 07:25:49 +0000 2022   

                                          Tweet text  \
0   RT @PLnewstoday: ⚡️📣Inside Azovstal Territory...   
1   ⚡️📣Inside Azovstal Territory: First Western J...   
2   ⚡️📣Ukraine Snipers Killed Civilians In Mariup...   

                                                 URL  Retweets  Favorited  \
0  https://twitter.com/PLnewstoday/statuses/15228...       209          0   
1  https://twitter.com/PLnewstoday/statuses/15228...       209        409   
2  https://twitter.com/PLnewstoday/statuses/15224...       296        558   

            Source  
0  Twitter Web App  
1  Twitter Web App  
2  Twitter Web App  


In [4]:
df['Tweet ID'] = df['Tweet ID'].str.slice_replace(stop=3)
df['Mentions'] = [[] for _ in range(df.shape[0])]
df['Hashtags'] = [[] for _ in range(df.shape[0])]
df['URLs'] = [[] for _ in range(df.shape[0])] #will have to be careful about this name - easy to mix up with the URL column from TweetBeaver
df.rename(columns={"Tweet text": "Text", "URL": "Link"}, inplace=True)


In [5]:
parser = ttp.Parser()
#df['Mentions'] = df.eval(parse_result.[0] if (parse_result:= parser.parse(df['Tweet text'])) else None
df['parsed_results'] = df['Text'].apply(parser.parse)
df['Mentions'] = df['parsed_results'].apply(lambda x: x.users)
df['Hashtags'] = df['parsed_results'].apply(lambda y: y.tags)
df['URLs'] = df['parsed_results'].apply(lambda z: z.urls)
del df['parsed_results']

In [6]:
print(df.head(3))

  Tweet author             Tweet ID                     Date posted  \
0  PLnewstoday  1522888466773721088  Sat May 07 10:37:59 +0000 2022   
1  PLnewstoday  1522851834431516672  Sat May 07 08:12:25 +0000 2022   
2  PLnewstoday  1522477719707172864  Fri May 06 07:25:49 +0000 2022   

                                                Text  \
0   RT @PLnewstoday: ⚡️📣Inside Azovstal Territory...   
1   ⚡️📣Inside Azovstal Territory: First Western J...   
2   ⚡️📣Ukraine Snipers Killed Civilians In Mariup...   

                                                Link  Retweets  Favorited  \
0  https://twitter.com/PLnewstoday/statuses/15228...       209          0   
1  https://twitter.com/PLnewstoday/statuses/15228...       209        409   
2  https://twitter.com/PLnewstoday/statuses/15224...       296        558   

            Source       Mentions            Hashtags  \
0  Twitter Web App  [PLnewstoday]                  []   
1  Twitter Web App             []  [RussiaUkraineWar]   
2  Twitter

<h4>Facebook Profile Processing</h4>

In [59]:
#!pip install spacy-entity-linker
#!pip3 install geopy
!pip3 install spacy-stanza

Collecting spacy-stanza
  Downloading spacy_stanza-1.0.2-py3-none-any.whl (9.7 kB)
Collecting stanza<1.5.0,>=1.2.0
  Downloading stanza-1.4.0-py3-none-any.whl (574 kB)
Collecting emoji
  Downloading emoji-1.7.0.tar.gz (175 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting torch>=1.3.0
  Downloading torch-1.11.0-cp39-cp39-win_amd64.whl (157.9 MB)
Building wheels for collected packages: emoji
  Building wheel for emoji (setup.py): started
  Building wheel for emoji (setup.py): finished with status 'done'
  Created wheel for emoji: filename=emoji-1.7.0-py3-none-any.whl size=171030 sha256=8255b00b432a9d27ee08fe23b1498b450f21d019a575b87e6625e5e807f72187
  Stored in directory: c:\users\dell\appdata\local\pip\cache\wheels\fa\7a\e9\22dd0515e1bad255e51663ee513a2fa839c95934c5fc301090
Successfully built emoji
Installing collected packages: torch, emoji, stanza, spacy-stanza
Successfully installed emoji-1.7.0 spacy-stanza-1.0.2 sta

You should consider upgrading via the 'c:\users\dell\miniconda3\python.exe -m pip install --upgrade pip' command.


In [60]:
#!python -m spacy_entity_linker "download_knowledge_base"
#!python -m spacy download en_core_web_lg
#!python -m spacy download "xx_ent_wiki_sm"
# !python -m spacy download es_core_news_md
# !python -m spacy download pt_core_news_md

import spacy
from spacy import displacy
import stanza
import spacy_stanza

stanza.download("ar")
stanza.download("vi")

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json: 154kB [00:00, 21.8MB/s]                    
2022-06-14 21:42:53 INFO: Downloading default packages for language: ar (Arabic)...
Downloading https://huggingface.co/stanfordnlp/stanza-ar/resolve/v1.4.0/models/default.zip: 100%|██████████| 397M/397M [02:49<00:00, 2.35MB/s] 
2022-06-14 21:45:49 INFO: Finished downloading models and saved to C:\Users\Dell\stanza_resources.
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json: 154kB [00:00, 17.1MB/s]                    
2022-06-14 21:45:49 INFO: Downloading default packages for language: vi (Vietnamese)...
Downloading https://huggingface.co/stanfordnlp/stanza-vi/resolve/v1.4.0/models/default.zip: 100%|██████████| 385M/385M [03:04<00:00, 2.09MB/s] 
2022-06-14 21:49:00 INFO: Finished downloading models and saved to C:\Users\Dell\stanza_resources.


In [64]:


nlp_en = spacy.load("en_core_web_lg")
nlp_es = spacy.load('es_core_news_md')
nlp_pt = spacy.load("pt_core_news_md")
nlp_xx = spacy.load("xx_ent_wiki_sm")

nlp_ar = spacy_stanza.load_pipeline("ar", processors='tokenize, ner')
nlp_vi = spacy_stanza.load_pipeline("vi", processors='tokenize, ner')
for name in nlp_ar.pipe_names:
    print(name)

#nlp = stanza.Pipeline(lang='ar', processors='tokenize,ner')
nlp_en.add_pipe("ner", name="ner_es", source=nlp_es)
nlp_en.add_pipe("ner", name="ner_pt", source=nlp_pt)
nlp_en.add_pipe("ner", name="ner_xx", source=nlp_xx)


nlp_en.add_pipe("ner", source=nlp_ar)
nlp_en.add_pipe("ner", source=nlp_vi)

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json: 154kB [00:00, 15.4MB/s]                    
2022-06-14 22:43:23 INFO: Loading these models for language: ar (Arabic):
| Processor | Package |
-----------------------
| tokenize  | padt    |
| mwt       | padt    |
| ner       | aqmar   |

2022-06-14 22:43:23 INFO: Use device: cpu
2022-06-14 22:43:23 INFO: Loading: tokenize
2022-06-14 22:43:23 INFO: Loading: mwt
2022-06-14 22:43:23 INFO: Loading: ner


In [10]:
with open("network.json", "r", encoding="utf-8") as f:
    content = json.loads(f.read())
    fb_df = pd.DataFrame(content.get("nodes"))

print(fb_df.head(3))
#fb_df["location"] = [None for _ in range(fb_df.shape[0])]

                       name                                             link  \
0              Wes W Parker         https://www.facebook.com/wes.parker.9277   
1       Kristen Fitzpatrick  https://www.facebook.com/kristen.fitzpatrick.12   
2  Jorene Monares Dela Cruz  https://www.facebook.com/jorene.monaresdelacruz   

                                         profile_pic             role  \
0  https://scontent.ffcm1-2.fna.fbcdn.net/v/t1.18...  admin/moderator   
1  https://scontent.ffcm1-1.fna.fbcdn.net/v/t39.3...  admin/moderator   
2  https://scontent.ffcm1-2.fna.fbcdn.net/v/t1.64...              NaN   

                       id         type         labels                joined  \
0  9arFVr8dkfuuDextavLKFH  UserAccount  [UserAccount]                   NaN   
1  mjHRdrVWvSJ2onFjyR65wy  UserAccount  [UserAccount]                   NaN   
2  K9vNtK9UcY6bmkNtgkG93u  UserAccount  [UserAccount]  Joined last Thursday   

             description members  
0                    NaN     NaN  

In [53]:
import regex #important note: I'm choosing this library instead of the standard re library because some of the description values that follow the 
#"Works at" pattern have 2 or more whitespace characters between the 'at' and the next word. Dealing with this requires making the positive 
#lookaround variable-length, which is not supported in the default Python regex engine. Hence the different engine library choice.
from numpy import nan

#works_at_pattern = [{"LOWER": "at"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
regex_pattern = r"\b(?<=\bat\s{1,2}?\X{0,1})\p{L}.*$"
#matcher.add("WorksAt", [works_at_pattern])

def apply_and_concat(dataframe, field, func, column_names):
    return pd.concat((
        dataframe,
        dataframe[field].apply(
            lambda cell: pd.Series(func(cell), index=column_names))), axis=1)
    
def process_doc_deps(doc: spacy.tokens.Doc):
    if doc is nan or doc is None:
        return doc
    if (doc[0] is nan or doc[0] is None):
        return doc[0]
    result = ''
    if (doc[0].pos_ in ['NOUN', 'ADJ', 'VERB']):
        if doc[0].pos_ == 'VERB':
            if doc[0].tag_ == 'VBZ':              
                result = 'The subject ' + ''.join([(token.text.lower() if token == doc[0] else token.text) + token.whitespace_ for token in doc])
            else:
                result = 'The subject does ' + ''.join([(token.text.lower() if token == test_result[0] else token.text) + token.whitespace_ for token in doc])
        else:
            result = 'The subject is ' + ''.join([(token.text.lower() if token == test_result[0] else token.text) + token.whitespace_ for token in doc])
    else:
        result = doc.text
    return result

def get_entities(descr: str):
    spacy_results = nlp_en(str(descr), disable=['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer'])
    location_results = []
    occupation_results = []
    misc_results = []
    # works_at_rule_results = regex.search(regex_pattern, spacy_results.text, flags=regex.UNICODE)
    # if (works_at_rule_results is not None):
    #     start, end = works_at_rule_results.span()
    #     span = spacy_results.char_span(start, end)
    #     # This is a Span object or None if match doesn't map to valid token sequence
    #     if span is not None:
    #         location_results.append(span.text)
    #     elif bool('\u200E' in spacy_results.text):
    #         location_results.append(works_at_rule_results.group())
    for ent in spacy_results.ents:
        if ent.label_ in ['LOC', 'ORG', 'GPE'] and ent.text not in location_results:
            location_results.append(ent.text)
        elif ent.label_ in ['FAC', 'PRODUCT'] and ent.text not in occupation_results:
            occupation_results.append(ent.text)
        elif ent.label_ in ['PERSON', 'PER', 'MISC']:
            misc_results.append(ent.text)
    results_tuple = ((', '.join(location_results) if len(location_results) != 0 else nan), (', '.join(occupation_results) if len(occupation_results) != 0 else nan), (', '.join(misc_results) if len(misc_results) != 0 else nan))
    return results_tuple


In [46]:
from spacy import displacy

test_result = nlp_en(nan)
new_result = 'Subject does ' + ''.join([(token.text.lower() if token == test_result[0] else token.text) + token.whitespace_ for token in test_result])

ValueError: [E866] Expected a string or 'Doc' as input, but got: <class 'float'>.

In [54]:
descr_doc = fb_df['description'].apply(lambda s: nlp_en(s, disable=['ner', 'lemmatizer', 'senter']) if s is not nan and s is not None else s)
fb_df['description_normalized'] = descr_doc.apply(process_doc_deps)

In [51]:
print(fb_df[:5]['description'])
print(type(fb_df.loc[3, 'description']))
del fb_df['description_doc']

0                                    NaN
1                                    NaN
2                  Works at Facebook App
3                                   None
4    Driver at Valley Golf Cainta, Rizal
Name: description, dtype: object
<class 'NoneType'>


In [55]:
del fb_df['location']
del fb_df['occupation']
del fb_df['misc_ents']

test2 = nlp_en('Works at Electrical and electronics engineering')
spacy.explain('VBN')
for token in test2:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)
    token.se

Works work VERB VBZ ROOT Xxxxx True False


AttributeError: 'spacy.tokens.token.Token' object has no attribute 'se'

In [56]:
fb_df = apply_and_concat(fb_df, 'description_normalized', get_entities, ['location', 'occupation', 'misc_ents']) 

In [57]:
locations_list = fb_df['location'].value_counts()
occupations_list = fb_df['occupation'].value_counts()
misc_list = fb_df['misc_ents'].value_counts()
uncaptured_vals_list = fb_df[fb_df['location'].isna() & fb_df['description'].notna()]
print(uncaptured_vals_list)


                       name  \
8              Rambo Buteng   
9            Nguyen Le Ngan   
11             Ta Hoang Son   
14                Bocboc Jl   
21      Missy Deleon Sabado   
...                     ...   
11801         Alice Stevens   
11806           Mike Lucker   
11808         Eldred Halsey   
11817  Debbie Lemons Newton   
11819       Richard O'Neill   

                                                    link  \
8                  https://www.facebook.com/rambo.boteng   
9      https://www.facebook.com/profile.php?id=100077...   
11     https://www.facebook.com/profile.php?id=100077...   
14                    https://www.facebook.com/bocboc.jl   
21                    https://www.facebook.com/isayyysbd   
...                                                  ...   
11801          https://www.facebook.com/alice.stevens.33   
11806          https://www.facebook.com/mike.lucker.3344   
11808  https://www.facebook.com/profile.php?id=100018...   
11817                 https