<h4>The purpose of this notebook is to perform all necessary cleaning, (pre-)processing, and enrichment of data collected from external OSIF sources, such as TweetBeaver or the Youtube Data API. </h4>

<h5><strong>Tweet Processing: @PLNewsToday timeline</strong></h5>

The columns in the file `PLnewstoday timeline.csv` are `Tweet author`,`Tweet ID`,`Date posted`,`Tweet text`,`URL`,`Retweets`,`Favorited`,`Source`. Notably in comparison to other files of tweets such as `thedcpatriot-lancaster-twitter1.json` from Twint, this timeline file (which I downloaded from TweetBeaver) does not contain columns for hashtags, user mentions, hyperlinks, tweet geo-tagged location, and several other useful properties of tweets. Let's extract those properties from the tweets in the timeline ourselves and save the output to a new file, which we will call `PLnewstoday timeline processed.csv`. 

In addition to the standard tweet properties of interest like mentions, URLs, and hashtags, there are also some aspects of these tweets that are of special value to us due to the nature of their source (a journalist reporting from a war zone, with several of their stories deemed false or deceptive by outside experts). As such, I will use a few APIs on these tweets for additional feature extraction: the GATE Journalist Safety Analyzer, the GATE Rumour veracity classifier, and the GATE TwitIE Named Entity Recognizer for Tweets. 

In [10]:
%pip install twitter-text-python

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\Users\Dell\miniconda3\python.exe -m pip install --upgrade pip' command.


In [11]:
import json
import pandas as pd
from ttp import ttp


In [12]:
df = pd.read_csv("data/PLnewstoday timeline.csv")
print(df.head(3))
    


  Tweet author                Tweet ID                     Date posted  \
0  PLnewstoday  ID 1522888466773721088  Sat May 07 10:37:59 +0000 2022   
1  PLnewstoday  ID 1522851834431516672  Sat May 07 08:12:25 +0000 2022   
2  PLnewstoday  ID 1522477719707172864  Fri May 06 07:25:49 +0000 2022   

                                          Tweet text  \
0   RT @PLnewstoday: ⚡️📣Inside Azovstal Territory...   
1   ⚡️📣Inside Azovstal Territory: First Western J...   
2   ⚡️📣Ukraine Snipers Killed Civilians In Mariup...   

                                                 URL  Retweets  Favorited  \
0  https://twitter.com/PLnewstoday/statuses/15228...       209          0   
1  https://twitter.com/PLnewstoday/statuses/15228...       209        409   
2  https://twitter.com/PLnewstoday/statuses/15224...       296        558   

            Source  
0  Twitter Web App  
1  Twitter Web App  
2  Twitter Web App  


In [13]:
df['Tweet ID'] = df['Tweet ID'].str.slice_replace(stop=3)
df['Mentions'] = [[] for _ in range(df.shape[0])]
df['Hashtags'] = [[] for _ in range(df.shape[0])]
df['URLs'] = [[] for _ in range(df.shape[0])] #will have to be careful about this name - easy to mix up with the URL column from TweetBeaver
df.rename(columns={"Tweet text": "Text", "URL": "Link"}, inplace=True)


In [14]:
parser = ttp.Parser()
#df['Mentions'] = df.eval(parse_result.[0] if (parse_result:= parser.parse(df['Tweet text'])) else None
df['parsed_results'] = df['Text'].apply(parser.parse)
df['Mentions'] = df['parsed_results'].apply(lambda x: x.users)
df['Hashtags'] = df['parsed_results'].apply(lambda y: y.tags)
df['URLs'] = df['parsed_results'].apply(lambda z: z.urls)
del df['parsed_results']

In [15]:
print(df.head(3))

  Tweet author             Tweet ID                     Date posted  \
0  PLnewstoday  1522888466773721088  Sat May 07 10:37:59 +0000 2022   
1  PLnewstoday  1522851834431516672  Sat May 07 08:12:25 +0000 2022   
2  PLnewstoday  1522477719707172864  Fri May 06 07:25:49 +0000 2022   

                                                Text  \
0   RT @PLnewstoday: ⚡️📣Inside Azovstal Territory...   
1   ⚡️📣Inside Azovstal Territory: First Western J...   
2   ⚡️📣Ukraine Snipers Killed Civilians In Mariup...   

                                                Link  Retweets  Favorited  \
0  https://twitter.com/PLnewstoday/statuses/15228...       209          0   
1  https://twitter.com/PLnewstoday/statuses/15228...       209        409   
2  https://twitter.com/PLnewstoday/statuses/15224...       296        558   

            Source       Mentions            Hashtags  \
0  Twitter Web App  [PLnewstoday]                  []   
1  Twitter Web App             []  [RussiaUkraineWar]   
2  Twitter

<h4>Facebook Profile Processing</h4>

In [16]:
#!pip install spacy-entity-linker
#!pip3 install geopy
#!pip3 install spacy-stanza


In [17]:
#!python -m spacy_entity_linker "download_knowledge_base"
#!python -m spacy download en_core_web_lg
#!python -m spacy download "xx_ent_wiki_sm"
# !python -m spacy download es_core_news_md
# !python -m spacy download pt_core_news_md
# !python -m spacy download ko_core_news_sm
# !python -m spacy download zh_core_web_sm

import spacy
from spacy import displacy
import stanza
import spacy_stanza
from spacy import tokens
from stanza.pipeline.multilingual import MultilingualPipeline

# stanza.download("ar")
# stanza.download("vi")
#stanza.download("multilingual")
# stanza.download("bg")
# stanza.download("tr")
# stanza.download("be")
# stanza.download("he")
# stanza.download("id")
# stanza.download("ko")
# stanza.download("es")
# stanza.download("pt")
# stanza.download("en")


In [18]:


nlp_en = spacy.load("en_core_web_lg")
nlp_es = spacy.load('es_core_news_md')
nlp_pt = spacy.load("pt_core_news_md")
nlp_xx = spacy.load("xx_ent_wiki_sm")
nlp_zh = spacy.load("zh_core_web_sm")
nlp_ko = spacy.load("ko_core_news_sm")

nlp_ar = spacy_stanza.load_pipeline("ar", processors='tokenize, ner')
nlp_vi = spacy_stanza.load_pipeline("vi", processors='tokenize, ner')
nlp_bg = spacy_stanza.load_pipeline("bg", processors='tokenize, ner')
nlp_tr = spacy_stanza.load_pipeline("tr", processors='tokenize, ner')
nlp_multi = MultilingualPipeline(lang_id_config={
    "langid_clean_text": False, 
    "langid_lang_subset": ["en","ar", "es", "pt", "be", "bg", "ko", "id", "he", "ru", "th", "tr", "vi" ],
    }, 
    lang_configs={
        "en": {"processors": 'tokenize, pos, ner', "download_method": None},
        "ar": {"processors": 'tokenize, ner', "download_method": None},
        "es": {"processors": 'tokenize, pos, ner', "download_method": None},
        "pt": {"processors": 'tokenize, pos, ner', "download_method": None},
        "be": {"processors": 'tokenize, ner', "download_method": None},
        "bg": {"processors": 'tokenize, ner', "download_method": None},
        "he": {"processors": 'tokenize, ner', "download_method": None},
        "id": {"processors": 'tokenize, ner', "download_method": None},
        "ko": {"processors": 'tokenize, ner', "download_method": None},
        "th": {"processors": 'tokenize, ner', "download_method": None},
        "tr": {"processors": 'tokenize, ner', "download_method": None},
        "vi": {"processors": 'tokenize, ner', "download_method": None}
    }, max_cache_size=15 )
#nlp = stanza.Pipeline(lang='ar', processors='tokenize,ner')
nlp_en.add_pipe("ner", name="ner_es", source=nlp_es)
nlp_en.add_pipe("ner", name="ner_pt", source=nlp_pt)
nlp_en.add_pipe("ner", name="ner_xx", source=nlp_xx)
nlp_en.add_pipe("ner", name="ner_zh", source=nlp_zh)
nlp_en.add_pipe("ner", name="ner_ko", source=nlp_ko)

lang_detector = stanza.Pipeline(lang="multilingual", processors="langid", langid_lang_subset=["en","ar", "es", "pt", "be", "bg", "ko", "id", "he", "ru", "tr", "vi" ])


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json: 154kB [00:00, 6.42MB/s]                    
2022-07-24 14:19:47 INFO: Loading these models for language: ar (Arabic):
| Processor | Package |
-----------------------
| tokenize  | padt    |
| mwt       | padt    |
| ner       | aqmar   |

2022-07-24 14:19:47 INFO: Use device: cpu
2022-07-24 14:19:47 INFO: Loading: tokenize
2022-07-24 14:19:47 INFO: Loading: mwt
2022-07-24 14:19:47 INFO: Loading: ner
2022-07-24 14:19:49 INFO: Done loading processors!
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json: 154kB [00:00, 19.3MB/s]                    
2022-07-24 14:19:53 INFO: Loading these models for language: vi (Vietnamese):
| Processor | Package |
-----------------------
| tokenize  | vtb     |
| ner       | vlsp    |

2022-07-24 14:19:53 INFO: Use device: cpu
2022-07-24 14:19:53 INFO: Loading: tokenize
2022-07-24 14:19:53 INFO: Loading: ne

In [19]:
with open("network.json", "r", encoding="utf-8") as f:
    content = json.loads(f.read())
    fb_df = pd.DataFrame(content.get("nodes"))

print(fb_df.head(3))
#fb_df["location"] = [None for _ in range(fb_df.shape[0])]

                       name                                             link  \
0              Wes W Parker         https://www.facebook.com/wes.parker.9277   
1       Kristen Fitzpatrick  https://www.facebook.com/kristen.fitzpatrick.12   
2  Jorene Monares Dela Cruz  https://www.facebook.com/jorene.monaresdelacruz   

                                         profile_pic             role  \
0  https://scontent.ffcm1-2.fna.fbcdn.net/v/t1.18...  admin/moderator   
1  https://scontent.ffcm1-1.fna.fbcdn.net/v/t39.3...  admin/moderator   
2  https://scontent.ffcm1-2.fna.fbcdn.net/v/t1.64...              NaN   

                       id         type         labels                joined  \
0  9arFVr8dkfuuDextavLKFH  UserAccount  [UserAccount]                   NaN   
1  mjHRdrVWvSJ2onFjyR65wy  UserAccount  [UserAccount]                   NaN   
2  K9vNtK9UcY6bmkNtgkG93u  UserAccount  [UserAccount]  Joined last Thursday   

             description members  
0                    NaN     NaN  

In [20]:
import regex #important note: I'm choosing this library instead of the standard re library because some of the description values that follow the 
#"Works at" pattern have 2 or more whitespace characters between the 'at' and the next word. Dealing with this requires making the positive 
#lookaround variable-length, which is not supported in the default Python regex engine. Hence the different engine library choice.
from numpy import nan

#works_at_pattern = [{"LOWER": "at"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
regex_pattern = r"\b(?<=\bat\s{1,2}?\X{0,1})\p{L}.*$"
#matcher.add("WorksAt", [works_at_pattern])

def apply_and_concat(dataframe, field, func, column_names):
    return pd.concat((
        dataframe,
        dataframe[field].apply(
            lambda cell: pd.Series(func(cell), index=column_names))), axis=1)
    
def apply_and_concat_external(target_dataframe, source_series, func, column_names):
    return pd.concat((
        target_dataframe,
        source_series.apply(lambda cell: pd.Series(func(cell), index=column_names))
    ), axis=1)

def get_lang(descr: str):
    if descr is nan or descr is None:
        return nan
    doc = lang_detector(descr)
    return doc.lang
    
def process_doc_deps(doc: tokens.Doc):
    if doc is nan or doc is None:
        return doc
    if (doc[0] is nan or doc[0] is None):
        return doc[0]
    result = ''
    if (doc[0].pos_ in ['NOUN', 'ADJ', 'VERB']):
        if doc[0].pos_ == 'VERB':
            if doc[0].tag_ == 'VBZ':              
                result = 'The subject ' + ''.join([(token.text.lower() if token == doc[0] else token.text) + token.whitespace_ for token in doc])
            else:
                result = 'The subject does ' + ''.join([(token.text.lower() if token == doc[0] else token.text) + token.whitespace_ for token in doc])
        else:
            result = 'The subject is ' + ''.join([(token.text.lower() if token == doc[0] else token.text) + token.whitespace_ for token in doc])
    else:
        result = doc.text
    return result

def get_entities(descr: str):
    spacy_results = nlp_en(str(descr), disable=['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer'])
    location_results = []
    occupation_results = []
    misc_results = []
    # works_at_rule_results = regex.search(regex_pattern, spacy_results.text, flags=regex.UNICODE)
    # if (works_at_rule_results is not None):
    #     start, end = works_at_rule_results.span()
    #     span = spacy_results.char_span(start, end)
    #     # This is a Span object or None if match doesn't map to valid token sequence
    #     if span is not None:
    #         location_results.append(span.text)
    #     elif bool('\u200E' in spacy_results.text):
    #         location_results.append(works_at_rule_results.group())
    for ent in spacy_results.ents:
        if ent.label_ in ['LOC', 'ORG', 'GPE', 'LC', 'OG'] and ent.text not in location_results:
            location_results.append(ent.text)
        elif ent.label_ in ['FAC', 'PRODUCT'] and ent.text not in occupation_results:
            occupation_results.append(ent.text)
        elif ent.label_ in ['PERSON', 'PER', 'PS', 'MISC']:
            misc_results.append(ent.text)
    results_tuple = ((', '.join(location_results) if len(location_results) != 0 else nan), (', '.join(occupation_results) if len(occupation_results) != 0 else nan), (', '.join(misc_results) if len(misc_results) != 0 else nan))
    return results_tuple


def get_lang_and_entities(doc: stanza.models.common.doc.Document):
    location_results = []
    occupation_results = []
    misc_results = []
    for ent in doc.ents:
        if ent.type in ['LOC', 'ORG', 'GPE', 'LC', 'OG']:
            location_results.append(ent.text)
        elif ent.type in ['FAC', 'PRODUCT']:
            occupation_results.append(ent.text)
        elif ent.type in ['PERSON', 'PER', 'PS', 'MISC']:
            misc_results.append(ent.text)
    results_tuple = ((doc.lang), (', '.join(location_results) if len(location_results) != 0 else nan), (', '.join(occupation_results) if len(occupation_results) != 0 else nan), (', '.join(misc_results) if len(misc_results) != 0 else nan))
    return results_tuple


In [21]:
from spacy import displacy
spacy.explain("DEV")
# test_result = nlp_en(nan)
# new_result = 'Subject does ' + ''.join([(token.text.lower() if token == test_result[0] else token.text) + token.whitespace_ for token in test_result])

'地 before VP'

In [22]:
docs_series = fb_df['description'][fb_df['description'].notna()]

In [23]:
print(not isinstance(fb_df['description'], list))

docs_list = docs_series.to_list()


langed_docs = nlp_multi(docs_list)

True


2022-07-24 14:20:45 INFO: Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | combined  |
| pos       | combined  |
| ner       | ontonotes |

2022-07-24 14:20:45 INFO: Use device: cpu
2022-07-24 14:20:45 INFO: Loading: tokenize
2022-07-24 14:20:45 INFO: Loading: pos
2022-07-24 14:20:48 INFO: Loading: ner
2022-07-24 14:20:53 INFO: Done loading processors!
2022-07-24 14:28:10 INFO: Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | combined  |
| ner       | ontonotes |

2022-07-24 14:28:10 INFO: Use device: cpu
2022-07-24 14:28:10 INFO: Loading: tokenize
2022-07-24 14:28:11 INFO: Loading: ner
2022-07-24 14:28:15 INFO: Done loading processors!
2022-07-24 14:28:21 INFO: Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | combined  |
| pos       | combined  |
| ner       | ontonotes |

2022-07-24 14:28:

In [24]:
print(type(langed_docs[0]))
for doc in langed_docs:
    print("---")
    print(f"text: {doc.text}")
    print(f"lang: {doc.lang}")
    print(f"ents: {doc.ents}")
    print(f"{doc.sentences[0].dependencies_string()}")
    


<class 'stanza.models.common.doc.Document'>
---
text: Works at Facebook App
lang: en
ents: [{
  "text": "Facebook App",
  "type": "ORG",
  "start_char": 9,
  "end_char": 21
}]

---
text: Driver at Valley Golf Cainta, Rizal
lang: en
ents: [{
  "text": "Valley Golf Cainta",
  "type": "FAC",
  "start_char": 10,
  "end_char": 28
}, {
  "text": "Rizal",
  "type": "GPE",
  "start_char": 30,
  "end_char": 35
}]

---
text: Middle Georgia Technical College
lang: en
ents: [{
  "text": "Middle Georgia Technical College",
  "type": "ORG",
  "start_char": 0,
  "end_char": 32
}]

---
text: Works at Self-Employed
lang: en
ents: []

---
text: Trường Đại Học Trà Vinh
lang: vi
ents: [{
  "text": "Học Trà Vinh",
  "type": "PERSON",
  "start_char": 11,
  "end_char": 23
}]

---
text: Works at J. Riley Williams, PLC
lang: en
ents: [{
  "text": "J. Riley Williams",
  "type": "PERSON",
  "start_char": 9,
  "end_char": 26
}, {
  "text": "PLC",
  "type": "ORG",
  "start_char": 28,
  "end_char": 31
}]

---
text:

In [25]:
langed_series = pd.Series(langed_docs, index=docs_series.index)

In [26]:

docs_series = apply_and_concat_external(docs_series, langed_series, get_lang_and_entities, ['lang', 'location', 'occupation', 'misc_ents']) 

In [27]:
res = nlp_multi("คือสงสัยมานานแล้วว่าที่เค้าตั้งกันในโปรไฟล์เฟซว่า ทำงานที่ พ่อกับแม่จำกัดมหาชน คือมันมีบริษัทนี้จริงๆ")
print(res.lang)

be


In [28]:
fb_df = fb_df.join(docs_series.loc[:, ['lang', 'location', 'occupation', 'misc_ents']], how="outer")

In [29]:
fb_df.to_csv("data-temp/fb_data_langid_semiprocessed.csv", sep='\t')

In [40]:
fb_df_ar = fb_df[(fb_df['lang'] == "ar") & fb_df['description'].notna()]
fb_df_vi = fb_df[(fb_df['lang'] == "vi") & fb_df['description'].notna()]
fb_df_bg = fb_df[(fb_df['lang'] == "bg") & fb_df['description'].notna()]
fb_df_tr = fb_df[(fb_df['lang'] == "tr") & fb_df['description'].notna()]
fb_df_etc = fb_df[~(fb_df['lang'].isin(["ar", "vi", "bg", "tr"])) & ~(fb_df['description'].isna())]

descr_doc_ar = list(nlp_ar.pipe(fb_df_ar['description']))
#descr_doc_vi = list(nlp_vi.pipe(fb_df_vi))
#descr_doc_bg = list(nlp_bg.pipe(fb_df_bg))
#descr_doc_tr = list(nlp_tr.pipe(fb_df_tr))
descr_doc_etc = list(nlp_en.pipe(fb_df_etc['description'], disable=['ner', 'lemmatizer', 'senter']))

ents_ar = [ent for doc in descr_doc_ar for ent in doc.ents  if ent.label_ in ['LOC', 'ORG', 'GPE', 'LC', 'OG']]
# ents_vi = [ent for doc in descr_doc_vi for ent in doc.ents if ent.label_ in ['LOCATION', 'ORGANIZATION', 'GPE', 'LC', 'OG']]
# ents_bg = [ent for doc in descr_doc_bg for ent in doc.ents if ent.label_ in ['LOC', 'ORG', 'GPE', 'LC', 'OG']]
#ents_tr = [ent for doc in descr_doc_tr for ent in doc.ents if ent.label_ in ['LOCATION', 'ORGANIZATION', 'GPE', 'LC', 'OG']]

  docs = (self._ensure_doc(text) for text in texts)
Words: ['<UNK>', 'مدرسة', 'عبد', 'الفتاح', 'حمود', 'الثانوية', '\u200e']
Entities: [('عبد الفتاح حمود', 'PER', 7, 22)]
  docs = (self._ensure_doc(text) for text in texts)
Words: ['<UNK>', 'إعدادية', 'المثنى', 'ل', 'البنين', '\u200e']
Entities: [('المثنى', 'LOC', 9, 15)]
  docs = (self._ensure_doc(text) for text in texts)
Words: ['<UNK>', 'ما', 'زال', 'مالكت', 'والو\u200e', 'at', 'Ta', 'r', 'oudant']
Entities: [('\u200eمازال مالكت', 'PER', 0, 12)]
  docs = (self._ensure_doc(text) for text in texts)
Words: ['Works', 'at', '\u200eلا', 'إل', 'ه', 'الا', 'الله', 'محمد', 'رسول', 'الله\u200e']
Entities: [('محمد رسول', 'PER', 26, 35)]
  docs = (self._ensure_doc(text) for text in texts)


In [54]:
descr_doc = list(nlp_en.pipe(fb_df['description']))
descr_doc = fb_df['description'].apply(lambda s: nlp_en(s, disable=['ner', 'lemmatizer', 'senter']) if s is not nan and s is not None else s)
fb_df['description_normalized'] = descr_doc.apply(process_doc_deps)

In [55]:

test2 = nlp_en('Works at Electrical and electronics engineering')
spacy.explain('VBN')
for token in test2:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)


Works work VERB VBZ ROOT Xxxxx True False


AttributeError: 'spacy.tokens.token.Token' object has no attribute 'se'

In [56]:
fb_df = apply_and_concat(fb_df, 'description_normalized', get_entities, ['location', 'occupation', 'misc_ents']) 

In [57]:
locations_list = fb_df['location'].value_counts()
occupations_list = fb_df['occupation'].value_counts()
misc_list = fb_df['misc_ents'].value_counts()
uncaptured_vals_list = fb_df[fb_df['location'].isna() & fb_df['description'].notna()]
print(uncaptured_vals_list)


                       name  \
8              Rambo Buteng   
9            Nguyen Le Ngan   
11             Ta Hoang Son   
14                Bocboc Jl   
21      Missy Deleon Sabado   
...                     ...   
11801         Alice Stevens   
11806           Mike Lucker   
11808         Eldred Halsey   
11817  Debbie Lemons Newton   
11819       Richard O'Neill   

                                                    link  \
8                  https://www.facebook.com/rambo.boteng   
9      https://www.facebook.com/profile.php?id=100077...   
11     https://www.facebook.com/profile.php?id=100077...   
14                    https://www.facebook.com/bocboc.jl   
21                    https://www.facebook.com/isayyysbd   
...                                                  ...   
11801          https://www.facebook.com/alice.stevens.33   
11806          https://www.facebook.com/mike.lucker.3344   
11808  https://www.facebook.com/profile.php?id=100018...   
11817                 https