<h4>The purpose of this notebook is to perform necessary cleaning, (pre-)processing, and enrichment of data collected from external OSIF sources, such as TweetBeaver and Facebook. </h4>

<h5><strong>Tweet Processing: @PLNewsToday timeline</strong></h5>

The columns in the file `PLnewstoday timeline.csv` are `Tweet author`,`Tweet ID`,`Date posted`,`Tweet text`,`URL`,`Retweets`,`Favorited`,`Source`. Notably in comparison to other files of tweets such as `thedcpatriot-lancaster-twitter1.json` from Twint, this timeline file (which I downloaded from TweetBeaver) does not contain columns for hashtags, user mentions, hyperlinks, tweet geo-tagged location, and several other useful properties of tweets. Let's extract those properties from the tweets in the timeline ourselves and save the output to a new file, which we will call `PLnewstoday timeline processed.csv`. 

In addition to the standard tweet properties of interest like mentions, URLs, and hashtags, there are also some aspects of these tweets that are of special value to us due to the nature of their source (a journalist reporting from a war zone, with several of their stories deemed false or deceptive by outside experts). As such, I will use a few APIs on these tweets for additional feature extraction: the GATE Journalist Safety Analyzer, the GATE Rumour veracity classifier, and the GATE TwitIE Named Entity Recognizer for Tweets. 

In [1]:
#%pip install twitter-text-python

In [2]:
import json
import pandas as pd
from ttp import ttp


In [3]:
df = pd.read_csv("data/PLnewstoday timeline.csv")
print(df.head(3))
    


  Tweet author                Tweet ID                     Date posted  \
0  PLnewstoday  ID 1522888466773721088  Sat May 07 10:37:59 +0000 2022   
1  PLnewstoday  ID 1522851834431516672  Sat May 07 08:12:25 +0000 2022   
2  PLnewstoday  ID 1522477719707172864  Fri May 06 07:25:49 +0000 2022   

                                          Tweet text  \
0   RT @PLnewstoday: ⚡️📣Inside Azovstal Territory...   
1   ⚡️📣Inside Azovstal Territory: First Western J...   
2   ⚡️📣Ukraine Snipers Killed Civilians In Mariup...   

                                                 URL  Retweets  Favorited  \
0  https://twitter.com/PLnewstoday/statuses/15228...       209          0   
1  https://twitter.com/PLnewstoday/statuses/15228...       209        409   
2  https://twitter.com/PLnewstoday/statuses/15224...       296        558   

            Source  
0  Twitter Web App  
1  Twitter Web App  
2  Twitter Web App  


In [4]:
df['Tweet ID'] = df['Tweet ID'].str.slice_replace(stop=3)
df['Mentions'] = [[] for _ in range(df.shape[0])]
df['Hashtags'] = [[] for _ in range(df.shape[0])]
df['URLs'] = [[] for _ in range(df.shape[0])] #will have to be careful about this name - easy to mix up with the URL column from TweetBeaver
df.rename(columns={"Tweet text": "Text", "URL": "Link"}, inplace=True)


In [5]:
parser = ttp.Parser()
#df['Mentions'] = df.eval(parse_result.[0] if (parse_result:= parser.parse(df['Tweet text'])) else None
df['parsed_results'] = df['Text'].apply(parser.parse)
df['Mentions'] = df['parsed_results'].apply(lambda x: x.users)
df['Hashtags'] = df['parsed_results'].apply(lambda y: y.tags)
df['URLs'] = df['parsed_results'].apply(lambda z: z.urls)
del df['parsed_results']

In [6]:
print(df.head(3))

  Tweet author             Tweet ID                     Date posted  \
0  PLnewstoday  1522888466773721088  Sat May 07 10:37:59 +0000 2022   
1  PLnewstoday  1522851834431516672  Sat May 07 08:12:25 +0000 2022   
2  PLnewstoday  1522477719707172864  Fri May 06 07:25:49 +0000 2022   

                                                Text  \
0   RT @PLnewstoday: ⚡️📣Inside Azovstal Territory...   
1   ⚡️📣Inside Azovstal Territory: First Western J...   
2   ⚡️📣Ukraine Snipers Killed Civilians In Mariup...   

                                                Link  Retweets  Favorited  \
0  https://twitter.com/PLnewstoday/statuses/15228...       209          0   
1  https://twitter.com/PLnewstoday/statuses/15228...       209        409   
2  https://twitter.com/PLnewstoday/statuses/15224...       296        558   

            Source       Mentions            Hashtags  \
0  Twitter Web App  [PLnewstoday]                  []   
1  Twitter Web App             []  [RussiaUkraineWar]   
2  Twitter

<h4>Facebook Profile Processing</h4>

In [8]:
import stanza
from stanza.pipeline.multilingual import MultilingualPipeline

stanza.download("ar")
stanza.download("vi")
stanza.download("multilingual")
stanza.download("bg")
stanza.download("tr")
stanza.download("be")
stanza.download("he")
stanza.download("id")
stanza.download("ko")
stanza.download("es")
stanza.download("pt")
stanza.download("en")
stanza.download("en", processors="ner")

In [9]:


# nlp_en = spacy.load("en_core_web_lg")
# nlp_es = spacy.load('es_core_news_md')
# nlp_pt = spacy.load("pt_core_news_md")
# nlp_xx = spacy.load("xx_ent_wiki_sm")
# nlp_zh = spacy.load("zh_core_web_sm")
# nlp_ko = spacy.load("ko_core_news_sm")

# nlp_ar = spacy_stanza.load_pipeline("ar", processors='tokenize, ner')
# nlp_vi = spacy_stanza.load_pipeline("vi", processors='tokenize, ner')
# nlp_bg = spacy_stanza.load_pipeline("bg", processors='tokenize, ner')
# nlp_tr = spacy_stanza.load_pipeline("tr", processors='tokenize, ner')

# #nlp = stanza.Pipeline(lang='ar', processors='tokenize,ner')
# nlp_en.add_pipe("ner", name="ner_es", source=nlp_es)
# nlp_en.add_pipe("ner", name="ner_pt", source=nlp_pt)
# nlp_en.add_pipe("ner", name="ner_xx", source=nlp_xx)
# nlp_en.add_pipe("ner", name="ner_zh", source=nlp_zh)
# nlp_en.add_pipe("ner", name="ner_ko", source=nlp_ko)

nlp_multi = MultilingualPipeline(lang_id_config={
    "langid_clean_text": False, 
    "langid_lang_subset": ["en","ar", "es", "pt", "be", "bg", "ko", "id", "he", "ru", "th", "tr", "vi" ],
    }, 
    lang_configs={
        "en": {"processors": 'tokenize, pos, ner', "download_method": stanza.DownloadMethod.REUSE_RESOURCES},
        "ar": {"processors": 'tokenize, ner', "download_method": stanza.DownloadMethod.REUSE_RESOURCES},
        "es": {"processors": 'tokenize, pos, ner', "download_method": stanza.DownloadMethod.REUSE_RESOURCES},
        "pt": {"processors": 'tokenize, pos', "download_method": stanza.DownloadMethod.REUSE_RESOURCES},
        "be": {"processors": 'tokenize', "download_method": stanza.DownloadMethod.REUSE_RESOURCES},
        "bg": {"processors": 'tokenize, ner', "download_method": stanza.DownloadMethod.REUSE_RESOURCES},
        "he": {"processors": 'tokenize', "download_method": stanza.DownloadMethod.REUSE_RESOURCES},
        "id": {"processors": 'tokenize', "download_method": stanza.DownloadMethod.REUSE_RESOURCES},
        "ko": {"processors": 'tokenize', "download_method": stanza.DownloadMethod.REUSE_RESOURCES},
        "th": {"processors": 'tokenize', "download_method": stanza.DownloadMethod.REUSE_RESOURCES},
        "tr": {"processors": 'tokenize, ner', "download_method": stanza.DownloadMethod.REUSE_RESOURCES},
        "vi": {"processors": 'tokenize, ner', "download_method": stanza.DownloadMethod.REUSE_RESOURCES}
    }, 
    ld_batch_size=85,
    max_cache_size=15 )

lang_detector = stanza.Pipeline(lang="multilingual", processors="langid", langid_lang_subset=["en","ar", "es", "pt", "be", "bg", "ko", "id", "he", "ru", "tr", "vi" ])


2023-01-01 17:26:09 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

2023-01-01 17:26:10 INFO: Loading these models for language: multilingual ():
| Processor | Package |
-----------------------
| langid    | ud      |

2023-01-01 17:26:10 INFO: Use device: cpu
2023-01-01 17:26:10 INFO: Loading: langid
2023-01-01 17:26:11 INFO: Done loading processors!
2023-01-01 17:26:11 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

2023-01-01 17:26:11 INFO: Loading these models for language: multilingual ():
| Processor | Package |
-----------------------
| langid    | ud      |

2023-01-01 17:26:11 INFO: Use device: cpu
2023-01-01 17:26:11 INFO: Loading: langid
2023-01-01 17:26:11 INFO: Done loading processors!


In [None]:
with open("data-temp/network.json", "r", encoding="utf-8") as f:
    content = json.loads(f.read())
    fb_df = pd.DataFrame(content.get("nodes"))
    
fb_df.drop(columns=["members"], inplace=True)
print(fb_df.head(3))
#fb_df["location"] = [None for _ in range(fb_df.shape[0])]

In [11]:
import regex #important note: I'm choosing this library instead of the standard re library because some of the description values that follow the 
#"Works at" pattern have 2 or more whitespace characters between the 'at' and the next word. Dealing with this requires making the positive 
#lookaround variable-length, which is not supported in the default Python regex engine. Hence the different engine library choice.
from numpy import nan

#works_at_pattern = [{"LOWER": "at"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
regex_pattern = r"\b(?<=\bat\s{1,2}?\X{0,1})\p{L}.*$"
#matcher.add("WorksAt", [works_at_pattern])

def apply_and_concat(dataframe, field, func, column_names):
    return pd.concat((
        dataframe,
        dataframe[field].apply(
            lambda cell: pd.Series(func(cell), index=column_names))), axis=1)
    
def apply_and_concat_external(target_dataframe, source_series, func, column_names):
    return pd.concat((
        target_dataframe,
        source_series.apply(lambda cell: pd.Series(func(cell), index=column_names))
    ), axis=1)

def get_lang(descr: str):
    if descr is nan or descr is None:
        return nan
    doc = lang_detector(descr)
    return doc.lang
    
# def process_doc_deps(doc: tokens.Doc):
#     if doc is nan or doc is None:
#         return doc
#     if (doc[0] is nan or doc[0] is None):
#         return doc[0]
#     result = ''
#     if (doc[0].pos_ in ['NOUN', 'ADJ', 'VERB']):
#         if doc[0].pos_ == 'VERB':
#             if doc[0].tag_ == 'VBZ':              
#                 result = 'The subject ' + ''.join([(token.text.lower() if token == doc[0] else token.text) + token.whitespace_ for token in doc])
#             else:
#                 result = 'The subject does ' + ''.join([(token.text.lower() if token == doc[0] else token.text) + token.whitespace_ for token in doc])
#         else:
#             result = 'The subject is ' + ''.join([(token.text.lower() if token == doc[0] else token.text) + token.whitespace_ for token in doc])
#     else:
#         result = doc.text
#     return result

# def get_entities(descr: str):
#     spacy_results = nlp_en(str(descr), disable=['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer'])
#     location_results = []
#     occupation_results = []
#     misc_results = []
#     # works_at_rule_results = regex.search(regex_pattern, spacy_results.text, flags=regex.UNICODE)
#     # if (works_at_rule_results is not None):
#     #     start, end = works_at_rule_results.span()
#     #     span = spacy_results.char_span(start, end)
#     #     # This is a Span object or None if match doesn't map to valid token sequence
#     #     if span is not None:
#     #         location_results.append(span.text)
#     #     elif bool('\u200E' in spacy_results.text):
#     #         location_results.append(works_at_rule_results.group())
#     for ent in spacy_results.ents:
#         if ent.label_ in ['LOC', 'ORG', 'GPE', 'LC', 'OG'] and ent.text not in location_results:
#             location_results.append(ent.text)
#         elif ent.label_ in ['FAC', 'PRODUCT'] and ent.text not in occupation_results:
#             occupation_results.append(ent.text)
#         elif ent.label_ in ['PERSON', 'PER', 'PS', 'MISC']:
#             misc_results.append(ent.text)
#     results_tuple = ((', '.join(location_results) if len(location_results) != 0 else nan), (', '.join(occupation_results) if len(occupation_results) != 0 else nan), (', '.join(misc_results) if len(misc_results) != 0 else nan))
#     return results_tuple


def get_lang_and_entities(doc: stanza.models.common.doc.Document):
    location_results = []
    occupation_results = []
    misc_results = []
    other_results = []
    untagged_results = []
    if len(doc.ents) > 0:
        for ent in doc.ents:
            if ent.type in ['LOC', 'ORG', 'GPE', 'LC', 'OG', 'LOCATION', 'ORGANIZATION', 'GPE']:
                location_results.append(ent.text)
            elif ent.type in ['FAC', 'PRODUCT']:
                occupation_results.append(ent.text)
            elif ent.type in ['PERSON', 'PER', 'PS', 'MISC', 'MISCELLANEOUS']:
                misc_results.append(ent.text)
            else:
                other_results.append(ent.text)
    else:
        works_at_rule_results = regex.search(regex_pattern, doc.text, flags=regex.UNICODE)
        if (works_at_rule_results is not None):
            start, end = works_at_rule_results.span()
            span = doc.text[start:end]
            untagged_results.append(span)
    results_tuple = (
                     (doc.lang), 
                     (', '.join(location_results) if len(location_results) != 0 else nan), 
                     (', '.join(occupation_results) if len(occupation_results) != 0 else nan), 
                     (', '.join(misc_results) if len(misc_results) != 0 else nan),
                     (', '.join(other_results) if len(other_results) != 0 else nan),
                     (', '.join(untagged_results) if len(untagged_results) != 0 else nan)
                    )
    return results_tuple


In [13]:
docs_series = fb_df['description'][fb_df['description'].notna()]

In [14]:
print(not isinstance(fb_df['description'], list))

docs_list = docs_series.to_list()


langed_docs = nlp_multi(docs_list)

True


2023-01-01 17:26:54 INFO: Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | combined  |
| pos       | combined  |
| ner       | ontonotes |

2023-01-01 17:26:54 INFO: Use device: cpu
2023-01-01 17:26:54 INFO: Loading: tokenize
2023-01-01 17:26:54 INFO: Loading: pos
2023-01-01 17:26:55 INFO: Loading: ner
2023-01-01 17:26:57 INFO: Done loading processors!
2023-01-01 17:34:32 INFO: Loading these models for language: vi (Vietnamese):
| Processor | Package |
-----------------------
| tokenize  | vtb     |
| ner       | vlsp    |

2023-01-01 17:34:32 INFO: Use device: cpu
2023-01-01 17:34:32 INFO: Loading: tokenize
2023-01-01 17:34:32 INFO: Loading: ner
2023-01-01 17:34:34 INFO: Done loading processors!
2023-01-01 17:34:44 INFO: Loading these models for language: pt (Portuguese):
| Processor | Package |
-----------------------
| tokenize  | bosque  |
| mwt       | bosque  |
| pos       | bosque  |

2023-01-01 17:34:44 INFO: Use

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

2023-01-01 17:36:11 INFO: Loading these models for language: ru (Russian):
| Processor | Package   |
-------------------------
| tokenize  | syntagrus |
| pos       | syntagrus |
| lemma     | syntagrus |
| depparse  | syntagrus |
| ner       | wikiner   |

2023-01-01 17:36:11 INFO: Use device: cpu
2023-01-01 17:36:11 INFO: Loading: tokenize
2023-01-01 17:36:11 INFO: Loading: pos
2023-01-01 17:36:12 INFO: Loading: lemma
2023-01-01 17:36:12 INFO: Loading: depparse
2023-01-01 17:36:13 INFO: Loading: ner
2023-01-01 17:36:16 INFO: Done loading processors!
2023-01-01 17:36:23 INFO: Loading these models for language: bg (Bulgarian):
| Processor | Package |
-----------------------
| tokenize  | btb     |
| ner       | bsnlp19 |

2023-01-01 17:36:23 INFO: Use device: cpu
2023-01-01 17:36:23 INFO: Loading: tokenize
2023-01-01 17:36:23 INFO: Loading: ner
2023-01-01 17:36:25 INFO: Done loading processors!
2023-01-01 17:36:31 INFO: Loading these models for language: tr (Turkish):
| Processor | Pac

In [None]:
print(type(langed_docs[0]))
for doc in langed_docs:
    print("---")
    print(f"text: {doc.text}")
    print(f"lang: {doc.lang}")
    print(f"ents: {doc.ents}")
    print(f"{doc.sentences[0].dependencies_string()}")
    


In [16]:
langed_series = pd.Series(langed_docs, index=docs_series.index)

In [17]:

docs_series = apply_and_concat_external(docs_series, langed_series, get_lang_and_entities, ['lang', 'location', 'occupation', 'misc_ents', 'other_ents', 'untagged']) 

In [18]:
res = nlp_multi("คือสงสัยมานานแล้วว่าที่เค้าตั้งกันในโปรไฟล์เฟซว่า ทำงานที่ พ่อกับแม่จำกัดมหาชน คือมันมีบริษัทนี้จริงๆ")
print(res.lang)

be


In [19]:
fb_df = fb_df.join(docs_series.loc[:, ['lang', 'location', 'occupation', 'misc_ents', 'other_ents', 'untagged']], how="outer")

In [47]:
fb_df.to_csv("data-temp/fb_data_langid_semiprocessed.csv", sep='\t', encoding='utf-8-sig')

In [21]:
locations_list = fb_df['location'].value_counts()
occupations_list = fb_df['occupation'].value_counts()
misc_list = fb_df['misc_ents'].value_counts()
uncaptured_vals_list = fb_df[(fb_df['location'].isna() | fb_df['occupation'].isna() | fb_df['misc_ents'].isna()) & fb_df['lang'].notna()]
print(uncaptured_vals_list.count())


name           6921
link           6921
profile_pic    6921
role              0
id             6921
type           6921
labels         6921
joined         6921
description    6921
members           0
lang           6921
location       3359
occupation      194
misc_ents       507
dtype: int64
