# Data Collection
## Web Scraping
To conduct the analysis, the first step was to establish a corpus and collect text from the Community Guidelines documents of YouTube, Facebook/Meta, and TikTok that focus on online harassment. Text was collected from the bullying/harassment subsections of each policy document to ensure the text was relevant and the analysis was internally valid. However, as these policy sections often linked to other facets of the guidelines, the URLs on each page were also extracted. Following the data collection, three separate csv files were created for each platform, and the text was scraped from each file to create three separate text files that could be compared in the analysis.

### Meta/Facebook 

In [5]:
#The following script is adapted from the Geeks for Geeks “BeautifulSoup – Scraping Link from HTML” tutorial: https://www.geeksforgeeks.org/beautifulsoup-scraping-link-from-html/![image-3.png](attachment:image-3.png)

In [136]:
url_to_scrape = "https://transparency.fb.com/policies/community-standards/bullying-harassment/"

In [137]:
# call the function to retrieve HTML document
html_document = getHTMLdocument(url_to_scrape)

In [138]:
# use BeautifulSoup to parse the HTML document
soup = BeautifulSoup(html_document, 'html.parser')

In [139]:
# find all the anchor tags with "href" attribute starting with "https://"
for link in soup.find_all('a', attrs={'href': re.compile("^https://")}):
    # display the actual urls
    print(link.get('href')) 

https://www.facebook.com/safety/tools
https://www.facebook.com/safety/bullying
https://about.fb.com/news/2018/10/protecting-people-from-bullying/
https://transparency.fb.com/policies/community-standards/dangerous-individuals-organizations/
https://transparency.fb.com/policies/community-standards/privacy-violations-image-privacy-rights/
https://www.facebook.com/help/263149623790594?ref=tc
https://www.facebook.com/ads/library/?active_status=all&ad_type=political_and_issue_ads&country=US&media_type=all
https://www.crowdtangle.com/
https://fort.fb.com/
https://www.facebook.com/about/privacy/
https://www.facebook.com/legal/terms/
https://www.facebook.com/policies/cookies/


In [8]:
# The following script is adapted from (Krisel, 2023) https://github.com/intro-to-text-analysis-SIPA-S23/syllabus

In [140]:
# read CSV file containing URLs to scrape
data_df= pd.read_csv("Facebook Community Guidelines URLs - Sheet1.csv", delimiter=',', encoding='utf-8')

In [141]:
# create a DataFrame with the URLs to scrape
Meta_urls = data_df

In [142]:
# define function to scrape article from URL
def scrape_article(URL):
    response = requests.get(URL)
    response.encoding = 'utf-8'
    html_string = response.text
    return html_string

In [143]:
# apply the 'scrape_article' function to each URL in the DataFrame and store the resulting HTML in a new column 'text'
Meta_urls['text'] = Meta_urls['URL'].apply(scrape_article)

In [144]:
# display the DataFrame with the scraped HTML
Meta_urls

Unnamed: 0,Platform,URL,text
0,Meta/Facebook,https://about.fb.com/news/2018/10/protecting-p...,<!DOCTYPE html>\n<!--[if lt IE 8]> <html ...
1,Meta/Facebook,https://fort.fb.com/,"<!DOCTYPE html>\n<html lang=""en"" id=""facebook""..."
2,Meta/Facebook,https://transparency.fb.com/policies/community...,"<!DOCTYPE html>\n<html lang=""en"" id=""facebook""..."
3,Meta/Facebook,https://transparency.fb.com/policies/community...,"<!DOCTYPE html>\n<html lang=""en"" id=""facebook""..."
4,Meta/Facebook,https://transparency.fb.com/policies/community...,"<!DOCTYPE html>\n<html lang=""en"" id=""facebook""..."
5,Meta/Facebook,https://www.crowdtangle.com/,"<!DOCTYPE html>\n<html lang=""en"">\n <head>\..."
6,Meta/Facebook,https://www.facebook.com/about/privacy/,"<!DOCTYPE html><html id=""facebook"" class=""_9dl..."
7,Meta/Facebook,https://www.facebook.com/ads/library/?active_s...,"<!DOCTYPE html>\n<html lang=""en"" id=""facebook""..."
8,Meta/Facebook,https://www.facebook.com/help/263149623790594?...,"<!DOCTYPE html><html id=""facebook"" class=""_9dl..."
9,Meta/Facebook,https://www.facebook.com/legal/terms/,"<!DOCTYPE html>\n<html lang=""en"" id=""facebook""..."


In [145]:
# print the text of each article
for text in Meta_urls['text']:
    print(text)

<!DOCTYPE html>
<!--[if lt IE 8]>      <html class="lt-ie10 lt-ie9 lt-ie8" lang="en-US"> <![endif]-->
<!--[if IE 8]>         <html class="lt-ie10 lt-ie9" lang="en-US"> <![endif]-->
<!--[if IE 9]>         <html class="lt-ie10" lang="en-US"> <![endif]-->
<!--[if gt IE 9]><!--> <html xmlns="http://www.w3.org/1999/xhtml" prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb#" lang="en-US"> <!--<![endif]-->
<html lang="en-US">
<head>
	<meta charset="UTF-8">
	<meta name="viewport" content="width=device-width, initial-scale=1">
	<link rel="profile" href="https://gmpg.org/xfn/11">
	<link rel="pingback" href="https://about.fb.com/xmlrpc.php">
	<meta name='robots' content='max-image-preview:large' />
<!-- Feed RSS/XML -->
<link rel="alternate" type="application/rss+xml" title="Meta Feed" href="https://about.fb.com/feed/">
<!-- /end Feed RSS/XML -->

<!-- Favicon/Icons -->
<link rel="icon" href="https://about.fb.com/wp-content/uploads/2021/10/meta-favicon.png?fit=16%2C16" sizes="32x32" />
<link r

In [146]:
# print the text of each article after removing HTML tags
for text in Meta_urls['text']:
    soup = BeautifulSoup(text)
    article = soup.get_text()
    print(article)

 
































Protecting People from Bullying and Harassment | Meta








































































Skip to content





Viewing this site in


English
Portugese
German
French
Japanese
Korean
Spanish (LTAM)
Spanish (ES)






Você está visualizando este site em


Inglês
Português
Alemão
Francês
Japonês
Coreano
Espanhol (LTAM)
Espanhol (ES)






Diese Seite anzeigen auf


Englisch
Portugiesisch
Deutsch
Französisch
Japanisch
Koreanisch
Spanisch (LTAM)
Spanisch (ES)






Vous consultez ce site en


Anglais
Portugais
Allemand
Français
Japonais
Coréen
Espagnol (LTAM)
Espagnol (ES)






このサイトを次の言語で表示


英語
ポルトガル語
ドイツ語
フランス語
日本語
韓国語
スペイン語 (LTAM)
スペイン語 (ES)






다음 언어로 표시 중


영어
포르투갈어
독일어
프랑스어
일본어
한국어
스페인어 (LTAM)
스페인어 (ES)






Este sitio se está viendo en


Inglés
Portugués
Alemán
Francés
Japonés
Coreano
Español (LTAM)
Español (ES)






Este sitio se está viendo en


Inglés
Portugués
Alemán
Francés
Japonés
Coreano
Español (LTAM)
















Report Something | Facebook Help Center



























































Facebook








FacebookEmail or phonePasswordForgot account?Sign UpPlain Text Terms of Service Facebook Ads Controls  Privacy Basics  Cookies Policy  Data Policy  More Resources  View a printable version of the Terms of Service The Facebook company is now Meta. We’ve updated our Terms of Use, Data Policy, and Cookies Policy to reflect the new name on January 4, 2022. While our company name has changed, we are continuing to offer the same products, including the Facebook app from Meta. Our Data Policy and Terms of Service remain in effect, and this name change does not affect how we use or share data. Learn more about Meta and our vision for the metaverse.Terms of Service Meta builds technologies and services that enable people to connect with each other, build communities, and grow businesses. These Terms govern your use of Facebook, Messenger, and the other products, fea

In [147]:
# save the text of each article to a text file
with open("all_articles_Meta.txt","w") as file:
    for text in Meta_urls['text']:
        soup = BeautifulSoup(text)
        article = soup.get_text()
        file.write(article)

In [148]:
# create a directory to store the cleaned text files
! mkdir files_meta

mkdir: files_meta: File exists


In [149]:
# iterate over each article, remove HTML tags, and save the cleaned text to a separate file in the 'files' directory
import os
id = 0
for text in Meta_urls['text']:
    soup = BeautifulSoup(text)
    article = soup.get_text()

    id += 1
    file_path = f"files_meta/article_{id}.txt"
    os.makedirs(os.path.dirname(file_path), exist_ok=True)
    with open(file_path, "w") as file:
        file.write(str(article))

In [150]:
! cd files_meta
! ls files_meta

article_1.txt  article_12.txt article_3.txt  article_6.txt  article_9.txt
article_10.txt article_13.txt article_4.txt  article_7.txt
article_11.txt article_2.txt  article_5.txt  article_8.txt


In [151]:
! cd ..

In [152]:
#clean the data. Remove stopwards 
from nltk.corpus import stopwords as nltk_stopwords
from nltk import pos_tag
from nltk.corpus import wordnet 
from nltk.stem.wordnet import WordNetLemmatizer
from wordcloud import WordCloud    
import matplotlib.pyplot as plt
stops = nltk_stopwords.words('english')

In [189]:
# Create a function to derive the Part of Speech (POS) of given words. 
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

In [190]:
#combining the text files into one file
import glob

files = glob.glob("files_meta/*.txt")
combined_text = ""

for file in files:
    with open(file, encoding='utf-8') as f:
        text = f.read()
        combined_text += text + "\n"

with open("CommunityGuidelines_Meta.txt", "w", encoding='utf-8') as f:
    f.write(combined_text)