<p style="text-align:right"> 
Federico Motta
<br/> 
20-01-1998
<br/>
fm.federicomotta@gmail.com

</p>

<style>
h1,h2{
    color:#D37171
}
h3,p,li {
    
    color: #717070;
    line-height:150%
}

.cont {
  width:100%;
  text-align:center;
  display: inline-grid;
  justify-content:center;
flex-direction: column;
}

.t_cont {
    width:400px
}

.fun {
    color:blue
}
.str {
    color: #AB312A
}

.fun,.str{
    font-weight:400
}
.appendix{
    font-size:12px;
}

.img {
    width:50%;
    margin-top:50px;
    margin-bottom:50px;
    
}

tt{
    font-weight:800
}


</style>

<h1 style="text-align:center"> D3I - Research Engineer </h1>

<h2> Introduction </h2>
<h3> Research Background </h3>
<p>
Scraped social media data allow computational social scientists to understand how users engage with big media events such as social movements and general trends (e.g., Hong & Kim, 2016; Ince et al., 2017; Gleason, 2013). However, if we want to investigate how the event changes and shapes the user behavior across environments, it is possible that analyzing only public data could not be enough. What we usually show publicly on social media reflects only some shards of what our “true self” is (Coi et al., 2020). For instance, we can see that a user supports a social movement hashtag (e.g  <span style="color:#1DA1F2">#BlackLivesMatter</span>) but we don’t know why they support it and in which other ways apart from twitting about it. It is rather hard to interpret beliefs and motivations through short tweets or very limited social interactions in the comment sections. Even if it's hard to find deep personal information about users' beliefs, the Internet is full of places where users can develop and build social interaction in a more intimate and genuine way, which leads eventually to the expression of self-disclosure, social support across users, and a more intimate component of their selves (Lomanowska & Guitton, 2016). One of them could be found in instant messaging environments. For their long-term development and the peer-to-peer structure, we suppose that analyzing private conversations such as the one in Instagram Direct Messages could lead to an enrichment of the general knowledge about users' engagement to media events.
</p>

<h3> Research Question </h3>
<h4> The research question investigates how social media users engage in big-media events inside private environments, such as Instagram Direct Messages. </h4>
<p> In particular, we can divide the research question in two more practical: </p>
    <ul>
    <li> <b>RQ1:</b> can private conversations give quantitative support if treated as anonymized scraping data? In other words, can we predict users engagement to big-media events merging several users' conversations? </li>
    <li> <b>RQ2:</b> can private conversations give quantitative support by analyzing several users' conversations, without merging them? That is, analysing every user's conversations separately, and eventually performing Analysis of Variance (ANOVA) through them.</li>
    </ul>


<h3> Inventing a scenario: the <span style="color:#1DA1F2">#fuckarticles</span>  movement</h3>
<p>The internet has done it again. During the day of November 15, 2020 a vast amount of prominent personalities  began posting tweets insinuating that "grammatical articles" were completely useless and even "lame."(Figure 1). It is still unclear what started this media phenomenom, but a lot of Twitter users started posting <span style="color:#1DA1F2"> #fuckarticles</span> ungrammatical tweets. Several computational social scientists are now trying to understand the weight and the impact of the phenomenom across different social media platforms. </p> 

<figure class="img">
    <img class="image" src="https://i.imgur.com/Usz88m1.png"> 
    <figcaption><p><b>Figure 1.</b> <i> On the left: influencers from various worlds twitting on how much they hate articles; on the right: the famous show "The Office", which changed the name on the same day.</i></p> </figcaption>
    </figure>


<h3> Providing an answer to the research question in the fake scenario</h3>
<p> To find a research answer to the <span style="color:#1DA1F2"> #fuckarticles</span> movement, we need to:</p>
    <ol>
        <li> access users' instagram private messages and assess their use of articles </li>
        <li> collect all the messages containing grammatical articles before the 15/10 (period1) </li>
        <li> collect all the messages containing grammatical articles after the 15/10 (period2) </li>
        <li> confront period1 and period2 and see if there are significant differences. </li>
     </ol>
<p> We will use the Instagram data provided by the D3I infrastracture. In particular, the DDP <tt class="str">'Instagram_data_zenodo'</tt>. </p>
        
        



<h3> Creating a script to provide an answer to the research question </h3>
<p> The instagram data inside the DDP are standard instagram data that can be downloaded by any user at <a>https://www.instagram.com/accounts/privacy_and_security/</a> in a couple of days. In every directory, we have different json files which contains different information about the user. To find the answer to the research question, we are intereseted in just two of them, namely <tt> messages.json</tt> (where all the messages are stored) and <tt> profile.json</tt> (exclusively for RQ2, where the basic user data such as the username are stored).
We will now develop a script that will guide us through the analysis of these 2 files.
<br/>
</p>

<h2> The script </h2>
<p> Given any <tt> messages.json </tt>, <tt>profile.json</tt> from Instagram. The script creates a scalable strcucture able to:</p>
<ol>
    <li> <a href ="#read">Read the json structure</a></li>
    <li> <a href ="#clean">Clean and anonymize the data </a> </li>
    <li> <a href ="#count">Find word matches in a given period of time </a> </li>
    <li> <a href ="#find">Find the json owner (for RQ2) </a>
    <li> <a href ="#both"> Summarise the process for both RQ1 AND RQ2</a> </li>
    
</ol>

<h3 id="read">Reading the json structure</h3>
<p>We start with understanding the json structure of a single <tt>message.json</tt> in the DDP. In order to do it properly, we create a function, <tt class="fun">find_keys()</tt>, which allows us to see the data type (<tt>dict, list</tt>) and all its unique keys. Let's upload a message.json from a random user: </p>

In [1]:
import json
with open("materials/100billionfaces_20201021/messages.json",'r') as jfile:
    messages = json.load(jfile) ##loads the json in python

def find_keys(jload):
    if len(jload) == 1: ## when the json is just a single element, the unique keys are the ones of that single element.
        return type(jload), jload.keys()
    
    keys = []
    for m in jload:
        try: m.keys()
        except AttributeError: return None ## when there are no keys left
        else: [keys.append(k) for k in m.keys() if not k in keys] ## appends in keys's list only the unique keys (no duplicates)
    return type(jload),keys

fk = find_keys(messages)

print(fk)


(<class 'list'>, ['participants', 'conversation'])


<p> We now know that our json is a list of dict, and every element looks like a chat, as it probably has both <tt class="str">'participants'</tt> and <tt class="str">'conversation'</tt> keys. For the rerearch question we are interested in finding where the corpus, the sender, and the date of the message is located. So I run again <tt class="fun"> find_keys()</tt>  </i> in order to get the <tt class="str">'conversation'</tt> keys (most likely the text is there) in a random element of the list (aka a single chat): </p> 
<br></br>

In [2]:
random_chat = messages[0]['conversation']
fk2 = find_keys(random_chat)

print(fk2)

(<class 'list'>, ['sender', 'created_at', 'story_share', 'text', 'link', 'likes', 'media'])


<p> it seems that we have found all the elements we need for our research answer! namely: the corpus of the message (<tt class="str">'text'</tt>), the date of the message(<tt class="str">'created_at'</tt>) and the sender of the message (<tt class="str">'sender'</tt>).
<br/>
    Let's see if they actually contain the information we need. Let's print all of them from a random message from the random chat: </p>

In [3]:
random_message = random_chat[1]
print(random_message["sender"][0:5]) ##respect user's privacy :D
print(random_message["text"])
print(random_message["created_at"])


ilike
You can check it on https://www.lovedance234.com
2020-10-20T08:03:58.405275+00:00


<h3 id="clean"> Clean and anonymize the data </h3>

<p> Now that we know where to find the date, corpus, and senders, it is time to filter the corpus we want for this particular study in the whole JSON. As we are treating personal data, we also need to anonymize any sensitive information that could be present, such as emails, phone numbers, senders' and sender friends' usernames. </p>
<p> let's proceed by defining a function, <tt class="fun"> clean_data()</tt> that will:
    <ol>
  <li>Select every message which contains proper corpus</li>
  <li>Filter only messages written in a given language (it'll be English) </li>
  <li>Censure every sensible information </li>
  <li>Anonymize the sender of any message</li>
  <li>Select only messages sent by the json's owner ( optional, for RQ2)
</ol>

<br>


In [4]:
import pycld2 as cld2 ## fast language detector 
import uuid ## A universally unique identifier, in order to anonymise user Id
import re  ## regular expression matching operations similar to those found in Perl 

def clean_data(data,      ##the messages.json 
               language,  ## the language we want to filter
               sender = None,  ##the sender we want to consider, it's None for RQ1 and the json's owner for RQ2
               divide_users = False, ## for RQ1 we don't divide users (False) while for RQ2 we do (True) 
               ):
    
    def anonymize_corpus(string): 
        string = string.lower()                                    ##transform all the text to lower case (it helps with matching regex)
        
        re_email = re.compile(r'[^@\s]+@[^@\s]+\.[a-zA-Z0-9]+$')   ##regular expresssion for emails
        re_email2 = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b') ##regular expresssion for emails n.2 (emails are difficult to anonymise :/)
        re_pn = re.compile(r'(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})') ##regular expression for phone numbers
        re_tags = re.compile(r'\B@\w+') ##regular expresssion for words that starts with "@", meaning all ig usernames
        
        operations = [(re_email,"**email**"),
                      (re_email2,"**email**"),
                      (re_pn, "**num**"),
                      (re_tags,"@***")]
        
        
        for regex,new_string in operations: 
            string = re.sub(regex, new_string, string)
        

        return string
    
    cleaned = []
    for chat in data:                              ## we start by iterating through every chat present in data;
        conversation = chat["conversation"]        ## we iterate through every chat's conversation;
        for message in conversation:               ## and we iterate through the messages in that conversation.
            if message.get("text"):                 ## We try to find a message with text, 
                text = message["text"]                                         ## when we find the corpus 
                message["sender"] = str(uuid.uuid5(uuid.NAMESPACE_URL,         ## we anonymize the sender.
                                    message["sender"]))

                if not text:                                                ## we ignore when text=None.
                    continue
                    
                isReliable, textBytesFound, details = cld2.detect(text)     ## we then pass the text into a language detector
                optional_con = (message["sender"] 
                                    == sender) if divide_users else True    ## if we want to select only the owner's messages

                if not(isReliable and                                       ## and we consider only the text 
                details[0][1] == language and                               ## which is accurate from the given language
                optional_con):
                    continue                                 
                                                                                
                ele = {}                                            ## we create a new dict
                ele["date"] = message["created_at"]                 ## with date,
                ele["text"] = anonymize_corpus(text)                ## the anonymized corpus
                ele["sender"] = message["sender"]                   ## and the anonymized sender.
                cleaned.append(ele)                                 ## We then append every valid dict in a list
                
    return(cleaned)                                                 ## that we'll return


<p> let's try it on some our random messages to see if it works: </p>

In [5]:
test_cl_data = clean_data(data=messages,
                          language="en")[:4]
for ele in test_cl_data:
    print("-",ele["text"])

- you can check it on https://www.lovedance234.com
- of course dummie!
- ok! alyssa an text me back at 06-**num**2
- did anyone of you already see some smurfen or other gnomes @*** @*** @*** @*** ? please send your pics to **email**


<h3 id="count">Find word matches in a given period of time</h3>

<p> Finally, it is time to define <tt class="fun">find_occurencies()</tt> which will be able to:
<ol>
    <li> filter data between a given date span </li>
    <li> select all the matching results given a list of strings </li>
</ol>
<p> We put <tt class="fun">find_occurencies()</tt> inside a <tt>class</tt>, <tt class="fun">FindMatch()</tt>, so that we can then decide to call different methods with different information about our final results, such as: 
    <ol>
    <li> A list of all the filtered data </li>
    <li> A list of all the matching data </li>
    <li> A list of all the matching data's corpus, date, and senders </li>
        <li> A list of all the unique matching data's corpuses </li>
        <li> A ratio of matching data on the filtered data </li>
        <li> An overview of the result </li>
</ol>
    

In [6]:
from collections import Counter ## container that keeps track of how many times equivalent values are added.
import dateutil.parser as dparser ## a date/time string parser (because our 'created_at' is type string, not type date)


class FindMatch():
    def __init__(self, 
                 cl_data,  ##the data obtained from clean_data()
                 wordlist, ## a list of words we want to find in the messages (e.g., articles)
                 period,   ## the time span we want to filter, e.g., a list of two datetime.datetime
                 sender=None): ##the json owner (RQ2)
        
        def find_occurencies(cl_data,
                             wordlist,
                             period):
            
            flt_list  = []         ##it will contain all the filtered elements from cl_data
            matches_all = []       ## it will contain all the matching elements from flt_list
            
            for message in cl_data:                                        ## we iterate through messages in the cleaned dataset 
                date = dparser.parse(message["date"],fuzzy=True)           ## we transform date class str into date class datetime.datetime

                if period[0] <= date <= period[1]:                         ## if the message's date is in the given date span
                    flt_list.append(message)                               ## we append it in our filtered_data list
                    
                    if not any(w in message["text"] for w in wordlist):
                        continue                                       ## if the message contains at least one match with our word list
                    matches_all.append(message)                        ## we append it in our matches_all list
                    
                        

            return {"filtered_data" : flt_list,
                    "matches_all": matches_all }
        
        res = find_occurencies(cl_data, wordlist, period)
        matches_all = res["matches_all"]
        filtered_data = res["filtered_data"]
        
        ## it returns a list of all the elements between a given date span
        self.filtered_data = filtered_data
        
        ## it returns a list of matching elements between a given date span
        self.matches_all = matches_all
        
        ## it returns a list of corpus of all the matching elements
        matches_corpus = []
        [matches_corpus.append(k["text"]) for k in matches_all]
        self.matches_corpus = matches_corpus
        
        ## it returns a list of date of all the matching elements
        matches_date = []
        [matches_date.append(k["date"]) for k in matches_all]
        self.matches_date = matches_date
         
        ## it returns a list of senders of all the matching elements    
        matches_senders = []
        [matches_senders.append(k["sender"]) for k in matches_all]
        self.matches_senders = matches_senders
        
        ## it returns the json owner (useful if divide_users=True)
        self.sender = sender if sender else "Multiple senders, please run FindMatch().senders"

        
        ##it returns a list of corpus of all the unique corpus value (to check presence of bot messages)
        self.matches_corpus_unique = list(Counter(matches_corpus))
        
        ## it returns the ratio of matching elements on the elements between a given date span
        self.ratio =  len(matches_all)/len(filtered_data) if (len(matches_all)+len(filtered_data) != 0) else None
        
        ## it returns an overview of the data
        self.overview = {"sender":sender if sender else "multiple senders",
                         "filtered_data": len(filtered_data),
                         "matches": len(matches_all),
                         "ratio":self.ratio }
        
            



<h3 id="find">Finding the json owner (for RQ2)</h3>

<p> This last function is just a way of understanding who is the owner of the json. This will help when we will analyse every single user (RQ2) instead of merging all the data together(RQ1). It will also be helpful for privacy reasons (e.g., we may not have the legal permission to look to other users' messages). </p>

<p> We therefore define <tt class="fun">find_sender()</tt> which will look into the directory to find the username in another json file (<tt class="str">"profile.json"</tt>). In case there are no <tt>profile.json</tt> files, the function will count the most common username across <tt>messages.json</tt> participants.
    

In [7]:
def find_sender(data,       ##the messages.json file
                directory): ##the directory in which the json files are
    sender = "anon"
    try: 
        open(directory+"/profile.json",'r') ##try to find the profile.json file into directory
    except FileNotFoundError:               ## in case you don't find profile.json
        user_list = []
        for chat in messages:                                 ## iterate through chat's participants 
            user_list = user_list + chat['participants']      
            sndr = Counter(user_list).most_common()[0][0]   ## find the most common participant
        
        
    else: 
        with open(directory+"/profile.json",'r') as jfile:    ## if you find profile.json
            profile = json.load(jfile) 
            sndr = profile["username"]                      ## you can find the username here
    anonymized_sender = str(uuid.uuid5(uuid.NAMESPACE_URL, sndr))   ##let's anonymized it!
    return(anonymized_sender)

<h3 id="both"> Summarise the process for both RQ1 AND RQ2</h3>

<p> We have all the elements we need in order to launch a script with our <tt>messages.json</tt> files! Let's define <tt class="fun">start_match()</tt>, a function that will launch all the functions we previously defined. Given a <tt>data</tt> (multiple messages.json), a <tt>wordlist</tt> (e.g., grammatical articles), and a <tt>period</tt> (our date span) the function will be able to return a clean result on merged json data (RQ1, <tt> divide_users = False </tt>).
<br/>
In addition, as we also want to investigate on single users occurencies (RQ2, <tt> divide_users = True </tt>), we can also provide a given directory in order to find the sender's name. </p>

In [8]:
def start_match(data,                   ##the messages.json
                wordlist,               ## a list of words we want to find in the messages (e.g., articles)
                period,                 ##the time span we want to filter, e.g., a list of two datetime.datetime
                directory = None,         ##the directory in which the json files are
                language = "en",
                divide_users = False,):   ## False for RQ1 and True for RQ2  
                
    ## if we want to investigate on single users (RQ2)
    if divide_users:         
        
        ##we first need to find our sender 
        sender = find_sender(data = data, 
                             directory = directory) 
        
        ##we then clean our data
        cl_data = clean_data(data = data,     
                             language = "en",
                             divide_users = True,
                             sender = sender
                            )
        
        ## and we find word matches 
        occur = FindMatch(cl_data = cl_data, 
                          wordlist = wordlist, 
                          period = period,
                         sender=sender)
        return occur
    #######
        
    ## if we want to investigate on single users (RQ1)
    else:
        
        ## we clean our dataa
        cl_data = clean_data(data = data,
                             language = language,)
        
        ## and we find word matches
        occur = FindMatch(cl_data = cl_data, 
                          wordlist = wordlist, 
                          period = period)
        return occur
        

<h2 id="merge"> RQ1: Merging the data </h2>
<h3> Importing the data </h3>

<p> We first consider the case in which we merge all the <tt>messages.json</tt> (<tt> divide_users = False </tt>)
<p> Let's import our <tt>messages.json</tt> and merge them together in order to  obtain the argument <tt>data</tt>:</p>

In [9]:
import os ## way of using operating system dependent functionality.

materials = os.getcwd()+"/materials" ## the directory 'materials' in our current working directory
subdir = next(os.walk(materials))[1] ## every subdirectory in a given directory

messages = []
for d in subdir:
        s_directory = materials+"/"+d
        try: open(s_directory+"/messages.json",'r')
        except FileNotFoundError: pass 
        else:
            with open(s_directory+"/messages.json",'r') as jfile:
                messages.extend(json.load(jfile)) ##append every json in result




find_keys(messages) ##let's check if the structure is the same for all the json files.

(list, ['participants', 'conversation'])

<h3> Defining the wordlist and the period </h3>
<p> Let's then define our argument <tt>wordlist</tt>, which is going to be <tt>articles</tt>, and <tt>period1</tt>, <tt>period2</tt> (we will launch the function twice), which are going to be our argument <tt>period</tt>.

In [10]:
import datetime ## python datetime library
import pytz ## accurate and cross platform timezone calculations

##wordlist
articles = [" a "," the "," an "]    

##period
period1 = [datetime.datetime(year=2020, month=10, day=7, tzinfo=pytz.UTC), 
           datetime.datetime(year=2020, month=10, day=14, tzinfo=pytz.UTC)]

##period
period2 = [datetime.datetime(year=2020, month=10, day=16, tzinfo=pytz.UTC),
           datetime.datetime(year=2020, month=10, day=23, tzinfo=pytz.UTC)]

<h3> Running the script </h3>

In [11]:
result_1_1 = start_match(
    data = messages,
    wordlist = articles,
    period = period1)


result_1_2 = start_match(
    data = messages,
    wordlist = articles,
    period = period2)


print("***** OVERVIEW PERIOD 1 *****")
print(json.dumps(result_1_1.overview,indent=3))
print("***** SOME CORPUS PERIOD 1 *****")
print(json.dumps(result_1_1.matches_all[:3],indent=3))
print("-------------")
print("***** OVERVIEW PERIOD 2 *****")
print(json.dumps(result_1_2.overview,indent=3))
print("***** SOME CORPUS PERIOD 2 *****")
print(json.dumps(result_1_2.matches_all[:3],indent=3))

***** OVERVIEW PERIOD 1 *****
{
   "sender": "multiple senders",
   "filtered_data": 22,
   "matches": 7,
   "ratio": 0.3181818181818182
}
***** SOME CORPUS PERIOD 1 *****
[
   {
      "date": "2020-10-13T09:55:12.968106+00:00",
      "text": "say yes to the dress",
      "sender": "adfb70ad-69bc-5967-aeed-0289e7d859d7"
   },
   {
      "date": "2020-10-12T20:15:43.753243+00:00",
      "text": "you can look who follows me an add them, they are all participants",
      "sender": "c4ee6575-6f58-5a54-9c6d-1093133ce5cc"
   },
   {
      "date": "2020-10-12T20:15:59.416476+00:00",
      "text": "you can look who follows me an add them, they are all participants",
      "sender": "c4ee6575-6f58-5a54-9c6d-1093133ce5cc"
   }
]
-------------
***** OVERVIEW PERIOD 2 *****
{
   "sender": "multiple senders",
   "filtered_data": 135,
   "matches": 24,
   "ratio": 0.17777777777777778
}
***** SOME CORPUS PERIOD 2 *****
[
   {
      "date": "2020-10-22T15:49:43.838170+00:00",
      "text": "or maybe i

<h3> Results and Discussion </h3><span><p>calculated in R </p></span>
<p> We can see that the ratio during period1 ( <i>r</i>=.32 ) is higher than the ratio in period2  ( <i>r</i>=.17 ). However, we see that the <tt>filtered_data</tt> from both period contains very small data, which is in contrast with the general computational analysis, which usually works with big databases. In other words, to investigate on significant changes in the two periods, we need to analyse a bigger sample of <tt class="str">'messages.json'</tt>.<p> 

<h2 id="merge"> RQ2: Analysing every single user </h2>

<p> Let's consider RQ2 (<tt> divide_users = True </tt>). As we don't need to merge the json, let's launch <tt class="fun"> start_match()</tt> while we iterate trough files. </p>
<p> We already have our <tt>period</tt> and <tt>wordlist</tt> arguments. We only need to change argument <tt>data</tt> and argument <tt>directory</tt> while looping.

In [12]:
import json
import os
import tabulate


result_2_1 = []

for d in subdir:
    s_directory = materials+"/"+d  # directory
    try: open(s_directory+"/messages.json",'r')
    except FileNotFoundError: pass
    else:
        with open(s_directory+"/messages.json",'r') as jfile:
            s_messages = json.load(jfile)  #data
            
            s_result_1 = start_match(
                    data = s_messages,
                    wordlist = articles,
                    period = period1,
                    directory = s_directory,
                    divide_users=True)
            
            result_2_1.append(s_result_1)

result_2_2 = []

for d in subdir:
    s_directory = materials+"/"+d
    try: open(s_directory+"/messages.json",'r')
    except FileNotFoundError: pass
    else:
        with open(s_directory+"/messages.json",'r') as jfile:
            s_messages = json.load(jfile)   
            
            s_result_2 = start_match(
                    data = s_messages,
                    wordlist = articles,
                    period = period2,
                    directory = s_directory,
                    divide_users=True)
            result_2_2.append(s_result_2)
            
            

ratio1 = []        
[ratio1.append(res1.overview) for res1 in result_2_1 ]

ratio2 = []        
[ratio2.append(res2.overview) for res2 in result_2_2 ]


header1 = ratio1[0].keys()
rows1 =  [x.values() for x in ratio1]
header2 = ratio2[0].keys()
rows2 =  [x.values() for x in ratio2]

print("******** OVERVIEW PERIOD 1 ********")
print(tabulate.tabulate(rows1, header1,tablefmt='rst'))
print("")
print("******** OVERVIEW PERIOD 2 ********")
print(tabulate.tabulate(rows2, header2,tablefmt='rst'))






******** OVERVIEW PERIOD 1 ********
sender                                  filtered_data    matches    ratio
b449fd02-660b-57b3-8cb4-9ab5883dbbd2                0          0
90a7952f-8cb3-5e45-b276-6472f3b02cbf                0          0
774a1aa2-85a8-5be8-bb89-d590e655c02b                1          0        0
077e7fde-9e3f-57e9-baeb-b418b9a0ee06                0          0
c4ee6575-6f58-5a54-9c6d-1093133ce5cc                2          2        1
dab501a0-9984-58f2-bf5a-6aed28efebcc                1          1        1
af5eedbb-6e9d-57a2-811d-4f369f34469c                1          0        0
c97ab426-aab5-5850-b5ce-8c749f9cdaef                1          0        0
1b49d9fb-5747-55f3-b9cd-0b53d3f8b35c                0          0
c4ee6575-6f58-5a54-9c6d-1093133ce5cc                0          0

******** OVERVIEW PERIOD 2 ********
sender                                  filtered_data    matches     ratio
b449fd02-660b-57b3-8cb4-9ab5883dbbd2                9          2  0.222222
90a7952f

<h3> Results and Discussion </h3><span><p>calculated in R </p></span>
<p> We perform a within-subject ANOVA to investigate differences ratios between the two periods. We find no significant differences, <i>F(1,12)</i>=1.1, <i>p</i>=.31 .</p>
<p>
For the reasons we highlighted during RQ1 discussion, we acknowledge that the sample size is too little to expect a significant difference between the two periods. </p>

<h2> Conclusion </h2>
<p>We could not find significant evidence for the invented scenario through our answers to both RQ1 and RQ2. 
<br/>
However, we can now apply the script to the study of every other wordlist and every other Instagram dataset. In other words, we can perform the same analysis on a sample of millions of users potentially. By developing the script, we wanted to make sure that it was scalable and functional when faced with other, more realistic scenarios. These include sentiment analysis to prevent the Werther effect's suicides (Yip & Pinkney, 2022), detecting political pulse (Salgado & Sanz, 2022), and all major social movements and media events that had and will engage users in the future years. In addition, if we can ask the users to donate us their Instagram private data, we will also be able to categorize every user with additional parameters such as gender, age, and location. This would lead to performing studies in a more targeted way. 
<br/>
Future works may focus on finding the best condition to develope a project such as the D3I. In this way, we will be able to finally analyse a large sample of personal data such as the one used in this script and extract all the information we cannot reach from users' public data. This will allow us to finally get closer to the general understanding of human behavior in digital environments.</p>

<h2> References </h2>
<ul>
<li>
Choi, S., Williams, D., & Kim, H. (2020). A snap of your true self: How self-presentation and temporal affordance influence self-concept on social media. New Media & Society. <a>https://doi.org/10.1177/1461444820977199</a> </li>
<br/>
<li>Folgado, M.G., Sanz, V. (2022)Exploring the political pulse of a country using data science tools. J Comput Soc Sc. <a>https://doi.org/10.1007/s42001-021-00157-1</a></li>
<br/>    
    <li>Gleason, B. (2013). #Occupy Wall Street: Exploring Informal Learning About a Social Movement on Twitter. American Behavioral Scientist, 57(7), 966–982. <a>https://doi.org/10.1177/0002764213479372</a>
    </li>  
    <br/>
<li>    
Hong, S., & Kim, S. H. (2016). Political polarization on Twitter: Implications for the use of social media in digital governments. Government Information Quarterly, 33(4), 777–782. <a>https://doi.org/10.1016/j.giq.2016.04.007.E13 </a>
    </li>
    
<br/>

<li>
Jelani Ince, Fabio Rojas & Clayton A. Davis (2017) The social media response to Black Lives Matter: how Twitter users interact with Black Lives Matter through hashtag use, Ethnic and Racial Studies, 40:11, 1814-1830, DOI: <a>https://doi.org/10.1080/01419870.2017.1334931</a>
    </li>
    <br/>
<li>
Lomanowska, A. M., & Guitton, M. J. (2016). Online intimacy and well-being in the digital age. Internet Interventions, 4, 138–144. doi: <a>https://doi.org/10.1016/j.invent.2016.06.005 </a>
    </li>
    <br/> 
<li>
Yip, P.S.F., Pinkney, E. (2022) Social media and suicide in social movements: a case study in Hong Kong. J Comput Soc Sc. <a>https://doi.org/10.1007/s42001-022-00159-7</a>
    </li>

</p>


In [13]:
%%html
<style>
h1,h2{
    color:#D37171
}
h3,p,li {
    
    color: #717070;
    line-height:150%
}

.cont {
  width:100%;
  text-align:center;
  display: inline-grid;
  justify-content:center;
flex-direction: column;
}

.t_cont {
    width:400px
}

.fun {
    color:blue
}
.str {
    color: #AB312A
}

.fun,.str{
    font-weight:400
}
.appendix{
    font-size:12px;
}

.img {
    width:50%;
    margin-top:50px;
    margin-bottom:50px;
    
}

tt{
    font-weight:800
}


</style>