# DA4. Some considerations on privacy

The data we are using is collected using the Twitter streaming API - and contains only public tweets. That does not mean we should forego privacy considerations here (among other normative and ethical implications that we should of course also carefully consider). 


## A "dilemma"

* Open Science and replicability
* Privacy protection

## Some considerations

* Data minimization
* Data anonymization
* Data pseudonimization
* Profiling


To run this code, I am also using a Twitter text parser to detect usernames in the text (https://github.com/edmondburnett/twitter-text-python).

To install it:
```pip install twitter-text-python```

In [1]:
import pandas as pd
from ttp import ttp
p = ttp.Parser()

In [3]:
df = pd.read_pickle('/Users/theo/Documents/AssignmentsDA/3_DA3/TheoAraujo_export.pkl.zip')

FileNotFoundError: [Errno 2] No such file or directory: 'TheoAraujo_export.pkl.zip'

In [3]:
len(df)

89858

In [4]:
df.columns

Index(['id', 'created_at', 'from_user_name', 'from_user_id', 'from_user_lang',
       'from_user_tweetcount', 'from_user_followercount',
       'from_user_friendcount', 'from_user_listed', 'from_user_realname',
       'from_user_utcoffset', 'from_user_timezone', 'from_user_description',
       'from_user_url', 'from_user_verified', 'from_user_profile_image_url',
       'from_user_created_at', 'from_user_withheld_scope',
       'from_user_favourites_count', 'source', 'location', 'geo_lat',
       'geo_lng', 'text', 'retweet_id', 'retweet_count', 'favorite_count',
       'to_user_id', 'to_user_name', 'in_reply_to_status_id', 'filter_level',
       'lang', 'possibly_sensitive', 'quoted_status_id', 'withheld_copyright',
       'withheld_scope'],
      dtype='object')

## Data minimization

* What is my RQ?
* Which variables are required to answer my RQ?

Whatever else is not needed - especially if it contains personally identifiable data - I should **delete**.

In [5]:
df = df[['id', 'created_at', 'from_user_name', 'from_user_lang', 'from_user_followercount',
       'from_user_friendcount', 'from_user_verified', 'from_user_description', 'text', 'lang']]

## Data pseudonimisation

Usernames may be relevant here - at least knowing that two tweets are from the same user. That said, for almost all research I do not need to know who the users are (in principle). In some contexts, I may need to know however if a public figure or organization is tweeting (and who they are).
* look at the user name

**Note:** there are more elegant ways to pseudonimise the data (e.g., encryption), but I am using here some alternatives here that also get the job done. 

### Creating a general dictionary for all users in the dataset

First in the from_user_name column.



In [6]:
users = df[['from_user_name', 'from_user_verified']].drop_duplicates()

In [7]:
users.head()

Unnamed: 0,from_user_name,from_user_verified
0,icrowdnewswire,0
1,sandipk47733795,0
2,syabm97,0
3,JRuizAlt,0
4,AndyTheGuttler,0


Using the index to create a unique id

In [8]:
users = users.reset_index()

In [9]:
users.head()

Unnamed: 0,index,from_user_name,from_user_verified
0,0,icrowdnewswire,0
1,1,sandipk47733795,0
2,2,syabm97,0
3,3,JRuizAlt,0
4,4,AndyTheGuttler,0


In [10]:
users = users.rename(columns={'index': 'pseudID'})

In [11]:
users.head()

Unnamed: 0,pseudID,from_user_name,from_user_verified
0,0,icrowdnewswire,0
1,1,sandipk47733795,0
2,2,syabm97,0
3,3,JRuizAlt,0
4,4,AndyTheGuttler,0


In [12]:
def correct_verified(row):
    if row['from_user_verified'] == 1:
        row['pseudID'] = row['from_user_name']
    return row

In [13]:
users = users.apply(correct_verified, axis=1)

In [14]:
users[users['from_user_verified']==0].head()

Unnamed: 0,pseudID,from_user_name,from_user_verified
0,0,icrowdnewswire,0
1,1,sandipk47733795,0
2,2,syabm97,0
3,3,JRuizAlt,0
4,4,AndyTheGuttler,0


In [15]:
users[users['from_user_verified']==1].head()

Unnamed: 0,pseudID,from_user_name,from_user_verified
219,abo1fares,abo1fares,1
354,Braden_Keith,Braden_Keith,1
440,getcerebral,getcerebral,1
484,sree,sree,1
519,NewsCEO,NewsCEO,1


Transforming this dataframe in a dictionary:

In [16]:
userids = {}
for from_user_name, pseudID in users[['from_user_name', 'pseudID']].values.tolist():
    from_user_name = str(from_user_name).lower()
    userids[from_user_name] = str(pseudID).lower()

In [17]:
userids['sree']

'sree'

In [18]:
userids['braden_keith']

'braden_keith'

In [19]:
userids['sandipk47733795']

'1'

Getting all mentions in the text, and adding the users to the dictionary

In [20]:
counter = len(users)

In [21]:
counter

74877

In [22]:
for text in df['text'].unique().tolist():
    text_parsed = p.parse(text)
    users_in_text = text_parsed.users
    users_in_text = [str(user_in_text).lower() for user_in_text in users_in_text]
    for user_in_text in users_in_text:
        if user_in_text not in userids.keys():
            userids[user_in_text] = counter
            counter += 1

In [23]:
len(userids)

93541

Editing the text to replace mentions by pseudIDs

In [24]:
def pseudonimise_text(text):
    import re
    text_parsed = p.parse(text)
    users_in_text = text_parsed.users
    users_in_text = [str(user_in_text).lower() for user_in_text in users_in_text]
    for user_in_text in users_in_text:
        replacer = re.compile(re.escape(user_in_text), re.IGNORECASE)
        text = replacer.sub(str(userids[user_in_text]), text)
    return text
            
    

In [25]:
df['text_p'] = df['text'].apply(pseudonimise_text)

In [26]:
df[['text', 'text_p']].head()

Unnamed: 0,text,text_p
0,How To Understand the @Spotify Algorithm and G...,How To Understand the @74877 Algorithm and Get...
1,"RT @Tejasvi_Surya: Shri RL Kashyap, a noted ma...","RT @74878: Shri RL Kashyap, a noted mathematic..."
2,RT @okedkama: Sebab facebook algorithm memang ...,RT @74879: Sebab facebook algorithm memang blo...
3,RT @KyleMorgenstein: what’s wild about this sh...,RT @74880: what’s wild about this shit is how ...
4,"@happYord It doesn't ""mean"" anything lol, it d...","@74881 It doesn't ""mean"" anything lol, it does..."


In [27]:
del df['text']

## Now pseudonimizing the from_user_name column

In [28]:
def pseudonimise_user(row):
    row['pseudID'] = userids[str(row['from_user_name']).lower()]
    if row['from_user_verified'] == 1:
        return row
    else:
        row['from_user_description'] = 'removed'
        return row
        

In [29]:
df = df.apply(pseudonimise_user, axis=1)

In [30]:
df.head()

Unnamed: 0,id,created_at,from_user_name,from_user_lang,from_user_followercount,from_user_friendcount,from_user_verified,from_user_description,lang,text_p,pseudID
0,1354112097425235969,2021-01-26 17:00:56,icrowdnewswire,,1277,4031,0,removed,en,How To Understand the @74877 Algorithm and Get...,0
1,1354112111723462657,2021-01-26 17:01:00,sandipk47733795,,5,25,0,removed,en,"RT @74878: Shri RL Kashyap, a noted mathematic...",1
2,1354112119994638336,2021-01-26 17:01:02,syabm97,,370,454,0,removed,in,RT @74879: Sebab facebook algorithm memang blo...,2
3,1354112134708228096,2021-01-26 17:01:05,JRuizAlt,,268,1006,0,removed,en,RT @74880: what’s wild about this shit is how ...,3
4,1354112251968499713,2021-01-26 17:01:33,AndyTheGuttler,,67,57,0,removed,en,"@74881 It doesn't ""mean"" anything lol, it does...",4


In [31]:
df[df['from_user_verified']==1].head()

Unnamed: 0,id,created_at,from_user_name,from_user_lang,from_user_followercount,from_user_friendcount,from_user_verified,from_user_description,lang,text_p,pseudID
224,1354118594523770882,2021-01-26 17:26:45,abo1fares,,149675,602,1,الأمين العام لـ #حزب_التجمع_الوطني\nفريق يدير ...,en,"RT @74887: In Saudi Arabia, a hashtag about un...",abo1fares
365,1354121632600752130,2021-01-26 17:38:50,Braden_Keith,,3162,1265,1,"Co-Founder, Editor-in-Chief of @swimswamnews.",en,50% of TikTok content is about how to game the...,braden_keith
451,1354123559921180674,2021-01-26 17:46:29,getcerebral,,4724,1459,1,@m00nphysics / artist mgmt / words / they/them...,en,RT @124: If you are an Algorithm please have y...,getcerebral
495,1354124868690849793,2021-01-26 17:51:41,sree,,84352,9906,1,Your virtual events + my talk shows (#SreeShow...,en,"RT @75021: A few months ago, TikTok execs, at ...",sree
530,1354126074729426946,2021-01-26 17:56:29,NewsCEO,,4038,700,1,President & CEO of News Media Alliance (@NewsA...,en,"RT @75021: A few months ago, TikTok execs, at ...",newsceo


In [32]:
del df['from_user_name']

In [33]:
df.head()

Unnamed: 0,id,created_at,from_user_lang,from_user_followercount,from_user_friendcount,from_user_verified,from_user_description,lang,text_p,pseudID
0,1354112097425235969,2021-01-26 17:00:56,,1277,4031,0,removed,en,How To Understand the @74877 Algorithm and Get...,0
1,1354112111723462657,2021-01-26 17:01:00,,5,25,0,removed,en,"RT @74878: Shri RL Kashyap, a noted mathematic...",1
2,1354112119994638336,2021-01-26 17:01:02,,370,454,0,removed,in,RT @74879: Sebab facebook algorithm memang blo...,2
3,1354112134708228096,2021-01-26 17:01:05,,268,1006,0,removed,en,RT @74880: what’s wild about this shit is how ...,3
4,1354112251968499713,2021-01-26 17:01:33,,67,57,0,removed,en,"@74881 It doesn't ""mean"" anything lol, it does...",4


# Questions you cannot solve with code

* Can we share the data openly, even if minimized? In some cases, only tweet ID's are shared.
 * general: only share tweet IDs
* Can we anonymize public data if we still share the text (or some shape of it)?
 * no
* Are we doing user profiling?
 * pay attention when focusing on specific users
