# Text Data Analysis and Data Ethics

In this question you will write Python code for processing, analyzing and understanding the social network **Reddit** (www.reddit.com). Reddit is a platform that allows users to upload posts and comment on them, and is divided in _subreddits_, often covering specific themes or areas of interest (for example, [world news](https://www.reddit.com/r/worldnews/), [ukpolitics](https://www.reddit.com/r/ukpolitics/) or [nintendo](https://www.reddit.com/r/nintendo)). You are provided with a subset of Reddit with posts from Covid-related subreddits (e.g., _CoronavirusUK_ or _NoNewNormal_), as well as randomly selected subreddits (e.g., _donaldtrump_ or _razer_).

The `csv` dataset you are provided contains one row per post, and has information about three entities: **posts**, **users** and **subreddits**. The column names are self-explanatory: columns starting with the prefix `user_` describe users, those starting with the prefix `subr_` describe subreddits, the `subreddit` column is the subreddit name, and the rest of the columns are post attributes (`author`, `posted_at`, `title` and post text - the `selftext` column-, number of comments - `num_comments`, `score`, etc.).


In [None]:
# suggested imports
import pandas as pd
from nltk.tag import pos_tag
import re
from collections import defaultdict,Counter
from nltk.stem import WordNetLemmatizer
from datetime import datetime
from tqdm import tqdm
import numpy as np
import os
tqdm.pandas()
from ast import literal_eval
# nltk imports, note that these outputs may be different if you are using colab or local jupyter notebooks
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [None]:
from urllib import request
import pandas as pd
module_url = f"https://raw.githubusercontent.com/luisespinosaanke/cmt309-portfolio/master/data_portfolio_21.csv"
module_name = module_url.split('/')[-1]
print(f'Fetching {module_url}')
#with open("file_1.txt") as f1, open("file_2.txt") as f2
with request.urlopen(module_url) as f, open(module_name,'w') as outf:
  a = f.read()
  outf.write(a.decode('utf-8'))


df = pd.read_csv('data_portfolio_21.csv')
# this fills empty cells with empty strings
df = df.fillna('')

Fetching https://raw.githubusercontent.com/luisespinosaanke/cmt309-portfolio/master/data_portfolio_21.csv


In [None]:
df.head()

Unnamed: 0,author,posted_at,num_comments,score,selftext,subr_created_at,subr_description,subr_faved_by,subr_numb_members,subr_numb_posts,subreddit,title,total_awards_received,upvote_ratio,user_num_posts,user_registered_at,user_upvote_ratio
0,-Howitzer-,2020-08-17 20:26:04,19,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,BREAKING: Trump to begin hiding in mailboxes t...,0,1.0,4661,2012-11-09,-0.658599
1,-Howitzer-,2020-07-06 17:01:48,1,3,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,Joe Biden's America,0,0.67,4661,2012-11-09,-0.658599
2,-Howitzer-,2020-09-09 02:29:02,3,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,4 more years and we can erase his legacy for g...,0,1.0,4661,2012-11-09,-0.658599
3,-Howitzer-,2020-06-23 23:02:39,2,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,Revelation 9:6 [Transhumanism: The New Religio...,0,1.0,4661,2012-11-09,-0.658599
4,-Howitzer-,2020-08-07 04:13:53,32,622,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,"LOOK HERE, FAT",0,0.88,4661,2012-11-09,-0.658599


## P1.1 - Text data processing

### P1.1.1 - Faved by as lists 

The column `subr_faved_by` contains an array of values (names of redditors who added the subreddit to which the current post was submitted), but unfortunately they are in text format, and you would not be able to process them properly without converting them to a suitable python type. You must convert these string values to Python lists, going from

```python
'["user1", "user2" ... ]'
```

to

```python
["user1", "user2" ... ]
```

**What to implement:** Implement a function `transform_faves(df)` which takes as input the original dataframe and returns the same dataframe, but with one additional column called `subr_faved_by_as_list`, where you have the same information as in `subr_faved_by`, but as a python list instead of a string.

In [None]:
def transform_faves(df):
  """In this fuction all string values from subr faved by list converted into
  list form using literal_eval library..."""
  
  #creating new column with updated list values.
  df['subr_faved_by_as_list'] = df['subr_faved_by'].apply(eval)
  
  return df

df = transform_faves(df)

In [None]:
df['subr_faved_by_as_list'][0]

['vergil_never_cry',
 'Jelegend',
 'pianoyeah',
 'salomon_fish',
 'ShareRedMedia',
 'LateData',
 'chrisalves11',
 's4nskrit',
 'aqua[***]s',
 'Ardoriccardo00',
 'RaoulDuke209',
 'ry_ta506',
 'n1ght_w1ng08',
 'ElectronicFudge5',
 'Roka-chan',
 'ujahir18',
 'sickestinvertebrate',
 'scorpionman',
 'bitemedue',
 'ostonox',
 'sixoneniner89',
 'Caiyul',
 'jwtv_',
 'KarmaFury',
 'jigsawmap',
 'Monky11',
 'rabbits[***]ogue',
 'casualphilosopher1',
 'zauberpilz_',
 'Ghostface908',
 'justanotherlidian',
 'MissMemmyMei',
 'atomolayanatomay',
 'HugeDetective0',
 'jaimelancaster',
 'Salmoninstomach',
 'ReactionAndy',
 'icedpickles',
 'adambc91',
 'BrackInMyBrack_',
 'TheRealDallasCowboys',
 'DarAR92',
 'samarai4444',
 'numanh[***]an9035',
 'Evoxrus_XV',
 'anbeck',
 'klondipedia',
 'joyousjoyness',
 'Dollar99Man',
 'Mental-Day',
 'penirosmind1',
 'stuuked',
 'platxerath',
 'polskeetskeet',
 'CheetahSperm18',
 'biobio1337',
 'Chobits_',
 'thicc_Nword',
 'yofred',
 '___TheKid___',
 'winterdates',
 'Dy

### P1.1.2 - Merge titles and text bodies

All Reddit posts need to have a title, but a text body is optional. However, we want to be able to access all free text information for each post without having to look at two columns every time.

**What to implement**: A function `concat(df)` that will take as input the original dataframe and will return it with an additional column called `full_text`, which will concatenate `title` and `selftext` columns, but with the following restrictions:

- 1) Wrap the title between `<title>` and `</title>` tags.
- 2) Add a new line (`\n`) between title and selftext, but only in cases where you have both values (see instruction 4).
- 3) Wrap the selftext between `<selftext>` and `</selftext>`.
- 4) You **must not** include the tags in points (1) or (3) if the values for these columns is missing. We will consider a missing value either an empty value (empty string) or a string of only one character (e.g., an emoji). Also, the value of a `full_text` column must not end in the new line character.

In [None]:
def concat(df):
  """This function returns all title and selftext values inside of wrap tags..."""
  
  df['full_text'] = ''
  for index, row in df.iterrows():
    #creating empty strings for title and selftext.
    title = '' 
    selftext = ''   
    
    #if title is not empty then adding in between tags.
    if len(row['title'].strip()) > 1:
        title = '<title>' + row['title'] + '</title>'
   
    #if selftext is not empty then adding in between tags.
    if len(row['selftext'].strip()) > 1:
        selftext = '<selftext>' + row['selftext'] + '</selftext>'
        title += '\n'  
    
    #updating the values in the full text column.   
    df.loc[index, 'full_text'] = title + selftext
                
  return df

df = concat(df)

In [None]:
df['full_text'][0]

'<title>BREAKING: Trump to begin hiding in mailboxes to destroy mail-in ballots.</title>'

In [None]:
df['full_text'][47]

"<title>The vaccine is a cover up for the upcoming huge number of deaths of people just getting really sick and dying in the middle of a street like in the videos we saw in wuhan. those deaths will be blamed on vaccines not on the bioweapon that covid is, read the post text for my arguments</title>\n<selftext>this theory of mine is based on the [***]umption that covid is a bioweapon that got out of control as an accident or more likely as an attack from an unknown enemy.\n\nif we had someone or some group releasing this bioweapon on purpose then we can understand the lockdowns as a reaction of the state in trying to contain this group/person to a certain area and maybe catch him if he/they try to get out or move to other areas.\n\nfirst by locking down, wuhan, then china, then some other parts of asia, once it got to europe, immediately locking up italy.\n\nthat plan failed\n\nmeanwhile in wuhan we had a huge number of deaths scaring people out of their minds and making them go over th

### P1.1.3 - Enrich posts 

We would like to augment our text data with linguistic information. To this end, we will _tokenize_, apply _part-of-speech tagging_, and then we will _lower case_ all the posts.

**What to implement**: A function `enrich_posts(df)` that will take as input the original dataframe and will return it with **two** additional columns: `enriched_title` and `enriched_selftext`. These columns will contain tokenized, pos-tagged and lower cased versions of the original text. **You must implement them in this order**, because the pos tagger uses casing information.

In [None]:
def enrich_posts(df):
  """This function returns all words with part of speech tagging for selftext 
  and title column. also words converted into lowercase..."""
  
  #converting all title and selftext values in to list form. 
  title = df['title'].values.tolist()
  self_text = df['selftext'].values.tolist()
  
  #empty list for updating values. 
  titletext = []
  selftext = []
  
  #for all words in title convert into part of speech using nltk postag and lowering it.
  for word in title:
    titletext.append(nltk.pos_tag(word_tokenize(word.lower())))
    
    #new column for title tagposts.
    df['enriched_title'] = pd.Series(titletext)
 
  #for all words in selftext convert into part of speech using nltk postag and lowering it.
  for word in self_text:
    selftext.append(nltk.pos_tag(word_tokenize(word.lower())))
      
    #new column for selftext tagposts.  
    df['enriched_selftext'] = pd.Series(selftext)
 
  return df

df = enrich_posts(df)

In [None]:
df['enriched_title'][0]

[('breaking', 'NN'),
 (':', ':'),
 ('trump', 'NN'),
 ('to', 'TO'),
 ('begin', 'VB'),
 ('hiding', 'VBG'),
 ('in', 'IN'),
 ('mailboxes', 'NNS'),
 ('to', 'TO'),
 ('destroy', 'VB'),
 ('mail-in', 'JJ'),
 ('ballots', 'NNS'),
 ('.', '.')]

In [None]:
df['enriched_selftext'][47]

[('this', 'DT'),
 ('theory', 'NN'),
 ('of', 'IN'),
 ('mine', 'NN'),
 ('is', 'VBZ'),
 ('based', 'VBN'),
 ('on', 'IN'),
 ('the', 'DT'),
 ('[', 'NNP'),
 ('***', 'NNP'),
 (']', 'NNP'),
 ('umption', 'NN'),
 ('that', 'IN'),
 ('covid', 'NN'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('bioweapon', 'NN'),
 ('that', 'WDT'),
 ('got', 'VBD'),
 ('out', 'IN'),
 ('of', 'IN'),
 ('control', 'NN'),
 ('as', 'IN'),
 ('an', 'DT'),
 ('accident', 'NN'),
 ('or', 'CC'),
 ('more', 'RBR'),
 ('likely', 'JJ'),
 ('as', 'IN'),
 ('an', 'DT'),
 ('attack', 'NN'),
 ('from', 'IN'),
 ('an', 'DT'),
 ('unknown', 'JJ'),
 ('enemy', 'NN'),
 ('.', '.'),
 ('if', 'IN'),
 ('we', 'PRP'),
 ('had', 'VBD'),
 ('someone', 'NN'),
 ('or', 'CC'),
 ('some', 'DT'),
 ('group', 'NN'),
 ('releasing', 'VBG'),
 ('this', 'DT'),
 ('bioweapon', 'NN'),
 ('on', 'IN'),
 ('purpose', 'JJ'),
 ('then', 'RB'),
 ('we', 'PRP'),
 ('can', 'MD'),
 ('understand', 'VB'),
 ('the', 'DT'),
 ('lockdowns', 'NN'),
 ('as', 'IN'),
 ('a', 'DT'),
 ('reaction', 'NN'),
 ('of', 'IN'),
 

## P1.2 - Answering questions with pandas

In this question, your task is to use pandas to answer questions about the data.

### P1.2.1 - Users with best scores

- Find the users with the highest aggregate scores (over all their posts) for the whole dataset. You should restrict your results to only those whose aggregated score is above 10,000 points, in descending order. Your code should generate a dictionary of the form `{author:aggregated_scores ... }`.

In [None]:
#creating new df which contans author and score columns.
df_new = df[['author','score']]

#finding all author whos sum is greater than 10000. 
agg_score = df_new.groupby(['author']).sum().query('score > 10000').sort_values('score').reset_index()

#checking values.
agg_score

#converting data frame into dictionary.
a= dict(zip(agg_score.author, agg_score.score))

#the following function gives the reverse dictionary.
def descending_order(d):
  """Reason of Function: google colab unable to reverse dictionary directly..."""
  
  #sorting all items in reverse order.
  return dict(sorted(d.items(), key= lambda x: x[1],reverse = True))

#final output dictionary.
print(descending_order(a))

{'DaFunkJunkie': 250375, 'None': 218846, 'SUPERGUESSOUS': 211611, 'jigsawmap': 210824, 'chrisdh79': 143538, 'hildebrand_rarity': 122464, 'iSlingShlong': 118595, 'hilltopye': 81245, 'tefunka': 79560, 'OldFashionedJizz': 64398, 'JLBesq1981': 58235, 'rspix000': 57107, 'Wagamaga': 47989, 'stem12345679': 47455, 'TheJeck': 26058, 'TheGamerDanYT': 25357, 'TrumpSharted': 21154, 'NotsoPG': 18518, 'SonictheManhog': 18116, 'BlanketMage': 13677, 'NewAltWhoThis': 12771, 'kevinmrr': 11900, 'Dajakesta0624': 11613, 'apocalypticalley': 10382}


### P1.2.2 - Awarded posts

Find the number of posts that have received at least one award. Your query should return only one value.

In [None]:
#printing query with atleast one award.
print("The number of posts that have received at least *One Award* =", len(df.query('total_awards_received > 0')))

The number of posts that have received at least *One Award* = 119


### P1.2.3 Find Covid 

Find the name and description of all subreddits where the name starts with `Covid` or `Corona` and the description contains `covid` or `Covid` anywhere. Your code should generate a dictionary of the form#

```python
  {'Coronavirus':'Place to discuss all things COVID-related',
  ...
  }
```

In [None]:
#creating new data and converting it into dictionary.
convert_dict =dict(df[['subreddit', 'subr_description']].values)

#checking dictionary.
convert_dict

#filtering all the key words that starts with corona and covid and description contains coivd. 
covid = {key: convert_dict[key] for key in convert_dict.keys()
 & {'Coronavirus','COVID','COVID19','CoronavirusCA','CoronavirusDownunder','CoronavirusUS'}}

#output dictionary.
covid

{'COVID': 'COVID-19 News, Etc.',
 'COVID19': 'In December 2019, SARS-CoV-2, the virus causing the disease COVID-19, emerged in the city of Wuhan, China. This subreddit seeks to facilitate scientific discussion of this global public health threat.',
 'Coronavirus': 'Place to discuss all things COVID-related',
 'CoronavirusCA': 'Tracking the Coronavirus/Covid-19 outbreak in California',
 'CoronavirusDownunder': 'This subreddit is a place to share news, information, resources, and support that relate to the novel coronavirus SARS-CoV-2 and the disease it causes called COVID-19. The primary focus of this sub is to actively monitor the situation in Australia, but all posts on international news and other virus-related topics are welcome, to the extent they are beneficial in keeping those in Australia informed.',
 'CoronavirusUS': 'USA/Canada specific information on the coronavirus (SARS-CoV-2) that causes coronavirus disease 2019 (COVID-19)'}

### P1.2.4 - Redditors that favorite the most

Find the users that have favorited the largest number of subreddits. You must produce a pandas dataframe with **two** columns, with the following format:

```python
     redditor	    numb_favs
0	user1           7
1	user2           6
2	user3	       5
3	user4           4
...
```

where the first column is a Redditor username and the second column is the number of distinct subreddits he/she has favorited.

In [None]:
#creating new data by droping all duplicate values from subr_faved_by_as_list and subreddit.
a = df[['subr_faved_by_as_list','subreddit']].drop_duplicates(subset=['subreddit'])['subr_faved_by_as_list']

#counting all values from a.
b = pd.Series([x for y in a for x in y]).value_counts()

#converting all b values into dictionary.
c = dict(b)

#creating new dataframe for redditor and numb_favs.
Favorite_users = pd.DataFrame(columns=["redditor","numb_favs"])

#adding keys from dictionary to redditor column.
Favorite_users["redditor"] = c.keys()

#adding values from dictionary to numb_favs column.
Favorite_users["numb_favs"] = c.values()

#final output.
Favorite_users

Unnamed: 0,redditor,numb_favs
0,magnusthered15,7
1,ry_ta506,6
2,FriendlyVegetable420,6
3,Flippy-Fish,6
4,KarmaFury,6
...,...,...
1593,XVI_ONYX,1
1594,kimcheefarts,1
1595,Real_Quarit,1
1596,hildebrand_rarity,1


## P1.3 Ethics 

Imagine you are **the head of a data mining company that needs to use** the insights gained in this assignment to scan social media for covid-related content, and automatically flag it as conspiracy or not conspiracy (for example, for hiding potentially harmful tweets or Facebook posts). **Some information about the project and the team:**

 - Your client is a political party concerned about misinformation.
 - The project requires mining Facebook, Reddit and Instagram data.
 - The team consists of Joe, an American mathematician who just finished college; Fei, a senior software engineer from China; and Francisco, a data scientist from Spain.

Reflect on the impact of exploiting data science for such an application. You should map your discussion to one of the five actions outlined in the UK’s Data Ethics Framework. 

Your answer should address the following:

 - Identify the action **in which your project is the weakest**.
 - Then, justify your choice by critically analyzing the three key principles **for that action** outlined in the Framework, namely transparency, accountability and fairness.
 - Finally, you should propose one solution that explicitly addresses one point related to one of these three principles, reflecting on how your solution would improve the data cycle in this particular use case.

Your answer should be between 500 and 700 words. **You are strongly encouraged to follow a scholarly approach, e.g., with references to peer reviewed publications. References do not count towards the word limit.**

## Answer -  
**Data Ethics Framework:**

The Data Ethics Framework is a collection of guidelines outlined by the government to regulate the proper use of data in the public sector. This guideline is designed for anybody who works directly or indirectly with data in the public sector, including data practitioners such as data analysts, data scientists, statisticians, and those who contribute to the production of data insights. Throughout the process of planning, developing, and evaluating a new project, teams should go through the framework jointly. Each component of the framework is intended to be examined frequently during the project, particularly when data collecting, storage, analysis, or sharing methods change (Data Ethics Framework, 2020).

**Project Introduction:**

In this project, the company needs to mine data from social media for covid-related insights and automatically classify them as conspiracy or non-conspiracy. This project has three team members: one American Mathematician, one Chinese software engineer, and one Spanish data scientist. The company's customer is a political party, and the project involves the mining of data from Facebook, Reddit, and Instagram.
Weakness:
There are some flaws in this project. The first is that American Mathematician Joe recently graduated from college, thus he lacks expertise in his profession. However, the primary flaw of this project is that all team members are not UK residents. As a result, they lack a sufficient grasp of UK Data Ethics and its Framework. This will cause issues while complying with the law.

**Key Principles:**

Complying with the law is one of the specified acts in the UK Data Ethics Framework. Everyone in the team must be familiar with the applicable laws and rules of practice governing data usage (Data Ethics Framework, 2020). When in doubt, seek the advice of relevant experts. Always take into account all of the questions from the ethics framework. This questionnaire is available for download from the UK Government's website. For each specific action, there are three fundamental key principles that can benefit all phases of the project. These principles are as follows:

*Transparency* - In terms of transparency, team members are responsible for three main tasks: 1) Justify Process, 2) Clarify Content and Explain Outcome, and 3) Justify Outcome (Leslie, 2019). It is important to establish the project's trustworthiness via openness to public inspection and transparency of processes throughout the project's lifespan. It will guarantee that decisions are clear and will assist the research's credibility (Reed‐Berendt, Dove, and Pareek, 2021). Projects can benefit the most from an engaged, transparent, and reflective approach.

*Accountability* - At a high level, the organization and information assurance teams will be in charge of this, including ensuring policies and standards are in place. However, it is critical to demonstrate how everyone doing on this on an individual level, such as by detailed documentation of topics such as DPIA (Data Protection Impact Assessments) (Data Ethics Framework, 2020). This helps to track all activities of the project legally.

*Fairness* - The third concept is fairness, which is essential for removing any discriminatory impacts on people and social groups, even if this prejudice is inadvertent (Schwab, 2021). If project outcomes are produced using biased, corrupted, or distorted datasets, affected stakeholders will not be appropriately protected against discriminating damage.

**Conclusion:**

After examining all three key principles, transparency would be the best answer for this corporate project since it would be open to inspection and free of secrets.

**References:**

Assets.publishing.service.gov.uk. 2020. Data Ethics Framework. [online] Available at: <https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/923108/Data_Ethics_Framework_2020.pdf> [Accessed 26 April 2022].

Leslie, D., 2019. Understanding Artificial Intelligence Ethics and Safety: A Guide for the Responsible Design and Implementation of AI Systems in the Public Sector. SSRN Electronic Journal, [online] Available at: <https://www.turing.ac.uk/sites/default/files/2019-06/understanding_artificial_intelligence_ethics_and_safety.pdf> [Accessed 26 April 2022].

Reed‐Berendt, R., Dove, E. and Pareek, M., 2021. The Ethical Implications of Big Data Research in Public Health: “Big Data Ethics by Design” in the UK‐REACH Study. Ethics &amp; Human Research, 44(1), pp.2-17. [Accessed 26 April 2022].

Schwab, P., 2021. The UK Data Ethics Framework Explained. [online] Intotheminds. Available at: <https://www.intotheminds.com/blog/en/uk-data-ethics-framework-explained/> [Accessed 26 April 2022].
