# Data Science Portfolio - Part I (30 marks)

In this question you will write Python code for processing, analyzing and understanding the social network **Reddit** (www.reddit.com). Reddit is a platform that allows users to upload posts and comment on them, and is divided in _subreddits_, often covering specific themes or areas of interest (for example, [world news](https://www.reddit.com/r/worldnews/), [ukpolitics](https://www.reddit.com/r/ukpolitics/) or [nintendo](https://www.reddit.com/r/nintendo)). You are provided with a subset of Reddit with posts from Covid-related subreddits (e.g., _CoronavirusUK_ or _NoNewNormal_), as well as randomly selected subreddits (e.g., _donaldtrump_ or _razer_).

The `csv` dataset you are provided contains one row per post, and has information about three entities: **posts**, **users** and **subreddits**. The column names are self-explanatory: columns starting with the prefix `user_` describe users, those starting with the prefix `subr_` describe subreddits, the `subreddit` column is the subreddit name, and the rest of the columns are post attributes (`author`, `posted_at`, `title` and post text - the `selftext` column-, number of comments - `num_comments`, `score`, etc.).

In this exercise, you are asked to perform a number of operations to gain insights from the data.

In [2]:
# suggested imports
import pandas as pd
from nltk.tag import pos_tag
import re
from collections import defaultdict,Counter
from nltk.stem import WordNetLemmatizer
from datetime import datetime
from tqdm import tqdm
import numpy as np
import os
tqdm.pandas()
from ast import literal_eval
# nltk imports, note that these outputs may be different if you are using colab or local jupyter notebooks
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [3]:
from urllib import request
import pandas as pd
module_url = f"https://raw.githubusercontent.com/luisespinosaanke/cmt309-portfolio/master/data_portfolio_21.csv"
module_name = module_url.split('/')[-1]
print(f'Fetching {module_url}')
#with open("file_1.txt") as f1, open("file_2.txt") as f2
with request.urlopen(module_url) as f, open(module_name,'w') as outf:
  a = f.read()
  outf.write(a.decode('utf-8'))


df = pd.read_csv('data_portfolio_21.csv')
# this fills empty cells with empty strings
df = df.fillna('')

Fetching https://raw.githubusercontent.com/luisespinosaanke/cmt309-portfolio/master/data_portfolio_21.csv


In [4]:
df.head()

Unnamed: 0,author,posted_at,num_comments,score,selftext,subr_created_at,subr_description,subr_faved_by,subr_numb_members,subr_numb_posts,subreddit,title,total_awards_received,upvote_ratio,user_num_posts,user_registered_at,user_upvote_ratio
0,-Howitzer-,2020-08-17 20:26:04,19,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,BREAKING: Trump to begin hiding in mailboxes t...,0,1.0,4661,2012-11-09,-0.658599
1,-Howitzer-,2020-07-06 17:01:48,1,3,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,Joe Biden's America,0,0.67,4661,2012-11-09,-0.658599
2,-Howitzer-,2020-09-09 02:29:02,3,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,4 more years and we can erase his legacy for g...,0,1.0,4661,2012-11-09,-0.658599
3,-Howitzer-,2020-06-23 23:02:39,2,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,Revelation 9:6 [Transhumanism: The New Religio...,0,1.0,4661,2012-11-09,-0.658599
4,-Howitzer-,2020-08-07 04:13:53,32,622,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,"LOOK HERE, FAT",0,0.88,4661,2012-11-09,-0.658599


## P1.1 - Text data processing (10 marks)

### P1.1.1 - Faved by as lists (3 marks)

The column `subr_faved_by` contains an array of values (names of redditors who added the subreddit to which the current post was submitted), but unfortunately they are in text format, and you would not be able to process them properly without converting them to a suitable python type. You must convert these string values to Python lists, going from

```python
'["user1", "user2" ... ]'
```

to

```python
["user1", "user2" ... ]
```

**What to implement:** Implement a function `transform_faves(df)` which takes as input the original dataframe and returns the same dataframe, but with one additional column called `subr_faved_by_as_list`, where you have the same information as in `subr_faved_by`, but as a python list instead of a string.

In [5]:
def transform_faves(df):
  '''
  Function that takes the DataFrame, and adds a column called 'subr_faved_by
  as_list' which is the list form of the column 'subr_faved_by'.

  Arguments: df - A DataFrame object with column called 'subr_faved_by'.
  The column must have string object elements.

  Returns: The original DataFrame, but with subr_faved_by_as_list column added.
  '''
  
  # Form temporary list to hold values
  temp_list = []

  # Loop through subr_faved_by (the strings)
  for i in df['subr_faved_by']:
    
    # Add to temp_list the strings which are stripped of leading and trailing square brackets and remove whitespace and ' characters
    new_str = i.replace('[', '').replace(']', '').replace("'", '')
    temp_list.append(new_str.split(','))

  # Make new column in dataframe with temp_list
  df['subr_faved_by_as_list'] = temp_list

  # Return new dataframe
  return df
  
df = transform_faves(df)

### P1.1.2 - Merge titles and text bodies (4 marks)

All Reddit posts need to have a title, but a text body is optional. However, we want to be able to access all free text information for each post without having to look at two columns every time.

**What to implement**: A function `concat(df)` that will take as input the original dataframe and will return it with an additional column called `full_text`, which will concatenate `title` and `selftext` columns, but with the following restrictions:

- 1) Wrap the title between `<title>` and `</title>` tags.
- 2) Add a new line (`\n`) between title and selftext, but only in cases where you have both values (see instruction 4).
- 3) Wrap the selftext between `<selftext>` and `</selftext>`.
- 4) You **must not** include the tags in points (1) or (3) if the values for these columns is missing. We will consider a missing value either an empty value (empty string) or a string of only one character (e.g., an emoji). Also, the value of a `full_text` column must not end in the new line character.

In [6]:
def concat(df):
  '''
  Function that takes a data frame, and merges the title and selftext columns.
  The function also adds html <title> and <selftext> tags provided the length
  of title or selftext is greater than one.

  Arguments: df: DataFrame containing 'title' and 'selftext' columns.

  Returns: df: Original DataFrame with 'full_text' column added. This full_text
  is the merging of title and selftext, with html tags added.
  '''

  # Define empty list to holder title and selftext
  temp_list = []

  # Loop through both title and selftext columns, adding tags if necessary
  for i in range(len(df['title'])):
    if len(df['title'][i]) > 1 and len(df['selftext'][i]) > 1: # Both title and selftext has characters
      title_selftext = '<title>' + df['title'][i] + '</title>' + '\n' + '<selftext>' + df['selftext'][i] + '</selftext>'
      temp_list.append(title_selftext)
    elif len(df['title'][i]) > 1 and len(df['selftext'][i]) <= 1: # Only title has characters
      title_only = '<title>' + df['title'][i] + '</title>' 
      temp_list.append(title_only)
    elif len(df['title'][i]) <=1 and len(df['selftext'][i]) > 1: # Only selftext has characters
      selftext_only = '<selftext>' + df['selftext'][i] + '</selftext>'
      temp_list.append(selftext_only)
    else: # In the event there is only a single character in the title or selftext
      temp_list.append('')

  # Make new column to store temp_list entries
  df['full_text'] = temp_list

  # Return df
  return df

df = concat(df)

### P1.1.3 - Enrich posts (3 marks)

We would like to augment our text data with linguistic information. To this end, we will _tokenize_, apply _part-of-speech tagging_, and then we will _lower case_ all the posts.

**What to implement**: A function `enrich_posts(df)` that will take as input the original dataframe and will return it with **two** additional columns: `enriched_title` and `enriched_selftext`. These columns will contain tokenized, pos-tagged and lower cased versions of the original text. **You must implement them in this order**, because the pos tagger uses casing information.

In [7]:
def enrich_posts(df):
  '''
  Function that tokenizes, POS_tags, and lower cases words in that order. 
  Designed to be used specifically on 'title' and 'selftext' columns of 
  dataframe.

  Arguments: df: DataFrame containing 'title' and 'selftext' columns.

  Returns: df: Original DataFrame but with two columns added, namely 'enriched
  _title' and 'enriched_title'

  '''

  # Write  helper function to lower case first entries in a list of tuples
  def lower_case(list_tuples):
    '''
    Helper function designed to be used in conjuction with enrich_posts() function.
    Takes a list of tuples, and lower cases every first entry in the tuples.

    Arguments: list_tuples: A list of tuples.

    Returns: holder: A list of tuples, but with the first entry of every tuple
    in lower case.
    '''

    # Define new list to hold entries
    holder = []

    # Convert first entry to lowercase
    for i in list_tuples:
      temp = list(i)
      temp[0] = temp[0].lower()
      holder.append(tuple(temp))

    # Return holder array
    return holder
  
  # Tokenize and POS tag title
  title = map(nltk.word_tokenize, list(df['title']))
  title = map(nltk.pos_tag, list(title))
  title = map(lower_case, list(title))
  title = list(title)

  # Tokenize and POS tag selftext
  selftext = map(nltk.word_tokenize, list(df['selftext']))
  selftext = map(nltk.pos_tag, list(selftext))
  selftext = map(lower_case, list(selftext))
  selftext = list(selftext)

  # Add enriched_title and enriched_selftext columns to dataframe
  df['enriched_title'] = title
  df['enriched_selftext'] = selftext

  return df

df = enrich_posts(df)

## P1.2 - Answering questions with pandas (12 marks)

In this question, your task is to use pandas to answer questions about the data.

### P1.2.1 - Users with best scores (3 marks)

- Find the users with the highest aggregate scores (over all their posts) for the whole dataset. You should restrict your results to only those whose aggregated score is above 10,000 points, in descending order. Your code should generate a dictionary of the form `{author:aggregated_scores ... }`.

In [12]:
# Group by author and sum scores
authors = df.groupby('author')
scores = authors['score'].sum()

# Make new list of scores above 10,000
scores_aggregated = scores[scores > 10000]

# Sort values
scores_aggregated = scores_aggregated.sort_values(ascending = False)

# Form dictionary
aggregated_scores_authors = dict(scores_aggregated)

# Show dictionary
print(aggregated_scores_authors)

{'DaFunkJunkie': 250375, 'None': 218846, 'SUPERGUESSOUS': 211611, 'jigsawmap': 210824, 'chrisdh79': 143538, 'hildebrand_rarity': 122464, 'iSlingShlong': 118595, 'hilltopye': 81245, 'tefunka': 79560, 'OldFashionedJizz': 64398, 'JLBesq1981': 58235, 'rspix000': 57107, 'Wagamaga': 47989, 'stem12345679': 47455, 'TheJeck': 26058, 'TheGamerDanYT': 25357, 'TrumpSharted': 21154, 'NotsoPG': 18518, 'SonictheManhog': 18116, 'BlanketMage': 13677, 'NewAltWhoThis': 12771, 'kevinmrr': 11900, 'Dajakesta0624': 11613, 'apocalypticalley': 10382}


### P1.2.2 - Awarded posts (3 marks)

Find the number of posts that have received at least one award. Your query should return only one value.

In [9]:
# Find awarded posts
amount_of_awards = 0
for i in df['total_awards_received']:
  if i > 0:
    amount_of_awards += 1

print(amount_of_awards)

119


### P1.2.3 Find Covid (3 marks)

Find the name and description of all subreddits where the name starts with `Covid` or `Corona` and the description contains `covid` or `Covid` anywhere. Your code should generate a dictionary of the form#

```python
  {'Coronavirus':'Place to discuss all things COVID-related',
  ...
  }
```

In [10]:
# Create unique subreddits and description names
subreddits_unique = pd.unique(df['subreddit'])
descriptions_unique = pd.unique(df['subr_description'])

# Make new list to hold coronavirus related subreddit entries
corona_subreddits = {}

# Loop through subreddits and descriptions identifying covid/corona, and enter into list of tuples
for i in range(len(subreddits_unique)):
  if (subreddits_unique[i].lower().count('covid') > 0 or subreddits_unique[i].lower().count('corona')) > 0 and descriptions_unique[i].lower().count('covid') > 0:
    corona_subreddits.update({subreddits_unique[i] : descriptions_unique[i]})

# Show corona_subreddits
corona_subreddits

{'COVID': 'COVID-19 News, Etc.',
 'COVID19': 'In December 2019, SARS-CoV-2, the virus causing the disease COVID-19, emerged in the city of Wuhan, China. This subreddit seeks to facilitate scientific discussion of this global public health threat.',
 'Coronavirus': 'Place to discuss all things COVID-related',
 'CoronavirusCA': 'Tracking the Coronavirus/Covid-19 outbreak in California',
 'CoronavirusDownunder': 'This subreddit is a place to share news, information, resources, and support that relate to the novel coronavirus SARS-CoV-2 and the disease it causes called COVID-19. The primary focus of this sub is to actively monitor the situation in Australia, but all posts on international news and other virus-related topics are welcome, to the extent they are beneficial in keeping those in Australia informed.',
 'CoronavirusUS': 'USA/Canada specific information on the coronavirus (SARS-CoV-2) that causes coronavirus disease 2019 (COVID-19)'}

### P1.2.4 - Redditors that favorite the most

Find the users that have favorited the largest number of subreddits. You must produce a pandas dataframe with **two** columns, with the following format:

```python
     redditor	    numb_favs
0	user1           7
1	user2           6
2	user3	       5
3	user4           4
...
```

where the first column is a Redditor username and the second column is the number of distinct subreddits he/she has favorited.

In [11]:
# Get subreddits and remove extra values
subreddits_faved_by = pd.Series(df['subr_faved_by_as_list'])
subreddits_faved_by = list(subreddits_faved_by.drop_duplicates())

# Flatten list
subreddits_faved_flat = [item for sublist in subreddits_faved_by for item in sublist]

# Form unique users list
usernames_unique = list(pd.unique(subreddits_faved_flat))
print(len(usernames_unique))
# Compare usernames_unique against subreddits_faved_flat
favourites = []
for i in usernames_unique:
  favourites.append(subreddits_faved_flat.count(i))

# Form dictionary
users = dict(zip(usernames_unique, favourites))

# Sort users
users_sorted = {k: b for k, b in sorted(users.items(),
                key = lambda i: i[1], reverse = True)}

# Form dataframe
df_redditors = pd.DataFrame(list(users_sorted.items()), columns = ['redditor', 'numb_faves'])

# Show dataframe
df_redditors

1650


Unnamed: 0,redditor,numb_faves
0,magnusthered15,7
1,KarmaFury,6
2,FriendlyVegetable420,6
3,OmniusQubus,6
4,hmhmhm2,6
...,...,...
1645,certifiedloverboy69,1
1646,diveonfire,1
1647,mouthofreason,1
1648,Alexify,1


## P1.3 Ethics (8 marks)

**(updated on 16/03/2022)**

Imagine you are **the head of a data mining company that needs to use** the insights gained in this assignment to scan social media for covid-related content, and automatically flag it as conspiracy or not conspiracy (for example, for hiding potentially harmful tweets or Facebook posts). **Some information about the project and the team:**

 - Your client is a political party concerned about misinformation.
 - The project requires mining Facebook, Reddit and Instagram data.
 - The team consists of Joe, an American mathematician who just finished college; Fei, a senior software engineer from China; and Francisco, a data scientist from Spain.

Reflect on the impact of exploiting data science for such an application. You should map your discussion to one of the five actions outlined in the UK’s Data Ethics Framework. 

Your answer should address the following:

 - Identify the action **in which your project is the weakest**.
 - Then, justify your choice by critically analyzing the three key principles **for that action** outlined in the Framework, namely transparency, accountability and fairness.
 - Finally, you should propose one solution that explicitly addresses one point related to one of these three principles, reflecting on how your solution would improve the data cycle in this particular use case.

Your answer should be between 500 and 700 words. **You are strongly encouraged to follow a scholarly approach, e.g., with references to peer reviewed publications. References do not count towards the word limit.**

---

The specific actions of the UK Data Ethics Framework include ‘Define and understand public benefit and user need’, ‘Involve diverse expertise’, ‘Comply with the law’, ‘Review the quality and limitations of the data’, and ‘Evaluate and consider wider policy implications’ [1]. 


The project aims to flag false news about the COVID pandemic, which will allow users to be able to avoid false information, the public benefit being that events such as phone mast vandalism [2] and other bizarre conspiracies will occur less if false information is not spread online. The team consists of three different nationalities, all with varying skillsets and experience, showing that the team has diverse expertise, which is important for the project reaching its full potential [3]. Legally, there is not a great involvement; the team does not have a legal advisor nor has the team received any legal advice. In terms of the quality of the data, there are assessments to assess quality of data [4], however this has not been used in this case – we can say that the data for the project is representative of most of the public, with a median age range of 25 – 34 on Facebook and Instragram [5,6], and a median range of 30 – 49 on Reddit [7]. Due to these ranges covering a fair amount of the adult population, it is a fair assessment to say that the data is representative of most of the public. In terms of evaluation and policy considerations, as the project is yet to be completed there is little to evaluate. However, we can say that there is plan for future actions, namely the flagging of COVID conspiracy content, but that is all at this current point in time.


Clearly, the project’s legal aspect is the weakest element of it, as generally little consideration has been given to this aspect. In terms of accountability, the team would do well to ensure they have a technically minded legal advisor, as the personal data must comply with the EU General Data Protection Regulation (GDPR) and the Data Protection Act 2018 (DPA) [1]. The GDPR is extremely crucial for protecting the freedoms of the individual and restricting the free movement of this data [8]. Given that the team has no legal advisors nor external legal support, it will be extremely difficult to comply with any such laws. This is an example of the team being inadequate in the accountability portion of the legal aspect of things. 


The team should also aim to publish the DPIA as a part of transparency, and as it ‘is a legal requirement for any type of processing’ [9] and could lead to hefty fines if this is ignored. The team must also publish other related documents to the DPIA; the team would require legal advice to do so. At this point in time, no reference has been made to the DPIA, showing a lack of transparency within the legal aspect of the team. 


In terms of fairness, the project must comply with the Equality Act 2010, such that ‘Data analysis or automated decision making must not result in outcomes that lead to discrimination’ [1]. There is no equality officer within the team, and ‘public bodies are responsible for ensuring that any third parties which exercise functions on their behalf are capable of complying with the Equality Duty’ [10], so this could lead to issues for the political party commissioning the project.


To solve the accountability of the legal aspect of this project (and perhaps indirectly aid transparency and fairness) the team should seek to obtain the services of a legal advisor, who could then aid the team in complying with the GDPR. Not only would this prevent the team (and political party) from repercussions from not following the law, but it would also add to the accountability such that the public could have confidence that their data was being used appropriately. This would lead to more public support for the project, as if the project did not have this accountability, the public would not know how their data was being used, and could lead to poor support for the project, or even to conspiracy theorists becoming further entrenched in their views. 



References
[1] - UK Data Ethics Framework, British Government. Available at: https://www.gov.uk/government/publications/data-ethics-framework/data-ethics-framework-2020 . [Accessed 04/05/2022]


[2] – Mast fire probe amid 5G coronavirus claims, BBC News. Available at: https://www.bbc.co.uk/news/uk-england-52164358 . [Accessed 04/05/2022]


[3] – 4 Lessons for Building Diverse Teams, Naomi Wheeles, Harvard Business Review. Available at: https://hbr.org/2021/05/4-lessons-for-building-diverse-teams . [Accessed 04/05/2022].


[4] - Cai, L. and Zhu, Y., 2015. The Challenges of Data Quality and Data Quality Assessment in the Big Data Era. Data Science Journal, 14, p.2.


[5] – Distribution of Facebook users worldwide as of January 2022, by age and gender. Available at: https://www.statista.com/statistics/376128/facebook-global-user-age-distribution/ . [Accessed 04/05/2022].


[6] – Distribution of Instagram users worldwide as of January 2022, by age and gender. Available at: https://www.statista.com/statistics/248769/age-distribution-of-worldwide-instagram-users/ . [Accessed 04/05/2022].


[7] – The Demographics of Reddit: Who Uses The Site? William Sattelberg, Alphr. Available at: https://www.alphr.com/demographics-reddit/ . [Accessed 04/05/2022]. 


[8] – G. Chassang, The impact of the EU general data protection regulation on scientific research, Ecancer medical Science, 2017.


[9] – What is a DPIA, Information Commissioner’s Office. Available at: https://ico.org.uk/for-organisations/guide-to-data-protection/guide-to-the-general-data-protection-regulation-gdpr/data-protection-impact-assessments-dpias/what-is-a-dpia/ . [Accessed 04/05/2022].


[10] – Public Sector Equality Duty – A Guide To Meeting The Duty And Undertaking Equality Analysis, PSED Toolkit Appendix 3, City of London.
