# **CMT309 - Computational Data Science - Data Science Portfolio**

# Part 1 - Text Data Analysis (45 marks)

In this question you will write Python code for processing, analyzing and understanding the social network **Reddit** (www.reddit.com). Reddit is a platform that allows users to upload posts and comment on them, and is divided in _subreddits_, often covering specific themes or areas of interest (for example, [world news](https://www.reddit.com/r/worldnews/), [ukpolitics](https://www.reddit.com/r/ukpolitics/) or [nintendo](https://www.reddit.com/r/nintendo)). You are provided with a subset of Reddit with posts from Covid-related subreddits (e.g., _CoronavirusUK_ or _NoNewNormal_), as well as randomly selected subreddits (e.g., _donaldtrump_ or _razer_).

The `csv` dataset you are provided contains one row per post, and has information about three entities: **posts**, **users** and **subreddits**. The column names are self-explanatory: columns starting with the prefix `user_` describe users, those starting with the prefix `subr_` describe subreddits, the `subreddit` column is the subreddit name, and the rest of the columns are post attributes (`author`, `posted_at`, `title` and post text - the `selftext` column-, number of comments - `num_comments`, `score`, etc.).

In this exercise, you are asked to perform a number of operations to gain insights from the data.

## P1.0) Suggested/Required Imports

In [None]:
# suggested imports
import pandas as pd
from nltk.tag import pos_tag
import re
from collections import defaultdict,Counter
from nltk.stem import WordNetLemmatizer
from datetime import datetime
from tqdm import tqdm
import numpy as np
import os
tqdm.pandas()
from ast import literal_eval
# nltk imports, note that these outputs may be different if you are using colab or local jupyter notebooks
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [None]:
from urllib import request
import pandas as pd
module_url = f"https://raw.githubusercontent.com/luisespinosaanke/cmt309-portfolio/master/data_portfolio_22.csv"
module_name = module_url.split('/')[-1]
print(f'Fetching {module_url}')
#with open("file_1.txt") as f1, open("file_2.txt") as f2
with request.urlopen(module_url) as f, open(module_name,'w') as outf:
  a = f.read()
  outf.write(a.decode('utf-8'))
df = pd.read_csv('data_portfolio_22.csv')
# this fills empty cells with empty strings
df = df.fillna('')

Fetching https://raw.githubusercontent.com/luisespinosaanke/cmt309-portfolio/master/data_portfolio_22.csv


In [None]:
df.shape

(19940, 17)

In [None]:
df.head()

Unnamed: 0,author,posted_at,num_comments,score,selftext,subr_created_at,subr_description,subr_faved_by,subr_numb_members,subr_numb_posts,subreddit,title,total_awards_received,upvote_ratio,user_num_posts,user_registered_at,user_upvote_ratio
0,-Howitzer-,2020-08-17 20:26:04,19,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,BREAKING: Trump to begin hiding in mailboxes t...,0,1.0,4661,2012-11-09,-0.658599
1,-Howitzer-,2020-07-06 17:01:48,1,3,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,Joe Biden's America,0,0.67,4661,2012-11-09,-0.658599
2,-Howitzer-,2020-09-09 02:29:02,3,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,4 more years and we can erase his legacy for g...,0,1.0,4661,2012-11-09,-0.658599
3,-Howitzer-,2020-06-23 23:02:39,2,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,Revelation 9:6 [Transhumanism: The New Religio...,0,1.0,4661,2012-11-09,-0.658599
4,-Howitzer-,2020-08-07 04:13:53,32,622,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,"LOOK HERE, FAT",0,0.88,4661,2012-11-09,-0.658599


## P1.1 - Text data processing (20 marks)

### P1.1.1 - Offensive authors per subreddit (5 marks)

As you will see, the dataset contains a lot of strings of the form `[***]`. These have been used to mask (or remove) swearwords to make it less offensive. We are interested in finding those users that have posted at least one swearword in each subreddit. We do this by counting occurrences of the `[***]` string in the `selftext` column (we can assume that an occurrence of `[***]` equals a swearword in the original dataset).

**What to implement:** A function `offensive_authors(df)` that takes as input the original dataframe and returns a dataframe of the form below, where each row contains authors that posted at least one swearword in the corresponding subreddit.

```
subreddit	author
0	40kLore	Cross_Ange
1	40kLore	DaRandomGitty2
2	40kLore	EMB1981
3	40kLore	Evoxrus_XV
4	40kLore	Grtrshop
...
```

In [None]:
def offensive_authors(df):
  """
  Identify authors with offensive posts across different subreddits.
    
  This function takes a DataFrame containing posts data (with 'author', 'subreddit', and 'selftext' columns) 
  and returns a DataFrame with the authors who have posted at least one offensive post in each subreddit.
    
  Input:
  df (pd.DataFrame): A DataFrame containing posts data with 'author', 'subreddit', and 'selftext' columns.

  Output:
  result (pd.DataFrame): A DataFrame with columns 'subreddits' and 'author', listing the authors who have posted
                            at least one offensive post in each subreddit, sorted by subreddit.
  """

  df1 = df.copy() 
  # Replace '[***]' with a unique placeholder string that won't occur in any selftext, as [***] can be inside words
  df1['selftext'] = df1['selftext'].str.replace(r'\[.*?\*\*\*.*?\]', '###OFFENSIVE###')


  # Create a dictionary of {author: set of subreddits with offensive posts}
  author_subreddits = {}
  for author, subreddit, selftext in zip(df1['author'], df1['subreddit'], df1['selftext']):
      if '###OFFENSIVE###' in selftext:
          if author not in author_subreddits:
              author_subreddits[author] = set()
          author_subreddits[author].add(subreddit)
    

    
  # Find authors with at least one offensive post in each subreddit
  result = []
  for author, subreddits in author_subreddits.items():
      if len(subreddits) > 0:
          result.append((subreddits, author))
         

  result = pd.DataFrame(result, columns=['subreddits', 'author'])
  result = result.explode('subreddits')                                 #expand multiple subreddits per author into individual rows
  result = result.sort_values(by='subreddits').reset_index(drop = True)   
                     
    
  return result

In [None]:
offensive_authors(df)

Unnamed: 0,subreddits,author
0,40kLore,EMB1981
1,40kLore,SlobBarker
2,40kLore,spirtomb1831
3,40kLore,ThePoarter
4,40kLore,ThereGoesJoe
...,...,...
490,worldbuilding,Drake[***]zilla
491,worldbuilding,HorsesPlease
492,worldbuilding,BeardedJho
493,xqcow,marvi444


### P1.1.2 - Most common trigrams per subreddit (15 marks)

We are interested in learning about _the ten most frequent trigrams_ (a [trigram](https://en.wikipedia.org/wiki/Trigram) is a sequence of three consecutive words) in each subreddit's content. You must compute these trigrams on both the `selftext` and `title` columns. Your task is to generate a Python dictionary of the form:

```
{subreddit1: [(trigram1, freq1), (trigram2, freq2), ... , (trigram3, freq10)],
subreddit1: [(trigram1, freq1), (trigram2, freq2), ... , (trigram3, freq10)],
...
subreddit63: [(trigram1, freq1), (trigram2, freq2), ... , (trigram3, freq10)],}
```

That is, for each subreddit, the 10 most frequent trigrams and their frequency, stored in a list of tuples. Each trigram will be stored also as a tuple containing 3 strings.

**What to implement**: A function `get_tris(df, stopwords_list, punctuation_list)` that will take as input the original dataframe, a list of stopwords and a list of punctuation signs (e.g., `?` or `!`), and will return a python dictionary with the above format. Your function must implement the following steps in order:

- (**1 mark**) Create a new dataframe called `newdf` with only `subreddit`, `title` and `selftext` columns.
- (**1 mark**) Add a new column to `newdf` called `full_text`, which will contain `title` and `selftext` concatenated with the string `.` (a full stop) followed by a space. That, is `A simple title` and `This is a text body` would be `A simple title. This is a text body`.
- (**1 mark**) Remove all occurrences of the following strings from `full_text`. You must do this without creating a new column:
  - `[***]`
  - `&amp;`
  - `&gt;`
  - `https`
- (**1 mark**) You must also remove all occurrences of at least three consecutive hyphens, for example, you should remove strings like `---`, `----`, `-----`, etc., but not `--` and not `-`.
- (**1 mark**) Tokenize the contents of the `full_text` column after lower casing (removing all capitalization). You should use the `word_tokenize` function in `nltk`. Add the results to a new column called `full_text_tokenized`.
- (**2 mark**) Remove all tokens that are either stopwords or punctuation from `full_text_tokenized` and store the results in a new column called `full_text_tokenized_clean`. _See Note 1_.
- (**2 marks**) Create a new dataframe called `adf` (which will stand for _aggregated dataframe_), which will have one row per subreddit (i.e., 63 rows), and will have two columns: `subreddit` (the subreddit name), and `all_words`, which will be a big list with all the words that belong to that subreddit as extracted from the `full_text_tokenized_clean`.
- (**3 marks**) Obtain trigram counts, which will be stored in a dictionary where each `key` will be a trigram (a `tuple` containing 3 consecutive tokens), and each `value` will be their overall frequency in that subreddit. You are  encouraged to use functions from the `nltk` package, although you can choose any approach to solve this part.
- (**3 marks**) Finally, use the information you have in `adf` for generating the desired dictionary, and return it. _See Note 2_.

Note 1. You can obtain stopwords and punctuation as follows.
- Stopwords: 
```
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
```
- Punctuation:
```
import string
punctuation = list(string.punctuation)
```

Note 2. You do not have to apply an additional ordering when there are several trigrams with the same frequency.

In [None]:
# necessary imports here for extra clarity
from nltk.corpus import stopwords as sw
import string
import warnings

def get_tris(df, stopwords_list, punctuation_list):
  """
  Extract the top 10 most frequent trigrams from each subreddit in a DataFrame.
    
  This function takes a DataFrame containing posts data (with 'subreddit', 'title', and 'selftext' columns),
  a list of stopwords, and a list of punctuation marks. It processes the text and returns a dictionary with the
  top 10 most frequent trigrams in each subreddit.
    
  Input:
  df (pd.DataFrame): A DataFrame containing posts data with 'subreddit', 'title', and 'selftext' columns.
  stopwords_list (list): A list of stopwords to exclude from the text processing.
  punctuation_list (list): A list of punctuation marks to exclude from the text processing.
    
  Output:
  res (dict): A dictionary where keys are subreddit names and values are lists of the top 10 most frequent trigrams
                in each subreddit, sorted in descending order of frequency.
  """

  # 1 MARK - create new df with only relevant columns
  newdf = df.loc[:,['subreddit','title','selftext']]
  
  # 1 MARK - concatenate title and selftext
  newdf['full_text'] = newdf['title']+'. '+newdf['selftext']
  
  # 1 MARK for string replacement: remove the strings "[***]", "&amp;", "&gt;" and "https"
  newdf['full_text'] = newdf['full_text'].replace(r'\[.*?\]', '', regex=True)
  newdf['full_text'] = newdf['full_text'].replace('&amp;', '', regex=True)
  newdf['full_text'] = newdf['full_text'].replace('&gt;', '', regex=True)
  newdf['full_text'] = newdf['full_text'].replace('https', '', regex=True)
  

  # 1 MARK for regex replacement: at least three consecutive dashes
  newdf['full_text'] = newdf['full_text'].replace(r'[-]{3,}', '', regex=True)
  
  # 1 MARK - lower case, tokenize, and add result to full_text_tokenize
  newdf['full_text_tokenized'] = newdf['full_text'].str.lower()
  newdf['full_text_tokenized'] = newdf['full_text_tokenized'].apply(nltk.word_tokenize)
  
  # 2 MARKS - clean the full_text_tokenized column by iterating over each word and discarding if it's either a stopword or punctuation

  sw = stopwords_list
  p = punctuation_list

  # apply function to full_text_tokenized column
  newdf['full_text_tokenized_clean'] = newdf['full_text_tokenized'].apply(lambda x: [token for token in x if token not in sw and token not in p])

  # 2 MARKS - create new aggregated dataframe by concatenating all full_text_tokenized_clean values - rename columns as requested

  adf = newdf.groupby('subreddit')['full_text_tokenized_clean'].apply(lambda x: [item for sublist in x for item in sublist]).reset_index()
  adf.columns = ['subreddit', 'all_words']
  
  # 3 MARKS - create new Series object by piping nltk's FreqDist and trigrams functions into all_words

  tri_counts = adf['all_words'].apply(lambda x: nltk.FreqDist(nltk.trigrams(x)))
  
  # 3 MARKS - create output dictionary by zipping subreddit column from adf and tri_counts into a list of tuples, then passing dict()
  
  # the top 10 most frequent ngrams are obtained by calling sorted() on tri_counts and keeping only the top 10 elements

  res = dict(zip(adf['subreddit'], [sorted(counts.items(), key=lambda x: x[1], reverse=True)[:10] for counts in tri_counts]))
  return res
  

In [None]:
# get stopwords as list
sw = sw.words('english')
# get punctuation as list
p = list(string.punctuation)
# optional lines for adding the below line to avoid the SettingWithCopyWarning
warnings.filterwarnings('ignore')
get_tris(df, sw, p)

## P1.2 - Answering questions with pandas (15 marks)

In this question, your task is to use pandas to answer questions about the data.

### P1.2.1 - Authors that post highly commented posts (3 marks)

Find the top 1000 most commented posts. Then, obtain the names of the authors that have at least 3 posts among these posts.

**What to implement:** Implement a function `find_popular_authors(df)` that takes as input the original dataframe and returns a list strings, where each string is the name of authors that satisfy the above criteria.

In [None]:
def find_popular_authors(df):
  """
  Identify authors with at least 3 posts in the top 1000 most commented posts in a DataFrame.

  This function takes a DataFrame containing post data (with 'author' and 'num_comments' columns) and
  finds authors who have at least 3 posts among the top 1000 most commented posts. The resulting
  list of author names is sorted alphabetically in ascending order.

  Input:
  df (pd.DataFrame): A DataFrame containing post data with 'author' and 'num_comments' columns.

  Output:
  names (list): A sorted list of author names with at least 3 posts in the top 1000 most commented posts.
  """

  top = df.nlargest(1000, 'num_comments') #finds top 1000 most commented posts
  
  
  count = {}                              #this dictionary will contain author names and their number of comments left
  for name, value in top['author'].iteritems(): 
    if value in count:
      count[value] += 1
    else:
      count[value] = 1

  names = []                              #filters out authors with at least 3 comments
  for author in count:
    if count[author] > 2:
      names.append(author)
    else:
      continue
  
  names.sort(key=str.lower)
  return names


In [None]:
find_popular_authors(df)

['2020c[***]er[***]',
 '[***]reader',
 'akarim5847',
 'allicat83',
 'AllisonGator',
 'Allstarhit',
 'Antiliani',
 'apocalypticalley',
 'AristonD',
 'AutoModerator',
 'BanDerUh',
 'BebeFanMasterJ',
 'bemani4u',
 'bgny',
 'blacked_lover',
 'boomerpro',
 'CantStopPoppin',
 'cdillon42',
 'chakalakasp',
 'CLO_Junkie',
 'Cross_Ange',
 'DaFunkJunkie',
 'Defie-LOH-Gic',
 'dsbwayne',
 'dukey',
 'dunphish64',
 'elt0p0',
 'epiphanyx99',
 'faab64',
 'foodforthinks',
 'Fr1sk3r',
 'Frocharocha',
 'FunPeach0',
 'Gambit08',
 'habichuelacondulce',
 'harushiga',
 'hildebrand_rarity',
 'hilltopye',
 'HippolasCage',
 'imagepoem',
 'into_the_[***]e',
 'invertedparado[***]',
 'iSlingShlong',
 'itsreallyreallytrue',
 'Jellyrollrider',
 'jigsawmap',
 'johnruby',
 'jollygreenscott91',
 'KatieAllTheTime',
 'kevinmrr',
 'Kinmuan',
 'kogeliz',
 'lanqian',
 'le_br1t',
 'Leg_holes',
 'lilmcfuggin',
 'Lost_Distribution546',
 'Lshim',
 'Madd-Nigrulo',
 'madman320',
 'Mahomeboy_',
 'Majnum',
 'MakeItRainSheckels',
 'M

### P1.2.2 - Distribution of posts per weekday (5 marks)

Find the percentage of posts that were posted in each weekday (Monday, Tuesday, etc.). You can use an external calendar or you can use any functionality for dealing with dates available in pandas. 

**What to implement:** A function `get_weekday_post_distribution(df)` that takes as input the original dataframe and returns a dictionary of the form (the values are made up):

```
{'Monday': '14%',
'Tuesday': '23%', 
...
}
```

Note that you must only return two decimals, and you must include the percentage sign in the output dictionary. 

Note that in dictionaries order is not preserved, so the order in which it gets printed will not matter. 

In [None]:
def get_weekday_post_distribution(df):
  """
  Function to compute the distribution of posts across weekdays in a given DataFrame.

  Input:
  df (pd.DataFrame): A DataFrame containing a 'posted_at' column with timestamps.

  Output:
  res (dict): A dictionary with keys as the day of the week and values as the percentage of posts on that day.
  """

  df['date'] = pd.to_datetime(df['posted_at'])

  df['day_of_week'] = df['date'].dt.day_name()

  percentage_per_day = (df['day_of_week'].value_counts(normalize=True) * 100).round(2).astype(str) + '%' #number of decimals in output can be changed here 

  res = percentage_per_day.to_dict()

  return res

In [None]:
get_weekday_post_distribution(df)

{'Wednesday': '14.89%',
 'Friday': '14.79%',
 'Thursday': '14.75%',
 'Tuesday': '14.54%',
 'Monday': '14.31%',
 'Saturday': '13.76%',
 'Sunday': '12.96%'}

### P1.2.3 - The 100 most passionate redditors (7 marks)

We would like to know which are the 100 redditors (`author` column) that are most passionate. We will measure this by checking, for each redditor, the ratio at which they use adjectives. This ratio will be computed by dividing number of adjectives by the total number of words each redditor used. The analysis will only consider redditors that have written at least 1000 words.

**What to implement:** A function called `get_passionate_redditors(df)` that takes as input the original dataframe and returns a list of the top 100 redditors (authors) by the ratio at which they use adjectives considering both the `title` and `selftext` columns. The returned list should be a list of tuples, where each inner tuple has two elements: the redditor (author) name, and the ratio of adjectives they used. The returned list should be sorted by adjective ratio in descending order (highest first). Only redditors that wrote more than 1000 words should be considered. You should use `nltk`'s `word_tokenize` and `pos_tag` functions to tokenize and find adjectives. You do not need to do any preprocessing like stopword removal, lemmatization or stemming.

In [None]:
def get_passionate_redditors(df):
  """
  Function to find the top 100 passionate Redditors based on their adjective usage ratio.

  Checks the ratio of adjectives used by each Redditor who has written at least 1000 words and outputs top 100 of them in 
  descending order along with their corresponding ratio.

  Input:
  df (pd.DataFrame): A DataFrame containing 'title', 'selftext', and 'author' columns.

  Output:
  top_100_list (list): A list of tuples containing the author's name and their adjective usage ratio.
  """

  df['text'] = df['title'] + ' ' + df['selftext']
  df['tokens'] = df['text'].apply(lambda x: word_tokenize(x)) #splits the sentences into words entry by entry
    
  # Compute word count and adjective count for each post
  df['word_count'] = df['tokens'].apply(lambda x: len(x))
  df['adj_count'] = df['tokens'].apply(lambda x: sum(1 for word, pos in pos_tag(x) if pos.startswith('JJ'))) ##JJ denotes adjectives
    
  # Group posts by author and compute total word count and adjective count for each author
  grouped = df.groupby('author').agg({'word_count': 'sum', 'adj_count': 'sum'})    #need this to account for authors who wrote more than 1000 words across different posts
    
  # Filter out authors with less than 1000 words
  grouped = grouped[grouped['word_count'] >= 1000]
    
  # Compute adjective ratio for each author
  grouped['adj_ratio'] = grouped['adj_count'] / grouped['word_count']     
    
  # Sort authors by adjective ratio in descending order and displays top 100
  sorted_grouped = grouped.sort_values(by='adj_ratio', ascending=False).round(3) 
  top_100 = sorted_grouped.head(100)
    
  return [(author, ratio) for author, ratio in top_100['adj_ratio'].items()]

In [None]:
get_passionate_redditors(df)

[('OhanianIsTheBest', 0.148),
 ('healrstreettalk', 0.131),
 ('FreedomBoners', 0.125),
 ('factfind', 0.113),
 ('Travis-Cole', 0.106),
 ('fullbloodedwhitemale', 0.105),
 ('SecretAgentIceBat', 0.103),
 ('Tripmooney', 0.097),
 ('backpackwayne', 0.096),
 ('GeAlltidUpp', 0.096),
 ('mission_improbables', 0.095),
 ('EMB1981', 0.095),
 ('nyello-2000', 0.094),
 ('greyuniwave', 0.092),
 ('Venus230', 0.09),
 ('theinfinitelight', 0.088),
 ('th3allyK4t', 0.088),
 ('kent_k', 0.088),
 ('35quai', 0.088),
 ('120inn[***]', 0.088),
 ('notinferno', 0.087),
 ('Ninten-Doh', 0.086),
 ('kay278', 0.085),
 ('rrixham', 0.085),
 ('secretymology', 0.085),
 ('society0', 0.085),
 ('Zendexor', 0.084),
 ('spirtomb1831', 0.084),
 ('III_lll', 0.084),
 ('___TheKid___', 0.083),
 ('AnakinWayneII', 0.083),
 ('allofusahab', 0.083),
 ('sbpotdbot', 0.083),
 ('pooheygirl', 0.082),
 ('BornOnADifCloud', 0.082),
 ('anon7935678', 0.082),
 ('snorken123', 0.082),
 ('The_In-Betweener', 0.081),
 ('XDitto', 0.081),
 ('CommonEmployment2',

## P1.3 Ethics (10 marks)

Imagine you are **the head of a data mining company** that needs to use the insights gained in this assignment to scan social media for covid-related content, and automatically flag it as conspiracy or not conspiracy (for example, for hiding potentially harmful tweets or Facebook posts). Some
information about the project and the team:

- Your client is a political party concerned about misinformation.
- The project requires mining Facebook, Reddit and Instagram data.
- The team consists of Joe, an American mathematician who just finished college; Fei, a senior software engineer from China; and Francisco, a data scientist from Spain.

Reflect on the impact of exploiting data science for such an application. You should map your discussion to one of the five actions outlined in the UK’s Data Ethics Framework.

Your answer should address the following:
- Identify the action in which your project is the weakest.
- Then, justify your choice by critically analyzing the three key principles for that action outlined
in the Framework, namely transparency, accountability and fairness.
- Finally, you should propose one solution that explicitly addresses one point related to one of these three principles, reflecting on how your solution would improve the data cycle in this particular use case.

Your answer should be between 500 and 700 words. **You are strongly encouraged to follow a scholarly approach, e.g., with references to peer reviewed publications. References do not count towards the word limit**.

# **Answer:**

The quality and limitations of the data are considered the weakest aspect of this project. To elucidate this claim, three key principles must be examined: transparency, accountability, and fairness.

Firstly, transparency is limited in this project, as it is a commercial endeavor conducted for a political party. As such, the models developed will likely not be fully public. 

Moreover, the project focuses on the sensitive topic of COVID-related conspiracy theories, which may upset some members of the public. Explainability of the models may also be challenging, as state-of-the-art natural language processing predominantly employs neural networks, whose inner workings are relatively opaque and difficult to interpret. 

Additionally, the diverse data sources may make it challenging to publish input data transparently because of the different legal requirements imposed by the providers of the data.

Secondly, accountability is another concern. The determination of what constitutes a conspiracy theory may be subjective, making it more difficult for external parties to validate the project's outcomes. 

The task of analyzing and categorizing COVID-related content on social media is inherently complex due to the intricate nature of language processing, which poses challenges for the assessment of accountability. Furthermore, the reproducibility of the procedure might be difficult to achieve given the proprietary nature of the input data and the substantial computational overhead typically associated with developing complex natural language processing models.

The robustness of the models, referring to their consistency and accuracy, could be negatively impacted by the diverse nature of the input data. User-generated content on platforms like Reddit and Instagram tends to vary in format, with subtle differences in language use. 

In terms of the choice of output, a simple binary classification might not adequately capture the nuances in sentiment associated with conspiracy theories, thus limiting the algorithm's effectiveness. However, a more detailed sentiment analysis could be too complex and impenetrable for the end-users, including the political party client and potential external reviewers. 

Thirdly, fairness is a crucial issue in this project. The potential for bias is significant, as the political affiliation of the client may influence the results. For instance, they could display confirmation bias, where data that confirms their existing beliefs about what constitutes a "conspiracy" is favoured. Moreover, since the client is a political party, there might be a tendency, intentional or not, to flag content as a conspiracy if it goes against the party's ideology or to overlook potential conspiracies that align with the party's views. 

However, the ethnically diverse team is a positive factor in mitigating cultural biases.

Another area of concern is that utilising data from sources like Facebook, which does not allow pseudonymity, may be problematic, as it can reveal personal information. 

To enhance the fairness aspect of the project, the team should implement a comprehensive bias assessment and mitigation pipeline. This would include the following steps:

1. Identify potential sources of bias in the data, such as underrepresentation of certain demographic groups or communities.

2. Develop and apply techniques to mitigate these biases, such as re-sampling, re-weighting, or adjusting the model's training process. Possible discrimnatory proxy variables need to be established and eliminated.
3.Continuously monitor and evaluate the model's performance concerning fairness and adjust the mitigation strategies as needed. 

In conclusion, addressing the quality and limitations of the data is crucial to ensure the ethical integrity of this project. By focusing on improving transparency, accountability, and fairness, the project team can create more reliable and trustworthy outcomes while mitigating potential biases.


***References:***

Angwin, J., Larson, J., Mattu, S., and Kirchner, L. (2016). 'Machine bias: There's software used across the country to predict future criminals. And it's biased against blacks.' ProPublica, 23. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

Diakopoulos, N. (2015). 'Algorithmic accountability: Journalistic investigation of computational power structures.' Digital Journalism, 3(3), 398-415. https://doi.org/10.1080/21670811.2014.976411

Doshi-Velez, F., and Kim, B. (2017). 'Towards a rigorous science of interpretable machine learning.' arXiv preprint arXiv:1702.08608. https://arxiv.org/abs/1702.08608

Friedler, S. A., Scheidegger, C., and Venkatasubramanian, S. (2016). 'On the (im)possibility of fairness.' arXiv preprint arXiv:1609.07236. https://arxiv.org/abs/1609.07236

Gilpin, L. H., Bau, D., Yuan, B. Z., Bajwa, A., Specter, M., and Kagal, L. (2018). 'Explaining explanations: An overview of interpretability of machine learning.' 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), 80-89. https://doi.org/10.1109/DSAA.2018.00018

Hardt, M., Price, E., and Srebro, N. (2016). 'Equality of opportunity in supervised learning.' Advances in Neural Information Processing Systems, 29, 3315-3323. http://papers.nips.cc/paper/6374-equality-of-opportunity-in-supervised-learning

Lepri, B., Oliver, N., Letouzé, E., Pentland, A., and Vinck, P. (2018). 'Fair, transparent, and accountable algorithmic decision-making processes.' Philosophy & Technology, 31(4), 611-627. https://doi.org/10.1007/s13347-017-0279-x

O'Neil, C. (2016). Weapons of math destruction: How big data increases inequality and threatens democracy. Broadway Books.

Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). 'Why should I trust you? Explaining the predictions of any classifier.' Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135-1144. https://doi.org/10.1145/2939672.2939778