# Data Science Portfolio - Part I (30 marks)

In this question you will write Python code for processing, analyzing and understanding the social network **Reddit** (www.reddit.com). Reddit is a platform that allows users to upload posts and comment on them, and is divided in _subreddits_, often covering specific themes or areas of interest (for example, [world news](https://www.reddit.com/r/worldnews/), [ukpolitics](https://www.reddit.com/r/ukpolitics/) or [nintendo](https://www.reddit.com/r/nintendo)). You are provided with a subset of Reddit with posts from Covid-related subreddits (e.g., _CoronavirusUK_ or _NoNewNormal_), as well as randomly selected subreddits (e.g., _donaldtrump_ or _razer_).

The `csv` dataset you are provided contains one row per post, and has information about three entities: **posts**, **users** and **subreddits**. The column names are self-explanatory: columns starting with the prefix `user_` describe users, those starting with the prefix `subr_` describe subreddits, the `subreddit` column is the subreddit name, and the rest of the columns are post attributes (`author`, `posted_at`, `title` and post text - the `selftext` column-, number of comments - `num_comments`, `score`, etc.).

In this exercise, you are asked to perform a number of operations to gain insights from the data.

In [None]:
# suggested imports
import pandas as pd
from nltk.tag import pos_tag
import re
from collections import defaultdict,Counter
from nltk.stem import WordNetLemmatizer
from datetime import datetime
from tqdm import tqdm
import numpy as np
import os
tqdm.pandas()
from ast import literal_eval
# nltk imports, note that these outputs may be different if you are using colab or local jupyter notebooks
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


**Student ID -21102662**

In [None]:
from urllib import request
import pandas as pd
module_url = f"https://raw.githubusercontent.com/luisespinosaanke/cmt309-portfolio/master/data_portfolio_21.csv"
module_name = module_url.split('/')[-1]
print(f'Fetching {module_url}')
#with open("file_1.txt") as f1, open("file_2.txt") as f2
with request.urlopen(module_url) as f, open(module_name,'w') as outf:
  a = f.read()
  outf.write(a.decode('utf-8'))


df = pd.read_csv('data_portfolio_21.csv')
# this fills empty cells with empty strings
df = df.fillna('')

Fetching https://raw.githubusercontent.com/luisespinosaanke/cmt309-portfolio/master/data_portfolio_21.csv


In [None]:
df

Unnamed: 0,author,posted_at,num_comments,score,selftext,subr_created_at,subr_description,subr_faved_by,subr_numb_members,subr_numb_posts,subreddit,title,total_awards_received,upvote_ratio,user_num_posts,user_registered_at,user_upvote_ratio
0,-Howitzer-,2020-08-17 20:26:04,19,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,BREAKING: Trump to begin hiding in mailboxes t...,0,1.00,4661,2012-11-09,-0.658599
1,-Howitzer-,2020-07-06 17:01:48,1,3,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,Joe Biden's America,0,0.67,4661,2012-11-09,-0.658599
2,-Howitzer-,2020-09-09 02:29:02,3,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,4 more years and we can erase his legacy for g...,0,1.00,4661,2012-11-09,-0.658599
3,-Howitzer-,2020-06-23 23:02:39,2,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,Revelation 9:6 [Transhumanism: The New Religio...,0,1.00,4661,2012-11-09,-0.658599
4,-Howitzer-,2020-08-07 04:13:53,32,622,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,"LOOK HERE, FAT",0,0.88,4661,2012-11-09,-0.658599
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19935,zqrwiel,2020-07-23 16:39:15,11,246,,2009-04-13,A subreddit dedicated to the discussion of hip...,"['solex125', 'redreddington22', 'HibikiSS', 'k...",8740,630857,playboicarti,carti why,0,1.00,1883,2014-02-12,0.861626
19936,zqrwiel,2020-12-15 11:25:07,39,1,"Then I think we might get 18 songs, outro usua...",2009-04-13,A subreddit dedicated to the discussion of hip...,"['solex125', 'redreddington22', 'HibikiSS', 'k...",8740,630857,playboicarti,If uzi on track 3 and 16,0,1.00,1883,2014-02-12,0.861626
19937,zqrwiel,2020-12-27 13:57:49,15,1,He has 25songs to perform plus the additional ...,2009-04-13,A subreddit dedicated to the discussion of hip...,"['solex125', 'redreddington22', 'HibikiSS', 'k...",8740,630857,playboicarti,Man carti’s concerts are gonna be long af,0,1.00,1883,2014-02-12,0.861626
19938,zqrwiel,2020-12-29 12:07:10,6,1,I got goose[***]ps just by thinking about it 😬,2009-04-13,A subreddit dedicated to the discussion of hip...,"['solex125', 'redreddington22', 'HibikiSS', 'k...",8740,630857,playboicarti,Can’t wait to see Carti going full rage mode o...,0,1.00,1883,2014-02-12,0.861626


## P1.1 - Text data processing (10 marks)

### P1.1.1 - Faved by as lists (3 marks)

The column `subr_faved_by` contains an array of values (names of redditors who added the subreddit to which the current post was submitted), but unfortunately they are in text format, and you would not be able to process them properly without converting them to a suitable python type. You must convert these string values to Python lists, going from

```python
'["user1", "user2" ... ]'
```

to

```python
["user1", "user2" ... ]
```

**What to implement:** Implement a function `transform_faves(df)` which takes as input the original dataframe and returns the same dataframe, but with one additional column called `subr_faved_by_as_list`, where you have the same information as in `subr_faved_by`, but as a python list instead of a string.

In [None]:
def transform_faves(df):

    df['subr_faved_by_as_list'] = df['subr_faved_by']                           #copied column
    df['subr_faved_by_as_list'] = df.subr_faved_by_as_list.apply(literal_eval)  #applied literal eval which is already imported by default in the 1st section of this file
    return df

df = transform_faves(df)        #default in question
df.head()                       #printed the dataframe

Unnamed: 0,author,posted_at,num_comments,score,selftext,subr_created_at,subr_description,subr_faved_by,subr_numb_members,subr_numb_posts,subreddit,title,total_awards_received,upvote_ratio,user_num_posts,user_registered_at,user_upvote_ratio,subr_faved_by_as_list
0,-Howitzer-,2020-08-17 20:26:04,19,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,BREAKING: Trump to begin hiding in mailboxes t...,0,1.0,4661,2012-11-09,-0.658599,"[vergil_never_cry, Jelegend, pianoyeah, salomo..."
1,-Howitzer-,2020-07-06 17:01:48,1,3,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,Joe Biden's America,0,0.67,4661,2012-11-09,-0.658599,"[vergil_never_cry, Jelegend, pianoyeah, salomo..."
2,-Howitzer-,2020-09-09 02:29:02,3,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,4 more years and we can erase his legacy for g...,0,1.0,4661,2012-11-09,-0.658599,"[vergil_never_cry, Jelegend, pianoyeah, salomo..."
3,-Howitzer-,2020-06-23 23:02:39,2,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,Revelation 9:6 [Transhumanism: The New Religio...,0,1.0,4661,2012-11-09,-0.658599,"[vergil_never_cry, Jelegend, pianoyeah, salomo..."
4,-Howitzer-,2020-08-07 04:13:53,32,622,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,"LOOK HERE, FAT",0,0.88,4661,2012-11-09,-0.658599,"[vergil_never_cry, Jelegend, pianoyeah, salomo..."


### P1.1.2 - Merge titles and text bodies (4 marks)

All Reddit posts need to have a title, but a text body is optional. However, we want to be able to access all free text information for each post without having to look at two columns every time.

**What to implement**: A function `concat(df)` that will take as input the original dataframe and will return it with an additional column called `full_text`, which will concatenate `title` and `selftext` columns, but with the following restrictions:

- 1) Wrap the title between `<title>` and `</title>` tags.
- 2) Add a new line (`\n`) between title and selftext, but only in cases where you have both values (see instruction 4).
- 3) Wrap the selftext between `<selftext>` and `</selftext>`.
- 4) You **must not** include the tags in points (1) or (3) if the values for these columns is missing. We will consider a missing value either an empty value (empty string) or a string of only one character (e.g., an emoji). Also, the value of a `full_text` column must not end in the new line character.

In [None]:
def concat(df):
    #used numpy-where for all restrictions as per the question
    df['full_text'] = np.where(df['title'].values != '','<title>' + df['title'].values + '</title>', '') + np.where(((df['title'].values != '') & (df['selftext'].values != '')), '\n' , '') +  np.where(df['selftext'].values != '','<selftext>' + df['selftext'].values + '</selftext>', '')
    return df

df = concat(df)            #default in question
df.head()                  #printed the dataframe

Unnamed: 0,author,posted_at,num_comments,score,selftext,subr_created_at,subr_description,subr_faved_by,subr_numb_members,subr_numb_posts,subreddit,title,total_awards_received,upvote_ratio,user_num_posts,user_registered_at,user_upvote_ratio,subr_faved_by_as_list,full_text
0,-Howitzer-,2020-08-17 20:26:04,19,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,BREAKING: Trump to begin hiding in mailboxes t...,0,1.0,4661,2012-11-09,-0.658599,"[vergil_never_cry, Jelegend, pianoyeah, salomo...",<title>BREAKING: Trump to begin hiding in mail...
1,-Howitzer-,2020-07-06 17:01:48,1,3,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,Joe Biden's America,0,0.67,4661,2012-11-09,-0.658599,"[vergil_never_cry, Jelegend, pianoyeah, salomo...",<title>Joe Biden's America</title>
2,-Howitzer-,2020-09-09 02:29:02,3,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,4 more years and we can erase his legacy for g...,0,1.0,4661,2012-11-09,-0.658599,"[vergil_never_cry, Jelegend, pianoyeah, salomo...",<title>4 more years and we can erase his legac...
3,-Howitzer-,2020-06-23 23:02:39,2,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,Revelation 9:6 [Transhumanism: The New Religio...,0,1.0,4661,2012-11-09,-0.658599,"[vergil_never_cry, Jelegend, pianoyeah, salomo...",<title>Revelation 9:6 [Transhumanism: The New ...
4,-Howitzer-,2020-08-07 04:13:53,32,622,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,"LOOK HERE, FAT",0,0.88,4661,2012-11-09,-0.658599,"[vergil_never_cry, Jelegend, pianoyeah, salomo...","<title>LOOK HERE, FAT</title>"


### P1.1.3 - Enrich posts (3 marks)

We would like to augment our text data with linguistic information. To this end, we will _tokenize_, apply _part-of-speech tagging_, and then we will _lower case_ all the posts.

**What to implement**: A function `enrich_posts(df)` that will take as input the original dataframe and will return it with **two** additional columns: `enriched_title` and `enriched_selftext`. These columns will contain tokenized, pos-tagged and lower cased versions of the original text. **You must implement them in this order**, because the pos tagger uses casing information.

In [None]:
def enrich_posts(df): 
  df['enriched_title'] = df['title'].apply(word_tokenize).apply(pos_tag)         #used postag as it 
  tlist = [[(j.lower(),k) for (j,k) in i] for i in df['enriched_title']]         #used list comprehension and saved to tlist
  df['enriched_title'] = tlist                                                   #tlist as enriched_title

  df['enriched_selftext'] = df['selftext'].apply(word_tokenize).apply(pos_tag)   #used postag as it 
  slist = [[(j.lower(),k) for (j,k) in i] for i in df['enriched_selftext']]      #used list comprehension and saved to slist
  df['enriched_selftext'] = slist                                                #tlist as enriched_title

  return df

df = enrich_posts(df)
df.head()

Unnamed: 0,author,posted_at,num_comments,score,selftext,subr_created_at,subr_description,subr_faved_by,subr_numb_members,subr_numb_posts,...,title,total_awards_received,upvote_ratio,user_num_posts,user_registered_at,user_upvote_ratio,subr_faved_by_as_list,full_text,enriched_title,enriched_selftext
0,-Howitzer-,2020-08-17 20:26:04,19,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,...,BREAKING: Trump to begin hiding in mailboxes t...,0,1.0,4661,2012-11-09,-0.658599,"[vergil_never_cry, Jelegend, pianoyeah, salomo...",<title>BREAKING: Trump to begin hiding in mail...,"[(breaking, NN), (:, :), (trump, NN), (to, TO)...",[]
1,-Howitzer-,2020-07-06 17:01:48,1,3,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,...,Joe Biden's America,0,0.67,4661,2012-11-09,-0.658599,"[vergil_never_cry, Jelegend, pianoyeah, salomo...",<title>Joe Biden's America</title>,"[(joe, NNP), (biden, NNP), ('s, POS), (america...",[]
2,-Howitzer-,2020-09-09 02:29:02,3,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,...,4 more years and we can erase his legacy for g...,0,1.0,4661,2012-11-09,-0.658599,"[vergil_never_cry, Jelegend, pianoyeah, salomo...",<title>4 more years and we can erase his legac...,"[(4, CD), (more, JJR), (years, NNS), (and, CC)...",[]
3,-Howitzer-,2020-06-23 23:02:39,2,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,...,Revelation 9:6 [Transhumanism: The New Religio...,0,1.0,4661,2012-11-09,-0.658599,"[vergil_never_cry, Jelegend, pianoyeah, salomo...",<title>Revelation 9:6 [Transhumanism: The New ...,"[(revelation, NN), (9:6, CD), ([, JJ), (transh...",[]
4,-Howitzer-,2020-08-07 04:13:53,32,622,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,...,"LOOK HERE, FAT",0,0.88,4661,2012-11-09,-0.658599,"[vergil_never_cry, Jelegend, pianoyeah, salomo...","<title>LOOK HERE, FAT</title>","[(look, NNP), (here, NNP), (,, ,), (fat, NNP)]",[]


## P1.2 - Answering questions with pandas (12 marks)

In this question, your task is to use pandas to answer questions about the data.

### P1.2.1 - Users with best scores (3 marks)

- Find the users with the highest aggregate scores (over all their posts) for the whole dataset. You should restrict your results to only those whose aggregated score is above 10,000 points, in descending order. Your code should generate a dictionary of the form `{author:aggregated_scores ... }`.

In [None]:
df2 = df.groupby('author', sort=False)["score"].sum().reset_index(name ='Total')               #groupby author w.r.t score
df2 = df2[df2.Total > 10000].reset_index(drop=True).sort_values(by ='Total', ascending=False)  #applied condition of greater than 10000
df2 = dict(zip(df2.author, df2.Total))                                                         #zipped df2 in dictionary
x = dict(sorted(df2.items(), key = lambda x: x[1], reverse = True))                            #sorted the df2 dictionary in reverse order by values
print(x)                                                                                       #printed x

{'DaFunkJunkie': 250375, 'None': 218846, 'SUPERGUESSOUS': 211611, 'jigsawmap': 210824, 'chrisdh79': 143538, 'hildebrand_rarity': 122464, 'iSlingShlong': 118595, 'hilltopye': 81245, 'tefunka': 79560, 'OldFashionedJizz': 64398, 'JLBesq1981': 58235, 'rspix000': 57107, 'Wagamaga': 47989, 'stem12345679': 47455, 'TheJeck': 26058, 'TheGamerDanYT': 25357, 'TrumpSharted': 21154, 'NotsoPG': 18518, 'SonictheManhog': 18116, 'BlanketMage': 13677, 'NewAltWhoThis': 12771, 'kevinmrr': 11900, 'Dajakesta0624': 11613, 'apocalypticalley': 10382}


### P1.2.2 - Awarded posts (3 marks)

Find the number of posts that have received at least one award. Your query should return only one value.

In [None]:
z = df[['subr_numb_posts','total_awards_received']]             #grouped 2 columns in z
print(z[z.total_awards_received > 0].subr_numb_posts.sum())     #condition of greater than 1 and sum

79086187


### P1.2.3 Find Covid (3 marks)

Find the name and description of all subreddits where the name starts with `Covid` or `Corona` and the description contains `covid` or `Covid` anywhere. Your code should generate a dictionary of the form#

```python
  {'Coronavirus':'Place to discuss all things COVID-related',
  ...
  }
```

In [None]:
# your code here
new_df = df[['subreddit', 'subr_description']]  #copied 2 columns in new_df

#used regex and used conditions as per the question and saved in ids 
ids = (new_df.subreddit.str.contains('^Covid.*|^Corona.*',flags = re.IGNORECASE, regex = True, na = False)) & (new_df.subr_description.str.contains('.*covid.*|.*Covid.*', flags = re.IGNORECASE, regex = True, na = False))
new_df = new_df[ids]                                              #saved as dataframe column
new_df_dict = dict(zip(new_df.subreddit,new_df.subr_description)) #converted column to description
new_df_dict                                                       #printed

{'COVID': 'COVID-19 News, Etc.',
 'COVID19': 'In December 2019, SARS-CoV-2, the virus causing the disease COVID-19, emerged in the city of Wuhan, China. This subreddit seeks to facilitate scientific discussion of this global public health threat.',
 'Coronavirus': 'Place to discuss all things COVID-related',
 'CoronavirusCA': 'Tracking the Coronavirus/Covid-19 outbreak in California',
 'CoronavirusDownunder': 'This subreddit is a place to share news, information, resources, and support that relate to the novel coronavirus SARS-CoV-2 and the disease it causes called COVID-19. The primary focus of this sub is to actively monitor the situation in Australia, but all posts on international news and other virus-related topics are welcome, to the extent they are beneficial in keeping those in Australia informed.',
 'CoronavirusUS': 'USA/Canada specific information on the coronavirus (SARS-CoV-2) that causes coronavirus disease 2019 (COVID-19)'}

### P1.2.4 - Redditors that favorite the most

Find the users that have favorited the largest number of subreddits. You must produce a pandas dataframe with **two** columns, with the following format:

```python
     redditor	    numb_favs
0	user1           7
1	user2           6
2	user3	       5
3	user4           4
...
```

where the first column is a Redditor username and the second column is the number of distinct subreddits he/she has favorited.

In [None]:

userl = []
dfnew = df[['subreddit', 'subr_faved_by']]                               #copied 2 columns in dfnew
dfnew = dfnew.drop_duplicates()                                          #drop duplicate rows
dfnew['subr_faved_by'] = df.subr_faved_by.apply(literal_eval)            #converted to list from string format
userl=[u for r in dfnew.subr_faved_by for u in r]                        #used list comprehension for storing values
userlist = Counter(userl)                                                #list to dictionary
x = dict(sorted(userlist.items(), key = lambda x: x[1], reverse = True)) #sorted dictionary by values
f = pd.DataFrame(x.items(), columns=['redditor', 'numb_favs'])           #converted to dataframe
f

Unnamed: 0,redditor,numb_favs
0,magnusthered15,7
1,ry_ta506,6
2,KarmaFury,6
3,FriendlyVegetable420,6
4,OmniusQubus,6
...,...,...
1593,certifiedloverboy69,1
1594,diveonfire,1
1595,mouthofreason,1
1596,Alexify,1


## P1.3 Ethics (8 marks)

**(updated on 16/03/2022)**

Imagine you are **the head of a data mining company that needs to use** the insights gained in this assignment to scan social media for covid-related content, and automatically flag it as conspiracy or not conspiracy (for example, for hiding potentially harmful tweets or Facebook posts). **Some information about the project and the team:**

 - Your client is a political party concerned about misinformation.
 - The project requires mining Facebook, Reddit and Instagram data.
 - The team consists of Joe, an American mathematician who just finished college; Fei, a senior software engineer from China; and Francisco, a data scientist from Spain.

Reflect on the impact of exploiting data science for such an application. You should map your discussion to one of the five actions outlined in the UK’s Data Ethics Framework. 

Your answer should address the following:

 - Identify the action **in which your project is the weakest**.
 - Then, justify your choice by critically analyzing the three key principles **for that action** outlined in the Framework, namely transparency, accountability and fairness.
 - Finally, you should propose one solution that explicitly addresses one point related to one of these three principles, reflecting on how your solution would improve the data cycle in this particular use case.

Your answer should be between 500 and 700 words. **You are strongly encouraged to follow a scholarly approach, e.g., with references to peer reviewed publications. References do not count towards the word limit.**




As the head of the data mining company, I will meet the client, which is a political party and try to understand as much as I can about the concerns with my company’s core team. If a political client has concerns it surely has wide impacts on a particular geographical area. Keeping in mind the privacy and policies of my company, I will give them the knowledge of the same and request them to coordinate on this problem keeping the laws into the consideration as this can have negative impacts on the company as well as the public if not followed properly.

Considering the intensity and seriousness of the project, the principle of Transparency should be the top priority. As the project requires mining of social media data, I will divide the project in my team.I will assign Fei and Francisco the work of scanning and mining data(posts) made by prominent personalities who have many followers. As such people have influence in the public, there is hardly any use of scanning posts of individuals having very few contacts on social media. Fei and Francisco have the task of carefully understanding the key words which contradicts or are irrelevant to the actual information of covid which can be later used in an algorithm or model. The study of advanced counting techniques,, logic, relations, graph theory, and analysis of algorithms are included in this branch of mathematics. Joe will be responsible for all the graph and pattern analysis of the harmful/ irrelevant keywords and coordinate with Fei and Francisco for deep analysis.

Actions:

The first action - Define and understand public benefit and user need:
Can be understood talking with the client and considering the short term and long term effects on the public. 

The second action - Involve diverse expertise :
This is done with the help of socially active influencers, top medical experts, research scientists, public relation experts, etc. This will give a wider view of the problem and its impact on social media users. 

The third action - Comply with the law:
As there are federal laws about the use of user data and its limitations where all the companies have to comply with it, there should be special permissions taken from the government(if not provided/not clear by the current law) by proving that this project can help the society to gain more true knowledge rather than believing anything which is posted on social media.

The fourth action - Review the quality and limitations of the data:
As mentioned in the plan, only the contents posted by prominent personalities will be considered. This sets the limitations of the data to be scanned and the quality can be considered as such personalities have thousands of connections. Both Fei and Fransisco are responsible for this work.


Action 4 - Review the quality and limitations of the data, is the most important action for efficient application of this project. The principle analysis for this action are as follows:-


Data Ethics Principles:
Transparency - This principle can be followed by publicly presenting the information given by the top medical authorities and organisations. The purpose and progress of this project should also be publicised considering Law and Privacy Policies of the company. This Transparency should also be coordinated with the government so that they can take proper measures to control the misinformation.
	
Accountability - The governing body is the main key in this principle. As the client is from the government(political party), the decision to control misinformation is of the government. Government should be made a minor stakeholder of this project as they can improvise ways to spread more correct information in an efficient way. But also as the head of the company, I should make sure that the government/ political party has no direct control over the information the company makes public.

Fairness - As the client is a political party, the laws, aims and objectives should be made fair by all means of law so that there will be no bias or discrimination over the objectives and result of this project as Law is equal for everyone and it should be made clear to the client the consequences of manipulating the results of the project with respect to the law.



References:

[1] Erin Simpson, Adam Conner, AUG 18, 2020. Fighting Coronavirus Misinformation and Disinformation. Available at:
https://www.americanprogress.org/article/fighting-coronavirus-misinformation-disinformation/ [Accessed on: 5 May 2022]

[2] European Commision. Identifying conspiracy theories
Available at: https://ec.europa.eu/info/live-work-travel-eu/coronavirus-response/fighting-disinformation/identifying-conspiracy-theories_en [Accessed on: 5 May 2022]

[3] Into Math, November 1, 2019. Math and Social Media: Activities
Available at: https://intomath.org/post/math-and-social-media/ 
[Accessed on: 30 April 2022]


[4] Data Ethics and Marketing, Phillip Othen, 21 March 202.
Available at: https://pressgazette.co.uk/data-ethics-marketing/
[Accessed on: 6 May 2022]

