# Data Science Portfolio - Part I (30 marks)

In this question you will write Python code for processing, analyzing and understanding the social network **Reddit** (www.reddit.com). Reddit is a platform that allows users to upload posts and comment on them, and is divided in _subreddits_, often covering specific themes or areas of interest (for example, [world news](https://www.reddit.com/r/worldnews/), [ukpolitics](https://www.reddit.com/r/ukpolitics/) or [nintendo](https://www.reddit.com/r/nintendo)). You are provided with a subset of Reddit with posts from Covid-related subreddits (e.g., _CoronavirusUK_ or _NoNewNormal_), as well as randomly selected subreddits (e.g., _donaldtrump_ or _razer_).

The `csv` dataset you are provided contains one row per post, and has information about three entities: **posts**, **users** and **subreddits**. The column names are self-explanatory: columns starting with the prefix `user_` describe users, those starting with the prefix `subr_` describe subreddits, the `subreddit` column is the subreddit name, and the rest of the columns are post attributes (`author`, `posted_at`, `title` and post text - the `selftext` column-, number of comments - `num_comments`, `score`, etc.).

In this exercise, you are asked to perform a number of operations to gain insights from the data.

In [6]:
# suggested imports
import pandas as pd
from nltk.tag import pos_tag
import re
from collections import defaultdict,Counter
from nltk.stem import WordNetLemmatizer
from datetime import datetime
from tqdm import tqdm
import numpy as np
import os
tqdm.pandas()
from ast import literal_eval
# nltk imports, note that these outputs may be different if you are using colab or local jupyter notebooks
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize

import warnings 
warnings.filterwarnings("ignore") # to ignore warnings

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/joydipb/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/joydipb/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/joydipb/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [7]:
from urllib import request
import pandas as pd
module_url = f"https://raw.githubusercontent.com/luisespinosaanke/cmt309-portfolio/master/data_portfolio_21.csv"
module_name = module_url.split('/')[-1]
print(f'Fetching {module_url}')
#with open("file_1.txt") as f1, open("file_2.txt") as f2
with request.urlopen(module_url) as f, open(module_name,'w') as outf:
  a = f.read()
  outf.write(a.decode('utf-8'))


df = pd.read_csv('data_portfolio_21.csv')
# this fills empty cells with empty strings
df = df.fillna('')

Fetching https://raw.githubusercontent.com/luisespinosaanke/cmt309-portfolio/master/data_portfolio_21.csv


In [8]:
df.head() # check the first few rows

Unnamed: 0,author,posted_at,num_comments,score,selftext,subr_created_at,subr_description,subr_faved_by,subr_numb_members,subr_numb_posts,subreddit,title,total_awards_received,upvote_ratio,user_num_posts,user_registered_at,user_upvote_ratio
0,-Howitzer-,2020-08-17 20:26:04,19,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,BREAKING: Trump to begin hiding in mailboxes t...,0,1.0,4661,2012-11-09,-0.658599
1,-Howitzer-,2020-07-06 17:01:48,1,3,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,Joe Biden's America,0,0.67,4661,2012-11-09,-0.658599
2,-Howitzer-,2020-09-09 02:29:02,3,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,4 more years and we can erase his legacy for g...,0,1.0,4661,2012-11-09,-0.658599
3,-Howitzer-,2020-06-23 23:02:39,2,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,Revelation 9:6 [Transhumanism: The New Religio...,0,1.0,4661,2012-11-09,-0.658599
4,-Howitzer-,2020-08-07 04:13:53,32,622,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,"LOOK HERE, FAT",0,0.88,4661,2012-11-09,-0.658599


## P1.1 - Text data processing (10 marks)

### P1.1.1 - Faved by as lists (3 marks)

The column `subr_faved_by` contains an array of values (names of redditors who added the subreddit to which the current post was submitted), but unfortunately they are in text format, and you would not be able to process them properly without converting them to a suitable python type. You must convert these string values to Python lists, going from

```python
'["user1", "user2" ... ]'
```

to

```python
["user1", "user2" ... ]
```

**What to implement:** Implement a function `transform_faves(df)` which takes as input the original dataframe and returns the same dataframe, but with one additional column called `subr_faved_by_as_list`, where you have the same information as in `subr_faved_by`, but as a python list instead of a string.

In [9]:
def transform_faves(df):
    df['subr_faved_by_as_list'] = [literal_eval(x) for x in df['subr_faved_by']] # this is a list of lists
   
    return df   # return the dataframe

df = transform_faves(df) # run the function on the dataframe


### P1.1.2 - Merge titles and text bodies (4 marks)

All Reddit posts need to have a title, but a text body is optional. However, we want to be able to access all free text information for each post without having to look at two columns every time.

**What to implement**: A function `concat(df)` that will take as input the original dataframe and will return it with an additional column called `full_text`, which will concatenate `title` and `selftext` columns, but with the following restrictions:

- 1) Wrap the title between `<title>` and `</title>` tags.
- 2) Add a new line (`\n`) between title and selftext, but only in cases where you have both values (see instruction 4).
- 3) Wrap the selftext between `<selftext>` and `</selftext>`.
- 4) You **must not** include the tags in points (1) or (3) if the values for these columns is missing. We will consider a missing value either an empty value (empty string) or a string of only one character (e.g., an emoji). Also, the value of a `full_text` column must not end in the new line character.

In [10]:
def concat(df):
    df["full_text"] = pd.Series().astype(str) # create a new column
    for i in range(len(df)):
        if(len(df['selftext'][i]) == 0): # if there is no selftext
            df['full_text'][i] = "<title>" + df['title'][i] + "</title>"   # add the title
        else: # if there is selftext
            df['full_text'][i] = "<title>" + df['title'][i] + "</title>\n<selftext>" + df['selftext'][i] + "</selftext>" # add the title and selftext
                
    return df # return the dataframe
    
df = concat(df) # run the function on the dataframe

### P1.1.3 - Enrich posts (3 marks)

We would like to augment our text data with linguistic information. To this end, we will _tokenize_, apply _part-of-speech tagging_, and then we will _lower case_ all the posts.

**What to implement**: A function `enrich_posts(df)` that will take as input the original dataframe and will return it with **two** additional columns: `enriched_title` and `enriched_selftext`. These columns will contain tokenized, pos-tagged and lower cased versions of the original text. **You must implement them in this order**, because the pos tagger uses casing information.

In [11]:
def enrich_posts(df):
    df["enriched_title"] = pd.Series().astype(str) # create a new column
    df["enriched_selftext"] = pd.Series().astype(str) # create a new column
    for i in range(len(df)):
        df['enriched_title'][i] = word_tokenize(text=df['title'][i]) # tokenize the title
        df['enriched_title'][i] = nltk.pos_tag(df['enriched_title'][i]) # tag the title
        df['enriched_title'][i] = [(ls.lower(), cat) for ls, cat in df['enriched_title'][i]] # lowercase the title and tag it

        df['enriched_selftext'][i] = word_tokenize(text=df['selftext'][i]) # tokenize the selftext
        df['enriched_selftext'][i] = nltk.pos_tag(df['enriched_selftext'][i]) # tag the selftext
        df['enriched_selftext'][i] = [(ls.lower(), cat) for ls, cat in df['enriched_selftext'][i]] # lowercase the selftext and tag it 
    return df # return the dataframe

df = enrich_posts(df) # run the function on the dataframe

## P1.2 - Answering questions with pandas (12 marks)

In this question, your task is to use pandas to answer questions about the data.

### P1.2.1 - Users with best scores (3 marks)

- Find the users with the highest aggregate scores (over all their posts) for the whole dataset. You should restrict your results to only those whose aggregated score is above 10,000 points, in descending order. Your code should generate a dictionary of the form `{author:aggregated_scores ... }`.

In [12]:
df_preserve = df.copy() # make a copy of the dataframe
import gc
for i in range(len(df)-1): # for each row
    if(df['author'][i+1] == df['author'][i]): # if the next row is the same author
        df['score'][i+1] += df['score'][i] # add the score
        df = df.drop(index=i) # drop the current row
        
    else: # if the next row is not the same author
        pass # do nothing

df = df.reset_index() # reset the index

best_score_dict = defaultdict(int) # create a dictionary

for i in range(len(df)): # for each row
    if(df['score'][i] >= 10000): # if the score is greater than 10000
        best_score_dict[df['author'][i]] += df['score'][i] # add the score to the dictionary
                
    else: # if the score is less than 10000
        pass # do nothing

df = df_preserve.copy() # make a copy of the dataframe

# Sort the dictionary by value
best_score_dict = sorted(best_score_dict.items(), key=lambda x: x[1], reverse=True)

del df_preserve # delete the copy
gc.collect() # garbage collect

print(best_score_dict) # print the dictionary



    
    

[('DaFunkJunkie', 250375), ('None', 218846), ('SUPERGUESSOUS', 211611), ('jigsawmap', 210784), ('chrisdh79', 143538), ('hildebrand_rarity', 122464), ('iSlingShlong', 118595), ('hilltopye', 81245), ('tefunka', 79560), ('OldFashionedJizz', 64398), ('JLBesq1981', 58235), ('rspix000', 57107), ('Wagamaga', 47989), ('stem12345679', 47455), ('TheJeck', 26058), ('TheGamerDanYT', 25357), ('TrumpSharted', 21154), ('NotsoPG', 18518), ('SonictheManhog', 18116), ('BlanketMage', 13677), ('NewAltWhoThis', 12771), ('kevinmrr', 11900), ('Dajakesta0624', 11613), ('apocalypticalley', 10382)]


### P1.2.2 - Awarded posts (3 marks)

Find the number of posts that have received at least one award. Your query should return only one value.

In [13]:
award_count = 0 # create a counter
for i in range(len(df)): # for each row
    if(df['total_awards_received'][i] >= 1): # if the award count is greater than 1
        award_count+=1 # add 1 to the counter
    else: # if the award count is less than 1
        pass # do nothing
print("Number of post that recieved atleast one award are: ", award_count) # print the counter


Number of post that recieved atleast one award are:  119


### P1.2.3 Find Covid (3 marks)

Find the name and description of all subreddits where the name starts with `Covid` or `Corona` and the description contains `covid` or `Covid` anywhere. Your code should generate a dictionary of the form#

```python
  {'Coronavirus':'Place to discuss all things COVID-related',
  ...
  }
```

In [14]:
from re import search
substring = ("Covid|Corona") # create a substring
covid_dict = defaultdict(str) # create a dictionary to store the posts
for i in range(len(df)): # for each row
    if(search(substring, df['subreddit'][i]), re.IGNORECASE): # if the subreddit contains the substring
        covid_dict[df['subreddit'][i]] = df['subr_description'][i] # add the subreddit and description to the dictionary
    elif(search(substring, df['subr_description'][i]), re.IGNORECASE): # if the description contains the substring
        covid_dict[df['subreddit'][i]] = df['subr_description'][i] # add the subreddit and description to the dictionary

    else: # if the subreddit does not contain the substring
        pass # do nothing    

print(covid_dict) # print the dictionary

defaultdict(<class 'str'>, {'donaldtrump': 'Subreddit about Donald Trump', 'Coronavirus': 'Place to discuss all things COVID-related', 'China_Flu': 'COVID-19 (2019-nCoV) Wuhan Coronavirus Information', 'conspiracy': "**The conspiracy subreddit is a thinking ground. Above all else, we respect everyone's opinions and ALL religious beliefs and creeds. We hope to challenge issues which have captured the public’s imagination, from JFK and UFOs to 9/11. This is a forum for free thinking, not hate speech. Respect other views and opinions, and keep an open mind.** **Our intentions are aimed towards a fairer, more transparent world and a better future for everyone.**", 'CoronavirusCA': 'Tracking the Coronavirus/Covid-19 outbreak in California', 'PublicFreakout': 'A subreddit dedicated to people freaking out, melting down, losing their cool, or being weird in public.', 'JoeBiden': "President Joe Biden | We are the United States of America. There is not a single thing we cannot do. We need to tac

### P1.2.4 - Redditors that favorite the most

Find the users that have favorited the largest number of subreddits. You must produce a pandas dataframe with **two** columns, with the following format:

```python
     redditor	    numb_favs
0	user1           7
1	user2           6
2	user3           5
3	user4           4
...
```

where the first column is a Redditor username and the second column is the number of distinct subreddits he/she has favorited.

In [15]:
# Create a DataFrame of subreddit and subr_faved_by from the original dataframe
df_subr_faved_by = df[['subreddit','subr_faved_by']]
# Create empty lists to store the subreddit name and count subr_faved_by
subreddit_list = []
subr_faved_by_list = []
# Create a dictionary using the list and group by subreddit
subr_faved_by_dict = df_subr_faved_by.groupby('subreddit').subr_faved_by.apply(list).to_dict()
# Iterate through the dictionary
for key, value in subr_faved_by_dict.items():
    # Append the subreddit name to the list
    subreddit_list.append(key)
    # Append the count of subr_faved_by to the list
    subr_faved_by_list.append(len(value))
# Create a DataFrame using the list
df_subr_faved_by = pd.DataFrame({'subreddit':subreddit_list,'subr_faved_by':subr_faved_by_list})
# Sort the DataFrame by subr_faved_by
df_subr_faved_by = df_subr_faved_by.sort_values(by='subr_faved_by',ascending=False)
# Print the DataFrame
print(df_subr_faved_by)




              subreddit  subr_faved_by
10          Coronavirus           3324
52         playboicarti           2925
45           conspiracy           2235
41              [***]og           1268
25       LivestreamFail           1087
..                  ...            ...
36          TheRealJoke              6
29   NoLockdownsNoMasks              6
46        criminalminds              3
18  EngineeringStudents              2
39            WindowsMR              1

[63 rows x 2 columns]


## P1.3 Ethics (8 marks)

**(updated on 16/03/2022)**

Imagine you are **the head of a data mining company that needs to use** the insights gained in this assignment to scan social media for covid-related content, and automatically flag it as conspiracy or not conspiracy (for example, for hiding potentially harmful tweets or Facebook posts). **Some information about the project and the team:**

 - Your client is a political party concerned about misinformation.
 - The project requires mining Facebook, Reddit and Instagram data.
 - The team consists of Joe, an American mathematician who just finished college; Fei, a senior software engineer from China; and Francisco, a data scientist from Spain.

Reflect on the impact of exploiting data science for such an application. You should map your discussion to one of the five actions outlined in the UK’s Data Ethics Framework. 

Your answer should address the following:

 - Identify the action **in which your project is the weakest**.
 - Then, justify your choice by critically analyzing the three key principles **for that action** outlined in the Framework, namely transparency, accountability and fairness.
 - Finally, you should propose one solution that explicitly addresses one point related to one of these three principles, reflecting on how your solution would improve the data cycle in this particular use case.

Your answer should be between 500 and 700 words. **You are strongly encouraged to follow a scholarly approach, e.g., with references to peer reviewed publications. References do not count towards the word limit.**

---

The mainstream and alternative content classifications' URLs were interpreted as stories that supported conspiracy theories. Further investigation revealed that these platforms were either removed or labelled as a conspiracy. Alternative news sources produced more stories that helped conspiracy theories than mainstream news sources. Similar articles from mainstream sources reached a much larger audience. The virality of tales promoting conspiracy ideas was higher than stories denying them. The spread of conspiracy ideas was significantly slowed by content moderation on Facebook, Reddit, Quora and Twitter.

1. The evaluation and analysis of broader policy consequences, in my opinion, was indeed weak, with the following explanation: - Conspiracy theories appear on primarily four platforms: Facebook, Twitter, Reddit, and Quora "politically incorrect" or "/pol/" subsection, which is a popular site for conspiracy theorists. Unlike other occasions, Quora and Reddit are not the only place where conspiracy ideas could be found. It was discovered that stories promoting conspiracy theories went viral faster than debunking or neutralising them. The majority of reports bolstering conspiracy theories came from alternative sources, personal blogs, and social media posts, resulting in many Facebook and twitter likes.

2.Based on three fundamentals:

a. Transparency, the data published, social media postings made, and available information all have a valid and confirmed source. Twitter and YouTube removed conspiracy-theory-supporting stories, while Reddit and Facebook either removed or flagged them because they were primarily unverified. On Reddit, removing or flagging content was determined by the rules of each sub-community; however, on Facebook, it was decided by whether the company reviewed the stories itself (deleted) or relied on third-party fact-checkers (flagged).

b. Accountability—This refers to the presence of effective governance and oversight procedures and control over decisions and actions. It was discovered that content moderation presented varied challenges for each platform. Content moderation on Twitter, for example, was less effective than on other sites. This effect is most likely explained by how content is removed, as disinformation spreads quickly on Twitter in the first few hours after it appears. YouTube had trouble with timing as well. For example, a video claiming that the epidemic is a staged hoax received millions of views in just a few days, with versions of the movie being constantly re-uploaded when it was taken down. Facebook censored the fewest stories that supported conspiracy theories, while Reddit appeared to have no moderation in older content.

c. Fairness—Because it is critical to avoid unintended discriminatory effects on individuals or social groups, all biases that may impact the final outcomes should be addressed. These outcomes should respect the dignity of individuals, be non-discriminatory, and be in the public good. Platform owners should pay more attention to what they censor and why they filter it and explain their decisions to users explicitly. More openness and thoughtfulness in material removal, according to studies, makes consumers more aware of the type of information they are consuming, changes how they engage with it and builds trust between them and the services. Mainstream sources should be mindful that the information they generate during the reporting process could be used to support and reinforce the cause of conspiracy theory.


Reference: The spread of COVID-19 conspiracy theories on social media and the .... https://misinforeview.hks.harvard.edu/article/the-spread-of-covid-19-conspiracy-theories-on-social-media-and-the-effect-of-content-moderation/
