<a href="https://colab.research.google.com/github/Prajwal-Nagaraj/Masters/blob/ML-Project/P1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Portfolio - Part I 

In this question you will write Python code for processing, analyzing and understanding the social network **Reddit** (www.reddit.com). Reddit is a platform that allows users to upload posts and comment on them, and is divided in _subreddits_, often covering specific themes or areas of interest (for example, [world news](https://www.reddit.com/r/worldnews/), [ukpolitics](https://www.reddit.com/r/ukpolitics/) or [nintendo](https://www.reddit.com/r/nintendo)). You are provided with a subset of Reddit with posts from Covid-related subreddits (e.g., _CoronavirusUK_ or _NoNewNormal_), as well as randomly selected subreddits (e.g., _donaldtrump_ or _razer_).

The `csv` dataset you are provided contains one row per post, and has information about three entities: **posts**, **users** and **subreddits**. The column names are self-explanatory: columns starting with the prefix `user_` describe users, those starting with the prefix `subr_` describe subreddits, the `subreddit` column is the subreddit name, and the rest of the columns are post attributes (`author`, `posted_at`, `title` and post text - the `selftext` column-, number of comments - `num_comments`, `score`, etc.).

In this exercise, you are asked to perform a number of operations to gain insights from the data.

In [None]:
# suggested imports
import pandas as pd
from nltk.tag import pos_tag
import re
from collections import defaultdict,Counter
from nltk.stem import WordNetLemmatizer
from datetime import datetime
from tqdm import tqdm
import numpy as np
import os
tqdm.pandas()
from ast import literal_eval
# nltk imports, note that these outputs may be different if you are using colab or local jupyter notebooks
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [None]:
from urllib import request
import pandas as pd
module_url = f"https://raw.githubusercontent.com/luisespinosaanke/cmt309-portfolio/master/data_portfolio_21.csv"
module_name = module_url.split('/')[-1]
print(f'Fetching {module_url}')
#with open("file_1.txt") as f1, open("file_2.txt") as f2
with request.urlopen(module_url) as f, open(module_name,'w') as outf:
  a = f.read()
  outf.write(a.decode('utf-8'))


df = pd.read_csv('data_portfolio_21.csv')
# this fills empty cells with empty strings
df = df.fillna('')

Fetching https://raw.githubusercontent.com/luisespinosaanke/cmt309-portfolio/master/data_portfolio_21.csv


In [None]:
df.head()

## P1.1 - Text data processing 

### P1.1.1 - Faved by as lists 

The column `subr_faved_by` contains an array of values (names of redditors who added the subreddit to which the current post was submitted), but unfortunately they are in text format, and you would not be able to process them properly without converting them to a suitable python type. You must convert these string values to Python lists, going from

```python
'["user1", "user2" ... ]'
```

to

```python
["user1", "user2" ... ]
```

**What to implement:** Implement a function `transform_faves(df)` which takes as input the original dataframe and returns the same dataframe, but with one additional column called `subr_faved_by_as_list`, where you have the same information as in `subr_faved_by`, but as a python list instead of a string.

In [None]:
def transform_faves(df):
    '''Function used to form a new column that contains the users data as a list'''
    # your code here
    df['subr_faved_by_as_list'] = [x.strip('[]').split(',') for x in df['subr_faved_by']]   #Using strip and the split function along with list comprehension to get the column as a list
    return df

df = transform_faves(df)

### P1.1.2 - Merge titles and text bodies 

All Reddit posts need to have a title, but a text body is optional. However, we want to be able to access all free text information for each post without having to look at two columns every time.

**What to implement**: A function `concat(df)` that will take as input the original dataframe and will return it with an additional column called `full_text`, which will concatenate `title` and `selftext` columns, but with the following restrictions:

- 1) Wrap the title between `<title>` and `</title>` tags.
- 2) Add a new line (`\n`) between title and selftext, but only in cases where you have both values (see instruction 4).
- 3) Wrap the selftext between `<selftext>` and `</selftext>`.
- 4) You **must not** include the tags in points (1) or (3) if the values for these columns is missing. We will consider a missing value either an empty value (empty string) or a string of only one character (e.g., an emoji). Also, the value of a `full_text` column must not end in the new line character.

In [None]:
def concat(df):
    # your code here
    '''Function used to wrap the title and the selftext columns if they are present'''
    for index, row in df.iterrows():                                            #using the iterrows method to iterate through the rows of the dataframe
      if (int(len(row['title'].strip())>1)):                                    #condition to check if a title exists
        f_t = "<title>" + row['title'] + "</title>"                             #if a title exists, wrap it between '<title>' and '</title>'
      if (int(len(row['selftext'].strip())>1)):                                 #Condition to check if selftext exists
        f_t = f_t + "\n"                                                        #if selftext exists, then add an escape sequence to the full text
        f_t = f_t + "<selftext>" + row['selftext'] + "</selftext>"              #then wrap the self text between <selftext> and combine it with the full text
      df.at[index, 'full_text'] = f_t                                           #.at method to add the full text row to the corresponding index
    return df

df = concat(df)

### P1.1.3 - Enrich posts 
We would like to augment our text data with linguistic information. To this end, we will _tokenize_, apply _part-of-speech tagging_, and then we will _lower case_ all the posts.

**What to implement**: A function `enrich_posts(df)` that will take as input the original dataframe and will return it with **two** additional columns: `enriched_title` and `enriched_selftext`. These columns will contain tokenized, pos-tagged and lower cased versions of the original text. **You must implement them in this order**, because the pos tagger uses casing information.

In [None]:
def enrich_posts(df):
    # your code here
    '''Function used to tokenize and pos_tag the title and selftext columns'''
    df['token_title'] = df.apply(lambda row: nltk.word_tokenize(row['title']), axis = 1)        #applying tokenisation to the title row using the lambda method
    df['token_selft'] = df.apply(lambda row: nltk.word_tokenize(row['selftext']), axis = 1)     #applying tokenisation to the selftext row using the lambda method
    df['enriched_title'] = df.apply(lambda row: pos_tag(row['token_title']), axis = 1)          #applying pos_tagging to the tokenised title using lambda method
    df['enriched_selftext'] = df.apply(lambda row:pos_tag(row['token_selft']), axis = 1)        #applying pos_taggin to the tokenisex self text using lambda method
    return df

df = enrich_posts(df)

## P1.2 - Answering questions with pandas (12 marks)

In this question, your task is to use pandas to answer questions about the data.

### P1.2.1 - Users with best scores (3 marks)

- Find the users with the highest aggregate scores (over all their posts) for the whole dataset. You should restrict your results to only those whose aggregated score is above 10,000 points, in descending order. Your code should generate a dictionary of the form `{author:aggregated_scores ... }`.

In [None]:
# your code here
'''A query used to find the highest aggregate scores in descending order'''
a = df.groupby(by = 'author')['score'].sum().reset_index()                      #Using groupby to group by author, and the sum of their score and converting it to a dataframe
a1 = a[a['score']>10000]                                                        #Filtering the dataframe to return only users with more than 10000 score 
a12_dict = dict(zip(a1.author, a1.score.sort_values(ascending = False)))        #Converting the dataframe to a dictionary and ordering the users by descending order of their scores
a12_dict

{'BlanketMage': 250375,
 'DaFunkJunkie': 218846,
 'Dajakesta0624': 211611,
 'JLBesq1981': 210824,
 'NewAltWhoThis': 143538,
 'None': 122464,
 'NotsoPG': 118595,
 'OldFashionedJizz': 81245,
 'SUPERGUESSOUS': 79560,
 'SonictheManhog': 64398,
 'TheGamerDanYT': 58235,
 'TheJeck': 57107,
 'TrumpSharted': 47989,
 'Wagamaga': 47455,
 'apocalypticalley': 26058,
 'chrisdh79': 25357,
 'hildebrand_rarity': 21154,
 'hilltopye': 18518,
 'iSlingShlong': 18116,
 'jigsawmap': 13677,
 'kevinmrr': 12771,
 'rspix000': 11900,
 'stem12345679': 11613,
 'tefunka': 10382}

### P1.2.2 - Awarded posts 

Find the number of posts that have received at least one award. Your query should return only one value.

In [None]:
# your code here
'''A query used to return the posts that have won atleat one award'''
df.total_awards_received[df.total_awards_received>0].count()      #Counting the number of people who have won atleast one award

119

### P1.2.3 Find Covid 
Find the name and description of all subreddits where the name starts with `Covid` or `Corona` and the description contains `covid` or `Covid` anywhere. Your code should generate a dictionary of the form#

```python
  {'Coronavirus':'Place to discuss all things COVID-related',
  ...
  }
```

In [None]:
# your code here

### P1.2.4 - Redditors that favorite the most

Find the users that have favorited the largest number of subreddits. You must produce a pandas dataframe with **two** columns, with the following format:

```python
     redditor	    numb_favs
0	user1           7
1	user2           6
2	user3	       5
3	user4           4
...
```

where the first column is a Redditor username and the second column is the number of distinct subreddits he/she has favorited.

In [None]:
# your code here
x = df[[ 'subreddit', 'subr_faved_by_as_list']].drop_duplicates(subset = ['subreddit'])['subr_faved_by_as_list'] # Taking only the subreddit and subr_faved_by_as_list columns separately, while dropping all the duplicate subreddit
user_likes = pd.Series([p for q in x for p in q]).value_counts() #Using list comprehension to get the users who have favourited the most number of subreddits
user_likes.to_frame().reset_index()                              #Making a dataframe out of it 
user_likes.columns = ['redditor', 'numb_favs']                   #And assigning column names
user_likes.head()

 'magnusthered15'    7
 'KarmaFury'         6
 'Flippy-Fish'       6
 'OmniusQubus'       6
 'hmhmhm2'           6
dtype: int64

## P1.3 Ethics 

**(updated on 16/03/2022)**

Imagine you are **the head of a data mining company that needs to use** the insights gained in this assignment to scan social media for covid-related content, and automatically flag it as conspiracy or not conspiracy (for example, for hiding potentially harmful tweets or Facebook posts). **Some information about the project and the team:**

 - Your client is a political party concerned about misinformation.
 - The project requires mining Facebook, Reddit and Instagram data.
 - The team consists of Joe, an American mathematician who just finished college; Fei, a senior software engineer from China; and Francisco, a data scientist from Spain.

Reflect on the impact of exploiting data science for such an application. You should map your discussion to one of the five actions outlined in the UK’s Data Ethics Framework. 

Your answer should address the following:

 - Identify the action **in which your project is the weakest**.
 - Then, justify your choice by critically analyzing the three key principles **for that action** outlined in the Framework, namely transparency, accountability and fairness.
 - Finally, you should propose one solution that explicitly addresses one point related to one of these three principles, reflecting on how your solution would improve the data cycle in this particular use case.

Your answer should be between 500 and 700 words. **You are strongly encouraged to follow a scholarly approach, e.g., with references to peer reviewed publications. References do not count towards the word limit.**

---
The 5 actions stated in the UK Data Ethics Framework are defining public benefit and user need, involving diverse expertise, complying with the law, checking the quality and limitations of the data, evaluating and considering wider policy implications. In my opinion the actions in which this project appears to be weakest is Reviewing the quality of data and limitations of the data and Involving Diverse Expertise.

1) Reviewing the quality of data:

The data sources being considered for this project are the different subreddits from reddit and although some are related to the main aim of this project(covid 19), other subreddits however are randomly selected and have very little correlation with the main aim of the project and hence violate the Article 5(1)(c) [link text](https://gdpr-info.eu/art-5-gdpr/) of the GDPR.

Fairness:

From the ethics framework to ensure fairness, the data being used should be assessed for bias, but from the project description there is no mention of any checks to ensure that there's no bias. So there is a possibility that data collected could be biased towards or against a certain demographic.

Accountability:

As the team mainly consists of people from data science background, there's no external scrutiny to make sure that the algorithm is achieving correct output, but the procedure we've followed can be easily reproduced if the dataset is in the same exact format as there are functions which can be called wherever needed.

Transparency:

There's no Data Sharing Agreements mentioned anywhere and I do not know if the organisations from which this data is extracted are even aware that I am using their data to scan social media for misinformation. I could publish the data and the model to public websites but as the collected data is from social media and sensitive, I feel like people could be targeted easily by just searching for them on the respective sites.

My Solution:

As the data is sensitive, I would first anonymise the data to make sure that the people are not targeted then I would sign a Data Sharing Agreement then assign a digital object identifier(DOI) which will allow me to share the data openly without any repercussions to the people.

2) Involving Diverse Expertise:

From the question we can see that the team consists of 3 people who are from different countries and cultural backgrounds but their expertise fall under datascience. And according to the UK ethics we should employ people who are beyond data scientists such as policy experts and practitioners and subject matter experts this will ensure that bias is minimised.

Fairness:

Although the people in my team are from different cultural backgrounds which may seem like it's diverse enough, their technical backgrounds are all under the data science umbrella and will make the team homogenous and may produce bias.

Accountability:

There are no external domain experts involved in the project and there's no mention of consulting a relevant civil society either. We can consider consulting the target audience or the users of this project.

Transparency:

We can make an effort to publish information on expert consultations but as there's no information about whether or not we're going to use expert consultations we should first hire an expert consultation and then publish information regarding the consultation.

My solution:

If all the 3 members in the team are of the same gender, then I would look for people of other genders to work on this project to reduce bias as much as I can and I would also consult people beyond data scientists such as ethicists, researchers and subject matter experts etc. I would also publish information on github crediting the work to everyone that have worked on this project.


Your answer here