# **Text Data Analysis on social network Reddit**

# Part 1 - 

Reddit is a platform that allows users to upload posts and comment on them, and is divided in _subreddits_, often covering specific themes or areas of interest (for example, [world news](https://www.reddit.com/r/worldnews/), [ukpolitics](https://www.reddit.com/r/ukpolitics/) or [nintendo](https://www.reddit.com/r/nintendo)).

The `csv` dataset contains one row per post, and has information about three entities: **posts**, **users** and **subreddits**. The column names are self-explanatory: columns starting with the prefix `user_` describe users, those starting with the prefix `subr_` describe subreddits, the `subreddit` column is the subreddit name, and the rest of the columns are post attributes (`author`, `posted_at`, `title` and post text - the `selftext` column-, number of comments - `num_comments`, `score`, etc.).

We will perform a number of operations to gain insights from the data.

## P1.0) Suggested/Required Imports

In [None]:
# suggested imports
import pandas as pd
from nltk.tag import pos_tag
import re
from collections import defaultdict,Counter
from nltk.stem import WordNetLemmatizer
from datetime import datetime
from tqdm import tqdm
import numpy as np
import os
tqdm.pandas()
from ast import literal_eval
# nltk imports, note that these outputs may be different if you are using colab or local jupyter notebooks
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [None]:
from urllib import request
import pandas as pd
module_url = f"https://raw.githubusercontent.com/luisespinosaanke/cmt309-portfolio/master/data_portfolio_22.csv"
module_name = module_url.split('/')[-1]
print(f'Fetching {module_url}')
#with open("file_1.txt") as f1, open("file_2.txt") as f2
with request.urlopen(module_url) as f, open(module_name,'w') as outf:
  a = f.read()
  outf.write(a.decode('utf-8'))
df = pd.read_csv('data_portfolio_22.csv')
# this fills empty cells with empty strings
df = df.fillna('')

Fetching https://raw.githubusercontent.com/luisespinosaanke/cmt309-portfolio/master/data_portfolio_22.csv


In [None]:
df.head()

Unnamed: 0,author,posted_at,num_comments,score,selftext,subr_created_at,subr_description,subr_faved_by,subr_numb_members,subr_numb_posts,subreddit,title,total_awards_received,upvote_ratio,user_num_posts,user_registered_at,user_upvote_ratio
0,-Howitzer-,2020-08-17 20:26:04,19,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,BREAKING: Trump to begin hiding in mailboxes t...,0,1.0,4661,2012-11-09,-0.658599
1,-Howitzer-,2020-07-06 17:01:48,1,3,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,Joe Biden's America,0,0.67,4661,2012-11-09,-0.658599
2,-Howitzer-,2020-09-09 02:29:02,3,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,4 more years and we can erase his legacy for g...,0,1.0,4661,2012-11-09,-0.658599
3,-Howitzer-,2020-06-23 23:02:39,2,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,Revelation 9:6 [Transhumanism: The New Religio...,0,1.0,4661,2012-11-09,-0.658599
4,-Howitzer-,2020-08-07 04:13:53,32,622,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,"LOOK HERE, FAT",0,0.88,4661,2012-11-09,-0.658599


## P1.1 - Text data processing

### P1.1.1 - Offensive authors per subreddit 

As you will see, the dataset contains a lot of strings of the form `[***]`. These have been used to mask (or remove) swearwords to make it less offensive. We are interested in finding those users that have posted at least one swearword in each subreddit. We do this by counting occurrences of the `[***]` string in the `selftext` column (we can assume that an occurrence of `[***]` equals a swearword in the original dataset).

**What to implement:** A function `offensive_authors(df)` that takes as input the original dataframe and returns a dataframe of the form below, where each row contains authors that posted at least one swearword in the corresponding subreddit.

```
subreddit	author
0	40kLore	Cross_Ange
1	40kLore	DaRandomGitty2
2	40kLore	EMB1981
3	40kLore	Evoxrus_XV
4	40kLore	Grtrshop
...
```

In [None]:
def offensive_authors(df):

    """
    Returns a DataFrame of authors who have used at least one swearword in their posts on each subreddit.

    Args:
    - df (pandas.DataFrame): A DataFrame containing at least the columns 'subreddit', 'author', and 'selftext'.
    
    Returns:
    - pandas.DataFrame: A DataFrame containing the 'subreddit' and 'author' columns of authors who have used at least one swearword in their posts on each subreddit.
    """

  # Group the dataframe by subreddit and author
    groups = df.groupby(['subreddit', 'author'])

    # Count the number of occurrences of [*] in the selftext column
    counts = groups['selftext'].apply(lambda x: x.str.contains(r'\[\*\*\*\]').sum())

    # Filter the groups to only include those with at least one swearword
    filtered_groups = groups.filter(lambda x: counts.loc[(x['subreddit'], x['author'])] > 0)

    # Group the filtered groups by subreddit and author again
    filtered_groups = filtered_groups.groupby(['subreddit', 'author'])

    # Create the output dataframe
    output_df = filtered_groups.size().reset_index(name='count')
    output_df=output_df[['subreddit','author']]

    return output_df

In [None]:
offensive_authors(df)

Unnamed: 0,subreddit,author
0,40kLore,Cross_Ange
1,40kLore,DaRandomGitty2
2,40kLore,EMB1981
3,40kLore,Evoxrus_XV
4,40kLore,Grtrshop
...,...,...
490,worldbuilding,nyello-2000
491,worldbuilding,spirtomb1831
492,worldbuilding,storywriter109
493,xqcow,Stubka_The_Russian


### P1.1.2 - Most common trigrams per subreddit

We are interested in learning about _the ten most frequent trigrams_ (a [trigram](https://en.wikipedia.org/wiki/Trigram) is a sequence of three consecutive words) in each subreddit's content. You must compute these trigrams on both the `selftext` and `title` columns. Your task is to generate a Python dictionary of the form:

```
{subreddit1: [(trigram1, freq1), (trigram2, freq2), ... , (trigram3, freq10)],
subreddit1: [(trigram1, freq1), (trigram2, freq2), ... , (trigram3, freq10)],
...
subreddit63: [(trigram1, freq1), (trigram2, freq2), ... , (trigram3, freq10)],}
```

That is, for each subreddit, the 10 most frequent trigrams and their frequency, stored in a list of tuples. Each trigram will be stored also as a tuple containing 3 strings.

**What to implement**: A function `get_tris(df, stopwords_list, punctuation_list)` that will take as input the original dataframe, a list of stopwords and a list of punctuation signs (e.g., `?` or `!`), and will return a python dictionary with the above format. Your function must implement the following steps in order:

-Create a new dataframe called `newdf` with only `subreddit`, `title` and `selftext` columns.
-Add a new column to `newdf` called `full_text`, which will contain `title` and `selftext` concatenated with the string `.` (a full stop) followed by a space. That, is `A simple title` and `This is a text body` would be `A simple title. This is a text body`.
-Remove all occurrences of the following strings from `full_text`. You must do this without creating a new column:
  - `[***]`
  - `&amp;`
  - `&gt;`
  - `https`
- You must also remove all occurrences of at least three consecutive hyphens, for example, you should remove strings like `---`, `----`, `-----`, etc., but not `--` and not `-`.
- Tokenize the contents of the `full_text` column after lower casing (removing all capitalization). You should use the `word_tokenize` function in `nltk`. Add the results to a new column called `full_text_tokenized`.
-Remove all tokens that are either stopwords or punctuation from `full_text_tokenized` and store the results in a new column called `full_text_tokenized_clean`. _See Note 1_.
-Create a new dataframe called `adf` (which will stand for _aggregated dataframe_), which will have one row per subreddit (i.e., 63 rows), and will have two columns: `subreddit` (the subreddit name), and `all_words`, which will be a big list with all the words that belong to that subreddit as extracted from the `full_text_tokenized_clean`.
-Obtain trigram counts, which will be stored in a dictionary where each `key` will be a trigram (a `tuple` containing 3 consecutive tokens), and each `value` will be their overall frequency in that subreddit. You are  encouraged to use functions from the `nltk` package, although you can choose any approach to solve this part.
-Finally, use the information you have in `adf` for generating the desired dictionary, and return it. _See Note 2_.

Note 1. You can obtain stopwords and punctuation as follows.
- Stopwords: 
```
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
```
- Punctuation:
```
import string
punctuation = list(string.punctuation)
```

Note 2. You do not have to apply an additional ordering when there are several trigrams with the same frequency.

In [None]:
# necessary imports here for extra clarity
from nltk.corpus import stopwords as sw
import string
import warnings
from nltk import ngrams
from nltk.probability import FreqDist
def get_tris(df, stopwords_list, punctuation_list):

  """Returns a dictionary of the top 10 trigrams for each subreddit in the input DataFrame, after preprocessing the text data.

    Args:
    - df (pandas.DataFrame): A DataFrame containing at least the columns 'subreddit', 'title', and 'selftext'.
    - stopwords_list (list): A list of stopwords to be removed from the text data.
    - punctuation_list (list): A list of punctuation symbols to be removed from the text data.
    
    Returns:
    - dict: A dictionary where the keys are the subreddits in the input DataFrame, and the values are lists of the top 10 trigrams in the preprocessed text data for each subreddit.
  """
    
  # create new df with only relevant columns
  newdf=df[['subreddit','title','selftext']]
  
  # concatenate title and selftext
  newdf['full_text'] = newdf['title'] + '. ' + newdf['selftext']
  
  removal_strings=re.compile(r'\[\*\*\*\]|&amp;|&gt;|https')  #to search either and all of the strings in the text
  newdf['full_text'] = newdf['full_text'].replace(removal_strings, '', regex=True)

  # for regex replacement - remove the strings "[***]", "&amp;", "&gt;" and "https", also at least three consecutive dashes
  hyphen_pattern=re.compile(r"(-{3,})")  #to search three or more hyphens
  newdf["full_text"] = newdf["full_text"].str.replace(hyphen_pattern, "")
  
  # lower case, tokenize, and add result to full_text_tokenize
  newdf['full_text_tokenized'] = newdf['full_text'].apply(lambda x:word_tokenize(x.lower()))
  
  # clean the full_text_tokenized column by iterating over each word and discarding if it's either a stopword or punctuation
  def clean_tokens(tokens):
    """
    This function takes a list of tokens and removes any tokens that appear in a given list of stop words or a given list of punctuation marks.
    
    Parameters:
    - tokens (list): A list of tokens that need to be cleaned up.
    - stopwords_list (list): A list of stop words to be removed from the tokens.
    - punctuation_list (list): A list of punctuation marks to be removed from the tokens.
    
    Returns:
    - list: A list of cleaned tokens, where any tokens that appear in the provided stop words or punctuation lists are removed.
    """
    return [token for token in tokens if token not in stopwords_list and token not in punctuation_list]
    
  newdf['cleaned_text_tokenized'] = newdf['full_text_tokenized'].apply(clean_tokens)
  
  # create new aggregated dataframe by concatenating all full_text_tokenized_clean values - rename columns as requested
  # group by subreddit and concatenate all the cleaned token lists
  adf = newdf.groupby('subreddit')['cleaned_text_tokenized'].sum()

  # convert to a dataframe and reset the index to get the subreddit as a column
  adf = adf.to_frame().reset_index()

  # rename the column with the aggregated tokens to 'all_words'
  adf = adf.rename(columns={'cleaned_text_tokenized': 'all_words'})
  
  # create new Series object by piping nltk's FreqDist and trigrams functions into all_words
  trigrams=[]
  for word in adf['all_words']:
    trigrams.extend(ngrams(word,3))

  trigram_counts=Counter(trigrams) #New series object is trigram_counts
  
  # create output dictionary by zipping subreddit column from adf and tri_counts into a list of tuples, then passing dict()
  tri_counts=adf['all_words'].apply(lambda x: FreqDist(ngrams(x,3)))
  output_dict=dict(zip(adf['subreddit'], tri_counts.apply(lambda x:sorted(x.items(),key= lambda x: x[1], reverse=True)[:10])))
  return output_dict
  # the top 10 most frequent ngrams are obtained by calling sorted() on tri_counts and keeping only the top 10 elements
  

In [None]:
# get stopwords as list
sw = sw.words('english')
# get punctuation as list
p = list(string.punctuation)
# optional lines for adding the below line to avoid the SettingWithCopyWarning
warnings.filterwarnings('ignore')
get_tris(df, sw, p)

{'40kLore': [(('whose', 'bolter', 'anyway'), 8),
  (('started', 'us', 'examples'), 8),
  (('space', 'marine', 'chapter'), 7),
  (('kabal', 'black', 'heart'), 7),
  (('lo', '’', 'tos'), 7),
  (('die', 'paragon', 'knights'), 6),
  (('dark', 'age', 'technology'), 4),
  (('let', "'s", 'say'), 4),
  (('``', 'star', 'claimers'), 4),
  (('star', 'claimers', "''"), 4)],
 'AMD_Stock': [(('created', 'subreddit', 'reddit'), 10),
  (('subreddit', 'reddit', 'posts'), 10),
  (('reddit', 'posts', 'r/radeongpus'), 10),
  (('open', 'redditors', 'posts'), 10),
  (('redditors', 'posts', 'well'), 10),
  (('posts', 'well', 'please'), 10),
  (('well', 'please', 'consider'), 10),
  (('please', 'consider', 'subscribing'), 10),
  (('consider', 'subscribing', 'find'), 10),
  (('subscribing', 'find', 'posts'), 10)],
 'Anki': [(('``', 'conditional', "''"), 7),
  (('``', 'field2', "''"), 5),
  (('\\^conditional', 'field1', '/conditional'), 5),
  (('conditional', "''", 'filled'), 4),
  (('font-family', 'simplified'

## P1.2 - Answering questions with pandas

Your task is to use pandas to answer questions about the data.

### P1.2.1 - Authors that post highly commented posts

Find the top 1000 most commented posts. Then, obtain the names of the authors that have at least 3 posts among these posts.

**What to implement:** Implement a function `find_popular_authors(df)` that takes as input the original dataframe and returns a list strings, where each string is the name of authors that satisfy the above criteria.

In [None]:
def find_popular_authors(df):

  """
    This function takes a DataFrame with 'author' and 'num_comments' columns, filters the top 1000 authors with the 
    highest number of comments, and returns a list of the authors who have at least 3 comments in the top 1000.
    
    Parameters:
    df (pandas.DataFrame): A DataFrame containing 'author' and 'num_comments' columns.
    
    Returns:
    list: A list of popular authors who have at least 3 comments in the top 1000.
  """
  
  df_filtered=df[['author','num_comments']]
  df_sorted=df_filtered.sort_values(by='num_comments', ascending=False)
  df_sorted_1000=df_sorted.head(1000)
  author_counts=df_sorted_1000['author'].value_counts()
  popular_authors=author_counts[author_counts>=3]
  list_popular_authors=popular_authors.index.tolist()
  return list_popular_authors

In [None]:
find_popular_authors(df)

['AutoModerator',
 'r[***]og',
 'jigsawmap',
 'Salramm01',
 'HippolasCage',
 'FunPeach0',
 'iSlingShlong',
 'Stoaticor',
 'kevinmrr',
 'ratioetlogicae',
 'None',
 'harushiga',
 'tefunka',
 'SlobBarker',
 'stargem5',
 'AristonD',
 'werdmouf',
 'Cross_Ange',
 'samzz41',
 'itsreallyreallytrue',
 'SUPERGUESSOUS',
 'Frocharocha',
 'habichuelacondulce',
 'CantStopPoppin',
 'Allstarhit',
 'theitguyforever',
 'rebooted_life_42',
 'Zhana-Aul',
 'Not4Reel',
 'Jellyrollrider',
 'NYLaw',
 'MakeItRainSheckels',
 'TurtleFacts72',
 'Defie-LOH-Gic',
 'Typoqueen00',
 'imagepoem',
 'nycsellit4me',
 'madman320',
 'mythrowawaybabies',
 'kogeliz',
 'strngerdngermaus',
 'Kinmuan',
 'AllisonGator',
 'Antiliani',
 'vizard673',
 'notpreposterous',
 'BanDerUh',
 'dukey',
 'BebeFanMasterJ',
 'Fr1sk3r',
 'Gambit08',
 'XDitto',
 'elt0p0',
 'twistedlogicx',
 'TAKEitTOrCIRCLEJERK',
 'Ramy_',
 'tacolben',
 'Morihando',
 '2020c[***]er[***]',
 'dunphish64',
 'apocalypticalley',
 'dsbwayne',
 'schuey_08',
 'blacked_love

### P1.2.2 - Distribution of posts per weekday

Find the percentage of posts that were posted in each weekday (Monday, Tuesday, etc.). You can use an external calendar or you can use any functionality for dealing with dates available in pandas. 

**What to implement:** A function `get_weekday_post_distribution(df)` that takes as input the original dataframe and returns a dictionary of the form (the values are made up):

```
{'Monday': '14%',
'Tuesday': '23%', 
...
}
```

Note that you must only return two decimals, and you must include the percentage sign in the output dictionary. 

Note that in dictionaries order is not preserved, so the order in which it gets printed will not matter. 

In [None]:
def get_weekday_post_distribution(df):

    """
    This function takes a pandas DataFrame with a 'posted_at' column that contains dates and times of posts. It creates a new 'weekday' column based on the weekday of each post, and then calculates the percentage of posts that were made on each day of the week. The result is returned as a dictionary where the keys are the names of the days of the week and the values are the percentages of posts made on each day.

    Args:
    - df: a pandas DataFrame with a 'posted_at' column containing dates and times of posts

    Returns:
    - A dictionary where the keys are the names of the days of the week and the values are the percentages of posts made on each day
    """

  # Convert 'posted_at' column to datetime format
    df['posted_at'] = pd.to_datetime(df['posted_at'])

    # Create 'weekday' column
    df['weekday'] = df['posted_at'].dt.day_name()

    # Count the number of posts for each weekday
    counts = df['weekday'].value_counts()

    # Calculate the percentage of posts for each weekday
    percentages = (counts / counts.sum()) * 100
    percentages = percentages.round(2).astype(str) + '%'

    # Convert the result to a dictionary
    result = percentages.to_dict()

    return result

In [None]:
get_weekday_post_distribution(df)

{'Wednesday': '14.89%',
 'Friday': '14.79%',
 'Thursday': '14.75%',
 'Tuesday': '14.54%',
 'Monday': '14.31%',
 'Saturday': '13.76%',
 'Sunday': '12.96%'}

### P1.2.3 - The 100 most passionate redditors

We would like to know which are the 100 redditors (`author` column) that are most passionate. We will measure this by checking, for each redditor, the ratio at which they use adjectives. This ratio will be computed by dividing number of adjectives by the total number of words each redditor used. The analysis will only consider redditors that have written at least 1000 words.

**What to implement:** A function called `get_passionate_redditors(df)` that takes as input the original dataframe and returns a list of the top 100 redditors (authors) by the ratio at which they use adjectives considering both the `title` and `selftext` columns. The returned list should be a list of tuples, where each inner tuple has two elements: the redditor (author) name, and the ratio of adjectives they used. The returned list should be sorted by adjective ratio in descending order (highest first). Only redditors that wrote more than 1000 words should be considered. You should use `nltk`'s `word_tokenize` and `pos_tag` functions to tokenize and find adjectives. You do not need to do any preprocessing like stopword removal, lemmatization or stemming.

In [None]:
def get_passionate_redditors(df):

    """
    Returns a list of top 100 redditors based on their ratio of adjectives to total words used in their posts.
    
    Args:
    df (pandas.DataFrame): Dataframe containing Reddit post data with columns 'author', 'title', and 'selftext'.
    
    Returns:
    list: A list of tuples containing the top 100 redditors and their corresponding adjective ratios in descending order.
    Each tuple is of the form (author_name, adjective_ratio).
    """
    
  # Filter dataframe to only include redditors who have written at least 1000 words
    df = df.groupby('author').filter(lambda x: len(' '.join(x['title']) + ' '.join(x['selftext'])) > 1000)
    
    # Tokenize the text for each redditor
    redditor_tokens = {}
    for author, group in df.groupby('author'):
        text = ' '.join(group['title']) + ' '.join(group['selftext'])
        tokens = nltk.word_tokenize(text)
        redditor_tokens[author] = tokens
        
    # Count the number of adjectives for each redditor
    redditor_adj_counts = {}
    for author, tokens in redditor_tokens.items():
        tagged_tokens = nltk.pos_tag(tokens)
        adj_count = Counter(tag for word, tag in tagged_tokens if tag.startswith('JJ'))
        redditor_adj_counts[author] = adj_count['JJ']
        
    # Calculate the adjective ratio for each redditor
    redditor_ratios = {}
    for author, tokens in redditor_tokens.items():
        word_count = len(tokens)
        adj_count = redditor_adj_counts[author]
        ratio = adj_count / word_count
        redditor_ratios[author] = ratio
        
    # Sort the redditors by their adjective ratio in descending order and return the top 100
    sorted_redditors = sorted(redditor_ratios.items(), key=lambda x: x[1], reverse=True)
    top_100_redditors = sorted_redditors[:100]
    
    return top_100_redditors

In [None]:
get_passionate_redditors(df)

[('OhanianIsTheBest', 0.1473768522226672),
 ('srvnmdomdotnet', 0.13358778625954199),
 ('madman320', 0.13114754098360656),
 ('healrstreettalk', 0.12574341546304163),
 ('wezafabregas', 0.12299465240641712),
 ('FreedomBoners', 0.12229976736457294),
 ('inspiration_capsule', 0.11650485436893204),
 ('bradipaurbana', 0.11538461538461539),
 ('Joe_Tazuna', 0.11428571428571428),
 ('Jumido730', 0.11323328785811733),
 ('factfind', 0.11072380561820791),
 ('mubukugrappa', 0.11038961038961038),
 ('DogMeatTalk', 0.10987261146496816),
 ('Playaguy', 0.10897435897435898),
 ('LisaMck041', 0.10880829015544041),
 ('TheAtheistArab87', 0.10759493670886076),
 ('superegz', 0.10588235294117647),
 ('asad1ali2', 0.1054421768707483),
 ('mushroomsarefriends', 0.10404624277456648),
 ('Travis-Cole', 0.10388580491673276),
 ('LanJiaoDuaKee', 0.10344827586206896),
 ('spiritofcom', 0.102880658436214),
 ('clme', 0.10222222222222223),
 ('ComradePetri', 0.10218978102189781),
 ('abdouh15', 0.1),
 ('goodnewsies', 0.1),
 ('seam