# Design and construction of tripartite network from Reddit

Datasets from multipartite complex networks with 3 or more levels (tripartite, quadripartite, etc.) are very scarce, unlike the case of only 2 levels better known as bipartite graphs, which are quite common.

I designed and began to construct a tripartite network for my Ph.D. thesis, using the website [Reddit](https://www.reddit.com). According to their own description, "*Reddit is a network of communities where people can dive into their interests, hobbies and passions. There's a community for whatever you're interested in on Reddit*". In this context, I use the term *groups* instead of *communities* for technical reasons and to avoid misunderstandings.

The tripartite network I defined is composed of:
1. **Users** (usernames)
2. **Groups** (subreddits)
3. **Keywords** (words)

My main interest is the tripartite network analysis in two important topics:
* **Link prediction**. This can be used in recommendation systems for example, so we could recommend an user certain groups that might find interesting based on our anaylsis.
* **Community detection**. Also called clustering in (sligthly) different contexts, and it can be used to detect clusters of users based on the groups they frecuent and the keyword they use, for instance.

I already developed many algorithms to do **link prediction** and **community detection** in multipartite networks, but I was lacking of datasets to test them.

In [1]:
import requests

from collections import Counter

import nltk

from textblob import TextBlob

In [2]:
# To use the Reddit API you should have first a Reddit account and
# sign up for an OAUTH Client ID in https://www.reddit.com/prefs/apps
# and at the page bottom click on: "are you a developer? create an app..."
# https://towardsdatascience.com/how-to-use-the-reddit-api-in-python-5e05ddfd1e5c

my_username = 'tripartitenetwork' #account created only for this purpose
my_password = '987654321reddit123456789'

personal_use_script = 'jVFLZzCvn9H82rRg_M_O1w'
secret = 'djzraeUgBxE5U-BKirzY7OG9RQm7_w'

In [3]:
def headers_connection_request():
    # note that CLIENT_ID refers to 'personal use script' and SECRET_TOKEN to 'token'
    auth = requests.auth.HTTPBasicAuth(personal_use_script, secret)

    # here we pass our login method (password), username, and password
    data = {'grant_type': 'password',
            'username': my_username,
            'password': my_password}

    # setup our header info, which gives reddit a brief description of our app
    headers = {'User-Agent': 'MyBot/0.0.1'}

    # send our request for an OAuth token
    res = requests.post('https://www.reddit.com/api/v1/access_token',
                        auth=auth, data=data, headers=headers)

    # convert response to JSON and pull access_token value
    TOKEN = res.json()['access_token']

    # add authorization to our headers dictionary
    headers = {**headers, **{'Authorization': f"bearer {TOKEN}"}}

    return headers

In [4]:
# initialize it once, if later fails, is called again within the while loops
my_headers = headers_connection_request()
my_headers

{'User-Agent': 'MyBot/0.0.1',
 'Authorization': 'bearer 1206362233968-CE0VIO-ExQy8R80EVzxua6vyVP4uYQ'}

In [5]:
# in case of the first call doesn't work within the while loops, we nned to add this extra while loop
# being used in groups_keywords_dict function at the bottom
'''my_headers = None
while my_headers is None:
    print("test")
    try:
        # connect
        my_headers = headers_connection_request()
    except:
         pass'''

'my_headers = None\nwhile my_headers is None:\n    print("test")\n    try:\n        # connect\n        my_headers = headers_connection_request()\n    except:\n         pass'

## The starting point is any Reddit username, it's the only input we need.

In [6]:
username = 'urbannomadberlin' #'GovSchwarzenegger'
my_limit = 100

## (A) We start extracting all the words used from our specific user, and simultaneously, the groups where they were posted

We describe every text that a certain **user** writes (publicly) as a *post*. Hence, calling the Reddit API we indentify two main types of *posts* and some more subtypes:

1. `comment`


2. `submitted`

    i. `title`
    
    ii. `selftext` (optional)

### (i) We extract the keywords from comments and the subreddits where they were posted.

We extract the **keywords** from every `comment` *post*, every `title` of a `submitted` *post*, and optionally from the `selftext` of a `submitted` post, if any. Then we saved all of them in a common string `posts_full_text`.

In [7]:
posts_full_text = ""
groups_list = []

In [8]:
while True:
    try:
        res_comments = requests.get("https://oauth.reddit.com" + "/user" + "/" + username + "/comments",
                                    headers = my_headers,
                                    params = {'limit': my_limit})
        break
    except requests.ConnectionError:
        print("ConnectionError, trying again...")
        my_headers = headers_connection_request()

In [9]:
for post in res_comments.json()['data']['children']:
    posts_full_text += " " + post['data']['body']
    groups_list.append(post['data']['subreddit'])

### (ii) Extracting keywords from submitted title, and from submitted selftext, if any, and the subreddits where they were posted.

At the same time, we will append the subreddits, i.e. the **groups** where every *post* belongs, in a list called `groups_list`.

In [10]:
while True:
    try:
        res_submitted = requests.get("https://oauth.reddit.com" + "/user" + "/" + username + "/submitted",
                                     headers = my_headers,
                                     params = {'limit': my_limit})
        break
    except requests.ConnectionError:
        print("ConnectionError, trying again...")
        my_headers = headers_connection_request()

In [11]:
for post in res_submitted.json()['data']['children']:
    posts_full_text += " " + post['data']['title']
    groups_list.append(post['data']['subreddit'])
    if post['data']['selftext']:
        posts_full_text += " " + post['data']['selftext']

#### Having all the groups where a user posted we make a very simple analysis of them.

We count the **groups** repetitions and save them as a Python dictionary `groups_dict`. This will help us later to associate every **group** with its respective **user**, where the associated value will correspond to the link weight of the newly defined bipartite **user-groups** network.

In [12]:
groups_dict = {group: count for group, count in Counter(groups_list).most_common()}
#groups_dict

#### After retrieving all of the user posts keywords, we start to analyze them using the simplest approach: the [bag-of-words model](https://en.wikipedia.org/wiki/Bag-of-words_model).

The intention is to improve this analysis later with methods such as n-grams or more sophisticaed ones within the natural language processing field.

In [13]:
corpus_text = posts_full_text.lower()
#corpus_text

In [14]:
#nltk.download('stopwords') #download if necessary!

stopwords_e = nltk.corpus.stopwords.words('english')
stopwords_g = nltk.corpus.stopwords.words('german')
stopwords_s = nltk.corpus.stopwords.words('spanish') #add languages if needed
stopwords = stopwords_e + stopwords_g + stopwords_s

mystopwords = ["also", "b", "best", "cannot", "can't", "cant"] #complete with words to exclude if necessary

stopwords += mystopwords

def common_words(text):
    # isalpha() method optional for words made of only letters 
    return [word for word in TextBlob(text).words if word not in stopwords]# and word.isalpha()]

In [15]:
#common_words(corpus_text)
#set(common_words(corpus_text))

Saving the most common words as a Python dictionary `keywords_dict`, will help us later to associate every **keyword** with its respective **user**, where the associated value will correspond to the link weight of the newly defined **user-keywords** network.

In [16]:
keywords_dict = {word: count for word, count in Counter(common_words(corpus_text)).most_common()}
#keywords_dict

## (B) We continue extracting, for our specific input user, all the associated users.

In principle, this is not really necessary. Since we already have the basic code to extract all the **groups** and **keywords** for any specific **user**, we could do the same procedure for any arbitrary list of Reddit usernames. But it would make absolute sense to search for **users** connected somehow to our input **user**, and we will find them with a similar approach to the previous one, retrieving our input **user** information. Once we obtain all the **users** associated to our input **user**, we applied to them the full procedure describe in **(A)** to obtain their respective **groups** and **keywords**, and having this we'll have all the needed information to construct our tripartite network. Other different Reddit usernames can be also added manually at any point to expand the network even more.

### (i) For any given input user and from its submitted posts, we extract the users from the direct replies (first children) to any of them.

We save all the associated **users** in the `associated_users` list.

In [17]:
associated_users = []

In [18]:
for post in res_submitted.json()['data']['children']:
    name = post['data']['name']
    while True:
        try:
            res_name = requests.get("https://oauth.reddit.com" + "/comments" + "/" + name[3:] + "/api"
                                    + "/morechildren",
                                    headers = my_headers)
            break
        except requests.ConnectionError:
            print("ConnectionError, trying again...")
            my_headers = headers_connection_request()
    for comment in res_name.json()[1]['data']['children']:
        if 'author' in comment['data']: #there's a weird behaviour of Reddit API when retreiving long posts!
            associated_users.append(comment['data']['author'])

ConnectionError, trying again...


In [19]:
#associated_users

### (ii) For the same input user and from its comments, we extract the users from the previous comment (parent or link author).

We append all these new associated **users** in the `associated_users` list.

In [20]:
for i, post in enumerate(res_comments.json()['data']['children']): #up to 100 comments
    #print(i, "https://www.reddit.com" + post['data']['permalink'])#, post['data']['body'][:50])
    link = post['data']['link_id']
    parent = post['data']['parent_id']
    if link != parent: #if parent is not the main post
        while True:
            try:
                res_parent = requests.get("https://oauth.reddit.com" + post['data']['permalink'][:-8]
                                          + parent[3:],
                                          headers = my_headers)
                break
            except requests.ConnectionError:
                print("ConnectionError, trying again...")
                my_headers = headers_connection_request()
        for j, comment in enumerate(res_parent.json()[1]['data']['children']):
            if 'author' in comment['data']: #there's a weird behaviour of Reddit API when retreiving long posts!
                #print(j, comment['data']['author'])
                associated_users.append(comment['data']['author'])
    else: #parent is the main post
        #print(post['data']['link_author'])
        associated_users.append(post['data']['link_author'])
    #print()

In [21]:
#associated_users

### (iii) For the same input user and from its comments, we extract the users from all the following comments (first childrens).

This is very tricky to do given the structure of the retrieved information, we need to define a recursive function which acts directly over the adecuate part of the retrieved json and returns a list of **users**. We start doing it only for one comment, then for all of them. We append all these new associated **users** in the `associated_users` list.

In [22]:
def recursive_in_json(subjson, i=0, depth_limit=1, lst=[]): #depth_limit=1 will show only direct children from a comment
    for post in subjson['data']['children']:
        if i <= depth_limit:
            if 'replies' in post['data']:
                #print(post['data']['author'])
                lst.append(post['data']['author'])
                #print(i, "Name:", post['data']['name'], "Depth:", post['data']['depth'], "Body:", post['data']['body'][:50])
                #print("********************")
                if post['data']['replies']:
                    recursive_in_json(post['data']['replies'], i+1, depth_limit=depth_limit, lst=lst)
                #if 'parent_id' in post['data']:
                #    print('Parent_id:', post['data']['parent_id'])
            #print("________________________________________________________________________________________________")
            #print()
    return lst[1:]

In [23]:
# THIS EXAMPLE IS MEANT TO SHOW THE PARENTS OF A COMMENT
# IT CAN USE "comment" INSTEAD OF POST NAME AT THE 2ND TO LAST PLACE

##res_test = requests.get("https://oauth.reddit.com"
                        #+ "/r/counting/comments/plti3p/4482k_counting_thread/hcdbx5n"
                        #+ "/r/books/comments/q1sq8m/comment/hfh0glo/?utm_source=share&utm_medium=web2x&context=3"
##                        + "/r/berlin/comments/pzeryw/why_is_getting_an_anmeldung_so_hard/hf3285d/",
                        #+ "/?context=8",
##                        headers = my_headers)

In [24]:
#userstest = recursive_in_json(res_test.json()[1], lst=[]) #lst=[] is needed to call the function correctly
#userstest

In [25]:
userstestlist = []
for i, post in enumerate(res_comments.json()['data']['children']): #up to 100 comments
    #print(i, "https://www.reddit.com" + post['data']['permalink'])
    while True:
        try:
            res_test = requests.get("https://oauth.reddit.com" + post['data']['permalink'],
                                    headers = my_headers)
            break
        except requests.ConnectionError:
            print("ConnectionError, trying again...")
            my_headers = headers_connection_request()
    utl = recursive_in_json(res_test.json()[1], lst=[])
    #print(utl)
    #print()
    if utl:
        userstestlist.extend(utl)

ConnectionError, trying again...


In [26]:
#userstestlist

In [27]:
associated_users.extend(userstestlist)

We clean this list deleting repeating entries using a Python set, deleting the input **user** and the `'[deleted]'` ones (profiles that doesn't exist anymore), finally creating the list `users_list` to save all of them

In [28]:
users_list = list(set(associated_users))
users_list.remove(username)
users_list.remove('[deleted]')
#users_list

## The final step for the input user is to find the associated groups, keywords and users

In [29]:
username

'urbannomadberlin'

In [30]:
groups_dict

{'chile': 32,
 'berlin': 22,
 'ifyoulikeblank': 21,
 'berlinsocialclub': 16,
 'musicsuggestions': 10,
 'InternetIsBeautiful': 2,
 'worldnews': 2,
 'LesPaul': 2,
 'MusicCritique': 2,
 'mildlyinteresting': 1,
 'europe': 1,
 'Music': 1,
 'dataisbeautiful': 1,
 'movies': 1,
 'funny': 1,
 'AskReddit': 1,
 'HeadphoneAdvice': 1,
 'CryptoCurrency': 1,
 'KrakenSupport': 1}

In [31]:
keywords_dict

{'https': 31,
 'amp': 18,
 "n't": 16,
 'music': 16,
 'www.youtube.com/watch': 14,
 "'s": 13,
 'jajaja': 12,
 'one': 12,
 'rock': 12,
 'please': 11,
 'thanks': 11,
 'like': 11,
 'x200b': 11,
 'hi': 9,
 "'m": 9,
 'chile': 8,
 'si': 8,
 'looking': 8,
 '3': 7,
 'well': 7,
 'know': 7,
 'want': 7,
 'make': 7,
 'way': 7,
 'songs': 7,
 'could': 6,
 'pa': 6,
 'listen': 6,
 'bass': 6,
 'course': 6,
 'really': 6,
 'think': 6,
 'comida': 5,
 'wrote': 5,
 'love': 5,
 'berlin': 5,
 'soon': 5,
 'bands': 5,
 'days': 5,
 'jazz': 5,
 'album': 4,
 'going': 4,
 'possible': 4,
 'find': 4,
 'say': 4,
 '2': 4,
 'would': 4,
 'let': 4,
 'even': 4,
 'hahaha': 4,
 'something': 4,
 'first': 4,
 '1m5': 4,
 '1m1': 4,
 '2m2': 4,
 'different': 4,
 'sense': 4,
 'live': 4,
 'good': 4,
 'new': 4,
 'look': 4,
 'canciones': 4,
 'metal': 4,
 'unfortunately': 4,
 'bring': 4,
 'march': 4,
 'contract': 4,
 'money': 4,
 'chat': 3,
 '100': 3,
 'sólo': 3,
 'procesada': 3,
 'hace': 3,
 'alguna': 3,
 'c3': 3,
 'chucha': 3,
 'aún':

In [32]:
users_list

['Maur0',
 'puntastic_name',
 'navtaq',
 'alexb599',
 'og_han',
 'cYzzie',
 'PM-me-ur-kittenz',
 'h-u-g-o',
 'satalana',
 'kachol',
 'magallanes2010',
 'OsoGuti',
 'FunLovinMonotreme',
 'rainman_104',
 'vu67',
 'upka',
 'ancdefghi',
 'Melonemelo123',
 'Kawtcho',
 'Javiercdx',
 'Irresponsible_Tune',
 'Deneb0la',
 'AntonioZamorano58',
 'Tierrrez',
 '93WhiteStrat',
 'chicosapo',
 'musicatito',
 'PhilipJay99',
 'ivan_xd',
 'badwives',
 'flashcatcher',
 'JMG_99',
 'its_mango_time',
 'NNorAl',
 'CVirus',
 'butchYbutch_',
 'magezt',
 'DirtyProtest',
 'JaLogoJa',
 'longanizas',
 'TheRealWeedAtman',
 'Tiger_Mann',
 'ky2k',
 'raverbashing',
 'gramoun-kal',
 'rebelrebel2013',
 'Phrodo_00',
 'frenchliner',
 'Mugen_1212',
 'AutoModerator',
 'weaweonaaweonao',
 'ohravenyouneverlearn',
 'flrianjst',
 'NuQ',
 'vectorpropio',
 'Just-me-fmCR',
 'markzlz',
 'sandiaazucar',
 'maialen09',
 'Schtiglitz',
 'restoreprivacydotcom',
 'lilo910',
 'jsnaomi6',
 'patiperro_v3',
 'saproxilico',
 'CashmereShiv',
 'Co

In [33]:
len(users_list)

153

## Finding all groups and keywords for the associated users

We automatize now the previous procedure to obtain **groups** and **keywords** for every **user** in `users_list`, and save them in a Python dictionary of dictionaries.

In [34]:
def groups_keywords_dict(user, the_headers):

    posts_full_text = ""
    groups_list = []

    while True:
        try:
            res_comments = requests.get("https://oauth.reddit.com" + "/user" + "/" + user + "/comments",
                                        headers = the_headers,
                                        params = {'limit': my_limit})
            break
        except requests.ConnectionError:
            print("ConnectionError, trying again...")
            my_headers = None
            while my_headers is None:
                try:
                    # connect
                    my_headers = headers_connection_request()
                except:
                     pass
    try:
        for post in res_comments.json()['data']['children']:
            posts_full_text += " " + post['data']['body']
            groups_list.append(post['data']['subreddit'])
    except:
        pass

    while True:
        try:
            res_submitted = requests.get("https://oauth.reddit.com" + "/user" + "/" + user + "/submitted",
                                         headers = the_headers,
                                         params = {'limit': my_limit})
            break
        except requests.ConnectionError:
            print("ConnectionError, trying again...")
            my_headers = None
            while my_headers is None:
                try:
                    # connect
                    my_headers = headers_connection_request()
                except:
                     pass
    try:
        for post in res_submitted.json()['data']['children']:
            posts_full_text += " " + post['data']['title']
            groups_list.append(post['data']['subreddit'])
            if post['data']['selftext']:
                posts_full_text += " " + post['data']['selftext']
    except:
        pass

    groups_dict = {group: count for group, count in Counter(groups_list).most_common()}

    corpus_text = posts_full_text.lower()
    keywords_dict = {word: count for word, count in Counter(common_words(corpus_text)).most_common()}

    return groups_dict, keywords_dict

In [35]:
#my_headers = headers_connection_request()

In [36]:
final_users_groups_keywords_dict = {}
for i, user in enumerate(users_list):
    print(i, user)
    gkd = groups_keywords_dict(user, my_headers)
    final_users_groups_keywords_dict[user] = {}
    final_users_groups_keywords_dict[user]['groups'] = gkd[0]
    final_users_groups_keywords_dict[user]['keywords'] = gkd[1]

0 Maur0
1 puntastic_name
2 navtaq
ConnectionError, trying again...
3 alexb599
4 og_han
5 cYzzie
6 PM-me-ur-kittenz
7 h-u-g-o
8 satalana
9 kachol
10 magallanes2010
11 OsoGuti
12 FunLovinMonotreme
13 rainman_104
14 vu67
15 upka
16 ancdefghi
17 Melonemelo123
18 Kawtcho
19 Javiercdx
20 Irresponsible_Tune
21 Deneb0la
22 AntonioZamorano58
23 Tierrrez
24 93WhiteStrat
25 chicosapo
26 musicatito
27 PhilipJay99
28 ivan_xd
ConnectionError, trying again...
29 badwives
30 flashcatcher
31 JMG_99
32 its_mango_time
33 NNorAl
34 CVirus
35 butchYbutch_
36 magezt
37 DirtyProtest
38 JaLogoJa
39 longanizas
40 TheRealWeedAtman
41 Tiger_Mann
42 ky2k
43 raverbashing
44 gramoun-kal
45 rebelrebel2013
46 Phrodo_00
47 frenchliner
48 Mugen_1212
49 AutoModerator
50 weaweonaaweonao
51 ohravenyouneverlearn
52 flrianjst
53 NuQ
54 vectorpropio
55 Just-me-fmCR
56 markzlz
57 sandiaazucar
58 maialen09
59 Schtiglitz
60 restoreprivacydotcom
61 lilo910
62 jsnaomi6
63 patiperro_v3
64 saproxilico
65 CashmereShiv
66 Comfortably

In [39]:
final_users_groups_keywords_dict['pdonoso']

{'groups': {'chile': 102,
  'asklatinamerica': 10,
  'Music': 7,
  'trees': 4,
  'Jazz': 3,
  'bizarrebuildings': 2,
  'UserExperienceDesign': 2,
  'Showerthoughts': 2,
  'askscience': 2,
  'AskReddit': 2,
  'MusicaEnEspanol': 2,
  'WTFMusicVideos': 2,
  'woof_irl': 1,
  'Santiago': 1,
  'randomactsofmusic': 1,
  'RedditForGrownups': 1,
  'icecreamery': 1,
  'delusionalartists': 1,
  'LatinoPeopleTwitter': 1,
  'MovieDetails': 1,
  'coolguides': 1,
  'GardeningIndoors': 1,
  'toptalent': 1,
  'worldnews': 1,
  'LatinAmerica': 1,
  'firewater': 1,
  'AskHistorians': 1,
  'beatles': 1,
  'Jimi_Hendrix': 1,
  'helpmewin': 1,
  'retiredgif': 1,
  'HomeworkHelp': 1,
  'shittyaskscience': 1,
  'Lollapalooza': 1,
  'drums': 1,
  'woahdudemusic': 1,
  'gaming': 1,
  'Marijuana': 1},
 'keywords': {'si': 15,
  'really': 14,
  'gin': 10,
  'ser': 10,
  'mundo': 10,
  '’': 9,
  'love': 8,
  'creo': 8,
  'hacer': 8,
  'help': 8,
  'like': 7,
  'weon': 7,
  'good': 7,
  'get': 7,
  'cosas': 7,
  'ga