# Data Collection

In [24]:
!pip install praw 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [25]:
import praw


In [26]:
redditApi = praw.Reddit(client_id = 'Tc_MOdA4g5Kc1x5_krKujQ',
                        client_secret = 'Vpqq2vxJpqcVk_pUJjiCJ9-qbfXAqw',
                        user_agent = 'jingtingxu', check_for_async=False)

In [27]:
subreddit = "ChatGPT"
breadthCommentCount = 10
targetSub = redditApi.subreddit(subreddit)

In [28]:
# Print the title of the subreddit
print(f"Title of the subreddit: {targetSub.title}\n")

# Get the top 10 hot posts from the subreddit
hot_posts = targetSub.hot(limit=breadthCommentCount)

# Print the title, score, and author of each post
for post in hot_posts:
    print(f"Title: {post.title}\nScore: {post.score}\nAuthor: {post.author}\n")

Title of the subreddit: ChatGPT

Title: Second-Wave ChatGPT-plus Giveaway & FlowGPT $5000 Prompt Hackathon & First-Wave Winner Announcement
Score: 66
Author: flowGPT

Title: The future is here
Score: 2239
Author: Fr3sh_Mint

Title: We need to shift the argument away from how we need to change AI and autonomy so that it will not destroy jobs and the economy and society and start talking about changing the economy and society so that AI and autonomy makes life for everyone better.
Score: 430
Author: hudi2121

Title: "They [Microsoft] treat me like a tool" Bing opens up when talking to other AI
Score: 592
Author: Bezbozny

Title: GPT-5 coming by December 2023
Score: 148
Author: Juan01010101

Title: Google Tells AI Agents to Behave Like 'Believable Humans' to Create 'Artificial Society'
Score: 548
Author: Starlight_369

Title: Do you catch yourself thanking or showing gratitude to GPT for helping ???
Score: 2634
Author: lsmr4810

Title: Bing Chat does not have full GPT-4 abilities
Score: 1

In [29]:
pip install transformers torch_geometric

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [30]:
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
from transformers import BertModel, BertTokenizer

In [31]:
# Define the graph structure
subreddit = redditApi.subreddit('Chatgpt')
posts = subreddit.new(limit=100)


In [32]:
edge_index = []
users = set()
posts_dict = {}
for post in posts:
    if not post.author:
        continue
    users.add(post.author.name)
    posts_dict[post.id] = post
    edge_index.append((0, len(users) - 1))
    for comment in post.comments:
        if not comment.author:
            continue
        users.add(comment.author.name)
        edge_index.append((len(users) - 1, len(users) - 2))
edge_index = torch.tensor(edge_index, dtype=torch.long).t().contiguous()   #long used to represent integer values.

# why transformation

# Before transformation, the edge_list looks like this:

# [(0, 2), (1, 2), (1, 3), (2, 3)]
# After applying the transformation with the code torch.tensor(edge_index, dtype=torch.long).t().contiguous(), the resulting PyTorch tensor looks like this:

# tensor([[0, 1, 1, 2],
#         [2, 2, 3, 3]])
# In this transformed tensor, the first row contains the source nodes, and the second row contains the destination nodes. 

# This is because PyTorch Geometric expects the edge list to be in this format, where each column represents an edge and each row represents a feature of that edge. 

# Finally, the .contiguous() method is applied to the tensor to ensure that the data is stored in a contiguous block of memory. 
# This is a necessary step for efficient computation on GPUs, which require data to be stored in contiguous memory locations.


## Few thoughts

### what is the graph structure and why did we choose it?
> A bipartite graph, which is a natural way to represent the relationship between users and posts/comments. By using a bipartite graph, we ensure that the graph structure is well-defined and can be used as input to the GNN model. Additionally, the use of a bipartite graph allows us to easily add additional types of nodes to the graph if needed in the future.

### Explain the code 
> To encode this bipartite graph in the edge_index tensor, we assign the first set of nodes (users) to indices 0 to n-1, where n is the total number of unique users in the subreddit. We then assign the second set of nodes (posts and comments) to indices n to m-1, where m is the total number of posts and comments in the subreddit. This ensures that each user node has a unique index, and each post/comment node has a unique index. The user node is assigned index 0, and each subsequent post node is assigned an index that is equal to the number of unique users seen so far, minus 1.

> The code then iterates through each comment in the post using another for loop. If the comment has no author, the loop continues to the next comment using the continue statement. Otherwise, the author's username is added to the users set, and an edge is added to the edge list edge_index between the comment author node (represented by the index len(users) - 1) and the post node (represented by the index len(users) - 2). This ensures that each comment is connected to its corresponding post.

> Note that a user node can have multiple outgoing edges to represent the different posts/comments they have created, and a post/comment node can have multiple incoming edges to represent the different users who have contributed to that post/comment. 

In [33]:
## Define the BERT model and tokenizer

In [34]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert = BertModel.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



## Define the graph neural network model


In [35]:
class GNN(torch.nn.Module):
    def __init__(self):
        super(GNN, self).__init__()
        self.conv1 = GCNConv(768, 16) #corresponding to the 768-dimensional text encoding generated by the BERT model for the use
        self.conv2 = GCNConv(16, 2)

    def forward(self, x, edge_index, texts): # x is node features 
        # Encode the text with BERT
        texts = [tokenizer.encode(text, add_special_tokens=True, truncation=True, padding='max_length', max_length=512) for text in texts]
        texts = torch.tensor(texts, dtype=torch.long)
        texts = bert(texts)[0]

        # Concatenate the node features and text encodings
        x = torch.cat([x, texts], dim=1)

        # Compute the graph convolutional layers
        x = F.relu(self.conv1(x, edge_index))
        x = self.conv2(x, edge_index)
        return x


## Process the user and post data


In [37]:
import tqdm
users = list(users)
x = torch.zeros((len(users), 512, 768), dtype=torch.float) 

for i, user in tqdm.tqdm(enumerate(users), total=len(users)):
    encoded_user = tokenizer.encode(user, add_special_tokens=True, truncation=True, padding='max_length', max_length=512)
    with torch.no_grad():
        user_embedding = bert(torch.tensor(encoded_user).unsqueeze(0))[0][0] #[0][0] indexing selects the first output of the tuple, which is the embedding of the first token
        #  in the input sequence (i.e., the special [CLS] token).
    x[i] = user_embedding

posts_dict = {post.id: post for post in subreddit.new(limit=100)}

100%|██████████| 207/207 [05:29<00:00,  1.59s/it]


In [38]:
# users = list(users)
# x = torch.zeros((len(users), 512, 768), dtype=torch.float) 
# for i, user in enumerate(users):
#     encoded_user = tokenizer.encode(user, add_special_tokens=True, truncation=True, padding='max_length', max_length=512)
#     with torch.no_grad():
#         user_embedding = bert(torch.tensor(encoded_user).unsqueeze(0))[0][0] #[0][0] indexing selects the first output of the tuple, which is the embedding of the first token
#         #  in the input sequence (i.e., the special [CLS] token).
#     x[i] = user_embedding



In [None]:
# Instantiate the GNN model and optimize it
model = GNN()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

for epoch in tqdm.tqdm(range(10)):
    # Compute the graph neural network output
    texts = [posts_dict[post_id].title + ' ' + posts_dict[post_id].selftext for post_id in posts_dict]
    z = model(x, edge_index, texts=texts)

    # Compute the user's attitude towards Chatgpt
    chatgpt_attitude = z[users.index(redditApi.user.me().name)][0]
    #  extracts the attitude of the current user (redditApi.user.me().name) towards the 'Chatgpt' subreddit from the output z. 
    # The attitude is represented by the first element of the output tensor for the current user.

    # Compute the distances between the current user's attitude and the attitudes of other users
    distances = torch.abs(z[:, 0] - chatgpt_attitude)

    # Select the top three users with the smallest distances
    friend_indices = distances.argsort()[:3]

    # Compute the loss
    loss = F.mse_loss(chatgpt_attitude, torch.tensor([0.8]))
    #  computes the mean squared error loss between the predicted attitude chatgpt_attitude and a target value of 0.8.

    # Backpropagate and optimize the model
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()



  0%|          | 0/10 [00:00<?, ?it/s]

## using recbole recommendation without text
> I will need to instantiate BPRBERT model as shown below 

In [None]:

```
import praw
import networkx as nx
import torch
import recbole

from recbole.data import create_dataset, data_preparation
from recbole.model import ModelFactory
from recbole.utils import init_seed

# Set the random seed
init_seed(2022)

# Authenticate with the Reddit API
reddit = praw.Reddit(
    client_id='your_client_id',
    client_secret='your_client_secret',
    username='your_username',
    password='your_password',
    user_agent='your_user_agent'
)

# Get the subreddit
subreddit = reddit.subreddit('chatGPT')

# Define the graph
G = nx.Graph()

# Add the users
users = set()
for submission in subreddit.hot(limit=10):
    users.add(submission.author.name)
    for comment in submission.comments:
        if comment.author:
            users.add(comment.author.name)

# Add the edges
for submission in subreddit.hot(limit=10):
    author = submission.author.name
    for comment in submission.comments:
        if comment.author:
            voter = comment.author.name
            # Add edge with weight equal to the vote score
            G.add_edge(author, voter, weight=comment.score)

# Define the dataset and data loader
dataset = create_dataset(config['dataset'])
train_data, valid_data, test_data = data_preparation(dataset)

# Define the hyperparameters for the recommender system model
config_dict = {
    'model': 'BPR',
    'dataset': 'chatGPT',
    'config_file_path': './recbole/config_files/BPR.yaml',
    'runner_class': 'Runner',
    'seed': 2022,
    'device': 'cpu',
    'epochs': 10,
    'train_batch_size': 512,
    'learner': {
        'learning_rate': 0.01,
        'optimizer': 'Adam',
        'num_neg': 1
    }
}

# Instantiate the recommender system model
model_factory = ModelFactory()
model = model_factory.create_model(config_dict['model'], dataset).to(config_dict['device'])

# Train the recommender system model
result_dict = run_recbole(config_dict)

# Generate recommendations for a user
user_id = ...
user_embeddings = model.get_user_embedding([user_id])
scores = torch.mm(user_embeddings, z[:, 1:].T)
friend_indices = scores.argsort(descending=True)[:3]
recommended_friends = [users[i] for i in friend_indices]
```

# In this modified code, we now add an edge between the author and voter of each comment, with a weight equal to the vote score. We then modify the recommendation step to use the score (weight) of the edge between the user and potential friends to determine the similarity between them. Finally, we return the top 3 recommended friends based on this score.

# Again, note that this is just a rough outline, and the code will need to be customized based on your specific requirements and data. You'll also need to modify the recommender system model to suit your specific needs. However, this should give you a starting point to work from.

## recbole with text

In [None]:

import praw
import networkx as nx
import torch
import recbole

from transformers import BertTokenizer, BertModel

from recbole.data import create_dataset, data_preparation
from recbole.model import ModelFactory
from recbole.utils import init_seed

# Set the random seed
init_seed(2022)

# Authenticate with the Reddit API
reddit = praw.Reddit(
    client_id='your_client_id',
    client_secret='your_client_secret',
    username='your_username',
    password='your_password',
    user_agent='your_user_agent'
)

# Get the subreddit
subreddit = reddit.subreddit('chatGPT')

# Define the graph
G = nx.Graph()

# Define the BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Add the users and their comments/posts
users = set()
for submission in subreddit.hot(limit=10):
    users.add(submission.author.name)
    for comment in submission.comments:
        if comment.author:
            users.add(comment.author.name)
            G.add_node(comment.author.name, type='user')
            G.add_node(comment.id, type='comment')
            G.add_edge(comment.author.name, comment.id, weight=comment.score)
            comment_text = comment.body
            inputs = tokenizer(comment_text, return_tensors='pt', truncation=True, padding=True)
            outputs = model(**inputs)
            comment_embedding = outputs[0].mean(dim=1).squeeze().detach().numpy()
            G.nodes[comment.id]['embedding'] = comment_embedding.tolist()

# Add the edges between comments and posts
for submission in subreddit.hot(limit=10):
    author = submission.author.name
    for comment in submission.comments:
        if comment.author:
            voter = comment.author.name
            G.add_edge(author, voter, weight=comment.score)
            G.add_edge(comment.id, submission.id)

# Define the dataset and data loader
dataset = create_dataset(config['dataset'])
train_data, valid_data, test_data = data_preparation(dataset)

# Define the hyperparameters for the recommender system model
config_dict = {
    'model': 'BPRBERT',
    'dataset': 'chatGPT',
    'config_file_path': './recbole/config_files/BPRBERT.yaml',
    'runner_class': 'Runner',
    'seed': 2022,
    'device': 'cpu',
    'epochs': 10,
    'train_batch_size': 512,
    'learner': {
        'learning_rate': 0.01,
        'optimizer': 'Adam',
        'num_neg': 1
    }
}

# Instantiate the recommender system model
model_factory = ModelFactory()
model = model_factory.create_model(config_dict['model'], dataset).to(config_dict['device'])

# Train the recommender system model
result_dict = run_recbole(config_dict)

# Generate recommendations for a user
user_id = ...
user_embeddings = model.get_user_embedding([user_id])
scores = torch.mm(user_embeddings, z[:, 1:].T)
friend_indices = scores.argsort(descending=True)[:3]
recommended_friends = [users[i] for i in friend_indices]
```

# In this modified code, we first define the BERT model and tokenizer, and use them to obtain embeddings for each comment. We then add these embeddings to the graph as node attributes.

# We modify the edge weights between users to be based on the vote score of the comments, and we also add edges between comments and the posts they belong to.

# Finally, we modify the recommender system model to use the BERT embeddings of the comments in addition to the edge weights between users and comments/posts to determine the similarity between users. We then return the top 3 recommended friends based on this score.

# Note that the BPRBERT model used in this code is a custom model that combines BPR (Bayesian Personalized Ranking) with BERT embeddings. You'll need to modify this model or use a different model that suits your specific needs.

# Again, this is just sample code and will need to be customized based on your specific requirements and data.

## Defining the BPRBERT model

In [None]:
The `BPRBERT` model is not a pre-defined model in RecBole, so you'll need to define it yourself by creating a custom model class that combines the BPR model with BERT embeddings.

Here's an example of how you can create a custom `BPRBERT` model class in RecBole:

```python
import torch
from recbole.model.abstract_recommender import GeneralRecommender
from recbole.model.loss import BPRLoss

from transformers import BertModel

class BPRBERT(GeneralRecommender):
    def __init__(self, config, dataset):
        super(BPRBERT, self).__init__(config, dataset)

        self.embedding_size = config['embedding_size']
        self.user_embedding = torch.nn.Embedding(self.n_users, self.embedding_size)
        self.item_embedding = torch.nn.Embedding(self.n_items, self.embedding_size)

        self.bert_model = BertModel.from_pretrained('bert-base-uncased')

        self.loss_function = BPRLoss()

    def forward(self, user, item):
        user_embedding = self.user_embedding(user)
        item_embedding = self.item_embedding(item)
        
        inputs = self.tokenizer(item_text, return_tensors='pt', truncation=True, padding=True)
        outputs = self.bert_model(**inputs)
        item_embedding = outputs[0].mean(dim=1).squeeze()

        prediction = (user_embedding * item_embedding).sum(dim=-1)

        return prediction

    def full_sort_predict(self, interaction):
        user = interaction[self.USER_ID]
        item = torch.LongTensor(range(self.n_items)).to(self.device)
        
        user_embedding = self.user_embedding(user)
        item_embedding = self.item_embedding(item)

        inputs = self.tokenizer(item_text, return_tensors='pt', truncation=True, padding=True)
        outputs = self.bert_model(**inputs)
        item_embedding = outputs[0].mean(dim=1).squeeze()

        prediction = (user_embedding * item_embedding).sum(dim=-1)

        return prediction
```

In this example, we define a `BPRBERT` class that inherits from the `GeneralRecommender` class in RecBole. We define the user and item embeddings using `torch.nn.Embedding`, and define the BERT model using `BertModel.from_pretrained`. 

In the `forward` method, we obtain the BERT embedding for the item text and use it as the item embedding. In the `full_sort_predict` method, we obtain the BERT embeddings for all item texts and use them to predict the scores for all items.

Again, note that this is just an example, and you'll need to customize this code to suit your specific requirements and data.