# Week 3. Day 1. Exercises from Chapter 7 of FSStDS. 
## Fundamentals of Social Data Science. MT 2022

Within your NEW study pod discuss the following questions. Please submit an individual assignment by 12:30pm Tuesday, October 25, 2022 on Canvas. 

# Exercise 1. How busy is twitter now? 

Using the `counts` endpoint, compare the chatter around three British politicians, Penny Mordunt, Rishi Sunak, and Boris Johnson. Use the name of the politicians in separate queries:
1. Which politician had the most mentions overall?
2. For each politician, what was the hour of peak mentions? Was it the same for each person? 

In [2]:
import requests
import json
import pandas as pd

def read_json(filepath: str) -> dict:
    with open(filepath, 'r') as f:
        return json.load(f)

# authentication
config = read_json("../twitter_config.json")
headers = {"Authorization": f"Bearer {config['bearer_token']}"}

In [1]:
# Answer 1 below here


POLITICIAN_DICT = {
    "Boris Johnson": "@BorisJohnson",
    "Penny Mordaunt": "@PennyMordaunt",
    "Rishi Sunak": "@RishiSunak",
}

count_url = "https://api.twitter.com/2/tweets/counts/recent"


def create_politician_query(pol_name: str) -> str:
    return f"{POLITICIAN_DICT[pol_name]} {pol_name}"

def get_tweet_count(pol_name: str) -> int:
    query = create_politician_query(pol_name)
    params = {"query": query}
    response = requests.get(count_url, headers=headers, params=params)
    return response.json()

pol_results = {pol_name: get_tweet_count(pol_name) for pol_name in POLITICIAN_DICT}


# Find the one with most mentions overall
max_pol = max(pol_results, key=lambda x: pol_results[x]["meta"]["total_tweet_count"])
print(f"{max_pol} has the most mentions overall with {pol_results[max_pol]['meta']['total_tweet_count']} tweets")
# Find hour of peak mentions #
# Normalize data into dataframe
df = pd.concat([pd.json_normalize(pol_results[pol_name]["data"]).assign(name=pol_name) for pol_name in pol_results])
df["end"] = pd.to_datetime(df["end"])
df["start"] = pd.to_datetime(df["start"])
df["hour"] = df["start"].dt.hour
df["tweet_count"] = df["tweet_count"].astype(int)
# Find the hour with the most mentions
print(df.groupby(["name", "hour"])["tweet_count"].mean().sort_values(ascending=False).groupby("name").head(3))

# Answer 1 above here

Rishi Sunak has the most mentions overall with 39671 tweets
name            hour
Rishi Sunak     13      687.571429
                14      645.000000
                15      483.714286
Boris Johnson   20      146.857143
                18      127.857143
                19      124.000000
Penny Mordaunt  15       39.571429
                16       32.571429
                11       30.571429
Name: tweet_count, dtype: float64


# Exercise 2. Tweets across the world

Select a query to download 100 tweets. Include the locations entities. Can you find a query where no single country makes up more than 50% of the locations? How can you automate this using a bank of queries? 

**UPDATE**: WUHUUU, I got academic access so now everything works! Shout out to Khyati for figuring out the magic passwords 

**NOTE**: This was a total pain to do, and I have wept many tears in not getting this to work. Summa summarum: it is possible to filter with the `has:geo`-query for the premium API. For us mere mortals waiting for recognition from the Twitter gods, there is not much to do as very few (~1 in 100) of the tweets actually has a geo-tag. However, if you masters of SDS figure something out for the free tier, I'm all ears!

In [3]:
# Answer 2 below here 
search_url = "https://api.twitter.com/2/tweets/search/recent"
international_query = "football has:geo"
params = {"query": international_query, "tweet.fields": "text,author_id,created_at,geo","expansions":"geo.place_id", "place.fields": "id,country_code", "max_results": 100}
response = requests.get(search_url, headers=headers, params=params)


# Answer 2 above here

In [6]:
locations = pd.json_normalize(response.json()["includes"]["places"])
tweets = pd.json_normalize(response.json()["data"])
country_counts = pd.merge(tweets, locations, left_on="geo.place_id", right_on="id")[["text", "full_name", "country_code"]].value_counts(subset=["country_code"])
country_frequencies = country_counts / country_counts.sum()
country_frequencies.head()

country_code
US              0.46
GB              0.17
SA              0.12
IN              0.03
BR              0.03
dtype: float64

# Exercise 3. Comparing `praw` and `psaw`

Reddit mods have considerable power over their subreddits. In /r/Science they are notoriously aggressive in their deletion of non-scientific comments. 

PushShift is an archive of reddit data. You can access this archive using `psaw` which is otherwise very similar to `praw`. Go to /r/science and select a story that over two days old. Compare the comments that you receive from `praw` and `psaw`. Do they have similar: 
- counts of comments?
- counts of comments marked deleted?
- upvote scores on the comments? 

Posit what might account for the difference. Reflect on how this could intervene in making claims about activity on Reddit.

In [4]:
# Exercise 3 code part below here
import praw
import psaw

post_id = "y2439s"
reddit_config = read_json("../reddit_config.json")
reddit = praw.Reddit(
    client_id=reddit_config["client_id"],
    client_secret=reddit_config["client_secret"],
    user_agent="SDS Scraper Jonathan",
)

# get number of comments (PRAW)
submission = reddit.submission(id=post_id)
print(f"Number of comments (PRAW): {submission.num_comments}")

# get number of comments (PSAW)
api = psaw.PushshiftAPI()
comments = next(api.search_submissions(ids=[post_id]))
num_comments = comments.num_comments
print(f"Number of comments (PSAW): {num_comments}")

# Exercise 3 code part above here

Number of comments (PRAW): 2796
Number of comments (PSAW): 1


## Exercise 3 reflections below here 
PSAW seems to be way, way, WAY out of date. I honestly don't knwo why anyone would ever use it for any dynamic data (upvotes, comments etc). D- from here!

## Exercise 3 reflections above here

# Exercise 4 - Comparing comments 
(in case PushShift isn't working for you)

People can submit the same link to multiple subreddits. You can find out where else a link was submitted on Reddit by xx. 

Select a story that has been submitted to at least two subreddits. Compare the comments for each of these:
- Which subreddit has more comments? Would this have been unexpected? Why?
- Are there any overlap in the users who comment on these stories across the multiple subreddits?
- Summarise the scores of the comments versus the story (`data.ups`). Is the ratio of the top comment score to the ups the same? 
- Which story has more comments that have a score below zero? As a percentage? 


In [5]:
# Exercise 4 below here
link1 = "y6rsz2"
link2 = "y6rubo"

# which has more comments?
submission1 = reddit.submission(id=link1)
submission2 = reddit.submission(id=link2)
print(f"Number of comments for {link1} (subreddit: {submission1.subreddit.display_name}): {submission1.num_comments}")
print(f"Number of comments for {link2} (subreddit: {submission2.subreddit.display_name}): {submission2.num_comments}")

# Exercise 4 above here

Number of comments for y6rsz2 (subreddit: FoodPorn): 219
Number of comments for y6rubo (subreddit: food): 682


In [6]:
submission2.comments.replace_more(limit=0)
sub1_commentors = {comment.author.name for comment in submission1.comments}
sub2_commentors = {comment.author.name for comment in submission2.comments if comment.author is not None}

commment_intersection = sub1_commentors.intersection(sub2_commentors)
intersection_size = len(commment_intersection)
print(f"Number of commentors in both subreddits: {intersection_size}")
print(f"Commentors in both subreddits: {commment_intersection}")
print(f"OP of both posts: {submission1.author.name}")


Number of commentors in both subreddits: 1
Commentors in both subreddits: {'AlbionReturns'}
OP of both posts: Jackpot09


In [7]:
def get_top_comment_score(submission: praw.models.Submission) -> int:
    return max(submission.comments, key=lambda x: x.score).score

top_comment_sub1 = get_top_comment_score(submission1)
top_comment_sub2 = get_top_comment_score(submission2)
top_ratio_sub1 = top_comment_sub1 / submission1.score
top_ratio_sub2 = top_comment_sub2 / submission2.score
print(f"Top comment score for {link1} (subreddit: {submission1.subreddit.display_name}): {top_comment_sub1} ({top_ratio_sub1:.2%})")
print(f"Top comment score for {link2} (subreddit: {submission2.subreddit.display_name}): {top_comment_sub2} ({top_ratio_sub2:.2%})")




Top comment score for y6rsz2 (subreddit: FoodPorn): 289 (3.78%)
Top comment score for y6rubo (subreddit: food): 1818 (13.77%)


In [8]:
## BELOW ZERO ##
from typing import List
def get_negative_score_comments(submission: praw.models.Submission) -> List[praw.models.Comment]:
    return [comment for comment in submission.comments if comment.score < 0]

num_neg_comments_sub1 = len(get_negative_score_comments(submission1))
num_neg_comments_sub2 = len(get_negative_score_comments(submission2))
print(f"Number of negative score comments for {link1} (subreddit: {submission1.subreddit.display_name}): {num_neg_comments_sub1}")
print(f"Number of negative score comments for {link2} (subreddit: {submission2.subreddit.display_name}): {num_neg_comments_sub2}")

Number of negative score comments for y6rsz2 (subreddit: FoodPorn): 13
Number of negative score comments for y6rubo (subreddit: food): 26


### General notes
It seems that the more popular submission (the one of r/food) in general has more extreme values in all regards. This makes sense from a statistical perspective as the extreme values of larger distributions tend to be more extreme. It also shows the advantages of x-posting when farming karma :))