# Week 2. Day 2. Exercises from Chapter 5 of FSStDS. 
## Fundamentals of Social Data Science. MT 2022

Within your study pod discuss the following questions. Please submit an individual assignment by 12:30pm Wednesday, October 18, 2022 on Canvas. 

In [1]:
import pandas as pd 
import json
from pathlib import Path
from typing import Optional, List, Dict

# Exercise 1. Twitter merging 

I have provided two tables: `dalle2_oct18_2022_tweets.csv` and `dalle2_oct18_2022_users.csv`. You can see how these tweets were collected in the Appendix to this assignment. It's a simple pull of only 100 tweets. To continue this pull would require paging (another day). For now, let's focus on merging. Please merge these two tables. 

Some tips: 
- Ensure that you keep all the tweets.
- Ensure that the names which might overlap (hint...`id`) are given descriptive suffixes.
- Your resulting df should still have 100 rows. 

In [2]:
# Exercise 1 below here 

def read_tweets(file_path: Path) -> pd.DataFrame:
    """Reads a json file to pandas dataframe"""
    with open(file_path, "r") as f:
        data = json.load(f)
    return pd.json_normalize(data)


tweet_df = read_tweets(next(Path("../data").glob("*tweets.json")))
users_df = read_tweets(next(Path("../data").glob("*users.json")))
merge_df = pd.merge(tweet_df, users_df, left_on="author_id", right_on="id", how="left", suffixes=("_tweet", "_user"))

print(len(tweet_df),len(users_df),len(merge_df))
# Should be 100 79 100

100 79 100


# Exercise 2. Twitter analytics 

Split the data into two groups: 
- Those with more than 1000 followers and those with less
- Compare the two groups. Which group has more tweets and _proportionately_ more @mentions in their tweets.
    
> Note: Getting the @mentions can be done cheap and easy (search for @ symbol) or more robust and with a little more difficulty (look in the entities.mentions column and wrangle the dictionary)

In [3]:
# Exercise 2. Answer below here
over1k = merge_df["public_metrics.followers_count"] > 1000
over1k_pctmention = merge_df.loc[over1k, "entities.mentions"].notna().mean()
under1k_pctmention = merge_df.loc[~over1k, "entities.mentions"].notna().mean()

print(f"The percentage of tweets from those with over 1k followers that have mentions is {over1k_pctmention:0.1%} "
      f"The percentage of tweets from those with under 1k followers that have mentions is {under1k_pctmention:0.1%}")

# Should be 29 for over1k and 71 for under1k
# And therefore should be 24.1% and 11.3% respectively.

The percentage of tweets from those with over 1k followers that have mentions is 24.1% The percentage of tweets from those with under 1k followers that have mentions is 11.3%


# Exercise 3. Grouping the data

Group the data by Author and build a table that reports the max, min, and average for both  `public_metrics.retweet_count` and `public_metrics.like_count`. 

In [4]:
# Exercise 3. Answer below here
merge_df.groupby("author_id")[["public_metrics.retweet_count", "public_metrics.like_count"]].agg(["max", "min", "mean"])
# Should be one line

Unnamed: 0_level_0,public_metrics.retweet_count,public_metrics.retweet_count,public_metrics.retweet_count,public_metrics.like_count,public_metrics.like_count,public_metrics.like_count
Unnamed: 0_level_1,max,min,mean,max,min,mean
author_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1075759307659063296,1,1,1.0,4,4,4.0
11085112,0,0,0.0,0,0,0.0
1246388877713178624,1,1,1.0,1,1,1.0
1256547797567930370,4,4,4.0,10,10,10.0
1318265150059708416,0,0,0.0,1,1,1.0
...,...,...,...,...,...,...
780746497017032704,0,0,0.0,2,0,1.0
782162190518263808,0,0,0.0,5,5,5.0
814009956261203968,1,1,1.0,4,4,4.0
924431378,0,0,0.0,3,2,2.5


# Exercise 4. Twitter Reshaping

Create a long `DataFrame` of tweet_ids, author_ids, and hash_tags. That is, one row per hashtag rather than one per tweet. Report the length of this `DataFrame` and the `value_counts()` of the top 10 hashtags.

In [6]:


def get_taglist(tag_list_full: Optional[List[Dict[str, str]]]) -> List[str]:
    if type(tag_list_full) is not list:
        return []
    return [x['tag'] for x in tag_list_full]

new_df = merge_df.copy()
new_df['tags'] = new_df['entities.hashtags'].map(get_taglist)

In [7]:
long_df = new_df[['id_tweet','tags','username']].explode('tags')

print(f"The length of the exploded data by hashtag is {len(long_df)}")
print("The top ten hashtags are as follows:",
       long_df['tags'].value_counts()[:10],
      sep="\n")

The length of the exploded data by hashtag is 608
The top ten hashtags are as follows:
dalle2             77
aiart              25
ai                 25
dalle              23
stablediffusion    22
midjourney         22
digitalart         20
AIart              14
aiartist           13
aiartcommunity     12
Name: tags, dtype: int64


# Appendix: How I pre-processed the data (See Chapter 7) 




In [8]:
import os
import requests
import dotenv

ENV_PATH = f"..{os.sep}.env"
dotenv.load_dotenv(ENV_PATH) # This will refresh the environment variables
print(len(os.environ.get('TWITTER_BEARER_TOKEN')))

TypeError: object of type 'NoneType' has no len()

In [None]:
URL = "https://api.twitter.com/2/tweets/search/all"

BEARER = os.environ["TWITTER_BEARER_TOKEN"]
headers = {"Authorization": f"Bearer {BEARER}"}

QUERY = "(dalle2) -is:retweet"
MAX_RESULTS = 100 

params={"query": QUERY,
        "max_results":MAX_RESULTS}

params['expansions'] = "author_id,geo.place_id"
params['tweet.fields'] = "entities,public_metrics"
params['user.fields'] = "id,username,name,description,public_metrics"
params['place.fields'] = "id,country,country_code,full_name"

response = requests.get(URL, headers=headers, params=params)

assert response.status_code == 200, \
    f"Code {response.status_code}. See error: {response.json()}"

tweets = response.json()
print(tweets.keys())

dict_keys(['data', 'includes', 'meta'])


In [None]:
import json 

json.dump(tweets['data'], 
          open("dalle2_oct18_2022_tweets.json",'w')) 

json.dump(tweets['includes']['users'],
          open("dalle2_oct18_2022_users.json",'w')) 