# Community Archive Analysis Quickstart

[Community Archive](https://www.community-archive.org/) is an open database of tweets volunteered by users. In this notebook we'll look at how to:

1. Fetch all tweets by a specific user
2. Run some basic analysis, like finding the most liked tweets, and most common words & phrases
3. Plot a graph of their account's growth (likes over time)

## Step 1 - fetch the data

The [API docs](https://github.com/TheExGenesis/community-archive/blob/main/docs/api-doc.md) describe two ways to get the data:

1. Querying the database through the Supabase API
2. Downloading individual user data as JSON from object storage

We're going to fetch individual user's JSONs for simplicity.

In [3]:
import requests
import json
from tqdm.notebook import tqdm
import os

username = ''
# Helper function to downnload the data and display a progress bar
def downloadUserData(username):
  output_filename = f'{username}.json'
  if (os.path.exists(output_filename)):
    print(f"{output_filename} already exists, skipping")
    return

  print("Downloading tweet data for:", username)
  response = requests.get(url, stream=True)
  total_size = int(response.headers.get('content-length', 0))
  progress_bar = tqdm(total=total_size, unit='B', unit_scale=True, desc="Downloading JSON")

  # Download and save the file
  json_data = bytearray()
  for chunk in response.iter_content(1024):  # Download in chunks of 1 KB
      if chunk:
          json_data.extend(chunk)  # Accumulate each binary chunk
          progress_bar.update(len(chunk))  # Update the progress bar

  # Close the progress bar
  progress_bar.close()
  data = json.loads(json_data.decode('utf-8', errors='ignore'))  # Decode with error handling

  print(f"Writing to file: {username}.json")
  with open(f'{username}.json', 'w') as f:
      json.dump(data, f)

# downloadUserData(username)

The downloaded archive contains all tweets, as well as metadata about the user, like their profile picture, account bio, list of people they follow, and list of followers.

Let's print the attributes that are inside the archive.

In [4]:
username = 'defenderofbasic'
with open(f'../frontend/public/archives/{username}.json', 'r') as f:
    data = json.load(f)

print('All attributes:\n', list(data.keys())) # print all available attributes
tweet = data['tweets'][0]['tweet']
print('Tweet attributes:\n', list(tweet.keys()))

profile_data = data['profile'][0]['profile']
print('bio:\n', profile_data['description']['bio'])

All attributes:
 ['upload-options', 'profile', 'account', 'tweets', 'community-tweet', 'follower', 'following', 'note-tweet', 'like']
Tweet attributes:
 ['edit_info', 'retweeted', 'source', 'entities', 'display_text_range', 'favorite_count', 'in_reply_to_status_id_str', 'id_str', 'in_reply_to_user_id', 'truncated', 'retweet_count', 'id', 'in_reply_to_status_id', 'possibly_sensitive', 'created_at', 'favorited', 'full_text', 'lang', 'in_reply_to_screen_name', 'in_reply_to_user_id_str']
bio:
 Self funding open source memetic research. See research statement & support my work: https://t.co/zDlct7w2nZ


## Step 2 - find the top tweets & most common phrases

Let's find the tweets with the highest retweet/like count.




In [5]:
import textwrap
tweets = data['tweets']

sorted_tweets = sorted(
    tweets,
    key=lambda tweet: int(tweet['tweet']['retweet_count']) + int(tweet['tweet']['favorite_count']),
    reverse=True
)

for i in range(0, 3):
    tweet = sorted_tweets[i]['tweet']
    total = int(tweet['retweet_count']) + int(tweet['favorite_count'])
    url = f"https://x.com/{username}/status/{tweet['id']}"
    print(textwrap.fill(tweet['full_text'], width = 50))
    print("\n❤️ + ♻️:", "{:,}".format(total))
    print(f"url: {url}")
    print("------")

hell yeah, I called it 4 months ago.
https://t.co/ElLIHfe7L4 https://t.co/iKJBAdUKj0

❤️ + ♻️: 193,298
url: https://x.com/defenderofbasic/status/1846182079433670951
------
I think about this lock a lot  it's not going to
be replaced with something higher tech. Because
this lock serves an important function: it is
security that is easily verifiable by the user.
It's not enough that it's secure, it needs to be
clear why it's secure to the user
https://t.co/73jOOZwi6I

❤️ + ♻️: 12,972
url: https://x.com/defenderofbasic/status/1801079099546304983
------
the most damaging thing that school, homework, and
the 9-5 does to the human soul is making you feel
like there's an amount of work you can do after
which you'll be "done"

❤️ + ♻️: 12,686
url: https://x.com/defenderofbasic/status/1829517412460540379
------


Here we compute the most common 2-word and 3-word phrases (bigrams and trigrams in NLP terms). We remove replies & retweets so it's only counting original tweets & threads.

In [6]:
import nltk
from nltk.corpus import stopwords
from nltk.util import ngrams
from collections import Counter
import string

nltk.download('punkt')
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
translator = str.maketrans('', '', string.punctuation)

def preprocess_text(text):
    text = text.lower()  # Convert to lowercase
    text = text.translate(translator)  # Remove punctuation
    words = text.split()
    return [word for word in words if word not in stop_words]

processed_corpus = []

def removeRepliesAndRetweets():
  accountId = data['account'][0]['account']['accountId']
  new_tweets = []
  for item in data['tweets']:
      tweet = item['tweet']
      full_text = tweet['full_text']
      if (full_text.startswith('RT')):
        continue
      if ('in_reply_to_user_id' in tweet):
        if (tweet['in_reply_to_user_id'] != accountId):
          # this is a reply, and NOT to self, so we ignore
          continue
      new_tweets.append(tweet)
  return new_tweets

filtered_tweets = removeRepliesAndRetweets()

for tweet in filtered_tweets:
    processed_corpus.append(preprocess_text(tweet['full_text']))

def generate_ngrams(corpus, n=2):
    ngram_list = []
    for tweet in corpus:
        ngram_list.extend(list(ngrams(tweet, n)))  # Create n-grams for each tweet
    return ngram_list

bigrams = generate_ngrams(processed_corpus, n=2)
trigrams = generate_ngrams(processed_corpus, n=3)

bigram_counts = Counter(bigrams)
trigram_counts = Counter(trigrams)

# Get the most common bigrams and trigrams
most_common_bigrams = bigram_counts.most_common(10)  # Top 10 bigrams
most_common_trigrams = trigram_counts.most_common(10)  # Top 10 trigrams

print("Most Common Bigrams:")
for item in most_common_bigrams:
  print(item)

print("\nMost Common Trigrams:")
for item in most_common_trigrams:
  print(item)

ModuleNotFoundError: No module named 'nltk'

## Step 3 - Plot the account's growth over time

Create a sorted list of tweets over time, sum up retweets + likes, and plot it.

In [None]:
import matplotlib.pyplot as plt
import plotly.express as px
import pandas as pd
from datetime import datetime

# Convert the data into lists of values
dates = []
engagement_counts = []

for tweet in tweets:
    # Parse the creation date
    created_at = tweet['tweet']['created_at']
    date = datetime.strptime(created_at, '%a %b %d %H:%M:%S %z %Y')
    dates.append(date)

    # Calculate total engagement (retweets + favorites)
    retweets = int(tweet['tweet']['retweet_count'])
    favorites = int(tweet['tweet']['favorite_count'])
    total_engagement = retweets + favorites
    engagement_counts.append(total_engagement)


# Create a DataFrame for easier plotting
data = pd.DataFrame({
    'Date': dates,
    'Engagement': engagement_counts
})
data.sort_values(by='Date', inplace=True)  # Ensure data is sorted by date

# Plot the data
fig = px.line(data, x='Date', y='Engagement', title="Total Engagement (Likes + Retweets) Over Time")
fig.update_layout(
    xaxis_title="Date",
    yaxis_title="Total Engagement",
    xaxis=dict(rangeslider=dict(visible=True))  # Adds a range slider for easy zooming
)
fig.update_yaxes(fixedrange=False)
fig.show()


Finally I was curious to look at the tweets over specific time periods where I had spikes. Given a `search_date`, return all tweets within a day of that date.

In [None]:
search_date = "Aug 7, 2023"  # Replace with the date you want to check

from datetime import datetime, timedelta
from dateutil import parser

# Example search function
def find_tweets_by_date(target_date_str, date_format="%b %d, %Y", time_range_days=1):
    # Convert target date string to a datetime object
    target_date = parser.parse(target_date_str)
    # Extract the timezone from the first tweet
    first_tweet_time = parser.parse(tweets[0]['tweet']['created_at'])
    timezone = first_tweet_time.tzinfo  # Extract timezone info

    # Convert target_date to the same timezone
    target_date = target_date.astimezone(timezone)
    start_date = target_date - timedelta(days=time_range_days)
    end_date = target_date + timedelta(days=time_range_days)

    # Find tweets within the specified range
    matching_tweets = [
        tweet for tweet in tweets
        if start_date <= parser.parse(tweet['tweet']['created_at']).astimezone(timezone) <= end_date
    ]

    return matching_tweets

# Example usage:
# Enter the date to search for using the format "Sep 17, 2023"
found_tweets = find_tweets_by_date(search_date)
sorted_tweets = sorted(
    found_tweets,
    key=lambda tweet: int(tweet['tweet']['retweet_count']) + int(tweet['tweet']['favorite_count']),
    reverse=True
)
sorted_tweets = sorted_tweets[:5]

# Display results
for idx, item in enumerate(sorted_tweets):
    tweet = item['tweet']
    total = int(tweet['retweet_count']) + int(tweet['favorite_count'])
    url = f"https://x.com/{username}/status/{tweet['id']}"
    print(textwrap.fill(tweet['full_text'], width = 50))
    print("\n❤️ + ♻️:", "{:,}".format(total))
    print(f"url: {url}")
    print("------")

I used to think it'd be exhausting to be the
person who always fixes common spaces: picking up
a piece of trash, running the dishwasher, doing
maintenance.   But I think it's actually LESS
work.

❤️ + ♻️: 162
url: https://x.com/defenderofbasic/status/1688545517419057152
------
I used to spend so much energy being upset that
things suck &amp; no one else cares &amp; I have
to fix things that other people benefit from, just
because I want those nice things.   But it feels
actually easier to just...fix things I care about
&amp; move on. I feel good &amp; have nice things

❤️ + ♻️: 129
url: https://x.com/defenderofbasic/status/1688545518618574854
------
@visakanv I think about this a lot too, but with
like art. Expressing what you genuinely like about
an unknown author's work is very significant for
them &amp; the world in a way expressing it for
extremely popular artists isn't.  Just
redistributing it a tiny little bit is worth it

❤️ + ♻️: 52
url: https://x.com/defenderofbasic/status/168