This is some rough code I used to check whether all the json files had the same structure.

The code shows that, except for midterm-2018_processed_user_objects, all the json files have more or less the same structure.

In [1]:
'''
All the json files should be dictionary. One of its keys should be 'user'.
And 'user' should map to another dictionary with tons of keys.
Mostly this script is checking that the every 'user' dictionary has the same keys.
'''
known_inconsistent_keys = set()


In [10]:
import json # for reading json files
import os # for reading directory contents
from pathlib import Path
from functools import reduce

In [3]:
# list the paths of all files in the data directory
data_path = Path('data') # this is just a path TO the data directory
data_files_to_compare = [Path.joinpath(data_path,filename) for filename in os.listdir('./data/') if filename.endswith('.json')]

for x in data_files_to_compare:
    print(x)

json_contents = dict()
for path in data_files_to_compare:
    with open(path, 'r') as file:
        json_contents[path] = json.load(file)

data\botometer-feedback-2019_tweets.json
data\botwiki-2019_tweets.json
data\celebrity-2019_tweets.json
data\cresci-rtbust-2019_tweets.json
data\cresci-stock-2018_tweets.json
data\gilani-2017_tweets.json
data\midterm-2018_processed_user_objects.json
data\political-bots-2019_tweets.json
data\pronbots-2019_tweets.json
data\vendor-purchased-2019_tweets.json
data\verified-2019_tweets.json


In [4]:
type(json_contents.keys())

dict_keys

In [5]:
json_archetype = json_contents[list(json_contents.keys())[0]]
key_set_archetype = set(json_archetype[0]['user'].keys())

Generally speaking, each json file is a list of many accounts.
Each account is a dictionary with two keys: 'created_at' and 'user'
The value of the 'user' entry of that dictionary is itself a dictionary with many keys.

In the following code, we check to make sure that all the json files have this structure.
We also examine the keys of the 'user' dictionaries.

In [34]:
weird_datasets = list()
for file_path in json_contents.keys():
    json_obj = json_contents[file_path]
    # file_path is the filepath of the json file
    # json_obj is the contents of the json file
    
    assert(type(json_obj) == list) # the json object should be a list of accounts
    try:
        # make sure each acccount is structured in the same basic way
        for entry in json_obj:
            assert(type(entry) == dict), f'entry is wrong type: {type(entry)}'
            assert(len(entry) == 2), f'entry = {entry}'
            assert(set(entry.keys()) == {'created_at', 'user'}), f'entry keys {set(entry.keys())}'
            assert(type(entry['created_at']) == str), type(entry['created_at'])
            assert(type(entry['created_at']) == str), type(entry['created_at'])
    except Exception as e:
        weird_datasets.append(file_path)
        print('something\'s weird about this particular json dataset')
        print(file_path) # print the filepath of this weird dataset
        print(e)
        print()
        continue
    
    # If the above try-catch statement failed, the program goes to the next json file
    # and doesn't execute any code below. If the try-catch statement succeeded, we
    # should print something to let us know.
    print('this json dataset was formatted as expected')
    print(file_path)
    print()
    
    # What if the account data for different accounts includes different keys?
    # Let's collect only the keys that are the same for every user
    agreed_keys = reduce( lambda x,y : x & y , [entry['user'].keys() for entry in json_obj] )
    
    # Now let's figure out which keys are NOT the same for every user
    all_keys = reduce( lambda x,y : x | y , [entry['user'].keys() for entry in json_obj] )
    disagreed_keys = all_keys - agreed_keys

    # Let's keep a running collection of all the disagreed keys across all json files
    known_inconsistent_keys = known_inconsistent_keys | disagreed_keys
    
    # Let's also include any differences between key_set_archetype and all_keys
    known_inconsistent_keys = known_inconsistent_keys | (all_keys ^ key_set_archetype)

this json dataset was formatted as expected
data\botometer-feedback-2019_tweets.json

this json dataset was formatted as expected
data\botwiki-2019_tweets.json

this json dataset was formatted as expected
data\celebrity-2019_tweets.json

this json dataset was formatted as expected
data\cresci-rtbust-2019_tweets.json

this json dataset was formatted as expected
data\cresci-stock-2018_tweets.json

this json dataset was formatted as expected
data\gilani-2017_tweets.json

something's weird about this particular json dataset
data\midterm-2018_processed_user_objects.json
entry = {'probe_timestamp': 'Tue Nov 06 20:35:08 2018', 'user_id': 4107317134, 'screen_name': 'danitheduck21', 'name': 'Dani🏳️\u200d🌈', 'description': 'Dani 💜 She/Her 💜 Randomness all over. Expect lots of politics, Soap talk, LGBT content, and lots of TV shows 💜 #AmberRiley #Klaine #JaSam #Clana', 'user_created_at': 'Tue Nov 03 21:16:13 2015', 'url': None, 'lang': 'en', 'protected': False, 'verified': False, 'geo_enabled': F

This is what we find comparing the structures of all the json files:

In [39]:
print('These are the keys that are present for some accounts and datasets, but absent for others:')
for x in known_inconsistent_keys: print(x)
print()
print('Except for these datasets...')
for x in weird_datasets: print(str(x))
print('...all datasets have these keys')
for x in (key_set_archetype - known_inconsistent_keys): print(x)

These are the keys that are present for some accounts and datasets, but absent for others:
profile_banner_url
withheld_in_countries

Except for these datasets...
data\midterm-2018_processed_user_objects.json
...all datasets have these keys
profile_sidebar_border_color
name
profile_background_image_url
listed_count
favourites_count
created_at
followers_count
friends_count
profile_background_color
id_str
is_translation_enabled
translator_type
profile_use_background_image
notifications
is_translator
profile_sidebar_fill_color
utc_offset
follow_request_sent
default_profile
id
verified
default_profile_image
profile_text_color
entities
profile_background_tile
protected
profile_background_image_url_https
location
profile_image_url
following
lang
geo_enabled
time_zone
profile_image_url_https
profile_link_color
url
statuses_count
screen_name
has_extended_profile
description
contributors_enabled


This is the weird dataset:

In [33]:
display(json_contents[Path('data\midterm-2018_processed_user_objects.json')][0])

for x in json_contents[Path('data\midterm-2018_processed_user_objects.json')][0].keys(): print(x)

{'probe_timestamp': 'Tue Nov 06 20:35:08 2018',
 'user_id': 4107317134,
 'screen_name': 'danitheduck21',
 'name': 'Dani🏳️\u200d🌈',
 'description': 'Dani 💜 She/Her 💜 Randomness all over. Expect lots of politics, Soap talk, LGBT content, and lots of TV shows 💜 #AmberRiley #Klaine #JaSam #Clana',
 'user_created_at': 'Tue Nov 03 21:16:13 2015',
 'url': None,
 'lang': 'en',
 'protected': False,
 'verified': False,
 'geo_enabled': False,
 'profile_use_background_image': False,
 'default_profile': False,
 'followers_count': 481,
 'friends_count': 870,
 'listed_count': 26,
 'favourites_count': 6542,
 'statuses_count': 67025,
 'tid': 1059907055421509632}

probe_timestamp
user_id
screen_name
name
description
user_created_at
url
lang
protected
verified
geo_enabled
profile_use_background_image
default_profile
followers_count
friends_count
listed_count
favourites_count
statuses_count
tid
