# Data structure

In this notebook, we will analize the structure of the data obtained, from the Twitter API, since we already know the structure of the data obtained from the web scrapping performed on the website *Metacritic*.

## Importations

First of all, we need the most basic importations: json (to be able to load the documents) and pandas (to store that information into a dataframe).

In [1]:
import json
import pandas as pd

## Functions

We define functions that let us visualize both the structure of the data and the data type contained by each key.

In [7]:
# Recursive function to get all the keys from a dictionary. Print them as output
def get_keys(d, depth=0):
    tab = ""
    for i in range(depth):
        tab += "   "
    keys = d.keys()
    for key in keys:
        print(tab+key)#, ': ', type(d[key])) <- To also obtain data type
        if type(d[key]) == dict: # The element is another dictionary, we have to get its keys
            get_keys(d[key],depth+1)
        elif type(d[key]) == list: # The element is a list which may contain dictionaries. We check that with the first one (¿ALL THE ELEMENTS OF THE LIST HAVE THE SAME KEYS?)
            if type(d[key][0]) == dict: # The element is another dictionary, we have to get its keys
                get_keys(d[key][0],depth+1)

In [3]:
# Recursive function to get all the keys from a dictionary as an empty dictionary
def get_keys_as_dict(d, d_keys={}):
    keys = d.keys()
    for key in keys:
        if type(d[key]) == dict: # The element is another dictionary, we have to get its keys
            d_keys[key] = {}
            d_keys[key] = get_keys_as_dict(d[key],d_keys[key])
        elif type(d[key]) == list: # The element is a list which may contain dictionaries. We check that with the first one
            if type(d[key][0]) == dict: # The element is another dictionary, we have to get its keys
                d_keys[key] = {}
                d_keys[key] = get_keys_as_dict(d[key][0],d_keys[key])
                for i in range(1, len(d[key])):
                    list_keys = d[key][i]
                    for list_key in list_keys:
                        if list_key not in d_keys[key]:
                            d_keys[key][list_key] = type(d[key])
            else:
                d_keys[key] = type(d[key])
        else: # Element is neither a dictionary nor a list
            d_keys[key] = type(d[key])
    return d_keys

## Users

We will analyze the data obtained from the companies' profiles.

In [4]:
path = 'Videojocs/users/uNintendo.json'
#path = 'Videojocs/users/uPlaystation.json'
#path = 'Videojocs/users/uXbox.json'

In [5]:
# Load data
with open(path) as json_file:
    data = json.load(json_file)

In [6]:
get_keys(data)

data :  <class 'dict'>
   username :  <class 'str'>
   name :  <class 'str'>
   created_at :  <class 'str'>
   verified :  <class 'bool'>
   id :  <class 'str'>
   description :  <class 'str'>
   public_metrics :  <class 'dict'>
      followers_count :  <class 'int'>
      following_count :  <class 'int'>
      tweet_count :  <class 'int'>
      listed_count :  <class 'int'>


In [108]:
get_keys_as_dict(data, {})

{'data': {'username': str,
  'name': str,
  'created_at': str,
  'verified': bool,
  'id': str,
  'description': str,
  'public_metrics': {'followers_count': int,
   'following_count': int,
   'tweet_count': int,
   'listed_count': int}}}

In [112]:
# Check data structure by creating a dataframe from it
user = pd.DataFrame(data).T
user

Unnamed: 0,created_at,description,id,name,public_metrics,username,verified
data,2007-04-18T22:43:15.000Z,Welcome to the official Nintendo profile for g...,5162861,Nintendo of America,"{'followers_count': 12634405, 'following_count...",NintendoAmerica,True


## Users' timelines

Next, we will look at the data obtained from each company's timeline.

In [100]:
path = 'Videojocs/usersTL/uTlNintendo.json'
#path = 'Videojocs/usersTL/uTlPlaystation.json'
#path = 'Videojocs/usersTL/uTlXbox.json'

In [101]:
with open(path) as json_file:
    data = json.load(json_file)

In [102]:
get_keys(data)

data
   entities
      urls
         start
         end
         url
         expanded_url
         display_url
         images
            url
            width
            height
         status
         title
         description
         unwound_url
      hashtags
         start
         end
         tag
      annotations
         start
         end
         probability
         type
         normalized_text
   id
   edit_history_tweet_ids
   reply_settings
   created_at
   lang
   public_metrics
      retweet_count
      reply_count
      like_count
      quote_count
      impression_count
   text
   attachments
      media_keys
   context_annotations
      domain
         id
         name
         description
      entity
         id
         name
   possibly_sensitive
   conversation_id
   author_id
meta
   result_count
   newest_id
   oldest_id
   next_token


In [103]:
get_keys_as_dict(data, {})

{'data': {'entities': {'urls': {'start': int,
    'end': int,
    'url': str,
    'expanded_url': str,
    'display_url': str,
    'images': {'url': str, 'width': int, 'height': int},
    'status': int,
    'title': str,
    'description': str,
    'unwound_url': str,
    'media_key': list},
   'hashtags': {'start': int, 'end': int, 'tag': str},
   'annotations': {'start': int,
    'end': int,
    'probability': float,
    'type': str,
    'normalized_text': str}},
  'id': str,
  'edit_history_tweet_ids': list,
  'reply_settings': str,
  'created_at': str,
  'lang': str,
  'public_metrics': {'retweet_count': int,
   'reply_count': int,
   'like_count': int,
   'quote_count': int,
   'impression_count': int},
  'text': str,
  'attachments': {'media_keys': list},
  'context_annotations': {'domain': {'id': str,
    'name': str,
    'description': str},
   'entity': {'id': str, 'name': str}},
  'possibly_sensitive': bool,
  'conversation_id': str,
  'author_id': str,
  'in_reply_to_user_id

In [104]:
# Check data structure by creating a dataframe from it
userTl = pd.DataFrame(data['data'])
userTl

Unnamed: 0,entities,id,edit_history_tweet_ids,reply_settings,created_at,lang,public_metrics,text,attachments,context_annotations,possibly_sensitive,conversation_id,author_id,in_reply_to_user_id,referenced_tweets
0,"{'urls': [{'start': 215, 'end': 238, 'url': 'h...",1622958367374458882,[1622958367374458882],everyone,2023-02-07T14:00:02.000Z,en,"{'retweet_count': 26806, 'reply_count': 5088, ...","Tune in at 2 p.m. PST tomorrow, Feb. 8, for a ...",{'media_keys': ['3_1622958361330479105']},"[{'domain': {'id': '45', 'name': 'Brand Vertic...",False,1622958367374458882,5162861,,
1,"{'urls': [{'start': 90, 'end': 113, 'url': 'ht...",1622762067986882560,[1622762067986882560],everyone,2023-02-07T01:00:01.000Z,en,"{'retweet_count': 337, 'reply_count': 117, 'li...",Turn it up! 🔊\n\nWatch this performance of We ...,{'media_keys': ['3_1622762064778256384']},"[{'domain': {'id': '45', 'name': 'Brand Vertic...",False,1622762067986882560,5162861,,
2,"{'urls': [{'start': 161, 'end': 184, 'url': 'h...",1622740959783006212,[1622740959783006212],everyone,2023-02-06T23:36:08.000Z,en,"{'retweet_count': 286, 'reply_count': 199, 'li...",The critics are engaging with #FireEmblem Enga...,{'media_keys': ['13_1622740688268845068']},"[{'domain': {'id': '45', 'name': 'Brand Vertic...",False,1622740959783006212,5162861,,
3,"{'urls': [{'start': 218, 'end': 241, 'url': 'h...",1622648824287068164,[1622648824287068164],everyone,2023-02-06T17:30:01.000Z,en,"{'retweet_count': 453, 'reply_count': 194, 'li...","Under the glow of a full moon, a young witch-i...",{'media_keys': ['3_1622648820235370503']},"[{'domain': {'id': '45', 'name': 'Brand Vertic...",False,1622648824287068164,5162861,,
4,"{'urls': [{'start': 143, 'end': 166, 'url': 'h...",1621931825554755586,[1621931825554755586],everyone,2023-02-04T18:00:55.000Z,en,"{'retweet_count': 356, 'reply_count': 131, 'li...",Engage with iconic Fire Emblem heroes as Alear...,{'media_keys': ['13_1621931598424727553']},"[{'domain': {'id': '45', 'name': 'Brand Vertic...",False,1621931825554755586,5162861,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2295,"{'urls': [{'start': 111, 'end': 134, 'url': 'h...",1422973469923463168,[1422973469923463168],everyone,2021-08-04T17:31:27.000Z,en,"{'retweet_count': 165, 'reply_count': 106, 'li...","Here are the one, two, three, fore… ten ways #...",{'media_keys': ['13_1422972813825163271']},"[{'domain': {'id': '45', 'name': 'Brand Vertic...",False,1422973469923463168,5162861,,
2296,"{'urls': [{'start': 233, 'end': 256, 'url': 'h...",1422732447821803525,[1422732447821803525],everyone,2021-08-04T01:33:43.000Z,en,"{'retweet_count': 647, 'reply_count': 115, 'li...",The #NewPokemonSnap free content update is now...,"{'media_keys': ['3_1422732424711127040', '3_14...","[{'domain': {'id': '45', 'name': 'Brand Vertic...",False,1422732447821803525,5162861,,
2297,"{'urls': [{'start': 72, 'end': 95, 'url': 'htt...",1422591214377918466,[1422591214377918466],everyone,2021-08-03T16:12:30.000Z,en,"{'retweet_count': 496, 'reply_count': 152, 'li...",“Do my words sting? Let them.” \n\n#Zelda #Sky...,{'media_keys': ['13_1422591128415649800']},"[{'domain': {'id': '45', 'name': 'Brand Vertic...",False,1422591214377918466,5162861,,
2298,"{'urls': [{'start': 274, 'end': 297, 'url': 'h...",1422361934863769605,[1422361934863769605],everyone,2021-08-03T01:01:26.000Z,en,"{'retweet_count': 543, 'reply_count': 96, 'lik...",The #Tetris99 23rd MAXIMUS CUP event will run ...,{'media_keys': ['13_1422361651987288066']},"[{'domain': {'id': '45', 'name': 'Brand Vertic...",False,1422361934863769605,5162861,,


## Tweets

At last, we see if there is any differnce between the data obtained from the timeline and the one retrieved through a query (since both of them are tweets).

In [73]:
path = 'Videojocs/tweets/Nintendo.json'

In [74]:
with open(path) as json_file:
    data = json.load(json_file)

In [75]:
get_keys(data)

data
   author_id
   entities
      annotations
         start
         end
         probability
         type
         normalized_text
      urls
         start
         end
         url
         expanded_url
         display_url
         images
            url
            width
            height
         status
         title
         description
         unwound_url
   context_annotations
      domain
         id
         name
         description
      entity
         id
         name
   public_metrics
      retweet_count
      reply_count
      like_count
      quote_count
      impression_count
   lang
   id
   edit_history_tweet_ids
   created_at
   reply_settings
   conversation_id
   possibly_sensitive
   text
meta
   newest_id
   oldest_id
   result_count
   next_token


In [76]:
get_keys_as_dict(data, {})

{'data': {'author_id': str,
  'entities': {'annotations': {'start': int,
    'end': int,
    'probability': float,
    'type': str,
    'normalized_text': str},
   'urls': {'start': int,
    'end': int,
    'url': str,
    'expanded_url': str,
    'display_url': str,
    'images': {'url': str, 'width': int, 'height': int},
    'status': int,
    'title': str,
    'description': str,
    'unwound_url': str}},
  'context_annotations': {'domain': {'id': str,
    'name': str,
    'description': str},
   'entity': {'id': str, 'name': str}},
  'public_metrics': {'retweet_count': int,
   'reply_count': int,
   'like_count': int,
   'quote_count': int,
   'impression_count': int},
  'lang': str,
  'id': str,
  'edit_history_tweet_ids': list,
  'created_at': str,
  'reply_settings': str,
  'conversation_id': str,
  'possibly_sensitive': bool,
  'text': str,
  'attachments': list,
  'withheld': list},
 'meta': {'newest_id': str,
  'oldest_id': str,
  'result_count': int,
  'next_token': str}}

In [99]:
# Check data structure by creating a dataframe from it
tweets = pd.DataFrame(data['data'])
tweets

Unnamed: 0,author_id,entities,context_annotations,public_metrics,lang,id,edit_history_tweet_ids,created_at,reply_settings,conversation_id,possibly_sensitive,text,attachments,withheld
0,66283522,"{'annotations': [{'start': 9, 'end': 16, 'prob...","[{'domain': {'id': '45', 'name': 'Brand Vertic...","{'retweet_count': 0, 'reply_count': 0, 'like_c...",en,1623094239805468673,[1623094239805468673],2023-02-07T22:59:57.000Z,everyone,1623094239805468673,False,Good guy Nintendo. Gotta give credit where it'...,,
1,720055890074804224,"{'annotations': [{'start': 33, 'end': 47, 'pro...","[{'domain': {'id': '45', 'name': 'Brand Vertic...","{'retweet_count': 0, 'reply_count': 0, 'like_c...",en,1623094225230303233,[1623094225230303233],2023-02-07T22:59:53.000Z,everyone,1623094225230303233,False,Finally caved &amp; got myself a Nintendo Swit...,,
2,1588365348100866048,"{'urls': [{'start': 139, 'end': 162, 'url': 'h...","[{'domain': {'id': '45', 'name': 'Brand Vertic...","{'retweet_count': 0, 'reply_count': 0, 'like_c...",en,1623094174278119424,[1623094174278119424],2023-02-07T22:59:41.000Z,everyone,1623094174278119424,False,PXN V3 Pro Gaming Racing Wheel Volante PC Stee...,{'media_keys': ['3_1623094141084381184']},
3,23367384,"{'annotations': [{'start': 0, 'end': 7, 'proba...","[{'domain': {'id': '45', 'name': 'Brand Vertic...","{'retweet_count': 0, 'reply_count': 0, 'like_c...",en,1623094126710325251,[1623094126710325251],2023-02-07T22:59:30.000Z,everyone,1623094126710325251,False,Nintendo Will Pay Its Workers 10% More https:/...,,
4,1005597131166683138,"{'annotations': [{'start': 0, 'end': 7, 'proba...","[{'domain': {'id': '45', 'name': 'Brand Vertic...","{'retweet_count': 0, 'reply_count': 0, 'like_c...",en,1623094107844247552,[1623094107844247552],2023-02-07T22:59:25.000Z,everyone,1623094107844247552,False,Nintendo seeing Tomodachi Life trending on twi...,{'media_keys': ['7_1623094004765151232']},
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36831,1471880946366488579,"{'urls': [{'start': 39, 'end': 62, 'url': 'htt...","[{'domain': {'id': '45', 'name': 'Brand Vertic...","{'retweet_count': 0, 'reply_count': 0, 'like_c...",en,1620592419355115520,[1620592419355115520],2023-02-01T01:18:36.000Z,everyone,1620592419355115520,False,#NintendoNetwork #NintendoNetworkDown\n https:...,,
36832,1187259559506825216,"{'urls': [{'start': 74, 'end': 97, 'url': 'htt...","[{'domain': {'id': '45', 'name': 'Brand Vertic...","{'retweet_count': 2, 'reply_count': 0, 'like_c...",en,1620592414040760321,[1620592414040760321],2023-02-01T01:18:35.000Z,everyone,1620592414040760321,False,Mega Man X2 Super Nintendo SNES Video Game Lot...,"{'media_keys': ['3_1620585320314970112', '3_16...",
36833,1140968821,"{'urls': [{'start': 207, 'end': 230, 'url': 'h...","[{'domain': {'id': '45', 'name': 'Brand Vertic...","{'retweet_count': 102, 'reply_count': 39, 'lik...",en,1620592328296767488,[1620592328296767488],2023-02-01T01:18:14.000Z,everyone,1620592328296767488,False,February means we are definitely on official N...,{'media_keys': ['3_1620589259395563522']},
36834,1449547113231126531,"{'urls': [{'start': 74, 'end': 97, 'url': 'htt...","[{'domain': {'id': '45', 'name': 'Brand Vertic...","{'retweet_count': 0, 'reply_count': 1, 'like_c...",en,1620592205974097920,[1620592205974097920],2023-02-01T01:17:45.000Z,everyone,1620592205974097920,False,Get pre order bonus for Kirby returns to dream...,{'media_keys': ['3_1620592200542490624']},


## Next notebook

Now we have a much more clear understanding of the data obtained from the Twitter API, so we can perform [data cleaning](./04DataCleaning.ipynb) both from this inof and the one obtained through web scraping.