<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import" data-toc-modified-id="Import-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import</a></span></li><li><span><a href="#Feature-selection" data-toc-modified-id="Feature-selection-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Feature selection</a></span></li><li><span><a href="#Tidy-up-messy-dates" data-toc-modified-id="Tidy-up-messy-dates-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Tidy up messy dates</a></span></li><li><span><a href="#Export-to-table" data-toc-modified-id="Export-to-table-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Export to table</a></span></li></ul></div>

# User data import

In [1]:
import numpy as np
import pandas as pd
import datetime as dt
pd.set_option('max_colwidth', None)

## Import

In [2]:
user_path = 'data/users.json'
types = {'created_at':np.int64} #pandas infers this as a datetime(ns)
df = pd.read_json(user_path, dtype=types, lines=True)

## Feature selection

In [3]:
# Some keys are deprecated by Twitter - let's ignore them
deprecated = {'utc_offset', 'time_zone', 'lang', 'geo_enabled', 'following',
              'follow_request_sent', 'has_extended_profile', 'notifications',
              'profile_location', 'contributors_enabled', 'profile_image_url',
              'profile_background_color', 'profile_background_image_url',
              'profile_background_image_url_https', 'profile_background_tile',
              'profile_link_color', 'profile_sidebar_border_color',
              'profile_sidebar_fill_color', 'profile_text_color',
              'profile_use_background_image', 'is_translator',
              'is_translation_enabled', 'translator_type'}
columns = [x for x in df.columns if x not in deprecated]
df = df[columns]
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 548 entries, 0 to 547
Data columns (total 20 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   created_at               548 non-null    int64 
 1   default_profile          548 non-null    bool  
 2   default_profile_image    548 non-null    bool  
 3   description              548 non-null    object
 4   entities                 545 non-null    object
 5   favourites_count         548 non-null    int64 
 6   followers_count          548 non-null    int64 
 7   friends_count            548 non-null    int64 
 8   id                       548 non-null    int64 
 9   id_str                   548 non-null    int64 
 10  listed_count             548 non-null    int64 
 11  location                 548 non-null    object
 12  name                     548 non-null    object
 13  profile_banner_url       513 non-null    object
 14  profile_image_url_https  548 non-null    o

See [here](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview/user-object) for meanings

* The `default_profile` and `default_profile_image` fields might have been useful in deciding whether users are 'serious' twitterati or not.  However some well-known twitterers have default profile as `True` and `default_profile_image` is almost all `False` so I have decided to drop them.
* `entities` seems to contain url's for politicians.  Useful, but not for this work.
* `created_at` looks mostly like a seconds offset from Unix zero-time but which has been interpreted as a nanosecond offset.  I'll convert this in case I need it.
* I am going to drop `favourites_count` and `listed_count` but keep `friends_count`, `followers_count` as measures of influence.
* Keep `statuses_count` just in case.
* Drop `location` as well as `id` and `id_str` as `screen_name` is unique.

In [4]:
droplist = ['default_profile', 'default_profile_image', 'entities',
            'favourites_count', 'listed_count', 'location', 'id', 'id_str',
            'profile_banner_url', 'profile_image_url_https', 'protected',
            'url', 'verified']
columns = [x for x in columns if x not in droplist]
df = df[columns]

In [5]:
#Rename one column to make it more meaningful
import warnings
warnings.filterwarnings("ignore")
df.rename(columns={'statuses_count':'total_tweets'}, inplace=True)
warnings.filterwarnings("default")

## Tidy up messy dates

In [6]:
# 3 created_at given as datetime (ns), the rest given as seconds offset
# from Unix zero.  So convert the large ones to ns before casting to dates
created_times = []
for d in df['created_at']:
    if d < 2000000000: #Assume all twitter accounts created before 2033 ;)
        created_times.append(dt.timedelta(seconds = d) +
                             dt.date(1970,1,1))
    else:
        created_times.append(dt.timedelta(seconds = d/10**9) +
                             dt.date(1970,1,1))
df['created_at'] = created_times

  and should_run_async(code)


## Export to table

In [7]:
df.to_csv('output_data/user.csv', index=None)