## NEW CONTINUING: Annotated dataset

- In this notebook, I will explore the annotated dataset stored as a JSON file.
- To annotate text files containing code-switching instances a text annotation feature of [the labelbox platform](https://labelbox.com/product/annotate/text/) was used. 
    - More detailed information on the **annotation scheme** can be found [here](https://github.com/Data-Science-for-Linguists-2023/Kazakh-Russian-Code-Switching-Analysis/blob/main/annotated-data-samples/annotation_scheme_1st_draft.md)
- First, I will load and experiment with the test sample which contains two objects=annotated files;
- Second, I will load and explore the sample two which contains 14 objects=annotated files;
- Finally, we will carry out a preliminary analysis of annotated data:
   - how many utterances are there? 
   - how many code-switching instances are there within utterance?
   - what type of code-switching (cs) is prevalent: inter-sent, intra-sent or intra-word?
       - The prediction is that intra-sent and intra-word cs types would be prevalent.
   - for each type of cs, which linguistic unit is more common: phrases, discourse markers, particular POS (n,adj,adv, pronoun), etc.

In [1]:
# import labelbox as lb
# import labelbox.types as lb_types
import json
import pandas as pd
from flatten_json import flatten
from collections import defaultdict
import seaborn as sns

## Load annotated files: test sample 1

- Open a json file exported from the labelbox annotation project.
    - a structure of json file: nested dictionary with multiple keys:values and values stored as lists;
- Extract data from nested dictionary.
    - tutorial can be found [here](https://towardsdatascience.com/how-do-i-extract-nested-data-in-python-4e7bed37566a) 

In [2]:
with open('/Users/aidyn/Downloads/export-2023-03-16T21_01_57.705Z.json') as f:
   data = json.load(f)

# print(data[:10])

In [3]:
type(data)

list

In [4]:
len(data)
type(data[0])

dict

In [5]:
label_dict = data[0]['Label']
label_dict.keys()

dict_keys(['objects', 'classifications', 'relationships'])

In [6]:
type(data[0]['Label'])

dict

In [7]:
data[0]['Label']['objects'][0]

{'featureId': 'clfbk9dbf0002396jfl1k6gpk',
 'schemaId': 'clfbcpdz013ua07zn9k319jn3',
 'color': '#1CE6FF',
 'title': 'uttr',
 'value': 'uttr',
 'version': 1,
 'format': 'text.location',
 'data': {'location': {'start': 151, 'end': 211}},
 'classifications': [{'featureId': 'clfbk9hrj0004396jaxr9fvb7',
   'schemaId': 'clfbcpdz013ub07zn8cw44wpj',
   'title': 'lang',
   'value': 'lang',
   'position': 0,
   'answer': {'featureId': 'clfbk9hrj0003396jlk95obai',
    'schemaId': 'clfbcpdz013uc07zn8vjl7q45',
    'title': 'kz',
    'value': 'kz',
    'position': 0}}]}

In [8]:
type(data[0]['Label']['objects'])

list

In [9]:
len(data[0]['Label']['objects'])

49

In [10]:
data[0]['Label']['objects'][0]

{'featureId': 'clfbk9dbf0002396jfl1k6gpk',
 'schemaId': 'clfbcpdz013ua07zn9k319jn3',
 'color': '#1CE6FF',
 'title': 'uttr',
 'value': 'uttr',
 'version': 1,
 'format': 'text.location',
 'data': {'location': {'start': 151, 'end': 211}},
 'classifications': [{'featureId': 'clfbk9hrj0004396jaxr9fvb7',
   'schemaId': 'clfbcpdz013ub07zn8cw44wpj',
   'title': 'lang',
   'value': 'lang',
   'position': 0,
   'answer': {'featureId': 'clfbk9hrj0003396jlk95obai',
    'schemaId': 'clfbcpdz013uc07zn8vjl7q45',
    'title': 'kz',
    'value': 'kz',
    'position': 0}}]}

In [11]:
data[0]['Label']['objects'][0]['featureId']

'clfbk9dbf0002396jfl1k6gpk'

In [12]:
data[0]['Label']['objects'][0]['classifications']

[{'featureId': 'clfbk9hrj0004396jaxr9fvb7',
  'schemaId': 'clfbcpdz013ub07zn8cw44wpj',
  'title': 'lang',
  'value': 'lang',
  'position': 0,
  'answer': {'featureId': 'clfbk9hrj0003396jlk95obai',
   'schemaId': 'clfbcpdz013uc07zn8vjl7q45',
   'title': 'kz',
   'value': 'kz',
   'position': 0}}]

In [13]:
type(data[0]['Label']['objects'][0]['classifications'])

list

In [14]:
data[0]['Label']['objects'][0]['classifications'][0]['answer']

{'featureId': 'clfbk9hrj0003396jlk95obai',
 'schemaId': 'clfbcpdz013uc07zn8vjl7q45',
 'title': 'kz',
 'value': 'kz',
 'position': 0}

In [15]:
type(data[0]['Label']['objects'][0]['classifications'][0]['answer'])

dict

In [16]:
for item in range(len(data)):
     print(data[item]['Label']['objects'][0]['classifications'])

[{'featureId': 'clfbk9hrj0004396jaxr9fvb7', 'schemaId': 'clfbcpdz013ub07zn8cw44wpj', 'title': 'lang', 'value': 'lang', 'position': 0, 'answer': {'featureId': 'clfbk9hrj0003396jlk95obai', 'schemaId': 'clfbcpdz013uc07zn8vjl7q45', 'title': 'kz', 'value': 'kz', 'position': 0}}]
[{'featureId': 'clfbkxt8o0004396j79uc1v16', 'schemaId': 'clfbcpdz013ub07zn8cw44wpj', 'title': 'lang', 'value': 'lang', 'position': 0, 'answer': {'featureId': 'clfbkxt8o0003396jgcabgizk', 'schemaId': 'clfbcpdz113ue07zn3mzi80nz', 'title': 'rs', 'value': 'rs', 'position': 1}}]


In [17]:
for item in range(len(data)):
     print(data[item]['Label']['objects'][0])

{'featureId': 'clfbk9dbf0002396jfl1k6gpk', 'schemaId': 'clfbcpdz013ua07zn9k319jn3', 'color': '#1CE6FF', 'title': 'uttr', 'value': 'uttr', 'version': 1, 'format': 'text.location', 'data': {'location': {'start': 151, 'end': 211}}, 'classifications': [{'featureId': 'clfbk9hrj0004396jaxr9fvb7', 'schemaId': 'clfbcpdz013ub07zn8cw44wpj', 'title': 'lang', 'value': 'lang', 'position': 0, 'answer': {'featureId': 'clfbk9hrj0003396jlk95obai', 'schemaId': 'clfbcpdz013uc07zn8vjl7q45', 'title': 'kz', 'value': 'kz', 'position': 0}}]}
{'featureId': 'clfbkxma70002396jwrrwp26n', 'schemaId': 'clfbcpdz013ua07zn9k319jn3', 'color': '#1CE6FF', 'title': 'uttr', 'value': 'uttr', 'version': 1, 'format': 'text.location', 'data': {'location': {'start': 26, 'end': 42}}, 'classifications': [{'featureId': 'clfbkxt8o0004396j79uc1v16', 'schemaId': 'clfbcpdz013ub07zn8cw44wpj', 'title': 'lang', 'value': 'lang', 'position': 0, 'answer': {'featureId': 'clfbkxt8o0003396jgcabgizk', 'schemaId': 'clfbcpdz113ue07zn3mzi80nz', 't

In [18]:
for item in range(len(data)):
     print(data[item]['Label']['objects'][0]['classifications'][0]['answer'])

{'featureId': 'clfbk9hrj0003396jlk95obai', 'schemaId': 'clfbcpdz013uc07zn8vjl7q45', 'title': 'kz', 'value': 'kz', 'position': 0}
{'featureId': 'clfbkxt8o0003396jgcabgizk', 'schemaId': 'clfbcpdz113ue07zn3mzi80nz', 'title': 'rs', 'value': 'rs', 'position': 1}


**NOTE:**

- So, the loaded json file structure is complex, it contains lists, dictionaries, and each dictionary is stored as a list of lists.
- The goal is to extract the necessary data related to text annotation.


## Build pd dataframe

- Convert a json file to pandas df
    - in df, keys are column names and objects (with nested dictionaries) are rows;
    - We need to parse nested dictionaries in df to better understand its structure and to see what keys:values we need to extract;
    - The most simple elements are ID, DataRow ID  keys and their values;
    - The most complex structure is **Label key** and its nested value;
        - all **annotated data** is stored here.
    

In [19]:
# reading JSON file
# df = pd.read_json('/Users/aidyn/Downloads/export-2023-03-16T21_01_57.705Z.json')
df = pd.DataFrame(data)
# displaying sample output
df.head()

Unnamed: 0,ID,DataRow ID,Labeled Data,Label,Created By,Project Name,Created At,Updated At,Seconds to Label,Seconds to Review,...,Agreement,Is Benchmark,Benchmark Agreement,Benchmark ID,Dataset Name,Reviews,View Label,Has Open Issues,Skipped,DataRow Workflow Info
0,clfbk8guf04g5071jbrqo26bq,clfbcqidc3xmk07zvgxw373vk,https://storage.labelbox.com/clfagslqv28u407zn...,{'objects': [{'featureId': 'clfbk9dbf0002396jf...,mob75@pitt.edu,annotation-text1,2023-03-16T20:36:16.000Z,2023-03-16T20:36:16.000Z,691.644,32.957,...,-1,0,-1,,cs-annotation1,[],https://editor.labelbox.com?project=clfbk7q941...,0,False,"{'taskName': 'Done', 'Workflow History': [{'ac..."
1,clfbkxgaq0bes07012isv5rq9,clfbkqg468bwf07zc7n31dl3j,https://storage.labelbox.com/clfagslqv28u407zn...,{'objects': [{'featureId': 'clfbkxma70002396jw...,mob75@pitt.edu,annotation-text1,2023-03-16T20:59:51.000Z,2023-03-16T20:59:51.000Z,926.531,0.0,...,-1,0,-1,,text2,[],https://editor.labelbox.com?project=clfbk7q941...,0,False,"{'taskName': 'Done', 'Workflow History': [{'ac..."


In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 23 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   ID                     2 non-null      object 
 1   DataRow ID             2 non-null      object 
 2   Labeled Data           2 non-null      object 
 3   Label                  2 non-null      object 
 4   Created By             2 non-null      object 
 5   Project Name           2 non-null      object 
 6   Created At             2 non-null      object 
 7   Updated At             2 non-null      object 
 8   Seconds to Label       2 non-null      float64
 9   Seconds to Review      2 non-null      float64
 10  Seconds to Create      2 non-null      float64
 11  External ID            2 non-null      object 
 12  Global Key             0 non-null      object 
 13  Agreement              2 non-null      int64  
 14  Is Benchmark           2 non-null      int64  
 15  Benchmark 

In [21]:
# Check View Label Column
for i in df['View Label']:
    print(i)
  

https://editor.labelbox.com?project=clfbk7q941djz07ymacjs9pqs&label=clfbk8guf04g5071jbrqo26bq
https://editor.labelbox.com?project=clfbk7q941djz07ymacjs9pqs&label=clfbkxgaq0bes07012isv5rq9


In [22]:
with open('/Users/aidyn/Downloads/export-2023-03-16T21_01_57.705Z.json') as user_file:
  file_contents = user_file.read()
  
# print(file_contents)

parsed_json = json.loads(file_contents)
parsed_json[0]['Label']['objects'][0]

{'featureId': 'clfbk9dbf0002396jfl1k6gpk',
 'schemaId': 'clfbcpdz013ua07zn9k319jn3',
 'color': '#1CE6FF',
 'title': 'uttr',
 'value': 'uttr',
 'version': 1,
 'format': 'text.location',
 'data': {'location': {'start': 151, 'end': 211}},
 'classifications': [{'featureId': 'clfbk9hrj0004396jaxr9fvb7',
   'schemaId': 'clfbcpdz013ub07zn8cw44wpj',
   'title': 'lang',
   'value': 'lang',
   'position': 0,
   'answer': {'featureId': 'clfbk9hrj0003396jlk95obai',
    'schemaId': 'clfbcpdz013uc07zn8vjl7q45',
    'title': 'kz',
    'value': 'kz',
    'position': 0}}]}

**NOTE:**

- The main tags **uttr** and **cs** can be found via 'title' key;
- Subtags 
    - **lang: kz | rs**  
    - **inter_sent: uttr**
    - **intra_sent: disc | phr | vp | cl**
    - **intra_word: n | adj | adv | pn | conj | interj | morph** 
    can be found via 'classifications' dictionary;
- Both tags are under 'objects' root.

In [23]:
type(parsed_json)

list

In [24]:
# Might be helpful to extract keys and values from nested dict json object
def dict_get(x,key,here=None):
    x = x.copy()
    if here is None: here = []
    if x.get(key):  
        here.append(x.get(key))
        x.pop(key)
    else:
        for i,j in x.items():
          if  isinstance(x[i],list): dict_get(x[i][0],key,here)
          if  isinstance(x[i],dict): dict_get(x[i],key,here)
    return here

## Parse nested JSON content

- We want to extract Label key and its nested dictionary values bc annotated text information is stored here;
- The nested JSON file was parsed following this tutorial [here](https://pybit.es/articles/case-study-how-to-parse-nested-json/)

In [25]:
entries = df['Label'][0]['objects']
# entries

In [26]:
entries[0]

{'featureId': 'clfbk9dbf0002396jfl1k6gpk',
 'schemaId': 'clfbcpdz013ua07zn9k319jn3',
 'color': '#1CE6FF',
 'title': 'uttr',
 'value': 'uttr',
 'version': 1,
 'format': 'text.location',
 'data': {'location': {'start': 151, 'end': 211}},
 'classifications': [{'featureId': 'clfbk9hrj0004396jaxr9fvb7',
   'schemaId': 'clfbcpdz013ub07zn8cw44wpj',
   'title': 'lang',
   'value': 'lang',
   'position': 0,
   'answer': {'featureId': 'clfbk9hrj0003396jlk95obai',
    'schemaId': 'clfbcpdz013uc07zn8vjl7q45',
    'title': 'kz',
    'value': 'kz',
    'position': 0}}]}

In [27]:
len(entries)

49

In [28]:
entries2 = df['Label'][1]['objects']
# entries2

In [29]:
label_df1 = pd.DataFrame(entries)
label_df2 = pd.DataFrame(entries2)
label_df = pd.concat([label_df1, label_df2])
label_df.sample(5)

Unnamed: 0,featureId,schemaId,color,title,value,version,format,data,classifications
36,clfblaofn0061396jqommhdci,clfbcpdz013ua07zn9k319jn3,#1CE6FF,uttr,uttr,1,text.location,"{'location': {'start': 4718, 'end': 4815}}","[{'featureId': 'clfblaqka0063396jnqtse6jy', 's..."
42,clfbkm4nk006k396jiwsmp0ri,clfbcpdz013ua07zn9k319jn3,#1CE6FF,uttr,uttr,1,text.location,"{'location': {'start': 4559, 'end': 4593}}","[{'featureId': 'clfbkm5q0006m396j8ktf2ick', 's..."
39,clfbkll0s0065396jy77v4i7y,clfbcpdz013ua07zn9k319jn3,#1CE6FF,uttr,uttr,1,text.location,"{'location': {'start': 4360, 'end': 4378}}","[{'featureId': 'clfbklnu70067396jioinjnv1', 's..."
27,clfbkirzp004c396jzkplaa79,clfbcpdz013ua07zn9k319jn3,#1CE6FF,uttr,uttr,1,text.location,"{'location': {'start': 3269, 'end': 3305}}","[{'featureId': 'clfbkitmh004e396j2o79em9l', 's..."
17,clfbl409u002u396ja7cb53rr,clfbcpdz013ua07zn9k319jn3,#1CE6FF,uttr,uttr,1,text.location,"{'location': {'start': 1788, 'end': 1889}}","[{'featureId': 'clfbl41qh002w396jy1pu7664', 's..."


In [30]:
pip install flatten_json

Note: you may need to restart the kernel to use updated packages.


In [31]:
# df_final = pd.DataFrame([flatten_json(data[key]) for key in data])
# df_final

In [32]:
type(label_df)

pandas.core.frame.DataFrame

In [33]:
label_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 111 entries, 0 to 61
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   featureId        111 non-null    object
 1   schemaId         111 non-null    object
 2   color            111 non-null    object
 3   title            111 non-null    object
 4   value            111 non-null    object
 5   version          111 non-null    int64 
 6   format           111 non-null    object
 7   data             111 non-null    object
 8   classifications  111 non-null    object
dtypes: int64(1), object(8)
memory usage: 8.7+ KB


- So, there is 111 annotated samples.
- Classifications column contain nested subtags for each annotated sample as noted above.
- We can extract tags and subtags using **dict_get function** as shown below.

In [34]:
dict_get(label_df['classifications'],'title')

['lang',
 'lang',
 'intra-word',
 'intra-word',
 'intra-word',
 'intra-word',
 'intra-word',
 'intra-word',
 'intra-word',
 'intra-word',
 'intra-sent',
 'intra-sent',
 'intra-sent']

In [35]:
len(dict_get(label_df['classifications'],'title'))

13

In [36]:
dict_get(label_df['classifications'],'answer')

[{'featureId': 'clfblh28r008v396julvazgil',
  'schemaId': 'clfbcpdz013uc07zn8vjl7q45',
  'title': 'kz',
  'value': 'kz',
  'position': 0},
 {'featureId': 'clfblh7yh008z396j0c63pt6x',
  'schemaId': 'clfbcpdz013uc07zn8vjl7q45',
  'title': 'kz',
  'value': 'kz',
  'position': 0},
 {'featureId': 'clfbkyhm4000i396j1qjv48vy',
  'schemaId': 'clfbcpdz113uy07zn3wo55wm0',
  'title': 'n',
  'value': 'n',
  'position': 0},
 {'featureId': 'clfbl1nds001r396jp53z541c',
  'schemaId': 'clfbcpdz113v207znepk25gqw',
  'title': 'adv',
  'value': 'adv',
  'position': 2},
 {'featureId': 'clfbl34oh002g396jgg7ra10o',
  'schemaId': 'clfbcpdz113v407znbtfvd85f',
  'title': 'pn',
  'value': 'pn',
  'position': 3},
 {'featureId': 'clfbl4bs30030396j5szz9zr4',
  'schemaId': 'clfbcpdz113v807zn2q2y9ck4',
  'title': 'morph',
  'value': 'morph',
  'position': 5},
 {'featureId': 'clfbl8ccj004t396j7lhkdmid',
  'schemaId': 'clfbcpdz113v807zn2q2y9ck4',
  'title': 'morph',
  'value': 'morph',
  'position': 5},
 {'featureId': 

In [37]:
len(dict_get(label_df['classifications'],'answer'))

13

In [38]:
dict_get(label_df['classifications'],'schemaId')

['clfbcpdz013ub07zn8cw44wpj',
 'clfbcpdz013ub07zn8cw44wpj',
 'clfbcpdz113ux07znfaxvb5r2',
 'clfbcpdz113ux07znfaxvb5r2',
 'clfbcpdz113ux07znfaxvb5r2',
 'clfbcpdz113ux07znfaxvb5r2',
 'clfbcpdz113ux07znfaxvb5r2',
 'clfbcpdz113ux07znfaxvb5r2',
 'clfbcpdz113ux07znfaxvb5r2',
 'clfbcpdz113ux07znfaxvb5r2',
 'clfbcpdz113un07zn0qvugfbt',
 'clfbcpdz113un07zn0qvugfbt',
 'clfbcpdz113un07zn0qvugfbt']

In [39]:
dict_get(label_df['classifications'],'featureId')

['clfblh0n9008u396jtqksj611',
 'clfblh7yh0090396j5b73x4w5',
 'clfbkyhm4000j396jt97cpx6z',
 'clfbl1nds001s396j4bvyqisf',
 'clfbl34oh002h396jmuydz5wv',
 'clfbl4bs30031396jii4jnq80',
 'clfbl8ccj004u396jhv7y644g',
 'clfbl98z50059396jdefdw5wc',
 'clfbla2xm005o396jt8bsgdgv',
 'clfblblsu006i396j44s1550t',
 'clfblc1js006n396jb8lr8kea',
 'clfble333007e396juhak2ekj',
 'clfbleuco007q396j2maybd1e']

In [40]:
len(dict_get(label_df['classifications'],'featureId'))

13

In [41]:
dict_get(label_df['classifications'],'value')

['lang',
 'lang',
 'intra_word',
 'intra_word',
 'intra_word',
 'intra_word',
 'intra_word',
 'intra_word',
 'intra_word',
 'intra_word',
 'intra_sent',
 'intra_sent',
 'intra_sent']

In [42]:
label_df.groupby('schemaId')['title'].value_counts()

schemaId                   title
clfbcpdz013ua07zn9k319jn3  uttr     95
clfbcpdz113ui07znc6r6b37s  cs       16
Name: title, dtype: int64

In [43]:
label_df['title'].value_counts()

uttr    95
cs      16
Name: title, dtype: int64

**OBSERVATION:**

- There are 95 utterances overall in two annotated text files;
- 16 of them contain code-switching fragments. 

In [44]:
df = pd.json_normalize(data, sep='.')
display(df)



Unnamed: 0,ID,DataRow ID,Labeled Data,Created By,Project Name,Created At,Updated At,Seconds to Label,Seconds to Review,Seconds to Create,...,Dataset Name,Reviews,View Label,Has Open Issues,Skipped,Label.objects,Label.classifications,Label.relationships,DataRow Workflow Info.taskName,DataRow Workflow Info.Workflow History
0,clfbk8guf04g5071jbrqo26bq,clfbcqidc3xmk07zvgxw373vk,https://storage.labelbox.com/clfagslqv28u407zn...,mob75@pitt.edu,annotation-text1,2023-03-16T20:36:16.000Z,2023-03-16T20:36:16.000Z,691.644,32.957,658.687,...,cs-annotation1,[],https://editor.labelbox.com?project=clfbk7q941...,0,False,"[{'featureId': 'clfbk9dbf0002396jfl1k6gpk', 's...",[],[],Done,"[{'actorId': 'clfagslum28u607zn20th1y45', 'act..."
1,clfbkxgaq0bes07012isv5rq9,clfbkqg468bwf07zc7n31dl3j,https://storage.labelbox.com/clfagslqv28u407zn...,mob75@pitt.edu,annotation-text1,2023-03-16T20:59:51.000Z,2023-03-16T20:59:51.000Z,926.531,0.0,926.531,...,text2,[],https://editor.labelbox.com?project=clfbk7q941...,0,False,"[{'featureId': 'clfbkxma70002396jwrrwp26n', 's...",[],[],Done,"[{'actorId': 'clfagslum28u607zn20th1y45', 'act..."


In [45]:
# Concat two dict-entries
entries = entries + entries2
entries[0]

{'featureId': 'clfbk9dbf0002396jfl1k6gpk',
 'schemaId': 'clfbcpdz013ua07zn9k319jn3',
 'color': '#1CE6FF',
 'title': 'uttr',
 'value': 'uttr',
 'version': 1,
 'format': 'text.location',
 'data': {'location': {'start': 151, 'end': 211}},
 'classifications': [{'featureId': 'clfbk9hrj0004396jaxr9fvb7',
   'schemaId': 'clfbcpdz013ub07zn8cw44wpj',
   'title': 'lang',
   'value': 'lang',
   'position': 0,
   'answer': {'featureId': 'clfbk9hrj0003396jlk95obai',
    'schemaId': 'clfbcpdz013uc07zn8vjl7q45',
    'title': 'kz',
    'value': 'kz',
    'position': 0}}]}

In [46]:
type(entries)

list

- **recursive_parser function** below allows to flatten two-level dictionaries.

In [47]:
# parsed_data = defaultdict(list)

def recursive_parser(entry: dict, data_dict: dict, col_name: str = "") -> dict:
    """Recursive parser for a list of nested JSON objects
    
    Args:
        entry (dict): A dictionary representing a single entry (row) of the final data frame.
        data_dict (dict): Accumulator holding the current parsed data.
        col_name (str): Accumulator holding the current column name. Defaults to empty string.
    """
    for key, val in entry.items():
        extended_col_name = f"{col_name}_{key}" if col_name else key
        if isinstance(val, dict):
            recursive_parser(entry[key], data_dict, extended_col_name)
        else:
            data_dict[extended_col_name].append(val)

parsed_data = defaultdict(list)

for entry in entries:
    recursive_parser(entry, parsed_data, "")

df = pd.DataFrame(parsed_data)

In [48]:
df

Unnamed: 0,featureId,schemaId,color,title,value,version,format,data_location_start,data_location_end,classifications
0,clfbk9dbf0002396jfl1k6gpk,clfbcpdz013ua07zn9k319jn3,#1CE6FF,uttr,uttr,1,text.location,151,211,"[{'featureId': 'clfbk9hrj0004396jaxr9fvb7', 's..."
1,clfbk9q660007396ji9fkn5ya,clfbcpdz013ua07zn9k319jn3,#1CE6FF,uttr,uttr,1,text.location,228,296,"[{'featureId': 'clfbk9sie0009396jvshpmff7', 's..."
2,clfbkajp7000h396jticb6tb7,clfbcpdz013ua07zn9k319jn3,#1CE6FF,uttr,uttr,1,text.location,313,387,"[{'featureId': 'clfbkal14000j396jo0ry3eb8', 's..."
3,clfbkaq3k000m396jjwbnfb8n,clfbcpdz013ua07zn9k319jn3,#1CE6FF,uttr,uttr,1,text.location,398,413,"[{'featureId': 'clfbkarhn000o396j0lao0sgk', 's..."
4,clfbkb19x000r396jmixhcf23,clfbcpdz013ua07zn9k319jn3,#1CE6FF,uttr,uttr,1,text.location,429,458,"[{'featureId': 'clfbkb2up000t396jvhwtdk5a', 's..."
...,...,...,...,...,...,...,...,...,...,...
106,clfbla0ed005m396jhwxzv2ui,clfbcpdz113ui07znc6r6b37s,#FF34FF,cs,cs,1,text.location,4529,4541,"[{'featureId': 'clfbla2xm005o396jt8bsgdgv', 's..."
107,clfblbjlu006g396jgex1r7gu,clfbcpdz113ui07znc6r6b37s,#FF34FF,cs,cs,1,text.location,4976,4981,"[{'featureId': 'clfblblsu006i396j44s1550t', 's..."
108,clfblbx1q006l396jc40wocvd,clfbcpdz113ui07znc6r6b37s,#FF34FF,cs,cs,1,text.location,5033,5049,"[{'featureId': 'clfblc1js006n396jb8lr8kea', 's..."
109,clfble0yu007c396ja0be4j2m,clfbcpdz113ui07znc6r6b37s,#FF34FF,cs,cs,1,text.location,5620,5624,"[{'featureId': 'clfble333007e396juhak2ekj', 's..."


In [49]:
for i in df['classifications'][0]:
    print(i)

{'featureId': 'clfbk9hrj0004396jaxr9fvb7', 'schemaId': 'clfbcpdz013ub07zn8cw44wpj', 'title': 'lang', 'value': 'lang', 'position': 0, 'answer': {'featureId': 'clfbk9hrj0003396jlk95obai', 'schemaId': 'clfbcpdz013uc07zn8vjl7q45', 'title': 'kz', 'value': 'kz', 'position': 0}}


In [50]:
for i in df['classifications'][110]:
    print(i)

{'featureId': 'clfbleuco007q396j2maybd1e', 'schemaId': 'clfbcpdz113un07zn0qvugfbt', 'title': 'intra-sent', 'value': 'intra_sent', 'position': 1, 'answer': {'featureId': 'clfbleuco007p396jhonhjzf6', 'schemaId': 'clfbcpdz113uo07znh96wdg4o', 'title': 'disc', 'value': 'disc', 'position': 0}}


In [51]:
dic = df['classifications'].to_dict()
dic[0]

[{'featureId': 'clfbk9hrj0004396jaxr9fvb7',
  'schemaId': 'clfbcpdz013ub07zn8cw44wpj',
  'title': 'lang',
  'value': 'lang',
  'position': 0,
  'answer': {'featureId': 'clfbk9hrj0003396jlk95obai',
   'schemaId': 'clfbcpdz013uc07zn8vjl7q45',
   'title': 'kz',
   'value': 'kz',
   'position': 0}}]

In [52]:
type(dic)

dict

In [53]:
flattened_clf = flatten(dic)

In [54]:
import flatten_json
def flatten_json(nested_json: dict, exclude: list=[''], sep: str='_') -> dict:
    """
    Flatten a list of nested dicts.
    """
    out = dict()
    def flatten(x: (list, dict, str), name: str='', exclude=exclude):
        if type(x) is dict:
            for a in x:
                if a not in exclude:
                    flatten(x[a], f'{name}{a}{sep}')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, f'{name}{i}{sep}')
                i += 1
        else:
            out[name[:-1]] = x

    flatten(nested_json)
    return out

flatten_labeldf = flatten_json(label_df)

In [55]:
type(flatten_labeldf)

dict

In [56]:
flatten_labeldf.items()

dict_items([('',                     featureId                   schemaId    color title value  \
0   clfbk9dbf0002396jfl1k6gpk  clfbcpdz013ua07zn9k319jn3  #1CE6FF  uttr  uttr   
1   clfbk9q660007396ji9fkn5ya  clfbcpdz013ua07zn9k319jn3  #1CE6FF  uttr  uttr   
2   clfbkajp7000h396jticb6tb7  clfbcpdz013ua07zn9k319jn3  #1CE6FF  uttr  uttr   
3   clfbkaq3k000m396jjwbnfb8n  clfbcpdz013ua07zn9k319jn3  #1CE6FF  uttr  uttr   
4   clfbkb19x000r396jmixhcf23  clfbcpdz013ua07zn9k319jn3  #1CE6FF  uttr  uttr   
..                        ...                        ...      ...   ...   ...   
57  clfbla0ed005m396jhwxzv2ui  clfbcpdz113ui07znc6r6b37s  #FF34FF    cs    cs   
58  clfblbjlu006g396jgex1r7gu  clfbcpdz113ui07znc6r6b37s  #FF34FF    cs    cs   
59  clfblbx1q006l396jc40wocvd  clfbcpdz113ui07znc6r6b37s  #FF34FF    cs    cs   
60  clfble0yu007c396ja0be4j2m  clfbcpdz113ui07znc6r6b37s  #FF34FF    cs    cs   
61  clfblepxv007o396jjnd6u8d0  clfbcpdz113ui07znc6r6b37s  #FF34FF    cs    cs   

    versio

In [57]:
# entry_dict = flatten_labeldf['classifications'][0]['answer']
# entry_dict
# flatten_labeldf.keys()

#### Setting global tags

In [58]:
# reading JSON file
df_pilot5 = pd.read_json('/Users/aidyn/Downloads/export-2023-03-18T18_44_16.835Z.json')

# displaying sample output
df_pilot5.head()

Unnamed: 0,ID,DataRow ID,Labeled Data,Label,Created By,Project Name,Created At,Updated At,Seconds to Label,Seconds to Review,...,Agreement,Is Benchmark,Benchmark Agreement,Benchmark ID,Dataset Name,Reviews,View Label,Has Open Issues,Skipped,DataRow Workflow Info
0,clfeba39t08r707y49iyjb6rd,clfbkqg468bwf07zc7n31dl3j,https://storage.labelbox.com/clfagslqv28u407zn...,{'objects': [{'featureId': 'clfebao350002396jb...,mob75@pitt.edu,annotation-pilot-5,2023-03-18T18:43:26.000Z,2023-03-18T18:43:26.000Z,333.261,0,...,-1,0,-1,,text2,[],https://editor.labelbox.com?project=clfeazj2x0...,0,False,{'taskId': 'db06a4f6-ad4d-4ff0-b90e-4d0c0a701b...


In [59]:
labels = df_pilot5['Label'][0]['objects']
labels

[{'featureId': 'clfebao350002396jb23w7a83',
  'schemaId': 'clfeayddt0q3e07346y4fcus3',
  'color': '#1CE6FF',
  'title': 'uttr',
  'value': 'uttr',
  'version': 1,
  'format': 'text.location',
  'data': {'location': {'start': 26, 'end': 42}}},
 {'featureId': 'clfebdyk2000d396jx2mihrxi',
  'schemaId': 'clfeayddt0q3e07346y4fcus3',
  'color': '#1CE6FF',
  'title': 'uttr',
  'value': 'uttr',
  'version': 1,
  'format': 'text.location',
  'data': {'location': {'start': 106, 'end': 124}}},
 {'featureId': 'clfebeb8m000g396j2vg1hddw',
  'schemaId': 'clfeayddt0q3e07346y4fcus3',
  'color': '#1CE6FF',
  'title': 'uttr',
  'value': 'uttr',
  'version': 1,
  'format': 'text.location',
  'data': {'location': {'start': 168, 'end': 259}}},
 {'featureId': 'clfebfl84000n396jcbjyjln3',
  'schemaId': 'clfeayddt0q3e07346y4fcus3',
  'color': '#1CE6FF',
  'title': 'uttr',
  'value': 'uttr',
  'version': 1,
  'format': 'text.location',
  'data': {'location': {'start': 423, 'end': 437}}},
 {'featureId': 'clfebg

In [60]:
pilot_df = pd.DataFrame(labels)
pilot_df

Unnamed: 0,featureId,schemaId,color,title,value,version,format,data
0,clfebao350002396jb23w7a83,clfeayddt0q3e07346y4fcus3,#1CE6FF,uttr,uttr,1,text.location,"{'location': {'start': 26, 'end': 42}}"
1,clfebdyk2000d396jx2mihrxi,clfeayddt0q3e07346y4fcus3,#1CE6FF,uttr,uttr,1,text.location,"{'location': {'start': 106, 'end': 124}}"
2,clfebeb8m000g396j2vg1hddw,clfeayddt0q3e07346y4fcus3,#1CE6FF,uttr,uttr,1,text.location,"{'location': {'start': 168, 'end': 259}}"
3,clfebfl84000n396jcbjyjln3,clfeayddt0q3e07346y4fcus3,#1CE6FF,uttr,uttr,1,text.location,"{'location': {'start': 423, 'end': 437}}"
4,clfebg02r000r396jdkxwm5e5,clfeayddt0q3e07346y4fcus3,#1CE6FF,uttr,uttr,1,text.location,"{'location': {'start': 561, 'end': 585}}"
5,clfebga26000s396j3lrycm8n,clfeayddt0q3e07346y4fcus3,#1CE6FF,uttr,uttr,1,text.location,"{'location': {'start': 679, 'end': 705}}"
6,clfebghe7000t396ji7mharwj,clfeayddt0q3e07346y4fcus3,#1CE6FF,uttr,uttr,1,text.location,"{'location': {'start': 793, 'end': 821}}"
7,clfebgoo2000u396jjdtn9h4c,clfeayddt0q3e07346y4fcus3,#1CE6FF,uttr,uttr,1,text.location,"{'location': {'start': 874, 'end': 902}}"
8,clfebgv7m000v396jjnr7q3am,clfeayddt0q3e07346y4fcus3,#1CE6FF,uttr,uttr,1,text.location,"{'location': {'start': 934, 'end': 1008}}"


- Note: using global tags overrides the code-switching sub-classifications.

- Summary: We explored the annotated test sample stored as a json file using different methods as shown above. Most importantly, we now know where main tags and subtags can be found and we can build an extended dataframe based on these observations. For example, each object in a json file represents an annotated text file and contains multilevel dictionaries. The target key is **Label** column which contains all annotated data.

## Test sample 2

In [61]:
# reading JSON file
df = pd.read_json('/Users/aidyn/Documents/Data_Science/Kazakh-Russian-Code-Switching-Analysis/annotated-data-samples/export-2023-03-19T20_19_44.671Z.json')

# displaying sample output
df.head()


Unnamed: 0,ID,DataRow ID,Labeled Data,Label,Created By,Project Name,Created At,Updated At,Seconds to Label,Seconds to Review,...,Agreement,Is Benchmark,Benchmark Agreement,Benchmark ID,Dataset Name,Reviews,View Label,Has Open Issues,Skipped,DataRow Workflow Info
0,clfbk8guf04g5071jbrqo26bq,clfbcqidc3xmk07zvgxw373vk,https://storage.labelbox.com/clfagslqv28u407zn...,{'objects': [{'featureId': 'clfbk9dbf0002396jf...,mob75@pitt.edu,cs-annotation-project,2023-03-16T20:36:16.000Z,2023-03-16T20:36:16.000Z,749.159,72.967,...,-1,0,-1,,cs-annotation1,[],https://editor.labelbox.com?project=clfbk7q941...,0,False,"{'taskName': 'Done', 'Workflow History': [{'ac..."
1,clfbkxgaq0bes07012isv5rq9,clfbkqg468bwf07zc7n31dl3j,https://storage.labelbox.com/clfagslqv28u407zn...,{'objects': [{'featureId': 'clfbkxma70002396jw...,mob75@pitt.edu,cs-annotation-project,2023-03-16T20:59:51.000Z,2023-03-16T20:59:51.000Z,972.239,0.0,...,-1,0,-1,,text2,[],https://editor.labelbox.com?project=clfbk7q941...,0,False,"{'taskName': 'Done', 'Workflow History': [{'ac..."
2,clfef6xx80t7207yzh33ohlgx,clfeb1kdn1yhb07895fdzagon,https://storage.labelbox.com/clfagslqv28u407zn...,{'objects': [{'featureId': 'clfef75jw0002396jb...,mob75@pitt.edu,cs-annotation-project,2023-03-18T20:54:54.000Z,2023-03-18T20:54:54.000Z,1664.856,0.0,...,-1,0,-1,,text5,[],https://editor.labelbox.com?project=clfbk7q941...,0,False,{'taskId': '84a4bf5c-a482-464a-8623-da3757c0f5...
3,clfeh5eam0l8g07zr7djaedpq,clfeb7o4k17e7079p5eo5hd9f,https://storage.labelbox.com/clfagslqv28u407zn...,{'objects': [{'featureId': 'clfeh5naw0002396jf...,mob75@pitt.edu,cs-annotation-project,2023-03-18T21:33:53.000Z,2023-03-19T14:32:23.000Z,1196.039,0.0,...,-1,0,-1,,text6,[],https://editor.labelbox.com?project=clfbk7q941...,0,False,"{'taskName': 'Done', 'Workflow History': [{'ac..."
4,clffjz71f17oh07yxdfyo65zo,clffjq06201n907a87jni6dom,https://storage.labelbox.com/clfagslqv28u407zn...,{'objects': [{'featureId': 'clffk6vxv0002396jh...,mob75@pitt.edu,cs-annotation-project,2023-03-19T15:48:36.000Z,2023-03-19T15:48:36.000Z,840.414,0.0,...,-1,0,-1,,text11,[],https://editor.labelbox.com?project=clfbk7q941...,0,False,"{'taskName': 'Done', 'Workflow History': [{'ac..."


In [62]:
for i in df['Labeled Data'][:3]:
    print(i)

https://storage.labelbox.com/clfagslqv28u407zn55sthytz%2F4669a93f-90a9-e7bc-869e-6674927e3870-BABEL_OP2_302_96842_20140131_154710_inLine.txt?Expires=1680466786135&KeyName=labelbox-assets-key-3&Signature=u3gipVbP58L93HUfcgCFcd3EtqU
https://storage.labelbox.com/clfagslqv28u407zn55sthytz%2Fb1f1e43f-d07a-f08e-8ba1-317c48254edb-BABEL_OP2_302_95583_20131112_203137_outLine.txt?Expires=1680466786139&KeyName=labelbox-assets-key-3&Signature=kU7HEEcR6jNPul5lv48-Fg4Tl6k
https://storage.labelbox.com/clfagslqv28u407zn55sthytz%2F140ef9d3-6f66-97aa-86fe-7c1064aefd68-BABEL_OP2_302_93475_20131115_203137_outLine.txt?Expires=1680466786140&KeyName=labelbox-assets-key-3&Signature=k-eN8XWROzIZ7VMxtXvXunIwAYU


- Cyrillic characters are not recognized!

In [63]:
for i in df['View Label'][:3]:
    print(i)

https://editor.labelbox.com?project=clfbk7q941djz07ymacjs9pqs&label=clfbk8guf04g5071jbrqo26bq
https://editor.labelbox.com?project=clfbk7q941djz07ymacjs9pqs&label=clfbkxgaq0bes07012isv5rq9
https://editor.labelbox.com?project=clfbk7q941djz07ymacjs9pqs&label=clfef6xx80t7207yzh33ohlgx


- I can view each annotated file and show examples.

In [64]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 23 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   ID                     14 non-null     object 
 1   DataRow ID             14 non-null     object 
 2   Labeled Data           14 non-null     object 
 3   Label                  14 non-null     object 
 4   Created By             14 non-null     object 
 5   Project Name           14 non-null     object 
 6   Created At             14 non-null     object 
 7   Updated At             14 non-null     object 
 8   Seconds to Label       14 non-null     float64
 9   Seconds to Review      14 non-null     float64
 10  Seconds to Create      14 non-null     float64
 11  External ID            14 non-null     object 
 12  Global Key             0 non-null      float64
 13  Agreement              14 non-null     int64  
 14  Is Benchmark           14 non-null     int64  
 15  Benchmar

In [65]:
# Extract and rename only necessary columns 
df_test2_old = df[['Label', 'Seconds to Label', 'External ID','View Label']]
df_test2 = df_test2_old[['External ID','View Label', 'Seconds to Label', 'Label']].rename(columns={'View Label':'View Tag', 'Seconds to Label':'Seconds to Tag','Label':'Tags'}) 
df_test2.sample(5)

Unnamed: 0,External ID,View Tag,Seconds to Tag,Tags
10,BABEL_OP2_302_86557_20131121_000022_inLine.txt,https://editor.labelbox.com?project=clfbk7q941...,965.589,{'objects': [{'featureId': 'clffrealm0002396jg...
1,BABEL_OP2_302_95583_20131112_203137_outLine.txt,https://editor.labelbox.com?project=clfbk7q941...,972.239,{'objects': [{'featureId': 'clfbkxma70002396jw...
8,BABEL_OP2_302_90080_20140120_230635_inLine.txt,https://editor.labelbox.com?project=clfbk7q941...,698.394,{'objects': [{'featureId': 'clffqb8la0002396jl...
11,BABEL_OP2_302_86557_20131121_000022_outLine.txt,https://editor.labelbox.com?project=clfbk7q941...,1615.877,{'objects': [{'featureId': 'clffrwdo900ai396jf...
12,BABEL_OP2_302_87889_20140119_163150_inLine.txt,https://editor.labelbox.com?project=clfbk7q941...,1091.038,{'objects': [{'featureId': 'clfftdij700r5396j6...


In [66]:
df_test2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   External ID     14 non-null     object 
 1   View Tag        14 non-null     object 
 2   Seconds to Tag  14 non-null     float64
 3   Tags            14 non-null     object 
dtypes: float64(1), object(3)
memory usage: 576.0+ bytes


In [67]:
df_test2['Tagged Rows'] = pd.Series(df_test2['Tags'][0]['objects']).map(lambda x:len(x))
df_test2.sample(5)
# len(df_test2['Tags'][0]['objects'])

Unnamed: 0,External ID,View Tag,Seconds to Tag,Tags,Tagged Rows
8,BABEL_OP2_302_90080_20140120_230635_inLine.txt,https://editor.labelbox.com?project=clfbk7q941...,698.394,{'objects': [{'featureId': 'clffqb8la0002396jl...,9
3,BABEL_OP2_302_93320_20140218_173001_outLine.txt,https://editor.labelbox.com?project=clfbk7q941...,1196.039,{'objects': [{'featureId': 'clfeh5naw0002396jf...,9
11,BABEL_OP2_302_86557_20131121_000022_outLine.txt,https://editor.labelbox.com?project=clfbk7q941...,1615.877,{'objects': [{'featureId': 'clffrwdo900ai396jf...,9
4,BABEL_OP2_302_92509_20131114_030809_outLine.txt,https://editor.labelbox.com?project=clfbk7q941...,840.414,{'objects': [{'featureId': 'clffk6vxv0002396jh...,9
9,BABEL_OP2_302_90080_20140120_230635_outLine.txt,https://editor.labelbox.com?project=clfbk7q941...,535.854,{'objects': [{'featureId': 'clffqq2zj00bd396jx...,9


In [68]:
df_test2['Tagged Rows'][0]

9

In [69]:
type(pd.Series(df_test2['Tags'][0]['objects']))

pandas.core.series.Series

In [70]:
df_test2['Tags'][0]['objects'][0].items()

dict_items([('featureId', 'clfbk9dbf0002396jfl1k6gpk'), ('schemaId', 'clfbcpdz013ua07zn9k319jn3'), ('color', '#1CE6FF'), ('title', 'uttr'), ('value', 'uttr'), ('version', 1), ('format', 'text.location'), ('data', {'location': {'start': 151, 'end': 211}}), ('classifications', [{'featureId': 'clfbk9hrj0004396jaxr9fvb7', 'schemaId': 'clfbcpdz013ub07zn8cw44wpj', 'title': 'lang', 'value': 'lang', 'position': 0, 'answer': {'featureId': 'clfbk9hrj0003396jlk95obai', 'schemaId': 'clfbcpdz013uc07zn8vjl7q45', 'title': 'kz', 'value': 'kz', 'position': 0}}])])

In [71]:
df_test2['Tags'][0]['objects'][0]['featureId']

'clfbk9dbf0002396jfl1k6gpk'

In [72]:
len(df_test2['Tags'][0])

3

### Parse Tags column

In [73]:
for i in df_test2['Tags'][0]:
    print(i)

objects
classifications
relationships


In [74]:
tags_df_1 = df_test2['Tags'][0]['objects']
tags_df_2 = df_test2['Tags'][1]['objects']
tags_df_3 = df_test2['Tags'][2]['objects']
tags_df_4 = df_test2['Tags'][3]['objects']
tags_df_5 = df_test2['Tags'][4]['objects']
tags_df_6 = df_test2['Tags'][5]['objects']
tags_df_7 = df_test2['Tags'][6]['objects']
tags_df_8 = df_test2['Tags'][7]['objects']
tags_df_9 = df_test2['Tags'][8]['objects']
tags_df_10 = df_test2['Tags'][9]['objects']
tags_df_11 = df_test2['Tags'][10]['objects']
tags_df_12 = df_test2['Tags'][11]['objects']
tags_df_13 = df_test2['Tags'][12]['objects']
tags_df_14 = df_test2['Tags'][13]['objects']

tags_list = tags_df_1 + tags_df_2 + tags_df_3 + tags_df_4 + tags_df_5 + tags_df_6 + tags_df_7 + tags_df_8 + tags_df_9 + tags_df_10  + tags_df_11 + tags_df_12 + tags_df_13 + tags_df_14

tags_df = pd.DataFrame(tags_list)
tags_df.sample(5)

Unnamed: 0,featureId,schemaId,color,title,value,version,format,data,classifications
994,clfftyk8s0158396jeflao6ha,clfbcpdz013ua07zn9k319jn3,#1CE6FF,uttr,uttr,1,text.location,"{'location': {'start': 2091, 'end': 2115}}","[{'featureId': 'clfftylhh015a396jqub9hiwu', 's..."
850,clffsy5g600of396j09l45kip,clfbcpdz013ua07zn9k319jn3,#1CE6FF,uttr,uttr,1,text.location,"{'location': {'start': 5830, 'end': 5836}}","[{'featureId': 'clffsy6pa00oh396j9fh7go3k', 's..."
770,clffrppn1007l396j0m3e6o5e,clfbcpdz113ui07znc6r6b37s,#FF34FF,cs,cs,1,text.location,"{'location': {'start': 4483, 'end': 4495}}","[{'featureId': 'clffrpsvb007n396jfhe37oid', 's..."
52,clfbkyw7m000m396j1cx7oec8,clfbcpdz013ua07zn9k319jn3,#1CE6FF,uttr,uttr,1,text.location,"{'location': {'start': 423, 'end': 437}}","[{'featureId': 'clfbkyyuk000o396jfdxhwxl1', 's..."
197,clfefqimx007v396jr5n3f630,clfbcpdz113ui07znc6r6b37s,#FF34FF,cs,cs,1,text.location,"{'location': {'start': 4641, 'end': 4643}}","[{'featureId': 'clfefqlna007x396j29hphnxy', 's..."


In [75]:
len(tags_df)

1052

In [76]:
tags_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1052 entries, 0 to 1051
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   featureId        1052 non-null   object
 1   schemaId         1052 non-null   object
 2   color            1052 non-null   object
 3   title            1052 non-null   object
 4   value            1052 non-null   object
 5   version          1052 non-null   int64 
 6   format           1052 non-null   object
 7   data             1052 non-null   object
 8   classifications  1051 non-null   object
dtypes: int64(1), object(8)
memory usage: 74.1+ KB


In [77]:
tag_dic = df_test2['Tags'].to_dict()
len(tag_dic[0])

3

In [78]:
len(tag_dic)

14

In [79]:
# tag_dic['Label']['objects']['classifications']

In [80]:
tags_df['title'].value_counts()

uttr    905
cs      147
Name: title, dtype: int64

In [81]:
answer_list = dict_get(tags_df['classifications'],'answer')

In [82]:
answer_df = pd.DataFrame(answer_list)
answer_df 

Unnamed: 0,featureId,schemaId,title,value,position
0,clfbk9hrj0003396jlk95obai,clfbcpdz013uc07zn8vjl7q45,kz,kz,0
1,clfbk9sie0008396jckkidvdl,clfbcpdz013uc07zn8vjl7q45,kz,kz,0
2,clfbkal14000i396jlhysk84v,clfbcpdz013uc07zn8vjl7q45,kz,kz,0
3,clfbkarhn000n396j20edd2if,clfbcpdz013uc07zn8vjl7q45,kz,kz,0
4,clfbkb2up000s396j4qiaybx0,clfbcpdz013uc07zn8vjl7q45,kz,kz,0
...,...,...,...,...,...
1046,clffu9eh401cw396jmex3ex42,clfbcpdz013uc07zn8vjl7q45,kz,kz,0
1047,clffu9lz101d1396j5y4ygjft,clfbcpdz013uc07zn8vjl7q45,kz,kz,0
1048,clffu9uu201d6396j4f8zy178,clfbcpdz013uc07zn8vjl7q45,kz,kz,0
1049,clffu744c01bn396jtuxtxlqp,clfbcpdz113v807zn2q2y9ck4,morph,morph,5


In [83]:
answer_df['title'].value_counts()

kz        896
morph      43
adv        35
n          19
interj     13
disc       11
rs          9
phr         8
vp          5
adj         5
conj        4
pn          2
uttr        1
Name: title, dtype: int64

In [84]:
title_list = dict_get(tags_df['classifications'],'title')

In [85]:
len(title_list)

1051

In [86]:
title_df = pd.DataFrame(title_list)
title_df.value_counts()

lang          905
intra-word    121
intra-sent     24
inter-sent      1
dtype: int64

In [87]:
# title_df

In [88]:
value_list = dict_get(tags_df['classifications'],'value')
value_df = pd.DataFrame(value_list)
value_df.value_counts()

lang          905
intra_word    121
intra_sent     24
inter_sent      1
dtype: int64

### Retrieving Annotated Text

In [89]:
import labelbox as lb

In [90]:
import requests

In [91]:
from labelbox import Client

In [92]:
lb.__version__


'3.41.0'

In [93]:
# if __name__ == '__main__':
   #  API_KEY = "<eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VySWQiOiJjbGZhZ3NsdW0yOHU2MDd6bjIwdGgxeTQ1Iiwib3JnYW5pemF0aW9uSWQiOiJjbGZhZ3NscXYyOHU0MDd6bjU1c3RoeXR6IiwiYXBpS2V5SWQiOiJjbGZqd2UwaWwxYXh1MDd6czIwbnQ0YTlxIiwic2VjcmV0IjoiYjA1MTJlNTE5NGU1ZTczN2QwZjZhNmYyNzg0MjQwN2IiLCJpYXQiOjE2Nzk1MDI0MzksImV4cCI6MjMxMDY1NDQzOX0.G31NeCZksZbjadhEepAViDE4FCxtCvLcXkv_SG9gB-0>"
   #  client = lb.Client(API_KEY)

In [94]:
API_KEY = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VySWQiOiJjbGZhZ3NsdW0yOHU2MDd6bjIwdGgxeTQ1Iiwib3JnYW5pemF0aW9uSWQiOiJjbGZhZ3NscXYyOHU0MDd6bjU1c3RoeXR6IiwiYXBpS2V5SWQiOiJjbGZqd2UwaWwxYXh1MDd6czIwbnQ0YTlxIiwic2VjcmV0IjoiYjA1MTJlNTE5NGU1ZTczN2QwZjZhNmYyNzg0MjQwN2IiLCJpYXQiOjE2Nzk1MDI0MzksImV4cCI6MjMxMDY1NDQzOX0.G31NeCZksZbjadhEepAViDE4FCxtCvLcXkv_SG9gB-0"
PROJECT_ID = "clfbk7q941djz07ymacjs9pqs"

In [172]:
# Connection to the Labelbox platform
client = Client(API_KEY)

# Connection to the project
project = client.get_project(PROJECT_ID)

# Export all the labels of the project
#labels = project.export_labels()
labels = project.label_generator()

In [126]:
labels

<labelbox.data.annotation_types.collection.LabelGenerator at 0x7f836dd8b430>

In [107]:
# labels = project.export_labels(download=True)
# labels
#labels = list(labels)
 #label = next(labels)
# label.annotations

In [179]:
import urllib.request

API_KEY = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VySWQiOiJjbGZhZ3NsdW0yOHU2MDd6bjIwdGgxeTQ1Iiwib3JnYW5pemF0aW9uSWQiOiJjbGZhZ3NscXYyOHU0MDd6bjU1c3RoeXR6IiwiYXBpS2V5SWQiOiJjbGZqd2UwaWwxYXh1MDd6czIwbnQ0YTlxIiwic2VjcmV0IjoiYjA1MTJlNTE5NGU1ZTczN2QwZjZhNmYyNzg0MjQwN2IiLCJpYXQiOjE2Nzk1MDI0MzksImV4cCI6MjMxMDY1NDQzOX0.G31NeCZksZbjadhEepAViDE4FCxtCvLcXkv_SG9gB-0"
PROJECT_ID = "clfbk7q941djz07ymacjs9pqs"

# Connection to the Labelbox platform
client = lb.Client(API_KEY)

# Connection to the project
project = client.get_project(PROJECT_ID)

# Export all the labels of the project
labels = project.label_generator()

for label in labels:
  # Retrieve the text
  response = urllib.request.urlopen(label.data.url)
  
  # For each annotation print the fragment of text (aka the text between start
  # and end + 1)  
  text = response.read().decode("utf-8")
  for annotation in label.annotations:
    if annotation.name=='cs': # 
          print(f"type: {annotation.name} - start: {annotation.value.start} - end: {annotation.value.end} - text: {text[annotation.value.start:annotation.value.end + 1]}")

type: cs - start: 231 - end: 235 - text: общим
type: cs - start: 429 - end: 439 - text: обязательно
type: cs - start: 619 - end: 628 - text: отказ етпе
type: cs - start: 3946 - end: 3956 - text: давай давай
type: cs - start: 4566 - end: 4574 - text: машинамен
type: cs - start: 207 - end: 210 - text: сеть
type: cs - start: 981 - end: 988 - text: примерно
type: cs - start: 1428 - end: 1430 - text: это
type: cs - start: 1875 - end: 1886 - text: викториналар
type: cs - start: 3909 - end: 3914 - text: столда
type: cs - start: 4280 - end: 4287 - text: дублёнка
type: cs - start: 4529 - end: 4541 - text: кредит кредит
type: cs - start: 4976 - end: 4981 - text: просто
type: cs - start: 5033 - end: 5049 - text: декретный отпуск 
type: cs - start: 5620 - end: 5624 - text: вроде
type: cs - start: 5643 - end: 5646 - text: так 
type: cs - start: 28 - end: 31 - text: алло
type: cs - start: 61 - end: 64 - text: алло
type: cs - start: 162 - end: 170 - text: нормально
type: cs - start: 348 - end: 353 - 

**OBSERVATION:**

- There are 905 utterances overall in 14 annotated text files.
- 147 code-switching instances have occured within those utterances.
- So far, intra-word code-switcing (121) is prevalent compared to intra-sentential (24) or inter-sentential (9) code-switching types.
- The most common linguistic units within intra-word code-switching are morphemes (43, Russian stems followed by Kazakh suffixes), adverbs (35), and nouns (19). These fragments have an alternative in Kazakh and sometimes speakers use both versions in one sentence, for example: 
    - иә      обязательно  міндетті түрде иә      келемін келемін.
    - Yes.Kz  for sure.Rs  for sure.Kz    yes.Kz  will come.Kz
    - Yes, I will come for sure.
- The most common feature in intra-sentential code-switching is using discourse markers as below:
    - а күтсеңдерші [в общем]
    -   wait.Kz     generally.Rs
    - Wait, generally. 