# **Lecture 2** : [1/16]

## Working With Different File Formats : JSON STRING 

---

## Defining a JSON String with Example

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

**Definition:** In Python, JSON (JavaScript Object Notation) is a data format for representing structured data in a human-readable and machine-readable way.

**Data Format:** JSON uses a specific syntax to represent data as a collection of key-value pairs.

**Key-Value Pairs:** Each pair consists of a key (a string) and a value.

**Data Types:** JSON supports several data types:

* **Strings:** Enclosed in double quotes (e.g., "hello", "world")
* **Numbers:** Integers (e.g., 10, -5) or floating-point numbers (e.g., 3.14)
* **Booleans:** `true` or `false`
* **Arrays:** Ordered lists of values enclosed in square brackets (e.g., [1, 2, 3])
* **Objects:** Unordered collections of key-value pairs enclosed in curly braces (e.g., {"name": "John", "age": 30})
* **Null:** Represents the absence of a value (e.g., null)



<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

**This code below defines a JSON string representing a person's information.**

**Structure**

* The data is organized as a JSON object, enclosed within curly braces `{}`.
* The object contains key-value pairs, where keys are strings (e.g., `"firstName"`, `"lastName"`, `"age"`) and values can be strings, numbers, booleans, arrays, nested objects, or `null`.

**Key-Value Examples:**

* `"firstName": "John"`: The key `"firstName"` is associated with the string value `"John"`.
* `"address"`: This key holds a nested object containing address details like street, city, state, and postal code.
* `"phoneNumbers"`: This key holds an array of objects, each representing a phone number with its type and number.
* `"children"`: This key holds an empty array (`[]`), indicating that the person has no children.
* `"spouse"`: This key has the value `null`, indicating that the person is not married.


In [62]:
json_str = """{ 
  "firstName": "John", 
  "lastName": "Smith", 
  "isAlive": true, 
  "age": 27, 
  "address": { 
    "streetAddress": "21 2nd Street", 
    "city": "New York", 
    "state": "NY", 
    "postalCode": "10021-3100" 
  }, 
  "phoneNumbers": [ 
    { 
      "type": "home", 
      "number": "212 555-1234" 
    }, 
    {
      "type": "office", 
      "number": "646 555-4567" 
    } 
  ], 
  "children": [], 
  "spouse": null 
}
"""

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

* `json_str = ...`: This line defines a multi-line string variable named `json_str`.
* **The string:** This string contains a valid JSON object.
    * It represents a person with information like their name, age, address, phone numbers, marital status, and whether they have children.
    * The JSON object uses key-value pairs to store different aspects of the person's information. For example, `"firstName": "John"` stores the person's first name.

* `json_str`: This line simply prints the value of the `json_str` variable.
* `print(json_str)` : will print the contents of the variable which is a JSON string

In [64]:
json_str

'{ \n  "firstName": "John", \n  "lastName": "Smith", \n  "isAlive": true, \n  "age": 27, \n  "address": { \n    "streetAddress": "21 2nd Street", \n    "city": "New York", \n    "state": "NY", \n    "postalCode": "10021-3100" \n  }, \n  "phoneNumbers": [ \n    { \n      "type": "home", \n      "number": "212 555-1234" \n    }, \n    {\n      "type": "office", \n      "number": "646 555-4567" \n    } \n  ], \n  "children": [], \n  "spouse": null \n}\n'

In [65]:
print(json_str)

{ 
  "firstName": "John", 
  "lastName": "Smith", 
  "isAlive": true, 
  "age": 27, 
  "address": { 
    "streetAddress": "21 2nd Street", 
    "city": "New York", 
    "state": "NY", 
    "postalCode": "10021-3100" 
  }, 
  "phoneNumbers": [ 
    { 
      "type": "home", 
      "number": "212 555-1234" 
    }, 
    {
      "type": "office", 
      "number": "646 555-4567" 
    } 
  ], 
  "children": [], 
  "spouse": null 
}



print(json_str)

## Importing JSON 

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

**4 Main Built In Functions for Reading or Writing JSON**

* **Load:** Reads JSON from a file object (and returns a Python object).
* **Dump:** Writes Python objects into JSON and into a file object.
* **Loads:** Reads JSON from a string.
* **Dumps:** Outputs JSON as a string.



<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">
The codes below demonstrates how to work with JSON data in Python. It shows how to:

* Import the necessary library (`json`)
* Load a JSON string into a Python object
* Store the resulting object
* Check the type of the resulting object

**Breakdown of the Code:**

* `import json`: This line imports the `json` library, which provides functions for working with JSON data in Python.
* `json.loads(json_str)`: This line uses the `loads()` function from the `json` library to convert the `json_str` (which is assumed to be a string containing a valid JSON object) into a Python object. 
    * The comment `# gives us the python object` clarifies that this step converts the JSON string into a Python data structure (in this case, a dictionary).
* The output below shows the Python object that results from loading the JSON string. It's a dictionary with keys like `'firstName'`, `'lastName'`, `'age'`, `'address'`, etc., and corresponding values.
* `json_obj = json.loads(json_str)`: This line does the same thing as the previous cell - it loads the JSON string into a Python object and stores the result in a variable named `json_obj`.
* `type(json_obj)`: This line uses the `type()` function to check the type of the `json_obj` variable.
* The output `dict` indicates that the `json_obj` is a Python dictionary object.

In [68]:
import json

In [None]:
json.loads(json_str)

{'firstName': 'John',
 'lastName': 'Smith',
 'isAlive': True,
 'age': 27,
 'address': {'streetAddress': '21 2nd Street',
  'city': 'New York',
  'state': 'NY',
  'postalCode': '10021-3100'},
 'phoneNumbers': [{'type': 'home', 'number': '212 555-1234'},
  {'type': 'office', 'number': '646 555-4567'}],
 'children': [],
 'spouse': None}

In [70]:
json_obj = json.loads(json_str) 

In [14]:
type(json_obj) 

dict

## Acessing Value Keys within JSON

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

* **Accessing values by key:** The code uses the JSON object (`json_obj`) and accesses values associated with specific keys. For example, `json_obj["lastName"]` retrieves the value associated with the key "lastName".
* **Accessing nested values:** When dealing with nested dictionaries (like the "address" key within the main JSON object), the code uses chaining to access nested values. For example, `json_obj["address"]["city"]` retrieves the value of "city" within the nested "address" dictionary.
* **Using built-in functions:** The code demonstrates using the `len()` function to determine the number of elements in a list (in this case, the number of children).

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">
For the codes below, print the JSON objects that give the answer to the following questions:

* "What is the person's first name"
* "What is the person's last name"
* "What is the age of the person, John"
* "What is the city"
* "What is the state" 
* "How many children"

In [71]:
json_obj["firstName"]

'John'

In [72]:
json_obj["lastName"]

'Smith'

In [73]:
json_obj["age"]

27

In [74]:
json_obj["address"]["city"]

'New York'

In [75]:
json_obj["address"]["state"]

'NY'

In [76]:
len(json_obj["children"])

0

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

**Extracting Phone Numbers from a JSON Object**

`json_obj["phoneNumbers"]`
* This accesses the "phoneNumbers" key within the `json_obj` dictionary, which returns a list of dictionaries, each representing a phone number with its "type" and "number".

In [31]:
json_obj["phoneNumbers"]

[{'type': 'home', 'number': '212 555-1234'},
 {'type': 'office', 'number': '646 555-4567'}]

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

`json_obj["phoneNumbers"][0]["number"]`
* This accesses the "number" of the first phone number in the list where the output displays the first phone number.


In [78]:
json_obj["phoneNumbers"][0]["number"] 

'212 555-1234'

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

`json_obj["phoneNumbers"][1]["number"]`
* This accesses the "number" of the second phone number in the list where the output displays the second phone number.

In [80]:
json_obj["phoneNumbers"][1]["number"]

'646 555-4567'

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

`[number["number"] for number in json_obj["phoneNumbers"]]`
* This uses a list comprehension to extract the "number" from each phone number dictionary in the list where the output displays a list containing all the phone numbers.
* This cell demonstrates an alternative approach using a loop to store the phone numbers in a variable.

In [81]:
[number["number"] for number in json_obj["phoneNumbers"]]

['212 555-1234', '646 555-4567']

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

`phone_numbers = [phone["number"] for phone in json_obj["phoneNumbers"]]`
* This line uses a list comprehension to create a list of phone numbers and stores it in the `phone_numbers` variable.

In [82]:
phone_numbers = [phone["number"] for phone in json_obj["phoneNumbers"]]

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

**Example:**

* **Dealing with real-world data:** In real-world scenarios, you might have multiple JSON objects, often stored in separate files.
* **Handling multiple files:** You would need to process each file individually to extract the desired information.

## Real World Data with Several JSON Objects 

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

* **File format:** Each line contains a single JSON object.
* **Directory separator:** Directories are separated by slashes (/).
* **Filepath example:** `data/public/twitter/sample/data18062209` 
    * This file is located in the `sample` directory, which is under the `twitter` directory, which is under the `public` directory, and so on.
* **Root file system:** The file paths are relative to a single root directory represented by a single slash (/).
* **Purpose:** This is the filepath of the data we are reading. 

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

**Working with Files on the terminal**
**Filepath** 
* mnt/data/public/twitter/sample/data-18062209.json.bz2 
    



## Reading a Filepath on the Terminal 

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

`less /[filepath]`
*   allows you to open the text file
*   press `q` to exit 
*   it will return contents of the file
use 
* it will not show up in filebrowser for reasons of security you cannot go up your home directory 



<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

`ln-s /[filepath]`
* creates a softlink for the data so it will show on your folder for jojie

* **tab completion:** click `tab`aftr typying /mnt to continue the filepath name which means it was able to match a directory called /mnt 
*tab completetion for pu it will give public after you click tab

* `double click` at "1806" it doesn't complete it meaning there is many matches it will show `Display all possibilities (y/n)` 
    * if no it will return back
    * if yes it wil show the lisst

* click `enter` and data file will appear in folder on jojie as a link (it doesn't copy the data to save space)

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

`ln -s [/mnt/data/public]`
* create a link to a directory 
* create a link to puclic

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

`head [filename]`
* displays the 10 lines by default 

`head -n 1 [filename]`
* displays 1st line by default
* since bz2 we cannot see it

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

`bzcat / [filename]`
* bz is a zip file, compressed file  
* cat allows you to display the file 

`bzcat / [filename] | head -n `
* allows you to display the first line of the file

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

`rm [filename]`
* command to delete file

`rm -rf` [name of directory]
* command to delete a directory 
* r for recurssive meaning it will go inside the folder and whatever is inside it will deleter recursively 

`rm public`
* to delete public 

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

`ls -l /[filename]` 
* size of file 
* number after staff is the size of file in bytes

`ls -lh /[filename]`
* `h` to make it human readable
* it will show you the size in megabytes

## Use Python to Read the Filepath

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

`import bz2`
* this line imports the `bz2` library which is necessary for working with bzip2 files

`with bz2.open(...) as f:`
* This uses the `bz2.open()` function to open the bzip2 file in read mode 

`open(...)`
* opens the file specified by the path.

`as f:`
* assigns the opened file object to the variable f for further use.

`contents = f.read()`
* reads the entire contents of the bzip2 file and stores it in the variable `contents`

`f.read()`
* When called on a file object f, it reads the entire content of the file as a single string.
                                                  

In [96]:
import bz2 
with bz2.open("/mnt/data/public/twitter/sample/data-18062209.json.bz2") as f:
    contents = f.read()

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

**Read First 100 Characters**

`contents[:100]`
* This line is slicing the contents variable, which is a string containing the data read from the bzip2 file.

`[:100]`
* extracts the first 100 characters from the string and displays them. This is useful for quickly inspecting the beginning of the data.

**Output**
* The output shows the first 100 characters of the contents string.
It appears to be the beginning of a JSON object, as it starts with b'{"created_at": ..., which is likely the start of a JSON object's key-value pair.

In [93]:
contents[:100]

b'{"created_at": "Fri Jun 22 09:32:01 +0000 2018", "id": 1010093038536757248, "id_str": "1010093038536'

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

`import bz2`
* this line imports the `bz2` library which is necessary for working with bzip2 files

`with bz2.open(...) as f:`
* This uses the `bz2.open()` function to open the bzip2 file in read mode 

`open(...)`
* opens the file specified by the path.

`as f:`
* assigns the opened file object to the variable f for further use.

`contents = f.readlines()`
* This line reads the entire contents of the bzip2 file using the f.readlines() method
* readlines() reads all the lines in the file and returns them as a list of strings
* The list of lines is stored in the variable `contents`

In [100]:
import bz2 
with bz2.open("/mnt/data/public/twitter/sample/data-18062209.json.bz2") as f:
    contents = f.readlines()

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

**Read First Line**

`contents[0]`
* This line is slicing the contents variable, which is a string containing the data read from the bzip2 file.

`[0]`
* extracts the first line  from the string and displays them. This is useful for quickly inspecting the beginning of the data.

In [101]:
contents[0]

b'{"created_at": "Fri Jun 22 09:32:01 +0000 2018", "id": 1010093038536757248, "id_str": "1010093038536757248", "text": "\\u5f97\\u70b9\\u958b\\u793a\\u3057\\u3066\\u3082\\u3089\\u304a\\u3001\\u305d\\u308c\\u3067\\u3082\\u3057\\u30c8\\u30c3\\u30d73\\u5165\\u3063\\u3066\\u305f\\u3089\\u7d76\\u5bfe\\u8a34\\u3048\\u308b", "source": "<a href=\\"http://twitter.com/download/iphone\\" rel=\\"nofollow\\">Twitter for iPhone</a>", "truncated": false, "in_reply_to_status_id": null, "in_reply_to_status_id_str": null, "in_reply_to_user_id": null, "in_reply_to_user_id_str": null, "in_reply_to_screen_name": null, "user": {"id": 828743842237001728, "id_str": "828743842237001728", "name": "\\u3058\\u3046\\ud83d\\udc36", "screen_name": "jj_iiu", "location": "Precious \\u4e16\\u754c \\u753a", "url": null, "description": "@nanjolno * @saito_nagisa * @yanaginagi * @maaya_taso* NEXT\\u21d26/17 \\u30a4\\u30b3\\u30e9\\u30d6\\u30c0\\u30f3\\u30b9\\u30ec\\u30c3\\u30b9\\u30f3\\u30016/24 NBC\\u5f8c\\u591c\\u796d\\u

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

**Converting to Python Object**

`json.loads(contents[0])`
* This line takes the first line of the file `(contents[0])`, which is a string representing a JSON object, and converts it into a Python dictionary object using the `json.loads()` function.
    
**Output**
* The output displays the resulting Python dictionary.
* The dictionary contains various key-value pairs representing information about a Twitter user, such as their ID, screen name, location, number of followers, etc.

In [102]:
json.loads(contents[0])

{'created_at': 'Fri Jun 22 09:32:01 +0000 2018',
 'id': 1010093038536757248,
 'id_str': '1010093038536757248',
 'text': '得点開示してもらお、それでもしトップ3入ってたら絶対訴える',
 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
 'truncated': False,
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'in_reply_to_screen_name': None,
 'user': {'id': 828743842237001728,
  'id_str': '828743842237001728',
  'name': 'じう🐶',
  'screen_name': 'jj_iiu',
  'location': 'Precious 世界 町',
  'url': None,
  'description': '@nanjolno * @saito_nagisa * @yanaginagi * @maaya_taso* NEXT⇒6/17 イコラブダンスレッスン、6/24 NBC後夜祭、7/8 南條さんBD',
  'translator_type': 'none',
  'protected': False,
  'verified': False,
  'followers_count': 192,
  'friends_count': 219,
  'listed_count': 13,
  'favourites_count': 10089,
  'statuses_count': 27986,
  'created_at': 'Mon Feb 06 23:15:04 +0000 2017',
  'utc_offset': None,
  'time_zon

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

`for line in contents`
* This part iterates through each element `(line)` in the `contents` list. Recall that `contents` was previously populated with the lines read from the bzip2 file.

`json.loads(line)`
* Inside the loop, `json.loads(line)` takes each `line` (which is a string representing a JSON object) and converts it into a Python dictionary object using the `json.loads()` function.

`twitter_data = [...]`: 
* The entire expression (the loop and the `json.loads()` function) is wrapped in square brackets (`[]`). This creates a new list called `twitter_data`.
Each iteration of the loop appends the converted dictionary object (result of `json.loads(line)` to the `twitter_data` list.

**Summary**
* This line of code efficiently processes the list of lines read from the file. It iterates through each line, converts the JSON string to a Python dictionary using `json.loads()`, and stores the resulting dictionaries in a new list called `twitter_data`. This list now contains a collection of Python dictionaries, each representing a Twitter user's information.

##  Accessing & Extracting Data from the Converted List of Dictionaries 

In [103]:
twitter_data = [json.loads(line) for line in contents]

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

`twitter_data[0]`
* is accessing the first element (index 0) of the twitter_data list.

**Summary**
* Since `twitter_data` is a list of dictionaries (each dictionary representing a Twitter user), `twitter_data[0]` will return the first Twitter user's information as a Python dictionary.

In [105]:
twitter_data[0]

{'created_at': 'Fri Jun 22 09:32:01 +0000 2018',
 'id': 1010093038536757248,
 'id_str': '1010093038536757248',
 'text': '得点開示してもらお、それでもしトップ3入ってたら絶対訴える',
 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
 'truncated': False,
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'in_reply_to_screen_name': None,
 'user': {'id': 828743842237001728,
  'id_str': '828743842237001728',
  'name': 'じう🐶',
  'screen_name': 'jj_iiu',
  'location': 'Precious 世界 町',
  'url': None,
  'description': '@nanjolno * @saito_nagisa * @yanaginagi * @maaya_taso* NEXT⇒6/17 イコラブダンスレッスン、6/24 NBC後夜祭、7/8 南條さんBD',
  'translator_type': 'none',
  'protected': False,
  'verified': False,
  'followers_count': 192,
  'friends_count': 219,
  'listed_count': 13,
  'favourites_count': 10089,
  'statuses_count': 27986,
  'created_at': 'Mon Feb 06 23:15:04 +0000 2017',
  'utc_offset': None,
  'time_zon

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

`twitter_data[0]['created_at']`
* This line accesses the `created_at key` within the first dictionary `(twitter_data[0])` in the twitter_data list.
* The output shows the value associated with the `created_at` key, which is the timestamp of when the tweet was created: "Fri Jun 22 09:32:01 +0000 2018".

In [107]:
twitter_data[0]['created_at']

'Fri Jun 22 09:32:01 +0000 2018'

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

`twitter_data[0]['user']['screen_name']`

* This line accesses the `screen_name` of the user associated with the first tweet.
* It first accesses the `user` key within the first dictionary (`twitter_data[0]`), which itself is a dictionary containing user information.
* Then, it accesses the screen_name key within the user dictionary.
The output shows the `screen name` of the `user` associated with the first tweet: "jj_iiu".

In [109]:
twitter_data[0]['user']['screen_name']

'jj_iiu'

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

`twitter_data[0].keys()`
* This line retrieves all the keys (attribute names) present in the first dictionary (`twitter_data[0]`).
* The output displays a list of keys, representing the different attributes available for the first tweet in the dataset.

In [110]:
twitter_data[0].keys()

dict_keys(['created_at', 'id', 'id_str', 'text', 'source', 'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'is_quote_status', 'quote_count', 'reply_count', 'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted', 'filter_level', 'lang', 'timestamp_ms'])

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

`twitter_data[0]['id']`
* This line accesses the `id` key within the first dictionary (`twitter_data[0]`) and retrieves the unique identifier of the tweet.

In [111]:
twitter_data[0]['id']

1010093038536757248

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

`twitter_data[0]['filter_level']`
* This line accesses `the filter_level` key within the first dictionary (`twitter_data[0]`) and retrieves the filter level associated with the tweet.

In [112]:
twitter_data[0]['filter_level']

'low'

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

`twitter_data[0]['user'].keys()`
* This line retrieves all the keys (attribute names) present in the `user` dictionary within the first tweet.
* The output displays a list of keys, representing the different attributes available for the user associated with the first tweet.

In [113]:
twitter_data[0]['user'].keys()

dict_keys(['id', 'id_str', 'name', 'screen_name', 'location', 'url', 'description', 'translator_type', 'protected', 'verified', 'followers_count', 'friends_count', 'listed_count', 'favourites_count', 'statuses_count', 'created_at', 'utc_offset', 'time_zone', 'geo_enabled', 'lang', 'contributors_enabled', 'is_translator', 'profile_background_color', 'profile_background_image_url', 'profile_background_image_url_https', 'profile_background_tile', 'profile_link_color', 'profile_sidebar_border_color', 'profile_sidebar_fill_color', 'profile_text_color', 'profile_use_background_image', 'profile_image_url', 'profile_image_url_https', 'profile_banner_url', 'default_profile', 'default_profile_image', 'following', 'follow_request_sent', 'notifications'])

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

`twitter_data[0]['user']['screen_name']`
* **Action** : This line accesses the screen_name of the user associated with the first tweet.
    * It first accesses the `user` key within the first dictionary (`twitter_data[0]`), which itself is a dictionary containing user information.
    * Then, it accesses the `screen_name` key within the `user` dictionary.
* **Output** : The output shows the screen name of the user associated with the first tweet: "jj_iiu".

In [114]:
twitter_data[0]['user']['screen_name']

'jj_iiu'

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">
 
`twitter_data[0]['user']['friends_count']`
* **Action**: This line accesses the `friends_count` of the user associated with the first tweet.
    * Similar to the previous example, it first accesses the `user` dictionary and then the `friends_count` key within that dictionary.
* **Output**: The output shows the number of friends the user has, which is 219 in this case.

In [115]:
twitter_data[0]['user']['friends_count']

219

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

`twitter_data[0]['user']['description']`

* **Action** : This line accesses the `description` of the user associated with the first tweet.
    * Again, it first accesses the user dictionary and then the description key within that dictionary.
* **Output** : The output shows the user's description, which appears to be in Japanese.

In [116]:
twitter_data[0]['user']['description']

'@nanjolno * @saito_nagisa * @yanaginagi * @maaya_taso* NEXT⇒6/17 イコラブダンスレッスン、6/24 NBC後夜祭、7/8 南條さんBD'

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

`with bz2.open(...) as f:`
* This line opens the bzip2 compressed file located at `/mnt/data/public/twitter/sample/data-18062209.json.bz2` in read mode using the `bz2.open()` function.
* The `with` statement ensures proper closing of the file even if errors occur within the indented block.
* The file object is assigned to the variable `f` for further use.

`twitter_data = [json.loads(line) for line in f]`
* This line reads the entire file content and creates a list of Python dictionaries.
* `f` acts as an iterator over the lines in the file.
* For each line:
    * `json.loads(line)` parses the JSON string (line) into a Python dictionary.
* The resulting dictionaries are stored in the `twitter_data` list.

`twitter_data[0]`
* This line accesses the first element (index 0) of the `twitter_data` list.
* Since `twitter_data` contains dictionaries, `twitter_data[0]` retrieves the first dictionary, representing the first tweet in the dataset.

Output:
* The output displays the content of the first dictionary in the `twitter_data` list.
* This dictionary holds key-value pairs containing information about the first tweet, such as text, creation time, and user details.


In [117]:
with bz2.open('/mnt/data/public/twitter/sample/data-18062209.json.bz2') as f:
    twitter_data = [json.loads(line) for line in f]

In [None]:
twitter_data[0]

{'created_at': 'Fri Jun 22 09:32:01 +0000 2018',
 'id': 1010093038536757248,
 'id_str': '1010093038536757248',
 'text': '得点開示してもらお、それでもしトップ3入ってたら絶対訴える',
 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
 'truncated': False,
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'in_reply_to_screen_name': None,
 'user': {'id': 828743842237001728,
  'id_str': '828743842237001728',
  'name': 'じう🐶',
  'screen_name': 'jj_iiu',
  'location': 'Precious 世界 町',
  'url': None,
  'description': '@nanjolno * @saito_nagisa * @yanaginagi * @maaya_taso* NEXT⇒6/17 イコラブダンスレッスン、6/24 NBC後夜祭、7/8 南條さんBD',
  'translator_type': 'none',
  'protected': False,
  'verified': False,
  'followers_count': 192,
  'friends_count': 219,
  'listed_count': 13,
  'favourites_count': 10089,
  'statuses_count': 27986,
  'created_at': 'Mon Feb 06 23:15:04 +0000 2017',
  'utc_offset': None,
  'time_zon

## Using the Pandas Library to Read JSON file

In [None]:
<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">


<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

`import pandas as pd`
* This line imports the `pandas` library, which is a powerful and widely used library for data manipulation and analysis in Python.
* It imports the library and gives it a shorter alias, `pd`, for easier use throughout the code.

In [119]:
import pandas as pd 

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

`df_twitter = pd.read_json('/mnt/data/public/twitter/sample/data-18062209.json.bz2', lines=True)`
* This line reads a JSON file from the specified path (`'/mnt/data/public/twitter/sample/data-18062209.json.bz2'`) and creates a pandas DataFrame called `df_twitter`
* `pd.read_json()` is a function from the pandas library that reads data from a JSON file.
* The `lines=True` argument tells pandas that the JSON file contains multiple JSON objects, one per line. This allows pandas to efficiently read and parse the data.

In [120]:
df_twitter = pd.read_json(
    '/mnt/data/public/twitter/sample/data-18062209.json.bz2',
    lines=True
)

  df_twitter = pd.read_json(
  df_twitter = pd.read_json(


<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

`df_twitter`
* This line simply displays the contents of the df_twitter DataFrame.
* The output shows the first few rows of the DataFrame, including columns like `created_at`, `id`, and `id_str`.
You can see that the `created_at` column contains datetime values, and there are some rows with missing values (indicated by `NaN`).

In [122]:
df_twitter

Unnamed: 0,created_at,id,id_str,text,source,truncated,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,...,possibly_sensitive,delete,display_text_range,quoted_status_id,quoted_status_id_str,quoted_status,quoted_status_permalink,extended_entities,extended_tweet,withheld_in_countries
0,2018-06-22 09:32:01+00:00,1.010093e+18,1.010093e+18,得点開示してもらお、それでもしトップ3入ってたら絶対訴える,"<a href=""http://twitter.com/download/iphone"" r...",0.0,,,,,...,,,,,,,,,,
1,2018-06-22 09:32:01+00:00,1.010093e+18,1.010093e+18,RT @tatanakan: ？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？...,"<a href=""http://twitter.com/download/android"" ...",0.0,,,,,...,,,,,,,,,,
2,2018-06-22 09:32:01+00:00,1.010093e+18,1.010093e+18,ふなっしー　チャーム付ボールペン 青【ふなっしーグッズ/文房具/筆記具/ボールペン/文具/可...,"<a href=""https://twitter.com/funassyi_cafe"" re...",0.0,,,,,...,0.0,,,,,,,,,
3,NaT,,,,,,,,,,...,,"{'status': {'id': 1010092958819840005, 'id_str...",,,,,,,,
4,2018-06-22 09:32:01+00:00,1.010093e+18,1.010093e+18,RT @taejinsus: all the BTS outros deserve bett...,"<a href=""http://twitter.com/download/android"" ...",0.0,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72145,NaT,,,,,,,,,,...,,"{'status': {'id': 1010100013668581377, 'id_str...",,,,,,,,
72146,2018-06-22 09:59:27+00:00,1.010100e+18,1.010100e+18,RT @kirakira555star: １５ｇ　２００個＋α　３・４・６・８・１０ｍｍ　コ...,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",0.0,,,,,...,0.0,,,,,,,,,
72147,2018-06-22 09:59:27+00:00,1.010100e+18,1.010100e+18,@GRND_MAiRU まいるさんだから許した😡(ちょろいオタク)\n地獄少女の曲良すぎない...,"<a href=""http://twitter.com/download/iphone"" r...",0.0,1.010100e+18,1.010100e+18,7.519702e+17,7.519702e+17,...,,,"[12, 84]",,,,,,,
72148,2018-06-22 09:59:38+00:00,1.010100e+18,1.010100e+18,なんで人間は性行為の人数を自慢するんだ？失敗してきた数だろ？それか尻軽だと思われるべきだと思...,"<a href=""http://makebot.sh"" rel=""nofollow"">ナナシ...",0.0,,,,,...,,,,,,,,,,


<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

`df_twitter.columns`
* This line accesses the column names of the `df_twitter DataFrame`
* The output displays a list of all the column names present in the DataFrame, which represent the different attributes or features associated with each tweet in the dataset.

In [123]:
df_twitter.columns

Index(['created_at', 'id', 'id_str', 'text', 'source', 'truncated',
       'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'quote_count', 'reply_count',
       'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted',
       'filter_level', 'lang', 'timestamp_ms', 'retweeted_status',
       'possibly_sensitive', 'delete', 'display_text_range',
       'quoted_status_id', 'quoted_status_id_str', 'quoted_status',
       'quoted_status_permalink', 'extended_entities', 'extended_tweet',
       'withheld_in_countries'],
      dtype='object')

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

`df_twitter['user']`
* This line selects the entire `user` column from the DataFrame.
* The output shows a Series (a one-dimensional array-like object in pandas) containing the user information for each tweet.
* Each element in this Series is a dictionary, representing the user object associated with that particular tweet.

In [125]:
df_twitter['user']

0        {'id': 828743842237001728, 'id_str': '82874384...
1        {'id': 783267905001562113, 'id_str': '78326790...
2        {'id': 2791090622, 'id_str': '2791090622', 'na...
3                                                      NaN
4        {'id': 762235935631237120, 'id_str': '76223593...
                               ...                        
72145                                                  NaN
72146    {'id': 728074227106996227, 'id_str': '72807422...
72147    {'id': 930610682134802432, 'id_str': '93061068...
72148    {'id': 1364190668, 'id_str': '1364190668', 'na...
72149                                                  NaN
Name: user, Length: 72150, dtype: object

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;">

`df_twitter['user']`
* This selects the entire 'user' column from the DataFrame. The 'user' column likely contains dictionaries, where each dictionary represents information about the user who posted a specific tweet.

`.str`
* This method is used to access string-like methods on each element of the Series. Since the 'user' column contains dictionaries, the `.str` method is used to access attributes within those dictionaries.

`['id']` This accesses the 'id' key within each user dictionary in the Series.

In [126]:
df_twitter['user'].str['id']

0        8.287438e+17
1        7.832679e+17
2        2.791091e+09
3                 NaN
4        7.622359e+17
             ...     
72145             NaN
72146    7.280742e+17
72147    9.306107e+17
72148    1.364191e+09
72149             NaN
Name: user, Length: 72150, dtype: float64