 # The JSON Format

The data set we'll use in this mission is in a format called [JavaScript Object Notation](https://www.json.org/) (**JSON**). As the name indicates, JSON originated from the JavaScript language, but has now become a language-independent format.

From a Python perspective, JSON can be thought as a collection of Python objects nested inside each other. Go to [LAMBDA FUNCTION](#lambda_fn)

![image.png](attachment:image.png)

The JSON above is a list, where each element in the list is a dictionary. Each of the dictionaries have the same keys, and one of the values of each dictionary is itself a list<br>The Python `json` [module](https://docs.python.org/3.7/library/json.html#json.loads) contains a number of functions to make working with JSON objects easier. We can use the **json.loads()** method to convert JSON data contained in a string to the equivalent set of Python objects:

In [1]:
json_string = """
[
  {
    "name": "Sabine",
    "age": 36,
    "favorite_foods": ["Pumpkin", "Oatmeal"]
  },
  {
    "name": "Zoe",
    "age": 40,
    "favorite_foods": ["Chicken", "Pizza", "Chocolate"]
  },
  {
    "name": "Heidi",
    "age": 40,
    "favorite_foods": ["Caesar Salad"]
  }
]
"""
import json
json_obj = json.loads(json_string)
print(json_obj)
type(json_obj)

[{'name': 'Sabine', 'age': 36, 'favorite_foods': ['Pumpkin', 'Oatmeal']}, {'name': 'Zoe', 'age': 40, 'favorite_foods': ['Chicken', 'Pizza', 'Chocolate']}, {'name': 'Heidi', 'age': 40, 'favorite_foods': ['Caesar Salad']}]


list

We can observe a few things:

   - The formatting from our original string is gone. This is because printing Python lists and dictionaries has a simple formatting structure.
   - The order of the keys in the dictionary have changed. This is because (prior to version 3.6) Python dictionaries don't have fixed order

Let's practice using `json.loads()` to convert JSON data from a string to Python objects!

In [2]:
world_cup_str = """
[
    {
        "team_1": "France",
        "team_2": "Croatia",
        "game_type": "Final",
        "score" : [4, 2]
    },
    {
        "team_1": "Belgium",
        "team_2": "England",
        "game_type": "3rd/4th Playoff",
        "score" : [2, 0]
    }
]
"""
import json
world_cup_obj = json.loads(world_cup_str)
world_cup_obj

[{'team_1': 'France',
  'team_2': 'Croatia',
  'game_type': 'Final',
  'score': [4, 2]},
 {'team_1': 'Belgium',
  'team_2': 'England',
  'game_type': '3rd/4th Playoff',
  'score': [2, 0]}]

# Reading a JSON file

One of the places where the JSON format is commonly used is in the results returned by an [Application programming interface](https://en.wikipedia.org/wiki/Application_programming_interface) (**API**). APIs are interfaces that can be used to send and transmit data between different computer systems. We'll learn about how to work with APIs in a later course.

The data set from this mission — <font color='red'>hn_2014.json</font> — was downloaded from the Hacker News API. It's a different set of data from the CSV we've been using in the previous two missions, and it contains data about stories from Hacker News in 2014.

To read a file from JSON format, we use the `json.load()` function. Note that the function is `json.load()` without an "s" at the end. The `json.loads()` function is used for loading JSON data from a string ("loads" is short for "load string"), whereas the `json.load()` function is used to load from a file object. Let's look at how we would read that in our data:

In [3]:
import json
import pandas as pd

hn_df = pd.read_json('C:\\Users\\MY PC\\Desktop\\DATASETS\\hn_2014.json')#,orient='split',typ='series')#,orient='split',type='series')#,typ='series')
hn_df

# file = open('hn_2014.json')
# hn = json.load(file)

Unnamed: 0,author,numComments,points,url,storyText,createdAt,tags,createdAtI,title,objectId
0,dragongraphics,0,2,http://ashleynolan.co.uk/blog/are-we-getting-t...,,2014-05-29T08:07:50Z,"[story, author_dragongraphics, story_7815238]",1401350870,Are we getting too Sassy? Weighing up micro-op...,7815238
1,jcr,0,1,http://spectrum.ieee.org/automaton/robotics/ho...,,2014-05-29T08:05:58Z,"[story, author_jcr, story_7815234]",1401350758,Telemba Turns Your Old Roomba and Tablet Into ...,7815234
2,callum85,0,1,http://online.wsj.com/articles/apple-to-buy-be...,,2014-05-29T08:05:06Z,"[story, author_callum85, story_7815230]",1401350706,Apple Agrees to Buy Beats for $3 Billion,7815230
3,d3v3r0,0,1,http://alexsblog.org/2014/05/29/dont-wait-for-...,,2014-05-29T08:00:08Z,"[story, author_d3v3r0, story_7815222]",1401350408,Don’t wait for inspiration,7815222
4,timmipetit,0,1,http://techcrunch.com/2014/05/28/hackerone-get...,,2014-05-29T07:46:19Z,"[story, author_timmipetit, story_7815191]",1401349579,HackerOne Get $9M In Series A Funding To Build...,7815191
...,...,...,...,...,...,...,...,...,...,...
35801,lispython,0,3,https://medium.com/p/ff5f4c9b16bd,,2014-01-01T00:33:42Z,"[story, author_lispython, story_6993601]",1388536422,Engelbart and Kay,6993601
35802,co_pl_te,0,3,http://allthingsd.com/20131231/you-say-goodbye...,,2014-01-01T00:19:47Z,"[story, author_co_pl_te, story_6993568]",1388535587,You Say Goodbye and We Say Hello,6993568
35803,maurorm,0,1,http://ghiraldelli.pro.br/jesus-e-eu/,,2014-01-01T00:11:06Z,"[story, author_maurorm, story_6993544]",1388535066,Jesus e eu,6993544
35804,yeukhon,0,1,,https:&#x2F;&#x2F;fundraising.mozilla.org&#x2F;,2014-01-01T00:06:59Z,"[story, author_yeukhon, story_6993536]",1388534819,Mozilla end-of-year fundraising jumps from $75...,6993536


In [4]:
with open('C:\\Users\\MY PC\\Desktop\\DATASETS\\hn_2014.json') as file:
    hn = json.load(file)
len(hn)
list(hn[0])


['author',
 'numComments',
 'points',
 'url',
 'storyText',
 'createdAt',
 'tags',
 'createdAtI',
 'title',
 'objectId']

Our hn variable is a list. Let's find out how many objects are in the list, and the type of the first object (which will almost always be the type of every object in the list in JSON data):

In [5]:
print(len(hn))
print(type(hn[0]))

35806
<class 'dict'>


Our data set contains 35,806 dictionary objects, each representing a Hacker News story. In order to understand the format of our data set, we'll print the keys of the first dictionary:

In [6]:
print(hn[0].keys())

dict_keys(['author', 'numComments', 'points', 'url', 'storyText', 'createdAt', 'tags', 'createdAtI', 'title', 'objectId'])


If we recall the data set we used in the previous two missions(REGEX), we can see some similarities. There are keys representing the title, URL, points, number of comments, and date, as well as some others that are less familiar to us. Here is a summary of the keys and the data that they contain:

   - `author`: The username of the person who submitted the story.
   - `createdAt`: The date and time at which the story was created.
   - `createdAtI`: An integer value representing the date and time at which the story was created.
   - `numComments`: The number of comments that were made on the story.
   - `objectId`: The unique identifier from Hacker News for the story.
   - `points`: The number of points the story acquired, calculated as the total number of upvotes minus the total number of downvotes.
   - `storyText`: The text of the story (if the story contains text).
   - `tags`: A list of tags associated with the story.
   - `title`: The title of the story.
   - `url`: The URL that the story links to (if the story links to a URL).

Let's start by reading our Hacker News JSON file:

 # Deleting Dictionary Keys

Let's look at the first dictionary in full. To make it easier to understand, we're going to create a function which will print a JSON object with formatting to make it easier to read.<br>The function will use the `json.dumps()` function ("dump string") which does the opposite of the `json.loads()` function — it takes a JSON object and returns a string version of it. The `json.dumps()` function accepts arguments that can specify formatting for the string, which we'll use to make things easier to read:

In [7]:
def jprint(obj):
    # create a formatted string of the Python JSON object
    text = json.dumps(obj, sort_keys= True, indent=4)
    print(text)
first_story = hn[0]
jprint(first_story)

{
    "author": "dragongraphics",
    "createdAt": "2014-05-29T08:07:50Z",
    "createdAtI": 1401350870,
    "numComments": 0,
    "objectId": "7815238",
    "points": 2,
    "storyText": "",
    "tags": [
        "story",
        "author_dragongraphics",
        "story_7815238"
    ],
    "title": "Are we getting too Sassy? Weighing up micro-optimisation vs. maintainability",
    "url": "http://ashleynolan.co.uk/blog/are-we-getting-too-sassy"
}


You may notice that the `createdAt` and `createdAtI` keys both have the date and time data in two different formats. Because the format of `createdAt` is much easier to understand, let's do some data cleaning by deleting the `createdAtI` key from every dictionary.

To delete a key from a dictionary, we can use the `del` statement. Let's learn the syntax by looking at a simple example:


In [8]:
d = {'a': 1, 'b': 2, 'c': 3}
del(d['a'])# or del d['a']
d

{'b': 2, 'c': 3}

We can create a function using `del` that will return a copy of our dictionary with the key removed:

In [9]:
def del_key(dict_, key):
    # create a copy so we don't
    # modify the original dict
    modified_dict = dict_.copy()
    del modified_dict[key]
    return modified_dict

Let's use this function to delete the `createdAtI` key from first_story:

In [10]:
first_story = del_key(first_story, 'createdAtI')
jprint(first_story)

{
    "author": "dragongraphics",
    "createdAt": "2014-05-29T08:07:50Z",
    "numComments": 0,
    "objectId": "7815238",
    "points": 2,
    "storyText": "",
    "tags": [
        "story",
        "author_dragongraphics",
        "story_7815238"
    ],
    "title": "Are we getting too Sassy? Weighing up micro-optimisation vs. maintainability",
    "url": "http://ashleynolan.co.uk/blog/are-we-getting-too-sassy"
}


The dictionary returned by the function no longer includes the `createdAtI` key.

Let's use a loop and the `del_key()` function to remove the `createdAtI` key from every story in our Hacker News data set:

In [11]:
def del_key(dict_,key):
    modified_dict = dict_.copy()
    del modified_dict[key]
    return modified_dict
hn_clean = []
for dict_ in hn:
    new_d = del_key(dict_,'createdAtI')
    hn_clean.append(new_d)
    

In [12]:
hn_clean

[{'author': 'dragongraphics',
  'numComments': 0,
  'points': 2,
  'url': 'http://ashleynolan.co.uk/blog/are-we-getting-too-sassy',
  'storyText': '',
  'createdAt': '2014-05-29T08:07:50Z',
  'tags': ['story', 'author_dragongraphics', 'story_7815238'],
  'title': 'Are we getting too Sassy? Weighing up micro-optimisation vs. maintainability',
  'objectId': '7815238'},
 {'author': 'jcr',
  'numComments': 0,
  'points': 1,
  'url': 'http://spectrum.ieee.org/automaton/robotics/home-robots/telemba-telepresence-robot',
  'storyText': '',
  'createdAt': '2014-05-29T08:05:58Z',
  'tags': ['story', 'author_jcr', 'story_7815234'],
  'title': 'Telemba Turns Your Old Roomba and Tablet Into a Telepresence Robot',
  'objectId': '7815234'},
 {'author': 'callum85',
  'numComments': 0,
  'points': 1,
  'url': 'http://online.wsj.com/articles/apple-to-buy-beats-1401308971',
  'storyText': '',
  'createdAt': '2014-05-29T08:05:06Z',
  'tags': ['story', 'author_callum85', 'story_7815230'],
  'title': 'Apple

 # Writing List Comprehensions

The task we performed is an extremely common one. Specifically, we:
  - Iterated over values in a list.
  - Performed a transformation on those values.
  - Assigned the result to a new list.

Python includes a special syntax shortcut for tasks that meet these criteria: **List Comprehensions**. A list comprehension provides a concise way of creating lists in a single line of code.

List comprehensions can look complex at first, but we are simply reordering the elements of our for loop code. To keep things simple, we'll start with a basic example, where we want to add 1 to each item in a list of integers.


In [13]:
ints = [1, 2, 3, 4]
plus_one = []
for i in ints:
    
    plus_one.append(i+1)
plus_one

[2, 3, 4, 5]

Let's start by labeling the three main parts of our loop:



![image.png](attachment:image.png)

To transform this structure into a list comprehension, we do the following within brackets:

   - Start with the code that transforms each item.
   - Continue with our for statement (without a colon)

We can then assign the list comprehension to a variable name. The image below shows how we convert the manual loop version to a list comprehension.



![image.png](attachment:image.png)

Let look at a second example, where we want to multiply each item in the list by 10:


In [14]:
times_ten = []
for i in ints:
    times_ten.append(i*10)
times_ten

[10, 20, 30, 40]

To convert this to a list comprehension, we follow the same pattern:

![image.png](attachment:image.png)


The "transformation" step of our list comprehension can be anything, including a function or method. In the example below, we are applying a function to a list of floats to round them to integers:

In [15]:
floats = [2.1, 8.7, 4.2, 8.9]

rounded = []
for f in floats:
    rounded.append(round(f))

print(rounded)

[2, 9, 4, 9]


To convert to a list comprehension, we simply rearrange the components:



![image.png](attachment:image.png)

Just like in a normal loop, we can use any name for our iterator variable. Here, we have used f.

For the last example, we'll apply a method to each string in a list to capitalize it. We won't color the different components, so we can get used to how that looks.

In [16]:
letters = ['a', 'b', 'c', 'd']
caps = []
for l in letters:
    caps.append(l.upper())
caps

['A', 'B', 'C', 'D']

In [17]:
# or
letters = ['a', 'b', 'c', 'd']
print([l.upper() for l in letters])


['A', 'B', 'C', 'D']


Let's recap what we have learned so far. A list comprehension can be used where we:

   - Iterated over values in a list.
   - Performed a transformation on those values.
   - Assigned the result to a new list.

To transform a loop to a list comprehension, in brackets we:

   - Start with the code that transforms each item.
   - Continue with our for statement (without a colon).

We are going to write a list comprehension version of the code from the previous screen. To help, we've provided a copy of the code with the components labeled.

![image.png](attachment:image.png)

In [18]:
#LOOP VERSION
hn_clean = []

for d in hn:
    new_d = del_key(d, 'createdAtI')
    hn_clean.append(new_d)
#List comprehension
hn_clean=[del_key(d,'createdAtI') for d in hn]

List comprehensions can be used for many different things. Three common applications are:

   1. Transforming a list
   2. Creating a new list
   3. Reducing a list

On this screen, we're going to look at the first two of these applications.

The first application, transforming a list, is the category that all the examples you've seen so far fit under. You are taking an existing list, applying a transformation to every value, and assigning it to a variable.

The example below uses a list comprehension to transform a list of square numbers into their "square roots":

![image.png](attachment:image.png)

In [19]:
squares = [1,4,9,16,25,36]
sqroots = [int(sq**(1/2)) for sq in squares]
sqroots

[1, 2, 3, 4, 5, 6]

The second application, creating a new list, is useful for creating test data or data that is based on a set of numbers.

As an example, let's create a list of generic columns names that we could use to create a dataframe using the **range()** function and the **str.format()** method to combine number

![image.png](attachment:image.png)

In [20]:
# using for loop
cols = []
for i in range(1,5):
    cols.append('col_{}'.format(i))
cols

['col_1', 'col_2', 'col_3', 'col_4']

In [21]:
# using list comprehension
print(['col_{}'.format(i) for i in range(1,5)])

['col_1', 'col_2', 'col_3', 'col_4']


We can then use this to create an empty dataframe with labels:

In [22]:
import numpy as np
cols = ['cols_{}'.format(i) for i in range(1,6)]
data = np.zeros((5,5))
df = pd.DataFrame(data, columns = cols)
df

Unnamed: 0,cols_1,cols_2,cols_3,cols_4,cols_5
0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0


Use a list comprehension to extract the url value from each dictionary in hn_clean. Assign the result to urls.

In [23]:
hn_clean

[{'author': 'dragongraphics',
  'numComments': 0,
  'points': 2,
  'url': 'http://ashleynolan.co.uk/blog/are-we-getting-too-sassy',
  'storyText': '',
  'createdAt': '2014-05-29T08:07:50Z',
  'tags': ['story', 'author_dragongraphics', 'story_7815238'],
  'title': 'Are we getting too Sassy? Weighing up micro-optimisation vs. maintainability',
  'objectId': '7815238'},
 {'author': 'jcr',
  'numComments': 0,
  'points': 1,
  'url': 'http://spectrum.ieee.org/automaton/robotics/home-robots/telemba-telepresence-robot',
  'storyText': '',
  'createdAt': '2014-05-29T08:05:58Z',
  'tags': ['story', 'author_jcr', 'story_7815234'],
  'title': 'Telemba Turns Your Old Roomba and Tablet Into a Telepresence Robot',
  'objectId': '7815234'},
 {'author': 'callum85',
  'numComments': 0,
  'points': 1,
  'url': 'http://online.wsj.com/articles/apple-to-buy-beats-1401308971',
  'storyText': '',
  'createdAt': '2014-05-29T08:05:06Z',
  'tags': ['story', 'author_callum85', 'story_7815230'],
  'title': 'Apple

In [24]:
url = [d['url'] for d in hn_clean]
url

['http://ashleynolan.co.uk/blog/are-we-getting-too-sassy',
 'http://spectrum.ieee.org/automaton/robotics/home-robots/telemba-telepresence-robot',
 'http://online.wsj.com/articles/apple-to-buy-beats-1401308971',
 'http://alexsblog.org/2014/05/29/dont-wait-for-inspiration/',
 'http://techcrunch.com/2014/05/28/hackerone-get-9m-in-series-a-funding-to-build-bug-tracking-bounty-programs/',
 'http://www.teslamotors.com/en_AU/models/design',
 'http://gearapp.challengepost.com/',
 'https://gigaom.com/2014/05/28/skype-will-soon-get-real-time-speech-translation-based-on-deep-learning/',
 'http://www.nbcnews.com/feature/edward-snowden-interview/watch-primetime-special-inside-mind-edward-snowden-n117126',
 'http://snippetrepo.com/snippets/linear-equation-solver-in-3-lines-of-python',
 'http://www.quora.com/Websites/What-is-the-difference-between-a-privacy-policy-and-terms-and-conditions',
 'http://techcrunch.com/gallery/five-super-successful-tech-pivots/',
 'http://andrewgelman.com/2014/05/27/whole

 # Using List Comprehensions to Reduce a List

The last common application of list comprehensions is reducing a list. Let's say we had a list of integers and we wanted to remove any integers that were smaller than 50. We could do this by adding an if statement to our loop:



![image.png](attachment:image.png)

In [25]:
ints = [25, 14, 13, 84, 43, 6, 77 , 56]
big_ints = []
for i in ints:
    if i >= 50:
        big_ints.append(i)
print(big_ints)


[84, 77, 56]


We can use this technique to quickly and easily filter our data set using an if statement. Let's look at how we could use this to quickly count the number of stories that have comments. We'll start with a version using a loop:

In [26]:
# using for loops
has_comments = []
for d in hn:
    if d['numComments'] > 0:
        has_comments.append(d)
len(has_comments)

9279

In [27]:
# using list comprehension
has_comments = [d for d in hn_clean if d['numComments'] > 0]
len(has_comments)

9279

In [28]:
#more examples
num = [1,2,3,4,5,6,7,8,9,10]
small_num = [i for i in num if i%2==0]
small_num

[2, 4, 6, 8, 10]

Let's use list comprehension to count how many stories have more than 1,000 points.

In [29]:
hn_clean

[{'author': 'dragongraphics',
  'numComments': 0,
  'points': 2,
  'url': 'http://ashleynolan.co.uk/blog/are-we-getting-too-sassy',
  'storyText': '',
  'createdAt': '2014-05-29T08:07:50Z',
  'tags': ['story', 'author_dragongraphics', 'story_7815238'],
  'title': 'Are we getting too Sassy? Weighing up micro-optimisation vs. maintainability',
  'objectId': '7815238'},
 {'author': 'jcr',
  'numComments': 0,
  'points': 1,
  'url': 'http://spectrum.ieee.org/automaton/robotics/home-robots/telemba-telepresence-robot',
  'storyText': '',
  'createdAt': '2014-05-29T08:05:58Z',
  'tags': ['story', 'author_jcr', 'story_7815234'],
  'title': 'Telemba Turns Your Old Roomba and Tablet Into a Telepresence Robot',
  'objectId': '7815234'},
 {'author': 'callum85',
  'numComments': 0,
  'points': 1,
  'url': 'http://online.wsj.com/articles/apple-to-buy-beats-1401308971',
  'storyText': '',
  'createdAt': '2014-05-29T08:05:06Z',
  'tags': ['story', 'author_callum85', 'story_7815230'],
  'title': 'Apple

In [30]:
num_thousand_points = [d for d in hn_clean if d['points'] > 1000]
len(num_thousand_points)

8

 # Passing Functions as Arguments

In previous missions, we learned to use Python's built-in functions to analyze data in lists, like `min()`, `max()`, and `sorted()`.

What if we wanted to use these functions to work with data in JSON form? Let's use our demo JSON object to try and see what happens. First, we'll quickly remind ourselves of the data:


In [31]:
jprint(json_obj)

[
    {
        "age": 36,
        "favorite_foods": [
            "Pumpkin",
            "Oatmeal"
        ],
        "name": "Sabine"
    },
    {
        "age": 40,
        "favorite_foods": [
            "Chicken",
            "Pizza",
            "Chocolate"
        ],
        "name": "Zoe"
    },
    {
        "age": 40,
        "favorite_foods": [
            "Caesar Salad"
        ],
        "name": "Heidi"
    }
]


Let's try and use Python to return the dictionary of the person with the lowest age:

In [32]:
min(json_obj)

TypeError: '<' not supported between instances of 'dict' and 'dict'

We get an error because Python doesn't have any way to tell whether one dictionary object is "greater" than another.

There is a way we can actually tell functions like `min()`, `max()`, and `sorted()` how to sort complex objects like dictionaries and lists of lists. We do this by using the optional key argument. The official Python documentation contains the following excerpts that describe how the argument works:

    The key argument specifies a one-argument ordering function like that used for list.sort().

    key specifies a function of one argument that is used to extract a comparison key from each list element. The key corresponding to each item in the list is calculated once and then used for the entire sorting process.


These excerpts tell us we need to specify a function as an argument to control the comparison between values. Up until now, we've only passed variables or values as arguments, but not functions!

We'll learn the specifics of this particular application in a moment, but for now, we're going to explore how to pass a function as an argument.

Let's define a very simple function as an example:

In [33]:
def greet():
    return 'Hello world'
greet()

'Hello world'

If we try to examine the type of our function, we are unsuccessful:

In [34]:
type(greet())

str

What happens is that greet() is executed first; it returns the string "hello", and then the type() function tells us the type of that string:

![image.png](attachment:image.png)

We need to find a way to look at the function itself, rather than the result of the function. The key to this is the parentheses: ().

The parentheses are what tells Python to execute the function, so if we *omit the parentheses we can treat a function like a variable,* rather than working with the output of the function:

In [35]:
t = type(greet)
t

function

There are other variable-like behaviors we can also use when we omit the parentheses from a function. For instance, we can assign a function to a new variable name:

In [36]:
greet_2 = greet

greet_2()

'Hello world'

Now that we understand how to treat a function as variable, let's look at how we can run a function inside another function by passing it as an argument:



In [37]:
def run_func(func):
    print("RUNNING FUNCTION: {}".format(func))
    return func()

run_func(greet)

RUNNING FUNCTION: <function greet at 0x0000002183379670>


'Hello world'

![image.png](attachment:image.png)

Now that we have some intuition on how to pass functions as arguments, let's see how we use a function to control the behavior of the `sorted()` function:

![image.png](attachment:image.png)

Let's look at the same thing in code form:

In [38]:
def get_age(json_dict):
    return json_dict['age']

youngest = min(json_obj, key=get_age)
jprint(youngest)

{
    "age": 36,
    "favorite_foods": [
        "Pumpkin",
        "Oatmeal"
    ],
    "name": "Sabine"
}


Let's use this technique to find the story that has the greatest number of comments.

In [39]:
def get_num_comments(story):
    return story['numComments']

most_comments = max(hn_clean, key=get_num_comments)

<a id='lambda_fn'></a>

# Lambda Functions 
 

Usually, we create functions when we want to perform the same task many times. In the previous exercise, we created a function to use just once — as an argument to `max()`.

Python provides a special syntax to create temporary functions for situations like these. These functions are called **lambda functions**. Lambda functions can be defined in a single line, which allows you to define a function you want to pass as an argument at the time you need it.

While it's unusual to assign a lamdba function to a variable name, we'll do that while we learn lambda functions through some simple examples. We'll start with a function that returns a single argument without modifying it:


In [40]:
def unchanged(x):
    return x
    

Let's give each component of the function a name so we can more easily talk about it

![image.png](attachment:image.png)

We're calling the returned element "transformation," even though there is no transformation. This will make sense as we introduce more complex examples.

To create a lambda function equivalent of this function, we:

   - Use the lambda keyword, followed by
   - The parameter and a colon, and then
   - The transformation we wish to perform on our argument

We can then assign that to the function name:




![image.png](attachment:image.png)

Now let's look at a second example, where we add a simple transformation to the argument:

![image.png](attachment:image.png)

In [41]:
plus_one = lambda x : x + 1

If we want to create a lambda that has multiple parameters, we follow exactly the same steps:

In [42]:
def add (x,y):
    return x+y
# lambda equivalent
add = lambda x,y : x + y

If a function is particularly complex, it may be a better choice to define a regular function rather than create a lambda, even if it will only be used once. For instance, this function below, which extracts digits from a string and then adds one to the resultant integer:


In [43]:
def extract_and_increment(string):
    digits = re.search(r'\d+',string).group()
    incremented = digits + 1
    return incremented


It becomes tough to understand in its lambda form:

In [44]:
extract_and_increment = lambda string:int(re.search(r'\d+',string).group())+1

Let's practice creating a lambda function version of a simple function:

In [45]:
# normal function
def multiply(a,b):
    return a*b
# lambda function
multiply = lambda a,b : a*b

# Using Lambda Functions to Analyze JSON data

As we mentioned briefly on the previous screen, assigning a lambda to a variable so it can be called by name is a pretty uncommon pattern. The primary use of lambda functions is to define a function in place, like when we are providing a function as an argument.

So we have a more precise understanding of how a lambda function works, let's look at how our solution from the previous screen is executed:

![image.png](attachment:image.png)

Let's look at how this works in common usage with `min()`, `max()`, and `sorted()`. We'll use the JSON object from the previous few screens so it's easier to observe what is happening:

In [46]:
jprint(json_obj)

[
    {
        "age": 36,
        "favorite_foods": [
            "Pumpkin",
            "Oatmeal"
        ],
        "name": "Sabine"
    },
    {
        "age": 40,
        "favorite_foods": [
            "Chicken",
            "Pizza",
            "Chocolate"
        ],
        "name": "Zoe"
    },
    {
        "age": 40,
        "favorite_foods": [
            "Caesar Salad"
        ],
        "name": "Heidi"
    }
]


Let's start by using a lambda function with sorted() to sort the items in our JSON list alphabetically by name:

![image.png](attachment:image.png)

Next, we'll use a lambda function with min() to calculate the item in our JSON list with the smallest age:

![image.png](attachment:image.png)

Lastly, we'll use a lambda function with `max()` to calculate the item in our JSON list with the largest number of favorite foods:



In [47]:
max(json_obj, key=lambda d:len(d['favorite_foods']) )

{'name': 'Zoe', 'age': 40, 'favorite_foods': ['Chicken', 'Pizza', 'Chocolate']}

![image.png](attachment:image.png)

Over the past three screens, we have:

   - Learned that functions can be passed as arguments.
   - Created functions and used them to calculate the minimum, maximum, and to sort lists of lists.
   - Learned about lambda functions and how to create them.
    Learned how to use a lambda function to pass an argument in place when calculating minimums, maximums, and sorting lists of lists.

We can now apply all of this new knowledge to our Hacker News data to calculate the posts that had the most points in 2014!

In [48]:
hn_sorted_points = sorted(hn_clean, key=lambda p: p['points'] , reverse=True)
top_5_titles=[d['title'] for d in hn_sorted_points[:5]]

# Reading JSON files into pandas

So far, we've worked with our JSON data using pure Python. One other option available to us is to convert the JSON to a pandas dataframe and then use pandas methods to manipulate it.

Pandas has the `pandas.read_json()` function, which is designed to read JSON from either a file or a JSON string. In our case, our JSON exists as Python objects already, so we don't need to use this function.

Because the structure of JSON objects can vary a lot, sometimes you will need to prepare your data in order to be able to convert it to a tabular form. In our case, our data is a list of dictionaries, which pandas is easily able to convert to a dataframe.

Let's look at an our example JSON again:

In [49]:
jprint(json_obj)

[
    {
        "age": 36,
        "favorite_foods": [
            "Pumpkin",
            "Oatmeal"
        ],
        "name": "Sabine"
    },
    {
        "age": 40,
        "favorite_foods": [
            "Chicken",
            "Pizza",
            "Chocolate"
        ],
        "name": "Zoe"
    },
    {
        "age": 40,
        "favorite_foods": [
            "Caesar Salad"
        ],
        "name": "Heidi"
    }
]


Each of the dictionaries will become a row in the dataframe, with each key corresponding to a column name.

![image.png](attachment:image.png)

We can use the `pandas.DataFrame()` constructor and pass the list of dictionaries directly to it to convert the JSON to a dataframe:

In [50]:
json_df = pd.DataFrame(json_obj)
print(json_df)

     name  age               favorite_foods
0  Sabine   36           [Pumpkin, Oatmeal]
1     Zoe   40  [Chicken, Pizza, Chocolate]
2   Heidi   40               [Caesar Salad]


In this case, the `favorite_foods` column contains the list from the JSON. We'll see a similar thing with the `tags` column for our Hacker News data. We'll learn how to correct that on the next screen, but for now, let's convert our data to a pandas dataframe.

In [51]:
hn_df = pd.DataFrame(hn_clean)
hn_df

Unnamed: 0,author,numComments,points,url,storyText,createdAt,tags,title,objectId
0,dragongraphics,0,2,http://ashleynolan.co.uk/blog/are-we-getting-t...,,2014-05-29T08:07:50Z,"[story, author_dragongraphics, story_7815238]",Are we getting too Sassy? Weighing up micro-op...,7815238
1,jcr,0,1,http://spectrum.ieee.org/automaton/robotics/ho...,,2014-05-29T08:05:58Z,"[story, author_jcr, story_7815234]",Telemba Turns Your Old Roomba and Tablet Into ...,7815234
2,callum85,0,1,http://online.wsj.com/articles/apple-to-buy-be...,,2014-05-29T08:05:06Z,"[story, author_callum85, story_7815230]",Apple Agrees to Buy Beats for $3 Billion,7815230
3,d3v3r0,0,1,http://alexsblog.org/2014/05/29/dont-wait-for-...,,2014-05-29T08:00:08Z,"[story, author_d3v3r0, story_7815222]",Don’t wait for inspiration,7815222
4,timmipetit,0,1,http://techcrunch.com/2014/05/28/hackerone-get...,,2014-05-29T07:46:19Z,"[story, author_timmipetit, story_7815191]",HackerOne Get $9M In Series A Funding To Build...,7815191
...,...,...,...,...,...,...,...,...,...
35801,lispython,0,3,https://medium.com/p/ff5f4c9b16bd,,2014-01-01T00:33:42Z,"[story, author_lispython, story_6993601]",Engelbart and Kay,6993601
35802,co_pl_te,0,3,http://allthingsd.com/20131231/you-say-goodbye...,,2014-01-01T00:19:47Z,"[story, author_co_pl_te, story_6993568]",You Say Goodbye and We Say Hello,6993568
35803,maurorm,0,1,http://ghiraldelli.pro.br/jesus-e-eu/,,2014-01-01T00:11:06Z,"[story, author_maurorm, story_6993544]",Jesus e eu,6993544
35804,yeukhon,0,1,,https:&#x2F;&#x2F;fundraising.mozilla.org&#x2F;,2014-01-01T00:06:59Z,"[story, author_yeukhon, story_6993536]",Mozilla end-of-year fundraising jumps from $75...,6993536


 # Exploring Tags Using the Apply Function

Let's look at the first few rows of our new hn_df dataframe:

In [52]:
hn_df.head()

Unnamed: 0,author,numComments,points,url,storyText,createdAt,tags,title,objectId
0,dragongraphics,0,2,http://ashleynolan.co.uk/blog/are-we-getting-t...,,2014-05-29T08:07:50Z,"[story, author_dragongraphics, story_7815238]",Are we getting too Sassy? Weighing up micro-op...,7815238
1,jcr,0,1,http://spectrum.ieee.org/automaton/robotics/ho...,,2014-05-29T08:05:58Z,"[story, author_jcr, story_7815234]",Telemba Turns Your Old Roomba and Tablet Into ...,7815234
2,callum85,0,1,http://online.wsj.com/articles/apple-to-buy-be...,,2014-05-29T08:05:06Z,"[story, author_callum85, story_7815230]",Apple Agrees to Buy Beats for $3 Billion,7815230
3,d3v3r0,0,1,http://alexsblog.org/2014/05/29/dont-wait-for-...,,2014-05-29T08:00:08Z,"[story, author_d3v3r0, story_7815222]",Don’t wait for inspiration,7815222
4,timmipetit,0,1,http://techcrunch.com/2014/05/28/hackerone-get...,,2014-05-29T07:46:19Z,"[story, author_timmipetit, story_7815191]",HackerOne Get $9M In Series A Funding To Build...,7815191


Just like the `favorite_food` column in our example data on the previous screen, the `tags` column is a column where each item contains the list of data from our original JSON.

At first glance, it looks like each values in this JSON list contain three items:

   1. The string story
   2. The name of the author
   3. The story ID

If that's the case, then the column doesn't contain any unique data, and we can remove it. We're going to analyze this column to make sure that's the case.

Let's start by exploring how pandas is storing that data. First, we'll extract the column as a series, and check its type:

In [53]:
tags=hn_df['tags']
print(tags.dtype)

object


The tags column is stored as an object type. Whenever pandas uses the object type, each item in the series uses a Python object to store the data. Most commonly we see this type used for string data.

We previously learned that we could use the `Series.apply()` method to apply a function to every item in a series. Let's look at what we get when we pass the `type()` function as an argument to the column:


In [54]:
tags_type = hn_df['tags'].apply(type)
type_counts = tags_type.value_counts(dropna = False)
type_counts

<class 'list'>    35806
Name: tags, dtype: int64

All 35,806 items in the column are a Python list type.

Next, let's use `Series.apply()` to check the length of each of those lists. If our hypothesis from earlier is correct, every row will have a list containing three items:


In [55]:
tags_types = tags.apply(len)
type_lengths = tags_types.value_counts(dropna=False)
type_lengths

3    33459
4     2347
Name: tags, dtype: int64

While most of the item have three values in the list, about 2,000 values contain four values. Let's use a boolean mask to look at the items where the list has four items:

In [56]:
four_tags = tags[tags.apply(len)==4]
four_tags
# or according to dataquest
# tags = hn_df['tags']
# has_four_tags = tags.apply(len) == 4
# four_tags = tags[has_four_tags]

43       [story, author_alamgir_mand, story_7813869, sh...
86         [story, author_cweagans, story_7812404, ask_hn]
104      [story, author_nightstrike789, story_7812099, ...
107      [story, author_ISeemToBeAVerb, story_7812048, ...
109         [story, author_Swizec, story_7812018, show_hn]
                               ...                        
35747      [story, author_rpm4321, story_6994970, show_hn]
35759            [story, author_ct, story_6994828, ask_hn]
35778    [story, author_ChrisNorstrom, story_6994370, a...
35787    [story, author_benjamincburns, story_6994163, ...
35792      [story, author_randall, story_6993981, show_hn]
Name: tags, Length: 2347, dtype: object

Let's look at the first few items in the `four_tags` series we just created:

In [57]:
four_tags.head()

43     [story, author_alamgir_mand, story_7813869, sh...
86       [story, author_cweagans, story_7812404, ask_hn]
104    [story, author_nightstrike789, story_7812099, ...
107    [story, author_ISeemToBeAVerb, story_7812048, ...
109       [story, author_Swizec, story_7812018, show_hn]
Name: tags, dtype: object

It looks like whenever there are four tags, the extra tag is the last of the four. In this final exercise of the mission, we're going to use a lambda function to extract this fourth value in cases where there is one. To do this for any single list, we'll need to:

   1. Check the length of the list.
   2. If the length of the list is equal to four, return the last value.
   3. If the length of the list isn't equal to four, return a null value.

This is how we could create this as a standard function:


In [58]:
def extract_tag(l):
    if len(l)==4:
        return l[-1]
    else:
        return None

We could use `Series.apply()` to apply this function as is, but to practice working with lambda functions, let's look at how we can complete this operation in a single line.

To achieve this, we'll have to use a special version of an if statement known as a **ternary operator**. You can use the ternary operator whenever you need to return one of two values depending on a boolean expression. The syntax is as follows:


[on_true] if [expression] else [on_false]

The diagram below shows our function using an if statement and its ternary operator equivalent:

![image.png](attachment:image.png)

Let's finish by creating a lambda function version of this function and using apply to extract the tags.

In [59]:
cleaned_tags = tags.apply(lambda l:l[-1] if len(l)==4 else None)
hn_df['tags'] = cleaned_tags
