<a href="https://colab.research.google.com/github/TomazFilgueira/Dq_Datascientist/blob/main/04_Data_Cleaning/04_2_Advanced_data_cleaning/04_2_3_List_Comprehensions_Lambda_Functions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1) JSON Format

The data set we'll use in this lesson is in a format called JavaScript Object Notation (**JSON**). As the name indicates, JSON originated from the JavaScript language, but has now become a language-independent format.

From a Python perspective, JSON can be thought of as a collection of Python objects nested inside each other.

![image.png](attachment:image.png)

The JSON above is a list, where each element in the list is a dictionary. Each of the dictionaries have the same keys, and one of the values of each dictionary is itself a list.

The Python json module contains a number of functions to make working with JSON objects easier. We can use the `json.loads()` method to convert JSON data contained in a string to the equivalent set of Python objects:

In [4]:
json_string = """
[
  {
    "name": "Sabine",
    "age": 36,
    "favorite_foods": ["Pumpkin", "Oatmeal"]
  },
  {
    "name": "Zoe",
    "age": 40,
    "favorite_foods": ["Chicken", "Pizza", "Chocolate"]
  },
  {
    "name": "Heidi",
    "age": 40,
    "favorite_foods": ["Caesar Salad"]
  }
]
"""

import json
json_obj = json.loads(json_string)
print(type(json_obj))

print(json_obj)

<class 'list'>
[{'name': 'Sabine', 'age': 36, 'favorite_foods': ['Pumpkin', 'Oatmeal']}, {'name': 'Zoe', 'age': 40, 'favorite_foods': ['Chicken', 'Pizza', 'Chocolate']}, {'name': 'Heidi', 'age': 40, 'favorite_foods': ['Caesar Salad']}]


We can observe a few things:

The formatting from our original string is gone. This is because printing Python lists and dictionaries has a simple formatting structure.

The order of the keys in the dictionary may appear different. While Python dictionaries maintain insertion order in modern versions, the JSON specification does not guarantee key order preservation. This means that when working with JSON data, the order of keys might not always be consistent.

# Instructions

Let's practice using `json.loads()` to convert JSON data from a string to Python objects!

We have created a JSON string, world_cup_str, which contains data about games from the 2018 Football World Cup.

1. Import the json module.

1. Use `json.loads()` to convert `world_cup_str` to a Python object. Assign the result to `world_cup_obj`.

In [None]:
world_cup_str = """
[
    {
        "team_1": "France",
        "team_2": "Croatia",
        "game_type": "Final",
        "score" : [4, 2]
    },
    {
        "team_1": "Belgium",
        "team_2": "England",
        "game_type": "3rd/4th Playoff",
        "score" : [2, 0]
    }
]
"""

import json

world_cup_obj = json.loads(world_cup_str)

# 2) Reading a JSON file

One of the places where the JSON format is commonly used is in the results returned by an Application programming interface (**API**). APIs are interfaces that can be used to send and transmit data between different computer systems

The data set from this lesson — `hn_2014.json` — was downloaded from the Hacker News API. It's a different set of data from the CSV we've been using in the previous two lessons, and it contains data about stories from Hacker News in 2014.

To read a file from JSON format, we use the `json.load()` function. Note that the function is `json.load()` without an "s" at the end. The `json.loads()` function is used for **loading JSON data from a string** ("loads" is short for "load string"), whereas the `json.load()` function is used to **load from a file object**. Let's look at how we would read that in our data:

*Note that we're using with to open the file, which is a better practice than just using `open()`*

```
import json
with open("hn_2014.json", "r") as file:
    hn = json.load(file)

print(type(hn))
```

Our hn variable is a **list**. Let's find out **how many objects** are in the list, and the type of the first object (which will almost always be the type of every object in the list in **JSON data**):

```
print(len(hn))
print(type(hn[0]))
```

Our data set contains **35,806 dictionary objects**, each representing a Hacker News story. In order to understand the format of our data set, we'll print the keys of the first dictionary:

```
print(hn[0].keys())

dict_keys(['author', 'numComments', 'points', 'url', 'storyText', 'createdAt', 'tags', 'createdAtI', 'title', 'objectId'])
```

If we recall the data set we used in the previous two lessons, we can see some similarities. There are keys representing the title, **URL**, **points**, **number of comments**, and **date**, as well as some **others** that are less familiar to us. Here is a summary of the keys and the data that they contain:

`author`: The username of the person who submitted the story.

`createdAt`: The date and time at which the story was created.

`createdAtI`: An integer value representing the date and time at which the story was created.

`numComments`: The number of comments that were made on the story.

`objectId`: The unique identifier from Hacker News for the story.

`points`: The number of points the story acquired, calculated as the total number of upvotes minus the total number of downvotes.

`storyText`: The text of the story (if the story contains text).

`tags`: A list of tags associated with the story.

`title`: The title of the story.

`url`: The URL that the story links to (if the story links to a URL).

## Instructions

1. Use the `open()` function to open the `hn_2014.json` file as a file object.

1. Use the `json.load()` function to parse the file object and assign the result to hn.

In [5]:
import json
#Open via google colab
with open("/content/hn_2014.json", "r") as file:
    hn = json.load(file)

# 3 Deleting Dictionary Keys

Let's look at the first dictionary in full. To make it easier to understand, we're going to create a function which will print a JSON object with formatting to make it easier to read.

The function will use the `json.dumps()` function ("**dump string**") which does the opposite of the ``json.loads()`` function — it takes a **JSON object and returns a string version** of it.

The `json.dumps()` function accepts arguments that can specify formatting for the string, which we'll use to make things easier to read:

In [22]:
def jprint(obj):
    # create a formatted string of the Python JSON object
    text = json.dumps(obj, sort_keys=True, indent=4)
    print(text)

first_story = hn[0]
jprint(first_story)

{
    "author": "dragongraphics",
    "createdAt": "2014-05-29T08:07:50Z",
    "createdAtI": 1401350870,
    "numComments": 0,
    "objectId": "7815238",
    "points": 2,
    "storyText": "",
    "tags": [
        "story",
        "author_dragongraphics",
        "story_7815238"
    ],
    "title": "Are we getting too Sassy? Weighing up micro-optimisation vs. maintainability",
    "url": "http://ashleynolan.co.uk/blog/are-we-getting-too-sassy"
}


You may notice that the `createdAt` and `createdAtI` keys both have the date and time data in two different formats. Because the format of `createdAt` is much easier to understand, let's do some data cleaning by deleting the `createdAtI` key from every dictionary.

To **delete** a key from a dictionary, we can use the `del` statement. Let's learn the syntax by looking at a simple example:

In [9]:
d = {'a': 1, 'b': 2, 'c': 3}
del d['a']
print(d)

{'b': 2, 'c': 3}


We can create a function using `del` that will return a copy of our dictionary with the key removed.

The dictionary returned by the function no longer includes the `createdAtI` key.

In [11]:
def del_key(dict_, key):
    # create a copy so we don't
    # modify the original dict
    modified_dict = dict_.copy()
    del modified_dict[key]
    return modified_dict

#Let's use this function to delete the createdAtI key from first_story:

first_story = del_key(first_story, 'createdAtI')
jprint(first_story)

{
    "author": "dragongraphics",
    "createdAt": "2014-05-29T08:07:50Z",
    "numComments": 0,
    "objectId": "7815238",
    "points": 2,
    "storyText": "",
    "tags": [
        "story",
        "author_dragongraphics",
        "story_7815238"
    ],
    "title": "Are we getting too Sassy? Weighing up micro-optimisation vs. maintainability",
    "url": "http://ashleynolan.co.uk/blog/are-we-getting-too-sassy"
}


## Instructions

Let's use a loop and the `del_key()` function to remove the createdAtI key from every story in our Hacker News data set:

1. Create an empty list, `hn_clean` to store the cleaned data set.

1. Loop over the dictionaries in the hn list. In each iteration:

  * Use the `del_key()` function to delete the createdAtI key from the dictionary.

  * Append the cleaned dictionary to `hn_clean`.

In [6]:
def del_key(dict_, key):
    # create a copy so we don't
    # modify the original dict
    modified_dict = dict_.copy()
    del modified_dict[key]
    return modified_dict

hn_clean = []

for l in hn:
    #for each list in hn dict, call function
    temp = del_key(l, 'createdAtI')

    #append to hn clean
    hn_clean.append(temp)

# 4) Writing List Comprehensions

The task we performed previously is an extremely common one. Specifically, we:

* Iterated over values in a list.
* Performed a transformation on those values.
* Assigned the result to a new list.

Python includes a special syntax shortcut for tasks that meet these criteria: **List Comprehensions**. A list comprehension provides a concise way of creating lists in a single line of code.

List comprehensions can look complex at first, but we are simply reordering the elements of our for loop code. To keep things simple, we'll start with a basic example, where we want to add 1 to each item in a list of integers.

![Alt text](https://s3.amazonaws.com/dq-content/355/loop_components.svg)

In [7]:
ints = [1, 2, 3, 4]

plus_one = []
for i in ints:
    plus_one.append(i + 1)

print(plus_one)

[2, 3, 4, 5]


To transform this structure into a list comprehension, we do the following within brackets:

1. Start with the code that transforms each item.
1. Continue with our for statement (without a colon).

We can then assign the list comprehension to a variable name.

Let's look at a example, where we want to multiply each item in the list by **10**:

```python
times_ten = []
for i in ints:
    times_ten.append(i * 10)

print(times_ten)
```

To convert this to a list comprehension, we follow the same pattern:

![Alt text](https://s3.amazonaws.com/dq-content/355/loop_vs_lc_2.svg)

In the example below, we are applying a function to a list of floats to round them to integers:

```python
floats = [2.1, 8.7, 4.2, 8.9]

rounded = []
for f in floats:
    rounded.append(round(f))

print(rounded)
```

To convert to a list comprehension, we simply rearrange the components:

![Alt text](https://s3.amazonaws.com/dq-content/355/loop_vs_lc_3.svg)



## Instructions

We are going to write a list comprehension version of the code from the previous screen.

![Alt text](https://s3.amazonaws.com/dq-content/355/loop_components_hn.svg)

1. Create a list comprehension representation of the loop from the previous screen:

* Call the `del_key()` function to remove the `createdAtI` value from each dictionary in the hn list.
* Assign the results to a new list, `hn_clean`.

In [13]:
hn_clean = [del_key(d, 'createdAtI') for d in hn]

# 5) Using List Comprehensions to Transform and Create Lists

List comprehensions can be used for many different things. Three common applications are:

1. Transforming a list
1. Creating a new list
1. Reducing a list

The first application, **transforming a list**, is the category that all the examples you've seen so far fit under. You are taking an existing list, applying a transformation to every value, and assigning it to a variable.

The second application, **creating a new list**, is useful for creating test data or data that is based on a set of numbers.

As an example, let's create a list of generic columns names that we could use to create a dataframe using the `range()` function and `f-strings` to combine numbers and text:

![Alt text](https://s3.amazonaws.com/dq-content/355/lc_application_2_2_v2.svg)

We can then use this to create an empty dataframe with labels:

In [11]:
import numpy as np
import pandas as pd

cols = [f"col_{i}" for i in range(1,5)]
data = np.zeros((4,4))

df = pd.DataFrame(data, columns=cols)
print(df)

   col_1  col_2  col_3  col_4
0    0.0    0.0    0.0    0.0
1    0.0    0.0    0.0    0.0
2    0.0    0.0    0.0    0.0
3    0.0    0.0    0.0    0.0


## Instructions

Let's use list comprehension to create a new list containing just the URLs from each story.

Use a list comprehension to extract the **url** value from each dictionary in `hn_clean`. Assign the result to `urls`

In [17]:

#1) extract url from dict key ['url']
#2) loop over hn_clean
urls = [new_url['url'] for new_url in hn_clean]
print(urls[:5])

['http://ashleynolan.co.uk/blog/are-we-getting-too-sassy', 'http://spectrum.ieee.org/automaton/robotics/home-robots/telemba-telepresence-robot', 'http://online.wsj.com/articles/apple-to-buy-beats-1401308971', 'http://alexsblog.org/2014/05/29/dont-wait-for-inspiration/', 'http://techcrunch.com/2014/05/28/hackerone-get-9m-in-series-a-funding-to-build-bug-tracking-bounty-programs/'] 



# 6)  Using List Comprehensions to Reduce a List

The last common application of list comprehensions is **reducing a list**. Let's say we had a list of integers and we wanted to remove any integers that were smaller than 50. We could do this by adding an if statement to our loop:

![Alt text](https://s3.amazonaws.com/dq-content/355/lc_application_3_1.svg)

Our loop has one new component — the if statement, which we've colored yellow. Notice that instead of a transformation, we have just the list item itself `(i)` in red. Both if statements and transformations are **optional** in list comprehensions, but we must include some value to populate the elements in the new list.

To include an if statement in a list comprehension, we include it at the very end, before the closing bracket:

![Alt text](https://s3.amazonaws.com/dq-content/355/lc_application_3_2.svg)

Let's look at how we could use this to quickly **count the number of stories that have comments**. We'll start with a version using a loop:

In [18]:
has_comments = []

for d in hn_clean:
    if d['numComments'] > 0:
        has_comments.append(d)

num_comments = len(has_comments)
print(num_comments)

9279


In [19]:
#Now, let's use a list comprehension to perform the same calculation:

has_comments = [d for d in hn_clean if d['numComments'] > 0]

num_comments = len(has_comments)
print(num_comments)

9279


## Instructions

Let's use list comprehension to count how many stories have **more than 1,000 points.**

1. Use list comprehension to create a new list, `thousand_points`:
  
  * The list should contain values from `hn_clean` where the points key has a value greater than `1000`.

1. Count the number of values in `thousand_points` and assign the result to `num_thousand_points`.

In [20]:
thousand_points = [d for d in hn_clean if d['points'] > 1000]

num_thousand_points = len(thousand_points)
print(num_thousand_points)

8


# 7) Passing Functions as Arguments

What if we wanted to use built-in functions to work with data in JSON form?

Let's use our demo JSON object to try and see what happens. First, we'll quickly remind ourselves of the data:

In [23]:
jprint(json_obj)

[
    {
        "age": 36,
        "favorite_foods": [
            "Pumpkin",
            "Oatmeal"
        ],
        "name": "Sabine"
    },
    {
        "age": 40,
        "favorite_foods": [
            "Chicken",
            "Pizza",
            "Chocolate"
        ],
        "name": "Zoe"
    },
    {
        "age": 40,
        "favorite_foods": [
            "Caesar Salad"
        ],
        "name": "Heidi"
    }
]


In [24]:
# Let's try and use Python to return the dictionary of the person with the lowest age:
min(json_obj)

TypeError: '<' not supported between instances of 'dict' and 'dict'

We get an error because Python doesn't have any way to tell whether one dictionary object is "greater" than another.

There is a way we can actually tell functions like `min()`, `max()`, and `sorted()` how to sort complex objects like dictionaries and lists of lists. We do this by using the optional `key argument`. The official Python documentation contains the following excerpt that describes how the argument works:

  *`key` specifies a function of one argument that is used to extract a **comparison** key from each list element. The key corresponding to each item in the list is calculated once and then used for the entire sorting process*

  This excerpt tells us we need to specify a `function` as an argument to control the comparison between values

## Example



In [25]:
def greet():
    return "hello"

greet()

'hello'

If we try to examine the type of our function, we are unsuccessful:

In [26]:
t = type(greet())
print(t)

<class 'str'>


What happens is that `greet()` is executed first; it returns the string "hello", and then the `type()` function tells us the type of that string:

![Alt text](https://s3.amazonaws.com/dq-content/355/type_of_func_1.svg)

The parentheses are what tells Python to execute the function, so if we **omit** the parentheses we can treat a function like a **variable**, rather than working with the output of the function:

In [27]:
t = type(greet)
print(t)

<class 'function'>


There are other variable-like behaviors we can also use when we omit the parentheses from a function. For instance, we can assign a function to a new variable name:

In [28]:
greet_2 = greet

greet_2()

'hello'

Now that we understand how to treat a function as variable, let's look at how we can run a `function` inside `another function` by passing it as an argument:

![Alt text](https://s3.amazonaws.com/dq-content/355/func_as_arg.svg)

Now that we have some intuition on how to pass functions as arguments, let's see how we use a function to control the behavior of the `sorted()` function:

![Alt text](https://s3.amazonaws.com/dq-content/355/sorted_key.svg)

Let's look at the same thing in code form:

In [29]:
def get_age(json_dict):
    return json_dict['age']

youngest = min(json_obj, key=get_age)
jprint(youngest)

{
    "age": 36,
    "favorite_foods": [
        "Pumpkin",
        "Oatmeal"
    ],
    "name": "Sabine"
}


## Instructions

Let's use this technique to find the story that has the **greatest** number of comments.

1. Create a "key function" that accepts a single dictionary and returns the value from the numComments key.
1. Use the max() function with the "key function" you just created to find the value from the hn_clean list with the most comments:

  * Assign the result to the variable most_comments.

In [30]:
def key_function(json_dict):
    return json_dict['numComments']

most_comments = max(hn_clean, key=key_function)
print(most_comments)

{'author': 'platz', 'numComments': 1208, 'points': 889, 'url': 'https://blog.mozilla.org/blog/2014/04/03/brendan-eich-steps-down-as-mozilla-ceo/', 'storyText': None, 'createdAt': '2014-04-03T19:02:53Z', 'tags': ['story', 'author_platz', 'story_7525198'], 'title': 'Brendan Eich Steps Down as Mozilla CEO', 'objectId': '7525198'}
