## Python Basics - Challenge



- The file `guardian_articles_corona.json` contains utf-8 encoded articles for the search term *coronavirus* in the year 2020 from the [The Guardian API](http://open-platform.theguardian.com/)  (retrieved 13/05/2020)
- The objective is to simplify the data structure such that analyses can be run afterwards
- Make use of the exercises and notebooks we have discussed previously
- The challenge is much more comprehensive than the other tasks. It's OK if the solution takes more time. You might also want to tackle the challenge in your groups.

### 1.

Download the `JSON` file, read it into Python and familiarise yourself with the data structure. How many artciles does the file contain? 

### 2. 

Write a function to process the list with articles. Simplify the data structure according to the following Input / Output example:

**Input:**

```
{
    'id': 'world/2020/may/08/coronavirus-the-week-explained',
    'type": 'article',
    'sectionId': "world',
    'sectionName': 'World news',
    'webPublicationDate': '2020-05-08T10:54:45Z',
    'webTitle': 'Coronavirus: the week explained',
    'webUrl': 'https://www.theguardian.com/world/2020/may/08/coronavirus-the-week-explained',
    'apiUrl': 'https://content.guardianapis.com/world/2020/may/08/coronavirus-the-week-explained',
    'fields': {
      'bodyText': 'Welcome to our weekly roundup of developments in the coronavirus pandemic, which continues ...',
      'charCount': '6139'},     
   'tags': 
   [{'id': 'world/coronavirus-outbreak',
   'type': 'keyword',
   'sectionId': 'world',
   'sectionName': 'World news',
   'webTitle': 'Coronavirus outbreak',
   'webUrl': 'https://www.theguardian.com/world/coronavirus-outbreak',
   'apiUrl': 'https://content.guardianapis.com/world/coronavirus-outbreak',
   'references': []},
  {'id': 'science/science',
   'type': 'keyword',
   'sectionId': 'science',
   'sectionName': 'Science',
   'webTitle': 'Science',
   'webUrl': 'https://www.theguardian.com/science/science',
   'apiUrl': 'https://content.guardianapis.com/science/science',
   'references': []}]
   ...

```

**Output:**

```
{'chars': 6139,
 'id': 'world/2020/may/08/coronavirus-the-week-explained',
 'section': 'World news',
 'tags': 'world/coronavirus-outbreak, science/science',
 'text': 'Welcome to our weekly roundup of developments in the coronavirus pandemic, which continues ...',
 'title': 'Coronavirus: the week explained',
 'url': 'https://www.theguardian.com/world/2020/may/08/coronavirus-the-week-explained',
 'month': 5}
```

### 3.
The variable `chars` in your processed articles contains the particular number of characters in the text. Check by a sample article whether this result is correct.
      
### 4.
Find out in which month most articles were published.

### 5.
Find the three most frequently used tags from all articles.

### 6.
Return the titles of the five longest articles (= number of characters).

### 7.
Store the processed articles in a `JSON` file. Be careful to specify the text encoding as `utf-8`.

In [1]:
# Code for Python challenge

### Aufgabe 1
import json
from pprint import pprint

with open("guardian_articles_corona.json", 'r', encoding = 'utf-8') as json_file:
    articles = json.load(json_file)

print("Anzahl an Artikeln: " + str(len(articles)))

pprint(articles[0], indent = 2)

Anzahl an Artikeln: 10801
{ 'apiUrl': 'https://content.guardianapis.com/world/2020/may/08/coronavirus-the-week-explained',
  'fields': { 'bodyText': 'Welcome to our weekly roundup of developments in '
                          'the coronavirus pandemic, which continues to pose '
                          'new political, scientific and personal challenges '
                          'around the world. As the UK is among several '
                          'countries moving towards the lifting of some '
                          'restrictions, it remains under pressure to deliver '
                          'enough tests – and the role of scientific advisers '
                          'has come under renewed scrutiny. Public trust in '
                          'science grows during the pandemic, as top scientist '
                          'quits Sage It’s been a rocky week for science '
                          'advisers. The prominent disease modelling expert '
                     

In [2]:
### Aufgabe 2
def short_guardian(article):
    D = {'chars': int(article['fields']['charCount']),
         'id': article['id'],
         'section':  article['sectionName'],
         'tags': ', '.join(tag['id'] for tag in article['tags']),
         'title': article['fields']['headline'],
         'text': article['fields']['bodyText'],
         'url': article['webUrl'],
         'month': article['webPublicationDate'][5:7]}
    return D

articles_short = []

for article in articles:
    D = short_guardian(article)
    articles_short.append(D)

In [3]:
### Aufgabe 3
articles_short[10]['chars'] == len(articles_short[10]['text'])

True

In [4]:
### Aufgabe 4
from collections import Counter

months = []

for article in articles_short:
    months.append(article['month'])

max_month = Counter(months)
print(max_month)

Counter({'04': 4307, '03': 4076, '05': 1479, '02': 744, '01': 195})


In [5]:
### Aufgabe 5
from collections import Counter

tags = []
for article in articles_short:
    tags.append(article['tags'])
    
tags_list = ', '.join(tags).split(', ')

tags_c = Counter(tags_list)
tags_c.most_common(3)


[('world/coronavirus-outbreak', 8238), ('uk/uk', 3742), ('world/world', 3289)]

In [6]:
### Aufgabe 6
articles_sorted = sorted(articles_short, key = lambda k: k['chars'], reverse = True)

for article in articles_sorted[0:3]:
    print(article['title'])

Canada closes borders to foreigners – as it happened
Covid-19 outbreak like a nuclear explosion, says archbishop of Canterbury – as it happened
Covid-19 outbreak like a nuclear explosion, says archbishop of Canterbury – as it happened


In [7]:
### Aufgabe 7
with open('guardian_articles_short.json', 'w', encoding = 'utf-8') as outfile:
    json.dump(articles_short, outfile)

<br>
<br>


___

                
**Contact: Gerome Wolf** (Email: wolfgerome@gmail.com)