# Homework (deadline 18.11.2022 13:44:59)
Write solutions for the homework exercises in this notebook. Once the work is done download the notebook file (`File > Download .ipynb`) rename it properly so it follows a template `HW1_<SURNAME>_<NAME>.ipynb` and send the file to me. My email address is as follows: 

* <m.biesaga@uw.edu.pl>

Remember that you can contact me via email if you have any problems. Moreover, you can also visit me in the ISS on the fourth floor (room 415). Usually, I am there from 11ish but please let me know in advance if you are coming because I might be busy. 


## Task 1 (5 points)

Read about the `pageviews` method (`prop=pageviews`) in the `query endpoint` ([docpage](https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bpageviews)). Use this method to extract page views data for the pages from the exercise we did during the class (if you want you can sample 10 new pages with the `list=random` method) for the last 60 days. The results will be broken down by single days, so you have to aggregate the results (sum) so they give the total page views count for the entire period of 60 days. Remember that to select pages by page ids you pass `pageids=<id 1>|<id 2>|...|<id n>`. We did a very similar thing when we extracted article content through the `cirrusdoc` method in the Wikipedia API. Your final output should be a `dict` object that maps page ids to pageviews (total number of pageviews over 60 days). It should look something like this:

```python
results = {
    # page_id: pageviews
    153253: 10204,
    423423: 101,
    11012:  12,
    42435:  546,
    # and so on
}
```

In [None]:
## Import module requests
import requests

## Some page ids
page_ids = [
    19969580,
    39982842,
    25699035,
    52642931,
    53055349,
    24133565,
    1164662,
    40656459,
    12533026,
    47110862
]

## API URL
BASE_URL = 'https://en.wikipedia.org/w/api.php'

In [None]:
## Define the payload
params = {
    'action': 'query',
    'prop': 'pageviews',
    'pageids': '|'.join(str(pid) for pid in page_ids),
    'pvidays': 60,
    'format': 'json'
}
## Send the request
response = requests.get(BASE_URL, params=params)
## Extract a dictionary with pages
data = response.json()['query']['pages']


In [None]:
## My friends solution with the filter function and get method
PVS = { v['title']: sum(filter(None, v.get('pageviews', {}).values())) for v in data.values() }
## My solutions with two list comprehensions
PVM = { v['title'] : sum([ item for item in v['pageviews'].values() if item is not None ]) for v in data.values() }

In [None]:
## Create a new dictionary
PV = {}

## Iterate over all dictionaries
for item in data.values():
    ## Create a list of values filtering out None
    temp_list = filter(None, item.get('pageviews', {}).values())
    ## Add a new key - value pair
    PV[item['title']] = sum(temp_list)

PV

In [None]:
PVM

In [None]:
PVS

## Taks 2 (5 + 2 points)
In this task, you can score either 5 points or 7. The only difference is in the pages you will download. To score 5 points you just need to download the content of 20 random pages from Wikipedia (please review the [N1](https://github.com/MikoBie/ids/blob/main/notebooks/N1.ipynb) in which we downloaded the content of 10 random pages). To have the chance to score 7 points you need to download the content of 10 pages that have in the title `Olivia` and 10 pages that have `Noah` (those are the most popular names in the UK in 2021). 

**Hint for 7 points**: you might find this [`pssearch`](https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bprefixsearch) method interesting.

When you have the content of these 20 articles for each one of them compute a distribution of the following possessive pronouns: `her`, `his`, and `their`. In other words, you should end up with the list looking more or less like this one (but has 20 elements):

```python
[{'his': 79, 'her': 212, 'their': 14},
 {'his': 36, 'her': 147, 'their': 20},
 {'his': 17, 'her': 80, 'their': 6},
 {'his': 8, 'her': 80, 'their': 9},
 {'his': 14, 'her': 66, 'their': 2},
 {'his': 12, 'her': 188, 'their': 16},
 {'his': 3, 'her': 156, 'their': 13},
 {'his': 33, 'her': 126, 'their': 33},
 {'his': 10, 'her': 113, 'their': 8},
 {'his': 21, 'her': 4, 'their': 33}]
```
**Hint**: Remember that sometimes in articles pronouns starts with the capital letter (ignore the cases like `hers` and `theirs`). Moreover, review the [notebook number 5](https://github.com/MikoBie/ppss/blob/main/notebooks/N5.ipynb) about the lists from May 19th, 2022.

In [None]:
## Import module requests
import requests as rq
URL = 'https://en.wikipedia.org/w/api.php'

In [None]:
def page_ids(keywords, pslimit = 10, form = 'json'):
    """
    Searches for page ids and titles based on given words. It returns a list
    of page ids.

	Args:
		keywords (str): a string with keywords to search. Keywords should be
		separated by commas.
        pslimit (int): number of ids that should be returned. By default, it
        takes the value of 10.
        format (str): a string that indicates the format in which the response
        should be returned. By default, it takes the value of json.
        
    Returns:
        (list) : a list with page ids.
    """
    BASE_URL = 'https://en.wikipedia.org/w/api.php'
    list_keywords = keywords.split(',')
    output = []
    for keyword in list_keywords:
        keyword = keyword.strip()
        payload = { 'action' : 'query',
                    'list' : 'prefixsearch',
                    'pssearch' : keyword,
                    'format' : form,
                    'pslimit' : pslimit
		}
        response = rq.get(BASE_URL, payload)
        data = response.json()
        output.extend( [ item['pageid'] for item in data['query']['prefixsearch'] ] )
    return output
        
def count_pronouns(s, l = ['her', 'his', 'their']):
    """
    Takes a string and counts possesive pronouns that have been passed in a list. 
    It returns a dictionary with a frequency of of a given pronouns.

    Args:
        s (str): a string
        l (list): a list of words that should be counted. By default it is a list of 
        the following possesive pronouns.
    
    Returns:
        (dict): a dictionary with frequency of words listed in l.
    """
    s = s.lower().split()
    return { item : s.count(item) for item in l }
        

In [None]:
## Get the list of page ids for names Olivia and Noah
page_ids_string = page_ids(keywords = 'Olivia, Noah', pslimit=10)
## Convert the list into a string suitable for passing as an
## argument for Wikipedia API call
page_ids_string = '|'.join(str(item) for item in page_ids_string)
## Print out the string
page_ids_string

In [None]:
## Define the payload
payload = { 'action' : 'query',
            'prop' : 'cirrusdoc',
            'pageids' : page_ids_string,
            'format' : 'json'
}
## Send the request
response = rq.get(URL, payload)
## Extract the dictionary from the response
data = response.json()
## Extract a dictionary with the content of pages
pages = data['query']['pages']

In [None]:
## Create a dictionary in which keys denotate the title of the page and values the 
## its content
articles = { p['title'] : p['cirrusdoc'][0]['source']['text'] for p in pages.values() }

## Create a list of dictionaries with frequencies of possesive pronouns
pronouns_list = [ count_pronouns(item) for item in articles.values() ]

## Create a dictionary with frequencies for each page
pronouns_dict = { key : count_pronouns(value) for key, value in articles.items() }

In [None]:
## Print out the list
pronouns_list

In [None]:
## Pring out the dictionary
pronouns_dict