## Module submission header
### Submission preparation instructions 
_Completion of this header is mandatory, subject to a 2-point deduction to the assignment._ Only add plain text in the designated areas, i.e., replacing the relevant 'NA's. You must fill out all group member Names and Drexel email addresses in the below markdown list, under header __Module submission group__. It is required to fill out descriptive notes pertaining to any tutoring support received in the completion of this submission under the __Additional submission comments__ section at the bottom of the header. If no tutoring support was received, leave NA in place. You may as well list other optional comments pertaining to the submission at bottom. _Any distruption of this header's formatting will make your group liable to the 2-point deduction._

### Module submission group
- Group member 1
    - Name: NA
    - Email: NA
- Group member 2
    - Name: NA
    - Email: NA
- Group member 3
    - Name: NA
    - Email: NA
- Group member 4
    - Name: NA
    - Email: NA

### Additional submission comments
- Tutoring support received: NA
- Other (other): NA

# Assignment Group 3
## Module C _(45 points)_

In this assignment, we'll build and expand on an exercise which focuses on a targeted scrape of Wikipedia, extracting content pertaining to the UK cities.

Conviently, the individual articles pertaining to the cities of the UK are formed in a list on English Wikipedia:
- https://en.wikipedia.org/wiki/List_of_cities_in_the_United_Kingdom

__C1.__ _(5 pts)_ First, complete the function to operate the Wikipedia REST API's [documentation here](https://en.wikipedia.org/api/rest_v1/#/) to determine an acceptable endpoint for the html content. Then, use the `requests` module to access the provided list-page. This function should then return the parsed HTML, i.e., using `BeautifulSoup` as `page_soup`, as well as the entire `page_response` from the `requests` module. 

In [None]:
# C1:Function(5/5)

import requests
from bs4 import BeautifulSoup
from tqdm import tqdm

def get_page_and_parse(page_title):

    API_path = "https://en.wikipedia.org/api/rest_v1/"
    
    #---your code starts here---
    
    #---your code stops here---

    return page_soup, page_response

For reference, your output should be:
```
'List of cities in the United Kingdom'    
```

In [None]:
# C1:SanityCheck

cities_page_soup, cities_page_response = get_page_and_parse("List_of_cities_in_the_United_Kingdom")
cities_page_soup.find('head').text

__C2.__ _(7 pts)_ Next, your job is to extract the links for the different cities by locating the first object in the html tagged as `'table'`. Looping over the table's (non-header) rows, obtain the first `'a'`-tag (hyperlink) from the first column of each row, and append these links within `city_links`, using the form `[show_text, hyperlink]`.

In [None]:
# C2:Function(7/7)

def extract_city_links(cities_page_soup):
    city_links = []
    
    #---your code starts here---
        
    #---your code stops here---
    
    return city_links

For reference, your output should be:
```
[['Aberdeen', './Aberdeen'],
 ['Armagh', './Armagh'],
 ['Bangor', './Bangor,_Gwynedd'],
 ['Bath', './Bath,_Somerset'],
 ['Belfast', './Belfast'],
 ['Birmingham', './Birmingham'],
 ['Bradford', './City_of_Bradford'],
 ['Brighton & Hove', './Brighton_and_Hove'],
 ['Bristol', './Bristol'],
 ['Cambridge', './Cambridge']]
```

In [None]:
# C2:SanityCheck

city_links = extract_city_links(cities_page_soup)
city_links[:10]

__C3.__ _(7 pts)_ Next, complete the initial data collection task by filling the `city_data` object in the function, below. Each link in `city_links` should utilize the schema:
```
city_data = {
    page_id: {
        "name": <name of page>,
        "text": <full html string for page>
    }
}
```
where `page_id` now ocrresponds to the `id` of the page used when accessing the API, which in __Part C2__'s output should to correspond to the show texts for each link in `city_links`.

In [None]:
# C3:Function(7/7)

def get_city_data(city_links, cities_page_response):
    city_data = {"./List_of_cities_in_the_United_Kingdom": {
                     "name": "List of cities in the United Kingdom",
                     "text": cities_page_response.text}}
    API_path = "https://en.wikipedia.org/api/rest_v1/"
        
    #---your code starts here---
    
    #---your code stops here---
    
    return city_data

For reference, your output should be:
```
('Armagh',
 '<!DOCTYPE html>\n<html prefix="dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/" about="https://en.wikipedia.org/wiki/Special:Redirect/revision/1048196172"><head prefix="mwr: https://en.wikipedia.org/wiki/Special:Redirect/"><meta property="mw:TimeUuid" content="1df726d0-2b99-11ec-856e-55091143e768"/><meta charset="utf-8"/><meta property="mw:pageId" content="473800"/><meta property="mw:pageNamespace" content="0"/><link rel="dc:replaces" resource="mwr:revision/1047848140"/><meta property="mw:revisionSHA1" content="e4e8a119952ba8c9225de83cb9e025a14f3608d2"/><meta property="dc:modified" content="2021-10-04T19:12:15.000Z"/><meta property="mw:htmlVersion" content="2.3.0"/><meta property="mw:html:version" content="2.3.0"/><link rel="dc:isVersionOf" href="//en.wikipedia.org/wiki/Armagh"/><title>Armagh</title><base href="//en.wikipedia.org/wiki/"/><meta property="mw:styleModules" content="mediawiki.page.gallery.styles|ext.cite.style|ext.cite.styles"/><link rel="stylesheet" href="/w/lo')
```

In [None]:
# C3:SanityCheck
import json, os

# gather the data
if os.path.exists("./data/city_data.json"):
    city_data = json.load(open("./data/city_data.json"))
else:
    city_data = get_city_data(city_links, cities_page_response)
    # save the data to disk
    with open("./data/city_data.json", "w") as f:
        f.write(json.dumps(city_data))

# generate some output
(lambda x: (x['name'], x['text'][:1000]))(city_data[list(city_data.keys())[2]])

__C4.__ _(6 pts)_ Now that we've accessed the primary data, let's reivew the two ways we could spread out to gather more content relating to the cities. On Wikipedia, pages are hyper-linked into a network. This means pages both link to each city (in-links) and out from each city (out-links).

To see a list of in-links, i.e., what pages link to a given page, we can use the time-honored 'what links here' endpoint:
- `'https://en.wikipedia.org/w/index.php?title=Special:WhatLinksHere'`

Using this, any page can be checked for its in-links, i.e., pages that link to it. 

Here, your job is to complete the function below, whose plan includes using the specified `params` to ensure that 500 pages are returned per request, and that all pages returned have `namespace=0`, i.e., are core content articles on Wikipedia. Note: the `params` must be concatenated onto the `base_URL`, and then the page's searchable name, `page_id` (using underscores and not spaces), must be concatenated onto that.

Once you've parsed the page's `inlinks_soup` object, use `BeautifulSoup` commands to collect the hyperlinks within __all `'li'`-tagged objects that have a `'span'`-tagged object within them, but which _don't_ have a `'class'` attribute.__ This restriction will help us avoide the 'edit' links. For output, each link should be appended as a `(show_text, URL)`-tuple into the `links` list.

In [None]:
# C4:Function(6/6)

def index_inlinks(page_id):
    base_URL = 'https://en.wikipedia.org/w/index.php?title=Special:WhatLinksHere'
    params = '&limit=500&namespace=0&target='
    
    links = [] # collect the in-links within this list object
    
    #---your code starts here---
    
    #--- your code stops here---
    
    return links, inlinks_soup

For reference, your output should be:
```
(500,
 [('Foreign relations of Angola', '/wiki/Foreign_relations_of_Angola'),
  ('American Revolutionary War', '/wiki/American_Revolutionary_War'),
  ('Ankara', '/wiki/Ankara'),
  ('Alfred Hitchcock', '/wiki/Alfred_Hitchcock'),
  ('Amsterdam', '/wiki/Amsterdam'),
  ('Albert Speer', '/wiki/Albert_Speer'),
  ('Foreign relations of Azerbaijan', '/wiki/Foreign_relations_of_Azerbaijan'),
  ('Afro Celt Sound System', '/wiki/Afro_Celt_Sound_System'),
  ('Alternate history', '/wiki/Alternate_history'),
  ('Athens', '/wiki/Athens')])
```

Note: `'/City_of_London'` is actually our entry of interest for the original cities list, potentially making `'./London'` a poor unit test. Nevertheless, all pages should produce the same output structure, since the inlinks pages are constructed automatically.

In [None]:
# C4:SanityCheck

links, inlinks_soup = index_inlinks('London')
len(links), links[:10]

__C5.__ _(4 pts)_ Finally, make sure to collect the `'next 500'` pagenation, and to store this as `next_page_URL` by completing the function, below. We'll be able to use this URL to come back for the remainder of the page's list if/when we're ready. [Hint: Use the string `'next 500'` to identify the pagenation link in a given page's search results.]

In [None]:
# C5:Function(4/4)

def get_next_page_URL(inlinks_soup):
    
    next_page_URL = '' # collect the next page of results' URL, 
                       # provided more results exist
    
    #---your code starts here---
    
    #---your code stops here---
    
    return next_page_URL


For reference, your output should be:
```
'/w/index.php?title=Special:WhatLinksHere/London&namespace=0&limit=500&dir=next&offset=17792'
```

In [None]:
# C5:SanityCheck

next_page_URL = get_next_page_URL(inlinks_soup)
next_page_URL

__C6.__ _(5 pts)_ As we can see, it will take multiple calls to the function from part __C5__ to complete the collection of _all_ of each page's in-links. This means that knowing how many pages/in-links exist won't _exactly_ be clear from the start. So as a first pass, let's collect each page's first batch of 500 (or fewer) in-links. In theory, then we could check to see just how many of the city-pages have at least a second page (more than 500 in-links).

Specifically, your job in this part of the problem is to store up to 500 of each page's in-links in the `links_index` object, whose schema should conform to the following pattern:
```
links_index[page_id] = {'name': page_name, 'links': links, 
                                'next_page': next_page_URL}
```

In [None]:
# C6:Function(5/5)

def collect_first_inlinks_pages(city_links):
    links_index = {}
    
    #---your code starts here---
    
    #---your code stops here---
    
    return links_index

For reference, your output could be:
```
507
```
Clearly, some extra links are coming through in the answer key. So even though we are requesting a single batch of data as a structured response, the response we've received when unit testing on `'./London'` is sufficiently different in structure from that of `'./Armagh'`, where it's possible that _more_ hyperlinks could be collected than expected! 

As it turns out, this difference is the result of inlinks that point _through_ 're-direct' links, e.g., which ultimately point back `'./Armagh'`'s   page, such as `'./City_of_Armagh'`. Can you spot the redirect's on `'./Armagh'`'s   [inlinks page](https://en.wikipedia.org/w/index.php?title=Special:WhatLinksHere/Armagh&namespace=0&limit=500)?

In [None]:
# C6:SanityCheck

inlinks_index = collect_first_inlinks_pages(city_links)
len(inlinks_index['./Armagh']['links'])

__C7.__ _(3 pts)_ Now let's take another look at our data and see just how few in-links some cities' pages have. Using the output `inlinks_index`, determine which city had the smallest number of in-links, and record this and its id to complete the print statement in the `Inline()` cell, below.

In [None]:
# C7:Inline(3/3)

smallest_num_links = float('Inf')
smallest_link_city_id = ''

#---your code starts here---
        
#---your code stops here---

print('The least in-linked city was ', smallest_link_city_id, 
      ' with ', smallest_num_links, ' links.')

__C8.__ _(6 pts)_ Now that we know how big/small the in-link sets can be, let's see if we can gather all of the out-links. This means we'd like to do the same as in part __C7__, but by gathering all hyperlinks within the city wiki pages that point to other primary encyclopedia wiki content. 

Ideally, we'd want only those links on the pages to other articles on Wikipedia with `namespace = 0`. However to start, your job is to complete the function below using `BeautifulSoup` operations on each city's html response in `city_data` to collect _all_ hyper-links for each page in `city_data` that have a `'title'` attribute. This attribute should refine the links to inside of Wikipedia (although not necessarily with `namespace = 0`). To complete this filtering of the links, use the argument `{'title': True}` to filter your `.find_all()` for the hyper-links.

Note: by this point your `BeautifulSoup`-parseable data should be accessible via `city_data[page_id]['text']`, and for output your function should return a `links_index` object again, except now there will be no `'next_page'`, as these lists of links will be complete!

In [None]:
# C8:Function(6/6)

def collect_all_page_outlinks(city_data):
    links_index = {}
    
    #---your code starts here---
            
    #---your code stops here---
    
    return links_index 
        

For reference, your output should be:
```
(543,
 [('Armagh (disambiguation)', './Armagh_(disambiguation)'),
  ('Scots language', './Scots_language'),
  ('Irish language', './Irish_language'),
  ("St. Patrick's Cathedral, Armagh (Roman Catholic)",
   "./St._Patrick's_Cathedral,_Armagh_(Roman_Catholic)"),
  ('Northern Ireland', './Northern_Ireland'),
  ('United Kingdom census, 2011', './United_Kingdom_census,_2011'),
  ('Irish grid reference system', './Irish_grid_reference_system'),
  ('Belfast', './Belfast'),
  ('Local government in Northern Ireland',
   './Local_government_in_Northern_Ireland'),
  ('Armagh City, Banbridge and Craigavon District Council',
   './Armagh_City,_Banbridge_and_Craigavon_District_Council')])
```

In [None]:
# C8:SanityCheck

outlinks_index = collect_all_page_outlinks(city_data)
len(outlinks_index['./Armagh']['links']), outlinks_index['./Armagh']['links'][:10]

__C9.__ _(2 pts)_ Finally, using the `outlinks_index` from __C8__, determine which city had the _largest_ number of out-links and print the result using the `Inline()` cell, below.

In [None]:
# C9:Inline(2/2)

largest_num_links = 0
largest_link_city_id = ''

#---your code starts here---
        
#---your code stops here---

print('The most out-linked city was ', largest_link_city_id, 
      ' with ', largest_num_links, ' links.')