<a href="https://colab.research.google.com/github/Margarita89/CityQuest/blob/master/CityQuest_Data_collection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **City Quest Data Collection from Wikipedia**

### **Install wikipedia module**
Here I also saved useful links of the Wikipedia pages for tests

In [None]:
!pip install wikipedia
# Wikipedia API: https://github.com/goldsmith/Wikipedia/blob/master/wikipedia/wikipedia.py 
# NLTK: https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da 
# New York City: https://en.wikipedia.org/wiki/New_York_City
# San Francisco: https://en.wikipedia.org/wiki/San_Francisco
# Washigton D.C.: https://en.wikipedia.org/wiki/Washington,_D.C.
# Las Vegas: https://en.wikipedia.org/wiki/Las_Vegas
# Seattle: https://en.wikipedia.org/wiki/Seattle
# Miami: https://en.wikipedia.org/wiki/Miami
# Los Angeles: https://en.wikipedia.org/wiki/Los_Angeles
# Philadelphia: https://en.wikipedia.org/wiki/Philadelphia
# Chicago
# San Diego
# Boston
# Minneapolis
# Atlanta
# Houston
# Phoenix

### **Some imports and testing wikipedia module**

In [None]:
from collections import defaultdict
from typing import Set, List, Dict, DefaultDict
import wikipedia

In [127]:
ny = wikipedia.page("New York City")
ny.title
# u'New York'

'New York City'

In [128]:
ny.url
# u'https://en.wikipedia.org/wiki/New_York_City'

'https://en.wikipedia.org/wiki/New_York_City'

In [None]:
ny.summary

In [129]:
ny.content
# u'New York City, often called simply New York and abbreviated as NYC, is the most populous city'...

'New York City, often called simply New York and abbreviated as NYC, is the most populous city in the United States. With an estimated 2019 population of 8,336,817 distributed over about 302.6 square miles (784 km2), New York City is also the most densely populated major city in the United States. Located at the southern tip of the U.S. state of New York, the city is the center of the New York metropolitan area, the largest metropolitan area in the world by urban landmass. With almost 20 million people in its metropolitan statistical area and approximately 23 million in its combined statistical area, it is one of the world\'s most populous megacities. New York City has been described as the cultural, financial, and media capital of the world, significantly influencing commerce, entertainment, research, technology, education, politics, tourism, art, fashion, and sports. Home to the headquarters of the United Nations, New York is an important center for international diplomacy.Situated o

In [6]:
'History of New York City' in ny.links

True

In [7]:
len(ny.links)

2486

In [130]:
ny_history = wikipedia.page('History of New York City')
ny_history.content

'The written history of New York City began with the first European explorer the Italian Giovanni da Verrazzano in 1524. European settlement began with the Dutch in 1608.\nThe "Sons of Liberty" destroyed British authority in New York City, and the Stamp Act Congress of representatives from throughout the Thirteen Colonies met in the city in 1765 to organize resistance to British policies. The city\'s strategic location and status as a major seaport made it the prime target for British seizure in 1776. General George Washington lost a series of battles from which he narrowly escaped (with the notable exception of the Battle of Harlem Heights, his first victory of the war), and the British Army controlled New York City and made it their base on the continent until late 1783, attracting Loyalist refugees. The city served as the national capital under the Articles of Confederation from 1785–1789, and briefly served as the new nation\'s capital in 1789–90 under the United States Constitutio

In [9]:
ny.links[943]

'History of New York City'

In [72]:
ny.section('Architecture')

'New York has architecturally noteworthy buildings in a wide range of styles and from distinct time periods, from the Dutch Colonial Pieter Claesen Wyckoff House in Brooklyn, the oldest section of which dates to 1656, to the modern One World Trade Center, the skyscraper at Ground Zero in Lower Manhattan and the most expensive office tower in the world by construction cost.Manhattan\'s skyline, with its many skyscrapers, is universally recognized, and the city has been home to several of the tallest buildings in the world. As of 2019, New York City had 6,455 high-rise buildings, the third most in world after Hong Kong and Seoul. Of these, as of 2011, 550 completed structures were at least 330 feet (100 m) high, the second most in the world after Hong Kong, with more than fifty completed skyscrapers taller than 656 feet (200 m). These include the Woolworth Building, an early example of Gothic Revival architecture in skyscraper design, built with massively scaled Gothic detailing; complet

In [100]:
sections = get_sections('Van Cortlandt Park')
sections[:len(sections)//2]

['History',
 'Settlement and colonization',
 'Planning',
 'Creation',
 'Early years',
 'Decline',
 'Improvements',
 'Geography',
 'Geology',
 'Watercourses',
 'Wildlife',
 'Landmarks and structures',
 'Trails',
 'Landmarks',
 'Other structures']

### **Collect data from a page by city name**
Filter the data by "cultural" content. Search through sections, links and deeper links

In [87]:
def get_sections(page_name: str) -> List[str]:
  """Returns sections' names from Wikipedia page by page_name
  """
  query_params = {
          'action': 'parse',
          'prop': 'sections',
        }
  query_params['page']= page_name
  request = wikipedia.wikipedia._wiki_request(query_params)
  #print(request)
  sections = [a['line'] for a in request['parse']['sections']]
  return sections

In [81]:
sections = get_sections('New York City')
sections[0]

'Etymology'

In [61]:
# keywords for sections and links that are interesting to check
# stop words to skip
# can be updated later!
SECTION_KEYWORDS = {'architecture', 'park', 'art', 'history', 'cultural', 
                    'museum', 'etymology', 'landmark'}
LINK_KEYWORDS = {'architecture', 'park', 'art', 'history', 'cultural', 'museum', 
                 'monument', 'landmark', 'statue', 'castle', 'zoo', 'arch', 
                 'cathedral', 'church', 'square', 'opera', 'house'}
LINK_CONTENT_KEYWORDS = {'architecture', 'attractions','history', 'background'}
DEEP_LINK_KEYWORDS = {'List of museums'}  # to go to the next link
DEEP_CONTENT_KEYWORDS = {'museum', 'cathedral', 'statue', 'monument', 'house',
                           'church', 'castle', 'zoo', 'park', 'garden'}
STOP_WORDS = {'police', 'article', 'burial', 'party', 'inc', 'zoopraxiscope', 
              'Representatives', 'Archdiocese', 'Commission', 'History.com',
              'internet', 'archive', 'operations', 'parker', 'parkway', 
              'archdeacon', 'references', 'see', 'also' 'external', 'links', 
              'demographics', 'faculty', 'alumni', 'accreditation', 'lawsuits',
              'notable', 'residents', 'curriculum', 'school', 'health', 'mental',
              'online', 'canada', 'education', 'college', 'universit', 
              'relocation', 'expansion', 'bibliography', 'reading', 'citations'} 
              # 'Beaux-Arts'
LINK_STOP_WORDS = {}

In [62]:
def get_interesting_sections(page_name: str) -> List[str]:
  """Returns a list of section names that contains interesting keywords
  """
  sections = get_sections(page_name)
  interesting_sections = []
  for section in sections:
    section_low = section.lower()
    for key in SECTION_KEYWORDS:
      if key in section_low:
        interesting_sections.append(section)
  return interesting_sections

In [63]:
 def get_interesting_links(page_name: str) -> List[str]:
  """Returns a list of link names that contain interesting keywords
  """
  links = wikipedia.page(page_name).links
  interesting_links = []
  for link in links:
    link_lower = link.lower()
    link_low_splitted = link_lower.split(' ')
    for key in LINK_KEYWORDS:
      if (any(l.startswith(key) for l in link_low_splitted) and 
        all(stop_word not in link_lower for stop_word in STOP_WORDS)):
        interesting_links.append(link)
  return interesting_links

In [64]:
def get_page_data_by_section(page_name: str) -> Dict[str, str]:
  """Returns a dictionary of section: content as key: value pairs
  """
  section_names = get_interesting_sections(page_name=page_name)
  section_data = {'Summary': wikipedia.page(city_name).summary}
  city = wikipedia.page(city_name)
  for section_name in section_names:
    section_content = city.section(section_name)
    section_data[section_name] = section_content
  return section_data

In [65]:
def get_page_data_by_section_by_link(page_name: str) -> Dict[str, str]:
  """Returns a string of section contents without stop words
  """
  sections = get_sections(page_name)
  interesting_sections = []
  for section in sections:
    section_lower = section.lower()
    if all(stop_word not in section_lower for stop_word in STOP_WORDS):
      interesting_sections.append(section)
  
  page_content_by_link = {}
  page_summary = page_name + '.' + 'Summary'
  page_from_link = wikipedia.page(page_name)
  page_content_by_link[page_summary] = page_from_link.summary
  for section in interesting_sections:
    page_section = page_name + '.' + section
    page_content_by_link[page_section] = page_from_link.section(section)
  return page_content_by_link

In [66]:
def get_city_data_by_link(page_name: str) -> Dict[str, str]:
  """Returns a dictionary of link: content as key: value pairs
  """
  link_names = get_interesting_links(page_name=page_name)
  links_data = {}
  for link_name in link_names:
    link_data = get_page_data_by_section_by_link(page_name=link_name)
    links_data.update(link_data)
  return links_data

In [93]:
def get_deep_links_by_link(page_name: str) -> List[str]:
  """Returns filtered deep links of links 
  """
  deeper_links_names = []
  links = wikipedia.page(page_name).links

  for link in links:
    for key in DEEP_LINK_KEYWORDS:
      if key not in link:
        continue
      
      deep_page = wikipedia.page(link)
      for deep_page_link in deep_page.links:
        deep_page_link_lower = deep_page_link.lower()
        for key_content in DEEP_CONTENT_KEYWORDS:
          if key_content in deep_page_link_lower:
            deeper_links_names.append(deep_page_link)
  return deeper_links_names

In [94]:
def get_deeper_links_content(page_name: str) -> Dict[str, str]:
  """Returns deeper links content filtered as first half of sections
  """
  deeper_links_data = get_deep_links_by_link(page_name=page_name)
  
  deeper_links_data = {}
  for link in deeper_links_data:
    page_summary = link + '.' + 'Summary'
    page_from_link = wikipedia.page(link)
    deeper_links_data[page_summary] = page_from_link.summary
    
    link_sections = get_sections(link)
    for i in range(len(link_sections) // 2):
      section = link_sections[i]
      page_section = link + '.' + section
      deeper_links_data[page_section] = page_from_link.section(section)
  return deeper_links_data

In [95]:
def get_city_data(city_name: str) -> Dict[str, str]:
  """Gets two dictionaries of section: content and link: content
     Returns one concatenated dictionary
  """
  section_data = get_page_data_by_section(page_name=city_name)
  link_data = get_city_data_by_link(page_name=city_name)
  deeper_links_data = get_deeper_links_content(page_name=city_name)
  return {**section_data, **link_data, **deeper_links_data}

In [None]:
# collects all interesting content and link data for chosen cities into one  
# dictionary
city_names = ['New York City', 'San Francisco', 'Washington, D.C.', 'Chicago', 
              'Philadelphia', 'Las Vegas', 'Seattle', 'Miami', 'Los Angeles',
              'San Diego', 'Boston', 'Minneapolis', 'Atlanta', 'Houston', 
              'Phoenix']
city_names_test = ['New York City', 'San Francisco']
city_data = {}
for city_name in city_names_test:
  city_data[city_name] = get_city_data(city_name=city_name)

###**Do some testing**
To continue..

In [None]:
for city in city_data:
  for key in city_data[city]:
    print(key)

In [None]:
for heading in city_data['San Francisco']:
  print(heading)

In [None]:
for city in city_data:
  city_info = []
  for heading in city_data[city]:
    city_info.append(heading)
    city_info.append(city_data[city][heading])
  file_name = city + ' data.txt'
  with open(file_name, "w") as text_file:
    try:
      print('\n'.join(map(str, city_info)), file=text_file)
    except:
      print(city_info, file=text_file)

In [None]:
# save city_data in json format
import json 
with open('city_data.json', 'w') as fp:
    json.dump(city_data, fp)

In [None]:
import _pickle as pickle

with open('file.txt', 'w') as file:
     file.write(pickle.dumps(city_data))

In [None]:
for section in city_data['New York City']:
  print(section)

Etymology
History
Early history
Modern history
Architecture
Parks
National parks
State parks
City parks
Arts
Performing arts
Visual arts


In [None]:
city_data['New York City']['Arts']

"New York City has more than 2,000 arts and cultural organizations and more than 500 art galleries. The city government funds the arts with a larger annual budget than the National Endowment for the Arts. Wealthy business magnates in the 19th century built a network of major cultural institutions, such as the famed Carnegie Hall and the Metropolitan Museum of Art, that would become internationally established. The advent of electric lighting led to elaborate theater productions, and in the 1880s, New York City theaters on Broadway and along 42nd Street began featuring a new stage form that became known as the Broadway musical. Strongly influenced by the city's immigrants, productions such as those of Harrigan and Hart, George M. Cohan, and others used song in narratives that often reflected themes of hope and ambition. New York City itself is the subject or background of many plays and musicals."

In [None]:
for section in city_data['San Francisco']:
  print(section)

History
Performing arts
Museums
Beaches and parks


In [None]:
city_data['San Francisco']['Beaches and parks']

"Several of San Francisco's parks and nearly all of its beaches form part of the regional Golden Gate National Recreation Area, one of the most visited units of the National Park system in the United States with over 13 million visitors a year. Among the GGNRA's attractions within the city are Ocean Beach, which runs along the Pacific Ocean shoreline and is frequented by a vibrant surfing community, and Baker Beach, which is located in a cove west of the Golden Gate and part of the Presidio, a former military base. Also within the Presidio is Crissy Field, a former airfield that was restored to its natural salt marsh ecosystem. The GGNRA also administers Fort Funston, Lands End, Fort Mason, and Alcatraz. The National Park Service separately administers the San Francisco Maritime National Historical Park – a fleet of historic ships and waterfront property around Aquatic Park.\n\nThere are more than 220 parks maintained by the San Francisco Recreation & Parks Department. The largest and 

### **Use tokenization from nltk**
To continue

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

In [None]:
def preprocess(sent):
    sent = nltk.word_tokenize(sent)
    sent = nltk.pos_tag(sent)
    return sent

In [None]:
preprocess(city_data['San Francisco']['Beaches and parks'])

In [None]:
def ie_preprocess(document):
  sentences = nltk.sent_tokenize(document)
  sentences = [nltk.word_tokenize(sent) for sent in sentences] 
  sentences = [nltk.pos_tag(sent) for sent in sentences]
  return sentences

In [None]:
ie_preprocess(city_data['San Francisco']['History'])

### **Extract some years**
Thinking about preparing for question generation.. 
To continue

In [None]:
import re
import datetime
now = datetime.datetime.now()
y = city_data['San Francisco']['History']
print(f'Text:\n{y}')
years = re.findall(r'\b\d{4}\b', y)
years = [year for year in years if int(year) <= now.year]
print(f'\nYears:\n{years}')

Text:
The earliest archaeological evidence of human habitation of the territory of the city of San Francisco dates to 3000 BC. The Yelamu group of the Ohlone people resided in a few small villages when an overland Spanish exploration party, led by Don Gaspar de Portolá, arrived on November 2, 1769, the first documented European visit to San Francisco Bay. Seven years later, on March 28, 1776, the Spanish established the Presidio of San Francisco, followed by a mission, Mission San Francisco de Asís (Mission Dolores), established by the Spanish explorer Juan Bautista de Anza.

Upon independence from Spain in 1821, the area became part of Mexico. Under Mexican rule, the mission system gradually ended, and its lands became privatized. In 1835, William Richardson, a naturalized Mexican citizen of English birth, erected the first independent homestead, near a boat anchorage around what is today Portsmouth Square. Together with Alcalde Francisco de Haro, he laid out a street plan for the exp