# Get Cities of The World Quality of Life Data

In this exercise, we will try to get scoring information about the quality of life for several cities around the world. 🌍

For this exercise, we will be using the following API:

- <a href="https://developers.teleport.org/api/getting_started/" target="_blank">Teleport</a>

We will also need to use a website called RandomList.com that will give us a random cities around the world to get a scoring. 

Then we will store the data we got into an S3 Bucket! 

Quite a project, right? 🥵

🥰🥰 You'll learn a lot during this exercise 🥰🥰 

So let's go 💪💪💪

## Part 1: Get data for 1 City 

To simplify this exercise, let's start by trying to scrape data for only 1 city: Paris. In another part, we'll try to get scores for 100 different cities.

- Import the library called `requests`:

In [1]:
import requests

* Check teleport's API, to find a way to search information on Paris. Especially, we would need its `geonameid`

  * Here is the link for the documentation 👉👉👉 [Teleport API](https://developers.teleport.org/api/getting_started/)

In [2]:
cityParis = requests.get('https://api.teleport.org/api/cities/?search=Paris')

ℹ️ℹ️ You should get the following result ℹ️ℹ️

In [3]:
cityParis = cityParis.json()

* Now that you got the a list of search results, try to isolate Paris' `geonameid`

In [4]:
cityParis["_embedded"]["city:search-results"][0]["_links"]["city:item"]["href"]

'https://api.teleport.org/api/cities/geonameid:2988507/'

* Use `requests` to get information about Paris 

In [5]:
cityParisID = requests.get('https://api.teleport.org/api/cities/geonameid:2988507/')
cityParisID = cityParisID.json()

* You should now be able to get Paris' quality of life scores 

In [6]:
cityParisQualityOfLife = requests.get(cityParisID["_links"]["city:urban_area"]["href"]+"scores/")
cityParisQualityOfLife = cityParisQualityOfLife.json()

* Use `Pandas` to create a DataFrame where you'll get all the scores for Paris 

In [7]:
import pandas as pd
lst_categories = [t for t in cityParisQualityOfLife["categories"]]
df = pd.DataFrame(columns=lst_categories[0].keys())

for i in lst_categories:
    df = df.append(i, ignore_index=True)

df.head()

Unnamed: 0,color,name,score_out_of_10
0,#f3c32c,Housing,3.5835
1,#f3d630,Cost of Living,3.664
2,#f4eb33,Startups,9.2765
3,#d2ed31,Venture Capital,7.513
4,#7adc29,Travel Connectivity,10.0


* We now need to upload this DataFrame to S3. Let's first create a Boto3 session 
  * For the following, refer to the following documentation 👉👉👉 [Boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html)

In [8]:
import boto3

session = boto3.Session()

* Now create a resource session 

In [9]:
s3 = session.resource("s3")

* Create a Bucket that you'll call `scoring-cities-in-the-world`

In [10]:
bucket = s3.create_bucket(Bucket="jedha-bq-scoring-cities-in-the-world", CreateBucketConfiguration={'LocationConstraint': 'eu-west-3'})

* Use `Pandas` to export your DataFrame as a csv file

In [11]:
csv = df.to_csv()

* Use `put_object()` function to create an Object within the bucket you just created 

In [12]:
put_object = bucket.put_object(Key="test.csv", Body=csv)

## Get Data For Several Cities 

😉 Congrats ! 😉 You made it to the second part of the exercise. We now need more data to be able to compare them later. Let's try to find a way to get data for a lot more cities 

* Go on to [this Wikipedia page](https://en.wikipedia.org/wiki/List_of_largest_cities). There you'll find a list of the world's largest cities.
  * Use `scrapy` to scrape the city names directly from this page 😎

In [13]:
import os
import logging

import scrapy
from scrapy.crawler import CrawlerProcess

In [14]:
class WikipediaSpider(scrapy.Spider):

    # Name of your spider
    name = "wikipedia"

    # Url to start your spider from 
    start_urls = [
        'https://en.wikipedia.org/wiki/List_of_largest_cities',
    ]

    # Callback that gets text, author and tags of the webpage
    def parse(self, response):
        city = response.css('tbody')
        
        for i in city:
        
            yield {
                # 'text': i.css('a::text').getall()
                'text': i.css('tr::attr(id)').getall()
            }

In [15]:
# Name of the file where the results will be saved
filename = "wikipedia-city-name.json"

# If file already exists, delete it before crawling (because Scrapy will concatenate the last and new results otherwise)
if filename in os.listdir():
        os.remove(filename)

# Declare a new CrawlerProcess with some settings
process = CrawlerProcess(settings = {
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'LOG_LEVEL': logging.INFO,
    "FEEDS": {
        filename : {"format": "json"},
    }
})

# Start the crawling using the spider you defined above
process.crawl(WikipediaSpider)
process.start()

2022-04-02 13:59:11 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-04-02 13:59:11 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.2.0, Python 3.8.12 (default, Oct 12 2021, 03:01:40) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.4.8, Platform Windows-10-10.0.19043-SP0
2022-04-02 13:59:11 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 20,
 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2022-04-02 13:59:11 [scrapy.extensions.telnet] INFO: Telnet Password: 7a436e4acd98e934
2022-04-02 13:59:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2022-04-02 13:59:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth

* Read the json file with results from the crawling :

In [16]:
import pandas as pd
wikipedia =  pd.read_json("wikipedia-city-name.json")

In [17]:
city_list = wikipedia.iloc[1][0]

* Finally, create a loop that will go through each city, search for information and store it to your S3 bucket 
  * You might get some errors, definitely use the `try: \ except:` structure 
  * (It's totally fine if you couldn't get info for all cities) 😌😌

In [18]:
for i in city_list:
    try:
        current_city = requests.get('https://api.teleport.org/api/cities/?search='+i.replace("_", "%20"))
        current_city = current_city.json()
        current_city_ID = current_city["_embedded"]["city:search-results"][0]["_links"]["city:item"]["href"]
        current_city_info = requests.get(current_city_ID)
        current_city_info = current_city_info.json()
        currentCityQualityOfLife = requests.get(current_city_info["_links"]["city:urban_area"]["href"]+"scores/")
        currentCityQualityOfLife = currentCityQualityOfLife.json()
        lst_categories = [t for t in currentCityQualityOfLife["categories"]]
        df = pd.DataFrame(columns=lst_categories[0].keys())

        for i2 in lst_categories:
            df = df.append(i2, ignore_index=True)

        df = df.rename(columns={'name': i})
        csv = df.to_csv()
        put_object = bucket.put_object(Key=i+".csv", Body=csv)
        print (i+" done")
    except:
        print ("Couldn't find results for "+i)

Tokyo done
Delhi done
Shanghai done
São_Paulo done
Mexico_City done
Cairo done
Mumbai done
Beijing done
Couldn't find results for Dhaka
Osaka done
New_York_City done
Couldn't find results for Karachi
Buenos_Aires done
Couldn't find results for Chongqing
Istanbul done
Couldn't find results for Kolkata
Manila done
Lagos done
Rio_de_Janeiro done
Couldn't find results for Tianjin
Couldn't find results for Kinshasa
Guangzhou done
Los_Angeles done
Moscow done
Shenzhen done
Couldn't find results for Lahore
Bangalore done
Paris done
Bogotá done
Jakarta done
Chennai done
Lima done
Bangkok done
Seoul done
Couldn't find results for Nagoya
Hyderabad done
London done
Tehran done
Chicago done
Couldn't find results for Chengdu
Couldn't find results for Nanjing
Couldn't find results for Wuhan
Ho_Chi_Minh_City done
Couldn't find results for Luanda
Couldn't find results for Ahmedabad
Kuala_Lumpur done
Couldn't find results for Xi'an
Hong_Kong done
Dongguan done
Hangzhou done
Foshan done
Couldn't find re

🎊🎊🎊 Congratulations, You made it to the end of this exercise !! 🎊🎊🎊🎊