# World cities exercise

The goal of this exercise is to practice a common workflow for analyzing large data sets by combining Pandas and SQLAlchemy. 

This exercise uses an edited version of the world cities dataset from maxmind:

    https://www.maxmind.com/en/free-world-cities-database

For simplicity, the original data has been cleaned by removing all rows
with missing values.


In [8]:
# Useful imports.
import pandas as pd
from sqlalchemy import create_engine, func, select, Column, Integer, MetaData, String, Table

In [2]:
# Remove any existing database created in previous runs of this notebook.
import os

if os.path.isfile('world_cities.db'):
    os.remove('world_cities.db')

**Question 1**

Using SQL Alchemy, create a new table which contains population, country and city.
Use "sqlite:///world_cities.db" as the URI, to create an sqlite database in a local file.

The table should be called "cities" and have 3 columns: "country" of type `String`, "city" of type `String`, and "population", of type `Integer`.

In [3]:
# Your code goes here.
import MySQLdb as mysql
db  = create_engine("sqlite:///world_cities.db")

metadata = MetaData()

cities = Table('cities',metadata,
              Column('country', String),
              Column('city', String),
              Column('population', Integer))
metadata.create_all(db)

**Question 2**

Import the data from the file `world_cities.csv` into the newly created table.

To make things interesting, let's pretend that the CSV is so large that it does not fit into memory:

1. Use Pandas' `read_csv` function to read 10000 lines at the time from the CSV file (using the `chunksize` argument). Column 0 in the file corresponds to the index.
2. Iterate over all file chunks and
3. Store each resulting data frame to the table using the `to_sql` method. (Hint: Use `if_exists='append'` to append to the DB table, and `index=False` to prevent Pandas from trying to store the index, for which we did not create a column.)

In [4]:
# Your code goes here.

reader = pd.read_csv('world_cities.csv', chunksize=10000, index_col =0)

for table in reader:
    table.to_sql('cities', db, if_exists='append', index=False)

**Question 3**

Using SQLAlchemy only, count how many rows are stored in the cities table. (There should be 47979 of them.)

In [10]:
# Your code goes here.
s = select([cities.c.city]).count()
with db.connect() as conn:
    print 'Number of rows:', conn.execute(s).scalar()

Number of rows: 47979


**Question 4**

What are the cities with a population above 5 millions people?

Using SQLAlchemy, create an appropriate SQL query. Execute the query and create a DataFrame using Pandas' `read_sql` function.

Bonus: Change your query to get a dataframe sorted by the population, ideally in descending order.

In [None]:
# Your code goes here.
pop = select([cities]).where(cities.c.population >5e6).order_by(cities)

**Question 5**

Compute the number of cities per country: create a "group by" query using SQLAlchemy, and import the result in Pandas.

Bonus: time the execution of that approach and compare with loading all data into a pandas `DataFrame` and doing the same operation there.

In [None]:
# Your code goes here.

s = select([func.count(cities.c.city), cities.c.countrty]).group_by