# Refugees in the Age of Gloabl Warming
### Data Engineering Capstone Project

#### Project Summary
This project focuses on monitoring refugee and population information around the world based on temperature changes over time.

The project follows the following steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
# Do all imports and installs here
import pandas as pd
import os
from datetime import datetime
import psycopg2
import configparser
config = configparser.ConfigParser()
from sqlalchemy.engine import create_engine

In [59]:
# Read in parameters needed for Redshift cluster
# config = configparser.ConfigParser()
# config.read('dwh.cfg')

# # Connect to Redshift cluster and gets cursor to it
# conn = psycopg2.connect("""host={} 
#                            dbname={} 
#                            user={} 
#                            password={} 
#                            port={}""".format(*config['CLUSTER'].values()))
# cur = conn.cursor()

### Step 1: Scope the Project and Gather Data

#### Scope 
What is your end solution look like? What tools did you use? etc>

The plan is to build a data warehouse for analytical processes, so analysts can design recurring and ad hoc reports over time using SQL. There is a strong emphasis in ensuring the warehouse is easy to interpret, performant, and quality assured.
 
#### Data Sources and Content

There are four source datasets:
 1. City_temperature.csv
     - Summary: average daily temperature for all major cities in the world from 1995 - 2020
     - Source: University of Dayton - separate txt files available for each city [here](https://academic.udayton.edu/kissock/http/Weather/default.htm). The data is available for research and non-commercial purposes only. Refer to [this page](https://academic.udayton.edu/kissock/http/Weather/default.htm) for license.
     - Secondary source: SRK via Kaggle - [link](https://www.kaggle.com/sudalairajkumar/daily-temperature-of-major-cities)
 2. Country_population_total_long.csv
     - Summary: annual population counts by country from 1960 - 2017
     - Source: The World Bank - [link](https://data.worldbank.org/indicator/SP.POP.TOTL)
     - Secondary source: Devakumar kp via Kaggle - [link](https://www.kaggle.com/imdevskp/world-population-19602018?select=population_total_long.csv)
 3. UNdata_City_Population_20210315.csv
     - Summary: annual population counts by city from 1970 - 2020 (contains gaps in 1970's)
     - Source: UN Data - [link](https://data.un.org/Data.aspx?d=POP&f=tableCode%3A240)
 4. UNdata_Refugees_20210217.csv
     - Summary: annual refugee counts from 1975 - 2016 by country of residence and country of origin
     - Source: UN Data - [link](http://data.un.org/Data.aspx?d=UNHCR&f=indID%3aType-Ref)

### Read in Each Dataset

#### Temperature Data

In [2]:
temp_df = pd.read_csv('Data/temperature_data/city_temperature.csv', engine = 'python')
temp_df.head()

Unnamed: 0,Region,Country,State,City,Month,Day,Year,AvgTemperature
0,Africa,Algeria,,Algiers,1,1,1995,64.2
1,Africa,Algeria,,Algiers,1,2,1995,49.4
2,Africa,Algeria,,Algiers,1,3,1995,48.8
3,Africa,Algeria,,Algiers,1,4,1995,46.4
4,Africa,Algeria,,Algiers,1,5,1995,47.9


#### Population Counts by Country and Year

In [91]:
country_pop_df = pd.read_csv('Data/country_population_data/country_population_total_long.csv', engine = 'python')
country_pop_df.head()

Unnamed: 0,Country Name,Year,Count
0,Aruba,1960,54211
1,Afghanistan,1960,8996973
2,Angola,1960,5454933
3,Albania,1960,1608800
4,Andorra,1960,13411


#### Population Counts by City and Year

In [4]:
city_pop_df = pd.read_csv('Data/city_population_data/UNdata_City_Population_20210315.csv', engine = 'python')
city_pop_df.head()

Unnamed: 0,Country or Area,Year,Area,Sex,City,City type,Record Type,Reliability,Source Year,Value,Value Footnotes
0,Åland Islands,2019,Total,Both Sexes,MARIEHAMN,City proper,Estimate - de jure,"Final figure, complete",2020,11711.0,1
1,Åland Islands,2019,Total,Male,MARIEHAMN,City proper,Estimate - de jure,"Final figure, complete",2020,5606.0,1
2,Åland Islands,2019,Total,Female,MARIEHAMN,City proper,Estimate - de jure,"Final figure, complete",2020,6105.0,1
3,Åland Islands,2018,Total,Both Sexes,MARIEHAMN,City proper,Estimate - de jure,"Final figure, complete",2019,11709.0,1
4,Åland Islands,2018,Total,Male,MARIEHAMN,City proper,Estimate - de jure,"Final figure, complete",2019,5620.5,1


#### Refugee Counts by Year, Country of Residence, and Country of Origin

In [5]:
refugee_df = pd.read_csv('Data/refugee_data/UNdata_Refugees_20210317.csv', engine = 'python')
refugee_df.head()

Unnamed: 0,Country or territory of asylum or residence,Country or territory of origin,Year,Refugees,Refugees assisted by UNHCR,Total refugees and people in refugee-like situations,Total refugees and people in refugee-like situations assisted by UNHCR
0,Afghanistan,Iraq,2016,1.0,1.0,1.0,1.0
1,Afghanistan,Islamic Rep. of Iran,2016,33.0,33.0,33.0,33.0
2,Afghanistan,Pakistan,2016,59737.0,59737.0,59737.0,59737.0
3,Albania,China,2016,11.0,11.0,11.0,11.0
4,Albania,Dem. Rep. of the Congo,2016,3.0,3.0,3.0,3.0


### Extract code - remove before submitting

In [8]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
.enableHiveSupport().getOrCreate()
df_spark =spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')

In [11]:
#write to parquet
df_spark.write.parquet("sas_data")
df_spark=spark.read.parquet("sas_data")

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

#### Temperature Data

In [39]:
# Add a date field
temp_df['Date'] = temp_df['Month'].map(str) + '/' + temp_df['Day'].map(str) + '/' + temp_df['Year'].map(str)

# Summary stats by region. Note that there are seven regions. 
temp_df[['Region', 'Date', 'AvgTemperature']].groupby(['Region']).agg(['min', 'max', 'nunique'])

Unnamed: 0_level_0,Date,Date,Date,AvgTemperature,AvgTemperature,AvgTemperature
Unnamed: 0_level_1,min,max,nunique,min,max,nunique
Region,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Africa,1/1/1995,9/9/2019,9265,-99.0,102.8,654
Asia,1/1/1995,9/9/2019,9265,-99.0,103.7,1334
Australia/South Pacific,1/1/1995,9/9/2019,9265,-99.0,96.8,600
Europe,1/1/1995,9/9/2019,9265,-99.0,102.5,1079
Middle East,1/1/1995,9/9/2019,9265,-99.0,110.0,996
North America,1/1/1995,9/9/2019,9265,-99.0,107.7,1474
South/Central America & Carribean,1/1/1995,9/9/2019,9265,-99.0,97.4,609


In [88]:
# Note: there are records with a day value of 0. Exclude.
temp_df = temp_df[temp_df.Day > 0]

# Note: there are records with a year value of 200 and 201. Exclude.
temp_df = temp_df[temp_df.Year >= 1995]

# Note: missing temps are represented as -99. Exclude.
temp_df = temp_df[temp_df.AvgTemperature != -99]

# Drop duplicates
temp_df = temp_df.drop_duplicates()

# Verify there are no duplicate records at the most granular level, which is date and city
# 94 duplicates remain. Need to research further.
duplicate = temp_df[temp_df.duplicated(['State','City', 'Date'], keep=False)].sort_values(by = ['City', 'Date'])
duplicate.head()

Unnamed: 0,Region,Country,State,City,Month,Day,Year,AvgTemperature,Date
744576,Europe,Germany,,Hamburg,1,25,2011,38.5,1/25/2011
744970,Europe,Germany,,Hamburg,1,25,2011,38.6,1/25/2011
743935,Europe,Germany,,Hamburg,5,25,2010,51.7,5/25/2010
744330,Europe,Germany,,Hamburg,5,25,2010,52.0,5/25/2010
743962,Europe,Germany,,Hamburg,6,21,2010,57.4,6/21/2010


In [37]:
# Summary stats by city or country for a region of interest.
# Note: there are some time gaps in cities.
def summary_stats(GroupBy, Region):
    return temp_df[[GroupBy, 'Date', 'AvgTemperature']].where(temp_df.Region == Region).groupby([GroupBy]).agg(['min', 'max', 'nunique'])

# GroupBy input options: City, Country, State
# Region input options: Africa, Asia, Australia/South Pacific, Europe, Middle East, North America, South/Central America & Carribean
summary_stats(GroupBy = 'City', Region = 'South/Central America & Carribean')

Unnamed: 0_level_0,Date,Date,Date,AvgTemperature,AvgTemperature,AvgTemperature
Unnamed: 0_level_1,min,max,nunique,min,max,nunique
City,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Belize City,1/1/1996,9/9/2019,9262,-99.0,92.9,234
Bogota,1/1/1995,9/9/2019,9265,-99.0,66.7,155
Brasilia,1/1/1995,9/9/2019,9265,-99.0,87.7,245
Bridgetown,1/1/1995,9/9/2017,8541,-99.0,88.0,121
Buenos Aires,1/1/1995,9/9/2019,9265,-99.0,90.9,508
Caracas,1/1/1995,9/9/2019,9264,-99.0,89.9,163
Georgetown,1/1/1995,9/9/2011,5064,-99.0,90.6,136
Guatemala City,1/1/1995,9/9/2019,9265,-99.0,79.8,221
Guayaquil,1/1/1995,9/9/2019,9265,-99.0,90.0,189
Hamilton,1/1/1995,9/9/2009,5713,-99.0,85.4,308


In [30]:
# Verify location content naming conventions are consistent to avoid separating temperatures that belong together
def unique_values(Region):
    return temp_df[['Region', 'Country', 'State', 'City']].where(temp_df.Region == Region).drop_duplicates().sort_values(by = ['Country', 'State', 'City'])

# Region input options: Africa, Asia, Australia/South Pacific, Europe, Middle East, North America, South/Central America & Carribean
unique_values(Region = 'South/Central America & Carribean')

Unnamed: 0,Region,Country,State,City
1231460,South/Central America & Carribean,Argentina,,Buenos Aires
1240726,South/Central America & Carribean,Bahamas,,Nassau
1274233,South/Central America & Carribean,Barbados,,Bridgetown
1255704,South/Central America & Carribean,Belize,,Belize City
1249991,South/Central America & Carribean,Bermuda,,Hamilton
1264967,South/Central America & Carribean,Bolivia,,La Paz
1282775,South/Central America & Carribean,Brazil,,Brasilia
1292041,South/Central America & Carribean,Brazil,,Rio de Janeiro
1301306,South/Central America & Carribean,Brazil,,Sao Paulo
1310572,South/Central America & Carribean,Colombia,,Bogota


In [126]:
# Check if columns contain null values. Only state has nulls, which is to be expected.
temp_df.loc[pd.isnull(temp_df[['Region', 'Country', 'City', 'Date', 'AvgTemperature']]).any(1),:]

Unnamed: 0,Region,Country,State,City,Month,Day,Year,AvgTemperature,Date


#### Population Counts by Country and Year

In [131]:
# Rename columns to be more descriptive
country_pop_df.columns = ['Country', 'Year', 'Country_Population']

# Summary stats by country. 
country_pop_df.where(country_pop_df.Country > 'C').groupby(['Country']).agg(['min', 'max', 'nunique'])
# country_pop_df.agg(['min', 'max', 'nunique'])
# country_pop_df.where(country_pop_df.Country == 'China')

Unnamed: 0_level_0,Year,Year,Year,Country_Population,Country_Population,Country_Population
Unnamed: 0_level_1,min,max,nunique,min,max,nunique
Country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Cabo Verde,1960.0,2017.0,58,201765.0,5.374970e+05,58
Cambodia,1960.0,2017.0,58,5722370.0,1.600941e+07,58
Cameroon,1960.0,2017.0,58,5176918.0,2.456604e+07,58
Canada,1960.0,2017.0,58,17909009.0,3.654027e+07,58
Caribbean small states,1960.0,2017.0,58,4194710.0,7.314990e+06,58
Cayman Islands,1960.0,2017.0,58,7865.0,6.338200e+04,58
Central African Republic,1960.0,2017.0,58,1501668.0,4.596028e+06,58
Chad,1960.0,2017.0,58,3001609.0,1.501677e+07,58
Channel Islands,1960.0,2017.0,58,109420.0,1.686650e+05,58
Chile,1960.0,2017.0,58,8132990.0,1.847044e+07,58


In [107]:
# Verify there are no duplicate records at the most granular level, which is country and year
# No duplicates.
duplicate = country_pop_df[country_pop_df.duplicated(['Country', 'Year'], keep=False)].sort_values(by = ['Country', 'Year'])
duplicate.head()

Unnamed: 0,Country,Year,Country_Population


In [120]:
# Review country names for inconsistencies. May have adjust names to align with other sources.
country_pop_df[['Country']].drop_duplicates().sort_values(by = ['Country'])

Unnamed: 0,Country
1,Afghanistan
3,Albania
56,Algeria
8,American Samoa
4,Andorra
2,Angola
9,Antigua and Barbuda
6,Argentina
7,Armenia
0,Aruba


In [132]:
# Check if columns contain null values. No null values detected.
country_pop_df.loc[pd.isnull(country_pop_df[['Country', 'Year', 'Country_Population']]).any(1),:]

Unnamed: 0,Country,Year,Country_Population


#### Population Counts by City and Year

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.