# Refugees in the Age of Gloabl Warming
### Data Engineering Capstone Project

#### Project Summary
This project focuses on monitoring refugee and population information around the world based on temperature changes over time.

The project follows the following steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [2]:
# Do all imports and installs here
import pandas as pd
import os
from datetime import datetime
import psycopg2
import configparser
config = configparser.ConfigParser()
from sqlalchemy.engine import create_engine

In [3]:
# Read in parameters needed for Redshift cluster
# config = configparser.ConfigParser()
# config.read('dwh.cfg')

# # Connect to Redshift cluster and gets cursor to it
# conn = psycopg2.connect("""host={} 
#                            dbname={} 
#                            user={} 
#                            password={} 
#                            port={}""".format(*config['CLUSTER'].values()))
# cur = conn.cursor()

### Step 1: Scope the Project and Gather Data

#### Scope 
What is your end solution look like? What tools did you use? etc>

The plan is to build a data warehouse for analytical processes, so analysts can design recurring and ad hoc reports over time using SQL. There is a strong emphasis in ensuring the warehouse is easy to interpret, performant, and quality assured.
 
#### Data Sources and Content

There are four source datasets:
 1. City_temperature.csv
     - Summary: average daily temperature for all major cities in the world from 1995 - 2020
     - Source: University of Dayton - separate txt files available for each city [here](https://academic.udayton.edu/kissock/http/Weather/default.htm). The data is available for research and non-commercial purposes only. Refer to [this page](https://academic.udayton.edu/kissock/http/Weather/default.htm) for license.
     - Secondary source: SRK via Kaggle - [link](https://www.kaggle.com/sudalairajkumar/daily-temperature-of-major-cities)
 2. Country_population_total_long.csv
     - Summary: annual population counts by country from 1960 - 2017
     - Source: The World Bank - [link](https://data.worldbank.org/indicator/SP.POP.TOTL)
     - Secondary source: Devakumar kp via Kaggle - [link](https://www.kaggle.com/imdevskp/world-population-19602018?select=population_total_long.csv)
 3. UNdata_City_Population_20210315.csv
     - Summary: annual population counts by city from 1970 - 2020 (contains gaps in 1970's)
     - Source: UN Data - [link](https://data.un.org/Data.aspx?d=POP&f=tableCode%3A240)
 4. UNdata_Refugees_20210217.csv
     - Summary: annual refugee counts from 1975 - 2016 by country of residence and country of origin
     - Source: UN Data - [link](http://data.un.org/Data.aspx?d=UNHCR&f=indID%3aType-Ref)

### Read in Each Dataset

#### Temperature Data

In [3]:
temp_df = pd.read_csv('Data/temperature_data/city_temperature.csv', engine = 'python')
temp_df.head()

Unnamed: 0,Region,Country,State,City,Month,Day,Year,AvgTemperature
0,Africa,Algeria,,Algiers,1,1,1995,64.2
1,Africa,Algeria,,Algiers,1,2,1995,49.4
2,Africa,Algeria,,Algiers,1,3,1995,48.8
3,Africa,Algeria,,Algiers,1,4,1995,46.4
4,Africa,Algeria,,Algiers,1,5,1995,47.9


#### Population Counts by Country and Year

In [4]:
country_pop_df = pd.read_csv('Data/country_population_data/country_population_total_long.csv', engine = 'python')
country_pop_df.head()

Unnamed: 0,Country Name,Year,Count
0,Aruba,1960,54211
1,Afghanistan,1960,8996973
2,Angola,1960,5454933
3,Albania,1960,1608800
4,Andorra,1960,13411


#### Population Counts by City and Year

In [5]:
city_pop_df = pd.read_csv('Data/city_population_data/UNdata_City_Population_20210315.csv', engine = 'python')
city_pop_df.head()

Unnamed: 0,Country or Area,Year,Area,Sex,City,City type,Record Type,Reliability,Source Year,Value,Value Footnotes
0,Åland Islands,2019,Total,Both Sexes,MARIEHAMN,City proper,Estimate - de jure,"Final figure, complete",2020,11711.0,1
1,Åland Islands,2019,Total,Male,MARIEHAMN,City proper,Estimate - de jure,"Final figure, complete",2020,5606.0,1
2,Åland Islands,2019,Total,Female,MARIEHAMN,City proper,Estimate - de jure,"Final figure, complete",2020,6105.0,1
3,Åland Islands,2018,Total,Both Sexes,MARIEHAMN,City proper,Estimate - de jure,"Final figure, complete",2019,11709.0,1
4,Åland Islands,2018,Total,Male,MARIEHAMN,City proper,Estimate - de jure,"Final figure, complete",2019,5620.5,1


#### Refugee Counts by Year, Country of Residence, and Country of Origin

In [6]:
refugee_df = pd.read_csv('Data/refugee_data/UNdata_Refugees_20210317.csv', engine = 'python')
refugee_df.head()

Unnamed: 0,Country or territory of asylum or residence,Country or territory of origin,Year,Refugees,Refugees assisted by UNHCR,Total refugees and people in refugee-like situations,Total refugees and people in refugee-like situations assisted by UNHCR
0,Afghanistan,Iraq,2016,1.0,1.0,1.0,1.0
1,Afghanistan,Islamic Rep. of Iran,2016,33.0,33.0,33.0,33.0
2,Afghanistan,Pakistan,2016,59737.0,59737.0,59737.0,59737.0
3,Albania,China,2016,11.0,11.0,11.0,11.0
4,Albania,Dem. Rep. of the Congo,2016,3.0,3.0,3.0,3.0


### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

#### Temperature Data

In [7]:
# Add a date field
temp_df['Date'] = temp_df['Month'].map(str) + '/' + \
                  temp_df['Day'].map(str) + '/' +  \
                  temp_df['Year'].map(str)

# Summary stats by region
temp_df[['Region', 'Date', 'AvgTemperature']] \
    .groupby(['Region']) \
    .agg(['min', 'max', 'nunique'])
# Note that there are seven regions

Unnamed: 0_level_0,Date,Date,Date,AvgTemperature,AvgTemperature,AvgTemperature
Unnamed: 0_level_1,min,max,nunique,min,max,nunique
Region,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Africa,1/1/1995,9/9/2019,9297,-99.0,102.8,654
Asia,1/1/1995,9/9/2019,9265,-99.0,103.7,1334
Australia/South Pacific,1/1/1995,9/9/2019,9265,-99.0,96.8,600
Europe,1/1/1995,9/9/2019,9356,-99.0,102.5,1079
Middle East,1/1/1995,9/9/2019,9265,-99.0,110.0,996
North America,1/1/1995,9/9/2019,9295,-99.0,107.7,1474
South/Central America & Carribean,1/1/1995,9/9/2019,9266,-99.0,97.4,609


In [8]:
# Check how consistent information is populated by city over time
temp_df.groupby(['Year'])[['City']].nunique()
# Very consistent going back to 1995

Unnamed: 0_level_0,City
Year,Unnamed: 1_level_1
200,2
201,7
1995,319
1996,319
1997,320
1998,321
1999,321
2000,321
2001,321
2002,321


In [9]:
# Note: there are records with a day value of 0. Exclude.
temp_df = temp_df[temp_df.Day > 0]

# Note: there are records with a year value of 200 and 201. Exclude.
temp_df = temp_df[temp_df.Year >= 1995]

# Note: missing temps are represented as -99. Exclude.
temp_df = temp_df[temp_df.AvgTemperature != -99]

# Drop duplicates
temp_df = temp_df.drop_duplicates()

# Verify there are no duplicate records at the most granular level, which is date and city
# 94 duplicates remain. Need to research further.
duplicate = temp_df[temp_df.duplicated(['State', 'City', 'Date'], keep=False)] \
                        .sort_values(by = ['City', 'Date'])
duplicate.head(20)

Unnamed: 0,Region,Country,State,City,Month,Day,Year,AvgTemperature,Date
744576,Europe,Germany,,Hamburg,1,25,2011,38.5,1/25/2011
744970,Europe,Germany,,Hamburg,1,25,2011,38.6,1/25/2011
743935,Europe,Germany,,Hamburg,5,25,2010,51.7,5/25/2010
744330,Europe,Germany,,Hamburg,5,25,2010,52.0,5/25/2010
743962,Europe,Germany,,Hamburg,6,21,2010,57.4,6/21/2010
744357,Europe,Germany,,Hamburg,6,21,2010,58.6,6/21/2010
754882,Europe,Germany,,Munich,1,15,2019,35.0,1/15/2019
755249,Europe,Germany,,Munich,1,15,2019,35.1,1/15/2019
754151,Europe,Germany,,Munich,1,16,2018,41.5,1/16/2018
754518,Europe,Germany,,Munich,1,16,2018,40.9,1/16/2018


In [11]:
# Summary stats by city or country for a region of interest.
# Note: there are some time gaps in cities.
def summary_stats(GroupBy, Region):
    return temp_df[[GroupBy, 'Date', 'AvgTemperature']].where(temp_df.Region == Region) \
               .groupby([GroupBy]) \
               .agg(['min', 'max', 'nunique'])

# GroupBy input options: City, Country, State
# Region input options: Africa, Asia, Australia/South Pacific, Europe, Middle East, North America, South/Central America & Carribean
summary_stats(GroupBy = 'City', Region = 'South/Central America & Carribean')

Unnamed: 0_level_0,Date,Date,Date,AvgTemperature,AvgTemperature,AvgTemperature
Unnamed: 0_level_1,min,max,nunique,min,max,nunique
City,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Belize City,1/1/1996,9/9/2019,8867,64.6,92.9,233
Bogota,1/1/1995,9/9/2019,9196,46.7,66.7,154
Brasilia,1/1/1995,9/9/2019,9206,56.1,87.7,244
Bridgetown,1/1/1995,9/9/2016,8349,74.2,88.0,120
Buenos Aires,1/1/1995,9/9/2019,9231,35.3,90.9,507
Caracas,1/1/1995,9/9/2019,9121,71.5,89.9,162
Georgetown,1/1/1999,9/9/2006,2136,67.0,90.6,135
Guatemala City,1/1/1997,9/9/2019,8710,51.2,79.8,220
Guayaquil,1/1/1995,9/9/2019,9036,67.2,90.0,188
Hamilton,1/1/1995,9/9/2009,5558,51.1,85.4,307


In [12]:
# Verify location content naming conventions are consistent to avoid separating temperatures that belong together
def unique_values(Region):
    return temp_df[['Region', 'Country', 'State', 'City']].where(temp_df.Region == Region) \
               .drop_duplicates() \
               .sort_values(by = ['Country', 'State', 'City'])

# Region input options: Africa, Asia, Australia/South Pacific, Europe, Middle East, North America, South/Central America & Carribean
unique_values(Region = 'South/Central America & Carribean')

Unnamed: 0,Region,Country,State,City
1231460,South/Central America & Carribean,Argentina,,Buenos Aires
1240726,South/Central America & Carribean,Bahamas,,Nassau
1274233,South/Central America & Carribean,Barbados,,Bridgetown
1255704,South/Central America & Carribean,Belize,,Belize City
1249991,South/Central America & Carribean,Bermuda,,Hamilton
1264967,South/Central America & Carribean,Bolivia,,La Paz
1282775,South/Central America & Carribean,Brazil,,Brasilia
1292041,South/Central America & Carribean,Brazil,,Rio de Janeiro
1301306,South/Central America & Carribean,Brazil,,Sao Paulo
1310572,South/Central America & Carribean,Colombia,,Bogota


In [13]:
# Check if columns contain null values. Only state has nulls, which is to be expected.
temp_df.loc[pd.isnull(temp_df[['Region', 'Country', 'City', 'Date', 'AvgTemperature']]).any(1),:]

Unnamed: 0,Region,Country,State,City,Month,Day,Year,AvgTemperature,Date


#### Population Counts by Country and Year

In [14]:
# Rename columns to be more descriptive
country_pop_df.columns = ['Country', 'Year', 'Country_Population']

# Check how consistent information is populated by country over time
country_pop_df.groupby(['Year'])[['Country']].nunique()
# Very consistent going back to 1960

Unnamed: 0_level_0,Country
Year,Unnamed: 1_level_1
1960,216
1961,216
1962,216
1963,216
1964,216
1965,216
1966,216
1967,216
1968,216
1969,216


In [16]:
# Summary stats by country. 
country_pop_df.where(country_pop_df.Country > 'C') \
    .groupby(['Country']) \
    .agg(['min', 'max', 'nunique'])
# country_pop_df.agg(['min', 'max', 'nunique'])
# country_pop_df.where(country_pop_df.Country == 'China')

Unnamed: 0_level_0,Year,Year,Year,Country_Population,Country_Population,Country_Population
Unnamed: 0_level_1,min,max,nunique,min,max,nunique
Country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Cabo Verde,1960.0,2017.0,58,201765.0,5.374970e+05,58
Cambodia,1960.0,2017.0,58,5722370.0,1.600941e+07,58
Cameroon,1960.0,2017.0,58,5176918.0,2.456604e+07,58
Canada,1960.0,2017.0,58,17909009.0,3.654027e+07,58
Caribbean small states,1960.0,2017.0,58,4194710.0,7.314990e+06,58
Cayman Islands,1960.0,2017.0,58,7865.0,6.338200e+04,58
Central African Republic,1960.0,2017.0,58,1501668.0,4.596028e+06,58
Chad,1960.0,2017.0,58,3001609.0,1.501677e+07,58
Channel Islands,1960.0,2017.0,58,109420.0,1.686650e+05,58
Chile,1960.0,2017.0,58,8132990.0,1.847044e+07,58


In [17]:
# Verify there are no duplicate records at the most granular level, which is country and year
# No duplicates.
duplicate = country_pop_df[country_pop_df.duplicated(['Country', 'Year'], keep=False)] \
                .sort_values(by = ['Country', 'Year'])
duplicate.head()

Unnamed: 0,Country,Year,Country_Population


In [18]:
# Review country names for inconsistencies. May have adjust names to align with other sources.
country_pop_df[['Country']].drop_duplicates() \
    .sort_values(by = ['Country'])

Unnamed: 0,Country
1,Afghanistan
3,Albania
56,Algeria
8,American Samoa
4,Andorra
2,Angola
9,Antigua and Barbuda
6,Argentina
7,Armenia
0,Aruba


In [19]:
# Check if columns contain null values. No null values detected.
country_pop_df.loc[pd.isnull(country_pop_df[['Country', 'Year', 'Country_Population']]).any(1),:]

Unnamed: 0,Country,Year,Country_Population


#### Population Counts by City and Year

In [20]:
# Filter on sex to only include both sexes since other sources don't include this breakdown.
# Note I verfied there are 4,751 distinct cities and all cities have a both sexes row
city_pop_df[['City']].nunique()
city_pop_df[['Sex','City']].groupby(['Sex']).nunique()
city_pop_df = city_pop_df[city_pop_df.Sex == 'Both Sexes']

# By removing the sex breakdown, the field can be dropped
city_pop_df = city_pop_df.drop(columns=['Sex'])
# city_pop_df.head()

# Check how many inputs are in the Area column
city_pop_df[['Area']].drop_duplicates()

# Remove Area coulmn with there being only one input
city_pop_df = city_pop_df.drop(columns=['Area'])

# Rename columns for naming consistencies
city_pop_df.columns = ['Country_or_Area',
                       'Year',
                       'City',
                       'City_Type',
                       'Record_Type',
                       'Reliability',
                       'Source_Year',
                       'City_Population',
                       'Population_Notes']

# Summary stats by country/area and sex. There are gaps in time.
city_pop_df[['Country_or_Area', 'Year', 'City_Population']] \
    .groupby(['Country_or_Area']) \
    .agg(['min', 'max', 'nunique'])

Unnamed: 0_level_0,Year,Year,Year,City_Population,City_Population,City_Population
Unnamed: 0_level_1,min,max,nunique,min,max,nunique
Country_or_Area,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Albania,2003,2011,2,113249.0,4.184950e+05,3
Algeria,1998,2008,2,102151.0,2.712944e+06,78
American Samoa,2000,2010,2,3656.0,4.278000e+03,2
Andorra,2003,2011,8,21245.0,2.477900e+04,8
Anguilla,2001,2011,2,2812.0,4.904000e+03,2
Antigua and Barbuda,1991,1991,1,22342.0,2.234200e+04,1
Argentina,1991,2020,8,109882.0,1.541673e+07,258
Armenia,2001,2018,11,78848.0,1.116648e+06,32
Aruba,1991,2010,2,20045.0,2.829500e+04,2
Australia,2001,2018,17,417.0,5.230330e+06,451


In [21]:
# Drop duplicates
city_pop_df = city_pop_df.drop_duplicates()

# Verify there are no duplicate records at the most granular level
# There are duplicates. Need to prioritize most recent source year, preference on record type (census over estimate), and maybe reliability.
duplicate = city_pop_df[city_pop_df.duplicated(['City', 'Year', 'City_Type'], keep=False)] \
                .sort_values(by = ['City', 'Year'])
duplicate

Unnamed: 0,Country_or_Area,Year,City,City_Type,Record_Type,Reliability,Source_Year,City_Population,Population_Notes
53989,Spain,2001,A Coruña,City proper,Census - de facto - complete tabulation,"Final figure, complete",2009,236379.0,
53990,Spain,2001,A Coruña,City proper,Estimate - de jure,"Final figure, complete",2002,235847.0,170
52883,Spain,2011,A Coruña,City proper,Census - de jure - complete tabulation,"Final figure, complete",2014,245055.0,74
52884,Spain,2011,A Coruña,City proper,Estimate - de jure,"Final figure, complete",2013,246087.0,74
32822,Kazakhstan,2012,ASTANA,City proper,Estimate - de facto,"Final figure, complete",2018,760506.0,18118
32823,Kazakhstan,2012,ASTANA,City proper,Estimate - de facto,"Final figure, complete",2015,778198.0,
32630,Kazakhstan,2013,ASTANA,City proper,Estimate - de facto,"Final figure, complete",2018,796282.0,18118
32631,Kazakhstan,2013,ASTANA,City proper,Estimate - de facto,"Final figure, complete",2015,814435.0,
18183,Germany,2011,Aachen,City proper,Census - de jure - complete tabulation,"Final figure, complete",2014,236420.0,
18184,Germany,2011,Aachen,City proper,Estimate - de jure,"Final figure, complete",2011,258664.0,


In [22]:
# Check how consistent information is populated by city over time
city_pop_df.groupby(['Year'])[['City']].nunique()
# By year 2000, data collection is much more complete accross cities

Unnamed: 0_level_0,City
Year,Unnamed: 1_level_1
1970,2
1976,9
1980,3
1981,1
1983,9
1984,4
1985,2
1986,2
1987,28
1988,16


In [23]:
# Review country and city names for inconsistencies. 
city_pop_df[['Country_or_Area', 'City']] \
    .drop_duplicates() \
    .sort_values(by = ['Country_or_Area'])
# Will have adjust names to align with other sources.

Unnamed: 0,Country_or_Area,City
42,Albania,Durrës
43,Albania,TIRANA
86,Algeria,Souq Ahras
78,Algeria,Mouaskar (Mascara)
79,Algeria,M'Sila
81,Algeria,Oum El Bouaghi
82,Algeria,Qacentina (Constantine)
83,Algeria,Saïda
84,Algeria,Sidi-bel-Abbès
85,Algeria,Skikda


In [24]:
# Review record types and reliability types to familiarize with. 
city_pop_df[['Record_Type', 'Reliability',]] \
    .drop_duplicates() \
    .sort_values(by = ['Record_Type'])
# May have prioirtize when there are multiple sources for the same year and city

Unnamed: 0,Record_Type,Reliability
157,Census - de facto - complete tabulation,"Final figure, complete"
2678,Census - de facto - complete tabulation,Provisional figure
39,Census - de jure - complete tabulation,"Final figure, complete"
34434,Census - de jure - complete tabulation,Provisional figure
15932,Census - de jure - sample tabulation,"Final figure, complete"
48,Estimate - de facto,"Final figure, complete"
2567,Estimate - de facto,Provisional figure
0,Estimate - de jure,"Final figure, complete"
764,Estimate - de jure,Provisional figure
34484,Estimate - de jure,Other estimate


In [25]:
# Check if columns contain null values. 
city_pop_df.loc[pd.isnull(city_pop_df[['Country_or_Area',
                                       'Year',
                                       'City',
                                       'City_Type',
                                       'Record_Type', 
                                       'Reliability',
                                       'Source_Year', 
                                       'City_Population']]).any(1),:]
# No unexpected nulls.

Unnamed: 0,Country_or_Area,Year,City,City_Type,Record_Type,Reliability,Source_Year,City_Population,Population_Notes


#### Refugee Counts by Year, Country of Residence, and Country of Origin

In [26]:
# Rename columns for naming consistencies
refugee_df.columns = ['Asylum_Country_or_Territory',
                      'Origin_Country_or_Territory',
                      'Year',
                      'Refugees',
                      'Refugees_Assisted_by_UNHCR',
                      'Refugee-like_Population',
                      'Refugee-like_Population_Assisted_by_UNHCR']

# Drop duplicates
refugee_df = refugee_df.drop_duplicates()

# Get an understanding of the differences between population counts based on summary stats
def pop_stat_comparison(pop1, pop2):
    return refugee_df[['Asylum_Country_or_Territory', pop1, pop2]] \
               .groupby(['Asylum_Country_or_Territory']) \
               .agg(['min', 'max', 'nunique'])

pop_stat_comparison('Refugees', 'Refugees_Assisted_by_UNHCR')
# pop_stat_comparison('Refugee-like_Population', 'Refugee-like_Population_Assisted_by_UNHCR')
# pop_stat_comparison('Refugee-like_Population', 'Refugees')

# Main takeaways:
# UNHCR related counts are populated substantially less
# Refugees vs. refugee-like: they differ slightly

Unnamed: 0_level_0,Refugees,Refugees,Refugees,Refugees_Assisted_by_UNHCR,Refugees_Assisted_by_UNHCR,Refugees_Assisted_by_UNHCR
Unnamed: 0_level_1,min,max,nunique,min,max,nunique
Asylum_Country_or_Territory,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Afghanistan,1.0,280229.0,29,1.0,280229.0,16
Albania,1.0,22324.0,37,1.0,3918.0,27
Algeria,1.0,165000.0,82,1.0,155430.0,33
Angola,1.0,225000.0,156,1.0,13007.0,59
Anguilla,1.0,1.0,1,1.0,1.0,1
Antigua and Barbuda,4.0,15.0,2,4.0,15.0,2
Argentina,1.0,38320.0,170,1.0,365.0,57
Armenia,1.0,328000.0,69,1.0,50001.0,50
Aruba,1.0,1.0,1,1.0,1.0,1
Australia,1.0,317000.0,588,,,0


In [28]:
# Summary stats by country of asylum.
refugee_df[['Asylum_Country_or_Territory', 'Year']] \
    .groupby(['Asylum_Country_or_Territory']) \
    .agg(['min', 'max', 'nunique'])
# There are gaps in time.

Unnamed: 0_level_0,Year,Year,Year
Unnamed: 0_level_1,min,max,nunique
Asylum_Country_or_Territory,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Afghanistan,1990,2016,24
Albania,1992,2016,25
Algeria,1975,2016,42
Angola,1976,2016,41
Anguilla,2015,2016,2
Antigua and Barbuda,2015,2016,2
Argentina,1975,2016,42
Armenia,1992,2016,25
Aruba,2013,2016,3
Australia,1978,2016,39


In [29]:
# Verify there are no duplicate records at the most granular level
duplicate = refugee_df[refugee_df.duplicated(['Asylum_Country_or_Territory',
                                              'Origin_Country_or_Territory',
                                              'Year'], keep=False)] \
                           .sort_values(by = ['Asylum_Country_or_Territory',
                                              'Origin_Country_or_Territory',
                                              'Year'])
duplicate
# No duplicates

Unnamed: 0,Asylum_Country_or_Territory,Origin_Country_or_Territory,Year,Refugees,Refugees_Assisted_by_UNHCR,Refugee-like_Population,Refugee-like_Population_Assisted_by_UNHCR


In [30]:
# Check how consistent information is populated by city over time
refugee_df.groupby(['Year'])[['Asylum_Country_or_Territory']].nunique()
# Seems consistent by year, especially considering new countries forming over time

Unnamed: 0_level_0,Asylum_Country_or_Territory
Year,Unnamed: 1_level_1
1975,50
1976,53
1977,72
1978,82
1979,88
1980,90
1981,92
1982,94
1983,97
1984,96


In [31]:
# Review country and city names for inconsistencies. 
refugee_df[['Asylum_Country_or_Territory', 'Origin_Country_or_Territory']] \
    .drop_duplicates() \
    .sort_values(by = ['Asylum_Country_or_Territory'])
# May have adjust names to align with other sources.

Unnamed: 0,Asylum_Country_or_Territory,Origin_Country_or_Territory
0,Afghanistan,Iraq
35389,Afghanistan,Eritrea
10777,Afghanistan,State of Palestine
83439,Afghanistan,Tajikistan
21052,Afghanistan,Syrian Arab Rep.
90464,Afghanistan,Various
2,Afghanistan,Pakistan
1,Afghanistan,Islamic Rep. of Iran
83444,Albania,Sri Lanka
83440,Albania,Bosnia and Herzegovina


In [33]:
# Check if columns contain null values. 
refugee_df.loc[pd.isnull(refugee_df[['Asylum_Country_or_Territory',
                                     'Origin_Country_or_Territory',
                                     'Year',
                                     'Refugees',
                                     'Refugee-like_Population']]).any(1),:]
# No unexpected nulls. Not all refugee counts populate, which is okay.

Unnamed: 0,Asylum_Country_or_Territory,Origin_Country_or_Territory,Year,Refugees,Refugees_Assisted_by_UNHCR,Refugee-like_Population,Refugee-like_Population_Assisted_by_UNHCR
2453,Israel,Dem. Rep. of the Congo,2016,,,208.0,50.0
2941,Malaysia,Kenya,2016,,,1.0,1.0
2953,Malaysia,Rep. of Moldova,2016,,,1.0,1.0
2966,Malaysia,United States,2016,,,1.0,1.0
2967,Malaysia,Viet Nam,2016,,,1.0,1.0
4043,Saudi Arabia,Liberia,2016,,,7.0,7.0
4091,Serbia (and Kosovo: S/RES/1244 (1999)),Various,2016,,,1150.0,1150.0
7889,Israel,Dem. Rep. of the Congo,2015,,,208.0,50.0
8361,Malaysia,Kenya,2015,,,1.0,1.0
9441,Saudi Arabia,Liberia,2015,,,7.0,7.0


### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.