# Analysis

Template for Jupyter notebooks running Python.

Version 0.1.0 \| First Created July 12, 2023 \| Updated August 01, 2023

## Jupyter Notebook

This is an Jupyter Notebook document. For more details on using a Jupyter Notebook see <https://docs.jupyter.org/en/latest/>.



# Reproduction of Hurricane Harvey Flooding GEOG120 Lab Problem

### Authors

- Colman Bashore\*, cbashore@middlebury.edu, @colman-bashore, Middlebury College

\* Corresponding author and creator



### Abstract

Write a brief abstract about your research project.

If the project is a reproduction or replication study, include a declaration of the study type with a full reference to the original study.
For example:

This study is a *replication* of:

> citation to prior study

A graphical abstract of the study could also be included as an image here.



### Study metadata

- `Key words`: Comma-separated list of keywords (tags) for searchability. Geographers often use one or two keywords each for: theory, geographic context, and methods.
- `Subject`: select from the [BePress Taxonomy](http://digitalcommons.bepress.com/cgi/viewcontent.cgi?article=1008&context=reference)
- `Date created`: date when project was started
- `Date modified`: date of most recent revision
- `Spatial Coverage`: Specify the geographic extent of your study. This may be a place name and link to a feature in a gazetteer like GeoNames or OpenStreetMap, or a well known text (WKT) representation of a bounding box.
- `Spatial Resolution`: Specify the spatial resolution as a scale factor, description of the level of detail of each unit of observation (including administrative level of administrative areas), and/or or distance of a raster GRID size
- `Spatial Reference System`: Specify the geographic or projected coordinate system for the study, e.g. EPSG:4326
- `Temporal Coverage`: Specify the temporal extent of your study---i.e. the range of time represented by the data observations.
- `Temporal Resolution`: Specify the temporal resolution of your study---i.e. the duration of time for which each observation represents or the revisit period for repeated observations
- `Funding Name`: name of funding for the project
- `Funding Title`: title of project grant
- `Award info URI`: web address for award information
- `Award number`: award number

#### Original study spatio-temporal metadata

- `Spatial Coverage`: extent of original study
- `Spatial Resolution`: resolution of original study
- `Spatial Reference System`: spatial reference system of original study
- `Temporal Coverage`: temporal extent of original study
- `Temporal Resolution`: temporal resolution of original study



## Study design

Describe how the study relates to prior literature, e.g. is it a **original study**, **meta-analysis study**, **reproduction study**, **reanalysis study**, or **replication study**?

Also describe the original study archetype, e.g. is it **observational**, **experimental**, **quasi-experimental**, or **exploratory**?

Enumerate specific **hypotheses** to be tested or **research questions** to be investigated here, and specify the type of method, statistical test or model to be used on the hypothesis or question.

## Materials and procedure

## Computational environment

Maintaining a reproducible computational environment requires some conscious choices in package management.

Please refer to `00-Python-environment-setup.ipynb` for details.



In [164]:
# Import modules, define directories

from pyhere import here
import pandas as pd
import geopandas as gpd
import folium
import matplotlib


# You can define your own shortcuts for file paths:
path = {
    "dscr": here("data", "scratch"),
    "drpub": here("data", "raw", "public"),
    "drpriv": here("data", "raw", "private"),
    "ddpub": here("data", "derived", "public"),
    "ddpriv": here("data", "derived", "private"),
    "rfig": here("results", "figures"),
    "roth": here("results", "other"),
    "rtab": here("results", "tables"),
    "dmet": here("data", "metadata")
}

### Data and variables

Describe the **data sources** and **variables** to be used.
Data sources may include plans for observing and recording **primary data** or descriptions of **secondary data**.
For secondary data sources with numerous variables, the analysis plan authors may focus on documenting only the variables intended for use in the study.

Primary data sources for the study are to include ... .
Secondary data sources for the study are to include ... .

Each of the next subsections describes one data source.



#### Primary data source1 name

**Standard Metadata**

- `Abstract`: Brief description of the data source
- `Spatial Coverage`: Specify the geographic extent of your study. This may be a place name and link to a feature in a gazetteer like GeoNames or OpenStreetMap, or a well known text (WKT) representation of a bounding box.
- `Spatial Resolution`: Specify the spatial resolution as a scale factor, description of the level of detail of each unit of observation (including administrative level of administrative areas), and/or or distance of a raster GRID size
- `Spatial Reference System`: Specify the geographic or projected coordinate system for the study
- `Temporal Coverage`: Specify the temporal extent of your study---i.e. the range of time represented by the data observations.
- `Temporal Resolution`: Specify the temporal resolution of your study---i.e. the duration of time for which each observation represents or the revisit period for repeated observations
- `Lineage`: Describe and/or cite data sources and/or methodological steps planned to create this data source.
  - sampling scheme, including spatial sampling
  - target sample size and method for determining sample size
  - stopping criteria for data collection and sampling (e.g. sample size, time elapsed)
  - de-identification / anonymization
  - experimental manipulation
- `Distribution`: Describe who will make the data available and how?
- `Constraints`: Legal constraints for *access* or *use* to protect *privacy* or *intellectual property rights*
- `Data Quality`: State any planned quality assessment
- `Variables`: For each variable, enter the following information. If you have two or more variables per data source, you may want to present this information in table form (shown below)
  - `Label`: variable name as used in the data or code
  - `Alias`: intuitive natural language name
  - `Definition`: Short description or definition of the variable. Include measurement units in description.
  - `Type`: data type, e.g. character string, integer, real
  - `Accuracy`: e.g. uncertainty of measurements
  - `Domain`: Expected range of Maximum and Minimum of numerical data, or codes or categories of nominal data, or reference to a standard codebook
  - `Missing Data Value(s)`: Values used to represent missing data and frequency of missing data observations
  - `Missing Data Frequency`: Frequency of missing data observations: not yet known for data to be collected

| Label | Alias | Definition | Type | Accuracy | Domain | Missing Data Value(s) | Missing Data Frequency |
| :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: |
| variable1 | ... | ... | ... | ... | ... | ... | ... |
| variable2 | ... | ... | ... | ... | ... | ... | ... |



#### Primary data source2 name

... same form as above...



#### Secondary data source1 name

**Standard Metadata**

- `Abstract`: Brief description of the data source
- `Spatial Coverage`: Specify the geographic extent of your study. This may be a place name and link to a feature in a gazetteer like GeoNames or OpenStreetMap, or a well known text (WKT) representation of a bounding box.
- `Spatial Resolution`: Specify the spatial resolution as a scale factor, description of the level of detail of each unit of observation (including administrative level of administrative areas), and/or or distance of a raster GRID size
- `Spatial Reference System`: Specify the geographic or projected coordinate system for the study
- `Temporal Coverage`: Specify the temporal extent of your study---i.e. the range of time represented by the data observations.
- `Temporal Resolution`: Specify the temporal resolution of your study---i.e. the duration of time for which each observation represents or the revisit period for repeated observations
- `Lineage`: Describe and/or cite data sources and/or methodological steps used to create this data source
- `Distribution`: Describe how the data is distributed, including any persistent identifier (e.g. DOI) or URL for data access
- `Constraints`: Legal constraints for *access* or *use* to protect *privacy* or *intellectual property rights*
- `Data Quality`: State result of quality assessment or state "Quality unknown"
- `Variables`: For each variable, enter the following information. If you have two or more variables per data source, you may want to present this information in table form (shown below)
  - `Label`: variable name as used in the data or code
  - `Alias`: intuitive natural language name
  - `Definition`: Short description or definition of the variable. Include measurement units in description.
  - `Type`: data type, e.g. character string, integer, real
  - `Accuracy`: e.g. uncertainty of measurements
  - `Domain`: Range (Maximum and Minimum) of numerical data, or codes or categories of nominal data, or reference to a standard codebook
  - `Missing Data Value(s)`: Values used to represent missing data and frequency of missing data observations
  - `Missing Data Frequency`: Frequency of missing data observations

| Label | Alias | Definition | Type | Accuracy | Domain | Missing Data Value(s) | Missing Data Frequency |
| :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: |
| variable1 | ... | ... | ... | ... | ... | ... | ... |
| variable2 | ... | ... | ... | ... | ... | ... | ... |



#### Secondary data source2 name

... same form as above...



### Prior observations  

Prior experience with the study area, prior data collection, or prior observation of the data can compromise the validity of a study, e.g. through p-hacking.
Therefore, disclose any prior experience or observations at the time of study pre-registration here, with example text below:

At the time of this study pre-registration, the authors had _____ prior knowledge of the geography of the study region with regards to the ____ phenomena to be studied.
This study is related to ____ prior studies by the authors

For each primary data source, declare the extent to which authors had already engaged with the data:

- [ ] no data collection has started
- [ ] pilot test data has been collected
- [ ] data collection is in progress and data has not been observed
- [ ] data collection is in progress and __% of data has been observed
- [ ] data collection is complete and data has been observed. Explain how authors have already manipulated / explored the data.

For each secondary source, declare the extent to which authors had already engaged with the data:

- [ ] data is not available yet
- [ ] data is available, but only metadata has been observed
- [ ] metadata and descriptive statistics have been observed
- [ ] metadata and a pilot test subset or sample of the full dataset have been observed
- [ ] the full dataset has been observed. Explain how authors have already manipulated / explored the data.

If pilot test data has been collected or acquired, describe how the researchers observed and analyzed the pilot test, and the extent to which the pilot test influenced the research design.



### Import Data

#### blockgroups.shp
◦ Block Groups of Harris County, Texas
◦ One block group contains only 9 people, but you may include all block groups in your analysis.
◦ Source: United States Census API https://www.census.gov/developers/ via R tidycensus
https://walkerke.github.io/tidycensus

In [126]:
blockgroups = gpd.read_file( here(path["drpub"], "blockgroups.shp"))

blockgroups = gpd.GeoDataFrame(blockgroups)

blockgroups.head()



Unnamed: 0,STATEFP,COUNTYFP,TRACTCE,BLKGRPCE,AFFGEOID,GEOID,LSAD,ALAND,AWATER,GEONAME,geometry
0,48,201,311000,1,1500000US482013110001,482013110001,BG,616969.0,47009.0,"Block Group 1, Census Tract 3110, Harris Count...","POLYGON ((958404.137 4217699.295, 958413.144 4..."
1,48,201,311000,4,1500000US482013110004,482013110004,BG,408595.0,25333.0,"Block Group 4, Census Tract 3110, Harris Count...","POLYGON ((957048.814 4217692.784, 957496.427 4..."
2,48,201,311100,1,1500000US482013111001,482013111001,BG,1018525.0,213804.0,"Block Group 1, Census Tract 3111, Harris Count...","POLYGON ((958975.179 4217311.881, 958892.738 4..."
3,48,201,311100,3,1500000US482013111003,482013111003,BG,484061.0,36045.0,"Block Group 3, Census Tract 3111, Harris Count...","POLYGON ((958773.280 4216120.376, 959779.865 4..."
4,48,201,311100,4,1500000US482013111004,482013111004,BG,547376.0,0.0,"Block Group 4, Census Tract 3111, Harris Count...","POLYGON ((958756.220 4216618.402, 959659.603 4..."


#### blockgroup_demographic_data.csv
◦ Data table of block group demographics.
◦ Explanation of attribute variable codes from this 5-year American Community Survey 2012-2017 data are found in
block group metadata.xls
◦ Source: United States Census API https://www.census.gov/developers/ via Rtidycensus
https://walkerke.github.io/tidycensus

In [186]:
blockgroup_demographic_data = pd.read_csv(here(path["drpub"],'blockgroup_demographic_data.csv'), dtype=str,encoding='latin-1')


blockgroup_demographic_data.head()

Unnamed: 0,GEOID,B03002_001,B03002_002,B03002_003,B03002_004,B03002_005,B03002_006,B03002_007,B03002_008,B03002_009,...,B03002_012,B03002_013,B03002_014,B03002_015,B03002_016,B03002_017,B03002_018,B03002_019,B03002_020,B03002_021
0,482013110001,583.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,583.0,573.0,0.0,0.0,0.0,0.0,10.0,0.0,0.0,0.0
1,482013110004,1869.0,22.0,22.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1847.0,1818.0,0.0,0.0,0.0,0.0,29.0,0.0,0.0,0.0
2,482013111001,1046.0,11.0,11.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1035.0,895.0,4.0,0.0,0.0,0.0,136.0,0.0,0.0,0.0
3,482013111003,1639.0,112.0,112.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1527.0,1192.0,0.0,0.0,0.0,0.0,315.0,20.0,0.0,20.0
4,482013111004,1759.0,48.0,0.0,16.0,0.0,32.0,0.0,0.0,0.0,...,1711.0,1476.0,11.0,7.0,0.0,0.0,173.0,44.0,44.0,0.0


#### predicted_flood.shp
◦ This vector layer contains FEMA’s 100-yr flood zones for Harris County. A 100-year flood zone indicates a 1%
chance of flooding every year.
◦ Source: FEMA’s National Flood Hazard Layer (NFHL) Viewer https://hazards-
fema.maps.arcgis.com/apps/webappviewer/index.html?id=8b0adb51996444d4879338b5529aa9cd

In [34]:
predicted_flood = gpd.read_file(here(path["drpub"],'predicted_flood_FEMA.shp'))

predicted_flood.head()

Unnamed: 0,featCount,DFIRM_ID,geometry
0,4273.0,48201C,"MULTIPOLYGON (((909695.289 4219147.673, 909688..."


#### actual_flood.shp
◦ This vector layer contains FEMA’s 100-yr flood zones for Harris County. A 100-year flood zone indicates a 1%
chance of flooding every year.
◦ Source: FEMA’s National Flood Hazard Layer (NFHL) Viewer https://hazards-
fema.maps.arcgis.com/apps/webappviewer/index.html?id=8b0adb51996444d4879338b5529aa9cd

In [35]:
actual_flood = gpd.read_file(here(path["drpub"],'actual_flood.shp'))

actual_flood.head()

Unnamed: 0,featCount,VALUE,geometry
0,50363.0,1.0,"MULTIPOLYGON (((901523.471 4235737.879, 901523..."


### Bias and threats to validity

Given the research design and primary data to be collected and/or secondary data to be used, discuss common threats to validity and the approach to mitigating those threats, with an emphasis on geographic threats to validity.

These include:
  - uneven primary data collection due to geographic inaccessibility or other constraints
  - multiple hypothesis testing
  - edge or boundary effects
  - the modifiable areal unit problem
  - nonstationarity
  - spatial dependence or autocorrelation
  - temporal dependence or autocorrelation
  - spatial scale dependency
  - spatial anisotropies
  - confusion of spatial and a-spatial causation
  - ecological fallacy
  - uncertainty e.g. from spatial disaggregation, anonymization, differential privacy



### Data transformations

Describe all data transformations planned to prepare data sources for analysis.
This section should explain with the fullest detail possible how to transform data from the **raw** state at the time of acquisition or observation, to the pre-processed **derived** state ready for the main analysis.
Including steps to check and mitigate sources of **bias** and **threats to validity**.
The method may anticipate **contingencies**, e.g. tests for normality and alternative decisions to make based on the results of the test.
More specifically, all the **geographic** and **variable** transformations required to prepare input data as described in the data and variables section above to match the study's spatio-temporal characteristics as described in the study metadata and study design sections.
Visual workflow diagrams may help communicate the methodology in this section.

Examples of **geographic** transformations include coordinate system transformations, aggregation, disaggregation, spatial interpolation, distance calculations, zonal statistics, etc.

Examples of **variable** transformations include standardization, normalization, constructed variables, imputation, classification, etc.

Be sure to include any steps planned to **exclude** observations with *missing* or *outlier* data, to **group** observations by *attribute* or *geographic* criteria, or to **impute** missing data or apply spatial or temporal **interpolation**.



#### Goal 1: Load census data into block groups 


In [187]:
## Join blockgroups to blockgroup_demographic_data


bg_merged = blockgroups.merge(blockgroup_demographic_data, on = "GEOID", how = "left")

bgData = gpd.GeoDataFrame(bg_merged)

bgData.rename(
  columns={
    'B03002_001': 'Total',
    'B03002_003' : 'White',
    'B03002_004' : 'Black',
    'B03002_006' : 'Asian',
    'B03002_012' : 'Latinx',
  },
  inplace=True
)
# bgData.astype(
#     {'GEOID': 'int','White': 'float'}).dtypes

bgData.dtypes
#bgData.columns




STATEFP         object
COUNTYFP        object
TRACTCE         object
BLKGRPCE        object
AFFGEOID        object
GEOID           object
LSAD            object
ALAND          float64
AWATER         float64
GEONAME         object
geometry      geometry
Total           object
B03002_002      object
White           object
Black           object
B03002_005      object
Asian           object
B03002_007      object
B03002_008      object
B03002_009      object
B03002_010      object
B03002_011      object
Latinx          object
B03002_013      object
B03002_014      object
B03002_015      object
B03002_016      object
B03002_017      object
B03002_018      object
B03002_019      object
B03002_020      object
B03002_021      object
dtype: object

In [181]:
# Visualize join

map = folium.Map(location=[29.85, -95.4], tiles="CartoDB Positron", zoom_start=10)

folium.Choropleth(
    geo_data=blockgroups,
    data=bgData,
    columns=["GEOID","White"],
    fill_color="YlGn",
    key_on="feature.GEOID",
    fill_opacity=0.7,
    line_opacity=0.2,
).add_to(map)
map



TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

### Analysis

Describe the methods of analysis that will directly test the hypotheses or provide results to answer the research questions.
This section should explicitly define any spatial / statistical *models* and their *parameters*, including *grouping* criteria, *weighting* criteria, and *significance thresholds*.
Also explain any follow-up analyses or validations.



## Results

Describe how results are to be presented.



## Discussion

Describe how the results are to be interpreted *vis a vis* each hypothesis or research question.



## Integrity Statement

Include an integrity statement - The authors of this preregistration state that they completed this preregistration to the best of their knowledge and that no other preregistration exists pertaining to the same hypotheses and research.
If a prior registration *does* exist, explain the rationale for revising the registration here.



# Acknowledgements

- `Funding Name`: name of funding for the project
- `Funding Title`: title of project grant
- `Award info URI`: web address for award information
- `Award number`: award number

This report is based upon the template for Reproducible and Replicable Research in Human-Environment and Geographical Sciences, DOI:[10.17605/OSF.IO/W29MQ](https://doi.org/10.17605/OSF.IO/W29MQ)

## References