# Week 1 Assignment

In this assignment you will be exposed to some fundamental geospatial data wrangling techniques, learn how to load basic Python libraries, import and read your csv dataset, and perform some basic operations.

__DATA__: you will be working with raw data in the form of raster data as well as structured data in the form of csv files that are derived from the raw raster data. We will be building machine learning models to solve three problems, impervious fractional cover estimation, impervious surface classfication, and landcover classification. 

__Raster Files__: 

    - Landsat.tif (remotely sensed data in the form of surface reflactance, will be used as the input to our models) [values (0 - 10000)]
    - Landcover.tif (NLCD landcover map, will be used as our "ground truth" in training some of our models)     
        "landcover-legend": {
            'water': 11, 'snow': 12, 'developed-open': 21, 'developed-low': 22, 'developed-med': 23, 'developed-high': 24, 'barren': 31, 'dforest': 41, 'eforest': 42,
            'mforest': 43, 'shrub': 52, 'grassland': 71, 'hay': 81, 'crops': 82, 'wwetlands': 90,'ewetlands': 95
        }
    - Impervious.tif (NLCD fractional impervious map, will be used as our "ground truth" in training some of our modes) [values (0 - 100)]
    - Dem.tif (ancillary data in the form of elevation data) [values (0 - 10000)]
    - Aspect.tif (ancillary data in the form of downslope direction) [values (0 - 8)]
    - Posidex.tif (ancially date in the form of positional index) [values (0 - 100)]
    - Wetlands.tif (ancillary data in the form of wetlands information) [values (0 - 17)]

Modern machine learning has offered advancements in the automated analysis of data and we plan to employ some of those techniques here.



## 1. Import the required libraries:
> Numpy and Pandas are some of the most commonly used libraries for data analysis.
Numpy is a python package which is used for scientific computing. It provides support for large multi-dimensional arrays and matrices. Pandas is python library used for data manipulation and analysis.

> __HINT__: Check how to do this in the instructions file

In [2]:
pip install rasterio

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting rasterio
  Downloading rasterio-1.3.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m56.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting affine (from rasterio)
  Downloading affine-2.4.0-py3-none-any.whl (15 kB)
Collecting cligj>=0.5 (from rasterio)
  Downloading cligj-0.7.2-py3-none-any.whl (7.1 kB)
Collecting snuggs>=1.4.1 (from rasterio)
  Downloading snuggs-1.4.7-py3-none-any.whl (5.4 kB)
Collecting click-plugins (from rasterio)
  Downloading click_plugins-1.1.1-py2.py3-none-any.whl (7.5 kB)
Installing collected packages: snuggs, cligj, click-plugins, affine, rasterio
Successfully installed affine-2.4.0 click-plugins-1.1.1 cligj-0.7.2 rasterio-1.3.8 snuggs-1.4.7
Note: you may need to restart the kernel to use updated packages.


In [4]:
pip install geopandas

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting geopandas
  Downloading geopandas-0.13.2-py3-none-any.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting fiona>=1.8.19 (from geopandas)
  Downloading Fiona-1.9.4.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.4/16.4 MB[0m [31m71.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting pyproj>=3.0.1 (from geopandas)
  Downloading pyproj-3.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m56.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting shapely>=1.7.1 (from geopandas)
  Downloading shapely-2.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.

In [5]:
import os
import random
import numpy as np
import pandas as pd
import rasterio as rio
import geopandas as gpd

## 2. Read the raster files

> __HINT__: You can find out how to do this in the instructions file

In [9]:
paths = [
    's3://geokarma-testing/geoKARMA_h24v13_landcover_2019.tif',
    's3://geokarma-testing/geoKARMA_h24v13_landcover_2019.tif',
    's3://geokarma-testing/geoKARMA_h24v13_dem_2019.tif',
    's3://geokarma-testing/geoKARMA_h24v13_aspect_2019.tif',
    's3://geokarma-testing/geoKARMA_h24v13_posidex_2019.tif',
    's3://geokarma-testing/geoKARMA_h24v13_wetlands_2019.tif'
]

for path in paths:
    rio.open(path)
    

## 3 - 8. Print the metadata and bounds for each raster file
> It is a good idea to examine the metadata and all the information tied to the raster data

In [11]:
for path in paths:
    print(rio.open(path).meta)

{'driver': 'GTiff', 'dtype': 'uint8', 'nodata': 255.0, 'width': 5000, 'height': 5000, 'count': 1, 'crs': CRS.from_wkt('PROJCS["Albers_Conical_Equal_Area",GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4326"]],PROJECTION["Albers_Conic_Equal_Area"],PARAMETER["latitude_of_center",23],PARAMETER["longitude_of_center",-96],PARAMETER["standard_parallel_1",29.5],PARAMETER["standard_parallel_2",45.5],PARAMETER["false_easting",0],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH]]'), 'transform': Affine(30.0, 0.0, 1034415.0000701696,
       0.0, -30.0, 1364804.9997692876)}
{'driver': 'GTiff', 'dtype': 'uint8', 'nodata': 255.0, 'width': 5000, 'height': 5000, 'count': 1, 'crs': CRS.from_wkt('PROJCS["Albers_Conical_Equal_Area",GEOGCS["WGS 84",DATUM["WGS_1984",SP

## 9. Read the csv dataset

> __HINT__: You can find out how to do this in the instructions file

In [13]:
csv = pd.read_csv('s3://geokarma-testing/geoKARMA_h24v13_pixelbased_dataset.csv')

## 10. Print the first 5 observations of the dataset
> It is a good idea to examine the first and last observations to get an idea of the dataset

In [14]:
csv.head()

Unnamed: 0,landsat_1,landsat_2,landsat_3,landsat_4,landsat_5,landsat_6,dem_1,aspect_1,posidex_1,wetlands_1,landcover_1,impervious_1,urban_count_7,urban_count_5,urban_count_3,xgeo,ygeo
0,164,373,233,2592,1096,429,254,2,47,0,42,0,14,2,0,1051155.0,1247055.0
1,271,418,292,2782,1439,635,257,15,28,0,41,0,17,10,4,1056225.0,1241565.0
2,454,832,850,3860,2671,1476,277,1,45,-1,81,0,0,0,0,1128015.0,1313925.0
3,187,345,198,2469,1117,441,242,8,27,-1,42,0,0,0,0,1124175.0,1282395.0
4,481,715,731,3519,2286,1386,239,1,91,0,21,13,16,11,5,1156905.0,1320645.0


## 11. Print the last 5 observations of the dataset

In [15]:
csv.tail()

Unnamed: 0,landsat_1,landsat_2,landsat_3,landsat_4,landsat_5,landsat_6,dem_1,aspect_1,posidex_1,wetlands_1,landcover_1,impervious_1,urban_count_7,urban_count_5,urban_count_3,xgeo,ygeo
499995,257,504,474,2794,1555,733,340,12,38,0,21,16,35,18,7,1065225.0,1290975.0
499996,202,386,207,3367,1275,476,325,10,47,0,42,0,0,0,0,1074765.0,1298895.0
499997,158,352,210,3074,1433,500,258,16,54,0,41,0,3,0,0,1043805.0,1245045.0
499998,143,378,233,3098,1217,503,274,9,-1,0,21,1,34,18,6,1055205.0,1231785.0
499999,461,861,812,3726,2853,1489,422,6,19,0,0,0,1,0,0,1113525.0,1345005.0


## 12. Print the column names

In [16]:
csv.columns

Index(['landsat_1', 'landsat_2', 'landsat_3', 'landsat_4', 'landsat_5',
       'landsat_6', 'dem_1', 'aspect_1', 'posidex_1', 'wetlands_1',
       'landcover_1', 'impervious_1', 'urban_count_7', 'urban_count_5',
       'urban_count_3', 'xgeo', 'ygeo'],
      dtype='object')

## 13.  Print dimensions of the dataset
> You will get this in the form __(rows,columns)__

In [17]:
csv.shape

(500000, 17)

## 14. Print a summary of the dataset by using the info function

In [18]:
csv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 17 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   landsat_1      500000 non-null  int64  
 1   landsat_2      500000 non-null  int64  
 2   landsat_3      500000 non-null  int64  
 3   landsat_4      500000 non-null  int64  
 4   landsat_5      500000 non-null  int64  
 5   landsat_6      500000 non-null  int64  
 6   dem_1          500000 non-null  int64  
 7   aspect_1       500000 non-null  int64  
 8   posidex_1      500000 non-null  int64  
 9   wetlands_1     500000 non-null  int64  
 10  landcover_1    500000 non-null  int64  
 11  impervious_1   500000 non-null  int64  
 12  urban_count_7  500000 non-null  int64  
 13  urban_count_5  500000 non-null  int64  
 14  urban_count_3  500000 non-null  int64  
 15  xgeo           500000 non-null  float64
 16  ygeo           500000 non-null  float64
dtypes: float64(2), int64(15)
memo

# OPTIONAL 
> Create shapefiles to vizualize points in QGIS (drop all columns except the xgeo, ygeo), use the CRS from the Landcover raster