# Standardize parthenium survey data for phase 3
**Note:** This code results in `../output/survey.sqlite` which should be moved to `$BIODIVERSITY_DATA/survey/` for dependent modules to work.

This is about data for parthenium resent by SM on 2021-04-01.
In the survey data, we are replacing old phase values for parthenium with a different one (1->11, 2->12, 3-> 13)

```
update survey set phase=11 where species="Parthenium hysterophorus" and phase=1;
```

In [1]:
import sqlite3
import pandas as pd
import numpy as np
from pandas import ExcelFile

SURVEY_DATA_COLUMNS=['latitude','longitude','magnitude','phase','presence','species']

## Input and output

In [2]:
#DB='/Users/abhijin/data/Biodiversity/survey/survey.sqlite'
DB='../output/survey.sqlite'
DATA='/Users/abhijin/github/USAID_IPMIL_Biodiversity/data/field_survey/Parthenium_patchwiseGPS_2021-03-04.xlsx'

## Initiating dataframe
Mandatory column names for database (more can be added): `index,species,latitude,longitude,presence`

## Reading excel sheet
Processing one sheet at a time. The below function reads in a sheet and modifies columns.
Then, we process each species individually. Note that there are a number of duplicate entries.

In [3]:
df=[pd.read_excel(DATA,sheet_name='Rasuwa'),
    pd.read_excel(DATA,sheet_name='Tanahu_new'),
    pd.read_excel(DATA,sheet_name='Makwanpur'),
    pd.read_excel(DATA,sheet_name='Tanahu_old'),
    pd.read_excel(DATA,sheet_name='Chitwan'),
    pd.read_excel(DATA,sheet_name='chitwan_current2')]
x=[None]*6

In [4]:
x

[None, None, None, None, None, None]

In [5]:
x[0]=df[0].rename(columns={'Observatio': 'presence'})
x[0]=x[0][['latitude','longitude','presence']]
x[0].head(5)

Unnamed: 0,latitude,longitude,presence
0,28.255208,85.366606,Present
1,28.246433,85.363674,Absent
2,28.244754,85.361251,Absent
3,28.250513,85.366208,Absent
4,28.255208,85.366606,Present


In [6]:
x[1]=df[1].rename(columns={'LONGITUDE': 'longitude', 'LATITUDE': 'latitude','pres_abs': 'presence'})
x[1]=x[1][['latitude','longitude','presence']]
x[1].head(5)

Unnamed: 0,latitude,longitude,presence
0,27.97743,84.366231,Present
1,27.971386,84.367327,Absent
2,27.969235,84.363981,Absent
3,27.9687,84.361858,Absent
4,27.959625,84.358741,Absent


In [7]:
x[2]=df[2].rename(columns={'Observatio': 'presence'})
x[2]=x[2][['latitude','longitude','presence']]
x[2].head(5)

Unnamed: 0,latitude,longitude,presence
0,27.4137,85.0285,Present
1,27.4114,85.0296,Present
2,27.4114,85.03066,Present
3,27.41402,85.02682,Present
4,27.41215,85.03005,Present


In [8]:
x[3]=df[3].rename(columns={'presence/absence': 'presence'})
x[3]=x[3][['latitude','longitude','presence']]
x[3].head(5)

Unnamed: 0,latitude,longitude,presence
0,28.02875,84.08552,Presence
1,28.02715,84.08625,Presence
2,28.02702,84.08612,Presence
3,28.00838,84.08858,Presence
4,28.00857,84.08859,Presence


In [9]:
x[4]=df[4].rename(columns={'Observatio': 'presence'})
x[4]=x[4][['latitude','longitude','presence']]
x[4].head(5)

Unnamed: 0,latitude,longitude,presence
0,27.58898,84.34625,Present
1,27.58978,84.34734,Present
2,27.58972,84.34749,Present
3,27.58948,84.34765,Present
4,27.5893,84.347,Present


In [10]:
x[5]=df[5].rename(columns={'Observatio': 'presence'})
x[5]=x[5][['latitude','longitude','presence']]
x[5].head(5)

Unnamed: 0,latitude,longitude,presence
0,27.877316,84.587634,Present
1,27.873375,84.600521,Present
2,27.85869,84.56905,Present
3,27.85893,84.56963,Present
4,27.863355,84.57701,Present


## Append all and create the table

In [11]:
surveyData=pd.concat(x)
surveyData.size

1299

In [12]:
# checking for duplicates. lots of them
y=surveyData.groupby(['latitude','longitude']).head(1)
y.shape

(383, 3)

In [13]:
surveyData=y
surveyData['species']='Parthenium hysterophorus'
surveyData['phase']=3
surveyData['magnitude']=0
surveyData.loc[surveyData.presence=='Present','presence']=1
surveyData.loc[surveyData.presence=='Presence','presence']=1
surveyData.loc[surveyData.presence=='Absent','presence']=0
surveyData.loc[surveyData.presence=='Absence','presence']=0
surveyData.presence=surveyData.presence.astype(int)

In [14]:
surveyData.dtypes

latitude     float64
longitude    float64
presence       int64
species       object
phase          int64
magnitude      int64
dtype: object

## Push to database
Assumes that the database has the following table:
```
CREATE TABLE "survey" (
	"species"	TEXT,
	"latitude"	REAL,
	"longitude"	REAL,
    "phase"	INTEGER,
	"presence"	INTEGER,
	"magnitude"	TEXT,
	PRIMARY KEY("species","latitude","longitude","phase")
);
```

In [15]:
conn = sqlite3.connect(DB)
cur=conn.cursor()
cur.execute('CREATE TABLE IF NOT EXISTS "survey" ( \
    "species"    TEXT, \
    "latitude"    REAL, \
    "longitude"    REAL, \
    "phase"    INTEGER, \
    "presence"    INTEGER, \
    "magnitude"    TEXT, \
    PRIMARY KEY("species","latitude","longitude","phase") \
);')
surveyData[SURVEY_DATA_COLUMNS].to_sql('temporary_table',conn,if_exists='replace')

In [16]:
cur.execute('INSERT OR REPLACE INTO survey SELECT ' + ','.join(SURVEY_DATA_COLUMNS) + ' FROM temporary_table;')
cur.execute('DROP TABLE temporary_table;')
conn.commit()
conn.close()