# Table of Contents
1. [Things to Do](#Things-to-Do)
1. [Introduction](#Introduction)
1. [Set up Python](#Set-up-Python)
2. [Functions](#Functions)
3. [Getting Data](#Get-Data)
4. [Analyze Data](#Analyze-Data)
5. [Export Files](#Export-Files)

# Things to Do


- [Resume Here](#Resume-Here)

## Introduction

This notebook contains code and output of descriptive analyses for the 2000-2017 CC dataset after cleaning.

The objectives of this notebook are to:

The metrics we examine are: .




##  Set up Python

First we will need to set up the python environment, importing the necessary packages and setting the display options.

[Top](#Table-of-Contents)

In [2]:
import pandas as pd
import numpy as np
import os, glob, logging
from summary_functions import *
from scipy import stats
from monthlit import *
from prettyprint import *


import plotly
import chart_studio.plotly as py
import plotly.figure_factory as ff
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

init_notebook_mode(connected=True)
# plotly.tools.set_config_file(world_readable=True)


# increase print limit
pd.options.display.max_rows = 99999
pd.options.display.max_columns = 50

### Setting File Locations

In [3]:
deviceDict = {'dataBig':{'source':'S:/Chris/TailDemography/TailDemography/weather data files'
                         ,'log':'S:/Chris/TailDemography/TailDemography/weather data files/logs'
                         ,'output':'S:/Chris/TailDemography/TailDemography/weather data files/outputFiles/'},
              'silverSurfer':{'source':'C:\\Users\\craga_eowcrpe\\Google Drive\\TailDemography\\weather data files/outputFiles'
                              ,'log':'C:\\Users\\craga_eowcrpe\\Google Drive\\TailDemography\\weather data files/logs'
                              ,'output':'C:\\Users\\craga_eowcrpe\\Google Drive\\TailDemography\\weather data files/outputFiles'}
              ,'dataPers':{'source':'C:/Users/Christopher/Google Drive/TailDemography/weather data files'
                           ,'log': 'C:\\Users\\craga_eowcrpe\\Google Drive\\TailDemography\\weather data files/logs'
                           ,'output':'C:/Users/Christopher/Google Drive/TailDemography/weather data files/outputFiles'}
             ,'gandolf':{'source':'C:/Users/craga/Google Drive/TailDemography/weather data files'
                           ,'log': 'C:/Users/craga/Google Drive/TailDemography/weather data files/logs'
                           ,'output':'C:/Users/craga/Google Drive/TailDemography/weather data files/outputFiles'}}

### Choose Device

In [4]:
device = deviceDict['gandolf']
device

{'source': 'C:/Users/craga/Google Drive/TailDemography/weather data files',
 'log': 'C:/Users/craga/Google Drive/TailDemography/weather data files/logs',
 'output': 'C:/Users/craga/Google Drive/TailDemography/weather data files/outputFiles'}

# Source Data


### Logging

In [5]:
logging.basicConfig(filename=device['log']+'Desriptive Analyses.log'
                    , filemode='a',
                    format='%(funcName)s - %(levelname)s - %(message)s - %(asctime)s', level=logging.DEBUG)

## Functions

This section contains functions that were created for this notebook.

- [distribution](#distribution) #delete this we will use scipy stats describe instead
- [monthlit](#monthlit)
- [description](#description)
- [vocab_run](#vocab_run)

### distribution
[Back to Top](#TOC)

[Back to Functions](#Functions)

*distribution* takes a series or list of numeric objects, *x*, and returns descriptive stats of x including
        n, minimum, maximum, median, sIQR, mean, and stdev
    
Here are a few examples of how *distribution* works.

In [6]:
foo = [0,1,2,'r']
distribution(foo)

In [7]:
bar = [0,1,2]
distribution(bar)

Unnamed: 0,n,minimum,maximum,median,siqr,mean,stdev
0,3,0,2,1.0,0.5,1.0,1.0


[Back to Functions](#Functions)

## monthlit
[Back to Top](#TOC)

[Back to Functions](#Functions)

Here are a few examples of how _monthlit_ works.

In [8]:
dates = pd.DataFrame(data={'dates':['2018-12-9','2019-8-5', '2017/7/4',np.nan,None]})
dates.dates = pd.to_datetime(dates.dates)
dates

Unnamed: 0,dates
0,2018-12-09
1,2019-08-05
2,2017-07-04
3,NaT
4,NaT


In [None]:
np.isnan(np.nan)

In [None]:
monthlit(dates.dates.dt.month[0])

In [None]:
dates.dates.dt.month.apply(monthlit)

[Back to Functions](#Functions)

## description
[Back to Top](#TOC)

[Back to Functions](#Functions)

In [None]:
def description(x,variable,percentage=False):
    if percentage:
            res = x[variable].describe()
            res[['mean','std','min','25%','50%','75%','max']] = res[['mean','std','min','25%','50%','75%','max']]\
            .apply(lambda x:x*100) 
#Need to Add CI calculation to this function
#             meanCI = 'not calculated'
    else:
        res = x[variable].describe() 
    res['siqr'] = (res['75%']-res['25%'])/2
    res['meanCI'] = 'not calculated'
    return res

### vocab_run
[Back to Top](#TOC)

[Back to Functions](#Functions)

*vocab_run* takes a list, joins its the first the elements with a separator placing a different separator between
     the penultimate and final members of the list and returns the result as a string
     :param x: a list of strings to be concatenated
     :param connector_dict: a dictionary with keys describing the size of the list and values indicating the type of
     connectors separate the list elements.
    
Here are a few examples of how *vocab_run* works.

In [None]:
print("Could you bring some {} please?".format(vocab_run(['foo','bar','stuffkins'])))

In [None]:
print("You can either have {}.  You'll have to make a choice."\
      .format(vocab_run(['foo','bar','stuffkins'],connector_dict={1: None, 2: ' or ', 'run': ', '})))

[Back to Functions](#Functions)

We'll display all files in the source folder with the prefix _'cleaned CC data 2000-2017'_. The file names will be saved in a variable, _mysourcefiles_.

## Get Data
[Top](#TOC)

Here we can set the locations from which we get data and to which we export it.

In [None]:
os.chdir(device['source'])
mysourcefiles = glob.glob('*_weather*.csv')
mysourcefiles

In [None]:
def getweatherdata(afile,sourcename):
    tmp = pd.read_csv(afile)
    tmp['source'] = sourcename
    return tmp

Get weather data

In [None]:
df = pd.concat([getweatherdata(afile,afile.split('_')[0]) for afile in mysourcefiles]).drop(columns = 'Unnamed: 0')

Get population data.

In [None]:
df_pop = pd.read_csv('C:/Users/craga/Google Drive/TailDemography/outputFiles/Descriptive/population size.csv')
df_pop.head()

# Analyze Data
[Top](#TOC)

We will first examine the range and distribution of number of variables in our data set:


In [None]:
# Weather for a year includes weather since the last collection date of the previous calendar year 
seasons={'Dec':'winter','Jan':'winter','Feb':'winter',
         'Mar':'spring','Apr':'spring','May':'spring',
         'Jun':'summer','Jul':'summer','Aug':'summer',
         'Sept':'fall','Oct':'fall','Nov':'fall'}

Split analysis up:
- Analysis 1
    - weather in the previous 365 days relative to the first date of collection/sighting for the current calendar year
    - additional factor would be population for previous calendar year (year -1)
- Analysis 2 (Skip this for now)
    - weather in the previous 365 days relative to the first date of collection/sighting for the current calendar year
    - additonal factor would be populationi in the the current calendar year (year 0)
    - dv: population in (year 1 through year x)
- Analysis 3,4,5
    - IV
        - population in year -1
        - onset of monsoon in year 0
        - precipitation in summer
        - interaction ?
    - DV
        - population in year 1
        - age/size structure in year 1 (looking for 45mm to 65mm)
        - sex ratio in year 1

In [None]:
# This could be used to generate season-level weather data (use season dates) - Chris
# This could also be used to approximate the start of the monsoon season
# Check historical data in May and June to identify in notes when the first juvenile were spotted - (George and Chris)
## Look for correlates in the data
# Use SWRS data to identify start of monsoons (George to get SWRS data)
# what other precipitation and temperature in the NOAA data set have been used for this (George and Chris to check the lit)
df['month'] = df.month.apply(monthlit)
df['season'] = df.month.apply(lambda x: seasons[x])
df_season = pd.DataFrame(df.groupby(['source','year','season'])['PRCP','SNOW','TMAX','TMIN','TAVG'].describe())[1:-1]
df_season.columns = [' '.join(col).strip() for col in df_season.columns.values]
df_season = df_season.reset_index()

In [None]:
df_season['year-season'] = df_season.year.astype(str) + '-' + df_season.season
df_season

In [None]:
df_annual = pd.DataFrame(df.groupby(['source','year'])['PRCP','SNOW','TMAX','TMIN','TAVG'].describe())[1:-1]
df_annual.columns = [' '.join(col).strip() for col in df_annual.columns.values]
df_annual = df_annual.reset_index().sort_values('year')
df_annual

## Population Size

Can we predict the change in population size using the prvious year's weather?
First let's make a new data set that will allow us to vizualize the potential relationship between precipitation and population size.

In [None]:
df_reg_annual = df_annual.merge(df_pop.loc[df_pop.sex=='f'].drop(columns=['propMale','sex','liznumber']),on = ['year'],how='left')
df_reg_annual.head()

In [None]:
df_reg_season = df_season.merge(df_pop.loc[df_pop.sex=='f'].drop(columns=['propMale','sex','liznumber']),on = ['year'],how='left')
df_reg_season.head()

In [None]:
#Drop paradise
df_reg_annual['popinYearless1'] = df_reg_annual.groupby('source').liznumberYear.shift(-1)
df_reg_annual['popinYearless2'] = df_reg_annual.groupby('source').liznumberYear.shift(-2)
df_reg_annual['popinYearless3'] = df_reg_annual.groupby('source').liznumberYear.shift(-3)
df_reg_annual['popinYearless4'] = df_reg_annual.groupby('source').liznumberYear.shift(-4)
df_reg_annual['popinYearless5'] = df_reg_annual.groupby('source').liznumberYear.shift(-5)
df_reg_annual

In [None]:
#Drop paradise
df_reg_season['popinYearless1'] = df_reg_season.groupby('source').liznumberYear.shift(-1)
df_reg_season['popinYearless2'] = df_reg_season.groupby('source').liznumberYear.shift(-2)
df_reg_season['popinYearless3'] = df_reg_season.groupby('source').liznumberYear.shift(-3)
df_reg_season['popinYearless4'] = df_reg_season.groupby('source').liznumberYear.shift(-4)
df_reg_season['popinYearless5'] = df_reg_season.groupby('source').liznumberYear.shift(-5)
df_reg_season

## Correlations

In [None]:
def candidate(m,dv,placement=(1,1)):
    assert(dv in m.columns)
    return m[dv].sort_values().reset_index().iloc[placement[0]:placement[1]+1,:]

In [None]:
from functools import reduce

In [None]:
def topcorr(corrdf,lowestrank,dvs):
    candidates = [candidate(corrdf,dv,(1,lowestrank)) for dv in dvs]
    merger =  reduce(lambda x, y: pd.merge(x, y, on = 'index', how = 'outer'), candidates).fillna('--')
    return merger

In [None]:
#Dropping proportion of Females, but will put it back once I can order the y-axis
corrPortal_annual = df_reg_annual.loc[(df_reg_annual.source=='portal')]\
.drop(columns=['PRCP count', 'SNOW count', 'TMAX count', 'TMIN count', 'TAVG count','propFemale',
              'SNOW min', 'SNOW 25%', 'SNOW 50%',]).corr()
testx = corrPortal_annual.columns
testy = corrPortal_annual.index
testz = corrPortal_annual.values
test = go.Figure(go.Heatmap(x=testx,y=testy,z=testz))
plot(test, filename = 'portal annual correlation matrix.html')
iplot(test, filename = 'portal annual correlation matrix.html')

In [None]:
annual =topcorr(corrPortal_annual,3,mydvs) 
annual

## To-Do
- Run MV correlation
    - IV should be pop at year 0
    - DV should be pop at year 1- year X plus abiotic factors
    - Which abiotic

## Season

### Spring

In [None]:
corrPortal_spring = df_reg_season.loc[(df_reg_season.source=='portal')&(df_reg_season.season.isin(['spring']))]\
.drop(columns=['PRCP count', 'SNOW count', 'TMAX count', 'TMIN count', 'TAVG count','propFemale',
              'SNOW min', 'SNOW 25%', 'SNOW 50%',]).corr()
testx = corrPortal_spring.columns
testy = corrPortal_spring.index
testz = corrPortal_spring.values
test = go.Figure(go.Heatmap(x=testx,y=testy,z=testz))
plot(test, filename = 'portal spring correlation matrix.html')
iplot(test, filename = 'portal spring correlation matrix.html')

In [None]:
spring= topcorr(corrPortal_spring,3,mydvs) 
spring

### Summer

In [None]:
corrPortal_summer = df_reg_season.loc[(df_reg_season.source=='portal')&(df_reg_season.season.isin(['summer']))]\
.drop(columns=['PRCP count', 'SNOW count', 'TMAX count', 'TMIN count', 'TAVG count','propFemale',
              'SNOW min', 'SNOW 25%', 'SNOW 50%',]).corr()
testx = corrPortal_summer.columns
testy = corrPortal_summer.index
testz = corrPortal_summer.values
test = go.Figure(go.Heatmap(x=testx,y=testy,z=testz))
plot(test, filename = 'portal summer correlation matrix.html')
iplot(test, filename = 'portal summer correlation matrix.html')

In [None]:
summer= topcorr(corrPortal_summer,3,mydvs) 
summer

### Fall

In [None]:
corrPortal_fall = df_reg_season.loc[(df_reg_season.source=='portal')&(df_reg_season.season.isin(['fall']))]\
.drop(columns=['PRCP count', 'SNOW count', 'TMAX count', 'TMIN count', 'TAVG count','propFemale',
              'SNOW min', 'SNOW 25%', 'SNOW 50%',]).corr()
testx = corrPortal_fall.columns
testy = corrPortal_fall.index
testz = corrPortal_fall.values
test = go.Figure(go.Heatmap(x=testx,y=testy,z=testz))
plot(test, filename = 'portal fall correlation matrix.html')
iplot(test, filename = 'portal fall correlation matrix.html')

In [None]:
fall= topcorr(corrPortal_fall,3,mydvs) 
fall

In [None]:
fall['index'].tolist()

In [None]:
popvars = ['popinYearless1','popinYearless2','popinYearless3','popinYearless4']
weathvars = fall['index'].tolist()
for popvar in popvars:
    for weathvar in weathvars:
        r,p = stats.pearsonr(df_reg_season.loc[(df_reg_season.source=='portal')&(df_reg_season.season.isin(['fall']))]\
                             [popvar].dropna(),df_reg_season.loc[(df_reg_season.source=='portal')&(df_reg_season.season.isin(['fall']))]\
                             [weathvar].dropna())
        print('{} vs {}: r={}; p={}'.format(popvar,weathvar,r,p))

### Winter

In [None]:
corrPortal_winter = df_reg_season.loc[(df_reg_season.source=='portal')&(df_reg_season.season.isin(['winter']))]\
.drop(columns=['PRCP count', 'SNOW count', 'TMAX count', 'TMIN count', 'TAVG count','propFemale',
              'SNOW min', 'SNOW 25%', 'SNOW 50%',]).corr()
testx = corrPortal_winter.columns
testy = corrPortal_winter.index
testz = corrPortal_winter.values
test = go.Figure(go.Heatmap(x=testx,y=testy,z=testz))
plot(test, filename = 'portal winter correlation matrix.html')
iplot(test, filename = 'portal winter correlation matrix.html')

In [None]:
winter= topcorr(corrPortal_winter,3,mydvs) 
winter

In [None]:
popvars = ['popinYearless1','popinYearless2','popinYearless3','popinYearless4']
weathvars = ['PRCP mean','PRCP 75%','PRCP max']
for popvar in popvars:
    for weathvar in weathvars:
        r,p = stats.pearsonr(df_reg_season.loc[(df_reg_season.source=='portal')&(df_reg_season.season.isin(['winter']))]\
                             [popvar],df_reg_season.loc[(df_reg_season.source=='portal')&(df_reg_season.season.isin(['winter']))]\
                             [weathvar])
        print('{} vs {}: r={}; p={}'.format(popvar,weathvar,r,p))
popvars = ['popinYearless1','popinYearless2','popinYearless3','popinYearless4']
weathvars = ['TMIN 75%']
for popvar in popvars:
    for weathvar in weathvars:
        r,p = stats.pearsonr(df_reg_season.loc[(df_reg_season.source=='portal')&(df_reg_season.season.isin(['winter']))]\
                             [popvar],df_reg_season.loc[(df_reg_season.source=='portal')&(df_reg_season.season.isin(['winter']))]\
                             [weathvar])
        print('{} vs {}: r={}; p={}'.format(popvar,weathvar,r,p))

Unlike the other season, winter precipitation in year 0 has correlalations for population size in subsequent years.

# Resume Here

[Back to TOC](#Table-of-Contents)

Need to model this with regression.
two predictors: pop in year weather in year
dv: pop in year 2

Train model using captured juveniles to predict age class 
include weather

In [None]:
import numpy as np
import pingouin as pg
from scipy import stats

In [None]:
mydvs = ['liznumberYear', 'popinYearless1', 'popinYearless2', 'popinYearless3', 'popinYearless4', 'popinYearless5']

In [None]:
weathermetrics = ['']
iv = ['liznumberYear',weathermetric]
dv = 'popinYearless1'
r,p = stats.pearsonr(df_reg.liznumberYear,df_reg[var])
print('{}: r={}; p={}'.format(var,r,p))

In [29]:
var = 'TMIN mean' 
slope, intercept, r_value, p_value, std_err = stats.linregress(df_reg.liznumberYear,df_reg[var])
print("slope: {}    intercept: {}".format(slope, intercept))
# slope: 1.944864    intercept: 0.268578
print("R-squared: {}".format(r_value**2))

NameError: name 'df_reg' is not defined

In [None]:
pg.anova(data=df, dv='liznumberYear', between='group', detailed=True)
print(aov)

## Growth

## Sex Ratio