Example Workflow with Sina and Pandas
=================================

This is a sample Jupyter notebook that introduces users to a workflow using Sina with pandas. It uses the Fukushima data set found in the Sina examples folder. The get_pd() function itself is not tied to any particular data set, and lends itself well to a variety of data needs.

Note: The typical Sina dependencies do not include all of the libraries required to run this notebook. If you typically use the LC Sina virtual environment, follow either the standard or manual set up instructions from the Readme in the sina/python/ folder. Once your virtual environment is set up, you can add the additional libraries by using pip.

#### Additional Required Packages:

pandas

sklearn

#### Eaxmple:
pip install pandas sklearn


## Connect to Sina
Connect to Sina as you usually do. Consult Sina documentation for more details.

In [None]:
from sina.datastore import create_datastore
import sina.utils

import matplotlib.pyplot as plt

# Initialization

# Access the data
database = sina.utils.get_example_path('fukushima/data.sqlite')
print('Using database {}'.format(database))
ds = create_datastore(database)


# Make a phantom call to plt.show() to work around a known Jupyter issue with displaying graphs
plt.show()

print("Connection to database made. Ready to proceed")

## Declare Pandas Conversion Function

This function is the main interface between pandas and Sina. Feel free to copy/modify this function for use within your own workflow. Note that this function does drop units without doing any conversions. 

In [None]:
import pandas as pd

# makes printing a little bit prettier, shows less rows
pd.options.display.max_rows = 10


def get_pd(ds, ids, fields=None):
    '''
    Get a pandas dataframe for the given IDs.
    
    Uses a global records to interact with sina.
    ...
    
    :param ds: the datastore the records are coming from
    :param ids: the list of IDs of the records to include in the dataframe
    :param fields: list of data elements ie column names for dataframe
    
    :raises Exception: Error with list of ids 
    
    :returns: dataframe with rows corresponding to IDs and collumns corresponding to fields
    '''
    
    # ensure that there is a list of IDs
    try:
        ids = list(ids)
        
    except:
        raise Exception('Something went wrong with IDs')

    # if not specified, get all data field names
    if not fields:
        fields = ds.records.get(ids[0]).data.keys()
        
        
    # get the full record objects for all ids
    records = ds.records.get(ids)
    
    # turn list of records into a list of lists containg the data values
    recs = []
    for record in records:
        entry = []
        for field in fields:
            entry.append(record.data[field]['value'])
        recs.append(entry)
         
    return pd.DataFrame(data=recs, columns=fields)

print('Pandas Function Declared')


## Testing our Pandas Conversion Function
After runnning this cell, you will find that all records of type 'obs' will be loaded into a data frame.

In [None]:
# get ids for all observations
ids = ds.records.find_with_type("obs", ids_only=True)

# convert to pandas data frame
df = get_pd(ds,ids)

print(df)

## Demonstrating a query before pandas conversion 

### Selecting Specific Records
You can use sina to query for data to fall within certain values, then turn all of the resulting records into a dataframe. The cell bellow builds a df with all records where the  date is 4/18/2011. 

In [None]:
# run query and get ids
ids = ds.records.find_with_data(date='4/18/2011')

# convert to pandas data frame
df = get_pd(ds,ids)

# print and review
print(df)

### Filtering by Data Values
You can use Sina to query for data to fall within certain values, then turn all of the resulting records into a dataframe. In this example, we will use the fields option so that our data frame only contains the data fields altitude, longitude, latitude and gcnorm. The cell below builds a df with all records where alt_hae is between 250 and 300. This uses one of Sina's special query functions, DataRange, see query documentation for more deatils. 

In [None]:
# required for DataRange function
from sina.utils import DataRange

# run query and get ids
ids = ds.records.find_with_data(alt_hae=DataRange(250,300))

# specify which fields you want
fields = ['alt_hae', 'gcnorm', 'latitude', 'longitude']

# convert to pandas data frame
df = get_pd(ds,ids, fields)

# print and review
print(df)

## Accessing Data with Panda
This is a quick demonstration of how to access your data once it is in a data frame. In general, you access columns of data rather than individual records.

### Getting a Single Column
A single column from a dataframe is a pandas series. Note that there is no column name

In [None]:
import random 

# get ids for all observations
ids = list(ds.records.find_with_type("obs", ids_only=True))

# we will use a random subset of records, no need to load them all for these examples
k = 1000
ids = random.sample(ids, k)

# convert to pandas data frame
df = get_pd(ds,ids)

# getting a single column
altitude = df['alt_hae']
print(altitude)

### Getting a Subset of Columns

In [None]:
# getting a subset of columns
cols = ['latitude', 'longitude']
coordinates = df[cols]
print(coordinates)

### Modifying an Existing Column
You can modify all values in an existing column by some constant using this systanx. See Pandas documentation for more details

In [None]:
# modifying an existing column
new_sea_level=5
df['alt_hae'] = df['alt_hae'] - new_sea_level
print(df['alt_hae'])

### Getting a Subset of Records
You can filter down records based on their values for specific columns. See Pandas documentation for more details. 

In [None]:
# getting a subset of records
new_df = df[df['date']=='4/5/2011']
print(new_df)

### Getting a Random Sample of Records
Example for getting random samples

In [None]:
# get a random subset of records, in this case 5
random_df = df.sample(n=5)
print(random_df)

### Get Max  Values
Example of getting max values from pandas df

In [None]:
# get k largest gcnorm values
k = 5
klarge = df.nlargest(k, 'gcnorm')
print("Here are the {} largest values for gcnorm:\n".format(k))
print(klarge)

### Get Min Values
Example of getting max values from pandas df

In [None]:
# get k smallest altitude values
k = 5
ksmall = df.nsmallest(k, 'alt_hae')
print("Here are the {} smallest values for altitude:\n".format(k))
print(ksmall)

## Data Manipulation and Plotting
### Finding Distance
Here we will use longitude and latitude coordinates to find the distance from the reactor, then plot the gcnorm against that distance. The haversine function below is used to demonstrate vectorized operations with pandas. In general, you should not use a for loop to modify/create data. For more information, review pandas documentation 

In [None]:
import numpy as np

# note that we use the numpy library, this allows us to vectorize our code. 
def haversine(lat1, lon1, lat2, lon2):
    '''
    Get distance (km) between two points on the surface of a sphere (Earth).
    ...
    
    :param lat1: the latitude value of the first point
    :param lon1: the longitude value of the first point
    :param lat2: the latitude value of the second point
    :param lon2: the longitude value of the second point
    
    :returns: distance (km) between the two points
    '''
    
    Radius_Earth_KM = 6371
    lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1 
    dlon = lon2 - lon1 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a)) 
    total_km = Radius_Earth_KM * c
    return total_km

# get ids for all observations
ids = list(ds.records.find_with_type("obs", ids_only=True))

# convert to pandas data frame
df = get_pd(ds,ids)

# making a new column using existing columns. 
reactor = [37.4227,141.0327 ]
df['distance'] = haversine(df['longitude'], df['latitude'], reactor[1], reactor[0])


# Now that we have the distance, we want to plot by date.
dates = df['date'].unique()
for date in dates:
    plot_df = df[df['date']==date]
    x = plot_df['distance']
    y = plot_df['gcnorm']
    fig = plt.figure()
    ax = plt.axes()
    ax.plot(x,y)
    ax.set_xlabel("Distance from reactor (km)")
    ax.set_ylabel("Normalized Gross Counts per Second")
    ax.set_title('GCNorm over Distance for date {}'.format(date))


### 3D Plots with Pandas
Here, we will demonstrate how to produce a 3D plot using data from a DataFrame and matplot lib. Note that this plot interpolates the surface by creating triangles with adjacent points. 

In [None]:
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d

# get ids for a single date
ids = ds.records.find_with_data(date='4/18/2011')


# convert to pandas data frame
df = get_pd(ds,ids)

fig = plt.figure()
ax = plt.axes(projection='3d')


x = df['longitude']
y = df['latitude']
z = df['gcnorm']
_ = ax.plot_trisurf(x, y, z, cmap='inferno', edgecolor='none')
_ = ax.set_title('Heat Map by Latitude and longitude')

### Combining Concepts
By combining concepts from the last two cells, we can create a heat map that includes both distance from the reactor and altitude. 

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# get ids for all observations
ids = list(ds.records.find_with_type("obs", ids_only=True))

# we will use a random subset of records, no need to load them all for these examples
k = 15000
ids = random.sample(ids, k)

# convert to pandas data frame
df = get_pd(ds,ids)

# making a new column using existing columns. 
reactor = [37.4227,141.0327 ]

#haversine function declared above
df['distance'] = haversine(df['longitude'], df['latitude'], reactor[1], reactor[0])


# Now that we have the distance, we want to plot by date.
dates = df['date'].unique()
for date in dates:
    plot_df = df[df['date']==date]
    x = plot_df['distance']
    z = plot_df['gcnorm']
    y = plot_df['alt_hae']
    
    fig = plt.figure()
    ax = plt.axes(projection='3d')
    
    _ = ax.plot_trisurf(x, y, z, cmap='inferno', edgecolor='none')
    _ = ax.set_xlabel("Distance from reactor (km)")
    _ = ax.set_ylabel("Altitude HAE")
    _ = ax.set_zlabel("Normalized Gross Counts per Second")
    _ = ax.set_title('GCNorm over Distance and Altitude for date {}'.format(date))


## Example with Outlier Detection and Removal
A basic example of outlier detection and removal using the zscore method. 

In [None]:
from scipy.stats import zscore
import numpy as np

def remove_outliers(features, target):
    '''
    Uses Z-score to identify outliers.
    ...
    :param features: DataFrame with only numeric values
    :param target: DataFrame with remaining columns
    
    :returns: Dataframe without outliers
    '''
    
    z = np.abs(zscore(features))
    
    df = pd.concat([features, target], axis = 1)
    
    return df[(z <3).all(axis=1)]


# get ids for all observations
ids = list(ds.records.find_with_data(date='4/5/2011'))

# convert to pandas data frame
df = get_pd(ds,ids)
# only select numeric types as your features
features = df.select_dtypes(exclude='object')

# get non-numeric types
targets = df.select_dtypes(include='object')

new_df = remove_outliers(features, targets)

# plot results, maybe be more useful with other data sets
fig = plt.figure()
ax = plt.axes(projection='3d')

x = df['longitude']
y = df['latitude']
z = df['gcnorm']
_ = ax.plot_trisurf(x, y, z, cmap='inferno', edgecolor='none')
_ = ax.set_title('Heat Map by Latitude and longitude (Original)')

fig = plt.figure()
ax = plt.axes(projection='3d')

x = new_df['longitude']
y = new_df['latitude']
z = new_df['gcnorm']
_ = ax.plot_trisurf(x, y, z, cmap='inferno', edgecolor='none')
_ = ax.set_title('Heat Map by Latitude and longitude (Outliers Removed)')


## Example with Linear Regresion
This example shows some use with linear regression. Here, we do some feature engineering, train a model, and view the results. The goal for the model is to be able to determine gcnorm based on distance from the reactor, altitude, and days since the event. This model does not perform particularily well, but does serve as an example workflow with pandas and sklearn.

In [None]:
from sklearn.linear_model import LinearRegression

# load data from sina

# get ids for all observations
ids = list(ds.records.find_with_type("obs", ids_only=True))

# we will use a random subset of records, no need to load them all for this example
k = 5000
ids = random.sample(ids, k)

# convert to pandas data frame
df = get_pd(ds,ids)

# feature engineering

# use haversine function declared in above cells to get distance measure
reactor = [37.4227,141.0327 ]
df['distance'] = haversine(df['longitude'], df['latitude'], reactor[1], reactor[0])

# calculate days since reactor disaster
# use a dictionary to map strings to ints, using the df.replace() method
dates = {'4/5/2011':24, '4/18/2011':37, '5/9/2011':58}
df['days_since'] = df['date'].replace(dates)

# train the model

# our 3 features are altitude, distance, days
feature_labels = ['alt_hae', 'distance', 'days_since']

# get features vector, and concatentate. x, x^2, x^3
X = df[feature_labels].values
for i in range(2,4):
    X = np.concatenate((X, X**i), axis=1)
    
# get labels    
Y = df['gcnorm'].values


# fit the regression
reg = LinearRegression().fit(X, Y)


# view the results 
zprime = reg.predict(X)

y = df['days_since']
x = df['distance']
z = df['gcnorm']

fig = plt.figure()
ax = plt.axes(projection='3d')
_ = ax.plot_trisurf(x, y, z, cmap='inferno', edgecolor='none')
_ = ax.set_title('Recorded Heat Map')
_ = ax.set_ylabel('Days since Event')
_ = ax.set_xlabel('Distance from reactor')
_ = ax.set_zlabel('GCNorm')

print('\n\n')

fig = plt.figure()
ax = plt.axes(projection='3d')
_ = ax.plot_trisurf(x, y, zprime, cmap='inferno', edgecolor='none')
_ = ax.set_title('Predicted Heat Map')
_ = ax.set_ylabel('Days since Event')
_ = ax.set_xlabel('Distance from reactor')
_ = ax.set_zlabel('GCNorm')


In [None]:
factory.close()