[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/SmilodonCub/DS4VS/blob/master/Week6/DS4VS_week6_EDA.ipynb)

<br>

# Week 6: EDA

<br>

### a Brief Recap:

* Hello, how are you?
* Some changes to the Syllabus: PsychoPy class will become a second day for Visualization
* Today:
    * look over some code to import example 'real world' data files
    * an example of using a function to iterate over data files
    * In the next notebook: data missingness & data cleaning

## EDA

Today we're all about **Exploratory Data Analysis**  

We'll start by looking at **data wrangling** - ways to import and give structure to our data files


### Environmental Dependancies:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.io
import seaborn as sns

### Example Files

* EEG.txt
* ERG.txt
* pupil.txt
* a .mat mystery


### EEG.txt

In [None]:
EEG_url = 'https://raw.githubusercontent.com/SmilodonCub/DS4VS/master/Week6/EEG.txt'
#EEG_df = pd.read_csv( EEG_url )
#EEG_df = pd.read_csv( EEG_url, encoding= 'unicode_escape' )
#EEG_df = pd.read_csv( EEG_url, sep = '\t', encoding= 'unicode_escape' )
EEG_df = pd.read_csv( EEG_url, skiprows = range( 0, 38 ), sep = '\t', encoding= 'unicode_escape'  )
EEG_df.head()

In [None]:
EEG_df.tail()

#### Further formatting necessary

`usecols` to specify which '\t' delineated columns from the .txt file to access

In [None]:
EEG_df = pd.read_csv( EEG_url, skiprows = range( 0, 38 ), sep = '\t', usecols = range( 6, 69 ) )
EEG_df.head()

In [None]:
EEG_df.info()

#### Quick & dirty visualization

Just take a quick peak at one of the channels...  
...Is this what we expect to see?

In [None]:
EEG_df.plot(  x = 'Time (ms)', y = 'Trial (nV).10' ) 
plt.show()

### ERG.txt

In [None]:
ERG_url = 'https://raw.githubusercontent.com/SmilodonCub/DS4VS/master/Week6/ERG.txt'
ERG_df = pd.read_csv( ERG_url, sep = '\t', 
                     usecols = range( 44, 77 ), 
                     encoding= 'unicode_escape',
                     header = None)
print( ERG_df.columns )
ERG_df.head()

In [None]:
chan_names = [ 'chan_{}'.format( x ) for x in range( 0, 33 ) ]
ERG_df.columns = chan_names
ERG_df.head()

In [None]:
print( ERG_df.shape )
ERG_df.plot( x = 'chan_0', y = 'chan_1' ) 
plt.show()

#### A Pattern Emerges...

Glancing at the data we can see a pattern in the columns: time, dataX, dataY  
I'm giving the data columns two arbitrary names here; we'll need info from our domain expert.  

Let's rename the columns...

In [None]:
colnames = [ ['time_{}'.format(x), 'chan_X_{}'.format(x), 'chan_Y_{}'.format(x)] for x in range( 0, int( ERG_df.shape[1] /3 ) ) ]
colnames = [item for sublist in colnames for item in sublist]
colnames

In [None]:
#change the column names
ERG_df.columns = colnames
ERG_df.head()

Alternatively, we could check to see if the columns are identical. if so, drop the duplicates.  

In [None]:
ERG_df['time_0'].equals( ERG_df['time_1'] )

#### A Quick and dirty visualization

In [None]:
plt.subplots(figsize=(5, 7))
for idx in range(1,12):
    time_str = 'time_{}'.format(idx-1)
    X_str = 'chan_X_{}'.format(idx-1)
    Y_str = 'chan_Y_{}'.format(idx-1)
    plt.subplot(11,1,idx)
    sns.lineplot( data = ERG_df, x= time_str, y= X_str )
    plt.ylim( -50000, 100000 )
    
sns.despine(left=True, bottom=True, right=True)
plt.show()

### pupil.txt

In [None]:
pupil_url = 'https://raw.githubusercontent.com/SmilodonCub/DS4VS/master/Week6/pupil.txt'
pupil_df = pd.read_csv( pupil_url, sep = '\t' )
pupil_df.head() 

In [None]:
pupil_df.tail()

In [None]:
print( pupil_df.shape )
pupil_df.info()
pupil_df.describe()

#### A Quick & dirty visualization

In [None]:
plt.subplots(figsize=(5, 5))
plt.subplot(2,1,1)
sns.lineplot( data = pupil_df, x= 'time', y= 'pupilArea' )
plt.subplot(2,1,2)
sns.lineplot( data = pupil_df, x= 'time', y= 'pupilDiam' )
plt.show()

### MATLAB struct


In [None]:
path = '/home/bonzilla/Documents/ScienceLife/DS4VS/Week6/sampledata.mat'

mat_dat = scipy.io.loadmat(path)
print( type( mat_dat ) )
print( mat_dat.keys() )

Let's start by investigating the keys...

In [None]:
print( mat_dat['__header__'] )
print( mat_dat['__version__'] )
print( mat_dat['__globals__'] )
EEGmat = mat_dat['sampledata']
EEGmat.shape

In [None]:
EEGmat

It looks like the dunders are all file metadata. Not relevant? (let's ask our expert)  
We are mainly concerned with formatting the data in `sampledata`  

`mat_dat['sampledata']` is a deeply nested structure... 

In [None]:
mat_dat['sampledata'].shape

In [None]:
# sampledata.channels
channels = mat_dat['sampledata'][0][0][0][0]
channel_list = [ chan_num[0][0] for chan_num in channels ]
channel_list

In [None]:
# sampledata.fielname
filename = mat_dat['sampledata'][0][0][1][0]
filename

In [None]:
# sampledata.maps
mat_dat['sampledata'][0][0][2][0,2].shape

In [None]:
# sampledata.psth_range
psth_range = mat_dat['sampledata'][0][0][3]
print( psth_range.shape )
print( type( psth_range ) )
psth_range

In [None]:
plt.plot( psth_range[0] ) 
plt.show()

In [None]:
# sampledata.n_channels
n_channels = mat_dat['sampledata'][0][0][4][0][0]
n_channels
# QUESTION FOR FARZANEH: why is n_channel 20, but the 'channel' field only lists 16?

## Data Wrangling

<img src="data_preprocessing.jpg" width="30%" style="margin-left:auto; margin-right:auto">


## Example Data Import

1. create code to import the data in a managable format
2. functionalize import code
3. apply function the multiple files

### 1. Create code to import/format data

We just saw several example of this above, here's one more!  
The following code will bring in select values from a .txt file.  
This .txt file is one of many for a psychophysical experiment

In [None]:
addy = '/home/bonzilla/Documents/ScienceLife/DS4VS/datasets/gratvernier/mj gratvernier 49.txt'
lines = []
fp = open(addy)
for line in fp:
    lines.append( line )
    print( line )

fp.close()
#print( lines )

### what to do with a list of lines?

We now need to write code to pull out information based on known patterns in the .txt file format

In [None]:
# length of our list of lines
print( len( lines ) )
print( lines[0:32] )

### information dictionary

* subject and stimulus information is given in the extensive header of this file.  
* lines 1 to ~32 are the same across all files, so we can use that to our advantange to pull the information we need  
* We will store fields of interest in a Python dictionary...

In [None]:
# initialize a python dictionary
verniergrat_dict = {}

# add a few fields
verniergrat_dict['date'] = lines[2].split()[0]
verniergrat_dict['subject'] = lines[3][0:2]
verniergrat_dict['spac_freq'] = lines[5].split()[3]
verniergrat_dict['drift vel'] = lines[6].split()[2] 
verniergrat_dict['grat_gap'] = lines[7].split()[4]
verniergrat_dict['eccentricity'] = lines[5].split()[3]
verniergrat_dict['spac_freq'] = lines[5].split()[3]

print( verniergrat_dict )

### 2. Functionalize the import/format code

The dictionary doesn't hold all the information we'd need for analysis, but it's a good start.  
Let's functionalize the code:

In [None]:
def gratvernier_txt2dict( path, filename ):
    """
    helper function to extract data and task metadata from a BBL .txt file
    """
    # read lines into python environment
    file_add = path + filename
    lines = []
    fp = open(file_add)
    for line in fp:
        lines.append( line )
    fp.close()
    
    # pull info of interest and store as a python dictionary
    verniergrat_dict = {}
    
    verniergrat_dict['filename'] = filename
    verniergrat_dict['date'] = lines[2].split()[0]
    verniergrat_dict['subject'] = lines[3][0:2]
    verniergrat_dict['spac_freq'] = lines[5].split()[3]
    verniergrat_dict['drift vel'] = lines[6].split()[2]
    verniergrat_dict['grat_gap'] = lines[7].split()[4]
    verniergrat_dict['eccentricity'] = lines[5].split()[3]
    verniergrat_dict['spac_freq'] = lines[5].split()[3]
    
    # return the data dictionary
    return verniergrat_dict

### take this function for a test drive: 

In [None]:
path = '/home/bonzilla/Documents/ScienceLife/DS4VS/datasets/gratvernier/'
filename = 'mj gratvernier 49.txt'

exp1 = gratvernier_txt2dict( path, filename )
print( exp1 )
print( type( exp1 ) )
print( exp1.keys() )

## Now you try.....

* functionalize some of the code from a previous example
* write a function that will read a file and return some structured data

a basic outline:  

    def basic_dataread( str_path ):
        """
        write a blurb to describe what this function will do
        """
        path = "......." # provide the path where your file lives
        
        # use the appropriate function to read the file
        
        # write code to format/select data of interest
        
        return <data_object>

### iterating over many files

We would like to use our function to iterate over many files and extract information

In [None]:
# how many files are in the folder? 
folder = '/home/bonzilla/Documents/ScienceLife/DS4VS/datasets/gratvernier/'

import os
files = os.listdir( folder ) 
print( len( files ) )
print( files )

### iterating over many files

There are 59 .txt files in the same directory as our working example.  
It wouldn't be impossible to manually work through them with cut/paste, but our time is more valuable than that!!!  

Let's have Python do the work for us...

In [None]:
for file in files: 
    file_path = '/home/bonzilla/Documents/ScienceLife/DS4VS/datasets/gratvernier/'
    res = gratvernier_txt2dict( file_path, file )
    print( file )
    print( res )

### 3. Iterate over multiple data files

### Consolidate the outcomes as a pandas DataFrame

printing the results is not very useful to us.  
Let's write another function that will consolidate the outcomes from `gratvernier_txt2dict()`:

In [None]:
def gratvernier_text2df( path, extension ):
    """
    given a folder 'path' (str) and file 'extension' (str),
    gratvernier_txt2df() returns a pandas DataFrame that consolidates
    the dict fields from gratvernier_txt2dict()
    """
    # a list of data files to iterate over
    files = os.listdir( folder )
    files = [file for file in files if extension in file]
    data_fields = ['date', 'subject', 
                   'spac_freq', 'drift vel', 
                   'grat_gap', 'eccentricity']
    dict_list = []
    for file in files:
        row = gratvernier_txt2dict( path, file )
        dict_list.append( row )
        
    data_df = pd.DataFrame( dict_list )
    return data_df

### Let's take this function out for a test drive

In [None]:
folder = '/home/bonzilla/Documents/ScienceLife/DS4VS/datasets/gratvernier/'

res = gratvernier_text2df( folder, '.txt' )
res

### Are we happy with this result?

Let's get a better view of the DataFrame:

In [None]:
res.info()

### Dtype `object`

Take a moment and look at the DataFrame further...  


<img src="https://internationalnewsagency.org/wp-content/uploads/2020/11/face-with-a-raised-eyebrow-emoji-780x470.jpg" width="60%" style="margin-left:auto; margin-right:auto">

## Let's take a break. When we come back we will talk about cleaning and evaluating data
<img src="https://content.techgig.com/photo/80071467/pros-and-cons-of-python-programming-language-that-every-learner-must-know.jpg?132269" width="100%" style="margin-left:auto; margin-right:auto">