<br><br>

<img src="https://raw.githubusercontent.com/SmilodonCub/DS4VS/master/Homework/wranglingdata.png" width="80%" style="margin-left:auto; margin-right:auto">
<img src="https://raw.githubusercontent.com/SmilodonCub/DS4VS/master/Homework/withpandas.png" width="50%" style="margin-left:auto; margin-right:auto">




<h1><center>Homework 2: bringing data into the Python environment with `pandas` </center></h1>

**Data Wrangling** - the process of transforming and/or remapping raw data to a more concise and consolidated format

***

During Week6's class we working on building functions to wrangle some .txt files into a `pandas` dataframe. The .txt files are real-world psychophysical results from human subjects performing a vernier hyperacuity task. During Week6's lecture, we simply extracted some metadata regarding the collection of the data and some stimulus parameters. For this homework assignment, we will complete the exercise to collect relevant data.

For this assignment, we will work with some new text data. The data has a familiar format, however, it was collected from a different task. For now, our goal is to import all the data necessary for a downstream exploratory data analysis and store the data as a `pandas` dataframe. We will follow the same basic approach that was given in class.

The resulting DataFrame should have the following fields:  

|    |  **Column Name** | **Element Type** |                                                         **Example**                                                         |
|:--:|:----------------:|:----------------:|:---------------------------------------------------------------------------------------------------------------------------:|
| 1. |     file_name    |      string      |                                                       're SCSF 1.txt'                                                       |
| 2. |   observer_name  |      string      |                                                             're'                                                            |
| 3. |   type_grating   |      string      |                                                       'Equiluminance'                                                       |
| 4. |   spacial_freq   |       float      |                                                            0.251                                                            |
| 5. |  contrast_level  |       float      |                                                             100.                                                            |
| 6. |      rgb_min     |       list       |                                                         [0, 107, 0]                                                         |
| 7. |      rgb_max     |       list       |                                                         [705, 0, 0]                                                         |
| 8. | scase1_reversals |       list       | [ 6.30957, 12.5893, 1.58489, 2.23872,      0.794328, 1.58489, 0.794328, 1.12202,    0.794328, 1.12202, 0.562341, 0.794328 ] |
| 9. | scase2_reversals |       list       |  [ 3.16228, 6.30957, 1.58489, 2.23872,   0.562341, 0.794328, 0.562341, 0.794328,    0.562341, 1.12202, 0.794328, 1.12202 ]  |


<br>

## Dependancies

In [27]:
# Let's set up with these libraries handy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

<br>

## Importing a Text Data and Parsing Relevant Fields

These text files are not friendly for use with pandas methods such as `read_csv()`. We will have to bring the data in the old fashioned way. We will open and read the file with build in Python functions.

In [3]:
# SUBSTITUTE YOUR PATH HERE 
addy = '/home/bonzilla/Documents/ScienceLife/DS4VS/datasets/SCSF_dat/mj SCSF 87.txt'
lines = []
fp = open(addy)
for line in fp:
    lines.append( line )
    #print( line )  # UNCOMMENT IF YOU WOULD LIKE TO SEE THE FULL .txt FILE

fp.close()

<br>

The text file is not very large. Nonetheless, we don't need all the information or labels. Here are a few example of how to use some simple string methods to parse the data and metadata we are interested in

<br>

In [4]:
# print out the first 20 lines
print( len( lines ) )
print( lines[0:20] )

168
['EXPERIMENT PARAMETERS:\n', '\n', '6/23/2011\t9:47:56 AM to 9:52:43 AM\n', 'Observer Name: mj\n', 'Eye Tested: Left\n', 'Type of grating: Compound\n', 'Spatial Frequency(cpd): 0.631\n', 'DriftVelocity: 0.5\n', 'Stimuli Eccentricity(deg): 0\n', 'Viewing Distance(mm): 480\n', 'Temporal Envelope Type: RaisedCosine\n', '   Attack/Decay Time (msec): 350\n', 'Trial Duration (msec): 300\n', 'Maximum Number of Reversals: 12\n', 'Number of Staircases: 2\n', 'Maximal Red, Green, Blue Gun settings are 1.0, 1.0,1.0\n', '\n', 'STIMULUS :\n', '   Contrast Level: 100\n', '   Color Min: Red: 0  Green: 0  Blue: 0\n']


In [12]:
# An example parsing metadata from the text lines

print( lines[3] )             # print the 4th line
print( lines[3].split() )     # split the 4th line on every space and print the result
print( lines[3].split()[2] )  # split the 4th line on every space, select the 3rd element and print it
print( type( lines[3].split()[2] ) )

Observer Name: mj

['Observer', 'Name:', 'mj']
mj
<class 'str'>


In [11]:
# a numeric example

print( lines[6] )
print( lines[6].split() )
res = float( lines[6].split()[2] )
print( res )
print( type( res ) )

Spatial Frequency(cpd): 0.631

['Spatial', 'Frequency(cpd):', '0.631']
0.631
<class 'float'>


<br>

2 down!, 8 more data points to go...

### Organize the data as a Python `dict`

Now to store the parsed fields as key:value pairs in a Python dictionary:

In [23]:
# initialize a python dictionary
SCSF_dict = {}

# add a few fields

SCSF_dict['file_name'] = 'mj SCSF 87.txt'
# Easy: parse the split element of interest
SCSF_dict['observer_name'] = lines[3].split()[2]
SCSF_dict['type_grating'] = 
SCSF_dict['spatial_frequency'] = float( lines[6].split()[2] )
SCSF_dict['contrast_level'] = 

# Moderate: split the string. 
# add elements of the split result to a list if the element is a digit.
SCSF_dict['rgb_min'] = 
SCSF_dict['rgb_max'] = 

# Hard: We need to parse out the values for each staircase
# after the line 'Staircase completed after XX trials with 12 reversals at :'

# staircase1 starts and ends at the same line number in the text file.
# However, this is not the case for staircase2. 

# extract the 12 reversal values for each staircase as lists in floats

SCSF_dict['scase1_reversals'] = 
SCSF_dict['scase2_reversals'] = 

print( SCSF_dict )

{'file_name': 'mj SCSF 87.txt', 'observer_name': 'mj', 'type_grating': 'Compound', 'spatial_frequency': 0.631, 'contrast_level': 100.0, 'rgb_min': [0, 0, 0], 'rgb_max': [700, 211, 0], 'scase1_reversals': [3.1622, 6.3095, 3.1622, 4.4668, 3.1622, 4.4668, 3.1622, 6.3095, 4.4668, 6.3095, 4.4668, 6.3095], 'scase2_reversals': [12.589, 25.118, 6.3095, 8.9125, 4.4668, 12.589, 4.4668, 12.589, 8.9125, 12.589, 8.9125, 12.589]}


<br>

## Functionalize your code

Write a function that uses code your wrote above the generate the same result:

In [24]:
# I suggest the a function that takes two arguments: path & filename
# where
# path = file location. ex: '/home/bonzilla/Documents/ScienceLife/DS4VS/datasets/SCSF_dat/'
# filename = name of the text file. ex: 'mj SCSF 87.txt'
# however, you can change this

def SCSF_txt2dict( path, filename ):
    """
    helper function to extract data and task metadata from a BBL .txt file
    """
    # read lines into python environment
    file_address = path + filename
    
    SCSF_dict = {}
    # pull info of interest and store as a python dictionary
    
    # return the data dictionary
    return SCSF_dict

In [26]:
# Test the function out
path = # ex: '/home/bonzilla/Documents/ScienceLife/DS4VS/datasets/SCSF_dat/'
filename = # ex: 'mj SCSF 87.txt'
res = SCSF_txt2dict( path, filename ) # update if your function takes other parameters

print( type( res ) )
print( res.keys() )
print( res )

<class 'dict'>
dict_keys(['file_name', 'observer_name', 'type_grating', 'spatial_frequency', 'contrast_level', 'rgb_min', 'rgb_max', 'scase1_reversals', 'scase2_reversals'])
{'file_name': 'mj SCSF 87.txt', 'observer_name': 'mj', 'type_grating': 'Compound', 'spatial_frequency': 0.631, 'contrast_level': 100.0, 'rgb_min': [0, 0, 0], 'rgb_max': [700, 211, 0], 'scase1_reversals': [3.1622, 6.3095, 3.1622, 4.4668, 3.1622, 4.4668, 3.1622, 6.3095, 4.4668, 6.3095, 4.4668, 6.3095], 'scase2_reversals': [12.589, 25.118, 6.3095, 8.9125, 4.4668, 12.589, 4.4668, 12.589, 8.9125, 12.589, 8.9125, 12.589]}


<br>

## So many files, so little time

Let's generate a list to hold the names of all the text files in our folder. Finding the length of this list will tell us how much data we have to deal with


In [29]:
file_names = os.listdir( path )
print( len( file_names ) )
# print( file_names ) if you dare

391


WOW!, Okay. that's quite a few. Thankfully we have functionalized our code, so we can have Python do all the work of extracting out data

Write a `for` loop to iterate over all the files in `file_names`.
Print the result of each iteration

In [None]:
# for each file in file_names:
    # apply out new function SCSF_txt2dict()
    # print the result

<br>

## Functionalize the iteration process and return a `pandas` DataFrame

It's doesn't help us much to have hundreds of individual dictionaries floating around.  
That's very disorganized!  

Next we will write a function that will handle the process of iterating over all the files in our data folder and consolidating the outcomes as rows in a dataframe

In [35]:
def SCSF_text2df( path, extension ):
    """
    given a folder 'path' (str) and file 'extension' (str),
    SCSF_txt2df() returns a pandas DataFrame that consolidates
    the dict fields from SCSF_txt2dict()
    """
    # a list of data files to iterate over
    files = 
    
    # iterate over 
    dict_list = []
    for file in files:
        # generate a dictionary result for this file
        # append the result to dict_list
    
    # pass your list of dictionaries to pd.DataFrame to create your dataframe
    
    # return the result
    return data_df

In [38]:
# Try your function out!
# Print basic descriptions of the data

res = SCSF_text2df( path, '.txt' )

print( res.shape )
print( res.info() )
res.head()

<class 'pandas.core.frame.DataFrame'>
(391, 9)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 391 entries, 0 to 390
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   file_name          391 non-null    object 
 1   observer_name      391 non-null    object 
 2   type_grating       391 non-null    object 
 3   spatial_frequency  391 non-null    float64
 4   contrast_level     391 non-null    float64
 5   rgb_min            391 non-null    object 
 6   rgb_max            391 non-null    object 
 7   scase1_reversals   391 non-null    object 
 8   scase2_reversals   391 non-null    object 
dtypes: float64(2), object(7)
memory usage: 27.6+ KB
None


Unnamed: 0,file_name,observer_name,type_grating,spatial_frequency,contrast_level,rgb_min,rgb_max,scase1_reversals,scase2_reversals
0,mj SCSF 17.txt,mj,Equiluminance,0.631,100.0,"[0, 211, 0]","[700, 0, 0]","[12.589, 50.118, 1.5848, 2.2387, 1.122, 2.2387...","[0.79432, 3.1622, 1.122, 1.5848, 1.122, 1.5848..."
1,kh SCSF 67.txt,kh,Equiluminance,1.585,100.0,"[0, 0]","[705, 0, 0]","[50.118, 12.589, 25.118, 17.782, 35.481, 25.11...","[50.118, 12.589, 35.481, 17.782, 35.481, 25.11..."
2,mj SCSF 73.txt,mj,Compound,0.398,100.0,"[0, 0, 0]","[700, 211, 0]","[6.3095, 25.118, 1.5848, 2.2387, 0.56234, 0.79...","[3.1622, 12.589, 6.3095, 8.9125, 0.56234, 0.79..."
3,re SCSF 198.txt,re,Compound,0.631,100.0,"[0, 0, 0]","[1000, 152, 0]","[1.5848, 3.1622, 1.122, 2.2387, 1.5848, 2.2387...","[1.5848, 12.589, 1.5848, 2.2387, 1.122, 1.5848..."
4,re SCSF 153.txt,re,Compound,1.585,100.0,"[0, 0, 0]","[1000, 152, 0]","[0.79432, 3.1622, 2.2387, 4.4668, 3.1622, 8.91...","[1.5848, 6.3095, 3.1622, 4.4668, 3.1622, 6.309..."


<br>

## et Voila!

You took hundreds of text files, parsed the information you were after, and organized the data as a `pandas` DataFrame.  

The next step would be to apply the same principles to your own data.  

<br><br>