# Preprocessing of the data

## Pairs Description.

* Pairs from **0** to **4** are RR fixed ($\frac{1}{30}$) - RI random ($\frac{1}{7.5}$, $\frac{1}{15}$, $\frac{1}{30}$, $\frac{1}{60}$, $\frac{1}{120}$).

* Pairs drom **5** to **9** are RR random ($\frac{1}{15}$, $\frac{1}{30}$, $\frac{1}{45}$, $\frac{1}{60}$, $\frac{1}{120}$) - RI fixed ($\frac{1}{60}$).

## Event Dictionary

* Response RI = 24
* Response RR = 44
* Reward RI = 33
* Reward RR=46
* Start session with R1 (p=0.5) = 12
* Start session with RR (p=0.5) = 13
* Switch from RR to RI = 14
* Switch from RI to RR = 15
* Show pairs = 10
* Response on TecCen to start session = 11

### Importing libraries

In [1]:
import pandas as pd
import numpy as np

### Importing data, creating dataframe, extracting subjects, sessions, and pairs.

In [2]:
df = pd.read_csv("full_data_2016.csv")

In [3]:
df["session"].unique()  

array(['S1', 'S2', 'S3', 'S4', 'S5', 'S6', 'S7', 'S8', 'S9', 'S10', 'S11',
       'S12', 'S13', 'S14', 'S15', 'S16', 'S17', 'S18', 'S19', 'S20',
       'S21', 'S22', 'S23', 'S24', 'S25', 'S26', 'S27', 'S28', 'S29',
       'S30', 'S31', 'S32', 'S33', 'S34', 'S35', 'S36', 'S37', 'S38',
       'S39', 'S40', 'S41', 'S42', 'S43', 'S44', 'S45', 'S46', 'S47',
       'S48', 'S49', 'S50', 'S51', 'S52', 'S53', 'S54', 'S55', 'S56',
       'S57', 'S58', 'S59', 'S60', 'S61', 'S62', 'S63', 'S64', 'S65',
       'S66', 'S67', 'S68', 'S69', 'S70', 'S71', 'S72', 'S73', 'S74',
       'S75', 'S76', 'S77', 'S78', 'S79', 'S80', 'S81', 'S82', 'S83',
       'S84', 'S85', 'S86', 'S87', 'S88', 'S89', 'S90', 'S91', 'S92',
       'S93', 'S94', 'S95', 'S96', 'S97', 'S98', 'S99', 'S100', 'S101',
       'S102', 'S103', 'S104', 'S105', 'S106', 'S107', 'S108', 'S109',
       'S110'], dtype=object)

In [4]:
def get_data_info(df):
     """
     Returns the number of sessions, birds, pairs and events in the dataset.
     """

     sessions = df.session.unique()
     birds = df.bird.unique()
     pairs = np.sort(df.pair.unique())
     events = np.sort(df.event.unique())

     print(f"Sessions: {sessions}","\n", "*"*60, "\n", 
          f"Birds:{birds}", "\n", "*"*60, "\n", 
          f"Pairs: {pairs}", "\n", "*"*60, "\n",
          f"Events: {events}")
     
     return sessions, birds, pairs, events

In [5]:
sessions, birds, pairs, events = get_data_info(df)

Sessions: ['S1' 'S2' 'S3' 'S4' 'S5' 'S6' 'S7' 'S8' 'S9' 'S10' 'S11' 'S12' 'S13'
 'S14' 'S15' 'S16' 'S17' 'S18' 'S19' 'S20' 'S21' 'S22' 'S23' 'S24' 'S25'
 'S26' 'S27' 'S28' 'S29' 'S30' 'S31' 'S32' 'S33' 'S34' 'S35' 'S36' 'S37'
 'S38' 'S39' 'S40' 'S41' 'S42' 'S43' 'S44' 'S45' 'S46' 'S47' 'S48' 'S49'
 'S50' 'S51' 'S52' 'S53' 'S54' 'S55' 'S56' 'S57' 'S58' 'S59' 'S60' 'S61'
 'S62' 'S63' 'S64' 'S65' 'S66' 'S67' 'S68' 'S69' 'S70' 'S71' 'S72' 'S73'
 'S74' 'S75' 'S76' 'S77' 'S78' 'S79' 'S80' 'S81' 'S82' 'S83' 'S84' 'S85'
 'S86' 'S87' 'S88' 'S89' 'S90' 'S91' 'S92' 'S93' 'S94' 'S95' 'S96' 'S97'
 'S98' 'S99' 'S100' 'S101' 'S102' 'S103' 'S104' 'S105' 'S106' 'S107'
 'S108' 'S109' 'S110'] 
 ************************************************************ 
 Birds:['P168' 'P423' 'P498' 'P787' 'P796' 'P875'] 
 ************************************************************ 
 Pairs: [0 1 2 3 4 5] 
 ************************************************************ 
 Events: [10 11 12 13 14 15 16 17 20 21 22 24 27 28 

In [6]:
df_last_sessions = df[df["session"].isin(sessions[49:])]

In [7]:
df_last_sessions.head(50)

Unnamed: 0,box,bird,session,pair,time,event,archive_name
490272,4.0,P168,S50,3,0.01,10,P168_S50.xls
490273,4.0,P168,S50,3,1.04,11,P168_S50.xls
490274,4.0,P168,S50,3,1.04,12,P168_S50.xls
490275,4.0,P168,S50,3,1.97,22,P168_S50.xls
490276,4.0,P168,S50,3,1.97,24,P168_S50.xls
490277,4.0,P168,S50,3,1.97,27,P168_S50.xls
490278,4.0,P168,S50,3,1.97,28,P168_S50.xls
490279,4.0,P168,S50,3,2.04,21,P168_S50.xls
490280,4.0,P168,S50,3,2.46,22,P168_S50.xls
490281,4.0,P168,S50,3,2.46,24,P168_S50.xls


In [8]:
df_last_sessions.drop(['box', 'archive_name'], axis=1, inplace=True)
df_last_sessions.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_last_sessions.drop(['box', 'archive_name'], axis=1, inplace=True)


Unnamed: 0,bird,session,pair,time,event
490272,P168,S50,3,0.01,10
490273,P168,S50,3,1.04,11
490274,P168,S50,3,1.04,12
490275,P168,S50,3,1.97,22
490276,P168,S50,3,1.97,24


In [9]:
df_last_sessions.reset_index(inplace=True, drop=True)
df_last_sessions.head()

Unnamed: 0,bird,session,pair,time,event
0,P168,S50,3,0.01,10
1,P168,S50,3,1.04,11
2,P168,S50,3,1.04,12
3,P168,S50,3,1.97,22
4,P168,S50,3,1.97,24


In [10]:
sessions, birds, pairs, events = get_data_info(df_last_sessions)

Sessions: ['S50' 'S51' 'S52' 'S53' 'S54' 'S55' 'S56' 'S57' 'S58' 'S59' 'S60' 'S61'
 'S62' 'S63' 'S64' 'S65' 'S66' 'S67' 'S68' 'S69' 'S70' 'S71' 'S72' 'S73'
 'S74' 'S75' 'S76' 'S77' 'S78' 'S79' 'S80' 'S81' 'S82' 'S83' 'S84' 'S85'
 'S86' 'S87' 'S88' 'S89' 'S90' 'S91' 'S92' 'S93' 'S94' 'S95' 'S96' 'S97'
 'S98' 'S99' 'S100' 'S101' 'S102' 'S103' 'S104' 'S105' 'S106' 'S107'
 'S108' 'S109' 'S110'] 
 ************************************************************ 
 Birds:['P168' 'P423' 'P498' 'P787' 'P796' 'P875'] 
 ************************************************************ 
 Pairs: [0 1 2 3 4 5] 
 ************************************************************ 
 Events: [10 11 12 13 14 15 16 17 20 21 22 24 27 28 29 30 31 32 33 34 42 43 44 46
 47]


## Functions to preprocess the data

In [11]:
def transform_data(data: pd.DataFrame()):
    # Description of the function
    """
    The function  `transform_data(data, birds, sessions, pairs)`, transform the data to be analyzed in the following way:
    1. Extract the following variables:
         Responses
         Responses before reward
         Visit lenght responses
         Time
         Time difference
         Visit lenght time
         Reward
         Time since last reward
    2. Create a new dataframe with the variables extracted.
    3. Return the new dataframe.

    Input: data   (pandas dataframe)  - Data to be transformed
    Output: data  (pandas dataframe) - Transformed data
    """
    import pandas as pd

    # Number of observations
    len_data = data.shape[0]

    # Bird
    bird = np.array([data.bird.unique()[0]])
    bird = np.repeat(bird, len_data)

    # Session
    session = np.array([data.session.unique()[0]])
    session = np.repeat(session, len_data)

    # Pair
    pair = np.array([data.pair.unique()])
    pair = np.repeat(pair, len_data)

    # Responses
    resp_RI = [0]
    resp_RR = [0]

    # Responses before reward
    responses_before_reward_RI = [0]
    responses_before_reward_RR = [0]

    # Visit lenght responses
    visit_lenght_response_RI = [0]
    visit_lenght_response_RR = [0]

    # Time
    time = [0]
    # Time difference
    time_diff = [0]

    # Visit lenght time
    visit_lenght_time_RI = [0]
    visit_lenght_time_RR = [0]

    # Reward
    reward_RI = [0]
    reward_RR = [0]

    # Time since last reward
    time_since_last_reward_RI = [0]
    time_since_last_reward_RR = [0]

    # Counter for the number of responses before reward
    counter_ri = 0
    counter_rr = 0

    # Counter for the time since last reward
    time_counter_rr = 0
    time_counter_ri = 0

    # Counter for the visit length
    counter_visit_length_ri = 0
    counter_visit_length_rr = 0

    # Counter for the visit time
    counter_visit_time_ri = 0
    counter_visit_time_rr = 0

#################################################################################
    # Loop over the data to extract the variables for the analysis.
    for ii in range(1, len_data):
#################################################################################
        if data.iloc[ii]["event"] == 33:
            # time
            time.append(data.iloc[ii]["time"])
            # time difference
            time_diff.append(time[ii] - time[ii-2])

            # Responses before reward
            counter_ri = 0
            responses_before_reward_RI.append(counter_ri)
            responses_before_reward_RR.append(counter_rr)

            # Time since last reward
            time_counter_ri = 0
            time_counter_rr += time_diff[ii]
            time_since_last_reward_RI.append(time_counter_ri)
            time_since_last_reward_RR.append(time_counter_rr)

            # Responses
            resp_RI.append(1)
            resp_RR.append(0)

            # Reward
            reward_RI.append(1)
            reward_RR.append(0)

            # Lenght visit response
            counter_visit_length_rr = 0
            counter_visit_length_ri += 1
            visit_lenght_response_RI.append(counter_visit_length_ri)
            visit_lenght_response_RR.append(counter_visit_length_rr)

            # Lenght visit time
            counter_visit_time_ri = time_diff[ii]
            counter_visit_time_rr = 0
            visit_lenght_time_RI.append(counter_visit_time_ri)
            visit_lenght_time_RR.append(counter_visit_time_rr)

    #################################################################################
        if data.iloc[ii]["event"] == 46:
            # time
            time.append(data.iloc[ii][ "time"])
            # time difference
            time_diff.append(time[ii] - time[ii-2])

            # Responses before reward
            counter_rr = 0
            responses_before_reward_RR.append(counter_rr)
            responses_before_reward_RI.append(counter_ri)

            # Time since last reward
            time_counter_ri += time_diff[ii]
            time_counter_rr = 0
            time_since_last_reward_RI.append(time_counter_ri)
            time_since_last_reward_RR.append(time_counter_rr)

            # Responses
            resp_RI.append(0)
            resp_RR.append(1)

            # Reward
            reward_RI.append(0)
            reward_RR.append(1)

            # Lenght visit response
            counter_visit_length_rr += 1
            counter_visit_length_ri = 0
            visit_lenght_response_RI.append(counter_visit_length_ri)
            visit_lenght_response_RR.append(counter_visit_length_rr)

            # Lenght visit time
            counter_visit_time_rr = time_diff[ii]
            counter_visit_time_ri = 0
            visit_lenght_time_RI.append(counter_visit_time_ri)
            visit_lenght_time_RR.append(counter_visit_time_rr)


    #################################################################################
        if data.iloc[ii]["event"] == 24:
            # time
            time.append(data.iloc[ii][ "time"])
            # time difference
            time_diff.append(time[ii] - time[ii-1])

            # Responses before reward
            counter_ri += 1
            responses_before_reward_RI.append(counter_ri)
            responses_before_reward_RR.append(counter_rr)

            # Time since last reward
            time_counter_ri += time_diff[ii]
            time_counter_rr += time_diff[ii]
            time_since_last_reward_RI.append(time_counter_ri)
            time_since_last_reward_RR.append(time_counter_rr)

            # Responses
            resp_RI.append(1)
            resp_RR.append(0)

            # Reward
            reward_RI.append(0)
            reward_RR.append(0)


            if data.iloc[ii-1]["event"] == 24:
                # Lenght visit response
                counter_visit_length_rr = 0
                counter_visit_length_ri += 1
                visit_lenght_response_RI.append(counter_visit_length_ri)
                visit_lenght_response_RR.append(counter_visit_length_rr)

                # Lenght visit time
                counter_visit_time_ri += time_diff[ii]
                counter_visit_time_rr = 0
                visit_lenght_time_RI.append(counter_visit_time_ri)
                visit_lenght_time_RR.append(counter_visit_time_rr)

            elif data.iloc[ii-1]["event"] == 46 or data.iloc[ii-1]["event"] == 33 or data.iloc[ii-1]["event"] == 12 or data.iloc[ii-1]["event"] == 13:
                # Lenght visit response
                counter_visit_length_rr = 0
                counter_visit_length_ri = 0
                visit_lenght_response_RI.append(counter_visit_length_ri)
                visit_lenght_response_RR.append(counter_visit_length_rr)

                # Lenght visit time
                counter_visit_time_rr = 0
                counter_visit_time_ri = 0
                visit_lenght_time_RI.append(counter_visit_time_ri)
                visit_lenght_time_RR.append(counter_visit_time_rr)

            elif data.iloc[ii-1]["event"] == 44:
                # Lenght visit response
                counter_visit_length_rr = 0
                counter_visit_length_ri = 0
                visit_lenght_response_RI.append(counter_visit_length_ri)
                visit_lenght_response_RR.append(counter_visit_length_rr)

                # Lenght visit time
                counter_visit_time_rr = 0
                counter_visit_time_ri = 0
                visit_lenght_time_RI.append(counter_visit_time_ri)
                visit_lenght_time_RR.append(counter_visit_time_rr)

    #################################################################################
        if data.iloc[ii]["event"] == 44:
            time.append(data.iloc[ii]["time"])
            time_diff.append(time[ii] - time[ii-1])

            counter_rr +=1
            responses_before_reward_RI.append(counter_ri)
            responses_before_reward_RR.append(counter_rr)

            time_counter_ri += time_diff[ii]
            time_counter_rr += time_diff[ii]
            time_since_last_reward_RI.append(time_counter_ri)
            time_since_last_reward_RR.append(time_counter_rr)

            resp_RI.append(0)
            resp_RR.append(1)

            reward_RI.append(0)
            reward_RR.append(0)

            if data.iloc[ii-1]["event"] == 44:
                counter_visit_length_rr += 1
                counter_visit_length_ri = 0
                visit_lenght_response_RI.append(counter_visit_length_ri)
                visit_lenght_response_RR.append(counter_visit_length_rr)

                counter_visit_time_rr += time_diff[ii]
                counter_visit_time_ri = 0
                visit_lenght_time_RI.append(counter_visit_time_ri)
                visit_lenght_time_RR.append(counter_visit_time_rr)

            elif data.iloc[ii-1]["event"] == 46 or data.iloc[ii-1]["event"] == 33 or data.iloc[ii-1]["event"] == 12 or data.iloc[ii-1]["event"] == 13:
                counter_visit_length_rr = 0
                counter_visit_length_ri = 0
                visit_lenght_response_RI.append(counter_visit_length_ri)
                visit_lenght_response_RR.append(counter_visit_length_rr)

                counter_visit_time_rr = 0
                counter_visit_time_ri = 0
                visit_lenght_time_RI.append(counter_visit_time_ri)
                visit_lenght_time_RR.append(counter_visit_time_rr)

            elif data.iloc[ii-1]["event"] == 24:
                counter_visit_length_rr = 0
                counter_visit_length_ri = 0
                visit_lenght_response_RI.append(counter_visit_length_ri)
                visit_lenght_response_RR.append(counter_visit_length_rr)

                counter_visit_time_rr = 0
                counter_visit_time_ri = 0
                visit_lenght_time_RI.append(counter_visit_time_ri)
                visit_lenght_time_RR.append(counter_visit_time_rr)

    #################################################################################

    data_filtred = pd.DataFrame({"bird": bird, 
    "session": session,
    "pair": pair, 
    "time": time, 
    "time_diff": time_diff, 
    "resp_RI": resp_RI, 
    "resp_RR": resp_RR, 
    "reward_RI": reward_RI, 
    "reward_RR": reward_RR, 
    "time_since_last_reward_RI": time_since_last_reward_RI, 
    "time_since_last_reward_RR": time_since_last_reward_RR, 
    "visit_lenght_response_RI": visit_lenght_response_RI, 
    "visit_lenght_response_RR": visit_lenght_response_RR, 
    "responses_before_reward_RI": responses_before_reward_RI, 
    "responses_before_reward_RR": responses_before_reward_RR, 
    "visit_lenght_time_RI": visit_lenght_time_RI, 
    "visit_lenght_time_RR": visit_lenght_time_RR})

    return data_filtred

In [12]:
def extract_data(df: pd.DataFrame(), dir: str, birds: list, pairs: list, sessions: list):
    """The function `extract_data` extracts data from data frame for given birds, pairs and sessions, and saves into a directory, with the format: bird_pair_session.csv. If there are some problems with saving data, the function returns a list of birds, pairs and sessions that were not saved.

    Input: data: pd.DataFrame() - data frame with data ,
            dir: str - directory to save data,
            birds: list - list of birds to extract data from,
            pairs: list - list of pairs to extract data from,
            sessions: list - list of sessions to extract data from 
    Output: not_saved: list - list of birds, pairs and sessions that were not saved.
    """

    not_saved = []
    global data_frame
    for bird in birds:
        for session in sessions:
            for pair in pairs:
                print("Saving data for bird {}, pair {}, session {}".format(bird, pair, session))
                data_frame = df.loc[((df.event == 33) | (df.event == 44) 
                           | (df.event == 24) | (df.event == 46)
                             | (df.event == 12) | (df.event == 13)) 
                             & (df.session == session) & (df.bird == bird)
                             & (df.pair == pair)]

                data_frame.reset_index(inplace=True, drop=True)

                try:
                    assert ([13, 24, 33, 44, 46] in data_frame.event.unique()) or ([12, 13, 24, 33, 44, 46] in data_frame.event.unique()) or ([12, 24, 44, 33] in data_frame.event.unique()) or ([12, 24, 44, 46] in data_frame.event.unique()) or ([13, 24, 44, 33] in data_frame.event.unique()) or ([13, 24, 44, 46] in data_frame.event.unique()), Exception("There are events missing in the data dataframe") 
                    
                    transfor_df = transform_data(data_frame)
                    transfor_df.to_csv(dir + "{}_{}_{}.csv".format(bird, pair, session), index=False)
                
                except OSError:
                    raise OSError(f"The directory '{dir}' does not exist")

                except:
                    print("There are events missing in the data dataframe, the data for bird {}, pair {}, session {} was not saved".format(bird, pair, session))
                    not_saved.append([bird, pair, session])
                    continue
    return not_saved

In [13]:
#!mkdir data_2016
dir = "/Users/christianbadillo/Desktop/data_2016/"

not_saved = extract_data(df = df_last_sessions, dir=dir, birds = birds, pairs= pairs, sessions = sessions)

Saving data for bird P168, pair 0, session S50
There are events missing in the data dataframe, the data for bird P168, pair 0, session S50 was not saved
Saving data for bird P168, pair 1, session S50
Saving data for bird P168, pair 2, session S50
Saving data for bird P168, pair 3, session S50
Saving data for bird P168, pair 4, session S50
Saving data for bird P168, pair 5, session S50
Saving data for bird P168, pair 0, session S51
Saving data for bird P168, pair 1, session S51
Saving data for bird P168, pair 2, session S51
Saving data for bird P168, pair 3, session S51
Saving data for bird P168, pair 4, session S51
Saving data for bird P168, pair 5, session S51
There are events missing in the data dataframe, the data for bird P168, pair 5, session S51 was not saved
Saving data for bird P168, pair 0, session S52
Saving data for bird P168, pair 1, session S52
Saving data for bird P168, pair 2, session S52
Saving data for bird P168, pair 3, session S52
Saving data for bird P168, pair 4, s

In [15]:
# Data that was not saved because there are events missing in the data dataframe
# Or because does not exist data for that combination of bird, pair and session.

print(not_saved)

[['P168', 0, 'S50'], ['P168', 5, 'S51'], ['P168', 0, 'S53'], ['P168', 0, 'S55'], ['P168', 0, 'S59'], ['P168', 5, 'S62'], ['P168', 0, 'S64'], ['P168', 4, 'S64'], ['P168', 5, 'S65'], ['P168', 0, 'S66'], ['P168', 5, 'S67'], ['P168', 0, 'S70'], ['P168', 0, 'S71'], ['P168', 0, 'S72'], ['P168', 0, 'S74'], ['P168', 0, 'S75'], ['P168', 0, 'S76'], ['P168', 0, 'S78'], ['P168', 5, 'S80'], ['P168', 5, 'S81'], ['P168', 0, 'S82'], ['P168', 0, 'S85'], ['P168', 1, 'S85'], ['P168', 0, 'S86'], ['P168', 1, 'S88'], ['P168', 5, 'S89'], ['P168', 0, 'S91'], ['P168', 5, 'S91'], ['P168', 0, 'S92'], ['P168', 4, 'S92'], ['P168', 5, 'S92'], ['P168', 4, 'S93'], ['P168', 0, 'S97'], ['P168', 0, 'S98'], ['P168', 0, 'S101'], ['P168', 0, 'S102'], ['P168', 0, 'S103'], ['P168', 4, 'S103'], ['P168', 3, 'S105'], ['P168', 0, 'S108'], ['P168', 1, 'S109'], ['P423', 1, 'S50'], ['P423', 4, 'S50'], ['P423', 0, 'S51'], ['P423', 1, 'S51'], ['P423', 4, 'S51'], ['P423', 1, 'S52'], ['P423', 3, 'S52'], ['P423', 4, 'S52'], ['P423', 5, 