# Working With Multiple Data Files Using Copilot

In this lesson we will revisit some of the material covered at the end of the introductory chapter on Python. Specifically, reading multiple data files, manipulating the data using pandas, and deriving some basic information from the data.

The instructions are very high-level, because we want you to work on writing Copilot prompts yourself, based on high-level instructions rather than us giving you step-by step guidance. The version of this lesson in the online textbook, however, shows the worked example with prompts and Copilot-generated code.

## Instructions

Your task is to perform the operations below, using Python code generated by a combination of your own writing, and GitHub Copilot's suggestions. For your own learning, you are best off to start each step by trying to write the code yourself. If it works, great – you've learned something in Python! Congratulations. But, tTe tasks below intentionally ask you to do things that we haven't taught in this course yet. That's because we want you to learn how to use Copilot to help you out when you don't know how to do something.

If you try writing your own code and it doesn't work, then you can try again, or use Copilot to help you out. Copilot is meant to speed up your coding, so in practice it's preferable to use it to help you out when you get stuck, than spend too much time trying to figure out how to do something yourself. On the other hand, once you have the solution from Copilot, it's important to try to understand what it is doing. You may have to look up some of the functions it's using, if you're not familiar with them. 

If you're ignorant of what the code you generated is doing, you risk errors. Indeed, in the tutorial below we will see an example of Copilot's generating erroneous code. Even if the code works as desired, it is important for you to be able to explain correctly what you did. Also, look for ways to check that the results you get are correct (using code). 



### The Data and What to Do With It

There are three data files in the `data` directory: `s1.csv`, `s2.csv`, and `s3.csv`. Each file contains the reaction times (RTs) from 10 trials of a relatively simple task in which participants had to indicate which direction a briefly-presented arrow was pointing. The RTs are in seconds (s). Each file contains the RTs from a different participant. In each file there are three columns. You can determine what the columns are by looking at the first row (header) of each file.

Generate code that reads in the three files and combines them in a pandas DataFrame. Include error-checking  code that determines you loaded in the correct number of trials (30) and the correct number of columns (3). If the number of trials or columns is incorrect, print an error message.

Save the dataframe to a file called `all_data.csv`, in the `data` folder.

Once you have the DataFrame, calculate:
- the mean RT for each participant
- the mean RT across all participants
- the 95% confidence intervals for the mean RT for each participant
- the 95% confidence intervals for the mean RT across all participants

Print all of the above results to the screen, rounded to 2 decimal places and with accompanying text that indicates what each result is. Be sure that it is clear that RT is in seconds.

As well, generate a plot of the mean RT for each participant, with error bars showing the 95% confidence intervals.

### How to Approach This Exercise

A critical skill to develop in coding is **problem decomposition** – in other words, breaking down a task into smaller and smaller components, so that you can write the code to perform each step in the logical sequence. The instructions above are written in a sequential way, so you should be able to identify each individual step that  you need to take.

It is good practice, when working in Jupyter notebooks, to write the code for each step in a separate cell. This makes it easier to test each step, and to go back and change things if you need to. It also makes it easier to see what you've done, and understand what the code is doing. As well, it's sometime helpful to put Markdown cells in between code cells, to provide longer explanations than might be appropriate for a comment in the code itself. You can also make notes about things you might want to change later, or interpretations of the output of the code.

### Using the Copilot Assistant

As noted above, we encourage you to try writing the code for each step of the instructions above. Coding is a *procedural* skill that you only learn by doing — and the more you learn the better you will be at solving the bugs that Copilot-generated code will inevitably have. But, if you get stuck, or encounter an instruction that you haven't yet learned how to perform in Python, then by all means use Copilot. Just like writing your own code, it is best if you use a separate notebook cell for each Copilot prompt. That way, if you need to edit the code, you can do so without having to re-run all of the code that Copilot generated. It's also easier to debug because you can see the output of each cell and now exactly which line generated the error.

## Copilot-Generated Code for the Above

The cells below all reflect prompts written on the basis of the instructions above, and code generated entirely by Copilot. As you will see, it takes some trial and error – and critical thinking – to generate prompts that produce the code you want. But, it's a good way to learn how to use Copilot effectively. And if you do it right, you will learn a lot about Python along the way.

In [1]:
# read in three files from the data folder, whose names start with "s" and end in "csv"
# concatenate them into one dataframe

import pandas as pd
import glob


# get a list of all the files in the data folder that start with "s" and end with "csv"
file_list = glob.glob('data/s*.csv')


# read in the files and concatenate them into one dataframe
df = pd.concat([pd.read_csv(f) for f in file_list])


In [21]:
# print the first 5 rows of the dataframe
df.head()

Unnamed: 0,participantID,trial,RT
0,s2,1,0.433094
1,s2,2,0.392526
2,s2,3,0.396831
3,s2,4,0.417988
4,s2,5,0.37181


In [3]:
# Check that the dataframe has 30 rows and 3 columns
df.shape

(30, 3)

In [4]:
# raise an error if the dataframe does not have 30 rows and 3 columns
assert df.shape == (30, 3), "The dataframe does not have 30 rows and 3 columns"

In [5]:
# save the dataframe as a csv file named "all_subjects.csv" in the data folder
df.to_csv('data/all_subjects.csv')

In [6]:
# calculate mean rt for each subject
df.groupby('subject').mean()

KeyError: 'subject'

## Debugging Copilot-Generated Code

Typically, when you get a long, scary error message like the one above, you can ignore a lof of what is in the middle. The most important parts are the last line, which tells you what the error is, and the first lines, which usually indicate what line in the code you tried to run caused the error. 

In this case, we see a `KeyError: 'subject'` at the bottom of the error message. Recall that Python dictionaries are sets of **key**-value pairs. The keys are the names of the columns, and the values are the data in those columns. 

You can think of a pandas DataFrame as a dictionary in which the column names are keys, and the values in that column are its values. So the error message above indicates that the code tried to access a column called `subject` in the DataFrame, but that column doesn't exist. We can see at the top of the error message that the code that caused the error is: `---> 2 df.groupby('subject').mean()`. This means it's line 2 of the cell above that caused the error. 

As a novice coder, however, you might not know or remember that pandas DataFrames represent column names and data as dictionary key-value pairs. If you don't understand what an error means, you can use Copilot to help you!


Copilot has a few features that can help you diagnose and fix your code. If you right-click in a code cell, you will see `Copilot` within the menu that pops up. Under that heading there are a number of options, including `Explain This` and `Fix This`.



### Copilot `Explain This`

A great place to start in diagnosing errors is `Explain This`. Intuitively, you might go straight for `Fix This`. However, it's probably instructive to see what Copilot thinks your code is doing, and if that is what *you* think your code is doing. Copilot's explanation might give you insight into the error. Below is the output from selecting `Explain This` for the line of code that caused the error above:

> This code is designed to calculate the mean reaction time (RT) for each subject in a dataframe. The code uses the groupby() method to group the data by the 'subject' column and then calculates the mean for each group using the mean() method.
>
>The resulting dataframe contains the mean RT for each subject. The index of the dataframe is the unique values in the 'subject' column, and the columns are the other numerical columns in the original dataframe.
>
> This code is useful when working with data that has multiple subjects and you want to calculate summary statistics for each subject. By using the groupby() method, you can easily calculate summary statistics for each group without having to manually subset the data.
>
> One possible way to improve this code would be to specify the column to calculate the mean for, rather than calculating the mean for all numerical columns. This would make the code more specific and reduce the risk of errors. Additionally, the code could be modified to calculate other summary statistics, such as the standard deviation or median, by using the appropriate method instead of the mean() method.

First off, this shows how Copilot can be a great learning tool. It's giving us a clear explanation of what the code is doing, and teaching us a bit about how the `.groupby()` method works. It also suggests some improvements to the code, which we can take or leave.

None of this explanation tells us directly why the code generates an error, but at least we have an explanation of what the code is actually doing, which we can use to help us figure out what is going wrong. 

The first paragraph of Copilot's explanation notes that the code is grouping the data by the `subject` column. This could be a clue to us, since the error message also references `subject`. 

### Copilot `Fix This`

If you select `Fix This` from the Copilot menu, you will see what's called a **diff**, as shown below. A diff is a way of showing the difference between two pieces of code. The code on the left (with a pink background) is the code you wrote, and the code on the right (with a green background) is the code that Copilot suggests. In this case, the suggested "fix" would not fix the problem. 

![](images/copilot_fix_this.png)


The suggested code is doing two things differently from the original code. Firstly, it's suggesting that you assign the output of the command to a variable, and then print the variable. Assigning output to a variable and then printing it is in general good coding style, because we may want to use that variable again later in our code. But that won't fix the error. A

The other difference – which is subtle to a novice coder – is that the new code is referencing the column `Subject` rather than `subject`. Recall that Python is **case-sensitive**, meaning that `Subject` and `subject` are not the same thing. Copilot seems to be guessing that the error might be cause by a common mistake, which is to refer to a variable using the wrong case. 


You can click the `Discard` button to reject Copilot's suggestion, and the `Escape` key to exit the diff view.


<div class="alert alert-info">
Copilot does not generate the same code every time you type the same prompt. This is the nature of the type of generative AI that Copilot uses. So, when working through this lesson, you may get different outputs than what is shown in the lesson. Try to use the processes explained here for interpreting error messages and Copilot's descriptions and suggested code. It's much more important to get familiar with engaging in the process of debugging through the thought processes explained here, than to worry about getting different suggestions from Copilot.


### Copilot `Chat` 

![](images/chat_icon.png) Another, really useful option is the `Chat` option, which you can option by clicking the `Chat` icon in the Activity Bar (left side of VS Code). This allows you to have a conversation with Copilot about what you want to do. If you've used Chat-GPT before, this will be familiar to you.

There are different approaches to using Copilot Chat to help fix an error. One is to click in the cell that generates the error, and then clicking in the Copilot Chat input box, and typing: `the selected cell generates an error. how can i fix this error?` In response to this prompt for the present error, I got the following response:

>If you are getting an error with the selected cell, there could be a few reasons why. Here are some possible solutions:
>
>1. Check that the 'subject' column is present in the dataframe. You can do this by printing out the dataframe using the `print()` function and checking that the 'subject' column is present.
>
>2. Check that the 'subject' column is spelled correctly. If the column is misspelled, you will get a "KeyError" when trying to group by that column.
>
>3. Check that the 'subject' column is of the correct data type. If the 'subject' column is not a string or categorical data type, you may need to convert it using the `astype()` method.
>
>4. Check that the dataframe contains numerical columns. If the dataframe only contains non-numerical columns, you will get a "DataError" when trying to calculate the mean.
>
>5. Check that the dataframe does not contain missing values. If the dataframe contains missing values, you may need to remove them using the `dropna()` method before calculating the mean.

### Finding the error

These are actually really good responses, and provided in a logical order to go through in debugging. The first thing to do is to check that the column exists, and is spelled correctly. So let's start with that. Do you remember how to check the names of the columns in a pandas DataFrame? If not, you can write a Copilot prompt to help you out. In the cell below, write a prompt that will print the names of the columns in the DataFrame. Then run the cell, and see what happens.

In [7]:
# print the column names of the dataframe
df.columns

Index(['participantID', 'trial', 'RT'], dtype='object')

We can also look at the first few rows of the DataFrame, which includes the column names:

In [8]:
# print the first few rows of the dataframe
df.head()

Unnamed: 0,participantID,trial,RT
0,s2,1,0.433094
1,s2,2,0.392526
2,s2,3,0.396831
3,s2,4,0.417988
4,s2,5,0.37181


Another option, when using Jupyter notebooks with VS Code, is to click on the `Variables` button in the toolbar at the top of the notebook window. This will pop up a variable viewer in sub-window below your notebook. You can click on the variable names to see their values. For DataFrames, it actually shows a list of the columns in the window, and you can double-click on the variable name to see the contents of the DataFrame in another window, the Data Viewer. This view is similar to a spreadsheet. In fact, you can directly edit values in the Data Viewer. *You should never directly edit values like this*, however. Any steps you do manually are not documented in your code, and are not reproducible.  

The screenshot below shows the variables and Data Viewer for the current context.

![](images/inspectors.png)

If we look at the column names, we see that they are `participantID`, `trial`, and `RT`. The code that generated the error was trying to access a column called `subject`, which doesn't exist. It should be `participantID`. So we need to change the code to access the correct column name:

In [9]:
# calculate mean rt for each subject
df.groupby('participantID').mean()

Unnamed: 0_level_0,trial,RT
participantID,Unnamed: 1_level_1,Unnamed: 2_level_1
s1,5.5,0.389548
s2,5.5,0.444785
s3,5.5,0.446009


This looks good, however the code is providing means for both columns in the DataFrame, not just for `RT`. We can add to our prompt to tell it not to include trial in the output:

In [11]:
# calculate mean rt for each subject. Do not show the mean for trial 
df.groupby('participantID').mean().drop('trial', axis=1)

Unnamed: 0_level_0,RT
participantID,Unnamed: 1_level_1
s1,0.389548
s2,0.444785
s3,0.446009


The above generated code does what we want. However, from the perspectives of coding style and efficiency, it's not optimal. It seems unnecessary to compute the mean for `trial` and then drop it. We can modify the prompt in a way that generates more efficient code:

In [13]:
# calculate mean for each subject using the RT column
df.groupby('participantID')['RT'].mean()

participantID
s1    0.389548
s2    0.444785
s3    0.446009
Name: RT, dtype: float64

<div class="alert alert-info">
One thing you may notice is that the above code prints the DataFrame in a less "pretty", formatted manner than the one before. This is not really important here, but it's worth understanding why the difference occurs. When you call a pandas DataFrame it prints in a nicely formatted output. However, when you call a pandas Series, it prints in a less formatted way. The output of the `groupby()` method on a single column (`RT`) is a Series, whereas the output of `.groupby()` on multiple columns (`trial` and `RT` is a DataFrame. In the first case, the code created a DataFrame and then dropped the `trial` column, but as such it remained a DataFrame. In contrast the output of using the `mean()` method on a single column is a Series.
<p><p>
We'll worry about the formatting later, but it's good to understand why it happens.
</div>

## Continue Coding

Having fixed that error, we can now move on with the steps in the instructions. The next one was to calculate the mean RT across subjects. Do that below:

In [14]:
# calculate the mean rt across all subjects
df['rt'].mean()

KeyError: 'rt'

When I tried this, I got another `key Error`, this time for `RT`. This is because the code is trying to access a column called `rt`, but the column name is actually `rt`. AGain, let's start by trying to fix this ourselves, and then use Copilot if we get stuck. 

The code above selects the column in `df` called `rt` and computes its mean. However, if you look back to the listing of column names that we generated earlier, you'll see that the column name is actually `RT`. Python is **case sensitive**, meaning that capitalization matters: `RT` and `rt` are not the same. So we need to change the code to access the correct column name:

In [15]:
df['RT'].mean()

0.4267806816333334

Having fixed that error, let's move on to the next step, which is to calculate the 95% confidence intervals for the mean RT for each participant:

In [16]:
# calculate 95% confidence intervals for each subject
df.groupby('subject').mean().apply(lambda x: x.sem() * 1.96, axis=1)

KeyError: 'subject'

Again we get `Key Error: 'subject'`, which is the same as the first error above. So again, Copilot is labelling the column incorrectly. We know that we can fix this by replacing `subject` with `participantID`, but it's a bit annoying that Copilot keeps making this error. On the other hand, many errors are actually human errors. We are prompting Copilot to do something for each  "subject", which then requires Copilot to know that by "subject" we really mean the column called `participantID`. Maybe we could engineer our prompt to be better instead, by referring to the column name directly. Let's try that:

In [35]:
# calculate 95% confidence intervals for each participantid
df.groupby('participantID').mean().apply(lambda x: x.sem() * 1.96, axis=1)


participantID
s1    5.008243
s2    4.954111
s3    4.952911
dtype: float64

## Pink Elephants

This is a frightening example of a Copilot **hallucination**. The above code generates what you might think is a plausible result. However, there are a couple of problems. Firstly, confidence intervals (CIs) are typically reported as a pair of values. The 95% CIs reflect the range within which it is 95% likely that the true mean lines, so we need an upper and lower range of these values. These are typically the same amount plus or minus from the mean. So we need to modify the code to generate the upper and lower values of the CIs. 

Secondly, if you look back at the mean RT for each participant, they were in the range of .339 – .449. Confidence intervals reflect the range of possible true values for the mean, given the sample data.  So for our confidence intervals to have values around 5 s seems implausible, given the mean RTs are < 500 ms. We would expect the CIs to be perhaps 100-200 ms. So it seems like Copilot is actually calculating the CIs incorrectly. 

This highlights the critical importance of understanding not only what your code is doing, but what you are asking it to do. You should always check the output of your code to make sure it makes sense, and also make sure you understand what you are supposed to be doing with the code.

## Copilot Chat to the Rescue

We can use Copilot Chat to help us figure out what is going on. If we click on the `Chat` icon in the Activity Bar, we can type in a question to Copilot. In this case, I asked Copilot the following:

> the highlighted code is supposed to compute the 95% confidence intervals of mean RT. However, the output values are incorrect. Can you please suggest code that correctly calculates the 95% confidence intervals, and provides both the upper and lower CIs in a table

Copilot provided the answer below:

>> To correctly calculate the 95% confidence intervals for the mean RT and provide both the upper and lower CIs in a table, you can modify the existing code as follows:

```python
import numpy as np

# calculate mean and standard error of the mean for each participant
mean_rt = df.groupby('participantID')['RT'].mean()
sem_rt = df.groupby('participantID')['RT'].sem()

# calculate 95% confidence intervals for the mean RT
ci_rt = sem_rt * 1.96

# calculate upper and lower CIs
upper_ci = mean_rt + ci_rt
lower_ci = mean_rt - ci_rt

# create a table of the upper and lower CIs
ci_table = pd.DataFrame({'Upper CI': upper_ci, 'Lower CI': lower_ci})

# print the table of CIs
print(ci_table)
```

>> In this code, the `groupby()` method is used to group the data by the 'participantID' column, and the `mean()` and `sem()` methods are used to calculate the mean and standard error of the mean for the 'RT' column for each group. The resulting series contains the mean and SEM for each unique value in the 'participantID' column.
>> 
>> The `ci_rt` variable calculates the 95% confidence intervals for the mean RT using the formula `SEM * 1.96`, where `SEM` is the standard error of the mean.
>> 
>> The `upper_ci` and `lower_ci` variables calculate the upper and lower CIs by adding and subtracting the `ci_rt` value from the `mean_rt` value, respectively.
>> 
>> The `ci_table` variable creates a table of the upper and lower CIs using the `pd.DataFrame()` function.
>> 
>> Finally, the `print()` function is used to print the table of CIs.
>> 
>> This code should give you the desired output of a table containing the upper and lower CIs for the mean RT for each participant.

If you move your cursor over the code that Copilot generated, you will see a `Copy` button appear. Clicking this will copy the code to your clipboard, so you can paste it into your notebook. We'll run this code below:

In [59]:
import numpy as np

# calculate mean and standard error of the mean for each participant
mean_rt = df.groupby('participantID')['RT'].mean()
sem_rt = df.groupby('participantID')['RT'].sem()

# calculate 95% confidence intervals for the mean RT
ci_rt = sem_rt * 1.96

# calculate upper and lower CIs
upper_ci = mean_rt + ci_rt
lower_ci = mean_rt - ci_rt

# create a table of the upper and lower CIs
ci_table = pd.DataFrame({'Upper CI': upper_ci, 'Lower CI': lower_ci})

# print the table of CIs
print(ci_table)

               Upper CI  Lower CI
participantID                    
s1             0.430626  0.348470
s2             0.496197  0.393373
s3             0.515907  0.376111


OK, this is great progress. We now have our output in a table with both upper and lower CIs, and we can see that the CIs are in the range we would expect. It wouldn't hurt to cross-check the formula that Copilot is using against a textbook or online resource.

### Cross-Check Copilot's Code with Other Sources

Consistent results from a number of top hits on a Web search for `confidence interval formula` indicate that the formula for calculating the 95% CI for the mean is:

```python
ci_upper = mean + 1.96 * (std / sqrt(n))
ci_lower = mean - 1.96 * (std / sqrt(n))
```

where `mean` is the mean of the data, `std` is the standard deviation of the data, and `n` is the number of observations.

However, Copilot's code multiplied 1.96 by `sem_rt`, which is the standard error of the mean. But if you look up the formula for standard error, you will find that it is the standard deviation of the data divided by the square root of the number of observations. So, Copilot's code is correct, but it's using a different formula than the one I found online. But at least now we can be confident that the code is doing what we want it to do.

## Carrying On – STOPPED HERE

Let's use the same approach for the next step, calculating 95% CIs across all participants:

In [18]:
# calculate 95% confidence intervals across participantid
df['rt'].mean() + df['rt'].sem() * 1.96


KeyError: 'rt'

Now we're back to the `rt` Key Error we saw before, which was due to the case-sensitivity of Python. We can fix it manually, or think about how to make our prompts more precise, to reduce the chance of Copilot making this error. Let's try the latter:

In [19]:
# calculate 95% confidence intervals for RT across participantid
df['RT'].mean() + df['RT'].sem() * 1.96

0.45911120518780957

So yes, by explicitly providing the correctly-capitalized name of the column we want, we can avoid the error. But we can also see that Copilot is not very good at understanding our prompts in the specific context of the notebook file or active kernel that we're working in. It would be nice if Copilot were more sensitive to the names of variables that are actually defined in our kernel, and that will probably come in the future. But for now, we need to be careful about how we write our prompts, to be as explicit as we can about what we actually want. Often this requires you to think a bit more deeply about the structure of your data, and what you want to do with it, and be more *precise* in the phrasing of your prompts. Over time, developing this skill leads to better critical and analytical thinking, and makes you a better data scientist.

## Formatting the output nicely

The next instruction is to "Print all of the above results to the screen, rounded to 2 decimal places and with accompanying text that indicates what each result is. Be sure that it is clear that RT is in seconds.". In a typical workflow, we might want to get each of the above calculations working correctly, and then modify the code in those cells further to get nice-looking output. However, for explanatory purposes we will keep the above ones as they are, and copy the working prompts from above so that we can add instructions for formatting. 

In [20]:
# calculate mean RT for each participantid rounded to 2 decimal places
# and provide accompanying text that indicates what each result is,
# and that RT is in seconds
df.groupby('participantID').mean().round(2).apply(lambda x: str(x) + ' seconds', axis=1)

participantID
s1    trial    5.50\nRT       0.39\nName: s1, dtype:...
s2    trial    5.50\nRT       0.44\nName: s2, dtype:...
s3    trial    5.50\nRT       0.45\nName: s3, dtype:...
dtype: object

This seems almost right, but there are weird extra characters in there: `\n`. As well, it is outputting the means for both `RT` and `trial`, even though we only asked for `RT`. 

At this point, we could try to fix this ourselves, which would involve some Web searching to first understand what this line of code is actually doing (if you're not already familiar with the extended chain of methods and functions in this complex command), and then figuring out how to generate better output. But again, we can use Copilot to expedite this process. 

![](images/chat_icon.png) Click again on the Copilot Chat icon in the Activity Bar. Let's first ask for an explanation of the code. This is a complex line of code with four operations chained together, including an embedded `lambda()` function, which we haven't learned about yet. 

To get an explanation of the code, I first selected the entire line of code with muy curor. Then, in the Copilot Chat window, I typed `explain the highlighted line of code`. In response, I got the following explanation:

>The highlighted line of code calculates the mean of all numerical columns in the dataframe `df` for each unique value in the 'participantID' column. The resulting dataframe contains the mean for each numerical column, with the index being the unique values in the 'participantID' column.
>
>The `groupby()` method is used to group the data by the 'participantID' column, and the `mean()` method is used to calculate the mean for each group. The resulting dataframe contains the mean for each numerical column, with the index being the unique values in the 'participantID' column.
>
>The `round()` method is used to round the mean values to two decimal places, and the `apply()` method is used to add the string ' seconds' to each value in the dataframe. The `axis=1` parameter specifies that the function should be applied to each row of the dataframe.
>
>This code is useful when working with data that has multiple participants and you want to calculate summary statistics for each participant. By using the `groupby()` method, you can easily calculate summary statistics for each group without having to manually subset the data.
>
>The resulting dataframe can be used to create a summary table or plot of the mean values for each participant. The `apply()` method can be modified to add different strings or formatting to the mean values, depending on the desired output.

This is a good explanation of what the code is doing. But it doesn't provide a lot of insight into why the formatting is off, or why `trial` is included. So let's try asking Copilot a more specific question about the things we don't like about the code. I tried this prompt: `This seems almost right, but there are weird extra characters in there: `\n`. As well, it is outputting the means for both `RT` and `trial`, even though we only asked for `RT`. 

At this point, we could try to fix this ourselves, which would involve some Web searching to first understand what this line of code is actually doing (if you're not already familiar with the extended chain of methods and functions in this complex command), and then figuring out how to generate better output. But again, we can use Copilot to expedite this process. 

![](images/chat_icon.png) Click again on the Copilot Chat icon in the Activity Bar. Let's first ask for an explanation of the code. This is a complex line of code with four operations chained together, including an embedded `lambda()` function, which we haven't learned about yet. 

To get an explanation of the code, I first selected the entire line of code with muy curor. Then, in the Copilot Chat window, I typed `explain the highlighted line of code`. In response, I got the following explanation:

>The highlighted line of code calculates the mean of all numerical columns in the dataframe `df` for each unique value in the 'participantID' column. The resulting dataframe contains the mean for each numerical column, with the index being the unique values in the 'participantID' column.
>
>The `groupby()` method is used to group the data by the 'participantID' column, and the `mean()` method is used to calculate the mean for each group. The resulting dataframe contains the mean for each numerical column, with the index being the unique values in the 'participantID' column.
>
>The `round()` method is used to round the mean values to two decimal places, and the `apply()` method is used to add the string ' seconds' to each value in the dataframe. The `axis=1` parameter specifies that the function should be applied to each row of the dataframe.
>
>This code is useful when working with data that has multiple participants and you want to calculate summary statistics for each participant. By using the `groupby()` method, you can easily calculate summary statistics for each group without having to manually subset the data.
>
>The resulting dataframe can be used to create a summary table or plot of the mean values for each participant. The `apply()` method can be modified to add different strings or formatting to the mean values, depending on the desired output.

This is a good explanation of what the code is doing. But it doesn't provide a lot of insight into why the formatting is off, or why `trial` is included. So let's try asking Copilot a more specific question about the things we don't like about the code. I tried this prompt: `the highlighted code prints meants for both trial and RT, but I want it to only print the mean RT for each participantid. suggest code to do what I want`. In response, I got the following:

>To print only the mean RT for each participantID, you can modify the existing code to select only the 'RT' column before calling the `groupby()` and `mean()` methods. Here's an example:
>

```python
df.groupby('participantID')['RT'].mean().round(2).apply(lambda x: str(x) + ' seconds')
```

>
>In this code, the `groupby()` method is used to group the data by the 'participantID' column, and the `mean()` method is used to calculate the mean of the 'RT' column for each group. The resulting dataframe contains the mean RT for each unique value in the 'participantID' column.
>
>The `round()` method is used to round the mean values to two decimal places, and the `apply()` method is used to add the string ' seconds' to each value in the dataframe. The resulting series contains the mean RT for each unique value in the 'participantID' column, with the index being the unique values in the 'participantID' column.
>
>This code should give you the desired output of only the mean RT for each participantID, rounded to two decimal places and with the string ' seconds' added to each value.`. In response, I got the following:


>To print only the mean RT for each participantID, you can modify the existing code to select only the 'RT' column before calling the `groupby()` and `mean()` methods. Here's an example:
>

```python
df.groupby('participantID')['RT'].mean().round(2).apply(lambda x: str(x) + ' seconds')
```

>
>In this code, the `groupby()` method is used to group the data by the 'participantID' column, and the `mean()` method is used to calculate the mean of the 'RT' column for each group. The resulting dataframe contains the mean RT for each unique value in the 'participantID' column.
>
>The `round()` method is used to round the mean values to two decimal places, and the `apply()` method is used to add the string ' seconds' to each value in the dataframe. The resulting series contains the mean RT for each unique value in the 'participantID' column, with the index being the unique values in the 'participantID' column.
>
>This code should give you the desired output of only the mean RT for each participantID, rounded to two decimal places and with the string ' seconds' added to each value.

Let's try that suggested code and see what we get:

In [23]:
df.groupby('participantID')['RT'].mean().round(2).apply(lambda x: str(x) + ' seconds')

participantID
s1    0.39 seconds
s2    0.44 seconds
s3    0.45 seconds
Name: RT, dtype: object

This is great - now we have exactly what we want, and the code is even conveniently telling use that the units are seconds.

Next we want to print the mean across all participants, formatted as above. 

In [27]:
# calculate mean RT across participantid rounded to 2 decimal places
# and provide accompanying text that indicates what each result is,
# and that RT is in seconds
df['RT'].mean().round(2).apply(lambda x: str(x) + ' seconds')

AttributeError: 'numpy.float64' object has no attribute 'apply'

I tried right-clicking, gooing to Copilot, and then selecting `Fix This`. This generated the following code:

```python
# calculate mean RT across participantid rounded to 2 decimal places
# and provide accompanying text that indicates what each result is,
# and that RT is in seconds
mean_rt = df.groupby('participantid')['RT'].mean().mean().round(2)
result_text = str(mean_rt) + ' seconds is the mean RT across participantid'
result_text
result_text = str(mean_rt) + ' seconds is the mean RT across participantid'
result_text
```

This reflects a phenomenon that seems to happen regularly with Copilot, which is that it repeats itself. Note that the last two lines of code are identical to the two lines before them. This is a bit annoying, but it's not a big deal. We can just delete the last two lines of code, and we're good to go.

Also, look closely at the code: there's another bug! This time, Copilot has made a mistake in the code it generated. It's trying to group by `participantid`, but the column name is actually `participantID`. AGain, you need to be paying close attention to the code, and details like capitalization. In this case, I was being sloppy with capitalization in my prompts. Even though I already realized that I had to use `RT` and not `rt`, Copilot initially let me get away with lazily typing `participantid` and not `participantID`. But now it's come back to bite me.

Since I think I understand where the error is coming from, it's more expedient to fix it myself, so I copy and paste the code from above, and then edit it to capitalize `ID`.

In [33]:
# calculate mean RT across participantid rounded to 2 decimal places
# and provide accompanying text that indicates what each result is,
# and that RT is in seconds
mean_rt = df.groupby('participantID')['RT'].mean().mean().round(2)
result_text = str(mean_rt) + ' seconds is the mean RT across participantID'
result_text

'0.43 seconds is the mean RT across participantID'

Now that that is fixed, let's move on to the next step, which is to calculate the 95% confidence intervals for the mean RT for each participant:

In [34]:
# calculate 95% confidence intervals for each participantid rounded to 2 decimal places
# and provide accompanying text that indicates what each result is,
# and that RT is in seconds
df.groupby('participantID').mean().apply(lambda x: x.sem() * 1.96, axis=1).round(2).apply(lambda x: str(x) + ' seconds')


participantID
s1    5.01 seconds
s2    4.95 seconds
s3    4.95 seconds
dtype: object

## There's a flow error here...

## What Copilot Did


To fix the code to only show the mean RT, you need to modify the code to only select the 'rt' column before calculating the mean. You can do this by changing the first line of code to:



In [None]:
df['rt'].groupby(df['participantID']).mean()



This code selects only the 'rt' column using the indexing operator and then groups the data by 'participantID' before calculating the mean. This will give you the mean RT for each participantID.

You can then modify the rest of the code to use this new dataframe. Here is the modified code:



In [None]:
# calculate mean rt for each participantID
mean_rt = df['rt'].groupby(df['participantID']).mean()

# calculate mean rt across participantIDs
mean_rt_all = mean_rt.mean()

# calculate 95% confidence intervals for each participantID
ci_rt = mean_rt.sem() * 1.96

# calculate 95% confidence intervals across participantIDs
ci_rt_all = mean_rt_all.sem() * 1.96

# print all of the above results in a nice format
print('Mean RT for each participantID:')
print(mean_rt)
print('\nMean RT across participantIDs:')
print(mean_rt_all)
print('\n95% confidence intervals for each participantID:')
print(ci_rt)
print('\n95% confidence intervals across participantIDs:')
print(ci_rt_all)

# plots mean RT across subjects, with 95% confidence intervals
mean_rt.plot(yerr=ci_rt)



This code will now only calculate the mean RT and will print and plot the results accordingly.

In [None]:
# calculate mean rt for each participantID
mean_rt = df['rt'].groupby(df['participantID']).mean()

# calculate mean rt across participantIDs
mean_rt_all = mean_rt.mean()

# calculate 95% confidence intervals for each participantID
ci_rt = mean_rt.sem() * 1.96

# calculate 95% confidence intervals across participantIDs
ci_rt_all = mean_rt_all.sem() * 1.96

# print all of the above results in a nice format
print('Mean RT for each participantID:')
print(mean_rt)
print('\nMean RT across participantIDs:')
print(mean_rt_all)
print('\n95% confidence intervals for each participantID:')
print(ci_rt)
print('\n95% confidence intervals across participantIDs:')
print(ci_rt_all)

# plots mean RT across subjects, with 95% confidence intervals
mean_rt.plot(yerr=ci_rt)

In [None]:
df.head()

The code above does most of what we want, but it does generate another message – this time a warning, not an error. Warnings are worth looking at as well, because they can indicate that something is not quite right. Sometimes they are things like "deprecation notices", which occur when a package has changed, and the old way of doing something is no longer supported. However, this warning reads, `UserWarning: Warning: converting a masked element to nan.` What does this mean? 

