# Counting Survey Responses

Data comes in all shapes and sizes. Extracting useful insights from data often involves preparation and *cleaning*. Usually, data is not directly available in our Python code, instead being stored in other files or databases. One common filetype for storing data is called **JSON** (Javascript Object Notation). Python comes with built-in libraries for reading these `.json` files, which we can use to further analyse the data therein.

Let's say a colleague recently undertook a survey to determine which Generative AI chatbot was most used by students in their department. The survey software they used asked the following question and invited responses via the multi-checkbox seen in the image below:

<center><img src="../Resources/survey_checkbox.png" style="height:300px" /></center>

The survey software captured fifty responses in total and the student exported the survey data to a JSON file. A short snippet of the JSON output can be seen in the image below.

<center><img src="../Resources/survey_responses.png" style="height:300px" /></center>

Each response is captured as a list, so:

- The first respondent selected **ChatGPT and Claude**
- The second selected only **ChatGPT**
- The third selected **Claude, Gemini and ChatGPT**

It could be argued that this was not the best way to capture data for further analysis, but unfortunately, we have to work with what we have. The student would like to determine which tool was selected the most times, and how many times that was.

In this exercise, you will write code to load and clean the data, and answer the student's questions.

**The data for this exercise is `survey_responses.json`, located in the `Data` directory within `1_Monday` directory.**

## Breaking the task down to steps

When planning to tackle a task—large or small—it is a good idea to outline the structure of the code (there is no coding involved yet). In this task, we build the code from the bottom up. 

This notebook divides the development process into three tasks. Each task description includes the necessary information to proceed with the coding. This is an example approach to tackling this problem.

**Task 1** - Load the data from the JSON file into a Python data structure.

**Task 2** - Clean the data to prepare it for analysis.

**Task 3** - Analyse the cleaned data to determine the most selected tool.

## Task 1 - Data Loading

The first step to load data is to determine where our data is located. In this exercise, we were told above:

> **The data for this exercise is `survey_responses.json`, located in the `Data` directory within the `1_Monday` directory.**

Using the *Explorer* in the left sidebar, see if you can find the `survey_responses.json` file now.

Did you find it? Notice the `Data` directory is located in the same directory as the Notebook we are currently working in. Because of this, we would say, that *relative* to this Notebook, the file we are interested in loading is located in a directory called `Data`. To indicate that we are looking into a directory, we use a forward slash `/` - this applies in the Codespace and on MacOS, but note that Windows machines usually use a backslash!.

Knowing this, we can write out the *relative path* to the file we are interested in. We might create a constant to store the path - by convention, we use all uppercase lettering to define constants.

```python
PATH_TO_FILE = "Data/survey_repsonses.json"
```

Now that we have the path to our file, we can look at loading the file into Python. In Python, we load files into objects. The most common way to do this is using the `with` statement and the `open` function (which come together into a *context manager*). The `as` clause of the `with` statement stores the file object into a temporary variable (here called `file`), but only until the end of the indented block.

```python
with open(PATH_TO_FILE) as file:
    # do something with the file
```

Now that we've loaded


### Coding

Using the equations for $a$, $b$, and $c$ above, write a function in the code cell below named ```get_quadratic_coefficients``` which receives 6 arguments. These arguments should be, in order, $x_{1}$, $y_{1}$, $x_{2}$, $y_{2}$, $x_{3}$, and $y_{3}$. The function should return a list with three entries, containing the values of $a$, $b$, and $c$. There are a few calls to the function in the cell below, which you can use to test your function.

In [None]:
# Write your function here


# Test Cases
print(get_quadratic_coefficients(-1, 0, 0, 0, 1, 0)) # Should return [0.0, 0.0, 0.0]

print(get_quadratic_coefficients(4.2, -10.1, 4.4, -9.5, 4.6, -8.9)) # Should return [0.0, 3.0, -22.7]

print(get_quadratic_coefficients(-10, 200, 0, 0, 10, 200)) # Should return [2.0, 0.0, 0.0]

print(get_quadratic_coefficients(4, 10, 5, 20, 6, 18)) # Should return [-6.0, 64.0, -150.0]

For each of the exercises in this notebook, sample solutions can be found in [```Sample Solutions/Sample Solutions 2 - Peak Finding.ipynb```](Sample%20Solutions/Sample%20Solutions%202%20-%20Peak%20Finding.ipynb).

## Task 2 - Finding the Peak

Once we have the coefficients of the quadratic function, we can find the peak by finding the $x$ value at which the function reaches its maximum. To do this, we can find the value of $x$ at which the gradient of the function is equal to zero. The gradient of the function is given by:

$$ g = 2ax + b $$

and is equal to zero when:

$$ 0 = 2a x_{flat} + b $$

$$ x_{flat} = -\frac{b}{2a} $$

Considering general properties of quadratic function, if the value of $a$ is positive, then this value of $x_{flat}$ corresponds to a minimum, and if the value of $a$ is negative, then this value of $x_{flat}$ corresponds to a maximum. If the value of $a$ is zero, then the quadratic coefficient is zero and the function has no peak.

### Coding

Write a function named ```get_peak_in_region``` which finds the peak of the function approximated by three data-points. This function should receive 6 arguments. These arguments should be, in order, $x_{1}$, $y_{1}$, $x_{2}$, $y_{2}$, $x_{3}$, and $y_{3}$, where $(x_{1}, y_{1})$, $(x_{2}, y_{2})$ and $(x_{3}, y_{3})$ are the Cartesian coordinates of the point on the left boundary left, the central point, and the point on the right boundary of the region we're approximating.

This function should use your function ```get_quadratic_coefficients``` to find the coefficients of the quadratic function which approximates the data. It should then use these coefficients to find the $x$ value at which the gradient of the function is equal to zero. 

If this value of $x$ is between $x_{1}$ and $x_{3}$ inclusive (i.e. it's in the region we're approximating) AND the value of $a$ is negative (i.e. our approximation contains a peak) then the function should return a list containing the values of $x$ at this peak and the value of $y$ at this point (i.e. the Cartesian coordinates of the peak). To get the value of $y$, you will need to evaluate the quadratic function. You may want to consider writing a separate function for this.

If this value of $x$ is not between $x_{1}$ and $x_{3}$, or if the approximation is flat ($a$ is zero) or contains a minimum ($a$ is positive), then the function should return ```None```.

When writing this function, you will use ```get_quadratic_coefficients``` (and possibly more functions). Using multiple functions with a clear purpose not only improves the readability of your code but also enables you to test each function separately (unit testing). Testing frameworks are essential for developing large and complex codes. 

In [None]:
# Write your function here


# Test Cases
print(get_peak_in_region(0.8, 0, 1, 1, 1.2, 0)) # Should return [1.0, 1.0]

print(get_peak_in_region(3, -9, 4, -9, 5, -12)) # Should return [3.5, -8.625]

print(get_peak_in_region(5, 1.2, 5.1, 1.3, 5.2, 1.4)) # Should return None (the approximation should be straight)

print(get_peak_in_region(1, 1, 2, 0, 3, -2)) # Should be None (the peak is to the before the region we're approximating)

print(get_peak_in_region(-10, 0, -5, 10, 0, 15)) # Should be None (the peak is after the region we're approximating)

print(get_peak_in_region(0, 2, 1, 1, 2, 2)) # Should be None (the flat section of the approximation is a trough not a peak)

## Task 3 - Examining a Whole Dataset

Now that we have a function which can determine if there is a peak in a region of three data points and where that peak is, we can use this function to examine a whole dataset and find which is the highest peak.

To do this, we will consider each region of three data points in turn, and use our function ```get_peak_in_region``` to determine if there is a value in each region. If there is, we will evaluate if it higher than the highest peak we have found so far. If it is, we will store the coordinates of this peak.

For example, if we have data points $(x_{1}, y_{1}), (x_{2}, y_{2}), ... , (x_{10}, y_{10})$ We will first examine the region between $x_{1}$ and $x_{3}$, then the region between $x_{2}$ and $x_{4}$, then the region between $x_{3}$ and $x_{5}$, and so on.

### Coding

Write a function named ```get_peak_data_set```. This should receive two arguments. The first is a list containing the x values of the data set, and the second is a list containing the y values of the data set. The function should return a list containing the Cartesian coordinates of the highest peak in the data set. If the dataset does not contain any peaks, the function should return ```None```.

You may assume that the two lists provided as arguments will be the same length, will contain only numerical values, and that the data set will contain at least three points.

Write your function in the cell below. There are some calls to the function in the cell which you can use to test your function.

In [None]:
# Write your function here


# Test Cases
x = [0, 1, 2]
y = [0, 1, 0]
print(get_peak_dataset(x, y)) # Should return the value [1.0, 1.0]

x = [0, 1, 2, 3, 4]
y = [0, 1, 0, 2, 0]
print(get_peak_dataset(x, y)) # Should return the value [3.0, 2.0]

x = [0, 1, 2, 3, 4]
y = [1, 3.5, 1, 3, 1]
print(get_peak_dataset(x, y)) # Should return the value [1.0, 3.5]

x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
y = [0, 1, 2, 5, 10, 20, 18, 14, 12, 11, 10, 10, 11, 13, 15, 17, 19, 20.5, 19, 17, 14]
print(get_peak_dataset(x, y)) # Should return the value [5.33.., 20.66...]

x = [0, 1, 2, 3, 4]
y = [0, -1, -2, -1, 0]
print(get_peak_dataset(x, y)) # Should return None as there is no peak