# Case Study: Detecting Data Quality Issues in Inflammation Records

This project analyzes a series of CSV files containing inflammation data to detect potential data quality problems.
We use basic statistical summaries and helper functions to identify issues like the presence of zeros or inconsistencies.

---


## 1. Reading and Displaying Data from the First File

With the list of the relevant `inflammation_xx.csv` file paths above, A  program to read the `inflammation_xx.csv` files, and display the contents of the first file in this list.



In [None]:
all_paths = [
  "python/05_src/data/assignment_2_data/inflammation_01.csv",
  "python/05_src/data/assignment_2_data/inflammation_02.csv",
  "python/05_src/data/assignment_2_data/inflammation_03.csv",
  "python/05_src/data/assignment_2_data/inflammation_04.csv",
  "python/05_src/data/assignment_2_data/inflammation_05.csv",
  "python/05_src/data/assignment_2_data/inflammation_06.csv",
  "python/05_src/data/assignment_2_data/inflammation_07.csv",
  "python/05_src/data/assignment_2_data/inflammation_08.csv",
  "python/05_src/data/assignment_2_data/inflammation_09.csv",
  "python/05_src/data/assignment_2_data/inflammation_10.csv",
  "python/05_src/data/assignment_2_data/inflammation_11.csv",
  "python/05_src/data/assignment_2_data/inflammation_12.csv"
]

with open(all_paths[0], 'r') as f:
    lines = f.readlines()  # Step 1: Read all lines
    for line in lines:     # Step 2: Loop through each line
        print(line.strip())  # Print without newline character


## 2. Data Summarization Function: `patient_summary`

This function computes summary statistics (mean, max, or min) for each patient based on their 40-day inflammation data.

### Function Overview:
- **Input**: Path to a CSV file containing 60 rows (patients) and 40 columns (days), and the type of operation ('mean', 'max', or 'min').
- **Process**: Reads the data and applies the requested summary statistic across each patient's 40-day period.
- **Output**: A NumPy array with one summary value per patient (60 values total).

This approach helps identify trends or anomalies in patient inflammation patterns.


In [14]:
import numpy as np  # Import the NumPy library and give it the alias 'np'

def patient_summary(file_path, operation):
    data = np.loadtxt(fname=file_path, delimiter=',')  # Load the data from the file
    ax = 1  # This specifies ax (short for axis) tells NumPy to apply the operation across columns, i.e., calculate one result per row (patient).
    # Implement the specific operation based on the 'operation' argument
    if operation == 'mean':
        # YOUR CODE HERE: Calculate the mean (average) number of flare-ups for each patient
        summary_values = np.mean(data, axis=ax)

    elif operation == 'max':
        # YOUR CODE HERE: Calculate the maximum number of flare-ups experienced by each patient
        summary_values = np.max(data, axis=ax)

    elif operation == 'min':
        # YOUR CODE HERE: Calculate the minimum number of flare-ups experienced by each patient
        summary_values = np.min(data, axis=ax)

    else:
        # If the operation is not one of the expected values, raise an error
        raise ValueError("Invalid operation. Please choose 'mean', 'max', or 'min'.")

    return summary_values
result_mean = patient_summary("C:/Users/RituparnaB/Downloads/python/05_src/data/assignment_2_data/inflammation_01.csv", "mean")
result_min = patient_summary("C:/Users/RituparnaB/Downloads/python/05_src/data/assignment_2_data/inflammation_02.csv", "max")

print(result_mean[:5]) 
print(result_mean[:2]) 
print(result_min [:7]) 
print(result_min [:11]) 

[5.45  5.425 6.1   5.9   5.55 ]
[5.45  5.425]
[15. 15. 18. 18. 19. 18. 17.]
[15. 15. 18. 18. 19. 18. 17. 14. 14. 18. 15.]


In [15]:
# Test it out on the data file we read in and make sure the size is what we expect i.e., 60
# Your output for the first file should be 60
data_min = patient_summary(all_paths[0], 'min')
print(len(data_min))



60


## 2. Data Summarization Function: `patient_summary`

This function computes summary statistics (mean, max, or min) for each patient based on their 40-day inflammation data.

### Function Overview:
- **Input**: Path to a CSV file containing 60 rows (patients) and 40 columns (days), and the type of operation ('mean', 'max', or 'min').
- **Process**: Reads the data and applies the requested summary statistic across each patient's 40-day period.
- **Output**: A NumPy array with one summary value per patient (60 values total).

This approach helps identify trends or anomalies in patient inflammation patterns.


**Understanding the `check_zeros(x)` Helper Function**

The `check_zeros(x)` function is provided as a tool to assist with  data analysis. While we do not need to modify or fully understand the internal workings of this function, it's important to grasp its input, output, and what the output signifies:

1. **Input**:
   - **Parameter `x`**: This function takes an array of numbers as its input. In the context of my assignment, this array will typically represent a set of data points from my patient data, such as mean inflammation scores.

2. **Output**:
   - The function returns a boolean value: either `True` or `False`.

3. **Interpreting the Output**:
   - **Output is `True`**: This indicates that the array `x` contains at least one zero value. In the context of my analysis, this means that at least one patient has a mean inflammation score of 0, signaling a potential issue or anomaly in the data.
   - **Output is `False`**: This signifies that there are no zero values in the array `x`. For my patient data, it means no patient has a mean inflammation score of 0, and thus no apparent anomalies of this type were detected.

**Usage in my Analysis**:
When using `check_zeros(x)` in conjunction with  `patient_summary()` function in the `detect_problems()` function, we'll be checking whether any patient in the dataset has an average (mean) inflammation score of 0.

In [None]:
# Run this cell so we can use this helper function

def check_zeros(x):
    '''
    Given an array, x, check whether any values in x equal 0.
    Return True if any values found, else returns False.
    '''
    # np.where() checks every value in x against the condition (x == 0) and returns a tuple of indices where it was True (i.e. x was 0)
    flag = np.where(x == 0)[0]

    # Checks if there are any objects in flag (i.e. not empty)
    # If not empty, it found at least one zero so flag is True, and vice-versa.
    return len(flag) > 0

In [None]:
def detect_problems(file_path):
    # patient_summary() to get the means and check_zeros() to check for zeros in the means
    patient_means = patient_summary(file_path, 'mean')  # Get mean inflammation values
    has_problem = check_zeros(patient_means)            # Check if any mean is zero
    return has_problem


In [None]:
# Test out code here
# output for the first file should be False
print(detect_problems(all_paths[0]))

False


| Criteria                     | Complete Criteria                                                                                                                                                                 | Incomplete Criteria                                                                                                         |
|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------|
| **General Criteria**         |                                                                                                                                                                               |                                                                                                                       |
| Code Execution               | All code cells execute without errors.                                                                                                                                        | Any code cell produces an error upon execution.                                                                      |
| Code Quality                 | Code is well-organized, concise, and includes necessary comments for clarity.                                                                                                 | Code is unorganized, verbose, or lacks necessary comments.                                                            |
| Data Handling                | Data files are correctly handled and processed.                                                                                                                               | Data files are not handled or processed correctly.                                                                    |
| Adherence to Instructions    | Follows all instructions and requirements as per the assignment.                                                                                                              | Misses or incorrectly implements one or more of the assignment requirements.                                         |
| **Specific Criteria**        |                                                                                                                                                                               |                                                                                                                       |
| 1. Reading in our files | Correctly prints out information from the first file.                                                  | Fails to print out information from the first file.                              |
| 2. Summarizing our data | Correctly defines `patient_summary()` function. Function processes data as per `operation` and outputs correctly shaped data (60 entries).                                   | Incomplete or incorrect definition of `patient_summary()`. Incorrect implementation of operation or wrong output shape.|
| 3. Checking for Errors  | Correctly defines `detect_problems()` function. Function uses `patient_summary()` and `check_zeros()` to identify mean inflammation of 0 accurately.                        | Incorrect definition or implementation of `detect_problems()` function. Fails to accurately identify mean inflammation of 0.|
| **Overall Assessment**       | Meets all the general and specific criteria, indicating a strong understanding of the assignment objectives.                                                                  | Fails to meet one or more of the general or specific criteria, indicating a need for further learning or clarification.|


## References

### Data Sources
- Software Carpentry. _Python Novice Inflammation Data_. http://swcarpentry.github.io/python-novice-inflammation/data/python-novice-inflammation-data.zip
