<a target="_blank" rel="noopener noreferrer" href="https://colab.research.google.com/github/center-for-computational-psychiatry/course_spice/blob/master/modules/module-04_data-processing.ipynb">![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)</a>

# Data Processing
This tutorial was inspired by and adapted from Shawn A. Rhoads' [PSYC 347 Course](https://shawnrhoads.github.io/gu-psyc-347/) [[CC BY-SA 4.0 License](https://creativecommons.org/licenses/by-sa/4.0/)].

## Learning objectives

This notebook is intended to teach you basic python syntax for:

1. Reading data from a file
2. Filtering data
3. Summarizing data
4. Writing data to a file

In this notebook, we are going to work with open data from Sarah's 2022 paper: 

<blockquote>Banker, S. M., Na, S., Beltrán, J., Koenigsberg, H. W., Foss-Feig, J. H., Gu, X., & Schiller, D. (2022). Disrupted computations of social control in individuals with obsessive-compulsive and misophonia symptoms. *iScience*, 25(7), 104617.</blockquote>

The data is available on [OSF](https://osf.io/ad7np/). I already cleaned it up a little bit. We will be using the `Banker_et_al_2022_QuestionnaireData.csv` file.

In [1]:
filename = './resources/data/Banker_et_al_2022_QuestionnaireData.csv'

To read this file, we will use the `pandas` package. `pandas` is a Python library used for data manipulation and analysis. It provides data structures for efficiently storing and manipulating large datasets, as well as tools for working with data from a variety of sources such as CSV files, SQL databases, and Excel spreadsheets. Pandas is commonly used in data science and machine learning applications, as well as in finance, social science, and other fields where data analysis is important.

A CSV file is a comma-separated values file, which allows data to be saved in a tabular format. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. CSV files can be opened in spreadsheet applications like Microsoft Excel or Google Sheets.

In [2]:
# we will import pandas using the alias pd:
import pandas as pd 

Above, we imported `pandas` using the `import` statement. This statement makes the `pandas` package available to us in our notebook. We will use `import` to make other packages available to us as well throughout this course and in your research.

We can now use the functions and data structures provided by `pandas` in our code. We will use the `pandas` package to read in the data from the CSV file.

Let's start by reading in the data from the CSV file. We will use the `read_csv()` function from the `pandas` package to do this. The `read_csv()` function takes a single argument, the path to the CSV file to read. The function returns a `DataFrame` object, which is a data structure provided by `pandas` for storing tabular data. We will assign the `DataFrame` object returned by `read_csv()` to a variable called `df`.

In [3]:
covid_data = pd.read_csv(filename)

In [5]:
covid_data.head()

Unnamed: 0,prolific_id,attention_sum,Trait Anxiety,Loneliness,Depression,Obsessive Compulsion,Subjective Happiness,Stress,Misophonia,Autism Spectrum,...,Avoidant Personality Disorder,Borderline Personality Disorder,Apathy,Eating Disorder,Alchohol Use Disorder,Age,Sex,Gender,Income,Education
0,546ec14dfdf99b2bc7ebd032,0,37,18,28,18,11,7,8,4.527778,...,30,2,14,2,0,55,1,1,5,6
1,548491acfdf99b0379939cc0,0,53,19,46,18,12,14,7,3.472222,...,18,3,11,11,0,28,2,2,8,5
2,54924b8efdf99b77ccedc1d5,0,56,20,51,20,11,11,2,3.055556,...,25,3,19,26,0,19,1,1,5,4
3,5563984afdf99b672b5749b6,0,23,11,24,3,28,6,6,3.5,...,14,1,5,16,0,39,1,1,11,5
4,5588d358fdf99b304ee5674f,0,59,24,57,22,8,15,5,3.888889,...,23,5,33,10,3,27,2,2,4,6
