# Data Cleaning and Manipulation

In [None]:
import pandas as pd

# Dataset I
*Utilitarian moral judgments in relational contexts*
## Reading

In [None]:
df = pd.read_csv("./Data/relational_morality_utilitarian.csv", encoding = "utf-8-sig", lineterminator = "\n")
df.head()

## Exploring
Macro-level. Does the data look okay?

In [None]:
df = df[2:] #Indexing
df.head()

Explore the factors/variables/columns

In [None]:
df.columns #An example of an 'attribute' (contrast with a 'method')

In [None]:
#List comprehension
[col for col in df.columns] #Note the \r, escape characters and encoding

Data Types:<br>*String, Integer, Floating Point, None*

In [None]:
df.dtypes #Incorrect data types

## Cleaning
### Removing unwanted data (optional)
Often our behavioral tasks output data which are not useful for our analyses or which contain identifying information that has to be removed prior to being made public.<br><br>Having the columns in the dataset does not interfere with analyses (we simply won't use them), but it can be helpful to remove them to reduce file storage size as well as visual clutter:

In [None]:
del_cols = ["StartDate", "EndDate", "Status", "Progress", "Finished", "RecordedDate",\
            "DistributionChannel", "Q2", "intro.1", "CheckNo1", "CheckNo2", "RoomHouse.intro","sib.intro",\
            "FatherChild.Intro", "MotherChild.Intro", "Romance.Intro", "Friends.Intro", "EmployBoss.Intro",\
            "Team.Intro", "FWB.Intro", "Strange.Intro", "Q90", "Q211", "GollwitzerMeasure_1",\
            "GollwitzerMeasure_2", "GollwitzerMeasure_3", "GollwitzerMeasure_4", "GollwitzerMeasure_5",\
            "GollwitzerMeasure_6", "Covid.Responsibility", "Covid.Threat.Persona", "Covid.Threat.Comm",\
            "Bot.CheckFriday", "Q16", "Q17", "Q67", "Random ID\r"]

In [None]:
#Drop columns
df.drop()

### Renaming variables
This is an important step. It helps shorten our code and, as you'll see later, it can help identify columns when using automated processes.

In [None]:
rename_cols = {"Duration..in.seconds.":"Duration",
               "Q366":"gender",
               "Q367":"age",
               "Q368":"race",
               "Q372":"education",
               "Q373":"income"}

In [None]:
#Rename columns
df.rename()

### Data types
Identify those coloumns which we will need to perform mathmatical computations with. These will have to be encoded as numeric, not as strings.

In [None]:
blame = ["cry", "cellphon", "nohelpbo"]
praise = ["grocerie", "medicine", "findgift"]

In [None]:
praise_cols = [col for col in df.columns if any(viol.lower() in col.lower() for viol in praise) and "L." not in col]
blame_cols = [col for col in df.columns if any(viol.lower() in col.lower() for viol in blame) and "L." not in col]

In [None]:
df[praise_cols] = df[praise_cols].apply(pd.to_numeric, errors='coerce')
df[blame_cols] = df[blame_cols].apply(pd.to_numeric, errors='coerce')

In [None]:
df[praise_cols[0]].dtypes

### Exclusions
Applying the pre-registered exclusions criteria to remove ineligible participants from the dataset.

In [None]:
#Experimenter runs
#Attention checks
#Bot checks
#Duration

### Recoding
Oftentimes data will be exported with values that are uninterpretable. Sometimes, we may wish to update those values. We do this be through recoding. This is most commonly done with demographic data.

In [None]:
#Income
#Gender
#Race
#Age

### Missing values
Missing values are common in behavioral datasets. For instance, the lab's IRB does not allow us to force responses for certain participants. This means, people are free to skip questions without answering them.<br><br>We have three choices when it comes to handling missing values in our data:
- Remove them
- Estimate them using central tendancy
- Predict them using machine learning

We'll attempt the first two here as they are the most appropriate:

In [None]:
#Remove

In [None]:
#Replace using central tendancy

### Visualizations

In [None]:
import matplotlib.pyplot as plt

In [None]:
#Explore columns of interest
plt.figure()

plt.show()

### Outliers

In [None]:
#Identify and remove outliers

## Manipulation
### Composite columns

## Testing

### Assumptions