# Tutorial 4: Cleaning Survey Data
**Date**: Feb 2022

**Background**

In the last two tutorials you learned how to develop questionnaires for particular case studies: tolerance, happiness and climate change. You also learned how to reduce bias in question design, questionnaire design and administration of questionnaire. 

Now we will focus on a case-study to measure tolerance and follow the various steps needed to analyze a questionnaire in Python. 


**Case-Study**

A group of researchers are looking to explore tolerance as an orientation towards difference. They define tolerance as a value orientation towards difference. They developed a questionnaire focusing on the three different expressions of tolerance [1]:


<p align=\"center\"><img src='./images/items.png' width="1000" /></p>

They used this questionnaire to assess the tolerance for a sample of 150 university students. The responses to the questionnaire items were recorded in a Comma Separated Value (CSV) file. In addition to this, the researchers also included questions about general socio-demographics and past experiences.

**Data**


Here is a description of the variables in the dataset (`tolerance_survey_data.csv` file):

|id|variable   |description                                                                                                                                                     |
|------|-----------|----------------------------------------------------------------------------------------------------------------------------------------------------------------|
|1     |id         |annoymized unique identifier per individual                                                                                                                                                       |
|2     |age        |Age of student                                                                                                                                                                                                          |
|3     |height     |Height (scale in cm, e.g. 183)                                                                                                                                  |
|4     |country    |Where are you come from? (Country)                                                                                                                              |
|5     |language   |How many language you speak at home to your family?                                                                                                             |
|6     |freq_travel|How many different countries have you lived in?                                                                                                                                                                   |
|7     |q1         |five-point Likert Scale from 'strongly disagree' to 'strongly agree': [People should have the right to live how they wish]|
|8    |q2         |five-point Likert Scale from 'strongly disagree' to 'strongly agree': [It is important that people have the freedom to live their life as they choose]     |
|9   |q3         |five-point Likert Scale from 'strongly disagree' to 'strongly agree': [ It is okay for people to live as they wish as long as they do not harm other people]                                    |
|10    |q4         |five-point Likert Scale from 'strongly disagree' to 'strongly agree': [I respect other people’s beliefs and opinions]        |
|11    |q5         |five-point Likert Scale from 'strongly disagree' to 'strongly agree': [I respect other people’s opinions even when I do not agree]   |
|12    |q6         |five-point Likert Scale from 'strongly disagree' to 'strongly agree': [I like to spend time with people who are different from me]                                         |
|13    |q7         |five-point Likert Scale from 'strongly disagree' to 'strongly agree': [I like people who challenge me to think about the world in a different way]                            |
|14   |q8         |five-point Likert Scale from 'strongly disagree' to 'strongly agree': [Society benefits from a diversity of traditions and lifestyles]                            |
                 |


**Overview of the Next 3 Tutorials**

We will follow three major steps to analyze a questionnaire in Python:

1.	**Data Cleaning (Tutorial 4)**: Before processing the data recorded from the respondents answers it is imperative to understand the data types related to questionnaire items. It might be necessary to transform data from one measurement scale to another for quantitative processing. There might be missing values in the data. Appropriate measures should be taken to handle the missing values.


2.	**Reliability and Validity (Tutorial 5)**: Assessment of the internal consistency of the survey items. The coefficient of internal consistency provides an estimate of the reliability of the measurement and is based on the assumption that items measuring the same construct should correlate.


3.	**Factor Analysis (Tutorial 6)**: A multivariate statistical procedure that reduces a large number of observed variables into a smaller set of variables (factors). The underlying variables, factors can explain the interrelationships among the observed variables.


<br>
<br>


**Today's Objectives**

**In today’s tutorial on Cleaning Survey Data, you will:**

**i)	Import the survey data into pandas**

**ii)	Analyze the datatypes**

**iii)Transform the data for quantitative processing**

**iv)	Handle missing values**





[1] Hjerm, M., Eger, M. A., Bohman, A., & Fors Connolly, F. (2020). A new approach to the study of tolerance: Conceptualizing and measuring acceptance, respect, and appreciation of difference. Social Indicators Research, 147(3), 897-919.
Chicago	




## 1. Setup Library

Import the necessary libraries you will need to clean and pre-process the survey data.

In [3]:
import pandas as pd

## 2. Import Data 

For this tutorial, we will be using the **cleaned version** of "tolerance survey dataset". 
The csv file that we will be using `tolerance_survey_data.csv` is available at https://raw.githubusercontent.com/MaastrichtU-IDS/global-studies/main/semester4/tutorial4/inputs/tolerance_survey_data.csv 

Import this file in pandas using the `read_csv()` function.

In [4]:
#read the data into the dataframe and print the first 10 rows
url = 'https://raw.githubusercontent.com/MaastrichtU-IDS/global-studies/main/semester4/tutorial4/inputs/tolerance_survey_data.csv'
df = pd.read_csv(url)
df.head(10)

Unnamed: 0,id,age,height,country,language,freq_travel,q1,q2,q3,q4,q5,q6,q7,q8
0,1,34.0,186,Spain,1.0,5.0,Agree,Agree,Agree,Agree,Neutral,Disagree,Neutral,Neutral
1,2,34.0,157,BR,2.0,3.0,Strongly Agree,Agree,Neutral,Agree,Strongly Agree,Strongly Agree,Agree,Agree
2,3,27.0,191,RU,4.0,2.0,Agree,Agree,Strongly Agree,Agree,Strongly Agree,Agree,Strongly Agree,Agree
3,4,35.0,165,RU,3.0,5.0,Agree,Agree,Agree,Agree,Strongly Agree,Agree,Agree,Neutral
4,5,34.0,164,ID,5.0,5.0,Strongly Agree,Agree,Strongly Agree,Neutral,Neutral,Agree,Agree,Strongly Agree
5,6,,165,ID,3.0,3.0,Strongly Agree,Strongly Agree,Strongly Agree,Strongly Agree,Strongly Agree,Strongly Agree,Strongly Agree,Strongly Agree
6,8,32.0,222,ID,3.0,5.0,Agree,Strongly Agree,Strongly Agree,Strongly Agree,Strongly Agree,Strongly Agree,Agree,Agree
7,9,39.0,173,CN,4.0,5.0,Agree,Disagree,Strongly Disagree,Neutral,Strongly Disagree,Neutral,Disagree,Agree
8,10,29.0,177,AL,2.0,1.0,,Agree,Neutral,Neutral,Strongly Agree,Strongly Agree,Strongly Agree,Neutral
9,11,19.0,159,NI,2.0,4.0,Agree,Strongly Agree,Strongly Agree,Strongly Agree,Strongly Agree,Strongly Agree,Strongly Agree,Strongly Agree


## 3. Basic data understanding

In [None]:
# check how many variables and observations are in the dataset

In [None]:
#print the age and height from the 2nd participant (hint: index should be 1)

In case you don't know much about the survey data that is being analyzed, you can always check the scale of all the columns by looking for the `min`, `max`, and `unique value counts`. This will let you know if you need to rescale the data or not.

## 4. Identify data types

How would you identify the data types of variables in the survey?

- Are they categorical or numerical?
- how to deal with different types of variables, - for example: `age, height, q1` ?
- why they matter? 

In [None]:
# Check the data type of each variable

> Thinking: CHECK Variables 
> - **Height**: int64?
> - **Age**: float64?
> - **q1**: object

## 5. Analyzing Likert Scale survey questions

In [None]:
# how many participants disagree with q1 and q2 questions? Look for value_counts() at pandas documentation: 
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html

In [None]:
# convert the previous numbers into percentage

## 6. Visualizing all survey questions

In [None]:
# plot/visualise categorical variable such as q4
# you may refer to this old notebook you already solved in the first year: https://nbviewer.org/github/MaastrichtU-IDS/global-studies/blob/main/semester2/notebooks/4.1-data-visualization.ipynb

---
## 7. Transform/Prepare the data


Convert the exisiting scale of questionnaire items into numerical Likert Scale

> why? 

We need to map likert scale options _(i.e.strongly agree)_ to numbers _(i.e. 5) as following:

Strongly Agree ---> 5

Agree ---> 4

Neutral ---> 3

Disagree ---> 2

Strongly Disagree ---> 1

See the solution below. `df` is the original dataframe with categorical labels (Strongly Agree, Agree, etc.). `df_trasnformed` is the new dataframe which contains numerical values like 1,2,3,4, and 5 instead of categorical variables.


In [None]:
df_transformed = df.replace(['Strongly Agree',
                   'Agree', 
                   'Neutral', 
                   'Disagree', 
                   'Strongly Disagree'], [5,4,3,2,1])

Print the `df_transform` to see if its data types been converted to numbers. Double check the data types with `dtypes`. 

## 8. Missing Values in the Dataset

We explore the missing values with heatmaps. Look at the following code and explain what it does before running

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
names = df_transformed.columns
plt.figure(figsize = (25,11))
sns.heatmap(df_transformed.isna().values, xticklabels=df_transformed.columns)
plt.title("Missing values in the dataset", size=20)

In [1]:
#Now try to identify the rows in the dataframe which contain missing values (NaN). Use the isna() or  isnull() function

The main questionnarie items of interest are our survey items: q1, q2,...... q8. So we will drop rest of the columns.

In [None]:
#drop the columns: id, age, height, country, language, freq_travel

In [None]:
df_transformed = df_transformed.drop(['id','age','height','country','language','freq_travel'], axis=1)
df_transformed

In [None]:
#Replace the missing value (NaN) with the mean of that column.

In [None]:
#Compute the means of each colum 
#Hint use the mean() funtion on the dataframe

In [None]:
#Use the fillna() function to fill the missing (NaN) value with the mean

In [None]:
#Check the rows where the missing value has been replaced by the mean

In [None]:
#Another way to address the missing values is to drop the rows with missing values from the DataFrame. 
#How will you do that ?

In [None]:
#Check how many rows remain after dropping the rows with missing (NaN) values.

Which of the following is a better strategy:

i) Drop the rows with missing values or

ii) Replace the missing values by mean or mode. 

Justify your answer.