# Recap of Basic Python features

<!--
 Copyright (c) 2024 Paul Niklas RUth
 
 This Source Code Form is subject to the terms of the Mozilla Public
 License, v. 2.0. If a copy of the MPL was not distributed with this
 file, You can obtain one at https://mozilla.org/MPL/2.0/.
-->



# We can add functionality to python by importing modules

Some of the modules come with


# Dictionaries are like a labelled drawer containing arbitrary data
Data is accessed by keyword

In [1]:
my_dict = {'name': 'John', 1: [2, 4, 3]}
print("my_dict['name']:", my_dict['name'])
print("my_dict[1]:", my_dict[1])

my_dict['name']: John
my_dict[1]: [2, 4, 3]


# Introduction to Pandas and Data Loading

Pandas is a powerful Python library for data manipulation and analysis.

Speaker Notes:
- Pandas is widely used in data science and is built on top of NumPy.
- It provides data structures like Series and DataFrame for handling structured data.

## Getting Started with Pandas

To begin, we need to import the pandas library:

In [2]:
import pandas as pd

Speaker Notes:

 - We import pandas using the alias pd for convenience.
 - This allows us to refer to pandas functions and objects using the shorthand pd.

## Loading Data with Pandas

Pandas provides various functions for loading data into DataFrames.

In [3]:
data = pd.read_csv('data/DAC_Study_4_PS.sav.csv')

Speaker Notes:

 - Here, we use the `read_csv()` function to load data from a CSV file into a DataFrame.
 - The data is stored in the variable data.

### Introduction to the Dataset

The dataset used in this tutorial is lifted from a study conducted by Gino & Wiltermuth (2014) titled "Evil Genius? How Dishonesty Can Lead to Greater Creativity", published in Psychological Science. The study investigates the relationship between dishonesty and creativity.

#### Dataset Source:
[Link to the discussion we are reproducing](https://datacolada.org/110)

#### Description:
The dataset contains responses from participants who were presented with a virtual coin toss task followed by a creativity task involving generating uses for a newspaper. The study examines whether participants who cheated on the coin toss task exhibited greater creativity in the subsequent task compared to non-cheaters.



Speaker Notes:
- The dataset used in this tutorial is sourced from a study by Gino & Wiltermuth (2014) on dishonesty and creativity.
- Participants were first engaged in a coin toss task and then asked to generate creative uses for a newspaper.
- The dataset allows us to explore the relationship between dishonesty and creativity as measured by the number of creative uses generated.

# Section 2: Data Exploration and Basic Operations

In this section, we'll cover basic data exploration techniques and essential operations for understanding the dataset.


## Basic Data Exploration

Once we have loaded the data, we can explore its structure and contents.

In [4]:
# Display the first few rows of the DataFrame
print(data.head())

   Unnamed: 0            StartDate              EndDate  Cum_ID  filter  \
0           1  2012-11-17 23:54:00  2012-11-18 00:07:12     144       5   
1           2  2012-11-17 23:17:26  2012-11-17 23:41:13      91       5   
2           3  2012-11-17 23:44:36  2012-11-17 23:57:53     127       5   
3           4  2012-11-17 22:57:36  2012-11-17 23:11:29      24       5   
4           5  2012-11-18 00:00:06  2012-11-18 00:20:19     168       5   

   filter2  CF_headsheads1tails2  num_coin_tosses  instr  \
0        1                     1                3      1   
1        1                     1                3      1   
2        1                     2                3      1   
3        1                     2                3      1   
4        1                     1                3      1   

   reported_guessed_correctly  ...  ethnicity_5  ethnicity_6  student  code  \
0                           0  ...          NaN          1.0      1.0     1   
1                           0 

Speaker Notes:

 - The head() function displays the first few rows of the DataFrame.
 - This allows us to quickly inspect the data and get an idea of its format.

## Understanding the Data
We can also get information about the DataFrame using the `info()` function.


In [5]:
# Display information about the DataFrame
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 79 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Unnamed: 0                  178 non-null    int64  
 1   StartDate                   178 non-null    object 
 2   EndDate                     178 non-null    object 
 3   Cum_ID                      178 non-null    int64  
 4   filter                      178 non-null    int64  
 5   filter2                     178 non-null    int64  
 6   CF_headsheads1tails2        178 non-null    int64  
 7   num_coin_tosses             178 non-null    int64  
 8   instr                       178 non-null    int64  
 9   reported_guessed_correctly  178 non-null    int64  
 10  ruleFollow1                 178 non-null    int64  
 11  ruleFollow2                 178 non-null    int64  
 12  ruleFollow3                 178 non-null    int64  
 13  Numberofresponses           178 non

Speaker notes:
 - The info() function provides a summary of the DataFrame including column names, data types, and non-null counts.
 - This helps us understand the structure of the data and identify any missing values
 - After exploring the dataset, we noticed that some columns need to be converted to the appropriate datatype for further analysis.


## Converting Date Columns to Datetime
Pandas allows us to convert date columns to datetime format for easier manipulation and astype(str) for converting the hash into a string (if we need that)

In [6]:
# Convert 'StartDate' and 'EndDate' columns to datetime
data['StartDate'] = pd.to_datetime(data['StartDate'])
data['EndDate'] = pd.to_datetime(data['EndDate'])

# Convert 'MTurkID_md5' column to string
data['MTurkID_md5'] = data['MTurkID_md5'].astype(str)

Speaker Notes:

 - We can use the pd.to_datetime() function to convert date columns to datetime format.
 - This enables us to perform datetime operations on these columns, such as filtering by date ranges.
 - We use the astype() method to convert the 'MTurkID_md5' column to a string datatype.
 - This can be useful for certain operations or when exporting the data to other formats.

## Summary Statistics
Pandas offers a convenient way to calculate summary statistics for numerical columns.

In [7]:
# Calculate summary statistics
print(data.describe())

       Unnamed: 0                      StartDate  \
count  178.000000                            178   
mean    89.500000  2012-11-17 23:27:10.505618176   
min      1.000000            2012-11-17 22:49:18   
25%     45.250000     2012-11-17 23:04:57.500000   
50%     89.500000     2012-11-17 23:25:27.500000   
75%    133.750000            2012-11-17 23:46:00   
max    178.000000            2012-11-18 01:07:15   
std     51.528309                            NaN   

                             EndDate      Cum_ID  filter  filter2  \
count                            178  178.000000   178.0    178.0   
mean   2012-11-17 23:42:33.724718848   92.769663     5.0      1.0   
min              2012-11-17 22:58:53    1.000000     5.0      1.0   
25%    2012-11-17 23:19:35.249999872   45.250000     5.0      1.0   
50%       2012-11-17 23:40:24.500000   93.500000     5.0      1.0   
75%              2012-11-18 00:02:42  138.750000     5.0      1.0   
max              2012-11-18 01:17:53  192.000000

Speaker Notes:

 - The describe() function generates descriptive statistics such as count, mean, std deviation, min, and max.
 - This gives us insights into the distribution of numerical data in the DataFrame.

Speaker Notes for Summary of Section:

- Pandas is a versatile library for data manipulation and analysis.
- We can load data from various sources into DataFrames using pandas functions like `read_csv()`.
- Exploring the data's structure and contents using functions like `head()` and `info()` helps us understand the dataset.
- Summary statistics provided by `describe()` give us insights into the distribution of numerical data.

## Understanding Data Objects in Pandas
In this section, we'll delve into the two main data objects in pandas: `DataFrame` and `Series`.

## DataFrame

A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. Jupyter has a custom output format for pandas.DataFrames


In [11]:
data

Unnamed: 0.1,Unnamed: 0,StartDate,EndDate,Cum_ID,filter,filter2,CF_headsheads1tails2,num_coin_tosses,instr,reported_guessed_correctly,...,ethnicity_5,ethnicity_6,student,code,care_about_rules,pos_affect,neg_affect,RAT_perf,cheated,MTurkID_md5
0,1,2012-11-17 23:54:00,2012-11-18 00:07:12,144,5,1,1,3,1,0,...,,1.0,1.0,1,6.000000,2.3,2.5,9,0,165d039a661a10e307754efba79a6110
1,2,2012-11-17 23:17:26,2012-11-17 23:41:13,91,5,1,1,3,1,0,...,,1.0,0.0,1,5.333333,2.5,1.4,8,0,7ff5028e3f50f083999a7e904694524d
2,3,2012-11-17 23:44:36,2012-11-17 23:57:53,127,5,1,2,3,1,0,...,,1.0,1.0,1,6.000000,2.1,1.1,9,0,0795454a621e15bff1b89c8a14efa41e
3,4,2012-11-17 22:57:36,2012-11-17 23:11:29,24,5,1,2,3,1,0,...,,1.0,0.0,1,6.333333,1.3,1.0,12,0,c55d3da1c3ab1bcba89145f8bcb5a75e
4,5,2012-11-18 00:00:06,2012-11-18 00:20:19,168,5,1,1,3,1,0,...,,,0.0,1,5.333333,2.8,1.3,0,0,ad13ddfad7d7753a38595810fdaa566a
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,174,2012-11-17 23:46:08,2012-11-17 23:58:52,128,5,1,1,3,1,1,...,,1.0,0.0,1,6.666667,3.4,1.8,2,1,47976131cee69094caada8166ae4247b
174,175,2012-11-17 23:37:16,2012-11-17 23:50:19,106,5,1,2,3,1,1,...,,1.0,0.0,1,3.666667,1.9,2.4,8,1,1c2b7752f3ac8bcc42cca8a43b22fd7f
175,176,2012-11-17 23:06:56,2012-11-17 23:17:08,35,5,1,2,3,1,1,...,,1.0,0.0,1,4.333333,2.3,2.0,10,1,fcd1983e2829cdc53eb78f750ac5d17e
176,177,2012-11-17 23:06:38,2012-11-17 23:21:41,50,5,1,2,3,1,1,...,,1.0,0.0,1,1.666667,2.2,1.0,6,1,d883244dc21033c732168ce773cfcb90



Speaker Notes:
- Think of a DataFrame as a table or spreadsheet with rows and columns.
- Each column can have a different datatype (e.g., integer, float, string).

## Cutting the data by column
The data analysis will be done using three columns. To make things more neat let us extract these columns

In [9]:
data

Unnamed: 0.1,Unnamed: 0,StartDate,EndDate,Cum_ID,filter,filter2,CF_headsheads1tails2,num_coin_tosses,instr,reported_guessed_correctly,...,ethnicity_5,ethnicity_6,student,code,care_about_rules,pos_affect,neg_affect,RAT_perf,cheated,MTurkID_md5
0,1,2012-11-17 23:54:00,2012-11-18 00:07:12,144,5,1,1,3,1,0,...,,1.0,1.0,1,6.000000,2.3,2.5,9,0,165d039a661a10e307754efba79a6110
1,2,2012-11-17 23:17:26,2012-11-17 23:41:13,91,5,1,1,3,1,0,...,,1.0,0.0,1,5.333333,2.5,1.4,8,0,7ff5028e3f50f083999a7e904694524d
2,3,2012-11-17 23:44:36,2012-11-17 23:57:53,127,5,1,2,3,1,0,...,,1.0,1.0,1,6.000000,2.1,1.1,9,0,0795454a621e15bff1b89c8a14efa41e
3,4,2012-11-17 22:57:36,2012-11-17 23:11:29,24,5,1,2,3,1,0,...,,1.0,0.0,1,6.333333,1.3,1.0,12,0,c55d3da1c3ab1bcba89145f8bcb5a75e
4,5,2012-11-18 00:00:06,2012-11-18 00:20:19,168,5,1,1,3,1,0,...,,,0.0,1,5.333333,2.8,1.3,0,0,ad13ddfad7d7753a38595810fdaa566a
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,174,2012-11-17 23:46:08,2012-11-17 23:58:52,128,5,1,1,3,1,1,...,,1.0,0.0,1,6.666667,3.4,1.8,2,1,47976131cee69094caada8166ae4247b
174,175,2012-11-17 23:37:16,2012-11-17 23:50:19,106,5,1,2,3,1,1,...,,1.0,0.0,1,3.666667,1.9,2.4,8,1,1c2b7752f3ac8bcc42cca8a43b22fd7f
175,176,2012-11-17 23:06:56,2012-11-17 23:17:08,35,5,1,2,3,1,1,...,,1.0,0.0,1,4.333333,2.3,2.0,10,1,fcd1983e2829cdc53eb78f750ac5d17e
176,177,2012-11-17 23:06:38,2012-11-17 23:21:41,50,5,1,2,3,1,1,...,,1.0,0.0,1,1.666667,2.2,1.0,6,1,d883244dc21033c732168ce773cfcb90
