![University Logo](../Durham_University.svg)

# Dictionaries are like a labelled drawer containing arbitrary data
Data is accessed by keyword

In [None]:
# Create a mixed dict with a list as content


# Introduction to Pandas

## What is Pandas?
- Open-source library providing easy-to-use data structures and data analysis tools.
- Built on top of NumPy, making it a critical tool for data manipulation in Python.

## Why Use Pandas?
- Simplifies many complex data operations that are cumbersome or less intuitive in NumPy.
- Provides robust tools for working with structured data like time series and structured grids.

# The Connection Between Pandas and NumPy

Pandas is built on NumPy and extends its capabilities by providing more flexible data structures.

## Pandas and NumPy
- **NumPy**: Provides the foundation with its array object, which is designed for efficient numeric computation.
- **Pandas**: Uses NumPy arrays to store data, benefiting from NumPy's speed and resources but adds significant functionality for handling data.

## Use Cases
- **NumPy**: Ideal for performing numerical computations. Focus is on the numerical transformation of data.
- **Pandas**: Best for more complex operations involving data cleaning, transformations, and analysis of tabular data.

Both can easily be combined

# Importing Pandas

In [None]:
# Importing pandas

# Display the version


# Key Data Structures in Pandas

Pandas primarily uses two main data structures: `Series` and `DataFrame`.

## Series
- A one-dimensional array-like object capable of holding any data type.
- Each element has an index; the default one ranges from 0 to N, where N is the length of the series minus one.

## DataFrame
- A two-dimensional, size-mutable, potentially heterogeneous tabular data.
- Data is aligned in a tabular fashion in rows and columns.
- Think of it as a spreadsheet or SQL table.

## We can create Series from NumPy arrays, lists and dictionaries

In [None]:
# Import numpy
import numpy as np


In [None]:
# Create a series from a list (size 5)


In [None]:
# Create a series from a numpy array (size 5)


In [None]:
# Create a series from a dictionary


## We can create a DataFrame from a dictionary 
The keys of the dictionary are the column labels and entries are objects that can be cast into a series, if they have the same indexing and length

In [None]:
# Create a dataframe from the list and the numpy array


In [None]:
# Create another dataframe from the series


## We can also create a NumPy array from a Series or DataFrame
For that we use the `values` property.

In [None]:
# Show the values property of a pd.Series


In [None]:
# Show the values property of a pd.DataFrame


# Loading Data with Pandas

In [None]:
data_path = '../Data/presentation/DAC_Study_4_PS.sav.csv'

# Load the dataset

# Display the first few rows of the dataframe


# This is a real dataset
The original publication has been done in [Gino & Wiltermuth (2014): 'Evil Genius?: How Dishonesty Can Lead to Greater Creativity'](https://doi.org/10.1177/0956797614520714)

However, we reproduce the analysis done in the blog post [Uri, Joe, & Leif (2023): '[111] Data Falsificada (Part 3): "The Cheaters Are Out of Order"'](https://datacolada.org/111)

This is one of the studies from a widely publicised case of academic misconduct within research on dishonesty.

Until 2023 Francesca Gino was a professor at Harvard Business School

We will first introduce some pandas functions and then reproduce part of the original analysis and the analysis done by Uri et al.

# Basic DataFrame Operations

In [None]:
# Display the shape of the dataframe

# Get a concise summary of the dataframe


# Accessing data values

We can access the `pd.Series` belonging to these columns by name just as we would in a dictionary. A list of column names creates a new `pd.DataFrame` view

In [None]:
# Access the cheated and the Numberofresponses columns


## Data Selection with loc and iloc

In [None]:
# Select columns by name using loc

# Display the first five rows


In [None]:
# Select rows and columns by index using iloc

# Display the first five rows


## A word of warning about chained assignment
A view can be only created for one access, afterwards pandas will create a copy!

In [None]:
df[['StartDate', 'EndDate']][0:5] = df[['StartDate', 'EndDate']][0:5]
#  Here lies the problem          -   The access here is fine

Use this notation for access to data but not for assingment!

## Boolean data selection
A mask on a `pd.DataFrame` works for rows.

In [None]:
# Using conditions to filter rows with more than ten responses

# Display the head of filtered data


In [None]:
# Using loc with conditions to get only the cheated and Numberofresponses columns for rows with a large number of responses

# Display the head of filtered data


## Cleaning up: Converting Date Columns to Datetime
Pandas allows us to convert date columns to datetime format for easier manipulation.

In [None]:
# Convert 'StartDate' and 'EndDate' to datetime

# Display the types to confirm conversion


## Cleaning up: Handling Missing Values

In [None]:
# Use isnull() to check for missing values
# Get sum to test for the number of missing values per column

# Output columns with missing values


## An example of how you might fill missing data
This is a technical demonstration, if or how to actually impute missing data is highly dependent on your field, but rarely the optimal solution.

In [None]:
# Fill missing numeric values with the median


## Boolean Data SelectionAlternative is to just exclude the affected rows with `dropna`

In [None]:
# Demonstrate dropna on whole dataset


In [None]:
# Demonstrate dropna on a subset of columns


# Using `describe` for a quick statistical analysis

In [None]:
# Descriptive statistics for numeric columns


# Reproducing the original but refuted calculation 

[Gino & Wiltermuth (2014): 'Evil Genius?: How Dishonesty Can Lead to Greater Creativity'](https://doi.org/10.1177/0956797614520714)

 - The original hypothesis was that people willing to cheat are more creative
 - People participated in a virtual coin toss game where they had the opportunity and incentive to cheat
 - They then were asked to come up with ways to use a newspaper as a measurement of creativity (among others)

## Selecting Relevant Data for Analysis
We create a new work DataFrame that only contains the columns that indicate 
 - whether a participant cheated (`cheated`)
 - and the number of uses for the newspaper (`Numberofresponses`)

In [None]:
# Create a new DataFrame with only necessary columns

# Display the new DataFrame to confirm selection


## Preparing Data for Analysis
We create subviews for the datasets for the people that cheated and the people that did not

In [None]:
# Select non-cheaters and display the first 7 rows


In [None]:
# Select cheaters and display the first 7 rows


## Conducting T-Tests
We can now reproduce the analysis of the original publication from 2014

In [None]:
from scipy.stats import ttest_ind
# Calculate the ttest value for the numberofresponses of cheaters and non-cheaters

# Calculate the mean and std of number of responses for cheaters and non-cheaters


## The sorting of the data indicates manipulation
as found by [Uri, Joe, & Leif (2023): [111] Data Falsificada (Part 3): "The Cheaters Are Out of Order"](https://datacolada.org/111)

We can have a look by using matplotlib

In [None]:
# Create a plot of index versus number of responses, color by the cheated column


#  So what if the cheaters were originally in order as well

We can implement a function that enforces monotonicity (means data can only be lower than the following values)

In [None]:
import numpy as np

# Implement a function by checking if value higher than the min of value + next 4
# Correct value to min if True

# Apply the function to the cheated values and create a new column ImputedResponses

# Fill the missing values for non-cheaters from the original data

# Display the original and imputed data


# Visualizing Before and After Imputation
With more involved functions we should make sure that our calculation actually worked

In [None]:
# Create a new matplotlib figure to check that the function was successful

# Create DataFrame view with cheaters

# Plot Original Data

# Plot Imputed Data


## Rerun the Statistical Analysis
Although we have imputed high (if in doubt take the higher number of responses), the significance has vanished

In [None]:
# Recreate the analysis with the imputed data

# Calculate the mean and std of number of responses for cheaters and non-cheaters


## Grouping Data for Analysis for comparison

In [None]:
# 'groupby' cheated and calculate new means


In [None]:
# 'groupby' cheated and calculate new standard deviations


# A final function: Using pd.merge to combine data of different dataframes
We will now demonstrate the Pandas merge function. This has nothing to do with the research question. We will however use the data as mock data.

Let us say we have a grading scheme for the number of responses, and want to give each participant a grade for their number of responses.

In [None]:
# Grading scheme
grades = pd.DataFrame({
    'Numberofresponses': [2, 3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14],
    'Grade': ['F', 'F', 'F', 'D', 'D', 'C', 'C', 'C', 'B', 'B', 'B', 'A', 'A']
})

# Merge the grades with the data


# Key Takeaways

Today, we've covered a wide range of topics in Pandas, aimed at giving you the tools to perform basic data manipulation, analysis, and visualization:

1. **Data Loading and Cleaning**: How to load data from various sources and clean it for analysis.
2. **Data Selection and Manipulation**: Techniques for selecting, filtering, and adjusting data.
3. **Statistical Analysis**: Using t-tests to compare groups within your data.
4. **Data Visualization**: Creating plots to visualize data discrepancies and distributions.
5. **Grouping Data**: How to segment data for grouped analyses and comparisons.
6. **Merging Data**: How to combine data from two different sources.
