# Pandas cheat code

## Introduction
The objective of this knowledge post is to save time and increase code quality by offering an alternative to Pandas documentation and Stake Overflow in the form of a Jupyter Notebook with the following features:

- Pandas key data transformation easy to copy-past code snippets
- Embedded data for easy snippets run and testing
- Alteryx nodes to Pandas functions mapping

## How to leverage it
- you can run and test any function with embedded data, just make sure to load the data first ;)
- you can easily copy-past snippet of codes by **double clicking** on a cell, right click and copy-paste

Contents

### 1. Getting started

The first time you use this notebook, run once the pip install libraries (in case not already installed on your Python environment)

#### 1.1. Setup libraries

In [24]:
!pip install pandas
!pip install openpyxl

In [25]:
import pandas as pd

#### 1.2. About pandas dataframes

A pandas dataframe contains an array, a list of column names (the column index) and a list of row names (the row index) In many cases, an entry in the column index is the name of the column and the row index is the row number.  Some operations on dataframes affect the row index, so the name of the row is no longer a row number. We will see how to fix that in xx.xx.


### 2. Read data (inputdata)
https://pandas.pydata.org/docs/user_guide/io.html

<div class="alert alert-block alert-danger">
Make sure to load the datasets before running any other snippets of code ;)
    </div>

#### 2.1. Load text file (.csv, .txt, ...)

Transfer the content of a file to a dataframe and get basic information about the loaded dataset. 

In [56]:
#
# read a comma separated file
#
transactions = pd.read_csv('data_input/transactions.csv', sep=',')

#
# read a tab separated file
#
customers_1 = pd.read_csv('data_input/customers.txt', sep='\t')

# note that the phone numbers are loosing their leading zeroes:
print(customers_1)


In [57]:
#
# we need to prevent pandas to convert loaded text to a numeric field type
#
# read a text file (here a tab separated file) and force all fields to string
#
customers = pd.read_csv('data_input/customers.txt', sep='\t', dtype='str')
print(customers)


#### 2.2. Load xlsx

Pandas uses openpyxl to load a dataset from a sheet. 

In [28]:
costs = pd.read_excel('data_input/costs.xlsx', sheet_name='Sheet1')

#### 2.3 Get basic information about the loaded dataset

Get the number of rows, number of columns and list of column names

In [29]:

print(f"Nomber of rows: {customers.shape[0]}")
print(f"Nomber of columns: {customers.shape[1]}")
print("Columns:", customers.columns)


### 3. Explore data (browse)
https://pandas.pydata.org/docs/user_guide/basics.html

#### 3.1. Remove column number display limitation

In [30]:
pd.set_option('display.max_columns', None)

#### 3.2. Display top n rows

In [31]:
transactions.head(n=5)

#### 3.3. Display all rows

In [32]:
transactions

#### 3.4. Quick statistic of numerical columns

In [33]:
transactions.describe()

#### 3.5. Distinct values of a column

In [34]:
#
# count number of distinct values in a column
#
print(transactions["customer_id"].nunique())

In [35]:
#
# list of distinct values in a column
#
print(transactions["customer_id"].unique())

In [36]:
#
# list of distinct values in a column with their frequency
#
print(transactions["customer_id"].value_counts())

### 4. Select data (select)
https://pandas.pydata.org/docs/user_guide/basics.html

#### 4.1. Display columns and types

In [37]:
transactions.dtypes

In [38]:
transactions.info()

#### 4.2. Display number of rows

In [39]:
transactions.shape[0]

#### 4.3. Keep only a subset of columns

In [40]:
transactions_subset = transactions.copy()
transactions_subset = transactions_subset[['customer_id', 'product_id']]
transactions_subset.dtypes

#### 4.4. Drop columns

In [41]:
transactions_drop = transactions.copy()
transactions_drop.drop(['row_id', 'geo', 'bu', 'customer_level_2', 'product_level_2'], axis=1, inplace=True)
transactions_drop.dtypes

#### 4.5. Rename columns

In [42]:
transactions_rename = transactions.copy()
transactions_rename.rename(columns={'customer_level_2':'customer_category', 'product_level_2':'product_category'}, inplace=True)
transactions_rename.dtypes

### 4.6. Add columns

In [61]:
customers_add_columns = customers.copy()

# adding a new column to form a unique city name

customers_add_columns["unique_city"] = customers_add_columns["country"] + "-" + customers_add_columns["city"]

customers_add_columns


#### 4.7. Change data type

In [43]:
transactions_type = transactions.copy()

#To text
transactions_type['row_id'] = transactions_type['row_id'].astype(str)
transactions_type.dtypes

#To number
transactions_type['row_id'] = pd.to_numeric(transactions_type['row_id'], errors='coerce')
transactions_type.dtypes

#To date
transactions_type['transaction_date'] = pd.to_datetime(transactions_type['transaction_date'])
transactions_type.dtypes

### 5. Transpose (transpose)
https://pandas.pydata.org/docs/user_guide/reshaping.html#reshaping-by-melt

Pivots the orientation of the data so that horizontal fields are moved on the vertical axis

In [44]:
# the costs dataset contains one column per month for product price in different geography:
print(costs.info())
costs.head(n=5)

costs_transpose = pd.melt(costs, id_vars=['product_id','geo'], value_vars=['2019-01','2019-02','2019-03','2019-04','2019-05','2019-06','2019-07','2019-09','2019-10','2019-11','2019-12','2020-01'])
costs_transpose.rename(columns={'variable':'year_month'}, inplace=True)
costs_transpose.rename(columns={'value':'average_unit_cost'}, inplace=True)

print(costs_transpose.info())
costs_transpose.head(n=5)


In [45]:
#
# if you do not want to type the list of columns to pivot, you can build it from the definition of the dataframe:
#

# get the list of columns as a Python list
costs_columns_to_pivot = list(costs.columns)

# remove the columns we want to retain as is
costs_columns_to_pivot.remove("product_id")
costs_columns_to_pivot.remove("geo")
print(costs_columns_to_pivot)

# transpose
costs_transpose_2 = pd.melt(costs, id_vars=['product_id','geo'], value_vars=costs_columns_to_pivot)
costs_transpose_2.rename(columns={'variable':'year_month'}, inplace=True)
costs_transpose_2.rename(columns={'value':'average_unit_cost'}, inplace=True)

# check results
print(costs_transpose_2.info())
costs_transpose_2.head(n=5)



### 6. Merge datasets (join)
https://pandas.pydata.org/docs/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging

#### 6.1. Examples of joins supported by the pandas merge() function

The examples use two small dataframes (left_source_df and right_source_df) The join key is stored in a column named 'column_join' in both datasets, but this is not required. The merge() function will update the name of columns present in both datasets that are not part of the join key. For left join, right join and outer join, the merge() function sets missing values to NaN.

In [47]:
left_source_df = pd.DataFrame({"column_join": ["A","B","C"], "column_b": ["X1","Y1","W1"], "column_c":[111,222,333]})
right_source_df = pd.DataFrame({"column_join": ["A","B","D"], "column_b": ["X2","Y2","Z2"], "column_d":[444,555,666]})

print("\nLeft source")
print(left_source_df)

print("\nRight source")
print(right_source_df)

print("\nExample inner join")
example_inner = pd.merge(left_source_df, right_source_df, 
                         how='inner', 
                         left_on=['column_join'], 
                         right_on=['column_join'], 
                         suffixes=('_LEFT', '_RIGHT'))
print(example_inner)

print("\nExample left join")
example_left = pd.merge(left_source_df, right_source_df, 
                        how='left', left_on=['column_join'], 
                        right_on=['column_join'], 
                        suffixes=('_LEFT', '_RIGHT'))
print(example_left)

print("\nExample right join")
example_right = pd.merge(left_source_df, right_source_df, 
                         how='right', 
                         left_on=['column_join'], 
                         right_on=['column_join'], 
                         suffixes=('_LEFT', '_RIGHT'))
print(example_right)

print("\nExample outer join")
example_outer = pd.merge(left_source_df, right_source_df, 
                         how='outer', 
                         left_on=['column_join'], 
                         right_on=['column_join'], 
                         suffixes=('_LEFT', '_RIGHT'))
print(example_outer)

#### 6.2. Using information returned by merge() to run right_only and left_only joins

In [48]:
# the merge() function can add a column with the source of the data (left_only, right_only, both) 
# by default, this column is named "_merge"
# we can apply a filter on this column to retain only rows available in left data source, right data source or both.

print("\nExample outer join with indicator")
example_outer_with_indicator = pd.merge(left_source_df, right_source_df, indicator=True, how='outer', left_on=['column_join'], right_on=['column_join'], suffixes=('_LEFT', '_RIGHT'))
print(example_outer_with_indicator)

print("\nExample left only join")
example_left_only = pd.merge(left_source_df, right_source_df, indicator=True, how='left', left_on=['column_join'], right_on=['column_join'], suffixes=('_LEFT', '_RIGHT'))
example_left_only = example_left_only[example_left_only["_merge"] == "left_only"]
print(example_left_only)

print("\nExample right only join")
example_right_only = pd.merge(left_source_df, right_source_df, indicator=True, how='right', left_on=['column_join'], right_on=['column_join'], suffixes=('_LEFT', '_RIGHT'))
example_right_only = example_right_only[example_right_only["_merge"] == "right_only"]
print(example_right_only)


In [49]:
#
# we can, of course, extract the three subsets from the results of our outer join with indicator
#
example_outer_left_only = example_outer_with_indicator[example_outer_with_indicator["_merge"] == "left_only"]
example_outer_right_only = example_outer_with_indicator[example_outer_with_indicator["_merge"] == "right_only"]
example_outer_common_only = example_outer_with_indicator[example_outer_with_indicator["_merge"] == "both"]

print("\nExample right only from outer join")
print(example_outer_left_only)

print("\nExample right only from outer join")
print(example_outer_right_only)

print("\nExample common only from outer join")
print(example_outer_common_only)


#### 6.3. Data preparation for examples based on our transaction data

<div class="alert alert-block alert-danger">
For the merge code snippets to work, make sure to run first the transpose code snippet to generate the "costs_transpose" dataset, and also run the snippet here under to generate the "transactions_enriched" dataset 
</div>

In [50]:
transactions_enriched = transactions.copy()
transactions_enriched['transaction_date'] = transactions_enriched['transaction_date'].astype(str)
transactions_enriched['year_month'] = transactions_enriched['transaction_date'].str[:7]
transactions_enriched[['transaction_date', 'year_month']]


#### 6.3. Running an inner join on our transaction data

In [51]:
merge_inner = pd.merge(transactions_enriched, costs_transpose, how='inner', left_on=['product_id', 'geo', 'year_month'], right_on=['product_id', 'geo', 'year_month'], suffixes=('_LEFT', '_RIGHT'))
merge_inner.info()

print(f"Nunmber of rows in transaction data {transactions_enriched.shape[0]}")
print(f"Number of rows in merged dataset: {merge_inner.shape[0]}")


We are loosing transaction records because the merge() function cannot find corresponding cost information. What we need here is a left join, something similat to a VLookup() in Excel

#### 6.4. Running a left join ( VLookup() ) on our transaction data

In [53]:
# Since the column names are identical in both dataframes, we can used a simplified version:

merge_left = pd.merge(transactions_enriched, costs_transpose,
                      indicator=True,
                      how='left', 
                      on=['product_id', 'geo', 'year_month'], 
                      suffixes=('_LEFT', '_RIGHT'))

merge_left.info()

print(f"Nunmber of rows in transaction data {transactions_enriched.shape[0]}")
print(f"Number of rows in merged dataset: {merge_left.shape[0]}")

transactions_without_costs = merge_left[merge_left["_merge"] == "left_only"]
print(f"Number of transaction rows without costs: {transactions_without_costs.shape[0]}")

#### 6.4. Running a right-only join to find product without any transactions

In [54]:
# All columns from the transaction dataset will be set to NaN, so we retain only the columns used as key

 
merge_right_only = pd.merge(transactions_enriched[['product_id', 'geo', 'year_month']], costs_transpose, 
                            indicator=True, 
                            how='right', 
                            on=['product_id', 'geo', 'year_month'], 
                            suffixes=('_LEFT', '_RIGHT'))

merge_right_only = merge_right_only[merge_right_only["_merge"] == "right_only"]

merge_right_only
 

### xx. Row by row transformations

In some cases, we need to run some code for each row. Here is a simple example.

In [55]:
customers_copy = customers.copy()
#
# for this example, we add the country code as a prefix to the phone number if the prefix is not yet there
# there are a lot of things that can go wrong and are not covered hereunder.
#

for row_number, row_data in customers_copy.iterrows():
   
    phone_nbr = row_data["phone_nbr"]
    country = row_data["country"]
    
    
    if country == "BE":
        prefix = "0032"
    elif country == "FR":
        prefix = "0033"
    else:
        prefix = ""
        
# update the dataframe, as required

    if len(prefix) > 0 and phone_nbr[0:4] != prefix:
        new_phone_nbr = prefix + phone_nbr
        customers_copy.loc[row_number,"phone_nbr"] = new_phone_nbr

print(customers_copy)


### Write data (outputdata)
https://pandas.pydata.org/docs/user_guide/io.html

#### Write to csv

In [None]:
transactions.to_csv('data_output/transactions_output.csv', sep=',', header=True, index=False, encoding='utf-8')

#### Write to xlsx

In [None]:
merge_inner.to_excel('data_output/merge.xlsx', sheet_name='merge_inner', index=False)