## Data Cleaning with pyjanitor
https://github.com/pyjanitor-devs/pyjanitor
- Cleaning columns name (multi-indexes are possible!)
- Removing empty rows and columns
- Identifying duplicate entries
- Encoding columns as categorical
- Splitting your data into features and targets (for machine learning)
- Adding, removing, and renaming columns
- Coalesce multiple columns into a single column
- Date conversions (from matlab, excel, unix) to Python datetime format
- Expand a single column that has delimited, categorical values into dummy-encoded variables
- Concatenating and deconcatenating columns, based on a delimiter
- Syntactic sugar for filtering the dataframe based on queries on a column
- Experimental submodules for finance, biology, chemistry, engineering, and pyspark

- https://pyjanitor-devs.github.io/pyjanitor/api/functions/

In [2]:
#!pip install pyjanitor

In [3]:
# libraries
import pandas as pd, numpy as np, matplotlib.pyplot as plt

In [16]:
from pydataset import data
mtcars = data('mtcars')

In [18]:
mtcars.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


In [19]:
import janitor

In [22]:
# Sample Data curated for this example
company_sales = {
    'SalesMonth': ['Jan', 'Feb', 'Mar', 'April'],
    'Company1': [150.0, 200.0, 300.0, 400.0],
    'Company2': [180.0, 250.0, np.nan, 500.0],
    'Company3': [400.0, 500.0, 600.0, 675.0]
}
company_sales

{'SalesMonth': ['Jan', 'Feb', 'Mar', 'April'],
 'Company1': [150.0, 200.0, 300.0, 400.0],
 'Company2': [180.0, 250.0, nan, 500.0],
 'Company3': [400.0, 500.0, 600.0, 675.0]}

In [26]:
# The Pandas Way

# 1. Create a pandas DataFrame from the company_sales dictionary
df = pd.DataFrame.from_dict(company_sales)
df

Unnamed: 0,SalesMonth,Company1,Company2,Company3
0,Jan,150.0,180.0,400.0
1,Feb,200.0,250.0,500.0
2,Mar,300.0,,600.0
3,April,400.0,500.0,675.0


In [27]:
# 2. Delete a column from the DataFrame. Say 'Company1'
del df['Company1']
df

Unnamed: 0,SalesMonth,Company2,Company3
0,Jan,180.0,400.0
1,Feb,250.0,500.0
2,Mar,,600.0
3,April,500.0,675.0


In [28]:
# 3. Drop rows that have empty values in columns 'Company2' and 'Company3'
df = df.dropna(subset=['Company2', 'Company3'])
df

Unnamed: 0,SalesMonth,Company2,Company3
0,Jan,180.0,400.0
1,Feb,250.0,500.0
3,April,500.0,675.0


In [29]:
# 4. Rename 'Company2' to 'Amazon' and 'Company3' to 'Facebook'
df = df.rename(
    {
        'Company2': 'Amazon',
        'Company3': 'Facebook',
    },
    axis=1,
)
df

Unnamed: 0,SalesMonth,Amazon,Facebook
0,Jan,180.0,400.0
1,Feb,250.0,500.0
3,April,500.0,675.0


In [30]:
# 5. Let's add some data for another company. Say 'Google'
df['Google'] = [450.0, 550.0, 800.0]
df

Unnamed: 0,SalesMonth,Amazon,Facebook,Google
0,Jan,180.0,400.0,450.0
1,Feb,250.0,500.0,550.0
3,April,500.0,675.0,800.0


In [31]:
#advanced features
df = (
    pd.DataFrame(company_sales)
    .drop(columns="Company1")
    .dropna(subset=["Company2", "Company3"])
    .rename(columns={"Company2": "Amazon", "Company3": "Facebook"})
    .assign(Google=[450.0, 550.0, 800.0])
)
df

Unnamed: 0,SalesMonth,Amazon,Facebook,Google
0,Jan,180.0,400.0,450.0
1,Feb,250.0,500.0,550.0
3,April,500.0,675.0,800.0


In [33]:
df = (
    pd.DataFrame.from_dict(company_sales)
    .remove_columns(["Company1"])
    .dropna(subset=["Company2", "Company3"])
    .rename_column("Company2", "Amazon")
    .rename_column("Company3", "Facebook")
    .add_column("Google", [450.0, 550.0, 800.0])
)
df

Unnamed: 0,SalesMonth,Amazon,Facebook,Google
0,Jan,180.0,400.0,450.0
1,Feb,250.0,500.0,550.0
3,April,500.0,675.0,800.0


In [40]:
df.dtypes

salesmonth     object
company1      float64
company2      float64
company3      float64
dtype: object

In [35]:
# This cleans the column names as well as removes any duplicate rows
df = pd.DataFrame.from_dict(company_sales).clean_names().remove_empty()
df

Unnamed: 0,salesmonth,company1,company2,company3
0,Jan,150.0,180.0,400.0
1,Feb,200.0,250.0,500.0
2,Mar,300.0,,600.0
3,April,400.0,500.0,675.0


In [37]:
from janitor import clean_names, remove_empty

df = pd.DataFrame.from_dict(company_sales)
df = clean_names(df)
df = remove_empty(df)
df

Unnamed: 0,salesmonth,company1,company2,company3
0,Jan,150.0,180.0,400.0
1,Feb,200.0,250.0,500.0
2,Mar,300.0,,600.0
3,April,400.0,500.0,675.0


In [38]:
help(janitor)

Help on package janitor:

NAME
    janitor - Top-level janitor API lives here.

DESCRIPTION
    Lazy loading used here to speed up imports.

PACKAGE CONTENTS
    accessors (package)
    biology
    chemistry
    engineering
    errors
    finance
    functions (package)
    io
    math
    ml
    spark (package)
    testing_utils (package)
    timeseries
    utils
    xarray (package)

SUBMODULES
    convert_date
    data_description
    fill
    filter
    jxr
    pivot
    select
    truncate_datetime

DATA
    Iterable = typing.Iterable
        A generic version of collections.abc.Iterable.
    
    TYPE_CHECKING = False
    Tuple = typing.Tuple
        Tuple type; Tuple[X, Y] is the cross-product type of X and Y.
        
        Example: Tuple[T1, T2] is a tuple of two elements corresponding
        to type variables T1 and T2.  Tuple[int, float, str] is a tuple
        of an int, a float and a string.
        
        To specify a variable-length tuple of homogeneous type, use Tu

In [20]:
import feature-engine

SyntaxError: invalid syntax (3177891548.py, line 1)

In [21]:
import cleanlab

ModuleNotFoundError: No module named 'cleanlab'