# Demo of the *skimpy* Python package

This notebook is a quick demo of how to use skimpy in practice. First, let's make sure it's installed in this google colab notebook.

In [1]:
!pip install skimpy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting skimpy
  Downloading skimpy-0.0.8-py3-none-any.whl (14 kB)
Collecting ipykernel<7.0.0,>=6.7.0
  Downloading ipykernel-6.21.3-py3-none-any.whl (149 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m149.4/149.4 KB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rich<13.0,>=10.9
  Downloading rich-12.6.0-py3-none-any.whl (237 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m237.5/237.5 KB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
Collecting jupyter<2.0.0,>=1.0.0
  Downloading jupyter-1.0.0-py2.py3-none-any.whl (2.7 kB)
Collecting Pygments<3.0.0,>=2.10.0
  Downloading Pygments-2.14.0-py3-none-any.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m32.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting typeguard<3.0.0,>=2.12.1
  Downloading typeguard-2.13.3-py3-none-any.whl (17 kB)
Collecting m

If this is the first time you've run this notebook, you may need to now refresh the runtime. Click runtime then 'restart runtime' from the menu options at the top of the page.

Now we can import and use the package. Let's grab the example data, including in the package, and the function that is going to summarise the data, *skim*.

Here's the dataframe and imports

In [2]:
from skimpy import skim, generate_test_data

df = generate_test_data()

df.head()

Unnamed: 0,length,width,depth,rnd,class,location,booly_col,text,date,date_no_freq
0,0.762796,1.468082,9,-0.423534,virtginica,UK,False,What weather!,2018-01-31,NaT
1,0.031203,0.267769,10,2.10289,virtginica,UK,False,How are you?,2018-02-28,1992-01-05
2,0.044075,3.571043,12,0.147606,setosa,UK,True,How are you?,2018-03-31,2022-01-01
3,0.914088,2.838664,15,-0.997567,virtginica,,True,,2018-04-30,NaT
4,0.555878,2.214629,5,0.329828,setosa,UK,False,How are you?,2018-05-31,2022-01-01


It's also worth noting that this data has datatypes set in advance, and you'll get more informative skims from dataframes that have the datatypes set first. Here are the datatypes in this dataframe:

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   length        1000 non-null   float64       
 1   width         1000 non-null   float64       
 2   depth         1000 non-null   int64         
 3   rnd           882 non-null    float64       
 4   class         1000 non-null   category      
 5   location      999 non-null    category      
 6   booly_col     1000 non-null   bool          
 7   text          994 non-null    string        
 8   date          1000 non-null   datetime64[ns]
 9   date_no_freq  997 non-null    datetime64[ns]
dtypes: bool(1), category(2), datetime64[ns](2), float64(3), int64(1), string(1)
memory usage: 58.1 KB


## Running skimpy

Okay, we're ready to run *skim* on our dataframe!


In [4]:
skim(df)

## Options

There are some limited options for customisation.

You can change the header styles of the first three tables (you can find more info on styles in the documentation of the [**rich** package](https://rich.readthedocs.io/en/stable/index.html), which **skimpy** builds on):

In [5]:
skim(df, header_style="italic green")

## Cleaning Column Names

**skimpy** also comes with a function to clean up column names. Here's an example of some messy column names:

In [6]:
import pandas as pd
from rich import print
from skimpy import clean_columns

columns = [
    "bs lncs;n edbn ",
    "Nín hǎo. Wǒ shì zhōng guó rén",
    "___This is a test___",
    "ÜBER Über German Umlaut",
]
messy_df = pd.DataFrame(columns=columns, index=[0], data=[range(len(columns))])
print("Column names:")
print(list(messy_df.columns))

Now we'll clean them up:

In [7]:
clean_df = clean_columns(messy_df)
print(list(clean_df.columns))

In [8]:
# Done