# Skimpy - A simple way to summarize your dataset

skimpy is a light weight tool that provides summary statistics about variables in data frames within the console. Think of it as a super version of df.summary().

```bash
pip install skimpy
```
https://pypi.org/project/skimpy/


# Table of Contents (datasets)

### Table of Contents <a class="anchor" id="skimpy_toc"></a>

* [Table of Contents](#DS103L2_toc)
    * [builtin seaborn datasets](#skimpy_datasets)
    * [about skim](#skimpy_skim)
    * [load external diabetes](#skimpy_diabetes)
    * [generate test data](#skimpy_generate_test_data)
    * [skimpy cli](#skimpy_cli)
    * [anagrams](#skimpy_anagrams)
    * [anscombe](#skimpy_anscombe)
    * [attention](#skimpy_attention)
    * [brain_networks](#skimpy_brain_networks)
    * [car_crashes](#skimpy_car_crashes)
    * [diamonds](#skimpy_diamonds)
    * [dots](#skimpy_dots)
    * [exercise](#skimpy_exercise)
    * [flights](#skimpy_flights)
    * [fmri](#skimpy_fmri)
    * [gammas](#skimpy_gammas)
    * [geyser](#skimpy_geyser)
    * [iris](#skimpy_iris)
    * [mpg](#skimpy_mpg)
    * [penguins](#skimpy_penguins)
    * [planets](#skimpy_planets)
    * [taxis](#skimpy_taxis)
    * [tips](#skimpy_tips)
    * [titanic](#skimpy_titanic)

In [492]:
# import all the libraries that are required for creating the statistical analysis and loading the data
import pandas as pd
from skimpy import skim, generate_test_data
import seaborn as sns

# list builtin seaborn datasets <a class="anchor" id="skimpy_datasets"></a>
[Back to Top](#skimpy_toc)

In [489]:
dataset_names = sns.get_dataset_names()

In [490]:
dataset_names

['anagrams',
 'anscombe',
 'attention',
 'brain_networks',
 'car_crashes',
 'diamonds',
 'dots',
 'exercise',
 'flights',
 'fmri',
 'gammas',
 'geyser',
 'iris',
 'mpg',
 'penguins',
 'planets',
 'taxis',
 'tips',
 'titanic']

# about skim <a class="anchor" id="skimpy_skim"></a>
[Back to Top](#skimpy_toc)

In [491]:
skim?

[0;31mSignature:[0m
[0mskim[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mdf[0m[0;34m:[0m [0mpandas[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mframe[0m[0;34m.[0m[0mDataFrame[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mheader_style[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'bold cyan'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m**[0m[0mcolour_kwargs[0m[0;34m:[0m [0mstr[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0;32mNone[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Skim a data frame and return statistics.

skim is an alternative to pandas.DataFrame.summary(), quickly providing
an overview of a data frame. It produces a different set of summary
functions based on the types of columns in the dataframe. You may get
better results from ensuring that you set the datatypes in your dataframe
you want before running skim.
The colour_kwargs (str) are defined in dataframe_to_rich_table.

Args:
    df (pd.DataFrame): Dataframe to skim
  

# diabetes <a class="anchor" id="skimpy_diabetes"></a>
[Back to Top](#skimpy_toc)

In [493]:
# import an external file from Kaggle https://www.kaggle.com/saurabh00007/diabetescsv and move into Data folder
# use a relative path to load dataset
diabetes = pd.read_csv("../Data/Diabetes.csv")

In [494]:
diabetes.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [495]:
skim(diabetes)

# generate test data <a class="anchor" id="skimpy_generate_test_data"></a>
[Back to Top](#skimpy_toc)

In [496]:
# generate test data from seaborn
test_data = generate_test_data()

In [497]:
generate_test_data?

[0;31mSignature:[0m [0mgenerate_test_data[0m[0;34m([0m[0;34m)[0m [0;34m->[0m [0mpandas[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mframe[0m[0;34m.[0m[0mDataFrame[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Generate dataframe with several different datatypes.

For testing skimpy, it's convenient to have a dataset with many different
data types. This function creates that dataframe.

Returns:
    pd.DataFrame: dataframe with columns spanning several data types.
[0;31mFile:[0m      /usr/local/lib/python3.9/site-packages/skimpy/__init__.py
[0;31mType:[0m      function


In [498]:
test_data.describe()

Unnamed: 0,length,width,depth,rnd
count,1000.0,1000.0,1000.0,882.0
mean,0.501619,2.036549,10.024,-0.019771
std,0.359707,1.92864,3.208382,1.001654
min,2e-06,0.002057,2.0,-2.808934
25%,0.134014,0.602987,8.0,-0.735467
50%,0.49757,1.467916,10.0,-0.000774
75%,0.860224,2.952881,12.0,0.663878
max,0.999999,13.908001,20.0,3.716621


In [562]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   length        1000 non-null   float64       
 1   width         1000 non-null   float64       
 2   depth         1000 non-null   int64         
 3   rnd           882 non-null    float64       
 4   class         1000 non-null   category      
 5   location      999 non-null    category      
 6   booly_col     1000 non-null   bool          
 7   text          994 non-null    string        
 8   date          1000 non-null   datetime64[ns]
 9   date_no_freq  997 non-null    datetime64[ns]
dtypes: bool(1), category(2), datetime64[ns](2), float64(3), int64(1), string(1)
memory usage: 58.1 KB


In [499]:
skim(test_data)

# command line (CLI) skimpy <a class="anchor" id="skimpy_cli"></a>
[Back to Top](#skimpy_toc)

In [500]:
#you can also run command line (CLI) skimpy on the data
!skimpy ../Data/diabetes.csv

╭─────────────────────────────── skimpy summary ───────────────────────────────╮
│ [3m         Data Summary         [0m [3m      Data Types       [0m                       │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                       │
│ ┃[1;36m [0m[1;36mdataframe        [0m[1;36m [0m┃[1;36m [0m[1;36mValues[0m[1;36m [0m┃ ┃[1;36m [0m[1;36mColumn Type[0m[1;36m [0m┃[1;36m [0m[1;36mCount[0m[1;36m [0m┃                       │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                       │
│ │ Number of rows    │ 768    │ │ int64       │ 7     │                       │
│ │ Number of columns │ 9      │ │ float64     │ 2     │                       │
│ └───────────────────┴────────┘ └─────────────┴───────┘                       │
│ [3m                                  number                                   [0m  │
│ ┏━━━━━━┳━━━━━━━┳━━━━━━┳━━━━━━━┳━━━━━━┳━━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━━┳━━━━━━┓  │
│ ┃[1m [0m[1m    [0m[1m [0m┃

# anagrams <a class="anchor" id="skimpy_anagrams"></a>
[Back to Top](#skimpy_toc)

In [501]:
anagrams= sns.load_dataset("anagrams")

In [502]:
skim(anagrams)

In [503]:
anagrams.head()

Unnamed: 0,subidr,attnr,num1,num2,num3
0,1,divided,2,4.0,7
1,2,divided,3,4.0,5
2,3,divided,3,5.0,6
3,4,divided,5,7.0,5
4,5,divided,4,5.0,8


# anscombe <a class="anchor" id="skimpy_anscombe"></a>
[Back to Top](#skimpy_toc)

In [504]:
anscombe = sns.load_dataset("anscombe")

In [505]:
skim(anscombe)

In [506]:
anscombe.head()

Unnamed: 0,dataset,x,y
0,I,10.0,8.04
1,I,8.0,6.95
2,I,13.0,7.58
3,I,9.0,8.81
4,I,11.0,8.33


# attention <a class="anchor" id="skimpy_attention"></a>
[Back to Top](#skimpy_toc)

In [507]:
attention = sns.load_dataset("attention")

In [508]:
attention.describe()

Unnamed: 0.1,Unnamed: 0,subject,solutions,score
count,60.0,60.0,60.0,60.0
mean,29.5,10.5,2.0,5.958333
std,17.464249,5.814943,0.823387,1.621601
min,0.0,1.0,1.0,2.0
25%,14.75,5.75,1.0,5.0
50%,29.5,10.5,2.0,6.0
75%,44.25,15.25,3.0,7.0
max,59.0,20.0,3.0,9.0


In [509]:
skim(attention)

# brain_networks <a class="anchor" id="skimpy_brain_networks"></a>
[Back to Top](#skimpy_toc)

In [510]:
brain_networks = sns.load_dataset("brain_networks")

In [511]:
brain_networks.describe()

Unnamed: 0,network,1,1.1,2,2.1,3,3.1,4,4.1,5,...,16.5,16.6,16.7,17,17.1,17.2,17.3,17.4,17.5,17.6
count,922,922,922,922,922,922,922,922,922,922,...,922,922,922,922,922,922,922,922,922,922
unique,922,922,922,922,922,922,922,922,922,922,...,922,922,922,922,922,922,922,922,922,922
top,node,1,1,1,1,1,1,1,1,1,...,3,4,4,1,1,2,2,3,3,4
freq,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


In [512]:
# error
#skim(brain_networks)

# car_crashes <a class="anchor" id="skimpy_car_crashes"></a>
[Back to Top](#skimpy_toc)

In [513]:
car_crashes = sns.load_dataset("car_crashes")

In [514]:
car_crashes.describe()

Unnamed: 0,total,speeding,alcohol,not_distracted,no_previous,ins_premium,ins_losses
count,51.0,51.0,51.0,51.0,51.0,51.0,51.0
mean,15.790196,4.998196,4.886784,13.573176,14.004882,886.957647,134.493137
std,4.122002,2.017747,1.729133,4.508977,3.764672,178.296285,24.835922
min,5.9,1.792,1.593,1.76,5.9,641.96,82.75
25%,12.75,3.7665,3.894,10.478,11.348,768.43,114.645
50%,15.6,4.608,4.554,13.857,13.775,858.97,136.05
75%,18.5,6.439,5.604,16.14,16.755,1007.945,151.87
max,23.9,9.45,10.038,23.661,21.28,1301.52,194.78


In [515]:
skim(car_crashes)

# diamonds <a class="anchor" id="skimpy_diamonds"></a>
[Back to Top](#skimpy_toc)

In [516]:
diamonds = sns.load_dataset("diamonds")

In [517]:
diamonds.describe()

Unnamed: 0,carat,depth,table,price,x,y,z
count,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0
mean,0.79794,61.749405,57.457184,3932.799722,5.731157,5.734526,3.538734
std,0.474011,1.432621,2.234491,3989.439738,1.121761,1.142135,0.705699
min,0.2,43.0,43.0,326.0,0.0,0.0,0.0
25%,0.4,61.0,56.0,950.0,4.71,4.72,2.91
50%,0.7,61.8,57.0,2401.0,5.7,5.71,3.53
75%,1.04,62.5,59.0,5324.25,6.54,6.54,4.04
max,5.01,79.0,95.0,18823.0,10.74,58.9,31.8


In [518]:
skim(diamonds)

# dots <a class="anchor" id="skimpy_dots"></a>
[Back to Top](#skimpy_toc)

In [519]:
dots = sns.load_dataset("dots")

In [520]:
dots.describe()

Unnamed: 0,time,coherence,firing_rate
count,848.0,848.0,848.0
mean,74.150943,12.898113,39.616662
std,284.596669,15.453506,12.232967
min,-600.0,0.0,6.27572
25%,-100.0,3.2,32.620191
50%,80.0,6.4,38.022005
75%,260.0,12.8,47.383649
max,720.0,51.2,70.0489


In [521]:
skim(dots)

# exercise <a class="anchor" id="skimpy_exercise"></a>
[Back to Top](#skimpy_toc)

In [522]:
exercise = sns.load_dataset("exercise")

In [523]:
exercise.describe()

Unnamed: 0.1,Unnamed: 0,id,pulse
count,90.0,90.0,90.0
mean,44.5,15.5,99.7
std,26.124701,8.703932,14.858471
min,0.0,1.0,80.0
25%,22.25,8.0,90.25
50%,44.5,15.5,96.0
75%,66.75,23.0,103.0
max,89.0,30.0,150.0


In [524]:
skim(exercise)

# flights <a class="anchor" id="skimpy_flights"></a>
[Back to Top](#skimpy_toc)

In [525]:
flights = sns.load_dataset("flights")

In [526]:
flights.describe()

Unnamed: 0,year,passengers
count,144.0,144.0
mean,1954.5,280.298611
std,3.464102,119.966317
min,1949.0,104.0
25%,1951.75,180.0
50%,1954.5,265.5
75%,1957.25,360.5
max,1960.0,622.0


In [527]:
skim(flights)

In [528]:
flights.head()

Unnamed: 0,year,month,passengers
0,1949,Jan,112
1,1949,Feb,118
2,1949,Mar,132
3,1949,Apr,129
4,1949,May,121


In [529]:
flights.tail()

Unnamed: 0,year,month,passengers
139,1960,Aug,606
140,1960,Sep,508
141,1960,Oct,461
142,1960,Nov,390
143,1960,Dec,432


# fmri <a class="anchor" id="skimpy_fmri"></a>
[Back to Top](#skimpy_toc)

In [530]:
fmri = sns.load_dataset("fmri")

In [531]:
fmri.describe()

Unnamed: 0,timepoint,signal
count,1064.0,1064.0
mean,9.0,0.00354
std,5.479801,0.09393
min,0.0,-0.255486
25%,4.0,-0.04607
50%,9.0,-0.013653
75%,14.0,0.024293
max,18.0,0.564985


In [532]:
skim(fmri)

# gammas <a class="anchor" id="skimpy_gammas"></a>
[Back to Top](#skimpy_toc)

In [533]:
gammas  = sns.load_dataset("gammas")

In [534]:
gammas.describe()

Unnamed: 0,timepoint,subject,BOLD signal
count,6000.0,6000.0,6000.0
mean,5.0,9.5,0.814837
std,2.916008,5.766762,1.774536
min,0.0,0.0,-3.611603
25%,2.5,4.75,-0.481188
50%,5.0,9.5,0.928425
75%,7.5,14.25,2.169299
max,10.0,19.0,4.829915


In [535]:
skim(gammas)

# geyser <a class="anchor" id="skimpy_geyser"></a>
[Back to Top](#skimpy_toc)

In [536]:
geyser = sns.load_dataset("geyser")

In [537]:
geyser.describe()

Unnamed: 0,duration,waiting
count,272.0,272.0
mean,3.487783,70.897059
std,1.141371,13.594974
min,1.6,43.0
25%,2.16275,58.0
50%,4.0,76.0
75%,4.45425,82.0
max,5.1,96.0


In [538]:
skim(geyser)

# iris <a class="anchor" id="skimpy_iris"></a>
[Back to Top](#skimpy_toc)

In [539]:
iris = sns.load_dataset("iris")

In [540]:
iris.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [541]:
skim(iris)

# mpg <a class="anchor" id="skimpy_mpg"></a>
[Back to Top](#skimpy_toc)

In [542]:
mpg = sns.load_dataset("mpg")

In [543]:
mpg.describe()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year
count,398.0,398.0,398.0,392.0,398.0,398.0,398.0
mean,23.514573,5.454774,193.425879,104.469388,2970.424623,15.56809,76.01005
std,7.815984,1.701004,104.269838,38.49116,846.841774,2.757689,3.697627
min,9.0,3.0,68.0,46.0,1613.0,8.0,70.0
25%,17.5,4.0,104.25,75.0,2223.75,13.825,73.0
50%,23.0,4.0,148.5,93.5,2803.5,15.5,76.0
75%,29.0,8.0,262.0,126.0,3608.0,17.175,79.0
max,46.6,8.0,455.0,230.0,5140.0,24.8,82.0


In [544]:
skim(mpg)

# penguins <a class="anchor" id="skimpy_penguins"></a>
[Back to Top](#skimpy_toc)

In [545]:
penguins = sns.load_dataset("penguins")

In [546]:
penguins.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,342.0,342.0,342.0,342.0
mean,43.92193,17.15117,200.915205,4201.754386
std,5.459584,1.974793,14.061714,801.954536
min,32.1,13.1,172.0,2700.0
25%,39.225,15.6,190.0,3550.0
50%,44.45,17.3,197.0,4050.0
75%,48.5,18.7,213.0,4750.0
max,59.6,21.5,231.0,6300.0


In [547]:
skim(penguins)

# planets <a class="anchor" id="skimpy_planets"></a>
[Back to Top](#skimpy_toc)

In [548]:
planets = sns.load_dataset("planets")

In [549]:
planets.describe()

Unnamed: 0,number,orbital_period,mass,distance,year
count,1035.0,992.0,513.0,808.0,1035.0
mean,1.785507,2002.917596,2.638161,264.069282,2009.070531
std,1.240976,26014.728304,3.818617,733.116493,3.972567
min,1.0,0.090706,0.0036,1.35,1989.0
25%,1.0,5.44254,0.229,32.56,2007.0
50%,1.0,39.9795,1.26,55.25,2010.0
75%,2.0,526.005,3.04,178.5,2012.0
max,7.0,730000.0,25.0,8500.0,2014.0


In [550]:
skim(planets)

# taxis <a class="anchor" id="skimpy_taxis"></a>
[Back to Top](#skimpy_toc)

In [551]:
taxis = sns.load_dataset("taxis")

In [552]:
taxis.describe()

Unnamed: 0,passengers,distance,fare,tip,tolls,total
count,6433.0,6433.0,6433.0,6433.0,6433.0,6433.0
mean,1.539251,3.024617,13.091073,1.97922,0.325273,18.517794
std,1.203768,3.827867,11.551804,2.44856,1.415267,13.81557
min,0.0,0.0,1.0,0.0,0.0,1.3
25%,1.0,0.98,6.5,0.0,0.0,10.8
50%,1.0,1.64,9.5,1.7,0.0,14.16
75%,2.0,3.21,15.0,2.8,0.0,20.3
max,6.0,36.7,150.0,33.2,24.02,174.82


In [553]:
skim(taxis)

# tips <a class="anchor" id="skimpy_tips"></a>
[Back to Top](#skimpy_toc)

In [554]:
tips = sns.load_dataset("tips")

In [555]:
tips.describe()

Unnamed: 0,total_bill,tip,size
count,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672
std,8.902412,1.383638,0.9511
min,3.07,1.0,1.0
25%,13.3475,2.0,2.0
50%,17.795,2.9,2.0
75%,24.1275,3.5625,3.0
max,50.81,10.0,6.0


In [556]:
skim(tips)

# titanic <a class="anchor" id="skimpy_titanic"></a>
[Back to Top](#skimpy_toc)

In [557]:
titanic = sns.load_dataset("titanic")

In [558]:
titanic.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [559]:
skim(titanic)