## Import Required Libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from dataframe_stats import StatisticalDescription

## Let us load the Iris Dataset

In [2]:
iris = load_iris()
iris_df = pd.DataFrame(data= np.c_[iris['data']],
                     columns= iris['feature_names'])
iris_df.sample(10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
141,6.9,3.1,5.1,2.3
121,5.6,2.8,4.9,2.0
134,6.1,2.6,5.6,1.4
89,5.5,2.5,4.0,1.3
144,6.7,3.3,5.7,2.5
0,5.1,3.5,1.4,0.2
37,4.9,3.6,1.4,0.1
10,5.4,3.7,1.5,0.2
93,5.0,2.3,3.3,1.0
88,5.6,3.0,4.1,1.3


## Let us test the Module

In [3]:
new_df = StatisticalDescription(iris_df.columns)
new_df.populate_df(iris_df)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
column type,float16,float16,float16,float16
count,150,150,150,150
min,4.3,2,1,0.1
max,7.9,4.4,6.9,2.5
mean,5.84333,3.05733,3.758,1.19933
median,5.8,3,4.35,1.3
mode,5,3,1.4,0.2
percent of zero (or nan) rows,0,0,0,0
variance,0.685694,0.189979,3.11628,0.581006
std,0.828066,0.435866,1.7653,0.762238


The `column type` characteristics not only shows the type of an object, but also reduces the size of it. Initially, the values in the iris dataset were of type `float64` but the module reduced it to `float16`. This allows to reduce computational time when bigger datasets are used without loosing accuracy 

The dataframe is saved in `csv` format using the following line

In [6]:
new_df.save_dataframe(iris_df, 'iris_dataset_stats')

## Different Dataframe test

It is not hard to process Iris dataset since it only contains `numerical` features. What if we also input a dataframe with `categorical` data? The following example is for the `titanic` dataset taken from kaggle.

In [8]:
data_url = "titanic.csv"

titanic_df = pd.read_csv(data_url)
titanic_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


## Same Process

In [9]:
new_df = StatisticalDescription(titanic_df.columns)
new_df.populate_df(titanic_df)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
column type,int16,int8,int8,categorical data (str or mixed),categorical data (str or mixed),float16,int8,int8,categorical data (str or mixed),float16,categorical data (str or mixed),categorical data (str or mixed)
count,891,342,891,891,891,714,283,213,891,876,204,889
min,1,0,1,,,0.42,0,0,,0,,
max,891,1,3,,,80,8,6,,512.329,,
mean,446,1,2.30864,,,29.6991,1.64664,1.59624,,32.7556,,
median,446,0,3,,,28,0,0,,14.4542,,
mode,1,0,3,"0 Abbing, Mr. Anthony 1...",0 male dtype: object,24,0,0,0 1601 1 347082 2 CA. 2343 dtyp...,8.05,0 B96 B98 1 C23 C25 C27 2 ...,0 S dtype: object
percent of zero (or nan) rows,0,61.6162,0,100,100,19.8653,68.2379,76.0943,100,1.6835,22.8956,99.7755
variance,66231,0.236772,0.699015,,,169.052,1.21604,0.649728,,2469.44,,
std,257.354,0.486592,0.836071,,,13.002,1.10274,0.806057,,49.6934,,


For categorical columns with no numerical entries -> no statistics can be calculated. Thus, we use `NaN` values 

In [10]:
new_df.save_dataframe(titanic_df, 'titanic_dataset_stats')