# **2. DATA PROFILING**

Import libraries:

In [None]:
!pip install sweetviz lux-api autoviz plotly matplotlib

In [1]:
import pandas as pd
import json
import seaborn as sns
import sweetviz as sv
import lux
from autoviz import AutoViz_Class
import plotly.express as px
import matplotlib.pyplot as plt
%matplotlib inline

Imported v0.1.905. Please call AutoViz in this sequence:
    AV = AutoViz_Class()
    %matplotlib inline
    dfte = AV.AutoViz(filename, sep=',', depVar='', dfte=None, header=0, verbose=1, lowess=False,
               chart_format='svg',max_rows_analyzed=150000,max_cols_analyzed=30, save_plot_dir=None)


**DATA PROFILING**

Data profiling is the set of activities and processes designed to determine the metadata of a given dataset.

Data profiling helps understand and prepare data for subsequent cleansing, integration, and analysis.

Import data:

In [2]:
BEERS = pd.read_csv('https://raw.githubusercontent.com/camillasancricca/DATADIQ/master/BEERS.csv')

Basic profiling activities:

In [3]:
#look at data with the functions we have already seen in DQ ASSESSMENT
BEERS.columns

Index(['abv', 'ibu', 'id', 'name', 'style', 'brewery_id', 'ounces'], dtype='object')

In [4]:
BEERS.shape

(2419, 7)

In [5]:
BEERS.head()

Unnamed: 0,abv,ibu,id,name,style,brewery_id,ounces
0,0.05,,1436,Pub Beer,American Pale Lager,408,12.0
1,66.0,,2265,Devil's Cup,American Pale Ale (APA),177,12.0
2,71.0,,2264,Rise of the Phoenix,American IPA,177,12.0
3,0.09,,2263,Sinister,American Double / Imperial IPA,177,12.0
4,75.0,,2262,Sex and Candy,American IPA,177,12.0


In [6]:
BEERS.dtypes

abv           float64
ibu           float64
id              int64
name           object
style          object
brewery_id     object
ounces        float64
dtype: object


In [7]:
#display numeric columns
NUM = list(BEERS.select_dtypes(include=['int64','float64']).columns)
NUM

['abv', 'ibu', 'id', 'ounces']

In [8]:
#display categorical columns
CAT = list(BEERS.select_dtypes(include=['bool','object']).columns)
CAT

['name', 'style', 'brewery_id']

**SINGLE COLUMN ANALYSIS**

**Cardinalities**

Cardinalities are numbers that summarize simple metadata (*e.g.,* number of rows, attributes, null values, distinct values, Uniqueness and Distinctness).

*Cardinality* = count of the number of distinct actual values.

*Uniqueness* = percentage calculated as Cardinality divided by the total number of records.

*Actual* = count of the number of records with an actual value (*i.e.,* not-null).

*Distinctness* = percentage calculated as Cardinality divided by Actual.

In [9]:
#len command counts the number of rows
IBU_ROWS = len(BEERS['ibu'])
IBU_ROWS

2419

In [10]:
#number of rows with shape function
BEERS.shape[0]

2419

In [11]:
#number of columns with shape function
BEERS.shape[1]

7

In [12]:
#number of cells with shape function
CELLS = BEERS.shape[0]*BEERS.shape[1]
CELLS

16933

In [13]:
#number of non—null observation in a column (attribute "ibu")
COUNT =  BEERS["ibu"].count()
COUNT

np.int64(1412)

In [None]:
#value counts is the same of a group by (attribute "ibu")
VALUE_COUNT =  BEERS["ibu"].value_counts()
VALUE_COUNT

In [18]:
#nunique is the number of distict values for an attribute (attribute "ibu")
DISTINCT = (BEERS['ibu'].nunique())
DISTINCT

107

UNIQUENESS EVALUATION:

In [20]:
#for attribute "ibu"
UNIQUENESS = DISTINCT / IBU_ROWS
print("UNIQUENESS: ", UNIQUENESS)

UNIQUENESS:  0.04423315419594874


DISTINCTNESS EVALUATION:

In [None]:
#for attribute "ibu"
DISTINCTNESS = DISTINCT / COUNT
print("DISTINCTNESS: ", DISTINCTNESS)

**Value distributions**

Value distributions summarize the distribution of values within a column (*e.g.,* extremes and Constancy). A common representation for value distributions are Histograms.

*Constancy* = frequency of the most frequent value divided by the total number of values. It might reveals the presence of standard values.

In [None]:
#extremes (attribute "abv")
#print ('MIN:', )
#print ('MAX:', )
#print ('MODE:', )

In [None]:
#extremes (all attributes)
#print ('MIN: ', )
#print("\n\n")
#print ('MAX: ', )

In [None]:
#other information: Mean and Standard deviation
#print('Average:', )
#print('Standard Deviation:', )

In [None]:
#find max of value counts (attribute "ibu")


CONSTANCY EVALUATION:

In [None]:
#for attribute "ibu"

#print("CONSTANCY: ", CONSTANCY)

**Histograms** are often used to fit distributions to the data. Analysts can check if the values of some columns are (approximately) normally distributed, and the number of outliers may be returned.


In [None]:
#plot the distribution of the attribute "ibu" with hist function


In [None]:
#describe function get different properties for all the numerical attributes of the table


In [None]:
#we can also display multiple histograms


In [None]:
#correlation evaluation base on pearson correlation coefficient


In [None]:
#correlation evaluation base on kendall correlation coefficient


In [None]:
#correlation evaluation base on spearman correlation coefficient


In [None]:
#correlation evaluation heatmap


**SWEETVIZ LIBRARY** (alternative library for data profiling)

In [None]:
#sweet_report = sv.analyze([BEERS,'Sweetviz Report'])
#sweet_report.show_notebook()

**LUX LIBRARY** (alternative library for correlation discovery)

In [None]:
#BEERS.default_display = "lux"

In [None]:
#BEERS

**MATPLOTLIB** (very useful library for data visualization)

In [None]:
#plt.figure(figsize=(8, 6))
#plt.scatter(, , marker='o', color='r', label='Data')
#plt.xlabel('IBU')
#plt.ylabel('ABV')
#plt.title('SCATTER PLOT')
#plt.legend()
#plt.grid(False)
#plt.show()

**PLOTLY** (very useful library for **interactive** data visualization)

In [None]:
#df = px.data.iris()
#fig = px.scatter_matrix(df,
#    dimensions=["sepal_length", "sepal_width", "petal_length", "petal_width"],
#    color="species")
#fig.show()

In [None]:
#df = px.data.tips()
#fig = px.box(df, x="time", y="total_bill")
#fig.show()

In [None]:
#fig = px.box(df, x="time", y="total_bill", points="all")
#fig.show()

**Summary:**

*Basic profiling activities*
- pandas.read_csv()
- DataFrame.columns()
- DataFrame.shape()
- DataFrame.head()
- DataFrame.dtypes()
- DataFrame.select_dtypes()

*Single column analysis*
- len()
- DataFrame.count()
- DataFrame.value_counts()
- DataFrame.nunique()
- DataFrame.min(), DataFrame.max(), DataFrame.mean(), DataFrame.std(), DataFrame.mode()

*Histograms*
- DataFrame.describe()
- DataFrame.dropna()
- DataFrame.hist()
- DataFrame.corr()