## Data Exploration with SweetViz
<b>
    Author:<i> Dr. Anil P. Singh, saffron intelligence </i>
</b>

Reference : https://pypi.org/project/sweetviz/


#### Target analysis
 * Relationship target value (SalePrice) to other features

#### Visualize and compare
 * Distinct datasets (e.g. training vs test data)
 * Intra-set characteristics (e.g. old contstruction vs new construction)

#### General Checklist
 * Type, unique values, missing values, duplicate rows, most frequent values
 
#### Univariate Analysis
 * min/max/range, quartiles, mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
 
 
#### Bivariate Analysus
 * numerical (Pearson's correlation), 
 * categorical (uncertainty coefficient)
 * categorical-numerical (correlation ratio) 

#### Type inference
 * Automatically detects numerical, categorical and text features, with optional manual overrides


In [22]:
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
plt.style.use('bmh')
df = pd.read_csv('train.csv')
df['Id'] = df['Id'].astype(str)
num_rows = df.shape[0]
num_features = df.shape[1]
print("Number of features: ",num_features)
print("Number of rows",num_rows)

Number of features:  81
Number of rows 1460


#### Create the exploration report

There are 3 major routines for us to use:
* Analyzing a single dataframe
 * <i>analyze(..</i>
 * <i>compare(..</i>
 * <i>compare_intra(..</i>

In [2]:
!pip install sweetviz;

#### analyze()

Here are the primitives for analyze :  
<code>
def analyze(source,target_feat=None,feat_cfg=None, pairwise_analysis = 'auto')
</code>

* source: Data Frame or a tuple containing the data frame
* target_feat: Target feature. Only BOOLEAN and NUMERICAL targets are supported
* feat_cfg: A FeatureConfig object to tell
 * Features to be skipped
 * Featues to be manually set to a data type
 
This how we create a feat_cfg object:
<code>
   feature_config = sv.FeatureConfig(skip="StrtName", force_text=["Id"])
</code>
The arguments can either be a single string or list of strings. Parameters are skip, force_cat, force_num and force_text. The "force_" arguments override the built-in type detection.

#### Running analyze with default arguments

In [23]:
import sweetviz as sv
house_report = sv.analyze(df)
house_report.show_html()

HBox(children=(FloatProgress(value=0.0, layout=Layout(flex='2'), max=82.0), HTML(value='')), layout=Layout(dis…


Report SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


#### Running analyze with target feature

In [24]:
house_report = sv.analyze(df,target_feat='SalePrice')
house_report.show_html()

HBox(children=(FloatProgress(value=0.0, layout=Layout(flex='2'), max=82.0), HTML(value='')), layout=Layout(dis…


Report SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


#### Running analyze with pair-wise analysis (categorical features)

In [16]:
numeric_features = df.select_dtypes(include=['float64', 'int64']).columns
numeric_features = list(numeric_features)+['Id']
numeric_features.remove('SalePrice')

#Removing numeric features
feat_config = sv.FeatureConfig(skip=numeric_features)

house_report = sv.analyze(df,feat_cfg=feat_config,pairwise_analysis='on')
house_report.show_html()

HBox(children=(FloatProgress(value=0.0, layout=Layout(flex='2'), max=45.0), HTML(value='')), layout=Layout(dis…


Report SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


#### Running analyze with pair-wise analysis (numeric features)

In [18]:
cat_features = df.select_dtypes(include=['object']).columns
cat_features = list(numeric_features)+['Id']


#Removing numeric features
feat_config = sv.FeatureConfig(skip=cat_features)

house_report = sv.analyze(df,feat_cfg=feat_config,pairwise_analysis='on')
house_report.show_html()

HBox(children=(FloatProgress(value=0.0, layout=Layout(flex='2'), max=45.0), HTML(value='')), layout=Layout(dis…


Report SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


#### Running analyze with full glory

In [19]:
#Removing numeric features
feat_config = sv.FeatureConfig(skip='Id')

house_report = sv.analyze(df,target_feat='SalePrice',feat_cfg=feat_config,pairwise_analysis='on')
house_report.show_html()

HBox(children=(FloatProgress(value=0.0, layout=Layout(flex='2'), max=81.0), HTML(value='')), layout=Layout(dis…


Report SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


#### Running comparison between training and testing data

In [21]:
#Reading the training set and doing a comparison
df2 = pd.read_csv('test.csv')

compare_report = sv.compare([df, 'Train'], [df2, 'Test'])
compare_report.show_html('compare.html')

HBox(children=(FloatProgress(value=0.0, layout=Layout(flex='2'), max=82.0), HTML(value='')), layout=Layout(dis…


Report compare.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.
