<a href="https://colab.research.google.com/github/LiorGazit/Useful_Analytics_Methods/blob/master/Demonstrating_dtale_and_pandas_profiling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# This NB demonstrates 2 awesome tools for reviewing data
1. `dtale` is meant to present DFs in a reviewable matrix.  
 [Documentation](https://github.com/man-group/dtale)
1. `pandas-profiling` generates a quick and nice exploration of the data.  
 [Documentation](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/index.html)

## Installations
### Install `dtale`

In [1]:
try:
  import dtale
  import dtale.app as dtale_app
except:
  !pip install dtale
  import dtale
  import dtale.app as dtale_app

### Install `pandas-profiling`

In [2]:
try:
  import pandas as pd
  from pandas_profiling import ProfileReport
  dummy = ProfileReport(pd.DataFrame({}), title="Evaluating text annotations", explorative=True)
except:
  !pip install pandas-profiling==2.8.0
  import pandas as pd
  from pandas_profiling import ProfileReport

## Imports

In [3]:
import re
from sklearn.datasets import load_wine, fetch_20newsgroups, load_breast_cancer
import numpy as np
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Get the Data Set
The dataset is described in this [README](https://github.com/jayded/evidence-inference/blob/master/annotations/README.md).  
It consists of clinical free text information with some additional attributes.   

In [4]:
dataset_raw = pd.read_csv("https://raw.githubusercontent.com/jayded/evidence-inference/master/annotations/annotations_merged.csv")
display(dataset_raw)

Unnamed: 0.1,Unnamed: 0,UserID,PromptID,PMCID,Valid Label,Valid Reasoning,Label,Annotations,Label Code,In Abstract,Evidence Start,Evidence End
0,0,0,213,2206488,True,True,no significant difference,IL-6r (ng/ml)\t\t\t\t\t\t\t Group A\t43.6 (1.7...,0,False,-1,-1
1,1,1,213,2206488,True,True,no significant difference,"There was no significant difference in IL 6, I...",0,True,1612,1708
2,2,3,213,2206488,True,True,no significant difference,"There was no significant difference in IL 6, I...",0,True,1612,1707
3,3,2,213,2206488,True,True,no significant difference,"There was no significant difference in IL 6, I...",0,True,1612,1707
4,4,0,98,2858204,True,True,significantly increased,"After two weeks of treatment, the reduction in...",1,True,18239,18338
...,...,...,...,...,...,...,...,...,...,...,...,...
24681,24681,3,13889,4105274,True,True,no significant difference,the knee flexion range of motion of the patien...,0,False,17738,18037
24682,24682,0,13890,4889182,True,True,no significant difference,"Examining 7‐day page views (Table&nbsp;3), the...",0,False,14636,14813
24683,24683,1,13890,4889182,True,True,no significant difference,"Examining 7‐day page views (Table 3), there wa...",0,False,14636,14812
24684,24684,0,13891,3398966,True,True,significantly decreased,"Overall, 86 (9.9%) intervention participants a...",-1,False,43929,44184


### Creating Some Features from the Raw Data

In [5]:
dataset = dataset_raw[["Label", "Annotations"]].copy()
dataset["label_numerical"] = [int(-1 if label == "significantly decreased" else 0 if label == "no significant difference" else 1) for label in dataset_raw["Label"]]
dataset["count_of_characters"] = [round(len(str(text))/100) for text in dataset_raw["Annotations"]]
dataset["count_of_words"] = [round(len(nltk.word_tokenize(str(text)))/20) for text in dataset_raw["Annotations"]]
dataset["count_'placebo'"] = [str(text).lower().count("placebo") for text in dataset_raw["Annotations"]]
dataset["count_'mg'"] = [str(text).lower().count("mg") for text in dataset_raw["Annotations"]]
dataset["count_'%'"] = [str(text).lower().count("%") for text in dataset_raw["Annotations"]]
dataset["count_'(p'"] = [str(text).lower().count("(p") for text in dataset_raw["Annotations"]]

dataset["count_'no_significant'"] = [len(re.findall("no significant", str(text).lower())) for text in dataset_raw["Annotations"]]
dataset["count_'increase'"] = [len(re.findall("increase|greater|improve|larger|bigger", str(text).lower())) for text in dataset_raw["Annotations"]]
dataset["count_'decrease'"] = [len(re.findall("decrease|lower|worse", str(text).lower())) for text in dataset_raw["Annotations"]]

dataset = dataset.head(2000)

display(dataset)

Unnamed: 0,Label,Annotations,label_numerical,count_of_characters,count_of_words,count_'placebo',count_'mg',count_'%',count_'(p',count_'no_significant',count_'increase',count_'decrease'
0,no significant difference,IL-6r (ng/ml)\t\t\t\t\t\t\t Group A\t43.6 (1.7...,0,2,2,0,0,0,0,0,0,0
1,no significant difference,"There was no significant difference in IL 6, I...",0,1,1,0,0,0,0,1,0,0
2,no significant difference,"There was no significant difference in IL 6, I...",0,1,1,0,0,0,0,1,0,0
3,no significant difference,"There was no significant difference in IL 6, I...",0,1,1,0,0,0,0,1,0,0
4,significantly increased,"After two weeks of treatment, the reduction in...",1,1,1,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
1995,no significant difference,No significant differences in protamine usage ...,0,2,2,0,1,0,0,1,0,0
1996,no significant difference,No significant differences in protamine usage ...,0,2,2,0,1,0,0,1,0,0
1997,no significant difference,he initial bolus heparin dosages required to p...,0,3,4,0,0,0,0,0,0,0
1998,no significant difference,The NOAC groups required significantly larger ...,0,1,1,0,0,0,0,0,1,0


## Demonstrating `dtale`

In [6]:
dtale_app.USE_COLAB = True
 
dtale.show(dataset)

2021-05-21 20:18:42,704 - INFO     - NumExpr defaulting to 2 threads.


https://5ifnudymob4-496ff2e9c6d22116-40000-colab.googleusercontent.com/dtale/main/1

## Demonstrating `pandas-profiling`

In [7]:
profile = ProfileReport(dataset, title="Evaluating text annotations", explorative=True)
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/26 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]