<a href="https://colab.research.google.com/github/JSchoonmaker/Google-Data-Studio/blob/main/Copy_of_04_GoogleNews_Cleaner_Splitter_Classification_Aggregator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Obsei Tutorial 04
## This example shows following Obsei workflow
 1. Observe: Search and fetch news article via Google News
 2. Cleaner: Clean article text proerply
 3. Analyze: Classify article text while splitting text in small chunks and later computing final inference using given formula

## Install Obsei from latest code, perform these steps -
- Select GPU RunType for faster computation 
- Restart Runtime after installation

In [4]:
!pip install git+https://github.com/lalitpagaria/obsei.git

Collecting git+https://github.com/lalitpagaria/obsei.git
  Cloning https://github.com/lalitpagaria/obsei.git to /tmp/pip-req-build-icyju_pu
  Running command git clone -q https://github.com/lalitpagaria/obsei.git /tmp/pip-req-build-icyju_pu


## Configure Google News Observer

In [5]:
from obsei.source.google_news_source import GoogleNewsConfig, GoogleNewsSource

source_config = GoogleNewsConfig(
    query="guns",
    max_results=100,
    fetch_article=True,
    lookup_period="7d",
)

source = GoogleNewsSource()

## Configure TextCleaner as Pre-Processor to clean review text
These cleaning function will run serially

In [6]:
from obsei.preprocessor.text_cleaner import TextCleaner, TextCleanerConfig
from obsei.preprocessor.text_cleaning_function import *

text_cleaner_config = TextCleanerConfig(
    cleaning_functions = [
        ToLowerCase(),
        RemoveWhiteSpaceAndEmptyToken(),
        RemovePunctuation(),
        RemoveSpecialChars(),
        DecodeUnicode(),
        RemoveStopWords(),
        RemoveWhiteSpaceAndEmptyToken(),
   ]
)

text_cleaner = TextCleaner()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Configure Classification Analyzer

- List of categories in `labels`
- `TextSplitterConfig` with proper `max_split_length` and `split_stride`
- `InferenceAggregatorConfig` with required `aggregate_function` currently two are supported (average and max frequent class)
- `ClassificationMaxCategories` need `score_threshold` which is used to determine what minimum probability needed to take a class into consideration

**Note**: Select model from https://huggingface.co/models?pipeline_tag=zero-shot-classification, if you want to try different one

In [7]:
from obsei.analyzer.classification_analyzer import ClassificationAnalyzerConfig, ZeroShotClassificationAnalyzer
from obsei.postprocessor.inference_aggregator import InferenceAggregatorConfig
from obsei.postprocessor.inference_aggregator_function import ClassificationMaxCategories
from obsei.preprocessor.text_splitter import TextSplitterConfig

analyzer_config=ClassificationAnalyzerConfig(
   labels=["buy", "sell", "going up", "going down"],
   use_splitter_and_aggregator=True,
   splitter_config=TextSplitterConfig(
       max_split_length=300,
       split_stride=3
   ),
   aggregator_config=InferenceAggregatorConfig(
       aggregate_function=ClassificationMaxCategories(
           score_threshold=0.3
       )
   )
)

text_analyzer = ZeroShotClassificationAnalyzer(
   model_name_or_path="typeform/mobilebert-uncased-mnli",
   device="auto"
)

09/17/2021 17:47:59 - INFO - filelock -   Lock 140185704657808 acquired on /root/.cache/huggingface/transformers/7d4ccfc6aefc75ac561aaa8c03d1909ae061a071ef8ce030b7d60c16a29d8f14.3d82be6bc6988b6ef939905f20dccd6e6f79f0568bfd1f7a0d2a4aac73f6b2ac.lock


Downloading:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

09/17/2021 17:47:59 - INFO - filelock -   Lock 140185704657808 released on /root/.cache/huggingface/transformers/7d4ccfc6aefc75ac561aaa8c03d1909ae061a071ef8ce030b7d60c16a29d8f14.3d82be6bc6988b6ef939905f20dccd6e6f79f0568bfd1f7a0d2a4aac73f6b2ac.lock
09/17/2021 17:47:59 - INFO - filelock -   Lock 140185686726352 acquired on /root/.cache/huggingface/transformers/c36b4e96561ed2310bd22985d6a3383534a75eb5d53654c8546ae5c93eb79304.05b156a8fa85e157c1f69a668d2f66bd5ba5b7b25c756006f3d388e91e428735.lock


Downloading:   0%|          | 0.00/98.8M [00:00<?, ?B/s]

09/17/2021 17:48:03 - INFO - filelock -   Lock 140185686726352 released on /root/.cache/huggingface/transformers/c36b4e96561ed2310bd22985d6a3383534a75eb5d53654c8546ae5c93eb79304.05b156a8fa85e157c1f69a668d2f66bd5ba5b7b25c756006f3d388e91e428735.lock
09/17/2021 17:48:04 - INFO - filelock -   Lock 140185654576784 acquired on /root/.cache/huggingface/transformers/28e88aa376340b5bd6252201db517163c53c08e457d94927ff6e3f7ded32adfe.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock


Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

09/17/2021 17:48:04 - INFO - filelock -   Lock 140185654576784 released on /root/.cache/huggingface/transformers/28e88aa376340b5bd6252201db517163c53c08e457d94927ff6e3f7ded32adfe.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock
09/17/2021 17:48:04 - INFO - filelock -   Lock 140185654607504 acquired on /root/.cache/huggingface/transformers/eb4b7bc7fef4e269bac1a75616a49625ba8265688495f9844d8212258adc3883.dd8bd9bfd3664b530ea4e645105f557769387b3da9f79bdb55ed556bdd80611d.lock


Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

09/17/2021 17:48:05 - INFO - filelock -   Lock 140185654607504 released on /root/.cache/huggingface/transformers/eb4b7bc7fef4e269bac1a75616a49625ba8265688495f9844d8212258adc3883.dd8bd9bfd3664b530ea4e645105f557769387b3da9f79bdb55ed556bdd80611d.lock
09/17/2021 17:48:05 - INFO - filelock -   Lock 140185654668816 acquired on /root/.cache/huggingface/transformers/9d98ed764fc17d357d3cbe8a2e9836f07f5234e25119e35fea48dcbd5d5d0130.658a1922a1a2a2912ec0beca90e7381d6796532f6ea000c7a35e25acc3ec4584.lock


Downloading:   0%|          | 0.00/268 [00:00<?, ?B/s]

09/17/2021 17:48:05 - INFO - filelock -   Lock 140185654668816 released on /root/.cache/huggingface/transformers/9d98ed764fc17d357d3cbe8a2e9836f07f5234e25119e35fea48dcbd5d5d0130.658a1922a1a2a2912ec0beca90e7381d6796532f6ea000c7a35e25acc3ec4584.lock


## Search and fetch news article

In [8]:
source_response_list = source.lookup(source_config)

09/17/2021 17:48:09 - INFO - trafilatura.core -   using custom extraction: None
09/17/2021 17:48:09 - INFO - trafilatura.core -   not enough comments None
09/17/2021 17:48:09 - INFO - trafilatura.core -   using custom extraction: None
09/17/2021 17:48:09 - INFO - trafilatura.core -   not enough comments None
09/17/2021 17:48:09 - INFO - readability.readability -   ruthless removal did not work. 
09/17/2021 17:48:09 - INFO - trafilatura.core -   using custom extraction: None
09/17/2021 17:48:09 - INFO - trafilatura.core -   not enough comments None
09/17/2021 17:48:10 - INFO - trafilatura.core -   using custom extraction: None
09/17/2021 17:48:10 - INFO - trafilatura.core -   not enough comments None
09/17/2021 17:48:10 - INFO - trafilatura.core -   using custom extraction: None
09/17/2021 17:48:10 - INFO - trafilatura.core -   not enough comments None
09/17/2021 17:48:11 - INFO - readability.readability -   ruthless removal did not work. 
09/17/2021 17:48:11 - INFO - trafilatura.core -

## PreProcess text to clean it

In [9]:
cleaner_response_list = text_cleaner.preprocess_input(
    input_list=source_response_list,
    config=text_cleaner_config
)

## Analyze article to perform classification
**Note**: This is compute heavy step

In [10]:
analyzer_response_list = text_analyzer.analyze_input(
    source_response_list=cleaner_response_list,
    analyzer_config=analyzer_config
)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


## Print Result

In [11]:
for analyzer_response in analyzer_response_list:
  print(vars(analyzer_response))

{'segmented_data': {'aggregator_data': {'category_count': {'positive': 11, 'going up': 11, 'negative': 2, 'going down': 2}, 'max_scores': {'positive': 0.9728320837020874, 'going up': 0.8844072222709656, 'negative': 0.31063076853752136, 'going down': 0.6738869547843933}, 'aggregator_name': 'ClassificationMaxCategories'}}, 'meta': {'title': 'Guns N’ Roses at Wrigley Field: band revisits the past with plenty of gusto - Chicago Sun-Times', 'description': "Guns N’ Roses at Wrigley Field: band revisits the past with plenty of gusto  Chicago Sun-TimesWatch Guns N' Roses Members Soundcheck Rare Song 'Hard School'  Ultimate Classic RockSee Video of Guns N' Roses Rehearsing 'Hard School' Before Show  LoudwireEvery track on Guns N' Roses' Use Your Illusion I & II, ranked from worst to best  LouderWatch the pro-shot version of Guns N’ Roses and Dave Grohl’s plug-pulling Paradise City performance  Guitar WorldView Full Coverage on Google News", 'published date': 'Fri, 17 Sep 2021 14:04:26 GMT', 'ur