# Batch vs. Streamin Processing
*[Source](https://thenewstack.io/the-big-data-debate-batch-processing-vs-streaming-processing/)*
## Definitions
- A batch is a collection of data points that have been grouped together within a specific time interval. Another term often used for this is a window of data. 
- Streaming processing deals with continuous data and is key to turning big data into fast data. Both models are valuable and each can be used to address different use cases. 

While the batch processing model requires a set of data collected over time, streaming processing requires data to be fed into an analytics tool, and in real-time. Batch processing is often used when dealing with large volumes of data or data sources from legacy systems, where it’s not feasible to deliver data in streams. Batch data also by definition requires all the data needed for the batch to be loaded to some type of storage, a database or file system to then be processed. 

Data streams can also be involved in processing large quantities of data, but batch works best when you don’t need real-time analytics. 


# Mining Newsfeed data
## 1. API setup
1. Create an account at: https://newsapi.org/register
2. Save API Key
  - Optional: save as environment variable.
      - Linux/OS: https://www.digitalocean.com/community/tutorials/how-to-read-and-set-environmental-and-shell-variables-on-linux
          - Command Line
          - Write file ~/.bashrc
              - `NEWS_API_KEY="<Your Key>"`
          - run `source ~/.bashrc`
      - Windows: http://codevba.com/office/environ.htm#.YAkatZP0lhE

First, let's understand how does the API work. https://newsapi.org/docs/get-started

Data example for `sources` endpoint
```JSON
{
    "status": "ok",
    -"sources": [
    -{
        "id": "abc-news",
        "name": "ABC News",
        "description": "Your trusted source for breaking news, analysis, exclusive interviews, headlines, and videos at ABCNews.com.",
        "url": "https://abcnews.go.com",
        "category": "general",
        "language": "en",
        "country": "us"
        },
    -{
        "id": "abc-news-au",
        "name": "ABC News (AU)",
        "description": "Australia's most trusted source of local, national and world news. Comprehensive, independent, in-depth analysis, the latest business, sport, weather and more.",
        "url": "http://www.abc.net.au/news",
        "category": "general",
        "language": "en",
        "country": "au"
    },
    ...
}
```

---

## 2. Let's start defining our class
```Python
class NewsConsumer:
    NEWS_API_KEY_NAME = "NEWS_API_KEY"
    BASE_URL="https://newsapi.org/v2/everything?"

    def __init__(self):
        global NEWS_API_KEY_NAME
        global BASE_URL
        self.num_requests=0
```

---

### Questions to ask
- What is my API limit? Can I find it in the documentation?
- Should I stablish a limit? How to determine a limit? Can I input the limit to the API?
- Is it going to be a flow? batches? 

Let's also check: https://newsapi.org/pricing

Let's set up a limit of 50 pages.

## 3. Make a request

```Python
import os
import requests
import urllib.parse as urlparse
from urllib.parse import urlencode
class NewsConsumer:
    
    NEWS_API_KEY_NAME = "NEWS_API_KEY"
    BASE_URL = "https://newsapi.org/v2/everything?"
    REQUESTS_LIMIT = 100

    def __init__(self):
        self.num_requests=0
        
    def makeRequest(self, q: str, page: int, language: str = "en", page_size: int = 100) -> str:
        if self.num_requests > NewsConsumer.REQUESTS_LIMIT:
            return ""
        assert page_size > 0, "page_size can't be lesser than 0"
        assert page > 0, "pagination variable can't be a negative number"
        url_parts = list(urlparse.urlparse(NewsConsumer.BASE_URL))
        query = dict(urlparse.parse_qsl(url_parts[4]))
        query.update({'q':q, 'language':language, 'pageSize':page_size, 'page':page, 'apiKey':os.getenv(NewsConsumer.NEWS_API_KEY_NAME)})
        url_parts[4] = urlencode(query)
        self.num_requests+=page_size
        return requests.get(urlparse.urlunparse(url_parts))
```

# *Pickle
Python pickle module is used for serializing and de-serializing a Python object structure. Any object in Python can be pickled so that it can be saved on disk. What pickle does is that it “serializes” the object first before writing it to file. Pickling is a way to convert a python object (list, dict, etc.) into a character stream. The idea is that this character stream contains all the information necessary to reconstruct the object in another python script.

[Examples](https://www.geeksforgeeks.org/understanding-python-pickling-example/)
```Python
import pickle
a = NewsConsumer()
test1 = a.makeRequest("vaccine", 1)
# more on modes: https://www.w3schools.com/python/ref_func_open.asp
pickle.dump( test.json(), open( "save.p", "wb" ) )
```

## 4. Schema Definition

- What information is relevant?
- Pre-processing vs. Raw Information
- What is the objective?
- What are the restrictions? Memory restrictions, Network restrictions, etc. 

In [68]:
import os
import requests
import urllib.parse as urlparse
from urllib.parse import urlencode
class NewsConsumer:
    
    NEWS_API_KEY_NAME = "NEWS_API_KEY"
    BASE_URL = "https://newsapi.org/v2/everything?"
    REQUESTS_LIMIT = 100

    def __init__(self):
        self.num_requests=0
        
    def makeRequest(self, q: str, page: int, language: str = "en", page_size: int = 100) -> str:
        if self.num_requests > NewsConsumer.REQUESTS_LIMIT:
            return ""
        assert page_size > 0, "page_size can't be lesser than 0"
        assert page > 0, "pagination variable can't be a negative number"
        url_parts = list(urlparse.urlparse(NewsConsumer.BASE_URL))
        query = dict(urlparse.parse_qsl(url_parts[4]))
        query.update({'q':q, 'language':language, 'pageSize':page_size, 'page':page, 'apiKey':os.getenv(NewsConsumer.NEWS_API_KEY_NAME)})
        url_parts[4] = urlencode(query)
        self.num_requests+=page_size
        return requests.get(urlparse.urlunparse(url_parts))
        

In [69]:
a = NewsConsumer()

In [75]:
test2 = a.makeRequest("vaccine", 2)

In [71]:
a.num_requests

1

In [72]:
test.json()

{'status': 'ok',
 'totalResults': 73667,
 'articles': [{'source': {'id': None, 'name': 'Lifehacker.com'},
   'author': 'Elizabeth Yuko on Vitals, shared by Elizabeth Yuko to Lifehacker',
   'title': 'What We Know About Allergic Reactions to the COVID Vaccines',
   'description': 'Amid last week’s jubilant news segments featuring frontline healthcare workers getting their first dose of Pfizer’s COVID-19 vaccine came reports that put a damper on the excitement: that some of the vaccine recipients experienced severe allergic reactions fo…',
   'url': 'https://vitals.lifehacker.com/what-we-know-about-allergic-reactions-to-the-covid-vacc-1845934680',
   'urlToImage': 'https://i.kinja-img.com/gawker-media/image/upload/c_fill,f_auto,fl_progressive,g_center,h_675,pg_1,q_80,w_1200/g9u5kdn7n5iacterquch.jpg',
   'publishedAt': '2020-12-22T21:30:00Z',
   'content': 'Amid last weeks jubilant news segments featuring frontline healthcare workers getting their first dose of Pfizers COVID-19 vaccine ca

In [76]:
test2.json()

{'status': 'error',
 'code': 'maximumResultsReached',
 'message': 'You have requested too many results. Developer accounts are limited to a max of 100 results. You are trying to request results 100 to 200. Please upgrade to a paid plan if you need more results.'}

In [77]:
import pickle
pickle.dump( test.json(), open( "save.p", "wb" ) )

In [5]:
import psycopg2

ModuleNotFoundError: No module named 'psycopg2'