## Chapter 9: Getting Data

- Notes aren't detailed because scraping is more reading the docs than knowledge-based

#### stdin and stdout
- can pipe data through command line using sys.stdin and sys.stdout
- Personal note: make use of regex to get information about files when processing them

#### Reading Files
##### Text Files
- open function useful for reading files
- note that opened files must always be closed
```python
file = open('ex.txt','r')
file.close()       
```
- read r is an assumed default parameter, changing r to w, write would overwrite file
- a would open the file for appending and create it if it does not exist already

- Alternative to closing is using a with block
```python
with open(file) as f:
    data = extract_function(f)
# process
```
- Can also use for to go through each line of text file
- As well as that can use regex, defaultdict and dict to get data/info about data
- .strip() is a useful function for removing \n when reading text files
- Example: .txt full of mail adresses, we want to construct a histogram of all domains

In [108]:
import os
def get_domain(mail: str) -> str:
    # split string at @ and get last portion of array
    return mail.lower().split("@")[-1] 
assert get_domain('banana@gmail.com') == 'gmail.com'

from collections import Counter
with open(os.getcwd()+'\email_adresses.txt','r') as f:
    domain_counts = Counter(get_domain(line.strip())
                           for line in f
                           if "@" in line)
    
print(f"{domain_counts}")

# if we didnt strip each line we would have \n at each end, which is whitespace
with open(os.getcwd()+'\email_adresses.txt','r') as f:
    z = [l for l in f]
print(z)


Counter({'gmail.com': 1, 'buycat.com': 1, 'buydog.com': 1, 'pmail.com': 1})
['bob@gmail.com\n', 'chester@buycat.com\n', 'manny@buydog.com\n', 'bobby@pmail.com\n']


- *Skip Delimited files because we have pandas

#### Scraping The Web
- Beautiful Soup for scraping
- Requests Library for http
- To use soup, pass string with html/link, then get text from it

```python
url = "https://sekiroshadowsdietwice.wiki.fextralife.com/Items"
html = requests.get(url).text
soup = BeautifulSoup(html,'html5lib')

```
- use soup.find('tag_name') to find tags and get data
- extract text: soup.find('tag_name').text
- get attributes like id, class etc: soup.p['id']
- Find all tags soup.find_all('tag_name')
- finding tags with specific class and id: [p for p in soup('p') if p.get('id')]

```python
# class in specific tags
important_paragraphs = soup('p', {'class' : 'important'})
important_paragraphs2 = soup('p', 'important')
important_paragraphs3 = [p for p in soup('p')
if 'important' in p.get('class', [])]

# tag in tag with nested for
spans_inside_divs = [span
    for div in soup('div') 
    for span in div('span')]
```



In [128]:
from bs4 import BeautifulSoup
import requests
url = "https://sekiroshadowsdietwice.wiki.fextralife.com/Items"
html = requests.get(url).text
soup = BeautifulSoup(html,'html5lib')
table = soup.find("tbody")
table_data = [td.text for td in table("tr")]

import re
# we want to store data as a dict where keys are the items and the values are effects
def get_details(string):
    return [item.strip() for item in (string.split("\n")) if re.search('[0-9a-zA-Z]',item)]

final_data = {}
for obj in table_data:
    name,effect = get_details(obj)
    final_data[name] = effect
    

# we have an ugly dict so now we will transfer it to pandas
structured_data = {"Item": list(final_data.keys()),"Description": list(final_data.values())}


import pandas as pd
df = pd.DataFrame(structured_data)
print(df.head())

               Item                                        Description
0  Ako's Spiritfall  Increases Vitality and Posture damage for a ti...
1       Ako's Sugar  Temporarily boosts Vitality and Posture damage...
2   Antidote Powder  Heals the status abnormality "Poison", reduces...
3        Bell Demon  Possessing this item increases enemy difficult...
4         Bite Down  The user dies instantly, but can resurrect if ...


<b>Congress scraping example v.good </b> 
- Bottom line is regex and, filtering lists, keyword match (keyword.lower()) are best friends here.

#### API
- API: Application Programming Interface

##### JSON & XML

###### JSON
- http protocol for text transfer, data requested tranformed into string format, serialization in the form of JSON
- example json (very similar to dict)

```python
{ "title" : "Data Science Book",
"author" : "Joel Grus",
"publicationYear" : 2019,
"topics" : [ "data", "science", "data science"] }
```
- json object can easily be transformed into dict, we just need to deserialize it from json into a python object

```python
import json
desrialize = json.loads(serialized)

# where serialized is something you call straight from https request
```

###### XML
- XML is like html, can use beautifulsoup

#### No to API
- DOCS COVER THEM ALL