# Getting data

## *Processing pipelines* with `stdin` and `stdout`

See folder `Jupyter\stdin and stdout` for the files.

Some notes:
- this is the simplest way of transforming data: it reads from a file (text, pdf, html), filter it, sort it, whatever, and write it out again without having to store the entire thing in a temporary file or memory. It just reads line by line.
  
- after copying the two python files in a folder, `cd Jupyter\stdin and stdout`
- `type SomeFile.txt | python egrep.py "[0-9]" | python line_count.py`
  - this looks at SomeFile and pipes it to egrep.py, which outputs all lines containing numbers, then pipes it to line_count.py
-`type the_bible.txt | python most_common_words.py 10`
- add `> output.txt` to the end of the command to write the output to a file

## Reading and writing txt files

In [2]:
# Just stick some data there
with open('email_addresses.txt', 'w') as f:
    f.write("joelgrus@gmail.com\n")
    f.write("joel@m.datasciencester.com\n")
    f.write("joelgrus@m.datasciencester.com\n")

def get_domain(email_address: str) -> str:
    """Split on '@' and return the last piece"""
    return email_address.lower().split("@")[-1]

# a couple of tests
assert get_domain('joelgrus@gmail.com') == 'gmail.com'
assert get_domain('joel@m.datasciencester.com') == 'm.datasciencester.com'

# to check how many domains we now have
from collections import Counter

with open('email_addresses.txt', 'r') as f:
    domain_counts = Counter(get_domain(line.strip())
                            for line in f
                            if "@" in line)
domain_counts

Counter({'gmail.com': 1, 'm.datasciencester.com': 2})

In text files, you can use a for loop `for line in f` to read the file (one line at a time) and look for matches with `re.match()` and `re.search()`.

First set up `pattern = re.compile(r"regex")` and then use `pattern.match(line)` or `pattern.search(line)` to look for matches.

## Metacharacters in regex

`regex` uses metacharacters to match patterns:
1. Dot `.` matches any character. `.{3}` matches any 3 characters.} 

1. `^`: The caret symbol indicates that the pattern should match at the beginning of a string (or a line in multiline mode). For example, the regex `^hello` would match any string that starts with "hello".
   1. It's also worth noting that inside square brackets ([]), the caret symbol has a different meaning. It negates the character set. For example, [^a-z] matches any character that is NOT a lowercase letter.

2. `$`: The dollar symbol indicates that the pattern should match at the end of a string (or a line in multiline mode). For example, the regex `world$` would match any string that ends with "world".
   
3. `[]`: Square brackets are used to indicate a set of characters. In a regex pattern, a dash (-) inside a character set specifies a range of characters. For example, the pattern [a-z] matches any single lowercase letter, [a-zA-Z] any letters, and [0-9] matches any single digit. You can also combine ranges and individual characters in a single character set, so [a-z0-9_] matches a single lowercase letter, digit, or underscore. `[a-z]{3}` matches 3 lowercase letters.

4. `d`: The d character is a shorthand character class that matches any single digit (equivalent to [0-9]). For example, the pattern \d\d:\d\d:\d\d will match a string of the form "09:30:00", but not "9:30:00" (missing leading zero). `d{3}` matches 3 digits.
   

5. `\b`: The word boundary metacharacter is used to specify that the pattern should match at the beginning or end of a word. A word boundary occurs where a word character (usually an alphanumeric character or an underscore) is adjacent to a non-word character. For example, the regex `\bword\b` would match any string containing the word "word" surrounded by whitespace, punctuation, or the start/end of the string.

6. `?`: The question mark symbol is a quantifier that indicates that the preceding element (character, group, or character class) is optional, meaning it can appear either 0 or 1 times. For example, the regex pattern colou?r will match both "color" and "colour".

7. `*:` The asterisk symbol is also a quantifier, and it means that the preceding element can appear 0 or more times. For example, the regex pattern ab*c will match "ac", "abc", "abbc", "abbbc", and so on.

8. `\`: The backslash symbol is an escape character, which means it is used to indicate that the following character should be treated literally, rather than as a special regex symbol. For example, if you want to match a period (.) in a text, you should use the pattern \. because a plain period has a special meaning in regex (it matches any single character).

9. `\.`: As I mentioned above, this pattern is used to match a literal period (.) in the text. The backslash is used to escape the period, so it doesn't have its special meaning (matching any character). For example, the regex pattern a\.b will match the string "a.b", but not "acb" or "a5b".

10. `\w+`: stands for "word character". It matches any alphanumeric character (letters and digits) and the underscore (_). This is equivalent to the class `[a-zA-Z0-9_]`. The `+` stands for "one or more" of the preceding element. It is a type of quantifier, which means it indicates how many instances of the previous character, group, or character class must be present for a match. So, `\w+` will match one or more word characters. This means it will match words, including those with digits or underscores in them. It's a common pattern used to match entire words in a text (although it does not handle punctuation or spaces within words). For example, in the string "Hello, World_123!", `\w+` would match "Hello", "World_123".

Example:
```python
   regex = r"^https?://.*\.house\.gov/?$" # matches http:// or https://, followed by any number of characters, followed by .house.gov or .house.gov/ (the ? makes the s optional)
```

Example: 
```python
regex = r"Action\s*\d*\s*:(.*?)\nAction\s*\d*\s*Input\s*\d*\s*:[\s]*(.*)"
match = re.search(regex, llm_output)
```
      \s* is whitespace, 0 or more

      \d* letters, 0 or more
      
      () is a group. This allows you to select `match.group(1)`

In [8]:
import re

re.search('^i', 'it is the music')

In [120]:
# Must start with http:// or https://
# Must end with .house.gov or .house.gov/
regex = r"^https?://.*\.house\.gov/?$"

# Let's write some tests!
assert re.match(regex, "http://joel.house.gov")
assert re.match(regex, "https://joel.house.gov")
assert re.match(regex, "http://joel.house.gov/")
assert re.match(regex, "https://joel.house.gov/")
assert not re.match(regex, "joel.house.gov")
assert not re.match(regex, "http://joel.house.com")
assert not re.match(regex, "https://joel.house.gov/biography")

In [7]:
import re

pattern = r'\d{3}'  # Matches three digits
text = "There are 123 apples and 456 oranges."

srch = re.findall(pattern, text)

if srch:
    print("Match found:", srch)
else:
    print("No match found.")

Match found: 123


In [8]:
import re

pattern = r'\d{3}'  # Matches three digits
text = "There are 123 apples and 456 oranges."

srch = re.search(pattern, text)

if srch:
    print("Match found:", match.group())
else:
    print("No match found.")

Match found: 123


## CSV

note `csv.reader(f)`

In [13]:
with open('tab_delimited_stock_prices.txt', 'w') as f:
    f.write("""6/20/2014\tAAPL\t90.91
6/20/2014\tMSFT\t41.68
6/20/2014\tFB\t64.5
6/19/2014\tAAPL\t91.86
6/19/2014\tMSFT\t41.51
6/19/2014\tFB\t64.34
""")

def process(date: str, symbol: str, closing_price: float) -> None:
    # Imaginge that this function actually does something.
    assert closing_price > 0.0


# this reads and "interprets" a file, allowing processing
import csv

with open('tab_delimited_stock_prices.txt') as f:
    tab_reader = csv.reader(f, delimiter='\t')
    for row in tab_reader:
        date = row[0]
        symbol = row[1]
        closing_price = float(row[2])
        process(date, symbol, closing_price)
        # print(date, symbol,closing_price)

# which is good for further processing in the current format

6/20/2014 AAPL 90.91
6/20/2014 MSFT 41.68
6/20/2014 FB 64.5
6/19/2014 AAPL 91.86
6/19/2014 MSFT 41.51
6/19/2014 FB 64.34


note `csv.DictReader()` method

In [15]:
with open('colon_delimited_stock_prices.txt', 'w') as f:
    f.write("""date:symbol:closing_price
6/20/2014:AAPL:90.91
6/20/2014:MSFT:41.68
6/20/2014:FB:64.5
""")

# this organises it in a dict
with open('colon_delimited_stock_prices.txt') as f:
    colon_reader = csv.DictReader(f, delimiter=':')
    for dict_row in colon_reader:
        date = dict_row["date"]
        symbol = dict_row["symbol"]
        closing_price = float(dict_row["closing_price"])
        process(date, symbol, closing_price)
        print(dict_row)

{'date': '6/20/2014', 'symbol': 'AAPL', 'closing_price': '90.91'}
{'date': '6/20/2014', 'symbol': 'MSFT', 'closing_price': '41.68'}
{'date': '6/20/2014', 'symbol': 'FB', 'closing_price': '64.5'}


note `.writerow()` method

In [16]:
todays_prices = {'AAPL': 90.91, 'MSFT': 41.68, 'FB': 64.5 }

# this takes the dict and writes it to a new csv file
with open('comma_delimited_stock_prices.txt', 'w') as f:
    csv_writer = csv.writer(f, delimiter=',')
    for stock, price in todays_prices.items():
        csv_writer.writerow([stock, price])

In [None]:
import string, random

length = 5

characters = string.ascii_letters + string.digits + string.punctuation

''.join(random.choice(characters) for _ in range(length))