In [64]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
import gzip
import json
from nltk.corpus import words
from tqdm import tqdm
%matplotlib inline
import string

# DS198-003: Tech Foundations - Debugging

In this demo, we will find useful ways to debug through code using Python and Jupyter Notebooks! This demo has been created by [Kevin Miao](mailto:kevinmiao@cs.berkeley.edu) for UC Berkeley's DS198-003 (Spring '22)

### Data Loading

In [44]:
!wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Musical_Instruments_5.json.gz
    
with gzip.open("reviews_Musical_Instruments_5.json.gz", "rb") as f:
    data = [json.loads(line) for line in f]  # json.loads(f.read().decode('utf-8'))

--2022-03-14 21:18:37--  http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Musical_Instruments_5.json.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2460495 (2.3M) [application/x-gzip]
Saving to: ‘reviews_Musical_Instruments_5.json.gz.1’


2022-03-14 21:18:38 (2.18 MB/s) - ‘reviews_Musical_Instruments_5.json.gz.1’ saved [2460495/2460495]



Here we are loading in an Amazon review from a the dataset above. An example is embedded below. We have the overall score, a summary, the review time and the review text, as well as the name of the reviewer.

In [45]:
data[0]

{'reviewerID': 'A2IBPI20UZIR0U',
 'asin': '1384719342',
 'reviewerName': 'cassandra tu "Yeah, well, that\'s just like, u...',
 'helpful': [0, 0],
 'reviewText': "Not much to write about here, but it does exactly what it's supposed to. filters out the pop sounds. now my recordings are much more crisp. it is one of the lowest prices pop filters on amazon so might as well buy it, they honestly work the same despite their pricing,",
 'overall': 5.0,
 'summary': 'good',
 'unixReviewTime': 1393545600,
 'reviewTime': '02 28, 2014'}

We are also loading in a **dictionary** with all english words!

In [46]:
word_list = words.words()
word_list[:10]

['A',
 'a',
 'aa',
 'aal',
 'aalii',
 'aam',
 'Aani',
 'aardvark',
 'aardwolf',
 'Aaron']

In **natural language processing**, we match all the words of a text to an index in a dictionary that contains all words. For instance, 'Hello world' would become [# index of `hello`, `# index of world`]. Let's write a function that does that! Also, here below I provided a list of punctuation marks that might come in handy later. ;)

In [66]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

### Demo

In [56]:
sample_sentence = "hello I love data science"
split_sentence = sample_sentence.split()
print(split_sentence)
tokens = []
for i, word in enumerate(split_sentence):
    token = word_list.index(word)
    tokens.append(token)
    print(f"{word} - {token}")

['hello', 'I', 'love', 'data', 'science']
hello - 83713
I - 90447
love - 108498
data - 48777
science - 175096


In [63]:
def tokenize(sentence, dictionary):
    """
    Arguments:
        - sentence (str): a sentence of words
        - dictionary List[str]: a list of all English words
    Returns:
        - tokenization List[int]: a list of all indices of the dictionary for the words in our sentence
    """
    split_sentence = sentence.split()
    tokens = []
    try:
        for i, word in enumerate(split_sentence):
            token = word_list.index(word)
            tokens.append(token)
            print(f"{word} - {token}")
        return tokens
    except:
        raise Exception('Code failed')

sample_sentence = data[0]['reviewText']
print(f"Example Sentence: {sample_sentence}")
tokenization = tokenize(sample_sentence, word_list)
print(f"Tokenization: {tokenization}")

Example Sentence: Not much to write about here, but it does exactly what it's supposed to. filters out the pop sounds. now my recordings are much more crisp. it is one of the lowest prices pop filters on amazon so might as well buy it, they honestly work the same despite their pricing,


Exception: Code failed

### Print Statements

### Debugger `pdb`

The basics:
- `pdb.set_trace()`: Set a breakpoint
- `n`: To the next line
- `r`: Run code
- `e`: Exit debugger

In [67]:
import pdb
pdb.set_trace()

## Type Hinting

Write the `map` function. The `MapFilter` function takes in a list or an array, a tuple that indicates the range of indices that needs to be returned, and a function that needs to be applied on this subsequence of the iterable. It returns a final number.

In [24]:
input_array = [1.7,4,56.33,7.2,7.0,8,9.12,10,-2, 5, 7]
input_range = (3, 6)
input_function = lambda x: abs(x**2 + x*2)

###### START #######

def map():
    """
    """
    raise NotImplementedError()

###### END ########