### Initial experimentation using Hacker News API and returned data

##### offical docs -> https://github.com/HackerNews/API
##### .raml -> https://api.stoplight.io/v1/versions/DaBbQv9WoET786zHn/export/raml.yaml

In [12]:
import pandas as pd
import requests

```python
hnapi_base = 'https://hacker-news.firebaseio.com/v0/'

latest = hnapi_base + 'maxitem.json'

def hit(endpoint):
    return requests.get(endpoint).json()

limit = 10000

max_uri = hit(latest)
min_uri = max_uri - limit

batch = list(range(min_uri, max_uri+1))
comments = []

for item in batch:    
    uri = hnapi_base+f'item/{item}.json'
    response = requests.get(uri).json()
    
    if response is None:
        continue

    comment = True if response['type'] == 'comment' else False


    if not comment:
        continue
    comments.append(pd.Series(response))


df = pd.DataFrame(comments)
```

In [13]:
df = pd.read_csv('../data/raw/minimal.csv')
df.head(3)

Unnamed: 0,id,by,time,parent,text,kids,deleted
0,22445632,Apocryphon,1582918645,22443772,Andrew Walz&#x2F;Deez Nuts unity ticket 2020,,
1,22445633,Animats,1582918658,22440816,Those are called &quot;multi-chip modules&quot...,,
2,22445634,munk-a,1582918668,22445244,I also generally think that writing boilerplat...,,


In [14]:
df.text[:10]

0         Andrew Walz&#x2F;Deez Nuts unity ticket 2020
1    Those are called &quot;multi-chip modules&quot...
2    I also generally think that writing boilerplat...
3    Just wanted to say this is what I was looking ...
4    This is the article Day 4. For a bit of contex...
5    Reminds me of when Microsoft took over.  Many ...
6    &gt; Try the same thing on Mission, Howard, Fo...
7    I remember one looking through the rule book f...
8    As a LastPass user, my main defense against th...
9    I think my opinion is biased because I know Re...
Name: text, dtype: object

In [15]:
df.by.value_counts()

anonsivalley652    37
saagarjha          36
DoreenMichele      30
dang               28
kick               28
                   ..
yoaviram            1
saaaaaam            1
davidwihl           1
yashap              1
dna_polymerase      1
Name: by, Length: 3976, dtype: int64

In [16]:
import datetime as dt

# checking to make sure there are no issues with unix timestamp

print(pd.to_datetime(df.time.min(), unit='s'))
print(pd.to_datetime(df.time.max(), unit='s'))

2020-02-28 19:37:25
2020-03-01 07:23:24


In [17]:
df.dtypes

id          int64
by         object
time        int64
parent      int64
text       object
kids       object
deleted    object
dtype: object

In [18]:
df.isnull().sum()

id            0
by          251
time          0
parent        0
text        251
kids       5109
deleted    8571
dtype: int64

### Deleted posts retain some metadata but nothing useful to us at this point

In [19]:
deleted = df.loc[df['deleted'].isnull() == False][['deleted', 'by']]

In [20]:
if deleted['by'].isnull().sum() == len(deleted):
    df.drop(columns='deleted', inplace = True)

In [21]:
df.loc[df['by'].isnull() == True]

Unnamed: 0,id,by,time,parent,text,kids
29,22445669,,1582918891,22445076,,
46,22445688,,1582919069,22440816,,
96,22445751,,1582919555,22443363,,
111,22445767,,1582919643,22444523,,
161,22445824,,1582920121,22443968,,
...,...,...,...,...,...,...
8467,22455230,,1583039390,22455128,,
8527,22455295,,1583040709,22455124,,
8529,22455297,,1583040766,22455216,,
8585,22455362,,1583042013,22446646,,


In [22]:
df.head(3)

Unnamed: 0,id,by,time,parent,text,kids
0,22445632,Apocryphon,1582918645,22443772,Andrew Walz&#x2F;Deez Nuts unity ticket 2020,
1,22445633,Animats,1582918658,22440816,Those are called &quot;multi-chip modules&quot...,
2,22445634,munk-a,1582918668,22445244,I also generally think that writing boilerplat...,


### Text

In [23]:
import re

In [24]:
## TODO: Some issue with dtype in text column some comments return empty after regex
patt = {
            "unicode_patt": "&.{4}(?=;);",
            "line_break":   "<p>",
            "href_patt":    "<a.*</a>",
            "quote":        "&quot;",
            "html_footnote": '\[.\]'
        }
    
r = rf'|'.join(patt.values())
    
def scrub(doc):
    return re.sub(r, '', str(doc))    

In [25]:
r

'&.{4}(?=;);|<p>|<a.*</a>|&quot;|\\[.\\]'

In [26]:
df['text'] =  df['text'].apply(scrub)
df

Unnamed: 0,id,by,time,parent,text,kids
0,22445632,Apocryphon,1582918645,22443772,Andrew WalzDeez Nuts unity ticket 2020,
1,22445633,Animats,1582918658,22440816,Those are called multi-chip modules. The Penti...,
2,22445634,munk-a,1582918668,22445244,I also generally think that writing boilerplat...,
3,22445636,fapi1974,1582918669,22443146,Just wanted to say this is what I was looking ...,
4,22445637,acqq,1582918676,22443536,This is the article Day 4. For a bit of contex...,[22445691]
...,...,...,...,...,...,...
8817,22455626,nl,1583047100,22455217,Go is used at a bunch of major companies outsi...,
8818,22455627,pjmlp,1583047213,22454235,"Strange, somehow that is exactly what we were ...",[22455654]
8819,22455629,lethisaputri,1583047378,22454333,ok,
8820,22455630,looping__lui,1583047403,22455017,This sounds like a dysfunctional company to me...,


In [27]:
# curiosity

r = '|'.join(patt.values())
t = re.compile(r)

    
regex = '|'.join(patt.values())

In [28]:
df

Unnamed: 0,id,by,time,parent,text,kids
0,22445632,Apocryphon,1582918645,22443772,Andrew WalzDeez Nuts unity ticket 2020,
1,22445633,Animats,1582918658,22440816,Those are called multi-chip modules. The Penti...,
2,22445634,munk-a,1582918668,22445244,I also generally think that writing boilerplat...,
3,22445636,fapi1974,1582918669,22443146,Just wanted to say this is what I was looking ...,
4,22445637,acqq,1582918676,22443536,This is the article Day 4. For a bit of contex...,[22445691]
...,...,...,...,...,...,...
8817,22455626,nl,1583047100,22455217,Go is used at a bunch of major companies outsi...,
8818,22455627,pjmlp,1583047213,22454235,"Strange, somehow that is exactly what we were ...",[22455654]
8819,22455629,lethisaputri,1583047378,22454333,ok,
8820,22455630,looping__lui,1583047403,22455017,This sounds like a dysfunctional company to me...,


In [29]:
dfs = df.sample(500)

In [30]:
%%time

def drag(o):
    return re.sub(r, "", o)

dfs['text'].apply(drag)

CPU times: user 2.53 ms, sys: 0 ns, total: 2.53 ms
Wall time: 2.47 ms


5093    My fav part is he gave it and failed the candi...
3163    There are not “a lot” of reinfected. There are...
4920    &gt; executives want to believe in magic solut...
1063    I think a problem with this analogy is you are...
2021    Drawing things out so that youre not sick at t...
                              ...                        
1944    Based on what exactly?  This thing seems to su...
3548    A lot of these services get quite expensive ve...
5038      Is there a link which doesnt use a cookie wall?
7003    One of the biggest mistakes in IT ever, in my ...
4431    ... except its the same answer, only their res...
Name: text, Length: 500, dtype: object

In [31]:
%%time

def drag(o):
    return re.sub(t, "", str(o))

dfs['text'].apply(drag)

CPU times: user 3.56 ms, sys: 16 µs, total: 3.57 ms
Wall time: 4.11 ms


5093    My fav part is he gave it and failed the candi...
3163    There are not “a lot” of reinfected. There are...
4920    &gt; executives want to believe in magic solut...
1063    I think a problem with this analogy is you are...
2021    Drawing things out so that youre not sick at t...
                              ...                        
1944    Based on what exactly?  This thing seems to su...
3548    A lot of these services get quite expensive ve...
5038      Is there a link which doesnt use a cookie wall?
7003    One of the biggest mistakes in IT ever, in my ...
4431    ... except its the same answer, only their res...
Name: text, Length: 500, dtype: object