In [1]:
import sys, os
cwd = os.getcwd()
sys.path[0] = cwd[:cwd.rfind('/')]
from data import *

In [2]:
# Import necessary modules
import os
import numpy as np
import pandas as pd

# ---------------- Pandas settings --------------- #
# Removes rows and columns truncation of '...'
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 200)

from sqlalchemy import create_engine
from google.cloud import bigquery
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

## Create Connection to Google Cloud BigQuery

In [3]:
# Google Cloud credentials
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '../saltyhackers-bigquery.json'

In [4]:
# Open bigquery client connection
client = bigquery.Client()

In [5]:
# Create bigquery dataset reference
hn_ref = client.dataset('hacker_news', project='bigquery-public-data')

In [6]:
# Get 'comments' table from bigquery
# Create dataframe with 50000 rows
# ElephantSQL limit it 20MB
comment_ref = hn_ref.table('comments')

comments = client.get_table(comment_ref)

comm_df = client.list_rows(comments, max_results=50000).to_dataframe()

### Sneak Peek at dataframe

In [7]:
comm_df.head()

Unnamed: 0,id,by,author,time,time_ts,text,parent,deleted,dead,ranking
0,2701393,5l,5l,1309184881,2011-06-27 14:28:01+00:00,And the glazier who fixed all the broken windo...,2701243,,,0
1,5811403,99,99,1370234048,2013-06-03 04:34:08+00:00,Does canada have the equivalent of H1B/Green c...,5804452,,,0
2,21623,AF,AF,1178992400,2007-05-12 17:53:20+00:00,"Speaking of Rails, there are other options in ...",21611,,,0
3,10159727,EA,EA,1441206574,2015-09-02 15:09:34+00:00,Humans and large livestock (and maybe even pet...,10159396,,,0
4,2988424,Iv,Iv,1315853580,2011-09-12 18:53:00+00:00,I must say I reacted in the same way when I re...,2988179,,,0


## Pre-processing 

### Remove HTML tags

In [15]:
import re
import html


def cleanup_html(raw_html):
    """
    Clean's up raw HTML code to proper format
    """
    clean_html = re.sub(r'<.*?>', '', raw_html)
    clean_html_http = re.sub(r'http\S+([\.]{3})?', '', clean_html)
    clean_txt = html.unescape(clean_html)
    return clean_txt

# Apply the function
comm_df['text'] = comm_df['text'].apply(cleanup_html)

# Check results
comm_df.sample(10)

for row in comm_df['text'].sample(10):
    print(row)
    print()

But if the price is mainly based on that, it's nothing but a tulip bubble. I don't think we get sustainable value unless people are actually using it as currency to buy things.

My point, you missed it.

You are absolutely right. However, It could also be interpreted as digital salesperson. When I entered a store and a salesperson pitch to me about products based on my face and facial expressions.
We need to learn about how to deal with ourselves. They was and they will always be intrusive.

There was no outcry:http://sf.streetsblog.org/2014/03/20/contrary-to-ed-lee-reco...

> Why do you think that supporting break and continue in macros is difficult?In many languages, emulation of control loop do not support break and continue so I thought that this was the case also here, but apparently I'm wrong, sorry for my 'too quick' post and thanks for the correction.

Thanks, I'll do that.  For the record, my professional programming experience is about 2/3 real-time embedded signal processing

## Performing ML on our dataframe


Performing ML on our dataframe

According to Urban Dictionary, a salty person is someone that’s bitter (kinda weird since bitter and salty are completely different tastes, but the transformation of the English language is a topic for another day). Can we predict which users of Hacker News are the saltiest/most toxic based on the comments they post? Can we help users identify whose comments on Hacker News to ignore in order to make their time on the site more enjoyable? How will we determine what “salty” means?

For this sentiment analysis model, we will use Vader Sentiment due to its simplicity and ability to handle text typically found on social media (robust measures regarding slang, capitalized letters, emojis, and punctuation). In order to determine a user's "saltiness", we will utilize 3 of Vader's polarity scores: positivity, compound, and negative. Positive and negative scores are self-explanatory; however, the compound score is worth understanding further. According to the Vader Sentiment documentation:

    The compound score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). This is the most useful metric if you want a single unidimensional measure of sentiment for a given sentence. Calling it a 'normalized, weighted composite score' is accurate.

Furthermore, the documentation breaks down how sentiment is obtained:

    Typical threshold values (used in the literature cited on this page) are:

    positive sentiment: compound score >= 0.05
    neutral sentiment: (compound score > -0.05) and (compound score < 0.05)
    negative sentiment: compound score <= -0.05

    The pos, neu, and neg scores are ratios for proportions of text that fall in each category (so these should all add up to be 1... or close to it with float operation). These are the most useful metrics if you want multidimensional measures of sentiment for a given sentence.

With this understanding, we can now derive a formula to determine the saltiness of our users' comments. For our purposes, we want to give a bit more weight to the positive and negative ratios, so we will define our score formula as follows:

    **Saltiness Score** = *Positive Ratio* + *Compound Score* - *Negative Ratio*


We only need to perform sentiment analysis on the users' comments, so we'll only focus on the 'text' column. The goal here is to perform an analysis on each comment, and append the comment's score to a corresponding 'score' column.


In [16]:
# Create the sentiment analysis function

def sentiment_score(comment):
    analyser = SentimentIntensityAnalyzer()
    
    x = 0
    score = analyser.polarity_scores(comment)
    x = x + score['pos']
    x = x + score['compound']
    x = x - score['neg'] 
    
    return x

In [17]:
# Apply the function to each sample in the 'text' column
# Store score in newly-created 'score' column
comm_df['score'] = comm_df['text'].apply(sentiment_score)

In [18]:
# Check results
comm_df.head()

Unnamed: 0,id,by,author,time,time_ts,text,parent,deleted,dead,ranking,score
0,2701393,5l,5l,1309184881,2011-06-27 14:28:01+00:00,And the glazier who fixed all the broken windo...,2701243,,,0,-0.0616
1,5811403,99,99,1370234048,2013-06-03 04:34:08+00:00,Does canada have the equivalent of H1B/Green c...,5804452,,,0,0.0
2,21623,AF,AF,1178992400,2007-05-12 17:53:20+00:00,"Speaking of Rails, there are other options in ...",21611,,,0,0.5215
3,10159727,EA,EA,1441206574,2015-09-02 15:09:34+00:00,Humans and large livestock (and maybe even pet...,10159396,,,0,1.0102
4,2988424,Iv,Iv,1315853580,2011-09-12 18:53:00+00:00,I must say I reacted in the same way when I re...,2988179,,,0,0.0


## Pushing the dataframe to postgres

Now, all we have left to do is convert our pandas dataframe to SQL and load it into our postgres database. For this project, we chose to employ the help of ElephantSQL for its simple interface and exceptional DBMS.

In [20]:
# Establish connection to database
engine = create_engine('postgres://txtqhcho:mHEV5Or0MiRw_5oaIJF162BkmqapzanU@salt.db.elephantsql.com:5432/txtqhcho')
# Covert dataframe to SQL
comm_df.to_sql('saltyhackers', con=engine, index=False)