# R/GA by Design Hackday - Beyond Rules with NLP

Thanks for choosing the Natural Language Processing project! This notebook will serve as your guide during your project. 


## Contents
* Goals
* Agenda
* Introductory concepts in NLP
* Data processing for supplied corpora.
* Step by step guide to sentence classification (sentiment analysis).
* preliminary examples for topic modelling using LDA.

## Goals:
* Learn how to create a problem statement and an execution plan.
* Familiarize yourself with the basic concepts, challenges, opportunities, and methods of NLP.
* Understand how NLP can be used in your work.
* Collectively spec a prototype.


## Agenda

#### What is NLP?  (60 minutes):

* Deck (20 min) - technical / landscape
* Examples (20 min) - play around with example applications.
* Reflect - What role does text play in your work and how can NLP support this? (10 min)
  * how do you use text in your work?
  * how could this be relevant?
  * how can this technological affordance be used for something innovative? 
* Read through and discuss introductory material around NLP concepts. (10 min)
* Clone the repo or download the repo as a zip, alternatively read docs online.
  * If you do not have admin rights, find someone on your team who does or follow along with Michael.

#### Project Planning (30+30+45 minutes):

* Brainstorm (30 min) - Put together a bunch of ideas - from simple to moonshot.
* Present (30 min) - Present 3 best ideas to the group.
  * discuss with others around feasibility.
  * settle on 1 idea.
* Nail down a brief and an execution plan. (20 min)
  * Detail what you will and *wont* do.
* Gather inspirational material from personal archives or included resources list. (5 min)
* Dole out tasks, consult with Michael on what it will take to build this thing. (20 min)
* Thing to consider:
  * Who do you need on your team?
  * Why is NLP an important ingredient in your solution?
  * What data do you need to get started?
  * How is the technology manifested in the deliverable?

#### Working (40+10+40 minutes):

* Working session (40 min)
* "Client" Review (10 min)
* Working session (40 min)

#### Wrap-up and presentation gathering (15 min)
* Prepare for 17:15 regroup with the larger group.


In [165]:
%matplotlib inline

import os
import pprint
import pandas as pd
from textblob import TextBlob, Word
from nltk.corpus import inaugural

pd.options.display.max_colwidth = 0
pp = pprint.PrettyPrinter(indent=4)

# Introductory Concepts in NLP

## Data
This dataset was created for the Paper 'From Group to Individual Labels using Deep Features', Kotzias et. al,. KDD 2015

It contains sentences labelled with positive or negative sentiment, extracted from reviews of products, movies, and restaurants

**Format**:
* sentence \t score \n

**Details**:
* Score is either 1 (for positive) or 0 (for negative)	
* The sentences come from three different websites/fields:
  * imdb.com
  * amazon.com
  * yelp.com

For each website, there are 500 positive and 500 negative sentences.

In [37]:
# create an empty dataframe
reviews_df = pd.DataFrame(columns = ["review", "sentiment", "source"])

# load the three datasets into the empty dataframe
for dirpath, _, filenames in os.walk("../data/web-reviews/"):
    for filename in filenames:
        data = pd.read_table(dirpath + filename, names = ["review","sentiment"])
        source = filename.split("_")[0]
        data["source"] = source
        reviews_df = reviews_df.append(data)

Let's see how our data looks. First, we print a random sample of the reviews table.

In [55]:
reviews_df.sample(5)

Unnamed: 0,review,sentiment,source
370,"I'll be drivng along, and my headset starts ringing for no reason.",0.0,amazon
804,They were golden-crispy and delicious.,1.0,yelp
653,"Not much flavor to them, and very poorly constructed.",0.0,yelp
26,- They never brought a salad we asked for.,0.0,yelp
329,Rip off---- Over charge shipping.,0.0,amazon


Next, lets look at some of the capabilities of **TextBlob**, a python library for text processing.

First, we will see how we can use TextBlob to get simple information out of a single review.

In [162]:
#take a random review and turn it into a textblob
text = reviews_df.sample(1).iloc[0]["review"]
blob = TextBlob(text)

# print some information about the blob
pp.pprint(blob)
print("\n")
pp.pprint(blob.tags)
print("\n")
pp.pprint(blob.noun_phrases)
print("\n")
pp.pprint(blob.sentiment)

TextBlob("Bad characters, bad story and bad acting.  ")


[   ('Bad', u'NNP'),
    ('characters', u'NNS'),
    ('bad', u'JJ'),
    ('story', u'NN'),
    ('and', u'CC'),
    ('bad', u'JJ'),
    ('acting', u'NN')]


WordList([u'bad story'])


Sentiment(polarity=-0.5249999999999999, subjectivity=0.5)


Now, lets look at some tools that will allow use to standardize sentences and words.

**Lemmas** 
Lemmatization is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

In [163]:
# Show words and their lemmas.
for word in blob.words:
    w = Word(word)
    print word, ">",w.lemmatize()

Bad > Bad
characters > character
bad > bad
story > story
and > and
bad > bad
acting > acting


In [164]:
# show words and 
for word in blob.words:
    w = Word(word)
    print w, ">", w.synsets[:1]

Bad > [Synset('bad.n.01')]
characters > [Synset('fictional_character.n.01')]
bad > [Synset('bad.n.01')]
story > [Synset('narrative.n.01')]
and > []
bad > [Synset('bad.n.01')]
acting > [Synset('acting.n.01')]


## Data Processing

## Text Generation

## Topic Modelling Using LDA

##  Sentiment Analysis

## Appendix