<a href="https://colab.research.google.com/github/AlexBB999/NLP/blob/master/31_3_Assignment_NLP_Text_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
pip install nltk



In [0]:
# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

In [0]:
! pip install spacy




In [0]:
! python -m spacy download en


[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [0]:
from collections import Counter
import nltk
import spacy
import re
nlp = spacy.load('en',disable=['parser','ner'])

In [0]:
# Launch the installer to download Gutenberg corpus
nltk.download("gutenberg")

# Download the English models of SpaCy
#!python -m spacy download en

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


True

In [0]:
# import the data we just downloaded
from nltk.corpus import gutenberg

# grab and process the raw data
print(gutenberg.fileids())

persuasion = gutenberg.raw('austen-persuasion.txt')
alice = gutenberg.raw('carroll-alice.txt')

# print the first 100 characters of Alice
print('\nRaw:\n', alice[0:100])

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

Raw:
 [Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was


###**Basic text cleaning
When modifying text data, using regular expressions is a common practice **bold text**.

 We're also going to use regular expressions (**specifically re.sub(), short for "substitute")** to identify and remove substrings we don't want. 
 
 
 Specifically, **we'll match those substrings with a regular expression and substitute in an empty string for them.**

 If you want more information the Python Regular Expression HOWTO is an accessible starting point and reference, and RegExr is a useful tool for visualizing and tinkering with regular expressions.

We'll start our cleaning by removing the title.

 **We'll match all text between square brackets and replace it with an empty string**.

In [0]:
# this pattern matches all text between square brackets
pattern = "[\[].*?[\]]"
persuasion = re.sub(pattern, "", persuasion)
alice = re.sub(pattern, "", alice)

# print the first 100 characters of Alice again
print("Title removed:", alice[0:100])

Title removed: 

CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on


Next, we'll remove the chapter headings like CHAPTER I.
 Note that two novels have different styles of chapter headings.

So, we deal with each novel one by one.

This is **quite usual** in cleaning text data.

As we said before, **all texts have their own peculiarities and cleaning them requires you to know those peculiaritie**s.

In [0]:
# now we'll match and remove chapter headings
persuasion = re.sub(r'Chapter \d+', '', persuasion)
alice = re.sub(r'CHAPTER .*', '', alice)

# ok, what's it look like now?
print('Chapter headings removed:', alice[0:100])

Chapter headings removed: 



Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothin


If you were to read the two novels, you'd notice that there are **a lot of "new line" characters and other types of extra whitespaces.**

 So, **we need to clean them up**:

In [0]:
# remove newlines and other extra whitespace by splitting and rejoining
persuasion = ' '.join(persuasion.split())
alice = ' '.join(alice.split())

# all done with cleanup? let's see how it looks.
print('Extra whitespace removed:', alice[0:100])

Extra whitespace removed: CHAPTER I. Down the Rabbit-Hole Alice was beginning to get very tired of sitting by her sister on th


Much of the things you saw as data cleaning so far were just a demonstration of what kind of problems you may encounter in a corpus. You can imagine a lot more than what we showed here. For example, if we were to work on a social media corpus, then we most likely would encounter with many emojis and abbreviations. So, dealing with them would also be a major problem in the data cleaning phase.

**Hence, you should always be careful about what kind of corpus you have and what types of problems may occur in the text**.

Since our text started to look okay, the next step is to tokenize our texts


##**Tokenization**

As you recall from the previous checkpoint, **each individual meaningful piece from a text is called a token**,

 and **the process of breaking up the text into these pieces is called tokenization**.
 
 Tokenization is an important step in text preprocessing, because most of the time we generate the numerical representations of our texts from these tokens. Hence, breaking up the text into tokens correctly is a crucial step for the success of the next steps of any data science workflow.

**Tokens are generally words and punctuation**. 

In some NLP applications, you may see that people remove the punctuations from the text as if they are stopwords. There's no a single correct way of handling the punctuations and it's usually a matter of experimentation to determine the best way.

 I**n the following, we'll keep punctuations in our documents as we'll make use of them when separating our text into sentences**.
 
  **However, when we analyze our data, we check for them and don't include them in our analysi**s as you'll see shortly.

Let's go ahead and use spaCy to parse our novels into tokens.

 **When we call spaCy on the novel it will immediately and automatically parse it, tokenizing the string by breaking it into words and punctuation (and many other things we will explore)**:

In [0]:


# all the processing work is done below, so it may take a while
alice_doc = nlp(alice)
persuasion_doc = nlp(persuasion)

All our parsed documents are now stored in two variables we defined.

 SpaCy did a lot of good things when parsing the documents. 

Let's see what we have after the parsing happened:

In [0]:
# let's explore the objects we've built.
print("The alice_doc object is a {} object.".format(type(alice_doc)))
print("It is {} tokens long".format(len(alice_doc)))
print("The first three tokens are '{}'".format(alice_doc[:3]))
print("The type of each token is {}".format(type(alice_doc[0])))

The alice_doc object is a <class 'spacy.tokens.doc.Doc'> object.
It is 34495 tokens long
The first three tokens are 'CHAPTER I. Down'
The type of each token is <class 'spacy.tokens.token.Token'>


We see from introspecting the spaCy objects above that we're playing around with **doc and token objects.** 

**These are the types that are defined by SpaCy**.

##**Removing stopwords**

One of the important steps of text preprocessing is to remove the stopwords from the dataset.

This is because they occur a lot in the text and most of the time they convey little meaning. So removing them benefits twice:

**We get rid of the noisy data**.

The size of the text diminishes and hence the computation time shortens.

Removing stopwords with SpaCy is quite easy

In [0]:
alice_without_stopwords = [token for token in alice_doc if not token.is_stop]
persuasion_without_stopwords = [token for token in persuasion_doc if not token.is_stop]

In [0]:
type(alice_without_stopwords)

list

As you can see, **we just iterated over the tokens that are already made available by SpaCy** during the parsing of the documents and exclude the token from the list if it's a stopword.

**Now, we store our tokens in two lists that are free of stopwords**. 

Let's stop text processing a little bit and look at how frequent each token is in our corpus:

In [0]:
# utility function to calculate how frequently words appear in the text
def word_frequencies(text):
    
    # build a list of words
    # strip out punctuation
    words = []
    for token in text:
        if not token.is_punct:
            words.append(token.text)
            
    # build and return a Counter object containing word counts
    return Counter(words)

# instantiate our list of most common words.
alice_word_freq = word_frequencies(alice_without_stopwords).most_common(10)
persuasion_word_freq = word_frequencies(persuasion_without_stopwords).most_common(10)
print('\nAlice:', alice_word_freq)
print('Persuasion:', persuasion_word_freq)


Alice: [('said', 453), ('Alice', 394), ('little', 124), ('like', 84), ('went', 83), ('know', 83), ('thought', 74), ('Queen', 74), ('time', 68), ('King', 61)]
Persuasion: [('Anne', 496), ('Captain', 297), ('Mrs', 291), ('Elliot', 288), ('Mr', 254), ('Wentworth', 217), ('Lady', 191), ('good', 181), ('little', 175), ('Charles', 166)]


Just take a moment and think about the 10 most common words in each novel.

 Do you see some differences that make sense to you

##**Lemmatization**

So far, we've tokenized our texts looked at whether certain words are present and how frequently they appear. 

**We can process these words further to remove a little more noise from our data**.

 Consider the words "think", "thought", and "thinking". They're related. They all share the same root word: the verb "think". Most of the times, we want to focus on the fact that the act of thinking comes up a lot in data, and not have that information split across all the different forms of "think".

To focus in like this, **we can reduce each word to its root that is to lemma and do our counts again**.

 **This time, we're building a count of concepts rather than just words**:

In [0]:
# utility function to calculate how frequently lemas appear in the text
def lemma_frequencies(text):
    
    # build a list of lemas
    # strip out punctuation
    lemmas = []
    for token in text:
        if not token.is_punct:
            lemmas.append(token.lemma_)
            
    # build and return a Counter object containing lemma counts
    return Counter(lemmas)

# instantiate our list of most common lemmas
alice_lemma_freq = lemma_frequencies(alice_without_stopwords).most_common(10)
persuasion_lemma_freq = lemma_frequencies(persuasion_without_stopwords).most_common(10)
print('\nAlice:', alice_lemma_freq)
print('Persuasion:', persuasion_lemma_freq)


Alice: [('say', 476), ('Alice', 394), ('think', 130), ('go', 130), ('little', 125), ('look', 105), ('know', 103), ('come', 96), ('like', 92), ('begin', 91)]
Persuasion: [('Anne', 496), ('Captain', 297), ('Mrs', 291), ('Elliot', 288), ('think', 258), ('Mr', 254), ('know', 252), ('good', 222), ('Wentworth', 215), ('Lady', 191)]


**As you can realize, the top ten list changed**.

 You can try to print more number of top lemmas and catch meaningful differences between the two novels.

**Alternatively, we can identify the lemmas common to one text but not the other**.

 This may help us in understanding the differences between the two novels

In [0]:
alice_lemma_common = [pair[0] for pair in alice_lemma_freq]
persuasion_lemma_common = [pair[0] for pair in persuasion_lemma_freq]
print('Unique to Alice:', set(alice_lemma_common) - set(persuasion_lemma_common))
print('Unique to Persuasion:', set(persuasion_lemma_common) - set(alice_lemma_common))

Unique to Alice: {'Alice', 'say', 'begin', 'go', 'little', 'like', 'come', 'look'}
Unique to Persuasion: {'Elliot', 'Mrs', 'good', 'Wentworth', 'Anne', 'Captain', 'Lady', 'Mr'}


These are examples of how you can do data exploration on text data. When it comes to text data, the limit is sky! So, use your imagination and find out more creative ways of analyzing the two novels based on the lemmas they have.

We'll not go into the details but some syntactical properties can also help in this analysis. 
If you notice, the most frequent lemmas include person names. For the purpose of our analysis, we may need to eliminate them from the lists. 

In order to do this, we can derive the named entities in the texts and SpaCy has already derived named entities in the texts during parsing.

If you like, you can go ahead and inspect the named entities.

Note: We lemmatized our tokens to treat words with similar meanings as if they are the same. 

Apart from looking at lemmas, we could also perform a similar analysis by pulling out prefixes (token.prefix_) or suffixes (token.suffix_) from the tokens.

##**Sentences**

Before closing this checkpoint, we want to mention about how to determine the sentences in a corpus.

Beyond individual words, text can also be considered at the level of sentences. Using punctuation cues, we can split up text into sentences.

Each sentence can then be summarized by, for example, using sentiment analysis to categorize sentences as having positive or negative sentiment.

 We may also be interested in how long sentences tend to be, and how many unique words make up a sentence.

 The sentence also provides context for the individual words, allowing us to draw even more information from each word.

We get a lot of automatic sentence-level information from spaCy. The **doc.sents** property will give us each sentence as a span object. 

Let's look at some of that:



In [0]:
# initial exploration of sentences

sentences = list(alice_doc.sents)
print("Alice in Wonderland has {} sentences.".format(len(sentences)))

example_sentence = sentences[2]
print("Here is an example: \n{}\n".format(example_sentence))

Alice in Wonderland has 1624 sentences.
Here is an example: 
There was nothing so VERY remarkable in that; nor did Alice think it so VERY much out of the way to hear the Rabbit say to itself, 'Oh dear!



 **look at some metrics around this sentence**

In [0]:
example_words = [token for token in example_sentence if not token.is_punct]
unique_words = set([token.text for token in example_words])

print(("There are {} words in this sentence, and {} of them are"
       " unique.").format(len(example_words), len(unique_words)))

There are 56 words in this sentence, and 46 of them are unique.


As we can see, sentence-level analysis can also be helpful in the data exploration phase.

This is all about data cleaning and text preprocessing for now. It's your turn to complete the assignments.

##**Assignments**
In this assignment, you're required to clean up the two datasets. You'll be using these datasets in the later checkpoints of this module and hence cleaning them up here will help you save time when working with these datasets.

The first dataset is a dialogue dataset called **Cornell Movie--Dialogs Corpus**. This corpus includes conversations between the characters of more than 600 movies.

The second dataset is the **Twitter US Airline Sentiment dataset** from Kaggle. This dataset contains the tweets from travelers about some airlines in February 2015. This dataset is usually used in sentiment analysis but we'll use it for sentence generation later on.

Since the memory requirements of the datasets are relatively large, we recommend you to use Google Colaboratory.

In [0]:
#Apply th data preprocessing techniques you learned here to Cornell Movie--Dialogs Corpus data. You'll be using this dataset when developing a chatbot in a later checkpoint. You should access the dataset from the Thinkful database using the following credentials:

 postgres_user = 'dsbc_student'
 postgres_pw = '7*.8G9QH21'
 postgres_host = '142.93.121.174'
 postgres_port = '5432'
 postgres_db = 'cornell_movie_dialogs'

 #The data is in the table called "dialogs".
#Apply the data preprocessing techniques you learned here to Twitter US Airline Sentiment data. You'll be using this dataset when generating sentences in a later checkpoint. 
#You should access the dataset from the Thinkful database using the following credentials:

 postgres_user = 'dsbc_student'
 postgres_pw = '7*.8G9QH21'
 postgres_host = '142.93.121.174'
 postgres_port = '5432'
 postgres_db = 'twitter_sentiment'

 ##The data is in the table called "twitter".
Note: When parsing the data using SpaCy, you may run into some memory issues even in Google Colaboratory. If you're having memory issues, try parsing your text as follows:

nlp = spacy.load('en', disable=['parser', 'ner'])
nlp.add_pipe(nlp.create_pipe('sentencizer'))
nlp.max_length = 20000000
doc = nlp(the_dialogs_come_here)

In [0]:
import numpy as np
import pandas as pd
from sklearn import linear_model
import matplotlib.pyplot as plt
from sqlalchemy import create_engine

import warnings
warnings.filterwarnings('ignore')

In [0]:
 postgres_user = 'dsbc_student'
 postgres_pw = '7*.8G9QH21'
 postgres_host = '142.93.121.174'
 postgres_port = '5432'
 postgres_db = 'cornell_movie_dialogs'

In [0]:
engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
dialog0 = pd.read_sql_query('select * FROM dialogs',con=engine)

# no need for an open connection, as we're only doing a single query
engine.dispose()

In [0]:
dialog=dialog0.copy()

In [0]:
dialog.head(10)

Unnamed: 0,index,dialogs
0,0,Can we make this quick? Roxanne Korrine and A...
1,1,"Well, I thought we'd start with pronunciation,..."
2,2,Not the hacking and gagging and spitting part....
3,3,Okay... then how 'bout we try out some French ...
4,4,You're asking me out. That's so cute. What's ...
5,5,Forget it.
6,6,"No, no, it's my fault -- we didn't have a prop..."
7,7,Cameron.
8,8,"The thing is, Cameron -- I'm at the mercy of a..."
9,9,Seems like she could get a date easy enough...


In [0]:
dialog.shape

(304446, 2)

In [0]:
dialog['dialogs'][0]

'Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.'

In [0]:
dialog2=dialog['dialogs']

In [0]:
dialog2[0]

'Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.'

In [0]:
example_sentence = dialog.loc[2]
print("Here is an example: \n{}\n".format(example_sentence))

Here is an example: 
index                                                      2
dialogs    Not the hacking and gagging and spitting part....
Name: 2, dtype: object



In [0]:
example_words = [token for token in example_sentence]
unique_words = set([token for token in example_words])

print(("There are {} words in this sentence, and {} of them are"
       " unique.").format(len(example_words), len(unique_words)))

There are 2 words in this sentence, and 2 of them are unique.


In [0]:
#df['new_col'] = df['text'].apply(lambda x: nlp(x))

In [0]:
pattern = "[\[].*?[\]]"

alice = re.sub(pattern, "", alice)

In [0]:
len(dialog)

304446

**ADD ? TO PATTERN**

In [0]:
pattern = "[\[]\?.*?[\]]"

In [0]:
dialog.loc[0][1]=re.sub(pattern,"",dialog.loc[0][1])


In [0]:
test=re.sub(pattern,"",dialog.loc[0][1])

In [0]:
test

'Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.'

In [0]:
pattern2='[\?."(,*)!]'

In [0]:
test2=re.sub(r'[,@\'?\.$%_]',"",dialog.loc[1][1],flags=re.I)

In [0]:
pattern3=r'[-,@\'?\.$%_]'

In [0]:
print(test2)

Well I thought wed start with pronunciation if thats okay with you


In [0]:
test3=re.sub(r'[^a-zA-z0-9\s]',"",dialog.loc[1][1],flags=re.I)

In [0]:
print(test3)

Well I thought wed start with pronunciation if thats okay with you


**XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX**

**THIS IS THE TICKET**

**SOMETHING I DID NOT KNOW**

**dialogs_doc = nlp(" ".join(dialogs_df.dialogs))**

**XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX**

**REPLACE CONTRACTIONS WITH FULL WORD-- DO THIS FIRST!**


---


---



**THIS CONTRACTION UTILITY ASSUMES EVERYTHING IS lowercase**



In [0]:
contraction_dict = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",  "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",  "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have"}

def _get_contractions(contraction_dict):
    contraction_re = re.compile('(%s)' % '|'.join(contraction_dict.keys()))
    return contraction_dict, contraction_re

contractions, contractions_re = _get_contractions(contraction_dict)

def replace_contractions(text):
    def replace(match):
        return contractions[match.group(0)]
    return contractions_re.sub(replace, text)

# Usage
replace_contractions("this's a text with  isn't  didn't contraction")

'this is a text with  is not  did not contraction'

**NEW VERSION**

In [0]:
what=(" ".join(dialog.dialogs))

In [0]:
type(what)

str

In [0]:
what[:400]

"Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again. Well, I thought we'd start with pronunciation, if that's okay with you. Not the hacking and gagging and spitting part.  Please. Okay... then how 'bout we try out some French cuisine.  Saturday?  Night? You're asking me out.  That's so cute. What's your name again? F"

In [0]:
dialog_test =replace_contractions(" ".join(dialog.dialogs))

In [0]:
nlp = spacy.load('en', disable=['parser', 'ner'])
#necessary to avoid memory error of SpaCy
nlp.max_length = 20000000

dialogs_doc = nlp(" ".join(dialog.dialogs))


In [0]:
dialog_test[:400]

"Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again. Well, I thought we would start with pronunciation, if that is okay with you. Not the hacking and gagging and spitting part.  Please. Okay... then how 'bout we try out some French cuisine.  Saturday?  Night? You're asking me out.  That's so cute. What's your name aga"

**NEXT APPLY REGEX**

In [0]:
pattern3=r'[-,@\'?\.$%_]'

In [0]:
dialog_test=[re.sub(pattern3,"",dialog.loc[x][1]) for x in range(len(dialog))]

In [0]:
for row in dialog_test[:5]:
  print(row)

Can we make this quick  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break up on the quad  Again
Well I thought wed start with pronunciation if thats okay with you
Not the hacking and gagging and spitting part  Please
Okay then how bout we try out some French cuisine  Saturday  Night
Youre asking me out  Thats so cute Whats your name again


In [0]:
dialog_test[0]

'Can we make this quick  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break up on the quad  Again'

In [0]:
type(dialog_test)

list

**GET RID OF WHITESPACE**

In [0]:
dialog_test=[' '.join(dialog_test[x].split()) for x in range(len(dialog))]

In [0]:
for row in dialog_test[:5]:
  print(row)

Can we make this quick Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break up on the quad Again
Well I thought wed start with pronunciation if thats okay with you
Not the hacking and gagging and spitting part Please
Okay then how bout we try out some French cuisine Saturday Night
Youre asking me out Thats so cute Whats your name again


  **TOKENIZE**

In [0]:
nlp = spacy.load('en',disable=['parser','ner'])

# all the processing work is done below, so it may take a while
alice_doc = nlp(alice)

In [0]:
dialog_test_doc=[nlp(dialog_test[x]) for x in range(len(dialog_test))]

In [0]:
len(dialog_test_doc[0])

22

In [0]:
len(dialog_test_doc[1])

14

In [0]:
# let's explore the objects we've built.
print("The dialog_test_doc  object is a {} object.".format(type(dialog_test_doc)))
print("It is {} tokens long".format(len(dialog_test_doc)))
print("The first three tokens are '{}'".format(dialog_test_doc[:3]))
print("The type of each token is {}".format(type(dialog_test_doc[0])))

The dialog_test_doc  object is a <class 'list'> object.
It is 304446 tokens long
The first three tokens are '[Can we make this quick Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break up on the quad Again, Well I thought wed start with pronunciation if thats okay with you, Not the hacking and gagging and spitting part Please]'
The type of each token is <class 'spacy.tokens.doc.Doc'>


In [0]:
dialog_test_doc[0]

Can we make this quick Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break up on the quad Again

In [0]:
dialog_test_doc[1]

Well I thought wed start with pronunciation if thats okay with you

  **REMOVING STOPWORDS**

In [0]:
dialog_without_stopwords = [[token for token in dialog_test_doc[x] if not token.is_stop] for x in range(len(dialog_test_doc))]

In [0]:
len(dialog_without_stopwords)

304446

In [0]:
tempo=[token for token in dialog_test_doc[0] if not token.is_stop]

In [0]:
tempo

[quick,
 Roxanne,
 Korrine,
 Andrew,
 Barrett,
 having,
 incredibly,
 horrendous,
 public,
 break,
 quad]

In [0]:
#dialog_without_stopwords

**I NEED TO MAKE ONE LIST OUT OF DIALOG_WITHOUT_STOPWORDS**

**THEN I CAN FEED LIST INTO WORD_FREQUENCIES()**

In [0]:
stop_words_big_list=[]
for x in range(len(dialog_without_stopwords)):
   stop_words_big_list.extend(dialog_without_stopwords[x])

In [0]:
len(stop_words_big_list)

1458491

In [0]:
stop_words_big_list[:20]

[quick,
 Roxanne,
 Korrine,
 Andrew,
 Barrett,
 having,
 incredibly,
 horrendous,
 public,
 break,
 quad,
 thought,
 d,
 start,
 pronunciation,
 s,
 okay,
 hacking,
 gagging,
 spitting]

In [0]:
dialog_word_freq=word_frequencies(stop_words_big_list).most_common(10)

In [0]:
dialog_word_freq

[('nt', 55255),
 ('s', 32046),
 ('m', 22404),
 ('know', 21378),
 ('like', 13691),
 ('got', 12653),
 ('want', 10791),
 ('ve', 10667),
 ('think', 10397),
 ('going', 8762)]

In [0]:
len(dialog_without_stopwords)

304446

In [0]:
# utility function to calculate how frequently words appear in the text
def word_frequencies(text):
    
    # build a list of words
    # strip out punctuation
    words = []
    for token in text:
        if not token.is_punct:
            words.append(token.text)
            
    # build and return a Counter object containing word counts
    return Counter(words)

# instantiate our list of most common words.
#alice_word_freq = word_frequencies(alice_without_stopwords).most_common(10)
#print('\nAlice:', alice_word_freq)


In [0]:
word_frequencies(dialog_without_stopwords[0])

Counter({'Andrew': 1,
         'Barrett': 1,
         'Korrine': 1,
         'Roxanne': 1,
         'break': 1,
         'having': 1,
         'horrendous': 1,
         'incredibly': 1,
         'public': 1,
         'quad': 1,
         'quick': 1})

In [0]:
dialog_word_frequency=[word_frequencies(dialog_without_stopwords[x]) for x in range(len(dialog_test_doc))]

In [0]:
dialog_word_frequency[:5]

[Counter({'Andrew': 1,
          'Barrett': 1,
          'Korrine': 1,
          'Roxanne': 1,
          'break': 1,
          'having': 1,
          'horrendous': 1,
          'incredibly': 1,
          'public': 1,
          'quad': 1,
          'quick': 1}),
 Counter({'d': 1,
          'okay': 1,
          'pronunciation': 1,
          's': 1,
          'start': 1,
          'thought': 1}),
 Counter({'gagging': 1, 'hacking': 1, 'spitting': 1}),
 Counter({'French': 1,
          'Night': 1,
          'Okay': 1,
          'Saturday': 1,
          'bout': 1,
          'cuisine': 1,
          'try': 1}),
 Counter({'asking': 1, 'cute': 1, 's': 2})]

**//////////////////////////////////////////////////////////**

In [0]:
#list(nlp(dialog.loc[0][1]))

In [0]:
test_doc = nlp(dialog.loc[0][1])

In [0]:
test_doc

Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.

In [0]:
testy=list(test_doc.sents)

In [0]:
example_sentence=testy[2]

In [0]:
example_words = [token for token in example_sentence if not token.is_punct]
unique_words = set([token.text for token in example_words])

print(("There are {} words in this sentence, and {} of them are"
       " unique.").format(len(example_words), len(unique_words)))

There are 1 words in this sentence, and 1 of them are unique.


##**TWITTER**

In [0]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'twitter_sentiment'

##The data is in the table called "twitter"

In [0]:
engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
twitter0 = pd.read_sql_query('select * FROM twitter',con=engine)

# no need for an open connection, as we're only doing a single query
engine.dispose()

In [0]:
twit=twitter0.copy()

In [0]:
twit.shape

(14640, 16)

In [0]:
twit.head()

Unnamed: 0,index,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [0]:
list(twit)

['index',
 'tweet_id',
 'airline_sentiment',
 'airline_sentiment_confidence',
 'negativereason',
 'negativereason_confidence',
 'airline',
 'airline_sentiment_gold',
 'name',
 'negativereason_gold',
 'retweet_count',
 'text',
 'tweet_coord',
 'tweet_created',
 'tweet_location',
 'user_timezone']

In [0]:
nlp = spacy.load('en', disable=['parser', 'ner'])

# below is necessary to avoid memory error of SpaCy
nlp.max_length = 20000000

# all the processing work is done below, so it may take a while
twit_doc = nlp(" ".join(twit.text))

In [0]:
# let's explore the objects we've built.
print("The twit_doc object is a {} object.".format(type(twit_doc)))
print("It is {} tokens long".format(len(twit_doc)))
print("The first three tokens are '{}'".format(twit_doc[:3]))
print("The type of each token is {}".format(type(twit_doc[0])))

The twit_doc object is a <class 'spacy.tokens.doc.Doc'> object.
It is 307328 tokens long
The first three tokens are '@VirginAmerica What @dhepburn'
The type of each token is <class 'spacy.tokens.token.Token'>


In [0]:
# removing the stopwords
twit_docW = [token for token in twit_doc if not token.is_stop]

In [0]:
# let's explore the without stop words objects we've built.
print("The twit_doc object is a {} object.".format(type(twit_docW)))
print("It is {} tokens long".format(len(twit_docW)))
print("The first three tokens are '{}'".format(twit_docW[:3]))
print("The type of each token is {}".format(type(twit_docW[0])))

The twit_doc object is a <class 'list'> object.
It is 178303 tokens long
The first three tokens are '[@VirginAmerica, @dhepburn, said]'
The type of each token is <class 'spacy.tokens.token.Token'>


In [0]:
# lemmatization
lemmas = [token.lemma_ for token in twit_docW]

In [0]:
type(lemmas)

list

In [0]:
# let's explore the LEMMA objects we've built.
print("The LEMMAS list  {} object.".format(type(lemmas)))
print("It is {} tokens long".format(len(lemmas)))
print("The first three tokens are '{}'".format(lemmas[:3]))
print("The type of each token is {}".format(type(lemmas)))

The LEMMAS list  <class 'list'> object.
It is 178303 tokens long
The first three tokens are '['@VirginAmerica', '@dhepburn', 'say']'
The type of each token is <class 'list'>
