# Introduction to text analisys III #

## Tokenization part II ##

**working with text**

Consider the following text (first in raw notebook format)

In computing, a comma-separated values (CSV) file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.

The CSV file format is not standardized. The basic idea of separating fields with a comma is clear, but that idea gets complicated when the field data may also contain commas or even embedded line-breaks. CSV implementations may not handle such field data, or they may use quotation marks to surround the field. Quotation does not solve everything: some fields may need embedded quotation marks, so a CSV implementation may include escape characters or escape sequences.

In addition, the term "CSV" also denotes some closely related delimiter-separated formats that use different field delimiters. These include tab-separated values and space-separated values. A delimiter that is not present in the field data (such as tab) keeps the format parsing simple. These alternate delimiter-separated files are often even given a .csv extension despite the use of a non-comma field separator. This loose terminology can cause problems in data exchange. Many applications that accept CSV files have options to select the delimiter character and the quotation character.

put it in a single variable

In [2]:
text = """In computing, a comma-separated values (CSV) file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.

The CSV file format is not standardized. The basic idea of separating fields with a comma is clear, but that idea gets complicated when the field data may also contain commas or even embedded line-breaks. CSV implementations may not handle such field data, or they may use quotation marks to surround the field. Quotation does not solve everything: some fields may need embedded quotation marks, so a CSV implementation may include escape characters or escape sequences.

In addition, the term "CSV" also denotes some closely related delimiter-separated formats that use different field delimiters. These include tab-separated values and space-separated values. A delimiter that is not present in the field data (such as tab) keeps the format parsing simple. These alternate delimiter-separated files are often even given a .csv extension despite the use of a non-comma field separator. This loose terminology can cause problems in data exchange. Many applications that accept CSV files have options to select the delimiter character and the quotation character.

"""

Extracting sentences ? extracting paragraphs ?

In [3]:
paragraphs = text.split("\n")

print(paragraphs[0])

In computing, a comma-separated values (CSV) file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.


In [4]:
print(paragraphs[1]) ## why is this empty ?




In [5]:
paragraphs = text.split("\n\n") ## solution double \n\n ...

print(paragraphs[1])

The CSV file format is not standardized. The basic idea of separating fields with a comma is clear, but that idea gets complicated when the field data may also contain commas or even embedded line-breaks. CSV implementations may not handle such field data, or they may use quotation marks to surround the field. Quotation does not solve everything: some fields may need embedded quotation marks, so a CSV implementation may include escape characters or escape sequences.


**a better solution**

In [5]:
import re

sp = re.split('\n+', text) ## splitting using regular expressions
print(sp)

['In computing, a comma-separated values (CSV) file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.', 'The CSV file format is not standardized. The basic idea of separating fields with a comma is clear, but that idea gets complicated when the field data may also contain commas or even embedded line-breaks. CSV implementations may not handle such field data, or they may use quotation marks to surround the field. Quotation does not solve everything: some fields may need embedded quotation marks, so a CSV implementation may include escape characters or escape sequences.', 'In addition, the term "CSV" also denotes some closely related delimiter-separated formats that use different field delimiters. These include tab-separated values and space-separated values. A delimiter that is not present 

**splitting sentences simple way**

In [6]:
for s in sp:
    sentences = s.split(".")
    print(sentences)
    break

['In computing, a comma-separated values (CSV) file stores tabular data (numbers and text) in plain text', ' Each line of the file is a data record', ' Each record consists of one or more fields, separated by commas', ' The use of the comma as a field separator is the source of the name for this file format', '']


## tokenization the professional way ##

**using NLTK natural language toolkit**

A powerful and well documented library for Text analysis

In [8]:
import nltk 
from nltk.tokenize import sent_tokenize

**to use the sentence tokenizer we need an english corpus called "punkt" with nltk.download()**

In [9]:
sentences = sent_tokenize(text)

In [10]:
for s in sentences:
    print(s)

In computing, a comma-separated values (CSV) file stores tabular data (numbers and text) in plain text.
Each line of the file is a data record.
Each record consists of one or more fields, separated by commas.
The use of the comma as a field separator is the source of the name for this file format.
The CSV file format is not standardized.
The basic idea of separating fields with a comma is clear, but that idea gets complicated when the field data may also contain commas or even embedded line-breaks.
CSV implementations may not handle such field data, or they may use quotation marks to surround the field.
Quotation does not solve everything: some fields may need embedded quotation marks, so a CSV implementation may include escape characters or escape sequences.
In addition, the term "CSV" also denotes some closely related delimiter-separated formats that use different field delimiters.
These include tab-separated values and space-separated values.
A delimiter that is not present in the f

nltk is smart enough to manage several complex cases.

* Dealing with achronims

In [12]:
complex_case = "This is a sentence. In this one there is an acronym N.A.S.A to treat like a sentence."
sentences = sent_tokenize(complex_case)

In [13]:
for c in sentences:
    print(c)

This is a sentence.
In this one there is an acronym N.A.S.A to treat like a sentence.


**tokenizing words**

In [14]:
from nltk.tokenize import word_tokenize

In [15]:
tokens = word_tokenize(text)

In [16]:
print(tokens)

['In', 'computing', ',', 'a', 'comma-separated', 'values', '(', 'CSV', ')', 'file', 'stores', 'tabular', 'data', '(', 'numbers', 'and', 'text', ')', 'in', 'plain', 'text', '.', 'Each', 'line', 'of', 'the', 'file', 'is', 'a', 'data', 'record', '.', 'Each', 'record', 'consists', 'of', 'one', 'or', 'more', 'fields', ',', 'separated', 'by', 'commas', '.', 'The', 'use', 'of', 'the', 'comma', 'as', 'a', 'field', 'separator', 'is', 'the', 'source', 'of', 'the', 'name', 'for', 'this', 'file', 'format', '.', 'The', 'CSV', 'file', 'format', 'is', 'not', 'standardized', '.', 'The', 'basic', 'idea', 'of', 'separating', 'fields', 'with', 'a', 'comma', 'is', 'clear', ',', 'but', 'that', 'idea', 'gets', 'complicated', 'when', 'the', 'field', 'data', 'may', 'also', 'contain', 'commas', 'or', 'even', 'embedded', 'line-breaks', '.', 'CSV', 'implementations', 'may', 'not', 'handle', 'such', 'field', 'data', ',', 'or', 'they', 'may', 'use', 'quotation', 'marks', 'to', 'surround', 'the', 'field', '.', 'Quo

**special tokenizer**

In [17]:
from nltk.tokenize import TweetTokenizer

s0 = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--"

model = TweetTokenizer()

tokens = model.tokenize(s0)

print(tokens)

['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3', 'and', 'some', 'arrows', '<', '>', '->', '<--']


notice how the tokenizer treated the emoticons

### Exercises:###

* tokenize a dataset of tweets
* count and interpret emoticons
* simple polarization from emoticons
    