## Text Preprocessing Exercise in Tokenisation

#### Introduction
Lets download a set of data containing news clipping written in English,


In [103]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.util import ngrams
from nltk.corpus import stopwords
import pandas as pd
import re 
import string

In [104]:
news_corpus = pd.read_csv('./data/news_clipping.csv',encoding='utf-8')
news_corpus.head(15)

Unnamed: 0,article_nr,news_snippets
0,AP880224-0195,The Bechtel Group Inc. offered in 1985 to sell...
1,AP881017-0144,A gunman took a 74-year-old woman hostage afte...
2,AP881017-0219,"Today is Saturday, Oct. 29, the 303rd day of 1..."
3,AP900117-0022,Cupid has a new message for lovers this Valent...
4,AP880405-0167,The Reagan administration is weighing whether ...
5,AP880825-0239,"More than 120,000 skins of a protected species..."
6,AP880325-0232,There will be no organized union boost behind ...
7,AP880908-0056,Here is a summary of developments in forest an...
8,AP881105-0097,"Jean-Pierre Stirbois, the No. 2 man in the ext..."
9,AP880716-0112,"At least 15 people died and 25,000 residents o..."


#### Task 1

1. Tokenize each new clipping into sentences
2. The list of sentences should be saved into a new column sentence_tokens within the same dataframe

In [105]:
# your answers

# We can directly apply sent_tokenize to the text

news_corpus['sentence_tokens'] = news_corpus.news_snippets.map(sent_tokenize)



In [106]:
# this is to check that there are new entries made in the column sentence_tokens
news_corpus.head()

Unnamed: 0,article_nr,news_snippets,sentence_tokens
0,AP880224-0195,The Bechtel Group Inc. offered in 1985 to sell...,[The Bechtel Group Inc. offered in 1985 to sel...
1,AP881017-0144,A gunman took a 74-year-old woman hostage afte...,[A gunman took a 74-year-old woman hostage aft...
2,AP881017-0219,"Today is Saturday, Oct. 29, the 303rd day of 1...","[Today is Saturday, Oct. 29, the 303rd day of ..."
3,AP900117-0022,Cupid has a new message for lovers this Valent...,[Cupid has a new message for lovers this Valen...
4,AP880405-0167,The Reagan administration is weighing whether ...,[The Reagan administration is weighing whether...


In [107]:
# this is to check the contents of one row of the column sentence_tokens

my_sentences  = news_corpus['sentence_tokens'][0]

print ("number of sentences ", len(my_sentences))  # count the number of sentences
print (my_sentences)       # prints out the sentence tokens


number of sentences  28
['The Bechtel Group Inc. offered in 1985 to sell oil to Israel at a discount of at least $650 million for 10 years if it promised not to bomb a proposed Iraqi pipeline, a Foreign Ministry official said Wednesday.', "But then-Prime Minister Shimon Peres said the offer from Bruce Rappaport, a partner in the San Francisco-based construction and engineering company, was ``unimportant,'' the senior official told The Associated Press.", 'Peres, now foreign minister, never discussed the offer with other government ministers, said the official, who spoke on condition of anonymity.', "The comments marked the first time Israel has acknowledged any offer was made for assurances not to bomb the planned $1 billion pipeline, which was to have run near Israel's border with Jordan.", 'The pipeline was never built.', "In San Francisco, Tom Flynn, vice president for public relations for the Bechtel Group, said the company did not make any offer to Peres but that Rappaport, a Swis

#### Task 2
1. For each sentences, further tokenize into uni-grams or words.
2. The list of sentences should be saved into a new column sentence_word_tokens within the same dataframe

In [108]:
##Create a for loop goes through the list of sentences

def  sentences_to_word_tokens(sentences):
    # input: sentences is a list of string(sentence)
    # return: a list that contains a list of tokens
    word_tokens = []
    for s in sentences:
       word_tokens.append(word_tokenize(s))    
    return word_tokens

news_corpus['sentence_with_words_tokens'] = news_corpus.sentence_tokens.map(sentences_to_word_tokens)

In [109]:
news_corpus.head()

Unnamed: 0,article_nr,news_snippets,sentence_tokens,sentence_with_words_tokens
0,AP880224-0195,The Bechtel Group Inc. offered in 1985 to sell...,[The Bechtel Group Inc. offered in 1985 to sel...,"[[The, Bechtel, Group, Inc., offered, in, 1985..."
1,AP881017-0144,A gunman took a 74-year-old woman hostage afte...,[A gunman took a 74-year-old woman hostage aft...,"[[A, gunman, took, a, 74-year-old, woman, host..."
2,AP881017-0219,"Today is Saturday, Oct. 29, the 303rd day of 1...","[Today is Saturday, Oct. 29, the 303rd day of ...","[[Today, is, Saturday, ,, Oct., 29, ,, the, 30..."
3,AP900117-0022,Cupid has a new message for lovers this Valent...,[Cupid has a new message for lovers this Valen...,"[[Cupid, has, a, new, message, for, lovers, th..."
4,AP880405-0167,The Reagan administration is weighing whether ...,[The Reagan administration is weighing whether...,"[[The, Reagan, administration, is, weighing, w..."


In [110]:
# Let's compare sentence_tokens and sentence_with_words_tokens for the first row, and the first sentence

print ("The full sentence is : \n" , news_corpus.sentence_tokens[0][0])
print ("\nThe sentence is tokenize into : \n",  news_corpus.sentence_with_words_tokens[0][0])

The full sentence is : 
 The Bechtel Group Inc. offered in 1985 to sell oil to Israel at a discount of at least $650 million for 10 years if it promised not to bomb a proposed Iraqi pipeline, a Foreign Ministry official said Wednesday.

The sentence is tokenize into : 
 ['The', 'Bechtel', 'Group', 'Inc.', 'offered', 'in', '1985', 'to', 'sell', 'oil', 'to', 'Israel', 'at', 'a', 'discount', 'of', 'at', 'least', '$', '650', 'million', 'for', '10', 'years', 'if', 'it', 'promised', 'not', 'to', 'bomb', 'a', 'proposed', 'Iraqi', 'pipeline', ',', 'a', 'Foreign', 'Ministry', 'official', 'said', 'Wednesday', '.']


#### Task 3
1. For each sentences, further tokenize into bigrams.
2. The list of sentences should be saved into a new column sentence_with_bigram_tokens within the same dataframe

In [111]:
##Create a for loop goes through the list of sentences

def  sentences_to_bigrams(sentences):
    # input: sentences is a list of strings
    # return: a list that contains a list of tokens
    bigrams = []
    for s in sentences:
       word_tokens = word_tokenize(s)
       bigram_list = list(ngrams(word_tokens,2))
       bigrams.append(bigram_list)    
    return bigrams

# list(ngrams(word_tokenize(my_text),2))

news_corpus['sentence_with_bigram'] = news_corpus.sentence_tokens.map(sentences_to_bigrams)


In [112]:
# Let's compare the sentences, sentence tokens, word tokens and bigram tokens for the first row, and the first sentence

print ("The full sentence is : \n" , news_corpus.sentence_tokens[0][0])
print ("\nThe sentence is tokenize into words: \n",  news_corpus.sentence_with_words_tokens[0][0])
print ("\nThe sentence is tokenize into bigrams : \n",  news_corpus.sentence_with_bigram[0][0])

The full sentence is : 
 The Bechtel Group Inc. offered in 1985 to sell oil to Israel at a discount of at least $650 million for 10 years if it promised not to bomb a proposed Iraqi pipeline, a Foreign Ministry official said Wednesday.

The sentence is tokenize into words: 
 ['The', 'Bechtel', 'Group', 'Inc.', 'offered', 'in', '1985', 'to', 'sell', 'oil', 'to', 'Israel', 'at', 'a', 'discount', 'of', 'at', 'least', '$', '650', 'million', 'for', '10', 'years', 'if', 'it', 'promised', 'not', 'to', 'bomb', 'a', 'proposed', 'Iraqi', 'pipeline', ',', 'a', 'Foreign', 'Ministry', 'official', 'said', 'Wednesday', '.']

The sentence is tokenize into bigrams : 
 [('The', 'Bechtel'), ('Bechtel', 'Group'), ('Group', 'Inc.'), ('Inc.', 'offered'), ('offered', 'in'), ('in', '1985'), ('1985', 'to'), ('to', 'sell'), ('sell', 'oil'), ('oil', 'to'), ('to', 'Israel'), ('Israel', 'at'), ('at', 'a'), ('a', 'discount'), ('discount', 'of'), ('of', 'at'), ('at', 'least'), ('least', '$'), ('$', '650'), ('650', '

In [113]:
# Let's compare the sentences, sentence tokens, word tokens and bigram tokens for the  11th row, and the third sentence

print ("The full sentence is : \n" , news_corpus.sentence_tokens[10][2])
print ("\nThe sentence is tokenize into words: \n",  news_corpus.sentence_with_words_tokens[10][2])
print ("\nThe sentence is tokenize into bigrams : \n",  news_corpus.sentence_with_bigram[10][2])

The full sentence is : 
 He was the epitome of elegance and high standards.

The sentence is tokenize into words: 
 ['He', 'was', 'the', 'epitome', 'of', 'elegance', 'and', 'high', 'standards', '.']

The sentence is tokenize into bigrams : 
 [('He', 'was'), ('was', 'the'), ('the', 'epitome'), ('epitome', 'of'), ('of', 'elegance'), ('elegance', 'and'), ('and', 'high'), ('high', 'standards'), ('standards', '.')]


### Task 4 (optional)
1. As you examine the results of the tokenization, you may come across strange combination of symbols, or characters that do not make senses or complicated the tokenization effort. In this case, you may wish to do use regular expression to substitute out these symbols, characters for something else.

In [114]:
# your answer