---

_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._

---

# Working With Text

In [1]:
text1 = "Ethics are built right into the ideals and objectives of the United Nations "

len(text1) # The length of text1

76

In [2]:
text2 = text1.split(' ') # Return a list of the words in text2, separating by ' '.

len(text2)

14

In [3]:
text2

['Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations',
 '']

<br>
List comprehension allows us to find specific words:

In [4]:
[w for w in text2 if len(w) > 3] # Words that are greater than 3 letters long in text2

['Ethics',
 'built',
 'right',
 'into',
 'ideals',
 'objectives',
 'United',
 'Nations']

In [5]:
[w for w in text2 if w.istitle()] # Capitalized words in text2

['Ethics', 'United', 'Nations']

In [6]:
[w for w in text2 if w.endswith('s')] # Words in text2 that end in 's'

['Ethics', 'ideals', 'objectives', 'Nations']

<br>
We can find unique words using `set()`.

In [8]:
text3 = 'To be or not to be'
text4 = text3.split(' ')
uniquewords = set([w.lower() for w in text4])

In [9]:
len(uniquewords)

4

In [10]:
uniquewords

{'be', 'not', 'or', 'to'}

### Processing free-text

In [11]:
text5 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text6 = text5.split(' ')

text6

['"Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations"',
 '#UNSG',
 '@',
 'NY',
 'Society',
 'for',
 'Ethical',
 'Culture',
 'bit.ly/2guVelr']

<br>
Finding hastags:

In [12]:
[w for w in text6 if w.startswith('#')]

['#UNSG']

<br>
Finding callouts:

In [13]:
[w for w in text6 if w.startswith('@')]

['@']

In [24]:
#text7 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
3#UNSG @NY Society for Ethical Culture bit.ly/2guVelr'
text7 = "@UN-@UN_Women-Some-body-cannot-change-things-@ NY Society "
text8 = text7.split('-')
text8

['@UN',
 '@UN_Women',
 'Some',
 'body',
 'cannot',
 'change',
 'things',
 '@ NY Society ']

<br>

We can use regular expressions to help us with more complex parsing. 

For example `'@[A-Za-z0-9_]+'` will return all words that: 
* start with `'@'` and are followed by at least one: 
* capital letter (`'A-Z'`)
* lowercase letter (`'a-z'`) 
* number (`'0-9'`)
* or underscore (`'_'`)

In [28]:
import re # import re - a module that provides support for regular expressions

[w for w in text8 if re.search('@[A-Za-z0-9_\s]+', w)]

['@UN', '@UN_Women', '@ NY Society ']

### Working with Dates

In [57]:
dates = """23-10-2002
23-10-02
3/10/2002
09/23/2002
23 Oct 2002
23 October 2002
October 23, 2002
23 October, 2002
Oct 23, 2002
September. 15, 2011
6/1998 Primary Care Doctor:\n
1973"""

#dates = '23-10-2002\n23-10-02\n3/10/2002\n09/23/2002\n23 Oct 2002\n23 October 2002\nOctober 23, 2002\nOct 23, 2002'
dates

'23-10-2002\n23-10-02\n3/10/2002\n09/23/2002\n23 Oct 2002\n23 October 2002\nOctober 23, 2002\n23 October, 2002\nOct 23, 2002'

In [58]:
import re
regexFindDatesInNumber = r'\d{1,2}[-/]\d{1,2}[-/]\d{2,4}'
re.findall(regexFindDatesInNumber, dates)

['23-10-2002', '23-10-02', '3/10/2002', '09/23/2002']

In [61]:
regexFindDatesInString = r'(?:\d{2} )?(?:Jan|Feb|Mar|Apr|Jun|Jul|Sep|Oct|Nov|Dec)[a-z,]* (?:\d{2}, )?\d{2,4}'
re.findall(regexFindDatesInString, dates)

['23 Oct 2002',
 '23 October 2002',
 'October 23, 2002',
 '23 October, 2002',
 'Oct 23, 2002']

### Regex with Pandas

In [64]:
import pandas as pd

time_sentences = ["Monday: The doctor's appointment is at 2:45pm.", 
                  "Tuesday: The dentist's appointment is at 11:30 am.",
                  "Wednesday: At 7:00pm, there is a basketball game!",
                  "Thursday: Be back home by 11:15 pm at the latest.",
                  "Friday: Take the train at 08:10 am, arrive at 09:00am."]

df = pd.DataFrame(time_sentences, columns=['text'])
df

Unnamed: 0,text
0,Monday: The doctor's appointment is at 2:45pm.
1,Tuesday: The dentist's appointment is at 11:30...
2,"Wednesday: At 7:00pm, there is a basketball game!"
3,Thursday: Be back home by 11:15 pm at the latest.
4,"Friday: Take the train at 08:10 am, arrive at ..."


In [66]:
## Length of string in each column.
df['text'].str.len()

0    46
1    50
2    49
3    49
4    54
Name: text, dtype: int64

In [73]:
## Number of tokens in each string
df['text'].str.split()

0    [Monday:, The, doctor's, appointment, is, at, ...
1    [Tuesday:, The, dentist's, appointment, is, at...
2    [Wednesday:, At, 7:00pm,, there, is, a, basket...
3    [Thursday:, Be, back, home, by, 11:15, pm, at,...
4    [Friday:, Take, the, train, at, 08:10, am,, ar...
Name: text, dtype: object

In [74]:
df['text'].str.split().str.len()

0     7
1     8
2     8
3    10
4    10
Name: text, dtype: int64

In [76]:
## Find a specific word
df['text'].str.contains("appointment")

0     True
1     True
2    False
3    False
4    False
Name: text, dtype: bool

In [77]:
## Count number of digits in the string
df['text'].str.count(r'\d')

0    3
1    4
2    3
3    4
4    8
Name: text, dtype: int64

In [79]:
## Find all digits
df['text'].str.findall(r'\d')

0                   [2, 4, 5]
1                [1, 1, 3, 0]
2                   [7, 0, 0]
3                [1, 1, 1, 5]
4    [0, 8, 1, 0, 0, 9, 0, 0]
Name: text, dtype: object

In [80]:
## find all hour and minutes
time_sentences = ["Monday: The doctor's appointment is at 2:45pm.", 
                  "Tuesday: The dentist's appointment is at 11:30 am.",
                  "Wednesday: At 7:00pm, there is a basketball game!",
                  "Thursday: Be back home by 11:15 pm at the latest.",
                  "Friday: Take the train at 08:10 am, arrive at 09:00am."]

df['text'].str.findall(r'\d{1,2}:\d{1,2}')

0            [2:45]
1           [11:30]
2            [7:00]
3           [11:15]
4    [08:10, 09:00]
Name: text, dtype: object

In [82]:
## Replace all Weekday with ???
df['text'].str.replace(r'[\w]+day', '???')

0          ???: The doctor's appointment is at 2:45pm.
1       ???: The dentist's appointment is at 11:30 am.
2          ???: At 7:00pm, there is a basketball game!
3         ???: Be back home by 11:15 pm at the latest.
4    ???: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

In [83]:
## Put First 3 char of weekday instead of full string of weekday (e.g. instead of Monday put Mon)
df['text'].str.replace(r'([\w]+day\b)', lambda x: x.groups()[0][:3])

0          Mon: The doctor's appointment is at 2:45pm.
1       Tue: The dentist's appointment is at 11:30 am.
2          Wed: At 7:00pm, there is a basketball game!
3         Thu: Be back home by 11:15 pm at the latest.
4    Fri: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

In [86]:
# create new columns from first match of extracted groups
df['text'].str.extract(r'(\d{1,2}):(\d{1,2})')

Unnamed: 0,0,1
0,2,45
1,11,30
2,7,0
3,11,15
4,8,10


In [92]:
df['text'].str.extractall(r'(\d{1,2}):(\d{1,2})(\s?[ap]m)+')

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0,2,45,pm
1,0,11,30,am
2,0,7,0,pm
3,0,11,15,pm
4,0,8,10,am
4,1,9,0,am


In [96]:
## Extract group names
df['text'].str.extractall(r'(?P<time>(?P<hour>\d{1,2}):(?P<minute>\d{1,2})(?P<period>\s?[ap]*m))')

Unnamed: 0_level_0,Unnamed: 1_level_0,time,hour,minute,period
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0,2:45pm,2,45,pm
1,0,11:30 am,11,30,am
2,0,7:00pm,7,0,pm
3,0,11:15 pm,11,15,pm
4,0,08:10 am,8,10,am
4,1,09:00am,9,0,am


In [97]:
text = ['This is dirty TEXT: A phone number +001234561234, moNey 3.333, some date like 09.08.2016 and weird Čárákterš.']

['This is dirty TEXT: A phone number +001234561234, moNey 3.333, some date like 09.08.2016 and weird Čárákterš.']

In [113]:
#-*- coding: utf-8 -*-

def get_latin(line):
   print(' '.join((''.join([c.lower() if ord(c) >= 65 and ord(c) <= 90 or ord(c) >= 97 and ord(c) <= 122 else '' for c in line ]))))

for line in text:
    get_latin(line)
    

t h i s i s d i r t y t e x t a p h o n e n u m b e r m o n e y s o m e d a t e l i k e a n d w e i r d r k t e r


In [117]:
text = 'Its RainyDay and we are PlayingInCold'
' '.join(re.findall('[A-Z][^A-Z]*', text))

'Its  Rainy Day and we are  Playing In Cold'

In [147]:
tweet = 'I love my <3 iphone & you are awesome apple. Display Is Awesome, sooo happppppy 🙂 http://www.apple.com bee-eater'
tweet

'I love my <3 iphone & you are awesome apple. Display Is Awesome, sooo happppppy 🙂 http://www.apple.com bee-eater'

In [148]:
import itertools
tweet_mod = ''.join(''.join(s)[:2] for _, s in itertools.groupby(tweet))
tweet_mod

'I love my <3 iphone & you are awesome apple. Display Is Awesome, soo happy 🙂 http://ww.apple.com bee-eater'

In [132]:
t = list(itertools.groupby(tweet))

In [149]:
t = [list(s) for _, s in itertools.groupby(tweet)]


In [150]:
t

[['I'],
 [' '],
 ['l'],
 ['o'],
 ['v'],
 ['e'],
 [' '],
 ['m'],
 ['y'],
 [' '],
 ['<'],
 ['3'],
 [' '],
 ['i'],
 ['p'],
 ['h'],
 ['o'],
 ['n'],
 ['e'],
 [' '],
 ['&'],
 [' '],
 ['y'],
 ['o'],
 ['u'],
 [' '],
 ['a'],
 ['r'],
 ['e'],
 [' '],
 ['a'],
 ['w'],
 ['e'],
 ['s'],
 ['o'],
 ['m'],
 ['e'],
 [' '],
 ['a'],
 ['p', 'p'],
 ['l'],
 ['e'],
 ['.'],
 [' '],
 ['D'],
 ['i'],
 ['s'],
 ['p'],
 ['l'],
 ['a'],
 ['y'],
 [' '],
 ['I'],
 ['s'],
 [' '],
 ['A'],
 ['w'],
 ['e'],
 ['s'],
 ['o'],
 ['m'],
 ['e'],
 [','],
 [' '],
 ['s'],
 ['o', 'o', 'o'],
 [' '],
 ['h'],
 ['a'],
 ['p', 'p', 'p', 'p', 'p', 'p'],
 ['y'],
 [' '],
 ['🙂'],
 [' '],
 ['h'],
 ['t', 't'],
 ['p'],
 [':'],
 ['/', '/'],
 ['w', 'w', 'w'],
 ['.'],
 ['a'],
 ['p', 'p'],
 ['l'],
 ['e'],
 ['.'],
 ['c'],
 ['o'],
 ['m'],
 [' '],
 ['b'],
 ['e', 'e'],
 ['-'],
 ['e'],
 ['a'],
 ['t'],
 ['e'],
 ['r']]

In [162]:
dateObj = "1 Janaury 1993"
splitValues = dateObj.split()
for index, val in enumerate(splitValues):
    print(index, val)
    if re.search('[A-Za-z]+', val):
        splitValues[index] = val[:3]
" ".join(splitValues)

0 1
1 Janaury
Matched
2 1993


'1 Jan 1993'

In [175]:
dictObj = {72: ['7/11/1977', '9/36/30', '01/01/10']}

In [182]:
import datetime
from dateutil.parser import parse
def validdate(dateVal):
    listofInt = dateVal.split('/')
    isvalidDate = True
    try:
        print(parse(dateVal))
    except ValueError:
        isvalidDate = False
    return isvalidDate
        
for key, val in dictObj.items():
    if (len(val) > 1):
        print([d for d in val if validdate(d)])


1977-07-11 00:00:00
2010-01-01 00:00:00
['7/11/1977', '01/01/10']


In [183]:
from dateutil.parser import parse
parse('9/36/30')

ValueError: day is out of range for month