---

_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._

---

# Working With Text

In [21]:
text1 = "Ethics are built right into the ideals and objectives of the United Nations "

len(text1) # The length of text1
text1.startswith("Eth")
text1.endswith("Na")
'Ethics' in text1
'Ethics'.isupper()
'Ethics'.islower()
'Ethics'.istitle()
'Ethics'.isalpha()
'Ethics'.isalnum()
'Ethi56cs'.isalnum()
'Ethics'.isdigit()
'56'.isdigit()
'Ethics'.upper()
'Ethics'.lower()
'fgf'.title()
text1.splitlines()
t = '**'.join("this is wonderful")
print(t)
text1.strip()
text1.rstrip()
text1.find('Eth')
text1.replace('e','E')
text1.rfind('United')
list('welcome')
[n for n in 'welcome']

t**h**i**s** **i**s** **w**o**n**d**e**r**f**u**l


['w', 'e', 'l', 'c', 'o', 'm', 'e']

In [None]:
text2 = text1.split(' ') # Return a list of the words in text2, separating by ' '.

len(text2)

In [None]:
text2

<br>
List comprehension allows us to find specific words:

In [None]:
[w for w in text2 if len(w) > 3] # Words that are greater than 3 letters long in text2

In [None]:
[w for w in text2 if w.istitle()] # Capitalized words in text2

In [None]:
[w for w in text2 if w.endswith('s')] # Words in text2 that end in 's'

<br>
We can find unique words using `set()`.

In [None]:
text3 = 'To be or not to be'
text4 = text3.split(' ')

len(text4)

In [None]:
len(set(text4))

In [None]:
set(text4)

In [None]:
len(set([w.lower() for w in text4])) # .lower converts the string to lowercase.

In [None]:
set([w.lower() for w in text4])

### Processing free-text

In [None]:
text5 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text6 = text5.split(' ')

text6

<br>
Finding hastags:

In [None]:
[w for w in text6 if w.startswith('#')]

<br>
Finding callouts:

In [None]:
[w for w in text6 if w.startswith('@')]

In [None]:
text7 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text8 = text7.split(' ')

<br>

We can use regular expressions to help us with more complex parsing. 

For example `'@[A-Za-z0-9_]+'` will return all words that: 
* start with `'@'` and are followed by at least one: 
* capital letter (`'A-Z'`)
* lowercase letter (`'a-z'`) 
* number (`'0-9'`)
* or underscore (`'_'`)

In [None]:
import re # import re - a module that provides support for regular expressions

[w for w in text8 if re.search('@[A-Za-z0-9_]+', w)]

## deal with files

In [None]:
f = open('filename' ,'r') # 'r' :mode read 
f.readline()
f.seek(0)
text = f.read()
len(text)
lines = text.splitlines()
len(lines)
f.close()
f.closed # boolean var

# Working with Text Data in pandas

In [None]:
import pandas as pd

time_sentences = ["Monday: The doctor's appointment is at 2:45pm.", 
                  "Tuesday: The dentist's appointment is at 11:30 am.",
                  "Wednesday: At 7:00pm, there is a basketball game!",
                  "Thursday: Be back home by 11:15 pm at the latest.",
                  "Friday: Take the train at 08:10 am, arrive at 09:00am."]

df = pd.DataFrame(time_sentences, columns=['text'])
df

In [None]:
# find the number of characters for each string in df['text']
df['text'].str.len()

In [None]:
# find the number of tokens for each string in df['text']
df['text'].str.split().str.len()

In [None]:
# find which entries contain the word 'appointment'
df['text'].str.contains('appointment')

In [None]:
# find how many times a digit occurs in each string
df['text'].str.count(r'\d')

In [None]:
# find all occurances of the digits
df['text'].str.findall(r'\d')

In [None]:
# group and find the hours and minutes
df['text'].str.findall(r'(\d?\d):(\d\d)')

In [None]:
# replace weekdays with '???'
df['text'].str.replace(r'\w+day\b', '???')

In [None]:
# replace weekdays with 3 letter abbrevations
df['text'].str.replace(r'(\w+day\b)', lambda x: x.groups()[0][:3])

In [None]:
# create new columns from first match of extracted groups
df['text'].str.extract(r'(\d?\d):(\d\d)')

In [None]:
# extract the entire time, the hours, the minutes, and the period
df['text'].str.extractall(r'((\d?\d):(\d\d) ?([ap]m))')

In [None]:
# extract the entire time, the hours, the minutes, and the period with group names
df['text'].str.extractall(r'(?P<time>(?P<hour>\d?\d):(?P<minute>\d\d) ?(?P<period>[ap]m))')

## Tips and tricks of the trade for cleaning text in Python
https://stanford.edu/~rjweiss/public_html/IRiSS2013/text2/notebooks/cleaningtext.html

https://www.analyticsvidhya.com/blog/2014/11/text-data-cleaning-steps-python/

http://ieva.rocks/2016/08/07/cleaning-text-for-nlp/

https://chrisalbon.com/python/cleaning_text.html