# 1) Introduction to Text Mining.

![text_data](text_data.png)
![tweeter](tweeter.png)
![what_can_be_done](what_can_be_done.png)


# 2) Handling Text in Python.

![text_1](text_1.png)
![text_2](text_2.png)
![text_3](text_3.png)
![text_4](text_4.png)
![text_5](text_5.png)
![text_6](text_6.png)
![text_7](text_7.png) 
![text_8](text_8.png)
![text_9](text_9.png)
![text_10](text_10.png)
![text_11](text_11.png)
![text_12](text_12.png)

# 3) Regular Expressions.

In [11]:
text7 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'

In [12]:
text8 = text7.split(' ')
text8

['@UN',
 '@UN_Women',
 '"Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations"',
 '#UNSG',
 '@',
 'NY',
 'Society',
 'for',
 'Ethical',
 'Culture',
 'bit.ly/2guVelr']

In [15]:
# get all the `Hashtags`.
[w for w in text8 if w.startswith('#')]

['#UNSG']

In [18]:
# get all the `callouts`.
print( [w for w in text8 if w.startswith('@')] )
# here, the devil lies. Because this bring '@' sign only,
# which is not a collout at all. This suggest that we are in need for Regex.

import re
[w for w in text8 if re.search('@[A-Za-z1-9_]+',w)]

['@UN', '@UN_Women', '@']


['@UN', '@UN_Women']

![text_13](text_13.png)
![text_14](text_14.png)
![text_15](text_15.png)
![text_16](text_16.png)

In [20]:
[w for w in text8 if re.search('@\w+',w)]

['@UN', '@UN_Women']

In [29]:
text12 = 'ouagadougou'
print( re.findall(r'[aeiou]',text12) )
re.findall(r'[^aeiou]',text12)

['o', 'u', 'a', 'a', 'o', 'u', 'o', 'u']


['g', 'd', 'g']

![text_17](text_17.png)
![text_18](text_18.png)
![text_19](text_19.png)
![text_19](text_20.png)
![text_21](text_21.png)
![text_22](text_22.png)
![text_23](text_23.png)

# 4) Demonstration: Regex with Pandas and Named Groups.

In [33]:
import pandas as pd

time_sentences = ["Monday: The doctor's appointment is at 2:45pm.", 
                  "Tuesday: The dentist's appointment is at 11:30 am.",
                  "Wednesday: At 7:00pm, there is a basketball game!",
                  "Thursday: Be back home by 11:15 pm at the latest.",
                  "Friday: Take the train at 08:10 am, arrive at 09:00am."]
df = pd.DataFrame(time_sentences, columns  = ['text'])
df

Unnamed: 0,text
0,Monday: The doctor's appointment is at 2:45pm.
1,Tuesday: The dentist's appointment is at 11:30...
2,"Wednesday: At 7:00pm, there is a basketball game!"
3,Thursday: Be back home by 11:15 pm at the latest.
4,"Friday: Take the train at 08:10 am, arrive at ..."


In [34]:
# str.len() : to get the length of each text.
df['text'].str.len()

0    46
1    50
2    49
3    49
4    54
Name: text, dtype: int64

In [39]:
# str.split(): to split the text by the white space.
splited_words = df['text'].str.split()
print(splited_words.str.len())
splited_words

0     7
1     8
2     8
3    10
4    10
Name: text, dtype: int64


0    [Monday:, The, doctor's, appointment, is, at, ...
1    [Tuesday:, The, dentist's, appointment, is, at...
2    [Wednesday:, At, 7:00pm,, there, is, a, basket...
3    [Thursday:, Be, back, home, by, 11:15, pm, at,...
4    [Friday:, Take, the, train, at, 08:10, am,, ar...
Name: text, dtype: object

In [40]:
# str.contains() : for checking presence of a certain word in a text.
df['text'].str.contains('appointment')

0     True
1     True
2    False
3    False
4    False
Name: text, dtype: bool

In [42]:
# str.count(): to count the presence of a certain Regex in a text.
df['text'].str.count(r'\d')

0    3
1    4
2    3
3    4
4    8
Name: text, dtype: int64

In [80]:
# str.findall() : to pull out a certain regex from a text.
df['text'].str.findall(r'\d')

0                   [2, 4, 5]
1                [1, 1, 3, 0]
2                   [7, 0, 0]
3                [1, 1, 1, 5]
4    [0, 8, 1, 0, 0, 9, 0, 0]
Name: text, dtype: object

In [82]:
df['text'].str.findall(r'(\d{0,1}\d):(\d\d)')    #{0,1} = ?

0               [(2, 45)]
1              [(11, 30)]
2               [(7, 00)]
3              [(11, 15)]
4    [(08, 10), (09, 00)]
Name: text, dtype: object

In [53]:
# str.replace() : to replace a certain regex in a text with other text
df['text'].str.replace(r'\w+day','xxx')

0          xxx: The doctor's appointment is at 2:45pm.
1       xxx: The dentist's appointment is at 11:30 am.
2          xxx: At 7:00pm, there is a basketball game!
3         xxx: Be back home by 11:15 pm at the latest.
4    xxx: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

In [83]:
df['text'].str.replace(r'(\w+day\b)', lambda x: x.groups()[0][:4])
# \b: word boundry [limit], could be removed and i'll get the same result.

0         Mond: The doctor's appointment is at 2:45pm.
1      Tues: The dentist's appointment is at 11:30 am.
2         Wedn: At 7:00pm, there is a basketball game!
3        Thur: Be back home by 11:15 pm at the latest.
4    Frid: Take the train at 08:10 am, arrive at 09...
Name: text, dtype: object

In [69]:
# str.extract() : to extract a certain regex and put into a dataframe.
df['text'].str.extract(r'(\d?\d):(\d\d)')
#{0,1} = ? = could be happened or not.

Unnamed: 0,0,1
0,2,45
1,11,30
2,7,0
3,11,15
4,8,10


Note that str.extract only extracts groups from the first match of the pattern.So, it didn't extract `09:00am`.To get all matches use str.extractall.

Let's try an example with extractall that uses a more complex pattern with more groups.

In [73]:
df['text'].str.extractall(r'((\d?\d):(\d\d) ?([ap]m))')
# the last prentheses to mach the whole pattern as in col0.

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2,3
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0,2:45pm,2,45,pm
1,0,11:30 am,11,30,am
2,0,7:00pm,7,0,pm
3,0,11:15 pm,11,15,pm
4,0,08:10 am,8,10,am
4,1,09:00am,9,0,am


In [78]:
df['text'].str.extractall(r'(?P<time>(?P<hour>\d?\d):(?P<minute>\d\d) ?(?P<period>[ap]m))')

Unnamed: 0_level_0,Unnamed: 1_level_0,time,hour,minute,period
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0,2:45pm,2,45,pm
1,0,11:30 am,11,30,am
2,0,7:00pm,7,0,pm
3,0,11:15 pm,11,15,pm
4,0,08:10 am,8,10,am
4,1,09:00am,9,0,am


# 5) Internationalization and Issues with Non-ASCII Characters.

![text_24](text_24.png)
![text_25](text_25.png)
![text_26](text_26.png)
![text_27](text_27.png)
![text_28](text_28.png)
![text_29](text_29.png)
![text_30](text_30.png)
![text_31](text_31.png)
![text_32](text_32.png)

# Practice Quiz.
## 1.
Which of these options correspond to matching a pattern at least once?
- Answer: `+`

## 2.
Which of these options correspond to matching a pattern zero or more times?
- Answer: `*`

## 3.
Which of these options correspond to matching a pattern zero or more times?
- Answer: `^xyz`

## 4.
Which of these options correspond to matching xyz at the end of the string?
- Answer: `xyz$`


# Module 1 Quiz

## 1.
Which of these options correspond to matching a pattern at most once?
- Answer: `?`

## 2.
Which of these options correspond to matching a pattern at least twice?
- Answer: `{2,}`

## 3.
Which of these options correspond to matching a pattern at most thrice?
- Answer: `{,3}`

## 4.
Which of these options correspond to match none of the characters x,y,z?
- Answer: `[^xyz]`

## 5.
Which of these options correspond to match one of the characters x,y,z?
- Answer: `[xyz]`

## 6.
Which of these options correspond to match the sequence xyz?
- Answer: `xyz`


