#### Text Mining - Importance
Data is growing at rapid pace, about 20 exabytes per day, 80% of this is unstructured text data  

#### Information is hidden in text data  
Example of tweets, what information could one extract  
1. Author  
2. Location  
3. Time  
4. Topic or subject of tweet  
5. Likes / Dislikes
6. Shares
7. Sentiment etc

Common Use cases with text data.
1. Parse text i.e read and split it in words  
2. Identify and extract certain components from text  - Information retreival
3. Tag/Classify documents 
4. Search for relevant documents  
5. Sentiment analysis  
6. Topic Modeling

primitive Constructs in text
- Documents 
- Sentences 
    - First person / third person
    - tense : past present future
- Words / Tokens
    - Subject
    - Object
    - Noun
    - Adjective
    - Prepositions
    - Verb
    - Adverb etc
- Characters

#### Text Processing 
Text needs pre-processing, the steps *usually* (needs may differ) required are 
1. Reading the text, each line/sentence is read as a character string  
2. Break sentences into words/tokens  
3. Change case to lower
4. Remove white spaces from front, end  
5. Remove common occuring words like prepositions, often called 'stop' word removal


In [16]:
text = 'Ethics are built right into the ideals and objectives of the united nations'
len(text)
tokens = text.split(' ')
#type(tokens) -  list

#### Examples of 
1. sentence split into token  
2. case identification  
3. finding substrings
4. finding unique tokens

**Use of list comprehensions to achieve this**

In [9]:
# Words that are more than 3 chrs long
wrds1 = [x for x in tokens if len(x) >3]

['Ethics',
 'built',
 'right',
 'into',
 'ideals',
 'objectives',
 'united',
 'nations']

In [12]:
# Words that start with a capital letter
wrds2 = [x for x in tokens if x.istitle()]
wrds2

['Ethics']

In [14]:
# Words that end with 's'
wrds3 = [x for x in tokens if x.endswith('s')]
wrds3

['Ethics', 'ideals', 'objectives', 'nations']

In [8]:
text2 = "To be or not to be"
tokens2 = text2.split(' ')

# Change case to lower
wrds4 = [x.lower() for x in tokens2 ]

# Unique words from a list using set()
set(wrds4)

{'be', 'not', 'or', 'to'}

#### String Examining methods available on string stype
s.startswith() - examine if string starts with a character  
s.endswith() - ends with a character  
t in s -  if a substring is part of the string   
s.isupper()   
s.islower()  
s.istitle() - title case, i.e first letter is capital  
s.isalpha() - only has alphabets
s.isdigit() -  only comprised of digits
s.isalnum() -  has both, i.e alphanumeric

#### Functions for String operation
s.lower(), s.upper() - convert to upper or lower  
s.split() & s.join() -  split based on a character string given as argument  
s.splitlines()  
s.strip()  - strips whitespaces from front and back  
s.rstrip()  - only from back
s.find()  - finds first occurence of the pattern from start  
s.rfind()  - just given the index couting backwards  
s.replace(u,v)  

In [20]:
# Split & Join functions are opposite of each other
text3 = "ouagodougou"
tokens = text3.split("ou")
print(tokens)

word4 = "ou"
word4.join(tokens)
# join when given an arguments as a list, places the word between pairs in the list, an action opposite to splitting
# using the string


['', 'agod', 'g', '']


'ouagodougou'

In [2]:
# Breaking a string into characters
text4 = "I am learning NLP" 
# Method 1 : string comprehension
chrs = [j for j in text4]
chrs

# Method 2 : coercion to a list
print(list(text4))

# What will not work
#text4.split("") # split needs a charcter to split on, arleast a whitespace " "

['I', ' ', 'a', 'm', ' ', 'l', 'e', 'a', 'r', 'n', 'i', 'n', 'g', ' ', 'N', 'L', 'P']


In [42]:
# Cleaning Operations
#1. Whitespace removal, use method 'strip' on string , NOT on list available from split method
text5 =  "   A quick brown fox jumped over a lazy dog"
tokens = text5.split(" ")
print(tokens)
text5_clean = text5.strip()
text5_clean

['', '', '', 'A', 'quick', 'brown', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog']


'A quick brown fox jumped over a lazy dog'

In [52]:
#2. Find and replace  
# 'find' method returns the index of the first occurence of string patters from the start,
# rstring gives the same index from reverse
print ([text5_clean.find("o") , text5_clean.rfind("o")])
#len(text5_clean)

# replace finds a pattern and replaces in the entire string
text5_clean.replace("o", "O")


[10, 38]


'A quick brOwn fOx jumped Over a lazy dOg'