# Working With Text

In [None]:
text1 = "Ethics are built right into the ideals and objectives of the United Nations "

len(text1) # The length of text1

76

In [None]:
text2 = text1.split(' ') # Return a list of the words in text2, separating by ' '.

len(text2)

14

In [None]:
text2

['Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations',
 '']

<br>
List comprehension allows us to find specific words:

In [None]:
[w for w in text2 if len(w) > 3] # Words that are greater than 3 letters long in text2

['Ethics',
 'built',
 'right',
 'into',
 'ideals',
 'objectives',
 'United',
 'Nations']

In [None]:
[w for w in text2 if len(w)>5]

['Ethics', 'ideals', 'objectives', 'United', 'Nations']

In [None]:
text2[2].istitle()

False

In [None]:
[w for w in text2 if w.istitle()] # Capitalized words in text2

['Ethics', 'United', 'Nations']

In [None]:
[w for w in text2 if w.endswith('s')] # Words in text2 that end in 's'

['Ethics', 'ideals', 'objectives', 'Nations']

<br>
We can find unique words using `set()`.

In [None]:
text3 = 'To be or not to be'
text4 = text3.split(' ')

len(text4)

6

In [None]:
len(text3)

18

In [None]:
len(set(text4))

5

In [None]:
set(text4)

{'To', 'be', 'not', 'or', 'to'}

In [None]:
len(set([w.lower() for w in text4])) # .lower converts the string to lowercase.

4

In [None]:
set([w.lower() for w in text4])

{'be', 'not', 'or', 'to'}

# Some operations on text

In [None]:
s = 'ouagadougou'
s1 = s.split('ou')
s1

['', 'agad', 'g', '']

In [None]:
'ou'.join(s1)

'ouagadougou'

In [None]:
s2 = s.upper()
s2

'OUAGADOUGOU'

In [None]:
s2.lower()

'ouagadougou'

In [None]:
# split with white space and let's see
s3 = s.split('')
s3

ValueError: ignored

In [None]:
# so to do this find list of s
list(s)

['o', 'u', 'a', 'g', 'a', 'd', 'o', 'u', 'g', 'o', 'u']

In [None]:
# Or
[c for c in s]

['o', 'u', 'a', 'g', 'a', 'd', 'o', 'u', 'g', 'o', 'u']

## Cleaning Text

In [None]:
text_1 = '     Braj kishore is studying in RJIT. '
text_2 = text_1.split(' ')
text_2

['', '', '', '', '', 'Braj', 'kishore', 'is', 'studying', 'in', 'RJIT.', '']

* Here we are getting some empty space element in the list so to deal with let's follow below operations.

In [None]:
# strip() is used to remove the white spaces from the front of the string
text_3 = text_1.strip()
print(text_3)
text_4 = text_3.split(' ')
text_4

Braj kishore is studying in RJIT.


['Braj', 'kishore', 'is', 'studying', 'in', 'RJIT.']

**Changing Text**
1. Find 
2. Replace

In [None]:
text_5 = 'A quick brown fox jumped over the lazy dog.'
text_5.find('fox')  # find(t) is used to give the location of t.

14

In [None]:
text_6 = text_5.replace('fox', 'lion')
text_6

'A quick brown lion jumped over the lazy dog.'

### Processing free-text

In [None]:
text5 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text6 = text5.split(' ')

text6

['"Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations"',
 '#UNSG',
 '@',
 'NY',
 'Society',
 'for',
 'Ethical',
 'Culture',
 'bit.ly/2guVelr']

<br>
Finding hastags:

In [None]:
[w for w in text6 if w.startswith('#')]

['#UNSG']

<br>
Finding callouts:

In [None]:
[w for w in text6 if w.startswith('@')]

['@']

In [None]:
text7 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text8 = text7.split(' ')

<br>

We can use regular expressions to help us with more complex parsing. 

For example `'@[A-Za-z0-9_]+'` will return all words that: 
* start with `'@'` and are followed by at least one: 
* capital letter (`'A-Z'`)
* lowercase letter (`'a-z'`) 
* number (`'0-9'`)
* or underscore (`'_'`)

In [None]:
[w for w in text8 if w.startswith('@')] #Here w[2] "@" is not a callout so how to remove this.

['@UN', '@UN_Women', '@']

Here w[2] "@" is not a callout so how to remove this.<br>
This can be removed by finding a pattern. '@[A-Za-z0-9_]+'

In [None]:
import re # import re - a module that provides support for regular expressions

[w for w in text8 if re.search('@[A-Za-z0-9_]+', w)]

['@UN', '@UN_Women']