# Text Helper Functions

* [1. Text Helper Functions](#1)
    - [1.1 URL](#1.1)
    - [1.2 Emoticons](#1.2)
    - [1.3 Email](#1.3)
    - [1.4 Hash](#1.4)
    - [1.5 Mention](#1.5)
    - [1.6 Number](#1.6)
    - [1.7 Phone Number](#1.7)
    - [1.8 Year](#1.8)
    - [1.9 Non Alphanumeric](#1.9)
    - [1.10 Punctuations](#1.10)
    - [1.11 Stopwords](#1.11)
    - [1.12 N-grams](#1.12)
    - [1.13 Repetitive Character](#1.13)
    - [1.14 Dollar](#1.14)
    - [1.15 Number-Greater](#1.15)
    - [1.16 Number- Lesser](#1.16)
    - [1.17 OR](#1.17)
    - [1.18 AND](#1.18)
    - [1.19 Dates](#1.19)
    - [1.20 Only Words](#1.20)
    - [1.21 Only Numbers](#1.21)
    - [1.22 Boundaries](#1.22)
    - [1.23 Search](#1.23)
    - [1.24 Pick Sentence](#1.24)
    - [1.25 Duplicate Sentence](#1.25)
    - [1.26 Caps Words](#1.26)
    - [1.27 Length of Words](#1.27)
    - [1.28 Length of Characters](#1.28)
    - [1.29 Get ID](#1.29)
    - [1.30 Specific String Rows](#1.30)
    - [1.31 Hex code to Color](#1.31)
    - [1.32 Tags](#1.32)
    - [1.33 IP Address](#1.33)
    - [1.34 Mac Address](#1.34)
    - [1.35 Subword](#1.35)
    - [1.36 Latitude & Longitude](#1.36)
    - [1.37 PAN](#1.37)
   

In [1]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
import re, emoji, webcolors, nltk

In [2]:
raw = nltk.corpus.gutenberg.sents('shakespeare-caesar.txt')
corpus = []
for sent in raw:
    corpus.append(' '.join(sent))

In [3]:
df = pd.DataFrame(corpus, columns=['Text'])
df.head()

Unnamed: 0,Text
0,[ The Tragedie of Julius Caesar by William Sha...
1,Actus Primus .
2,Scoena Prima .
3,"Enter Flauius , Murellus , and certaine Common..."
4,Flauius .


## URL
<a id="1.1"></a>

In [4]:
def find_url(string): 
    text = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',string)
    return "".join(text) # converting return value from list to string

In [5]:
sentence="I love spending time at https://www.kaggle.com/"
find_url(sentence)

'https://www.kaggle.com/'

## Emoticons 

<a id="1.2"></a>

### Find and convert emoji to text

In [6]:
def find_emoji(text):
    emo_text=emoji.demojize(text)
    line=re.findall(r'\:(.*?)\:',emo_text)
    return line

In [7]:
sentence="I love ⚽ very much 😁"
find_emoji(sentence)

# Emoji cheat sheet - https://www.webfx.com/tools/emoji-cheat-sheet/
# Uniceode for all emoji : https://unicode.org/emoji/charts/full-emoji-list.html

['soccer_ball', 'beaming_face_with_smiling_eyes']

### Remove Emoji from text

In [8]:
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)



In [9]:
sentence="Its all about \U0001F600 face"
print(sentence)
remove_emoji(sentence)

Its all about 😀 face


'Its all about  face'

## Email
<a id="1.3"></a>

Extract email from text

In [10]:
def find_email(text):
    line = re.findall(r'[\w\.-]+@[\w\.-]+',str(text))
    return ",".join(line)

In [11]:
sentence="My gmail is abc99@gmail.com"
find_email(sentence)

'abc99@gmail.com'

## Hash
<a id="1.4"></a>

This value is especially to denote trends in twitter.

In [12]:
def find_hash(text):
    line=re.findall(r'(?<=#)\w+',text)
    return " ".join(line)

In [13]:
sentence="#Corona is trending now in the world" 
find_hash(sentence)

'Corona'

## Mention
<a id="1.5"></a>

@ - Used to mention someone in tweets

In [14]:
def find_at(text):
    line=re.findall(r'(?<=@)\w+',text)
    return " ".join(line)

In [15]:
sentence="@David,can you help me out"
find_at(sentence)

'David'

## Number
<a id="1.6"></a>

Pick only number from sentence

In [16]:
def find_number(text):
    line=re.findall(r'[0-9]+',text)
    return " ".join(line)

In [17]:
sentence="2833047 people are affected by corona now"
find_number(sentence)

'2833047'

## Phone Number
<a id="1.7"></a>

Indian Mobile numbers have ten digit.

In [18]:
def find_phone_number(text):
    line=re.findall(r"\b\d{10}\b",text)
    return "".join(line)

In [19]:
find_phone_number("9998887776 is a phone number of Mark from 210,North Avenue")

'9998887776'

## Year
<a id="1.8"></a>

Extract year from 1940 till 2020

In [20]:
def find_year(text):
    line=re.findall(r"\b(19[40][0-9]|20[0-1][0-9]|2020)\b",text)
    return line

In [21]:
sentence="India got independence on 1947."
find_year(sentence)

['1947']

## Non Alphanumeric
<a id="1.9"></a>

In [22]:
def find_nonalp(text):
    line = re.findall("[^A-Za-z0-9 ]",text)
    return line

In [23]:
sentence="Twitter has lots of @ and # in posts.(general tweet)"
find_nonalp(sentence)

['@', '#', '.', '(', ')']

## Punctuations
<a id="1.10"></a>

Retrieve punctuations from sentence.

In [24]:
def find_punct(text):
    line = re.findall(r'[!"\$%&\'()*+,\-.\/:;=#@?\[\\\]^_`{|}~]*', text)
    string="".join(line)
    return list(string)

In [25]:
example="Corona virus have kiled #24506 confirmed cases now.#Corona is un(tolerable)"
print(find_punct(example))

['#', '.', '#', '(', ')']


## Stopwords
<a id="1.11"></a>

In [26]:
def stop_word_fn(text):
    stop_words = set(stopwords.words('english')) 
    word_tokens = word_tokenize(text) 
    non_stop_words = [w for w in word_tokens if not w in stop_words] 
    stop_words= [w for w in word_tokens if w in stop_words] 
    return stop_words

In [27]:
example_sent = "This is a sample sentence, showing off the stop words filtration."
stop_word_fn(example_sent)

['is', 'a', 'off', 'the']

## N-grams
<a id="1.12"></a>

In [28]:
def ngrams_top(corpus,ngram_range,n=None):
    """
    List the top n words in a vocabulary according to occurrence in a text corpus.
    """
    vec = CountVectorizer(stop_words = 'english',ngram_range=ngram_range).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    total_list=words_freq[:n]
    df=pd.DataFrame(total_list,columns=['text','count'])
    return df

In [29]:
ngrams_top(corpus,(1,1),n=10)

Unnamed: 0,text,count
0,caesar,190
1,brutus,161
2,bru,153
3,haue,148
4,shall,125
5,thou,115
6,cassi,107
7,cassius,85
8,did,82
9,antony,75


In [30]:
ngrams_top(corpus,(2,2),n=10)

Unnamed: 0,text,count
0,let vs,16
1,mark antony,13
2,st thou,12
3,marke antony,12
4,thou art,11
5,art thou,9
6,noble brutus,9
7,thou hast,9
8,good night,9
9,enter brutus,9


In [31]:
ngrams_top(corpus,(3,3),n=10)

Unnamed: 0,text,count
0,did st thou,5
1,beware ides march,3
2,thou sleep st,3
3,mark antony shall,3
4,thee thou st,3
5,let vs heare,3
6,brutus honourable man,3
7,brutus sayes ambitious,3
8,thou did st,3
9,enter brutus messala,3


## Repetitive Character
<a id="1.13"></a>

If you want to change match repetitive characters to n numbers,**chage the return line in the *rep function* to grp[0:n]**.

In [32]:
def rep(text):
    grp = text.group(0)
    if len(grp) > 1:
        return grp[0:1] # can change the value here on repetition
def unique_char(rep,sentence):
    convert = re.sub(r'(\w)\1+', rep, sentence) 
    return convert

In [33]:
sentence="heyyy this is loong textttt sooon"
unique_char(rep,sentence)

'hey this is long text son'

## Dollar
<a id="1.14"></a>

In [34]:
def find_dollar(text):
    line=re.findall(r'\$\d+(?:\.\d+)?',text)
    return " ".join(line)

# \$ - dollar sign followed by
# \d+ one or more digits
# (?:\.\d+)? - decimal which is optional

In [35]:
sentence="this shirt costs $20.56"
find_dollar(sentence)

'$20.56'

## Number-Greater
<a id="1.15"></a>

In [36]:
#Number greater than 930
def num_great(text): 
    line=re.findall(r'9[3-9][0-9]|[1-9]\d{3,}',text)
    return " ".join(line)

In [37]:
sentence="It is expected to be more than 935 corona death and 29974 observation cases across 29 states in india"
num_great(sentence)

'935 29974'

## Number Lesser
<a id="1.16"></a>

In [38]:
# Number less than 930
def num_less(text):
    only_num=[]
    for i in text.split():
        line=re.findall(r'^(9[0-2][0-0]|[1-8][0-9][0-9]|[1-9][0-9]|[0-9])$',i) # 5 500
        only_num.append(line)
        all_num=[",".join(x) for x in only_num if x != []]
    return " ".join(all_num)

In [39]:
sentence="There are some countries where less than 920 cases exist with 1100 observations"
num_less(sentence)

'920'

## OR
<a id="1.17"></a>

In [40]:
def or_cond(text,key1,key2):
    line=re.findall(r"{}|{}".format(key1,key2), text) 
    return " ".join(line)

In [41]:
sentence="sad and sorrow displays emotions"
or_cond(sentence,'sad','sorrow')

'sad sorrow'

## AND
<a id="1.18"></a>

In [42]:
def and_cond(text):
    line=re.findall(r'(?=.*do)(?=.*die).*', text) 
    return " ".join(line)

In [43]:
print("Both string present:",and_cond("do or die is a motivating phrase"))
print("Only one string present :",and_cond('die word is other side of emotion'))

Both string present: do or die is a motivating phrase
Only one string present : 


## Dates
<a id="1.19"></a>

In [44]:
# mm-dd-yyyy format 
def find_dates(text):
    line=re.findall(r'\b(1[0-2]|0[1-9])/(3[01]|[12][0-9]|0[1-9])/([0-9]{4})\b',text)
    return line

In [45]:
sentence="Todays date is 04/28/2020 for format mm/dd/yyyy, not 28/04/2020"
find_dates(sentence)

[('04', '28', '2020')]

## Only Words
<a id="1.20"></a>

In [46]:
def only_words(text):
    line=re.findall(r'\b[^\d\W]+\b', text)
    return " ".join(line)


In [47]:
sentence="the world population has grown from 1650 million to 6000 million"
only_words(sentence)

'the world population has grown from million to million'

## Only Numbers
<a id="1.21"></a>

In [48]:
def only_numbers(text):
    line=re.findall(r'\b\d+\b', text)
    return " ".join(line)

In [49]:
sentence="the world population has grown from 1650 million to 6000 million"
only_numbers(sentence)

'1650 6000'

## Boundaries
Picking up the words with boundaries
<a id="1.22"></a>

In [50]:
# Extracting word with boundary
def boundary(text):
    line=re.findall(r'\bneutral\b', text)
    return " ".join(line)

In [51]:
sentence="Most tweets are neutral in twitter"
boundary(sentence)

'neutral'


## Search
Is the key word present in the sentence?
<a id="4.23"></a>

In [52]:
def search_string(text,key):
    return bool(re.search(r''+key+'', text))

In [53]:
sentence="Happy Mothers day to all Moms"
search_string(sentence,'day')

True

## Pick Sentence
If we want to get all sentence with particular keyword.We can use below function
<a id="1.24"></a>

In [54]:
def pick_only_key_sentence(text,keyword):
    line=re.findall(r'([^.]*'+keyword+'[^.]*)', text)
    return line

In [55]:
sentence="People are fighting with covid these days.Economy has fallen down.How will we survice covid"
pick_only_key_sentence(sentence,'covid')

['People are fighting with covid these days', 'How will we survice covid']

## Duplicate Sentence
Most webscrapped data contains duplicated sentence.This function could retrieve unique ones.
<a id="1.25"></a>

In [56]:
def pick_unique_sentence(text):
    line=re.findall(r'(?sm)(^[^\r\n]+$)(?!.*^\1$)', text)
    return line

In [57]:
sentence="I thank doctors\nDoctors are working very hard in this pandemic situation\nI thank doctors"
pick_unique_sentence(sentence)

['Doctors are working very hard in this pandemic situation', 'I thank doctors']

## Caps Words
Extract words starting with capital letter.Some words like names,place or universal object are usually mentioned in a text starting with CAPS.
<a id="1.26"></a>

In [58]:
def find_capital(text):
    line=re.findall(r'\b[A-Z]\w+', text)
    return line

In [59]:
sentence="World is affected by corona crisis.No one other than God can save us from it"
find_capital(sentence)

['World', 'No', 'God']

## Length of words
<a id="1.27"></a>
No regex but added one liner to identify length of words in a sentence

In [60]:
text_length = list(map(lambda x: len(x.split()), corpus))
text_length[:5]

[11, 3, 3, 12, 2]

## Length of characters
<a id="1.28"></a>
No regex but added one liner to identify length of characters in a sentence including space

In [61]:
char_length = list(map(len, corpus))
char_length[:5]

[61, 14, 14, 66, 9]

<a id="1.29"></a>
## Get ID
Most data has IDs in it with some prefix.So if we want to pick only numbers in ID leaving the prefix out,we can apply below function.

In [62]:
def find_id(text):
    line=re.findall(r'\bIND(\d+)', text)
    return line

In [63]:
sentence="My company id is IND50120.And I work under Asia region"
find_id(sentence)

['50120']

<a id="1.30"></a>
## Specific String Rows
Quering for specific string can also be done by directly applying *"str.contains("XXXX")"* to a series/column of a dataframe

In [64]:
my_string_rows = df[df['Text'].str.contains("good")]
my_string_rows[['Text']].sample(3)

Unnamed: 0,Text
1662,I do not thinke it good
979,"For your part , To you , our Swords haue leade..."
1725,"Beare with me good Boy , I am much forgetfull ."


<a id="1.31"></a>
## Hex code to Color
Converting hex color codes to color names.We will install and import webcolors. (only for CSS3 colors)

In [65]:
def find_color(string): 
    text = re.findall('\#(?:[0-9a-fA-F]{3}){1,2}',string)
    conv_name=[]
    for i in text:
        conv_name.append(webcolors.hex_to_name(i))
    return conv_name

In [66]:
sentence="Find the color of #00FF00 and #FF4500"
find_color(sentence)

['lime', 'orangered']

Try more hex codes:https://www.rapidtables.com/web/css/css-color.html

<a id="1.32"></a>
## Tags
Most of web scrapped data contains html tags.It can be removed from below re script

In [67]:
def remove_tag(string):
    text=re.sub('<.*?>','',string)
    return text

In [68]:
sentence="Markdown sentences can use <br> for breaks and <i></i> for italics"
remove_tag(sentence)

'Markdown sentences can use  for breaks and  for italics'

<a id="1.33"></a>
## IP Address
Extract IP address from text.

In [69]:
def ip_add(string):
    text=re.findall('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}',string)
    return text

In [70]:
sentence="An example of ip address is 125.16.100.1"
ip_add(sentence)

['125.16.100.1']

<a id="1.34"></a>
## Mac Address
Extract Mac address from text.

In [71]:
def mac_add(string):
    text=re.findall('(?:[0-9a-fA-F]:?){12}',string)
    return text
#https://stackoverflow.com/questions/26891833/python-regex-extract-mac-addresses-from-string/26892371

In [72]:
sentence="MAC ADDRESSES of this laptop - 00:24:17:b1:cc:cc .Other details will be mentioned"
mac_add(sentence)

['00:24:17:b1:cc:cc']

<a id="1.35"></a>
## Subword
Extract number of subwords from sentences and words.

In [73]:
def subword(string,sub): 
    text=re.findall(sub,string)
    return len(text)

In [74]:
sentence = 'Fundamentalism and constructivism are important skills'
subword(sentence,'ism') # change subword and try for others

2

<a id="1.36"></a>
## Latitude & Longitude
Extract number of subwords from sentences and words.

In [75]:
def lat_lon(string):
    text=re.findall(r'^[-+]?([1-8]?\d(\.\d+)?|90(\.0+)?),\s*[-+]?(180(\.0+)?|((1[0-7]\d)|([1-9]?\d))(\.\d+)?)$',string)
    if text!=[]:
        print("[{}] is valid latitude & longitude".format(string))
    else:
        print("[{}] is not a valid latitude & longitude".format(string))

In [76]:
lat_lon('28.6466772,76.8130649')
lat_lon('2324.3244,3423.432423')

[28.6466772,76.8130649] is valid latitude & longitude
[2324.3244,3423.432423] is not a valid latitude & longitude


<a id="1.37"></a>
## PAN

PAN Validation:

[First 5 letters in CAPS+4 didgits+Last letter in CAPS]

In [77]:
def valid_pan(string):
    text=re.findall(r'^([A-Z]){5}([0-9]){4}([A-Z]){1}$',string)
    if text!=[]:
        print("{} is valid PAN number".format(string))
    else:
        print("{} is not a valid PAN number".format(string))

In [78]:
valid_pan("ABCSD0123K")
valid_pan("LEcGD012eg")

ABCSD0123K is valid PAN number
LEcGD012eg is not a valid PAN number
