# Intro

A popular opinion says that the best option in Machine Learning is not to use it if possible. Indeed, Machine Learning methods involve AI, so what for to trust our tasks to AI when we can easily solve them with human intelligence. It turns out that when we want to process data using patterns, re Python library is usually sufficient.

Note that I will not explain how the regular expressions work (to see this you can check the documentation https://docs.python.org/3/library/re.html). I will just show some examples.

Another important point that I want to put an emphasis on is that text is a complex data structure, and we need to be very careful when working with it. Otherwise, many mistakes can be made. 

# 1 Extract phone numbers from text

Let's say that a phone number (without country code, 10 digits) can be represented as follows:

1) 1234567891

2) (123) 4567891

3) (123) 456-78-91

For the example this is enough. Now we need to write a regular expression which will find a phone number, if it is indeed a phone number, and return nothing, if it is not (does not match one of three examples).

A good website which might be useful in finding a proper RegEx is https://regex101.com.

In [1]:
import re

def return_number(text):
    matches = re.findall('\d{10}|\(\d{3}\)\s*\d{7}|\(\d{3}\)\s*\d{3}\-\d{2}\-\d{2}', text)
    if len(matches) == 0:
        return 'Nothing found'
    return matches

### 1.1 Check if it works

Now let's give some correct examples to see that the algorithm works.

In [2]:
print(return_number('aaaa 1234567891 aaaaa')) # first case
print()
print(return_number('aaaaa (123) 4567891 aaaaaa')) # second case
print()
print(return_number('aaaaa(123) 456-78-91aaaa')) # third case

['1234567891']

['(123) 4567891']

['(123) 456-78-91']


Everything seems fine here.

### 1.2 A common mistake

Often people tend to find confirmation of their hypotheses rather than try to disprove them. In this case, we would stop at the last example, saying that the code works correctly. That being said, I suggest going further and checking in which cases the code might not work correctly by trying it with examples of not phone numbers.

In [3]:
print(return_number('aa123456789aa')) # less than 10 digits
print()
print(return_number('aaa12345678911aaa')) # more than 10 digits
print()
print(return_number('aaaa (1234) 567891aaaa')) # 4 digits in brackets
print()
print(return_number('aaaa (123) 456789 aaaaa')) # less than 7 digits after brackets
print()
print(return_number('a a a a (123) 456-789-1a a a a ')) # more than 2 digits after -

Nothing found

['1234567891']

Nothing found

Nothing found

Nothing found


As we see, we made a mistake. Let's rewrite our code and check if now it is fine.

### 1.3 Correcting mistakes

In [4]:
import re

def return_number_fixed(text):
    
    matches = re.findall(r'\b\d{10}\b|\(\d{3}\)\s*\d{7}|\(\d{3}\)\s*\d{3}\-\d{2}\-\d{2}', text)
    
    if len(matches) == 0:
        return 'Nothing found'
    return matches

In [5]:
print(return_number_fixed('1234567891')) # first case
print()
print(return_number_fixed('(123) 4567891')) # second case
print()
print(return_number_fixed('(123) 456-78-91')) # third case

['1234567891']

['(123) 4567891']

['(123) 456-78-91']


In [6]:
print(return_number_fixed('123456789')) # less than 10 digits
print()
print(return_number_fixed('12345678911')) # more than 10 digits
print()
print(return_number_fixed('(1234) 567891')) # 4 digits in brackets
print()
print(return_number_fixed('(123) 456789')) # less than 7 digits after brackets
print()
print(return_number_fixed('(123) 456-789-1')) # more than 2 digits after -

print()
print(return_number_fixed(' 1234     ')) # less than 4 digits and spaces
print()
print(return_number_fixed('     1234567891    ')) # 10 digits between spaces
print()
print(return_number_fixed('     123456 7891    ')) # a space between digits
print()
print(return_number_fixed('     12345678911    ')) # more than 10 digits

Nothing found

Nothing found

Nothing found

Nothing found

Nothing found

Nothing found

['1234567891']

Nothing found

Nothing found


As we see, the previous mistake was corrected. However, there is still an issue if we put a space between numbers. Despite the fact that it does not match our patterns, I do not see a problem in it.

# 2 Split text into sentences

Suupose we have a big text and we want to get a list of sentences. We will assume that a dot, an exclamation mark, a question mark, or a text break identify the end of a current sentence and the beginning of a new one.

In [7]:
def split_text_to_sentences(text):
    if text == '':
        return 'There is no text'
    return re.split(r'[.!?\n]', text)

### 2.1 Sanity check

In [8]:
text = '''AAAAA. aaaaa! a? a
bbbbb'''
split_text_to_sentences(text)

['AAAAA', ' aaaaa', ' a', ' a', 'bbbbb']

### 2.2 Try to break the code

Suppose that we place several exclamation marks (or other signs) in a row. Probably, in this case it makes sense to recognize it as a single delimiter.

In [9]:
text = '''AAAAA. aaaaa!!!! a? a
bbbbb'''
split_text_to_sentences(text)
# the code broke

['AAAAA', ' aaaaa', '', '', '', ' a', ' a', 'bbbbb']

Now it is time to improve the code.

### 2.3 Improvements

In [10]:
def split_text_to_sentences_modified(text):
    if text == '':
        return 'There is no text'
    return re.split(r'[.!?\n]+', text)

In [11]:
text = '''AAAAA. aaaaa! a? a
bbbbb'''
split_text_to_sentences_modified(text)

['AAAAA', ' aaaaa', ' a', ' a', 'bbbbb']

In [12]:
text = '''AAAAA... aaaaa!!!! a? a
bbbbb'''
split_text_to_sentences_modified(text)
# Now everything seems fine

['AAAAA', ' aaaaa', ' a', ' a', 'bbbbb']

# 3 Information extraction from Google form

Suppose people fill the form and there is a question 'Enter your full name'. There are more questions, and they are separated by ';'. The idea is to write a function which returns all the text between 'Enter your full name: ' and ';' (the full name of the person).

In [13]:
def search_full_name(text):
    try:
        return re.search(r'Enter your full name:(.*?);', text)[1].strip()
    except:
        return 'Something went wrong'

### 3.1 Check if the solution works

In [14]:
search_full_name('Enter your full name: dfgsdfsgdfgdfsgsdfvd;')

'dfgsdfsgdfgdfsgsdfvd'

In [15]:
search_full_name('''Here is some weird text bla bla blaEnter your full name: my name is Aleksandr;and some another weird text''')
# fine

'my name is Aleksandr'

### 3.2 Try line wrapping

As we see, the text can be massive. So, for conveniece purposes, I would like to put it in several lines. However, our code will not work in such case.

In [16]:
search_full_name('''Here is some weird text bla bla blaEnter your full 
name: name surname another part of name; 
                 and some another weird text''')

'Something went wrong'

### 3.3 Solution

Before putting text into a search, we need to preprocess it. In our case, we will simply make it one line by deleting the line separator '\n'.

In [17]:
def search_full_name_fix_bug(text):
    text = text.replace('\n', '')
    try:
        return re.search(r'Enter your full name:(.*?);', text, re.DOTALL)[1].strip()
    except:
        return 'Something went wrong'

In [18]:
search_full_name_fix_bug('Enter your full name: dfgsdfsgdfgdfsgsdfvd;')

'dfgsdfsgdfgdfsgsdfvd'

In [19]:
search_full_name_fix_bug('''Here is some weird text bla bla blaEnter your full name: my name is Aleksandr;and some another weird text''')


'my name is Aleksandr'

In [20]:
search_full_name_fix_bug('''Here is some weird text bla bla blaEnter your full 
name: name surname another part of name; 
                 and some another weird text''')

'name surname another part of name'

Now it is fine.

# Conclusion

From this notebook I want to make two important conclusions:

1) Despite the fact that NLP is a very powerful set of tools, and one of the main trends in modern Artificial Intelligence, it is not always needed. In some problems, simple regular expressions can be a faster and more efficient way to create a good solution, not to mention that in some cases NLP methods are not applicable

2) Always check the results. Very often, we need to think not about whether the solution works in principle, but about when it might not work correctly. This can save us a lot of time and effort