# _Regular-Expressions_ 

A RegexpTokenizer splits a string into substrings using a regular expression. For example, the following tokenizer forms tokens out of alphabetic sequences, money expressions

http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize 

http://www.nltk.org/howto/tokenize.html

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Regular-expressions" data-toc-modified-id="Regular-expressions-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Regular expressions</a></span></li><li><span><a href="#NLTK-Regexp-tokenizer" data-toc-modified-id="NLTK-Regexp-tokenizer-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>NLTK Regexp tokenizer</a></span></li><li><span><a href="#Tokenizing" data-toc-modified-id="Tokenizing-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Tokenizing</a></span></li><li><span><a href="#Converting-to-lower-case" data-toc-modified-id="Converting-to-lower-case-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Converting to lower case</a></span></li><li><span><a href="#Plotting-the-frequency-bar-chart" data-toc-modified-id="Plotting-the-frequency-bar-chart-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Plotting the frequency bar chart</a></span></li><li><span><a href="#Removing-stopwords" data-toc-modified-id="Removing-stopwords-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Removing stopwords</a></span></li></ul></div>

In [1]:
import re

In [2]:
with open('data/Audi.txt') as f:
    Audi = f.readlines()

In [3]:
Audi

['Audi AG is a German automobile manufacturer that designs, engineers, produces, markets and distributes luxury vehicles. Audi is a member of the Volkswagen Group and has its roots at Ingolstadt, Bavaria, Germany. Audi-branded vehicles are produced in nine production facilities worldwide. The origins of the company are complex, going back to the early 20th century and the initial enterprises (Horch and the Audiwerke) founded by engineer August Horch; and two other manufacturers (DKW and Wanderer), leading to the foundation of Auto Union in 1932. The modern era of Audi essentially began in the 1960s when Auto Union was acquired by Volkswagen from Daimler-Benz. After relaunching the Audi brand with the 1965 introduction of the Audi F103 series, Volkswagen merged Auto Union with NSU Motorenwerke in 1969, thus creating the present day form of the company. The company name is based on the Latin translation of the surname of the founder, August Horch. "Horch", meaning "listen" in German, bec

## Regular expressions

In [4]:
import re

example = 'Audi was eshtablished in the year 1960. Today is 09-03-2019.'

# extracts each character
result=re.findall('\w',example)
print(result)

# extracts words
result_2=re.findall('\w+',example)
print(result_2)

#extract dates for theformat dd-mm-yyyy or mm-dd-yyyy
result_3=re.findall('\d+-\d+-\d{4}',example)
print(result_3)

['A', 'u', 'd', 'i', 'w', 'a', 's', 'e', 's', 'h', 't', 'a', 'b', 'l', 'i', 's', 'h', 'e', 'd', 'i', 'n', 't', 'h', 'e', 'y', 'e', 'a', 'r', '1', '9', '6', '0', 'T', 'o', 'd', 'a', 'y', 'i', 's', '0', '9', '0', '3', '2', '0', '1', '9']
['Audi', 'was', 'eshtablished', 'in', 'the', 'year', '1960', 'Today', 'is', '09', '03', '2019']
['09-03-2019']


## NLTK Regexp tokenizer
- NLTK has an inbuilt tokenizer which we can use to extract all alphanumeric characters, whch is equivalent to `[a-zA-Z0-9_]` or `\w `

In [5]:
from nltk.corpus import RegexpTokenizer as regextoken

tokenizer = regextoken('\w+')
print(tokenizer)

RegexpTokenizer(pattern='\\w+', gaps=False, discard_empty=True, flags=<RegexFlag.UNICODE|DOTALL|MULTILINE: 56>)


## Tokenizing

In [6]:
tokens = tokenizer.tokenize(Audi[0])

In [7]:
tokens

['Audi',
 'AG',
 'is',
 'a',
 'German',
 'automobile',
 'manufacturer',
 'that',
 'designs',
 'engineers',
 'produces',
 'markets',
 'and',
 'distributes',
 'luxury',
 'vehicles',
 'Audi',
 'is',
 'a',
 'member',
 'of',
 'the',
 'Volkswagen',
 'Group',
 'and',
 'has',
 'its',
 'roots',
 'at',
 'Ingolstadt',
 'Bavaria',
 'Germany',
 'Audi',
 'branded',
 'vehicles',
 'are',
 'produced',
 'in',
 'nine',
 'production',
 'facilities',
 'worldwide',
 'The',
 'origins',
 'of',
 'the',
 'company',
 'are',
 'complex',
 'going',
 'back',
 'to',
 'the',
 'early',
 '20th',
 'century',
 'and',
 'the',
 'initial',
 'enterprises',
 'Horch',
 'and',
 'the',
 'Audiwerke',
 'founded',
 'by',
 'engineer',
 'August',
 'Horch',
 'and',
 'two',
 'other',
 'manufacturers',
 'DKW',
 'and',
 'Wanderer',
 'leading',
 'to',
 'the',
 'foundation',
 'of',
 'Auto',
 'Union',
 'in',
 '1932',
 'The',
 'modern',
 'era',
 'of',
 'Audi',
 'essentially',
 'began',
 'in',
 'the',
 '1960s',
 'when',
 'Auto',
 'Union',
 'wa

## Converting to lower case

In [8]:
tokens = [token.lower() for token in tokens]

## Plotting the frequency bar chart

In [9]:
from nltk.probability import FreqDist
import plotly.offline as pyoff
from plotly.offline import iplot, init_notebook_mode
import plotly.graph_objs as go
init_notebook_mode(connected= True)

freq_dist = FreqDist(tokens)
top_20 = freq_dist.most_common(20)

x  = [i[0] for i in top_20]
y  = [i[1] for i in top_20]
data = [go.Bar(
            x=x,
            y=y)]
iplot(data)

We see that the words like `a`, `in`, `the` , etc are in the top 20 list, this is because these words are used frequently in the english language, this sentence itself has a lot of these words, we call them `stopwords` in text processing world!
These `stopwords` do not generally provide information. We can remove the stop words by using inbuilt nltk functions.

## Removing stopwords

If we are using text processing for information retrieval, we would not want these words taking up space in our database, or taking up valuable processing time. For this, we can remove them easily, by storing a list of words that you consider to be stop words. NLTK(Natural Language Toolkit) in python has a list of stopwords stored in 16 different languages. You can find them in the nltk_data directory.To check the list of stopwords you can type the following commands in the python shell.

Note: You can even modify the list by adding words of your choice in the english .txt. file in the stopwords directory.

In [10]:
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
print(stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [11]:
tokens = [token for token in tokens if token.lower() not in stopwords]

In [12]:
freq_dist = FreqDist(tokens)
top_20 = freq_dist.most_common(20)

x  = [i[0] for i in top_20]
y  = [i[1] for i in top_20]
data = [go.Bar(
            x=x,
            y=y)]
layout = go.Layout(
    title='Frequency distribution',
)
fig = go.Figure(data=data, layout=layout)

iplot(fig)