# Regular Expression 

A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. Regular expressions are widely used in UNIX world.

The Python module re provides full support for Perl-like regular expressions in Python. The re module raises the exception re.error if an error occurs while compiling or using a regular expression.

We would cover two important functions, which would be used to handle regular expressions. But a small thing first: There are various characters, which would have special meaning when they are used in regular expression. To avoid any confusion while dealing with regular expressions, we would use Raw Strings as **r'expression'**.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/hacker_news.csv


In [2]:
hn = pd.read_csv("/kaggle/input/hacker_news.csv")
hn.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12224879,Interactive Dynamic Video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,8/4/2016 11:52
1,11964716,Florida DJs May Face Felony for April Fools' W...,http://www.thewire.com/entertainment/2013/04/f...,2,1,vezycash,6/23/2016 22:20
2,11919867,Technology ventures: From Idea to Enterprise,https://www.amazon.com/Technology-Ventures-Ent...,3,1,hswarna,6/17/2016 0:01
3,10301696,Note by Note: The Making of Steinway L1037 (2007),http://www.nytimes.com/2007/11/07/movies/07ste...,8,2,walterbell,9/30/2015 4:12
4,10482257,Title II kills investment? Comcast and other I...,http://arstechnica.com/business/2015/10/comcas...,53,22,Deinos,10/31/2015 9:48


In [3]:
title = hn["title"]
title.head(10)

0                            Interactive Dynamic Video
1    Florida DJs May Face Felony for April Fools' W...
2         Technology ventures: From Idea to Enterprise
3    Note by Note: The Making of Steinway L1037 (2007)
4    Title II kills investment? Comcast and other I...
5                       Nuts and Bolts Business Advice
6          Ask HN: How to improve my personal website?
7    Shims, Jigs and Other Woodworking Concepts to ...
8                               That self-appendectomy
9    Crate raises $4M seed round for its next-gen S...
Name: title, dtype: object

In [4]:
# creat a pattern for regular expression
pattern = r"[Pp]ython" # [] is a set

In [5]:
title.str.contains(pattern).sum()

160

In [6]:
# To see title have Python or not.
title[title.str.contains(pattern)]

102                    From Python to Lua: Why We Switched
103              Ubuntu 16.04 LTS to Ship Without Python 2
144      Create a GUI Application Using Qt and Python i...
196      How I Solved GCHQ's Xmas Card with Python and ...
436      Unikernel Power Comes to Java, Node.js, Go, an...
                               ...                        
19597    David Beazley  Python Concurrency from the Gro...
19852      Ask HN: How to automate Python apps deployment?
19862                            Moving Away from Python 2
19980                        Python vs. Julia Observations
19998    Show HN: Decorating: Animated pulsed for your ...
Name: title, Length: 160, dtype: object

In [7]:
# RE using Python
import re

In [8]:
python = 0

for i in title:
    if re.search(pattern,i):
        python += 1

In [9]:
python 

160

In [10]:
pattern = r"[12][0-9][0-9][0-9]"
pattern = r"[12][0-9]{3}"

If we wanted to write a pattern that matches the numbers in text from 1000 to 2999 we could write the regular expression

[1-2][0-9][0-9][0-9] or [1-2][0-9]{3} (this type of regular expression syntax is called a `quantifier`. In this case, its a numeric quantifier).

Quantifiers specify how many of the previous character our pattern requires, which can help us when we want to match substrings of specific lengths

 a{3} -> The character `a` three times

 a{3,5} ->The character `a` three, four or five times

 a{,3} ->The character `a` zero, one, two or three times

 a{8,} ->The character `a` eight or more times

In addition to numeric quantifiers, there are single characters in regex that specify some common quantifiers that we're likely to use. A summary of them is below.

a* -> equivalent to a{0,} zero or more

a+ -> equivalent to a{1,} one or more

a? -> equivalent to a{0,1} zero or one (optional)

In [11]:
# Now check the "email", "e_mail" in (hn) dataframe
pattern = r"e_?mail"

In [12]:
title.str.contains(pattern).sum()

81

In [13]:
title[title.str.contains(pattern)]

119      Show HN: Send an email from your shell to your...
313          Disposable emails for safe spam free shopping
1361     Ask HN: Doing cold emails? helps us prove this...
1750     Protect yourself from spam, bots and phishing ...
2421                    Ashley Madison hack treating email
                               ...                        
18098    House panel looking into Reddit post about Cli...
18583    Mailgen  Generates clean, responsive HTML for ...
18847    Show HN: Crisp iOS keyboard for email and text...
19303    Ask HN: Why big email providers don't sign the...
19446    Tell HN: Secure email provider Riseup will run...
Name: title, Length: 81, dtype: object

Summary of syntax for some of the regex character classes:

* set ->[fud] either f,u or d
* range -> [a-e] any of the charachter a,b,c,d or e
* range -> [0-3] any of the charachter 0,1,2 or 3
* range -> [A-Z] any uppercase letter
* set+range -> [A-Za-z] any uppercase or lower case letter [Aa-Zz]

There are some other common character classes which we'll use a lot.

* Digit -> **`\d`** any digit character(equivalent to [0-9])
* Word -> **`\w`** any digit, uppercase, lowercase or underscore character (equivalent to [A-Za-z0-9_]). Does not include any special character 
* Whitepace -> **`\s`** any space, tab or linebreak character
* Dot -> **`.`**  any character or special character except newline

In [14]:
# To check [pdf] & [videos] in hn
pattern = r"(\[\w+\])"

In [15]:
title[title.str.contains(pattern)].head()

  return func(self, *args, **kwargs)


66     Analysis of 114 propaganda sources from ISIS, ...
100    Munich Gunman Got Weapon from the Darknet [Ger...
159         File indexing and searching for Plan 9 [pdf]
162    Attack on Kunduz Trauma Centre, Afghanistan  I...
195               [Beta] Speedtest.net  HTML5 Speed Test
Name: title, dtype: object

In [16]:
title.str.extract(pattern).head()

Unnamed: 0,0
0,
1,
2,
3,
4,


In [17]:
title.str.extract(pattern).iloc[100]

0    [German]
Name: 100, dtype: object

### Negative set

* Negative Set -> [^fud] any charachter except f,u or d
* Negative Set -> [^1-3Z\s] any charachter except 1,2,3,Z or Whitespace character
* Negative Digit -> \D any charachter except Digit character
* Negative Word -> \W any charachter except word character
* Negative whitespace -> \S any charachter except space character

In [18]:
# If we have this type of sentence then
# "Javascript"
# "javaScript"
# "Java"
# "java"
pattern = "[Jj]ava[^Ss]"

In [19]:
"I am Java lover"
"I am Java lover and JavaScript"
"I am Javaprogramming lover"
"I am Java"

'I am Java'

In [20]:
if re.search(pattern,"I am Java"):
    print("I found")

In [21]:
pat = r"\b[Jj]ava\b" # word boundry character 

In [22]:
if re.search(pat,"I am Java"):
    print("I found")

I found


In [23]:
pat = r"\b[Jj]ava\w*\b"

In [24]:
if re.search(pat,"I am Javaprogramming lover"):
    print("I found")

I found


In [25]:
pattern = r"^\[\w+\]"

In [26]:
title[title.str.contains(pattern)].head()

195                [Beta] Speedtest.net  HTML5 Speed Test
398        [video] Google Self-Driving SUV Sideswipes Bus
3136                          [CSS] Yellow Fade Technique
5054    [React] proptypes-parser: Define React PropTyp...
9389    [Petition] Tell Microsoft to stop making browsers
Name: title, dtype: object

In [27]:
pattern = r"\[\w+\]$"

title[title.str.contains(pattern)].head()

66     Analysis of 114 propaganda sources from ISIS, ...
100    Munich Gunman Got Weapon from the Darknet [Ger...
159         File indexing and searching for Plan 9 [pdf]
162    Attack on Kunduz Trauma Centre, Afghanistan  I...
210    A plan to rescue western democracy from the ig...
Name: title, dtype: object

### Lookarounds

* positive lookahead  zzz(?=abc)  match zzz only when it is followed by abc
* negative lookahead  zzz(?!abc)  match zzz only when it is not followed by abc
* positive lookbehind (?<=abc)zzz match zzz only when it is preceded by abc
* negative lookbehind (?<!abc)zzz match zzz only when it is not preceded by abc

In [28]:
pat = r"\b[Cc]\b[^+]"

In [29]:
title[title.str.contains(pat)].head()

353    VW C.E.O. Personally Apologized to President O...
365                     The new C standards are worth it
444          Moz raises $10m Series C from Foundry Group
521         Fuchsia: Micro kernel written in C by Google
549    How to Become a C.E.O.? The Quickest Path Is a...
Name: title, dtype: object

In [30]:
pat = r"\b(?<!Series\s)[Cc]\b"

In [31]:
title[title.str.contains(pat)].head()

13                Custom Deleters for C++ Smart Pointers
220                       Lisp, C++: Sadness in my heart
353    VW C.E.O. Personally Apologized to President O...
365                     The new C standards are worth it
508    BDE 3.0 (Bloomberg's core C++ library): Open S...
Name: title, dtype: object

Hope you guys have learnt how the whole process of Regular Expression . 

# Please Upvote this notebook if it has helped you in any ways! Thank you:)