## Regular Expression

A regular expression, often called a pattern, is **an expression used to specify a set of strings** required for a particular purpose. 

- A simple way to specify a finite set of strings is to list its elements or members. <br>For example `{Doc1,1,Doc2,2,Doc3,3}`. 
    

`{Doc1,Doc2,Doc3}` can be specified by the pattern `Doc(1|2|3)`. <br>We say that this pattern matches each of the two strings. [Lets check?](https://regex101.com/)

> In most formalisms, if there exists at least one regular expression that matches a particular set then there exists an infinite number of other regular expressions that also match it, i.e. **the specification is not unique**.<br>
For example, the string set `{Doc1,Doc2,Doc3}` can also be specified by the pattern `Doc\d`.



## Uses of Regular Expressions

**Some important usages of regular expressions are:**

- Check if an input honors a given pattern; for example, we can check whether a value entered in a HTML formulary is a valid e-mail address
> `Maniteja123@gmail.com`

- Look for a pattern appearance in a piece of text; for example, check if either the word "color" or the word "colour" appears in a document with just **one scan**
> `I like Red color and i am wearing a Red colour shirt`

- Extract specific portions of a text; for example, extract the postal code of an address
> `Mr John Smith. 132, My Street, Kingston, New York 12401.`

- Replace portions of text; for example, change the appearance of "color" with "colour"
> `I like Red colour and i am wearing a Red colour shirt`

- Split a larger text into smaller pieces, for example, splitting a text by any appearance of the dot, comma, or newline characters
> `myself person1,you are person2`

# Meta Characters

- All meta characters. `^ $ * + ? { } \ | ( ) `

  1. `.` any character (except new line character)
  2. `^` startswith `^word`
  3. `$` endswith `word$`
  4. `*` zero or more occurrences
  5. `+` one or more occurrences
  6. `{}` exactly specified no of occurrences "M{2}"
  7. `[]` A set of characters "[a-c]"
  8. `\` Signals a special sequence (can also be used to escape special characters) "\d"
  9. `|` Either or "apple|iphone"
  10. `()` Capture and group

# Special Sequences
- A special sequence is a \ followed by one of the characters in the list below, and has a special       meaning:

  1. `\d` : Matches any decimal digit; this is equivalent to the class [0-9].
  2. `\D` : Matches any non-digit character; this is equivalent to the class [^0-9].
  3. `\s` : Matches any whitespace character, next line character(\n) or tab(\t);
  4. `\S` : Matches any non-whitespace character;
  5. `\w` : Matches any alphanumeric (word) character; this is equivalent to the class [a-zA-Z0-9_].
  6. `\W` : Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].

#### Some examples for set 
 1. `[arn]` Returns a match where one of the specified characters (a, r, or n) are present
 2. `[a-n]` Returns a match for any lower case character, alphabetically between a and n
 3. `[^arn]` Returns a match for any character EXCEPT a, r, and n
 4. `[0123]` Returns a match where any of the specified digits (0, 1, 2, or 3) are present
 5. `[0-9]` Returns a match for any digit between 0 and 9
 6. `0-5` Returns a match for any two-digit numbers from 00 and 59
 7. `[a-zA-Z]` Returns a match for any character alphabetically between a and z, lower case OR upper      case

# Getting started with RegEx in Python

The **[re](https://docs.python.org/3/howto/regex.html)** module provides an interface to the regular expression engine, allowing you to **compile regular expressions into objects and then perform matches with them**.

In [1]:
import re

## 1. Compiling Regular Expressions

Regular expressions are **compiled** into `Pattern` objects, which have methods for various operations such as searching for pattern matches or performing string substitutions.


### `re.compile(pattern, flags=0)`

Compile a regular expression pattern, returning a pattern object.

- The regular expression is passed to `re.compile()` as a **string**. 

> Regular expressions are handled as strings because regular expressions aren’t part of the core Python language, and no special syntax was created for expressing them. 

> Regular expression patterns are compiled into a series of bytecodes which are then executed by a matching engine written in C.

In [56]:
pattern = re.compile("python",flags=1)

In [57]:
pattern

re.compile(r'python', re.TEMPLATE|re.UNICODE)

In [60]:
pattern1="python"

In [61]:
a="i am learning python"

In [62]:
re.findall(pattern1,a)

['python']

- `re.compile()` also accepts an optional `flags` argument, used to enable various special features and syntax variations. [More about flags](http://xahlee.info/python/python_regex_flags.html)

<br>

In the example below, we use the flag `re.I` (short for `re.IGNORECASE`) to ignore letter case in the regex pattern.

## 2. Performing Matches

So, we have created a `Pattern` object representing a compiled regular expression using `re.compile()` method.

Pattern objects have several methods and attributes.

Here is the list of different methods used for performing matches:


<table style="border: 1px solid black; font-size:15px;">
<thead>
    <th>Method/Attribute</th>
    <th>Purpose</th>
</thead>
    
<tbody>
<tr>
    <td>match()</td>
    <td>Determine if the RE matches at the beginning of the string.</td>
</tr>
    
<tr>
    <td>search()</td>
    <td>Scan through a string, looking for any location where this RE matches.</td>
</tr>

<tr>
    <td>findall()</td>
    <td>Find all substrings where the RE matches, and returns them as a list.</td>
</tr>

<tr>
    <td>finditer()</td>
    <td>Find all substrings where the RE matches, and returns them as an iterator.</td>
</tr>
</tbody>
</table>



# Functions


`findall()` Returns a list containing all matches

`sub()` Replaces one or many matches with a string

`search()` Returns a Match object if there is a match anywhere in the string

`compile()` Returns a Regex pattern object





In [2]:
import re

#### re.findall()
#### re.findall (pattern, target string)
>  It helps to get a list of all matching patterns. It has no constraints of searching from start or end.

In [63]:
a="abc1234def"

In [64]:
re.findall("\D",a)

['a', 'b', 'c', 'd', 'e', 'f']

In [72]:
for i in re.finditer("\D+",a):
    print(i)

<re.Match object; span=(0, 3), match='abc'>
<re.Match object; span=(7, 10), match='def'>


In [7]:
colleges="""IIT Madras - Indian Institute of Technology
4.6(202)
Fees: ₹ 10.00 Lakh
Salary : ₹ 16.00 Lakh
Not Ranked
Times ' 22
3
The Week ' 21
1
Outlook ' 20
Admissions
Courses & Fees
Placements 0-5
IIT Madras - Indian Institute of Technology
4.1(202)
Fees: ₹ 10.00 Lakh
Salary : ₹ 16.00 Lakh
Not Ranked
Times ' 22
3
The Week ' 21
1
Outlook ' 20
Admissions
Courses & Fees
Placements 0-5
IIT Madras - Indian Institute of Technology
4.4(202)
Fees: ₹ 10.00 Lakh
Salary : ₹ 16.00 Lakh
Not Ranked
Times ' 22
3
The Week ' 21
1
Outlook ' 20
Admissions
Courses & Fees
Placements 0-5"""

In [8]:
#Extracting Salary of colleges

re.findall("Salary\s:\s(.*)",colleges)

['₹ 16.00 Lakh', '₹ 16.00 Lakh', '₹ 16.00 Lakh']

In [9]:
#Extracting rating of colleges

re.findall("(\d\.\d)\(",colleges)

['4.6', '4.1', '4.4']

In [10]:
emails="""maniteja@gmail.com
mani_teja@gmail.com
mani.teja@ gmail.com
maniteja1234@gmail.com
maniteja@outlook.com
mani_teja@outlook.com
mani.teja@outlook.com
maniteja1234@outlook.com
maniteja@yahoo.org
mani_teja@yahoo.com
mani.teja@yahoo.in
maniteja1234@yahoo.com
"""

In [12]:
#Extracting email with space after @

re.findall("\w+\W\w+\W\s\w+\W\w+",emails)

['mani.teja@ gmail.com']

In [14]:
#Extracting username of all emails

re.findall("(.*)@",emails)

['maniteja',
 'mani_teja',
 'mani.teja',
 'maniteja1234',
 'maniteja',
 'mani_teja',
 'mani.teja',
 'maniteja1234',
 'maniteja',
 'mani_teja',
 'mani.teja',
 'maniteja1234']

In [16]:
#Extracting domain of all emails


re.findall("(gmail|outlook|yahoo)",emails)

['gmail',
 'gmail',
 'gmail',
 'gmail',
 'outlook',
 'outlook',
 'outlook',
 'outlook',
 'yahoo',
 'yahoo',
 'yahoo',
 'yahoo']

In [19]:
#Extracting com,org,in of all emails

re.findall("(com|org|in)",emails)

['com',
 'com',
 'com',
 'com',
 'com',
 'com',
 'com',
 'com',
 'org',
 'com',
 'in',
 'com']

In [20]:
#Extracting emails of outlook 

re.findall(r"(m\w+\W?\w+@outlook.com)",emails)

['maniteja@outlook.com',
 'mani_teja@outlook.com',
 'mani.teja@outlook.com',
 'maniteja1234@outlook.com']

In [22]:
#Extraccting all emails 

email_regex = '[\w.]+@(?:gmail|yahoo|outlook)\.(?:com|in|org)'
emails = re.findall(email_regex, emails)

In [23]:
emails

['maniteja@gmail.com',
 'mani_teja@gmail.com',
 'maniteja1234@gmail.com',
 'maniteja@outlook.com',
 'mani_teja@outlook.com',
 'mani.teja@outlook.com',
 'maniteja1234@outlook.com',
 'maniteja@yahoo.org',
 'mani_teja@yahoo.com',
 'mani.teja@yahoo.in',
 'maniteja1234@yahoo.com']

In [24]:
phones="""SAMSUNG Galaxy F23 5G (Forest Green, 128 GB)
SAMSUNG Galaxy F04 (Jade Purple, 64 GB)
POCO M3 Pro 5G (Yellow, 128 GB)
MOTOROLA e40 (Carbon Gray, 64 GB)
APPLE iPhone 13 (Blue, 128 GB)
APPLE iPhone 14 (Starlight, 128 GB)
APPLE iPhone 14 (Blue, 128 GB)
MOTOROLA G62 5G (Midnight Gray, 128 GB)
REDMI 10 (Pacific Blue, 64 GB)
REDMI 10 (Caribbean Green, 64 GB)
REDMI Note 11 SE (Cosmic White, 64 GB)
MOTOROLA G32 (Mineral Gray, 64 GB)
MOTOROLA G62 5G (Frosted Blue, 128 GB)
POCO C31 (Royal Blue, 64 GB)
MOTOROLA e40 (Pink Clay, 64 GB)
REDMI 10 (Midnight Black, 64 GB)
SAMSUNG Galaxy F23 5G (Copper Blush, 128 GB)
REDMI Note 12 Pro+ 5G (Obsidian Black, 256 GB)
SAMSUNG Galaxy F04 (Opal Green, 64 GB)
MOTOROLA G32 (Satin Silver, 64 GB)
POCO M4 Pro (Cool Blue, 64 GB)
REDMI 9i Sport (Coral Green, 64 GB)
SAMSUNG Galaxy F23 5G (Forest Green, 128 GB)
POCO M4 Pro (Cool Blue, 128 GB)"""

In [28]:
#Extracting mobile Brands

names=re.findall("([A-Z]{3,})",phones)

In [29]:
names

['SAMSUNG',
 'SAMSUNG',
 'POCO',
 'MOTOROLA',
 'APPLE',
 'APPLE',
 'APPLE',
 'MOTOROLA',
 'REDMI',
 'REDMI',
 'REDMI',
 'MOTOROLA',
 'MOTOROLA',
 'POCO',
 'MOTOROLA',
 'REDMI',
 'SAMSUNG',
 'REDMI',
 'SAMSUNG',
 'MOTOROLA',
 'POCO',
 'REDMI',
 'SAMSUNG',
 'POCO']

In [30]:
len(names)

24

In [31]:
#Extracting storage of mobile phones

storage=re.findall("\s(\d+)\sGB",phones)

In [32]:
len(storage)

24

In [33]:
storage

['128',
 '64',
 '128',
 '64',
 '128',
 '128',
 '128',
 '128',
 '64',
 '64',
 '64',
 '64',
 '128',
 '64',
 '64',
 '64',
 '128',
 '256',
 '64',
 '64',
 '64',
 '64',
 '128',
 '128']

In [68]:
colors=re.findall("\s\((\w+\s?\w+),",phones)

In [69]:
len(colors)

24

In [70]:
colors

['Forest Green',
 'Jade Purple',
 'Yellow',
 'Carbon Gray',
 'Blue',
 'Starlight',
 'Blue',
 'Midnight Gray',
 'Pacific Blue',
 'Caribbean Green',
 'Cosmic White',
 'Mineral Gray',
 'Frosted Blue',
 'Royal Blue',
 'Pink Clay',
 'Midnight Black',
 'Copper Blush',
 'Obsidian Black',
 'Opal Green',
 'Satin Silver',
 'Cool Blue',
 'Coral Green',
 'Forest Green',
 'Cool Blue']

**re.finditer()**

- finditer only can call in iteration

In [86]:
a=re.finditer("\s(\d+)\sGB",phones)

In [84]:
for i in a:
    print(i.group())

 128 GB
 64 GB
 128 GB
 64 GB
 128 GB
 128 GB
 128 GB
 128 GB
 64 GB
 64 GB
 64 GB
 64 GB
 128 GB
 64 GB
 64 GB
 64 GB
 128 GB
 256 GB
 64 GB
 64 GB
 64 GB
 64 GB
 128 GB
 128 GB


In [87]:
for i in a:
    print(i.group(),i.start(),i.end())               

 128 GB 36 43
 64 GB 77 83
 128 GB 108 115
 64 GB 143 149
 128 GB 173 180
 128 GB 209 216
 128 GB 240 247
 128 GB 280 287
 64 GB 312 318
 64 GB 346 352
 64 GB 385 391
 64 GB 420 426
 128 GB 458 465
 64 GB 488 494
 64 GB 520 526
 64 GB 553 559
 128 GB 597 604
 256 GB 644 651
 64 GB 684 690
 64 GB 719 725
 64 GB 750 756
 64 GB 786 792
 128 GB 830 837
 128 GB 862 869


- `group()` --  gives you the target string 
- `start()` --  start index of match happening
- `end()`   --  end index of match happening

**re.sub()  -- Substitute**

- Substitutes the target sub string in the string with a pattern.

- `syntax`:

    - `re.sub(pattern, replacement, string)`

In [89]:
sub= "Inno123@456matics"

In [91]:
#Substituting all digits and special characters as empty string

re.sub("[^a-zA-Z]","",sub)

'Innomatics'

In [217]:
re.sub("\d","",a)

'abc'

In [102]:
sub1="""innomatic@ Re!earch Lab#"""

In [104]:
# #Substituting all  special characters as 's'
re.sub("[^a-zA-Z\s]",'s',sub1)

'innomatics Research Labs'

In [110]:
sub2="""maniteja@gmail.com
mani_teja@gmail.com
mani.teja@ gmail.in
maniteja1234@gmail.com
maniteja@outlook.com
mani_teja@outlook.com
mani.teja@outlook.com
maniteja1234@outlook.com
maniteja@yahoo.org
mani_teja@yahoo.com
mani.teja@yahoo.in
maniteja1234@yahoo.com
"""

In [113]:
#Substituting all .in emails domains has outlook

re.sub("\w+\.in","outlook.in",sub2).split("\n")

['maniteja@gmail.com',
 'mani_teja@gmail.com',
 'mani.teja@ outlook.in',
 'maniteja1234@gmail.com',
 'maniteja@outlook.com',
 'mani_teja@outlook.com',
 'mani.teja@outlook.com',
 'maniteja1234@outlook.com',
 'maniteja@yahoo.org',
 'mani_teja@yahoo.com',
 'mani.teja@outlook.in',
 'maniteja1234@yahoo.com',
 '']

**re.search()**

- search also works same as other function finding operations.

- search will only give you the first match as your output.

In [97]:
a="a1b2c3"

In [98]:
aa=re.search("\d",a)

In [99]:
aa                     #output--  first single match

<re.Match object; span=(1, 2), match='1'>

In [100]:
aa.group()

'1'

# Using Regular Expressions in Pandas (Data Cleaning)

In [114]:
import pandas as pd

In [120]:
df=pd.read_csv(r"C:\Users\HP\Downloads\Data Analysis @\text.csv",names=["Label","text"])

In [121]:
df

Unnamed: 0,Label,text
0,0,im sooo sick today...so no emery concert tonight
1,0,@Freebies4Mom I haven't been able to access it...
2,0,won't losing you
3,0,@GeminiTwisted I can't eat egg whites like @Da...
4,0,At Home Writers Block Has Set In
5,0,Burnt myself on my old friend the mini spring ...
6,0,@tommcfly http://twitpic.com/64l1e - wow thats...
7,0,@seneca That's a crying shame. I never liked ...
8,0,@kreatture there are a few in Halifax... Ex-na...
9,0,Sstill happy but my feet REALLY hurt. Wish I h...


In [122]:
df["text"]

0     im sooo sick today...so no emery concert tonight 
1     @Freebies4Mom I haven't been able to access it...
2                                     won't losing you 
3     @GeminiTwisted I can't eat egg whites like @Da...
4                     At Home Writers Block Has Set In 
5     Burnt myself on my old friend the mini spring ...
6     @tommcfly http://twitpic.com/64l1e - wow thats...
7     @seneca That's a crying shame.  I never liked ...
8     @kreatture there are a few in Halifax... Ex-na...
9     Sstill happy but my feet REALLY hurt. Wish I h...
10    We are a lil under 11 hrs away from OTH's seas...
11    I just got new pants today . . . I just spille...
12    @luke not getting enough sleep and letting emo...
13    @nnorafiza @Phee78 Just read your tweets about...
14    @JamieOber Now you know, that's not what I wan...
15    @hazelcullen it went awful my monologues went ...
16    said goodbye to the Baha'is tonight     One go...
17    @celebrian then you, vanessa, and chelsea 

In [124]:
#Data Cleaning using Regex

df["text"].replace("[^a-zA-Z]"," ",regex=True)

0     im sooo sick today   so no emery concert tonight 
1      Freebies Mom I haven t been able to access it...
2                                     won t losing you 
3      GeminiTwisted I can t eat egg whites like  Da...
4                     At Home Writers Block Has Set In 
5     Burnt myself on my old friend the mini spring ...
6      tommcfly http   twitpic com   l e   wow thats...
7      seneca That s a crying shame   I never liked ...
8      kreatture there are a few in Halifax    Ex na...
9     Sstill happy but my feet REALLY hurt  Wish I h...
10    We are a lil under    hrs away from OTH s seas...
11    I just got new pants today       I just spille...
12     luke not getting enough sleep and letting emo...
13     nnorafiza  Phee   Just read your tweets about...
14     JamieOber Now you know  that s not what I wan...
15     hazelcullen it went awful my monologues went ...
16    said goodbye to the Baha is tonight     One go...
17     celebrian then you  vanessa  and chelsea 