# Python (Data type vs data structure)

#### Data Type
  - number (int, float)
  - text (***str***)
  - boolean (yes/no, true/false, 1/0)

#### Data Structure
  - list
  - set
  - dictionary
  - array

### Define a string

In [1]:
string = '1'
print (string)
print (type(string))
string = 1
print (string)
print (type(string))
string = 'abc123'
print (string)
print (type(string))

1
<class 'str'>
1
<class 'int'>
abc123
<class 'str'>


### Functions of string

In [3]:
print (string.capitalize())
print (string.find('b'))
print (string.find('A'))
print (string.endswith('3'))
print (string.endswith('4'))
print (string.capitalize())
print (string.isnumeric())
print (string.isalpha())
print (string.isalnum())
print (string.split(' '))
print ('hello world'.split())

Abc123
1
-1
True
False
Abc123
False
False
True
['abc123']
['hello', 'world']


### Bigger string

In [4]:
string = 'I bought10 packet of chips.\nI bought 5 packets of nachos.\nI bought 100 toffess.\nI bought 2 bottles of cold drinks.'
print (string)
print (string.split())

I bought10 packet of chips.
I bought 5 packets of nachos.
I bought 100 toffess.
I bought 2 bottles of cold drinks.
['I', 'bought10', 'packet', 'of', 'chips.', 'I', 'bought', '5', 'packets', 'of', 'nachos.', 'I', 'bought', '100', 'toffess.', 'I', 'bought', '2', 'bottles', 'of', 'cold', 'drinks.']


In [4]:
only_numbers = []
for element in string.split():
    if element.isnumeric():
        only_numbers.append(element)
print (only_numbers)

['5', '100', '2']


#### Let's define a function just to re reuse it 

In [5]:
def extract_num(string):
    only_numbers = []
    for element in string.split():
        if element.isnumeric():
            only_numbers.append(element)
    return (only_numbers)

In [6]:
extract_num(string)

['5', '100', '2']

# Now, this user defined function can be replaced by an inbuilt library in Python

# REGEX

***Regex is one of the most powerful tools of Python that helps us to search for any pattern in the input string and manipulate strings effectively***

In [1]:
import re #regex python inbuilt library

## Syntax
### re.method_name(pattern, string)
### pattern ---> r''


### Understanding some metacharacters/reserved characters

#### Character Class
The first metacharacters we’ll look at are '[' and ']'. They’re used for specifying a character class, which is a set of characters that you wish to match.
eg:[a-c] means the input can contain any character from a to c . 
This can also be representated as [abc].
Similarly, to define a class of numerics we define using [0-9]

### Some sample customized character sets

#### [a-z]
#### [A-Z]
#### [A-Za-z]
#### [A-z] = [A-Z]+[a-z]
#### [a-z0-9]

# Method 1

#### re.findall(pattern, string)
#### returns a list of all the matches

In [5]:
import re

In [10]:
string = 'abcdef'
pattern = r'[a-c]'
re.findall(pattern, string)

['a', 'b', 'c']

In [11]:
string = 'a b c d e f'
pattern = r'[a-z]'
re.findall(pattern, string)

['a', 'b', 'c', 'd', 'e', 'f']

In [12]:
string = 'a b c d e f '
pattern = r'[A-Z]'
re.findall(pattern, string)

[]

In [13]:
string = 'a b c 1 2 3'
pattern = r'[0-2]'
re.findall(pattern, string)

['1', '2']

In [14]:
string = 'a b c 1 2 3'
pattern = r'[0-9]'
re.findall(pattern, string)

['1', '2', '3']

In [7]:
string = 'aab c 1 2 3'
pattern = r'[a-z][a-z]'
re.findall(pattern, string)

['aa']

In [12]:
string1 = 'abcd\n123'
string2 = r'abcd\n123'
pattern = '[a-z][a-z]'
#print(re.findall(pattern, string1))
#print(re.findall(pattern, string2))
print(string1, string2)

abcd
123 abcd\n123


In [16]:
string = 'abcd123'
pattern = '[a-z][a-z]'
re.findall(pattern, string)

['ab', 'cd']

In [19]:
string = 'abcd123'
pattern = r'\w{2}'
re.findall(pattern, string)

['ab', 'cd', '12']

In [18]:
string = 'abc123'
pattern = '[a-z][a-z][a-z]'
re.findall(pattern, string)

['abc']

### Now, we introduce some more reserved characters
  - \+ (one or more)
  - \* (zero or more)
  - \? (zero or one)

In [5]:
import re

In [7]:
string = 'a123 def456'
pattern = r'[a-z]+'
re.findall(pattern, string)

['a', 'def']

In [20]:
string = 'abc123456$ 44 '
pattern = r'\D+'
re.findall(pattern, string)

['abc', '$ ', ' ']

In [24]:
string = '123abc deF456 1'
pattern = '[a-z]?[0-9]+'
re.findall(pattern, string,re.IGNORECASE)

['123', 'F456', '1']

In [30]:
string = '123 def456 ghi789'
pattern = r'[A-z]+[0-9]+'
re.findall(pattern, string)

['def456', 'ghi789']

In [43]:
print (re.search('ab*','abbb').group())
print (re.search('ab?','abbb').group())

abbb
ab


#### \d ---> [0,1,2,...9]   ---> integer set
#### \D ---> negation([0,1,...9]) ---> non-integer set

#### \w ---> [a-zA-Z0-9_] -----> alphanumeric set
#### \W ---> negation([a-zA-Z0-9_]) ---> non-alphanumeric set

#### \s ---> [ ,tab,\n,\r] -------> space character set
#### \S ---> negation([ ,tab,\n,\r]) -------> non-space character set

#### . -----> matches everything except new line character

#### Combining character classes

In [44]:
string = 'a b c 1 2 3 _ $'
pattern = r'\W'
re.findall(pattern, string)

[' ', ' ', ' ', ' ', ' ', ' ', ' ', '$']

In [76]:
re.findall('([a-z]{2})\d+_', 'abc1234575865_ajbkb')

['bc']

### The number of times each of the characters from a character set has to appear is restricted by

##### {min_number,max_number} -------> start, end is restricted
##### {fixed number} -------> restricted to only this number
##### {min_number,} ------> maximum is boundless

In [39]:
stri1 = '1st day of month February 5th of year 2017'
stri2 = '16th day of month February of year 2040'
pattern= r"(\d{1,2}[a-z]{2})[\w\s]+([A-Z][A-Za-z]+)[\w\s]+(\d{4})"

In [38]:
re.findall(r"(\d{1,2}[a-z]{2})",stri1)

['1st', '5th']

In [40]:
re.findall(pattern, stri1)

[('1st', 'February', '2017')]

In [41]:
for string in [stri1,stri2]:
    print(' '.join(re.findall(pattern,string)[0]))
# '9th February 2017'

1st February 2017
16th February 2040


In [89]:
print ('zero or more b', re.search(r"(ab*)","ababbb").group())
print ('one or more b', re.search(r"(ab+)","abbbbbb").group())
print ('zero or one b', re.search(r"(ab?)","abbbabab").group())
print ('followed by three b', re.search(r"(ab{3})","xxabbbbyxab").group())
print ('followed by two to three b', re.search(r"(ab{2,3})","xyabbbabab").group())


zero or more b ab
one or more b abbbbbb
zero or one b ab
followed by three b abbb
followed by two to three b abbb


#####  ( ) ---> extract the required information, from the matched string

##### \ -------> Escape character

##### to alter the meaning of a character in regex to its original meaning .
##### The backslash can be followed by various characters to signal various special sequences. You can escape the special meaning of the metacharacters by preceding them with a '\' . You can search for a "+"  using "\\+"


In [90]:
mob_no1 = 'xyDz@@@mmm.co.in\n\n jhkjhljl'
mob_no2 = 'xyzmmm.co.in'
mob_no3 = 'uiadhkAZahks@@mmm.com is my mail id'

In [95]:
re.findall(r'\d+\+\d+=\d+','22+354=532315')

['22+354=532315']

In [97]:
re.findall('A','Aeroplane')

['A']

In [98]:
re.findall(r"(\w+@)@[A-z]+\.com",mob_no3)

['uiadhkAZahks@']

In [43]:
amount = 'twuytyu has transferred rs.34000 in your Kotak Bank in the acc num 9997866675275. Updated balance is rs.50000'

In [45]:
re.findall(r'rs.(\d)+',amount)

['0', '0']

In [46]:
re.findall(r'rs.(\d)',amount)

['3', '5']

In [48]:
re.findall(r'rs.(\d+)',amount) ### actual required amount

['34000', '50000']

In [70]:
string3='24-01-2009'
re.findall('([\d]{1,2})-(\d{1,2})-([0-9]{2,4})',string3)

#The day is 24th, month is 01, year is 2009

[('24', '01', '2009')]

In [72]:
p=r'([\d]{1,2})-(\d{1,2})-([0-9]{2,4})'
res = re.match(p,string3)

In [73]:
res.group()

'24-01-2009'

In [76]:
res.group(1),res.group(2),res.group(3)

('24', '01', '2009')

In [108]:
input1 = ['Im in      Bangalore','Im in Pune','im in Chennai']
for i in input1:
    print (re.findall('im in\s+([a-zA-Z]+)',i,re.IGNORECASE))

['Bangalore']
['Pune']
['Chennai']


### Let's introduce some more reserved characters

- \^ (Two Func: beginning of the string & negation of the class)
- \$ (One Func: Check at the end of string)

beginning of a string ------>  ^ outside the character class ^[]
negation of class     ------>  ^ inside the character class [^]
end of a string       ------>  $ outside the character class []$

In [30]:
sam = 'alagammai_0491 is my id'
pattern = r'^[a-z]+_\d+'

re.findall(string=sam,pattern=pattern)

['alagammai_0491']

In [28]:
print(re.findall(r'([^aeiou])','Bngalore',re.IGNORECASE))

['B', 'n', 'g', 'l', 'r']


In [111]:
print(re.findall(r'([^aeiou]+)','Bngalore',re.IGNORECASE))

['Bng', 'l', 'r']


In [121]:
re.findall(r'^[a-z]+_\d+$','alagammai_0491')

['alagammai_0491']

In [116]:
print(re.findall(r'[aeiou]+$','ungalaejhskhuihiwioe',re.IGNORECASE))

['ioe']


##### finditer ---> finding span of each and every pattern match in a string happening through findall 

In [129]:
re.findall("\d{2}",'24-01-2009') #finds the matched all substrings that match the pattern 

['24', '01', '20', '09']

In [130]:
#instead of displaying the matched substring like findall, 
#finditer returns the start and end positions of the matched substrings 
for i in re.finditer(r"\d{2}",'24-01-2009'):
    print (i.start(),':',i.end())

0 : 2
3 : 5
6 : 8
8 : 10


# Match

### re.match(pattern,string)

In [8]:

pattern = r"([a-z]+)([0-9]+)"
string1 = "asd1213 xyz678"
string2 = "Basd123angalore"

In [3]:
match1 = re.match(pattern,string1)
match1

<re.Match object; span=(0, 7), match='asd1213'>

In [7]:
re.match(pattern,string2)

<re.Match object; span=(0, 7), match='basd123'>

In [24]:
type(match1)

re.Match

In [25]:
match1.group()

'asd1213'

In [19]:
print (match1.group(1))

asd


In [20]:
print (match1.group(2))

1213


In [5]:
match2 = re.match(pattern,string2)
print (match2.group()) 
#throws error because match checks for a match only at the beginning of a string . So, we use search instead

AttributeError: 'NoneType' object has no attribute 'group'

In [33]:
type(match2)

NoneType

# Search 

### re.search(pattern,string) 

In [10]:
string2 = 'AHSGJskjahdkadk23123 POIcvfsjdhfds21876437'

In [18]:
pattern = r"[a-z]+[0-9]+"

In [19]:
match2 = re.search(pattern,string2)

In [16]:
match_match = re.match(pattern,string2)

In [17]:
match_match.group()

AttributeError: 'NoneType' object has no attribute 'group'

In [22]:
match2.group()

'skjahdkadk23123'

In [23]:
match3 = re.findall(pattern,string2)
match3

['skjahdkadk23123', 'cvfsjdhfds21876437']

In [59]:
pattern = r"([a-z]+)([0-9]+)"

In [43]:
pattern

'([a-z]+)([0-9]+)'

In [36]:
match2

<re.Match object; span=(5, 20), match='skjahdkadk23123'>

In [37]:
type(match2)

re.Match

In [38]:
match2.span()[0],match2.span()[1]

(5, 20)

In [39]:
match2.group()

'skjahdkadk23123'

In [146]:
match2.group(0)

'skjahdkadk23123'

In [147]:
match2.group(1)

'skjahdkadk'

In [148]:
match2.group(2)

'23123'

In [156]:
string = '123 def456 ghi789'
pattern = r'([A-z]+)([0-9]+)'
match = re.search(pattern, string)
match_2 = re.match(pattern, string)
print (type(match))
print (type(match_2))
print (match)
print (match_2)

<class 're.Match'>
<class 'NoneType'>
<re.Match object; span=(4, 10), match='def456'>
None


In [150]:
print (match.group(0))
print (match.group(1))
print (match.group(2))

def456
def
456


In [157]:
print(re.search(r'([^aeiou]+)','eiuBngalore',re.IGNORECASE))
print(re.match(r'([^aeiou]+)','eiuBngalore',re.IGNORECASE))
print(re.match(r'([^aeiou]+)','Bngalore',re.IGNORECASE))

<re.Match object; span=(3, 6), match='Bng'>
None
<re.Match object; span=(0, 3), match='Bng'>


In [158]:
pattern1 = r""
v1=re.search(pattern1,string3)
v1.group()  #displays the first matched result 

''

In [42]:
string = 'abcd123'
pattern = '[a-z][a-z]'
re.findall(pattern, string)

['ab', 'cd']

## Sub, Subn, Split

### re.sub(pattern,string_replacing_with,string_to_be_replaced) -----> substitution as per the pattern
### re.subn(pattern,string_replacing_with,string_to_be_replaced) -----> substitution as per the pattern along with number of time substitution happened
### re.split(pattern,string) -----> split the string by the pattern

In [162]:
re.sub('[A-z]+', 'ALPHABETS', 'abc123 kjhkhk oipiopi 234asfdf', count=3)

'ALPHABETS123 ALPHABETS ALPHABETS 234asfdf'

In [163]:
re.subn('[A-z]+', 'ALPHABETS', 'abc123 kjhkhk oipiopi 234asfdf', count=2)

('ALPHABETS123 ALPHABETS oipiopi 234asfdf', 2)

In [164]:
#To parse a date format from MM:DD:YYY to MM-DD-YYY
year="Today's date is 30:09:2016"
pattern2=r"(:)"   
#searches for : everywhere nd replaces with -
re.findall(pattern2,year)

[':', ':']

In [165]:
re.sub(pattern2,"-",year)

"Today's date is 30-09-2016"

In [166]:
re.subn(pattern2,"-",year)

("Today's date is 30-09-2016", 2)

In [167]:
text = "He was carefully disguised but captured quickly by police."

pattern3=r"(\w+ly)" 
print ('the substituted sentence is ', re.sub(pattern3,"____",text))
print ('the adverbs in the sentence are', re.findall(pattern3,text))

the substituted sentence is  He was ____ disguised but captured ____ by police.
the adverbs in the sentence are ['carefully', 'quickly']


In [168]:
re.split(r'\d+','uygjgjg45456667hkhkjhk7665354390980nmvhjkhj')

['uygjgjg', 'hkhkjhk', 'nmvhjkhj']

# Practise Questions

###### Exercise 1 
    
Write a Python program to remove leading zeros from an IP address. 
Input: 216.0008.094.196 
Output : 216.8.94.196

###### Exercise 2

Write a Python program to convert a date of yyyy-mm-dd format to dd-mm-yyyy format.

Input: 2026-01-02 


Output : 01-02-2026

###### Exercise 3:

Zara has a text article and wants to know the important context of the text. Write a regex pattern that returns all the important words from the text . 
Note: According to Zara, important words are  the words enclosed in quotes (single or double)

EG: Input : "Python", 'PHP', "Java" are the important languages that are used now a days
Output : ['Python', 'PHP', 'Java']


###### Exercise 4:

Write a regular expression that finds all IP addresses listed in an input text and replaces it with the new string . 

Eg:; Input : text = 'IPs : 173.254.28.78 or 167.81.178.97 are the IPs listed'
             to_replace = '127.0.0.1'

Output : IPs : 127.0.0.1 or 127.0.0.1 are the IPs listed


##### Exercise 5 :

find all the phone numbers from a given text using a single regex 

Inputs :                                  
Number  1 : 000-002-08-5678               
Numbers are : 78-7328                     
More numbers : +91-02-008-7892    

Outputs:
000-002-08-567
78-7328
+91-02-008-789


Assume that the codes cannot exceed 3 numbers . 

###### Exercise 6 :
   
Extract dates from the following text

On May 13, 1998, at 15:45 hours, India secretly conducted a series of underground nuclear tests with five bombs in Pokhran, Rajasthan. Although this was not the first time the country was testing its nuclear weapons (the first successful test took place in 1974 under the codename “Smiling Buddha”), this one was certainly the most memorable if one takes into consideration the sheer effect it had on its states and neighbouring countries. On May 15, 1998, shortly after the detonation of all five warheads, then Prime Minister Atal Bihari Vajpayee declared India a full-fledged nuclear state.
  
  Ans:
May 13, 1998 
May 15, 1998


###### Exercise 7:

Extract all the proper nouns from the text - Proper nouns are those words which have all capitalised or the first character capitalized 

The President of the United States (POTUS /ˈpoʊtəs/ POH-təs)[note 2] is the head of state and head of government of the United States of America. The president directs the executive branch of the federal government and is the commander-in-chief of the United States Armed Forces.

Ans :
The
President
United
States
POTUS
POH
United
States
America
The
United
States
Armed
Forces
