## Agenda

*  Regular Expressions
*  Python PIP
*  Python JSON
*  Pickling

## Regular Expressions Regex

#### Why Regex?

In [160]:
text = "This is a string with some email addresses: johndoe@example.com, janedoenew@example.org"

In [47]:
import re

#### Introduction
Regular expressions (regex) are a powerful tool for finding patterns in text. They can be used to extract information from text, to validate data, and to perform a variety of other tasks.

In Python, regular expressions are implemented in the re module. The re module provides a number of functions and constants to work with regular expressions.

#### Basics of Regular Expressions
A regular expression is a sequence of characters that defines a pattern. The pattern can be used to match a specific string or a set of strings.

The basic components of a regular expression are:

*  Characters: Any character can be used in a regular expression.
*  Metacharacters: Metacharacters are special characters that have special meaning in regular expressions.
*  Quantifiers: Quantifiers are used to specify the number of times a character or group of characters can occur in a pattern.


##### Metacharacters
The following are some of the most common metacharacters:

*  .: Matches any character.
*  +: Matches one or more occurrences of the preceding character or group of characters.
*  ?: Matches zero or one occurrence of the preceding character or group of characters.
*  [: Matches any character in the specified set.
*  ]: Closes a set.
*  \: Escapes the next character, making it a literal character.


##### Quantifiers
The following are some of the most common quantifiers:

*  {n}: Matches exactly n occurrences of the preceding character or group of characters.
*  {n,m}: Matches n to m occurrences of the preceding character or group of characters.
*  {n,}: Matches n or more occurrences of the preceding character or group of characters.

##### Examples

Here are some examples of regular expressions:

*  .*: Matches any string.
*  [a-z]+: Matches any string of one or more lowercase letters.
*  \d: Matches any digit.
*  \w: Matches any alphanumeric character.
*  \s: Matches any whitespace character.
*  ^: Matches the beginning of a string.
*  $: Matches the end of a string.


In [3]:
import re

In [58]:
print("\nHello")


Hello


In [52]:
text = "This is a string with some numbers: 12345"

In [73]:
pattern = '\d'
matches = re.findall(pattern, text)
# print(type(matches))
print(matches)


['1', '2', '3', '4', '5']


In [76]:
# pattern2= 'is'
# matches = re.finditer(pattern2,text)

pattern = re.compile('\d+')
matches = pattern.finditer(text)

# print(matches)
# print(type(matches))

for match in matches:
    print(match)
    # print(type(match))
    # print(match.start(),match.end())

<re.Match object; span=(36, 41), match='12345'>


In [78]:
text = "This is a string with the word 'the' in it."

pattern = 'the'
matches = re.finditer(pattern, text)
# print(matches)

for match in matches:
  print(match)
  # print(match.start(), match.end())


<re.Match object; span=(22, 25), match='the'>
<re.Match object; span=(32, 35), match='the'>


In [79]:
text = "This is a string with some punctuation."
pattern = '.'
matches = re.findall(pattern, text)

print(matches)

# . Matches any character.


['T', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' ', 's', 't', 'r', 'i', 'n', 'g', ' ', 'w', 'i', 't', 'h', ' ', 's', 'o', 'm', 'e', ' ', 'p', 'u', 'n', 'c', 't', 'u', 'a', 't', 'i', 'o', 'n', '.']


In [82]:
print(r'\n')

\n


In [92]:
text = "is is is a string with some punctuation."
pattern = r'\bis'
matches = re.findall(pattern, text)
print(matches)

['is', 'is', 'is']


In [96]:
text = "This is a string with some punctuations"
pattern = 's'
matches = re.findall(pattern, text)
print(matches)

['s', 's', 's', 's', 's']


In [None]:
text = "This is a string with some punctuation 123."
pattern = '[a-z,A-Z]+'
matches = re.findall(pattern, text)
print(matches)

['This', 'is', 'a', 'string', 'with', 'some', 'punctuation']


In [None]:
text = "This is a string with some punctuation 123."
pattern = '[a-z,A-Z,0-9]+'
matches = re.findall(pattern, text)
print(matches)

['This', 'is', 'a', 'string', 'with', 'some', 'punctuation', '123']


In [103]:
text = "This is a string with some punctuation."
pattern = '\.'
matches = re.findall(pattern, text)
print(matches)


pattern = '\.'
matches = re.finditer(pattern, text)

for match in matches:
  print(match)


['.']
<re.Match object; span=(38, 39), match='.'>


In [109]:
text = '''This is  an example of multi line text of datatype String
1st line
2nd line
3rd line
4th line
conclusion'''

# pattern = '^This'
# matches = re.findall(pattern, text)
# print(matches)

pattern = '^This'
matches = re.finditer(pattern, text)

for match in matches:
  print(match)

<re.Match object; span=(0, 4), match='This'>


#### Write regex code to fetch only those lines that is starting with a numeric value in the below multi line text string

text = '''This is  an example of multi line text of datatype String

1st line

2nd line

3rd line

4th line

conclusion'''

Expected Output: 

1st line

2nd line

3rd line

4th line


Hint: Use a loop to consider one line (of this multi-line text string) at a time

In [117]:
text = "This is a string with some punctuation."
# pattern = 'punctuation.$'
# matches = re.findall(pattern, text)
# print(matches)


pattern = 'punctuation.$'
matches = re.finditer(pattern, text)

for match in matches:
  print(match)

<re.Match object; span=(27, 39), match='punctuation.'>


In [119]:
# text = '''This is a string. with some punctuation.
# This is a string with some punctuation'''
text = 'This is. a string. with some punctuation.'
# pattern = '.*(?!.)$'

# pattern = ''
# matches = re.findall(pattern, text)
# print(matches)

pattern = ''
matches = re.finditer(pattern, text)

for match in matches:
  print(match)

<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(1, 1), match=''>
<re.Match object; span=(2, 2), match=''>
<re.Match object; span=(3, 3), match=''>
<re.Match object; span=(4, 4), match=''>
<re.Match object; span=(5, 5), match=''>
<re.Match object; span=(6, 6), match=''>
<re.Match object; span=(7, 7), match=''>
<re.Match object; span=(8, 8), match=''>
<re.Match object; span=(9, 9), match=''>
<re.Match object; span=(10, 10), match=''>
<re.Match object; span=(11, 11), match=''>
<re.Match object; span=(12, 12), match=''>
<re.Match object; span=(13, 13), match=''>
<re.Match object; span=(14, 14), match=''>
<re.Match object; span=(15, 15), match=''>
<re.Match object; span=(16, 16), match=''>
<re.Match object; span=(17, 17), match=''>
<re.Match object; span=(18, 18), match=''>
<re.Match object; span=(19, 19), match=''>
<re.Match object; span=(20, 20), match=''>
<re.Match object; span=(21, 21), match=''>
<re.Match object; span=(22, 22), match=''>
<re.Match object; span=(23, 23)

In [137]:
text = '''some email addresses: johndoe@example.com, 
janedoenew@example.org, 
jp@gmail.com,
jp@gmail.net,
jp@gmailnew.com 
nice weather today'''

# pattern = '\w+@\w+\.\w+'
# emails = re.findall(pattern, text)

# gmails = re.findall(g_pattern, text)
# print(gmails)
# print(emails)

# Pattern: some 'alphanumeric/spl charater' + '@' + alphabets' + '.' + alphabets )2-3)

# pattern = '\w+@\w+.\w+'
# matches = re.finditer(pattern, text)

# for match in matches:
#   print(match)


gpattern = '\w+@gmail.+\w+'
matches = re.finditer(gpattern, text)

for match in matches:
  print(match)

<re.Match object; span=(69, 81), match='jp@gmail.com'>
<re.Match object; span=(83, 95), match='jp@gmail.net'>
<re.Match object; span=(97, 112), match='jp@gmailnew.com'>


In [156]:
text = '''my name is priya@gmail.com
kpriya@gmailnew.com
hello@gmailx.com
abc@gma.com'''

pattern = '\w+@gmail.+'
match = re.findall(pattern,text)
print (match)

['priya@gmail.com', 'kpriya@gmailnew.com', 'hello@gmailx.com']


In [141]:
text = '''some email address:
kowi.a@gmail.com,
google.g@gmail.com , john_8
@gmail.com'''

pattern = '\w+@\w+\.w+'
matches = re.finditer(pattern,text)
for match in matches:
    print(match)

In [209]:
text = '''555-444-2221
+91*9025228000
+91-90252280012
+91 9025228002'''

# pattern = '\+\d\d.\d\d\d\d\d\d\d\d\d\d'
# emails = re.findall(pattern, text)
# print(emails)

pattern = re.compile('\+\d+.\d{10}')
# pattern = '.{10}'
matches = re.finditer(pattern,text)
for match in matches:
    print(match)

# pattern = '\+\d\d.\d+'
# emails = re.findall(pattern, text)
# print(emails)

# pattern = '\+\d\d[\s-]\d+'
# emails = re.findall(pattern, text)
# print(emails)


<re.Match object; span=(13, 27), match='+91*9025228000'>
<re.Match object; span=(28, 42), match='+91-9025228001'>
<re.Match object; span=(44, 58), match='+91 9025228002'>


In [183]:
text = '''abc ABC
efg'''


pattern = '[a-z]'
matches = re.finditer(pattern,text,re.A)
for match in matches:
    print(match)

<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(1, 2), match='b'>
<re.Match object; span=(2, 3), match='c'>
<re.Match object; span=(8, 9), match='e'>
<re.Match object; span=(9, 10), match='f'>
<re.Match object; span=(10, 11), match='g'>


## Python PIP

Pip is a package manager for Python. It is used to install, uninstall, and manage Python packages.



In [384]:
%pip --version

pip 23.2.1 from /Users/Z00CVY1/Library/Python/3.9/lib/python/site-packages/pip (python 3.9)
Note: you may need to restart the kernel to use updated packages.


In [212]:
%pip install -r

Package           Version
----------------- ------------
asttokens         2.4.0
backcall          0.2.0
colorama          0.4.6
comm              0.1.4
contourpy         1.1.1
cycler            0.12.1
debugpy           1.8.0
decorator         5.1.1
docutils          0.20.1
executing         1.2.0
fonttools         4.43.1
ipykernel         6.25.2
ipython           8.15.0
jedi              0.19.0
jupyter_client    8.3.1
jupyter_core      5.3.1
kiwisolver        1.4.5
matplotlib        3.8.0
matplotlib-inline 0.1.6
nest-asyncio      1.5.8
numpy             1.26.0
packaging         23.1
pandas            2.1.1
parso             0.8.3
patsy             0.5.3
pickleshare       0.7.5
Pillow            10.1.0
pip               23.2.1
platformdirs      3.10.0
prompt-toolkit    3.0.39
psutil            5.9.5
psycopg2          2.9.7
pure-eval         0.2.2
Pygments          2.16.1
pyparsing         3.1.1
python-dateutil   2.8.2
pytz              2023.3.post1
pywin32           306
pyzmq          


[notice] A new release of pip is available: 23.2.1 -> 23.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
%pip install --user scikit-learn

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [70]:
%pip install --upgrade pip

Defaulting to user installation because normal site-packages is not writeable
Collecting pip
  Downloading pip-23.2.1-py3-none-any.whl (2.1 MB)
[K     |████████████████████████████████| 2.1 MB 2.6 MB/s eta 0:00:01
[?25hInstalling collected packages: pip
Successfully installed pip-23.2.1
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [33]:
%pip install package_name
%pip install sklearn


scikitlearn 

Defaulting to user installation because normal site-packages is not writeable
Collecting package_name
  Downloading package_name-0.1.tar.gz (782 bytes)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: package_name
  Building wheel for package_name (setup.py) ... [?25ldone
[?25h  Created wheel for package_name: filename=package_name-0.1-py3-none-any.whl size=1254 sha256=a9f47ff4bdb4f3b513f4300459d7a85307b5ac1dbfb77be146e0b975d9a2fa9b
  Stored in directory: /Users/Z00CVY1/Library/Caches/pip/wheels/67/e6/c3/cbfcab244d830378592564f5e46da23a8aad979c4a958b401a
Successfully built package_name
Installing collected packages: package_name
Successfully installed package_name-0.1


In [4]:
%pip uninstall pandas

Found existing installation: pandas 2.0.3
Uninstalling pandas-2.0.3:
  Would remove:
    /Users/Z00CVY1/Library/Python/3.9/lib/python/site-packages/pandas-2.0.3.dist-info/*
    /Users/Z00CVY1/Library/Python/3.9/lib/python/site-packages/pandas/*
Proceed (Y/n)? ^C
[31mERROR: Operation cancelled by user[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


## JSON in Python

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is often used to transmit data between a server and a client, or to store data in a file.

JSON is a text-based format, which means that it can be easily read and written by humans. It is also easy to parse and manipulate using Python.

##### Basics of JSON
A JSON object is a collection of key-value pairs. The keys are strings, and the values can be strings, numbers, arrays, or objects.

An array is a list of values. The values in an array can be of any type.

An object is a collection of key-value pairs. The keys and values in an object can be of any type.

Here is an example of a JSON object:




In [222]:
import json

In [221]:
people = '''{
  "name": "John Doe",
  "age": 30}
'''
print(type(people))

<class 'str'>


In [224]:
people = '''{
  "name": "John Doe",
  "age": 30,
  "address": {
    "street": "123 Main Street",
    "city": "Anytown",
    "state": "CA"
  }
}'''
print(type(people))

<class 'str'>


In [226]:
data = json.loads(people)

print(type(data))
print(data)


<class 'dict'>
{'name': 'John Doe', 'age': 30, 'address': {'street': '123 Main Street', 'city': 'Anytown', 'state': 'CA'}}


In [227]:
data['name']

'John Doe'

In [233]:
print(data['age'])
print(type(data['age']))

30
<class 'int'>


In [236]:
print(data['address'])
print(type(data['address']))

{'street': '123 Main Street', 'city': 'Anytown', 'state': 'CA'}
<class 'dict'>


In [None]:

people = '''{
  "name": "John Doe",
  "age": 30}
'''
print(type(people))

<class 'str'>


In [237]:
for i in data['address']:
    print(i,':', data['address'][i])

street : 123 Main Street
city : Anytown
state : CA


In [241]:
add = data['address']
print(type(add))
print(type(add['state']))

<class 'dict'>
<class 'str'>


In [242]:
# people = ''' {pc : [{},{},{} ]  } '''

In [243]:
people = '''
{
  "people_collection":[
  {
    "name": "John Doe",
    "age": 30,
    "email": null,
    "address": {
      "street": "123 Main Street",
      "city": "Anytown",
      "state": "CA"
    }
  },
  {
    "name": "Jack Snyder",
    "age": 43,
    "email": null,
    "address": {
      "street": "222 Main Street",
      "city": "Newyork",
      "state": "NY"
    }
  }  
]
}
'''
print(type(people))
people

<class 'str'>


'\n{\n  "people_collection":[\n  {\n    "name": "John Doe",\n    "age": 30,\n    "email": null,\n    "address": {\n      "street": "123 Main Street",\n      "city": "Anytown",\n      "state": "CA"\n    }\n  },\n  {\n    "name": "Jack Snyder",\n    "age": 43,\n    "email": null,\n    "address": {\n      "street": "222 Main Street",\n      "city": "Newyork",\n      "state": "NY"\n    }\n  }  \n]\n}\n'

In [244]:
data = json.loads(people)

print(type(data))
print(data)



<class 'dict'>
{'people_collection': [{'name': 'John Doe', 'age': 30, 'email': None, 'address': {'street': '123 Main Street', 'city': 'Anytown', 'state': 'CA'}}, {'name': 'Jack Snyder', 'age': 43, 'email': None, 'address': {'street': '222 Main Street', 'city': 'Newyork', 'state': 'NY'}}]}


In [247]:
print(type(data['people_collection']))
# data['people_collection']

data['people_collection'][0]['name']



<class 'list'>


'John Doe'

In [250]:
for person in data['people_collection']:
    for person_details in person:
        print(person_details, " : ", person[person_details])

name  :  John Doe
age  :  30
email  :  None
address  :  {'street': '123 Main Street', 'city': 'Anytown', 'state': 'CA'}
name  :  Jack Snyder
age  :  43
email  :  None
address  :  {'street': '222 Main Street', 'city': 'Newyork', 'state': 'NY'}


In [72]:
for person in data['people_collection']:
    print(person['name'], ' is from ', person['address']['city'])
    


John Doe  is from  Anytown
Jack Snyder  is from  Newyork


In [78]:
for person in data['people_collection']:
    print(person)

{'name': 'John Doe', 'age': 30, 'email': None, 'address': {'street': '123 Main Street', 'city': 'Anytown', 'state': 'CA'}}
{'name': 'Jack Snyder', 'age': 43, 'email': None, 'address': {'street': '222 Main Street', 'city': 'Newyork', 'state': 'NY'}}


In [79]:
for person in data['people_collection']:
    del person['email']

In [80]:
for person in data['people_collection']:
    print(person)

{'name': 'John Doe', 'age': 30, 'address': {'street': '123 Main Street', 'city': 'Anytown', 'state': 'CA'}}
{'name': 'Jack Snyder', 'age': 43, 'address': {'street': '222 Main Street', 'city': 'Newyork', 'state': 'NY'}}


In [255]:
jsbvskjbe = json.dumps(data,indent=4)

print(type(jsbvskjbe))
print(jsbvskjbe)


<class 'str'>
{
    "people_collection": [
        {
            "name": "John Doe",
            "age": 30,
            "email": null,
            "address": {
                "street": "123 Main Street",
                "city": "Anytown",
                "state": "CA"
            }
        },
        {
            "name": "Jack Snyder",
            "age": 43,
            "email": null,
            "address": {
                "street": "222 Main Street",
                "city": "Newyork",
                "state": "NY"
            }
        }
    ]
}


In [262]:
with open('demo files/states.txt', 'r') as f:
    data = json.load(f)

In [263]:
data

{'states': [{'state_name': 'Andhra Pradesh', 'capital': 'Hyderabad'},
  {'state_name': 'Arunachal Pradesh', 'capital': 'Itanagar'},
  {'state_name': 'Assam', 'capital': 'Dispur'},
  {'state_name': 'Bihar', 'capital': 'Patna'},
  {'state_name': 'Chhattisgarh', 'capital': 'Raipur'},
  {'state_name': 'Goa', 'capital': 'Panaji'},
  {'state_name': 'Gujarat', 'capital': 'Gandhinagar'},
  {'state_name': 'Haryana', 'capital': 'Chandigarh'},
  {'state_name': 'Himachal Pradesh', 'capital': 'Shimla'},
  {'state_name': 'Jammu & Kashmir',
   'capital': 'Srinagar(Summer)/Jammu(Winter)'},
  {'state_name': 'Jharkhand', 'capital': 'Ranchi'},
  {'state_name': 'Karnataka', 'capital': 'Bengaluru'},
  {'state_name': 'Kerala', 'capital': 'Thiruvananthapuram'},
  {'state_name': 'Madhya Pradesh', 'capital': 'Bhopal'},
  {'state_name': 'Maharashtra', 'capital': 'Mumbai'},
  {'state_name': 'Manipur', 'capital': 'Imphal'},
  {'state_name': 'Meghalaya', 'capital': 'Shillong'},
  {'state_name': 'Mizoram', 'capital

In [264]:
for state in data['states']:
    print(state['capital'], ' is the capital of ', state['state_name'])

Hyderabad  is the capital of  Andhra Pradesh
Itanagar  is the capital of  Arunachal Pradesh
Dispur  is the capital of  Assam
Patna  is the capital of  Bihar
Raipur  is the capital of  Chhattisgarh
Panaji  is the capital of  Goa
Gandhinagar  is the capital of  Gujarat
Chandigarh  is the capital of  Haryana
Shimla  is the capital of  Himachal Pradesh
Srinagar(Summer)/Jammu(Winter)  is the capital of  Jammu & Kashmir
Ranchi  is the capital of  Jharkhand
Bengaluru  is the capital of  Karnataka
Thiruvananthapuram  is the capital of  Kerala
Bhopal  is the capital of  Madhya Pradesh
Mumbai  is the capital of  Maharashtra
Imphal  is the capital of  Manipur
Shillong  is the capital of  Meghalaya
Aizawl  is the capital of  Mizoram
Kohima  is the capital of  Nagaland
Bhubaneswar  is the capital of  Odisha
Chandigarh  is the capital of  Punjab
Jaipur  is the capital of  Rajasthan
Gangtok  is the capital of  Sikkim
Chennai  is the capital of  Tamil Nadu
Hyderabad  is the capital of  Telangana
Agart

## Pickling in Python

Python pickle module is used for serializing and de-serializing a Python object structure. Any object in Python can be pickled so that it can be saved on disk. What Pickle does is it “serializes” the object first before writing it to a file. 

In [265]:
import pickle

example_dict = {1:"6", 2:"2", 3:"£"}

pickle_out = open ("dict.pickle", "wb")
pickle.dump(example_dict,pickle_out)
pickle_out.close()

In [96]:
with open ('dict2.pickle','wb') as pickle_out:
    pickle.dump(example_dict,pickle_out)


In [267]:
with open ('dict.pickle','rb') as pic:
    example_dict = pickle.load(pic)
print(example_dict)

{1: '6', 2: '2', 3: '£'}
