# Regular Expressions Tutorial: Personal Data with Python


with `Mr. Fugu Data Science`

# (◕‿◕✿)


[Github](https://github.com/MrFuguDataScience) | [Youtube](https://www.youtube.com/channel/UCbni-TDI-Ub8VlGaP8HLTNw?view_as=subscriber)

[re module documentation](https://docs.python.org/3/library/re.html)

# Useful Tables to Help:

|     	|                                             	|   	| Special Characters 	|                                                                 	|
|-----	|---------------------------------------------	|---	|--------------------	|-----------------------------------------------------------------	|
| \A  	| matches start of string                     	|   	| .                  	| any character except (\n)                                       	|
| \b  	| boundary of  word                           	|   	| ^                  	| start of string, except for multiline it matches directly after 	|
| \B  	| Not Word Boundary                           	|   	| $                  	| end of string or just before (\n)                               	|
| \d  	| digits: (0-9)                               	|   	| *                  	| 0+, repetitions of pattern before the (*)                       	|
| \D  	| any character but NOT digit                 	|   	| +                  	| 1+ repetitions of pattern before the (+)                        	|
| \s  	| whitespace: (\t,\n, space)                  	|   	| ?                  	| 0 or 1 repetitions preceding RE.                                	|
| \S  	| matches any character but NOT (\t,\n space) 	|   	| *?, +?, ??         	| Creates non greedy forms                                        	|
| \w  	| matches characters,  digits, underscore     	|   	| {m}                	| exact number of matches                                         	|
| \W  	| anything except  character                  	|   	| {m.n}              	| range of matches                                                	|
| \Z  	| Only matches end of string                  	|   	| {m,n}?             	| match as FEW as possible, in range                              	|
| ( ) 	| grouping                                    	|   	| [ ]                	| Character set, finds anything inside                            	|


`---------------------------------------`

# `Other Tips:`

| Look ahead &  Look Behind 	|                                                                                                                           	|
|---------------------------	|---------------------------------------------------------------------------------------------------------------------------	|
| ?:                        	| match but don't capture it                                                                                                	|
| ?=                        	| match suffix but exclude from capture "Look ahead" ex.) q(?=u), find "q" that IS followed by "u", BUT, doesn't return "u" 	|
| ?!                        	| match if suffix is absent (neg.) look ahead ex.) p(?!u), means find "p" NOT  followed by "u"                              	|
| ?<=                       	| positive look ahead                                                                                                       	|
| ?<!                       	| negative look behind                                                                                                      	|
| (?=(regex))               	| use this to store the data from look ahead                                                                                	|

In [1]:
import re
import pandas as pd
import faker # create fake personal profiles
from datetime import datetime  # change formatting

In [2]:
# Create our fake Personal Profiles Dataset:

fake_=faker.Faker()
fake_.seed(413)
fake_personal_profile=[]
for i in range(500):
    fake_personal_profile.append(fake_.profile())

In [18]:
profiles=pd.DataFrame(fake_personal_profile)
profiles.columns
profiles.head()

Unnamed: 0,job,company,ssn,residence,current_location,blood_group,website,username,name,sex,address,mail,birthdate
0,Science writer,Mcdonald-Miller,724-95-4794,"74187 Laurie Parkways Apt. 344\nLake Maria, MN...","(-2.2591115, 88.092470)",O-,"[http://brown-salinas.com/, https://crawford-g...",markgray,Stephanie Craig,F,"7443 Jennifer Squares Suite 296\nJohnton, VT 3...",lbowers@yahoo.com,2001-11-10
1,Psychotherapist,"Roberts, Bennett and Briggs",869-25-5810,"4035 Perez Pass Apt. 421\nCalderonton, KS 89435","(82.7585265, 45.743752)",B+,"[http://www.benson.com/, http://www.thompson.c...",jose21,Brandon Wolfe,M,"08144 Todd Greens Suite 391\nMichaeltown, UT 9...",powellbrandy@yahoo.com,2018-09-15
2,Trade mark attorney,"Ramsey, Jordan and Hutchinson",236-16-8405,"0086 Cohen Trafficway\nNorth Joseph, WV 50669","(-76.6212615, 112.630239)",B-,[https://lee.biz/],gregory52,Melissa Summers,F,81538 Michael Tunnel Suite 123\nSouth Jeffreyh...,joshuadennis@yahoo.com,1939-01-23
3,Risk manager,Hinton-Bradley,122-87-1267,"4637 Mary Inlet Apt. 200\nNew Robert, ME 48035","(-80.2167735, 63.141605)",AB+,"[https://thomas.com/, https://roberson.info/, ...",tammy86,April Quinn,F,"8201 Garrison Forest Apt. 935\nCatherineton, N...",ghunter@hotmail.com,1960-10-24
4,Pathologist,Stephens Ltd,017-44-8291,"36411 Schwartz Shoals Suite 648\nEast Denise, ...","(47.668630, 88.304771)",A+,"[http://davis.org/, https://walker.biz/, http:...",yjones,Mike White,M,USNS Rodriguez\nFPO AP 85658,megan38@hotmail.com,1911-12-29


In [27]:
amended_profiles=profiles[['job','company','ssn','website','username',
                           'name','birthdate','mail']]

without_date=amended_profiles.iloc[:,[0,1,2,3,4,5,7]]


# amended_profiles
# without_date.head()

In [28]:
# Converting Birthday to American Date format:

oo=amended_profiles['birthdate']

dates_=[]
for i in oo:
    dates_.append(i.strftime('%m/%d/%Y'))


0      2001-11-10
1      2018-09-15
2      1939-01-23
3      1960-10-24
4      1911-12-29
          ...    
495    1996-01-23
496    1969-06-16
497    1980-10-30
498    1937-01-05
499    1955-01-04
Name: birthdate, Length: 500, dtype: object

In [34]:
# Adding to dataframe new date column then converting to list format to do Regex

without_date['birthday']=dates_

lst_version_df=without_date.values.tolist()
# without_date.head()
# lst_version_df

In [36]:
#Converting to strings the data so we can parse with Regex for simulation

flat_=[]
for i in lst_version_df:
        flat_.append(",".join(i[3]))
        flat_.append(",".join(i[:3]))
        flat_.append(",".join(i[4:]))
lst_profiles=[]
for i in range(len(flat_)-2):
    lst_profiles.append(",".join([flat_[i],flat_[i+1],flat_[i+2]]))

lst_profiles[:5]
# flat_

['http://brown-salinas.com/,https://crawford-gates.biz/,https://www.hoffman.net/,Science writer,Mcdonald-Miller,724-95-4794,markgray,Stephanie Craig,lbowers@yahoo.com,11/10/2001',
 'Science writer,Mcdonald-Miller,724-95-4794,markgray,Stephanie Craig,lbowers@yahoo.com,11/10/2001,http://www.benson.com/,http://www.thompson.com/,http://www.francis.com/,http://wright.com/',
 'markgray,Stephanie Craig,lbowers@yahoo.com,11/10/2001,http://www.benson.com/,http://www.thompson.com/,http://www.francis.com/,http://wright.com/,Psychotherapist,Roberts, Bennett and Briggs,869-25-5810',
 'http://www.benson.com/,http://www.thompson.com/,http://www.francis.com/,http://wright.com/,Psychotherapist,Roberts, Bennett and Briggs,869-25-5810,jose21,Brandon Wolfe,powellbrandy@yahoo.com,09/15/2018',
 'Psychotherapist,Roberts, Bennett and Briggs,869-25-5810,jose21,Brandon Wolfe,powellbrandy@yahoo.com,09/15/2018,https://lee.biz/']

In [8]:
# Example of our data as a string:
lst_profiles[0]

'http://brown-salinas.com/,https://crawford-gates.biz/,https://www.hoffman.net/,Science writer,Mcdonald-Miller,724-95-4794,markgray,Stephanie Craig,lbowers@yahoo.com,11/10/2001'

In [44]:
# Find webpages:
qq='http://brown-salinas.com/,https://crawford-gates.biz/,https://www.hoffman.net/,Science writer,Mcdonald-Miller,724-95-4794,markgray,Stephanie Craig,lbowers@yahoo.com,11/09/2001,http://ballzini.com/,john joyce,www.eg.com'
re.findall(r'http?:\S+',qq)

# Get http or https
http_https=re.findall(r'[https?:?\/?\/?]+[?:www.?|!:www.?]+[a-zA-Z0-9-_./]+[a-zA-Z]',qq)
# http_https
# get (www.) but without http/s
www_=re.findall(r'(?:www.?)+[a-zA-Z-_0-9.]+[a-zA-Z]+',qq)
www_
# # Combine both:
all_webpages_=re.findall(r'[https?:?\/?\/?]+[?:www.?|!:www.?]+[a-zA-Z0-9-_./]+[a-zA-Z]|(?:www.?)+[a-zA-Z-_0-9.]+[a-zA-Z]+',qq)

['www.hoffman.net', 'www.eg.com']

In [45]:
all_webpages_

['http://brown-salinas.com',
 'https://crawford-gates.biz',
 'https://www.hoffman.net',
 'http://ballzini.com',
 'www.eg.com']

In [11]:
re.findall(r'[https?:?\/?\/?]+[?:www.?|!:www.?]+[a-zA-Z0-9-_./]+[a-zA-Z]|(?:www.?)+[a-zA-Z-_0-9.]+[a-zA-Z]+',lst_profiles[1])

['http://www.benson.com',
 'http://www.thompson.com',
 'http://www.francis.com',
 'http://wright.com']

In [46]:
# Find Social Security Numbers:
q='http://brown-salinas.com/,https://crawford-gates.biz/,https://www.hoffman.net/,Science writer,Mcdonald-Miller,724-95-4794,markgray,Stephanie Craig,lbowers@yahoo.com,john-deer@gmail.us,11/09/2001,555-11-3333'

pp=re.compile(r'\d{3}[-]\d{2}[-]\d{4}')
mtch=pp.finditer(q)
re.findall(pp,q)

['724-95-4794', '555-11-3333']

In [47]:
# Email Addresses:
re.findall(r'[a-zA-Z-_.+0-9]+@[a-zA-Z-./]+',q)

['lbowers@yahoo.com', 'john-deer@gmail.us']

In [14]:
re.split(',',q)


# re.findall(r'http?://\S+')

['http://brown-salinas.com/',
 'https://crawford-gates.biz/',
 'https://www.hoffman.net/',
 'Science writer',
 'Mcdonald-Miller',
 '724-95-4794',
 'markgray',
 'Stephanie Craig',
 'lbowers@yahoo.com',
 'john-deer@gmail.us',
 '11/09/2001',
 '555-11-3333']

In [15]:
# Birthday: yippie o_0

re.findall(r'\d{1,2}\/\d{1,2}\/\d+',q)

['11/09/2001']

In [49]:
# finding Occupation, Company Name, user_name, Persons_name

k=[]
for i in re.split(',',lst_profiles[0]):
    pp=re.compile(r'\d{3}[-]\d{2}[-]\d{4}')
    if re.findall(pp,i):
        pass
    elif re.findall(r'[a-zA-Z-_.+]+@[a-zA-Z-.]+',i):
        pass
    elif re.findall(r'\d{1,2}\/\d{1,2}\/\d+',i):
        pass
    elif re.findall(r'[https?:?\/?\/?]+[?:www.?|!:www.?]+[a-zA-Z0-9-_./]+[a-zA-Z]|(?:www.?)+[a-zA-Z-_0-9.]+[a-zA-Z]+',i):
        pass
    else:
        k.append(i)
k

# lst_profiles[0]

['Science writer', 'Mcdonald-Miller', 'markgray', 'Stephanie Craig']

In [51]:
webpages=[]
birthday=[]
emails=[]
social_security_num=[]
name_=[]
occupation=[]
for i in lst_profiles:
    webpages.append(re.findall(r'[https?:?\/?\/?]+[?:www.?|!:www.?]+[a-zA-Z0-9-_./]+[a-zA-Z/]|(?:www.?)+[a-zA-Z-_0-9.]+',i))
    birthday.append(re.findall(re.compile(r'\d{1,2}\/\d{1,2}\/\d+'),i))
    emails.append(re.findall(re.compile(r'[a-zA-Z-_.+]+@[a-zA-Z-.]+'),i))
    social_security_num.append(re.findall(re.compile(r'\d{3}[-]\d{2}[-]\d{4}'),i))
    name_.append(re.findall(re.compile(r'([A-Z][a-z]+(?: [A-Z][a-z]\.)? [A-Z][a-z]+)'),i))
    occupation.append(re.findall(re.compile(r'([A-Z][a-z]+?(?: [A-Z][a-z]\.)? [a-z]+)'),i))

# split_df=[]
# for i in lst_profiles:
#     split_df.append(i.split(','))

# split_df[:2]
name_
occupation[:10]

[['Science writer'],
 ['Science writer'],
 ['Bennett and'],
 ['Bennett and'],
 ['Bennett and'],
 ['Trade mark', 'Jordan and'],
 ['Trade mark', 'Jordan and'],
 ['Trade mark', 'Jordan and'],
 ['Risk manager'],
 ['Risk manager']]

In [21]:
birthday[:6]

[['11/10/2001'],
 ['11/10/2001'],
 ['11/10/2001'],
 ['09/15/2018'],
 ['09/15/2018'],
 ['09/15/2018']]

In [22]:
webpages[:5]

[['http://brown-salinas.com/',
  'https://crawford-gates.biz/',
  'https://www.hoffman.net/'],
 ['http://www.benson.com/',
  'http://www.thompson.com/',
  'http://www.francis.com/',
  'http://wright.com/'],
 ['http://www.benson.com/',
  'http://www.thompson.com/',
  'http://www.francis.com/',
  'http://wright.com/'],
 ['http://www.benson.com/',
  'http://www.thompson.com/',
  'http://www.francis.com/',
  'http://wright.com/'],
 ['https://lee.biz/']]

In [23]:
emails[:5]

[['lbowers@yahoo.com'],
 ['lbowers@yahoo.com'],
 ['lbowers@yahoo.com'],
 ['powellbrandy@yahoo.com'],
 ['powellbrandy@yahoo.com']]

In [24]:
social_security_num[:6]

[['724-95-4794'],
 ['724-95-4794'],
 ['869-25-5810'],
 ['869-25-5810'],
 ['869-25-5810'],
 ['236-16-8405']]

In [29]:
name_[:10]

[['Stephanie Craig'],
 ['Stephanie Craig'],
 ['Stephanie Craig'],
 ['Brandon Wolfe'],
 ['Brandon Wolfe'],
 ['Brandon Wolfe'],
 ['Melissa Summers'],
 ['Melissa Summers'],
 ['Melissa Summers'],
 ['April Quinn']]

In [18]:
# io=re.compile(r'([A-Z][a-z]+?(?: [A-Z][a-z]\.)? [a-z]+)')
io=re.compile(r'^([A-Z][a-z]+)')
y=['http://brown-salinas.com/',
  'https://crawford-gates.biz/',
  'https://www.hoffman.net/',
  'Psychontherapist',
   'Scientist person',
  'Mcdonald-Miller',
  '724-95-4794',
  'markgray',
  'Stephanie Craig',
  'lbowers@yahoo.com',
  '11/09/2001']
for i in y:
    print(re.findall(io,i))
# lst_profiles[3]

[]
[]
[]
['Psychontherapist']
['Scientist']
['Mcdonald']
[]
[]
['Stephanie']
[]
[]


# Citations:


https://www.regular-expressions.info/wordboundaries.html

https://www.regular-expressions.info/lookaround.html