This notebook contains the second part of exercises from:
https://www.w3resource.com/python-exercises/pandas/index.php
It covers `string` and `regular expression` operations.

In [2]:
import pandas as pd
import numpy as np

# Pandas String and Regular Expression [ 41 exercises with solution]

In [3]:
# 1 Write a Pandas program to convert all the string values to upper, 
# lower cases in a given pandas series. Also find the length of the string values.
sr1 = pd.Series(['annabel', 'april', 'elisa', 'maria', 'lenny', 'george'])
sr1.str.upper().str.lower().str.len()

0    7
1    5
2    5
3    5
4    5
5    6
dtype: int64

In [4]:
# 2 Write a Pandas program to remove whitespaces, left sided whitespaces 
# and right sided whitespaces of the string values of a given pandas series.
sr2 = pd.Series(['   annabel   ', '   april   ', '   elisa   ', '   maria   ', '   lenny   ', '   george   '])
sr2.str.strip().values

array(['annabel', 'april', 'elisa', 'maria', 'lenny', 'george'],
      dtype=object)

The `format()` method formats the specified value(s) and insert them inside the string's placeholder.
The placeholder is defined using curly brackets: `{}`. The `format()`method returns the formatted string.

The placeholders can be identified using named indexes `{price}`, numbered indexes `{0}`, or even empty placeholders `{}`.

Inside the placeholders you can add a formatting type to format the result:
- `:<` Left aligns the result (within the available space)
- `:>` Right aligns the result (within the available space)
- `:^` Center aligns the result (within the available space)
- `:_` Use a underscore as a thousand separator

In [5]:
# 3 Write a Pandas program to add leading zeros to the integer column 
# in a pandas series and makes the length of the field to 8 digit.
sr3 = pd.Series(list(range(10, 4000, 200)))
sr3.apply(lambda x: '{:0>8}'.format(x))

0     00000010
1     00000210
2     00000410
3     00000610
4     00000810
5     00001010
6     00001210
7     00001410
8     00001610
9     00001810
10    00002010
11    00002210
12    00002410
13    00002610
14    00002810
15    00003010
16    00003210
17    00003410
18    00003610
19    00003810
dtype: object

In [6]:
# 4  Write a Pandas program to add leading zeros to the character column 
# in a pandas series and makes the length of the field to 8 digit.
sr4 = pd.Series(['am', 'ummm', 'shrink', 'looosee'])
sr4.str.zfill(8)

0    000000am
1    0000ummm
2    00shrink
3    0looosee
dtype: object

In [7]:
# 5 Write a Pandas program to capitalize all the string values of specified columns of a given DataFrame.
sr2.str.capitalize()

0       annabel   
1         april   
2         elisa   
3         maria   
4         lenny   
5        george   
dtype: object

In [8]:
# 6 Write a Pandas program to count of occurrence of a specified substring in a DataFrame column.
sr2.str.contains('ri').value_counts()

False    4
True     2
dtype: int64

In [89]:
# 7 Write a Pandas program to find the index of a given substring of a DataFrame column.
df = pd.DataFrame(
    {
        'name_code': ['c001','c002','c022', 'c2002', 'c2222'],
        'date_of_birth ': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
        'age': [18.5, 21.2, 22.5, 22, 23]
    }
)
list(df[df.name_code.str.contains('c2')]['name_code'].index)

[3, 4]

In [92]:
df['Index'] = list(map(lambda x: x.find('22'), df['name_code']))
df

Unnamed: 0,name_code,date_of_birth,age,Index
0,c001,12/05/2002,18.5,-1
1,c002,16/02/1999,21.2,-1
2,c022,25/09/1998,22.5,2
3,c2002,12/02/2022,22.0,-1
4,c2222,15/09/1997,23.0,1


In [93]:
# 8 Write a Pandas program to find the index of a substring of DataFrame with beginning and end position.
df['Index'] = list(map(lambda x: x.find('0', 0, 5), df['name_code']))
df

Unnamed: 0,name_code,date_of_birth,age,Index
0,c001,12/05/2002,18.5,1
1,c002,16/02/1999,21.2,1
2,c022,25/09/1998,22.5,1
3,c2002,12/02/2022,22.0,2
4,c2222,15/09/1997,23.0,-1


In [94]:
# 9 Write a Pandas program to check whether alpha numeric values present in a given column of a DataFrame. 
df['new_name'] = ['Company','Company a001','Company 123', '1234', 'Company 12']
df.new_name.str.contains('[a-zA-Z0-9]')

0    True
1    True
2    True
3    True
4    True
Name: new_name, dtype: bool

In [95]:
df.new_name.str.isalnum()

0     True
1    False
2    False
3     True
4    False
Name: new_name, dtype: bool

In [96]:
# 10 Write a Pandas program to check whether alphabetic values present in a given column of a DataFrame.
df.new_name.str.isalpha()

0     True
1    False
2    False
3    False
4    False
Name: new_name, dtype: bool

In [97]:
# 11 Write a Pandas program to check whether only numeric values present in a given column of a DataFrame.
df['is_num'] = list(map(lambda x: x.isnumeric(), df.new_name))
df

Unnamed: 0,name_code,date_of_birth,age,Index,new_name,is_num
0,c001,12/05/2002,18.5,1,Company,False
1,c002,16/02/1999,21.2,1,Company a001,False
2,c022,25/09/1998,22.5,1,Company 123,False
3,c2002,12/02/2022,22.0,2,1234,True
4,c2222,15/09/1997,23.0,-1,Company 12,False


In [98]:
# 12 Write a Pandas program to check whether only lower case or upper case is present 
# in a given column of a DataFrame.
df.name_code.str.islower()

0    True
1    True
2    True
3    True
4    True
Name: name_code, dtype: bool

In [99]:
# 13 Write a Pandas program to check whether only proper case or title case 
# is present in a given column of a DataFrame.
df.new_name.str.istitle()

0     True
1    False
2     True
3    False
4     True
Name: new_name, dtype: bool

In [100]:
# 14 Write a Pandas program to check whether only space is present in a given column of a DataFrame.
df.new_name.str.contains('\s.')

0    False
1     True
2     True
3    False
4     True
Name: new_name, dtype: bool

In [101]:
df['space_name'] = ['Abcd','EFGF ', '  ', 'abcd', ' ']
df.space_name.str.isspace()

0    False
1    False
2     True
3    False
4     True
Name: space_name, dtype: bool

In [102]:
# 15 Write a Pandas program to get the length of the string present of a given column in a DataFrame.
df.space_name.str.len()

0    4
1    5
2    2
3    4
4    1
Name: space_name, dtype: int64

In [103]:
# 16 Write a Pandas program to get the length of the integer of a given column in a DataFrame.
df['sale_amount'] = [12348.5, 233331.2, 22.5, 2566552.0, 23.0]
df.sale_amount.map(str).apply(len)

0    7
1    8
2    4
3    9
4    4
Name: sale_amount, dtype: int64

In [104]:
# 17 Write a Pandas program to check if a specified column starts with a specified string in a DataFrame.
df.new_name.str.startswith('C')

0     True
1     True
2     True
3    False
4     True
Name: new_name, dtype: bool

In [105]:
# 18 Write a Pandas program to swap the cases of a specified character column in a given DataFrame. 
df.new_name.str.swapcase()

0         cOMPANY
1    cOMPANY A001
2     cOMPANY 123
3            1234
4      cOMPANY 12
Name: new_name, dtype: object

In [84]:
# 21 Write a Pandas program to replace arbitrary values with other values in a given DataFrame. 
df.name_code.str.replace('c', 'b')

0     b001
1     b002
2     b022
3    b2002
4    b2222
Name: name_code, dtype: object

In [107]:
df['literal_name'] = ['A','B', 'C', 'D', 'A']
df.replace('A','C')

Unnamed: 0,name_code,date_of_birth,age,Index,new_name,is_num,space_name,sale_amount,literal_name
0,c001,12/05/2002,18.5,1,Company,False,Abcd,12348.5,C
1,c002,16/02/1999,21.2,1,Company a001,False,EFGF,233331.2,B
2,c022,25/09/1998,22.5,1,Company 123,False,,22.5,C
3,c2002,12/02/2022,22.0,2,1234,True,abcd,2566552.0,D
4,c2222,15/09/1997,23.0,-1,Company 12,False,,23.0,C


In [115]:
df.replace('c','Z', regex=True)

Unnamed: 0,name_code,date_of_birth,age,Index,new_name,is_num,space_name,sale_amount,literal_name
0,Z001,12/05/2002,18.5,1,Company,False,AbZd,12348.5,A
1,Z002,16/02/1999,21.2,1,Company a001,False,EFGF,233331.2,B
2,Z022,25/09/1998,22.5,1,Company 123,False,,22.5,C
3,Z2002,12/02/2022,22.0,2,1234,True,abZd,2566552.0,D
4,Z2222,15/09/1997,23.0,-1,Company 12,False,,23.0,A


In [118]:
# 22 Write a Pandas program to replace more than one value with other values in a given DataFrame.
df.replace(['c', 'o', 'y'],['Z', 'O', 'Y'], regex=True)

Unnamed: 0,name_code,date_of_birth,age,Index,new_name,is_num,space_name,sale_amount,literal_name
0,Z001,12/05/2002,18.5,1,COmpanY,False,AbZd,12348.5,A
1,Z002,16/02/1999,21.2,1,COmpanY a001,False,EFGF,233331.2,B
2,Z022,25/09/1998,22.5,1,COmpanY 123,False,,22.5,C
3,Z2002,12/02/2022,22.0,2,1234,True,abZd,2566552.0,D
4,Z2222,15/09/1997,23.0,-1,COmpanY 12,False,,23.0,A


In [139]:
# 23 Write a Pandas program to split a string of a column of a given DataFrame into multiple columns.
df['name'] = ['Alberto  Franco',
              'Gino Ann Mcneill',
              'Ryan  Parkes', 
              'Eesha Artur Hinton', 
              'Syed  Wharton']
df[['first', 'middle', 'last']] = df['name'].str.split(' ', expand=True)
df

Unnamed: 0,name_code,date_of_birth,age,Index,new_name,is_num,space_name,sale_amount,literal_name,name,first,middle,last
0,c001,12/05/2002,18.5,1,Company 1,False,Abcd,12348.5,A,Alberto Franco,Alberto,,Franco
1,c002,16/02/1999,21.2,1,Company a001,False,EFGF,233331.2,B,Gino Ann Mcneill,Gino,Ann,Mcneill
2,c022,25/09/1998,22.5,1,Company 123,False,,22.5,C,Ryan Parkes,Ryan,,Parkes
3,c2002,12/02/2022,22.0,2,1234,True,abcd,2566552.0,D,Eesha Artur Hinton,Eesha,Artur,Hinton
4,c2222,15/09/1997,23.0,-1,Company 12,False,,23.0,A,Syed Wharton,Syed,,Wharton


In [154]:
# 24 Write a Pandas program to extract email from a specified column of string type of a given DataFrame.
df['name_email'] = ['Alberto Franco af@gmail.com',
                    'Gino Mcneill gm@yahoo.com',
                    'Ryan Parkes rp@abc.io', 
                    'Eesha Hinton', 
                    'Gino Mcneill gm@github.com']
df['email'] = df.name_email.str.extract(r'.*\b(\w+@\w+\.\w+)', expand=True)
df

Unnamed: 0,name_code,date_of_birth,age,Index,new_name,is_num,space_name,sale_amount,literal_name,name,first,middle,last,name_email,email
0,c001,12/05/2002,18.5,1,Company 1,False,Abcd,12348.5,A,Alberto Franco,Alberto,,Franco,Alberto Franco af@gmail.com,af@gmail.com
1,c002,16/02/1999,21.2,1,Company a001,False,EFGF,233331.2,B,Gino Ann Mcneill,Gino,Ann,Mcneill,Gino Mcneill gm@yahoo.com,gm@yahoo.com
2,c022,25/09/1998,22.5,1,Company 123,False,,22.5,C,Ryan Parkes,Ryan,,Parkes,Ryan Parkes rp@abc.io,rp@abc.io
3,c2002,12/02/2022,22.0,2,1234,True,abcd,2566552.0,D,Eesha Artur Hinton,Eesha,Artur,Hinton,Eesha Hinton,
4,c2222,15/09/1997,23.0,-1,Company 12,False,,23.0,A,Syed Wharton,Syed,,Wharton,Gino Mcneill gm@github.com,gm@github.com


In [162]:
# 25 Write a Pandas program to extract hash attached word from twitter text 
# from the specified column of a given DataFrame.
df25 = pd.DataFrame({
    'tweets': ['#Obama says goodbye','Retweets for #cash','A political endorsement in #Indonesia', '1 dog = many #retweets', 'Just a simple #egg']
    })
df25['hash'] = df25.tweets.str.extract(r'.*(#.+?)\b', expand=True)
df25

Unnamed: 0,tweets,hash
0,#Obama says goodbye,#Obama
1,Retweets for #cash,#cash
2,A political endorsement in #Indonesia,#Indonesia
3,1 dog = many #retweets,#retweets
4,Just a simple #egg,#egg


In [171]:
# 26 Write a Pandas program to extract word mention someone in tweets 
# using @ from the specified column of a given DataFrame.
df26 = pd.DataFrame({
    'tweets': [
        '@Obama says goodbye',
        'Retweets for @cash',
        'A political endorsement in @Indonesia', 
        '1 dog = many #retweets', 
        'Just a simple #egg'
    ]
})
df26['mention'] = df26.tweets.str.extract(r'.*@(.+?)\b', expand=True)
df26

Unnamed: 0,tweets,mention
0,@Obama says goodbye,Obama
1,Retweets for @cash,cash
2,A political endorsement in @Indonesia,Indonesia
3,1 dog = many #retweets,
4,Just a simple #egg,


In [178]:
# 27 Write a Pandas program to extract only number from the specified column of a given DataFrame.
df27 = pd.DataFrame({
    'company_code': ['c0001','c0002','c0003', 'c0003', 'c0004'],
    'address': [
        '7277 Surrey Ave.',
        '920 N. Bishop Ave.',
        '9910 Golden Star St.', 
        '25 Dunbar St.', 
        '17 West Livingston Court'
    ]
})
df27['address_num'] = df27.address.str.extract(r'(\d+)', expand=True)
df27

Unnamed: 0,company_code,address,address_num
0,c0001,7277 Surrey Ave.,7277
1,c0002,920 N. Bishop Ave.,920
2,c0003,9910 Golden Star St.,9910
3,c0003,25 Dunbar St.,25
4,c0004,17 West Livingston Court,17


In [183]:
# 28 Write a Pandas program to extract only phone number from the specified column of a given DataFrame.
df28 = pd.DataFrame({
    'company_code': ['c0001','c0002','c0003', 'c0003', 'c0004'],
    'company_phone_no': [
        'Company1-Phone no. 4695168357',
        'Company2-Phone no. 8088729013',
        'Company3-Phone no. 6204658086', 
        'Company4-Phone no. 5159530096', 
        'Company5-Phone no. 9037952371']
    })
df28['phone'] = df28.company_phone_no.str.extract(r'no\.\s(\d+)', expand=True)
df28

Unnamed: 0,company_code,company_phone_no,phone
0,c0001,Company1-Phone no. 4695168357,4695168357
1,c0002,Company2-Phone no. 8088729013,8088729013
2,c0003,Company3-Phone no. 6204658086,6204658086
3,c0003,Company4-Phone no. 5159530096,5159530096
4,c0004,Company5-Phone no. 9037952371,9037952371


In [220]:
# 29 Write a Pandas program to extract year between 1800 to 2200 from the specified column of a given DataFrame.
df29 = pd.DataFrame({
    'company_code': ['c0001','c0002','c0003', 'c0003', 'c0004'],
    'year': ['year 1800','year 1700','year 2300', 'year 1900', 'year 2200'
            ]
})
df29['find_year'] = df29.year.str.extract(r'year\s([12][89012]\d{2})')
df29

Unnamed: 0,company_code,year,find_year
0,c0001,year 1800,1800.0
1,c0002,year 1700,
2,c0003,year 2300,
3,c0003,year 1900,1900.0
4,c0004,year 2200,2200.0


In [211]:
# 30 Write a Pandas program to extract only non alphanumeric characters 
# from the specified column of a given DataFrame.
df30 = pd.DataFrame({
    'company_code': ['c0001#','c00@0^2','$c0003', 'c0003', '&c0004'],
    'year': ['year 1800','year 1700','year 2300', 'year 1900', 'year 2200']
    })
df30['non_an'] = df30.company_code.str.findall(r'(\W)')
df30

Unnamed: 0,company_code,year,non_an
0,c0001#,year 1800,[#]
1,c00@0^2,year 1700,"[@, ^]"
2,$c0003,year 2300,[$]
3,c0003,year 1900,[]
4,&c0004,year 2200,[&]


In [237]:
# 31 Write a Pandas program to extract only punctuations from the specified column of a given DataFrame.
df31 = pd.DataFrame({
    'company_code': ['c0001...','c000,2','c0003', 'c0003#()', '"c0004",'],
    'year': ['year 1800','year 1700','year 2300', 'year 1900', 'year 2200']
    })
df31['punct'] = df31.company_code.str.findall(r'([\.,;:\?\!\-\(\)"])')
df31

Unnamed: 0,company_code,year,punct
0,c0001...,year 1800,"[., ., .]"
1,"c000,2",year 1700,"[,]"
2,c0003,year 2300,[]
3,c0003#(),year 1900,"[(, )]"
4,"""c0004"",",year 2200,"["", "", ,]"


In [241]:
# 32 Write a Pandas program to remove repetitive characters from the specified column of a given DataFrame.
df32 = pd.DataFrame({
    'text_code': ['t0001.','t0002','t0003', 't0004'],
    'text_lang': ['She livedd a long life.', 'How oold is your father?', 'What is tthe problem?','TThhis desk is used by Tom.']
    })
df32['text_lang'] = df32.text_lang.str.replace(r'(\w)\1+', '', regex=True)
df32

Unnamed: 0,text_code,text_lang
0,t0001.,She live a long life.
1,t0002,How ld is your father?
2,t0003,What is he problem?
3,t0004,is desk is used by Tom.


In [244]:
import re

df32 = pd.DataFrame({
    'text_code': ['t0001.','t0002','t0003', 't0004'],
    'text_lang': ['She livedd a long life.', 'How oold is your father?', 'What is tthe problem?','TThhis desk is used by Tom.']
    })

def rep_char(str1):
    tchr = str1.group(0)
    if len(tchr) > 1:
        return tchr[0:1] # can change the value here on repetition
def unique_char(rep, sent_text):
    convert = re.sub(r'(\w)\1+', rep, sent_text) 
    return convert
df32['normal_text']=df32['text_lang'].apply(lambda x : unique_char(rep_char,x))
df32

Unnamed: 0,text_code,text_lang,normal_text
0,t0001.,She livedd a long life.,She lived a long life.
1,t0002,How oold is your father?,How old is your father?
2,t0003,What is tthe problem?,What is the problem?
3,t0004,TThhis desk is used by Tom.,This desk is used by Tom.


In [261]:
# 33 Write a Pandas program to extract numbers greater than 940 from the specified column of a given DataFrame.
df33 = pd.DataFrame({
    'company_code': ['c0001','c0002','c0003', 'c0003', 'c0004'],
    'address': ['7277 Surrey Ave.1111',
                '920 N. Bishop Ave.',
                '9910 Golden Star St.', 
                '1025 Dunbar St.', 
                '1700 West Livingston Court'
               ]
})
df33['number'] = df33.address.str.extract(r'(\d+)').apply(lambda x: pd.to_numeric(x, downcast='integer'))
df33['number'] = df33['number'][df33.number > 940]
df33

Unnamed: 0,company_code,address,number
0,c0001,7277 Surrey Ave.1111,7277.0
1,c0002,920 N. Bishop Ave.,
2,c0003,9910 Golden Star St.,9910.0
3,c0003,1025 Dunbar St.,1025.0
4,c0004,1700 West Livingston Court,1700.0


In [272]:
df33['number'] = df33.address.str.extract(r'(\d+)')
df33['number'] = df33.number[pd.to_numeric(df33.number) > 940]
df33

Unnamed: 0,company_code,address,number
0,c0001,7277 Surrey Ave.1111,7277.0
1,c0002,920 N. Bishop Ave.,
2,c0003,9910 Golden Star St.,9910.0
3,c0003,1025 Dunbar St.,1025.0
4,c0004,1700 West Livingston Court,1700.0


In [291]:
# 34 Write a Pandas program to extract numbers less than 100 from the specified column of a given DataFrame.
df34 = pd.DataFrame({
    'company_code': ['c0001','c0002','c0003', 'c0003', 'c0004'],
    'address': ['72 Surrey Ave.11',
                '92 N. Bishop Ave.',
                '9910 Golden Star St.', 
                '102 Dunbar St.', 
                '17 West Livingston Court'
               ]
    })
df34['number'] = df34.address.str.extract(r'(\b[0-9][1-9]\s)')
df34

Unnamed: 0,company_code,address,number
0,c0001,72 Surrey Ave.11,72.0
1,c0002,92 N. Bishop Ave.,92.0
2,c0003,9910 Golden Star St.,
3,c0003,102 Dunbar St.,
4,c0004,17 West Livingston Court,17.0


In [299]:
# 35 Write a Pandas program to check whether two given words present in a specified column of a given DataFrame. 
df35 = pd.DataFrame({
    'company_code': ['c0001','c0002','c0003', 'c0003', 'c0004'],
    'address': ['9910 Surrey Ave.',
                '92 N. Bishop Ave.',
                '9910 Golden Star Ave.', 
                '102 Dunbar St.', 
                '17 West Livingston Court'
               ]
    })
df35['street'] = df35.address.str.extract(r'(.*\bWest.*|.*Ave\..*)')
df35

Unnamed: 0,company_code,address,street
0,c0001,9910 Surrey Ave.,9910 Surrey Ave.
1,c0002,92 N. Bishop Ave.,92 N. Bishop Ave.
2,c0003,9910 Golden Star Ave.,9910 Golden Star Ave.
3,c0003,102 Dunbar St.,
4,c0004,17 West Livingston Court,17 West Livingston Court


In [319]:
# 36 Write a Pandas program to extract date (format: mm-dd-yyyy) from a given column of a given DataFrame.
df36 = pd.DataFrame({
    'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
    'date_of_sale': ['12/05/2002','16/02/1999','05/09/1998','12/02/2022','15/09/1997'],
    'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]
})

def extract_date(string):
    m = re.search(r'(?P<date>\d{2})/(?P<month>\d{2})/(?P<year>\d{4})', string)
    return {'date': m.group('date'), 'month': m.group('month'), 'year': m.group('year')}
                
df36['date'] = df36.date_of_sale.apply(
    lambda x: extract_date(x)['month'] + '-' +
              extract_date(x)['date'] + '-' +
              extract_date(x)['year'])
df36

Unnamed: 0,company_code,date_of_sale,sale_amount,date
0,Abcd,12/05/2002,12348.5,05-12-2002
1,EFGF,16/02/1999,233331.2,02-16-1999
2,zefsalf,05/09/1998,22.5,09-05-1998
3,sdfslew,12/02/2022,2566552.0,02-12-2022
4,zekfsdf,15/09/1997,23.0,09-15-1997


In [399]:
# the BEST way!
df36.date_of_sale.replace(r'(\d+)/(\d+)/(\d+)', r'\2-\1-\3', regex=True)

0    05-12-2002
1    02-16-1999
2    09-05-1998
3    02-12-2022
4    09-15-1997
Name: date_of_sale, dtype: object

In [320]:
def find_valid_dates(dt):
    #format: mm-dd-yyyy
    result = re.findall(r'\b(1[0-2]|0[1-9])/(3[01]|[12][0-9]|0[1-9])/([0-9]{4})\b',dt)
    return result

df36['valid_dates'] = df36['date_of_sale'].apply(lambda dt : find_valid_dates(dt))
print("\nValid dates (format: mm-dd-yyyy):")
df36


Valid dates (format: mm-dd-yyyy):


Unnamed: 0,company_code,date_of_sale,sale_amount,date,valid_dates
0,Abcd,12/05/2002,12348.5,05-12-2002,"[(12, 05, 2002)]"
1,EFGF,16/02/1999,233331.2,02-16-1999,[]
2,zefsalf,05/09/1998,22.5,09-05-1998,"[(05, 09, 1998)]"
3,sdfslew,12/02/2022,2566552.0,02-12-2022,"[(12, 02, 2022)]"
4,zekfsdf,15/09/1997,23.0,09-15-1997,[]


In [358]:
# 37 Write a Pandas program to extract only words from a given column of a given DataFrame.
df37 = pd.DataFrame({
    'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
    'date_of_sale': ['12/05/2002','16/02/1999','05/09/1998','12/02/2022','15/09/1997'],
    'address': ['9910 Surrey Ave.',
                '92 N_ Bishop Ave.',
                '9910 Golden Star Ave.', 
                '102 Dunbar St.', 
                '17 West Livingston Court'
               ]
})

df37['words'] = df37.address.str.findall(r'(\b[^\d\W]+\b)').str.join(' ')
df37

Unnamed: 0,company_code,date_of_sale,address,words
0,Abcd,12/05/2002,9910 Surrey Ave.,Surrey Ave
1,EFGF,16/02/1999,92 N_ Bishop Ave.,N_ Bishop Ave
2,zefsalf,05/09/1998,9910 Golden Star Ave.,Golden Star Ave
3,sdfslew,12/02/2022,102 Dunbar St.,Dunbar St
4,zekfsdf,15/09/1997,17 West Livingston Court,West Livingston Court


In [359]:
# 38 Write a Pandas program to extract the sentences 
# where a specific word is present in a given column of a given DataFrame.
df38 = pd.DataFrame({
    'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
    'date_of_sale': ['12/05/2002','16/02/1999','05/09/1998','12/02/2022','15/09/1997'],
    'address': ['9910 Surrey Avenue',
                '92 N. Bishop Avenue',
                '9910 Golden Star Avenue', 
                '102 Dunbar St.', 
                '17 West Livingston Court'
               ]
})

df38['avenues'] = df38.address.str.extract(r'(.*Avenue)')
df38

Unnamed: 0,company_code,date_of_sale,address,avenues
0,Abcd,12/05/2002,9910 Surrey Avenue,9910 Surrey Avenue
1,EFGF,16/02/1999,92 N. Bishop Avenue,92 N. Bishop Avenue
2,zefsalf,05/09/1998,9910 Golden Star Avenue,9910 Golden Star Avenue
3,sdfslew,12/02/2022,102 Dunbar St.,
4,zekfsdf,15/09/1997,17 West Livingston Court,


In [367]:
# 39 Write a Pandas program to extract the unique sentences from a given column of a given DataFrame.
df39 = pd.DataFrame({
    'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
    'date_of_sale': ['12/05/2002','16/02/1999','05/09/1998','12/02/2022','15/09/1997'],
    'address': ['9910 Surrey Avenue\n9910 Surrey Avenue',
                '92 N. Bishop Avenue',
                '9910 Golden Star Avenue', 
                '102 Dunbar St.\n102 Dunbar St.', 
                '17 West Livingston Court'
               ]
})
df39['address'] = df39.address.str.extract(r'(.*)')
df39

Unnamed: 0,company_code,date_of_sale,address
0,Abcd,12/05/2002,9910 Surrey Avenue
1,EFGF,16/02/1999,92 N. Bishop Avenue
2,zefsalf,05/09/1998,9910 Golden Star Avenue
3,sdfslew,12/02/2022,102 Dunbar St.
4,zekfsdf,15/09/1997,17 West Livingston Court


In [373]:
# 40 Write a Pandas program to extract words starting with capital words from a given column of a given DataFrame.
df40 = pd.DataFrame({
    'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
    'date_of_sale': ['12/05/2002','16/02/1999','05/09/1998','12/02/2022','15/09/1997'],
    'address': ['9910 Surrey Avenue great city',
                '92 N. Bishop Avenue fall in love',
                '9910 Golden Star Avenue yeah!', 
                '102 Dunbar St. hmm', 
                '17 West Livingston Court auch!'
               ]
})

df40['capitals'] = df40.address.str.findall(r'(\b[A-Z][^\d\W]+\b)').str.join(' ')
df40

Unnamed: 0,company_code,date_of_sale,address,capitals
0,Abcd,12/05/2002,9910 Surrey Avenue great city,Surrey Avenue
1,EFGF,16/02/1999,92 N. Bishop Avenue fall in love,Bishop Avenue
2,zefsalf,05/09/1998,9910 Golden Star Avenue yeah!,Golden Star Avenue
3,sdfslew,12/02/2022,102 Dunbar St. hmm,Dunbar St
4,zekfsdf,15/09/1997,17 West Livingston Court auch!,West Livingston Court


In [378]:
# 41 Write a Pandas program to remove the html tags within the specified column of a given DataFrame.
df41 = pd.DataFrame({
    'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
    'date_of_sale': ['12/05/2002','16/02/1999','05/09/1998','12/02/2022','15/09/1997'],
    'address': ['9910 Surrey <b>Avenue</b>',
                '92 N. Bishop Avenue',
                '9910 <br>Golden Star Avenue', 
                '102 Dunbar <i></i>St.', 
                '17 West Livingston Court'
               ]
})

df41['address'] = df41.address.str.replace(r'(<.*?>)', '', regex=True)
df41

Unnamed: 0,company_code,date_of_sale,address
0,Abcd,12/05/2002,9910 Surrey Avenue
1,EFGF,16/02/1999,92 N. Bishop Avenue
2,zefsalf,05/09/1998,9910 Golden Star Avenue
3,sdfslew,12/02/2022,102 Dunbar St.
4,zekfsdf,15/09/1997,17 West Livingston Court
