In [73]:
import pandas as pd
products = pd.read_csv('data/clean/products_cl.csv')

In [74]:
import re

# Introduction to Regex

## GOAL

Introduce to the library `re` (regexp) and show the main functions and how filter text based on regular expressions. 

## DESCRIPTION

In this workshop, the following functions will be reviewed: 

* `findall()`
* `search()`
* `split()`
* `sub()`
* `span()`
* `string()`
* `group()`

Metacharacters: ` . ^ $ * + ? { } [ ] \ | ( )`

Special Sequences: `\A` `\b` `\d` `\s`

And how to compile the regex expressions to reuse it. 

More information on that [link](https://www.w3schools.com/python/python_regex.asp).

In [75]:
products.sample()

Unnamed: 0,sku,name,desc,price,promo_price,in_stock,type,max_price_orderlines
4425,APP1580,"Apple MacBook Retina 12 ""Core m5 12GHz | 8GB R...",New MacBook Retina Display 12-inch Core 8GB RA...,1799.0,16.865.839,0,1282,


In [76]:
# extract an specific description
prod_descr = products.query('sku == "DLK0139"')['desc'].values[0]
prod_descr

'Full HD video surveillance camera with 180 degrees and night vision compatible HomeKit'

### `findall`

Returns a list containing all matches

In [77]:
# return all ocurrencies appearing on a string
re.findall('a', prod_descr)

['a', 'a', 'a', 'a', 'a']

### `search`

Returns a Match object if there is a match anywhere in the string. If there is more than one match, only the first occurrence of the match will be returned.

The mathch objects have the following methods: 
- `.span()` returns a tuple containing the start-, and end positions of the match.
- `.string` returns the string passed into the function
- `.group()` returns the part of the string where there was a match

In [78]:
prod_descr

'Full HD video surveillance camera with 180 degrees and night vision compatible HomeKit'

In [79]:
match_obj = re.search('video', prod_descr)

In [80]:
match_obj.string

'Full HD video surveillance camera with 180 degrees and night vision compatible HomeKit'

In [81]:
match_obj.group()

'video'

In [82]:
match_obj.span()

(8, 13)

### `split`

Returns a list where the string has been split at each match

In [83]:
prod_descr.split(' and ') # and is removed from the list

['Full HD video surveillance camera with 180 degrees',
 'night vision compatible HomeKit']

### `sub`

Replaces one or many matches with a string

In [84]:
dark_descr = re.sub("camera", "pool", prod_descr)
print(dark_descr)

Full HD video surveillance pool with 180 degrees and night vision compatible HomeKit


### METACHARACTERS


Some characters are special metacharacters, and don’t match themselves. Instead, they signal that some out-of-the-ordinary thing should be matched, or they affect other portions of the RE by repeating them or changing their meaning.

` . ^ $ * + ? { } [ ] \ | ( )`

 #### `[]` means set of characters:
 
 - `[abc]` will match any of the characters a, b, or c
 - `[a-c]` will do the same
 - `[a-z]` will match any lowercase letter

In [85]:
alphanumeric = "4298fsfsv012rvv21v9"

In [86]:
re.findall(r"[a-z]", alphanumeric)

['f', 's', 'f', 's', 'v', 'r', 'v', 'v', 'v']

`\` Can help us to scape special characters 

In [87]:
alphanumeric_with_special = alphanumeric + "[a-z]"
print(alphanumeric_with_special)
# CALLENGE: use \ to escape the square brakets
re.findall(r"\[a-z]", alphanumeric_with_special)

4298fsfsv012rvv21v9[a-z]


['[a-z]']

#### Some special sequences:

- `\A`- Returns a match if the specified characters are at the beginning of the string
- `\b` - Returns a match where the specified characters are at the beginning or at the end of a word
- `\d` - 	Returns a match where the string contains digits (numbers from 0-9) (`\D` for where the string DOES NOT contain digits)
- `\s`- Returns a match where the string contains a white space character (`\S` for where the string DOES NOT contain a white space)

In [88]:
prod_descr

'Full HD video surveillance camera with 180 degrees and night vision compatible HomeKit'

In [89]:
# find all possible numbers
re.findall(r"\d", prod_descr)

['1', '8', '0']

### `.`	Any character (except newline character)

In [90]:
re.findall(r'..c', prod_descr)

['anc', 'e c', 'n c']

### `+` One or more occurrences

In [91]:
re.findall(r'e+', prod_descr)

['e', 'e', 'e', 'e', 'e', 'ee', 'e', 'e']

In [92]:
re.sub("e+", "__", prod_descr)

'Full HD vid__o surv__illanc__ cam__ra with 180 d__gr__s and night vision compatibl__ Hom__Kit'

### `{}`- Exactly the specified number of occurrences

In [93]:
re.findall(r"e{2}", prod_descr)

['ee']

In [94]:
re.sub(r"e{2}", "__", prod_descr)

'Full HD video surveillance camera with 180 degr__s and night vision compatible HomeKit'

### `^` Starts with

In [95]:
re.findall(r"^F", prod_descr)

['F']

#### How to apply it on the whole dataframe?

In [96]:
products.loc[products['name'].str.contains(r'^Fit')]

Unnamed: 0,sku,name,desc,price,promo_price,in_stock,type,max_price_orderlines
123,FIT0009,Fitbit Aria scale smart white,smart scale with WiFi connection.,119.99,1.159.906,0,11905404,
124,FIT0010,Fitbit Aria scale smart black,smart scale with WiFi connection.,119.99,1.159.906,0,11905404,
270,FIT0013,Fitbit ZIP monitor green activity,Activity Monitor compact and lightweight.,59.99,548.977,0,11905404,
699,FIT0023,Fitbit Flex Bracelet navy activity monitor,Control activity bracelet with two interchange...,99.99,899.877,0,11905404,
1454,FIT0024,Fitbit Charge Bracelet Black Size L,Bracelet size L activity and sleep monitor wor...,129.95,1.198.989,0,11905404,
1510,FIT0026,Fitbit Charge HR Bracelet Black Size L,Bracelet sport and activity monitors sleep.,149.95,1.419.899,0,11905404,
2178,FIT0028,Fitbit Surge Figured Black Clock,Smartwatch with monitoring activity and sleep ...,249.95,229.9,0,11905404,
2181,FIT0029,Fitbit Surge Black Clock Small size,Smartwatch with monitoring activity and sleep ...,249.95,1.999.888,0,11905404,
9533,FIT0062,Fitbit Smartwatch Ionic Gray,Fitbit is the sports Smartwatch Ionic waterpro...,349.95,3.399.894,1,11905404,
9534,FIT0064,Fitbit Orange Blue Ionic Smartwatch,Fitbit is the sports Smartwatch Ionic waterpro...,349.95,3.399.894,1,11905404,


Learn more how to apply regexp and pandas: 

* https://kanoki.org/2019/11/12/how-to-use-regex-in-pandas/

### `*`	Zero or more occurences

In [97]:
similar_words = ["hey", "hay", "how", "h i j k", "h", "ha", "oops"]

In [98]:
# use "." to return all words starting with "h"
for word in similar_words:
    print(re.findall("h.*", word))

['hey']
['hay']
['how']
['h i j k']
['h']
['ha']
[]


In [99]:
print(prod_descr)
re.findall("vi*\S", prod_descr)

Full HD video surveillance camera with 180 degrees and night vision compatible HomeKit


['vid', 've', 'vis']

In [100]:
# Another way to show
re.findall("vi*\w+", prod_descr)
# \w: Returns a match where the string contains any word characters 
#    (characters from a to Z, digits from 0-9, and the underscore _ character)
# +: One or more occurrences

['video', 'veillance', 'vision']

### Examples into dataframes

In [101]:
# I would like to filter all the names that contain body
(
products
    .loc[products['name'].str.contains(r'(body|Body)')]
    .sort_values('name').head(5))

  .loc[products['name'].str.contains(r'(body|Body)')]


Unnamed: 0,sku,name,desc,price,promo_price,in_stock,type,max_price_orderlines
9630,BOD0009,BodyGuardz TrainR Pro 8/7/6 iPhone Case with A...,Advanced holster included sports armband for i...,39.99,279.897,1,5405,
10293,BOD0007,BodyGuardz TrainR Pro X iPhone Case with Armba...,Advanced holster included sports armband for i...,39.99,279.897,1,11865403,
5225,GTE0075,G-Technology G-DOCK ev Body only USB3.0,Housing with connection USB3.0 compatible with...,107.99,851.901,0,11935397,
6017,LMP0023,"LMP battery MacBook Pro 17 ""Unibody Early / Mi...",replacement battery compatible with MacBook Pr...,129.99,1.299.903,1,13005399,
4621,NTE0104,NewerTech NuPower 95 W Battery for MacBook Pro...,internal battery MacBook Pro 17-inch Unibody 2011,131.99,1.090.004,1,10142,


In [102]:
# CHALLENGE: how can you reduce the previous regexp expression?
(
products
    .loc[products['name'].str.contains(r'(b|B)ody')]
    .sort_values('name').head(5))

  .loc[products['name'].str.contains(r'(b|B)ody')]


Unnamed: 0,sku,name,desc,price,promo_price,in_stock,type,max_price_orderlines
9630,BOD0009,BodyGuardz TrainR Pro 8/7/6 iPhone Case with A...,Advanced holster included sports armband for i...,39.99,279.897,1,5405,
10293,BOD0007,BodyGuardz TrainR Pro X iPhone Case with Armba...,Advanced holster included sports armband for i...,39.99,279.897,1,11865403,
5225,GTE0075,G-Technology G-DOCK ev Body only USB3.0,Housing with connection USB3.0 compatible with...,107.99,851.901,0,11935397,
6017,LMP0023,"LMP battery MacBook Pro 17 ""Unibody Early / Mi...",replacement battery compatible with MacBook Pr...,129.99,1.299.903,1,13005399,
4621,NTE0104,NewerTech NuPower 95 W Battery for MacBook Pro...,internal battery MacBook Pro 17-inch Unibody 2011,131.99,1.090.004,1,10142,


### Compile regular expressions

In [103]:
# the last will be the first ones
regexp_dict = {
    'ipod':'^.{0,7}apple ipod',
    'case':'(case|funda|housing|casing|folder)',
    'cable':'cable|connector|Lightning to USB|Wall socket|power strip',
    'battery':'battery',
    'headset':'(headset|headphones)',
    'mouse':'(mouse|trackpad)',
    'stand':'(stand|support)',
    'protect':'(protect|cover|sleeve|Screensaver|shell)',
    'watch':'(^.{0,6}apple watch|smartwatch|smart watch)',
    'camera':'camera',
    'refurbished':'(refurbished|reconditioned|like new)',
    'strap':'strap|armband|belt|bracelet'
}

temp = products.copy().assign(category = 'unknown')

import numpy as np

for val in regexp_dict.items(): 
    label = val[0]
    regexp = re.compile(val[1], flags=re.IGNORECASE)
    temp = (
    temp
        .assign(
            category = lambda x: np.where(
                (x['desc'].str.contains(regexp, regex=True)) &
                (x['category'] == 'unknown'), label, x['category'])))

temp['category'].value_counts()

  (x['desc'].str.contains(regexp, regex=True)) &


unknown        6439
case           1377
protect         808
cable           512
refurbished     398
battery         223
stand           221
strap           191
headset         140
watch           126
camera          104
mouse            40
Name: category, dtype: int64

In [124]:
temp.groupby('category')['price'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
battery,213.0,116.001268,140.356193,3.99,49.95,84.95,119.99,1499.0
cable,506.0,118.612233,315.341661,3.99,24.95,34.99,59.9975,2725.95
camera,95.0,257.038,193.62207,12.9,134.99,229.0,299.0,1049.99
case,1330.0,78.262429,152.838093,3.49,34.95,49.9,74.4225,2660.99
headset,137.0,132.933796,85.531812,9.99,79.99,119.95,180.0,379.99
mouse,38.0,197.563947,480.055631,4.99,29.0,48.0,101.0,2189.0
protect,793.0,72.783871,167.574306,7.9,24.99,34.95,49.95,1453.95
refurbished,391.0,858.134834,765.773509,8.95,289.0,694.0,1199.0,3949.0
stand,219.0,98.70242,145.816559,9.95,35.945,50.0,79.995,1191.99
strap,189.0,64.48254,50.34649,14.99,39.95,59.0,59.95,369.0


In [105]:
temp.query('in_stock == 1')['price'].sum()

416392.9199999999

In [106]:
temp.query('in_stock == 0')['price'].sum()

6251730.99

In [107]:
pd.set_option('display.max_rows', 1000)
pd.set_option("display.max_colwidth", 100)


In [145]:
products[products['price'] > 1000]

Unnamed: 0,sku,name,desc,price,promo_price,in_stock,type,max_price_orderlines
51,APP0344,"Apple Thunderbolt Display 27 ""Monitor Mac",Monitor Display 27-inch Apple Thunderbolt (MC914ZM / A).,1149.00,10.449.923,0,1296,
100,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM | 500GB HDD",MacBook Pro laptop 133 inches (MD101Y / A).,1199.00,11.455.917,0,1282,
101,PAC0508,Apple MacBook Pro 133 '' 25GHz | 16GB RAM | 1TB Fusion,Apple MacBook Pro Fusion Drive 16GB 2 internal and external SuperDrive disks.,1919.00,16.999.895,0,1282,
102,PAC0507,Apple MacBook Pro 133 '' 25Ghz | 16GB RAM | Fusion 740GB,Apple MacBook Pro Fusion Drive 16GB 2 internal and external SuperDrive disks.,1639.00,15.989.896,0,1282,
103,PAC0515,"Apple MacBook Pro 133 ""i7 29GHz | RAM 16GB | 500GB HDD | SSD 240GB",Apple MacBook Pro 133 inches (MD101Y / A) and SSD Processor RAM expansion.,2039.00,20.379.897,0,1282,
...,...,...,...,...,...,...,...,...
10421,APP2067-A,"Open - Apple MacBook Air 13 ""1.8GHz dual-core Intel Core i5 256GB",Reconditioned computer MacBook Air 13 inch i5 18GHz 8GB RAM and 256GB SSD,1355.59,11.342.684,0,"2,17E+11",
10447,DLL0053,"Dell UltraSharp UP2718Q Monitor 27 ""4K HDR",Monitor 27 inch 4K 4K and 6ms response height adjustable and pivotable.,1869.99,15.699.895,0,1296,
10450,PAC2510,"Apple iMac 27 ""Core i5 3.3GHz Retina 5K | 16GB RAM | 2TB Fusion | Certified Apple (CPO)",27-inch iMac 5K Retina refitted with 16GB of RAM and 2TB capacity with Apple Certification and g...,2869.00,20.990.045,0,"5,74E+15",
10451,AP20461,"Apple MacBook Pro 15 ""Core i7 Touch Bar 26GHz | RAM 16GB | 256GB PCIe SSD | Radeon Pro 450 2GB S...",Refurbished MacBook Pro and 15-inch Apple certified Touch Bar 26GHz Core i7 16GB RAM 256GB PCIe ...,2699.00,21.989.935,1,"1,02E+12",


In [144]:
#products[products['desc'].str.contains('^Apple MacBook *') == True]

products[products['desc'].str.contains('^Apple MacBook ') == True]

Unnamed: 0,sku,name,desc,price,promo_price,in_stock,type,max_price_orderlines
101,PAC0508,Apple MacBook Pro 133 '' 25GHz | 16GB RAM | 1TB Fusion,Apple MacBook Pro Fusion Drive 16GB 2 internal and external SuperDrive disks.,1919.0,16.999.895,0,1282,
102,PAC0507,Apple MacBook Pro 133 '' 25Ghz | 16GB RAM | Fusion 740GB,Apple MacBook Pro Fusion Drive 16GB 2 internal and external SuperDrive disks.,1639.0,15.989.896,0,1282,
103,PAC0515,"Apple MacBook Pro 133 ""i7 29GHz | RAM 16GB | 500GB HDD | SSD 240GB",Apple MacBook Pro 133 inches (MD101Y / A) and SSD Processor RAM expansion.,2039.0,20.379.897,0,1282,
104,PAC0510,"Apple MacBook Pro 133 ""i7 29GHz | RAM 16GB | 740GB Fusion",Apple MacBook Pro Fusion Drive 16GB 2 internal and external SuperDrive disks.,2039.0,20.379.897,0,1282,
105,PAC0185,"Apple MacBook Pro 133 ""i5 25GHz | RAM 16GB | 275GB SSD",Apple MacBook Pro 133 inches (MD101Y / A) with 275GB SSD.,1639.0,1469,0,1282,
106,PAC0186,"Apple MacBook Pro 133 ""25GHz | RAM 16GB | 500GB HDD | SSD 500GB",Apple MacBook Pro 133 inches (MD101Y / A) with two internal and external disks SuperDrive.,1919.0,16.999.895,0,1282,
107,PAC0174,Apple MacBook Pro 133 '' 25GHz | 8GB RAM | Fusion 628GB,Apple MacBook Pro Fusion Drive 8GB internal and two external drives SuperDrive.,1613.99,15.389.905,0,1282,
108,PAC0178,Apple MacBook Pro 133 '' 25GHz | 16GB RAM | Fusion 628GB,Apple MacBook Pro Fusion Drive 16GB 2 internal and external SuperDrive disks.,1733.99,15.699.895,0,1282,
109,APP0574,Apple MacBook Pro 133 '' i7 29GHz | 4GB RAM | 500GB HDD,Apple MacBook Pro 133 inches (MD101Y / A) with extension Processor.,1379.0,13.855.843,0,1282,
110,PAC0318,"Apple MacBook Pro 133 ""i7 29GHz | RAM 16GB | 500GB HDD",Apple MacBook Pro 133 inches (MD101Y / A) with processor and RAM expansion.,1619.0,16.189.897,0,1282,
