# Working With String Values

### Import library and data

In [1]:
import pandas as pd
products = pd.read_excel('AW-Sales-2016.xlsx', sheet_name='Products', \
                         index_col='ProductKey').drop('ProductSubcategoryKey', axis=1)
products.head()

Unnamed: 0_level_0,ProductSKU,ProductName,ModelName,ProductDescription,ProductColor,ProductSize,ProductStyle,ProductCost,ProductPrice
ProductKey,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
214,HL-U509-R,"Sport-100 Helmet, Red",Sport-100,"Universal fit, well-vented, lightweight , snap...",Red,0,0,13.0863,34.99
215,HL-U509,"Sport-100 Helmet, Black",Sport-100,"Universal fit, well-vented, lightweight , snap...",Black,0,0,12.0278,33.6442
218,SO-B909-M,"Mountain Bike Socks, M",Mountain Bike Socks,Combination of natural and synthetic fibers st...,White,M,U,3.3963,9.5
219,SO-B909-L,"Mountain Bike Socks, L",Mountain Bike Socks,Combination of natural and synthetic fibers st...,White,L,U,3.3963,9.5
220,HL-U509-B,"Sport-100 Helmet, Blue",Sport-100,"Universal fit, well-vented, lightweight , snap...",Blue,0,0,12.0278,33.6442


### Slicing String Values with .str
trying to use regular square brackets will make Pandas think we mean to filter rows

In [31]:
products['ProductName'][1:5]

ProductKey
215    Sport-100 Helmet, Black
218     Mountain Bike Socks, M
219     Mountain Bike Socks, L
220     Sport-100 Helmet, Blue
Name: ProductName, dtype: object

We can use the ***str*** method to clarify that we want to refer to the character positions of each string

In [33]:
products['ProductName'].str[1:5].head()

ProductKey
214    port
215    port
218    ount
219    ount
220    port
Name: ProductName, dtype: object

## Using The str Sub-Library
The Pandas library shares a lot of it's built-in string functions with the basic python syntax.<br> In order to access these functions we need to specify the ***str*** prefix and then call the method we need.
### upper() / lower() / title()

In [29]:
products['ProductColor'].str.upper().head()

ProductKey
214      RED
215    BLACK
218    WHITE
219    WHITE
220     BLUE
Name: ProductColor, dtype: object

In [30]:
products['ProductSize'].str.lower().head()

ProductKey
214    NaN
215    NaN
218      m
219      l
220    NaN
Name: ProductSize, dtype: object

### replace()
The ***replace*** method will allow us to replace any string with any other set of characters within the original string value

In [10]:
products['ProductSKU'].str.replace('U509', '***').head()

ProductKey
214     HL-***-R
215       HL-***
218    SO-B909-M
219    SO-B909-L
220     HL-***-B
Name: ProductSKU, dtype: object

We can use ***replace*** to remove parts of the string entirly

In [12]:
products['ProductSKU'].str.replace('-','').head()

ProductKey
214    HLU509R
215     HLU509
218    SOB909M
219    SOB909L
220    HLU509B
Name: ProductSKU, dtype: object

### split()
The ***split*** method returns a list containing parts of the orginal string after it was split by the specified delimiter

In [16]:
products['ProductSKU'].str.split('-').head()

ProductKey
214    [HL, U509, R]
215       [HL, U509]
218    [SO, B909, M]
219    [SO, B909, L]
220    [HL, U509, B]
Name: ProductSKU, dtype: object

The default delimiter is a space 

In [20]:
products['ProductName'].str.split().head()

ProductKey
214      [Sport-100, Helmet,, Red]
215    [Sport-100, Helmet,, Black]
218    [Mountain, Bike, Socks,, M]
219    [Mountain, Bike, Socks,, L]
220     [Sport-100, Helmet,, Blue]
Name: ProductName, dtype: object

We can access specific positions within each list using the ***get*** method

In [37]:
products['ProductName'].str.split().str.get(1).head()

ProductKey
214    Helmet,
215    Helmet,
218       Bike
219       Bike
220    Helmet,
Name: ProductName, dtype: object

We can achieve the same by using simple slicers on the ***str*** method as previously demonstrated

In [39]:
products['ProductName'].str.split().str[1].head()

ProductKey
214    Helmet,
215    Helmet,
218       Bike
219       Bike
220    Helmet,
Name: ProductName, dtype: object

We can use the ***expand*** parameter to create a dataframe from the split values

In [48]:
products['ProductName'].str.split(expand = True).head()

Unnamed: 0_level_0,0,1,2,3,4,5
ProductKey,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
214,Sport-100,"Helmet,",Red,,,
215,Sport-100,"Helmet,",Black,,,
218,Mountain,Bike,"Socks,",M,,
219,Mountain,Bike,"Socks,",L,,
220,Sport-100,"Helmet,",Blue,,,


Some product names are longer and because of that we get a lot of blank values. We can limit the number of columns we get back by specifying the maximum number of splits to accure

In [49]:
products['ProductName'].str.split(expand = True, n = 2).head()

Unnamed: 0_level_0,0,1,2
ProductKey,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
214,Sport-100,"Helmet,",Red
215,Sport-100,"Helmet,",Black
218,Mountain,Bike,"Socks, M"
219,Mountain,Bike,"Socks, L"
220,Sport-100,"Helmet,",Blue


### rstrip() / lstrip() / strip()
These methods are used to remove spaces from the right / left / both edges, respectively.

In [55]:
products['ProductName'].str.strip().head()

ProductKey
214      Sport-100 Helmet, Red
215    Sport-100 Helmet, Black
218     Mountain Bike Socks, M
219     Mountain Bike Socks, L
220     Sport-100 Helmet, Blue
Name: ProductName, dtype: object

### len()
When used on a dataframe, the traditional ***len*** function will return the number of records within it. <br>
We can use the ***len method*** to actually get the number of characters in each string

In [76]:
print('Total rows: ', len(products['ProductName']))
products['ProductName'].to_frame().assign(String_Length = products['ProductName'].str.len()).head(3)

Total rows:  293


Unnamed: 0_level_0,ProductName,String_Length
ProductKey,Unnamed: 1_level_1,Unnamed: 2_level_1
214,"Sport-100 Helmet, Red",21
215,"Sport-100 Helmet, Black",23
218,"Mountain Bike Socks, M",22


### contains()
In order to find out if a part of a string is contained within another string, we can use the ***contains*** method. <br> This will return a boolean Series (True if string found in value, False otherwise)<br>
***contains*** supports REGEX patterns as well

In [79]:
# Using the in operator is useless in this case as Python will compare our string with the entire dataframe
'Helmet' in products['ProductName']

False

In [96]:
products['ProductName'].to_frame().assign(Is_Helmet = products['ProductName'].str.contains('Helmet')).head()

Unnamed: 0_level_0,ProductName,Is_Helmet
ProductKey,Unnamed: 1_level_1,Unnamed: 2_level_1
214,"Sport-100 Helmet, Red",True
215,"Sport-100 Helmet, Black",True
218,"Mountain Bike Socks, M",False
219,"Mountain Bike Socks, L",False
220,"Sport-100 Helmet, Blue",True


We can use ***contains***'s output to effectively filter our dataframe

In [98]:
products[products['ProductName'].str.contains('Helmet')]

Unnamed: 0_level_0,ProductSKU,ProductName,ModelName,ProductDescription,ProductColor,ProductSize,ProductStyle,ProductCost,ProductPrice
ProductKey,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
214,HL-U509-R,"Sport-100 Helmet, Red",Sport-100,"Universal fit, well-vented, lightweight , snap...",Red,0,0,13.0863,34.99
215,HL-U509,"Sport-100 Helmet, Black",Sport-100,"Universal fit, well-vented, lightweight , snap...",Black,0,0,12.0278,33.6442
220,HL-U509-B,"Sport-100 Helmet, Blue",Sport-100,"Universal fit, well-vented, lightweight , snap...",Blue,0,0,12.0278,33.6442


Always keep in mind that Python is case-sensitive. Pandas is no different in this regard and when in doubt about how values may appear we can change the casing of the value when performing our checks

In [47]:
products[products['ProductName'].str.lower().str.contains('hl')].head()

Unnamed: 0_level_0,ProductSKU,ProductName,ModelName,ProductDescription,ProductColor,ProductSize,ProductStyle,ProductCost,ProductPrice
ProductKey,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
238,FR-R92R-62,"HL Road Frame - Red, 62",HL Road Frame,Our lightest and best quality aluminum frame m...,Red,62,U,747.9682,1263.4598
241,FR-R92R-44,"HL Road Frame - Red, 44",HL Road Frame,Our lightest and best quality aluminum frame m...,Red,44,U,747.9682,1263.4598
244,FR-R92R-48,"HL Road Frame - Red, 48",HL Road Frame,Our lightest and best quality aluminum frame m...,Red,48,U,747.9682,1263.4598
247,FR-R92R-52,"HL Road Frame - Red, 52",HL Road Frame,Our lightest and best quality aluminum frame m...,Red,52,U,747.9682,1263.4598
250,FR-R92R-56,"HL Road Frame - Red, 56",HL Road Frame,Our lightest and best quality aluminum frame m...,Red,56,U,747.9682,1263.4598


### slice()
We can use ***slice*** to select a part of a string by defining starting position, ending position and even jump.<br>This is equivelent to traditional "Slicing" in Python

In [60]:
# Get only the 4 characters that comes after the 2-letter prefix
print(products['ProductSKU'].head(3))
print(products['ProductSKU'].str.slice(3,7).head(3))

ProductKey
214    HL-U509-R
215      HL-U509
218    SO-B909-M
Name: ProductSKU, dtype: object
ProductKey
214    U509
215    U509
218    B909
Name: ProductSKU, dtype: object


### slice_replace()
***slice_replace*** is a combination of ***slice*** and ***replace***, allowing us to replace a segment of our strings with another string

In [48]:
products['ProductSKU'].str.slice_replace(3,7,'**').head()

ProductKey
214    HL-**-R
215      HL-**
218    SO-**-M
219    SO-**-L
220    HL-**-B
Name: ProductSKU, dtype: object

### count()
Use the count to count the number of apearences of a specific string within another string

In [57]:
new_col = products['ProductSKU'].str.count('-').head()
products['ProductSKU'].to_frame().assign(number_of_dashes = new_col).head()

Unnamed: 0_level_0,ProductSKU,number_of_dashes
ProductKey,Unnamed: 1_level_1,Unnamed: 2_level_1
214,HL-U509-R,2.0
215,HL-U509,1.0
218,SO-B909-M,2.0
219,SO-B909-L,2.0
220,HL-U509-B,2.0


Again, case should be taken into account if we need to count both upper and lower case letters

In [65]:
new_col = products['ProductName'].str.lower().str.count('s')
products[4:8]['ProductName'].to_frame().assign(number_of_s = new_col).head()

Unnamed: 0_level_0,ProductName,number_of_s
ProductKey,Unnamed: 1_level_1,Unnamed: 2_level_1
220,"Sport-100 Helmet, Blue",1
223,AWC Logo Cap,0
226,"Long-Sleeve Logo Jersey, S",3
229,"Long-Sleeve Logo Jersey, M",2


### startswith() / endswith()
We can use the traditional ***startswith*** and ***endswith*** to check a match specifically in the begining / end of a string, respectively.<br>In Pandas these returns a boolean Series which we can use for filtering of course.

In [66]:
products['ProductName'].str.startswith('Sport').head()

ProductKey
214     True
215     True
218    False
219    False
220     True
Name: ProductName, dtype: bool

In [68]:
products[products['ProductName'].str.startswith('Sport')]

Unnamed: 0_level_0,ProductSKU,ProductName,ModelName,ProductDescription,ProductColor,ProductSize,ProductStyle,ProductCost,ProductPrice
ProductKey,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
214,HL-U509-R,"Sport-100 Helmet, Red",Sport-100,"Universal fit, well-vented, lightweight , snap...",Red,0,0,13.0863,34.99
215,HL-U509,"Sport-100 Helmet, Black",Sport-100,"Universal fit, well-vented, lightweight , snap...",Black,0,0,12.0278,33.6442
220,HL-U509-B,"Sport-100 Helmet, Blue",Sport-100,"Universal fit, well-vented, lightweight , snap...",Blue,0,0,12.0278,33.6442


### find()
The ***find*** method is very similar to the ***index*** method - It finds the smallest index position (first occurance) of the specified string inside another string.<br>It's advantage over ***index***, however, is in cases when search for a non-existent value: instead of an error we will get the value -1

In [70]:
new_col = products['ProductName'].str.find('Helmet')
products['ProductName'].to_frame().assign(Helmet_Position = new_col).head()

Unnamed: 0_level_0,ProductName,Helmet_Position
ProductKey,Unnamed: 1_level_1,Unnamed: 2_level_1
214,"Sport-100 Helmet, Red",10
215,"Sport-100 Helmet, Black",10
218,"Mountain Bike Socks, M",-1
219,"Mountain Bike Socks, L",-1
220,"Sport-100 Helmet, Blue",10
