### Vectorized String Operations

Pandas is known for its strength in its relative ease and handling of strinf data. With this, pandas builds on comprehensive set of vectorized string operations, that is required when one is working with data munging or cleaning with respect to world real data.



In [3]:
# reintroducing array elements for arithmetic operations
import pandas as pd
import numpy as np

x = np.array([1,2,3,4,5])
x

array([1, 2, 3, 4, 5])

in the above, we are not worried with the dhape of the arrays but on the operation should do due to the vectorization operations for arrays of strings, Numpy finds it lacking in the same operations as seen with the arithmetic operations. 

In [7]:
# numpy string array 

data = ['peter', 'john','mike', 'mary']
[s.capitalize() for s in data] # this is sufficeint for working with some data but will break if missing values are encountered


['Peter', 'John', 'Mike', 'Mary']

In [8]:
data = ['peter', 'john','mike',  None, 'mary']
[s.capitalize() for s in data]

AttributeError: 'NoneType' object has no attribute 'capitalize'

Error shows an AttributeError type which is NoneType object has no attribute called capitalize. This is where pandas becomes useful because it provides features that addreses both the need for vectorizing string operations and the need of handling missing values if encountered in the Pandas Series and Index string objects.

In [9]:
# converting the data variable above to pandas series object
names = pd.Series(data)
names

0    peter
1     john
2     mike
3     None
4     mary
dtype: object

In [10]:
# let us now call the feature that will handle the capitalization
# of every single entry while skipping the None missing value

names.str.capitalize()

0    Peter
1     John
2     Mike
3     None
4     Mary
dtype: object

### Pandas String Types

In [19]:
# creating a pandas series

conte = pd.Series(['Graham Chapman', 'John Cena', 'Micheal Scofield', 'Idris Abubakar',
                  'Adeola Michael', 'Emeka Obi'])
conte

0      Graham Chapman
1           John Cena
2    Micheal Scofield
3      Idris Abubakar
4      Adeola Michael
5           Emeka Obi
dtype: object

almost all the pandas string methods is as a result of python built in string methods. some of them include:

len(), lower(), capitaliza(), find() rfind(), islower(), isupper(), upper(), ljust(), rjust(), startswit(), endswith(), strip(), lstrip(), rstrip() etc

In [13]:
# let us used the pandas vectorized string lower method

conte.str.lower() # the lower returns series of strings

0      graham chapman
1           john cena
2    micheal scofield
3      idris abubakar
4      adeola michael
5           emeka obi
dtype: object

In [14]:
# some returns number

conte.str.len() # one obvious fact here is the need to use str before the method name

0    14
1     9
2    16
3    14
4    14
5     9
dtype: int64

In [17]:
# somereturns boolean values

conte.str.startswith('E')

0    False
1    False
2    False
3    False
4    False
5     True
dtype: bool

In [18]:
# others returns list or other values for each element

conte.str.split()

0      [Graham, Chapman]
1           [John, Cena]
2    [Micheal, Scofield]
3      [Idris, Abubakar]
4      [Adeola, Michael]
5           [Emeka, Obi]
dtype: object

### Using regular Expressions in pandas

This happens when you want to examine the content of each strings. A way of manipulating through your string type to get a desired output. It follows some of the API coventions of pythons built-in re module. some of them are

match() which is equivalent to pythons re module re.match() method which reutrns a Boolean

extract() equivalent to re.match() but in this case returns matched groups as strings

findall() equivalent to re.findall()

replace() which replaces occurences of pattern with someother string.

contains() equivalent to re.search() which returns a boolean

count() which counts occurences of pattern.

split() equivalent to str.split() but accepts regexps

rsplit() equivalent to str.rsplit() but accepts regexps

In [37]:
# let us try extracting the first name from each group of characters 

conte.str.extract('([A-Za-z]+)')

Unnamed: 0,0
0,Graham
1,John
2,Micheal
3,Idris
4,Adeola
5,Emeka


In [43]:
# finding all names that start and end with a consonant.

conte.str.findall(r'^[^AEIOU].*[^aeiou]')


0      [Graham Chapman]
1            [John Cen]
2    [Micheal Scofield]
3                    []
4                    []
5                    []
dtype: object

other methods include:

get() whch index each element

slice() which slices each element

slice_replace() Replace slice in each element with passed value

cat() Concatenate strings

repeat() Repeat values

normalize() Return Unicode form of string

pad() Add whitespace to left, right, or both sides of strings

wrap() Split long strings into lines with length less than a given width

join() Join strings in each element of the Series with passed separator

get_dummies() Extract dummy variables as a DataFrame

In [44]:
# vectorized item access and slicing
# this enables us to get access from each array. 
# we can use the get() and slice() method

conte.str.slice(0, 3)

0    Gra
1    Joh
2    Mic
3    Idr
4    Ade
5    Eme
dtype: object

In [45]:
# this is also equivalent to the above operation

conte.str[0:3]

0    Gra
1    Joh
2    Mic
3    Idr
4    Ade
5    Eme
dtype: object

In [66]:
conte.str.get(3)

0    h
1    n
2    h
3    i
4    o
5    k
dtype: object

In [59]:
conte.str[:3]

0    Gra
1    Joh
2    Mic
3    Idr
4    Ade
5    Eme
dtype: object

In [71]:
conte.str.split().str.get(-1)

0     Chapman
1        Cena
2    Scofield
3    Abubakar
4     Michael
5         Obi
dtype: object

In [75]:
# get dummies method is useful when our data contains coded indicators

full_conte = pd.DataFrame({'name': conte,
                          'info':['B|C|A','A|B', 'A|C', 'B|A', 'B|C', 'B|C|A']})
full_conte

Unnamed: 0,name,info
0,Graham Chapman,B|C|A
1,John Cena,A|B
2,Micheal Scofield,A|C
3,Idris Abubakar,B|A
4,Adeola Michael,B|C
5,Emeka Obi,B|C|A


In [78]:
# the get_dummies quickly let us split out the indicator variables into a Dataframe

full_conte['info'].str.get_dummies()

Unnamed: 0,A,B,C
0,1,1,1
1,1,1,0
2,1,0,1
3,1,1,0
4,0,1,1
5,1,1,1


### Mini Project: Recipe Database

The vectorized string operations we saw above, becomes extremely useful when we combine them in an intuitive way in solving analytical cases in real world data.

we will use an open recipe database compiled from various sources on the web. The aim it to pass the recipe data into ingredient lists, so that we can easily find recipe based on some ingrdients we have.

the scripts used here can be found at  https://github.com/fictivekin/openrecipes

It can be downloaded and unzipped using the following code below.

In [122]:


!curl -O https://s3.amazonaws.com/openrecipes/20170107-061401-recipeitems.json.gz
    

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0 29.3M    0 17632    0     0  17632      0  0:29:06  0:00:01  0:29:05 12261
  0 29.3M    0  221k    0     0   110k      0  0:04:31  0:00:02  0:04:29 94112
  2 29.3M    2  612k    0     0   204k      0  0:02:27  0:00:03  0:02:24  174k
  3 29.3M    3 1105k    0     0   276k      0  0:01:48  0:00:04  0:01:44  249k
  5 29.3M    5 1734k    0     0   346k      0  0:01:26  0:00:05  0:01:21  358k
  7 29.3M    7 2261k    0     0   376k      0  0:01:19  0:00:06  0:01:13  446k
  9 29.3M    9 2788k    0     0   398k      0  0:01:15  0:00:07  0:01:08  505k
 10 29.3M   10 3128k    0     0   391k      0  0:01:16  0:00:08  0:01:08  507k
 11 29.3M   11 3587k    0     0   398k      0  0:01

In [7]:
# reading in the data which is in JSON format

file = r'C:\Users\Nnabugwu kevin\Desktop\DataQuest\New folder\20170107-061401-recipeitems.json'
try:
    recipes = pd.read_json(file)
except ValueError as f:
    print("ValueError: ", f)
    

ValueError:  Trailing data


In [8]:

with open(file) as f:
    line = f.readline()
pd.read_json(line).shape

(2, 12)

In [9]:
# read the entire file into a Python array`
with open(file, 'r', encoding="utf-8") as f:
    # Extract each line
    data = (line.strip() for line in f)
    # Reformat so each line is the element of a list
    data_json = "[{0}]".format(','.join(data))
# read the result as a JSON
recipes = pd.read_json(data_json)

In [10]:
recipes.shape

(173278, 17)

In [11]:
recipes.head(2)

Unnamed: 0,_id,name,ingredients,url,image,ts,cookTime,source,recipeYield,datePublished,prepTime,description,totalTime,creator,recipeCategory,dateModified,recipeInstructions
0,{'$oid': '5160756b96cc62079cc2db15'},Drop Biscuits and Sausage Gravy,Biscuits\n3 cups All-purpose Flour\n2 Tablespo...,http://thepioneerwoman.com/cooking/2013/03/dro...,http://static.thepioneerwoman.com/cooking/file...,{'$date': 1365276011104},PT30M,thepioneerwoman,12,2013-03-11,PT10M,"Late Saturday afternoon, after Marlboro Man ha...",,,,,
1,{'$oid': '5160756d96cc62079cc2db16'},Hot Roast Beef Sandwiches,12 whole Dinner Rolls Or Small Sandwich Buns (...,http://thepioneerwoman.com/cooking/2013/03/hot...,http://static.thepioneerwoman.com/cooking/file...,{'$date': 1365276013902},PT20M,thepioneerwoman,12,2013-03-13,PT20M,"When I was growing up, I participated in my Ep...",,,,,


In [12]:
recipes.iloc[0] # index location of the first row

_id                                {'$oid': '5160756b96cc62079cc2db15'}
name                                    Drop Biscuits and Sausage Gravy
ingredients           Biscuits\n3 cups All-purpose Flour\n2 Tablespo...
url                   http://thepioneerwoman.com/cooking/2013/03/dro...
image                 http://static.thepioneerwoman.com/cooking/file...
ts                                             {'$date': 1365276011104}
cookTime                                                          PT30M
source                                                  thepioneerwoman
recipeYield                                                          12
datePublished                                                2013-03-11
prepTime                                                          PT10M
description           Late Saturday afternoon, after Marlboro Man ha...
totalTime                                                           NaN
creator                                                         

In [13]:
# the ingrdients list is a string format, and we are interested in this column
# we will carefully extract this info

recipes.ingredients.str.len().describe()

# we can see that the ingredients list averages about 245 characters long, min of
# 0 and max of about 10,000


count    173278.000000
mean        244.617926
std         146.705285
min           0.000000
25%         147.000000
50%         221.000000
75%         314.000000
max        9067.000000
Name: ingredients, dtype: float64

In [15]:
# which recipe has the longest ingredient list?
import numpy as np
recipes.name[np.argmax(recipes.ingredients.str.len())]

'Carrot Pineapple Spice &amp; Brownie Layer Cake with Whipped Cream &amp; Cream Cheese Frosting and Marzipan Carrots'

In [16]:
# how many of the recipes are for breakfast
recipes.ingredients.str.contains('[Bb]reakfast').sum()

233

In [17]:
# how many recipes list contains cinnamon as ingredients

recipes.ingredients.str.contains('[Cc]innamon').sum()

10526

In [18]:
# let us see if there is any mispelt cinnamon

recipes.ingredients.str.contains('[Cc]inamon').sum()

11

### A simple Recipe Recommender

In [19]:
# creating an imaginary know 
spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley',
              'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']

In [21]:
# creating a boolean datafrane consisting of this ingredients inclusion
import pandas as pd
import re

spice_df = pd.DataFrame(dict((spice, recipes.ingredients.str.contains(
spice, re.IGNORECASE)) for spice in spice_list))

spice_df.head()

Unnamed: 0,salt,pepper,oregano,sage,parsley,rosemary,tarragon,thyme,paprika,cumin
0,False,False,False,True,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False
2,True,True,False,False,False,False,False,False,False,True
3,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False


In [28]:
# finding a recipe that uses rosemary, tarragon and sage
# this can be done very quickly using the query() method

selection = spice_df.query('rosemary & sage & tarragon')
len(selection)

8

In [30]:
# we can only find 8 recipe with this combination.
# let us get there description

recipes.name[selection.index]

68620     Roast Turkey with Mushroom Sauce Recipe
72271                        Gordon's rustic pâté
112987              Marica's Spaghetti Meat Sauce
136120                       Roast chicken recipe
142515                             Focaccia Bread
165679                  Easy Herb Crackers Recipe
167942    Roast Turkey with Mushroom Sauce Recipe
171269                              Goose risotto
Name: name, dtype: object

In [31]:
# finding selection of other ingredients
selection_2 = spice_df.query('parsley & oregano & salt')
len(selection_2)

522

In [37]:
recipes.name[selection_2.index]

33                           Cauliflower Pizza Crust Recipe
164           Rigatoni with Spicy Calabrese-Style Pork Ragù
421                                        Franks and Beans
424                                                Cioppino
1022      Grilled Pork Tenderloin with Chimichurri and S...
                                ...                        
167336                Bulgur with Zucchini and Herbs Recipe
169382                               Wild Mushroom Stuffing
171155    Spaghetti and Meatballs Recipe with Oven Roast...
171284    Spaghetti and Meatballs Recipe with Oven Roast...
172517                           Greek Chicken and Potatoes
Name: name, Length: 522, dtype: object