<a href="https://colab.research.google.com/github/Shuraimi/1.Python-AllBasics/blob/master/2.%20Data_manipulation_with_Pandas/11.%20Vectorised_String_operations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vectorised String operations

One strength of Python is it's relative ease to handle and manipulate string data. Pandas build on this and provides a set of vectorised string operations that become an essential piece of type of munging when dealing with real world data.

## Introducing Pandas String operations

Till now we've seen how Numpy and Pandas generalise arithmetic operations so that they can be performed easily on array elements like 👇

In [None]:
import numpy as np
x=np.array([2,4,7,9])
x*=2

In [None]:
x

array([ 4,  8, 14, 18])

The vectorisation of operations simplifies the syntax of working with arrays. We don't need to worry about the size of shape of the array but about what operation should be done.

For working with strings, we need to use a for loop since there's no direct syntax.

In [None]:
s=['shuraim','john','kerry']
[i.capitalize() for i in s]

['Shuraim', 'John', 'Kerry']

But a null values gives an error

In [None]:
k=['shuraim','john',None,'kerry']
[i.capitalize() for i in k]

AttributeError: ignored

Pandas includes features to address both this vectorised string operations and handling missing values via the *str* attribute of Pandas Series and Index objects containing strings.

In [None]:
import pandas as pd
n=pd.Series(k)
n.str.capitalize()

0    Shuraim
1       John
2       None
3      Kerry
dtype: object

This vectorised string operation capitalizes all strings skipping any missing values.

Use tab for autocompletion after the str attribute.

## Tables of Pandas String methods

Python's string manipulation methods is intuitive enough and it's just enough to list a table of these methods in Pandas.

In [None]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
 'Eric Idle', 'Terry Jones', 'Michael Palin'])

### Methods similiar to pythons string methods

Nearly all Python's built in string functions are mirrored by a Pandas vectorised string method. Here is the list of methods that are similar to pythons built in string functions.

len() lower() translate() islower()
ljust() upper() startswith() isupper()
rjust() find() endswith() isnumeric()
center() rfind() isalnum() isdecimal()
zfill() index() isalpha() split()
strip() rindex() isdigit() rsplit()
rstrip() capitalize() isspace() partition()
lstrip() swapcase() istitle() rpartition()

These have various return values.

In [None]:
monte.str.len()
#returns a series of numbers

0    14
1    11
2    13
3     9
4    11
5    13
dtype: int64

In [None]:
monte.str.isdigit()

0    False
1    False
2    False
3    False
4    False
5    False
dtype: bool

### Methods using regular expressions

Additionally, there are several methods that accept regular expressions to examine contents of each string element and some follow API conventions of Python's built in re module.

In [None]:
monte.str.extract('([A-Za-z]+)')

Unnamed: 0,0
0,Graham
1,John
2,Terry
3,Eric
4,Terry
5,Michael


Or we can do something more complicated, like finding all names that start and end
with a consonant, making use of the start-of-string (^) and end-of-string ($) regular
expression characters:

In [None]:
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')

0    [Graham Chapman]
1                  []
2     [Terry Gilliam]
3                  []
4       [Terry Jones]
5     [Michael Palin]
dtype: object

### Miscellaneous methods

### Vectorised item access and slicing

The get() slice() methods enable vectorised element access from each array.

For example, we can get the slice of each array as df.str.slice(0,3) or you can also use the normal python slicing syntax.

In [None]:
monte.str.slice(0,3)

0    Gra
1    Joh
2    Ter
3    Eri
4    Ter
5    Mic
dtype: object

In [None]:
monte.str[0:3]

0    Gra
1    Joh
2    Ter
3    Eri
4    Ter
5    Mic
dtype: object

Therefore, monte.str[0:3] and monte.str.slice(0,3) give same results.

In [None]:
monte.str.get(3)

0    h
1    n
2    r
3    c
4    r
5    h
dtype: object

These get() and slice() let's yu access array elements returned by split()

In [None]:
monte.str.split()

0    [Graham, Chapman]
1       [John, Cleese]
2     [Terry, Gilliam]
3         [Eric, Idle]
4       [Terry, Jones]
5     [Michael, Palin]
dtype: object

Now to get the last name of each array element, we can use get()

In [None]:
monte.str.split().str.get(-1)

0    Chapman
1     Cleese
2    Gilliam
3       Idle
4      Jones
5      Palin
dtype: object

### Indicator variables

Another method, is the get_dummies() which is useful when our data has columns which have some sort of coded indicator like

A=“born in America,” <br>B=“born in the United King‐
dom,” <br>C=“likes cheese,”<br>D=“likes spam”:

In [None]:
full_monte=pd.DataFrame({'name':monte,'info':['B|C|D', 'B|D', 'A|C', 'B|D', 'B|C',
 'B|C|D']})

In [None]:
full_monte

Unnamed: 0,name,info
0,Graham Chapman,B|C|D
1,John Cleese,B|D
2,Terry Gilliam,A|C
3,Eric Idle,B|D
4,Terry Jones,B|C
5,Michael Palin,B|C|D


The get_dummies() let's you quickly make a DataFrame for these indicator variables.

In [None]:
full_monte['info'].str.get_dummies('|')

Unnamed: 0,A,B,C,D
0,0,1,1,1
1,0,1,0,1
2,1,0,1,0
3,0,1,0,1
4,0,1,1,0
5,0,1,1,1


With these operations, we can make an endless range of string processing methods to clean the data.

## Example: Recipe Database

Vcetorised string operationsbecome handy while cleaning messy real world data.

Our goal to parse recipe data into ingredient lists, so that we can find recipe based on ingredients we have.

In [None]:
#read the database which is in .json format
recipes=pd.read_json('https://openrecipes.s3.amazonaws.com/recipeitems-latest.json.gz')

ValueError: ignored

In [None]:
try:
  recipes = pd.read_json('https://openrecipes.s3.amazonaws.com/recipeitems-latest.json.gz')
except ValueError as e:
  print("ValueError:", e)

ValueError: Expected object or value


In [None]:
import requests
import json
import gzip
import io

url = "https://openrecipes.s3.amazonaws.com/recipeitems-latest.json.gz"

# Download the gzipped JSON file
response = requests.get(url)
response.raise_for_status()

# Decompress the gzipped content
with gzip.GzipFile(fileobj=io.BytesIO(response.content), mode='rb') as file:
    # Load the JSON data
    data = json.load(file)

JSONDecodeError: ignored

In [None]:
pip install requests



In [None]:
!curl -O http://openrecipes.s3.amazonaws.com/recipeitems-latest.json.gz
!gunzip recipeitems-latest.json.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100    20  100    20    0     0    263      0 --:--:-- --:--:-- --:--:--   266


In [None]:
try:
  recipes = pd.read_json('recipeitems-latest.json')
except ValueError as e:
  print("ValueError:", e)

ValueError: Expected object or value


In [None]:
with open('recipeitems-latest.json') as f:
  line = f.readline()
pd.read_json(line).shape

ValueError: ignored

In [None]:
import requests
import json
import gzip
import io

url = "https://openrecipes.s3.amazonaws.com/recipeitems-latest.json.gz"

# Download the gzipped JSON file
response = requests.get(url)
response.raise_for_status()

# Check if the response content is empty
if not response.content:
    print("The content is empty.")
else:
    try:
        # Decompress the gzipped content
        with gzip.GzipFile(fileobj=io.BytesIO(response.content), mode='rb') as file:
            # Load the JSON data
            data = json.load(file)
        # Now, 'data' contains the JSON content from the URL
    except json.JSONDecodeError as e:
        print("Error decoding JSON:", e)

Error decoding JSON: Expecting value: line 1 column 1 (char 0)
