## 8.13 Intro to Data Science: Pandas, Regular Expressions and Data Munging

Preparing data for analysis is called data munging or data wrangling. These are synonyms—from this point forward, we’ll say data munging.

Two of the most important steps in data munging are data cleaning and transforming data into the optimal formats for your database systems and analytics software. Some common data cleaning examples are:

deleting observations with missing values,
substituting reasonable values for missing values,
deleting observations with bad values,
substituting reasonable values for bad values,
tossing outliers (although sometimes you’ll want to keep them),
duplicate elimination (although sometimes duplicates are valid),
dealing with inconsistent data,
and more.

Data Validation
Let’s begin by creating a Series of five-digit ZIP Codes from a dictionary of city-name/five-digit-ZIP-Code key–value pairs. We intentionally entered an invalid ZIP Code for Miami:

## Zipcode work Samantha Cress

In [3]:
import pandas as pd

In [4]:
zips = pd.Series({'Boston': '02215', 'Miami': '3310'})

In [5]:
zips #Though zips looks like a two-dimensional array, it’s actually one-dimensional.

Boston    02215
Miami      3310
dtype: object

The str attribute of a Series provides string-processing and various regular expression methods. Let’s use the str attribute’s match method to check whether each ZIP Code is valid:

In [6]:
zips.str.match(r'\d{5}')

Boston     True
Miami     False
dtype: bool

## Cities Name work Samantha Cress

Sometimes, rather than matching an entire value to a pattern, you’ll want to know whether a value contains a substring that matches the pattern. In this case, use method contains instead of match. Let’s create a Series of strings, each containing a U.S. city, state and ZIP Code, then determine whether each string contains a substring matching the pattern ' [A-Z]{2} ' (a space, followed by two uppercase letters, followed by a space):

In [7]:
cities = pd.Series(['Boston, MA 02215', 'Miami, FL 33101'])

In [8]:
cities

0    Boston, MA 02215
1     Miami, FL 33101
dtype: object

In [9]:
cities.str.contains(r' [A-Z]{2} ')

0    True
1    True
dtype: bool

In [10]:
cities.str.match(r' [A-Z]{2} ')

0    False
1    False
dtype: bool

## Dataframes, Reformating work Samantha Cress

In [11]:
contacts = [['Samantha Cress', 'demo1@deitel.com', '5555555555'], ['Sue Brown', 'demo2@deitel.com', '5555551234']]

In [12]:
contactsdf = pd.DataFrame(contacts,           columns=['Name', 'Email', 'Phone']) 

In [13]:
contactsdf

Unnamed: 0,Name,Email,Phone
0,Samantha Cress,demo1@deitel.com,5555555555
1,Sue Brown,demo2@deitel.com,5555551234


In [14]:
import re

In [15]:
def get_formatted_phone(value):
    result = re.fullmatch(r'(\d{3})(\d{3})(\d{4})', value)
    return '-'.join(result.groups()) if result else value

In [16]:
formatted_phone = contactsdf['Phone'].map(get_formatted_phone)

In [17]:
formatted_phone

0    555-555-5555
1    555-555-1234
Name: Phone, dtype: object

In [18]:
contactsdf['Phone'] = formatted_phone

In [19]:
contactsdf

Unnamed: 0,Name,Email,Phone
0,Samantha Cress,demo1@deitel.com,555-555-5555
1,Sue Brown,demo2@deitel.com,555-555-1234
