# Data Cleaning and Preparation

In [27]:
import numpy as np
import pandas as pd


## Data Transformation (continue)
So far in this lesson we’ve been concerned with rearranging data. Filtering, cleaning,
and other transformations are another class of important operations.

## String Manipulation
Python has long been a popular raw data manipulation language in part due to its
ease of use for string and text processing. Most text operations are made simple with
the string object’s **built-in methods**. 

For more complex pattern matching and text manipulations, **regular expressions** may be needed. 

**pandas** adds to the mix by enabling you to apply string and regular expressions concisely on whole arrays of data,
additionally handling the annoyance of missing data.

### String Object Methods

In [1]:
# convert the string to a list using the ',' as separator
val = 'a,b,  guido'


In [3]:
# convert the string to a list using the ',' as separator and remove the extra spaces


In [22]:
# concatenate the parts back with separator '::'
first, second, third = pieces
first + '::' + second + '::' + third

'a::b::guido'

In [4]:
# concatenate the parts back with separator '::'


**check**: try the `in` operator and str methods `index`, `find`, `count` and `replace`

![](assets/built-in-str-methods.png)

### Regular Expressions
Regular expressions provide a flexible way to search or match (often more complex)
string patterns in text. A single expression, commonly called a regex, is a string
formed according to the regular expression language.

The `re` module functions fall into three categories: **pattern matching**, **substitution**,
and **splitting**.

In [24]:
import re
text = "foo    bar\t baz  \tqux"
# https://pythex.org



suppose we wanted to split a string with a variable number of whitespace characters
(tabs, spaces, and newlines). The regex describing one or more whitespace characters
is \s+:

In [5]:
# split the string depending on the whitespaces



In [6]:
# find all the string depending on the whitespaces


**Note:** Creating a regex object with `re.compile` is highly recommended if you intend to
apply the same expression to many strings; doing so will save CPU cycles

In [62]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)

In [7]:
# get a list of all emails in the text


Relatedly, `sub` will return a new string with occurrences of the pattern replaced by the
a new string:


Suppose you wanted to find email addresses and simultaneously segment each
address into its three components: *username*, *domain name*, and *domain suffix*.

In [64]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)

In [8]:
# find all email in the text


In [9]:
# prefix each segment of the email with a suitable label


![](assets/re-methods.png)

### Vectorized String Functions in pandas

In [33]:
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
        'Rob': 'rob@gmail.com', 'Wes': np.nan}
data = pd.Series(data)
data


Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

Series has array-oriented methods for string operations that skip NA values. These are accessed through Series’s **str attribute**

In [10]:
# check if the email is gmail


Regular expressions can be used, too, along with any re options like IGNORECASE

In [11]:
# using the pattern declared earlier, find all parts of each email


In [12]:
# use the match method to check if the field matches an email or not


![](assets/series-str-methods.png)

# Data Aggregation and Group Operations

## GroupBy Mechanics

the term split-apply-combine is used for describing group operations. 

- In the first stage of the process, data contained in a pandas object, whether a Series, Data‐Frame, or otherwise, is **split** into groups based on one or more keys that you provide.
- Once this is done, a function is **applied** to each group, producing a new value. 
- Finally, the results of all those function applications are **combined** into a result object. 

The form of the resulting object will usually depend on what’s being done to the data.

![](assets/group-aggregation.png)

In [78]:
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randint(0, 10, 7),
                   'data2' : np.random.randint(0, 10, 7)})
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,5,1
1,a,two,5,7
2,b,one,7,9
3,b,two,3,2
4,a,one,2,7
5,b,two,5,1
6,a,one,6,4


In [13]:
# group data of column 'data1' by 'key1' then print the groups


In [14]:
# calculate the mean in each group


In [15]:
# group data of column 'data1' by 'key1' and 'key2' then print the groups


In [16]:
# calculate the mean in each group


In [17]:
# unstack the result Series


In [18]:
# try some selection on the result DataFrame


### Iterating Over Groups
The GroupBy object supports iteration, generating a sequence of **2-tuples** containing
the **group name** along with the **chunk of data**.

In [19]:
# group data of DataFrame 'data' by 'key1' then print each group name and data



In [20]:
# group data of DataFrame 'data' by 'key1' and 'key2' then print each group name and data


In [21]:
# group data of DataFrame 'data' by 'key1' then convert it to a dictionary of DataFrames



In [22]:
# group the columns by its datatypes, then print the groups


In [23]:
# loop through the groups and print it


### Selecting a Column or Subset of Columns
Indexing a GroupBy object created from a DataFrame with a column name or array
of column names has the effect of column subsetting for aggregation. This means
that:
```python
df.groupby('key1')['data1']
df.groupby('key1')[['data2']]
```
are syntactic sugar for:
```python
df['data1'].groupby(df['key1'])
df[['data2']].groupby(df['key1'])
```


In [24]:
# group data of DataFrame 'data' by 'key1' then calculate the mean of column 'data2'


In [25]:
# How to get result as DataFrame GroupBy or Series GroupBy?


### Grouping with Dicts and Series

In [28]:
people = pd.DataFrame(np.random.randint(0, 10, (5,5)),
                      columns=['a', 'b', 'c', 'd', 'e'],
                      index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
people

Unnamed: 0,a,b,c,d,e
Joe,0,3,0,8,1
Steve,2,9,4,2,5
Wes,7,2,1,6,5
Jim,6,8,9,4,9
Travis,3,5,3,3,4


In [112]:
mapping = {'Joe': 'red', 'Steve': 'red', 'Wes': 'blue',
           'Jim': 'blue', 'Travis': 'red', 'Elon' : 'orange'}

In [29]:
# group and sum the scores of teams red and blue


In [30]:
# convert the dict to a Series, then group and count the scores of teams red and blue


### Grouping with Functions

## independent Practice:
- read the dataset `tips.csv`
- create a new column 'tip_pct', which is tip / total_bill
- replace the short-day name with the full-day name and convert it to upper-case
- calculate the average tip percent for smokers and non-smokers
- calculate the max and average tip percent for each time
- calculate the average tip percent for each day and time
- create dummy variables for the day and time columns


