# Data Cleaning and Preparation

In [33]:
import numpy as np
import pandas as pd


## Data Transformation (continue)
So far in this lesson we’ve been concerned with rearranging data. Filtering, cleaning,
and other transformations are another class of important operations.

## String Manipulation
Python has long been a popular raw data manipulation language in part due to its
ease of use for string and text processing. Most text operations are made simple with
the string object’s **built-in methods**. 

For more complex pattern matching and text manipulations, **regular expressions** may be needed. 

**pandas** adds to the mix by enabling you to apply string and regular expressions concisely on whole arrays of data,
additionally handling the annoyance of missing data.

### String Object Methods

In [34]:
# convert the string to a list using the ',' as separator
val = 'a,b,  guido'
val.split(",")

['a', 'b', '  guido']

In [35]:
# convert the string to a list using the ',' as separator and remove the extra spaces

pieces = [v.strip() for v in val.split(",")]
pieces

['a', 'b', 'guido']

In [40]:
# concatenate the parts back with separator '::'
first, second, third = pieces
first + "::" + second + "::" + third

'a::b::guido'

In [41]:
# concatenate the parts back with separator '::'

"::".join(pieces)

'a::b::guido'

**check**: try the `in` operator and str methods `index`, `find`, `count` and `replace`

![](assets/built-in-str-methods.png)

### Regular Expressions
Regular expressions provide a flexible way to search or match (often more complex)
string patterns in text. A single expression, commonly called a regex, is a string
formed according to the regular expression language.

The `re` module functions fall into three categories: **pattern matching**, **substitution**,
and **splitting**.

In [42]:
import re
text = "foo    bar\t baz  \tqux"
# https://pythex.org
print(text)


foo    bar	 baz  	qux


suppose we wanted to split a string with a variable number of whitespace characters
(tabs, spaces, and newlines). The regex describing one or more whitespace characters
is \s+:

In [45]:
# split the string depending on the whitespaces
# text.split(" ")
re.split("\s+", text)

['foo', 'bar', 'baz', 'qux']

In [46]:
# find all the string depending on the whitespaces
re.findall("\w+", text)

['foo', 'bar', 'baz', 'qux']

**Note:** Creating a regex object with `re.compile` is highly recommended if you intend to
apply the same expression to many strings; doing so will save CPU cycles

In [47]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)

In [48]:
# using the regex object, get a list of all emails in the text
regex.findall(text) 

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

Relatedly, `sub` will return a new string with occurrences of the pattern replaced by the
a new string:


In [50]:
# substitute each email in the text with the word SECRET
print(regex.sub("SECRET", text))

Dave SECRET
Steve SECRET
Rob SECRET
Ryan SECRET



Suppose you wanted to find email addresses and simultaneously segment each
address into its three components: *username*, *domain name*, and *domain suffix*.

In [51]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)

In [52]:
# find all email in the text
regex.findall(text) 

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

In [54]:
# prefix each segment of the email with a suitable label
print(regex.sub(r"user: \1, company: \2, ext: \3", text))

Dave user: dave, company: google, ext: com
Steve user: steve, company: gmail, ext: com
Rob user: rob, company: gmail, ext: com
Ryan user: ryan, company: yahoo, ext: com



![](assets/re-methods.png)

### Vectorized String Functions in pandas

In [55]:
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
        'Rob': 'rob@gmail.com', 'Wes': np.nan}
data = pd.Series(data)

data

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

Series has array-oriented methods for string operations that skip NA values. These are accessed through Series’s **str attribute**

In [58]:
# check if the email is gmail
mask = data.str.contains("gmail").fillna(False)
data[mask]

Steve    steve@gmail.com
Rob        rob@gmail.com
dtype: object

Regular expressions can be used, too, along with any re options like IGNORECASE

In [59]:
# using the pattern declared earlier, find all parts of each email
data.str.findall(pattern, flags=re.IGNORECASE)

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object

In [63]:
# use the match method to check if the field matches an email or not
data.str.match(pattern, flags=re.IGNORECASE)

Dave     True
Steve    True
Rob      True
Wes       NaN
dtype: object

![](assets/series-str-methods.png)

# Data Aggregation and Group Operations

## GroupBy Mechanics

the term split-apply-combine is used for describing group operations. 

- In the first stage of the process, data contained in a pandas object, whether a Series, Data‐Frame, or otherwise, is **split** into groups based on one or more keys that you provide.
- Once this is done, a function is **applied** to each group, producing a new value. 
- Finally, the results of all those function applications are **combined** into a result object. 

The form of the resulting object will usually depend on what’s being done to the data.

![](assets/group-aggregation.png)

In [64]:
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randint(0, 10, 7),
                   'data2' : np.random.randint(0, 10, 7)})
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,4,0
1,a,two,4,3
2,b,one,7,9
3,b,two,1,3
4,a,one,2,3
5,b,two,5,7
6,a,one,2,7


In [66]:
# group data of column 'data1' by 'key1' then print the groups
result = df["data1"].groupby(df["key1"])
result.groups

{'a': [0, 1, 4, 6], 'b': [2, 3, 5]}

In [67]:
# calculate the mean in each group
result.mean()

key1
a    3.000000
b    4.333333
Name: data1, dtype: float64

In [68]:
# group data of column 'data1' by 'key1' and 'key2' then print the groups
result = df["data1"].groupby([df["key1"], df["key2"]])
result.groups

{('a', 'one'): [0, 4, 6], ('a', 'two'): [1], ('b', 'one'): [2], ('b', 'two'): [3, 5]}

In [70]:
# calculate the mean in each group
temp = result.mean()
temp

key1  key2
a     one     2.666667
      two     4.000000
b     one     7.000000
      two     3.000000
Name: data1, dtype: float64

In [72]:
temp[("a", "one")]

2.6666666666666665

In [75]:
# unstack the result Series
result = temp.unstack()
result

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,2.666667,4.0
b,7.0,3.0


In [74]:
temp.reset_index()

Unnamed: 0,key1,key2,data1
0,a,one,2.666667
1,a,two,4.0
2,b,one,7.0
3,b,two,3.0


In [76]:
# try some selection on the result DataFrame
result.loc["a", "one"]

2.6666666666666665

### Iterating Over Groups
The GroupBy object supports iteration, generating a sequence of **2-tuples** containing
the **group name** along with the **chunk of data**.

In [77]:
# group data of DataFrame 'data' by 'key1' then print each group name and data
result = df.groupby(df["key1"])
for name, group in result:
    print(name)
    print(group)
    print("-" * 25)
    

a
  key1 key2  data1  data2
0    a  one      4      0
1    a  two      4      3
4    a  one      2      3
6    a  one      2      7
-------------------------
b
  key1 key2  data1  data2
2    b  one      7      9
3    b  two      1      3
5    b  two      5      7
-------------------------


In [78]:
# group data of DataFrame 'data' by 'key1' and 'key2' then print each group name and data
result = df.groupby([df["key1"], df["key2"]])
for name, group in result:
    print(name)
    print(group)
    print("-" * 25)

('a', 'one')
  key1 key2  data1  data2
0    a  one      4      0
4    a  one      2      3
6    a  one      2      7
-------------------------
('a', 'two')
  key1 key2  data1  data2
1    a  two      4      3
-------------------------
('b', 'one')
  key1 key2  data1  data2
2    b  one      7      9
-------------------------
('b', 'two')
  key1 key2  data1  data2
3    b  two      1      3
5    b  two      5      7
-------------------------


In [80]:
# group data of DataFrame 'data' by 'key1' then convert it to a dictionary of DataFrames

result = dict(list(df.groupby(df["key1"])))
result["b"]


Unnamed: 0,key1,key2,data1,data2
2,b,one,7,9
3,b,two,1,3
5,b,two,5,7


In [83]:
# group the columns by its datatypes, then print the groups
result = df.groupby(df.dtypes, axis=1)
result.groups

{int32: ['data1', 'data2'], object: ['key1', 'key2']}

In [84]:
# loop through the groups and print it
result = df.groupby(df.dtypes, axis=1)

for name, group in result:
    print(name)
    print(group)
    print("-" * 25)

int32
   data1  data2
0      4      0
1      4      3
2      7      9
3      1      3
4      2      3
5      5      7
6      2      7
-------------------------
object
  key1 key2
0    a  one
1    a  two
2    b  one
3    b  two
4    a  one
5    b  two
6    a  one
-------------------------


### Selecting a Column or Subset of Columns
Indexing a GroupBy object created from a DataFrame with a column name or array
of column names has the effect of column subsetting for aggregation. This means
that:
```python
df.groupby('key1')['data1']
df.groupby('key1')[['data2']]
```
are syntactic sugar for:
```python
df['data1'].groupby(df['key1'])
df[['data2']].groupby(df['key1'])
```


In [86]:
# group data of DataFrame 'data' by 'key1' then calculate the mean of column 'data2'

# df["data2"].groupby(df["key1"]).mean()
df.groupby("key1")["data2"].mean()


key1
a    3.250000
b    6.333333
Name: data2, dtype: float64

In [87]:
# How to get result as DataFrame GroupBy or Series GroupBy?
df.groupby("key1")[["data2"]].mean()

Unnamed: 0_level_0,data2
key1,Unnamed: 1_level_1
a,3.25
b,6.333333


### Grouping with Dicts and Series

In [88]:
people = pd.DataFrame(np.random.randint(0, 10, (5,5)),
                      columns=['a', 'b', 'c', 'd', 'e'],
                      index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
people

Unnamed: 0,a,b,c,d,e
Joe,1,3,0,1,9
Steve,9,3,5,1,5
Wes,1,3,1,7,0
Jim,5,9,7,8,3
Travis,2,5,9,1,8


In [89]:
mapping = {'Joe': 'red', 'Steve': 'red', 'Wes': 'blue',
           'Jim': 'blue', 'Travis': 'red', 'Elon' : 'orange'}

In [90]:
# group and sum the scores of teams red and blue
people.groupby(mapping).sum()

Unnamed: 0,a,b,c,d,e
blue,6,12,8,15,3
red,12,11,14,3,22


In [92]:
# convert the dict to a Series, then group and count the scores of teams red and blue
s = pd.Series(mapping)
people.groupby(s).sum()

Unnamed: 0,a,b,c,d,e
blue,6,12,8,15,3
red,12,11,14,3,22


### Grouping with Functions

In [93]:
people.groupby(len).sum()

Unnamed: 0,a,b,c,d,e
3,7,15,8,16,12
5,9,3,5,1,5
6,2,5,9,1,8


In [94]:
people.groupby(lambda x: x[0]).sum()

Unnamed: 0,a,b,c,d,e
J,6,12,7,9,12
S,9,3,5,1,5
T,2,5,9,1,8
W,1,3,1,7,0


## independent Practice:
- read the dataset `tips.csv`
- create a new column 'tip_pct', which is tip / total_bill
- replace the short-day name with the full-day name and convert it to upper-case
- calculate the average tip percent for smokers and non-smokers
- calculate the max and average tip percent for each time
- calculate the average tip percent for each day and time
- create dummy variables for the day and time columns


