# Data Cleaning and Preparation

In [1]:
import numpy as np
import pandas as pd


## Data Transformation (continue)
So far in this lesson we’ve been concerned with rearranging data. Filtering, cleaning,
and other transformations are another class of important operations.

### Detecting and Filtering Outliers
Filtering or transforming outliers is largely a matter of applying array operations.
Consider a DataFrame with some normally distributed data

In [2]:
data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.00638,-0.076284,0.01418,-0.032409
std,1.007102,1.000653,0.990268,0.972114
min,-3.247819,-3.387276,-3.322339,-3.280373
25%,-0.694036,-0.778923,-0.669937,-0.63556
50%,-0.027542,-0.086337,0.030056,-0.025112
75%,0.730032,0.625217,0.701318,0.606538
max,3.49111,2.798,2.948473,3.138961


In [1]:
# find values in one of the columns exceeding 3 in absolute value


In [2]:
# select all rows having a value exceeding 3 or –3


In [3]:
# set outliers to 3 or -3 depending on its sign


### Computing Indicator/Dummy Variables
Another type of transformation for statistical modeling or machine learning applica‐
tions is converting a categorical variable into a “dummy” or “indicator” matrix.

column in a DataFrame has k distinct values, you would derive a matrix or Data‐
Frame with k columns containing all 1s and 0s.

pandas has a `get_dummies` function for doing this

In [9]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                   'data1': range(6)})
df

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [4]:
# create dummy variables for column 'key'


In [5]:
# create dummy variables for column 'key'  then add the columns to a variable


In [6]:
# join the dummies to the Data-Frame


In [13]:
# read the dataset 'movies.dat' and display the first 10 rows of it

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_csv('movies.dat', sep='::', engine="python",
                       header=None, names=mnames)
movies[:10]

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children's
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


Adding dummy variables for each genre requires a little bit of wrangling.

In [7]:
# make a list 'genres' contains all distict genres


In [8]:
# create a DataFrame 'dummies' of size (#movies, #genres) and fill it with zeros


In [9]:
# get the genres from the first row of 'movies' DataFrame, then get their indices in the 'dummies' DataFrame
# hint: use 'get_indexer' method


In [10]:
# for each value of 'genre' column in 'movies' DataFrame, set the corresponding columns in dummies to 1 


In [11]:
# join the DataFrames 'movies' and 'dummies', then display the first row 


## String Manipulation
Python has long been a popular raw data manipulation language in part due to its
ease of use for string and text processing. Most text operations are made simple with
the string object’s **built-in methods**. 

For more complex pattern matching and text manipulations, **regular expressions** may be needed. 

**pandas** adds to the mix by enabling you to apply string and regular expressions concisely on whole arrays of data,
additionally handling the annoyance of missing data.

### String Object Methods

In [20]:
# convert the string to a list using the ',' as separator
val = 'a,b,  guido'
val.split(",")

['a', 'b', '  guido']

In [21]:
# convert the string to a list using the ',' as separator and remove the extra spaces
pieces = val.split(",")
pieces = [v.strip() for v in pieces]
pieces

['a', 'b', 'guido']

In [22]:
# concatenate the parts back with separator '::'
first, second, third = pieces
first + '::' + second + '::' + third

'a::b::guido'

In [23]:
# concatenate the parts back with separator '::'
"::".join(pieces)

'a::b::guido'

**check**: try the `in` operator and str methods `index`, `find`, `count` and `replace`

![](assets/built-in-str-methods.png)

### Regular Expressions
Regular expressions provide a flexible way to search or match (often more complex)
string patterns in text. A single expression, commonly called a regex, is a string
formed according to the regular expression language.

The `re` module functions fall into three categories: **pattern matching**, **substitution**,
and **splitting**.

In [24]:
import re
text = "foo    bar\t baz  \tqux"
# https://pythex.org



suppose we wanted to split a string with a variable number of whitespace characters
(tabs, spaces, and newlines). The regex describing one or more whitespace characters
is \s+:

In [12]:
# split the string depending on the whitespaces



In [13]:
# find all the string depending on the whitespaces


**Note:** Creating a regex object with `re.compile` is highly recommended if you intend to
apply the same expression to many strings; doing so will save CPU cycles

In [62]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)

In [14]:
# get a list of all emails in the text


Relatedly, `sub` will return a new string with occurrences of the pattern replaced by the
a new string:


Suppose you wanted to find email addresses and simultaneously segment each
address into its three components: *username*, *domain name*, and *domain suffix*.

In [64]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)

In [15]:
# find all email in the text


In [16]:
# prefix each segment of the email with a suitable label


![](assets/re-methods.png)

### Vectorized String Functions in pandas

In [33]:
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
        'Rob': 'rob@gmail.com', 'Wes': np.nan}
data = pd.Series(data)
data


Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

Series has array-oriented methods for string operations that skip NA values. These are accessed through Series’s **str attribute**

In [17]:
# check if the email is gmail


Regular expressions can be used, too, along with any re options like IGNORECASE

In [18]:
# using the pattern declared earlier, find all parts of each email


In [19]:
# use the match method to check if the field matches an email or not


![](assets/series-str-methods.png)