# Cleaning Data for Analysis

## Data Types

There may be times we want to convert from one data type to another

**Categorical Data**

Columns that contain categorical data, such as Male / Female can be converting into 'category' dtype
* Can make the DataFrame smaller in memory
* Can make them be utilized by other Python libraries

In [1]:
import pandas as pd
df = pd.read_csv('https://assets.datacamp.com/production/repositories/666/datasets/b064fa9e0684a38ac15b0a19845367c29fde978d/tips.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null object
smoker        244 non-null object
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: float64(2), int64(1), object(4)
memory usage: 13.4+ KB


In [2]:
# Converting Data Types
df['smoker'] = df['smoker'].astype('bool')
df['sex'] = df['sex'].astype('category')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null category
smoker        244 non-null bool
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: bool(1), category(1), float64(2), int64(1), object(2)
memory usage: 10.2+ KB


### Converting Data Types
* Numeric data loaded as a string, usually a sign of bad data that needs to be cleaned

In [3]:
# Converting total_bill into a numeric dtype
# errors='coerce' will set invalid values as NaN
df['total_bill'] = pd.to_numeric(df['total_bill'], errors='coerce')
df['tip'] = pd.to_numeric(df['tip'], errors='coerce')
df.dtypes

total_bill     float64
tip            float64
sex           category
smoker            bool
day             object
time            object
size             int64
dtype: object

## String Manipulation

* Much of data cleaning involves string manipulation
* Most of the world's data is unstructured text
* Python has many built-in and external libraries
* 're' library for regular expressions

### Regular Expression Match Example

***** - Matches it zero or more times

**{2}** - Matches exactly 2 values

**^** - Caret will tell the pattern to start the pattern match th a the beginning of value

**$** - Will tell the pattern to match at the end of the value

|Value      |Pattern Matched    |Regular Expression|
|-----------|-------------------|------------------|
|17         |12345678901        |\d*               |
|\$17       |\$12345678901      |\ $\d*            |
|\$17.00    |\$12345678901.24   |\ \$\d*\\.\d *    |
|\$17.89    |\$12345678901.24   |\ \$\d*\\.\d{2}   |
|\$17.895   |\$12345678901.999  |^\\$\d*\\.\d{2}\$ |

#### Using Regular Expressions

* Compile the pattern
* Use the compiled pattern to match values
* This lets use use the pattern over and over again
* Useful since we want to match values down a column of values

In [4]:
import re

# RegEx Pattern - Match a Phone Number in the format of xxx-xxx-xxxx
pattern = re.compile('\d{3}\-\d{3}\-\d{4}')

# See if the pattern matches
result = pattern.match('123-456-7890')
result2 = pattern.match('1123-456-7890')

print(f'{bool(result)}')
print(f'{bool(result2)}')

True
False


In [5]:
# Find the numeric values in a string
matches = re.findall('\d*')

TypeError: findall() missing 1 required positional argument: 'string'