## Converting data types
In this exercise, you'll see how ensuring all categorical variables in a DataFrame are of type category reduces memory usage.

The tips dataset has been loaded into a DataFrame called tips. This data contains information about how much a customer tipped, whether the customer was male or female, a smoker or not, etc.

Look at the output of tips.info() in the IPython Shell. You'll note that two columns that should be categorical - sex and smoker - are instead of type object, which is pandas' way of storing arbitrary strings. Your job is to convert these two columns to type category and note the reduced memory usage.

In [1]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
# Import tips.csv to a DataFrame
tips = pd.read_csv('D:/Springboard_DataCamp/data/Cleaning_Data_in_Python/tips.csv')
tips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null object
smoker        244 non-null object
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: float64(2), int64(1), object(4)
memory usage: 13.4+ KB


In [4]:
# Convert the sex column to type 'category'
tips.sex = tips.sex.astype('category')

# Convert the smoker column to type 'category'
tips.smoker = tips.smoker.astype('category')

tips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null category
smoker        244 non-null category
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: category(2), float64(2), int64(1), object(2)
memory usage: 10.3+ KB


## Working with numeric data
If you expect the data type of a column to be numeric (int or float), but instead it is of type object, this typically means that there is a non numeric value in the column, which also signifies bad data.

You can use the **`pd.to_numeric()`** function to convert a column into a numeric data type. If the function raises an error, you can be sure that there is a bad value within the column. 

You can either use the techniques you learned in Chapter 1 to do some exploratory data analysis and find the bad value, or you can choose to ignore or coerce the value into a missing value, NaN.

A modified version of the tips dataset has been pre-loaded into a DataFrame called tips2. For instructional purposes, it has been pre-processed to introduce some 'bad' data for you to clean. Use the .info() method to explore this. You'll note that the total_bill and tip columns, which should be numeric, are instead of type object. Your job is to fix this.

In [6]:
# Import tips.csv to a DataFrame
tips2 = pd.read_csv('D:/Springboard_DataCamp/data/Cleaning_Data_in_Python/tips2.csv')
# Convert the sex column to type 'category'
tips2.sex = tips2.sex.astype('category')

# Convert the smoker column to type 'category'
tips2.smoker = tips2.smoker.astype('category')
tips2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null object
tip           244 non-null object
sex           244 non-null category
smoker        244 non-null category
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: category(2), int64(1), object(4)
memory usage: 10.3+ KB


In [7]:
# rows 0 and 2 have dashes inserted
tips2.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,-,-,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,-,-,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [9]:
 #Convert 'total_bill' to a numeric dtype
tips2['total_bill'] = pd.to_numeric(tips2['total_bill'], errors='coerce')

# Convert 'tip' to a numeric dtype
tips2['tip'] = pd.to_numeric(tips2['tip'], errors='coerce')

# Print the info of tips
tips2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    242 non-null float64
tip           242 non-null float64
sex           244 non-null category
smoker        244 non-null category
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: category(2), float64(2), int64(1), object(2)
memory usage: 10.3+ KB


In [10]:
#Dashes are changed to NaN's
tips2.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,,,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,,,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


## String parsing with regular expressions
When working with data, it is sometimes necessary to write a regular expression to look for properly entered values. Phone numbers in a dataset is a common field that needs to be checked for validity. 

Your job in this exercise is to define a regular expression to match US phone numbers that fit the pattern of xxx-xxx-xxxx.

The regular expression module in python is **re**. When performing pattern matching on data, since the pattern will be used for a match across multiple rows, it's better to compile the pattern first using **`re.compile()`**, and then use the compiled pattern to match values.

In [11]:
# Import the regular expression module
import re

# Compile the pattern: prog
prog = re.compile('\d{3}\-\d{3}\-\d{4}')

# See if the pattern matches
result = prog.match('123-456-7890')
print(bool(result))

# See if the pattern matches
result2 = prog.match('1123-456-7890')
print(bool(result2))

True
False


## Extracting numerical values from strings
Extracting numbers from strings is a common task, particularly when working with unstructured data or log files.

Say you have the following string: 'the recipe calls for 6 strawberries and 2 bananas'.

It would be useful to extract the 6 and the 2 from this string to be saved for later use when comparing strawberry to banana ratios.

When using a regular expression to extract multiple numbers (or multiple pattern matches, to be exact), you can use the **`re.findall()`** function.

In [13]:
'''
\d is the pattern required to find digits. 
This should be followed with a + so that the previous element is matched one or more times. 
This ensures that 10 is viewed as one number and not as 1 and 0.
'''
# Find the numeric values: matches
matches = re.findall('\d+', 'the recipe calls for 10 strawberries and 1 banana')
matches

['10', '1']

## Pattern matching
In this exercise, you'll continue practicing your regular expression skills. For each provided string, your job is to write the appropriate pattern to match it.

In [14]:
# Write the first pattern
pattern1 = bool(re.match(pattern='\d{3}\-\d{3}\-\d{4}', string='123-456-7890'))
pattern1

True

In [15]:
# Write the second pattern
pattern2 = bool(re.match(pattern='\$\d*\.\d{2}', string='$123.45'))
pattern2

True

In [16]:
# Write the third pattern
pattern3 = bool(re.match(pattern='[A-Z]\w*', string='Australia'))
pattern3

True

## Custom functions to clean data
The tips dataset has been pre-loaded into a DataFrame called tips. It has a **`'sex'`** column that contains the values **`'Male' or 'Female'`**. Your job is to write a function that will recode **`'Female' to 0, 'Male' to 1`**, and return **`np.nan`** for all entries of 'sex' that are neither 'Female' nor 'Male'.

Recoding variables like this is a common data cleaning task. Functions provide a mechanism for you to abstract away complex bits of code as well as reuse code. This makes your code more readable and less error prone.

You can use the **`.apply()`** method to apply a function across entire rows or columns of DataFrames. 

However, note that each column of a DataFrame is a pandas Series. Functions can also be applied across Series. Here, you will apply your function over the **'sex'** column.

In [18]:
# Import tips3.csv
tips3 = pd.read_csv('D:/Springboard_DataCamp/data/Cleaning_Data_in_Python/tips3.csv')
tips3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    202 non-null float64
tip           220 non-null float64
sex           234 non-null object
smoker        229 non-null object
day           243 non-null object
time          232 non-null object
size          231 non-null float64
dtypes: float64(3), object(4)
memory usage: 13.4+ KB


In [19]:
# Define recode_gender()
def recode_gender(gender):

    # Return 0 if gender is 'Female'
    if gender == 'Female':
        return 0
    
    # Return 1 if gender is 'Male'    
    elif gender == 'Male':
        return 1
    
    # Return np.nan    
    else:
        return np.nan

# Apply the function to the sex column
tips3['recode'] = tips3.sex.apply(recode_gender)

# Print the first five rows of tips3
tips3.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,recode
0,16.99,1.01,Female,No,Sun,Dinner,2.0,0.0
1,10.34,1.66,Male,No,Sun,Dinner,3.0,1.0
2,21.01,3.5,Male,No,Sun,Dinner,3.0,1.0
3,23.68,,Male,No,Sun,Dinner,2.0,1.0
4,24.59,,Female,No,Sun,Dinner,4.0,0.0


## Lambda functions
You'll now be introduced to a powerful Python feature that will help you clean your data more effectively: lambda functions. Instead of using the def syntax that you used in the previous exercise, lambda functions let you make simple, one-line functions.

For example, here's a function that squares a variable used in an .apply() method:

def my_square(x):
    return x ** 2

df.apply(my_square)

The equivalent code using a lambda function is:

df.apply(lambda x: x ** 2)

The lambda function takes one parameter - the variable x. The function itself just squares x and returns the result, which is whatever the one line of code evaluates to. In this way, lambda functions can make your code concise and Pythonic.

Your job is to clean its 'total_dollar' column by removing the dollar sign. You'll do this using two different methods: With the .replace() method, and with regular expressions. 

In [20]:
# Import tips4.csv
tips4 = pd.read_csv('D:/Springboard_DataCamp/data/Cleaning_Data_in_Python/tips4.csv')
tips4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 8 columns):
total_bill      244 non-null float64
tip             244 non-null float64
sex             244 non-null object
smoker          244 non-null object
day             244 non-null object
time            244 non-null object
size            244 non-null int64
total_dollar    244 non-null object
dtypes: float64(2), int64(1), object(5)
memory usage: 15.3+ KB


In [21]:
tips4.total_dollar.head()

0     $16.99 
1     $10.34 
2     $21.01 
3     $23.68 
4     $24.59 
Name: total_dollar, dtype: object

In [22]:
# Write the lambda function using replace
tips4['total_dollar_replace'] = tips4.total_dollar.apply(lambda x: x.replace('$', ''))

# Write the lambda function using regular expressions
tips4['total_dollar_re'] = tips4.total_dollar.apply(lambda x: re.findall('\d+\.\d+', x)[0])

# Print the head of tips
tips4.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,total_dollar,total_dollar_replace,total_dollar_re
0,16.99,1.01,Female,No,Sun,Dinner,2,$16.99,16.99,16.99
1,10.34,1.66,Male,No,Sun,Dinner,3,$10.34,10.34,10.34
2,21.01,3.5,Male,No,Sun,Dinner,3,$21.01,21.01,21.01
3,23.68,3.31,Male,No,Sun,Dinner,2,$23.68,23.68,23.68
4,24.59,3.61,Female,No,Sun,Dinner,4,$24.59,24.59,24.59


## Filling missing data
Here, you'll return to the airquality dataset from Chapter 2. Explore airquality in the IPython Shell to checkout which columns have missing values.

It's rare to have a (real-world) dataset without any missing values, and it's important to deal with them because certain calculations cannot handle missing values while some calculations will, by default, skip over any missing values.

Also, understanding how much missing data you have, and thinking about where it comes from is crucial to making unbiased interpretations of data.

In [23]:
# import DataFrame
airquality = pd.read_csv('D:/Springboard_DataCamp/data/Cleaning_Data_in_Python/airquality.csv')
airquality.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 6 columns):
Ozone      116 non-null float64
Solar.R    146 non-null float64
Wind       153 non-null float64
Temp       153 non-null int64
Month      153 non-null int64
Day        153 non-null int64
dtypes: float64(3), int64(3)
memory usage: 7.2 KB


In [24]:
# Calculate the mean of the Ozone column: oz_mean
oz_mean = airquality.Ozone.mean()

# Replace all the missing values in the Ozone column with the mean
airquality['Ozone'] = airquality.fillna(oz_mean)

# Print the info of airquality
print(airquality.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 6 columns):
Ozone      153 non-null float64
Solar.R    146 non-null float64
Wind       153 non-null float64
Temp       153 non-null int64
Month      153 non-null int64
Day        153 non-null int64
dtypes: float64(3), int64(3)
memory usage: 7.2 KB
None


## Testing your data with asserts
Here, you'll practice writing assert statements using the Ebola dataset from previous chapters to programmatically check for missing values and to confirm that all values are positive. 

In the video, you saw Dan use the **`.all()`** method together with the **`.notnull()`** DataFrame method to check for missing values in a column. 

The **`.all()`** method returns True if all values are True. When used on a DataFrame, it returns a Series of Booleans - one for each column in the DataFrame. 

So if you are using it on a DataFrame, like in this exercise, you need to chain another .all() method so that you return only one True or False value. When using these within an assert statement, nothing will be returned if the assert statement is true: This is how you can confirm that the data you are checking are valid.

Note: You can use pd.notnull(df) as an alternative to df.notnull().

In [25]:
#Import ebola as a DataFrame
ebola = pd.read_csv('D:/Springboard_DataCamp/data/Cleaning_Data_in_Python/ebola.csv')
ebola.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122 entries, 0 to 121
Data columns (total 18 columns):
Date                   122 non-null object
Day                    122 non-null int64
Cases_Guinea           93 non-null float64
Cases_Liberia          83 non-null float64
Cases_SierraLeone      87 non-null float64
Cases_Nigeria          38 non-null float64
Cases_Senegal          25 non-null float64
Cases_UnitedStates     18 non-null float64
Cases_Spain            16 non-null float64
Cases_Mali             12 non-null float64
Deaths_Guinea          92 non-null float64
Deaths_Liberia         81 non-null float64
Deaths_SierraLeone     87 non-null float64
Deaths_Nigeria         38 non-null float64
Deaths_Senegal         22 non-null float64
Deaths_UnitedStates    18 non-null float64
Deaths_Spain           16 non-null float64
Deaths_Mali            12 non-null float64
dtypes: float64(16), int64(1), object(1)
memory usage: 17.2+ KB


In [26]:
# Assert that there are no missing values
assert ebola.notnull().all().all()

AssertionError: 

In [28]:
 #Assert that all values are >= 0
assert (ebola >= 0).all().all()

AssertionError: 

In [29]:
ebola1 = ebola.fillna(value=9)
ebola1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122 entries, 0 to 121
Data columns (total 18 columns):
Date                   122 non-null object
Day                    122 non-null int64
Cases_Guinea           122 non-null float64
Cases_Liberia          122 non-null float64
Cases_SierraLeone      122 non-null float64
Cases_Nigeria          122 non-null float64
Cases_Senegal          122 non-null float64
Cases_UnitedStates     122 non-null float64
Cases_Spain            122 non-null float64
Cases_Mali             122 non-null float64
Deaths_Guinea          122 non-null float64
Deaths_Liberia         122 non-null float64
Deaths_SierraLeone     122 non-null float64
Deaths_Nigeria         122 non-null float64
Deaths_Senegal         122 non-null float64
Deaths_UnitedStates    122 non-null float64
Deaths_Spain           122 non-null float64
Deaths_Mali            122 non-null float64
dtypes: float64(16), int64(1), object(1)
memory usage: 17.2+ KB


In [31]:
# Assert that there are no missing values
assert ebola1.notnull().all().all()

 #Assert that all values are >= 0
assert (ebola1 >= 0).all().all()