# Lesson I

## Membership Constraints

In this chapter, we're going to take a look at common data problems with text and categorical data.

In this lesson, we'll focus on categorical variables. **Categorical data** represents variables that represent predefined finite set of categories.

| **Type of Data** | **Example Values** | **Numeric Representation** |
| -------------|----------------|------------------------|
| Marriage status | ``unmarried``, ``married`` | ``0``, ``1`` |
| Household income Category | ``0-20K``, ``20-40K``, ... | ``0``, ``1``, ... |
| Loan Status | ``default``, ``payed``, ``no_loan`` | ``0``, ``1``, ``2`` |

To run machine learning models on categorical data, they are often coded as numbers. Since categorical data represent a predefined set of categories, they can't have values that go beyond these predefined categories.

### Why could we Have these problems?

* Data Entry Errors
    - Free text
    - Dropdowns
* Parsing Errors

### How do we treat these problems?

* Dropping Data
* Remapping Categories
* Inferring Categories

### An Example

Here's a DataFrame named ``study_data`` containing a list of ``first names``, ``birth dates``, and ``blood types``. 

```python
# Read study data and print it
study_data = pd.read_csv('study.csv')
study_data
```

<img src='pictures/study.jpg' width=450 allign= left />

Additionally, a DataFrame named ``categories``, containing the correct possible categories for the blood type column has been created as well.

```python
# Correct possible blood types
categories
```

<img src='pictures/blood.jpg' width=150/>

Notice the inconsistency here? There's definitely no blood type named **Z+**. Luckily, the ``categories`` DataFrame will help us systematically spot all rows with these inconsistencies. 

It's always good practice to keep a log of all possible values of your categorical data, as it will make dealing with these types of inconsistencies way easier.

### A note on Join

Before moving on to dealing with these inconsistent values, let's have a brief reminder on joins. The two main types of joins we care about here are **anti joins** and **inner joins**.

#### Anti Join

**Anti joins**, take in *two* DataFrames A and B, and return data from one DataFrame that is not contained in another. 

<img src='pictures/antijoin.jpg' />

In this example, we are performing a left anti join of A and B, and are returning the columns of DataFrames A and B for values only found in A of the common column between them being joined on.

#### Inner Join

Inner joins, return only the data that is contained in both DataFrames. 

<img src='pictures/innerjoin.jpg' />

For example, an inner join of A and B, would return columns from both DataFrames for values only found in A and B, of the common column between them being joined on

#### A left anti join on blood types

* What is in ``study_data`` only
    - Returns only rows containing Z+

#### An inner join on blood types

* What is in ``study_data`` and ``categories`` only
    - Returns all rows except those containing Z+, B+ and AB-

### Finding inconsistent categories

Let's see how to do that in Python:

```python
inconsistent_categories = set(study_data['blood_type']).difference(categories['blood_type'])
print(inconsistent_categories)

'''
{'Z+'} # Output
'''
```

We first get all inconsistent categories in the ``blood_type`` column of the ``study_data`` DataFrame. 

We do that by creating a *set* out of the ``blood_type`` column which stores its unique values, and use the ``difference()`` method which takes in as argument the ``blood_type`` column from the ``categories`` DataFrame. This returns all the categories in ``blood_type`` that are not in categories.

```python
# Get and print rows with inconsistent categories
inconsistent_rows = study_data['blood_type'].isin(inconsistent_categories)
study_data[inconsistent_rows]

'''
5 Jennifer 2019-12-17   Z+  # Output Row
'''
```

We then find the inconsistent rows by finding all the rows of the ``blood_type`` columns that are equal to inconsistent categories by using the ``isin()`` method, this returns a series of boolean values that are ``True`` for inconsistent rows and ``False`` for consistent ones. 

We then subset the ``study_data`` DataFrame based on these boolean values, and voila we have our inconsistent data.

### Dropping Inconsistent Categories

To drop inconsistent rows and keep ones that are only consistent. We just use the tilde symbol while subsetting which returns everything except inconsistent rows.

```python
inconsistent_categories = set(study_data['blood_type']).difference(categories['blood_type'])
inconsistent_rows = study_data['blood_type'].isin(inconsistent_categories)
inconsistent_data = study_data[inconsistent_rows]

# Drop inconsistent categories and get consistent data only
consistent_data = study_data[~inconsistent_rows]
```

## Exercise

### Finding Consistency

In this exercise and throughout this chapter, you'll be working with the ``airlines`` DataFrame which contains survey responses on the San Francisco Airport from airline customers.

The DataFrame contains flight metadata such as the *airline*, the *destination*, *waiting times* as well as answers to key questions regarding *cleanliness*, *safety*, and *satisfaction*. Another DataFrame named ``categories`` was created, containing all correct possible values for the survey columns.

In this exercise, you will use both of these DataFrames to find survey answers with inconsistent values, and drop them, effectively performing an outer and inner join on both these DataFrames as seen in the video exercise.

In [4]:
# Import Packages
import pandas as pd
# Airlines data set
airlines = pd.read_csv('datasets/airlines_final.csv')
# Categories dataset
categories = pd.read_csv('datasets/categories.csv')


# Print categories DataFrame
print(categories)

# Print unique values of survey columns in airlines
print('Cleanliness: ', airlines['cleanliness'].unique(), "\n")
print('Safety: ', airlines['safety'].unique(), "\n")
print('Satisfaction: ', airlines['satisfaction'].unique(), "\n")

      cleanliness           safety          satisfaction
0           Clean          Neutral        Very satisfied
1         Average        Very safe               Neutral
2  Somewhat clean    Somewhat_safe    Somewhat_satisfied
3  Somewhat dirty      Very_unsafe  Somewhat_unsatisfied
4           Dirty  Somewhat_unsafe      Very_unsatisfied
Cleanliness:  ['Clean' 'Average' 'Somewhat clean' 'Somewhat dirty' 'Dirty'] 

Safety:  ['Neutral' 'Very safe' 'Somewhat safe' 'Very unsafe' 'Somewhat unsafe'] 

Satisfaction:  ['Very satisfied' 'Neutral' 'Somewhat satsified' 'Somewhat unsatisfied'
 'Very unsatisfied'] 



In [7]:
# Find the cleanliness category in airlines not in categories
cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])

# Find rows with that category
cat_clean_rows = airlines['cleanliness'].isin(cat_clean)

# Print rows with inconsistent category
print(airlines[cat_clean_rows])

# Print rows with consistent categories only
print(airlines[~cat_clean_rows])

Empty DataFrame
Columns: [Unnamed: 0, id, day, airline, destination, dest_region, dest_size, boarding_area, dept_time, wait_min, cleanliness, safety, satisfaction]
Index: []
      Unnamed: 0    id        day        airline        destination  \
0              0  1351    Tuesday    UNITED INTL             KANSAI   
1              1   373     Friday         ALASKA  SAN JOSE DEL CABO   
2              2  2820   Thursday          DELTA        LOS ANGELES   
3              3  1157    Tuesday      SOUTHWEST        LOS ANGELES   
4              4  2992  Wednesday       AMERICAN              MIAMI   
...          ...   ...        ...            ...                ...   
2472        2804  1475    Tuesday         ALASKA       NEW YORK-JFK   
2473        2805  2222   Thursday      SOUTHWEST            PHOENIX   
2474        2806  2684     Friday         UNITED            ORLANDO   
2475        2807  2549    Tuesday        JETBLUE         LONG BEACH   
2476        2808  2162   Saturday  CHINA EAST

# Lesson II

## Categorical Variables

We can have other types of problems that could affect categorical variables.

### What type of errors could we have?

-  **Value inconsistency**

* *Inconsistent fields:* ``'married'``, ``'Married'``, ``'UNMARRIED'``, ``'not_married'`` ...
* *_Trailing white spaces:* _``'married'``, ``'married '``...

- **Collapsing too many categories to few**

* *Creating new groups:* ``0-20K``, ``20-40K`` categories... from continuous household income data
* *Mapping groups to new ones:* Mapping household income categories to 2 ``'rich'``, ``'poor'``

- **Making sure data is of type**

* ``Category`` (Seen in Chapter I)

### Value consistency

Let's start with making sure our categorical data is consistent. A common categorical data problem is having values that slightly differ because of **capitalization**.

*  ``'married'``, ``'Married'``, ``'UNMARRIED'``, ``'not_married'``

for example, let's assume we're working with a demographics dataset, and we have a marriage status column with inconsistent capitalization. 

```python
# Get marriage status column
marriage_status = demographics['marriage_status']
marriage_status.value_counts()
```
Here's what counting the number of married people in the ``marriage_status`` *Series* would look like. Note that the ``.value_counts()`` methods works on *Series* only.

<img src='pictures/valueconsistency.jpg' />

For a DataFrame, we can groupby the column and use the ``.count()`` method.

```python
# Get value counts on DataFrame
marriage_status.groupby('marriage_status').count()
```

To deal with this, we can either **capitalize** or **lowercase** the ``marriage_status`` column. This can be done with the ``str.upper()`` or ``.lower()`` functions respectively.

```python
# Capitalize
marriage_status['marriage_status'] = marriage_status['marriage_status'].str.upper()

# Lowercase
marriage_status['marriage_status'] = marriage_status['marriage_status'].str.lower()
```

Another common problem with categorical values are *leading* or *trailing* spaces. 

* _``'married'``, ``'married '``...

For example, imagine the same demographics DataFrame containing values with leading spaces. 

```python
# Get marriage status column
marriage_status = demographics['marriage_status']
marriage_status.value_counts()
```


Here's what the counts of married vs unmarried people would look like. Note that there is a married category with a trailing space on the right, which makes it hard to spot on the output, as opposed to unmarried.

<img src='pictures/trailing.jpg' />

To remove leading spaces, we can use the ``str.strip()`` method which when given no input, strips all leading and trailing white spaces.

```python
# Strip all spaces
demographics = demographics['marriage_status'].str.strip()
```

### Collapsing data into categories

Sometimes, we may want to create categories out of our data, such as creating household income groups from income data. To create categories out of data, 

let's use the example of creating an ``income_group`` column in the demographics DataFrame. We can do this in 2 ways. 

```python
# Using qcut()
import pandas as pd
group_names = ['0-200K', '200K-500K', '500K+']
demographics['income_group'] = pd.qcut(demographics['household_income'], q=3,
                                        labels=group_names)
                                        
```

The first method utilizes the ``qcut()`` function from ``pandas``, which automatically divides our data based on its distribution into the number of categories we set in the ``q`` argument, we created the category names in the ``group_names`` list and fed it to the ``labels`` argument, returning the following. 

```python
# Print income_group column
demographics[['income_group', 'household_income']]
```

<img src='pictures/collapsing.jpg' />

Notice that the first row actually misrepresents the actual income of the income group, as we didn't instruct qcut where our ranges actually lie.

```python
# Using cut() - create category ranges and names
ranges = [0, 200000, 500000, np.inf]
group_names = ['0-200K', '200K-500K', '500K+']

# Crate income group column
demographics['income_group'] = pd.cut(demographics['household_income'], bins=ranges, labels=group_names)
```

We can do this with the ``cut()`` function instead, which lets us define category cutoff ranges with the ``bins`` argument. It takes in a *list* of cutoff points for each category, with the final one being infinity represented with ``np.inf()``. 

```python
demographics[['income_group', 'household_income']]
```

<img src='pictures/cut.jpg' />

From the output, we can see this is much more correct.

Sometimes, we may want to *reduce* the amount of categories we have in our data. Let's move on to mapping categories to fewer ones.

* ``operating_system`` column is : ``'Microsoft'``, ``'MacOS'``, ``'IOS'``, ``'Android'``, ``'Linux'``
* ``operating_system`` column **should** become: ``'DesktopOS'``, ``'MobileOS'``

We can do this using the ``replace()`` method. It takes in a *dictionary* that maps each existing category to the category name you desire. 

```python
# Create mapping dictionary and replace
mapping = {'Microsoft':'DesktopOS', 'MacOS':'DesktopOS', 'IOS':'MobileOS', 'Android':'MobileOS', 'Linux':'DesktopOS'}
devices['operating_system'] = devices['operating_system'].replace(mapping)
```

In this case, this is the mapping dictionary. A quick print of the unique values of operating system shows the mapping has been complete.

## Exercise

### Inconsistent Categories

In this exercise, you'll be revisiting the ``airlines`` DataFrame from the previous lesson.

As a reminder, the DataFrame contains flight metadata such as the *airline*, the *destination*, *waiting times* as well as *answers* to key questions regarding *cleanliness*, *safety*, and *satisfaction* on the San Francisco Airport.

In this exercise, you will examine *two* categorical columns from this DataFrame, ``dest_region`` and ``dest_size`` respectively, assess how to address them and make sure that they are cleaned and ready for analysis.

In [8]:
# Print unique values of both columns
print(airlines['dest_region'].unique())
print(airlines['dest_size'].unique())

['Asia' 'Canada/Mexico' 'West US' 'East US' 'Midwest US' 'EAST US'
 'Middle East' 'Europe' 'eur' 'Central/South America'
 'Australia/New Zealand' 'middle east']
['Hub' 'Small' '    Hub' 'Medium' 'Large' 'Hub     ' '    Small'
 'Medium     ' '    Medium' 'Small     ' '    Large' 'Large     ']


In [10]:
# Lower dest_region column and then replace "eur" with "europe"
airlines['dest_region'] = airlines['dest_region'].str.lower()
airlines['dest_region'] = airlines['dest_region'].replace({'eur':'europe'})

# Remove white spaces from `dest_size`
airlines['dest_size'] = airlines['dest_size'].str.strip()

# Verify changes have been effected
print(airlines['dest_region'].unique())
print(airlines['dest_size'].unique())

['asia' 'canada/mexico' 'west us' 'east us' 'midwest us' 'middle east'
 'europe' 'central/south america' 'australia/new zealand']
['Hub' 'Small' 'Medium' 'Large']


### Remapping categories

To better understand survey respondents from ``airlines``, you want to find out if there is a relationship between certain responses and the day of the week and wait time at the gate.

The ``airlines`` DataFrame contains the ``day`` and ``wait_min`` columns, which are categorical and numerical respectively. The ``day`` column contains the exact day a flight took place, and ``wait_min`` contains the amount of minutes it took travelers to wait at the gate. To make your analysis easier, you want to create two new categorical variables:

* ``wait_type``: ``'short'`` for 0-60 min, ``'medium'`` for 60-180 and ``long`` for 180+
* ``day_week``: ``'weekday'`` if day is in the weekday, ``'weekend'`` if day is in the weekend.


In [11]:
# Import numpy
import numpy as np

# Create ranges for categories
label_ranges = [0, 60, 180, np.inf]
label_names = ['short', 'medium', 'long']

# Create wait_type column
airlines['wait_type'] = pd.cut(airlines['wait_min'], bins = label_ranges, 
                                labels = label_names)

# Create mappings and replace
mappings = {'Monday':'weekday', 'Tuesday':'weekday', 'Wednesday': 'weekday', 
            'Thursday': 'weekday', 'Friday': 'weekday', 
            'Saturday': 'weekend', 'Sunday': 'weekend'}

airlines['day_week'] = airlines['day'].replace(mappings)

# Lesson III

## Cleaning text Data

In this lesson we'll talk about text data and regular expressions.

### What is text data?

text data is one of the most common types of data types.

| **Type of data** | **Example values** |
|--------------|----------------|
| Names | ``Alex``, ``Sara`` ... |
| Phone Numbers | +96171679912 ... |
| Emails | 'adel@datacamp.com' |
| Passwords | ----- |

**Common text data problems:**

- *Data inconsistency:*
    * ``+96171679912`` or ``0096171679912``
- *Fixed length violations:*
    * Passwords needs to be at least 8 characters
- *Typos:*
    * ``+961.71.679912``

### Example

Let's take a look at the following example. Here's a DataFrame named phones containing the full name and phone numbers of individuals. Both are string columns. Notice the ``phone number`` column.

```python
phones = pd.read_csv('phones.csv')
print(phones)
```

<img src='pictures/phones.jpg' />

We can see that there are phone number values, that begin with 00 or +. We also see that there is one entry where the phone number is 4 digits, which is non-existent. Furthermore, we can see that there are dashes across the phone number column. If we wanted to feed these phone numbers into an automated call system, or create a report discussing the distribution of users by area code, we couldn't really do so without uniform phone numbers.

#### Fixing the phone number column

Let's first begin by replacing the plus sign with 00, to do this, we use the dot str dot replace method which takes in two values, the string being replaced, which is in this case the plus sign and the string to replace it with which is in this case 00. 

```python
# Replace "+" with "00"
phones['Phone number'] = phones['Phone number'].str.replace('+', '00')
```

We can see that the column has been updated.

```python
phones
```

<img src='pictures/phones1.jpg' />

We use the same exact technique to remove the dashes, by replacing the dash symbol with an empty string.

```python
# Replace "-" with nothing
phones['Phone number'] = phones['Phone number'].str.replace("-", "")
```

Now finally we're going to replace all phone numbers *below* 10 digits to ``NaN``. We can do this by chaining the Phone number column with the ``.str.len()`` method, which returns the *string* length of each row in the column. 

We can then use the ``.loc()`` method, to index rows where digits is below 10, and replace the value of Phone number with *numpy's* ``nan`` object, which is here imported as ``np``.

```python
# Replace phone numbers with lower than 10 digits to NaN
digits = phones['Phone number'].str.len()
phones.loc[digits < 10, "Phone number"] = np.nan
```

We can also write ``assert`` statements to test whether the phone number column has a specific length,and whether it contains the symbols we removed. 

```python
# Find length of each row in Phone number column
sanity_check = phone['Phone number'].str.len()
```

The first assert statement tests that the minimum length of the strings in the phone number column, found through ``str.len()``, is *bigger than or equal* to ``10``. 

```python
# Assert minmum phone number length is 10
assert sanity_check.min() <= 10
```

In the second assert statement, we use the ``str.contains()`` method to test whether the phone number column contains a specific pattern. It returns a series of booleans for that are ``True`` for matches and ``False`` for non-matches. 
We set the pattern **plus bar pipe minus**, the *bar pipe* here is basically an **or** statement, so we're trying to find matches for either symbols. We chain it with the any method which returns ``True`` if any element in the output of our ``.str.contains()`` is ``True``, and test whether the it returns ``False``.

```python
# Assert all numbers do not have "+" or "-"
assert phone['Phone number'].str.contains("+|-").any() == False
```

#### Regular Expressions in action

But what about more complicated examples? How can we clean a phone number column that looks like this for example? 

<img src='pictures/re.jpg' />

Where phone numbers can contain a range of symbols from *plus signs, dashes, parenthesis* and maybe more. This is where *regular expressions* come in. *Regular expressions* give us the ability to search for any pattern in text data, like only digits for example. They are like control + find in your browser, but way more dynamic and robust.

Let's a look at this example:

```python
# Replace letters with nothing
phones['Phone number'] = phones['Phone number'].str.replace(r'\D+', '')
```

Here we are attempting to only extract digits from the phone number column. To do this, we use the  ``.str.replace()`` method with the pattern we want to replace with an empty string. Notice the pattern fed into the method. This is essentially us telling pandas to replace anything that is not a digit with nothing.


## Exercise

### Removing titles and taking names

While collecting survey respondent metadata in the ``airlines`` DataFrame, the full name of respondents was saved in the ``full_name`` column. However upon closer inspection, you found that a lot of the different names are prefixed by honorifics such as ``"Dr."``, ``"Mr."``, ``"Ms."`` and ``"Miss"``.

Your ultimate objective is to create two new columns named ``first_name`` and ``last_name``, containing the first and last names of respondents respectively. Before doing so however, you need to remove honorifics.

In [None]:
# THIS CODE WILL ONLY WORK ON DATACAMP WORKSPACE

# Replace "Dr." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Dr.","")

# Replace "Mr." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Mr.","")

# Replace "Miss" with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Miss","")
# Replace "Ms." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Ms.","")

# Assert that full_name has no honorifics
assert airlines['full_name'].str.contains('Ms.|Mr.|Miss|Dr.').any() == False

### Keeping it descriptive

To further understand travelers' experiences in the San Francisco Airport, the quality assurance department sent out a qualitative questionnaire to all travelers who gave the airport the worst score on all possible categories. The objective behind this questionnaire is to identify common patterns in what travelers are saying about the airport.

Their response is stored in the ``survey_response`` column. Upon a closer look, you realized a few of the answers gave the shortest possible character amount without much substance. In this exercise, you will isolate the responses with a character count higher than **40** , and make sure your new DataFrame contains responses with **40** characters or more using an ``assert`` statement.

In [None]:
# THIS CODE WILL ONLY WORK ON DATACAMP WORKSPACE

# Store length of each row in survey_response column
resp_length = airlines['survey_response'].str.len()

# Find rows in airlines where resp_length > 40
airlines_survey = airlines[resp_length > 40]

# Assert minimum survey_response length is > 40
assert airlines_survey['survey_response'].str.len().min() > 40

# Print new survey_response column
print(airlines_survey['survey_response'])