# Lesson I

## Membership Constraints

In this chapter, we're going to take a look at common data problems with text and categorical data.

In this lesson, we'll focus on categorical variables. **Categorical data** represents variables that represent predefined finite set of categories.

| **Type of Data** | **Example Values** | **Numeric Representation** |
| -------------|----------------|------------------------|
| Marriage status | ``unmarried``, ``married`` | ``0``, ``1`` |
| Household income Category | ``0-20K``, ``20-40K``, ... | ``0``, ``1``, ... |
| Loan Status | ``default``, ``payed``, ``no_loan`` | ``0``, ``1``, ``2`` |

To run machine learning models on categorical data, they are often coded as numbers. Since categorical data represent a predefined set of categories, they can't have values that go beyond these predefined categories.

### Why could we Have these problems?

* Data Entry Errors
    - Free text
    - Dropdowns
* Parsing Errors

### How do we treat these problems?

* Dropping Data
* Remapping Categories
* Inferring Categories

### An Example

Here's a DataFrame named ``study_data`` containing a list of ``first names``, ``birth dates``, and ``blood types``. 

```python
# Read study data and print it
study_data = pd.read_csv('study.csv')
study_data
```

<img src='pictures/study.jpg' width=450 allign= left />

Additionally, a DataFrame named ``categories``, containing the correct possible categories for the blood type column has been created as well.

```python
# Correct possible blood types
categories
```

<img src='pictures/blood.jpg' width=150/>

Notice the inconsistency here? There's definitely no blood type named **Z+**. Luckily, the ``categories`` DataFrame will help us systematically spot all rows with these inconsistencies. 

It's always good practice to keep a log of all possible values of your categorical data, as it will make dealing with these types of inconsistencies way easier.

### A note on Join

Before moving on to dealing with these inconsistent values, let's have a brief reminder on joins. The two main types of joins we care about here are **anti joins** and **inner joins**.

#### Anti Join

**Anti joins**, take in *two* DataFrames A and B, and return data from one DataFrame that is not contained in another. 

<img src='pictures/antijoin.jpg' />

In this example, we are performing a left anti join of A and B, and are returning the columns of DataFrames A and B for values only found in A of the common column between them being joined on.

#### Inner Join

Inner joins, return only the data that is contained in both DataFrames. 

<img src='pictures/innerjoin.jpg' />

For example, an inner join of A and B, would return columns from both DataFrames for values only found in A and B, of the common column between them being joined on

#### A left anti join on blood types

* What is in ``study_data`` only
    - Returns only rows containing Z+

#### An inner join on blood types

* What is in ``study_data`` and ``categories`` only
    - Returns all rows except those containing Z+, B+ and AB-

### Finding inconsistent categories

Let's see how to do that in Python:

```python
inconsistent_categories = set(study_data['blood_type']).difference(categories['blood_type'])
print(inconsistent_categories)

'''
{'Z+'} # Output
'''
```

We first get all inconsistent categories in the ``blood_type`` column of the ``study_data`` DataFrame. 

We do that by creating a *set* out of the ``blood_type`` column which stores its unique values, and use the ``difference()`` method which takes in as argument the ``blood_type`` column from the ``categories`` DataFrame. This returns all the categories in ``blood_type`` that are not in categories.

```python
# Get and print rows with inconsistent categories
inconsistent_rows = study_data['blood_type'].isin(inconsistent_categories)
study_data[inconsistent_rows]

'''
5 Jennifer 2019-12-17   Z+  # Output Row
'''
```

We then find the inconsistent rows by finding all the rows of the ``blood_type`` columns that are equal to inconsistent categories by using the ``isin()`` method, this returns a series of boolean values that are ``True`` for inconsistent rows and ``False`` for consistent ones. 

We then subset the ``study_data`` DataFrame based on these boolean values, and voila we have our inconsistent data.

### Dropping Inconsistent Categories

To drop inconsistent rows and keep ones that are only consistent. We just use the tilde symbol while subsetting which returns everything except inconsistent rows.

```python
inconsistent_categories = set(study_data['blood_type']).difference(categories['blood_type'])
inconsistent_rows = study_data['blood_type'].isin(inconsistent_categories)
inconsistent_data = study_data[inconsistent_rows]

# Drop inconsistent categories and get consistent data only
consistent_data = study_data[~inconsistent_rows]
```

## Exercise

### Finding Consistency

In this exercise and throughout this chapter, you'll be working with the ``airlines`` DataFrame which contains survey responses on the San Francisco Airport from airline customers.

The DataFrame contains flight metadata such as the *airline*, the *destination*, *waiting times* as well as answers to key questions regarding *cleanliness*, *safety*, and *satisfaction*. Another DataFrame named ``categories`` was created, containing all correct possible values for the survey columns.

In this exercise, you will use both of these DataFrames to find survey answers with inconsistent values, and drop them, effectively performing an outer and inner join on both these DataFrames as seen in the video exercise.

In [4]:
# Import Packages
import pandas as pd
# Airlines data set
airlines = pd.read_csv('datasets/airlines_final.csv')
# Categories dataset
categories = pd.read_csv('datasets/categories.csv')


# Print categories DataFrame
print(categories)

# Print unique values of survey columns in airlines
print('Cleanliness: ', airlines['cleanliness'].unique(), "\n")
print('Safety: ', airlines['safety'].unique(), "\n")
print('Satisfaction: ', airlines['satisfaction'].unique(), "\n")

      cleanliness           safety          satisfaction
0           Clean          Neutral        Very satisfied
1         Average        Very safe               Neutral
2  Somewhat clean    Somewhat_safe    Somewhat_satisfied
3  Somewhat dirty      Very_unsafe  Somewhat_unsatisfied
4           Dirty  Somewhat_unsafe      Very_unsatisfied
Cleanliness:  ['Clean' 'Average' 'Somewhat clean' 'Somewhat dirty' 'Dirty'] 

Safety:  ['Neutral' 'Very safe' 'Somewhat safe' 'Very unsafe' 'Somewhat unsafe'] 

Satisfaction:  ['Very satisfied' 'Neutral' 'Somewhat satsified' 'Somewhat unsatisfied'
 'Very unsatisfied'] 



In [7]:
# Find the cleanliness category in airlines not in categories
cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])

# Find rows with that category
cat_clean_rows = airlines['cleanliness'].isin(cat_clean)

# Print rows with inconsistent category
print(airlines[cat_clean_rows])

# Print rows with consistent categories only
print(airlines[~cat_clean_rows])

Empty DataFrame
Columns: [Unnamed: 0, id, day, airline, destination, dest_region, dest_size, boarding_area, dept_time, wait_min, cleanliness, safety, satisfaction]
Index: []
      Unnamed: 0    id        day        airline        destination  \
0              0  1351    Tuesday    UNITED INTL             KANSAI   
1              1   373     Friday         ALASKA  SAN JOSE DEL CABO   
2              2  2820   Thursday          DELTA        LOS ANGELES   
3              3  1157    Tuesday      SOUTHWEST        LOS ANGELES   
4              4  2992  Wednesday       AMERICAN              MIAMI   
...          ...   ...        ...            ...                ...   
2472        2804  1475    Tuesday         ALASKA       NEW YORK-JFK   
2473        2805  2222   Thursday      SOUTHWEST            PHOENIX   
2474        2806  2684     Friday         UNITED            ORLANDO   
2475        2807  2549    Tuesday        JETBLUE         LONG BEACH   
2476        2808  2162   Saturday  CHINA EAST