**Table of contents**<a id='toc0_'></a>    
- [Import Statements](#toc1_1_)    
  - [Common types of problems that we might face with categorical data](#toc1_2_)    
    - [-> Membership constraints](#toc1_2_1_)    
    - [-> Value inconsistency](#toc1_2_2_)    
    - [-> Collapsing data into categories](#toc1_2_3_)    
  - [Common types of problems that we might face with text data](#toc1_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=5
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

### <a id='toc1_1_'></a>[Import Statements](#toc0_)

In [1]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

### <a id='toc1_2_'></a>[Common types of problems that we might face with categorical data](#toc0_)

- Membership constraints
- Value inconsistencies
  - Inconsistent fields: "married", "Maried", "Unmarried", "Not_MARRIED"
  - Trailing spaces/other characters: "married ", "_married__"
- Too few categories that needs to be expanded
- Too many categories that needs to be collapsed
- Mapping existing categories to new ones 
- Last but not least, making sure that the data is of right type

#### <a id='toc1_2_1_'></a>[-> Membership constraints](#toc0_)

Sometimes we need to ensure that the values of a particular categorical column comes from a known/pre-defined set of values. For example, we might want to ensure that the values of the "sex" column are either "male" or "female" (only a simple example and in practice we often have much more categories to check against).

<u>Useful Functions and Methods</u>:

- `ser.isin(iter)` - Check whether the series values are contained in the iterable
- `ser.unique()` - Return unique values of Series object
- `.cat.remove_categories(removals)` - Remove categories from categorical data (the removed category values will be replaced with NaNs).
- `df.dropna()` - Drop rows with NaN values

#### <a id='toc1_2_2_'></a>[-> Value inconsistency](#toc0_)

Sometimes the categories are misspelled/there are trailing spaces/there are different capitalizations of the same category. We need to make sure that each category is indeed a correct and distinct category.

<u>Useful Functions and Methods</u>:

- `.str` accessors - Accessor objects for string methods
- `ser.unique()` - Return unique values of Series object

#### <a id='toc1_2_3_'></a>[-> Collapsing data into categories](#toc0_)

Sometimes there are several categories of data that should actually be grouped under a single category. For example, we might want to group all the different types of "dog breeds" and "cat breeds" from a column into single categories "dog" and "cat".

<u> Useful Functions and Methods</u>:

- `ser.replace(mapping_dict)` - Replace values in Series object using the mapping dictionary (keys are the values to be replaced and values are the replacement values)
- `pd.cut(ser, bins, labels)` - Bin values into discrete intervals (bins) and name each bin with the corresponding interval label from labels
- `pd.qcut(ser, q, labels)` - Bin values into discrete intervals (bins) and name each bin with the corresponding interval label from labels. The bins are chosen so that there are approximately the same number of records in each bin.
- `np.select(condlist, choicelist, default)` - Return an array drawn from elements in choicelist, depending on conditions

In [2]:
df_airlines = pd.read_csv("../datasets/airlines_final.csv", index_col=0)

In [3]:
df_airlines.head(3)

Unnamed: 0,id,day,airline,destination,dest_region,dest_size,boarding_area,dept_time,wait_min,cleanliness,safety,satisfaction
0,1351,Tuesday,UNITED INTL,KANSAI,Asia,Hub,Gates 91-102,2018-12-31,115.0,Clean,Neutral,Very satisfied
1,373,Friday,ALASKA,SAN JOSE DEL CABO,Canada/Mexico,Small,Gates 50-59,2018-12-31,135.0,Clean,Very safe,Very satisfied
2,2820,Thursday,DELTA,LOS ANGELES,West US,Hub,Gates 40-48,2018-12-31,70.0,Average,Somewhat safe,Neutral


In [4]:
df_airlines.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2477 entries, 0 to 2808
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             2477 non-null   int64  
 1   day            2477 non-null   object 
 2   airline        2477 non-null   object 
 3   destination    2477 non-null   object 
 4   dest_region    2477 non-null   object 
 5   dest_size      2477 non-null   object 
 6   boarding_area  2477 non-null   object 
 7   dept_time      2477 non-null   object 
 8   wait_min       2477 non-null   float64
 9   cleanliness    2477 non-null   object 
 10  safety         2477 non-null   object 
 11  satisfaction   2477 non-null   object 
dtypes: float64(1), int64(1), object(10)
memory usage: 251.6+ KB


In [5]:
# let's say to better understand survey respondents from airlines, we want to find out if there is a 
# relationship between certain responses and the day of the week and wait time at the gate.

# The airlines DataFrame contains the day and wait_min columns. We want to create two new columns and replace,
# wait_type: 'short' for 0-60 min, 'medium' for 60-180 and long for 180+
# day_week: 'weekday' if day is in the weekday, 'weekend' if day is in the weekend.

In [6]:
# Create ranges for categories
bins = [0, 60, 180, np.inf]
bin_labels = ['short', 'medium', 'long']

# Create wait_type column
df_airlines['wait_type'] = pd.cut(df_airlines['wait_min'], bins = bins, 
                                labels = bin_labels)

# Create mappings and replace
cond_list = [df_airlines["day"].isin(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']), df_airlines["day"].isin(['Saturday', 'Sunday'])]

choice_list = ["weekday", "weekend"]

df_airlines["day_week"] = np.select(condlist=cond_list, choicelist=choice_list)

In [7]:
df_airlines.wait_type.value_counts()

wait_type
medium    1711
long       685
short       81
Name: count, dtype: int64

In [8]:
df_airlines.day_week.value_counts()

day_week
weekday    2000
weekend     477
Name: count, dtype: int64

### <a id='toc1_3_'></a>[Common types of problems that we might face with text data](#toc0_)

- Data inconsistency: "+9618181818", "009618181818" etc.
- Fixed lenght violations: For example, a password must be at least 8 characters long.
- Typos: "+961.818.1818" etc.

For most of the problems related to text data (along with above mentioned problems), the most useful method is the `.str` accessor methods. `Regex` is also very useful for dealing with text data problems.