In [1]:
import numpy as np
import pandas as pd

## Membership constraints

#### Chapter 2 - Text and categorical data problems

In [None]:
#### Categories and membership constraints
##### Predefined finite set of categories
Type of data                    Example values              Numeric representation 
Marriage Status                 unmarried, married          0,1
Household Income Category       0-20K, 20-40K, ...          0,1,..
Loan Status                     default,payed,no_loan       0,1,2

Marriage status can only beunmarried _or_ married

### An example

In [None]:
# Read study data and print it
study_data = pd.read_csv('study.csv')
study_data

# Correct possible blood types
categories

#### Finding inconsistent categories

In [None]:
inconsistent_categories = set(study_data['blood_type']).difference(categories['blood_type'])
print(inconsistent_categories)

In [None]:
# Get and print rows with inconsistent categories
inconsistent_rows = study_data['blood_type'].isin(inconsistent_categories)
study_data[inconsistent_rows]

#### Dropping inconsistent categories

In [None]:
inconsistent_categories = set(study_data['blood_type']).difference(categories['blood_type'])
inconsistent_rows = study_data['blood_type'].isin(inconsistent_categories)
inconsistent_data = study_data[inconsistent_rows]
# Drop inconsistent categories and get consistent data only
consistent_data = study_data[~inconsistent_rows]

In [8]:
airlines = pd.read_csv('airlines_final.csv',index_col=0)
airlines.head()

Unnamed: 0,id,day,airline,destination,dest_region,dest_size,boarding_area,dept_time,wait_min,cleanliness,safety,satisfaction
0,1351,Tuesday,UNITED INTL,KANSAI,Asia,Hub,Gates 91-102,2018-12-31,115.0,Clean,Neutral,Very satisfied
1,373,Friday,ALASKA,SAN JOSE DEL CABO,Canada/Mexico,Small,Gates 50-59,2018-12-31,135.0,Clean,Very safe,Very satisfied
2,2820,Thursday,DELTA,LOS ANGELES,West US,Hub,Gates 40-48,2018-12-31,70.0,Average,Somewhat safe,Neutral
3,1157,Tuesday,SOUTHWEST,LOS ANGELES,West US,Hub,Gates 20-39,2018-12-31,190.0,Clean,Very safe,Somewhat satsified
4,2992,Wednesday,AMERICAN,MIAMI,East US,Hub,Gates 50-59,2018-12-31,559.0,Somewhat clean,Very safe,Somewhat satsified


In [9]:
categories = pd.read_csv('categories.csv')
categories

Unnamed: 0,cleanliness,safety,satisfaction
0,Clean,Neutral,Very satisfied
1,Average,Very safe,Neutral
2,Somewhat clean,Somewhat safe,Somewhat satisfied
3,Somewhat dirty,Very unsafe,Somewhat unsatisfied
4,Dirty,Somewhat unsafe,Very unsatisfied


In [9]:
airlines.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2477 entries, 0 to 2808
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   id             2477 non-null   int64         
 1   day            2477 non-null   object        
 2   airline        2477 non-null   object        
 3   destination    2477 non-null   object        
 4   dest_region    2477 non-null   object        
 5   dest_size      2477 non-null   object        
 6   boarding_area  2477 non-null   object        
 7   dept_time      2477 non-null   datetime64[ns]
 8   wait_min       2477 non-null   float64       
 9   cleanliness    2477 non-null   object        
 10  safety         2477 non-null   object        
 11  satisfaction   2477 non-null   object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(9)
memory usage: 251.6+ KB


In [5]:
airlines.columns

Index(['id', 'day', 'airline', 'destination', 'dest_region', 'dest_size',
       'boarding_area', 'dept_time', 'wait_min', 'cleanliness', 'safety',
       'satisfaction'],
      dtype='object')

In [11]:
airlines['dept_time'] = pd.to_datetime(airlines['dept_time'])

### Finding consistency
In this exercise and throughout this chapter, you'll be working with the airlines DataFrame which contains survey responses on the San Francisco Airport from airline customers.

The DataFrame contains flight metadata such as the airline, the destination, waiting times as well as answers to key questions regarding cleanliness, safety, and satisfaction. Another DataFrame named categories was created, containing all correct possible values for the survey columns.

In this exercise, you will use both of these DataFrames to find survey answers with inconsistent values, and drop them, effectively performing an outer and inner join on both these DataFrames as seen in the video exercise. The pandas package has been imported as pd, and the airlines and categories DataFrames are in your environment.

In [17]:
# Print categories DataFrame
display(categories)
print('\n')

# Print unique values of survey columns in airlines
print('Cleanliness: ', airlines['cleanliness'].unique(), "\n")
print('Safety: ', airlines['safety'].unique(), "\n")
print('Satisfaction: ', airlines['satisfaction'].unique(), "\n")

Unnamed: 0,cleanliness,safety,satisfaction
0,Clean,Neutral,Very satisfied
1,Average,Very safe,Neutral
2,Somewhat clean,Somewhat safe,Somewhat satisfied
3,Somewhat dirty,Very unsafe,Somewhat unsatisfied
4,Dirty,Somewhat unsafe,Very unsatisfied




Cleanliness:  ['Clean' 'Average' 'Somewhat clean' 'Somewhat dirty' 'Dirty'] 

Safety:  ['Neutral' 'Very safe' 'Somewhat safe' 'Very unsafe' 'Somewhat unsafe'] 

Satisfaction:  ['Very satisfied' 'Neutral' 'Somewhat satsified' 'Somewhat unsatisfied'
 'Very unsatisfied'] 



In [22]:
categories

Unnamed: 0,cleanliness,safety,satisfaction
0,Clean,Neutral,Very satisfied
1,Average,Very safe,Neutral
2,Somewhat clean,Somewhat safe,Somewhat satisfied
3,Somewhat dirty,Very unsafe,Somewhat unsatisfied
4,Dirty,Somewhat unsafe,Very unsatisfied


In [23]:
airlines.head()

Unnamed: 0,id,day,airline,destination,dest_region,dest_size,boarding_area,dept_time,wait_min,cleanliness,safety,satisfaction
0,1351,Tuesday,UNITED INTL,KANSAI,Asia,Hub,Gates 91-102,2018-12-31,115.0,Clean,Neutral,Very satisfied
1,373,Friday,ALASKA,SAN JOSE DEL CABO,Canada/Mexico,Small,Gates 50-59,2018-12-31,135.0,Clean,Very safe,Very satisfied
2,2820,Thursday,DELTA,LOS ANGELES,West US,Hub,Gates 40-48,2018-12-31,70.0,Average,Somewhat safe,Neutral
3,1157,Tuesday,SOUTHWEST,LOS ANGELES,West US,Hub,Gates 20-39,2018-12-31,190.0,Clean,Very safe,Somewhat satsified
4,2992,Wednesday,AMERICAN,MIAMI,East US,Hub,Gates 50-59,2018-12-31,559.0,Somewhat clean,Very safe,Somewhat satsified


In [24]:
# Find the cleanliness category in airlines not in categories
cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])

# Find rows with that category
cat_clean_rows = airlines['cleanliness'].isin(cat_clean)

# Print rows with inconsistent category
print(airlines[cat_clean_rows])

Empty DataFrame
Columns: [id, day, airline, destination, dest_region, dest_size, boarding_area, dept_time, wait_min, cleanliness, safety, satisfaction]
Index: []


In [26]:
# Print rows with consistent categories only
display(airlines[airlines['cleanliness'].isin(categories['cleanliness'])])

Unnamed: 0,id,day,airline,destination,dest_region,dest_size,boarding_area,dept_time,wait_min,cleanliness,safety,satisfaction
0,1351,Tuesday,UNITED INTL,KANSAI,Asia,Hub,Gates 91-102,2018-12-31,115.0,Clean,Neutral,Very satisfied
1,373,Friday,ALASKA,SAN JOSE DEL CABO,Canada/Mexico,Small,Gates 50-59,2018-12-31,135.0,Clean,Very safe,Very satisfied
2,2820,Thursday,DELTA,LOS ANGELES,West US,Hub,Gates 40-48,2018-12-31,70.0,Average,Somewhat safe,Neutral
3,1157,Tuesday,SOUTHWEST,LOS ANGELES,West US,Hub,Gates 20-39,2018-12-31,190.0,Clean,Very safe,Somewhat satsified
4,2992,Wednesday,AMERICAN,MIAMI,East US,Hub,Gates 50-59,2018-12-31,559.0,Somewhat clean,Very safe,Somewhat satsified
...,...,...,...,...,...,...,...,...,...,...,...,...
2804,1475,Tuesday,ALASKA,NEW YORK-JFK,East US,Hub,Gates 50-59,2018-12-31,280.0,Somewhat clean,Neutral,Somewhat satsified
2805,2222,Thursday,SOUTHWEST,PHOENIX,West US,Hub,Gates 20-39,2018-12-31,165.0,Clean,Very safe,Very satisfied
2806,2684,Friday,UNITED,ORLANDO,East US,Hub,Gates 70-90,2018-12-31,92.0,Clean,Very safe,Very satisfied
2807,2549,Tuesday,JETBLUE,LONG BEACH,West US,Small,Gates 1-12,2018-12-31,95.0,Clean,Somewhat safe,Very satisfied


- Great _consistent_ work! Keep it up! In the next lesson, we'll be looking at more in depth solutions to dealing with dirty categorical data.

## Categorical variables

### What type of errors could we have?
##### I) Value inconsistency
- Inconsistent fields:'married', 'Maried', 'UNMARRIED', 'not married'.._
- Trailing white spaces: _'married ', ' married '..
##### II) Collapsing too many categories to few
- Creating new groups:0-20K, 20-40K categories ... from continuous household income data
- Mapping groups to new ones: Mapping household income categories to 2 'rich', 'poor'
##### III) Making sure data is of typecategory(seen in Chapter 1)


#### Value consistency
- Capitalization:'married', 'Married', 'UNMARRIED', 'unmarried'..

In [None]:
# Get marriage status column
marriage_status = demographics['marriage_status']
marriage_status.value_counts()

In [None]:
# Get value counts on DataFrame
marriage_status.groupby('marriage_status').count()

In [None]:
# Capitalize
marriage_status['marriage_status'] = marriage_status['marriage_status'].str.upper()
marriage_status['marriage_status'].value_counts()

In [None]:
# Lowercase
marriage_status['marriage_status'] = marriage_status['marriage_status'].str.lower()
marriage_status['marriage_status'].value_counts()

#### Value consistency
- Trailing spaces:'married ', 'married', 'unmarried', ' unmarried'..

In [None]:
# Get marriage status column
marriage_status = demographics['marriage_status']
marriage_status.value_counts()

In [None]:
# Strip all spaces
demographics = demographics['marriage_status'].str.strip()
demographics['marriage_status'].value_counts()

### Collapsing data into categories
- Create categories out of data:income_group column from income column

In [None]:
# Using qcut()
import pandas as pd
group_names = ['0-200K', '200K-500K', '500K+']
demographics['income_group'] = pd.qcut(demographics['household_income'], q = 3, labels = group_names)

# Print income_group column
demographics[['income_group', 'household_income']]

### Collapsing data into categories
- Create categories out of data:income_group column from income column

In [None]:
# Using cut() - create category ranges and names
ranges = [0,200000,500000,np.inf]
group_names = ['0-200K', '200K-500K', '500K+']
# Create income group column
demographics['income_group'] = pd.cut(demographics['household_income'], bins=ranges, labels=group_names)
demographics[['income_group', 'household_income']]

### Collapsing data into categories
- Map categories to fewer ones: reducing categories in categorical column.
- operating_system column is: 'Microsoft', 'MacOS', 'IOS', 'Android', 'Linux'
- operating_system column should become: 'DesktopOS', 'MobileOS'

In [None]:
# Create mapping dictionary and replace
mapping = {'Microsoft':'DesktopOS', 'MacOS':'DesktopOS', 'Linux':'DesktopOS','IOS':'MobileOS', 'Android':'MobileOS'}
devices['operating_system'] = devices['operating_system'].replace(mapping)
devices['operating_system'].unique()

## Inconsistent categories
In this exercise, you'll be revisiting the airlines DataFrame from the previous lesson.

As a reminder, the DataFrame contains flight metadata such as the airline, the destination, waiting times as well as answers to key questions regarding cleanliness, safety, and satisfaction on the San Francisco Airport.

In this exercise, you will examine two categorical columns from this DataFrame, dest_region and dest_size respectively, assess how to address them and make sure that they are cleaned and ready for analysis. The pandas package has been imported as pd, and the airlines DataFrame is in your

In [29]:
# Print unique values of both columns
display(airlines['dest_region'].unique())
print('\n')
display(airlines['dest_size'].unique())

array(['Asia', 'Canada/Mexico', 'West US', 'East US', 'Midwest US',
       'EAST US', 'Middle East', 'Europe', 'eur', 'Central/South America',
       'Australia/New Zealand', 'middle east'], dtype=object)





array(['Hub', 'Small', '    Hub', 'Medium', 'Large', 'Hub     ',
       '    Small', 'Medium     ', '    Medium', 'Small     ',
       '    Large', 'Large     '], dtype=object)

In [30]:
# Lower dest_region column and then replace "eur" with "europe"
airlines['dest_region'] = airlines['dest_region'].str.lower()
airlines['dest_region'] = airlines['dest_region'].replace({'eur':'europe'})

In [37]:
# Print unique values of both columns
display(airlines['dest_region'].unique())
print('\n')
display(airlines['dest_size'].unique())

array(['asia', 'canada/mexico', 'west us', 'east us', 'midwest us',
       'middle east', 'europe', 'central/south america',
       'australia/new zealand'], dtype=object)





array(['Hub', 'Small', '    Hub', 'Medium', 'Large', 'Hub     ',
       '    Small', 'Medium     ', '    Medium', 'Small     ',
       '    Large', 'Large     '], dtype=object)

In [44]:
airlines['dest_size'] = airlines['dest_size'].str.strip()

In [45]:
# Print unique values of both columns
display(airlines['dest_region'].unique())
print('\n')
display(airlines['dest_size'].unique())

array(['asia', 'canada/mexico', 'west us', 'east us', 'midwest us',
       'middle east', 'europe', 'central/south america',
       'australia/new zealand'], dtype=object)





array(['Hub', 'Small', 'Medium', 'Large'], dtype=object)

- Great work! Notice how all categories have been properly treated?

### Remapping categories
To better understand survey respondents from airlines, you want to find out if there is a relationship between certain responses and the day of the week and wait time at the gate.

The airlines DataFrame contains the day and wait_min columns, which are categorical and numerical respectively. The day column contains the exact day a flight took place, and wait_min contains the amount of minutes it took travelers to wait at the gate. To make your analysis easier, you want to create two new categorical variables:

wait_type: 'short' for 0-60 min, 'medium' for 60-180 and long for 180+
day_week: 'weekday' if day is in the weekday, 'weekend' if day is in the weekend.
The pandas and numpy packages have been imported as pd and np. Let's create some new categorical data!

In [46]:
# Create ranges for categories
label_ranges = [0, 60, 180, np.inf]
label_names = ['short', 'medium', 'long']

# Create wait_type column
airlines['wait_type'] = pd.cut(airlines['wait_min'], bins = label_ranges, 
                                labels = label_names)

# Create mappings and replace
mappings = {'Monday':'weekday', 'Tuesday':'weekday', 'Wednesday': 'weekday', 
            'Thursday': 'weekday', 'Friday': 'weekday', 
            'Saturday': 'weekend', 'Sunday': 'weekend'}

airlines['day_week'] = airlines['day'].replace(mappings)

- Awesome work! You just created two new categorical variables, that when combined with other columns, could produce really interesting analysis. Don't forget, you can always use an assert statement to check your changes passed.

In [47]:
airlines

Unnamed: 0,id,day,airline,destination,dest_region,dest_size,boarding_area,dept_time,wait_min,cleanliness,safety,satisfaction,wait_type,day_week
0,1351,Tuesday,UNITED INTL,KANSAI,asia,Hub,Gates 91-102,2018-12-31,115.0,Clean,Neutral,Very satisfied,medium,weekday
1,373,Friday,ALASKA,SAN JOSE DEL CABO,canada/mexico,Small,Gates 50-59,2018-12-31,135.0,Clean,Very safe,Very satisfied,medium,weekday
2,2820,Thursday,DELTA,LOS ANGELES,west us,Hub,Gates 40-48,2018-12-31,70.0,Average,Somewhat safe,Neutral,medium,weekday
3,1157,Tuesday,SOUTHWEST,LOS ANGELES,west us,Hub,Gates 20-39,2018-12-31,190.0,Clean,Very safe,Somewhat satsified,long,weekday
4,2992,Wednesday,AMERICAN,MIAMI,east us,Hub,Gates 50-59,2018-12-31,559.0,Somewhat clean,Very safe,Somewhat satsified,long,weekday
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2804,1475,Tuesday,ALASKA,NEW YORK-JFK,east us,Hub,Gates 50-59,2018-12-31,280.0,Somewhat clean,Neutral,Somewhat satsified,long,weekday
2805,2222,Thursday,SOUTHWEST,PHOENIX,west us,Hub,Gates 20-39,2018-12-31,165.0,Clean,Very safe,Very satisfied,medium,weekday
2806,2684,Friday,UNITED,ORLANDO,east us,Hub,Gates 70-90,2018-12-31,92.0,Clean,Very safe,Very satisfied,medium,weekday
2807,2549,Tuesday,JETBLUE,LONG BEACH,west us,Small,Gates 1-12,2018-12-31,95.0,Clean,Somewhat safe,Very satisfied,medium,weekday


## Cleaning text data

#### What is text data?

In [None]:
Type of data       Example values
Names              Alex, Sara...
Phone numbers      +96171679912 ...
Emails             `adel@datacamp.com`..
Passwords...

#### Common text data problems
1) Data inconsistency: \
+96171679912 or 0096171679912 or ..?
2) Fixed length violations:\
Passwords needs to be at least 8 characters
3) Typos:\
+961.71.679912


In [4]:
phones = pd.read_csv('phones.csv')
display(phones)

Unnamed: 0,Full name,Phone number
0,Noelani A. Gray,001-702-397-5143
1,Myles Z. Gomez,001-329-485-0540
2,Gil B. Silva,001-195-492-2338
3,Prescott D. Hardin,+1-297-996-4904
4,Benedict G. Valdez,001-969-820-3536
5,Reece M. Andrews,4138
6,Hayfa E. Keith,001-536-175-8444
7,Hedley I. Logan,001-681-552-1823
8,Jack W. Carrillo,001-910-323-5265
9,Lionel M. Davis,001-143-119-9210


#### Fixing the phone number column

In [7]:
# Replace "+" with "00"
phones['Phone number'] = phones['Phone number'].str.replace('+','00')

In [8]:
phones

Unnamed: 0,Full name,Phone number
0,Noelani A. Gray,001-702-397-5143
1,Myles Z. Gomez,001-329-485-0540
2,Gil B. Silva,001-195-492-2338
3,Prescott D. Hardin,001-297-996-4904
4,Benedict G. Valdez,001-969-820-3536
5,Reece M. Andrews,4138
6,Hayfa E. Keith,001-536-175-8444
7,Hedley I. Logan,001-681-552-1823
8,Jack W. Carrillo,001-910-323-5265
9,Lionel M. Davis,001-143-119-9210


In [9]:
# Replace "-" with nothing
phones['Phone number'] = phones['Phone number'].str.replace('-','')

In [10]:
phones

Unnamed: 0,Full name,Phone number
0,Noelani A. Gray,17023975143
1,Myles Z. Gomez,13294850540
2,Gil B. Silva,11954922338
3,Prescott D. Hardin,12979964904
4,Benedict G. Valdez,19698203536
5,Reece M. Andrews,4138
6,Hayfa E. Keith,15361758444
7,Hedley I. Logan,16815521823
8,Jack W. Carrillo,19103235265
9,Lionel M. Davis,11431199210


In [12]:
# Replace phone numbers with lower than 10 digits to NaN
digits = phones['Phone number'].str.len()
phones.loc[digits < 10, 'Phone number'] = np.nan

In [13]:
phones

Unnamed: 0,Full name,Phone number
0,Noelani A. Gray,17023975143.0
1,Myles Z. Gomez,13294850540.0
2,Gil B. Silva,11954922338.0
3,Prescott D. Hardin,12979964904.0
4,Benedict G. Valdez,19698203536.0
5,Reece M. Andrews,
6,Hayfa E. Keith,15361758444.0
7,Hedley I. Logan,16815521823.0
8,Jack W. Carrillo,19103235265.0
9,Lionel M. Davis,11431199210.0


In [15]:
# Find length of each row in Phone number column
sanity_check = phones['Phone number'].str.len()

In [16]:
# Assert minmum phone number length is 10
assert sanity_check.min() >= 10

In [None]:
# Assert all numbers do not have "+" or "-"
assert phone['Phone number'].str.contains("+|-").any() == False

### But what about more complicated examples?

In [2]:
phones1 = pd.read_csv('phones_1.csv')
phones1

Unnamed: 0,Full name,Phone number
0,Olga Robinson,+(01706)-258911
1,Justina Kim,+0500-5714372
2,Tamekah Henson,+0800-11113
3,Miranda Solis,+07058-8790634
4,Caldwell Gilliam,+(016977)-8424


### Regular expressions in action

In [4]:
# Look the dtype
print(phones1['Phone number'].dtype)

object


In [5]:
# Change dtype
phones1['Phone number'] = phones1['Phone number'].astype(str)

In [6]:
# Now apply the replacement
phones1['Phone number'] = phones1['Phone number'].str.replace(r'\D+', '', regex=True)

print(phones1['Phone number'])
print(phones1['Phone number'].dtype) # Output will be object

0     01706258911
1     05005714372
2       080011113
3    070588790634
4      0169778424
Name: Phone number, dtype: object
object


In [7]:
phones1

Unnamed: 0,Full name,Phone number
0,Olga Robinson,1706258911
1,Justina Kim,5005714372
2,Tamekah Henson,80011113
3,Miranda Solis,70588790634
4,Caldwell Gilliam,169778424


### Removing titles and taking names
While collecting survey respondent metadata in the airlines DataFrame, the full name of respondents was saved in the full_name column. However upon closer inspection, you found that a lot of the different names are prefixed by honorifics such as "Dr.", "Mr.", "Ms." and "Miss".

Your ultimate objective is to create two new columns named first_name and last_name, containing the first and last names of respondents respectively. Before doing so however, you need to remove honorifics.

The airlines DataFrame is in your environment, alongside pandas as pd.

In [10]:
airlines.columns

Index(['id', 'day', 'airline', 'destination', 'dest_region', 'dest_size',
       'boarding_area', 'dept_time', 'wait_min', 'cleanliness', 'safety',
       'satisfaction'],
      dtype='object')

In [None]:
# Replace "Dr." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Dr.","")

# Replace "Mr." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Mr.", "")

# Replace "Miss" with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Miss", "")

# Replace "Ms." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Ms.", "")

# Assert that full_name has no honorifics
assert airlines['full_name'].str.contains('Ms.|Mr.|Miss|Dr.').any() == False

- Great work! By normalizing full names this way, you can now easily split them into first names and last names!

### Keeping it descriptive
To further understand travelers' experiences in the San Francisco Airport, the quality assurance department sent out a qualitative questionnaire to all travelers who gave the airport the worst score on all possible categories. The objective behind this questionnaire is to identify common patterns in what travelers are saying about the airport.

Their response is stored in the survey_response column. Upon a closer look, you realized a few of the answers gave the shortest possible character amount without much substance. In this exercise, you will isolate the responses with a character count higher than 40 , and make sure your new DataFrame contains responses with 40 characters or more using an assert statement.

The airlines DataFrame is in your environment, and pandas is imported as pd.

In [None]:
# Store length of each row in survey_response column
resp_length = airlines['survey_response'].str.len()

# Find rows in airlines where resp_length > 40
airlines_survey = airlines[resp_length > 40]

# Assert minimum survey_response length is > 40
assert airlines_survey['survey_response'].str.len().min() > 40

# Print new survey_response column
print(airlines_survey['survey_response'])

- Phenomenal work! These types of feedbacks are essential to improving any service. Coupled with some wordcount analysis, you can find common patterns across all survey responses in no time!