## `Section 02: Common data problems`
### 01-Finding consistency
* Print the `categories` DataFrame and take a close look at all possible correct categories of the survey columns.
* Print the unique values of the survey columns in `airlines` using the `.unique()` method.

In [13]:
import pandas as pd
import numpy as np

categories = pd.read_csv("datasets/categories.csv", index_col=0)
airlines = pd.read_csv("datasets/airlines.csv",index_col=0)
airlines

Unnamed: 0,id,day,airline,destination,dest_region,dest_size,boarding_area,dept_time,wait_min,cleanliness,safety,satisfaction
0,1351,Tuesday,UNITED INTL,KANSAI,Asia,Hub,Gates 91-102,12/31/2018,115,Clean,Neutral,Very satisfied
1,373,Friday,ALASKA,SAN JOSE DEL CABO,Canada/Mexico,Small,Gates 50-59,12/31/2018,135,Clean,Very safe,Very satisfied
2,2820,Thursday,DELTA,LOS ANGELES,West US,Hub,Gates 40-48,12/31/2018,70,Average,Somewhat safe,Neutral
3,1157,Tuesday,SOUTHWEST,LOS ANGELES,West US,Hub,Gates 20-39,12/31/2018,190,Clean,Very safe,Somewhat satisfied
4,2992,Wednesday,AMERICAN,MIAMI,East US,Hub,Gates 50-59,12/31/2018,559,Unacceptable,Very safe,Somewhat satisfied
...,...,...,...,...,...,...,...,...,...,...,...,...
280,4 1475,Tuesday,ALASKA,NEW YORK-JFK,East US,Hub,Gates 50-59,12/31/2018,280,Somewhat clean,Neutral,Somewhat satisfied
280,5 2222,Thursday,SOUTHWEST,PHOENIX,West US,Hub,Gates 20-39,12/31/2018,165,Clean,Very safe,Very satisfied
280,6 2684,Friday,UNITED,ORLANDO,East US,Hub,Gates 70-90,12/31/2018,92,Clean,Very safe,Very satisfied
280,7 2549,Tuesday,JETBLUE,LONG BEACH,West US,Small,Gates 1-12,12/31/2018,95,Clean,Somewhat safe,Very satisfied


In [14]:
# Print categories DataFrame
print(categories)

# Print unique values of survey columns in airlines
print('Cleanliness: ', airlines['cleanliness'].unique(), "\n")
print('Safety: ',airlines['safety'].unique(), "\n")
print('Satisfaction: ', airlines['satisfaction'].unique(), "\n")

      cleanliness           safety          satisfaction
0           Clean          Neutral        Very satisfied
1         Average        Very safe               Neutral
2  Somewhat clean    Somewhat safe    Somewhat satisfied
3  Somewhat dirty      Very unsafe  Somewhat unsatisfied
4           Dirty  Somewhat unsafe      Very unsatisfied
Cleanliness:  ['Clean' 'Average' 'Unacceptable' 'Somewhat clean' 'Somewhat dirty'
 'Dirty'] 

Safety:  ['Neutral' 'Very safe' 'Somewhat safe' 'Very unsafe' 'Somewhat unsafe'] 

Satisfaction:  ['Very satisfied' 'Neutral' 'Somewhat satisfied' 'Somewhat unsatisfied'
 'Very unsatisfied'] 



* as shown in above output the column is inconsistent `cleanliness` because it has an `Unacceptable` category.

* Create a set out of the `cleanliness` column in `airlines` using `set()` and find the inconsistent category by finding the difference in the `cleanliness` column of `categories`.
* Find rows of `airlines` with a `cleanliness` value not in `categories` and print the output.

In [15]:
# Find the cleanliness category in airlines not in categories
cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])

# Find rows with that category
cat_clean_rows = airlines['cleanliness'].isin(cat_clean)

# Print rows with inconsistent category
print(airlines[cat_clean_rows])

       id        day           airline  destination  dest_region dest_size  \
4    2992  Wednesday          AMERICAN        MIAMI      East US       Hub   
18   2913     Friday  TURKISH AIRLINES     ISTANBUL  Middle East       Hub   
100  2321  Wednesday         SOUTHWEST  LOS ANGELES      West US       Hub   

    boarding_area   dept_time  wait_min   cleanliness         safety  \
4     Gates 50-59  12/31/2018       559  Unacceptable      Very safe   
18   Gates 91-102  12/31/2018       225  Unacceptable      Very safe   
100   Gates 20-39  12/31/2018       130  Unacceptable  Somewhat safe   

           satisfaction  
4    Somewhat satisfied  
18   Somewhat satisfied  
100  Somewhat satisfied  


In [4]:
# Find the cleanliness category in airlines not in categories
cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])

# Find rows with that category
cat_clean_rows = airlines['cleanliness'].isin(cat_clean)

# Print rows with inconsistent category
print(airlines[cat_clean_rows])

# Print rows with consistent categories only
print(airlines[~cat_clean_rows])

       id        day           airline  destination  dest_region dest_size  \
4    2992  Wednesday          AMERICAN        MIAMI      East US       Hub   
18   2913     Friday  TURKISH AIRLINES     ISTANBUL  Middle East       Hub   
100  2321  Wednesday         SOUTHWEST  LOS ANGELES      West US       Hub   

    boarding_area   dept_time  wait_min   cleanliness         safety  \
4     Gates 50-59  12/31/2018       559  Unacceptable      Very safe   
18   Gates 91-102  12/31/2018       225  Unacceptable      Very safe   
100   Gates 20-39  12/31/2018       130  Unacceptable  Somewhat safe   

           satisfaction  
4    Somewhat satisfied  
18   Somewhat satisfied  
100  Somewhat satisfied  
          id       day        airline        destination    dest_region  \
0       1351   Tuesday    UNITED INTL             KANSAI           Asia   
1        373    Friday         ALASKA  SAN JOSE DEL CABO  Canada/Mexico   
2       2820  Thursday          DELTA        LOS ANGELES        West 

### 2-Inconsistent categories
* Print the unique values in `dest_region` and `dest_size` respectively.


In [16]:
# Print unique values of both columns
print(airlines['dest_region'].unique())
print(airlines['dest_size'].unique())

['Asia' 'Canada/Mexico' 'West US' 'East US' 'Midwest US' 'EAST US'
 'Middle East' 'Europe' 'eur' 'Central/South America'
 'Australia/New Zealand' 'middle east']
['Hub' 'Small' 'Medium' 'Large']


* Change the capitalization of all values of `dest_region` to lowercase.
* Replace the `'eur'` with `'europe'` in `dest_region` using the `.replace()` method.

In [17]:
# Print unique values of both columns
print(airlines['dest_region'].unique())
print(airlines['dest_size'].unique())

# Lower dest_region column and then replace "eur" with "europe"
airlines['dest_region'] = airlines['dest_region'].str.lower()
airlines['dest_region'] = airlines['dest_region'].replace({'eur':'europe'})

['Asia' 'Canada/Mexico' 'West US' 'East US' 'Midwest US' 'EAST US'
 'Middle East' 'Europe' 'eur' 'Central/South America'
 'Australia/New Zealand' 'middle east']
['Hub' 'Small' 'Medium' 'Large']


* Strip white spaces from the `dest_size` column using the `.strip()` method.
* Verify that the changes have been into effect by printing the unique values of the columns using `.unique()`.

In [18]:
airlines = pd.read_csv("datasets/airlines.csv", index_col=0)

# Print unique values of both columns
print(airlines['dest_region'].unique())
print(airlines['dest_size'].unique())

# Lower dest_region column and then replace "eur" with "europe"
airlines['dest_region'] = airlines['dest_region'].str.lower() 
airlines['dest_region'] = airlines['dest_region'].replace({'eur':'europe'})

# Remove white spaces from `dest_size`
airlines['dest_size'] = airlines['dest_size'].str.strip()

# Verify changes have been effected
print(airlines['dest_region'].unique())
print(airlines['dest_size'].unique())

['Asia' 'Canada/Mexico' 'West US' 'East US' 'Midwest US' 'EAST US'
 'Middle East' 'Europe' 'eur' 'Central/South America'
 'Australia/New Zealand' 'middle east']
['Hub' 'Small' 'Medium' 'Large']
['asia' 'canada/mexico' 'west us' 'east us' 'midwest us' 'middle east'
 'europe' 'central/south america' 'australia/new zealand']
['Hub' 'Small' 'Medium' 'Large']


### 03-Remapping categories
* Create the ranges and labels for the `wait_type` column mentioned in the description.
* Create the `wait_type` column by from `wait_min` by using `pd.cut()`, while inputting `label_ranges` and `label_names` in the correct arguments.
* Create the `mapping` dictionary mapping weekdays to `'weekday'` and weekend days to `'weekend'`.
* Create the `day_week` column by using `.replace()`.

In [21]:
# Create ranges for categories
label_ranges = [0, 60, 180, np.inf]
label_names = ['short', 'medium', 'long']

# Create wait_type column
airlines['wait_type'] = pd.cut(airlines['wait_min'], bins = label_ranges, 
                                labels = label_names)

# Create mappings and replace
mappings = {'Monday':'weekday', 'Tuesday':'weekday', 'Wednesday': 'weekday', 
            'Thursday': 'weekday', 'Friday': 'weekday', 
            'Saturday': 'weekend', 'Sunday': 'weekend'}

airlines['day_week'] = airlines['day'].replace(mappings)
airlines[['wait_min','wait_type','day','day_week']]

Unnamed: 0,wait_min,wait_type,day,day_week
0,115,medium,Tuesday,weekday
1,135,medium,Friday,weekday
2,70,medium,Thursday,weekday
3,190,long,Tuesday,weekday
4,559,long,Wednesday,weekday
...,...,...,...,...
280,280,long,Tuesday,weekday
280,165,medium,Thursday,weekday
280,92,medium,Friday,weekday
280,95,medium,Tuesday,weekday


### 04- Removing titles and taking names

* Remove `"Dr."`, `"Mr."`, `"Miss"` and `"Ms."` from `full_name` by replacing them with an empty string `""` in that order.
* Run the `assert` statement using `.str.contains()` that tests whether `full_name` still contains any of the honorifics. 

In [30]:
airlines = pd.read_csv("datasets/airlines_v2.csv", index_col=0)

In [32]:
# Replace "Dr." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Dr.","")

# Replace "Mr." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Mr.","")

# Replace "Miss" with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Miss","")

# Replace "Ms." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Ms.","")

# Assert that full_name has no honorifics
assert airlines['full_name'].str.contains('Ms.|Mr.|Miss|Dr.').any() == False

### 05-Keeping it descriptive
* Using the `airlines` DataFrame, store the length of each instance in the `survey_response` column in `resp_length` by using `.str.len()`.
* Isolate the rows of `airlines` with `resp_length` higher than `40`.
* Assert that the smallest `survey_response` length in `airlines_survey` is now bigger than `40`.

In [34]:
airlines = pd.read_csv("datasets/airlines_v3.csv",index_col=0)

In [35]:
# Store length of each row in survey_response column
resp_length = airlines['survey_response'].str.len()

# Find rows in airlines where resp_length > 40
airlines_survey = airlines[resp_length > 40]

# Assert minimum survey_response length is > 40
assert airlines_survey['survey_response'].str.len().min() > 40

# Print new survey_response column
print(airlines_survey['survey_response'])

18    The airport personnell forgot to alert us of d...
19    The food in the airport was really really expe...
20    One of the other travelers was really loud and...
21    I don't remember answering the survey with the...
22    The airport personnel kept ignoring my request...
23    The chair I sat in was extremely uncomfortable...
24    I wish you were more like other airports, the ...
25    I was really unsatisfied with the wait times b...
27    The flight was okay, but I didn't really like ...
28    We were really slowed down by security measure...
29    There was a spill on the aisle next to the bat...
30    I felt very unsatisfied by how long the flight...
Name: survey_response, dtype: object


==================================
### `The End`  
==================================