# ETL Pipeline Preparation
Follow the instructions below to help you create your ETL pipeline.
### 1. Import libraries and load datasets.
- Import Python libraries
- Load `messages.csv` into a dataframe and inspect the first few lines.
- Load `categories.csv` into a dataframe and inspect the first few lines.

In [1]:
# import libraries
import pandas as pd
from sqlalchemy import create_engine

In [2]:
# load messages dataset
messages = pd.read_csv('messages.csv')
messages.head()

Unnamed: 0,id,message,original,genre
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct


In [3]:
# load categories dataset
categories = pd.read_csv('categories.csv')
categories.head()

Unnamed: 0,id,categories
0,2,related-1;request-0;offer-0;aid_related-0;medi...
1,7,related-1;request-0;offer-0;aid_related-1;medi...
2,8,related-1;request-0;offer-0;aid_related-0;medi...
3,9,related-1;request-1;offer-0;aid_related-1;medi...
4,12,related-1;request-0;offer-0;aid_related-0;medi...


### 2. Merge datasets.
- Merge the messages and categories datasets using the common id
- Assign this combined dataset to `df`, which will be cleaned in the following steps

In [4]:
# merge datasets
df = pd.merge(messages, categories, on='id')
df.head()

Unnamed: 0,id,message,original,genre,categories
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,related-1;request-0;offer-0;aid_related-0;medi...
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,related-1;request-0;offer-0;aid_related-1;medi...
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,related-1;request-0;offer-0;aid_related-0;medi...
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,related-1;request-1;offer-0;aid_related-1;medi...
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,related-1;request-0;offer-0;aid_related-0;medi...


### 3. Split `categories` into separate category columns.
- Split the values in the `categories` column on the `;` character so that each value becomes a separate column. You'll find [this method](https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.Series.str.split.html) very helpful! Make sure to set `expand=True`.
- Use the first row of categories dataframe to create column names for the categories data.
- Rename columns of `categories` with new column names.

In [5]:
# create a dataframe of the 36 individual category columns
categories = df.categories.str.split(";", expand=True)
categories.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,26,27,28,29,30,31,32,33,34,35
0,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
1,related-1,request-0,offer-0,aid_related-1,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-1,floods-0,storm-1,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
2,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
3,related-1,request-1,offer-0,aid_related-1,medical_help-0,medical_products-1,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
4,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0


In [6]:
# select the first row of the categories dataframe
row = categories.iloc[0]

# use this row to extract a list of new column names for categories.
# one way is to apply a lambda function that takes everything 
# up to the second to last character of each string with slicing
category_colnames = [item[:-2] for item in row]
print(category_colnames)

['related', 'request', 'offer', 'aid_related', 'medical_help', 'medical_products', 'search_and_rescue', 'security', 'military', 'child_alone', 'water', 'food', 'shelter', 'clothing', 'money', 'missing_people', 'refugees', 'death', 'other_aid', 'infrastructure_related', 'transport', 'buildings', 'electricity', 'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure', 'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold', 'other_weather', 'direct_report']


In [7]:
# rename the columns of `categories`
categories.columns = category_colnames
categories.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
1,related-1,request-0,offer-0,aid_related-1,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-1,floods-0,storm-1,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
2,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
3,related-1,request-1,offer-0,aid_related-1,medical_help-0,medical_products-1,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
4,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0


### 4. Convert category values to just numbers 0 or 1.
- Iterate through the category columns in df to keep only the last character of each string (the 1 or 0). For example, `related-0` becomes `0`, `related-1` becomes `1`. Convert the string to a numeric value.
- You can perform [normal string actions on Pandas Series](https://pandas.pydata.org/pandas-docs/stable/text.html#indexing-with-str), like indexing, by including `.str` after the Series. You may need to first convert the Series to be of type string, which you can do with `astype(str)`.

In [8]:
for column in categories:
    # set each value to be the last character of the string
    categories[column] = categories[column].apply(lambda x: x[-1])
    
    # convert column from string to numeric
    categories[column] = categories[column].astype(int)
categories.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 5. Replace `categories` column in `df` with new category columns.
- Drop the categories column from the df dataframe since it is no longer needed.
- Concatenate df and categories data frames.

In [9]:
# drop the original categories column from `df`
df.drop('categories', axis=1, inplace=True)

df.head()

Unnamed: 0,id,message,original,genre
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct


In [10]:
# concatenate the original dataframe with the new `categories` dataframe
df = pd.concat([df, categories], axis=1)
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 6. Remove duplicates.
- Check how many duplicates are in this dataset.
- Drop the duplicates.
- Confirm duplicates were removed.

In [11]:
# check number of duplicates
df.duplicated().sum()

170

In [12]:
# drop duplicates
df.drop_duplicates(inplace=True)

In [13]:
# check number of duplicates
df.duplicated().sum()

0

### 7. Save the clean dataset into an sqlite database.
You can do this with pandas [`to_sql` method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html) combined with the SQLAlchemy library. Remember to import SQLAlchemy's `create_engine` in the first cell of this notebook to use it below.

In [14]:
engine = create_engine('sqlite:///categorized_messages.db')
df.to_sql('categorized_messages', engine, index=False, if_exists='replace')

### 8. Use this notebook to complete `etl_pipeline.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database based on new datasets specified by the user. Alternatively, you can complete `etl_pipeline.py` in the classroom on the `Project Workspace IDE` coming later.

### Further investigation and refinement

Wrap-up: What are the columns?

In [15]:
df.columns

Index(['id', 'message', 'original', 'genre', 'related', 'request', 'offer',
       'aid_related', 'medical_help', 'medical_products', 'search_and_rescue',
       'security', 'military', 'child_alone', 'water', 'food', 'shelter',
       'clothing', 'money', 'missing_people', 'refugees', 'death', 'other_aid',
       'infrastructure_related', 'transport', 'buildings', 'electricity',
       'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
       'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold',
       'other_weather', 'direct_report'],
      dtype='object')

In [16]:
df[df.isna().any(axis=1)].shape

(16046, 40)

In [17]:
df[df[['message'] + category_colnames].isna().any(axis=1)].shape

(0, 40)

There are missing entries, however they don't appear in the relevant columns.  
***-> Drop rows with missing values in relevant columns.***

In [18]:
(df.isna() == False).sum()

id                        26216
message                   26216
original                  10170
genre                     26216
related                   26216
request                   26216
offer                     26216
aid_related               26216
medical_help              26216
medical_products          26216
search_and_rescue         26216
security                  26216
military                  26216
child_alone               26216
water                     26216
food                      26216
shelter                   26216
clothing                  26216
money                     26216
missing_people            26216
refugees                  26216
death                     26216
other_aid                 26216
infrastructure_related    26216
transport                 26216
buildings                 26216
electricity               26216
tools                     26216
hospitals                 26216
shops                     26216
aid_centers               26216
other_in

As later only a categories are inferred from messages (in english), 'id', 'original' and 'genre' can be dropped.

In [19]:
df.drop(['id', 'original', 'genre'], axis=1, inplace=True)

In [20]:
[df[category_colnames][category].unique() for category in category_colnames]

[array([1, 0, 2]),
 array([0, 1]),
 array([0, 1]),
 array([0, 1]),
 array([0, 1]),
 array([0, 1]),
 array([0, 1]),
 array([0, 1]),
 array([0, 1]),
 array([0]),
 array([0, 1]),
 array([0, 1]),
 array([0, 1]),
 array([0, 1]),
 array([0, 1]),
 array([0, 1]),
 array([0, 1]),
 array([0, 1]),
 array([0, 1]),
 array([0, 1]),
 array([0, 1]),
 array([0, 1]),
 array([0, 1]),
 array([0, 1]),
 array([0, 1]),
 array([0, 1]),
 array([0, 1]),
 array([0, 1]),
 array([0, 1]),
 array([0, 1]),
 array([0, 1]),
 array([0, 1]),
 array([0, 1]),
 array([0, 1]),
 array([0, 1]),
 array([0, 1])]

'child_alone' doesn't have varying entries (only 0).  
***-> Category columns with no varying entries could be dropped. But unsure due to future behaviour.***

In [25]:
df.related.value_counts()

1    19906
0     6122
Name: related, dtype: int64

'related' only category having three values 0, 1, 2, with 2 only occurring 188 times.

In [22]:
pd.concat([df[(df.related == 0) | (df.related == 2)][category_colnames].min(), 
           df[(df.related == 0) | (df.related == 2)][category_colnames].max()], 
          axis=1)

Unnamed: 0,0,1
related,0,2
request,0,0
offer,0,0
aid_related,0,0
medical_help,0,0
medical_products,0,0
search_and_rescue,0,0
security,0,0
military,0,0
child_alone,0,0


'related' with 2 seems to behave like 'related' with 0 having all other categories 0. So, this is probably and error. However, as we can't decide on future behaviour:  
***-> Only accept 0 or 1 as entries in categories***

In [23]:
df.drop(df[(df[category_colnames] == 2).any(axis=1) == True].index, inplace=True)

In [24]:
df[category_colnames].sum()

related                   19906
request                    4474
offer                       118
aid_related               10860
medical_help               2084
medical_products           1313
search_and_rescue           724
security                    471
military                    860
child_alone                   0
water                      1672
food                       2923
shelter                    2314
clothing                    405
money                       604
missing_people              298
refugees                    875
death                      1194
other_aid                  3446
infrastructure_related     1705
transport                  1201
buildings                  1333
electricity                 532
tools                       159
hospitals                   283
shops                       120
aid_centers                 309
other_infrastructure       1151
weather_related            7297
floods                     2155
storm                      2443
fire    

In [None]:
df[df.related == 0][category_colnames].sum()

'related' seems to be meta-category (*Is the message disaster related?*, see [figure8](https://www.figure-eight.com/dataset/combined-disaster-response-data/))

In [None]:
for category in category_colnames:
    try:
        min_other_cats_if_1 = df[category_colnames].groupby(category).min().T[1]
        print("{:<20}".format(category),
              'being 1 is always related to', 
              min_other_cats_if_1[min_other_cats_if_1 == 1].index.values, 
              'being 1')
    except:
        print(category, "doesn't have 1s.")

In [None]:
df[['request', 'offer']].drop_duplicates(keep='last')

'request' and 'offer' seem to be exclusive choices. However, not sure about future behaviour.

In [None]:
(pd.concat([
    (df[df[cat_col] == 1][category_colnames[3:-1]].sum() / df[df[cat_col] == 1].shape[0]) 
    for cat_col in category_colnames[3:-1]], 
    axis=1)
 .rename(columns={key: value for (key, value) in enumerate(category_colnames[3:-1])})
      .round(2)).T

In [None]:
related = ['related']

request_offer = ['request', 'offer']

aid_related_top = ['aid_related']
aid_related_sub = [
    'medical_help', 
    'medical_products', 
    'search_and_rescue', 
    'security',
    'military',
    'child_alone',
    'water',
    'food',
    'shelter',
    'clothing',
    'money',
    'missing_people',
    'refugees',
    'death',
    'other_aid'
]
aid_related = aid_related_top + aid_related_sub

infrastructure_related_top = ['infrastructure_related']
infrastructure_related_sub = [
    'hospitals', 
    'shops', 
    'aid_centers', 
    'other_infrastructure'
]
infrastructure_related = infrastructure_related_top + infrastructure_related_sub

ambigous_related_top = ['ambigous_related']
ambigous_related_sub = [
    'transport',
    'buildings',
    'electricity',
    'tools']
ambigous_related = ambigous_related_top + ambigous_related_sub

weather_related_top = ['weather_related']
weather_related_sub = [
    'floods', 
    'storm', 
    'fire', 
    'earthquake', 
    'cold', 
    'other_weather'
]
weather_related = weather_related_top + weather_related_sub

tops = aid_related_top + infrastructure_related_top + weather_related_top
subs = aid_related_sub + ambigous_related_sub + infrastructure_related_sub+ weather_related_sub

direct_report = ['direct_report']

category_colnames == related + request_offer + aid_related + infrastructure_related_top + ambigous_related_sub + infrastructure_related_sub + weather_related + direct_report

In [None]:
def coverage(cats_to_cover, cover_by_cats):
    coverage = (
        pd.concat([(df[df[cat_col] == 1][category_colnames].sum() / df[df[cat_col] == 1].shape[0]) 
                   for cat_col in category_colnames], 
                  axis=1)
        .rename(columns={key: value for (key, value) in enumerate(category_colnames)})
        .round(2).T
        .loc[cats_to_cover, cover_by_cats])
    return coverage

In [None]:
coverage(subs, related + tops)

In [None]:
coverage(related + tops, subs).T

In [None]:
coverage(ambigous_related_sub, request_offer + direct_report)

In [None]:
coverage(subs, request_offer + direct_report)

In [None]:
coverage(request_offer + direct_report, subs).T

In [None]:
df[df.aid_related == 1][aid_related_sub].max(axis=1).min()

In [None]:
df[df.infrastructure_related == 1][infrastructure_related_sub + ambigous_related_sub].max(axis=1).min()

In [None]:
df[df.weather_related == 1][weather_related_sub].max(axis=1).min()

- Although every category column being 1 implied 'related' being 1, it's not the other way around. To the contrary, there are more than 5,000 rows only having 'related' being 1.
- On the other hand, 'weather_related', 'infrastructure_related' and 'aid_related' seem to be kind of 'top' categories in a sense that one of their 'subcategories' being 1 implies the top-category being one (and other way around).
- 'request' and 'offer' as well as 'direct_report' are a category of their own

In [None]:
for message in df[df.direct_report == 1]['message'].sample(10):
    print(message)

In [None]:
for message in df[(df.direct_report == 0) & (df.related == 1)]['message'].sample(10):
    print(message)

In [None]:
for message in df[(df.infrastructure_related == 1) & (df.related == 1)]['message'].sample(10):
    print(message)