<a href="https://colab.research.google.com/github/Rossel/DataQuest_Courses/blob/master/032__Working_With_Strings_In_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# COURSE 4/6: DATA CLEANING AND ANALYSIS

# MISSION 4: Working With Strings In Pandas

Learn how to work with strings in pandas.

## 1. Introduction

In the previous mission, we learned how to use the `apply()`, `map()`, and `applymap()` methods to apply a function to a series. While we could certainly use these methods to clean strings in columns, pandas has built in many vectorized string methods that can perform these tasks quicker and with less keystrokes.

We introduced some of these methods already in the Pandas Fundamentals course when we learned the following data cleaning tasks:

Cleaning column names
Extracting values from the start of strings
Extracting values from the end of strings
In this mission, we'll learn a couple other string cleaning tasks such as:

Finding specific strings or substrings in columns
Extracting substrings from unstructured data
Removing strings or substrings from a series
As we learn these tasks, we'll also work to build intuition around how these string methods operate so that you can explore methods we haven't explicitly covered on your own.

We'll work with the 2015 World Happiness Report again and additional economic data from the World Bank. You can find the data set here. Here's a preview of the data set:




[World Happiness Report](https://www.kaggle.com/unsdsn/world-happiness)

In [1]:
# Import files directly using Google Colab
# Download the files from the links below:
# World_Happiness_2015.csv: https://drive.google.com/file/d/1iZ8_lHkMx7pI22s4ECfpNHKnOohyPfvU/view?usp=sharing

from google.colab import files
upload = files.upload()
upload = files.upload()

Saving World_Happiness_2015.csv to World_Happiness_2015.csv


Saving World_dev.csv to World_dev.csv


In [2]:
# Import pandas and numpy libraries
import pandas as pd
import numpy as np

In [3]:
 # Read the csv files
 happiness2015 = pd.read_csv("World_Happiness_2015.csv")
 world_dev = pd.read_csv("World_dev.csv")

In [4]:
happiness2015.head()

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204
3,Norway,Western Europe,4,7.522,0.0388,1.459,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531
4,Canada,North America,5,7.427,0.03553,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176


Below are descriptions for the columns we'll be working with:

- `ShortName` - Name of the country
- `Region` - The region the country belongs to
- `IncomeGroup` - The income group the country belongs to, based on Gross National Income (GNI) per capita
- `CurrencyUnit` - Name of country's currency
- `SourceOfMostRecentIncomeAndExpenditureData` - The name of the survey used to collect the income and expenditure data
- `SpecialNotes` - Contains any miscellaneous notes about the data

To start, let's read the data sets into pandas and combine them.

**Instructions:**

We've already read `World_Happiness_2015.csv` into a dataframe called `happiness2015` and `World_dev.csv` into a dataframe called `world_dev`.

- Use the `pd.merge()` function to combine `happiness2015` and `world_dev`. Save the resulting dataframe to `merged`. As a reminder, you can use the following syntax to combine the dataframes: `pd.merge(left=df1, right=df2, how='left', left_on='left_df_Column_Name', right_on='right_df_Column_Name')`.
 - Set the `left_on` parameter to the `Country` column from `happiness2015` and the `right_on` parameter to the `ShortName` column from `world_dev`.
- Use the `DataFrame.rename()` [method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html) to rename the `SourceOfMostRecentIncomeAndExpenditureData` column in `merged` to `IESurvey` (because we don't want to keep typing that long name!).
 - We've already saved the mapping to a dictionary named `col_renaming`.
 - Make sure to set the `axis` parameter to 1.

In [5]:
world_dev = pd.read_csv("World_dev.csv")
col_renaming = {'SourceOfMostRecentIncomeAndExpenditureData': 'IESurvey'}
merged = pd.merge(left=happiness2015, right=world_dev, how='left', left_on='Country', right_on='ShortName')
merged = merged.rename(col_renaming, axis=1)

## 2. Using Apply to Transform Strings

In the last step, we combined `happiness2015` and `world_dev` and assigned the result to `merged`. Below are the first five rows of `merged` (after removing some of the columns we don't need)(check with DQ page):

In [6]:
merged.head()

Unnamed: 0,Country,Region_x,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,CountryCode,ShortName,TableName,LongName,Alpha2Code,CurrencyUnit,SpecialNotes,Region_y,IncomeGroup,Wb2Code,NationalAccountsBaseYear,NationalAccountsReferenceYear,SnaPriceValuation,LendingCategory,OtherGroups,SystemOfNationalAccounts,AlternativeConversionFactor,PppSurveyYear,BalanceOfPaymentsManualInUse,ExternalDebtReportingStatus,SystemOfTrade,GovernmentAccountingConcept,ImfDataDisseminationStandard,LatestPopulationCensus,LatestHouseholdSurvey,IESurvey,VitalRegistrationComplete,LatestAgriculturalCensus,LatestIndustrialData,LatestTradeData,LatestWaterWithdrawalData
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738,CHE,Switzerland,Switzerland,Switzerland,CH,Swiss franc,,Europe & Central Asia,High income: OECD,CH,Original chained constant price data are resca...,2010,Value added at basic prices (VAB),,,Country uses the 2008 System of National Accou...,,Rolling,"IMF Balance of Payments Manual, 6th edition.",,Special trade system,Consolidated central government,Special Data Dissemination Standard (SDDS),2010,,"Expenditure survey/budget survey (ES/BS), 2004",Yes,2008,2010.0,2013.0,2000.0
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201,ISL,Iceland,Iceland,Republic of Iceland,IS,Iceland krona,,Europe & Central Asia,High income: OECD,IS,Original chained constant price data are resca...,2010,Value added at basic prices (VAB),,,Country uses the 2008 System of National Accou...,,Rolling,"IMF Balance of Payments Manual, 6th edition.",,General trade system,Consolidated central government,Special Data Dissemination Standard (SDDS),2011,,"Integrated household survey (IHS), 2010",Yes,2010,2005.0,2013.0,2005.0
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204,DNK,Denmark,Denmark,Kingdom of Denmark,DK,Danish krone,,Europe & Central Asia,High income: OECD,DK,Original chained constant price data are resca...,2010,Value added at basic prices (VAB),,,Country uses the 2008 System of National Accou...,,Rolling,"IMF Balance of Payments Manual, 6th edition.",,Special trade system,Consolidated central government,Special Data Dissemination Standard (SDDS),2011,,"Income tax registers (ITR), 2010",Yes,2010,2010.0,2013.0,2009.0
3,Norway,Western Europe,4,7.522,0.0388,1.459,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531,NOR,Norway,Norway,Kingdom of Norway,NO,Norwegian krone,,Europe & Central Asia,High income: OECD,NO,Original chained constant price data are resca...,2010,Value added at basic prices (VAB),,,Country uses the 2008 System of National Accou...,,Rolling,"IMF Balance of Payments Manual, 6th edition.",,General trade system,Consolidated central government,Special Data Dissemination Standard (SDDS),2011,,"Income survey (IS), 2010",Yes,2010,2010.0,2013.0,2006.0
4,Canada,North America,5,7.427,0.03553,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176,CAN,Canada,Canada,Canada,CA,Canadian dollar,Fiscal year end: March 31; reporting period fo...,North America,High income: OECD,CA,Original chained constant price data are resca...,2010,Value added at basic prices (VAB),,,Country uses the 2008 System of National Accou...,,2011,"IMF Balance of Payments Manual, 6th edition.",,General trade system,Consolidated central government,Special Data Dissemination Standard (SDDS),2011,,"Labor force survey (LFS), 2010",Yes,2011,2011.0,2013.0,1986.0


Let's work with the `CurrencyUnit` column first. Suppose we wanted to extract the unit of currency without the leading nationality. For example, instead of "Danish krone" or "Norwegian krone", we just needed "krone".

If we wanted to complete this task for just one of the strings, we could use Python's `tring.split()` [method](https://docs.python.org/3/library/stdtypes.html):
```
words = 'Danish krone'

#Use the string.split() method to return the following list: ['Danish', 'krone']
listwords = words.split()

#Use the index -1 to return the last word of the list.
listwords[-1]
```
Now, to repeat this task for each element in the Series, let's return to a concept we learned in the previous mission - the `Series.apply()` [method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html).


**Instructions:**

- Write a function called `extract_last_word` with the following criteria:
 - The function should accept one parameter called `element`.
 - Use the `string.split()` method to split the object into a list. First convert `element` to a string as follows: `str(element)`.
 - Return the last word of the list.
- Use the `Series.apply()` method to apply the function to the `CurrencyUnit` column. Save the result to `merged['Currency Apply']`.
- Use the `Series.head()` method to print the first five rows in `merged['Currency Apply']`.

## 3. Vectorized String Methods Overview

In the last exercise, we extracted the last word of each element in the `CurrencyUnit` column using the `Series.apply()` method. However, we also learned in the last mission that we should use built-in vectorized methods (if they exist) instead of the `Series.apply()` method for performance reasons.

Instead, we could've split each element in the `CurrencyUnit` column into a list of strings with the `Series.str.split()` [method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.split.html), the vectorized equivalent of Python's `string.split()` method:
![img](https://s3.amazonaws.com/dq-content/346/Split.png)

In fact, pandas has built in a number of vectorized methods that perform the same operations for strings in series as Python string methods.

Below are some common vectorized string methods, but you can find the full list [here](https://pandas.pydata.org/pandas-docs/stable/text.html#method-summary):

|Method|Description|
|---|---|
Series.str.split()|	Splits each element in the Series.
Series.str.strip()	|Strips whitespace from each string in the Series.
Series.str.lower()	|Converts strings in the Series to lowercase.
Series.str.upper()	|Converts strings in the Series to uppercase.
Series.str.get()	|Retrieves the ith element of each element in the Series.
Series.str.replace()	|Replaces a regex or string in the Series with another string.
Series.str.cat()	|Concatenates strings in a Series.
Series.str.extract()	|Extracts substrings from the Series matching a regex pattern.

We access these vectorized string methods by adding a `str` between the Series name and method name:
![img](https://s3.amazonaws.com/dq-content/346/Syntax.png)

The `str` attribute indicates that each object in the Series should be treated as a string, without us having to explicitly change the type to a string like we did when using the `apply` method.

Note that we can also slice each element in the Series to extract characters, but we'd still need to use the `str` attribute. For example, below we access the first five characters in each element of the `CurrencyUnit` column:



In [7]:
merged['CurrencyUnit'].str[0:5]

0      Swiss
1      Icela
2      Danis
3      Norwe
4      Canad
       ...  
153    Rwand
154    West 
155      NaN
156    Burun
157    West 
Name: CurrencyUnit, Length: 158, dtype: object

It's also good to know that vectorized string methods can be chained. For example, suppose we needed to split each element in the `CurrencyUnit` column into a list of strings using the `Series.str.split()` method and capitalize the letters using the `Series.str.upper()` method. You can use the following syntax to apply more than one method at once:



In [8]:
merged['CurrencyUnit'].str.upper().str.split()

0                   [SWISS, FRANC]
1                 [ICELAND, KRONA]
2                  [DANISH, KRONE]
3               [NORWEGIAN, KRONE]
4               [CANADIAN, DOLLAR]
                  ...             
153               [RWANDAN, FRANC]
154    [WEST, AFRICAN, CFA, FRANC]
155                            NaN
156               [BURUNDI, FRANC]
157    [WEST, AFRICAN, CFA, FRANC]
Name: CurrencyUnit, Length: 158, dtype: object

However, don't forget to include `str` before each method name, or you'll get an error!

**Instructions:**

- Use the `Series.str.split()` method to split the `CurrencyUnit` column into a list of words and then use the `Series.str.get()` method to select just the last word. Assign the result to `merged['Currency Vectorized']`.
- Use the `Series.head()` method to print the first five rows in `merged['Currency Vectorized']`.

## 4. Exploring Missing Values with Vectorized String Methods

We learned that using vectorized string methods results in:

1. Better performance
2. Code that is easier to read and write

Let's explore another benefit of using vectorized string methods next. Suppose we wanted to compute the length of each string in the `CurrencyUnit` column. If we use the `Series.apply()` method, what happens to the missing values in the column?

First, let's use the `Series.isnull()` [method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.isnull.html) to confirm if there are any missing values in the column:



In [9]:
merged['CurrencyUnit'].isnull().sum()

13

So, we know that the `CurrencyUnit` column has 13 missing values.

Next, let's create a function to return the length of each currency unit and apply it to the `CurrencyUnit` column:



In [10]:
def compute_lengths(element):
    return len(str(element))
lengths_apply = merged['CurrencyUnit'].apply(compute_lengths)

Then, we can check the number of missing values in the result by setting the `dropna` parameter in the `Series.value_counts()` method to False:

In [11]:
lengths_apply.value_counts(dropna=False)

14    21
4     20
12    17
13    14
3     13
15    13
16    12
18     9
17     9
11     8
22     7
25     5
19     3
9      2
26     1
20     1
23     1
10     1
39     1
Name: CurrencyUnit, dtype: int64

Since the original column had 13 missing values and *`NaN` doesn't appear in the list of unique values above*, we know our function must have treated `NaN` as a string and returned a length of `3` for each `NaN` value. This doesn't make sense - missing values shouldn't be treated as strings. They should instead have been *excluded* from the calculation.

If we wanted to exclude missing values, we'd have to update our function to something like this:

## 5. Finding Specific Words in Strings

## 6. Finding Specific Words in Strings Continued

## 7. Extracting Substrings from a Series

## 8. Extracting Substrings from a Series Continued

## 9. Extracting All Matches of a Pattern from a Series

## 10. Extracting More Than One Group of Patterns from a Series

## 11. Challenge: Clean a String Column, Aggregate the Data, and Plot the Results