## String Methods in _pandas_

In this notebook, you'll learn about several string methods and how to use them with a _pandas_ DataFrame.

We'll be reading in and trying to combine two datasets.

In [6]:
import pandas as pd

Our first dataset contains unemployment data that was obtained from the Burea of Labor Statistics.

In [9]:
unemployment = pd.read_csv('../data/tn_unemployment.csv')

unemployment.head()

Unnamed: 0,laus_code,State,County,Name,Period,LF,Employed,Unemployed,unemployment_rate
0,CN4700100000000,47,1,"Anderson County, TN",Mar-21,34704,33010,1694,4.9
1,CN4700300000000,47,3,"Bedford County, TN",Mar-21,20623,19550,1073,5.2
2,CN4700500000000,47,5,"Benton County, TN",Mar-21,6723,6305,418,6.2
3,CN4700700000000,47,7,"Bledsoe County, TN",Mar-21,4252,3947,305,7.2
4,CN4700900000000,47,9,"Blount County, TN",Mar-21,64098,61119,2979,4.6


Now, let's bring in our second DataFrame, which contains population data per county.

In [12]:
population = pd.read_csv('../data/tn_population.csv')

population.head()

Unnamed: 0,Name,Population
0,ANDERSON COUNTY,75129
1,BEDFORD COUNTY,45058
2,BENTON COUNTY,16489
3,BLEDSOE COUNTY,12876
4,BLOUNT COUNTY,123010


Our goal is to combine the unemployment and population data. In order to do this, _pandas_ needs a common column to join on. 

Notice that both DataFrames have a Name column. However, we can't merge them at the moment since the capitalization is different, and one includes the state.

When working with text data in `pandas`, it is often useful to utilize the built-in sting methods. To use these methods, you must prepend a `.str` before the desired method.

### Changing Case

For example, we can make column entirely uppercase using the `upper()` method.

In [18]:
unemployment['Name'].str.lower()

0       anderson county, tn
1        bedford county, tn
2         benton county, tn
3        bledsoe county, tn
4         blount county, tn
              ...          
90         wayne county, tn
91       weakley county, tn
92         white county, tn
93    williamson county, tn
94        wilson county, tn
Name: Name, Length: 95, dtype: object

Alternatively, we can capitalize the first letter of each word using the `title()` method.

In [21]:
population['Name'].str.title()

0       Anderson County
1        Bedford County
2         Benton County
3        Bledsoe County
4         Blount County
            ...        
90         Wayne County
91       Weakley County
92         White County
93    Williamson County
94        Wilson County
Name: Name, Length: 95, dtype: object

Let's use the second method which will get our columns closer to where they need to be.

In [29]:
population['Name'] = population['Name'].str.title()
population 

Unnamed: 0,Name,Population
0,Anderson County,75129
1,Bedford County,45058
2,Benton County,16489
3,Bledsoe County,12876
4,Blount County,123010
...,...,...
90,Wayne County,17021
91,Weakley County,35021
92,White County,25841
93,Williamson County,202686


### Replace

Another often useful method is the `replace()` method. To use this method, specify what pattern you want to replace and then the replacement text.

In [61]:
unemployment['Period'].str.replace('-21', ' 2021')

0     Mar 2021
1     Mar 2021
2     Mar 2021
3     Mar 2021
4     Mar 2021
        ...   
90    Mar 2021
91    Mar 2021
92    Mar 2021
93    Mar 2021
94    Mar 2021
Name: Period, Length: 95, dtype: object

**Try It Out** Use string slicing to remove the ", TN" from the Name column of the unemployment DataFrame.

In [53]:
# Your Code Here
unemployment['Name'].str.replace(', TN', '')

0       Anderson County
1        Bedford County
2         Benton County
3        Bledsoe County
4         Blount County
            ...        
90         Wayne County
91       Weakley County
92         White County
93    Williamson County
94        Wilson County
Name: Name, Length: 95, dtype: object

### String Slicing

We can also slice strings using _pandas_ much like we can with regular strings.

In [55]:
unemployment['Period'].str[:3]

0     Mar
1     Mar
2     Mar
3     Mar
4     Mar
     ... 
90    Mar
91    Mar
92    Mar
93    Mar
94    Mar
Name: Period, Length: 95, dtype: object

**Try It Out** Use string slicing to remove the ", TN" from the Name column of the unemployment DataFrame.

In [73]:
# Your Code Here
unemployment['Name'].str[:-4]

0       Anderson County
1        Bedford County
2         Benton County
3        Bledsoe County
4         Blount County
            ...        
90         Wayne County
91       Weakley County
92         White County
93    Williamson County
94        Wilson County
Name: Name, Length: 95, dtype: object

### String Concatenation

Note that we can also use + with string to concatenate them. For example, we could add on the ", TN" to the population Name column.

In [118]:
population['Name'] + ', TN'

0       Anderson County, TN
1        Bedford County, TN
2         Benton County, TN
3        Bledsoe County, TN
4         Blount County, TN
              ...          
90         Wayne County, TN
91       Weakley County, TN
92         White County, TN
93    Williamson County, TN
94        Wilson County, TN
Name: Name, Length: 95, dtype: object

In [137]:
unemployment['Name'].str[:-11] 


0       Anderson
1        Bedford
2         Benton
3        Bledsoe
4         Blount
         ...    
90         Wayne
91       Weakley
92         White
93    Williamson
94        Wilson
Name: Name, Length: 95, dtype: object

### Splitting Strings

Another useful string method is `.str.split()`, which allows us to divide a string into a list of parts by specifying what to split on. 

Notice that if we split on the comma, the first piece will match what is contained in the `Name` column of the population DataFrame.

In [122]:
unemployment['Name'].str.split(',')

0       [Anderson County,  TN]
1        [Bedford County,  TN]
2         [Benton County,  TN]
3        [Bledsoe County,  TN]
4         [Blount County,  TN]
                ...           
90         [Wayne County,  TN]
91       [Weakley County,  TN]
92         [White County,  TN]
93    [Williamson County,  TN]
94        [Wilson County,  TN]
Name: Name, Length: 95, dtype: object

By default, this method returns a list. We can make it return a DataFrame by using the `expand` argument.

In [128]:
unemployment['Name'].str.split(',', expand = True)

Unnamed: 0,0,1
0,Anderson County,TN
1,Bedford County,TN
2,Benton County,TN
3,Bledsoe County,TN
4,Blount County,TN
...,...,...
90,Wayne County,TN
91,Weakley County,TN
92,White County,TN
93,Williamson County,TN


We only want the first column.

In [160]:
unemployment['Name'].str.split(',', expand = True)[0]

0       Anderson County
1        Bedford County
2         Benton County
3        Bledsoe County
4         Blount County
            ...        
90         Wayne County
91       Weakley County
92         White County
93    Williamson County
94        Wilson County
Name: 0, Length: 95, dtype: object

Finally, we can assign this back to the `Name` column.

In [162]:
unemployment['Name'] = unemployment['Name'].str.split(',', expand = True)[0]

In [164]:
unemployment.head()

Unnamed: 0,laus_code,State,County,Name,Period,LF,Employed,Unemployed,unemployment_rate
0,CN4700100000000,47,1,Anderson County,Mar-21,34704,33010,1694,4.9
1,CN4700300000000,47,3,Bedford County,Mar-21,20623,19550,1073,5.2
2,CN4700500000000,47,5,Benton County,Mar-21,6723,6305,418,6.2
3,CN4700700000000,47,7,Bledsoe County,Mar-21,4252,3947,305,7.2
4,CN4700900000000,47,9,Blount County,Mar-21,64098,61119,2979,4.6


Finally, we are ready to merge our DataFrames.

In [148]:
pd.merge(left = population, right = unemployment)

Unnamed: 0,Name,Population,laus_code,State,County,Period,LF,Employed,Unemployed,unemployment_rate
0,Anderson County,75129,CN4700100000000,47,1,Mar-21,34704,33010,1694,4.9
1,Bedford County,45058,CN4700300000000,47,3,Mar-21,20623,19550,1073,5.2
2,Benton County,16489,CN4700500000000,47,5,Mar-21,6723,6305,418,6.2
3,Bledsoe County,12876,CN4700700000000,47,7,Mar-21,4252,3947,305,7.2
4,Blount County,123010,CN4700900000000,47,9,Mar-21,64098,61119,2979,4.6
...,...,...,...,...,...,...,...,...,...,...
87,Wayne County,17021,CN4718100000000,47,181,Mar-21,6416,6074,342,5.3
88,Weakley County,35021,CN4718300000000,47,183,Mar-21,15494,14783,711,4.6
89,White County,25841,CN4718500000000,47,185,Mar-21,12085,11484,601,5.0
90,Williamson County,202686,CN4718700000000,47,187,Mar-21,129484,125213,4271,3.3
