### Stings methods in pandas

- Pandas dataframe we have columns that contain string values. Pandas provides an easy way of applying string methods to whole columns which are just pandas series objects. We can access the values of these series objects (or columns) as strings and apply string methods to them by using the `str` attribute of the series. Let’s look at some examples here.



In [2]:
import pandas as pd
import warnings 
warnings.filterwarnings("ignore")

In [3]:
df=pd.read_csv(r"data\Titanic.csv")

In [4]:
df

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


- Here `Name` column in dataframe is a string datatype

**Applying String Methods to a Column**

- To apply a string method to a column, we will be using the str attribute of that Series object. So in general, we will be using the following format:

`Series.str.string(Method|Function)`

In [7]:
df["Name"]

0                                  Kelly, Mr. James
1                  Wilkes, Mrs. James (Ellen Needs)
2                         Myles, Mr. Thomas Francis
3                                  Wirz, Mr. Albert
4      Hirvonen, Mrs. Alexander (Helga E Lindqvist)
                           ...                     
413                              Spector, Mr. Woolf
414                    Oliva y Ocana, Dona. Fermina
415                    Saether, Mr. Simon Sivertsen
416                             Ware, Mr. Frederick
417                        Peter, Master. Michael J
Name: Name, Length: 418, dtype: object

In [8]:
df["Name"].str.upper()         #Complete text has changed to upper letters

0                                  KELLY, MR. JAMES
1                  WILKES, MRS. JAMES (ELLEN NEEDS)
2                         MYLES, MR. THOMAS FRANCIS
3                                  WIRZ, MR. ALBERT
4      HIRVONEN, MRS. ALEXANDER (HELGA E LINDQVIST)
                           ...                     
413                              SPECTOR, MR. WOOLF
414                    OLIVA Y OCANA, DONA. FERMINA
415                    SAETHER, MR. SIMON SIVERTSEN
416                             WARE, MR. FREDERICK
417                        PETER, MASTER. MICHAEL J
Name: Name, Length: 418, dtype: object

In [9]:
df["Name"].str.capitalize()

0                                  Kelly, mr. james
1                  Wilkes, mrs. james (ellen needs)
2                         Myles, mr. thomas francis
3                                  Wirz, mr. albert
4      Hirvonen, mrs. alexander (helga e lindqvist)
                           ...                     
413                              Spector, mr. woolf
414                    Oliva y ocana, dona. fermina
415                    Saether, mr. simon sivertsen
416                             Ware, mr. frederick
417                        Peter, master. michael j
Name: Name, Length: 418, dtype: object

**Replacing Characters in a Column**


- What if we want to replace all comma(,) in the Name column with an underscore? We can perform this by using the replace method. We just specify what we want to replace and then what we want to replace it with.

`Series.str.replace(",","_")`

In [10]:
df["Name"].str.replace(",","_")      #ALl the comma's in Name column are repalced with underscore

0                                  Kelly_ Mr. James
1                  Wilkes_ Mrs. James (Ellen Needs)
2                         Myles_ Mr. Thomas Francis
3                                  Wirz_ Mr. Albert
4      Hirvonen_ Mrs. Alexander (Helga E Lindqvist)
                           ...                     
413                              Spector_ Mr. Woolf
414                    Oliva y Ocana_ Dona. Fermina
415                    Saether_ Mr. Simon Sivertsen
416                             Ware_ Mr. Frederick
417                        Peter_ Master. Michael J
Name: Name, Length: 418, dtype: object

**Chaining Methods**

- We can also chain string methods together. Remember that these Series methods return a pandas Series object. So we can just add another string method to them. For example, let’s say that we want to replace all comma(,) with an underscore and change all the Names to lowercase. We can accomplish this with the following code:

In [11]:
df["Name"].str.replace(",","_").str.lower()

0                                  kelly_ mr. james
1                  wilkes_ mrs. james (ellen needs)
2                         myles_ mr. thomas francis
3                                  wirz_ mr. albert
4      hirvonen_ mrs. alexander (helga e lindqvist)
                           ...                     
413                              spector_ mr. woolf
414                    oliva y ocana_ dona. fermina
415                    saether_ mr. simon sivertsen
416                             ware_ mr. frederick
417                        peter_ master. michael j
Name: Name, Length: 418, dtype: object

**Creating a new column with `length of Name` in the dataframe**

In [12]:
df["Name_Length"]=df["Name"].str.len()

In [14]:
df      #Added a new col with length of name at last col of dataframe

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Name_Length
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,16
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S,32
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,25
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,16
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,44
...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S,18
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C,28
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S,28
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S,19


**Extracting Gender without using "Sex" column**

- Note that we need the escape character "\" to look for the ".".  We can easily the results of the returned series in a new column as follows.

In [17]:
df[df["Name"].str.contains("Mr\.")]     #ALl the male data without using "Sex" column

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Name_Length
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,16
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,25
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,16
5,897,3,"Svensson, Mr. Johan Cervin",male,14.0,0,0,7538,9.2250,,S,26
7,899,2,"Caldwell, Mr. Albert Francis",male,26.0,1,1,248738,29.0000,,S,28
...,...,...,...,...,...,...,...,...,...,...,...,...
406,1298,2,"Ware, Mr. William Jeffery",male,23.0,1,0,28666,10.5000,,S,25
407,1299,1,"Widener, Mr. George Dunton",male,50.0,1,1,113503,211.5000,C80,C,26
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S,18
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S,28


**Separation of first Name and last Name in the Name Column**

In [18]:
df["Name"]

0                                  Kelly, Mr. James
1                  Wilkes, Mrs. James (Ellen Needs)
2                         Myles, Mr. Thomas Francis
3                                  Wirz, Mr. Albert
4      Hirvonen, Mrs. Alexander (Helga E Lindqvist)
                           ...                     
413                              Spector, Mr. Woolf
414                    Oliva y Ocana, Dona. Fermina
415                    Saether, Mr. Simon Sivertsen
416                             Ware, Mr. Frederick
417                        Peter, Master. Michael J
Name: Name, Length: 418, dtype: object

In [22]:
df["Name"].str.split(",",expand=True)
                                             #Both first and last names are separated

Unnamed: 0,0,1
0,Kelly,Mr. James
1,Wilkes,Mrs. James (Ellen Needs)
2,Myles,Mr. Thomas Francis
3,Wirz,Mr. Albert
4,Hirvonen,Mrs. Alexander (Helga E Lindqvist)
...,...,...
413,Spector,Mr. Woolf
414,Oliva y Ocana,Dona. Fermina
415,Saether,Mr. Simon Sivertsen
416,Ware,Mr. Frederick
