# Transformación de las variables caracter o String 

## Veamos las aplicaciones más comunes para el caso de las variables string

https://docs.python.org/3/library/stdtypes.html#string-methods
https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html
https://www.aboutdatablog.com/post/10-most-useful-string-functions-in-pandas


Pandas provides a set of string functions which make it easy to operate on string data. Most importantly, these functions ignore (or exclude) missing/NaN values.

Almost, all of these methods work with Python string functions (refer: https://docs.python.org/3/library/stdtypes.html#string-methods). So, convert the Series Object to String Object and then perform the operation.

Let us now see how each operation performs.


 1	 lower() Converts strings in the Series/Index to lower case.

 2 upper() Converts strings in the Series/Index to upper case.

 3	len() Computes String length().

 4	strip() Helps strip whitespace(including newline) from each string in the Series/index from both the sides.

 5  split(' ') Splits each string with the given pattern.

 6  cat(sep=' ')  Concatenates the series/index elements with given separator.

 7	get_dummies() Returns the DataFrame with One-Hot Encoded values.

 8 contains(pattern)  Returns a Boolean value True for each element if the substring contains in the element, else False.

 9 replace(a,b) Replaces the value a with the value b.

 10  repeat(value) Repeats each element with specified number of times.

 11 count(pattern) Returns count of appearance of pattern in each element.

 12 startswith(pattern)  Returns true if the element in the Series/Index starts with the pattern.

 13 endswith(pattern) Returns true if the element in the Series/Index ends with the pattern.

 14 find(pattern) Returns the first position of the first occurrence of the pattern.

 15  findall(pattern)  Returns a list of all occurrence of the pattern.

 16 swapcase Swaps the case lower/upper.

 17 islower() Checks whether all characters in each string in the Series/Index in lower case or not. Returns Boolean

 18 isupper() Checks whether all characters in each string in the Series/Index in upper case or not. Returns Boolean.

 19 isnumeric() Checks whether all characters in each string in the Series/Index are numeric. Returns Boolean.

In [9]:
##########
#  Data  #
##########

import pandas as pd
import numpy as np



In [10]:
# Usamos la función .str de Pandas

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])

s

# lower()

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])

s.str.lower()

0             tom
1    william rick
2            john
3         alber@t
4             NaN
5            1234
6      stevesmith
dtype: object

In [11]:
# upper()

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])

s.str.upper()

0             TOM
1    WILLIAM RICK
2            JOHN
3         ALBER@T
4             NaN
5            1234
6      STEVESMITH
dtype: object

In [12]:
# len()

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])
s.str.len()

0     3.0
1    12.0
2     4.0
3     7.0
4     NaN
5     4.0
6    10.0
dtype: float64

In [13]:
# strip()

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
s
print ("After Stripping:")
s.str.strip()

After Stripping:


0             Tom
1    William Rick
2            John
3         Alber@t
dtype: object

In [14]:
# split(pattern)

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
s
print("Split Pattern:")
s.str.split(' ')

Split Pattern:


0              [Tom, ]
1    [, William, Rick]
2               [John]
3            [Alber@t]
dtype: object

In [15]:
# cat(sep=pattern)

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

s.str.cat(sep='_')

'Tom _ William Rick_John_Alber@t'

In [16]:
# get_dummies()

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
s.str.get_dummies()

Unnamed: 0,William Rick,Alber@t,John,Tom
0,0,0,0,1
1,1,0,0,0
2,0,0,1,0
3,0,1,0,0


In [17]:
# contains ()

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

s.str.contains(' ')

0     True
1     True
2    False
3    False
dtype: bool

In [18]:
# replace(a,b)

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print ("After replacing @ with $:")
s.str.replace('@','$')

After replacing @ with $:


0             Tom 
1     William Rick
2             John
3          Alber$t
dtype: object

In [19]:
# repeat(value)

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

s.str.repeat(2)

0                      Tom Tom 
1     William Rick William Rick
2                      JohnJohn
3                Alber@tAlber@t
dtype: object

In [20]:
# count(pattern)

s = pd.Series(['Tommmmmmmm', ' William Rick', 'John', 'Alber@t'])

print ("The number of 'm's in each string:")
s.str.count('m')

The number of 'm's in each string:


0    8
1    1
2    0
3    0
dtype: int64

In [21]:
# startswith(pattern) 

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print("Strings that start with 'T':")
s.str. startswith ('T')

Strings that start with 'T':


0     True
1    False
2    False
3    False
dtype: bool

In [22]:
# endswith(pattern)

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print("Strings that end with 't':")
s.str.endswith('t')

Strings that end with 't':


0    False
1    False
2    False
3     True
dtype: bool

In [23]:
# find(pattern)

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

s.str.find('e')

0   -1
1   -1
2   -1
3    3
dtype: int64

In [24]:
# findall(pattern)

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

s.str.findall('e')

0     []
1     []
2     []
3    [e]
dtype: object

In [25]:
# swapcase()

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])
s.str.swapcase()

0             tOM
1    wILLIAM rICK
2            jOHN
3         aLBER@T
dtype: object

In [26]:
# islower()

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])
s.str.islower()


0    False
1    False
2    False
3    False
dtype: bool

In [27]:
# isupper()

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])

s.str.isupper()

0    False
1    False
2    False
3    False
dtype: bool

In [28]:


# isnumeric()

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])

s.str.isnumeric()

0    False
1    False
2    False
3    False
dtype: bool

In [29]:
# Importar los datos

import os
cwd = os.getcwd()
cwd
# Print the current working directory
print("Current working directory: {0}".format(cwd))


path = 'C:\\Users\\oscar\\Desktop\\GitHub\\Python\\Jupyter Notebook\\Data Manipulation\\Modificar variables String'
# Change the current working directory
os.chdir(path)
# Let's check the new directory 
print("Current working directory: {0}".format(os.getcwd()))

import pandas as pd

employee = pd.read_excel('testData.xlsx', sheet_name= 'employee')

employee.head(3)

Current working directory: c:\Users\oscar\Desktop\GitHub\Python\Jupyter Notebook\Data Manipulation\Modificar variables String
Current working directory: c:\Users\oscar\Desktop\GitHub\Python\Jupyter Notebook\Data Manipulation\Modificar variables String


Unnamed: 0,employee_id,transport_expense,distance,age,education,sons,pet,gender,disciplinary_faults,transportation_method
0,1,235,11,37,3,1,1,F,3,public_transportation
1,2,235,29,48,1,1,5,M,4,public_transportation
2,3,179,51,38,1,0,0,M,5,bicycle


In [32]:
employee.gender.str.lower().head(3)

0    f
1    m
2    m
Name: gender, dtype: object

Y así para un dataframe