## Data Wrangling: Clean, Transform, Merge, Reshape

In [1]:
import pandas as pd
import numpy as np

## Combining and merging data sets

### Database-style DataFrame merges

In [2]:
df1=pd.DataFrame({'data1':range(5,12),'key':list('bbacaab')})
df2=pd.DataFrame({'data2':range(56,59),'key':list('abd')})

By default, .merge() performs an [inner join](https://www.w3schools.com/sql/sql_join.asp) between the DataFrames, using the common columns as keys. Types of join:
- RIGHT
- LEFT
- INNER
- OUTER

In [3]:
df1

Unnamed: 0,data1,key
0,5,b
1,6,b
2,7,a
3,8,c
4,9,a
5,10,a
6,11,b


In [4]:
df2

Unnamed: 0,data2,key
0,56,a
1,57,b
2,58,d


In [5]:
df1.merge(df2,how='outer') #Si utilizas outer y no hay valor lo rellena con un NaN

Unnamed: 0,data1,key,data2
0,5.0,b,57.0
1,6.0,b,57.0
2,11.0,b,57.0
3,7.0,a,56.0
4,9.0,a,56.0
5,10.0,a,56.0
6,8.0,c,
7,,d,58.0


That means that it returns the cartesian product of the elements with common keys: if there are duplicates, it will return all the possible combinations:

In [6]:
df3=pd.DataFrame({'data2':range(56,61),'key':list('abdbd')})
df1.merge(df3) #En este caso con valores duplicados hace TODAS las combinaciones posibles

Unnamed: 0,data1,key,data2
0,5,b,57
1,5,b,59
2,6,b,57
3,6,b,59
4,11,b,57
5,11,b,59
6,7,a,56
7,9,a,56
8,10,a,56


If the columns to join on don't have the same name, or we want to join on the index of the DataFrames, we'll need to specify that.

In [7]:
df4=pd.DataFrame({'data2':range(56,61),'rkey':list('abdbd')})
df1.merge(df4,left_on='key',right_on='rkey')

Unnamed: 0,data1,key,data2,rkey
0,5,b,57,b
1,5,b,59,b
2,6,b,57,b
3,6,b,59,b
4,11,b,57,b
5,11,b,59,b
6,7,a,56,a
7,9,a,56,a
8,10,a,56,a


If there are two columns with the same name that we do not join on, both will get transferred to the resulting DataFrame with a suffix. We can customize these suffixes.

In [8]:
df1.merge(df2,left_on='data1',right_on='data2',how='outer')

Unnamed: 0,data1,key_x,data2,key_y
0,5.0,b,,
1,6.0,b,,
2,7.0,a,,
3,8.0,c,,
4,9.0,a,,
5,10.0,a,,
6,11.0,b,,
7,,,56.0,a
8,,,57.0,b
9,,,58.0,d


In [9]:
df1.merge(df2,left_on='data1',right_on='data2',how='outer',suffixes=('_customer','_order'))

Unnamed: 0,data1,key_customer,data2,key_order
0,5.0,b,,
1,6.0,b,,
2,7.0,a,,
3,8.0,c,,
4,9.0,a,,
5,10.0,a,,
6,11.0,b,,
7,,,56.0,a
8,,,57.0,b
9,,,58.0,d


### Merging on index

In [10]:
df4.index=range(5,10)
df4

Unnamed: 0,data2,rkey
5,56,a
6,57,b
7,58,d
8,59,b
9,60,d


In [11]:
df1

Unnamed: 0,data1,key
0,5,b
1,6,b
2,7,a
3,8,c
4,9,a
5,10,a
6,11,b


In [12]:
df1.merge(df4,left_on='data1',right_index=True) #En este caso concatenamos una columna con el index

Unnamed: 0,data1,key,data2,rkey
0,5,b,56,a
1,6,b,57,b
2,7,a,58,d
3,8,c,59,b
4,9,a,60,d


### Concatenating along an axis

In [13]:
pd.concat([df1,df2]) #por defecto te concatena EN COLUMNAS

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,data1,data2,key
0,5.0,,b
1,6.0,,b
2,7.0,,a
3,8.0,,c
4,9.0,,a
5,10.0,,a
6,11.0,,b
0,,56.0,a
1,,57.0,b
2,,58.0,d


Si queremos juntarlos por filas tenemos que especificar el eje

In [14]:
pd.concat([df1,df2],axis=1) #por defecto te concatena EN COLUMNAS

Unnamed: 0,data1,key,data2,key.1
0,5,b,56.0,a
1,6,b,57.0,b
2,7,a,58.0,d
3,8,c,,
4,9,a,,
5,10,a,,
6,11,b,,


In [15]:
pd.concat([df1,df4],axis=1)

Unnamed: 0,data1,key,data2,rkey
0,5.0,b,,
1,6.0,b,,
2,7.0,a,,
3,8.0,c,,
4,9.0,a,,
5,10.0,a,56.0,a
6,11.0,b,57.0,b
7,,,58.0,d
8,,,59.0,b
9,,,60.0,d


In [16]:
df1['data1']
df2['data2']
df3['data2']
pd.concat([df1['data1'],df2['data2'],df3['data2']],axis=1,keys=['x','y','z']) #Podemos darle el nombre que queramos a la columna

Unnamed: 0,x,y,z
0,5,56.0,56.0
1,6,57.0,57.0
2,7,58.0,58.0
3,8,,59.0
4,9,,60.0
5,10,,
6,11,,


#### Digression

Attention! Be careful not to reassign to reserved words or functions- you will overwrite the variable.

In [17]:
pd.concat=[df1,df2]

In [18]:
pd.concat([df1,df2])

TypeError: 'list' object is not callable

In [19]:
del(pd.concat)

In [20]:
pd.concat()

AttributeError: module 'pandas' has no attribute 'concat'

You can delete the overwritten variable, but you won't get back the original value. If it is an object or function from a module, you'll need to reload() the module, since Python doesn't load again an already imported module if you try to import it. reload() is useful also when you are actively developing your own module and want to load the latest definition of a function into memory.

In [None]:
from importlib import reload

In [None]:
reload(pd)
pd.concat([df1,df2])

## Data transformation

### Removing duplicates

In [50]:
df1.duplicated() #Nos dice si la fila se encuentra repetida

0    False
1    False
2    False
3    False
4    False
5    False
6    False
dtype: bool

In [51]:
df1['key'].duplicated()

0    False
1     True
2    False
3    False
4     True
5     True
6     True
Name: key, dtype: bool

In [52]:
df1['key'].drop_duplicates()

0    b
2    a
3    c
Name: key, dtype: object

In [54]:
df1.drop_duplicates(subset='key')

Unnamed: 0,data1,key
0,5,b
2,7,a
3,8,c


In [56]:
df1.drop_duplicates(subset='key',keep='last') #nos guardaría el último valor duplicado

Unnamed: 0,data1,key
3,8,c
5,10,a
6,11,b


### Renaming axis indexes

In [57]:
df1.index=list('bbkdsdf')
df1

Unnamed: 0,data1,key
b,5,b
b,6,b
k,7,a
d,8,c
s,9,a
d,10,a
f,11,b


In [None]:
pd.concat

### Discretization and binning

In [60]:
np.random.seed(42)
ages= pd.Series(np.random.randint(9,99,50))
ages

0     60
1     23
2     80
3     69
4     29
5     91
6     95
7     83
8     83
9     96
10    32
11    11
12    30
13    61
14    10
15    96
16    38
17    46
18    10
19    72
20    68
21    29
22    41
23    84
24    66
25    30
26    97
27    57
28    67
29    50
30    68
31    88
32    23
33    70
34    70
35    55
36    70
37    59
38    63
39    72
40    11
41    59
42    15
43    29
44    81
45    47
46    26
47    12
48    97
49    68
dtype: int64

In [68]:
limits=[14,18,35,50,65,100]

Podemos crear listas que nos incluyan los valores de la serie en un rango establecido (limits en este caso). Si no lo puede meter en un rango pondrá NaN.

In [74]:
categorized=pd.cut(ages,bins=limits,include_lowest=True) #include lowest te incluye el valor de la izquierda del rango
categorized

0       (50.0, 65.0]
1       (18.0, 35.0]
2      (65.0, 100.0]
3      (65.0, 100.0]
4       (18.0, 35.0]
5      (65.0, 100.0]
6      (65.0, 100.0]
7      (65.0, 100.0]
8      (65.0, 100.0]
9      (65.0, 100.0]
10      (18.0, 35.0]
11               NaN
12      (18.0, 35.0]
13      (50.0, 65.0]
14               NaN
15     (65.0, 100.0]
16      (35.0, 50.0]
17      (35.0, 50.0]
18               NaN
19     (65.0, 100.0]
20     (65.0, 100.0]
21      (18.0, 35.0]
22      (35.0, 50.0]
23     (65.0, 100.0]
24     (65.0, 100.0]
25      (18.0, 35.0]
26     (65.0, 100.0]
27      (50.0, 65.0]
28     (65.0, 100.0]
29      (35.0, 50.0]
30     (65.0, 100.0]
31     (65.0, 100.0]
32      (18.0, 35.0]
33     (65.0, 100.0]
34     (65.0, 100.0]
35      (50.0, 65.0]
36     (65.0, 100.0]
37      (50.0, 65.0]
38      (50.0, 65.0]
39     (65.0, 100.0]
40               NaN
41      (50.0, 65.0]
42    (13.999, 18.0]
43      (18.0, 35.0]
44     (65.0, 100.0]
45      (35.0, 50.0]
46      (18.0, 35.0]
47           

In [73]:
ages[categorized.isna()]

11    11
14    10
18    10
40    11
47    12
dtype: int64

In [75]:
categorized.value_counts()

(65.0, 100.0]     23
(18.0, 35.0]       9
(50.0, 65.0]       7
(35.0, 50.0]       5
(13.999, 18.0]     1
dtype: int64

## String manipulation

### String object methods

In [77]:
bichos=pd.Series(np.random.choice(['Mantis Shrimp','Naked Mole Rat','Star Nosed Mole'],15))
bichos

0      Naked Mole Rat
1      Naked Mole Rat
2      Naked Mole Rat
3      Naked Mole Rat
4      Naked Mole Rat
5      Naked Mole Rat
6       Mantis Shrimp
7     Star Nosed Mole
8      Naked Mole Rat
9      Naked Mole Rat
10     Naked Mole Rat
11     Naked Mole Rat
12     Naked Mole Rat
13     Naked Mole Rat
14    Star Nosed Mole
dtype: object

In [78]:
bichos.str.upper()

0      NAKED MOLE RAT
1      NAKED MOLE RAT
2      NAKED MOLE RAT
3      NAKED MOLE RAT
4      NAKED MOLE RAT
5      NAKED MOLE RAT
6       MANTIS SHRIMP
7     STAR NOSED MOLE
8      NAKED MOLE RAT
9      NAKED MOLE RAT
10     NAKED MOLE RAT
11     NAKED MOLE RAT
12     NAKED MOLE RAT
13     NAKED MOLE RAT
14    STAR NOSED MOLE
dtype: object

In [79]:
bichos.str.len()

0     14
1     14
2     14
3     14
4     14
5     14
6     13
7     15
8     14
9     14
10    14
11    14
12    14
13    14
14    15
dtype: int64

In [83]:
bichos.str.split()

0      [Naked, Mole, Rat]
1      [Naked, Mole, Rat]
2      [Naked, Mole, Rat]
3      [Naked, Mole, Rat]
4      [Naked, Mole, Rat]
5      [Naked, Mole, Rat]
6        [Mantis, Shrimp]
7     [Star, Nosed, Mole]
8      [Naked, Mole, Rat]
9      [Naked, Mole, Rat]
10     [Naked, Mole, Rat]
11     [Naked, Mole, Rat]
12     [Naked, Mole, Rat]
13     [Naked, Mole, Rat]
14    [Star, Nosed, Mole]
dtype: object

In [81]:
bichos.str[:6]

0     Naked 
1     Naked 
2     Naked 
3     Naked 
4     Naked 
5     Naked 
6     Mantis
7     Star N
8     Naked 
9     Naked 
10    Naked 
11    Naked 
12    Naked 
13    Naked 
14    Star N
dtype: object

In [89]:
bichos.str.split().str[-1]

0        Rat
1        Rat
2        Rat
3        Rat
4        Rat
5        Rat
6     Shrimp
7       Mole
8        Rat
9        Rat
10       Rat
11       Rat
12       Rat
13       Rat
14      Mole
dtype: object

### Vectorized string functions in pandas

[Vectorized string functions in pandas](https://pandas.pydata.org/pandas-docs/stable/text.html) are grouped within the .str attribute of Series and Indexes. They have the same names as the regular Python string functions, but work on Series of strings.