# Data Wrangling - La cirugía de los datos

El **data wrangling**, a veces denominada **data munging**, es el proceso de transformar y mapear datos de un dataset *raw* (en bruto) en otro formato con la intención de hacerlo más apropiado y valioso para una variedad de propósitos posteriores, como el análisis. Un **data wrangler** es una persona que realiza estas operaciones de transformación.

Esto puede incluir munging, visualización de datos, agregación de datos, entrenamiento de un modelo estadístico, así como muchos otros usos potenciales. La oscilación de datos como proceso generalmente sigue un conjunto de pasos generales que comienzan extrayendo los datos en forma cruda del origen de datos, dividiendo los datos en bruto usando algoritmos (por ejemplo, clasificación) o analizando los datos en estructuras de datos predefinidas, y finalmente depositando el contenido resultante en un sistema de almacenamiento (o silo) para su uso futuro.

In [1]:
%config IPCompleter.greedy=True

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv("../datasets/customer-churn-model/Customer Churn Model.txt")

In [4]:
df.head()

Unnamed: 0,State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,...,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False.
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False.
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False.
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False.
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False.


# Crear un subconjunto de datos 

In [5]:
account_length = df["Account Length"]

In [6]:
account_length.head()

0    128
1    107
2    137
3     84
4     75
Name: Account Length, dtype: int64

In [7]:
type(account_length)

pandas.core.series.Series

In [8]:
subset = df[["Account Length", "Phone", "Eve Charge", "Day Calls"]]

In [9]:
subset.head()

Unnamed: 0,Account Length,Phone,Eve Charge,Day Calls
0,128,382-4657,16.78,110
1,107,371-7191,16.62,123
2,137,358-1921,10.3,114
3,84,375-9999,5.26,71
4,75,330-6626,12.61,113


In [10]:
type(subset)

pandas.core.frame.DataFrame

In [11]:
desired_columns = ["Account Length", "Phone", "Eve Charge", "Night Calls"]
subset = df[desired_columns]
subset.head()

Unnamed: 0,Account Length,Phone,Eve Charge,Night Calls
0,128,382-4657,16.78,91
1,107,371-7191,16.62,103
2,137,358-1921,10.3,104
3,84,375-9999,5.26,89
4,75,330-6626,12.61,121


In [12]:
desired_columns = ["Account Length", "VMail Message", "Day Calls"]
desired_columns

['Account Length', 'VMail Message', 'Day Calls']

In [13]:
all_columns_list = df.columns.values.tolist()
all_columns_list

['State',
 'Account Length',
 'Area Code',
 'Phone',
 "Int'l Plan",
 'VMail Plan',
 'VMail Message',
 'Day Mins',
 'Day Calls',
 'Day Charge',
 'Eve Mins',
 'Eve Calls',
 'Eve Charge',
 'Night Mins',
 'Night Calls',
 'Night Charge',
 'Intl Mins',
 'Intl Calls',
 'Intl Charge',
 'CustServ Calls',
 'Churn?']

In [14]:
sublist = [x for x in all_columns_list if x not in desired_columns]
sublist

['State',
 'Area Code',
 'Phone',
 "Int'l Plan",
 'VMail Plan',
 'Day Mins',
 'Day Charge',
 'Eve Mins',
 'Eve Calls',
 'Eve Charge',
 'Night Mins',
 'Night Calls',
 'Night Charge',
 'Intl Mins',
 'Intl Calls',
 'Intl Charge',
 'CustServ Calls',
 'Churn?']

In [15]:
subset = df[sublist]
subset.shape

(3333, 18)

In [16]:
a = set(desired_columns)
b = set(all_columns_list)
sublist = b-a
sublist = list(sublist)
sublist.sort()

In [17]:
subset = df[sublist]
subset.shape

(3333, 18)

In [18]:
df[0:25].shape

(25, 21)

In [19]:
df[10:35].shape

(25, 21)

In [20]:
df[:9].shape

(9, 21)

In [21]:
df[3319:].shape

(14, 21)

In [22]:
## Usuarios con Day Mins > 500
df2 = df[df["Day Mins"]>300]
df2.shape

(43, 21)

In [23]:
## Usuarios de New York (State = "NY")
df3 = df[df["State"] == "NY"]
df3.shape

(83, 21)

In [24]:
## Usuarios que son de NY y sus minutos diarios superen los 300
df3 = df[(df["State"] == "NY") & (df["Day Mins"] > 300)]
df3.shape

(2, 21)

In [25]:
## Usuarios que son de NY o sus minutos diarios superen los 300
df4 = df[(df["State"] == "NY") | (df["Day Mins"] > 300)]
df4.shape

(124, 21)

In [26]:
df5 = df[(df["Day Calls"] < df["Night Calls"])]
df5.shape

(1626, 21)

In [27]:
df5 = df[(df["Day Mins"] < df["Night Mins"])]
df5.shape

(2051, 21)

In [28]:
## Minutos de día, de noche y Longitud de la cuenta de los primeros 50 individuos
sub_first_50 = df[["Day Mins", "Night Mins", "Account Length"]][:50]
sub_first_50.head()

Unnamed: 0,Day Mins,Night Mins,Account Length
0,265.1,244.7,128
1,161.6,254.4,107
2,243.4,162.6,137
3,299.4,196.9,84
4,166.7,186.9,75


In [29]:
df.iloc[0:10, 0:6]

Unnamed: 0,State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan
0,KS,128,415,382-4657,no,yes
1,OH,107,415,371-7191,no,yes
2,NJ,137,415,358-1921,no,no
3,OH,84,408,375-9999,yes,no
4,OK,75,415,330-6626,yes,no
5,AL,118,510,391-8027,yes,no
6,MA,121,510,355-9993,no,yes
7,MO,147,415,329-9001,yes,no
8,LA,117,408,335-4719,no,no
9,WV,141,415,330-8173,yes,yes


In [30]:
df.iloc[:,3:6] ## Todas las filas para columnas entre la 3 y 5
df.iloc[0:10,:] ## Todas las columnas para las filas de la 0 a la 9

Unnamed: 0,State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,...,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False.
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False.
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False.
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False.
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False.
5,AL,118,510,391-8027,yes,no,0,223.4,98,37.98,...,101,18.75,203.9,118,9.18,6.3,6,1.7,0,False.
6,MA,121,510,355-9993,no,yes,24,218.2,88,37.09,...,108,29.62,212.6,118,9.57,7.5,7,2.03,3,False.
7,MO,147,415,329-9001,yes,no,0,157.0,79,26.69,...,94,8.76,211.8,96,9.53,7.1,6,1.92,0,False.
8,LA,117,408,335-4719,no,no,0,184.5,97,31.37,...,80,29.89,215.8,90,9.71,8.7,4,2.35,1,False.
9,WV,141,415,330-8173,yes,yes,37,258.6,84,43.96,...,111,18.87,326.4,97,14.69,11.2,5,3.02,0,False.


In [31]:
df.iloc[0:10, [2,5,7]]

Unnamed: 0,Area Code,VMail Plan,Day Mins
0,415,yes,265.1
1,415,yes,161.6
2,415,no,243.4
3,408,no,299.4
4,415,no,166.7
5,510,no,223.4
6,510,yes,218.2
7,415,no,157.0
8,408,no,184.5
9,415,yes,258.6


In [32]:
df.iloc[[1,2,3], [1,2]]

Unnamed: 0,Account Length,Area Code
1,107,415
2,137,415
3,84,408


In [33]:
df.loc[[1,5,8,36], ["Area Code", "VMail Plan"]]

Unnamed: 0,Area Code,VMail Plan
1,415,yes
5,510,no
8,408,no
36,408,yes


In [34]:
df["Total Mins"] = df["Day Mins"] + df["Night Mins"] + df["Eve Mins"]
df.head()

Unnamed: 0,State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,...,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?,Total Mins
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,16.78,244.7,91,11.01,10.0,3,2.7,1,False.,707.2
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,16.62,254.4,103,11.45,13.7,3,3.7,1,False.,611.5
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,10.3,162.6,104,7.32,12.2,5,3.29,0,False.,527.2
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,5.26,196.9,89,8.86,6.6,7,1.78,2,False.,558.2
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,12.61,186.9,121,8.41,10.1,3,2.73,3,False.,501.9


In [35]:
df["Total Calls"] = df["Day Calls"] + df["Night Calls"] + df["Eve Calls"]
df.head()

Unnamed: 0,State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,...,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?,Total Mins,Total Calls
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,244.7,91,11.01,10.0,3,2.7,1,False.,707.2,300
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,254.4,103,11.45,13.7,3,3.7,1,False.,611.5,329
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,162.6,104,7.32,12.2,5,3.29,0,False.,527.2,328
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,196.9,89,8.86,6.6,7,1.78,2,False.,558.2,248
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,186.9,121,8.41,10.1,3,2.73,3,False.,501.9,356


# Generación aleatoria de números

In [36]:
import numpy as np

In [37]:
## Generar un número aleatorio entero entre 1 y 100
np.random.randint(1,100)

38

In [38]:
## La forma más clásica de generar un número aleatorio es entre 0 y 1 (Con decimales)
np.random.random()

0.7618256492608186

In [39]:
## Función que genera una lista de n números aleatorios enteros dentro del intervalo [a,b]
def randint_list(n, a, b):
    x = []
    for i in range(n):
        x.append(np.random.randint(a, b))
    return x

In [40]:
randint_list(25, 1, 50)

[3,
 10,
 3,
 24,
 5,
 16,
 7,
 36,
 34,
 40,
 48,
 1,
 26,
 36,
 41,
 16,
 31,
 46,
 31,
 4,
 43,
 21,
 43,
 1,
 6]

In [41]:
import random

In [42]:
for i in range(10):
    print(random.randrange(0, 100, 7))

63
98
98
77
0
21
70
63
98
77


# Shuffling

In [43]:
a = np.arange(100)
a

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
       68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
       85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99])

In [44]:
np.random.shuffle(a)
a

array([76, 54, 30, 57, 85, 31, 26, 80, 47, 71, 95, 38, 81,  4, 60,  8, 37,
       97, 35, 42, 46, 96, 83, 89, 92, 34, 86, 65, 67, 19, 59, 73, 72,  3,
       91,  5, 49, 18, 17, 63, 29, 32, 84, 23, 64, 82, 99, 36, 79, 44, 22,
       50, 39,  9, 94, 74, 11, 88, 25,  2, 69, 15, 58, 16, 52, 33, 41, 48,
       56, 61, 62, 28, 78, 12, 24,  0, 90, 51, 21, 45,  1, 68, 75, 10, 55,
       93, 13, 66, 87, 77,  7, 27, 14,  6, 70, 43, 20, 98, 53, 40])

In [45]:
df.head()

Unnamed: 0,State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,...,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?,Total Mins,Total Calls
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,244.7,91,11.01,10.0,3,2.7,1,False.,707.2,300
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,254.4,103,11.45,13.7,3,3.7,1,False.,611.5,329
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,162.6,104,7.32,12.2,5,3.29,0,False.,527.2,328
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,196.9,89,8.86,6.6,7,1.78,2,False.,558.2,248
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,186.9,121,8.41,10.1,3,2.73,3,False.,501.9,356


In [46]:
df.shape

(3333, 23)

In [47]:
column_list = df.columns.values.tolist()
column_list

['State',
 'Account Length',
 'Area Code',
 'Phone',
 "Int'l Plan",
 'VMail Plan',
 'VMail Message',
 'Day Mins',
 'Day Calls',
 'Day Charge',
 'Eve Mins',
 'Eve Calls',
 'Eve Charge',
 'Night Mins',
 'Night Calls',
 'Night Charge',
 'Intl Mins',
 'Intl Calls',
 'Intl Charge',
 'CustServ Calls',
 'Churn?',
 'Total Mins',
 'Total Calls']

In [48]:
np.random.choice(column_list)

"Int'l Plan"

# Seed

In [49]:
np.random.seed(2018)

In [62]:
for i in range(5):
    print(np.random.random())


0.8823493117539459
0.10432773786047767
0.9070093335163405
0.3063988986063515
0.446408872427422
