# Data Wrangling - La cirugía de los datos

El data wrangling, a veces denominada data munging, es el proceso de transformar y mapear datos de un dataset raw (en bruto) en otro formato con la intención de hacerlo más apropiado y valioso para una variedad de propósitos posteriores, como el análisis. Un data wrangler es una persona que realiza estas operaciones de transformación.

Esto puede incluir munging, visualización de datos, agregación de datos, entrenamiento de un modelo estadístico, así como muchos otros usos potenciales. La oscilación de datos como proceso generalmente sigue un conjunto de pasos generales que comienzan extrayendo los datos en forma cruda del origen de datos, dividiendo los datos en bruto usando algoritmos (por ejemplo, clasificación) o analizando los datos en estructuras de datos predefinidas, y finalmente depositando el contenido resultante en un sistema de almacenamiento (o silo) para su uso futuro.



In [2]:
import pandas as pd

In [6]:
data = pd.read_csv("../datasets/customer-churn-model/Customer Churn Model.txt")

In [7]:
data.head()

Unnamed: 0,State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,...,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False.
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False.
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False.
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False.
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False.


## Crear un subconjunto de datos

In [8]:
account_length = data["Account Length"]

In [10]:
account_length.head()

0    128
1    107
2    137
3     84
4     75
Name: Account Length, dtype: int64

In [17]:
desired_columns = ["Account Length", "Phone", "Eve Charge", "Night Calls"]

In [18]:
subset = data[desired_columns]

In [19]:
subset.head()

Unnamed: 0,Account Length,Phone,Eve Charge,Night Calls
0,128,382-4657,16.78,91
1,107,371-7191,16.62,103
2,137,358-1921,10.3,104
3,84,375-9999,5.26,89
4,75,330-6626,12.61,121


## Filtrado de las columnas deseadas del dataset

In [78]:
desired_columns = ["Account Length", "VMail Message"]
desired_columns

['Account Length', 'VMail Message']

In [79]:
all_columns_list = data.columns.values.tolist()
all_columns_list

['State',
 'Account Length',
 'Area Code',
 'Phone',
 "Int'l Plan",
 'VMail Plan',
 'VMail Message',
 'Day Mins',
 'Day Calls',
 'Day Charge',
 'Eve Mins',
 'Eve Calls',
 'Eve Charge',
 'Night Mins',
 'Night Calls',
 'Night Charge',
 'Intl Mins',
 'Intl Calls',
 'Intl Charge',
 'CustServ Calls',
 'Churn?']

In [80]:
sublist = [x for x in all_columns_list if x not in desired_columns]

In [81]:
sublist

['State',
 'Area Code',
 'Phone',
 "Int'l Plan",
 'VMail Plan',
 'Day Mins',
 'Day Calls',
 'Day Charge',
 'Eve Mins',
 'Eve Calls',
 'Eve Charge',
 'Night Mins',
 'Night Calls',
 'Night Charge',
 'Intl Mins',
 'Intl Calls',
 'Intl Charge',
 'CustServ Calls',
 'Churn?']

In [99]:
subset = data[sublist]
subset.head()

Unnamed: 0,Area Code,Intl Mins,Eve Mins,Night Calls,State,Day Calls,Day Charge,Phone,CustServ Calls,Churn?,Night Mins,VMail Plan,Eve Calls,Intl Charge,Eve Charge,Night Charge,Intl Calls,Day Mins,Int'l Plan
0,415,10.0,197.4,91,KS,110,45.07,382-4657,1,False.,244.7,yes,99,2.7,16.78,11.01,3,265.1,no
1,415,13.7,195.5,103,OH,123,27.47,371-7191,1,False.,254.4,yes,103,3.7,16.62,11.45,3,161.6,no
2,415,12.2,121.2,104,NJ,114,41.38,358-1921,0,False.,162.6,no,110,3.29,10.3,7.32,5,243.4,no
3,408,6.6,61.9,89,OH,71,50.9,375-9999,2,False.,196.9,no,88,1.78,5.26,8.86,7,299.4,yes
4,415,10.1,148.3,121,OK,113,28.34,330-6626,3,False.,186.9,no,122,2.73,12.61,8.41,3,166.7,yes


## Filtrado Alternativo

In [83]:
a = set(desired_columns)
b = set(all_columns_list)
sublist1 = b-a
sublist = list(sublist1)

In [84]:
subset1 = data[sublist1]
subset1.head()

Unnamed: 0,Area Code,Intl Mins,Eve Mins,Night Calls,State,Day Calls,Day Charge,Phone,CustServ Calls,Churn?,Night Mins,VMail Plan,Eve Calls,Intl Charge,Eve Charge,Night Charge,Intl Calls,Day Mins,Int'l Plan
0,415,10.0,197.4,91,KS,110,45.07,382-4657,1,False.,244.7,yes,99,2.7,16.78,11.01,3,265.1,no
1,415,13.7,195.5,103,OH,123,27.47,371-7191,1,False.,254.4,yes,103,3.7,16.62,11.45,3,161.6,no
2,415,12.2,121.2,104,NJ,114,41.38,358-1921,0,False.,162.6,no,110,3.29,10.3,7.32,5,243.4,no
3,408,6.6,61.9,89,OH,71,50.9,375-9999,2,False.,196.9,no,88,1.78,5.26,8.86,7,299.4,yes
4,415,10.1,148.3,121,OK,113,28.34,330-6626,3,False.,186.9,no,122,2.73,12.61,8.41,3,166.7,yes


In [98]:
subset[10:21]

Unnamed: 0,Area Code,Intl Mins,Eve Mins,Night Calls,State,Day Calls,Day Charge,Phone,CustServ Calls,Churn?,Night Mins,VMail Plan,Eve Calls,Intl Charge,Eve Charge,Night Charge,Intl Calls,Day Mins,Int'l Plan
10,415,12.7,228.5,111,IN,137,21.95,329-6603,4,True.,208.8,no,83,3.43,19.42,9.4,6,129.1,no
11,415,9.1,163.4,94,RI,127,31.91,344-9403,0,False.,196.0,no,148,2.46,13.89,8.82,5,187.7,no
12,408,11.2,104.9,128,IA,96,21.9,363-1107,1,False.,141.1,no,71,3.02,8.92,6.35,2,128.8,no
13,510,12.3,247.6,115,MT,88,26.62,394-8006,3,False.,192.3,no,75,3.32,21.05,8.65,5,156.6,no
14,415,13.1,307.2,99,IA,70,20.52,366-9238,4,False.,203.0,no,76,3.54,26.11,9.14,6,120.7,no
15,415,5.4,317.8,128,NY,67,56.59,351-7269,4,True.,160.6,no,97,1.46,27.01,7.23,9,332.9,no
16,408,13.8,280.9,75,ID,139,33.39,350-8884,1,False.,89.3,yes,90,3.73,23.88,4.02,4,196.4,no
17,510,8.1,218.2,121,VT,114,32.42,386-2923,3,False.,129.6,no,111,2.19,18.55,5.83,3,190.7,no
18,510,10.0,212.8,108,VA,66,32.25,356-2992,1,False.,165.7,yes,65,2.7,18.09,7.46,5,189.7,no
19,415,13.0,159.5,74,TX,90,38.15,373-2782,1,False.,192.8,no,88,3.51,13.56,8.68,2,224.4,no


In [86]:
### Usuarios con Day Mins > 330
subset_day_mins_330 = subset[subset["Day Mins"] > 330]
subset_day_mins_330

Unnamed: 0,State,Area Code,Phone,Int'l Plan,VMail Plan,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
15,NY,415,351-7269,no,no,332.9,67,56.59,317.8,97,27.01,160.6,128,7.23,5.4,9,1.46,4,True.
156,OH,415,370-9116,no,no,337.4,120,57.36,227.4,116,19.33,153.9,114,6.93,15.8,7,4.27,0,True.
365,CO,415,343-5709,no,no,350.8,75,59.64,216.5,94,18.4,253.9,100,11.43,10.1,9,2.73,1,True.
605,MO,415,373-2053,no,no,335.5,77,57.04,212.5,109,18.06,265.0,132,11.93,12.7,8,3.43,2,True.
975,DE,510,332-6181,no,no,334.3,118,56.83,192.1,104,16.33,191.0,83,8.59,10.4,6,2.81,0,True.
985,NY,415,345-9140,yes,no,346.8,55,58.96,249.5,79,21.21,275.4,102,12.39,13.3,9,3.59,1,True.
2594,OH,510,348-1163,yes,no,345.3,81,58.7,203.4,106,17.29,217.5,107,9.79,11.8,8,3.19,1,True.


In [87]:
### Usuarios de New York (State = "NY") AND que hablan mas de 300 minutos al dia
subset_live_ny = subset[(subset["State"] == "NY") & (subset["Day Mins"] > 300)]
subset_live_ny

Unnamed: 0,State,Area Code,Phone,Int'l Plan,VMail Plan,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
15,NY,415,351-7269,no,no,332.9,67,56.59,317.8,97,27.01,160.6,128,7.23,5.4,9,1.46,4,True.
985,NY,415,345-9140,yes,no,346.8,55,58.96,249.5,79,21.21,275.4,102,12.39,13.3,9,3.59,1,True.


In [88]:
subset_live_ny.shape

(2, 19)

In [89]:
##Minutos de dia, de noche y longitud de la cuenta de los primos 50
subset_columns = ["Day Mins", "Night Mins", "Account Length"]
subset_first_50 = data[subset_columns][1:50]

In [90]:
subset_first_50.head()

Unnamed: 0,Day Mins,Night Mins,Account Length
1,161.6,254.4,107
2,243.4,162.6,137
3,299.4,196.9,84
4,166.7,186.9,75
5,223.4,203.9,118


In [91]:
data.iloc[1:10, 3:6] ##Primeras 10 filas, columnas de la 3 a la 6

Unnamed: 0,Phone,Int'l Plan,VMail Plan
1,371-7191,no,yes
2,358-1921,no,no
3,375-9999,yes,no
4,330-6626,yes,no
5,391-8027,yes,no
6,355-9993,no,yes
7,329-9001,yes,no
8,335-4719,no,no
9,330-8173,yes,yes


In [114]:
subset["Total Mins"] = subset["Day Mins"] + subset["Night Mins"] + subset["Eve Mins"]

In [115]:
subset["Total Mins"]

0       707.2
1       611.5
2       527.2
3       558.2
4       501.9
        ...  
3328    650.8
3329    575.8
3330    661.5
3331    512.6
3332    741.7
Name: Total Mins, Length: 3333, dtype: float64

In [116]:
subset.head()

Unnamed: 0,Area Code,Intl Mins,Eve Mins,Night Calls,State,Day Calls,Day Charge,Phone,CustServ Calls,Churn?,...,VMail Plan,Eve Calls,Intl Charge,Eve Charge,Night Charge,Intl Calls,Day Mins,Int'l Plan,Total Calls,Total Mins
0,415,10.0,197.4,91,KS,110,45.07,382-4657,1,False.,...,yes,99,2.7,16.78,11.01,3,265.1,no,300,707.2
1,415,13.7,195.5,103,OH,123,27.47,371-7191,1,False.,...,yes,103,3.7,16.62,11.45,3,161.6,no,329,611.5
2,415,12.2,121.2,104,NJ,114,41.38,358-1921,0,False.,...,no,110,3.29,10.3,7.32,5,243.4,no,328,527.2
3,408,6.6,61.9,89,OH,71,50.9,375-9999,2,False.,...,no,88,1.78,5.26,8.86,7,299.4,yes,248,558.2
4,415,10.1,148.3,121,OK,113,28.34,330-6626,3,False.,...,no,122,2.73,12.61,8.41,3,166.7,yes,356,501.9


In [117]:
subset["Total Calls"] = subset["Day Calls"] + subset["Night Calls"] + subset["Eve Calls"]

In [120]:
subset.head()

Unnamed: 0,Area Code,Intl Mins,Eve Mins,Night Calls,State,Day Calls,Day Charge,Phone,CustServ Calls,Churn?,...,VMail Plan,Eve Calls,Intl Charge,Eve Charge,Night Charge,Intl Calls,Day Mins,Int'l Plan,Total Calls,Total Mins
0,415,10.0,197.4,91,KS,110,45.07,382-4657,1,False.,...,yes,99,2.7,16.78,11.01,3,265.1,no,300,707.2
1,415,13.7,195.5,103,OH,123,27.47,371-7191,1,False.,...,yes,103,3.7,16.62,11.45,3,161.6,no,329,611.5
2,415,12.2,121.2,104,NJ,114,41.38,358-1921,0,False.,...,no,110,3.29,10.3,7.32,5,243.4,no,328,527.2
3,408,6.6,61.9,89,OH,71,50.9,375-9999,2,False.,...,no,88,1.78,5.26,8.86,7,299.4,yes,248,558.2
4,415,10.1,148.3,121,OK,113,28.34,330-6626,3,False.,...,no,122,2.73,12.61,8.41,3,166.7,yes,356,501.9


In [121]:
subset[["Total Calls", "Total Mins"]].head()

Unnamed: 0,Total Calls,Total Mins
0,300,707.2
1,329,611.5
2,328,527.2
3,248,558.2
4,356,501.9
