# Data Wangling

Hecho el análisis, comenzamos a tomar decisiones de cortado de dataset, imputación de datos, ver si tomamos todo o algunos datos, etc. \
El **data wrangling**, a veces denominada **data munging**, es el proceso de transformar y mapear datos de un dataset **raw** (en bruto) en otro formato con la intención de hacerlo más apropiado y valioso para una variedad de propósitos posteriores, como el análisis. Un ***data wrangler*** es una persona que realiza estas operaciones de transformación. \
Esto puede incluir munging, visualización de datos, agregación de datos, entrenamiento de un modelo estadístico, así como muchos otros usos potenciales. La oscilación de datos como proceso generalmente sigue un conjunto de pasos generales que comienzan extrayendo los datos en forma cruda del origen de datos, dividiendo los datos en bruto usando algoritmos (por ejemplo, clasificación) o analizando los datos en estructuras de datos predefinidas, y finalmente depositando el contenido resultante en un sistema de almacenamiento (o silo) para su uso futuro.

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('../datasets/customer-churn-model/Customer Churn Model.txt') 
data.head()

Unnamed: 0,State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,...,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False.
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False.
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False.
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False.
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False.


## Subconjuntos

Tomamos de nuestro datalake o de nuestro dataframe, las columnas, o variables que necesitamos

In [3]:
account_length = data['Account Length']
account_length.head()

0    128
1    107
2    137
3     84
4     75
Name: Account Length, dtype: int64

In [4]:
type(account_length)

pandas.core.series.Series

In [5]:
data.columns.values

array(['State', 'Account Length', 'Area Code', 'Phone', "Int'l Plan",
       'VMail Plan', 'VMail Message', 'Day Mins', 'Day Calls',
       'Day Charge', 'Eve Mins', 'Eve Calls', 'Eve Charge', 'Night Mins',
       'Night Calls', 'Night Charge', 'Intl Mins', 'Intl Calls',
       'Intl Charge', 'CustServ Calls', 'Churn?'], dtype=object)

In [6]:
subset = data[['Account Length', 'Phone', 'Eve Calls', 'Day Calls']]
subset.head()

Unnamed: 0,Account Length,Phone,Eve Calls,Day Calls
0,128,382-4657,99,110
1,107,371-7191,103,123
2,137,358-1921,110,114
3,84,375-9999,88,71
4,75,330-6626,122,113


In [7]:
type(subset)

pandas.core.frame.DataFrame

In [8]:
desired_columns = ['Account Length', 'Phone', 'Eve Calls', 'Day Calls']
subset = data[desired_columns]
subset.head()

Unnamed: 0,Account Length,Phone,Eve Calls,Day Calls
0,128,382-4657,99,110
1,107,371-7191,103,123
2,137,358-1921,110,114
3,84,375-9999,88,71
4,75,330-6626,122,113


Otra técnica, cuando tenemos un conjunto de datos con muchas variables, es sacar las que no queremos.

In [9]:
desired_columns = ['Account Length', 'VMail Message', 'Day Calls']
desired_columns

['Account Length', 'VMail Message', 'Day Calls']

In [10]:
all_columns_list = data.columns.values.tolist()
all_columns_list

['State',
 'Account Length',
 'Area Code',
 'Phone',
 "Int'l Plan",
 'VMail Plan',
 'VMail Message',
 'Day Mins',
 'Day Calls',
 'Day Charge',
 'Eve Mins',
 'Eve Calls',
 'Eve Charge',
 'Night Mins',
 'Night Calls',
 'Night Charge',
 'Intl Mins',
 'Intl Calls',
 'Intl Charge',
 'CustServ Calls',
 'Churn?']

In [12]:
sublist = [ x for x in all_columns_list if x not in desired_columns ]
sublist

['State',
 'Area Code',
 'Phone',
 "Int'l Plan",
 'VMail Plan',
 'Day Mins',
 'Day Charge',
 'Eve Mins',
 'Eve Calls',
 'Eve Charge',
 'Night Mins',
 'Night Calls',
 'Night Charge',
 'Intl Mins',
 'Intl Calls',
 'Intl Charge',
 'CustServ Calls',
 'Churn?']

In [13]:
subset = data[sublist]
subset.head()

Unnamed: 0,State,Area Code,Phone,Int'l Plan,VMail Plan,Day Mins,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
0,KS,415,382-4657,no,yes,265.1,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False.
1,OH,415,371-7191,no,yes,161.6,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False.
2,NJ,415,358-1921,no,no,243.4,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False.
3,OH,408,375-9999,yes,no,299.4,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False.
4,OK,415,330-6626,yes,no,166.7,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False.


Otra forma puede ser

In [14]:
a = set(desired_columns)
b = set(all_columns_list)
sublist = b-a
sublist = list(sublist)

In [15]:
data[sublist]

Unnamed: 0,Phone,Day Mins,Churn?,Eve Calls,Intl Calls,CustServ Calls,Night Mins,Intl Charge,Intl Mins,Area Code,Night Charge,VMail Plan,Night Calls,State,Eve Charge,Int'l Plan,Day Charge,Eve Mins
0,382-4657,265.1,False.,99,3,1,244.7,2.70,10.0,415,11.01,yes,91,KS,16.78,no,45.07,197.4
1,371-7191,161.6,False.,103,3,1,254.4,3.70,13.7,415,11.45,yes,103,OH,16.62,no,27.47,195.5
2,358-1921,243.4,False.,110,5,0,162.6,3.29,12.2,415,7.32,no,104,NJ,10.30,no,41.38,121.2
3,375-9999,299.4,False.,88,7,2,196.9,1.78,6.6,408,8.86,no,89,OH,5.26,yes,50.90,61.9
4,330-6626,166.7,False.,122,3,3,186.9,2.73,10.1,415,8.41,no,121,OK,12.61,yes,28.34,148.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3328,414-4276,156.2,False.,126,6,2,279.1,2.67,9.9,415,12.56,yes,83,AZ,18.32,no,26.55,215.5
3329,370-3271,231.1,False.,55,4,3,191.3,2.59,9.6,415,8.61,no,123,WV,13.04,no,39.29,153.4
3330,328-8230,180.8,False.,58,6,2,191.9,3.81,14.1,510,8.64,no,91,RI,24.55,no,30.74,288.8
3331,364-6381,213.8,False.,84,10,2,139.2,1.35,5.0,510,6.26,no,137,CT,13.57,yes,36.35,159.6
