<a href="https://colab.research.google.com/github/EstebanJM/BigData/blob/main/PROYECTO_1_BIG_DATA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**PROYECTO PRIMER SEMESTRE - SEGMENTACION Y ANALISIS DE DATOS**

##**Importamos la bases**

In [322]:
import numpy as np
import pandas as pd
import scipy.stats as st
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.spatial import distance_matrix

from sklearn.preprocessing import StandardScaler, OrdinalEncoder, Normalizer
from sklearn.impute import KNNImputer

from datetime import datetime

## **Limpieza de datos**

In [323]:
P = '/content/datos_marketing.csv'
mark = pd.read_csv(P)

mark.head()

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,Complain,AcceptedCmps
0,5524,1957,Graduation,Single,58138.0,0,0,04-09-2012,635,88,546,172,88,88,8,10,4,no,0
1,2174,1954,Graduation,Single,46344.0,1,1,08-03-2014,11,1,6,2,1,6,1,1,2,no,0
2,4141,1965,Graduation,Together,71613.0,0,0,21-08-2013,426,49,127,111,21,42,8,2,10,no,0
3,6182,1984,Graduation,Together,26646.0,1,0,10-02-2014,11,4,20,10,3,5,2,0,4,no,0
4,5324,1981,PhD,Married,58293.0,1,0,19-01-2014,173,43,118,46,27,15,5,3,6,no,0


In [324]:
mark.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   MntWines             2240 non-null   int64  
 9   MntFruits            2240 non-null   int64  
 10  MntMeatProducts      2240 non-null   int64  
 11  MntFishProducts      2240 non-null   int64  
 12  MntSweetProducts     2240 non-null   int64  
 13  MntGoldProds         2240 non-null   int64  
 14  NumWebPurchases      2240 non-null   int64  
 15  NumCatalogPurchases  2240 non-null   i

In [325]:
mark.columns = mark.columns.str.lower()
mark.columns

Index(['id', 'year_birth', 'education', 'marital_status', 'income', 'kidhome',
       'teenhome', 'dt_customer', 'mntwines', 'mntfruits', 'mntmeatproducts',
       'mntfishproducts', 'mntsweetproducts', 'mntgoldprods',
       'numwebpurchases', 'numcatalogpurchases', 'numstorepurchases',
       'complain', 'acceptedcmps'],
      dtype='object')

**Se actualizan los nombres de las columnas de la base de datos a los estandares de Python**


In [326]:
mark.rename(columns = {
    "kidhome": "kid_home",
    "teenhome": "teen_home",
    "mntwines": "mnt_wines",
    "mntfishproducts": "mnt_fish_products",
    "mntfruits": "mnt_fruits",
    "mntmeatproducts": "mnt_meat_products",
    "numwebpurchases": "num_web_purchases",
    "mntgoldprods": "mnt_gold_prods",
    "mntsweetproducts": "mnt_sweet_products",
    "numcatalogpurchases": "num_catalog_purchases",
    "numstorepurchases": "num_store_purchases",
    "acceptedcmps": "accepted_cmps"},
            inplace = True)
mark.columns

Index(['id', 'year_birth', 'education', 'marital_status', 'income', 'kid_home',
       'teen_home', 'dt_customer', 'mnt_wines', 'mnt_fruits',
       'mnt_meat_products', 'mnt_fish_products', 'mnt_sweet_products',
       'mnt_gold_prods', 'num_web_purchases', 'num_catalog_purchases',
       'num_store_purchases', 'complain', 'accepted_cmps'],
      dtype='object')

**cambiamos el Dtype de dt_cusotmer a fecha**

In [327]:
mark['dt_customer'] = pd.to_datetime(mark['dt_customer'])
mark.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   id                     2240 non-null   int64         
 1   year_birth             2240 non-null   int64         
 2   education              2240 non-null   object        
 3   marital_status         2240 non-null   object        
 4   income                 2216 non-null   float64       
 5   kid_home               2240 non-null   int64         
 6   teen_home              2240 non-null   int64         
 7   dt_customer            2240 non-null   datetime64[ns]
 8   mnt_wines              2240 non-null   int64         
 9   mnt_fruits             2240 non-null   int64         
 10  mnt_meat_products      2240 non-null   int64         
 11  mnt_fish_products      2240 non-null   int64         
 12  mnt_sweet_products     2240 non-null   int64         
 13  mnt

  mark['dt_customer'] = pd.to_datetime(mark['dt_customer'])


**cambiamos el Dtype de complain a int64**

In [328]:
d1 = {'yes': 1,
        'no': 0}

mark['complain'] = mark['complain'].map(d1)
mark['complain'] = mark['complain'].astype('category')
mark.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   id                     2240 non-null   int64         
 1   year_birth             2240 non-null   int64         
 2   education              2240 non-null   object        
 3   marital_status         2240 non-null   object        
 4   income                 2216 non-null   float64       
 5   kid_home               2240 non-null   int64         
 6   teen_home              2240 non-null   int64         
 7   dt_customer            2240 non-null   datetime64[ns]
 8   mnt_wines              2240 non-null   int64         
 9   mnt_fruits             2240 non-null   int64         
 10  mnt_meat_products      2240 non-null   int64         
 11  mnt_fish_products      2240 non-null   int64         
 12  mnt_sweet_products     2240 non-null   int64         
 13  mnt

**Se crea la columna "Age" que representa la edad de los clientes. Se calcula restando el año actual con los datos de la columna "year_birth"**

In [329]:
mark['Age'] = 2023 - mark['year_birth']
mark['Age'].describe()
mark.head()

Unnamed: 0,id,year_birth,education,marital_status,income,kid_home,teen_home,dt_customer,mnt_wines,mnt_fruits,mnt_meat_products,mnt_fish_products,mnt_sweet_products,mnt_gold_prods,num_web_purchases,num_catalog_purchases,num_store_purchases,complain,accepted_cmps,Age
0,5524,1957,Graduation,Single,58138.0,0,0,2012-04-09,635,88,546,172,88,88,8,10,4,0,0,66
1,2174,1954,Graduation,Single,46344.0,1,1,2014-08-03,11,1,6,2,1,6,1,1,2,0,0,69
2,4141,1965,Graduation,Together,71613.0,0,0,2013-08-21,426,49,127,111,21,42,8,2,10,0,0,58
3,6182,1984,Graduation,Together,26646.0,1,0,2014-10-02,11,4,20,10,3,5,2,0,4,0,0,39
4,5324,1981,PhD,Married,58293.0,1,0,2014-01-19,173,43,118,46,27,15,5,3,6,0,0,42


**Crear columna de antiguedad**

In [330]:
mark['antiquity'] = datetime.today() - mark['dt_customer']
mark['antiquity'] = mark['antiquity'].dt.days / 365
mark['antiquity'] = mark['antiquity'].astype('int64')
mark['antiquity']

0       11
1        9
2       10
3        8
4        9
        ..
2235    10
2236     8
2237     9
2238     9
2239    10
Name: antiquity, Length: 2240, dtype: int64

In [331]:
mark['antiquity'].describe()

count    2240.000000
mean        9.641518
std         0.691019
min         8.000000
25%         9.000000
50%        10.000000
75%        10.000000
max        11.000000
Name: antiquity, dtype: float64

In [332]:
def function_antiguedad(row):
  antiquity = row['antiquity']
  if antiquity < 9:
    return 'nuevo'
  elif antiquity >= 9 and antiquity < 10:
    return 'antiguo'
  else:
    return 'muy antiguo'

mark['cat_antiquity'] = mark.apply(function_antiguedad, axis=1)
mark['cat_antiquity'] = mark['cat_antiquity'].astype('category')
mark['cat_antiquity'].describe()

count            2240
unique              3
top       muy antiguo
freq             1298
Name: cat_antiquity, dtype: object

**cambiamos el Dtype de education a Dtype Categoria**

In [333]:
mark['education'] = mark['education'].astype('category')
mark.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   id                     2240 non-null   int64         
 1   year_birth             2240 non-null   int64         
 2   education              2240 non-null   category      
 3   marital_status         2240 non-null   object        
 4   income                 2216 non-null   float64       
 5   kid_home               2240 non-null   int64         
 6   teen_home              2240 non-null   int64         
 7   dt_customer            2240 non-null   datetime64[ns]
 8   mnt_wines              2240 non-null   int64         
 9   mnt_fruits             2240 non-null   int64         
 10  mnt_meat_products      2240 non-null   int64         
 11  mnt_fish_products      2240 non-null   int64         
 12  mnt_sweet_products     2240 non-null   int64         
 13  mnt

**Convertimos la varialbe marital_status a Dtype Category**

In [334]:
mark['marital_status'] = mark['marital_status'].astype('category')
mark.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   id                     2240 non-null   int64         
 1   year_birth             2240 non-null   int64         
 2   education              2240 non-null   category      
 3   marital_status         2240 non-null   category      
 4   income                 2216 non-null   float64       
 5   kid_home               2240 non-null   int64         
 6   teen_home              2240 non-null   int64         
 7   dt_customer            2240 non-null   datetime64[ns]
 8   mnt_wines              2240 non-null   int64         
 9   mnt_fruits             2240 non-null   int64         
 10  mnt_meat_products      2240 non-null   int64         
 11  mnt_fish_products      2240 non-null   int64         
 12  mnt_sweet_products     2240 non-null   int64         
 13  mnt

**Eliminar datos atípicos**

In [335]:
q1 = mark['income'].quantile(.25)
q3 = mark['income'].quantile(.75)
IQR = q3 - q1
lim_sup = q3 + 1.5 * IQR
lim_inf = q1 - 1.5 * IQR

mark = mark[(mark['income'] >= lim_inf) & (mark['income'] <= lim_sup)]
mark['income'].describe()

count      2208.000000
mean      51633.638134
std       20713.376530
min        1730.000000
25%       35196.000000
50%       51301.000000
75%       68289.750000
max      113734.000000
Name: income, dtype: float64

In [336]:
q1 = mark['Age'].quantile(.25)
q3 = mark['Age'].quantile(.75)
IQR = q3 - q1
lim_sup = q3 + 1.5 * IQR
lim_inf = q1 - 1.5 * IQR

mark = mark[(mark['Age'] >= lim_inf) & (mark['Age'] <= lim_sup)]
mark['Age'].describe()

count    2205.000000
mean       54.095692
std        11.705801
min        27.000000
25%        46.000000
50%        53.000000
75%        64.000000
max        83.000000
Name: Age, dtype: float64

**Identificamos las variables que contienen valores nulos**

In [337]:
mark.isna().sum()

id                       0
year_birth               0
education                0
marital_status           0
income                   0
kid_home                 0
teen_home                0
dt_customer              0
mnt_wines                0
mnt_fruits               0
mnt_meat_products        0
mnt_fish_products        0
mnt_sweet_products       0
mnt_gold_prods           0
num_web_purchases        0
num_catalog_purchases    0
num_store_purchases      0
complain                 0
accepted_cmps            0
Age                      0
antiquity                0
cat_antiquity            0
dtype: int64

**Rellenamos los datos vacios de la columna "Income" con la mediana y convertimos el Dtype a categoria**

In [338]:
med = mark['income'].median()
mark['income'].fillna(med, inplace=True)
mark.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2205 entries, 0 to 2239
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   id                     2205 non-null   int64         
 1   year_birth             2205 non-null   int64         
 2   education              2205 non-null   category      
 3   marital_status         2205 non-null   category      
 4   income                 2205 non-null   float64       
 5   kid_home               2205 non-null   int64         
 6   teen_home              2205 non-null   int64         
 7   dt_customer            2205 non-null   datetime64[ns]
 8   mnt_wines              2205 non-null   int64         
 9   mnt_fruits             2205 non-null   int64         
 10  mnt_meat_products      2205 non-null   int64         
 11  mnt_fish_products      2205 non-null   int64         
 12  mnt_sweet_products     2205 non-null   int64         
 13  mnt

In [339]:
mark['income'] = mark['income'].astype('int')
mark.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2205 entries, 0 to 2239
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   id                     2205 non-null   int64         
 1   year_birth             2205 non-null   int64         
 2   education              2205 non-null   category      
 3   marital_status         2205 non-null   category      
 4   income                 2205 non-null   int64         
 5   kid_home               2205 non-null   int64         
 6   teen_home              2205 non-null   int64         
 7   dt_customer            2205 non-null   datetime64[ns]
 8   mnt_wines              2205 non-null   int64         
 9   mnt_fruits             2205 non-null   int64         
 10  mnt_meat_products      2205 non-null   int64         
 11  mnt_fish_products      2205 non-null   int64         
 12  mnt_sweet_products     2205 non-null   int64         
 13  mnt

**Categorizamos el estado civil de los clientes, remplazando el de aquellos que no representaban un nombre logico**

In [340]:
d2 = {'YOLO' : 'Single',
      'Alone': 'Single',
      'Absurd':'Single',
      'Divorced':'Single',
      'Together':'Married',
      'Single' : 'Single',
      'Married' : 'Married',
      'Widow' : 'Widow'}

mark['marital_status'] = mark['marital_status'].map(d2)
mark.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2205 entries, 0 to 2239
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   id                     2205 non-null   int64         
 1   year_birth             2205 non-null   int64         
 2   education              2205 non-null   category      
 3   marital_status         2205 non-null   object        
 4   income                 2205 non-null   int64         
 5   kid_home               2205 non-null   int64         
 6   teen_home              2205 non-null   int64         
 7   dt_customer            2205 non-null   datetime64[ns]
 8   mnt_wines              2205 non-null   int64         
 9   mnt_fruits             2205 non-null   int64         
 10  mnt_meat_products      2205 non-null   int64         
 11  mnt_fish_products      2205 non-null   int64         
 12  mnt_sweet_products     2205 non-null   int64         
 13  mnt

**Categorizamos la edad por "Adulto Joven", "Adulto" y "Adulto Mayor"**

In [341]:
def function_Edad(row):
  Age = row['Age']
  if Age <= 35:
    return 'adulto joven'
  elif Age <= 53:
    return 'adulto'
  else:
    return 'adulto mayor'

mark['cat_Age'] = mark.apply(function_Edad, axis=1)
mark['cat_Age'].describe()

count             2205
unique               3
top       adulto mayor
freq              1060
Name: cat_Age, dtype: object

**Convertimos la variable cat_age a Dtype category**

In [342]:
mark['cat_Age'] = mark['cat_Age'].astype('category')
mark.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2205 entries, 0 to 2239
Data columns (total 23 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   id                     2205 non-null   int64         
 1   year_birth             2205 non-null   int64         
 2   education              2205 non-null   category      
 3   marital_status         2205 non-null   object        
 4   income                 2205 non-null   int64         
 5   kid_home               2205 non-null   int64         
 6   teen_home              2205 non-null   int64         
 7   dt_customer            2205 non-null   datetime64[ns]
 8   mnt_wines              2205 non-null   int64         
 9   mnt_fruits             2205 non-null   int64         
 10  mnt_meat_products      2205 non-null   int64         
 11  mnt_fish_products      2205 non-null   int64         
 12  mnt_sweet_products     2205 non-null   int64         
 13  mnt

**Determinamos los cuartiles de la variable "ingreso" para categorizar los datos de los clientes por "Muy bajo", "bajo", "medio" y "alto"**

In [343]:
mark['income'].describe()

count      2205.000000
mean      51622.094785
std       20713.063826
min        1730.000000
25%       35196.000000
50%       51287.000000
75%       68281.000000
max      113734.000000
Name: income, dtype: float64

In [344]:
def function_income(row):
  income = row['income']
  if income < 35196:
    return 'muy bajo'
  elif income >= 35196 and income < 51287:
    return 'Bajo'
  elif income >= 51287 and income < 68281:
    return 'Medio'
  else:
    return 'Alto'

mark['cat_income'] = mark.apply(function_income, axis=1)
mark['cat_income'].describe()

count     2205
unique       4
top       Alto
freq       552
Name: cat_income, dtype: object

In [345]:
mark['cat_income'] = mark['cat_income'].astype('category')


**Eliminar datos duplicados generales**

In [346]:
mark.drop_duplicates()

Unnamed: 0,id,year_birth,education,marital_status,income,kid_home,teen_home,dt_customer,mnt_wines,mnt_fruits,...,num_web_purchases,num_catalog_purchases,num_store_purchases,complain,accepted_cmps,Age,antiquity,cat_antiquity,cat_Age,cat_income
0,5524,1957,Graduation,Single,58138,0,0,2012-04-09,635,88,...,8,10,4,0,0,66,11,muy antiguo,adulto mayor,Medio
1,2174,1954,Graduation,Single,46344,1,1,2014-08-03,11,1,...,1,1,2,0,0,69,9,antiguo,adulto mayor,Bajo
2,4141,1965,Graduation,Married,71613,0,0,2013-08-21,426,49,...,8,2,10,0,0,58,10,muy antiguo,adulto mayor,Alto
3,6182,1984,Graduation,Married,26646,1,0,2014-10-02,11,4,...,2,0,4,0,0,39,8,nuevo,adulto,muy bajo
4,5324,1981,PhD,Married,58293,1,0,2014-01-19,173,43,...,5,3,6,0,0,42,9,antiguo,adulto,Medio
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2235,10870,1967,Graduation,Married,61223,0,1,2013-06-13,709,43,...,9,3,4,0,0,56,10,muy antiguo,adulto mayor,Medio
2236,4001,1946,PhD,Married,64014,2,1,2014-10-06,406,0,...,8,2,5,0,1,77,8,nuevo,adulto mayor,Medio
2237,7270,1981,Graduation,Single,56981,0,0,2014-01-25,908,48,...,2,3,13,0,1,42,9,antiguo,adulto,Medio
2238,8235,1956,Master,Married,69245,0,1,2014-01-24,428,30,...,6,5,10,0,0,67,9,antiguo,adulto mayor,Alto


In [347]:
mark.info ()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2205 entries, 0 to 2239
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   id                     2205 non-null   int64         
 1   year_birth             2205 non-null   int64         
 2   education              2205 non-null   category      
 3   marital_status         2205 non-null   object        
 4   income                 2205 non-null   int64         
 5   kid_home               2205 non-null   int64         
 6   teen_home              2205 non-null   int64         
 7   dt_customer            2205 non-null   datetime64[ns]
 8   mnt_wines              2205 non-null   int64         
 9   mnt_fruits             2205 non-null   int64         
 10  mnt_meat_products      2205 non-null   int64         
 11  mnt_fish_products      2205 non-null   int64         
 12  mnt_sweet_products     2205 non-null   int64         
 13  mnt

## Exploración estadística

**Resumen estadistico de las variables numéricas**

In [348]:
mark.describe ()

Unnamed: 0,id,year_birth,income,kid_home,teen_home,mnt_wines,mnt_fruits,mnt_meat_products,mnt_fish_products,mnt_sweet_products,mnt_gold_prods,num_web_purchases,num_catalog_purchases,num_store_purchases,accepted_cmps,Age,antiquity
count,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0
mean,5585.439456,1968.904308,51622.094785,0.442177,0.506576,306.164626,26.403175,165.312018,37.756463,27.128345,44.057143,4.10068,2.645351,5.823583,0.29932,54.095692,9.642177
std,3247.546423,11.705801,20713.063826,0.537132,0.54438,337.493839,39.784484,217.784507,54.824635,41.130468,51.736211,2.737424,2.798647,3.241796,0.68044,11.705801,0.691758
min,0.0,1940.0,1730.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,27.0,8.0
25%,2815.0,1959.0,35196.0,0.0,0.0,24.0,2.0,16.0,3.0,1.0,9.0,2.0,0.0,3.0,0.0,46.0,9.0
50%,5455.0,1970.0,51287.0,0.0,0.0,178.0,8.0,68.0,12.0,8.0,25.0,4.0,2.0,5.0,0.0,53.0,10.0
75%,8418.0,1977.0,68281.0,1.0,1.0,507.0,33.0,232.0,50.0,34.0,56.0,6.0,4.0,8.0,0.0,64.0,10.0
max,11191.0,1996.0,113734.0,2.0,2.0,1493.0,199.0,1725.0,259.0,262.0,321.0,27.0,28.0,13.0,4.0,83.0,11.0


**Resumen estadistico de las variables "category"**

In [None]:
mark.describe(include=['category'])

## Exploración gráfica

**Creación de listas y gráficos de barras**

In [None]:
for col in mark.select_dtypes('category').columns:
  mark[col].value_counts().plot(kind='bar')
  plt.title(col)
  plt.show()

**Creación de listas y gráficos de barras**

In [None]:

for col in mark.select_dtypes('int64').columns:
  mark[col].hist()
  plt.title(col)
  plt.show()

**Correlograma**

In [None]:
sns.heatmap(mark.corr())

**segmentamos los ingresos de las personas por edad y por su estado civil**

In [360]:
from statistics import mode
mark.pivot_table(index=['marital_status','cat_income', ], values='income', aggfunc=mode)

Unnamed: 0_level_0,Unnamed: 1_level_0,income
marital_status,cat_income,Unnamed: 2_level_1
Married,Alto,80134
Married,Bajo,39922
Married,Medio,67445
Married,muy bajo,7500
Single,Alto,82800
Single,Bajo,35860
Single,Medio,63841
Single,muy bajo,7500
Widow,Alto,82032
Widow,Bajo,42429


**Realizamos una segmentacion que nos muestre la cantidad de jovenes que hay en casa, dependiendo de su edad y su nivel educativo**

In [362]:
mark.pivot_table(index=['education','cat_income'], values=['teen_home', 'kid_home'], aggfunc=mode)

Unnamed: 0_level_0,Unnamed: 1_level_0,kid_home,teen_home
education,cat_income,Unnamed: 2_level_1,Unnamed: 3_level_1
2n Cycle,Alto,0,0
2n Cycle,Bajo,1,1
2n Cycle,Medio,0,1
2n Cycle,muy bajo,1,0
Basic,muy bajo,1,0
Graduation,Alto,0,0
Graduation,Bajo,1,1
Graduation,Medio,0,1
Graduation,muy bajo,1,0
Master,Alto,0,0


**Realizamos una segmentación del numero de compras que se realizan por catálogo, tienda o internet, dependiendo de la categoria de edad y de su nivel educativo**


In [357]:
mark.pivot_table(index=['cat_antiquity','complain'], values=['num_catalog_purchases','num_web_purchases','num_store_purchases'], aggfunc=mode)

Unnamed: 0_level_0,Unnamed: 1_level_0,num_catalog_purchases,num_store_purchases,num_web_purchases
cat_antiquity,complain,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
antiguo,0,0,3,2
antiguo,1,0,3,3
muy antiguo,0,0,3,2
muy antiguo,1,0,3,2
nuevo,0,0,4,1


In [359]:
mark.pivot_table(index=['cat_Age', ], values=['mnt_wines','mnt_fruits','mnt_meat_products','mnt_fish_products','mnt_sweet_products','mnt_gold_prods'], aggfunc=mode)

Unnamed: 0_level_0,mnt_fish_products,mnt_fruits,mnt_gold_prods,mnt_meat_products,mnt_sweet_products,mnt_wines
cat_Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
adulto,0,0,3,8,0,2
adulto joven,8,4,4,8,4,5
adulto mayor,0,0,4,3,0,6


## Conclusiones

#Segmentación ingresos de las personas por edad y por su estado civil
Se puede evidenciar que para ingresos altos, las personas solteras tienen un ingreso mayor en comparación a las personas casadas y a los viudos.
Respecto al ingreso bajo, se observa que los viudos tienen un ingreso mayor, seguidas de las personas casadas y de los solteros.
Para las personas con ingreso medio, se puede ver que los ingresos son mayores en las personas casadas y luego en las solteras y viudas respectivamente.
Finalmente, se evidencia que para las personas con ingresos muy bajos, las personas viudas suelen tener mayores ingresos.

#Segmentación cantidad de jovenes que hay en casa, dependiendo de su edad y su nivel educativo
Se observa que para las personas con nivel educativo de bachillerato y con niveles de ingresos bajos suelen tener tanto un niño como un adolecente en casa. Aquellos con niveles de ingresos medios tienen un hijo adolecente.
Aquellos con un nivel educativo de básica primaria y con ingresos muy bajos tienen un niño en sus hogares.
Respecto a las personas con un título de pregrado, aquellos con niveles de ingresos bajos tienen un niño y un adolecente en casa, mientras que los de ingresos medios suelen tener un adolecente y los de ingresos bajos poseen un infante.
Para las personas con maestría y PhD, se puede evidenciar que aquellos con ingresos bajos tienen en su hogar un niño y un adolecente, seguidos de aquellos con ingresos medios que solo tienen un hijo adolecente y los de ingresos muy bajos que tienen un niño pequeño.

#Segmentación cantidad de jovenes que hay en casa, dependiendo de su edad y su nivel educativo
Respecto a la cantidad de personas que realizar compras según el lugar, encontramos que tanto los clientes nuevos, antiguos y muy antiguo tienen una preferencia por las compras en tienda, seguido de las compras por internet y siendo las compras por catálogo las menos preferidas.

#Segmentación del numero de compras que se realizan por catálogo, tienda o internet, dependiendo de la categoria de edad y de su nivel educativo
Al realizar una categorización según la edad, se puede evidenciar que el adulto mayor gasta más en vinos, seguido por el adulto joven y el adulto. Los adultos jóvenes suelen consumir más productos como dulces, pescados y frutas en comparación a los adultos y adultos mayores.
Tanto los adultos como los adultos jóvenes tienen mayor tendencia a consumir productos cárnicos en comparación al adulto mayor.
Por último, respecto a los productos suntuosos, se puede evidenciar que el adulto mayor y el adulto joven tienen más tendecia de consumo en contraposición con los adultos.