Segmentación del mercado de adolecentes usando scikit-learn
===

**Juan David Velásquez Henao**  
jdvelasq@unal.edu.co   
Universidad Nacional de Colombia, Sede Medellín  
Facultad de Minas  
Medellín, Colombia

---

Haga click [aquí](https://github.com/jdvelasq/IPython-for-predictive-analytics/blob/master/08-kmeans-teen-market-segments.ipynb) para acceder a la última versión online.

Haga click [aquí](http://nbviewer.jupyter.org/github/jdvelasq/IPython-for-predictive-analytics/blob/master/08-kmeans-teen-market-segments.ipynb) para ver la última versión online en `nbviewer`. 

---
[Licencia](https://github.com/jdvelasq/IPython-for-predictive-analytics/blob/master/LICENSE)  
[Readme](https://github.com/jdvelasq/IPython-for-predictive-analytics/blob/master/readme.md)

In [1]:
## Lee el archivo. La función readlines() retorna una 
## una lista de strings donde cada string es una linea
## del archivo original.
import pandas as pd
teens = pd.read_csv('data/snsdata.csv')
teens.head()

Unnamed: 0,gradyear,gender,age,friends,basketball,football,soccer,softball,volleyball,swimming,...,blonde,mall,shopping,clothes,hollister,abercrombie,die,death,drunk,drugs
0,2006,M,18.982,7,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2006,F,18.801,0,0,1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,2006,M,18.335,69,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,2006,F,18.875,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2006,,18.995,10,0,0,0,0,0,0,...,0,0,2,0,0,0,0,0,1,1


In [2]:
## Función equivalente a la función table de R
def table(x, y=None):
    if y is None:
        return {a:x.count(a) for a in set(x)}
    return {a:{b:a.count(b) for b in a} for a in set(x)}

In [3]:
## table([x[1] for x in teens[1:]])
table(teens.gender.tolist())

{'F': 22054, 'M': 5222, nan: 2724}

In [4]:
## La muestra contiene un rango de edades 
## por fuera de la población de interés
teens.age.describe()

count    24914.000000
mean        17.993950
std          7.858054
min          3.086000
25%         16.312000
50%         17.287000
75%         18.259000
max        106.927000
Name: age, dtype: float64

In [5]:
## cantidad de nulos en la columna age
teens.age.isnull().sum()

5086

In [6]:
## Se seleccionan las personas entre 13 y 20 años.
## teens$age <- ifelse(teens$age >= 13 & teens$age < 20, teens$age, NA)
import numpy as np
teens.loc[teens.age <  13, 'age'] = np.nan
teens.loc[teens.age >= 20, 'age'] = np.nan

In [7]:
## Se verifica la variable edad en los registros de las
## personas en la población de interés.
teens.age.describe()

count    24477.000000
mean        17.252429
std          1.157465
min         13.027000
25%         16.304000
50%         17.265000
75%         18.220000
max         19.995000
Name: age, dtype: float64

In [8]:
## Se crean nuevas variables numéricas a partir 
## de información categórica
teens['female']    = [1. if x == 'F' else 0.  for x in teens.gender]
teens['no_gender'] = [1. if pd.isnull(x) else 0.  for x in teens.gender]
# teens$female <- ifelse(teens$gender == "F" & !is.na(teens$gender), 1, 0)
# teens$no_gender <- ifelse(is.na(teens$gender), 1, 0)

In [9]:
## Cantidad de hombres y mujeres en la muestra.
table(teens.gender.tolist())

{'F': 22054, 'M': 5222, nan: 2724}

In [10]:
## Cantidad de hombres y mujeres en la muestra
## 0=Male, 1=Female
table(teens.female.tolist())

{0.0: 7946, 1.0: 22054}

In [11]:
## Cantidad de patrones para los que se 
## conoce y no se conoce el genero
table(teens.no_gender.tolist())

{0.0: 27276, 1.0: 2724}

In [12]:
## Edad sin tener en cuenta los datos faltantes
teens.age.mean()

17.25242893328433

In [13]:
## Edad por año de graduación
teens.groupby("gradyear")['age'].mean()

gradyear
2006    18.655858
2007    17.706172
2008    16.767701
2009    15.819573
Name: age, dtype: float64

In [14]:
## Los valores faltantes se llenan con el promedio
## de acuerdo con el año de graduación
ave_age = teens.groupby("gradyear")['age'].mean()
teens.age = [x if not pd.isnull(x) else ave_age[y] for x, y in zip(teens.age, teens.gradyear)]
teens.age.describe()

count    30000.000000
mean        17.237326
std          1.141821
min         13.027000
25%         16.282000
50%         17.238000
75%         18.212000
max         19.995000
Name: age, dtype: float64

In [15]:
## Se separa la información de las palabras relacionadas
## con los intereses del resto de la información
interests = teens.iloc[:,4:].copy()
interests.head()

Unnamed: 0,basketball,football,soccer,softball,volleyball,swimming,cheerleading,baseball,tennis,sports,...,shopping,clothes,hollister,abercrombie,die,death,drunk,drugs,female,no_gender
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0.0,0.0
1,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1.0,0.0
2,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0.0,0.0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1.0,0.0
4,0,0,0,0,0,0,0,0,0,0,...,2,0,0,0,0,0,1,1,0.0,1.0


In [16]:
## Se escala para eliminar problemas asociados
## a la medida de los datos.
#
# interests <- teens[5:40]
# interests_z <- as.data.frame(lapply(interests, scale))
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
interests_z = scaler.fit_transform(interests)
interests_z

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.06666667, 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.06666667, 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.06666667, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.08333333, 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ]])

In [17]:
## Se usa el algoritmo para determinar los centros de 5 grupos
from sklearn.cluster import KMeans
teen_clusters = KMeans(n_clusters=5, random_state=0).fit(interests_z)
teen_clusters

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=5, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=0, tol=0.0001, verbose=0)

In [18]:
## Número de ejemplos en cada cluster
table(teen_clusters.predict(interests_z).tolist())

{0: 2724, 1: 4247, 2: 5222, 3: 17148, 4: 659}

In [19]:
## Frecuencia de las palabras en cada cluster
# teen_clusters$centers
teen_clusters.cluster_centers_

array([[ 7.20447871e-03,  9.83847283e-03,  6.56713983e-03,
         5.29066252e-03,  7.23725614e-03,  3.31580693e-03,
         9.87110458e-03,  3.78579295e-03,  5.33529124e-03,
         6.36319139e-03,  1.40316528e-02,  1.23013113e-03,
         5.91450481e-03,  7.92951542e-03,  2.73918446e-03,
         1.17229564e-02,  2.99247986e-03,  2.20264317e-03,
         8.48935389e-03,  7.60436333e-03,  4.66086730e-03,
         3.89634228e-03,  3.20606951e-03,  1.36830864e-03,
         8.08628011e-03,  8.89215206e-03,  1.82992272e-04,
         1.79882526e-02,  2.30943799e-02,  1.40877386e-02,
         7.05661609e-03,  5.00183554e-03,  7.22533707e-03,
         5.03461296e-03,  7.70925110e-03,  2.06497797e-03,
        -1.52100554e-14,  1.00000000e+00],
       [ 2.00835098e-02,  2.71151886e-02,  1.30775330e-02,
         1.91096905e-02,  2.10706724e-02,  8.87077803e-03,
         8.96521080e-03,  8.55681016e-03,  9.03316867e-03,
         1.92895789e-02,  4.47868475e-02,  4.36017119e-03,
         1.37

In [20]:
## Se asigna a cada ejemplo de los datos
## el cluster al que pertenece
teens["cluster"] = teen_clusters.predict(interests_z).tolist()

In [21]:
## clusters a los que pertenecen los primeros cinco patrones
## teens[1:5, c("cluster", "gender", "age", "friends")]
teens.loc[0:4, ["cluster", "gender", "age", "friends"]]
# cluster gender    age friends
#       5      M 18.982       7
#       3      F 18.801       0
#       5      M 18.335      69
#       5      F 18.875       0
#       4     NA 18.995      10
# 

Unnamed: 0,cluster,gender,age,friends
0,2,M,18.982,7
1,1,F,18.801,0
2,2,M,18.335,69
3,3,F,18.875,0
4,0,,18.995,10


In [22]:
## Características demográficas de los clusters
## aggregate(data = teens, age ~ cluster, mean)
teens.groupby("cluster")['age'].mean()

cluster
0    17.269795
1    17.057756
2    17.432631
3    17.226636
4    16.990936
Name: age, dtype: float64

In [23]:
## aggregate(data = teens, female ~ cluster, mean)
#
# cluster	female
#       1 0.8381171
#       2 0.7250000
#       3 0.8378198
#       4 0.8027079
#       5 0.6994515
#
teens.groupby("cluster")['female'].mean()

cluster
0    0.0
1    1.0
2    0.0
3    1.0
4    1.0
Name: female, dtype: float64

In [24]:
# Cantidad promedio de amigos por cluster
# aggregate(data = teens, friends ~ cluster, mean)
#
# cluster  friends
#       1 41.43054
#       2 32.57333
#       3 37.16185
#       4 30.50290
#       5 27.70052
#
teens.groupby("cluster")['friends'].mean()

cluster
0    26.139134
1    37.740758
2    24.926465
3    30.169874
4    40.025797
Name: friends, dtype: float64

---

Segmentación del mercado de adolecentes usando scikit-learn
===

**Juan David Velásquez Henao**  
jdvelasq@unal.edu.co   
Universidad Nacional de Colombia, Sede Medellín  
Facultad de Minas  
Medellín, Colombia

---

Haga click [aquí](https://github.com/jdvelasq/IPython-for-predictive-analytics/blob/master/08-kmeans-teen-market-segments.ipynb) para acceder a la última versión online.

Haga click [aquí](http://nbviewer.jupyter.org/github/jdvelasq/IPython-for-predictive-analytics/blob/master/08-kmeans-teen-market-segments.ipynb) para ver la última versión online en `nbviewer`. 

---
[Licencia](https://github.com/jdvelasq/IPython-for-predictive-analytics/blob/master/LICENSE)  
[Readme](https://github.com/jdvelasq/IPython-for-predictive-analytics/blob/master/readme.md)