## Text Mining in Social Media - Medida de la longitud de los tuits

*En este notebook se detalla paso a paso todo el proceso de extracción de tuits de usuarios, su inserción en un DataFrame y el entrenamiento, etiquetado y evaluación de un modelo capaz de predecir el sexo y la variedad de español de un usuario dados 100 de sus tuits.*

**1- Generamos dos diccionarios llamados ids_train e ids_test que contienen los ids de usuarios clasificados por variedad y sexo. Usamos para ello la función generar_dicc_ids**

Para ello abrimos el fichero truth.txt, que contiene estos campos en forma de csv separados por los caracteres ':::'.

In [2]:
def generar_dicc_ids(particion):
    # particion: 'train' o 'test'
    if particion=='train':
        f = open('./training/truth.txt')
    else:
        f = open('./test/truth.txt')
    
    contador=0
    sexo={'male':0,'female':0}
    variedades={'colombia':0,'argentina':0,'spain':0,'venezuela':0,'peru':0,'chile':0,'mexico':0}

    ids={'colombia':{'male':[],'female':[]},'argentina':{'male':[],'female':[]},'spain':{'male':[],'female':[]}, \
     'venezuela':{'male':[],'female':[]},'peru':{'male':[],'female':[]},'chile':{'male':[],'female':[]}, \
     'mexico':{'male':[],'female':[]}}

    for line in f:
        line=line.rstrip()
        if len(line)==0:
            continue
        lista=line.split(':::')

        ids[lista[2]][lista[1]].append(lista[0])

        sexo[lista[1]] += 1
        variedades[lista[2]] += 1
        contador+=1

    print(sexo)
    print(variedades)
    print(contador)

    return ids

ids_train = generar_dicc_ids('train')
ids_test  = generar_dicc_ids('test')

{'male': 1400, 'female': 1400}
{'colombia': 400, 'argentina': 400, 'spain': 400, 'venezuela': 400, 'peru': 400, 'chile': 400, 'mexico': 400}
2800
{'male': 700, 'female': 700}
{'colombia': 200, 'argentina': 200, 'spain': 200, 'venezuela': 200, 'peru': 200, 'chile': 200, 'mexico': 200}
1400


**Ejemplo de los diccionarios generados**

In [3]:
ids_train['spain']['male'][0:10]

['ef3588c9462713023145ae3c12c85614',
 '17e06a4ef15eaa851242465edc5328bd',
 'cfa38327f7699d48daaaaf4278a1354a',
 'e85fd98dfc6743bce7b274fbcac69f24',
 '826e3b4e72bfb6f9cdfc1a6995be10e5',
 'bf149d41c5e685054a57fb38d964735d',
 '79e9e99239fe662190978d62ebc4c24d',
 '2d56280a969d9dbcc4b32b21bc0a9b02',
 'b56c7a73e3bb2eb5436a3642b1fb70ee',
 '17786b7a4a1a31775af8ae786b4e4711']

**2- Definimos la función leer_tuits_longitud:**

**leer_tuits_longitud:** recibe el id del usuario y a qué partición pertenece ('train' o 'test'). Devuelve una lista con la longitud de los 100 tuits del usuario.


In [4]:
def leer_tuits_longitud(id,particion):
    if particion == 'train':
        s='./training/' + id + '.xml'
    else:
        s='./test/' + id + '.xml'
    f=open(s)
    longitudes=list()
    for line in f:
        line=line.rstrip()
        if line.find('<document><![CDATA[') != -1:
            line=line[21:-14]
            longitudes.append(len(line))
    return longitudes


**3- Definimos la función generar_dataframe que devuelve un DataFrame de pandas con las columnas tuits, sexo y variedad. En cada fila de la columna tuits, se encuentran todos los tuits de un usuario concatenados en un único string.**

In [5]:
import pandas as pd

def generar_dataframe(ids,particion):
    # ids: ids_train o ids_test
    
    df = pd.DataFrame(columns = ['tuits','sexo','variedad'])

    i=0 # Contador del número de filas. Cada fila será un usuario.
    variedades_lista = ['argentina', 'chile', 'colombia', 'mexico', 'peru', 'spain', 'venezuela']
    sexo_lista = ['female','male']

    for variedad in variedades_lista:
        v=0 # contador del número de usuarios de una variedad
        print(variedad)
        
        for sexo in sexo_lista:
            s=0 # contador del número de usuarios de un sexo
            
            for id in ids[variedad][sexo]:
                t = leer_tuits_longitud(id,particion)
                df.loc[i] = [t,sexo,variedad]
                i+=1
                v+=1
                s+=1
            
            print("\t",sexo,s)
        
        print(v)
    
    return df


Generamos los DataFrames train y test.

In [6]:
train = generar_dataframe(ids_train, 'train')

argentina
	 female 200
	 male 200
400
chile
	 female 200
	 male 200
400
colombia
	 female 200
	 male 200
400
mexico
	 female 200
	 male 200
400
peru
	 female 200
	 male 200
400
spain
	 female 200
	 male 200
400
venezuela
	 female 200
	 male 200
400


In [7]:
test  = generar_dataframe(ids_test, 'test')

argentina
	 female 100
	 male 100
200
chile
	 female 100
	 male 100
200
colombia
	 female 100
	 male 100
200
mexico
	 female 100
	 male 100
200
peru
	 female 100
	 male 100
200
spain
	 female 100
	 male 100
200
venezuela
	 female 100
	 male 100
200


Ejemplo del DataFrame train

In [8]:
train.sample(n=10)

Unnamed: 0,tuits,sexo,variedad
1507,"[140, 140, 134, 94, 139, 116, 140, 140, 140, 6...",male,mexico
191,"[47, 35, 127, 64, 43, 31, 47, 96, 66, 86, 39, ...",female,argentina
286,"[29, 40, 129, 55, 37, 126, 76, 69, 25, 39, 92,...",male,argentina
767,"[123, 43, 38, 22, 70, 51, 139, 77, 60, 51, 70,...",male,chile
1183,"[77, 137, 41, 33, 31, 36, 134, 56, 30, 55, 30,...",male,colombia
1590,"[130, 136, 97, 89, 42, 111, 41, 56, 33, 140, 1...",male,mexico
1345,"[65, 63, 107, 90, 139, 53, 107, 62, 67, 61, 12...",female,mexico
1131,"[42, 59, 127, 138, 68, 131, 79, 38, 74, 16, 10...",male,colombia
294,"[137, 114, 77, 124, 104, 126, 92, 122, 131, 11...",male,argentina
1746,"[55, 68, 89, 133, 92, 119, 140, 89, 137, 112, ...",female,peru


**4- Añadimos cuatro nuevas columnas al Data Frame: mean, median, std y skewness que son la media, mediana, desviación estándar y simetría de las longitudes de los tuits de cada usuario.**

In [9]:
import statistics as stats
from scipy import stats as sc

for i in range(2800):
    train.loc[i, 'mean']     = stats.mean   (train.loc[i]['tuits'])
    train.loc[i, 'median']   = stats.median (train.loc[i]['tuits'])
    train.loc[i, 'std']      = stats.stdev  (train.loc[i]['tuits'])
    train.loc[i, 'skewness'] = sc.skew      (train.loc[i]['tuits'])
    
for i in range(1400):
    test.loc[i, 'mean']     = stats.mean   (test.loc[i]['tuits'])
    test.loc[i, 'median']   = stats.median (test.loc[i]['tuits'])
    test.loc[i, 'std']      = stats.stdev  (test.loc[i]['tuits'])
    test.loc[i, 'skewness'] = sc.skew      (test.loc[i]['tuits'])

Ejemplo del DataFrame con las nuevas columnas

In [10]:
train.loc[0:10]

Unnamed: 0,tuits,sexo,variedad,mean,median,std,skewness
0,"[91, 32, 87, 94, 47, 56, 68, 76, 127, 40, 82, ...",female,argentina,85.16,87.5,32.143843,0.106207
1,"[62, 40, 39, 66, 55, 41, 72, 59, 76, 103, 76, ...",female,argentina,63.47,55.5,23.780331,0.709803
2,"[55, 61, 22, 28, 28, 71, 25, 24, 32, 70, 33, 6...",female,argentina,38.1,32.0,23.456601,1.5564
3,"[58, 79, 75, 45, 18, 139, 54, 46, 73, 68, 90, ...",female,argentina,60.56,56.0,29.102161,0.717457
4,"[39, 25, 78, 36, 54, 123, 114, 70, 85, 55, 89,...",female,argentina,74.02,68.0,35.923876,0.385561
5,"[119, 66, 129, 133, 77, 84, 107, 52, 75, 83, 1...",female,argentina,76.75,72.5,37.675482,0.221802
6,"[67, 126, 55, 43, 46, 54, 33, 24, 60, 63, 31, ...",female,argentina,60.24,57.0,29.608257,0.842735
7,"[138, 101, 45, 62, 74, 84, 133, 110, 134, 140,...",female,argentina,73.99,73.0,50.456201,-0.037711
8,"[137, 67, 68, 76, 91, 107, 92, 35, 69, 133, 4,...",female,argentina,75.97,69.0,36.419957,0.283163
9,"[131, 82, 62, 19, 52, 99, 27, 105, 61, 73, 80,...",female,argentina,84.67,81.0,34.43162,0.111638


**5- Generamos las matrices X de entrada al sistema considerando los cuatro columnas anteriormente creadas: mean, median, std, skewness.**

In [11]:
x_train = train.loc[:,['mean','median','std','skewness']]
x_test  = test.loc[:,['mean','median','std','skewness']]

x_train.loc[0:10]


Unnamed: 0,mean,median,std,skewness
0,85.16,87.5,32.143843,0.106207
1,63.47,55.5,23.780331,0.709803
2,38.1,32.0,23.456601,1.5564
3,60.56,56.0,29.102161,0.717457
4,74.02,68.0,35.923876,0.385561
5,76.75,72.5,37.675482,0.221802
6,60.24,57.0,29.608257,0.842735
7,73.99,73.0,50.456201,-0.037711
8,75.97,69.0,36.419957,0.283163
9,84.67,81.0,34.43162,0.111638


**6- SEXO**

**Clasificador**

In [12]:
from sklearn.ensemble import RandomForestClassifier

clf_gender = RandomForestClassifier(500).fit(x_train, train['sexo'])

**Predicción**

In [13]:
predicted_gender = clf_gender.predict(x_test)

Ejemplo de las predicciones

In [14]:
predicted_gender[0:10]

array(['male', 'male', 'female', 'male', 'female', 'male', 'female',
       'male', 'female', 'female'], dtype=object)

**Evaluación**

In [15]:
from sklearn import metrics

print(metrics.classification_report(test.sexo, predicted_gender))

             precision    recall  f1-score   support

     female       0.54      0.53      0.54       700
       male       0.54      0.55      0.55       700

avg / total       0.54      0.54      0.54      1400



**7- VARIEDAD**

**Clasificador**

In [16]:
from sklearn.ensemble import RandomForestClassifier

clf_variety = RandomForestClassifier(500).fit(x_train, train['variedad'])

**Predicción**

In [17]:
predicted_variety = clf_variety.predict(x_test)

Ejemplo de las predicciones

In [18]:
predicted_variety[0:10]

array(['argentina', 'colombia', 'venezuela', 'chile', 'argentina',
       'chile', 'peru', 'colombia', 'spain', 'argentina'], dtype=object)

**Evaluación**

In [19]:
from sklearn import metrics

print(metrics.classification_report(test.variedad, predicted_variety))

             precision    recall  f1-score   support

  argentina       0.21      0.23      0.22       200
      chile       0.17      0.19      0.18       200
   colombia       0.16      0.14      0.15       200
     mexico       0.22      0.20      0.21       200
       peru       0.15      0.12      0.14       200
      spain       0.17      0.16      0.16       200
  venezuela       0.25      0.28      0.26       200

avg / total       0.19      0.19      0.19      1400

