# 02 Matriz de coocurrencia de *hashtags*
## Objetivo: Convertir los *tweets* de los usuarios en *hashtags* de acuerdo en como fueron utilizados en conjunto, aplicando ciertos filtros, para aplicar OSLOM y obtener un tópicos de conversación
En base a los *hashtags* utilizados, conformamos la red de coocurrencia de *hashtags*, primero filtrando aquellos que han sido utilizado menos de tres veces en conjunto, y luego eliminando aquellos cuya distribución de usos dentro de las comunidades de seguidores de políticos se asemeja más a la distribución de las comunidades de seguidores políticos (divergencia de Kullback-Leibler).

Finalmente exportamos el grafo de coocurrencia de *hashtags* como una red no dirigida y pesada, siendo los pesos la cantidad de veces que se utilizaron dos *hashtags* en conjunto en *tweets*.

Luego de aplicar el algoritmo de detección de comunidades OSLOM, exportamos la red para poder visualizarla con Gephi y LaNet-vi

# Parámetros de generación de la red de *hashtags*
- USAR_RETWEETS: usar hashtags provenientes de todos los tweets o únicamente aquellos que no fueron retweets
- UMBRAL: cantidad mínima de veces que coocurren dos hashtags en tweets para estar presentes en la búsqueda de topicos
- FILTRAR_ENTROPIA: aplicar filtro de entropía de aquellos hashtags que fueron usados por los cuatro grupos de seguidores de políticos mas uniformemente

In [1]:
import os
import re
import pandas as pd
import numpy as np
import scipy.sparse as sp
from scipy.sparse import csr_matrix
from scipy.sparse import coo_matrix
import itertools
import scipy.stats

In [2]:
USAR_RETWEETS = False
UMBRAL = 3
FILTRAR_ENTROPIA = True
REESCRIBIR_ARCHIVOS = True

GEPHI_UMBRAL = 20 # Definimos el umbral de co-ocurrencia de hashtags en la red que visualizaremos en Gephi

### Comenzar exportación de la matriz de co-ocurrencia de hashtags

In [3]:
if (USAR_RETWEETS):
    print("TODOS")
    tweet_hashtags_df = pd.read_csv('csv_files/tweet_hashtags.csv', keep_default_na = False, na_values=["na"])
else:
    print("SIN RETWEETS")
    tweet_hashtags_df = pd.read_csv('csv_files/tweet_hashtags_sin_retweets.csv', keep_default_na = False, na_values=["na"])
tweet_hashtags_df.head()

SIN RETWEETS


Unnamed: 0,tweet_id,hashtag
0,56491df463d0c4504f501809,1MillonDeViviendasComoSea
1,5649901a63d0c42c0c4cebbb,TuitUtil
2,56e28bd4ed464b302f1e55a0,MisTardesExtrañanTuSonrisaGG
3,56e0dfe1ed464b072c3a9d6f,FestivalesEnTVP
4,563fb22e63d0c46d6cad4d77,MichaelJackson


In [4]:
# Buscamos co-ocurrencia de hashtags en un mismo tweet. Agrupamos por tweet y contamos hashtags
grouped = tweet_hashtags_df.groupby("tweet_id").count().reset_index()
grouped.columns = ["tweet_id", "cant"]

In [5]:
# Eliminamos tweets que tienen un unico hashtag (no hay co-ocurrencia con ningun otro)
grouped = grouped[grouped.cant != 1]

In [6]:
# Mergeamos para recuperar tweet_id con hashtag
tweet_hashtags_df = tweet_hashtags_df.merge(grouped[["tweet_id"]])
tweet_hashtags_df.head()

Unnamed: 0,tweet_id,hashtag
0,563fb22e63d0c46d6cad4d77,MichaelJackson
1,563fb22e63d0c46d6cad4d77,July2002
2,564809d563d0c42b9283e9a6,AllTheNight
3,564809d563d0c42b9283e9a6,HoyPintaPara
4,56e485b2ed464b6baf2d13a5,EnVIVO


# Filtro de UMBRAL

In [7]:
hashtags = pd.read_csv('csv_files/hashtags.csv', keep_default_na = False, na_values=['_'])
inverse_hashtags = dict()
for index, row in hashtags.iterrows():
    inverse_hashtags[str(row["hashtag"])] = row["id"]

tweets = tweet_hashtags_df[["tweet_id"]].drop_duplicates().reset_index()
inverse_tweets = dict()
for index, row in tweets.iterrows():
    inverse_tweets[str(row["tweet_id"])] = row["index"]
tweet_hashtags_df.dropna(inplace=True)
tweet_hashtags_df.head()

Unnamed: 0,tweet_id,hashtag
0,563fb22e63d0c46d6cad4d77,MichaelJackson
1,563fb22e63d0c46d6cad4d77,July2002
2,564809d563d0c42b9283e9a6,AllTheNight
3,564809d563d0c42b9283e9a6,HoyPintaPara
4,56e485b2ed464b6baf2d13a5,EnVIVO


In [8]:
tweet_hashtags_df["hashtag"] = tweet_hashtags_df.hashtag.apply(lambda x : x if x != "NA" else "na")
tweet_hashtags_df.shape

(2605421, 2)

In [9]:
tweet_indexes = [inverse_tweets[t] for t in tweet_hashtags_df.tweet_id.values]
hashtag_indexes = [inverse_hashtags[h] for h in tweet_hashtags_df.hashtag.values]
values = [1 for i in range(len(tweet_indexes))]
mat = coo_matrix((values, (tweet_indexes, hashtag_indexes)),shape=(max(tweet_indexes)+1, len(hashtags.id.values)))
mat = mat.tocsr()
mat

<2605424x1272496 sparse matrix of type '<class 'numpy.int64'>'
	with 2605421 stored elements in Compressed Sparse Row format>

In [10]:
res = mat.T.dot(mat)
res = res.tocsr()
res

<1272496x1272496 sparse matrix of type '<class 'numpy.int64'>'
	with 3716101 stored elements in Compressed Sparse Row format>

In [11]:
pairs = [(x,y,w) for (x,y,w) in zip(res.nonzero()[0], res.nonzero()[1], res[res.nonzero()].A1) if x != y and w >= UMBRAL]

In [12]:
keys = {}
for x,y,w in pairs:
    keys[x] = x
    keys[y] = y
filtro_hashtags = list(keys.keys())
len(filtro_hashtags)

46185

In [13]:
len(np.unique(np.concatenate((res.nonzero()))))

463377

In [14]:
tweet_hashtags_indexed_df = tweet_hashtags_df.merge(hashtags)
tweet_hashtags_indexed_df.shape

(2605421, 3)

In [15]:
tweet_hashtags_indexed_df = tweet_hashtags_indexed_df[tweet_hashtags_indexed_df["id"].isin(filtro_hashtags)]
tweet_hashtags_indexed_df.shape

(1918986, 3)

Nos quedamos con 1.9 millones de 2.6 millones de *tweets* originales

# Filtro de entropia

In [16]:
user_hashtags = pd.read_csv('csv_files/user_hashtags.csv')
user_hashtags.head()

Unnamed: 0,user_id,timestamp,hashtag
0,186068,2015-11-05 17:48:32,PasanteDeClarín
1,186068,2015-11-03 21:41:31,QueVuelvaElFav
2,186068,2015-10-25 15:03:35,NingunaSanta
3,186068,2015-10-24 05:49:57,Viernes
4,186068,2015-10-20 21:50:24,Velez


In [17]:
user_hashtags.shape

(15326649, 3)

In [18]:
user_hashtags_filtered = user_hashtags.merge(tweet_hashtags_indexed_df[["hashtag"]].drop_duplicates())
user_hashtags_filtered.shape

(10458830, 3)

De los 15 millones de *hashtags* que utilizaron los usuarios, queremos únicamente los 46 mil que fueron utilizados en conjunto más de dos veces

In [19]:
def extract_only_followers(follower_network, politician_id, except_politicians):
    """Extrae de la red de usuarios, aquellos usuarios que siguen a :param politician_id: y que NO siguen a ninguno de 
    los usuarios que estan dentro de :param except_politicians: """
    politician_followers = follower_network[follower_network["followed_id"] == politician_id].groupby("follower_id").count().reset_index().follower_id.values
    other_fols = follower_network[follower_network["followed_id"].isin(except_politicians)].follower_id.values
    filtered = [only for only in politician_followers if only not in other_fols]
    return filtered

In [20]:
macri_id = 137027
scioli_id = 188326
massa_id = 12218
stolbizer_id = 224325

In [21]:
follower_network = pd.read_csv('csv_files/followers.csv')

In [22]:
macri_followers = extract_only_followers(follower_network, macri_id, [scioli_id, massa_id, stolbizer_id])
scioli_followers = extract_only_followers(follower_network, scioli_id, [macri_id, massa_id, stolbizer_id])
massa_followers = extract_only_followers(follower_network, massa_id, [macri_id, scioli_id, stolbizer_id])
stolbizer_followers = extract_only_followers(follower_network, stolbizer_id, [macri_id, scioli_id, massa_id])

In [23]:
pol_folls = np.concatenate((stolbizer_followers, scioli_followers, massa_followers, macri_followers))
only_politician_user_hashtags_filtered = user_hashtags_filtered[user_hashtags_filtered["user_id"].isin(pol_folls)]
all_hashtags = user_hashtags_filtered[["hashtag"]].drop_duplicates().reset_index()[["hashtag"]]
only_pol_hashtags = only_politician_user_hashtags_filtered[["hashtag"]].drop_duplicates().reset_index()[["hashtag"]]
all_hash_set = set(all_hashtags.hashtag.values.tolist())
only_pol_set = set(only_pol_hashtags.hashtag.values.tolist())
excluded_hashtags = all_hash_set.difference(only_pol_set)
excluded_df = pd.DataFrame(data=[h2 for h2 in excluded_hashtags], columns=["hashtag"])

In [24]:
excluded_df.shape

(10097, 1)

In [25]:
macri_df = pd.DataFrame(data=[i for i in zip(macri_followers, np.repeat(0, len(macri_followers)))], columns=["user_id", "user_cluster"])
scioli_df = pd.DataFrame(data=[i for i in zip(scioli_followers, np.repeat(1, len(scioli_followers)))], columns=["user_id", "user_cluster"])
massa_df = pd.DataFrame(data=[i for i in zip(massa_followers, np.repeat(2, len(massa_followers)))], columns=["user_id", "user_cluster"])
stolbizer_df = pd.DataFrame(data=[i for i in zip(stolbizer_followers, np.repeat(3, len(stolbizer_followers)))], columns=["user_id", "user_cluster"])
user_communities_df = pd.concat((macri_df,scioli_df,massa_df,stolbizer_df))

In [26]:
user_hashtags_filtered = user_hashtags_filtered.merge(user_communities_df)
only_politician_user_hashtags_filtered = only_politician_user_hashtags_filtered.merge(user_communities_df)

In [27]:
def get_qk():
    clusters_size = user_communities_df.groupby("user_cluster").count().reset_index()
    clus_size_total = sum(clusters_size.user_id.values)
    clusters_size["user_index"] = clusters_size.user_id / clus_size_total
    qk = clusters_size.sort_values("user_cluster").user_id.values
    return qk
get_qk()

array([38211, 32087,  9089,  6361])

In [28]:
remaining_hashtags = user_hashtags_filtered[["hashtag"]].drop_duplicates()
remaining_hashtags.head(2)

Unnamed: 0,hashtag
0,QueVuelvaElFav
1,PrefieroAMacriPorque


In [29]:
cluster = 4
def get_hashtag_topic_combination():
    """Devuelve cada combinacion posible entre las cuatro comunidades de usuarios y hashtag"""
    list1= range(cluster)
    list2= remaining_hashtags.hashtag.values
    d = []
    for i in list1:
        for j in list2:
            d.append([i,j,0])
    todos = pd.DataFrame(data=d)
    todos.columns=["user_cluster", "hashtag", "usos"]
    return todos

def agg_entropy(pk):
    """Funcion de agregacion auxiliar para invocar el calculo de entropia sobre el DataFrame"""
    qk = get_qk()
    return scipy.stats.entropy(pk, qk)

def get_entropy():
    """Calculo de la divergencia de Kullback-Leibler para cada distribucion de usos de un hashtag,
    respecto a las comunidades de usuarios"""
    c = user_hashtags_filtered.groupby(["hashtag","user_cluster"]).count().reset_index()
    todos = get_hashtag_topic_combination()    
    todos_set = set ([((i,j)) for i,j in zip(todos.user_cluster.values, todos.hashtag.values)])
    c_merged = c.merge(todos, how='left')
    existing_s = set ([((i,j)) for i,j in zip(c_merged.user_cluster.values, c_merged.hashtag.values)])
    diff = todos_set.difference(existing_s)    
    missing = pd.DataFrame(data=[(i,j,0) for i,j in diff], columns=["user_cluster", "hashtag", "user_id"])
    filled = pd.concat((c,missing))
    g = filled.groupby("hashtag").sum().reset_index()[["hashtag", "user_id"]]
    g.columns = ["hashtag", "total"]
    t = filled.merge(g)
    t.sort_values(["hashtag", "user_cluster"], inplace=True)
    t["proba"] = t.user_id / t.total
    ans_pk = t.groupby("hashtag").agg({'proba': scipy.stats.entropy}).reset_index()
    ans_pk_qk = t.groupby("hashtag").agg({'proba': agg_entropy}).reset_index()
    ans_pk_qk.columns=["hashtag", "proba_kull_lei"]
    ans = ans_pk.merge(ans_pk_qk)
    return ans

In [30]:
ans = get_entropy()
ans.head()

Unnamed: 0,hashtag,proba,proba_kull_lei
0,007SPECTRE,0.865993,0.102941
1,04Mar,0.562335,0.604969
2,05Nov,0.0,0.808289
3,06Mar,0.0,0.808289
4,06Nov,0.0,0.808289


In [31]:
perc_to_delete = 5
quantil = np.percentile(ans.proba_kull_lei.values, perc_to_delete)
quantil

0.031136842448633035

In [32]:
# Mientras mas parecido fue la utilizacion del hashtag con respecto a la comunidad de usuarios,
# menor es el valor de la divergencia. Queremos eliminar estos hashtags
filtered_hashtags_entropy = ans[ans["proba_kull_lei"] <= quantil].sort_values("proba_kull_lei")
filtered_hashtags_entropy.head()

Unnamed: 0,hashtag,proba,proba_kull_lei
3072,Bentancur,1.157284,1.9e-05
9554,FelizDiaDeLaMusica,1.162503,8.8e-05
15950,MeDaMiedo,1.168219,0.000228
5829,ContraSanValentin,1.159907,0.000235
5906,CopaSudamericana,1.147672,0.000322


In [33]:
filtered_hashtags_entropy.shape

(1806, 3)

In [34]:
keep_hashtags = ans[ans["proba_kull_lei"] > quantil][["hashtag"]].reset_index()[["hashtag"]]
keep_hashtags.shape

(34282, 1)

In [35]:
excluded_df.head()

Unnamed: 0,hashtag
0,NeuquénConScioli
1,ChávezInvicto
2,Vespertino
3,CostaDíaz
4,PAris


In [36]:
excluded_df.shape

(10097, 1)

In [37]:
# Nos aseguramos de que no haya coincidencia entre los hashtags que no fueron utilizados por algun usuario que seguia
# unicamente a un politico
excluded_df.merge(keep_hashtags)

Unnamed: 0,hashtag


In [38]:
filter_hashtags = pd.concat((excluded_df,keep_hashtags))
filter_hashtags.shape

(44379, 1)

In [39]:
tweet_hashtags_df.head()

Unnamed: 0,tweet_id,hashtag
0,563fb22e63d0c46d6cad4d77,MichaelJackson
1,563fb22e63d0c46d6cad4d77,July2002
2,564809d563d0c42b9283e9a6,AllTheNight
3,564809d563d0c42b9283e9a6,HoyPintaPara
4,56e485b2ed464b6baf2d13a5,EnVIVO


In [40]:
tweet_hashtags_df.shape

(2605421, 2)

In [41]:
# De las comunidades detectadas en el proceso posterior, eliminamos algunos de los hashtags que aparezcan en el filtro
# para rearmar la red de hashtags y ver que topicos y que ocurre si sacamos los que no aportan nada
if (FILTRAR_ENTROPIA):
    tweet_hashtags_df = tweet_hashtags_df.merge(filter_hashtags)
tweet_hashtags_df.shape

(1541151, 2)

De los 2.6 millones de *tweets*, aplicando el filtro de entropía nos quedamos con 1.5 millones

In [42]:
hashtags = pd.read_csv('csv_files/hashtags.csv', keep_default_na = False, na_values=['_'])
inverse_hashtags = dict()
for index, row in hashtags.iterrows():
    inverse_hashtags[str(row["hashtag"])] = row["id"]

tweets = tweet_hashtags_df[["tweet_id"]].drop_duplicates().reset_index()
inverse_tweets = dict()
for index, row in tweets.iterrows():
    inverse_tweets[str(row["tweet_id"])] = row["index"]

In [43]:
tweet_hashtags_df.head()

Unnamed: 0,tweet_id,hashtag
0,563fb22e63d0c46d6cad4d77,MichaelJackson
1,563fb22e63d0c46d6cad4d6a,MichaelJackson
2,563fb22e63d0c46d6cad4db9,MichaelJackson
3,563fb22e63d0c46d6cad4dfa,MichaelJackson
4,56498c6763d0c42c0c4b2761,MichaelJackson


In [44]:
len(inverse_hashtags)

1272496

In [45]:
len(inverse_tweets)

795331

In [46]:
tweet_indexes = [inverse_tweets[t] for t in tweet_hashtags_df.tweet_id.values]
hashtag_indexes = [inverse_hashtags[h] for h in tweet_hashtags_df.hashtag.values]
values = [1 for i in range(len(tweet_indexes))]
mat = coo_matrix((values, (tweet_indexes, hashtag_indexes)),shape=(len(tweet_indexes), len(hashtags.id.values)))
mat = mat.tocsr()
mat

<1541151x1272496 sparse matrix of type '<class 'numpy.int64'>'
	with 1541151 stored elements in Compressed Sparse Row format>

In [47]:
# Producto de la matriz, obtenemos la matriz de coocurrencia de hashtags
res = mat.T.dot(mat)
res = res.tocsr()
res

<1272496x1272496 sparse matrix of type '<class 'numpy.int64'>'
	with 825881 stored elements in Compressed Sparse Row format>

# Finalmente exporto la matriz de co-ocurrencia de hashtags con el UMBRAL que corresponde

In [48]:
# Filtrar diagonal principal de hashtags
pairs = [(x,y,w) for (x,y,w) in zip(res.nonzero()[0], res.nonzero()[1], res[res.nonzero()].A1) if x != y ]

In [49]:
len(np.unique(res.nonzero()))

44379

In [50]:
if (USAR_RETWEETS):
    sp.save_npz('npz_files/grafo_coocurrencia_umbral'+str(UMBRAL)+'_entropia'+str(FILTRAR_ENTROPIA), res, compressed=True)
    np.savetxt('npz_files/grafo_coocurrencia_umbral'+str(UMBRAL)+'_entropia'+str(FILTRAR_ENTROPIA)+'.txt', pairs, fmt="%d")
else:
    sp.save_npz('npz_files/grafo_coocurrencia_umbral'+str(UMBRAL)+'_entropia'+str(FILTRAR_ENTROPIA)+'_sin_retweets', res, compressed=True)
    np.savetxt('npz_files/grafo_coocurrencia_umbral'+str(UMBRAL)+'_entropia'+str(FILTRAR_ENTROPIA)+'_sin_retweets.txt', pairs, fmt="%d")

## Ejecutar OSLOM
```./oslom_undir -f grafo_coocurrencia...txt -w```

```cp tp clu_files/grafo_coocurrencia...txt```

In [51]:
def extract_oslom(filename):
    """Funcion que extrae las comunidades detectadas por OSLOM a las que pertence un vertice de la red.
    OSLOM genera un archivo del estilo:
    #module <numero-comunidad>
    <vertice-id> <vertice-id> ... <vertice-id>
    """
    clusters = {}
    hashtag_clusters = {}
    data = []
    with open(filename) as f:
        cluster = ""
        for line in f:
            m = re.search("^#module\s([0-9]+).*", line)
            #print line
            if (m is not None):
                cluster = int(m.group(1))
            else:
                l = line.replace('\n', ' ').strip().split(" ")
                l = list(map(int, l))
                clusters[cluster] = l
                for i in l:
                    if not i in hashtag_clusters:
                        hashtag_clusters[i] = set()
                    hashtag_clusters[i].add(cluster)
                    data.append([i, cluster])
    return pd.DataFrame(data=data, columns=["id", "cluster"]), hashtag_clusters

# Generar un archivo con hashtag,cluster con un cluster unico por hashtag

In [52]:
if (USAR_RETWEETS):
    nombre_archivo = 'clu_files/grafo_coocurrencia_umbral'+str(UMBRAL)+'_entropia'+str(FILTRAR_ENTROPIA)+'.txt'
else:
    nombre_archivo = 'clu_files/grafo_coocurrencia_umbral'+str(UMBRAL)+'_entropia'+str(FILTRAR_ENTROPIA)+'_sin_retweets.txt'
# Formato del archivo
with open(nombre_archivo, 'r') as f:
    for i in range(4):
        print(f.readline().replace('\n', ''))

#module 0 size: 4 bs: 8.08164e-05
226456 1171592 1171593 1171594 
#module 1 size: 3 bs: 7.11131e-07
51718 51719 51720 


In [53]:
if (USAR_RETWEETS):
    comunidades_df, hashtag_clusters = extract_oslom(nombre_archivo)
else:
    comunidades_df, hashtag_clusters = extract_oslom(nombre_archivo)
comunidades_df.shape

(47831, 2)

In [54]:
comunidades_df.head()

Unnamed: 0,id,cluster
0,226456,0
1,1171592,0
2,1171593,0
3,1171594,0
4,51718,1


In [55]:
comunidades_df[['id']].drop_duplicates().shape

(43844, 1)

In [56]:
if (USAR_RETWEETS):
    matriz_df_filename = 'npz_files/grafo_coocurrencia_umbral'+str(UMBRAL)+'_entropia'+str(FILTRAR_ENTROPIA)+'.txt'
else:
    matriz_df_filename = 'npz_files/grafo_coocurrencia_umbral'+str(UMBRAL)+'_entropia'+str(FILTRAR_ENTROPIA)+'_sin_retweets.txt'
matriz_df = pd.read_csv(matriz_df_filename, header=None, sep=' ', names=['source', 'target', 'weight'])
matriz_df.head()

Unnamed: 0,source,target,weight
0,10,11,1
1,10,233,1
2,10,297,1
3,10,319,4
4,10,534,1


In [57]:
matriz_df.shape

(781502, 3)

In [58]:
# DataFrame final que va a tener una comunidad unica para cada hashtag
comunidades_finales_df = pd.DataFrame(columns=['id', 'cluster'])
comunidades_finales_df

Unnamed: 0,id,cluster


In [59]:
agrupados = comunidades_df.groupby('id').count().reset_index()
agrupados.head(2)

Unnamed: 0,id,cluster
0,10,2
1,11,1


In [60]:
agrupados[agrupados["cluster"] != 1].shape

(3162, 2)

In [61]:
comunidades_finales_df = pd.concat((comunidades_finales_df,comunidades_df[comunidades_df['id'].isin(
    agrupados[agrupados["cluster"] == 1].id.values)]))
comunidades_finales_df.shape

(40682, 2)

In [62]:
comunidades_multiples_df = comunidades_df[comunidades_df['id'].isin(agrupados[agrupados["cluster"]!=1].id.values)].sort_values('id')
comunidades_multiples_df.head()

Unnamed: 0,id,cluster
5679,10,673
23959,10,1194
32099,30,1284
9827,30,857
18685,46,1091


In [63]:
todos_df = matriz_df.merge(comunidades_multiples_df, left_on='source', right_on='id').merge(comunidades_df, left_on='target', right_on='id')
todos_df.head()

Unnamed: 0,source,target,weight,id_x,cluster_x,id_y,cluster_y
0,10,11,1,10,673,11,1085
1,10,11,1,10,1194,11,1085
2,6115,11,2,6115,1368,11,1085
3,6115,11,2,6115,1372,11,1085
4,10,233,1,10,673,233,1314


In [64]:
todos_iguales_df = todos_df[todos_df["cluster_x"] == todos_df["cluster_y"]]
todos_iguales_df.head()

Unnamed: 0,source,target,weight,id_x,cluster_x,id_y,cluster_y
60,243,233,146,243,1314,233,1314
246,1441,233,17,1441,1314,233,1314
282,1733,233,84,1733,1314,233,1314
286,1821,233,56,1821,1314,233,1314
320,2369,233,2,2369,1314,233,1314


In [65]:
len(np.unique(todos_iguales_df.source.values))

3162

In [66]:
def agrupar_maximo(df):
    """Para un hashtag, ver cuales son los hashtags vecinos con los que fue utilizado, y de ellos ver cual
    es el topico de uso mayoritario entre el hashtag y sus vecinos"""
    g = df.groupby(['cluster_x']).sum().reset_index()[['cluster_x', 'weight']].sort_values('weight', ascending=False)
    cluster = g['cluster_x'].values[0]
    suma = g['weight'].values[0]
    return pd.DataFrame(data=[[cluster,suma]], columns=['cluster_final', 'suma'])

In [67]:
hashtags_comunidades_unicos_df = todos_iguales_df.groupby('source').apply(agrupar_maximo).reset_index()
hashtags_comunidades_unicos_df.head()

Unnamed: 0,source,level_1,cluster_final,suma
0,10,0,673,77
1,30,0,857,11
2,46,0,1171,55
3,74,0,1057,1549
4,78,0,1379,432


In [68]:
hashtags_comunidades_unicos_df = hashtags_comunidades_unicos_df[['source', 'cluster_final']]
hashtags_comunidades_unicos_df.columns = ['id', 'cluster']
hashtags_comunidades_unicos_df.head()

Unnamed: 0,id,cluster
0,10,673
1,30,857
2,46,1171
3,74,1057
4,78,1379


In [69]:
hashtags_comunidades_unicos_df.shape

(3162, 2)

In [70]:
hashtags_comunidades_unicos_df.merge(comunidades_finales_df, left_on='id', right_on='id')

Unnamed: 0,id,cluster_x,cluster_y


In [71]:
comunidades_finales_df = pd.concat((comunidades_finales_df, hashtags_comunidades_unicos_df))
comunidades_finales_df.sort_values('id', inplace=True)
comunidades_finales_df[['id']].drop_duplicates().shape

(43844, 1)

# Exportar a Gephi y LaNet-vi
### Exportar red de co-ocurrencia de hashtags con distintos parámetros para poder visualizarla

## Gephi

In [72]:
if (not 'res' in locals()):
    if (USAR_RETWEETS):
        res = sp.load_npz('npz_files/grafo_coocurrencia_umbral'+str(UMBRAL)+'_entropia'+str(FILTRAR_ENTROPIA)+'.npz')
    else:
        res = sp.load_npz('npz_files/grafo_coocurrencia_umbral'+str(UMBRAL)+'_entropia'+str(FILTRAR_ENTROPIA)+'_sin_retweets.npz')
res

<1272496x1272496 sparse matrix of type '<class 'numpy.int64'>'
	with 825881 stored elements in Compressed Sparse Row format>

In [73]:
hashtags_df = pd.read_csv('csv_files/hashtags.csv', keep_default_na = False, na_values=['_'])
hashtags = hashtags_df.to_dict()["hashtag"]

In [74]:
aristas = [(x,y,w) for (x,y,w) in zip(res.nonzero()[0], res.nonzero()[1], res[res.nonzero()].A1) if x > y and w >= GEPHI_UMBRAL]
origen = [x for (x,y,w) in zip(res.nonzero()[0], res.nonzero()[1], res[res.nonzero()].A1) if x > y and w >= GEPHI_UMBRAL]
destino = [y for (x,y,w) in zip(res.nonzero()[0], res.nonzero()[1], res[res.nonzero()].A1) if x > y and w >= GEPHI_UMBRAL]
vertices = np.unique(np.concatenate((origen,destino)))

In [75]:
len(vertices)

4427

In [76]:
hashtag_clusters = {}
for index, row in comunidades_finales_df.iterrows():
    hashtag_clusters[row['id']] = row['cluster']
hashtag_clusters[10]

673

In [77]:
# Exportar aristas
with open('gephi/gephi_aristas_'+str(GEPHI_UMBRAL)+'.csv', "w") as f:
    f.write('Source,Target,Weight,Color\n')
    for h1,h2,w in aristas:
        color = ''
        if (hashtag_clusters[h1] == hashtag_clusters[h2]):
            color = 'same'
        else:
            color = 'diff'
        f.write(str(h1) + ","+str(h2)+ ","+str(w)+","+color+"\n")

In [78]:
comunidades_finales_df.shape

(43844, 2)

In [79]:
# Exportar vertices
filtered_hashtags = hashtags_df[hashtags_df["id"].isin(vertices)]
vertices_df = filtered_hashtags.merge(comunidades_finales_df)
vertices_df.columns = ["Id", "Label", "Cluster"]
vertices_df.to_csv('gephi/gephi_vertices_'+str(GEPHI_UMBRAL)+'.csv',index=False)

## Large Network Visualization

In [80]:
# Exportar vertices 
hashtags_df.to_csv('lanet/lanet_vertices.csv',index=False, header=None, sep=" ")

In [81]:
# Exportar aristas, por el momento no utilizamos los pesos
aristas = [(x,y, w) for (x,y,w) in zip(res.nonzero()[0], res.nonzero()[1], res[res.nonzero()].A1) if x > y ]
np.savetxt('lanet/lanet_red.csv', aristas, fmt="%d")

### Ejecutar Lanet
```./lanet -logfile -logstdout -coresfile cores -W 4000 -H 3000 -render svg -input lanet_red.csv -decomp kdenses -edges 1.0 -fromlayer 0 -names lanet_vertices.csv```


```./lanet -logfile -logstdout -coresfile cores -W 4000 -H 3000 -render svg -input lanet_red.csv -decomp kcores -edges 1.0 -fromlayer 0 -names lanet_vertices.csv```