# Environment Sanity Check #

Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.

Check the output of `!nvidia-smi` to make sure you've been allocated a Tesla T4, P4, or P100.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!nvidia-smi

Wed Oct 18 18:44:15 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   55C    P8    11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

#Setup:
Set up script installs
1. Updates gcc in Colab
1. Installs Conda
1. Install RAPIDS' current stable version of its libraries, as well as some external libraries including:
  1. cuDF
  1. cuML
  1. cuGraph
  1. cuSpatial
  1. cuSignal
  1. BlazingSQL
  1. xgboost
1. Copy RAPIDS .so files into current working directory, a neccessary workaround for RAPIDS+Colab integration.


In [None]:
# This get the RAPIDS-Colab install files and test check your GPU.  Run this and the next cell only.
# Please read the output of this cell.  If your Colab Instance is not RAPIDS compatible, it will warn you and give you remediation steps.
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/env-check.py

Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 390, done.[K
remote: Counting objects: 100% (121/121), done.[K
remote: Compressing objects: 100% (70/70), done.[K
remote: Total 390 (delta 89), reused 53 (delta 51), pack-reused 269[K
Receiving objects: 100% (390/390), 107.16 KiB | 26.79 MiB/s, done.
Resolving deltas: 100% (191/191), done.
Collecting pynvml
  Downloading pynvml-11.5.0-py3-none-any.whl (53 kB)
     ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ 53.1/53.1 kB 1.6 MB/s eta 0:00:00
Installing collected packages: pynvml
Successfully installed pynvml-11.5.0
***********************************************************************
Woo! Your instance has the right kind of GPU, a Tesla T4!
We will now install RAPIDS via pip!  Please stand by, should be quick...
***********************************************************************



In [None]:
# This will update the Colab environment and restart the kernel.  Don't run the next cell until you see the session crash.
!bash rapidsai-csp-utils/colab/update_gcc.sh
import os
os._exit(00)

Updating your Colab environment.  This will restart your kernel.  Don't Panic!
Found existing installation: cupy-cuda11x 11.0.0
Uninstalling cupy-cuda11x-11.0.0:
  Successfully uninstalled cupy-cuda11x-11.0.0
PPA publishes dbgsym, you may need to include 'main/debug' component
Repository: 'deb https://ppa.launchpadcontent.net/ubuntu-toolchain-r/test/ubuntu/ jammy main'
Description:
Toolchain test builds; see https://wiki.ubuntu.com/ToolChain

More info: https://launchpad.net/~ubuntu-toolchain-r/+archive/ubuntu/test
Adding repository.
Adding deb entry to /etc/apt/sources.list.d/ubuntu-toolchain-r-ubuntu-test-jammy.list
Adding disabled deb-src entry to /etc/apt/sources.list.d/ubuntu-toolchain-r-ubuntu-test-jammy.list
Adding key to /etc/apt/trusted.gpg.d/ubuntu-toolchain-r-ubuntu-test.gpg with fingerprint 60C317803A41BA51845E371A1E9377A2BA9EF27F
Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Hit:2 https://developer.download.nvidia.com/compute/cuda/rep

In [None]:
#%%timeit -r 1 -n 1
# This will install CondaColab.  This will restart your kernel one last time.  Run this cell by itself and only run the next cell once you see the session crash.
import condacolab
condacolab.install()

‚è¨ Downloading https://github.com/conda-forge/miniforge/releases/download/23.1.0-1/Mambaforge-23.1.0-1-Linux-x86_64.sh...
üì¶ Installing...
üìå Adjusting configuration...
ü©π Patching environment...
‚è≤ Done in 0:00:08
üîÅ Restarting kernel...


In [None]:
#%%timeit -r 1 -n 1
# you can now run the rest of the cells as normal
import condacolab
condacolab.check()

‚ú®üç∞‚ú® Everything looks OK!


In [None]:
#%%timeit -r 1 -n 1
# Installing RAPIDS is now 'python rapidsai-csp-utils/colab/install_rapids.py <release> <packages>'
# The <release> options are 'stable' and 'nightly'.  Leaving it blank or adding any other words will default to stable.
!python rapidsai-csp-utils/colab/install_rapids.py stable
import os
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'
os.environ['CONDA_PREFIX'] = '/usr/local'

[1;30;43mStreaming output truncated to the last 5000 lines.[0m



libcublas-dev-11.11. | 394.1 MB  | ###7       |  38% [A[A[A[A[A[A[A[A[A[A















pylibraft-23.04.01   | 1.5 MB    | 4          |   4% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A










libcublas-dev-11.11. | 394.1 MB  | ####1      |  41% [A[A[A[A[A[A[A[A[A[A









py-xgboost-1.7.5dev. | 205 KB    | ####6      |  47% [A[A[A[A[A[A[A[A[A















pylibraft-23.04.01   | 1.5 MB    | 5          |   5% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A










libcublas-dev-11.11. | 394.1 MB  | ####3      |  44% [A[A[A[A[A[A[A[A[A[A










libcublas-dev-11.11. | 394.1 MB  | ####6      |  47% [A[A[A[A[A[A[A[A[A[A









py-xgboost-1.7.5dev. | 205 KB    | #######7   |  78% [A[A[A[A[A[A[A[A[A















pylibraft-23.04.01   | 1.5 MB    | #1         |  12% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A










libcublas-dev-11.11. | 394.1 MB  |

In [None]:
import numpy as np
import pandas as pd
import networkx as nx
from networkx.algorithms import bipartite
from scipy import sparse
import cugraph as cnx

In [None]:
#!pip install xlrd

In [None]:
%%timeit -r 1 -n 1
#Vienen de C:\Users\L00251678\Dropbox\Data Science\Luis de Marcos\2023\NetworkX\data_openEditors
!wget -O 0nodes_v5.0.xlsx https://www.dropbox.com/scl/fi/gr8nu2app9xgowge7ghlk/0nodes_v5.0.xlsx?rlkey=s25vzcuz1dzacjda97qbr71nx&dl=0
!wget -O 0edges_v5.0.csv https://www.dropbox.com/scl/fi/24n3kb3oemopi7qfej4bj/0edges_v5.0.csv?rlkey=ujj9hf7dc54itlp1lwf13wazg&dl=0
!wget -O 1EBMembers_v5.0_Final.xlsx https://www.dropbox.com/scl/fi/1g3ex2p9oq048omo2ma39/1EBMembers_v5.0_Final.xlsx?rlkey=rb864c9t5ukn2fwuqaefo9t5n&dl=0
!wget -O 2Journals_v5.0_crosslisted.xlsx https://www.dropbox.com/scl/fi/cpq23yhva8lvmr5o397lj/2Journals_v5.0_crosslisted.xlsx?rlkey=fjabm1vzg9hf0fkz5eguavyo1&dl=0

--2023-10-19 11:19:26--  https://www.dropbox.com/scl/fi/gr8nu2app9xgowge7ghlk/0nodes_v5.0.xlsx?rlkey=s25vzcuz1dzacjda97qbr71nx
Resolving www.dropbox.com (www.dropbox.com)... 162.125.85.18, 2620:100:6035:18::a27d:5512
Connecting to www.dropbox.com (www.dropbox.com)|162.125.85.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /c/scl/fi/gr8nu2app9xgowge7ghlk/0nodes_v5.0.xlsx?rlkey=s25vzcuz1dzacjda97qbr71nx [following]
--2023-10-19 11:19:26--  https://www.dropbox.com/c/scl/fi/gr8nu2app9xgowge7ghlk/0nodes_v5.0.xlsx?rlkey=s25vzcuz1dzacjda97qbr71nx
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc248332086ab08e1ea6df6429a3.dl.dropboxusercontent.com/cd/0/get/CF5IQw2_VBS7eKf2zKPDL3HFwQ7ugrP2LHN0j-wEIz_jefUnmx4zso7L-zz8UGbHlf2bujbSADZsguvOh4KkdjGlqUcwzyz8-aVJ_MHgXJT55OfZvtSBnpm8RNEpMr9r15M/file# [following]
--2023-10-19 11:19:27--  https://uc248332086ab08e1ea6df6429a3.dl.dropboxusercontent.co

In [None]:
#importar nodos a diccionario
nodos = {}
i=0
#nodosF = pd.read_excel(f"{datapath}/0nodes_v5.0.xlsx")
#nodosF = pd.read_excel("/content/drive/MyDrive/Colab Notebooks/cuGraph/data_openEditors/0nodes_v5.0.xlsx", engine='openpyxl')
#nodosF = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/cuGraph/data_openEditors/0nodes_v5.0.csv")
#nodosF = pd.read_excel("https://www.dropbox.com/scl/fi/gr8nu2app9xgowge7ghlk/0nodes_v5.0.xlsx?rlkey=s25vzcuz1dzacjda97qbr71nx&dl=0", engine='openpyxl')
nodosF = pd.read_excel("/content/0nodes_v5.0.xlsx")
for indice, fila in nodosF.iterrows():
    id =fila["Id"]
    nodos[i] = id
    i=i+1
print('dimensi√≥n nodosF', nodosF.shape)

dimensi√≥n nodosF (312810, 1)


In [None]:
#%%timeit -r 1 -n 1
#importar matriz de adyacencia (MA) de fichero. csv de enteros separados por ;
#MA = np.genfromtxt(f"{datapath}/0edges_v5.0.csv", delimiter=';', dtype='int', filling_values=0)
#MA = np.genfromtxt("https://www.dropbox.com/scl/fi/24n3kb3oemopi7qfej4bj/0edges_v5.0.csv?rlkey=ujj9hf7dc54itlp1lwf13wazg&dl=0", delimiter=';', dtype='int', filling_values=0)
MA = np.genfromtxt("/content/0edges_v5.0.csv", delimiter=';', dtype='int', filling_values=0)
print('dimensi√≥n Matriz Adj:', MA.shape)
#print(nodos)
#print(MA)
#print(MA.T) #transpuesta de la matriz

dimensi√≥n Matriz Adj: (309772, 3038)


In [None]:
#%%timeit -r 1 -n 1
#construir grafo a partir de matriz
r,s = MA.shape
sMA = sparse.csr_matrix(MA)
#print(sMA)
G2 = nx.algorithms.bipartite.matrix.from_biadjacency_matrix(sMA)
#cambiar nombre nodos
G2 = nx.relabel_nodes(G2, nodos)
#print(G2.nodes)
#print(nx.is_connected(G2))
#print(nx.number_connected_components(G2))

In [None]:
datapath = f'/content/drive/MyDrive/Colab Notebooks/Luis de Marcos'

In [None]:
set1nodes = set(n for n,d in G2.nodes(data=True) if d['bipartite']==0)
set2nodes = set(G2) - set1nodes
print("Is connected: "+str(nx.is_connected(G2)))
print("# connected components: "+str(nx.number_connected_components(G2)))
print("Graph density: "+str(nx.density(G2)))
print("Bipartite nodes: "+str(len(G2.nodes)))
print("Bipartite set 1 nodes: "+str(len(set1nodes)))
print("Bipartite set 2 nodes: "+str(len(set2nodes)))
print("Bipartite edges: "+str(len(G2.edges)))
print("Bipartite density: "+str(nx.bipartite.density(G2, set1nodes)))
#Generar el subgrafo con solo la componente principal
#COMENTAR ESTAS LINEAS PARA TRABAJAR CON GRAFO COMPLETO
#https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.components.connected_components.html#networkx.algorithms.components.connected_components
#largest_cc = max(nx.connected_components(G2), key=len)
#SG2 = G2.subgraph(largest_cc).copy()
#print(nx.is_connected(SG2))
#print(nx.number_connected_components(SG2))
#G2 = SG2 #machacar el or√≠ginal para hacer el resto de c√°lculos que siguen con G2

Is connected: False
# connected components: 289
Graph density: 7.871759896669262e-06
Bipartite nodes: 312810
Bipartite set 1 nodes: 309772
Bipartite set 2 nodes: 3038
Bipartite edges: 385125
Bipartite density: 0.00040923406921714224


In [None]:
#cargar atributos de los nodos de fichero. "Id" se utiliza como √≠ndice
personasDf = pd.read_excel("/content/1EBMembers_v5.0_Final.xlsx", index_col=0)
print(personasDf.shape)
print(personasDf.dtypes)
#print(personasDf.head())
#print(personasDf.tail())
for indice, fila in personasDf.iterrows():
    if(indice in G2.nodes):
        G2.nodes[indice]["Name"]=fila["Name"]
        #G2.nodes[indice]["AffiliationName"]=fila["AffiliationName"]
        # G2.nodes[indice]["GenderMC"]=fila["GenderMC"]
        # G2.nodes[indice]["GenderAPI"]=fila["GenderAPI"]
        #G2.nodes[indice]["Country"]=fila["AffiliationCountry"]
        # G2.nodes[indice]["GeoAreaMC"]=fila["GeoAreaMC"]

#cargar los nodos de los atributos (revistas) de fichero
revistasDf = pd.read_excel("/content/2Journals_v5.0_crosslisted.xlsx", index_col=0)
print(revistasDf.shape)
print(revistasDf.dtypes)
#print(revistasDf.head())
for indiceR, filaR in revistasDf.iterrows():
    if(indiceR in G2.nodes):
        G2.nodes[indiceR]["JName"]=filaR["JName"]
        #G2.nodes[indiceR]["Field"]=filaR["Field"]
        #G2.nodes[indiceR]["Quartile"]=filaR["Quartile"]
        # G2.nodes[indiceR]["NumMembersEB"]=filaR["NumMembersEB"]
        #G2.nodes[indiceR]["IF"]=filaR["IF"]
        #G2.nodes[indiceR]["Cross-listed"]=filaR["Cross-listed"]
        #G2.nodes[indiceR]["Cross-fields"]=filaR["Cross-fields"]
        #G2.nodes[indiceR]["Groups"]=filaR["Groups"]
        #G2.nodes[indiceR]["Cross-Groups"]=filaR["Cross-groups"]

(309772, 4)
Unnamed: 0             int64
Name                  object
AffiliationName       object
AffiliationCountry    object
dtype: object
(3038, 21)
JName                object
Journal              object
Publi                object
ISSN                 object
EditorialURL         object
JCR Abbreviation     object
Field                object
Groups               object
Cross-groups           bool
Index                object
TCits                object
IF                  float64
Quartile             object
CI                  float64
OAGold               object
MatchType            object
Cross-listed           bool
Cross-fields         object
Cross-index            bool
In-SCIE                bool
In-SSCI                bool
dtype: object


In [None]:
#ESTO SON PRUEBAS de carga. QUITAR
#Asegurarse que a bipartite=0 van los members y a bipartite=1 van los journals.

#PILLA Gender y GeoArea COMO FLOAT. Y tambi√©n tienen missing values
#print(G2.nodes[4504])
#print(G2.nodes[15084])
#print(G2.nodes['J00002'])
#print(G2.nodes['J00051'])
#print(G2.nodes['J00281'])

In [None]:
#obtener nodos de las dos particiones (RB_top y RB_Bottom)
RB_top = {n for n, d in G2.nodes(data=True) if d['bipartite']==0}
RB_bottom = set(G2) - RB_top
#print(RB_top)
#print(RB_bottom)

In [None]:
#guardar grafo
#nx.write_gexf(G2, f"{datapath}/complete_bipartite_graph_Allfields.gexf")

In [None]:
#m√©tricas del grafo bipartito
print("Es bipartito:", bipartite.is_bipartite(G2))
print("Densidad top:",bipartite.density(G2, RB_top))

Es bipartito: True
Densidad top: 0.00040923406921714224


In [None]:
##degree_centrality
dc_top = bipartite.degree_centrality(G2, RB_top)
dc_bottom = bipartite.degree_centrality(G2, RB_bottom)

In [None]:
#closeness_centrality. tarda un poco. Probando con cuGraph
cc_top = cnx.katz_centrality(G2)

In [None]:
#betweenness_centrality. tarda un poco. . Probando con cuGraph
bc_top = cnx.betweenness_centrality(G2)

In [None]:
##clustering (seg√∫n la documentacion puede ser para todo el grafo)
#clustering_coeff = bipartite.clustering(G2)
#print(clustering_coeff)

In [None]:
##append de las m√©tricas a los dataframes iniciales (personasDf, revistasDf)
##append degree centrality (dc_top)
dcTop_Df = pd.DataFrame.from_dict(dc_top, orient='index', columns=['degreeCentrality'])
resultPersonasDf = personasDf.join(dcTop_Df) #importante usar un DF distinto la primera ver por si se quiere ejecutar la celda m√°s de una vez. si no da error al intentar duplicar columnas
resultRevistasDf = revistasDf.join(dcTop_Df) #importante usar un DF distinto tambi√©n

##append katz centrality (cc_top)
ccTop_Df = pd.DataFrame.from_dict(cc_top, orient='index', columns=['katzCentrality'])
resultPersonasDf = resultPersonasDf.join(ccTop_Df)
resultRevistasDf = resultRevistasDf.join(ccTop_Df)

#append betweenness centrality (bc_top)
bcTop_Df = pd.DataFrame.from_dict(bc_top, orient='index', columns=['betweennessCentrality'])
resultPersonasDf = resultPersonasDf.join(bcTop_Df)
resultRevistasDf = resultRevistasDf.join(bcTop_Df)

##append clustering coefficient (clustering_coeff)
#clustering_coeff_Df = pd.DataFrame.from_dict(clustering_coeff, orient='index', columns=['clusteringCoefficient'])
#resultPersonasDf = resultPersonasDf.join(clustering_coeff_Df)
#resultRevistasDf = resultRevistasDf.join(clustering_coeff_Df)

#guardar Excels con los datasets completos
resultPersonasDf.to_excel(f"{datapath}/RMetricsMembersDS.xlsx")
resultRevistasDf.to_excel(f"{datapath}/RMetricsJournalsDS.xlsx")