50) Si decimos que un usuario sabe un idioma cuando tiene un nivel de babel mayor o igual a 1 consiga un dataframe cuyas columnas son tipos de logs, el índice es la cantidad de idiomas que sabe un usuario y las celdas la probabilidad de que esos usuarios generen ese tipo de log. (⭐⭐⭐)

In [1]:
import pandas as pd
import numpy as np

from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


Trabajamos con el primer DataFrame.

**Analisis previo**

* Viendo la informacion del DataFrame vemos que la columna de "babel_user" es de type int. Entonces todos sus valores son numericos y no tiene valores NaN. Ademas, vemos que tenemos 35666 entradas (filas) y en la columna 1 (babel_lang) tenemos 4 valores nulos.

* Viendo los valores unicos de la columna "babel_level" vemos que uno de los valores posibles es "N". La N se representa como "hablante nativo". Debemos considerar mapear toda la columna para que sean valores de tipo int (para poder hacer comparaciones) y a nivel numerico podemos considerar que N = 4,5.

* Viendo los valores unicos de la columna "babel_lang" vemos que tenemos valores tipo NaN (lo cual sabiamos desde antes). No sabemos que pasa con esos NaNs. Entonces para contar la cantidad de idiomas que sabe un usuario no tendremos en cuenta aquellos valores en "babel_lang" que sean NaN ya que no es posible saber que pasa con estos y asumo que es debido a un error.

In [2]:
def mapear_a_entero(x):
  traduccion = {
      "0":0,
      "1":1,
      "2":2,
      "3":3,
      "4":4,
      "N":4.5,
      "5":5,
  }

  return traduccion.get(x, np.nan)

In [3]:
lenguajes = pd.read_csv("/content/drive/MyDrive/Organizacion de Datos/Colab Notebooks/TP1/languages.csv")

lenguajes.dropna(subset=["babel_lang"], inplace=True)
lenguajes["babel_level"] = lenguajes["babel_level"].map(mapear_a_entero)
lenguajes = lenguajes[lenguajes["babel_level"] > 0]

df_idiomas = lenguajes.groupby(["babel_user"]).agg({"babel_lang": ["nunique"]})
df_idiomas.columns = ["cantidad_de_idiomas"]
df_idiomas.reset_index(inplace=True)

df_idiomas["babel_user"] = df_idiomas["babel_user"].astype(np.int32)
df_idiomas["cantidad_de_idiomas"] = df_idiomas["cantidad_de_idiomas"].astype(np.int8)

Trabajamos con el segundo DataFrame.

**Analisis previo**

* Viendo las columnas del df original vemos que aquellas que nos pueden interesar son "contributor_id" (solo IDs ya que en el archivo "languages.csv" solo contamos con estas y no con IPs) y "logtype" para hallar los logtype que genero cada usuario. Cargamos el archivo unicamente con estas columnas para ahorrarnos el paso de dropear las demás.

* Vemos que en ambas columnas que nos interesan tenemos NaNs. Puede pasar que existen logtypes que no tienen asignado un ID o viceversa. No sabemos que pasa con esos NaNs. Para obtener para cada usuario los typelog que genero no tendremos en cuenta aquellos ID que sean NaN ya que no es posible saber que pasa con estos y asumo que es debido a un error. Luego, quitamos las filas cuyo id es NaN. Ademas, si tenemos ID que son NaN no tenemos forma de saber cuantos idiomas saben.

* Viendo la cantidad de valores unicos que tiene la columna "logtype" ya sabemos que nuestro DataFrame final tendra esta cantidad de columnas. Ademas, como vimos antes, viendo los valores unicos tenemos valores que son NaN.

* Vemos que estamos usando mucha memoria. Otra razon más para quedarnos solo con las columnas que nos interesen y también considerar cambiarles el tipo para de esta forma administrar mejor la memoria.

In [6]:
registros = pd.read_csv("/content/drive/MyDrive/Organizacion de Datos/Colab Notebooks/TP1/logs.csv", usecols=["contributor_id", "logtype"])

registros.dropna(subset=["contributor_id", "logtype"], inplace=True)

agrupados_logs = registros.groupby(["contributor_id", "logtype"]).agg({"logtype": ["count"]})
agrupados_logs.columns = ["tipos_de_logs"]

df_logtypes = agrupados_logs.unstack(fill_value = 0)
df_logtypes.columns = [x[1] for x in df_logtypes.columns]
df_logtypes.reset_index(inplace = True)

logtypes = registros["logtype"].unique()
df_logtypes[logtypes] = df_logtypes[logtypes].astype(np.int32)

Trabajamos con ambos DataFrame.

**Analisis previo**

Vemos que no tenemos la misma cantidad de usuarios. Esto quiere decir que podemos tener usuarios que esten en un DataFrame y en el otro no. Entonces:
  * Si tenemos un usuario que aparece en "lenguages.csv" y sabe hablar un idioma, pero no aparece en "logs.csv" entonces consideramos que la cantidad de logtpye que genero son 0 para cada tipo. Por lo tanto si tenemos en cuenta a estos usuarios.
  * Si tenemos un usuario que aparece en "logs.csv" y no aparece en "lenguages.csv" entonces consideramos que no sabe hablar bien ningun idioma. Por lo tanto no tenemos en cuenta a estos usuarios.

In [8]:
df_logtypes = df_logtypes.rename(columns={
    "contributor_id": "id"
})

df_idiomas = df_idiomas.rename(columns={
    "babel_user": "id"
})

merge = pd.merge(df_idiomas, df_logtypes, how="left").fillna(0)
merge = merge.drop(columns=["id"])
merge = merge.astype(np.int32)

resultado = merge.groupby(["cantidad_de_idiomas"]).sum()
resultado = resultado.div(resultado.sum(axis=1), axis=0)

Respuesta

In [10]:
resultado

Unnamed: 0_level_0,block,campus,contentmodel,course,create,delete,eparticle,gblblock,gblrights,growthexperiments,...,online,patrol,protect,renameuser,rights,student,tag,thanks,upload,usermerge
cantidad_de_idiomas,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.041032,7.3e-05,0.0,0.000363,0.274579,0.041928,0.001865,0.0,0.0,0.0,...,0.000121,0.186557,0.016204,0.0,0.001066,0.000509,0.002907,0.126802,0.004069,0.0
2,0.031144,0.0,0.0,0.000107,0.260636,0.424955,0.000226,0.0,0.0,0.0,...,1.8e-05,0.081937,0.024627,0.000459,0.000846,4.7e-05,0.000233,0.055612,0.001134,0.0
3,0.090322,4.6e-05,0.0,0.000363,0.138445,0.474434,0.000724,0.0,0.0,0.0,...,4.6e-05,0.059043,0.022534,0.000194,0.000479,0.000155,0.000255,0.041426,0.001449,0.0
4,0.072103,0.0,1e-05,7.9e-05,0.090102,0.593083,0.000169,0.0,6e-06,0.0,...,8e-06,0.064252,0.020072,0.007942,0.001354,9.7e-05,0.000136,0.028909,0.001648,0.0
5,0.024211,6.5e-05,0.0,0.000352,0.167538,0.347231,0.000825,0.0,0.0,0.0,...,0.000102,0.083519,0.010285,0.000324,0.000185,0.000435,0.000361,0.078562,0.002446,0.0
6,0.038516,0.0,0.0,2e-05,0.17206,0.519417,9e-05,0.0,0.0,0.0,...,0.0,0.018137,0.02255,0.001125,0.002331,3e-05,6e-05,0.039372,0.001215,0.0
7,0.049885,0.0,0.0,0.0,0.06063,0.579574,1.6e-05,0.0,0.0,0.0,...,0.0,0.122479,0.020448,0.001902,0.001024,1.6e-05,8.1e-05,0.017327,0.001544,0.0
8,0.0,0.0,0.0,0.0,0.348402,0.017064,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000255,0.0,0.0,0.0,0.0,0.160194,0.00955,0.0
9,0.0,0.0,0.0,0.0,0.34715,0.008636,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.07772,0.044905,0.0
10,0.0,0.0,0.0,0.0,0.677132,0.031176,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,5.5e-05,0.005472,0.000553,0.0
