# Sistema de Recomendación de Cursos
### Javier Hiruelo Pérez

Sistema de recomendación de cursos para el concurso de Hack4Good de 2024.

In [9]:
import pandas as pd
import numpy as np

In [10]:
encabezado = {"SpecializationName":"Especificación",
              "SpecializationInstructors":"Instructores",
              "ReviewScore":"Puntuación",
              "ReviewCount":"Reviews",
              "Skills":"Skills",
              "Details":"Detalles",
              "SpecializationURL":"URL",
              "SpecialziationImgURL":"URL Imagen",
              "SuggestedTime":"Tiempo",
              "SpecializationEnrolled":"Alumnos",
              "SpecializationDescription":"Descripción"}

courses_db = pd.read_csv(filepath_or_buffer="specializations.csv")
courses_db = courses_db.rename(columns=encabezado)
courses_db.head()

Unnamed: 0,Especificación,Instructores,Puntuación,Reviews,Skills,Detalles,URL,URL Imagen,Tiempo,Alumnos,Descripción
0,Machine Learning,"DeepLearning.AI, Stanford University",4.9,(8.2k reviews),"Machine Learning, Probability & Statistics, Ma...",Beginner · Specialization · 1-3 Months,/specializations/machine-learning-introduction,https://d3njjcbhbojbot.cloudfront.net/api/util...,Approximately 3 months to completeSuggested pa...,"145,425 already enrolled",The Machine Learning Specialization is a found...
1,Introduction to Data Science,IBM Skills Network,4.6,(77.3k reviews),"Data Science, Data Structures, SQL, Computer P...",Credit Eligible,/specializations/introduction-data-science,https://d3njjcbhbojbot.cloudfront.net/api/util...,Approximately 4 months to completeSuggested pa...,"75,229 already enrolled",Interested in learning more about data science...
2,Data Science Fundamentals with Python and SQL,IBM Skills Network,4.5,(52.8k reviews),"Data Structures, Python Programming, Data Anal...",Credit Eligible,/specializations/data-science-fundamentals-pyt...,https://d3njjcbhbojbot.cloudfront.net/api/util...,Approximately 7 months to completeSuggested pa...,"31,178 already enrolled",Data Science is one of the hottest professions...
3,Key Technologies for Business,IBM Skills Network,4.7,(73.7k reviews),"Data Science, Cloud Computing, Applied Machine...",Beginner · Specialization · 1-3 Months,/specializations/key-technologies-for-business,https://d3njjcbhbojbot.cloudfront.net/api/util...,Approximately 3 months to completeSuggested pa...,"14,231 already enrolled","In this Specialization, we will cover 3 key te..."
4,Deep Learning,DeepLearning.AI,4.8,(137.9k reviews),"Deep Learning, Machine Learning, Artificial Ne...",Credit Eligible,/specializations/deep-learning,https://d3njjcbhbojbot.cloudfront.net/api/util...,Approximately 5 months to completeSuggested pa...,"742,213 already enrolled",The Deep Learning Specialization is a foundati...


Para obtener el número de alumnos por curso tenemos que mapearlo porque vienen contenido en un *string* y algunos aleatoriamente son un *integer* y algunos valores son *NaN*.

In [11]:
def mapeo(alumno,replaced_char):
    if type(alumno) == str:
        return alumno.replace(replaced_char,"").split(" ")[0]
    else:
        return alumno

courses_alumns = courses_db["Alumnos"].fillna(0)
courses_alumns = courses_alumns.map(lambda x: int(mapeo(x,",")))
courses_alumns.head()

0    145425
1     75229
2     31178
3     14231
4    742213
Name: Alumnos, dtype: int64

La forma de ver si un curso es relevante o no debería ser una función en base a su puntuación media y el ratio de inscritos y reviews, porque podemos suponer que solo los alumnos inscritos que están satisfechos con el curso dejan una review positiva y lo mismo con las reviews negativas. Así que vamos a calcular un valor intermedio que tenga en cuenta esos factores:

In [12]:
#TODO
courses_score = courses_db["Puntuación"].fillna(0)
print(courses_score.head())

courses_review = courses_db["Reviews"].fillna(0)
courses_review = courses_review.map(lambda x: mapeo(x,"("))
courses_review = courses_review.map(lambda x: float(x.replace("k",""))*1000 if "k" in str(x) else float(x))
courses_review.head()

0    4.9
1    4.6
2    4.5
3    4.7
4    4.8
Name: Puntuación, dtype: float64


0      8200.0
1     77300.0
2     52800.0
3     73700.0
4    137900.0
Name: Reviews, dtype: float64

In [13]:
# Nombres de los cursos:
courses_names = courses_db["Especificación"]
# Recomponemos las URL de los cursos:
courses_URLs = "https://www.coursera.org" + courses_db["URL"]
# Las URL de las imágenes vienen dadas enteras, no hay que generarlas:
courses_IMGs_URLs = courses_db["URL Imagen"]
# Los instructores pueden ser varios y vienen separados por comas:
courses_instructors = courses_db["Instructores"]
# Descripciones de los cursos:
courses_descriptions = courses_db["Descripción"]

print("-----INFORMACIÓN SOBRE EL PRIMER CURSO DE LA BASE DE DATOS-----\n")
print(f"- Nombre de la especificación: \n\n \t{courses_names[0]} \n")
print(f"- URL del curso: \n\n \t{courses_URLs[0]} \n")
print(f"- Imagen de cabecera del curso: \n\n \t{courses_IMGs_URLs[0]} \n")
print(f"- Instructores del curso: \n\n \t{courses_instructors[0]} \n")
print(f"- Alumnos inscritos del curso: \n\n \t{courses_alumns[0]} \n")
print(f"- Reviews del curso: \n\n \t{courses_review[0]} \n")
print(f"- Puntuación del curso: \n\n \t{courses_score[0]} \n")
print(f"- Descripción del curso: \n\n \t{courses_descriptions[0]} \n")

-----INFORMACIÓN SOBRE EL PRIMER CURSO DE LA BASE DE DATOS-----

- Nombre de la especificación: 

 	Machine Learning 

- URL del curso: 

 	https://www.coursera.org/specializations/machine-learning-introduction 

- Imagen de cabecera del curso: 

 	https://d3njjcbhbojbot.cloudfront.net/api/utilities/v1/imageproxy/https://d15cw65ipctsrr.cloudfront.net/3a/9d2a7af297483a845340bcfbac6f1e/MLS.course-banners-01_Course-Logo-.png?auto=format%2Ccompress%2C%20enhance&dpr=1&w=600&h=216&fit=fill&q=50 

- Instructores del curso: 

 	DeepLearning.AI, Stanford University 

- Alumnos inscritos del curso: 

 	145425 

- Reviews del curso: 

 	8200.0 

- Puntuación del curso: 

 	4.9 

- Descripción del curso: 

 	The Machine Learning Specialization is a foundational online program created in collaboration between DeepLearning.AI and Stanford Online. This beginner-friendly program will teach you the fundamentals of machine learning and how to use these techniques to build real-world AI applications. Thi

In [14]:
data = {"Name":courses_names,"Instructors":courses_instructors,"URL":courses_URLs,"Imgs":courses_IMGs_URLs,"Reviews":courses_review,"Alumns":courses_alumns,"Description":courses_descriptions}
df = pd.DataFrame(data=data)
df.to_csv(path_or_buf="courses_data.csv",sep=",",encoding='utf-8',index=False)

In [15]:
courses_skills = courses_db["Skills"].fillna("").map(lambda x: str(x).replace(" ","").split(","))
# Vamos a ver todos los tipos distintos de skills que hay
skills = []
for skill in courses_skills:
    skills = skills + skill

print(list(set(skills))[:5])
print(len(set(skills)))

len(set(courses_names))

['', 'ManagementAccounting', 'GISSoftware', 'BusinessTransformation', 'ConflictManagement']
343


724

## Modelo de Red Neuronal para Recomendación de Cursos

Un usuario puede estar definido por las skils que tiene o que quiere obtener con los cursos. Podemos representar los usuarios con vectores de **343** elementos y la salida de la red neuronal será una capa de **728** donde aplicamos una **softmax** que nos dará la probabilidad de que el curso se ajuste a lo que busca el usuario.

In [20]:
skills_dim = len(set(skills))
paths_dim = len(set(courses_URLs))

print(f"Los usuarios pueden escoger entre {skills_dim} Skills diferentes.\n")
print(f"Existen {paths_dim} Cursos distintos en la base de datos.")

Los usuarios pueden escoger entre 343 Skills diferentes.

Existen 728 Cursos distintos en la base de datos.


### Creación de los Conjuntos de Entrenamiento y Test

Ahora hay que el conjunto de entrenamiento y test para entrenar el modelo.

In [21]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()
skills_one_hot = ohe.fit_transform(pd.DataFrame(sorted(skills)))

skills_one_hot = skills_one_hot.toarray()

In [22]:
input = skills_one_hot
print(input.shape)
input

(14900, 343)


array([[1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.]])

In [23]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
output = pd.DataFrame(mlb.fit_transform(courses_skills),
                   columns=mlb.classes_,
                   index=courses_skills.index)

print(output.shape)
output = np.array(output)
output

(728, 343)


array([[0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

### Creación del Modelo

In [24]:
import tensorflow as tf
from sklearn.model_selection import train_test_split

inputs = tf.keras.Input(shape=(343,), name="User_Skills")
x = tf.keras.layers.Dense(64, activation=tf.nn.relu, name="dense_1")(inputs)
x = tf.keras.layers.Dropout(0.2)(x)
x = tf.keras.layers.Dense(64, activation=tf.nn.relu, name="dense_2")(x)
x = tf.keras.layers.Dropout(0.2)(x)
outputs = tf.keras.layers.Dense(728, activation=tf.nn.softmax, name="Recomendations")(x)

model = tf.keras.Model(inputs = inputs, outputs = outputs)

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

model.summary()

2024-03-15 11:26:26.016198: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-15 11:26:26.021029: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-15 11:26:26.095855: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-15 11:26:27.824488: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-03-15 11:26:27.825307: W tensorflow/core/common_runtime/gpu/

In [25]:
X_train, X_test, y_train, y_test = train_test_split(input, output.T, test_size=0.3, random_state=1)

ValueError: Found input variables with inconsistent numbers of samples: [14900, 343]

In [800]:
print("Fit model on training data")
history = model.fit(
    X_train,
    y_train,
    epochs=100,
    batch_size=16,
    validation_data = (X_test,y_test)
)

Fit model on training data
Epoch 1/100
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 10ms/step - accuracy: 0.1072 - loss: 0.6929 - val_accuracy: 0.0000e+00 - val_loss: 0.6937
Epoch 2/100
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.0788 - loss: 0.6929 - val_accuracy: 0.0097 - val_loss: 0.6937
Epoch 3/100
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.1036 - loss: 0.6930 - val_accuracy: 0.0097 - val_loss: 0.6937
Epoch 4/100
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.1249 - loss: 0.6929 - val_accuracy: 0.0097 - val_loss: 0.6937
Epoch 5/100
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.0741 - loss: 0.6929 - val_accuracy: 0.0097 - val_loss: 0.6937
Epoch 6/100
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.1238 - loss: 0.6929 - val_accuracy: 0.0000e+00 - val_loss: 0.693

# TODO 
Hay que hacer más datos de train buenos que no solo sean one-hot sino que sean de varios hot