## Overview

Udemy is one of the most popular E-learning platforms in the world. As mentioned on their website, the platform has over 75,000 instructors, **150,000 courses**, **250 million enrollments** and **33 million minutes** worth of content. This notebook takes an in-depth look into records of the MOOC platform.

## Importing libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import itertools
from wordcloud import WordCloud, STOPWORDS

In [None]:
PATH = "../input/udemy-courses/"

df = pd.read_csv(PATH + 'udemy_courses.csv')

## Head and Tail

In [None]:
print("There are {} rows and {} columns in the dataset.".format(df.shape[0], df.shape[1]))

In [None]:
df.head(10)

In [None]:
df.tail(10)

## Check for null values and data type

In [None]:
df.info()

## Data Cleaning

Here we try to remove some errors in our data. There were a few misplaced values, wrong data types and unsatisfactory data formats.

EDIT: According to a more recent version of the data, a lot of the issues have been fixed. Hence, a lot of code present in this section in an earlier version has been removed.

In [None]:
df['course_id'] = df['course_id'].astype(str)

In [None]:
df['published_timestamp'] = pd.to_datetime(df['published_timestamp'])
df['date_published'] = df.loc[:, 'published_timestamp'].apply(lambda s: s.date())
df['year_published'] = df.loc[:, 'published_timestamp'].apply(lambda s: s.year)
df['month_published'] = df.loc[:, 'published_timestamp'].apply(lambda s: s.month_name())

## Exploring subjects

In [None]:
subject = df['subject'].value_counts().reset_index()

In [None]:
subject.columns = ['subject', 'counts']

In [None]:
fig = px.bar(
        subject, 
        x = 'subject', 
        y='counts', 
        color='subject',
        title='Subject Counts')
fig.update_layout(showlegend=False, width=600)
fig.show()

We see that courses are dominated by Web Dev and Business. Not too many surprises there as Information Technology and Business/Management are two of the most lucrative industries to be working for.

## Time Series of growth of courses by subject

In [None]:
subjects = df['subject'].unique()

subset = df[['date_published','subject']]
subset = subset.sort_values('date_published')
time_series = subset['date_published'].value_counts().reset_index()
time_series.columns = ['Date', 'Counts']
time_series = time_series.sort_values('Date')
time_series['Cum Count'] = time_series['Counts'].cumsum()
dummies = pd.get_dummies(subset['subject'])

subset = subset.join(dummies)
subset['Cum Business'] = subset['Business Finance'].cumsum()
subset['Cum Software'] = subset['Web Development'].cumsum()
subset['Cum Music'] = subset['Musical Instruments'].cumsum()
subset['Cum Design'] = subset['Graphic Design'].cumsum()
subset_melt = subset.melt(id_vars='date_published', value_vars=['Cum Business', 'Cum Software', 'Cum Design', 'Cum Music'])

fig = make_subplots(
    rows=2, 
    cols=1,
    subplot_titles=("Time series plot of number of courses",
                    "Time series plot of number of courses by subject"))
df.sort_values('date_published', inplace=True)
fig.append_trace(go.Scatter(
    x=time_series['Date'],
    y=time_series['Cum Count'],
    name="All",
    mode='lines'),
    row=1, col=1)

fig.append_trace(go.Scatter(
    x=subset['date_published'], 
    y=subset['Cum Business'], 
    mode="lines",
    name="Business",
    line=dict(color="#617C58")
),
    row=2, col=1)
fig.append_trace(go.Scatter(
    x=subset['date_published'], 
    y=subset['Cum Software'], 
    mode="lines",
    name="Software",
    line=dict(color="#74597D", dash="longdashdot"),
),
    row=2, col=1)
fig.append_trace(go.Scatter(
    x=subset['date_published'], 
    y=subset['Cum Design'], 
    
    mode="lines",
    name="Design",
    line=dict(color="#C85A17", dash="dash")
),
    row=2, col=1)
fig.append_trace(go.Scatter(
    x=subset['date_published'], 
    y=subset['Cum Music'], 
    
    mode="lines",
    name="Music",
  
    line=dict(color="#1884C7", dash="dashdot")
),
    row=2, col=1)
fig.update_layout(width=700, height=800)
fig.show()

The onset of 2016 saw a rise in the number of software/programming courses. So much so that, it overtook Business just before 2017. Overall, all four categories seem to have had a good rise post 2016.

## Plotting level vs number of lectures

Let's try to capture the relationship between the level of difficultly of a course versus the number of lectures. Ideally, for beginner courses there should be slightly higher number of lectures so as to help develop intuition about the subject matter.

In [None]:
fig = px.box(
    df,
    x='level',
    y='num_lectures',
    color='level',
    title='Boxplot of Level vs Number of Lectures')
fig.update_yaxes(range=[0,200])
fig.update_layout(showlegend=False)
fig.show()

The median among the groups seems to be very close to each other. However, expert level courses seems to have fewer number of lectures towards the higher quantiles.

## Plotting level vs course duration

Intuitively,course duration for expert level courses should be higher due to the difficulty of the course material.

In [None]:
fig = px.violin(
    df,
    x='level',
    y='content_duration',
    color='level',
    title='Violin plot of Level vs Course Duration')
fig.update_yaxes(range=[0,40])
fig.update_layout(showlegend=False)
fig.show()

Again, the median seems to be the same over all levels. However, for expert level courses, we see that there are again fewer data points towards the higher end of the boxplot as compared to the others. This goes against our initial hypothesis. Maybe these were deemed to be expert level courses because of the degree of toughness of the material as well as the smaller duration of lectures?

## Free vs Paid split

In [None]:
pf_split = df['is_paid'].value_counts().reset_index()
pf_split.columns = ['Is Paid', 'Counts']
fig = px.pie(pf_split, names='Is Paid', values='Counts', color=['009933','#980000 '], width=500)
fig.update_layout(title="Paid Vs Free Courses")

There is no such thing as a free lunch. 8.43% of our data disagrees. Moving on,

## Best free courses

In [None]:
free_df = df[df['is_paid'] == False]

Since we do not have information regarding reviews, a decent measure of 'goodness' could be the number of people subscribed to a course. Let's have a look at the top free course per subject based on subscriber count.

In [None]:
top_rated_free = free_df.groupby('subject') \
.apply(lambda x: x.sort_values(['num_subscribers'], ascending=False)) \
.reset_index(drop=True) \
.groupby('subject') \
.head(1)

top_rated_free = top_rated_free[['course_title',
                                 'content_duration',
                                 'published_timestamp',
                                 'num_subscribers',
                                'subject']]
top_rated_free

Some cool things to note about the above table:
* All courses have a difficulty level of either beginner or all.
* All courses were pubished during the earlier days of the platform (Udemy Series B funding happened in 2012).
* All courses have relatively less content duration (HTML 5 is higher but still relatively low compared to other technical courses)


## Popular/Engaging Courses

Let's try to look at the most popular and engaging courses over all the subject areas. We will use subscriber count as well as reviews as measurements to plot this.

No surprises that all 10 entries are programming courses. Programming is considered one of the most important skills to learn in the 21st century.

In [None]:
top_subs = df.sort_values(by='num_subscribers', ascending=False).head(5)
top_reviews = df.sort_values(by='num_reviews', ascending=False).head(5)

fig = make_subplots(
    rows=2, 
    cols=1,
    subplot_titles=("Top 5 courses by subscriber count","Top 5 courses by review count")
)
fig.append_trace(go.Bar(
    y=top_subs['course_title'].values,
    x=top_subs['num_subscribers'].values,
    texttemplate = "%{value:,s}",
    marker=dict(color=top_subs['num_subscribers'].values, coloraxis="coloraxis"),
    textposition = "inside",
    orientation='h'
), row=1, col=1)
fig.append_trace(go.Bar(
    x=top_reviews['course_title'].values,
    y=top_reviews['num_reviews'].values,
     marker=dict(color=top_subs['num_reviews'].values, coloraxis="coloraxis"),
    texttemplate = "%{value:,s}",
    textposition = "outside",
), row=2, col=1)
fig.update_layout(coloraxis=dict(colorscale='emrld'),height=1200, width=900, showlegend=False)
fig.show()

## Distribution of numeric values

Let us look at the distribution of numeric values present in our data.

In [None]:
fig = make_subplots(
    rows=4, 
    cols=1,
    subplot_titles=("Price distribution (Skew: {:2f})".format(df['price'].skew()),
                    "Subscriber distribution (Skew: {:2f})".format(df['num_subscribers'].skew()),
                    "Lecture distribution (Skew: {:2f})".format(df['num_lectures'].skew()),
                    "Reviews distribution (Skew: {:2f})".format(df['num_reviews'].skew())
))

fig.append_trace(go.Histogram(
x=df['price'],
marker_color='#2B65EC',
opacity=0.75,
)
, row=1, col=1)

fig.append_trace(go.Histogram(
x=df['num_subscribers'],
marker_color='#1589FF',
opacity=0.75),row=2, col=1)

fig.append_trace(go.Histogram(
x=df['num_lectures'],
marker_color='#6698FF',
opacity=0.75),row=3, col=1)

fig.append_trace(go.Histogram(
x=df['num_reviews'],
marker_color='#38ACEC',
opacity=0.75),row=4, col=1)

fig.update_xaxes(title_text="Price", row=1, col=1)
fig.update_xaxes(title_text="Count", range=[0,30000], row=2, col=1)
fig.update_xaxes(title_text="Count", range=[0, 200], row=3, col=1)
fig.update_xaxes(title_text="Count", range=[0, 1000], row=4, col=1)

fig.update_yaxes(title_text="Frequency", row=1, col=1)
fig.update_yaxes(title_text="Frequency", row=2, col=1)
fig.update_yaxes(title_text="Frequency", row=3, col=1)
fig.update_yaxes(title_text="Frequency", row=4, col=1)

fig.update_layout(height=1000, width=1000,showlegend=False)

fig.show(title='Distribution of numerical columns')




There is heavy positive skew for 3 out of the 4 numeric columns. This tells us that the mode is far away from the mean. A reason for this could be the fact that there are outliers which have extreme values. 

## Relationship between numeric columns

A few useful relationships to look at between numeric columns are price vs num_subscribers, num_reviews vs num_subscribers and num_subscribers vs course_duration.


In [None]:
fig = make_subplots(
    rows=3,
    cols=1,
    )

fig.append_trace(go.Scatter(
    x=df['price'],
    y=df['num_subscribers'],
    mode='markers',
    opacity=0.75,
    marker_color='#43BFC7',
), row=1, col=1)

fig.append_trace(go.Scatter(
    x=df['num_reviews'],
    y=df['num_subscribers'],
    mode='markers',
    opacity=0.75,
    marker_color='#C74C44',
), row=2, col=1)

fig.append_trace(go.Scatter(
    x=df['num_subscribers'],
    y=df['content_duration'],
    mode='markers',
    opacity=0.75,
    marker_color='#A8C744',
), row=3, col=1)

fig.update_xaxes(title_text="Price", row=1, col=1)
fig.update_xaxes(title_text="Reviews", row=2, col=1)
fig.update_xaxes(title_text="Subscribers", row=3, col=1)

fig.update_yaxes(title_text="Subscribers", row=1, col=1)
fig.update_yaxes(title_text="Subscribers", row=2, col=1)
fig.update_yaxes(title_text="Duration (hrs)", row=3, col=1)

fig.update_layout(width=800, height=800, title="Graphs plotting relationship between numerical variables", showlegend=False)
fig.show()

There seems to be a slight positive trend for reviews vs subscribers. One hypothesis to test out is whether number of reviews influences a prospective customer's decision to by the course. This hypothesis would be more tailored if we also had data about ratings. Unfortunately the dataset does not provide us with it so we have to make do with what we have.

## Wordcloud

In [None]:
comment_words = ''
stopwords = set(STOPWORDS)

for s in df.course_title:
    s = str(s)
    tokens = s.split()
    for i in range(len(tokens)): 
        tokens[i] = tokens[i].lower() 
      
    comment_words += " ".join(tokens)+" "
wordcloud = WordCloud(width = 800, height = 800, 
            background_color ='black', 
            stopwords = stopwords, 
            min_font_size = 10).generate(comment_words)
plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
  
plt.show() 

The most popular words seems to be trigger words like "learn", "beginner", "complete" to get customers hooked onto coruses. "trading", "javascript", "guitar", "photoshop" seem to be a few more popular non trigger words.

## Predicting subscriber count

In this part of the notebook, we will try to predict subscriber count using the data we have.

### Importing libraries

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

### Choosing our columns of interest

In [None]:
num_cols = ['price', 'num_reviews', 'num_lectures', 'content_duration']
cat_cols = ['is_paid', 'level', 'subject']
X_data, y_data = df[num_cols].merge(pd.get_dummies(df[cat_cols]), left_index=True, right_index=True), df['num_subscribers']

We subset our data and derive dummy data for the categorical variables.

### Splitting into train and test data

In [None]:
X_train, X_test,  y_train, y_test = train_test_split(
                                    X_data, y_data, test_size=0.2, random_state=42)
col_names = X_train.columns

### Applying standard scaler

In [None]:
scaler = StandardScaler()
scaler = scaler.fit(X_train) 
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Standard scaler is applied only on the numerical features.

### Model Building

In [None]:
model = RandomForestRegressor(n_estimators=500, random_state=42)

We pick a random forest regressor of 500 trees to predict subscriber count.

In [None]:
model.fit(X_train, y_train)

In [None]:
y_train_preds = model.predict(X_train)

In [None]:
print("Mean Squared Error on training data is: {:.2f}".format(mean_squared_error(y_train_preds, y_train)))

In [None]:
y_pred = model.predict(X_test)

In [None]:
print("Mean Squared Error on testing data is: {:.2f}".format(mean_squared_error(y_pred, y_test)))

While our MSE is quite high, it seems that we haven't really overfit the data. It is likely that we will need better features in order to build predictive power.

## Feature Importance

In [None]:
imp_features = pd.Series(model.feature_importances_, index=col_names).nlargest(5)
px.bar(x=imp_features.index, y=imp_features.values,
       labels={'x':"Features", 'y':"Importance Criterion"},
       color=imp_features.index,
       color_discrete_sequence=px.colors.qualitative.T10,
       title="Feature Importance")

We see that the number of reviews plays a massive role in predicting subscriber count. Interestingly, price does not seem to be a big factor. One reason for this could be the fact that most courses on Udemy are quite cheap. 

I feel that having information on course ratings could have been another major factor in subsriber count. It would be interesting to see how much the model can be improved. I've left that as an exercise.

I hope all of you enjoyed this notebook. Do tell me if you are able to find out more data and are able to build better models. 😁