# Udemy development courses analysis

In this notebook I'll take a look at a large amount of Udemy courses in the category "Development".  
There are courses from Finance, Accounting, Book Keeping, Compliance, Cryptocurrence, Blockchain, Economics, Investing & Trading, Taxes and much more each having multiple courses under it's domain.  

I am interested in developing a Machine Learning model that tries to predict the rating for a course.

# 1. Data

Original data comes from [Kaggle](https://www.kaggle.com/jilkothari/finance-accounting-courses-udemy-13k-course).

# 2. Features

There are 17 features described in the dataset:

* `id`: The course ID of that particular course.
* `title`: Shows the unique names of the courses available under the development category on Udemy.
* `url`: Gives the URL of the course.
* `is_paid`: Returns a boolean value displaying true if the course is paid and false if otherwise.
* `num_subscribers`: Shows the number of people who have subscribed that course.
* `avgrating recent`: Reflects the recent changes in the average rating.
* `num_reviews`: Gives us an idea related to the number of ratings that a course has received.
* `num_published_lectures`: Shows the number of lectures the course offers.
* `num_published_practice_tests`: Gives an idea of the number of practice tests that a course offers.
* `created`: The time of creation of the course.
* `published_time`: Time of publishing the course.
* `discounted_price_amount`: The discounted price which a certain course is being offered at.
* `discounted_price_currency`: The currency corresponding to the discounted price which a certain course is being offered at.
* `price_detail_amount`: The original price of a particular course.
* `price_detail_currency`: The currency corresponding to the price detail amount for a course.

I will use `avg_rating` as the target feature.

# 3. Evaluation

Since I am trying to predict a numeric value, **this will be a regression problem**.  
For this reason, I will evaluate my model using the standard measures for a regression problem, most importantly `mean squared error (MSE)`.

# 4. Exploratory analysis

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

seed = 893145

In [2]:
df = pd.read_csv('./data/udemy-development-courses.csv')
df.head()

Unnamed: 0,id,title,url,is_paid,num_subscribers,avg_rating,avg_rating_recent,rating,num_reviews,is_wishlisted,num_published_lectures,num_published_practice_tests,created,published_time,discount_price_amount,discount_price_currency,discount_price_price_string,price_detail_amount,price_detail_currency,price_detail_price_string
0,762616,The Complete SQL Bootcamp 2020: Go from Zero t...,/course/the-complete-sql-bootcamp/,True,295509,4.66019,4.67874,4.67874,78006,False,84,0,2016-02-14T22:57:48Z,2016-04-06T05:16:11Z,455.0,INR,₹455,8640.0,INR,"₹8,640"
1,937678,Tableau 2020 A-Z: Hands-On Tableau Training fo...,/course/tableau10/,True,209070,4.58956,4.60015,4.60015,54581,False,78,0,2016-08-22T12:10:18Z,2016-08-23T16:59:49Z,455.0,INR,₹455,8640.0,INR,"₹8,640"
2,1361790,PMP Exam Prep Seminar - PMBOK Guide 6,/course/pmp-pmbok6-35-pdus/,True,155282,4.59491,4.59326,4.59326,52653,False,292,2,2017-09-26T16:32:48Z,2017-11-14T23:58:14Z,455.0,INR,₹455,8640.0,INR,"₹8,640"
3,648826,The Complete Financial Analyst Course 2020,/course/the-complete-financial-analyst-course/,True,245860,4.54407,4.53772,4.53772,46447,False,338,0,2015-10-23T13:34:35Z,2016-01-21T01:38:48Z,455.0,INR,₹455,8640.0,INR,"₹8,640"
4,637930,An Entire MBA in 1 Course:Award Winning Busine...,/course/an-entire-mba-in-1-courseaward-winning...,True,374836,4.4708,4.47173,4.47173,41630,False,83,0,2015-10-12T06:39:46Z,2016-01-11T21:39:33Z,455.0,INR,₹455,8640.0,INR,"₹8,640"


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13608 entries, 0 to 13607
Data columns (total 20 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   id                            13608 non-null  int64  
 1   title                         13608 non-null  object 
 2   url                           13608 non-null  object 
 3   is_paid                       13608 non-null  bool   
 4   num_subscribers               13608 non-null  int64  
 5   avg_rating                    13608 non-null  float64
 6   avg_rating_recent             13608 non-null  float64
 7   rating                        13608 non-null  float64
 8   num_reviews                   13608 non-null  int64  
 9   is_wishlisted                 13608 non-null  bool   
 10  num_published_lectures        13608 non-null  int64  
 11  num_published_practice_tests  13608 non-null  int64  
 12  created                       13608 non-null  object 
 13  p

We can see that there are some missing values. Let's get a more accurate view.

In [4]:
df.isna().sum()

id                                 0
title                              0
url                                0
is_paid                            0
num_subscribers                    0
avg_rating                         0
avg_rating_recent                  0
rating                             0
num_reviews                        0
is_wishlisted                      0
num_published_lectures             0
num_published_practice_tests       0
created                            0
published_time                     0
discount_price_amount           1403
discount_price_currency         1403
discount_price_price_string     1403
price_detail_amount              497
price_detail_currency            497
price_detail_price_string        497
dtype: int64

## Filling missing values

Sklearn offers a simple pipelining imputing. I will use just that.  

First, however, we have to understand **which** value to impute each column with.

In [5]:
df.discount_price_currency.value_counts() / len(df)

INR    0.896899
Name: discount_price_currency, dtype: float64

In [6]:
df.price_detail_currency.value_counts()/ len(df)

INR    0.963477
Name: price_detail_currency, dtype: float64

Since there is a predominant value, we won't cause much trouble by imputing the missing values with the same value.  
Later in modeling, that will result in an useless feature, but for now let's not think about it.  

Now let's focus on `discount_price_price_string` and `price_detail_price_string`.  
I suspect that their values are just the string version of, respectively, `discount_price_amount` and `price_detail_amount`.  
If this guess was true, then these columns are just a repetition and can therefore be dropped.  
Let's check.

In [7]:
list(df.discount_price_amount.value_counts()) == list(df.discount_price_price_string.value_counts())

True

In [8]:
list(df.price_detail_amount.value_counts()) == list(df.price_detail_price_string.value_counts())

True

Exactely.  
I will then `drop()` these two.

In [9]:
df.drop(['discount_price_price_string', 'price_detail_price_string'], axis=1, inplace=True)

We are then left with these four columns to impute:
* `discount_price_amount`
* `price_detail_amount`
* `discount_price_currency`
* `price_detail_currency`

We've said that for the `*_currency` columns we'll use a constant value, while for `*_amount` columns, since their `type` is numeric, we'll use the mean value.

In [10]:
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

cat_features = ['discount_price_currency', 'price_detail_currency']
num_features = ['discount_price_amount', 'price_detail_amount']

cat_imp = SimpleImputer(strategy="most_frequent")
num_imp = SimpleImputer(strategy="mean")

imputer = ColumnTransformer([
    ('cat_imputer', cat_imp, cat_features),
    ('num_imputer', num_imp, num_features)
])

imp_data = imputer.fit_transform(df)
imp_data.shape

(13608, 4)

In [11]:
filled_df = pd.DataFrame(imp_data, columns=cat_features+num_features)
filled_df

Unnamed: 0,discount_price_currency,price_detail_currency,discount_price_amount,price_detail_amount
0,INR,INR,455,8640
1,INR,INR,455,8640
2,INR,INR,455,8640
3,INR,INR,455,8640
4,INR,INR,455,8640
...,...,...,...,...
13603,INR,INR,493.944,4646.99
13604,INR,INR,493.944,4646.99
13605,INR,INR,493.944,4646.99
13606,INR,INR,493.944,4646.99


In [12]:
# drop preexisting columns (that still have missing values)
df.drop(cat_features+num_features, axis=1, inplace=True)
# join the original df with the new filled features
df = df.join(filled_df);

In [13]:
df.head()

Unnamed: 0,id,title,url,is_paid,num_subscribers,avg_rating,avg_rating_recent,rating,num_reviews,is_wishlisted,num_published_lectures,num_published_practice_tests,created,published_time,discount_price_currency,price_detail_currency,discount_price_amount,price_detail_amount
0,762616,The Complete SQL Bootcamp 2020: Go from Zero t...,/course/the-complete-sql-bootcamp/,True,295509,4.66019,4.67874,4.67874,78006,False,84,0,2016-02-14T22:57:48Z,2016-04-06T05:16:11Z,INR,INR,455,8640
1,937678,Tableau 2020 A-Z: Hands-On Tableau Training fo...,/course/tableau10/,True,209070,4.58956,4.60015,4.60015,54581,False,78,0,2016-08-22T12:10:18Z,2016-08-23T16:59:49Z,INR,INR,455,8640
2,1361790,PMP Exam Prep Seminar - PMBOK Guide 6,/course/pmp-pmbok6-35-pdus/,True,155282,4.59491,4.59326,4.59326,52653,False,292,2,2017-09-26T16:32:48Z,2017-11-14T23:58:14Z,INR,INR,455,8640
3,648826,The Complete Financial Analyst Course 2020,/course/the-complete-financial-analyst-course/,True,245860,4.54407,4.53772,4.53772,46447,False,338,0,2015-10-23T13:34:35Z,2016-01-21T01:38:48Z,INR,INR,455,8640
4,637930,An Entire MBA in 1 Course:Award Winning Busine...,/course/an-entire-mba-in-1-courseaward-winning...,True,374836,4.4708,4.47173,4.47173,41630,False,83,0,2015-10-12T06:39:46Z,2016-01-11T21:39:33Z,INR,INR,455,8640


In [14]:
df.isna().sum()

id                              0
title                           0
url                             0
is_paid                         0
num_subscribers                 0
avg_rating                      0
avg_rating_recent               0
rating                          0
num_reviews                     0
is_wishlisted                   0
num_published_lectures          0
num_published_practice_tests    0
created                         0
published_time                  0
discount_price_currency         0
price_detail_currency           0
discount_price_amount           0
price_detail_amount             0
dtype: int64

Ready to proceed!