<a href="https://colab.research.google.com/github/KJTechnologies/personal-website/blob/master/Udemy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Title: Predicting Udemy Course Popularity**

## Introduction:

In the era of online learning and digital education platforms, Udemy has emerged as a prominent destination for both instructors and learners. With thousands of courses available on a wide range of subjects, it has become essential for course creators and instructors to understand what makes a course popular and appealing to potential learners. To address this challenge, this project aims to leverage data analysis and visualization techniques to answer some of questions related to this learning platform.

The popularity of a course on Udemy is often determined by the number of subscribers, indicating the level of interest and engagement among learners. Accurately predicting course popularity is not only of interest to course creators and instructors but also to Udemy itself, which can use such predictions to improve recommendations and understand market trends.


The dataset under examination contains a wealth of data on online courses, encompassing 3,678 records, each with its own unique characteristics. Spanning a range of subjects, levels of expertise, and content durations, this dataset serves as a valuable resource for data analysis and visualization. In this analysis, the researcher will embark on a journey to uncover hidden insights within the dataset, guided by a series of pertinent questions:


What is the distribution of course levels in the dataset?

What is the average price of paid courses in the dataset?

How does the number of subscribers correlate with course pricing?

Which course boasts the highest number of subscribers, along with its title and price?

What is the average course duration in the dataset?

Can the longest course in terms of content duration be identified, and what is its title?

What subjects or categories dominate the dataset in terms of course distribution?

How many free courses (is_paid = False) are present, and which free courses are the most popular in terms of subscribers?

Is there a relationship between the number of reviews and the number of subscribers for courses?

How has the number of courses offered evolved over time, and is there a discernible trend?

Can courses with a high number of lectures be identified, and do they correlate with high numbers of subscribers?

Is there a correlation between the course level and the number of reviews received?

What are the characteristics of the most expensive course in the dataset?

What is the overall distribution of course prices in the dataset?

Can courses with a high number of subscribers relative to the number of reviews be identified, suggesting high engagement?



These questions will serve as the researcher's compass, guiding the exploration and analysis of the dataset, ultimately revealing key insights into the dynamic realm of online education.

In [10]:
#necessary libraries for this project

# Data manipulation and analysis
import numpy as np
import pandas as pd

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(0)


In [11]:
# load dataset
df = pd.read_csv("/content/udemy_courses.csv")

In [12]:
# first five rows
df.head()

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject
0,1070968,Ultimate Investment Banking Course,https://www.udemy.com/ultimate-investment-bank...,True,200,2147,23,51,All Levels,1.5,2017-01-18T20:58:58Z,Business Finance
1,1113822,Complete GST Course & Certification - Grow You...,https://www.udemy.com/goods-and-services-tax/,True,75,2792,923,274,All Levels,39.0,2017-03-09T16:34:20Z,Business Finance
2,1006314,Financial Modeling for Business Analysts and C...,https://www.udemy.com/financial-modeling-for-b...,True,45,2174,74,51,Intermediate Level,2.5,2016-12-19T19:26:30Z,Business Finance
3,1210588,Beginner to Pro - Financial Analysis in Excel ...,https://www.udemy.com/complete-excel-finance-c...,True,95,2451,11,36,All Levels,3.0,2017-05-30T20:07:24Z,Business Finance
4,1011058,How To Maximize Your Profits Trading Options,https://www.udemy.com/how-to-maximize-your-pro...,True,200,1276,45,26,Intermediate Level,2.0,2016-12-13T14:57:18Z,Business Finance


In [None]:
# information about columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3678 entries, 0 to 3677
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   course_id            3678 non-null   int64  
 1   course_title         3678 non-null   object 
 2   url                  3678 non-null   object 
 3   is_paid              3678 non-null   bool   
 4   price                3678 non-null   int64  
 5   num_subscribers      3678 non-null   int64  
 6   num_reviews          3678 non-null   int64  
 7   num_lectures         3678 non-null   int64  
 8   level                3678 non-null   object 
 9   content_duration     3678 non-null   float64
 10  published_timestamp  3678 non-null   object 
 11  subject              3678 non-null   object 
dtypes: bool(1), float64(1), int64(5), object(5)
memory usage: 319.8+ KB
