## Exploratory Data Analysis on Online Course Enrollment Data

### course_df :
### ratings_df :


Objectives
- Identify keywords in course titles using a WordCloud
- Calculate the summary statistics and visualizations of the online course content dataset.
- Determine popular course genres
- Claculate summary statistics and create visualizations of the online course enrollment dataset.
- Identify courses with the greatest number of enrolled students.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

# only draw static images in the notebook
%matplotlib inline  

# set a random state
rs = 123

In [3]:
# dataset url
course_genre_path = 'Course_Recommendation/CourseRecommendation_dataset/course_genre.csv'
ratings_path = 'Course_Recommendation/CourseRecommendation_dataset/ratings.csv'

# load dataset
course_df = pd.read_csv(course_genre_path)
ratings_df = pd.read_csv(ratings_path)



FileNotFoundError: [Errno 2] No such file or directory: 'Course_Recommendation\\CourseRecommendation_dataset\\course_genre.csv'

FileNotFoundError: [Errno 2] No such file or directory: 'Course_Recommendation/CourseRecommendation_dataset/course_genre.csv'

exploring course genre dataset

In [None]:
course_df.shape 

In [None]:
# Number of unique courses
course_df['COURSE_ID'].nunique()

In [None]:
course_df.columns 

In [None]:
course_df.head()

In [None]:
course_df.dtypes

# COURSE_ID and TITLE are str datatypes and 
# all the course genres are binary/int datatypes
# Any genre column with value 1 means the course is associated 
# with the course genre 
# while 0 means the course is not.

In [None]:
course_df.iloc[1, ]
# course 'accelerating deep learning with gpu' is associated with 
# genres Python, MachineLearning, and DataScience

# Creating wordcloud from course titles

In [None]:
# join all the title values into one string
titles = " ".join(title for title in course_df['TITLE'].astype(str))

# filter common stop words and some less meaningful words
stopwords = set(STOPWORDS)
stopwords.update(["getting started", "using", "enabling", 
                  "template", "university", "end", 
                  "introduction", "basic"])

# Create a ```WordCloud``` object and 
# generate ```wordcloud``` from the titles.
wordcloud = WordCloud(stopwords=stopwords, background_color="white", width=800, height=400)
wordcloud.generate(titles)

# Visualize the generated wordcloud
plt.axis("off")
plt.figure(figsize=(40,20))
plt.tight_layout(pad=0)
plt.imshow(wordcloud, interpolation='bilinear')
plt.show()


# Analys=ze course genre

In [None]:
# all courses with genre MachineLearning == 1
course_df.query('MachineLearning==1')

# all courses with genres MachineLearning == 1 and BigData == 1
course_df.query('MachineLearning==1 and BigData==1')

In [None]:
# genreate a sorted course count per genre
genres = course_df.columns[2:]
genre_sums = course_df[genres].sum(axis=0)
genre_count = pd.DataFrame(genre_sums, columns = ['Count']).sort_values(by = "Count", ascending=False)
genre_count

In [None]:
# plot course genre counts using a barchart. 
# The x-axis is the course genre and 
# the y-axis is the course count per genre.
fig, ax = plt.subplots(figsize=(8,4))
# x and y are lists
sns.barplot(x=genre_count.index, y=genre_count['Count'], color='goldenrod', ax=ax, label="Course Genre Counts")
ax.set_xlabel("Course Genre")
plt.xticks(rotation=90)
ax.set_ylabel("Count")
ax.legend()
plt.show()

# Analyze course enrollement

This dataset contains three colums, `user` representing a unique user id, `item` representing a course id, and `rating` representing a course enrollment mode.
In an online learning scenario, we have learners or students as users who enrolled courses. In fact, to follow the standard recommender system naming convention, we call each learner as a `user`, each course an `item`, and the enrollment mode or interaction as `rating`. So that's why we have columns named `user`, `item`, and `rating` instead of using `learner`, `course`, and `enrollment`.

In this project, we may use these terms interchangeably.


In [None]:
ratings_df.head()

In [None]:
# The rating column contains one of two values: 
# `2` means the user just audited the course without completing it and 
#`3` means the user completed the course and earned a certificate.
# Two other possible values are not explicitly available in this project:
# `0` or `NA` means the user has no interaction with the course and 
# `1` means the user just browsed the course.
ratings_df['rating'].unique()


In [None]:
ratings_df.shape

In [None]:
# Number of ratings per user
user_rating_count = ratings_df.groupby(['user']).agg({'rating':'count'}).rename(columns={'rating':'num_rating'}).sort_values(by='num_rating',ascending=False)
user_rating_count.head()

In [None]:
# statistics summary of the user enrollments.
user_rating_count.describe()

In [None]:
# Plot the histogram of user rating counts.
user_rating_count.hist()

# Find the Top-20 Most Popular Courses

In [None]:
courses_rating_count = ratings_df.groupby(['item']).agg({'rating':'count'}).rename(columns={'item':'courses','rating':'num_rating'}).sort_values(by='num_rating',ascending=False).head(20)
courses_rating_count

# join the course titles in the course_df 
# so that we can identify what are the most popular courses
top_courses = courses_rating_count.merge(course_df[['COURSE_ID','TITLE']], how='left',left_on='item', right_on='COURSE_ID')[['TITLE','num_rating']]
top_courses



In [None]:
# Get the percentage of the top-20 course enrollments
total = ratings_df.shape[0]
top_pct = (top_courses['num_rating'].sum()/total)*100
print(f"Percentage of the top course enrollments {round(top_pct, 2)}%")