# Initial Findings Udemy Courses

In this first section, we analyze Udemy courses using datasets provided by Chase Willden on [data.world](https://data.world/chasewillden) according to four different categories of courses:  
- Web Development
- Graphic Design
- Music Instruments
- Business Finance
 
## Preprocessing

Importing basic packages:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

In [2]:
df_WebDev = pd.read_csv('../Data/raw/WebDevelopment.csv')
df_GraphDesig = pd.read_csv('../Data/raw/GraphicDesign.csv')
df_MusicInstr = pd.read_csv('../Data/raw/MusicInstraments.csv')
df_BussFinan = pd.read_csv('../Data/raw/BusinessFinance.csv')

Drop unknown columns: `Unnamed: 11`, `Unnamed: 12`:

In [3]:
df_WebDev.dropna(axis=1, how='all', inplace=True)
df_GraphDesig.dropna(axis=1, how='all', inplace=True)
df_MusicInstr.dropna(axis=1, how='all', inplace=True)
df_BussFinan.dropna(axis=1, how='all', inplace=True)

Inserting `category` column:

In [4]:
df_WebDev['category'] = 'WebDevelopment'
df_GraphDesig['category'] = 'GraphicDesign'
df_MusicInstr['category'] = 'MusicInstrument'
df_BussFinan['category'] = 'BussinessFinance'

In [5]:
df_courses = pd.concat([df_WebDev, df_GraphDesig, df_MusicInstr, df_BussFinan], ignore_index=True, sort=False)

Quick inspection:

In [6]:
df_courses.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 16 columns):
id                      3683 non-null int64
title                   3683 non-null object
url                     3683 non-null object
isPaid                  3683 non-null object
price                   3683 non-null object
numSubscribers          3683 non-null int64
numReviews              3683 non-null int64
numPublishedLectures    3683 non-null int64
instructionalLevel      3683 non-null object
contentInfo             3683 non-null object
publishedTime           3683 non-null object
Is Paid                 8 non-null object
Total                   8 non-null float64
Percent                 4 non-null object
category                3683 non-null object
Column1                 5 non-null object
dtypes: float64(1), int64(4), object(11)
memory usage: 460.5+ KB


Drop columns with any misssing value:

In [7]:
df_courses.dropna(axis=1, how='any', inplace=True)

In [8]:
df_courses.head()

Unnamed: 0,id,title,url,isPaid,price,numSubscribers,numReviews,numPublishedLectures,instructionalLevel,contentInfo,publishedTime,category
0,28295,Learn Web Designing & HTML5/CSS3 Essentials in...,https://www.udemy.com/build-beautiful-html5-we...,True,75,43285,525,24,All Levels,4 hours,2013-01-03T00:55:31Z,WebDevelopment
1,19603,Learning Dynamic Website Design - PHP MySQL an...,https://www.udemy.com/learning-dynamic-website...,True,50,47886,285,125,All Levels,12.5 hours,2012-06-18T16:52:34Z,WebDevelopment
2,889438,ChatBots: Messenger ChatBot with API.AI and No...,https://www.udemy.com/chatbots/,True,50,2577,529,64,All Levels,4.5 hours,2016-06-30T16:57:08Z,WebDevelopment
3,197836,Projects in HTML5,https://www.udemy.com/projects-in-html5/,True,60,8777,206,75,Intermediate Level,15.5 hours,2014-06-17T05:43:50Z,WebDevelopment
4,505208,Programming Foundations: HTML5 + CSS3 for Entr...,https://www.udemy.com/html-css-more/,True,20,23764,490,58,Beginner Level,5.5 hours,2015-10-17T04:52:25Z,WebDevelopment


- Convert price and contentInfo to float
- Convert published time to datetime

In [9]:
def price_to_float(price):
    try:
        priceFloat = float(price)
    except:
        priceFloat = 0.0
        
    return priceFloat

In [10]:
df_courses['price'] = df_courses.price.apply(price_to_float)

In [11]:
def time_spent(contentInfo):
    if re.search('hour', contentInfo):
        time = float(contentInfo.split()[0])
    elif re.search('minute', contentInfo):
        time = float(contentInfo.split()[0])/60
    else:
        time = np.nan
    return time

In [12]:
df_courses['timeSpent'] = df_courses.contentInfo.apply(time_spent) 

In [13]:
df_courses.head()

Unnamed: 0,id,title,url,isPaid,price,numSubscribers,numReviews,numPublishedLectures,instructionalLevel,contentInfo,publishedTime,category,timeSpent
0,28295,Learn Web Designing & HTML5/CSS3 Essentials in...,https://www.udemy.com/build-beautiful-html5-we...,True,75.0,43285,525,24,All Levels,4 hours,2013-01-03T00:55:31Z,WebDevelopment,4.0
1,19603,Learning Dynamic Website Design - PHP MySQL an...,https://www.udemy.com/learning-dynamic-website...,True,50.0,47886,285,125,All Levels,12.5 hours,2012-06-18T16:52:34Z,WebDevelopment,12.5
2,889438,ChatBots: Messenger ChatBot with API.AI and No...,https://www.udemy.com/chatbots/,True,50.0,2577,529,64,All Levels,4.5 hours,2016-06-30T16:57:08Z,WebDevelopment,4.5
3,197836,Projects in HTML5,https://www.udemy.com/projects-in-html5/,True,60.0,8777,206,75,Intermediate Level,15.5 hours,2014-06-17T05:43:50Z,WebDevelopment,15.5
4,505208,Programming Foundations: HTML5 + CSS3 for Entr...,https://www.udemy.com/html-css-more/,True,20.0,23764,490,58,Beginner Level,5.5 hours,2015-10-17T04:52:25Z,WebDevelopment,5.5


In [17]:
import datetime
#datetime(df_courses.publishedTime[0])

## Initial Findings

Number of courses per category:

In [10]:
df_courses.groupby('category').count().id

category
BussinessFinance    1199
GraphicDesign        603
MusicInstrument      681
WebDevelopment      1200
Name: id, dtype: int64

Average course price in every category:

In [18]:
#df_courses.groupby(['category', 'price']).mean()