- Recently I got a lot of feedback from my dear friends who just change or about the change their career towards to Data Analysis/ Data Science and Machine Learning areas about the lack of material between beginning the analysis journey and the advanced techniques.

- They are looking for detailed but at the same time beginner friendly, not so much complicated (with different regression, normalization techniques, etc.) explained Explanatory Data Analysis examples, which show them how to start and most importantly how to read the descriptive statistics and graphs.

- After getting these feedbacks, I have decided to make some kind of series of EDA’s from different datasets, without making so complicated for the people at their first steps of DS/ML journey.

### - This notebook is part of the 9 Beginner Friendly EDAs. If these EDAs would be helpful to anyone, I would be more than happy.




#### **INTRO**



In this EDA, we will discover the courses given by UDEMY.

- Let's import the required libraries

In [None]:
import pandas as pd
import numpy as np

import plotly 
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

### Overview Stage

- Read the csv
- Look for basic information about the dataset

In [None]:
df = pd.read_csv('../input/udemy-courses/udemy_courses.csv')
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
df.describe()

Let's summarize what we have got from the dataset.

- Our dataset has info about the courses given by UDEMY.
- 'Course ID' and Course 'url' would not be necessary for our analysis, we will drop them.
- Course published date is given object format, neeeds to be formatted as a datetime object.
- There is no missing value, which is very good during the data preparation stage.
- 'Level' column is categorical variable, it would be good to see whether any significant differences among the levels.
-  Numerical variables deserves special attention for further analysis.

- Let's make the necessary adjustments before moving to the analysis part.

In [None]:
df['date'] = pd.to_datetime(df['published_timestamp'])

In [None]:
df = df.drop(['course_id','url','published_timestamp'], axis=1)
df.sample(2)

In [None]:
df.info()

- Seems OK.  Let's move on to the next step: **analysis part**.

### Analysis Part

In [None]:
df.describe()

Let's look at the some of the information, which we can get from the above table:

- At first look, we can see that, dataset has numbers, minimum 0 and maximum in hundreds or thousands for the variables. 
- Also mean and median values significantly differs from each other. All of the variables have signicifantly higher mean value than median value, which is a good sign of hihgly screwed distribution, more specifially right skewed distribution with the possible outliers on the maximum side of the distribution. So for further analysis it would be good to remember that.

- Aferomentioned reasons, in the following lines, median value will be used for the give some insights from the above table.

- Median value for the price as 45.

- Median value for the number of subscribers for the courses around 912 

- Median value for the number of reviews 18
- Median value for the  number of lectures 25
- Median value for the content duration is 2


- OK let's see this analysis in the plotly

#### **Prices of UDEMY Courses**

In [None]:
fig = px.histogram(df, x= 'price', title='Prices of UDEMY Courses')

fig.show()

As seen in the histogram, UDEMY has 310 free course and it's 295 courses are priced as $200 . As we expected, there is highly right skewed distribution.

#### **Number of Subscribers of UDEMY Courses**

In [None]:
fig = px.histogram(df, x= 'num_subscribers', title='Number of Subscribers of UDEMY Courses')

fig.show()

Number of subscribers ranges from 0 to 268923, highly skewed distribution. 

#### **Number of Reviews of UDEMY Courses**

In [None]:
fig = px.histogram(df, x= 'num_reviews', title='Number of Reviews of UDEMY Courses')

fig.show()

Number of reviews ranges from 0 to 27445, highly skewed distribution. 

#### **Number of Lectures of UDEMY Courses**

In [None]:
fig = px.histogram(df, x= 'num_lectures', title='Number of Lectures of UDEMY Courses')

fig.show()

From given table of Number of lectures for UDEMY courses, we can see that 20-45 range has a lot of courses. But as we have mentioned before and easily seen in the histogram, we have highly skewed data with outliers.  

#### **Durations of UDEMY Courses**

In [None]:
fig = px.histogram(df, x= 'content_duration', title='Durations of UDEMY Courses')

fig.show()

From given table of Durations of UDEMY courses, we can see that  0-3 hours range has a lot of courses. But as we have mentioned before and easily seen in the histogram, we have highly sekwed data with outliers.  

- Befor moving on the details, let's see the correlation matrix for our dataset

In [None]:
df.corr()

In [None]:
index_vals = df['level'].astype('category').cat.codes
fig = go.Figure(data=go.Splom(
                dimensions=[dict(label='price',
                                 values=df['price']),
                            dict(label='num_subscribers',
                                 values=df['num_subscribers']),
                            dict(label='num_reviews',
                                 values=df['num_reviews']),
                            dict(label='num_lectures',
                                 values=df['num_lectures']),
                           dict(label='content_duration',
                                 values=df['content_duration'])],
                showupperhalf=False, 
                text=df['level'],
                marker=dict(color=index_vals,
                            showscale=False, # colors encode categorical variables
                            line_color='white', line_width=0.5)
                ))


fig.update_layout(
    title='UDEMY Courses',
    width=1000,
    height=1000,
)

fig.show()

Based on the results:
- There is positive but not so strong relationship between number of reviews and number of subscribers
- Also there is positive and almost strong (.80) relationship between number of lectures in the course and the duration of the course.

- After getting overall picture about the data, we can go into more details.

### UDEMY Courses Based on the **Subject**

- Let's see UDEMY courses by their subjects.

In [None]:
np.round(df['subject'].value_counts(normalize=True),2)

- Overall 33% of the Udemy Courses are from Web Development and 32% of the Udemy Courses are from Business Finance area. Other 34% of the courses are made by Musical Instruments related courses (18%) and Graphic Design (16%).
- Courses on Business Finance and Web Development subjects covers almost 2 out of 3 course selection.

In [None]:
fig = px.histogram(df, x="subject", title='Course Count by Subject')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

#### **UDEMY Courses By Subject in Each Year**

In [None]:
df['year']= df['date'].dt.year
subject_by_year = df.groupby('year')['subject'].value_counts().reset_index(level=0).rename(columns={'subject': 'subject count'}, index={'index': 'Subject'})
subject_by_year

In [None]:
fig = px.line(subject_by_year, x='year', y='subject count', color= subject_by_year.index, title='UDEMY Courses By Subject in Each Year')
fig.show()

- From the line plot we can see that Udemy courses on the Web Development and Business Finance significantly increased till 2015 
- Number of Business Finance related courses stay almost same in 2016 but Web Development related courses continued to increase significantly. 

### **Based on the Level of the Courses**

- Let's see UDEMY courses by their levels.

In [None]:
np.round(df['level'].value_counts(normalize=True),2)

- Overall 52% of the Udemy Courses contains information for all levels of the learner. 
- Beginner level courses make up 35% of all of the courses
- 1 out of 10 courses offered by UDEMY is in the intermediate level.
- Only 2 out of 100 courses offered by UDEMY appeal to advance or exper level learners.

In [None]:
fig = px.histogram(df, x="level", title='Course Count by Level of Courses')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

#### **UDEMY Courses By Level in Each Year**

In [None]:
level_by_year = df.groupby('year')['level'].value_counts().reset_index(level=0).rename(columns={'level': 'level count'}, index={'index': 'Level_of_Courses'})
level_by_year

In [None]:
fig = px.line(level_by_year, x='year', y='level count', color= level_by_year.index, title='UDEMY Courses By Level in Each Year')
fig.show()

- From the line plot we can see that Udemy courses in all levels, beginner levels and intermediate levels increased significantly by each year. 

- On the other hand, number of expert level courses offered by UDEMY are inconsistent.


### UDEMY Courses- Number of Subscribers & Num of Reviews and Number of Lectures by Year

In [None]:
df1 = df.groupby('year')[['num_subscribers','num_reviews','num_lectures']].sum().reset_index()
df1

In [None]:
fig = px.line(df1, x='year', y=['num_subscribers','num_reviews','num_lectures'])
fig.show()

- As seen in the line chart, number of subscribers increased constantly till 2015 and then decreased around a half milliion on 2016. Since 2017 data does not fully cover the 2017, we can not make any assumption on that.

### Price & Courses

In [None]:
paid_by_year = df.groupby('year')['is_paid'].value_counts().reset_index(level=0).rename(columns={'is_paid': 'paid_free count'}, index={'index': 'is_paid'})
paid_by_year

In [None]:
fig = px.line(paid_by_year, x='year', y='paid_free count', color= paid_by_year.index)
fig.show()

- Both number of free and paid courses increased by each year. 
- Yep, Agreed, not much increase on the free courses. It's a tough world.

### Top Paid Courses

In [None]:
top_15_paid_courses = df[df['price']!=0][['course_title','year','subject','num_subscribers']].sort_values(by= 'num_subscribers',ascending=False).head(15)
top_15_paid_courses

In [None]:
fig = px.bar(top_15_paid_courses, y= 'num_subscribers', x='course_title', hover_data = top_15_paid_courses[['year','subject']], color='subject')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Almost all of the top 15 paid courses are from Web Development area, except one course in Musical Instruments area.

### Top Free Courses

In [None]:
top_15_free_courses = df[df['price']==0][['course_title','year','subject','num_subscribers']].sort_values(by= 'num_subscribers',ascending=False).head(15)
top_15_free_courses

In [None]:
fig = px.bar(top_15_free_courses, y= 'num_subscribers', x='course_title', hover_data = top_15_free_courses[['year','subject']], color='subject')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Top 15 free courses are mostly from Web Development area, but also include other subjects areas.

### Top 15  Reviewed Courses

In [None]:
top_15_reviewed = df[['course_title','year','subject','is_paid','num_reviews']].sort_values(by='num_reviews', ascending=False).head(15)

top_15_reviewed

In [None]:
fig = px.bar(top_15_reviewed , y= 'num_reviews', x='course_title', hover_data = top_15_reviewed[['year','subject', 'is_paid']], color='subject')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Top 15 reviewed courses are from Web Development area, except one course. And 11 out of 15 top reviewed courses are paid courses.

### Top 15 Expensive Courses

In [None]:
top_15_price = df[['course_title','year','subject','num_subscribers', 'price']].sort_values(by=['price','num_subscribers'], ascending=False).head(15)

top_15_price

In [None]:
fig = px.bar(top_15_price , y= 'num_subscribers', x='course_title', hover_data = top_15_price[['price','year']], color='subject')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Expensive courses are $ 200, and all of the subjects areas are in the top 15 expensive course list. 

## This notebook is a part of the 9 Beginner Friendly EDAs
## If you like this one, you can also check out other notebooks in the Beginner Friendly EDAs series!

* [Data Analyst Jobs - EDA](https://www.kaggle.com/kaanboke/plotly-data-analyst-jobs)
* [Top Games on Google Play Store](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-top-games)
* [Hollywood Top Movies- EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-top-movies)
* [World Happiness Report - EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-eda)
* [Countries Life Expectancy](https://www.kaggle.com/kaanboke/plotly-beginner-friendly)
* [Netflix Movies- EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-netflix)
* [Amazon Top 50 Bestselling Books EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-amazon)
* [London bike Sharing EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-london-bike)



- Thanks for the dataset contibutor for this data. only missing thing for me, this dataset should contain something about the course ratings. We can make some assumptions based on number of subscribers or numbers of reviews, but still it does not give us confidence to make an assumption on the quality of the courses.

- It was a quite pleasure to share with you this detailed, beginner friendly EDA. Thanks for your time.

- All the best 