
# Udemy Courses Analysis Project

## Project Overview

In this project, we will delve into a comprehensive analysis of a dataset containing information about Udemy courses. As an enthusiastic data analyst eager to work in the data field, you have a unique opportunity to gain valuable insights from this dataset while honing your skills in Python, SQL, Excel, and data analysis.

## Project Objective

The primary objective of this project is to extract actionable insights from the Udemy dataset. By leveraging your data analysis skills and industry expertise, we aim to achieve the following:

1.  **Gain Insights into Course Popularity:** We will explore which courses attract the most subscribers and why. This analysis will involve examining factors such as course subject, pricing, and content duration.
    
2.  **Pricing Strategies:** We will investigate Udemy's pricing strategies by analyzing the distribution of course prices. This analysis will help us understand how price affects course enrollment.
    
3.  **Subject Analysis:** We will categorize courses by subject and examine the distribution of courses within each subject. Are certain subjects more popular than others?
    
4.  **Trends and Patterns:** Identifying trends and patterns in the data will be a key focus. We will look for correlations between variables like the number of subscribers, the number of reviews, and course difficulty level.
    
5.  **Data Visualization:** Data visualization will play a crucial role in presenting our findings. We will use various charts and graphs to make the analysis results more accessible and understandable.
    
6.  **Python Code Optimization:** Throughout the project, we will ensure that our Python code adheres to the PEP-8 guidelines for code style and optimization. Clean and efficient code is essential for effective data analysis.
    
7.  **Documentation:** Detailed documentation of the code and analysis steps will be maintained, including comments in Python code. This documentation will serve as a reference for future work and collaborations.
    

## Dataset Description

The dataset consists of the following columns:

1.  `course_id`: Unique identifier for each course.
2.  `course_title`: The title or name of the course.
3.  `is_paid`: Indicates whether the course is paid (True/False).
4.  `price`: The price of the course.
5.  `num_subscribers`: The number of subscribers to the course.
6.  `num_reviews`: The number of reviews for the course.
7.  `num_lectures`: The number of lectures in the course.
8.  `level`: The difficulty level of the course (e.g., beginner, intermediate, advanced).
9.  `content_duration`: The duration of the course content.
10.  `published_timestamp`: The timestamp when the course was published.
11.  `subject`: The subject or category of the course.

In [10]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

Read the dataset with the pandas library

In [11]:
df = pd.read_csv("7. Udemy Courses.csv", encoding="UTF-8")

In [12]:
# Print the top 5 lines of the  file.
print(df.head(5))


   course_id                                       course_title  is_paid  \
0     288942  #1 Piano Hand Coordination: Play 10th Ballad i...     True   
1    1170074  #10 Hand Coordination - Transfer Chord Ballad ...     True   
2    1193886  #12 Hand Coordination: Let your Hands dance wi...     True   
3    1116700  #4 Piano Hand Coordination: Fun Piano Runs in ...     True   
4    1120410  #5  Piano Hand Coordination:  Piano Runs in 2 ...     True   

  price  num_subscribers  num_reviews  num_lectures               level  \
0    35             3137           18            68          All Levels   
1    75             1593            1            41  Intermediate Level   
2    75              482            1            47  Intermediate Level   
3    75              850            3            43  Intermediate Level   
4    75              940            3            32  Intermediate Level   

  content_duration   published_timestamp              subject  
0        1.5 hours  2014-09-

#### Let's Explore the dataset

In [13]:
df.describe()

Unnamed: 0,course_id,num_subscribers,num_reviews,num_lectures
count,3682.0,3682.0,3682.0,3682.0
mean,676612.1,3194.23031,156.093156,40.065182
std,343635.5,9499.378361,934.957204,50.373299
min,8324.0,0.0,0.0,0.0
25%,407843.0,110.25,4.0,15.0
50%,688558.0,911.5,18.0,25.0
75%,961751.5,2540.25,67.0,45.0
max,1282064.0,268923.0,27445.0,779.0


Check for null values in the data set

In [14]:
df.isnull().sum()

course_id              0
course_title           0
is_paid                0
price                  0
num_subscribers        0
num_reviews            0
num_lectures           0
level                  0
content_duration       0
published_timestamp    0
subject                0
dtype: int64

So, there are no null values in our dataset which is kind of a good thing. :)

Now let's check the data types of our data

In [15]:
df.dtypes

course_id               int64
course_title           object
is_paid                  bool
price                  object
num_subscribers         int64
num_reviews             int64
num_lectures            int64
level                  object
content_duration       object
published_timestamp    object
subject                object
dtype: object

In [16]:
# #outlier elimination
# df1=df1[(df1['num_lectures']<400)]
# df1=df1[(df1['content_duration']<40)]


# To_plot = ["is_paid","price","num_reviews","num_lectures","content_duration"]
# for i in To_plot:
#     sns.jointplot(x=df1["num_subscribers"], y=df1[i], hue=df1["Subscribers"], palette= hue_C )
#     plt.show()