## Import libraries

In [None]:
import numpy as np
import pandas as pd
import datetime
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter('ignore')

## Question 1: How importance does the price play?

### Benefits of finding the answer: 
- Knowing the importance of price helps businesses optimize their pricing strategy. They can determine whether customers are highly sensitive to price changes and adjust their pricing models accordingly.
- Understanding the role of price allows businesses to position themselves competitively in the market. If price is a significant factor for customers, a company may choose to compete on price or differentiate itself through other means.
- Price elasticity insights help businesses set prices that maximize revenue. If customers are willing to pay higher prices for a product or service, companies can capture additional value.
- Understanding price sensitivity allows businesses to implement loyalty programs, discounts, or promotions strategically.
- Price elasticity insights contribute to more accurate demand forecasting. Businesses can predict how changes in price may affect demand, allowing for better inventory management and resource allocation.


## Preprocessing

- To make the data more human-readable and interpretable, especially when you're creating visualizations or summaries where it's clearer to have 'Paid' and 'Free' instead of 'True' and 'False'. It simplifies the understanding of the data and is often done for better communication.

In [None]:
df = udemy_df.copy()
df['is_paid'] = df['is_paid'].apply(lambda x: 'Free' if x == False else 'Paid')
df.head(5)

### Analyze data to answer the question?

#### Percentage of paid courses and free courses

In [None]:
is_paid_distribution = df['is_paid'].value_counts()
plt.figure(figsize=(10, 7))
plt.title('Free vs Paid Courses', fontsize=20)
is_paid_distribution.plot.pie(autopct="%1.1f%%", startangle=140)
plt.ylabel('')
plt.legend(labels=is_paid_distribution.index, loc="upper left", bbox_to_anchor=(1, 1))
plt.show()

- The dataset predominantly consists of paid courses, accounting for approximately 90% of the total courses. In contrast, free courses make up the remaining 10%. This indicates that the majority of courses in the dataset require payment, while a smaller portion is available for free.

#### Trend of user in 10 latest years

- Find top 100 popular course (which has most number of subscribers) in 2012 -> 2022
- Count how many free, paid courses in top 100.

In [None]:
df['year'] = df['published_time'].dt.year

# Filter data for the past 10 years
recent_years_df = df[df['year'] >= df['year'].max() - 10]

# Group by year and select the top 100 courses for each year
top_100_by_year = recent_years_df.groupby('year').apply(lambda group: group.sort_values('num_subscribers', ascending=False).head(100))
top_100_by_year = top_100_by_year.drop(['year'], axis=1)

# Count the course types for each year
values_count_by_year = top_100_by_year.groupby(['year', 'is_paid']).size().unstack(fill_value = 0)
values_count_by_year

- Visualize number of free and paid courses in top 100 popular courses for 10 latest years.

In [None]:
x = np.arange(len(values_count_by_year.index))
width = 0.2
    
fig, ax = plt.subplots(figsize=(30, 15))
rects1 = ax.bar(x - width/2 - 0.03, values_count_by_year['Free'], width, label='Free')
rects2 = ax.bar(x + width/2 + 0.03, values_count_by_year['Paid'], width, label='Paid')
ax.bar_label(rects1, padding=3)
ax.bar_label(rects2, padding=3)
ax.set_title('Distribution of Top 100 Popular Courses by Course Type (2012 -> 2022)')
ax.set_xticks(x)
ax.set_xticklabels(values_count_by_year.index)
ax.legend()
plt.show()

- Number of popular free courses is going to reduce.
- The shift towards fewer free courses could indicate an emphasis on providing higher-quality or premium content.
- I think this shift towards due to free courses have not certificate for finish so number of subcribers is going to decrease.

#### Correlation between numeric columns

In [None]:
paid_course = df[df['is_paid'] == 'Paid']
numeric_cols = ['price', 'num_subscribers', 'num_reviews', 'num_lectures']
paid_numeric_df = paid_course[numeric_cols]
corr_mat = paid_numeric_df.corr()

axis_corr = sns.heatmap(corr_mat,annot=True, vmin=-1, vmax=1,cmap=sns.diverging_palette(50, 500, n=500),square=True)
plt.show()

- Correlation between `price` and `num_subscribers` is 0.046 > 0 so if the course has high price, the number of subscribers wouldn't decrease.
- `Number of subscribers` and `number of reviews` has positive correlation like `num_lectures` and `price`


In [None]:
paid_course['year'] = paid_course['published_time'].dt.year
# Calculate the price / minute for each year
average_price_by_year = paid_course.groupby('year')['price'].sum() / paid_course.groupby('year')['content_length_min'].sum()
plt.figure(figsize=(10, 6))
average_price_by_year.plot()
plt.ylabel('Price / minute')
plt.xlabel('Year')
plt.title('Fluctuation Price / Minute for each year')
plt.show()

- The increase in the price per minute is going to increase.

**Conclusion**:
- Learners today prioritize the practical value and application of knowledge over cost considerations. This trend poses challenges for education platforms to deliver high-quality courses that meet the real needs and desires of learners.
- Contemporary learners highly value the quality of course content. They want to ensure that information is accurately conveyed, up-to-date, and practically valuable.
- Modern learners often seek courses that align with their individual goals and career development. Learning must be directly related to their areas of interest.
- The certification and credibility of a course have become crucial factors. Learners prioritize courses that provide valuable and recognized certificates in their respective fields.
- The increase in the price per minute may indicate an effort by course providers to enhance the value of the content.
- Learners may be prioritizing the quality of courses over the direct cost. This may reflect their focus on investing in high-quality learning.
- The increase in price may reflect a trend where learners prioritize certificates and credits, considering them more important than the direct price of the course.

## Question 2: How does profit from teaching courses on Udemy?

### Benefits of finding the answer?
- Instructors can project potential earnings by understanding how Udemy's revenue-sharing model works. This helps in setting realistic financial goals.
- Knowledge of the profit model allows instructors to strategize their course pricing.
- Knowing how profits are earned encourages instructors to actively engage with the Udemy platform.
- Udemy provides instructors with valuable analytics and insights into learner behavior.
- Instructors have knowlegde about fluctuation of Udemy's development and determine to teach on Udemy platform or not.
- Get the trend of categories which leaner enroll in and invest in this categories.

### Analyze data to answer the question?

#### Top Profitable Courses

In [None]:
# profit = price * num_sub
paid_course['profit'] = paid_course['num_subscribers'] * paid_course['price']
top_course = paid_course.groupby(['id', 'title'])['profit'].sum()
top_course = top_course.sort_values(ascending=False).head(10)
top_course = top_course.sort_values(ascending=True).head(10)

# plt.barh(top_course['profit'], top_course['title
plt.figure(figsize=(12, 8))
plt.barh(top_course.index.get_level_values('title'), top_course.values, color='skyblue')
plt.title('Top 10 Most Profitable Courses')
plt.xlabel('Total Profit')
plt.ylabel('Course Title')
plt.show()


- Most of courses belong to technology and IT category, but only **The Complete Digital Marketing Course - 12 Courses in 1** - belong to ecomony category in top 10, so it absolutely is the most popuplar economical course.
- IT is hot trend, especially **Data Science** gained a lot money, **Python** is top language in earning money from teaching.

#### Who gains the most money from teaching on udemy?

- Top 5 instructors who gain the most money in each year from 2013 -> 2022

In [None]:
# instructor co profit cao nhat
filtered_data = paid_course[(paid_course['year'] >= 2013) & (paid_course['year'] <= 2022)]

instructor_profit_df = filtered_data.groupby(['instructor_name', 'instructor_url', 'year'])['profit'].sum().reset_index()
instructor_profit_df = instructor_profit_df.sort_values(by='profit', ascending=False)
top_instructors_by_year = instructor_profit_df.groupby('year').apply(lambda x: x.nlargest(5, 'profit')).reset_index(drop=True)
top_instructors_by_year

- Visualize

In [None]:
fig, axes = plt.subplots(2, 5, figsize=(20, 15), sharey=True)

for i, (year, data) in enumerate(top_instructors_by_year.groupby('year')):
    ax = axes[i // 5, i % 5]
    ax.bar(data['instructor_name'], data['profit'], color='skyblue')
    ax.set_title(f'{year}')
    ax.set_xlabel('Instructor Name')
    ax.set_ylabel('Profit')
    ax.set_xticklabels(data['instructor_name'], rotation=45, ha='right', fontsize=7) 

plt.tight_layout()
plt.show()

- According to bar chart, profit of intructors increased until 2020, in 2021, 2022, profit decreased significantly.
- Golden age of earning money from Udemy in 2018 -> 2022, the highest profit is 600 million dollar for a year, it is a very big profit.
- Instructors name **Learn Tech Plus**, **Srinldhi Ranganathan** usually in top profit by year and have big profit.

**Conclusion:**
- In the previous years, courses related to IT and Technology on Udemy have generated significant profits. This could be attributed to the growing Information Technology industry, the demand for technical skills in the job market, and a high level of interest from learners.
- The data suggests a substantial decline in profits courses on Udemy in 2021 and 2022. Several factors may contribute to this, including increased competition, the emergence of alternative online education platforms, or even shifts in learners' preferences.
- External factors, such as economic conditions or global events, can impact the demand for specific skills. Economic downturns or shifts in the job market may influence the decision-making of individuals seeking courses, leading to changes in enrollment and, consequently, profitability.