<div style="text-align: center; background-color: #0A6EBD; font-family: 'Trebuchet MS', Arial, sans-serif; color: white; padding: 20px; font-size: 40px; font-weight: bold; border-radius: 0 0 0 0; box-shadow: 0px 6px 8px rgba(0, 0, 0, 0.2);">
 Final Project Programming for Data Science
</div>

<div style="text-align: center; background-color: #5A96E3; font-family: 'Trebuchet MS', Arial, sans-serif; color: white; padding: 20px; font-size: 40px; font-weight: bold; border-radius: 0 0 0 0; box-shadow: 0px 6px 8px rgba(0, 0, 0, 0.2);">
  Asking + Preprocesssing +Analyzing data to answer each question</div>

## Import libraries

In [None]:
import numpy as np
import pandas as pd
import datetime
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter('ignore')

## Read data from csv file

In [None]:
udemy_df = pd.read_csv("./Data/udemy.csv", parse_dates = ['published_time', 'last_update_date'])
udemy_df.head(10)

## Question 1: How importance does the price play?

### Benefits of finding the answer: 
- Knowing the importance of price helps businesses optimize their pricing strategy. They can determine whether customers are highly sensitive to price changes and adjust their pricing models accordingly.
- Understanding the role of price allows businesses to position themselves competitively in the market. If price is a significant factor for customers, a company may choose to compete on price or differentiate itself through other means.
- Price elasticity insights help businesses set prices that maximize revenue. If customers are willing to pay higher prices for a product or service, companies can capture additional value.
- Understanding price sensitivity allows businesses to implement loyalty programs, discounts, or promotions strategically.
- Price elasticity insights contribute to more accurate demand forecasting. Businesses can predict how changes in price may affect demand, allowing for better inventory management and resource allocation.


## Preprocessing

- To make the data more human-readable and interpretable, especially when you're creating visualizations or summaries where it's clearer to have 'Paid' and 'Free' instead of 'True' and 'False'. It simplifies the understanding of the data and is often done for better communication.

In [None]:
df = udemy_df.copy()
df['is_paid'] = df['is_paid'].apply(lambda x: 'Free' if x == False else 'Paid')
df.head(5)

### Analyze data to answer the question?

#### Percentage of paid courses and free courses

In [None]:
is_paid_distribution = df['is_paid'].value_counts()
plt.figure(figsize=(10, 7))
plt.title('Free vs Paid Courses', fontsize=20)
is_paid_distribution.plot.pie(autopct="%1.1f%%", startangle=140)
plt.ylabel('')
plt.legend(labels=is_paid_distribution.index, loc="upper left", bbox_to_anchor=(1, 1))
plt.show()

- The dataset predominantly consists of paid courses, accounting for approximately 90% of the total courses. In contrast, free courses make up the remaining 10%. This indicates that the majority of courses in the dataset require payment, while a smaller portion is available for free.

#### Trend of user in 10 latest years

- Find top 100 popular course (which has most number of subscribers) in 2012 -> 2022
- Count how many free, paid courses in top 100.

In [None]:
df['year'] = df['published_time'].dt.year

# Filter data for the past 10 years
recent_years_df = df[df['year'] >= df['year'].max() - 10]

# Group by year and select the top 100 courses for each year
top_100_by_year = recent_years_df.groupby('year').apply(lambda group: group.sort_values('num_subscribers', ascending=False).head(100))
top_100_by_year = top_100_by_year.drop(['year'], axis=1)

# Count the course types for each year
values_count_by_year = top_100_by_year.groupby(['year', 'is_paid']).size().unstack(fill_value = 0)
values_count_by_year

- Visualize number of free and paid courses in top 100 popular courses for 10 latest years.

In [None]:
x = np.arange(len(values_count_by_year.index))
width = 0.2
    
fig, ax = plt.subplots(figsize=(30, 15))
rects1 = ax.bar(x - width/2 - 0.03, values_count_by_year['Free'], width, label='Free')
rects2 = ax.bar(x + width/2 + 0.03, values_count_by_year['Paid'], width, label='Paid')
ax.bar_label(rects1, padding=3)
ax.bar_label(rects2, padding=3)
ax.set_title('Distribution of Top 100 Popular Courses by Course Type (2012 -> 2022)')
ax.set_xticks(x)
ax.set_xticklabels(values_count_by_year.index)
ax.legend()
plt.show()

- Number of popular free courses is going to reduce.
- The shift towards fewer free courses could indicate an emphasis on providing higher-quality or premium content.
- I think this shift towards due to free courses have not certificate for finish so number of subcribers is going to decrease.

#### Correlation between numeric columns

In [None]:
paid_course = df[df['is_paid'] == 'Paid']
numeric_cols = ['price', 'num_subscribers', 'num_reviews', 'num_lectures']
paid_numeric_df = paid_course[numeric_cols]
corr_mat = paid_numeric_df.corr()

axis_corr = sns.heatmap(corr_mat,annot=True, vmin=-1, vmax=1,cmap=sns.diverging_palette(50, 500, n=500),square=True)
plt.show()

- Correlation between `price` and `num_subscribers` is 0.046 > 0 so if the course has high price, the number of subscribers wouldn't decrease.
- `Number of subscribers` and `number of reviews` has positive correlation like `num_lectures` and `price`


In [None]:
paid_course['year'] = paid_course['published_time'].dt.year
# Calculate the price / minute for each year
average_price_by_year = paid_course.groupby('year')['price'].sum() / paid_course.groupby('year')['content_length_min'].sum()
plt.figure(figsize=(10, 6))
average_price_by_year.plot()
plt.ylabel('Price / minute')
plt.xlabel('Year')
plt.title('Fluctuation Price / Minute for each year')


- The increase in the price per minute is going to increase.

**Conclusion**:
- Learners today prioritize the practical value and application of knowledge over cost considerations. This trend poses challenges for education platforms to deliver high-quality courses that meet the real needs and desires of learners.
- Contemporary learners highly value the quality of course content. They want to ensure that information is accurately conveyed, up-to-date, and practically valuable.
- Modern learners often seek courses that align with their individual goals and career development. Learning must be directly related to their areas of interest.
- The certification and credibility of a course have become crucial factors. Learners prioritize courses that provide valuable and recognized certificates in their respective fields.
- The increase in the price per minute may indicate an effort by course providers to enhance the value of the content.
- Learners may be prioritizing the quality of courses over the direct cost. This may reflect their focus on investing in high-quality learning.
- The increase in price may reflect a trend where learners prioritize certificates and credits, considering them more important than the direct price of the course.

## Question 2: How does profit from teaching courses on Udemy?

### Benefits of finding the answer?
- Instructors can project potential earnings by understanding how Udemy's revenue-sharing model works. This helps in setting realistic financial goals.
- Knowledge of the profit model allows instructors to strategize their course pricing.
- Knowing how profits are earned encourages instructors to actively engage with the Udemy platform.
- Udemy provides instructors with valuable analytics and insights into learner behavior.
- Instructors have knowlegde about fluctuation of Udemy's development and determine to teach on Udemy platform or not.
- Get the trend of categories which leaner enroll in and invest in this categories.

### Analyze data to answer the question?

#### Top Profitable Courses

In [None]:
# profit = price * num_sub
paid_course['profit'] = paid_course['num_subscribers'] * paid_course['price']
top_course = paid_course.groupby(['id', 'title'])['profit'].sum()
top_course = top_course.sort_values(ascending=False).head(10)
top_course = top_course.sort_values(ascending=True).head(10)

# plt.barh(top_course['profit'], top_course['title
plt.figure(figsize=(12, 8))
plt.barh(top_course.index.get_level_values('title'), top_course.values, color='skyblue')
plt.title('Top 10 Most Profitable Courses')
plt.xlabel('Total Profit')
plt.ylabel('Course Title')
plt.show()


- Most of courses belong to technology and IT category, but only **The Complete Digital Marketing Course - 12 Courses in 1** - belong to ecomony category in top 10, so it absolutely is the most popuplar economical course.
- IT is hot trend, especially **Data Science** gained a lot money, **Python** is top language in earning money from teaching.

#### Who gains the most money from teaching on udemy?

- Top 5 instructors who gain the most money in each year from 2013 -> 2022

In [None]:
# instructor co profit cao nhat
filtered_data = paid_course[(paid_course['year'] >= 2013) & (paid_course['year'] <= 2022)]

instructor_profit_df = filtered_data.groupby(['instructor_name', 'instructor_url', 'year'])['profit'].sum().reset_index()
instructor_profit_df = instructor_profit_df.sort_values(by='profit', ascending=False)
top_instructors_by_year = instructor_profit_df.groupby('year').apply(lambda x: x.nlargest(5, 'profit')).reset_index(drop=True)
top_instructors_by_year

- Visualize

In [None]:
fig, axes = plt.subplots(2, 5, figsize=(20, 15), sharey=True)

for i, (year, data) in enumerate(top_instructors_by_year.groupby('year')):
    ax = axes[i // 5, i % 5]
    ax.bar(data['instructor_name'], data['profit'], color='skyblue')
    ax.set_title(f'{year}')
    ax.set_xlabel('Instructor Name')
    ax.set_ylabel('Profit')
    ax.set_xticklabels(data['instructor_name'], rotation=45, ha='right', fontsize=7) 

plt.tight_layout()
plt.show()

- According to bar chart, profit of intructors increased until 2020, in 2021, 2022, profit decreased significantly.
- Golden age of earning money from Udemy in 2018 -> 2022, the highest profit is 600 million dollar for a year, it is a very big profit.
- Instructors name **Learn Tech Plus**, **Srinldhi Ranganathan** usually in top profit by year and have big profit.

**Conclusion:**
- In the previous years, courses related to IT and Technology on Udemy have generated significant profits. This could be attributed to the growing Information Technology industry, the demand for technical skills in the job market, and a high level of interest from learners.
- The data suggests a substantial decline in profits courses on Udemy in 2021 and 2022. Several factors may contribute to this, including increased competition, the emergence of alternative online education platforms, or even shifts in learners' preferences.
- External factors, such as economic conditions or global events, can impact the demand for specific skills. Economic downturns or shifts in the job market may influence the decision-making of individuals seeking courses, leading to changes in enrollment and, consequently, profitability.

## Question 3: How udemy develope?

### Benefits of finding the answer?

- Insights into Online Education Trends: Understanding how Udemy developed provides insights into the trends and dynamics of the online education industry. This knowledge can be valuable for individuals interested in the field of e-learning.
- Entrepreneurial Inspiration: Udemy's success story can serve as inspiration for entrepreneurs looking to create platforms that make education more accessible. It showcases the potential for innovation in the education sector.
- Learning and Teaching Opportunities: Individuals interested in learning new skills or sharing their expertise can benefit from Udemy's platform. By understanding its development, users can make informed decisions about participating in the Udemy community.
- Impact on Education Accessibility: Udemy has played a role in making education accessible to a global audience. Understanding its development can contribute to discussions about the democratization of education and the role of technology in expanding learning opportunities.

### Preprocessing

- To make it easier to evaluate udemy's growth, we evaluate it by year. We create a column year.

In [None]:
udemy_df['year'] = udemy_df['published_time'].dt.year

### Analyze data to answer the question?

First, let's look at how subscriber numbers look over time.

In [None]:
num_sub_per_year = udemy_df.groupby('year')['num_subscribers'].sum()
display(num_sub_per_year)

In [None]:
plt.figure(figsize=(10, 6))
num_sub_per_year.plot(kind='bar', color='skyblue')
plt.title('Number of subscribers over year')
plt.xlabel('Year')
plt.ylabel('Number of subscribers')
plt.show()

- We see that the number of subscribers tends to increase each year. 
- In 2020, we saw a sudden increase in the number of subscribers, perhaps due to the covid 19 pandemic.

Next, the number of courses over time.

In [None]:
num_course_per_year = udemy_df.groupby('year')['id'].size()
plt.figure(figsize=(10, 6))
num_course_per_year.plot(kind='bar', color='skyblue')
plt.title('Number of courses over year')
plt.xlabel('Year')
plt.ylabel('Number of courses')
plt.show()

The number of instructor over time.

In [None]:
num_instruc_per_year = udemy_df.groupby('year')['instructor_name'].unique()
num_instruc_per_year = num_instruc_per_year.apply(lambda x: len(x))
plt.figure(figsize=(10, 6))
num_instruc_per_year.plot(kind='bar', color='skyblue')
plt.title('The number of instructors over year')
plt.xlabel('Year')
plt.ylabel('Number of instructors')
plt.show()

- We can see that after 2020 the number of registrants increased dramatically, in 2021 the number of instructors and the number of courses continues to increase. However, the number of registrations has dropped quite sharply, which tells experienced teachers when to enter the teaching market appropriately.
- In general, the number of courses and instructors still tend to increase.

Now, we'll look at another aspect of udemy's growth, looking at the average duration of each course over time.

In [None]:
average_duration_per_year = udemy_df.groupby('year')['content_length_min'].mean()
plt.figure(figsize=(10, 6))
average_duration_per_year.plot()
plt.title('The average duration of each course over year')
plt.xlabel('Year')
plt.ylabel('The average duration')
plt.show()

- We see that the average duration of the course tends to decrease. This helps instructors and teaching centers adjust course times to suit the market.

#### Conclusion:
- Over the years, Udemy has experienced substantial growth in terms of the number of subscribers, instructors, and courses. The platform's user base has likely expanded significantly as more learners around the world turn to online education.
- The growth in the number of instructors and courses on Udemy indicates a diverse range of content available on the platform. This diversity attracts learners with varied interests and learning objectives, contributing to Udemy's popularity.
- Udemy's success is likely attributed to its global appeal, with a broad and diverse user base from different countries and cultures. The platform's ability to attract instructors and learners globally demonstrates its effectiveness in providing accessible education.
- Udemy's focus on both instructors and learners has contributed to its growth. Instructors are attracted by the opportunity to reach a global audience, while learners benefit from a wide array of courses tailored to various skill levels and interests.

### Question 4: The diversity and scaling of languages?

### Benefits of finding the answer?

- It helps us see the situation about the diversity and proportion of languages ​​used in udemy courses. This can also help us predict which languages ​​will be commonly used in the near future.
- Learners tend to engage more actively with content presented in their native language. Offering courses in multiple languages can lead to increased participation, comprehension, and retention of information, as learners feel more comfortable and connected to the material.
- Udemy can tap into new markets and demographics by offering courses in different languages. This expansion can lead to increased user base and revenue opportunities as the platform becomes more inclusive and diverse.

### Preprocessing

We count number of language used in courses

In [None]:
language_per_year = udemy_df.groupby('year')['language'].unique()
language_per_year = language_per_year.apply(lambda x: len(x))

In [None]:
plt.figure(figsize=(10, 6))
language_per_year.plot(kind='bar', color='skyblue')
plt.title('Number of languages over year')
plt.xlabel('Year')
plt.ylabel('Number languages')
plt.show()

We see languages ​​diversify over time, with more and more types of languages ​​serving learner.

Next, we look at how much each language is used. Because there are a lot of language, so we take 5 top most popular language, another language we put it type 'Other'

In [None]:
list_language_top = udemy_df.groupby('language')['id'].count().nlargest(5).index
list_language_top = pd.Series(list_language_top)
language_top = udemy_df[udemy_df['language'].isin(list_language_top)]
df_language = language_top.groupby(['year', 'language']).size().unstack(fill_value=0)
df_language['Other'] = udemy_df.groupby('year')['language'].size()- df_language.sum(axis=1)
df_percentage_language = df_language.div(df_language.sum(axis=1), axis=0) * 100
df_percentage_language.plot(kind='area', stacked=True, title='Ratio between languages over years')
plt.legend(loc='upper left', bbox_to_anchor=(1, 1))
plt.xlabel('Year')
fig = plt.gcf()
fig.set_size_inches(10, 5)
plt.show()

- We can see the trend of using languages ​​such as 'Spanish', 'Postuguese', ... is increasing.
- English accounts for a large proportion but is no longer as dominant as before.

#### Conclusion:
- Udemy tailor its offerings to local markets by providing courses in languages specific to those regions. This adaptability can help the platform stay relevant and competitive in a globalized education landscape.
- Language diversity opens up opportunities for instructors proficient in specific languages to create and deliver content. This can attract skilled instructors from various linguistic backgrounds, enriching the platform with a diverse pool of expertise.