In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore')

In [None]:
udemy_df = pd.read_csv('Data/udemy.csv', parse_dates=['published_time', 'last_update_date'])
udemy_df

## Question 5: What is the development trend of categories on udemy over the years?

### Benefits of finding the answer?

- It can help scholars on udemy to choose categories that match current trends.
- Helps instructors on udemy choose to know the current trends.
- Understand the development trend of udemy through the number of courses.
- Capture the ups and downs of categories.
- From the learning trends of categories on udemy, we can partly grasp the development trend of society.
- Looking at the development trends over the years, we can understand the world economy at that time.

### Preprocessing

- To facilitate data processing and data visualization for answering questions, I will take out the columns: `num_subscribers`, `category`, `published_time`.

In [None]:
category_df = udemy_df[['num_subscribers', 'category', 'published_time']]
category_df

- I will convert the `published_time` column's data to year.

In [None]:
category_df['published_time'] = [item.year for item in category_df['published_time']]
category_df

### Analyze data to answer the question?

#### Top 5 categories with the most courses in each year in the last 10 years

- Find top 5 categories with the most courses in each year in the last 10 years

In [None]:
category_df = category_df[category_df['published_time'] >= 2012]
years = category_df['published_time'].unique()
years.sort()
top5_by_year = {}
for year in years:
    temp_df = category_df[category_df['published_time'] == year]
    top5_by_year[year] = dict(temp_df['category'].value_counts().head(5))

top5_by_year_df = pd.DataFrame(top5_by_year)
top5_by_year_df

- Visualize top 5 categories with the most courses in each year in the last 10 years

In [None]:
years = top5_by_year_df.columns
indexes = top5_by_year_df.index

plt.figure(figsize=(20, 10))

for index in indexes:
    plt.plot(years, top5_by_year_df.loc[index], label=index)

plt.xlabel('Year')
plt.xticks(years)
plt.ylabel('Category count')
plt.title('Category count over years')
plt.legend()
plt.grid(True)
plt.show()

- The number of courses is constantly increasing over the years, peaking in 2021. Then starting to gradually decrease.
- It can be seen that since 2014, the top 5 categories with the most courses have not changed until now.
- The only course on `Development` and `Business` that has consistently in the top 5 for the last 10 years.

#### Top 5 categories with the most subscribers in each year in the last 10 years

- Find top 5 categories with the most subscribers in each year in the last 10 years

In [None]:
category_df = category_df[category_df['published_time'] >= 2012]
years = category_df['published_time'].unique()
years.sort()
top_5_num_sub_by_year = {}
for year in years:
    temp_df = category_df[category_df['published_time'] == year][['category', 'num_subscribers']]
    count_subscribers = temp_df.groupby('category').sum()
    count_subscribers = count_subscribers.sort_values('num_subscribers', ascending=False).head(5)
    top_5_num_sub_by_year[year] = count_subscribers.to_dict()['num_subscribers']
    
top_5_num_sub_by_year_df = pd.DataFrame(top_5_num_sub_by_year)
top_5_num_sub_by_year_df

- Visualize top 5 categories with the most subscribers in each year in the last 10 years

In [None]:
years = top_5_num_sub_by_year_df.columns
indexes = top_5_num_sub_by_year_df.index

plot_df = top_5_num_sub_by_year_df
max_in_year = plot_df.agg(['max'])

bar_width = 0.1
index = np.arange(len(plot_df.columns))
fig, ax = plt.subplots(figsize=(30,10))

count = -3
for i in indexes:
    data = plot_df.loc[i]
    ax.bar(index - bar_width / 2 + bar_width * count, data, bar_width, label=i)
    count += 1

ax.bar(index, max_in_year.loc['max'] + 1000000, bar_width * 9, alpha = 0.1, color='black')
    
ax.set_title('Total subscribers of category count over years')
ax.set_xticks(index + bar_width / 2)
ax.set_xticklabels(years)
ax.legend()
ax.grid(True)
plt.show()

- The number of subscribers for courses has continuously increased over the years, peaking in 2020. Then, it decreased very rapidly.
- There are quite a few fluctuations in the top 5 categories with the most subscribers.
- The only course on `Development` and `Business` that has consistently in the top 5 for the last 10 years.

**Conclusion:**
- Through the two charts showing the top 5 categories with the most courses and the top 5 categories with the most subscribers, it can be seen that categories `Development` and `Business` are always in the top 5 categories. No matter what the trend is, these two categories still occupy an important position.
- From 2018 to 2020, it can be seen that this is the time when the categories have the most subscribers. Maybe because this is the time when the whole world has to fight the Covid-19, so people limit activities at school and an online learning place like Udemy is chosen by everyone to continue their studies.
- Maybe `Development` doesn't have too many courses, specifically from 2020, the number of courses in `Development` is only in the top 3. But the number of subscribers in `Development` is always in the top 1 and the difference is very large compared to the remaining categories.
- Besides, although `IT & Software` has only been developed since 2015, the number of subscribers for this category is top 2 only to `Development` until now.
- It seems that the world is following the trend of technology, so `Development` and `IT & Software` courses are of great interest to everyone.

In [None]:
a = udemy_df.groupby('category')['subcategory'].value_counts()
for k, v in a.items():
    print(k, v)

In [None]:
# udemy_df[udemy_df['last_update_date'] == NaT]
# udemy_df[np.isnat(udemy_df['last_update_date'])]
a = udemy_df[['published_time', 'last_update_date', 'num_subscribers']]
a['published_time'] = [item.year for item in a['published_time']]
a['last_update_date'] = [item.year for item in a['last_update_date']]
a.sort_values('num_subscribers')