Popular Data Science Questions

Data Science Stack Exchange is a question and answer site for Data science professionals, Machine Learning specialists, and those interested in learning more about the field

This site is all about getting answers. It's not a discussion forum. There's no chit-chat.

The site has the following guidelines:
1. Focus on questions about an actual problem you have faced. Include details about what you have tried and exactly what you are trying to do.

2. Avoid questions that are primarily opinion-based, or that are likely to generate discussion rather than answers

The site also subdivide into 
Teams - for collaboration and sharing organizational knowledge

Users- Proffessional willing to help and share answers

Companies- How its like working for different companies.




Contrary to other Stack Exchange sites, Data Science Stack Exchange (DSSE for short) specialises in Data Science.

DSSE also has a high percentage of unanswered questions: A complete list of Stack Exchange websites can be found here, sorted based on the proportion of answered questions. At the time of this writing, Data Science Stack Exchange (DSSE) is one of the bottom ten sites, having only 65% of its questions answered.

The data science specialisation and a high percentage of unanswered questions make DSSE the ideal candidate for our investigation.


Stack Exchange provides a public data base for each of its websites

read more about Stack Exchange Data Explorer (SEDE) on its help section and on https://data.stackexchange.com/tutorial

SEDE: The Stack Exchange Data Explorer

Stack Exchange provides a public database for each of its websites.  It is important to note that the database is designed to be queried with the Transact-SQL (Microsoft's SQL) dialect.

After exploring the database, we found a few tables that seem relevant to our analysis:

Posts: 
Contains comprehensive information about posts, including the creation date, tags, number of answers, views and upvotes among many more.

Tags: 
Holds information about different tags including the number of times they have been used on the site. However, it does not provide time-series information to help us identify if a tag was popular in the past or present.

PostTags: 
Contains information on posts and their tags alone. Similar to the Tags table, time series information is absent.

TagSynonyms: 
Provides information on tags and alternative names assigned to them by site administrators. Time series information is absent.

Given the absence of time-series information in the Tags, PostTags and TagSynonyms tables, and considering that the Posts table already contains the relevant details about tags, we will use the information in the Posts table alone.

The Posts Table

The Posts Table has 23 columns. We will focus only on those that seem relevant to our goal:

(1.) Id: An identification number for each post.
(2.) PostTypeId: An identification number for the type of post.

(3.) CreationDate: The date and time of creation of the post.
(4.) Score: The post's score.
(5.) ViewCount: How many times the post was viewed.
(6.) Tags: What tags were used.
(7.) AnswerCount: How many answers the question got (only applicable to question posts).
(8.) FavoriteCount: How many times the question was favored.

We are primarily interested with posts that are questions. Other post types are not relevant at the moment. Before proceeding, we can check how many posts on the site are questions, relative to other posts.

Note that with the exception of the tags column, the last few columns contain information about how popular the post is — the kind of information we're after.

There are eight different types of post. Before we try to figure out which of them are relevant to us, let's check how many of them there are using the query below

SELECT PostTypeId, COUNT(*) as NrOfPosts
  FROM posts
 GROUP BY PostTypeId;
 
 Results:
 
PostTypeId	NrOfPosts
1	         21446
2	         23673
4	         236
5	         236
6	         11
7	         1

Due to their low volume, anything that isn't questions or answers is mostly inconsequential. Even if it happens to be the case that such kind of posts is immensely popular, they would just be outliers and not relevant to us. We'll then just focus on the questions.

Since we're only interested in recent posts, we'll limit our analysis to the posts of 2019. (At the time of writing it is early 2022)

IMPORTING LIBRARIES

In [11]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set default attributes for charts or graphs
%matplotlib
%config InlineBackend.figure_format = 'retina'
plt.style.use('default')
plt.rcParams.update({'font.family':'Arial'})

Using matplotlib backend: agg


In [12]:
questions_df = pd.read_csv("2019_questions.csv", parse_dates=['CreationDate'])

In [13]:
questions_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8839 entries, 0 to 8838
Data columns (total 7 columns):
Id               8839 non-null int64
CreationDate     8839 non-null datetime64[ns]
Score            8839 non-null int64
ViewCount        8839 non-null int64
Tags             8839 non-null object
AnswerCount      8839 non-null int64
FavoriteCount    1407 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(4), object(1)
memory usage: 483.5+ KB


In [14]:
questions_df.sample(5)

Unnamed: 0,Id,CreationDate,Score,ViewCount,Tags,AnswerCount,FavoriteCount
8708,55699,2019-07-15 14:19:18,1,77,<machine-learning><python><regression><feature...,2,1.0
4616,51154,2019-04-30 04:18:11,0,50,<python><recommender-system><pyspark><metric>,0,0.0
124,44617,2019-01-26 19:36:50,1,150,<machine-learning><python><multilabel-classifi...,1,
7110,53814,2019-06-14 18:44:26,1,144,<nlp><pytorch><transformer>,0,
4631,61983,2019-10-20 09:56:39,0,5,<time-series><algorithms>,0,


NOTES:
  
The dataset presents some issues that we should resolve before analysis.

The FavoriteCount column has missing values and is stored with the wrong datatype.

The Tags column contains information about different tags at once, which makes the data untidy at the moment.

The wrong datatype assigned to the FavoriteCount column is probably because of the missing values. We can explore the column further to be sure:

In [15]:
questions_df.FavoriteCount.value_counts(dropna = False)

NaN      7432
 1.0      953
 2.0      205
 0.0      175
 3.0       43
 4.0       12
 5.0        8
 6.0        4
 7.0        4
 11.0       1
 8.0        1
 16.0       1
Name: FavoriteCount, dtype: int64

Asides from the missing values, the majority of the posts have zero favourite counts. Infact, a favorite count was only recorded in one post. It is best to drop the FavouriteCount column, since it adds no additional information to our analysis.

Cleaning the Data

1. The FavoriteCount column adds no extra value, we can drop it from our dataframe

In [16]:
questions_clean = questions_df.copy()


In [17]:
questions_clean.drop(columns = "FavoriteCount", inplace=True)

In [18]:
# Verify the removal of the favorite count column
assert "FavoriteCount" not in questions_clean.columns

In [19]:
questions_clean

Unnamed: 0,Id,CreationDate,Score,ViewCount,Tags,AnswerCount
0,44419,2019-01-23 09:21:13,1,21,<machine-learning><data-mining>,0
1,44420,2019-01-23 09:34:01,0,25,<machine-learning><regression><linear-regressi...,0
2,44423,2019-01-23 09:58:41,2,1651,<python><time-series><forecast><forecasting>,0
3,44427,2019-01-23 10:57:09,0,55,<machine-learning><scikit-learn><pca>,1
4,44428,2019-01-23 11:02:15,0,19,<dataset><bigdata><data><speech-to-text>,0
5,44430,2019-01-23 11:13:32,0,283,<fuzzy-logic>,1
6,44432,2019-01-23 11:17:46,1,214,<time-series><anomaly-detection><online-learning>,0
7,44436,2019-01-23 12:49:39,0,9,<matrix-factorisation>,0
8,44437,2019-01-23 13:04:11,0,7,<correlation><naive-bayes-classifier>,0
9,44438,2019-01-23 13:16:29,0,584,<machine-learning><python><deep-learning><kera...,1


2. The Tags Column

The values in the Tags column are strings that look like this:

"<machine-learning><regression><linear-regression><regularization>"

We'll want to transform this string in something more suitable to use typical string methods. Our goal will be to transform strings like the above in something like:

"machine-learning,regression,linear-regression,regularization"

In [21]:
# Split all the tags in each post into five columns
tag_columns = ['Tag1', 'Tag2', 'Tag3', 'Tag4', 'Tag5']
questions_clean[tag_columns] = (questions_clean.Tags.str.replace('<', '')
                                                        .str.rstrip('>')
                                                        .str.split('>', expand=True)
                               )
# Drop the Tags column
questions_clean.drop(columns = 'Tags', inplace=True)

# Condense all Tags into two columns
questions_clean = pd.melt(questions_clean, 
                          id_vars=['Id', 'CreationDate', 'Score', 'ViewCount', 'AnswerCount'],
                          value_vars = tag_columns, 
                          var_name= 'TagNumber', 
                          value_name= 'TagName'
                         )

In [22]:
# Preview the resulting dataframe
questions_clean.head()

Unnamed: 0,Id,CreationDate,Score,ViewCount,AnswerCount,TagNumber,TagName
0,44419,2019-01-23 09:21:13,1,21,0,Tag1,machine-learning
1,44420,2019-01-23 09:34:01,0,25,0,Tag1,machine-learning
2,44423,2019-01-23 09:58:41,2,1651,0,Tag1,python
3,44427,2019-01-23 10:57:09,0,55,1,Tag1,machine-learning
4,44428,2019-01-23 11:02:15,0,19,0,Tag1,dataset


3. ANALYSIS

We now focus on determining the most popular tags. We'll do so by considering two different popularity proxies: for each tag we'll count how many times the tag was used, and how many times a question with that tag was viewed.

We could take into account the score, or whether or not a question is part of someone's favorite questions. These are all reasonable options to investigate; but we'll limit the focus of our research to counts and views for now

Here we want to find out how many times each individual tag was used.

In [23]:
# Compute the use frequency of each tag
tag_counts = questions_clean.groupby('TagName').size()



In [24]:
# -- Compute the total views, scores, answers and favorites by TagName

col_dict = {
    # Stores the columns of interest and their aggregate column names
    "cols": ['Score', 'ViewCount', 'AnswerCount'],
    "aggregate_names": ['total_score', 'total_views', 'total_answers']
}


In [25]:

# Compute relevant totals for each unique TagName
aggregate_df = questions_clean.groupby('TagName')[col_dict['cols']].sum()



In [26]:
# Rename resulting columns with their aggregate names
aggregate_df.columns = col_dict['aggregate_names']



In [27]:
# Update dataframe with the count of each tag
aggregate_df['count'] = tag_counts



In [28]:
# Preview results sorted by count
aggregate_df.sort_values(by='count', ascending=False).head(10)

Unnamed: 0_level_0,total_score,total_views,total_answers,count
TagName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
machine-learning,2515,388499,2313,2693
python,1475,537585,1507,1814
deep-learning,1127,233628,877,1220
neural-network,1021,185367,824,1055
keras,785,268608,654,935
classification,701,104457,651,685
tensorflow,417,121369,353,584
scikit-learn,507,128110,518,540
nlp,455,71382,369,493
cnn,452,70349,362,489


Notes:
Machine learning, Python and Deep learning seem like the most popular tags on all levels. They have gathered the highest frequency of use, views, answers, and upvotes.

It seems that as the frequency of use (count) decreases, other measures decrease as well. This suggests a possible positive corellation between the count column and all other measures.

To further investigate these observations, we can examine the relationship between the count column and other columns. 

In addition, we will visualize the popularity of the leading tags (suspected at this point), relative to other tags used on the site

In [32]:
fig = plt.figure(figsize=(14, 4))

facets = col_dict['aggregate_names']
leading_tags = ['machine-learning', 'python', 'deep-learning']

# Generate count vs measure plots
for index, col in zip(range(3), facets):
    plt.subplot(1, 3, index + 1)
    sns.regplot(data=aggregate_df, x='count', y=col, 
                color='#444444', ci=False)  # Updated color value
    
    plt.scatter(data=aggregate_df.loc[leading_tags], 
                x='count', y=col, c='#0064AB')
    
    # Annotate the leading tags
    for tag in leading_tags:
        plt.text(x=aggregate_df.loc[tag, 'count'] - 60, 
                 y=aggregate_df.loc[tag, col] - aggregate_df[col].max() / 100,
                 s=tag.replace('-', ' ').title(), alpha=0.9,
                 fontsize=8, color='#0064AB', fontweight='bold', 
                 ha='right')
        
    y_ticks = [0, int(aggregate_df[col].max() / 2), aggregate_df[col].max()]
    x_ticks = range(0, 3500, 1000)
    plt.xticks(x_ticks, ["{}k".format(t // 1000) if t > 0 else t for t in x_ticks], 
               color='gray', fontsize=8)
    plt.yticks(y_ticks, ["{}k".format(t // 1000) if t > 0 else t for t in y_ticks], 
               color='gray', fontsize=8)
    plt.xlabel('count', fontsize='9', fontweight='bold', alpha=0.3)
    plt.ylabel(col, fontsize='9', fontweight='bold', alpha=0.3)
    plt.title('Popularity and {}'.format(col.split('_')[1].title()), weight='bold', alpha=0.4)
    
# Set attributes common to all plots
axes_list = fig.get_axes()
for ax in axes_list:
    for loc in ['left', 'bottom', 'top', 'right']:
        ax.spines[loc].set_color('#DDDDDD')  # Updated color value
    ax.tick_params(left=False, bottom=False)
    sns.despine()

# Plot title
axes_list[0].text(x=-100, y=2500, fontsize=20, fontweight='bold', alpha=0.7,
                  s='Most Popular DSSE Tags', color='#0064AB')

axes_list[0].text(x=3500, y=2500, fontsize=20, fontweight='bold', color="#444444",
                  s='vs Other Tags')

plt.subplots_adjust(hspace=0.3, wspace=0.3)



In [33]:
plt.show()

Notes:
Our initial observation is confirmed. Machine Learning, Python and Deep Learning are the three most popular tags used on the site. 

This is apparent based on their position in the upper right of each chart. Machine learning also shows a greater level of popularity than the other two tags.

In addition, there is a positive correlation between the frequency of use and every other measure. It is clear that the most popular topics get the highest number of views, upvotes and answers.

In [None]:
Relations Between Tags

The tags present in most_used and not present in most_viewed are:

machine-learning-model
statistics
predictive-modeling
r
And the tags present in most_viewed but not in most_used are:

csv
pytorch
dataframe

Some tags also stand out as being related. For example, python is related to pandas, as we can find both pythons and pandas in the same country — or better yet, because pandas is a Python library. So by writing about pandas, we can actually simultaneously tackle two tags



Earlier, we identified the three leading tags based on different measures of popularity. Determining how these popular tags relate to one another would be beneficial. Our question is: do people combine these tags in their posts? If yes, which popular tags are commonly used together?

To address these questions, we will consider the top ten tags based on use frequency or count:

In [34]:
# Identify the most popular tags by count
popular_tags = (aggregate_df.sort_values(by='count', ascending=False)
                            .head(10)
                            .index
               )

for position, tag in enumerate(popular_tags):
    print(position+1, tag)

1 machine-learning
2 python
3 deep-learning
4 neural-network
5 keras
6 classification
7 tensorflow
8 scikit-learn
9 nlp
10 cnn


Let's start by creating a dataframe that will contain the number of times that these tags were used together:

In [36]:
combinations = pd.DataFrame(index= list(popular_tags), 
                            columns= list(popular_tags)
                           )
combinations.fillna(0, inplace= True)
combinations.head()

Unnamed: 0,machine-learning,python,deep-learning,neural-network,keras,classification,tensorflow,scikit-learn,nlp,cnn
machine-learning,0,0,0,0,0,0,0,0,0,0
python,0,0,0,0,0,0,0,0,0,0
deep-learning,0,0,0,0,0,0,0,0,0,0
neural-network,0,0,0,0,0,0,0,0,0,0
keras,0,0,0,0,0,0,0,0,0,0


Next, we will populate the table with the number of times each tag is used in combination with another. We will not count instances where a tag is used in isolation since that would be redundant for our purposes.

In [38]:
# Isolate records for the 10 popular tags
popular_tags_df = questions_clean.query('TagName in @popular_tags')

# Pivot the dataframe to view tag combinations
popular_tags_df = popular_tags_df.pivot(index='Id', 
                                        columns='TagNumber', 
                                        values = 'TagName'
                                       )

# Count the different tag combinations
for first_tag, index in zip(popular_tags_df.Tag1, popular_tags_df.index):
    for col in ['Tag2', 'Tag3', 'Tag4', 'Tag5']:
        next_tag = popular_tags_df.loc[index, col]
        if not(pd.isnull(next_tag)) and not(pd.isnull(first_tag)):
            combinations.loc[first_tag, next_tag] += 1
            combinations.loc[next_tag, first_tag] += 1
            
# Visualize results with a heatmap
plt.figure(figsize=(10,4))

ax = sns.heatmap(data=combinations, annot=True, fmt='0', cmap='Blues',
            cbar=False, linewidth=1)

ax.xaxis.set_ticks_position('top')
ax.tick_params(left=False, top=False)

plt.xticks(fontsize=7, fontweight='bold')
plt.yticks(fontsize=9, alpha=0.5, fontweight='bold')

plt.title('How the most popular tags are combined', alpha=0.5, fontsize=16, loc='left', fontweight='bold')




plt.text(x=0, y=-1.2, s='Each box represents the number of cases', 
         color = '#0064AB', fontweight='bold', alpha=0.7);

Notes:
The most popular tags (Machine Learning, Python and Deep Learning) appear to have the strongest associations. They are used together than any other popular tag.

Machine learning is also used frequently with other tags like Classification, neural network and NLP.

Our findings are becoming interesting at this point. We identified Machine learning as the most popular tag and then discovered that many other popular tags are used together with it. 

At this point, exploring external sources for information on all these tags would be helpful, especially how they relate to one another in practice.

Engaging Domain Knowledge
Collecting information from the DSSE website, as well as other external sources yields the following information on the most popular tags.

Machine Learning: A subfield of computer science that draws on elements from algorithmic analysis, computational statistics, mathematics, optimization, etc. It is mainly concerned with the use of data to construct models that have high predictive/forecasting ability (source).

Deep learning: A new area of Machine Learning research concerned with the technologies used for learning hierarchical representations of data, mainly done with deep neural networks (i.e. networks with two or more hidden layers), but also with some sort of Probabilistic Graphical Models (source).

Neural networks: Neural networks make up the backbone of deep learning algorithms. Specifically called artificial neural networks (ANNs), they are designed to mimic the human brain through a set of algorithms (source).

Natural Language Processing (NLP): Involves using machine learning , deep learning (in recent trends) algorithms and “narrow” artificial intelligence (AI) to understand the meaning of text documents (source).

Classification: A subset of machine learning (supervised learning), that identifies the category or categories which a new instance of dataset belongs (source).

TensorFlow: TensorFlow is an open source library for machine learning , deep learning and machine intelligence (source).

Keras: A popular, open-source deep learning API for Python built on top of TensorFlow and is useful for fast implementation (source).
Scikit-learn: A popular machine learning package for Python that has simple and efficient tools for predictive data analysis (source).
Information on Python* and Time series is intentionally ommitted since they are quite generic.*

Most of the top tags are associated with machine learning in one way or another. This explains our earlier observation, where most of the tags were combined with machine learning. Since machine learning is a broad field, and python could be generic to many applications, the interesting topic here appears to be the relatively narrower alternative - Deep learning.

In otherwords, we can say that the most popular topic on DSSE from year 2021 till date is deep learning

Is Deep Learning Just A Fad?
Before we communicate our recommendation, it is important to ensure that our findings are backed with proof. Ideally, we want the content we create to be relevant and useful for as long as possible. To ensure this, we need to identify if people's interest in deep learning is increasing overtime or slowing down.

To address this question, we will pull information on all questions from DSSE till date. Relevant information will be the Id of the question, the CreationDate and the Tags used. The query below serves this purpose:

        SELECT Id, 
               CreationDate, 
               Tags
          FROM posts
         WHERE PostTypeId = 1
         ORDER BY CreationDate;
The output of the query has been stored in a local file named all_questions.csv

In this we will track the interest in deep learning across time. We will:

Count how many deep learning questions are asked per time period.
The total amount of questions per time period.
How many deep learning questions there are relative to the total amount of questions per time period.

In [39]:
# Read and preview the local file
all_questions = pd.read_csv('./all_questions.csv', parse_dates=['CreationDate'])
all_questions.head()

Unnamed: 0,Id,CreationDate,Tags
0,45416,2019-02-12 00:36:29,<python><keras><tensorflow><cnn><probability>
1,45418,2019-02-12 00:50:39,<neural-network>
2,45422,2019-02-12 04:40:51,<python><ibm-watson><chatbot>
3,45426,2019-02-12 04:51:49,<keras>
4,45427,2019-02-12 05:08:24,<r><predictive-modeling><machine-learning-mode...


Let's expand the Tags column to make the data easier to work with:

In [40]:
# Expand tags into seperate columns
all_questions[tag_columns] = (all_questions.Tags.str.replace('<', '')
                                                .str.rstrip('>')
                                                .str.split('>', expand=True)
                             )
# Drop the original Tags column
all_questions.drop(columns = 'Tags', inplace=True)
all_questions.head(3)

Unnamed: 0,Id,CreationDate,Tag1,Tag2,Tag3,Tag4,Tag5
0,45416,2019-02-12 00:36:29,python,keras,tensorflow,cnn,probability
1,45418,2019-02-12 00:50:39,neural-network,,,,
2,45422,2019-02-12 04:40:51,python,ibm-watson,chatbot,,


Our goal here is to identify tags that are related to Deep learning.
At the same time, we are also concerned about popularity. To combine our interests, we will follow a two step process:



Identify the 20 most popular tags.


From the top 20, find tags that are related to deep learning.
1. Identify the 20 most popular tags

In [41]:
top_20 = aggregate_df.sort_values(by='count', ascending=False).head(20)
for num, tag in enumerate(top_20.index):
    print(num+1, tag)

1 machine-learning
2 python
3 deep-learning
4 neural-network
5 keras
6 classification
7 tensorflow
8 scikit-learn
9 nlp
10 cnn
11 time-series
12 lstm
13 pandas
14 regression
15 dataset
16 r
17 predictive-modeling
18 clustering
19 statistics
20 machine-learning-model


2. From the top 20, find tags that are related to Deeplearning
A quick online research on each of the top 20 tags reveals the following as related to deep learning:

"lstm", "cnn", "scikit-learn", "tensorflow", "keras", "neural-network", "deep-learning", 
"convolutional-neural-network" and "pytorch"




Any tag that belongs to this list will be classified as a deep learning tag. We can reflect this classification in our dataframe:


In [42]:

dl_tags = ["lstm", "cnn", "scikit-learn", "tensorflow",
           "keras", "neural-network", "deep-learning",
           "convolutional-neural-network", "pytorch"]

In [43]:
# Assign True to deep learning posts and False otherwise 
all_questions['is_deeplearning'] = all_questions.iloc[:, 2:].isin(dl_tags).any(axis=1)
all_questions.sample(5)

Unnamed: 0,Id,CreationDate,Tag1,Tag2,Tag3,Tag4,Tag5,is_deeplearning
3219,46830,2019-03-07 05:07:47,regression,grid-search,,,,False
2697,1075,2014-09-04 18:13:57,apache-hadoop,,,,,False
3079,25771,2017-12-18 12:34:01,machine-learning,neural-network,convnet,,,True
7500,6398,2015-07-08 20:57:09,python,pandas,linear-regression,,,False
10072,8697,2015-11-03 06:26:49,machine-learning,data-mining,,,,False


Trends In Deep Learning Posts Overtime
We currently have data spanning from 2014 till date (about 8 years). If we decide to track trends monthly, we will have too many data points to consider. Tracking yearly will also give few data points. We will consider the more appropriate alternative: tracking trends quarterly.

Note:

At the time of this writing (February, 2023), we do not have all the data for the first quarter of 2023.
For consistency, we will remove all records for Q1, 2023 from our dataframe.



In [44]:
# Extract the quarter from the creation date column
all_questions['quarter'] = pd.PeriodIndex(all_questions.CreationDate, freq='Q')
all_questions['quarter'] = all_questions['quarter'].astype(str)

# Remove records for Q1, 2023
all_questions = all_questions.query('quarter != "2023Q1"')
all_questions.quarter.unique()

array(['2019Q1', '2014Q2', '2018Q3', '2019Q3', '2017Q4', '2016Q4',
       '2014Q3', '2017Q1', '2014Q4', '2018Q1', '2015Q1', '2018Q4',
       '2015Q2', '2019Q2', '2017Q2', '2015Q3', '2015Q4', '2019Q4',
       '2018Q2', '2017Q3', '2016Q1', '2016Q2', '2016Q3', '2020Q1'],
      dtype=object)

In [45]:
all_questions.sample(5)

Unnamed: 0,Id,CreationDate,Tag1,Tag2,Tag3,Tag4,Tag5,is_deeplearning,quarter
14887,22796,2017-09-03 15:04:51,machine-learning,classification,deep-learning,image-classification,,True,2017Q3
4313,38868,2018-09-27 15:50:55,python,logistic-regression,accuracy,,,False,2018Q3
1509,15305,2016-11-23 09:46:50,xgboost,,,,,False,2016Q4
17940,64048,2019-12-01 00:32:11,keras,tensorflow,,,,True,2019Q4
14727,44416,2019-01-23 07:37:38,machine-learning,keras,,,,True,2019Q1


With our dataframe organized in this format, we can group questions by quarter, then compute the proportion of deep learning questions relative to all questions posted on DSSE. From the resulting dataframe, we can build visuals that communicate deep learning trends overtime.

In [46]:
# Count the number of deep learning and total questions by quarter
by_quarter = all_questions.groupby('quarter').agg({'is_deeplearning': ['sum', 'size']})

# Assign names to the resulting columns
by_quarter.columns = ['dl_questions', 'all_questions']

# Compute the proportion of deep learning questions per quarter
by_quarter['dl_proportion'] = (by_quarter['dl_questions']/by_quarter['all_questions'])
by_quarter.reset_index(inplace=True)
by_quarter.head()

Unnamed: 0,quarter,dl_questions,all_questions,dl_proportion
0,2014Q2,9.0,157,0.057325
1,2014Q3,13.0,189,0.068783
2,2014Q4,21.0,216,0.097222
3,2015Q1,18.0,190,0.094737
4,2015Q2,28.0,284,0.098592


In [59]:
import matplotlib.pyplot as plt
import seaborn as sns

# Identify start and end deep learning proportion
end_proportion = by_quarter.iloc[-1, 3]
start_proportion = by_quarter.iloc[0, 3]

# Convert 'quarter' column to datetime type
by_quarter['quarter'] = pd.to_datetime(by_quarter['quarter'])

# Base figure
fig = plt.figure(figsize=(7, 4))
quarters = by_quarter['quarter']
plt.plot(quarters, by_quarter['dl_proportion'], color='#0064AB', linewidth=3)
plt.scatter(x=[quarters.iloc[0], quarters.iloc[-1]], y=[start_proportion, end_proportion], color='#0064AB')

# Ticks and labels
plt.xticks(rotation=45, ha='right', fontsize=8)
plt.yticks(np.arange(0, 0.6, 0.1), ['0%', '10%', '20%', '30%', '40%', '50%'], color='#AAA')
plt.xlabel('Year', color='#AAA', fontsize=10, weight='bold')
plt.ylabel('Deep learning percentage', color='#AAA', fontsize=10, weight='bold')

# Plot title and dividers
plt.title('Deep learning: just a fad?', color='#555', size=18, loc='left')
plt.axhline(y=0.3, alpha=0.1, linewidth=1)
plt.axvline(x=pd.Timestamp('2016-01-01'), linewidth=1, color='#BABABA')

# Annotations
plt.text(x=pd.Timestamp('2015-07-01'), y=0.54, color='#888', size=12,
         s='Percentage of deep learning questions relative to the total questions on DSSE')
plt.text(x=pd.Timestamp('2016-04-01'), y=0.18, s='Since 2016, Deep learning\nhas accounted for over 30% '
         +'of\nthe questions posted on\nData Science Stack Exchange.', size=10,
        color='#0064AB', weight='bold')
for loc, prop in zip([quarters.iloc[0], quarters.iloc[-1]], [start_proportion, end_proportion]):
    plt.text(x=loc, y=prop+0.025, s='{}%'.format(int(prop*100)),
             color='#0064AB', weight='bold', size=14)

# Declutter plot
ax = fig.get_axes()[0]
for loc in ['left', 'bottom', 'top', 'right']:
    ax.spines[loc].set_color('#DDDDDD')  # Updated color value
ax.tick_params(bottom=False, left=False)
sns.despine()

plt.show()








Notes:
The percentage of deep learning questions grew drastically between 2014 (5%) and 2016 (over 30%). Though this growth appears to have plateaued, the proportion has been maintained above 30% from 2016 to date (about six years). It is clear from this observation that deep learning is not just a fad. It is a growing field of data science that sparked a strong initial interest and is still explored on DSSE to date.
Conclusion
Throughout this project, we collected, explored and analyzed data from the Data Science Stack Exchange (DSSE) Database. Our goal was to identify the most popular Data science topic and then use the insight to develop content that our audience will engage with and love.

Analysis showed machine learning, python and deep learning as the most popular data science topics. However, due to the broad nature of the earlier two options, we decided to focus on the relatively narrower alternative - deep learning.

Interestingly, deep learning has grown in popularity on DSSE, rising from over 5% of total DSSE questions in 2014 to over 30% in 2016. Deep learning still accounts for over 30% of questions posted on DSSE to date, signifying that the topic is not just a fad. Instead, it is a growing field of data science that people continue to engage and explore over the long term.



In [None]:
Recommendations and Limitations
Based on our discovery, we advise tailoring our resources to address deep learning content since it promises a potential for audience interaction and engagement in the data science space.
Although our research has allowed us to conclude that deep learning is the most popular topic, our insight is from a single source. We could explore data from other data science sites to gain more confidence in our findings.
Considering non-data-science content to write about can also give us a potential for diversification since it enables us to appeal to a larger audience in the long run.
