Skip to content

Analyze the relative popularity of programming languages over time based on Stack Overflow data.

Notifications You must be signed in to change notification settings

AnonymouNew/Rise-and-Fall-of-Programming-Languages

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Rise-and-Fall-of-Programming-Languages

1. Data on tags over time

How can we tell what programming languages and technologies are used by the most people? How about what languages are growing and which are shrinking, so that we can tell which are most worth investing time in?

One excellent source of data is Stack Overflow, a programming question and answer site with more than 16 million questions on programming topics. By measuring the number of questions about each technology, we can get an approximate sense of how many people are using it. We're going to use open data from the Stack Exchange Data Explorer to examine the relative popularity of languages like R, Python, Java and Javascript have changed over time.

Each Stack Overflow question has a tag, which marks a question to describe its topic or technology. For instance, there's a tag for languages like R or Python, and for packages like ggplot2 or pandas.

Stack Overflow tags

We'll be working with a dataset with one observation for each tag in each year. The dataset includes both the number of questions asked in that tag in that year, and the total number of questions asked in that year.

# Load libraries
library(readr)
library(dplyr)

# Load dataset
by_tag_year <- read_csv("datasets/by_tag_year.csv")

# Inspect the dataset
by_tag_year
Parsed with column specification:
cols(
  year = ďż˝[32mcol_double()ďż˝[39m,
  tag = ďż˝[31mcol_character()ďż˝[39m,
  number = ďż˝[32mcol_double()ďż˝[39m,
  year_total = ďż˝[32mcol_double()ďż˝[39m
)
A spec_tbl_df: 40518 x 4
yeartagnumberyear_total
<dbl><chr><dbl><dbl>
2008.htaccess 5458390
2008.net 591058390
2008.net-2.0 28958390
2008.net-3.5 31958390
2008.net-4.0 658390
2008.net-assembly 358390
2008.net-core 158390
20082d 4258390
200832-bit 1958390
200832bit-64bit 458390
20083d 7358390
200864bit 14958390
2008abap 1058390
2008absolute 158390
2008abstract 558390
2008abstract-class 2758390
2008abstract-syntax-tree 658390
2008accelerometer 358390
2008access 158390
2008access-control 1258390
2008accessibility 2658390
2008access-vba 5058390
2008access-violation 458390
2008accordion 958390
2008acl 1158390
2008acrobat 1058390
2008action 1058390
2008actionlistener 458390
2008actionmailer 358390
2008actionscript 13658390
............
2018yaml 6481085170
2018yarn 3571085170
2018yeoman 361085170
2018yesod 411085170
2018yield 691085170
2018yii 2691085170
2018yii2 11811085170
2018yii2-advanced-app 2091085170
2018yocto 2881085170
2018youtube 6761085170
2018youtube-api 4731085170
2018youtube-api-v3 2231085170
2018youtube-data-api 2031085170
2018yui 51085170
2018yum 981085170
2018z3 1241085170
2018zend-db 111085170
2018zend-form 131085170
2018zend-framework 1881085170
2018zend-framework2 1081085170
2018zeromq 1681085170
2018z-index 1071085170
2018zip 4101085170
2018zipfile 1151085170
2018zk 351085170
2018zlib 891085170
2018zoom 1961085170
2018zsh 1751085170
2018zurb-foundation 1821085170
2018zxing 951085170
# These packages need to be loaded in the first `@tests` cell. 
library(testthat) 
library(IRkernel.testthat)

# Then follows one or more tests of the students code. 
# The @solution should pass the tests.
# The purpose of the tests is to try to catch common errors and to 
# give the student a hint on how to resolve these errors.
run_tests({
    test_that("the answer is correct", {
    expect_true("readr" %in% .packages(), info = "Did you load the readr package?")
    expect_true("dplyr" %in% .packages(), info = "Did you load the dplyr package?")
    expect_is(by_tag_year, "tbl_df", 
        info = "Did you read in by_tag_year with read_csv (not read.csv)?")
    expect_equal(nrow(by_tag_year), 40518, 
        info = "Did you read in by_tag_year with read_csv?")
    })
})
1/1 tests passed

2. Now in fraction format

This data has one observation for each pair of a tag and a year, showing the number of questions asked in that tag in that year and the total number of questions asked in that year. For instance, there were 54 questions asked about the .htaccess tag in 2008, out of a total of 58390 questions in that year.

Rather than just the counts, we're probably interested in a percentage: the fraction of questions that year that have that tag. So let's add that to the table.

# Add fraction column
by_tag_year_fraction <- mutate(by_tag_year, fraction = number / year_total)

# Print the new table
by_tag_year_fraction
A spec_tbl_df: 40518 x 5
yeartagnumberyear_totalfraction
<dbl><chr><dbl><dbl><dbl>
2008.htaccess 54583909.248159e-04
2008.net 5910583901.012160e-01
2008.net-2.0 289583904.949478e-03
2008.net-3.5 319583905.463264e-03
2008.net-4.0 6583901.027573e-04
2008.net-assembly 3583905.137866e-05
2008.net-core 1583901.712622e-05
20082d 42583907.193013e-04
200832-bit 19583903.253982e-04
200832bit-64bit 4583906.850488e-05
20083d 73583901.250214e-03
200864bit 149583902.551807e-03
2008abap 10583901.712622e-04
2008absolute 1583901.712622e-05
2008abstract 5583908.563110e-05
2008abstract-class 27583904.624079e-04
2008abstract-syntax-tree 6583901.027573e-04
2008accelerometer 3583905.137866e-05
2008access 1583901.712622e-05
2008access-control 12583902.055146e-04
2008accessibility 26583904.452817e-04
2008access-vba 50583908.563110e-04
2008access-violation 4583906.850488e-05
2008accordion 9583901.541360e-04
2008acl 11583901.883884e-04
2008acrobat 10583901.712622e-04
2008action 10583901.712622e-04
2008actionlistener 4583906.850488e-05
2008actionmailer 3583905.137866e-05
2008actionscript 136583902.329166e-03
...............
2018yaml 64810851705.971415e-04
2018yarn 35710851703.289807e-04
2018yeoman 3610851703.317453e-05
2018yesod 4110851703.778210e-05
2018yield 6910851706.358451e-05
2018yii 26910851702.478874e-04
2018yii2 118110851701.088309e-03
2018yii2-advanced-app 20910851701.925966e-04
2018yocto 28810851702.653962e-04
2018youtube 67610851706.229439e-04
2018youtube-api 47310851704.358764e-04
2018youtube-api-v3 22310851702.054978e-04
2018youtube-data-api 20310851701.870675e-04
2018yui 510851704.607573e-06
2018yum 9810851709.030843e-05
2018z3 12410851701.142678e-04
2018zend-db 1110851701.013666e-05
2018zend-form 1310851701.197969e-05
2018zend-framework 18810851701.732447e-04
2018zend-framework2 10810851709.952358e-05
2018zeromq 16810851701.548145e-04
2018z-index 10710851709.860206e-05
2018zip 41010851703.778210e-04
2018zipfile 11510851701.059742e-04
2018zk 3510851703.225301e-05
2018zlib 8910851708.201480e-05
2018zoom 19610851701.806169e-04
2018zsh 17510851701.612651e-04
2018zurb-foundation 18210851701.677157e-04
2018zxing 9510851708.754389e-05
# one or more tests of the students code. 
# The @solution should pass the tests.
# The purpose of the tests is to try to catch common errors and to 
# give the student a hint on how to resolve these errors.
run_tests({
    test_that("the answer is correct", {
    expect_is(by_tag_year_fraction, "tbl_df", 
        info = "Did you create the by_tag_year_fraction object?")
    expect_true("fraction" %in% colnames(by_tag_year_fraction), 
        info = "Did you use mutate() to add a fraction column?")
    expect_equal(by_tag_year_fraction$fraction,
                 by_tag_year_fraction$number / by_tag_year_fraction$year_total,
        info = "Check how you computed the fraction column: is it the number divided by that year's total?")
    })
    # You can have more than one test
})
1/1 tests passed

3. Has R been growing or shrinking?

So far we've been learning and using the R programming language. Wouldn't we like to be sure it's a good investment for the future? Has it been keeping pace with other languages, or have people been switching out of it?

Let's look at whether the fraction of Stack Overflow questions that are about R has been increasing or decreasing over time.

# Filter for R tags
r_over_time <- filter(by_tag_year_fraction, tag=='r')

# Print the new table
r_over_time
A spec_tbl_df: 11 x 5
yeartagnumberyear_totalfraction
<dbl><chr><dbl><dbl><dbl>
2008r 8 583900.0001370098
2009r 524 3438680.0015238405
2010r 2270 6943910.0032690516
2011r 584512005510.0048685978
2012r1222116454040.0074273552
2013r2232920604730.0108368321
2014r3101121647010.0143257660
2015r4084422195270.0184021190
2016r4461122260720.0200402323
2017r5441523052070.0236052554
2018r2893810851700.0266667895
# one or more tests of the students code. 
# The @solution should pass the tests.
# The purpose of the tests is to try to catch common errors and to 
# give the student a hint on how to resolve these errors.
run_tests({
    test_that("the answer is correct", {
    expect_is(r_over_time, "tbl_df",
        info = "Did you create an r_over_time object with filter()?")
    expect_equal(nrow(r_over_time), 11,
        info = "Did you filter just for the rows with the 'r' tag?")
    expect_true(all(r_over_time$tag == "r"),
        info = "Did you filter just for the rows with the 'r' tag?")
    })
    # You can have more than one test
})
1/1 tests passed

4. Visualizing change over time

Rather than looking at the results in a table, we often want to create a visualization. Change over time is usually visualized with a line plot.

# Load ggplot2
library(ggplot2)

# Create a line plot of fraction over time
ggplot(r_over_time, aes(x=year, y=fraction)) + geom_line()

png

# one or more tests of the students code. 
# The @solution should pass the tests.
# The purpose of the tests is to try to catch common errors and to 
# give the student a hint on how to resolve these errors.

get_aesthetics <- function(p) {
    unlist(c(list(p$mapping), purrr::map(p$layers, "mapping")))
}

run_tests({
    test_that("the answer is correct", {
        expect_true("ggplot2" %in% .packages(), info = "Did you load the ggplot2 package?")
        # expect_true("scales" %in% .packages(), info = "Did you load the scales package?")

        p <- last_plot()
        expect_is(p, "ggplot", info = "Did you create a ggplot figure?")
        expect_equal(length(p$layers), 1, info = "Did you create a plot with geom_line()?")
        expect_is(p$layers[[1]]$geom, "GeomLine", info = "Did you create a plot with geom_line()?")

        aesthetics <- get_aesthetics(p)
        expect_equal(rlang::quo_name(aesthetics$x), "year",
                     info = "Did you put year on the x-axis?")
        expect_equal(rlang::quo_name(aesthetics$y), "fraction",
                     info = "Did you put fraction on the y-axis?")
        
        # expect_equal(length(p$scales$scales), 1, info = "Did you add scale_y_continuous?")    
        # expect_equal(p$scales$scales[[1]]$labels(.03), "3.00%", info = "Did you make the y-axis a percentage?")
    })
})
1/1 tests passed

5. How about dplyr and ggplot2?

Based on that graph, it looks like R has been growing pretty fast in the last decade. Good thing we're practicing it now!

Besides R, two other interesting tags are dplyr and ggplot2, which we've already used in this analysis. They both also have Stack Overflow tags!

Instead of just looking at R, let's look at all three tags and their change over time. Are each of those tags increasing as a fraction of overall questions? Are any of them decreasing?

# A vector of selected tags
selected_tags <- c("r", "dplyr", "ggplot2")

# Filter for those tags
selected_tags_over_time <- filter(by_tag_year_fraction, tag %in% selected_tags)

# Plot tags over time on a line plot using color to represent tag
ggplot(selected_tags_over_time, aes(x=year, y=fraction, color=tag)) + geom_line()

png

# one or more tests of the students code. 
# The @solution should pass the tests.
# The purpose of the tests is to try to catch common errors and to 
# give the student a hint on how to resolve these errors.

get_aesthetics <- function(p) {
    unlist(c(list(p$mapping), purrr::map(p$layers, "mapping")))
}

run_tests({
    test_that("the answer is correct", {
        expect_true("ggplot2" %in% .packages(), info = "Did you load the ggplot2 package?")
        
        expect_is(selected_tags_over_time, "tbl_df",
                 info = "Did you create a selected_tags_over_time data frame?")

        expect_equal(nrow(selected_tags_over_time), 28,
                 info = "Did you filter for r, dplyr, and ggplot2 and save it to selected_tags_over_time?")

        expect_equal(sort(unique(selected_tags_over_time$tag)), c("dplyr", "ggplot2", "r"),
                 info = "Did you filter for r, dplyr, and ggplot2 and save it to selected_tags_over_time?")

        p <- last_plot()
        aesthetics <- get_aesthetics(p)
        expect_is(p, "ggplot", info = "Did you create a ggplot figure?")
        expect_equal(p$data, selected_tags_over_time, info = "Did you create your plot out of selected_tags_over_time?")
        
        expect_equal(length(p$layers), 1, info = "Did you create a plot with geom_line()?")
        expect_is(p$layers[[1]]$geom, "GeomLine", info = "Did you create a plot with geom_line()?")

        expect_true(!is.null(aesthetics$x), info = "Did you put year on the x-axis?")
        expect_equal(rlang::quo_name(aesthetics$x), "year",
                     info = "Did you put year on the x-axis?")

        expect_true(!is.null(aesthetics$y), info = "Did you put fraction on the y-axis?")
        expect_equal(rlang::quo_name(aesthetics$y), "fraction",
                     info = "Did you put fraction on the y-axis?")

        expect_true(!is.null(aesthetics$colour), info = "Did you put color on the x-axis?")
        expect_equal(rlang::quo_name(aesthetics$colour), "tag",
                     info = "Did you map the tag to the color?")

        # expect_equal(length(p$scales$scales), 1, info = "Did you add scale_y_continuous?")    
        # expect_equal(p$scales$scales[[1]]$labels(.03), "3.00%", info = "Did you make the y-axis a percentage?")
    })
    # You can have more than one test
})
1/1 tests passed

6. What are the most asked-about tags?

It's sure been fun to visualize and compare tags over time. The dplyr and ggplot2 tags may not have as many questions as R, but we can tell they're both growing quickly as well.

We might like to know which tags have the most questions overall, not just within a particular year. Right now, we have several rows for every tag, but we'll be combining them into one. That means we want group_by() and summarize().

Let's look at tags that have the most questions in history.

# Find total number of questions for each tag
sorted_tags <- by_tag_year %>%
group_by(tag) %>% summarize(tag_total = sum(number)) %>% arrange(desc(tag_total))

# Print the new table
sorted_tags
A tibble: 4080 x 2
tagtag_total
<chr><dbl>
javascript 1632049
java 1425961
c# 1217450
php 1204291
android 1110261
python 970768
jquery 915159
html 755341
c++ 574263
ios 566075
css 539818
mysql 522287
sql 445419
asp.net 334479
ruby-on-rails 293432
objective-c 284451
c 279915
.net 269578
arrays 266578
angularjs 252951
r 243016
json 236552
sql-server 234713
node.js 229843
iphone 219161
swift 196253
ruby 195860
regex 190061
ajax 188184
xml 173524
......
impala 1011
box-api 1010
drawrect 1010
expo 1010
package.json 1010
credit-card 1009
data-conversion 1009
omnet++ 1009
c-strings 1008
google-docs-api 1008
publishing 1008
jogl 1007
node-red 1007
postgresql-9.4 1007
uinavigationitem 1007
playframework-2.1 1006
cakephp-2.1 1005
device-driver 1005
jasperserver 1004
webdeploy 1004
cat 1003
date-formatting 1003
java-2d 1003
lattice 1003
directory-structure1002
relation 1002
doctype 1001
rvest 1001
tableviewcell 1000
yahoo 1000
# one or more tests of the students code. 
# The @solution should pass the tests.
# The purpose of the tests is to try to catch common errors and to 
# give the student a hint on how to resolve these errors.
run_tests({
    test_that("the answer is correct", {
        expect_is(sorted_tags, "tbl_df",
                 info = "Did you create a selected_tags_over_time data frame?")

        expect_equal(colnames(sorted_tags), c("tag", "tag_total"),
                 info = "Did you group by tag and summarize to create a tag_total column?")

        expect_equal(nrow(sorted_tags), length(unique(by_tag_year$tag)),
                 info = "Did you group by tag and summarize to create a tag_total column?")

        expect_equal(sorted_tags$tag_total,
                     sort(sorted_tags$tag_total, decreasing = TRUE),
                     info = "Did you arrange in descending order of tag_total?")
    })
})
1/1 tests passed

7. How have large programming languages changed over time?

We've looked at selected tags like R, ggplot2, and dplyr, and seen that they're each growing. What tags might be shrinking? A good place to start is to plot the tags that we just saw that were the most-asked about of all time, including JavaScript, Java and C#.

# Get the six largest tags
highest_tags <- head(sorted_tags$tag)

# Filter for the six largest tags
by_tag_subset <- by_tag_year_fraction %>% filter(tag==highest_tags)

# Plot tags over time on a line plot using color to represent tag
ggplot(by_tag_subset, aes(x = year, y = fraction, color = tag)) + geom_line()

png

# one or more tests of the students code. 
# The @solution should pass the tests.
# The purpose of the tests is to try to catch common errors and to 
# give the student a hint on how to resolve these errors.
get_aesthetics <- function(p) {
    unlist(c(list(p$mapping), purrr::map(p$layers, "mapping")))
}

run_tests({
    test_that("the answer is correct", {
        expect_equal(sort(unique(by_tag_subset$tag)), sort(head(sorted_tags$tag, 6)),
                   info = "Did you filter by_tag_year_fraction for only the 6 most asked-about tags, and save it as by_tag_subset?")

        expect_equal(colnames(by_tag_subset), colnames(by_tag_year_fraction),
                   info = "Did you filter by_tag_year_fraction for only the 6 most asked-about tags, and save it as by_tag_subset?")

        p <- last_plot()
        expect_is(p, "ggplot", info = "Did you create a ggplot figure?")
        expect_equal(p$data, by_tag_subset, info = "Did you create your plot out of by_tag_subset?")
        
        expect_equal(length(p$layers), 1, info = "Did you create a plot with geom_line()?")
        expect_is(p$layers[[1]]$geom, "GeomLine", info = "Did you create a plot with geom_line()?")

        aesthetics <- get_aesthetics(p)
        expect_equal(rlang::quo_name(aesthetics$x), "year",
                     info = "Did you put year on the x-axis?")
        expect_equal(rlang::quo_name(aesthetics$y), "fraction",
                     info = "Did you put fraction on the y-axis?")
        expect_equal(rlang::quo_name(aesthetics$colour), "tag",
                     info = "Did you map the tag to the color?")

        # expect_equal(length(p$scales$scales), 1, info = "Did you add scale_y_continuous?")    
        # expect_equal(p$scales$scales[[1]]$labels(.03), "3.00%", info = "Did you make the y-axis a percentage?")
    })
})
1/1 tests passed

8. Some more tags!

Wow, based on that graph we've seen a lot of changes in what programming languages are most asked about. C# gets fewer questions than it used to, and Python has grown quite impressively.

This Stack Overflow data is incredibly versatile. We can analyze any programming language, web framework, or tool where we'd like to see their change over time. Combined with the reproducibility of R and its libraries, we have ourselves a powerful method of uncovering insights about technology.

To demonstrate its versatility, let's check out how three big mobile operating systems (Android, iOS, and Windows Phone) have compared in popularity over time. But remember: this code can be modified simply by changing the tag names!

# Get tags of interest
my_tags <- c("android", "ios", "windows-phone")

# Filter for those tags
by_tag_subset <- by_tag_year_fraction %>% filter(tag==my_tags)

# Plot tags over time on a line plot using color to represent tag
ggplot(by_tag_subset, aes(x = year, y = fraction, color = tag)) + geom_line()

png

# one or more tests of the students code. 
# The @solution should pass the tests.
# The purpose of the tests is to try to catch common errors and to 
# give the student a hint on how to resolve these errors.
get_aesthetics <- function(p) {
    unlist(c(list(p$mapping), purrr::map(p$layers, "mapping")))
}

run_tests({
    test_that("the answer is correct", {
        expect_equal(sort(my_tags), c("android", "ios", "windows-phone"),
                    info = "Did you create a vector my_tags of just android, ios, and windows-phone?")
        
        expect_equal(sort(unique(by_tag_subset$tag)), c("android", "ios", "windows-phone"),
                   info = "Did you filter by_tag_year_fraction for only ios, android, and windows-phone?")

        expect_equal(colnames(by_tag_subset), colnames(by_tag_year_fraction),
                   info = "Did you filter by_tag_year_fraction for only the three requested tags, and save it as by_tag_subset?")

        p <- last_plot()
        expect_is(p, "ggplot", info = "Did you create a ggplot figure?")
        expect_equal(p$data, by_tag_subset, info = "Did you create your plot out of by_tag_subset?")
        
        expect_equal(length(p$layers), 1, info = "Did you create a plot with geom_line()?")
        expect_is(p$layers[[1]]$geom, "GeomLine", info = "Did you create a plot with geom_line()?")

        aesthetics <- get_aesthetics(p)
        expect_equal(rlang::quo_name(aesthetics$x), "year",
                     info = "Did you put year on the x-axis?")
        expect_equal(rlang::quo_name(aesthetics$y), "fraction",
                     info = "Did you put fraction on the y-axis?")
        expect_equal(rlang::quo_name(aesthetics$colour), "tag",
                     info = "Did you map the tag to the color?")

        # expect_equal(length(p$scales$scales), 1, info = "Did you add scale_y_continuous?")    
        # expect_equal(p$scales$scales[[1]]$labels(.03), "3.00%", info = "Did you make the y-axis a percentage?")
    })
})
1/1 tests passed

About

Analyze the relative popularity of programming languages over time based on Stack Overflow data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published