## Stack Overflow Tag Analysis

The Stack Overflow dataset is a seminal dataset used by tech market researchers, because it is open, comprehensive, and behavioral in nature. It was first made available for public consumption in 2010, and encompasses not just questions and answer posts from the developer community but also technology tags associated with each post. As a result, these tags act as a good market proxy of user adoption, interest or disterest from a behavioral perspective: the more posts with a certain tag, the more indicative interest of that certain tag. The time series nature of this dataset also make it possible to gauge the waxing and waning of a tag, while also indicating changing affiliation among tags. Finally, additional metadata about users, badge contests and an annual developer survey provide additional context to help make business/actionable decisions, leading to measurable outcomes.

Thus, for my capstone project I would like to analyze technology tags in the Stack Overflow dataset in order to better understand technology adoption curves and affiliations among technologies, depending on certain cohorts of developers. My team and I at Microsoft perfomed a similar analysis in 2016, where we found that OSS developers gravitated towards AWS and Google, as there were very few OSS offerings on Azure. We also realized that Microsoft had not reached out to OSS developers proactively, even with the limited offers. This 'wake up call' generated massive investment in development and marketing of OSS technologies on the Azure cloud/platform.

Five years later, Microsoft would like to know if their efforts have paid off. They have seen fantastic growth in their OSS services and technologies, but are they keeping up with the market? Have they been able bring new developers onto their platform, or is this adoption coming from existing developers who have become 'polyglots'? And have the 'islands of technologies' that used to center around just AWS and Google begun to diffuse and include Azure? ...or is Azure still an island unto itself?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline

#preprocessing
from sklearn.preprocessing import StandardScaler, Normalizer, MinMaxScaler
from scipy import stats

# pipelines
from sklearn.pipeline import Pipeline

# Regression
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import SGDRegressor
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

# NLP transformers
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

# classifiers you can use
from sklearn.dummy import DummyClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

# model selection bits
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score, ShuffleSplit, RandomizedSearchCV
from sklearn.model_selection import StratifiedShuffleSplit, StratifiedKFold, KFold
from sklearn.model_selection import learning_curve, validation_curve

# evaluation
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report, fbeta_score

In [2]:
posts = pd.read_csv("/Users/bens_mac/Documents/CodingNomads/SO_posts_BIG.csv")
users = pd.read_csv("/Users/bens_mac/Documents/CodingNomads/SO_users_BIG.csv")

In [3]:
posts.head()

Unnamed: 0,id,creation_date,last_activity_date,owner_user_id,post_type_id,tags,view_count
0,66138537,2021-02-10 14:11:57.947 UTC,2021-02-10 14:11:57.947 UTC,8384006.0,1,dataexplorer,2
1,66229417,2021-02-16 17:40:44.097 UTC,2021-02-16 17:40:44.097 UTC,12549160.0,1,rstudio-server,2
2,66288134,2021-02-20 04:49:09.76 UTC,2021-02-20 04:49:09.76 UTC,15246800.0,1,routes,2
3,66293452,2021-02-20 15:43:15.133 UTC,2021-02-20 15:43:15.133 UTC,7822211.0,1,angular-dynamic-components,2
4,66361333,2021-02-25 01:56:26.023 UTC,2021-02-25 01:56:26.023 UTC,2713214.0,1,amazon-eks|elastic-network-interface,2


In [4]:
users.head()

Unnamed: 0,id,display_name,age,creation_date,last_access_date,location,reputation
0,14712167,nicom,,2020-11-26 09:26:50.507 UTC,2021-02-27 13:20:10.693 UTC,,19
1,14717603,Donkey,,2020-11-27 05:56:10.18 UTC,2021-02-18 10:38:39.56 UTC,,43
2,14785218,redshorts17,,2020-12-08 07:42:10.147 UTC,2021-01-08 13:08:27.023 UTC,,89
3,14808842,PerekatovSergey,,2020-12-11 16:09:17.147 UTC,2021-02-26 14:26:46.983 UTC,Russia,26
4,14916620,leung2,,2020-12-31 03:02:17.93 UTC,2021-02-14 11:27:27.557 UTC,,33


In [5]:
posts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10445772 entries, 0 to 10445771
Data columns (total 7 columns):
 #   Column              Dtype  
---  ------              -----  
 0   id                  int64  
 1   creation_date       object 
 2   last_activity_date  object 
 3   owner_user_id       float64
 4   post_type_id        int64  
 5   tags                object 
 6   view_count          int64  
dtypes: float64(1), int64(3), object(3)
memory usage: 557.9+ MB


In [6]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14080580 entries, 0 to 14080579
Data columns (total 7 columns):
 #   Column            Dtype  
---  ------            -----  
 0   id                int64  
 1   display_name      object 
 2   age               float64
 3   creation_date     object 
 4   last_access_date  object 
 5   location          object 
 6   reputation        int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 752.0+ MB


In [7]:
posts.dropna()

Unnamed: 0,id,creation_date,last_activity_date,owner_user_id,post_type_id,tags,view_count
0,66138537,2021-02-10 14:11:57.947 UTC,2021-02-10 14:11:57.947 UTC,8384006.0,1,dataexplorer,2
1,66229417,2021-02-16 17:40:44.097 UTC,2021-02-16 17:40:44.097 UTC,12549160.0,1,rstudio-server,2
2,66288134,2021-02-20 04:49:09.76 UTC,2021-02-20 04:49:09.76 UTC,15246800.0,1,routes,2
3,66293452,2021-02-20 15:43:15.133 UTC,2021-02-20 15:43:15.133 UTC,7822211.0,1,angular-dynamic-components,2
4,66361333,2021-02-25 01:56:26.023 UTC,2021-02-25 01:56:26.023 UTC,2713214.0,1,amazon-eks|elastic-network-interface,2
...,...,...,...,...,...,...,...
10445766,42686301,2017-03-09 03:48:06.533 UTC,2017-03-10 11:30:01.587 UTC,2410131.0,1,mysql,95
10445767,42670128,2017-03-08 11:29:40.897 UTC,2017-03-08 20:26:48.12 UTC,1958365.0,1,c#|sql-server|asp.net-mvc|entity-framework,1452
10445769,42905797,2017-03-20 14:09:22.073 UTC,2017-03-20 17:11:44.09 UTC,7709628.0,1,php|mysql|select|pdo,80
10445770,43037016,2017-03-27 02:25:18.11 UTC,2017-03-27 16:14:33.847 UTC,7771458.0,1,php|html|mysql|xampp,724


In [8]:
posts['tags'] = posts['tags'].str.split("|")
posts['creation_date'] = [x[:4] for x in posts['creation_date']]
posts['last_activity_date'] = [x[:4] for x in posts['last_activity_date']]

In [9]:
users['creation_date'] = [x[:4] for x in users['creation_date']]
users['last_access_date'] = [x[:4] for x in users['last_access_date']]

In [10]:
posts['creation_date'] = posts['creation_date'].astype(str).astype(float)
posts['last_activity_date'] = posts['last_activity_date'].astype(str).astype(float)

In [11]:
posts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10445772 entries, 0 to 10445771
Data columns (total 7 columns):
 #   Column              Dtype  
---  ------              -----  
 0   id                  int64  
 1   creation_date       float64
 2   last_activity_date  float64
 3   owner_user_id       float64
 4   post_type_id        int64  
 5   tags                object 
 6   view_count          int64  
dtypes: float64(3), int64(3), object(1)
memory usage: 557.9+ MB


In [12]:
posts_proto = posts.sample(n=10000)

In [13]:
posts_proto.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 7480084 to 8041188
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   id                  10000 non-null  int64  
 1   creation_date       10000 non-null  float64
 2   last_activity_date  10000 non-null  float64
 3   owner_user_id       9819 non-null   float64
 4   post_type_id        10000 non-null  int64  
 5   tags                10000 non-null  object 
 6   view_count          10000 non-null  int64  
dtypes: float64(3), int64(3), object(1)
memory usage: 625.0+ KB


In [14]:
posts_proto = posts_proto.dropna()

In [15]:
all_tags_proto = posts_proto['tags']

In [16]:
all_tags_proto[0:15]

7480084                    [mysql, sql, database]
2324655                        [c#, .net, c#-4.0]
4023297           [java, eclipse, eclipse-plugin]
3504880    [swift, realm, rx-swift, objectmapper]
4601983                 [java, lambda, predicate]
7623335                        [css, font-family]
6607400          [android, ios, api, credit-card]
1088298                 [javascript, removeclass]
4427809        [caching, hadoop, mapreduce, hdfs]
6273858           [javascript, jquery, html, css]
7064275                      [php, mysql, select]
1594825                [php, apache, mod-rewrite]
1592968              [javascript, events, iframe]
5243855              [sql-server, tsql, nullable]
9009709     [iphone, ios, xcode, android-mapview]
Name: tags, dtype: object

In [54]:
#all tags of postgres + ...
#all tags of mysql + ...
#all tags of GCP + ...

#tokenize list of tags and their combinations

postgres = []
element = "postgres"
for tag_list in all_tags_proto:
    for tag in tag_list:
        if element in tag:
            postgres.append(tag_list)
print(postgres)

[['python', 'django', 'postgresql', 'ubuntu'], ['python', 'database', 'postgresql', 'django-models', 'django-views'], ['c#', 'database', 'postgresql', 'automapper', 'ef-core-2.0'], ['sql', 'postgresql'], ['postgresql'], ['java', 'spring', 'hibernate', 'postgresql'], ['hibernate', 'postgresql', 'autocommit'], ['sql', 'postgresql'], ['sql', 'postgresql', 'insert'], ['postgresql-11'], ['ruby-on-rails', 'postgresql', 'amazon-web-services', 'deployment'], ['python', 'postgresql', 'sqlalchemy', 'flask-sqlalchemy', 'psycopg2'], ['sql', 'postgresql', 'sql-view', 'database-metadata'], ['sql', 'postgresql', 'aggregate', 'greatest-n-per-group', 'window-functions'], ['postgresql', 'elixir', 'phoenix-framework', 'ecto'], ['postgresql', 'indexing', 'date-range', 'b-tree-index', 'gist-index'], ['macos', 'postgresql', 'postgresapp'], ['macos', 'postgresql', 'postgresapp'], ['postgresql', 'amazon-web-services', 'amazon-rds', 'laravel-vapor'], ['javascript', 'jquery', 'html', 'postgresql'], ['ruby-on-ra

In [41]:
Amazon = []
element = "amazon"
for tag_list in all_tags_proto:
    for tag in tag_list:
        if element in tag:
            Amazon.append(tag_list)
len(Amazon)

125

In [39]:
Google = []
element = "google"
for tag_list in all_tags_proto:
    for tag in tag_list:
        if element in tag:
            Google.append(tag_list)
len(Google)

297

In [38]:
Android = []
element = "android"
for tag_list in all_tags_proto:
    for tag in tag_list:
        if element in tag:
            Android.append(tag_list)
len(Android)

973

In [None]:
all_tags_proto.head(15)

In [None]:
top_tags = posts_proto['tags'].value_counts().head(50)

In [None]:
print(top_tags)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(all_tags_proto)

true_k = 5
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)

print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind]),
    print