# Stack Overflow

## Introduction 

In the second part of this assignment, we will create and analyze time series of creation dates of Stack Overflow questions. This assignment is to be completed **INDIVIDUALLY** and it is due on **October 7 at 7pm**.

Let's create some time series from the data. You may choose to analyze either users or tags. To analyze users, take the top 100 users with the most question posts. For each user, your time series will be the number of questions posted by that user at some frequency. To analyze tags, take the top 100 most popular question tags. For each tag, your time series will be the number of questions with that tag at some frequency. You may choose to sample your data each week, each month, on a certain day of the week or at certain hours in a day depending on what trend you are hoping to find in the data. For example, if you choose to analyze tags and sample during different hours of the day, your hypothesis could be that languages (i.e. Javascript) that are used more in industry will have more questions posted during work hours, whereas languages (i.e. Python) that are taught in academia will have more questions posted after midnight when students are scrambling to finish their homework.

Compare the time series using one of the methods discussed in class. In a few paragraphs, write down what you were hoping to find in the data, what timeseries you created, what method you chose and why. **(30 pts)**

You may find the [pandas.DataFrame.resample](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html) module helpful.

In [20]:
import pandas as pd
from pandas import Series, DataFrame
import csv
import scipy.spatial
import matplotlib.pyplot as plt

# building question dataframe
df = pd.read_csv('question_dataframe.csv')

# dropping irrelevant columns
df.drop(['Id', 'Tags'], axis=1, inplace=True)

# getting a new dataframe of top 100 users based on their question post frequencies
top = df['OwnerUserId'].value_counts().head(100)

# converting series to dataframe
top_df2 = top.to_frame(name=None)

# top_df2 has top 100 question user ids, getting rid of their answer frequencies
top_df2['OwnerUserId'] = top_df2.index
top_df2.index = range(100)

# result merges top 100 with original answer dataframe on OwnerUserId to get the creation_date of questions
result = pd.merge(df, top_df2, on='OwnerUserId')

# removing CreationDate column after fetching the months
result['Month'] = pd.DatetimeIndex(result['CreationDate']).month

# converts dataframe to dict, groups all duplicate answer user Ids and their tags together
final_dict = {k: (list(v["Month"])) for k,v in result.groupby("OwnerUserId")}

# counting the number of times user asked questions in months jan to december (01-12)
# where key is the user id and value is a list of frequencies by month
for user, months in final_dict.items():
    count = []
    for i in range(1, 13):
        count.append(months.count(i))
    final_dict[user] = count

# users is a list of all the answer users
users = list(final_dict.keys())

# lists that will contain 2 most similar/dissimilar users
similar_users = []
dissimilar_users = []

# setting the default values that keep track of min dissimilarity and max dissimilarity. max minkowski distance will be most dissimilar
min_minkowski = 100000
max_minkowski = 0
for i in range(0, len(users)):
    j = i+1
    while j < len(users):
        f_user = users[i] # getting the first user id from users list
        s_user = users[j] # getting the second user id from users list
        val_i = final_dict[f_user] # getting the set of frequencies for user 1
        val_j = final_dict[s_user] 

        # calculating the minkowski distances between user 1 and user 2
        minkowski = scipy.spatial.distance.minkowski(val_i, val_j, 1) # calculating the minkowski distance with user 1 set and user 2 set as inputs
        
        # updating the smallest jaccard and biggest jaccard on record
        if (minkowski < min_minkowski):
            min_minkowski = minkowski
            similar_users = [f_user, s_user]
        if (minkowski > max_minkowski):
            max_minkowski = minkowski
            dissimilar_users = [f_user, s_user]
        j += 1

# user ids
dis_user_1 = dissimilar_users[0]
dis_user_2 = dissimilar_users[1]
sim_user_1 = similar_users[0]
sim_user_2 = similar_users[1]

# plotting
plt.figure()
plt.title("Most Dissimilar Users Measured By Minkowski Distance")
plt.ylabel("Frequency of posts")
plt.xlabel("Months (0 - 11: Jan - December)")

# changing list to series for plotting
dis_user_1_set = Series(final_dict[dis_user_1])
dis_user_2_set = Series(final_dict[dis_user_2])

print('most dissimilar user 1 frequency set: {}'.format(final_dict[dis_user_1]))
print('most dissimilar user 2 frequency set: {}'.format(final_dict[dis_user_2]))

dis_user_1_set.plot(label='User {}'.format(dis_user_1))
dis_user_2_set.plot(label='User {}'.format(dis_user_2))
plt.legend()
plt.show()

plt.title("Most Similar Users Measured By Minkowski Distance")
plt.ylabel("Frequency of posts")
plt.xlabel("Months (0 - 11: Jan - December)")

# changing list to series for plotting
sim_user_1_set = Series(final_dict[sim_user_1])
sim_user_2_set = Series(final_dict[sim_user_2])

print('most similar user 1 frequency set: {}'.format(final_dict[sim_user_1]))
print('most similar user 2 frequency set: {}'.format(final_dict[sim_user_2]))

sim_user_1_set.plot(label='User {}'.format(sim_user_1))
sim_user_2_set.plot(label='User {}'.format(sim_user_2))
plt.legend()
plt.show()

most dissimilar user 1 frequency set: [37, 16, 37, 34, 22, 50, 49, 44, 49, 50, 38, 50]
most dissimilar user 2 frequency set: [7, 7, 2, 0, 49, 14, 15, 5, 24, 19, 17, 2]
most similar user 1 frequency set: [20, 15, 21, 19, 24, 9, 24, 25, 13, 18, 14, 7]
most similar user 2 frequency set: [28, 15, 30, 16, 17, 18, 20, 26, 13, 11, 12, 11]


Choose a different distance/similarity metric and repeat the same time series analysis. Compare the two different metrics you used. **(10 pts)**

In [25]:
import numpy as np
from scipy.spatial.distance import euclidean
from fastdtw import fastdtw

# lists that will contain 2 most similar/dissimilar users
similar_users = []
dissimilar_users = []

# setting the default values that keep track of min and max distance. max distance would be most dissimilar users 
min_dist = 100000
max_dist = 0
for i in range(0, len(users)):
    j = i+1
    while j < len(users):
        f_user = users[i] # getting the first user id from users list
        s_user = users[j] # getting the second user id from users list
        val_i = final_dict[f_user] # getting the set of frequencies for user 1
        val_j = final_dict[s_user] 

        # calculating jaccard similiarity between user 1 and user 2
        dist, path = fastdtw(val_i, val_j, dist=euclidean)
        
        # updating the smallest jaccard and biggest jaccard on record
        if (dist < min_dist):
            min_dist = dist
            similar_users = [f_user, s_user]
        if (dist > max_dist):
            max_dist = dist
            dissimilar_users = [f_user, s_user]
        j += 1

# user ids
dis_user_1 = dissimilar_users[0]
dis_user_2 = dissimilar_users[1]
sim_user_1 = similar_users[0]
sim_user_2 = similar_users[1]

# plotting
plt.figure()
plt.title("Most Dissimilar Users Measured By Dynamic Time Warp")
plt.ylabel("Frequency of posts")
plt.xlabel("Months (0 - 11: Jan - December)")

# changing list to series for plotting
dis_user_1_set = Series(final_dict[dis_user_1])
dis_user_2_set = Series(final_dict[dis_user_2])

print('most dissimilar user 1 frequency set: {}'.format(final_dict[dis_user_1]))
print('most dissimilar user 2 frequency set: {}'.format(final_dict[dis_user_2]))

dis_user_1_set.plot(label='User {}'.format(dis_user_1))
dis_user_2_set.plot(label='User {}'.format(dis_user_2))
plt.legend()
plt.show()

plt.title("Most Similar Users Measured By Dynamic Time Warp")
plt.ylabel("Frequency of posts")
plt.xlabel("Months (0 - 11: Jan - December)")

# changing list to series for plotting
sim_user_1_set = Series(final_dict[sim_user_1])
sim_user_2_set = Series(final_dict[sim_user_2])

print('most similar user 1 frequency set: {}'.format(final_dict[sim_user_1]))
print('most similar user 2 frequency set: {}'.format(final_dict[sim_user_2]))

sim_user_1_set.plot(label='User {}'.format(sim_user_1))
sim_user_2_set.plot(label='User {}'.format(sim_user_2))
plt.legend()
plt.show()

most dissimilar user 1 frequency set: [37, 16, 37, 34, 22, 50, 49, 44, 49, 50, 38, 50]
most dissimilar user 2 frequency set: [42, 36, 39, 9, 8, 17, 4, 2, 5, 7, 2, 2]
most similar user 1 frequency set: [8, 22, 26, 26, 28, 21, 13, 16, 17, 9, 12, 10]
most similar user 2 frequency set: [11, 10, 9, 19, 20, 30, 24, 28, 16, 18, 8, 11]


The two metrics used for measuring similarity between the frequency of user activities by months were Minkowski distance and Dynamic time warp.  I chose to use Minkowski distance because it was one of the distance metrics included in the time series notes.  P is set to 1 because Minkowski generalizes Euclidean distance for p = 2 and the result may not be meaningful.  For this dataset I hoped to find a trend for the users with the most activity to be posting questions consistently throughout the year and users who were not frequently on the site to post more sporadically.  The time series measures the similiarity between two users using their frequencies of question posts as data objects being passed in and compared.  

The timeseries generated for most dissimilar users measured by Minkowski distance showed a dramatic change in question post frequency where user one was posting less than usual while the other showed a sporadic increase of postings before quickly dying down.  For most similar users measured by Minkowski distance, the frequency of question posts by both users were more similar relative to the dissimilar graph. 

Dynamic time warp select better frequency sets for both most similar and dissimilar users.  There is a greater disparity in the lines for dissimilar users compared to the graph produced by the Minkowski distance.  