The purpose of this notebook is to perform some exploratory analysis of the 2016 User Study survey. Wahida Chowdhury and I will be working together and collaborating on ideas for this study, however this work will be largely data and visualization focused.

The User Study gathered nearly 5000 responses, anonymously asking demographic and usage information about participants' usage of the various GCTools (GCconnex, GCpedia, GCIntranet). Users' responses can be used to gather feedback on the strengths and weaknesses of the GCTools, as well as providing a description of the tools' userbases.

This file will document my research progress, starting from the very beggining of my analysis. This file will not include textual analysis, but this will be saved later for another notebook.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#I will import more sophisticated packages (ie scklearn packages) as they are needed

In [None]:
#First we will start by importing the data. Wahida gave me two sheets, one of which contains mostly completed responses,
#The other contains mostly incomplete responses.
#Here is the first one
data_path = r"/Users/Owner/Documents/Work_transfer/User Study 2016/"
df = pd.read_excel(data_path+"User Study 2016.xlsx")

In [None]:
df.filter(regex = "Other").describe()

This file is way too big. We're gonna have to crunch this down a bit for sure.
This survey is most useful when we can link the answers to the demographics. 
I'm going to start by taking a few key questions from the data and putting that into one dataframe

In [None]:
df3 = df.filter(regex = "P3Q")

In [None]:
df3['Participant'] = df['participant no']

In [None]:
df3columns = ['Department', 'DepartmentOther', 'CompressedWeek', 'FlexibleWork', 'Telework', 'JobSharing',
              'IncomeAveraging','NoArrangement', 'Status', 'StatusOther', 'Community', 'Tenure', 'TenureOther',
              'SMLevel', 'Language', 'Region', 'Age', 'Gender', 'Education', 'Participant']
df3.columns = df3columns

df3 = df3[['Participant', 'Department', 'DepartmentOther', 'CompressedWeek', 'FlexibleWork', 'Telework', 'JobSharing',
              'IncomeAveraging','NoArrangement', 'Status', 'StatusOther', 'Community', 'Tenure', 'TenureOther',
              'SMLevel', 'Language', 'Region', 'Age', 'Gender', 'Education']]

In [None]:
df3.describe(include  = "all")

Now we have all the demographic information of the participants neatly organized in one table, that can still be easily combined with other tables. For now I just want demographic info. We will dig a little bit deeper soon.

In [None]:
departments = df3['Department'].value_counts().reset_index()
departmentsother = df3['DepartmentOther'].value_counts().reset_index()# We cna ignore departments other
departments # Group similar dept's

In [None]:
#This actually tells us nothing that we couldn't already find

gender = df3['Gender'].value_counts().reset_index()
age = df3['Age'].value_counts().reset_index()
smlevel = df3['SMLevel'].value_counts().reset_index()
education = df3['Education'].value_counts().reset_index()
community = df3['Community'].value_counts().reset_index()

In [None]:
community
community_plot = community.set_index('index')


ax = community_plot.plot.bar(title = "Responses By Area of Work")
ax.set_xlabel("Area of Work")
ax.set_ylabel("Number of Respondents")
plt.show()

In [None]:
community_plot

This is an important table (probably one of the most directly important tables here). I don't know the statistics for public servants per category, but an interesting (although probably impossible) task would be to look at the proportion.

We'll use this later when we start looking at behaviour connected to employment group.

# GCconnex

In [None]:
dfconnex = df.filter(regex = "GCc")
dfconnex['Participant'] = df['participant no'] # This will be used to the tables

#I'm gonna take this moment to complain about the amount of columns I have to rename :'( 

In [None]:
print ("I'm so afraid of having to deal with", len(dfconnex.columns)+1,"columns :'(")

In [None]:
dfconnexcol = ['Aware', 'UsageLength', 'UsageLengthOther', 'NoUseWhy', 'NoUseCollab', 'NoUsePublic', 'NoUseSupervisor', 'NoUseNoTime',
               'PplNoUse', 'NoToolsInfo', 'NoPurpose', 'Other', 'OtherResponse', 'HowOftenUse', 'WhyUseConnect',
               'WhyUsePlan', 'WhyUseCoCreate', 'WhyUseFeedback', 'WhyUseOrgShareInfo', 'WhyUseFindReUseInfo',
               'WhyUseOfficialContent', 'WhyUseFindNewPos', 'WhyUseCareerDev',
               'WhyUseChat', 'WhyUseOther', 'WhyUseReason', 'EasyUse', 'EasyInfo', 'InfoUseful', 'LoadQuickly',
               'TailoredContent', 'AdequateFeedback', 'BuildRelationships', 'EasyProfile', 'EasyNewsFeed',
               'EasyOnBoarding', 'EasyNotifications', 'EasyInformationGroup', 'EasyCollabGroup', 'EasyWriteBlog',
               'EasyReadBlog', 'EasyWriteWire', 'EasyReadWire', 'EasyPostImage', 'EasyViewImage', 'EasyCreateBM',
               'EasyCreatePolls', 'EasyWidgets', 'EasyNoteIdeas', 'EasyUseChat', 'EasyUseChatrooms', 'EasyUseSearch',
               'EasyOther', 'WhyNotEasy', 'FeaturesWant', 'Helpopen', 'HelpAgile', 'HelpCollab', 'IsSecure',
               'IsReliable', 'IsCompFunctional', 'IsAlignedGovGoals', 'IsGoodSourceInfo', 'IsGoodCentralHub', 'OtherBenefits',
               'Participant']

In [None]:
dfconnex.columns = dfconnexcol # There's a piece of missing data in each entry.

In [None]:
dfconnex = dfconnex[['Participant', 'Aware', 'UsageLength', 'UsageLengthOther', 'NoUseWhy', 'NoUseCollab', 'NoUsePublic', 'NoUseSupervisor', 'NoUseNoTime',
               'PplNoUse', 'NoToolsInfo', 'NoPurpose', 'Other', 'OtherResponse', 'HowOftenUse', 'WhyUseConnect',
               'WhyUsePlan', 'WhyUseCoCreate', 'WhyUseFeedback', 'WhyUseOrgShareInfo', 'WhyUseFindReUseInfo',
               'WhyUseOfficialContent', 'WhyUseFindNewPos', 'WhyUseCareerDev',
               'WhyUseChat', 'WhyUseOther', 'WhyUseReason', 'EasyUse', 'EasyInfo', 'InfoUseful', 'LoadQuickly',
               'TailoredContent', 'AdequateFeedback', 'BuildRelationships', 'EasyProfile', 'EasyNewsFeed',
               'EasyOnBoarding', 'EasyNotifications', 'EasyInformationGroup', 'EasyCollabGroup', 'EasyWriteBlog',
               'EasyReadBlog', 'EasyWriteWire', 'EasyReadWire', 'EasyPostImage', 'EasyViewImage', 'EasyCreateBM',
               'EasyCreatePolls', 'EasyWidgets', 'EasyNoteIdeas', 'EasyUseChat', 'EasyUseChatrooms', 'EasyUseSearch',
               'EasyOther', 'WhyNotEasy', 'FeaturesWant', 'Helpopen', 'HelpAgile', 'HelpCollab', 'IsSecure',
               'IsReliable', 'IsCompFunctional', 'IsAlignedGovGoals', 'IsGoodSourceInfo', 'IsGoodCentralHub', 'OtherBenefits'
               ]]

In [None]:
dfconnex['Aware'].value_counts()

print (4057/(4057+802), "% of participants are aware of GCconnex.")
print ("I wouldn't be surprised if this was because most of the individuals saw the survey via the tools.")
print ("There is some very likely bias in this answer")

In [None]:
nousegcconnex = dfconnex[dfconnex['UsageLength'] == "Do not use at all"]

In [None]:
nousegcconnex.describe(include = "all")

From the above table, several key insights come up. Many employees feel they don't know why GCconnex would be a useful tool for them. Also, many employees answer that the people they collaborate do not use it (no "social" in the network). Many felt that there was no purpose to using GCconnex.

#### Compare using this with other social media habits
It might be worthwhile to compare how employees who do not use GCconnex (but are aware of it) use other media. They may have a bias toward not liking social media or newer technologies, or maybe they don't like GCconnex for work purposes. Comparing GCconnex to other social media outlets (especially as a work tool) may help determine whether it is the individual or the tool that doesn't work

In [None]:
nousegcconnex.to_csv(data_path+"nouse.csv") # Just to get a more thorough look at what's going on in the CSV

One useful tool to send to the dev team would be the user evaluations of the indivdiual aspects of GCconnex. It would be useful and simple to generate a report-card-like tool to the team, that allows them to see 

In [None]:
#Let's do the dev thing
dfdev = dfconnex.filter(regex = "Easy") #Extracts all the quesions that have the "Easy" thing. 
#I have a feeling this might be important to the dev team

In [None]:
dfdev.columns

In [None]:
dfdev.drop('WhyNotEasy', inplace = True, axis = 1)
dfdev.drop('EasyOther', inplace = True, axis = 1)

In [None]:
valuedict = {} #Taking the value counts of each question about easiness in here
for col in dfdev:
    valuedict[col] = dfdev[col].value_counts()

In [None]:
devfb = pd.DataFrame.from_dict(valuedict, orient = "index") #Turning the dictionary we just created into a dataframe

In [None]:
devfb = devfb.set_value('EasyInfo', "Don't know / Not sure / Don't use", 328) # Merging the "Dont Know" columns into one manually
devfb = devfb.set_value('EasyUse', "Don't know / Not sure / Don't use", 257)
devfb.drop("Don't know / Not sure", axis = 1, inplace = True)

In [None]:
#Now to reorder the columns because I'm pedantic like that
devfb = devfb[['Yes', 'No', "Don't know / Not sure / Don't use"]]
devfb["Don't know / Not sure / Don't use"] = devfb["Don't know / Not sure / Don't use"].astype(int)

In [None]:
devfb['Total'] = 0

for i in devfb:
    if i == 'Total':
        break
    else:
        devfb['Total'] += devfb[i]

        
#Here is the table, the meanings are still a bit obscure to anyone except myself since I rewrote the names of all the columns
#So before giving it to somebody else I'll fix the table index. But this is neat.

In [None]:
devfb

In [None]:
devfb[['Yes', 'No']].plot.bar(stacked = False)
plt.show()

In [None]:
devfbperc = devfb.apply(lambda x: x/devfb['Total']*100)
grap1 = devfbperc[['Yes', 'No', "Don't know / Not sure / Don't use"]][0:9].plot.barh(stacked = True, figsize = [10,10])
grap1.legend(loc = 'upper left', bbox_to_anchor=(1,1))
grap1.set_title("What do users find easy?")

grap2 = devfbperc[['Yes', 'No', "Don't know / Not sure / Don't use"]][10:20].plot.barh(stacked = True, figsize = [10,10])
grap2.legend(loc = 'upper left', bbox_to_anchor=(1,1))
grap2.set_title("What do users find easy (cont)?")
plt.show()

### ease of use by frequency, recency, (area of work)

### Developer Feedback Table

The above table can help us determine what the pain points are for GCconnex. The results that indicate 'Dont Know...' likely indicate that they don't use that functionality of the website. Whereas the "No" response indicates bad news. One unfortunate reality is the answer "No" To the EasyInfo response (long form: Easy to find the information you need) outweighs the "Yes". Perhaps there are some reasons in the comments, but that's not up to me to review.

This is mirrored in the EasyinformationGroup row (long form: Easy to find information in groups). This response had more "Yes" than "No," but not by a very large margin.

Another unfortunate reality is the "EasyUse" row (long form: "Did you find it [GCconnex] easy to use?"). Most users gave a yes or no answer, but only 48% of responses indicated Yes, and 43% indicated No. 

On a more positive note, it appears many users find it easy to READ the content already posted on GCconnex. Passive activities such as View Image, Read Blog, Read Wire, and News Feed, all report a majority of affirmative responses. 

##### Observations

If I'll be allowed to abstract from the responses, evidence of a Power Law phenomenon becomes apparent again. The Power Law dynamic as it applies to social networks implies that the majority of a user base will not create content on a network, however a small group of users will create a disproportionately large amount of content. Judging from the above responses, it appears that many users find it more simple to read content than to post content. This is obvious, since reading content is very passive and relatively effortless, however posting content is rather active. The fact that reading content is easier than posting content suggests the Power Law relationship is still very much present.

The observation above is not to dismiss the troubling responses in the table above. It already takes great effort to post content onto GCconnex, since one must generate original and thoughtful responses and/or questions onto GCconnex. It should therefore be a priority for the tools team to make putting the original content onto GCconnex as simple as possible, something that can clearly be improved judging from the above responses.

The troubling response to "was it easy to find the information you needed" indicates a problem.

### The "Is" questions

The survey asked many questions about their beliefs of GCconnex's usefulness. I've labeled these questions as "is" questions. Let's build a similar table for this. Btw, this is part of the exploratory data analysis. To build proper models that reflect the data, we should be able to get to know the data fairly well.




In [None]:
#Building the is table

dfis = dfconnex.filter(regex = "Is")

In [None]:
#To get this over with really quickly, I'll just leave the values as is for now

isdict = {}
for col in dfis:
    isdict[col] = dfis[col].value_counts()
    
istable = pd.DataFrame.from_dict(isdict, orient  = "index")
istable = istable.reindex_axis(sorted(istable.columns), axis = 1)

In [None]:
#Okay fine, I'll clean up the columns names
collist  = list(istable.columns.values)

collist = [c.replace("<br />", "") for c in collist]
    
istable.columns = collist

In [None]:
collist

In [None]:
istable

In [None]:
#Quick plot of the table
ax = istable.plot(kind = "bar", title = "User Responses to GCconnex 'is' Questions")
ax.legend(loc = 'upper left', bbox_to_anchor=(1,1))
plt.show()

#Also sidenote, you can plot dataframes directly from pandas, and this makes me unreasonably happy

### "Is" Table

The "is" table shows us what our sample group thinks about GCconnex as a tool. What we want to see is tall purple and yellow bars (the "Moderately Agree" and "Strongly Agree" responses). In each of the questions, users are still largely undecided as an aggregate. Most of the questions have more responses on the right side of 'undecided' rather than the left, indicating a slightly positive outlook on each of the questions (except for "Is Completely Functional").

### "Help" Table

Continuing our analysis with other factors from the survey.


In [None]:

dfhelp = dfconnex.filter(regex = "Help")
dfhelp

helpdict = {}

for col in dfhelp:
    helpdict[col] = dfhelp[col].value_counts()

helpcounts = pd.DataFrame.from_dict(helpdict, orient = "index")

In [None]:
helpcollist = [c.replace("<br />", "") for c in list(helpcounts.columns)]
helpcounts.columns = helpcollist

In [None]:
helpcounts = helpcounts.reindex_axis(sorted(helpcounts.columns), axis = 1)

In [None]:
helpcounts

In [None]:
ax2 = helpcounts.plot.bar(title = "User Responses if GCconnex Helps with...")
#Relatively High marks for collaboration!
ax2.legend(loc = 'center left', bbox_to_anchor=(1,0.5))
plt.show()

In [None]:
dfuseinfo = dfconnex[dfconnex['WhyUseFindReUseInfo'] == 1]
# Small tangent after speaking with Wahida. How many people who claim they use GCconnex to find information find search easy

In [None]:
dfuseinfo['EasyUseSearch'].value_counts() 

### Use of GCconnex vs. Ease of features
The above little digression makes me wonder if it would be easy to generate a series of crossplots (or even a crosstab) showing how people use GCconnex, and whether they find something easy or not. Variation in the ease of an operation that comes from the reason an individual uses GCconnex gives evidence of a learning curve, as well as whether the difficulty in using a feature stems from inexperience, or if it is just not simple to use.

In [None]:
#First Step is to pull the "Why Use," aspect, which has lots of other stuff in it

whyuse = dfconnex.filter(regex = "WhyUse")
whyuse.drop('WhyUseReason', inplace = True, axis = 1)

In [None]:
whyuse['Participant'] = dfconnex['Participant']


In [None]:
dfdev['Participant'] = dfconnex['Participant']

EaseVsWhy = pd.merge(whyuse, dfdev, on = "Participant")

#This has 34 columns, this might not work. I'll need to find a way to figure out how to do what I want to do

In [None]:
EaseVsWhy = EaseVsWhy.fillna(0)

In [None]:
 #I'm not sure if this will take me in the direction I want to go to be completely honest. 
# There's gotta be another less tedious way.

# What if I made the index all of the easy, and dropped everything that wasn't yes?
# Then it would be measuring the number of people who found it easy, and what they did. 
# I am so great.

In [None]:
EaseVsWhy.columns

In [None]:
EaseVsWhy.replace("Yes", 1, inplace = True)
EaseVsWhy.replace("No", 0, inplace = True)
EaseVsWhy.replace("Don't know / Not sure / Don't use", 0, inplace = True)
EaseVsWhy.replace("Don't know / Not sure", 0, inplace = True)

In [None]:
EaseVsWhy = EaseVsWhy.astype(int)

In [None]:
whycols = []
for col in EaseVsWhy.columns:
    if "Why" in col:
        whycols.append(col)

        


### Rubber Ducky

I'm not really getting anywhere with what I want to do, so I'm gonna use a textual rubber ducky method here. 

The EaseVsWhy dataframe consists of columns asking why people use GCconnex, and columns asking whether they find a certain aspect EASY. If the individual answered yes to a question in each column, they get coded with a 1. Otherwise it is coded 0. My ideal dataframe would have the Why questions as the index, and each of the Ease questions in the columns. In the spaces where the columns are, I want the number of indiduals who found each task easy numbered by the reasons why they use GCconnex

In [None]:
whylist = []
for col in whycols:
    whylist.append(EaseVsWhy[EaseVsWhy[col] == 1])
    

In [None]:
#Each element 

In [None]:
whylistsums = []

for why in whylist:
    whylistsums.append(why.sum())

In [None]:
useandease = pd.DataFrame(whylistsums).filter(regex  = "Easy")

In [None]:
useandease['WhyUse:'] = whycols

In [None]:
useandease = useandease.set_index('WhyUse:', drop = True )

### ...What?

If you're wondering what I just did above, so am I.

The dataframe shows everyone who answered "yes" to the question in the index, why they use GCconnex. It then goes column by column, tallying up everyone who responded "yes" to the question whether they found a certain task easy in GCconnex.

Lastly, because people could repsond that they use GCconnex for multiple things, one individual can be counted in several rows, but that's okay.

It's still lacking one thing, we don't have a proportional answer. We should normalize each row by the amount of people who responded yes to each row.

In [None]:
useandease

In [None]:
useandease.apply(lambda x: max(x))

In [None]:
whylistthings = EaseVsWhy.sum()[:11]

In [None]:
useandease['norm'] = whylistthings

In [None]:
useeasenormed = useandease.apply(lambda x: x/useandease['norm']*100)

In [None]:
useeasenormed.drop('norm', inplace = True, axis = 1)

In [None]:
useeasenormed.to_csv(data_path+'Ease of Tasks Depending on Use.csv') 
#This table gives the proportion of users who find each task easy depending on what they use GCconnex for

In [None]:
useeasenormed.describe().to_csv(data_path+"Ease of Tasks Depending on Use Summary Statistics.csv")

In [None]:
dfconnex

#### Use of tools and ease of tasks
The tables above show how users' perceived ease of tasks of GCconnex depend on how they use GCconnex. For example, a significant amount of users who use GCconnex for chat found chat easy to use, however when they used GCconnex for other reasons, the reported ease of chat fell.

Again, if a user uses GCconnex for both chat and career development, and they say chat is easy, that will count as a "yes" in both the chat and the career development rows. Nothing much can be done about that.

Checking out the variation in responses in the columns is interesting. The ease of some tasks depends on the use, and for other responses, there isnt much variation depending on the use of tasks.



# Demographic Data Cleaning

I'm gonna look at demographic information to make sure it's cleaned up. This way, I can start playing with models soon. I don't want to do any models without having the controls of age, gender, time on public service etc. This will make (some) models more robust and help lend credence to the claims we will be making.

In [None]:
edcount.value_counts()

In [None]:
edcount = df3['Education']
edcount = edcount.replace("Bachelor's degree", "Undergrad")
edcount = edcount.replace("University certificate or diploma above the bachelor's level including a master's degree or doctorate", "Masters/PhD")
edcount = edcount.replace("Diploma or certificate from a community college, CEGEP, institute of technology, nursing school, etc., or a trades certificate or diploma", "College/CEGEP/Certificate")
edcount = edcount.replace("Secondary or high school graduation certificate, equivalent or less", "High School or Lower")
edcount = edcount.replace("University certificate or diploma below the bachelor's level", "University < Bachelors")

In [None]:
edgraph = edcount.value_counts(sort = 'False')
edgraph.sort_index(axis = 0, level = [['High School or Lower', 'Unversity < Bachelors', 'College/CEGEP/Certificate',
                                       'Undergrad', 'Master/PhD']], inplace = True)

edgraph

In [None]:
edgraph.plot.bar()

plt.show()

In [None]:
dataframe = pd.merge(df3, dfconnex, on = 'Participant') # I kind of forgot that I already did everything necessary for that.



Here is where I start having to be careful with what I'm doing as I progress further. I have 85 columns for the GCconnex study, which is all well and good, however I have to bear in mind that almost every single column is missing an observation. I have to go through the dataframe and figure out exactly how to go about cleaning the data.

Some answers that lack responses should be filled with N/A instead. Other answers should be dropped if there is N/A. 

In [None]:
dataframe.to_csv(data_path+'Clean_File.csv')

I am going to perform the data cleaning with respect to a certain model in a different notebook. I don't want to put in too many things into one notebook. This was good for exploring what there is in the original file, and cleaning it up so that it is comprehensible for myself. I will likely return to this notebook when I want to look at the other GCTools (GCPedia, GCIntranet). A lot of the work for the other GCTools will build off what is already in this notebook.

### Databacked personas
##### As per Martin's request

- Data-driven stories about user types as described by the data. - Primary vs. Secondary Type of persona
    - How use site? 
    - Proficiency of social media.
    - Content/ Information needs
    - What are they looking for?
 
 
- Open ended comments
    - Format, see if can combine with closed content questions
    - Pull out (NLTK) and read.


- Anti-persona
    - User better served by other site. (Proxy by non-usage of GCconnex?)
        - Might come out of open ended comments anyway.
    - Important
    

In [None]:
df