## Read in the GitHub Comments data

In [3]:
%matplotlib inline
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#pd.set_option('display.max_rows', 10)
#pd.set_option('display.notebook_repr_html', True)
#pd.set_option('display.max_columns', 10)



In [4]:
df = pd.read_csv('issue_comments_jupyter_copy.csv')
df['org'] = df['org'].astype('str')
df['repo'] = df['repo'].astype('str')
df['comments'] = df['comments'].astype('str')
df['user'] = df['user'].astype('str')

### Initial data exploration

#### Display the first five lines of the dataset to the screen

In [5]:
df.head()

Unnamed: 0,org,repo,number,issue_date,comment_creation_date,comments,user
0,jupyter,colaboratory,118,2015-07-15 13:09:16,2015-07-15 16:37:03,Thanks !\n,wernight
1,jupyter,colaboratory,119,2015-07-23 06:55:39,2015-07-23 06:59:17,Oops. i got it. I have to uninstall ipython3 a...,kakitone
2,jupyter,colaboratory,121,2016-01-31 15:08:08,2016-03-08 14:07:43,same issue\n,magdalenat
3,jupyter,colaboratory,121,2016-01-31 15:08:08,2016-06-23 17:13:57,FWIW a workaround is to share from Google Driv...,magdalenat
4,jupyter,colaboratory,123,2016-09-18 21:31:02,2016-09-18 21:48:15,"At some point, I'll probably hack on a Rethink...",jackiekazil


The output is a table with 7 columns, various strings, two datetime columns, and what should be an integer column. 

#### Name of each column

In [6]:
print df.columns

Index([u'org', u'repo', u'number', u'issue_date', u'comment_creation_date',
       u'comments', u'user'],
      dtype='object')


#### Data types for each column

In [7]:
df.dtypes

org                      object
repo                     object
number                    int64
issue_date               object
comment_creation_date    object
comments                 object
user                     object
dtype: object

#### How many variables (columns)?

In [8]:
len(df.columns)

7

#### Number of observations (rows) in the dataset

In [9]:
len(df)

17656

This does not match the total number of rows in the csv file (18,048)! Let's look around more. 

In [10]:
df.tail()

Unnamed: 0,org,repo,number,issue_date,comment_creation_date,comments,user
17651,jupyter,try.jupyter.org,18,2016-10-06 16:06:13,2016-10-07 13:18:54,"weird...\n\n<img width=""985"" alt=""screen shot ...",rgbkrk
17652,jupyter,try.jupyter.org,18,2016-10-06 16:06:13,2016-10-07 13:19:57,Version 53.0.2785.116 (64-bit)\n\nLooks like a...,rgbkrk
17653,jupyter,try.jupyter.org,18,2016-10-06 16:06:13,2016-10-07 13:21:28,And I've updated to Version 53.0.2785.143 (64-...,rgbkrk
17654,jupyter,try.jupyter.org,18,2016-10-06 16:06:13,2016-10-07 13:22:50,"I've got 54.0.2840.50 (beta channel), so maybe...",rgbkrk
17655,jupyter,try.jupyter.org,18,2016-10-06 16:06:13,2016-10-07 13:32:10,![screen shot 2016-10-07 at 9 32 00 am](https:...,rgbkrk


There's nothing unusual with the end of the dataset. Let's look at a random spot like row 17356:

In [11]:
df.loc[17356,:]

org                                                                jupyter
repo                                                                 tmpnb
number                                                                 196
issue_date                                             2015-12-18 01:13:25
comment_creation_date                                  2015-12-18 03:38:47
comments                 The build will remain broken until we have a `...
user                                                          captainsafia
Name: 17356, dtype: object

Based on looking at the file in Excel, it looks like there's a problem in the text in https://github.com/jupyter/tmpnb/issues/208 https://github.com/jupyter/tmpnb/issues/235 - both of these issues have code in the issue. Code is formated in GitHub with a `backtick`

How many missing values are in the `comments` column of the dataset?

In [12]:
print(np.count_nonzero(df["comments"].isnull()))

0


The answer is "0" does this mean that Pandas already stripped out all the rows with missing values? Let's look at it another way:

In [13]:
df.isnull().sum()

org                      0
repo                     0
number                   0
issue_date               0
comment_creation_date    0
comments                 0
user                     0
dtype: int64

There are no summary statistics to run on this dataset because all but one column has a soon-to-be string datatype.

Extract the subset of rows of the dataset where the length of comments is less than 5

In [14]:
df.comments

0                                               Thanks !\n
1        Oops. i got it. I have to uninstall ipython3 a...
2                                             same issue\n
3        FWIW a workaround is to share from Google Driv...
4        At some point, I'll probably hack on a Rethink...
5        Also, if you just want storage (not the realti...
6        Was looking for the realtime. :-) \nTy for the...
7        Note that SageMathCloud also lets you collabor...
8        Thanks for bringing that up! My head was burie...
9        @rgbkrk let me know if you ever want to chat m...
10       Note that in addition to using SageMathCloud a...
11       http://nbviewer.ipython.org/gist/damianavila/d...
12                                                 Cute.\n
13       Here is the sticker that Leah did for NumFOCUS...
14       We already have Jupyter stickers. @takluyver i...
15       I am talking with Leah and Thomas about this o...
16                Yes you should. Pinged you in gitter. 

In [22]:
counts_per_repo = df.groupby(['org', 'repo']).count()

In [23]:
print counts_per_repo

                            number  issue_date  comment_creation_date  \
org     repo                                                            
jupyter colaboratory            11          11                     11   
        design                 585         585                    585   
        docker-demo-images     306         306                    306   
        jupyter-drive           97          97                     97   
        jupyter.github.io      613         613                    613   
        jupyter_client         928         928                    928   
        jupyter_core           265         265                    265   
        nature-demo             12          12                     12   
        nbcache                  5           5                      5   
        nbconvert-examples       7           7                      7   
        nbformat               365         365                    365   
        nbgrader              1287        1287     

In [21]:
counts_per_user = df.groupby(['user', 'repo']).count()
print counts_per_user

                                     org  number  issue_date  \
user              repo                                         
3rd3              notebook             2       2           2   
8leggedunicorn    notebook             7       7           7   
AC130USpectre     notebook             7       7           7   
AFJay             notebook            22      22          22   
AGS-Knight        notebook             9       9           9   
Achifaifa         notebook            11      11          11   
AdityaShirvalkar  notebook             2       2           2   
AgazW             notebook             2       2           2   
AidanRocke        nbviewer             2       2           2   
Akelio-zhang      notebook            10      10          10   
AkhilNairAmey     notebook             9       9           9   
AlJohri           notebook             2       2           2   
Alnilam           notebook             1       1           1   
Amasics           notebook             2

In [None]:
#import seaborn as sns 
#sns.scatter(x='user', y='number', data=counts_per_user)

### Statistics on strings

Descriptive statistics:
- Avg comment length
- Standard deviation for comment length

Correlations: 
- username vs. comment length
- username vs. comment count
- repo vs. comment count
- repo vs. comment length
- counts of comments over time
- times of year where specific users comment more or less