# In this notebook we are going to analyze data from the Make School Summer Academy

## We are going to try to find out what is the NPS (Net promoter score) of the summer academy.
## For that we need to check how likely someone is to recommand the Summer Academy to a friend.
#### For people with a score of 1-6, we can consider them detractors.
#### For people with a score of 6-8, we can consider them passive.
#### And for people with a score of 8-10, we can consider them promoters.
#### By averaging all of those out we can find the NPS.

### Let's start by importing what we need

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import glob
%matplotlib inline

## We are going to work with two datasets: 2016 and 2017

### Since the one from 2017 is cleaner we will get started by importing that one

In [2]:
df_2017 = pd.read_csv('./2017/Student Feedback Surveys-Superview.csv')
df_2017.head()

Unnamed: 0,ID,Location,Track,Week,Rating (Num),Schedule Pacing
0,134,San Francisco,"Apps, Explorer",Week 1,3,Just right
1,36,Los Angeles,Apps,Week 1,4,A little too fast
2,117,San Francisco,Games,Week 1,4,Way too slow
3,253,,,Week 2,4,A little too fast
4,350,New York City,"Apps, Explorer",Week 1,4,Just right


### The column we care about is the rating so let's check if anything about it is out of the ordinary

In [16]:
df_2017['Rating (Num)'].unique()

array(['3', '4', '5', '6', '7', '8', '9', '10', '0', '1', '2', '#ERROR!'],
      dtype=object)

### Two things stand out to me:
#### 1. Some of the values in the dataset are '#ERROR!'. We should make sure to get rid of those values.
#### 2. Our numbers are stored as String types and we should convert them to Int in order to work with them.

In [3]:
df_2017_curated = df_2017[df_2017['Rating (Num)'] != '#ERROR!']
df_2017_curated['Rating (Num)'] = df_2017_curated['Rating (Num)'].astype(int)
df_2017_curated['Rating (Num)'].mean()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


8.415172413793103

### We should create a function that helps us if someone is a promoter, passive, or a detractor

In [4]:
def calculate_promo_score(data):
    if data <= 6:
        return -1
    if data <= 8:
        return 0
    return 1

### Now let's use that function to create a new column for the promo score

In [5]:
df_2017_curated['promoter_score'] = df_2017_curated['Rating (Num)'].apply(calculate_promo_score)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


### Great! Now let's see what the NPS was for all 2017 summer academies

In [6]:
df_2017_curated['promoter_score'].sum() / len(df_2017_curated)

0.4406896551724138

### 0.44 is not bad, but that's for all the summer academies. I'm sure some have been more successful than others. Let's create a sweet visualization and see how different cities have done

## Now that we saw how the 2017 summer academy has done let's deal with the more messy dataset of the 2016 summer academy

### Let's start by importing all the weekly feedbacks except for Week 8 since that one is in a quite different format

In [11]:
path = r'./2016'
allFiles = glob.glob(path + "/*.csv")
df_2016_weeks1_6 = pd.DataFrame()
list_ = []
for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None, header=0)
    list_.append(df)
df_2016_weeks1_6 = pd.concat(list_)
df_2016_weeks1_6.head()

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  


Unnamed: 0.1,How well are the tutorials paced?,How well is the schedule paced?,How would you rate your overall satisfaction with the Summer Academy this week?,Timestamp,Unnamed: 0,What track are you in?
0,,3,3,8/5/2016 1:39:41,,
1,,3,4,8/5/2016 1:40:47,,
2,,3,4,8/5/2016 1:40:50,,
3,,4,4,8/5/2016 1:42:44,,
4,,4,5,8/5/2016 1:45:13,,


### It looks like while this data set could be useful it doesn't have what we're looking for, which is how likely someone is to recommand the Summer Academy

### Maybe we'll have better luck with the Week 8 Final Feedback

In [14]:
df_2016_week8 = pd.read_csv('Week 8 Feedback (2016, incomplete) - results.csv')
df_2016_week8.head()

Unnamed: 0,#,How likely is it that you would recommend the Make School Summer Academy to a friend?,location,track,Start Date (UTC),Submit Date (UTC),Network ID
0,00b836bda84e6bdbe780af97e249e59f,10,New York,summerApps,9/7/16 1:03,9/7/16 1:04,3212b7a834
1,39dde6dc0e1e375845d756fc7e39fc5f,10,San Francisco,summerIntro,9/7/16 1:03,9/7/16 1:04,f4954355aa
2,5e56b9de91670b308cb98dd2848b8739,10,New York,summerIntro,9/7/16 1:03,9/7/16 1:05,3d69ca289b
3,641081d05785b47a0f17448625da0d49,9,Sunnyvale,summerApps (4-week),9/7/16 1:04,9/7/16 1:06,261608f95d
4,c29bdd4f5678d78b450f4494e0f53c8c,3,San Francisco,summerIntro,9/7/16 1:04,9/7/16 1:11,d6672ddf6f


### This looks promising! It has the exact data we're looking for! Let's check if there's anything weird about it

In [17]:
df_2016_week8['How likely is it that you would recommend the Make School Summer Academy to a friend?'].unique()

array([10,  9,  3,  8,  6,  7,  4,  5])

### Looks great! We have no weird data and the rating are already integers so we can calculate the promoter score. Let's add a new column for just that.

In [19]:
df_2016_week8['promoter_score'] = df_2016_week8['How likely is it that you would recommend the Make School Summer Academy to a friend?'].apply(calculate_promo_score)
df_2016_week8.head()

Unnamed: 0,#,How likely is it that you would recommend the Make School Summer Academy to a friend?,location,track,Start Date (UTC),Submit Date (UTC),Network ID,promoter_score
0,00b836bda84e6bdbe780af97e249e59f,10,New York,summerApps,9/7/16 1:03,9/7/16 1:04,3212b7a834,1
1,39dde6dc0e1e375845d756fc7e39fc5f,10,San Francisco,summerIntro,9/7/16 1:03,9/7/16 1:04,f4954355aa,1
2,5e56b9de91670b308cb98dd2848b8739,10,New York,summerIntro,9/7/16 1:03,9/7/16 1:05,3d69ca289b,1
3,641081d05785b47a0f17448625da0d49,9,Sunnyvale,summerApps (4-week),9/7/16 1:04,9/7/16 1:06,261608f95d,1
4,c29bdd4f5678d78b450f4494e0f53c8c,3,San Francisco,summerIntro,9/7/16 1:04,9/7/16 1:11,d6672ddf6f,-1


### Now that we can, let's find out what the NPS was for the whole of the summer academy

In [20]:
df_2016_week8['promoter_score'].sum() / len(df_2017_curated)

0.02482758620689655

### 0.02?! While not negative, this does not even compare to the 2017 score. I wonder which cities brought the average down. Let's visualize our data and see just that