### Prepping Data Challenge: Controlling Complaints (week 20)

### Challenge
This week's challenge continues the focus on calculations, this time the focus is numbers. When using measures in data it is very easy to make mistakes if you don't check the realistic nature of the values, especially when entering data or forming calculations. By creating your calculations in your data preparation tool, you might be saving the users of your data set a lot of work and reducing the skills required to use the data. 

### Challenge
Control charts are a really useful way to visualise data but in Tableau Desktop they often require a few Table Calculations putting people off creating them. This week you will be building the calculations you need to build a control chart without using table calculations in Desktop. 

### Requirements
 - Input the data file
 - Create the mean and standard deviation for each Week
 - Create the following calculations for each of 1, 2 and 3 standard deviations:
   - The Upper Control Limit (mean+(n*standard deviation))
   - The Lower Control Limit (mean-(n*standard deviation))
   - Variation (Upper Control Limit - Lower Control Limit)
- Join the original data set back on to these results 
- Assess whether each of the complaint values for each Department, Week and Date is within or outside of the control limits
- Output only Outliers
-  Produce a separate output worksheet (or csv) for 1, 2 or 3 standard deviations and remove the irrelevant fields for that output.

Each worksheet should contain:
 - 10 fields
 - Variation
 - Outlier
 - Lower Control Limit
 - Upper Control Limit
 - Standard Deviation
 - Mean
 - Date
 - Week
 - Complaints
 - Department
 
For each of the outputs here are the number of rows they should have:
 - 1 Standard Deviation - 24 rows (25 including headers)
 - 2 Standard Deviations - 5 rows (6 including headers)
 - 3 Standard Deviations - 2 rows (3 including headers)

In [1]:
import pandas as pd
import numpy as np

In [2]:
#Input the data

df = pd.read_csv('WK20-Prep Air Complaints - Complaints per Day.csv',
              parse_dates=['Date'])

In [3]:
df.head()

Unnamed: 0,Date,Week,Complaints,Department
0,2021-04-19,16,42,Ticketing
1,2021-04-20,16,32,Ticketing
2,2021-04-21,16,51,Ticketing
3,2021-04-22,16,48,Ticketing
4,2021-04-23,16,34,Ticketing


In [4]:
#Create the mean and standard deviation for each Week
df['Mean'] = df.groupby(['Week'])['Complaints'].transform('mean')
df['Stdev'] = df.groupby(['Week'])['Complaints'].transform('std')

In [5]:
#Create the following calculations for each of 1, 2 and 3 standard deviations:

# duplicate the weekly dataframe for each number of standard deviations
df_1 = df.copy()
df_2 = df.copy()
df_3 = df.copy()

In [6]:
#The Upper Control Limit (mean+(n*standard deviation))
#The Lower Control Limit (mean-(n*standard deviation))
#Variation (Upper Control Limit - Lower Control Limit)
df_1['Upper Control Limit'] = df_1['Mean'] + (1*df_1['Stdev'])
df_1['Lower Control Limit' ] = df_1['Mean'] - (1*df_1['Stdev'])
df_1['Variation_1'] =df_1['Upper Control Limit'] - df_1['Lower Control Limit']

In [7]:
# Standard deviation 2
df_2['Upper Control Limit'] = df_2['Mean'] + (2*df_2['Stdev'])
df_2['Lower Control Limit' ] = df_2['Mean'] - (2*df_2['Stdev'])
df_2['Variation_2'] =df_2['Upper Control Limit'] - df_2['Lower Control Limit']

In [8]:
# 3SD
df_3['Upper Control Limit'] = df_3['Mean'] + (3*df_3['Stdev'])
df_3['Lower Control Limit' ] = df_3['Mean'] - (3*df_3['Stdev'])
df_3['Variation_3'] =df_3['Upper Control Limit'] - df_3['Lower Control Limit']

In [9]:
#Assess whether each of the complaint values for each Department, Week and Date is within or outside of the control limits
df_1['Outlier?_1'] = np.where((df_1['Complaints'] >= df_1['Lower Control Limit']) & (\
            df_1['Complaints'] <= df_1['Upper Control Limit']), 'Within', 'Outside')
df_2['Outlier?_2'] = np.where((df_2['Complaints'] >= df_2['Lower Control Limit']) & (\
            df_2['Complaints'] <= df_2['Upper Control Limit']), 'Within', 'Outside')
df_3['Outlier?_3'] = np.where((df_3['Complaints'] >= df_3['Lower Control Limit']) & (\
            df_3['Complaints'] <= df_3['Upper Control Limit']), 'Within', 'Outside')

#Output only outliers
df_1 = df_1[df_1['Outlier?_1'] == 'Outside']
df_2 = df_2[df_2['Outlier?_2'] == 'Outside']
df_3 = df_3[df_3['Outlier?_3'] == 'Outside']

In [10]:
df_1.head()

Unnamed: 0,Date,Week,Complaints,Department,Mean,Stdev,Upper Control Limit,Lower Control Limit,Variation_1,Outlier?_1
2,2021-04-21,16,51,Ticketing,29.714286,12.958174,42.67246,16.756111,25.916349,Outside
3,2021-04-22,16,48,Ticketing,29.714286,12.958174,42.67246,16.756111,25.916349,Outside
5,2021-04-24,16,57,Ticketing,29.714286,12.958174,42.67246,16.756111,25.916349,Outside
7,2021-04-26,17,14,Ticketing,37.809524,16.621128,54.430652,21.188396,33.242257,Outside
9,2021-04-28,17,57,Ticketing,37.809524,16.621128,54.430652,21.188396,33.242257,Outside


In [11]:
df_2.head()

Unnamed: 0,Date,Week,Complaints,Department,Mean,Stdev,Upper Control Limit,Lower Control Limit,Variation_2,Outlier?_2
5,2021-04-24,16,57,Ticketing,29.714286,12.958174,55.630635,3.797937,51.832698,Outside
15,2021-04-05,18,76,Ticketing,53.333333,9.44634,72.226013,34.440654,37.785359,Outside
33,2021-05-22,20,68,Ticketing,28.904762,12.234806,53.374374,4.43515,48.939224,Outside
45,2021-04-29,17,84,Onboard Experience,37.809524,16.621128,71.05178,4.567267,66.484513,Outside
60,2021-05-14,19,230,Onboard Experience,72.47619,37.661146,147.798482,-2.846101,150.644583,Outside


In [12]:
df_3.head()

Unnamed: 0,Date,Week,Complaints,Department,Mean,Stdev,Upper Control Limit,Lower Control Limit,Variation_3,Outlier?_3
33,2021-05-22,20,68,Ticketing,28.904762,12.234806,65.60918,-7.799656,73.408836,Outside
60,2021-05-14,19,230,Onboard Experience,72.47619,37.661146,185.459628,-40.507247,225.966875,Outside


In [13]:
#OUTPUT
df_1.to_csv('wk20-output1.csv', index= False)
df_2.to_csv('wk20-output2.csv', index= False)
df_3.to_csv('wk20-output3.csv', index= False)