# Code QA and Best Practice

Guidance on quality assurance of code for analysis and research has been developed by the government analysis function and GSS, found here: 
https://best-practice-and-impact.github.io/qa-of-code-guidance/intro.html <br />
The aim of this jupyter notebook is to demonstrate practical examples of these tips for best practice. The examples are in python, but the principles are equally applicable to R and other languages


## The Main Ideas

When you review code, like other analysis, you need to check that it's doing the right thing and producing the correct output. BUT another important aspect is that the code also needs to be clear and easy to understand to make it easier to spot mistakes and test, update or reproduce the code/analysis.

## Some Example Code

This code will load, clean and plot sunspot number data.

### A bad example
    

In [None]:
from pandas import *

df1=read_csv('Data/SN_d_tot_V2.0_18.csv',sep=';',names=['Year','Month','Day','Fraction_Of_Year','Daily_Total_Sunspot_No','Daily_St_Dev_Sunspot_No','No_Of_Obs','Definitive_Or_Provisional'])
df1.columns=df1.columns.str.lower()
df1=df1[['year','month','day','daily_total_sunspot_no','no_of_obs']]
df1=df1.assign(datetime=to_datetime(df1[['year','month','day']]))
df1=df1[df1['daily_total_sunspot_no'].notnull()]
df1=df1[~(df1['daily_total_sunspot_no']==-1)]

df2=read_csv('Data/SN_d_tot_V2.0_19.csv',sep=';',names=['Year','Month','Day','Fraction_Of_Year','Daily_Total_Sunspot_No','Daily_St_Dev_Sunspot_No','No_Of_Obs','Definitive_Or_Provisional'])
df2.columns=df2.columns.str.lower()
df2=df2[['year','month','day','daily_total_sunspot_no','no_of_obs']]
df2=df2.assign(datetime=to_datetime(df2[['year','month','day']]))
df2=df2[df2['daily_total_sunspot_no'].notnull()]
df2=df2[(df2['daily_total_sunspot_no']==-1)]

df3=read_csv('Data/SN_d_tot_V2.0_20.csv',sep=';',names=['Year','Month','Day','Fraction_Of_Year','Daily_Total_Sunspot_No','Daily_St_Dev_Sunspot_No','No_Of_Obs','Definitive_Or_Provisional'])
df3.columns=df3.columns.str.lower()
df3=df3[['year','month','day','daily_total_sunspot_no','no_of_obs']]
df3=df3.assign(datetime=to_datetime(df3[['year','month','day']]))
df3=df3[df3['daily_total_sunspot_no'].notnull()]
df3=df3[~(df3['daily_total_sunspot_no']==1)]

df=df1.append(df2).append(df3)

from plotly.express import *
fig=line(df,x=df['datetime'],y=df['daily_total_sunspot_no'])
fig.show()

This code is repetetive, messy and unclear. And it looks like something's gone wrong with the plot! It's difficult to pick out errors or quickly figure out what the code is trying to do. How can we improve it?

* Modular Code
* Readable Code
* Documentation
* Version Control
* Peer Review

### Modular Code

https://best-practice-and-impact.github.io/qa-of-code-guidance/modular_code.html <br />
https://best-practice-and-impact.github.io/qa-of-code-guidance/project_structure.html

Where code is re-used, it can be written as functions instead, so it is reused in a consistent way.
The function should be self-contained and not depend on, or affect, variables that haven't been fed in as an argument.
It should be simple - multiple smaller functions can be built up to larger functions.

For example:

In [None]:
def function(df,cols):
    
    df.columns=df.columns.str.lower()
    df = df.drop(df.columns[cols],axis=1)
    return df

def function2(df, cols):
    df =df.assign(datetime=to_datetime(df[cols]))
    
    return df

def function3(df,col):
    df=df[df[col].notnull()]
    df=df[~(df[col]==-1)]
    return df

def function4(df,cols1,cols2, col):
    
    df = function(df,cols1)
    df=function2(df,cols2)
    df =function3(df,col)
    
    return df

In [None]:
from pandas import *

df1=read_csv('Data/SN_d_tot_V2.0_18.csv',sep=';',names=['Year','Month','Day','Fraction_Of_Year','Daily_Total_Sunspot_No','Daily_St_Dev_Sunspot_No','No_Of_Obs','Definitive_Or_Provisional'])
df1=function4(df1,[3, 5, 7],['year','month','day'],'daily_total_sunspot_no')

df2=read_csv('Data/SN_d_tot_V2.0_19.csv',sep=';',names=['Year','Month','Day','Fraction_Of_Year','Daily_Total_Sunspot_No','Daily_St_Dev_Sunspot_No','No_Of_Obs','Definitive_Or_Provisional'])
df2=function4(df2,[3, 5, 7],['year','month','day'],'daily_total_sunspot_no')

df3=read_csv('Data/SN_d_tot_V2.0_20.csv',sep=';',names=['Year','Month','Day','Fraction_Of_Year','Daily_Total_Sunspot_No','Daily_St_Dev_Sunspot_No','No_Of_Obs','Definitive_Or_Provisional'])
df3=function4(df3,[3, 5, 6],['year','month','day'],'daily_total_sunspot_no')

df=df1.append(df2).append(df3)

from plotly.express import *
fig=line(df,x=df['datetime'],y=df['daily_total_sunspot_no'])
fig.show()

For larger, more complex code, it is best to split the code up into different scripts (files) and modules.
A module is a script containing a group of related functions that you use in the main script.
Make sure your scripts and modules have descriptive names, ideally without spaces.

For example, I would group together the above functions into a module called 'sunspot_data_cleaning_functions.py' and call that module from the main script.
I could then write a separate module for functions to perform analysis on the cleaned data.

In [None]:
from pandas import *
from sunspot_data_cleaning_functions import *

df1=read_csv('Data/SN_d_tot_V2.0_18.csv',sep=';',names=['Year','Month','Day','Fraction_Of_Year','Daily_Total_Sunspot_No','Daily_St_Dev_Sunspot_No','No_Of_Obs','Definitive_Or_Provisional'])
df1=function4(df1,[3, 5, 7],['year','month','day'],'daily_total_sunspot_no')

df2=read_csv('Data/SN_d_tot_V2.0_19.csv',sep=';',names=['Year','Month','Day','Fraction_Of_Year','Daily_Total_Sunspot_No','Daily_St_Dev_Sunspot_No','No_Of_Obs','Definitive_Or_Provisional'])
df2=function4(df2,[3, 5, 7],['year','month','day'],'daily_total_sunspot_no')

df3=read_csv('Data/SN_d_tot_V2.0_20.csv',sep=';',names=['Year','Month','Day','Fraction_Of_Year','Daily_Total_Sunspot_No','Daily_St_Dev_Sunspot_No','No_Of_Obs','Definitive_Or_Provisional'])
df3=function4(df3,[3, 5, 6],['year','month','day'],'daily_total_sunspot_no')

df=df1.append(df2).append(df3)

from plotly.express import *
fig=line(df,x=df['datetime'],y=df['daily_total_sunspot_no'])
fig.show()

### Readable Code

https://best-practice-and-impact.github.io/qa-of-code-guidance/readable_code.html <br />
The above example now looks much tidier, but it could still be improved for readability.

You should give your functions and variables descriptive names, so its easier to understand what they are doing. BUT these could become unwieldy if too long.
Using aliases for calling modules makes it clear which functions come from where.

Most coding languages have some sort of style guide with a recommended format to write things. e.g. pep8 for python or tidyverse for R. There are packages available called linters that can automatically check that your code aligns to a particular style, or formatters which check and then automatically fix any code that has diverged from the style.

Avoid repeating yourself! I will wrap the above code in a for loop to be certain the function is applied consistently and avoid errors.


In [None]:
import pandas as pd
import plotly.express as px
import sunspot_data_cleaning_functions as ss_clean

data_file_path = 'Data/'
file_list = ['SN_d_tot_V2.0_18.csv', 'SN_d_tot_V2.0_19.csv', 'SN_d_tot_V2.0_20.csv']

column_names = ['Year', 'Month', 'Day', 'Fraction_Of_Year', 'Daily_Total_Sunspot_No', 'Daily_St_Dev_Sunspot_No',
                'No_Of_Obs', 'Definitive_Or_Provisional']
cols_to_drop = ['fraction_of_year', 'daily_st_dev_sunspot_no', 'definitive_or_provisional'] 
datetime_cols = ['year','month','day']
not_null_col = 'daily_total_sunspot_no'


all_cleaned_data = pd.DataFrame()

for file in file_list:
    raw_data = pd.read_csv(data_file_path+file, sep = ';', names = column_names)
    clean_data = ss_clean.clean_raw_data(raw_data, cols_to_drop, datetime_cols, not_null_col)
    all_cleaned_data = all_cleaned_data.append(clean_data)
    
### previous version
# DO NOT RUN
# df1=read_csv('SN_d_tot_V2.0_18.csv',sep=';',names=['Year','Month','Day','Fraction_Of_Year','Daily_Total_Sunspot_No','Daily_St_Dev_Sunspot_No','No_Of_Obs','Definitive_Or_Provisional'])
# df1=function4(df1,[3, 5, 7],['year','month','day'],'daily_total_sunspot_no')

# df2=read_csv('SN_d_tot_V2.0_19.csv',sep=';',names=['Year','Month','Day','Fraction_Of_Year','Daily_Total_Sunspot_No','Daily_St_Dev_Sunspot_No','No_Of_Obs','Definitive_Or_Provisional'])
# df2=function4(df2,[3, 5, 7],['year','month','day'],'daily_total_sunspot_no')

# df3=read_csv('SN_d_tot_V2.0_20.csv',sep=';',names=['Year','Month','Day','Fraction_Of_Year','Daily_Total_Sunspot_No','Daily_St_Dev_Sunspot_No','No_Of_Obs','Definitive_Or_Provisional'])
# df3=function4(df3,[3, 5, 6],['year','month','day'],'daily_total_sunspot_no')

# df=df1.append(df2).append(df3)


fig = px.line(all_cleaned_data, x = all_cleaned_data['datetime'], y = all_cleaned_data['daily_total_sunspot_no'])
fig.show()



In [None]:
aggregated_monthly = all_cleaned_data.groupby(['year', 'month']).agg({'daily_total_sunspot_no': 'mean', 'no_of_obs':'sum'}).reset_index()
aggregated_monthly.rename(columns = {'daily_total_sunspot_no':'monthly_mean_ssn'}, inplace = True)
datetime_cols = ['month', 'year']
monthly_ssn = ss_clean.create_datetime_column_flex(aggregated_monthly, datetime_cols)

fig = px.line(monthly_ssn, x = monthly_ssn['datetime'], y = monthly_ssn['monthly_mean_ssn'])
fig.update_layout(title='Monthly mean sunspot numbers 1818 - 2022',
                   xaxis_title='Year',
                   yaxis_title='Monthly Mean Sunspot Number')
fig.show()

### Documentation

https://best-practice-and-impact.github.io/qa-of-code-guidance/code_documentation.html <br />
https://best-practice-and-impact.github.io/qa-of-code-guidance/project_documentation.html

Comment your code! But don't overdo it. Comments describing what is happening may be useful for novice coders, but overkill for others. Comments explaining why you've done something may be more useful for others.

Try not to leave commented out unused code in your scripts - it adds confusion and the chance it will be accidentally uncommented and run.

In [None]:
# import necessary functions
import pandas as pd
import plotly.express as px

# bespoke module of functions to clean data
import sunspot_data_cleaning_functions as ss_clean

# define variables
data_file_path = 'Data/'
file_list = ['SN_d_tot_V2.0_18.csv', 'SN_d_tot_V2.0_19.csv', 'SN_d_tot_V2.0_20.csv']

column_names = ['Year', 'Month', 'Day', 'Fraction_Of_Year', 'Daily_Total_Sunspot_No', 'Daily_St_Dev_Sunspot_No',
                'No_Of_Obs', 'Definitive_Or_Provisional']
cols_to_drop = ['fraction_of_year', 'daily_st_dev_sunspot_no', 'definitive_or_provisional'] 
# cols_to_drop = raw_data.columns.[[3, 5, 7]] # use this if index instead of column labels
datetime_cols = ['year','month','day']
not_null_col = 'daily_total_sunspot_no'

all_cleaned_data = pd.DataFrame()

# load and clean each file, then append into a single df
# Used a for loop here to avoid repetition and errors, will also make it easier to update when more data available
for file in file_list:
    raw_data = pd.read_csv(data_file_path+file, sep = ';', names = column_names)
    clean_data = ss_clean.clean_raw_data(raw_data, cols_to_drop, datetime_cols, not_null_col)
    all_cleaned_data = all_cleaned_data.append(clean_data)

# plot the data
fig = px.line(all_cleaned_data, x = all_cleaned_data['datetime'], y = all_cleaned_data['daily_total_sunspot_no'])
fig.update_layout(title='Daily Total Sunspot Numbers 1818 - 2022',
                   xaxis_title='Year',
                   yaxis_title='Daily Total Sunspot Number')
fig.show()

In [None]:
# Noise in the data makes it hard to views trends - aggregated data by month or year would help

import sunspot_analysis_functions as ss_analyse

# Use function to calculate monthly averages
monthly_ssn = ss_analyse.monthly_averages(all_cleaned_data)

# Plot figure
fig = px.line(monthly_ssn, x = monthly_ssn['datetime'], y = monthly_ssn['monthly_mean_ssn'])
fig.update_layout(title='Monthly mean sunspot numbers 1818 - 2022',
                   xaxis_title='Year',
                   yaxis_title='Monthly Mean Sunspot Number')
fig.show()

Use docstrings to document your functions. Give a brief description of what the functions does, its inputs (including data types) and outputs. Could also include example usage, common errors, links to other related functions etc. There are standard formats you can use for different languages. Make sure you keep this documentation up to date when you make changes to your code!

In [None]:
help(ss_clean.clean_raw_data)

You should also provide project documentation for your work so other users can quickly understand what the project is aiming to do and how to get started with running it. <br/>

For example, a readme file in a git repo, which should contain a brief statement of intent, a longer description of the problem solved by the project and the approach taken, basic installation/usage instructions including example usage and links to any related projects.

A more complex project may also benefit from some basic dependency management and/or config files. 
Different coding languages have different approaches to dependency management, but it will typically involve creating a requirements.txt file listing all required packages and their version numbers to ensure that other users don't have problems trying to run the code if they have the wrong packages installed.
Config files are a separate file containing our variables. If these are kept in a separate script, it is easier to modify these if the requirements change. 

### Version Control

https://best-practice-and-impact.github.io/qa-of-code-guidance/version_control.html <br />
Version control is always a good idea, even if you are working on a piece of code alone. It allows you to keep track of changes you've made to code, when and why those changes were made, who by, and make it possible to return to previous versions if needed. The standard way to do this is using git, which is now available in the software centre on DHSC laptops. 

I'm not going to go into details of how to use git here, other coffee and coding sessions have done this, but here are a few key points: 
* Remember to write useful commit messages so others know what changes have been made and why. 
* Work in branches that are then merged into the main branch. The main branch should contain the most up to date working version of the code, so anyone coming to the project fresh would download and use that code.
* Use a gitignore file to prevent accidental commit of credentials, sensitive information or data

### Peer review

https://best-practice-and-impact.github.io/qa-of-code-guidance/peer_review.html <br />
This should be the baseline for code QA. Even the smallest piece of work, if its producing output for others, isn't 'done' unless some one else has given it a quick peer review. Here's some top tips:
* Can I easily understand what the code does? Hopefully yes if code is modular, readable and well documented!
* Does the code fulfil its requirements? You may need to check the outputs (like normal analysis QA)
* How easy will it be to alter this code when requirements change?
* Can I reproduce these outputs?

### A good example

In [None]:
# Import modules we need

# pandas for data analysis
import pandas as pd
# plotly for plotting
import plotly.express as px
# bespoke functions to clean and analyse the data
import sunspot_data_cleaning_functions as ss_clean
import sunspot_analysis_functions as ss_analyse

In [None]:
# Load the data from the csv into a dataframe

# A list of the column names
column_names = ['Year', 'Month', 'Day', 'Fraction_Of_Year', 'Daily_Total_Sunspot_No', 
             'Daily_St_Dev_Sunspot_No', 'No_Of_Obs', 'Definitive_Or_Provisional']

data_file_path = 'Data/'
file_list = ['SN_d_tot_V2.0_18.csv', 'SN_d_tot_V2.0_19.csv', 'SN_d_tot_V2.0_20.csv']

all_cleaned_data = pd.DataFrame()

# Run data cleaning on each file and append into a single df

cols_to_drop = ['fraction_of_year', 'daily_st_dev_sunspot_no', 'definitive_or_provisional'] 
datetime_cols = ['year','month','day']
not_null_col = 'daily_total_sunspot_no'

for file in file_list:
    
    # Using function read_csv from package pandas (pd)
    raw_data = pd.read_csv(data_file_path+file, sep = ';', names = column_names)

    # Clean the data
    clean_data = ss_clean.clean_raw_data(raw_data, cols_to_drop, datetime_cols, not_null_col)
    
    # Append cleaned dataframe to the list
    all_cleaned_data = all_cleaned_data.append(clean_data)


In [None]:
# Plot

fig = px.line(all_cleaned_data, x = all_cleaned_data['datetime'], y = all_cleaned_data['daily_total_sunspot_no'])
fig.update_layout(title='Daily Total Sunspot Numbers 1818 - 2022',
                   xaxis_title='Year',
                   yaxis_title='Daily Total Sunspot Number')

# display the figure
fig.show()

In [None]:
# Further analysis - smooth out these results to view trends better

# Aggregate by month and year
monthly_ssn = ss_analyse.aggregated_average_ssn(all_cleaned_data, ['year', 'month'], 'monthly_mean_ssn')
yearly_ssn = ss_analyse.aggregated_average_ssn(all_cleaned_data, ['year'], 'yearly_mean_ssn')

In [None]:
help(ss_analyse.aggregated_average_ssn)

In [None]:
# Plot monthly
fig = px.line(monthly_ssn, x = monthly_ssn['datetime'], y = monthly_ssn['monthly_mean_ssn'])
fig.update_layout(title='Monthly mean sunspot numbers 1818 - 2022',
                   xaxis_title='Year',
                   yaxis_title='Monthly Mean Sunspot Number')

# display the figure
fig.show()

In [None]:
# Plot yearly
fig = px.line(yearly_ssn, x = yearly_ssn['datetime'], y = yearly_ssn['yearly_mean_ssn'])
fig.update_layout(title='Yearly mean sunspot numbers 1818 - 2022',
                   xaxis_title='Year',
                   yaxis_title='Yearly Mean Sunspot Number')

# display the figure
fig.show()