<a href="https://colab.research.google.com/github/GabeMaldonado/JupyterNotebooks/blob/master/SoftwareEngineeringPractices.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notes on Software Engineering Best Practices -- Udacity

### Production Code
Refers to software running on production servers that is handling live users and data. 

### Production Quality Code
Refers to code that is efficient and reliable for production. Production code must be:
*   Clean -- readable, simple and consice.
*   Modular -- logically broken into functions and modules. 
*   Module -- a file. Modules allow code to be reused by encapsulating them into files that can be imported into other files.

### Refactoring
Restructuring code to improve internal structure without changing external functionality. 

Why refactor?

*   Reduce workload in long run
*   Easier to maintain code
*   Resuse more of your code
*   Become a better developer

Example of bad code and an improvement:

In [1]:
# Bad Code

s = [88, 92, 78, 94, 87] # student test scores
print(sum(s) / len(s)) # print mean of test scores

s1 = [x ** 0.5 * 10 for x in s] # curve scores with square root method and store in new list
print(sum(s1)/len(s1))

87.8
93.65398840104324


In [3]:
# Refactored code
import math
import numpy as np

test_scores = [88, 92, 78, 94, 87]
print(np.mean(test_scores))

curved_test_scores = [math.sqrt(score) *10 for score in test_scores]
print(np.mean(curved_test_scores))

87.8
93.65398840104324


## Writing Clean Code

### Meaningful Names

*   Be descriptive and imply type. E.g. for booleans, we can start with `is_` or `has` to make it clear it is a condition. We can also use verbs for *functions* and nouns for *variables*.
*   Be consistent but clearly differentiate -- e.g. `age_list` and `age` makes itreasier to differentiate between `ages` and `age`.
*   Avoid abbreviations and single letters -- (Exception: counters and common math variables) Keep your audience in mind when making exceptions. For instance, data scientist will know know what a `df` is whereas for a different audience we would need to provide a more meaningful name: `dataframe`.
*   Long names do not equal descriptive names -- be descriptive but with relevant information. We can give a function a descriptive name without having to include details about it implementation in the name. 

### Whitespace

Use whitespace properly:

*   Organize code with consisten indentation, the standard for python code is **4 spaces** for each indent. 
*   Separate sections with blank lines to keep code well-organized and readable
*   Limit lines to ~79 characters which is a PEP 8 style guide. 



## Writing Modular Code

*   DRY (Don't Repeat Yourself) Principle. 
Modularization allows you to reuse parts of your code. Generalize and consolidate repeated code in fucntions or loops.

*   Abstract out logic to improve readability. Abstracting out code into a function not only makes it less repetitive but also improves readability with descriptive function names. When absotracting, it is possible to over-engineer functions. Use best judgement. 

*   Minimize number of entities (functions, classes, modules). Breaking up the code into unnecessary functions and modules can have negative effects making it complicated to read and to follow the logic.

*   Functions should do one thing only. Each function should tackle one problem. If it does multiple things it becomes more difficult to generalize and reuse. If the function name contains an "and" as in `mean_and_std` it is good practice to refactor it. 

*   Try using less than three arguments per function. This is not a hard rule as sometimes functions required multiple parameters but it is good practice to to keep the arguments as few as possible. 


In [5]:
# Example of code that can improved

import math
import numpy as np

test_scores = [88, 92, 78, 94, 87]
print(np.mean(test_scores))

curved_5 = [score + 5 for score in test_scores]
print(np.mean(curved_5))

curved_10 = [score + 10 for score in test_scores]
print(np.mean(curved_10))

curved_sqrt = [math.sqrt(score) * 10 for score in test_scores]
print(np.mean(curved_sqrt))

87.8
92.8
97.8
93.65398840104324


In [8]:
# Modular Code

import math
import numpy as np

def flat_curve(arr, n):
  return [i + n for i in arr]

def square_root_curve(arr):
  return [math.sqrt(i) * 10 for i in arr]

test_scores = [88, 92, 78, 94, 87]
curved_5 = flat_curve(test_scores, 5)
curved_10 = flat_curve(test_scores, 10)
curved_sqrt = square_root_curve(test_scores)

for score_list in test_scores, curved_5, curved_10, curved_sqrt:
  print(np.mean(score_list))

87.8
92.8
97.8
93.65398840104324


## Excersice 1: Refactor -- Wine Quality Analysis

In this excerise we will refactor code, that analyzes wine quality, by renaming the columns in the dataset, and calculating some statistics on how some features may be related to quality ratings. 

In [10]:
# Load Data

import pandas as pd
df = pd.read_csv("/content/winequality-red.csv", sep=";")
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


### Reanimg columns

As we can see, the column names contain spaces, we need add an underscore to be able to reference them using *dot notation*

In [12]:
# one possible way to approach the problem:

new_df = df.rename(columns={'fixed acidity': 'fixed_acidity',
                             'volatile acidity': 'volatile_acidity',
                             'citric acid': 'citric_acid',
                             'residual sugar': 'residual_sugar',
                             'free sulfur dioxide': 'free_sulfur_dioxide',
                             'total sulfur dioxide': 'total_sulfur_dioxide'
                            })
new_df.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In the cell above, we updated the column names manually. It worked fine in this case but...doing it this way can be prone to errors and typos and it is also repetitive and a bit time consuming. A better and automated approach woudl be:

In [13]:
df.columns = [label.replace(' ', '_') for label in df.columns]
df.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


### Analyzing Features

Now that the columns are properly renamed, we can look at different features of this dataframe and see how they relate to the quality of the wine. One simple way to this is by observing the mean quality rating for the top and bottom half of each feature:

In [0]:
def numeric_to_buckets(df, column_name):
    median = df[column_name].median()
    for i, val in enumerate(df[column_name]):
        if val >= median:
            df.loc[i, column_name] = 'high'
        else:
            df.loc[i, column_name] = 'low' 

In [15]:
for feature in df.columns[:-1]:
    numeric_to_buckets(df, feature)
    print(df.groupby(feature).quality.mean(), '\n')

fixed_acidity
high    5.726061
low     5.540052
Name: quality, dtype: float64 

volatile_acidity
high    5.392157
low     5.890166
Name: quality, dtype: float64 

citric_acid
high    5.822360
low     5.447103
Name: quality, dtype: float64 

residual_sugar
high    5.665880
low     5.602394
Name: quality, dtype: float64 

chlorides
high    5.507194
low     5.776471
Name: quality, dtype: float64 

free_sulfur_dioxide
high    5.595268
low     5.677136
Name: quality, dtype: float64 

total_sulfur_dioxide
high    5.522981
low     5.750630
Name: quality, dtype: float64 

density
high    5.540574
low     5.731830
Name: quality, dtype: float64 

pH
high    5.598039
low     5.675607
Name: quality, dtype: float64 

sulphates
high    5.898917
low     5.351562
Name: quality, dtype: float64 

alcohol
high    5.958904
low     5.310302
Name: quality, dtype: float64 



## Efficient Code

Refers to code that:

*   Reduces run time
*   Reduces space in memory

Example of Code Optimization:
Here's some example code that finds the common book ids in `books_published_last_two_years.txt` and `all_coding_books.txt` to obtain a list of recent coding books.

In [0]:
import time
import pandas as pd
import numpy as np

In [0]:
with open('books_published_last_two_years.txt') as f:
    recent_books = f.read().split('\n')
    
with open('all_coding_books.txt') as f:
    coding_books = f.read().split('\n')

In [18]:
start = time.time()
recent_coding_books = []

for book in recent_books:
    if book in coding_books:
        recent_coding_books.append(book)

print(len(recent_coding_books))
print('Duration: {} seconds'.format(time.time() - start))

96
Duration: 12.395647048950195 seconds


### Tip #1: Use vector operations over loops when possible

Use numpy's `intersect1d` method to get the intersection of the `recent_books` and `coding_books` arrays.

In [19]:
start = time.time()
recent_coding_books = np.intersect1d(recent_books, coding_books)
print(len(recent_coding_books))
print('Duration: {} seconds'.format(time.time() - start))

96
Duration: 0.041463375091552734 seconds


### Tip #2: Know your data structures and which methods are faster
Use the set's `intersection` method to get the common elements in `recent_books` and `coding_books`.

In [20]:
start = time.time()
recent_coding_books = set(recent_books).intersection(coding_books)
print(len(recent_coding_books))
print('Duration: {} seconds'.format(time.time() - start))

96
Duration: 0.00864100456237793 seconds


Using **sets** to compute the intersection is the most efficient way to solve this.