## Characteristic of Production Code
**CLEAN:** readable, simple, and concise. A characteristic of production quality code that is crucial for collaboration and maintainability in software development.

**MODULAR:** logically broken up into functions and modules. Also an important characteristic of production quality code that makes your code more organized, efficient, and reusable.
* Don't repeat yourself (DRY)
* Abstract out logic to improve readability
* Minimize the number of entities (functions, classes, modules, etc.)
* Functions should do one thing
* Arbitrary variable names can be more effective in certain functions
* Try to use fewer than three arguments per function

**MODULE:** a file. Modules allow code to be reused by encapsulating them into files that can be imported into other files.

In [1]:
# DRY
import math
import numpy as np
test_scores = [88, 92, 79, 93, 85]

# List comprehension
curved_5 = [score + 5 for score in test_scores]
print(np.mean(curved_5))

curved_10 = [score + 10 for score in test_scores]
print(np.mean(curved_10))

curved_sqrt = [math.sqrt(score) * 10 for score in test_scores]
print(np.mean(curved_sqrt))

92.4
97.4
93.44776840374746


In [2]:
# Abstract out logic to improve readability
import math
import numpy as np

def flat_curve(arr, n):
    return [i + n for i in arr]

def square_root_curve(arr):
    return [math.sqrt(i) * 10 for i in arr]

test_scores = [88, 92, 79, 93, 85]

curved_5 = flat_curve(test_scores, 5)
curved_10 = flat_curve(test_scores, 10)
curved_sqrt = square_root_curve(test_scores)

for score_list in test_scores, curved_5, curved_10, curved_sqrt:
    print(np.mean(score_list))


87.4
92.4
97.4
93.44776840374746


## Writting Clean Code
* Nice Whitespace
    * Organize your code with consistent indentation - the standard is to use 4 spaces for each indent.
    * Separate sections with blank lines to keep your code well organized and readable.
    * Try to limit your lines to around 79 characters, which is the guideline given in the PEP 8 style guide.  
    [PEP 8 guidelines for code layout](https://www.python.org/dev/peps/pep-0008/?#code-lay-out)
    
* Meaningfull Names
    * Be descriptive and imply type: E.g. for booleans, use prefix is_ or has_ to make clear it is a condition.
    * Be consistent but clearly differentiate
    * Avoid abbreviations and especially single letters
    * Long names does not necessary mean descriptive names. (don't put the details in the name of function)
    * Use verb for function names.

In [3]:
# Be descriptive and imply type
age_list = [47, 12, 28]

for i, age in enumerate(age_list):
    if age < 18:
        is_minor = True
        age_list[i] = "minor"
        
age_list

[47, 'minor', 28]

## Code refactoring

**REFACTORING:** restructuring your code to improve its internal structure, without changing its external functionality. This gives you a chance to clean and modularize your program after you've got it working.

### Refactor: Wine Quality Analysis
In this exercise, you'll refactor code that analyzes a wine quality dataset taken from the UCI Machine Learning Repository [here](https://archive.ics.uci.edu/ml/datasets/wine+quality). Each row contains data on a wine sample, including several physicochemical properties gathered from tests, as well as a quality rating evaluated by wine experts.

The code in this notebook first renames the columns of the dataset and then calculates some statistics on how some features may be related to quality ratings. Can you refactor this code to make it more clean and modular?

In [4]:
import pandas as pd
df = pd.read_csv('../data/winequality-red.csv', sep=';')
df.head(2)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5


### Renaming Columns
You want to replace the spaces in the column labels with underscores to be able to reference columns with dot notation. Here's one way you could've done it.

In [5]:
# renaming
new_df = df.rename(columns={'fixed acidity': 'fixed_acidity',
                             'volatile acidity': 'volatile_acidity',
                             'citric acid': 'citric_acid',
                             'residual sugar': 'residual_sugar',
                             'free sulfur dioxide': 'free_sulfur_dioxide',
                             'total sulfur dioxide': 'total_sulfur_dioxide'
                            })
new_df.head(2)

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5


In [6]:
# renaming using using list comprehension
df.columns = [label.replace(' ', '_') for label in df.columns]
print(df.columns)

Index(['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar',
       'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')


### Analyzing Features
Now that your columns are ready, you want to see how different features of this dataset relate to the quality rating of the wine. A very simple way you could do this is by observing the mean quality rating for the top and bottom half of each feature. The code below does this for four features. It looks pretty repetitive right now. Can you make this more concise? 

In [7]:
def numeric_to_buckets(df, column_name):
    median = df[column_name].median()
    for i, val in enumerate(df[column_name]):
        if val >= median:
            df.loc[i, column_name] = 'high'
        else:
            df.loc[i, column_name] = 'low' 

In [8]:
# select all coulumn but the last one (quality)
for feature in df.columns[:-1]:
    numeric_to_buckets(df, feature)
    print(df.groupby(feature).quality.mean(), '\n')

fixed_acidity
high    5.726061
low     5.540052
Name: quality, dtype: float64 

volatile_acidity
high    5.392157
low     5.890166
Name: quality, dtype: float64 

citric_acid
high    5.822360
low     5.447103
Name: quality, dtype: float64 

residual_sugar
high    5.665880
low     5.602394
Name: quality, dtype: float64 

chlorides
high    5.507194
low     5.776471
Name: quality, dtype: float64 

free_sulfur_dioxide
high    5.595268
low     5.677136
Name: quality, dtype: float64 

total_sulfur_dioxide
high    5.522981
low     5.750630
Name: quality, dtype: float64 

density
high    5.540574
low     5.731830
Name: quality, dtype: float64 

pH
high    5.598039
low     5.675607
Name: quality, dtype: float64 

sulphates
high    5.898917
low     5.351562
Name: quality, dtype: float64 

alcohol
high    5.958904
low     5.310302
Name: quality, dtype: float64 



### groupby example

In [9]:
df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
                              'Parrot', 'Parrot'],
                   'Max_Speed': [380., 370., 24., 26.]})
print(df)

   Animal  Max_Speed
0  Falcon      380.0
1  Falcon      370.0
2  Parrot       24.0
3  Parrot       26.0


In [10]:
df.groupby('Animal').Max_Speed.mean()

Animal
Falcon    375.0
Parrot     25.0
Name: Max_Speed, dtype: float64

### enumerate example

In [11]:
columns = ['residual_sugar','density','pH','sulphates','alcohol']
for i, col in enumerate(columns):
    print(i, col)

0 residual_sugar
1 density
2 pH
3 sulphates
4 alcohol


## Efficient Code
* Execute faster
* Take up less space in memory/storage

### Optimizing Code: Common Books
Here's the code your coworker wrote to find the common book ids in `books_published_last_two_years.txt` and `all_coding_books.txt` to obtain a list of recent coding books.

In [12]:
import time
import pandas as pd
import numpy as np

In [13]:
with open('../data/books_published_last_two_years.txt') as f:
    recent_books = f.read().split('\n')
    
with open('../data/all_coding_books.txt') as f:
    coding_books = f.read().split('\n')

In [14]:
# inefficient way
start = time.time()
recent_coding_books = []

for book in recent_books:
    if book in coding_books:
        recent_coding_books.append(book)

print(len(recent_coding_books))
print('Duration: {} seconds'.format(time.time() - start))

96
Duration: 15.209768056869507 seconds


#### Tip #1: Use vector operations over loops when possible
Use numpy's `intersect1d` method to get the intersection of the `recent_books` and `coding_books` arrays.

**numpy.intersect1d()** function find the intersection of two arrays and return the sorted, unique values that are in both of the input arrays.
```
arr1 = np.array([1, 1, 2, 3, 4]) 
arr2 = np.array([2, 1, 4, 6]) 
    
result = np.intersect1d(arr1, arr2)  
print (result) 
[1 2 4]
```

In [15]:
start = time.time()
recent_coding_books = np.intersect1d(recent_books, coding_books)
print(len(recent_coding_books))
print('Duration: {} seconds'.format(time.time() - start))

96
Duration: 0.03148794174194336 seconds


#### Tip #2: Know your data structures and which methods are faster

**resource:**  [What makes sets faster than lists](https://stackoverflow.com/questions/8929284/what-makes-sets-faster-than-lists/8929445#8929445)

In [16]:
start = time.time()
recent_coding_books = set(recent_books).intersection(coding_books)
print(len(recent_coding_books))
print('Duration: {} seconds'.format(time.time() - start))

96
Duration: 0.00522303581237793 seconds


#### NOTE: Looks like using sets to compute the intersection is indeed most efficient in this case!

### Optimizing Code: Holiday Gifts
In the last example, you learned that using vectorized operations and more efficient data structures can optimize your code. Let's use these tips for one more example.

Say your online gift store has one million users that each listed a gift on a wish list. You have the prices for each of these gifts stored in `gift_costs.txt`. For the holidays, you're going to give each customer their wish list gift for free if it is under 25 dollars. Now, you want to calculate the total cost of all gifts under 25 dollars to see how much you'd spend on free gifts. Here's one way you could've done it.

In [17]:
import time
import numpy as np

In [18]:
with open('../data/gift_costs.txt') as f:
    gift_costs = f.read().split('\n')

In [None]:
gift_costs = np.array(gift_costs).astype(int)  # convert string to int
type(gift_costs)
print(gift_costs)

In [None]:
# ineffiecient code
start = time.time()

total_price = 0
for cost in gift_costs:
    if cost < 25:
        total_price += cost * 1.08  # add cost after tax

print(total_price)
print('Duration: {} seconds'.format(time.time() - start))

#### Refactor Code
**Hint:** Using numpy makes it very easy to select all the elements in an array that meet a certain condition, and then perform operations on them together all at once. You can them find the sum of what those values end up being.

In [None]:
# first way
start = time.time()

total_price = gift_costs[gift_costs < 25].sum() * 1.08

print(round(total_price, 2))
print('Duration: {} seconds'.format(time.time() - start))

In [None]:
# second way
start = time.time()

total_price = np.sum(gift_costs[gift_costs < 25]) * 1.08

print(round(total_price, 2))
print('Duration: {} seconds'.format(time.time() - start))

#### NOTE: numpy.sum(gift_costs) is faster that using the gift_costs.sum()

## Documentation

#### benefits
* Helpful for clarifying complex parts of code
* making your code easier to navigate
* quickly conveying how and why different components of your program are used.

#### types of documentation 
* In-line Comments - line level
* Docstrings - module and function level
* Project Documentation - project level

#### Docstrings Example

In [None]:
# One line docstring
def population_density(population, land_area):
    """Calculate the population density of an area."""
    return population / land_area

In [None]:
# multi line docstring
def population_density(population, land_area):
    """Calculate the population density of an area.

    Args:
    population: int. The population of the area
    land_area: int or float. This function is unit-agnostic, if you pass in values in terms of square km or square miles the function will return a density in those units.

    Returns:
    population_density: population/land_area. The population density of a 
    particular area.
    """
    return population / land_area

#### Project Documentation
[Udacity README Course](https://classroom.udacity.com/courses/ud777)

## Version Control In Data Science

#### Scenario 1
You have to stop working on a unfinished feature to start a more prioritize one.The solution is to create a new branch.

![scenario1](../images/scenario1.png)

STEP 1: You have a local version of this repository on your laptop, and to get the latest stable version, you pull from the develop branch.

> Switch to the develop branch
>
> `git checkout develop`

> Pull latest changes in the develop branch
>
> `git pull`

STEP 2: When you start working on this demographic feature, you create a new branch for this called demographic, and start working on your code in this branch.

> Create and switch to new branch called demographic from develop branch
>
> `git checkout -b demographic`

> Work on this new feature and commit as you go
>
> `git commit -m 'added gender recommendations'`
>
> `git commit -m 'added location specific recommendations'`
>
> `...`

STEP 3: However, in the middle of your work, you need to work on another feature. So you commit your changes on this demographic branch, and switch back to the develop branch.

> Commit changes before switching
>
> `git commit -m 'refactored demographic gender and location recommendations '`


> Switch to the develop branch 
> 
> `git checkout develop`

STEP 4: From this stable develop branch, you create another branch for a new feature called friend_groups.

> Create and switch to new branch called friend_groups from develop branch
>
> `git checkout -b friend_groups`

STEP 5: After you finish your work on the friend_groups branch, you commit your changes, switch back to the development branch, merge it back to the develop branch, and push this to the remote repository’s develop branch.

> Commit changes before switching
>
> `git commit -m 'finalized friend_groups recommendations '`

> Switch to the develop branch
>
> `git checkout develop`

> Merge friend_groups branch to develop
>
> `git merge --no-ff friends_groups`

> Push to remote repository
>
> `git push origin develop`

STEP 6: Now, you can switch back to the demographic branch to continue your progress on that feature.

> Switch to the demographic branch
>
> `git checkout demographic`

#### Scenario 2

While tweaking the parameters to get the best model, all the configuration must be saved using commit that has the cross validation result in its comment message.

![scenario12](../images/scenario2.png)

Step 1: You check your commit history, seeing messages of the changes you made and how well it performed.

> View log history
>
> `git log`

Step 2: The model at this commit seemed to score the highest, so you decide to take a look.

> Checkout a commit
>
> `git checkout bc90f2cbc9dc4e802b46e7a153aa106dc9a88560`

After inspecting your code, you realize what modifications made this perform well, and use those for your model.

Step 3: Now, you’re pretty confident merging this back into the development branch, and pushing the updated recommendation engine.

> Switch to develop branch
>
> `git checkout develop`

> Merge friend_groups branch to develop
>
> `git merge --no-ff friend_groups`

> Push changes to remote repository
>
> `git push origin develop`

#### Scenario 3

two coworkers working on diffrent branches.

![scenario3](../images/scenario3.png)

Step 1: Andrew commits his changes to the documentation branch, switches to the development branch, and pulls down the latest changes from the cloud on this development branch, including the change I merged previously for the friends group feature.

> Commit changes on documentation branch
>
> `git commit -m "standardized all docstrings in process.py"`

> Switch to develop branch
>
> `git checkout develop`

> Pull latest changes on develop down
>
> `git pull`

Step 2: Then, Andrew merges his documentation branch on the develop branch on his local repository, and then pushes his changes up to update the develop branch on the remote repository.

> Merge documentation branch to develop
>
> `git merge --no-ff documentation`

> Push changes up to remote repository
>
> `git push origin develop`

Step 3: After the team reviewed both of your work, they merge the updates from the development branch to the master branch. Now they push the changes to the master branch on the remote repository. These changes are now in production.

> Merge develop to master
>
> `git merge --no-ff develop`

> Push changes up to remote repository
>
> `git push origin master`

#### Git Great Resources
[A successful Git branching model](https://nvie.com/posts/a-successful-git-branching-model/)

[About merge conflicts](https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/about-merge-conflicts)

### Model Versioning

each commit will be documented with a score for that model.

[How to version control your production machine learning models](https://algorithmia.com/blog/how-to-version-control-your-production-machine-learning-models)