# Software Engineering Practices - pt. I

## Writing clean code

- Use meaningful names 
- Be descriptive and imply type: E.g. for booleans, use prefix `is_` or `has_` to make clear it is a condition. 

In [1]:
age_list = [47, 12, 28]

for i, age in enumerate(age_list):
    if age < 18:
        is_minor = True
        age_list[i] = "minor"
        
age_list

[47, 'minor', 28]

- Be consistent but clearly differentiate
- Avoid abbreviations and especially single letters
- Long names $\neq$ descriptive names 

## Writing modular code

- DRY: Don't repeat yourself
- Abstract out logic to improve readability
- Minimize the number of entities (functions, classes, modules, etc.)
- Functions should do one thing
- Arbitrary variable names can be more effective in certain functions. 
- Try to use fewer than three arguments per function.

## Quiz: Refactoring - Wine quality

In [2]:
import pandas as pd
df = pd.read_csv("../data/winequality-red.csv", sep=";")
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


### Renaming Columns

We would like to reference to the column names using dot notation. In order to do this we have to replace the white space `" "` by an underscore `"_"`. 


In [3]:
# rewrite using list comprehension
df.columns = [label.replace(" ", "_") for label in df.columns]
df.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


### Analyzing Features

Check relation between features and quality rating of the wine

- Observing mean quality rating for the top and bottom half of the feature

#### `pandas.DataFrame.loc`

- Access a group of rows and columns by label(s) or a boolean array. `.loc[]` is primarily label based, but may also be used with a boolean array.
- Example: `df.loc["viper"]` 
- [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html)

In [4]:
# Example
import pandas as pd
example_df = pd.DataFrame([[1,2], [4, 5], [7, 8]], 
                 index=["cobra", "viper", "sidewinder"],
                 columns=["max_speed", "shield"])

print(example_df, "\n")
print(example_df.loc["cobra"], "\n")
print(example_df.loc[["cobra"]], "\n")
print(example_df.loc["cobra":"viper", "max_speed"], "\n")

            max_speed  shield
cobra               1       2
viper               4       5
sidewinder          7       8 

max_speed    1
shield       2
Name: cobra, dtype: int64 

       max_speed  shield
cobra          1       2 

cobra    1
viper    4
Name: max_speed, dtype: int64 



#### Before:
```python
median_alcohol = df.alcohol.median()

for i, alcohol in enumerate(df.alcohol):
    if alcohol >= median_alcohol:
        df.loc[i, "alcohol"] = "high"
    else:
        df.loc[i, "alcohol"] = "low"
```

#### `pandas.DataFrame.groupby`

- Group DataFrame or Series using a mapper or by a Series of columns. 
- Involves combination of splitting the object, applying a funciton, and combining the results. 

In [5]:
# Example: groupby
example_df = pd.DataFrame({"Animal": ["Falcon", 
                              "Falcon", 
                              "Parrot", 
                              "Parrot"],
                  "Max Speed" : [380., 370., 24., 26.]})

print(example_df)
print(example_df.groupby(["Animal"]).mean())
example_df.groupby(["Max Speed"])

   Animal  Max Speed
0  Falcon      380.0
1  Falcon      370.0
2  Parrot       24.0
3  Parrot       26.0
        Max Speed
Animal           
Falcon      375.0
Parrot       25.0


<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000012DC1AB5748>

#### Task

Instead of repeating this for `pH`, `residual_sugar`, and `citric_acid` can you automate this?

- Use a function to automate this task

In [6]:
def numeric_to_buckets(df, column_name): 
    median = df[column_name].median()
    
    for i, val in enumerate(df[column_name]): 
        if val >= median: 
            df.loc[i, column_name] = "high"
        else: 
            df.loc[i, column_name] = "low"

In [7]:
for feature in df.columns[:-1]:
    numeric_to_buckets(df, feature)
    print(df.groupby(feature).quality.mean(), '\n')

fixed_acidity
high    5.726061
low     5.540052
Name: quality, dtype: float64 

volatile_acidity
high    5.392157
low     5.890166
Name: quality, dtype: float64 

citric_acid
high    5.822360
low     5.447103
Name: quality, dtype: float64 

residual_sugar
high    5.665880
low     5.602394
Name: quality, dtype: float64 

chlorides
high    5.507194
low     5.776471
Name: quality, dtype: float64 

free_sulfur_dioxide
high    5.595268
low     5.677136
Name: quality, dtype: float64 

total_sulfur_dioxide
high    5.522981
low     5.750630
Name: quality, dtype: float64 

density
high    5.540574
low     5.731830
Name: quality, dtype: float64 

pH
high    5.598039
low     5.675607
Name: quality, dtype: float64 

sulphates
high    5.898917
low     5.351562
Name: quality, dtype: float64 

alcohol
high    5.958904
low     5.310302
Name: quality, dtype: float64 



#### Efficient code

- Reducing run time
- Reducing space in memory

Resources: 

- [What makes sets faster than lists](https://stackoverflow.com/questions/8929284/what-makes-sets-faster-than-lists/8929445#8929445)

## Optimizing - Common Books

Improve on matching the books in the lists by:

- Using vector operations over loops when possible (numpy and pandas are your best friends)
- (after googling: fow to find common elements in two numppy arrays), use `numpy.intersect1d` method ([link](https://docs.scipy.org/doc/numpy/reference/generated/numpy.intersect1d.html))
- Next, know your data structures and which methods are faster.
 - Sets are more efficient here!

In [8]:
import time
import pandas as pd
import numpy as np

#### `with` statement in Python

- Give access to a file by opening it. 
 - using `open()` function: Open returns a `file object`, which has methods and attributes for getting information about and manipulating the open file. 
- `with` statement has better syntax and exceptions handling
 - simplified excepttion handling by encapsulating commong preparation and cleanup tasks. In addition, it will automatically close the file. The `with` statement provides a way for ensuring that a clean-up is always used.  

In [9]:
with open("../data/books_published_last_two_years.txt") as file:
    recent_books = file.read().split("\n")
    
with open("../data/all_coding_books.txt") as file: 
    coding_books = file.read().split("\n")

In [10]:
print("Length recent_books:", len(recent_books))
print("Length coding_books:", len(coding_books))

Length recent_books: 24159
Length coding_books: 32250


In [11]:
recent_books[:5]

['1262771', '9011996', '2007022', '9389522', '8181760']

In [12]:
# first method
start = time.time()
recent_coding_books = []

for book in recent_books:
    if book in coding_books:
        recent_coding_books.append(book)
        
end = time.time()
print("Duration: {:.4f}".format(end-start))
print("\nNo. of recent_coding_books:", len(recent_coding_books))

Duration: 18.4696

No. of recent_coding_books: 96


#### Tip #1: Use vector operations over loops whenever possible

Numpy's `intersect1d` method can be used to get the intersection of the `recent_books` and `coding_books`arrays. 

- `intersect1d`: Find the intersection of two arrays. Return the sorted, unique values that are in both of the input arrays. 
```python
>>> numpy.intersect1d([1,2,3], [3,1,1])
array([1,3])
```
- [link](https://docs.scipy.org/doc/numpy/reference/generated/numpy.intersect1d.html)

In [13]:
# second method
start = time.time()
recent_coding_books = np.intersect1d(recent_books, coding_books)
end = time.time()

print("Duration: {:.4f}".format(end-start))
print("\nNo. of recent_coding_books:", len(recent_coding_books))

Duration: 0.0530

No. of recent_coding_books: 96


#### Tip #2: Know your data structures and which methods are faster

Use the set's `intersection` method to get the common elements in `recent_books` and `coding_books`. 

In [14]:
# third method
start = time.time()
recent_coding_books = set(recent_books).intersection(coding_books)
end = time.time()

print("Duration: {:.4f}".format(end-start))
print("\nNo. of recent_coding_books:", len(recent_coding_books))

Duration: 0.0090

No. of recent_coding_books: 96


## Quiz: Optimizing - Holiday Gifts

- Using vectorized operations and more efficient data structures can optimize the code significantly. 

We'll use this for another example. 

- One million users have listed a gift on a wish list. 
- Prices: `gift_costs.txt`
- Give each customer gift for free if it is under 25 dollars. 
- Calculate total costs of all gifts under 25 dollars to see total costs.

General notes:

- Check data of your data
- What type of data do you want? In general, numpy arrays are nice to work with and they are fast. 

In [15]:
import time
import numpy as np

In [16]:
# load data
with open("../data/gift_costs.txt") as f:
    gift_costs = f.read().split("\n")

In [17]:
# check type
print("type:", type(gift_costs[0]))

# type wanted: int <- convert to numpy array
gift_costs = np.array(gift_costs).astype(int)

print("type:", type(gift_costs[0]))
print(gift_costs[:5])

type: <class 'str'>
type: <class 'numpy.int32'>
[ 8 84 42 65 74]


In [None]:
# first method
start = time.time()

total_price = 0

for cost in gift_costs:
    if cost < 25:
        total_price += cost*1.08 # cost after tax

end = time.time()

print(round(total_price,2))
print("Duration: {:.4f} seconds".format(end-start))

In [None]:
# second method using conditional numpy
start = time.time()

total_price = (gift_costs[gift_costs < 25]).sum() * 1.08

end = time.time()

print(round(total_price,2))
print("Duration: {:.4f} seconds".format(end-start))

In [None]:
# third method
start = time.time()

total_price = sum((gift_costs[gift_costs < 25])) * 1.08

end = time.time()

print(round(total_price,2))
print("Duration: {:.4f} seconds".format(end-start))

Note that dot notation, i.e. using `.sum()` is considerably faster than `sum(...)`.

## Documentation

- Clarify complex parts of code 
- Making your code easier to navigate
- Quickly conveying how and why different components of your program are used.

### In-line Comments 

### Docstrings

### Project Documentation

### Version Control 

#### Scenario #1: 

Stop working at one feature in favor of another one. What to do with the "half-baked" code? Create new branch in git

- Step 1: You have a local version of this repo on your laptop, and to get the latest stable version, you pull from the develop branch. 

> **Switch to develop branch**   
`git checkout develop`

> **Pull latest changes in the develop branch**  
`git pull`

#### STEP 2: When you start working on this demographic feature, you create a new branch for this called demographic, and start working on your code in this branch.

> Create and switch to new branch called demographic from develop branch:
`git checkout -b demographic`

> Work on this new feature and commit as you go:  
`git commit -m 'added gender recommendations'`  
`git commit -m 'added location specific' recommendations'`  
...

#### STEP 3: However, in the middle of your work, you need to work on another feature. So you commit your changes on this demographic branch, and switch back to the develop branch.

> Commit changes before switching  
`git commit -m 'refactored demographic gender and location recommendations '`

> Switch to the develop branch  
`git checkout develop`

#### STEP 4: From this stable develop branch, you create another branch for a new feature called friend_groups.

> Create and switch to new branch called friend_groups from develop branch  
`git checkout -b friend_groups`

#### STEP 5: After you finish your work on the friend_groups branch, you commit your changes, switch back to the development branch, merge it back to the develop branch, and push this to the remote repository’s develop branch.

> Commit changes before switching  
`git commit -m 'finalized friend_groups recommendations '`

> Switch to the develop branch  
`git checkout develop`

> Merge friend_groups branch to develop  
`git merge --no-ff friends_groups`

> Push to remote repository
`git push origin develop`

#### STEP 6: Now, you can switch back to the demographic branch to continue your progress on that feature.

> Switch to the demographic branch  
`git checkout demographic`

## Model Versioning 

Resources for useful ways and tools for managing versions of models and large data. 

- [How to Version Control Your Production Machine Learning Models](https://blog.algorithmia.com/how-to-version-control-your-production-machine-learning-models/)
- [Versioning Data Science](https://shuaiw.github.io/2017/07/30/versioning-data-science.html)

In [None]:
!jupyter nbconvert --to html notebook.ipynb

In [None]:
print("Done!")