# Day 1 recap

## Load data

In [1]:
import pandas as pd

data = pd.read_csv("../data/taxi_1k.csv")

data.head()

Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2009-06-15 17:26:21.0000001,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.84161,40.712278,1
1,2010-01-05 16:52:16.0000002,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1
2,2011-08-18 00:35:00.00000049,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2
3,2012-04-21 04:30:42.0000001,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1
4,2010-03-09 07:51:00.000000135,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1


## Feature engineering distances

### Manhattan distance 

[Manhattan distance](https://datagy.io/manhattan-distance-python/#:~:text=In%20a%20two%2Ddimensional%20space,%2B%20%7Cy2%20%2D%20y1%7C%20.)

In [2]:
import numpy as np

def manhattan_distance(lon1, lat1, lon2, lat2):
    
    distance = np.abs(lon1 - lon2) + np.abs(lat1 - lat2)
    
    return distance

In [3]:
data['manhattan_distance'] = manhattan_distance(
    data.pickup_longitude,
    data.pickup_latitude,
    data.dropoff_longitude,
    data.dropoff_latitude)

data.head()

Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,manhattan_distance
0,2009-06-15 17:26:21.0000001,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.84161,40.712278,1,0.011742
1,2010-01-05 16:52:16.0000002,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1,0.107481
2,2011-08-18 00:35:00.00000049,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2,0.019212
3,2012-04-21 04:30:42.0000001,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1,0.029386
4,2010-03-09 07:51:00.000000135,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1,0.027194


### Euclidean distance

[Euclidean distance](https://datascienceparichay.com/article/distance-between-two-points-python/)

In [4]:
def euclidean_distance(lon1, lat1, lon2, lat2):

    distance = ((lon1 - lon2)**2 + (lat1 - lat2)**2)**0.5
    
    return distance

In [5]:
data['euclidean_distance'] = euclidean_distance(
    data.pickup_longitude,
    data.pickup_latitude,
    data.dropoff_longitude,
    data.dropoff_latitude)

data.head()

Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,manhattan_distance,euclidean_distance
0,2009-06-15 17:26:21.0000001,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.84161,40.712278,1,0.011742,0.009436
1,2010-01-05 16:52:16.0000002,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1,0.107481,0.079696
2,2011-08-18 00:35:00.00000049,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2,0.019212,0.013674
3,2012-04-21 04:30:42.0000001,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1,0.029386,0.02534
4,2010-03-09 07:51:00.000000135,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1,0.027194,0.01947


## Packaging the distance functions

### `distance_functions.py`

Move the distance functions to a `distance_functions.py` file in a `taxi_package` folder

```python
import numpy as np

def manhattan(lon1, lat1, lon2, lat2):

    distance = np.abs(lon1 - lon2) + np.abs(lat1 - lat2)

    return distance


def euclidean(lon1, lat1, lon2, lat2):

    distance = ((lon1 - lon2)**2 + (lat1 - lat2)**2)**0.5

    return distance

```

### Make a script `me-2-mahnattan`

Create a `scripts` folder in the root directory in which you can create a python script `me-2-mahnattan`.

⚠️ A script must have the following header to be decoded correctly by your computer:

In [24]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-

After which, you can write any python code. 

Let's write a script that calculates the Manhattan distance between me and Manhattan, regardless of where I am.

We can use the [`geocoder`](https://pypi.org/project/geocoder/0.6.0/) package for that. 

In [32]:
import geocoder

from taxi_package.distances import manhattan

# My location using geocoder
my_loc = geocoder.ip('me')

my_lat = my_loc.latlng[0]

my_lon = my_loc.latlng[1]

# Manhattan's geolocation
manhattan_lon = 40.7831

manhattan_lat = 73.9712

# Calculate distance between me and manhattan using our manhattan function :D
me_to_manhattan = manhattan(my_lon, my_lat, manhattan_lon, manhattan_lat)

print(f"The Manhattan distance between me and Manhattan is {me_to_manhattan}")


The Manhattan distance between me and Manhattan is 188.87189999999998


### `setup.py`

Create a `setup.py` file in the root folder.

```python
from setuptools import setup
from setuptools import find_packages

# list dependencies from file
with open('requirements.txt') as f:
    content = f.readlines()
    requirements = [x.strip() for x in content]

setup(name='taxi_package', # To find the package name within the folder
      description="functions for taxi project", # Whatever you want
      packages=find_packages(),
      install_requires=requirements, # To list the necessary libraries
      scripts=['scripts/me-2-manhattan']) # If you want to install any python script


```

### `requirements.txt`

Add `numpy` and `geocoder` to the requirements file, as they are necessary for the package.

### Install!

By installing using the -e flag, you do not need to reinstall for changes in the code to be effective. The only case where you need to reinstall is if you create new scripts.

```bash
pip install -e .  
```

### Makefile shortcut

You can also create a `make` alias in the `Makefile`

```bash
install:
    pip install -e .
```

## Use the package

In [17]:
data = pd.read_csv("../data/taxi_1k.csv")

data.head()

Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2009-06-15 17:26:21.0000001,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.84161,40.712278,1
1,2010-01-05 16:52:16.0000002,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1
2,2011-08-18 00:35:00.00000049,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2
3,2012-04-21 04:30:42.0000001,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1
4,2010-03-09 07:51:00.000000135,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1


You can import your functions in a python compiler

In [18]:
from taxi_package.distances import euclidean # Import the Euclidean function from the package

data['euclidean_distance'] = euclidean(data.pickup_longitude,
                                      data.pickup_latitude,
                                      data.dropoff_longitude,
                                      data.dropoff_latitude)

data.head()

Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,euclidean_distance
0,2009-06-15 17:26:21.0000001,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.84161,40.712278,1,0.009436
1,2010-01-05 16:52:16.0000002,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1,0.079696
2,2011-08-18 00:35:00.00000049,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2,0.013674
3,2012-04-21 04:30:42.0000001,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1,0.02534
4,2010-03-09 07:51:00.000000135,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1,0.01947


Or you can run your script in the terminal

```bash
me-2-manhattan
```

## Test the package

Create a `tests` folder in the root directory in which you add **one .py for function you want to test.**

### `test_manhattan.py`

```python
from taxi_package.distances import manhattan

def test_manhattan():

    assert manhattan(0, 0, 0, 0) == 0
```

### `test_euclidean.py`

```python
from taxi_package.distances import euclidean

def test_euclidean():

    assert euclidean(0, 0, 0, 0) == 0
```

### Create a `test` command in the `Makefile`

This will allow you to run all the tests in one go from the terminal.

``` bash
test:
    @coverage run -m pytest tests/*.py
    @coverage report -m --omit=$(VIRTUAL_ENV)/lib/python*
```

⚠️ Don't forget to add `pytest` to your `requirements.txt`

## Share your package

### On github

```bash

git init # Initialize repo locally

gh repo create # Create a remote github repo

git add . # Add all modifications

git commit -m"first commit" # Commit all modifications

git push origin master # Push all modifications to gihub repo
 
gh browse # Open github in browser

```

Your package can now be installed by anyone using the SSH link provided on Github

```bash
pip install git+ssh://git@github.com/Benlecoq/all_engineering
```

### On Pypi

Follow [these instructions](https://anweshadas.in/how-to-upload-a-package-in-pypi-using-twine/) to upload your package on Pypi

## CI/CD

### Continuous Integration

From your root directory, create the following folder/file structure:

`.github/workflows/pythonpackage.yml`

In the `pythonpackage.yml`, add the following code to orchestrate your continuous integration.

```yml
name: Python package

on:
  push:
    branches: [ master ]
  pull_request:
    branches: [ master ]

jobs:

  build:

    runs-on: macos-latest

    steps:
    - uses: actions/checkout@v2
    - name: Set up Python ${{ matrix.python-version }}
      uses: actions/setup-python@v1
      with:
        python-version: ${{ matrix.python-version }}
    - name: Install dependencies
      run: |
        pip install -r requirements.txt
    - name: Install package and test
      run: |
        make install test

    strategy:
      matrix:
        python-version: [3.8]
```

In [None]:
=====================================================================================================================

END OF RECAP Day 1

In [None]:
=====================================================================================================================

# Day 2 recap

## Visualization

In [None]:
import matplotlib.pyplot as plt

plt.scatter(data.manhattan_distance,data.fare_amount)

In [None]:
import matplotlib.pyplot as plt

plt.scatter(data.euclidean_distance,data.fare_amount)

## Modelling

### Prep X and y

In [None]:
X = data.drop(columns=['key','fare_amount','pickup_datetime'])
X.head()

In [None]:
y = data['fare_amount']
y.head()

### Scale

In [None]:
from sklearn.preprocessing import StandardScaler

X_scaled = StandardScaler().fit_transform(X)

X_scaled

### Baseline Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor().fit(X_scaled,y)