# Step 1: Scope the Project and Gather Data
Since the scope of the project will be highly dependent on the data, these two things happen simultaneously. In this step, you’ll:

Identify and gather the data you'll be using for your project (at least two sources and more than 1 million rows). See Project Resources for ideas of what data you can use.
Explain what end use cases you'd like to prepare the data for (e.g., analytics table, app back-end, source-of-truth database, etc.)

- Step 2: Explore and Assess the Data
Explore the data to identify data quality issues, like missing values, duplicate data, etc.
Document steps necessary to clean the data
- Step 3: Define the Data Model
Map out the conceptual data model and explain why you chose that model
List the steps necessary to pipeline the data into the chosen data model
- Step 4: Run ETL to Model the Data
Create the data pipelines and the data model
Include a data dictionary
Run data quality checks to ensure the pipeline ran as expected
Integrity constraints on the relational database (e.g., unique key, data type, etc.)
Unit tests for the scripts to ensure they are doing the right thing
Source/count checks to ensure completeness
- Step 5: Complete Project Write Up
What's the goal? What queries will you want to run? How would Spark or Airflow be incorporated? Why did you choose the model you chose?
Clearly state the rationale for the choice of tools and technologies for the project.
Document the steps of the process.
Propose how often the data should be updated and why.
Post your write-up and final data model in a GitHub repo.
Include a description of how you would approach the problem differently under the following scenarios:
If the data was increased by 100x.
If the pipelines were run on a daily basis by 7am.
If the database needed to be accessed by 100+ people.

# 1 Scope

Scrape a custom API of a mobile app to gather the underlying data. After some data modeling we will be able to use this to perform some analytics and create some visualizations. The topic of this project is gin, thus there will be some quantifyable elements in our data, some factual elements, but also a lot of personal opinions.

# Reverse-engineering a private API

The only place I was able to find the information I was looking for (some data on different gin brands) was stuck behind an iOS/Android app. The following is a high level description of the setup process on how to access a private API of a mobile app, you can find further resources in the provided links.
- The simplest way to approach the problem is to use an Android emulator, in this case I used Android Studio. Due to Android's strict Certificate Authority management it is a bit finicky to setup mitmproxy with a system certificate on an Android emulator. An alternative is using a rooted physical device, in which case you will have a much easier time with CA management.
- Download an APK version of your target app, and install it on the emulated device. 
- Install ADB, and make sure to add the platform-tools folder to your PATH variable. --> Guide Check if your emulator is connected to ADB with ADB devices.
- Install HTTP Toolkit. Select Android device via ADB as your traffic source and follow setup steps in the emulator.
- Done! You should be able to see HTTP requests coming in from the emulator.
- All there is left to do is find the GET request that you are after, find the URL structure and the API key which we will use to authenticate our requests.


One of the first requests the app sends returns the full list of gins/tonics on the site, with a reduced number of fields.

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import time
import numpy as np
import json


# Get requests

In [None]:
baseURL = 'https://ginventory.reed.be/api/v2/products/{}?api_key={}&lang=en'
api_key = '175405344b34bde70ef2970b44e8f07d'
headers = {
    'User-Agent': 'Test, Peter Oravecz',
    'From': 'peteroravecz9@gmail.com'
}

response_collection = []
for i in range(1,100):
    url = baseURL.format(i, api_key)
    response = requests.get(url, headers = headers)
    print(i)
    response_collection.append(response.json())
    time.sleep(0.1)
    
with open('data_100.json', 'w', encoding='utf-8') as f:
    json.dump(response_collection, f, ensure_ascii=False, indent=4)
print("Responses collected!")

# Read in requests from file

In [98]:
with open("data_100.json", 'r', encoding='utf-8') as file:
    file_json = json.load(file)
df = pd.json_normalize(file_json)


df = df.drop(['direct_purchase_url', 'user_rating', 'in_wishlist', 'in_cabinet', 'purchase_links.data'], axis=1)
#df = df.drop(['perfect_tonics.data','perfect_garnishes.data', 'perfect_gins.data'], axis=1)

df_gin = df[df.type == 'gin']
df_gin.head()

Unnamed: 0,id,type,name,first_name,second_name,picture_url,producer,country,abv,average_rating,rating_count,description.content,description.google_translation,description.original_content,perfect_tonics.data,perfect_garnishes.data,perfect_gins.data
0,1,gin,1085 Toledo Gin,,,https://ginventory.reed.be/api/v2/products/1/p...,Licores Caro,Spain,40,6.6,74.0,"This Premium Geneva Toledana, is called 1085, ...",True,"Esta Ginebra Premium Toledana, lleva por nombr...","[{'id': 2, 'type': 'tonic', 'name': 'Macario T...","[{'id': 5, 'type': 'garnish', 'name': 'Grapefr...",
5,6,gin,119 Gin,,,https://ginventory.reed.be/api/v2/products/6/p...,Carmelitano Distilleries,Spain,40,6.9,39.0,119 Gin is a Premium London Dry Gin. It is obt...,False,,"[{'id': 7, 'type': 'tonic', 'name': 'Schweppes...","[{'id': 8, 'type': 'garnish', 'name': 'Ginger'...",
9,10,gin,12 Bridges Gin - (Discontinued),12 Bridges Gin,(Discontinued),https://ginventory.reed.be/api/v2/products/10/...,Integrity Spirits,United States,45,6.4,43.0,Aptly named for the number of bridges in our c...,False,,"[{'id': 13, 'type': 'tonic', 'name': 'Fever Tr...","[{'id': 14, 'type': 'garnish', 'name': 'Cucumb...",
15,16,gin,12/11 Gin,,,https://ginventory.reed.be/api/v2/products/16/...,Destilerías Liber,Spain,425,6.9,23.0,This gin is London Dry type and is made with a...,False,,"[{'id': 17, 'type': 'tonic', 'name': 'Fever Tr...","[{'id': 21, 'type': 'garnish', 'name': 'Cardam...",
21,22,gin,12/11 Gin Aurum Limited Edition,,,https://ginventory.reed.be/api/v2/products/22/...,Destilerías Liber,Spain,425,7.4,34.0,Gin December 11 AURUM Limited Edition is a gin...,False,,"[{'id': 23, 'type': 'tonic', 'name': '1724 Ton...","[{'id': 15, 'type': 'garnish', 'name': 'Lemon'...",


In [93]:
df['perfect_tonics.data'].apply(pd.Series)

Unnamed: 0,0,1,2,3
0,"{'id': 2, 'type': 'tonic', 'name': 'Macario To...",,,
1,,,,
2,,,,
3,,,,
4,,,,
...,...,...,...,...
94,"{'id': 97, 'type': 'tonic', 'name': 'Bö Premiu...","{'id': 96, 'type': 'tonic', 'name': 'Original ...","{'id': 12, 'type': 'tonic', 'name': 'Peter Spa...",
95,,,,
96,,,,
97,,,,


In [91]:
df_nested_test = pd.json_normalize(file_json, record_path=['perfect_tonics'])
df_nested_test.head()

TypeError: {'id': 1, 'type': 'gin', 'name': '1085 Toledo Gin', 'first_name': None, 'second_name': None, 'picture_url': 'https://ginventory.reed.be/api/v2/products/1/picture?type=normal', 'producer': 'Licores Caro', 'country': 'Spain', 'abv': '40', 'direct_purchase_url': None, 'average_rating': '6.6', 'rating_count': 74, 'user_rating': None, 'in_wishlist': None, 'in_cabinet': None, 'description': {'content': 'This Premium Geneva Toledana, is called 1085, to mark the milestone of the year in which the Kingdom of Taifa Arab Toledo agrees to join the kingdom of Castile paying allegiance to their king in exchange for a bull that ensures respect for all citizens regardless of their origin, race ... 1085 Gin is made in the style of traditional London Dry Gin way. In the process of making natural ingredients, unmalted barley and juniper fruits are used. In the first phase of the maceration process it is done in copper stills barley and fruits of juniper. Subsequently, a triple is done ... As a result of careful preparation 1085 is a gin 40º bright, light and soft on the palate, very balanced and complex aromatic variety. Each of the 8 botanicals used up its characteristic and unique personality: Juniper gives a very strong base. Cardamom and angelica root are the main botanical ...', 'google_translation': True, 'original_content': 'Esta Ginebra Premium Toledana, lleva por nombre 1085, para conmemorar el hito histórico de ese año en que el Reino de Taifas Árabe de Toledo acepta unirse al reino de Castilla rindiendo vasallaje a su rey a cambio de una bula que asegura el respeto a todos los ciudadanos con independencia de su origen, raza ...\nGin 1085 está elaborada al estilo London Dry Gin de manera tradicional. En el proceso de elaboración se usan ingredientes naturales, cebada sin maltear y frutos de enebro. En la primera fase del proceso se realiza la maceración en alambiques de cobre de la cebada y los frutos de enebro. Posteriormente se realiza una triple ...\nComo resultado de su esmerada elaboración 1085 resulta una ginebra de 40º brillante, ligera y suave en el paladar, muy equilibrada y de compleja variedad aromática. Cada uno de los 8 botánicos utilizados componen su característica y única personalidad: El enebro le da una base muy intensa. El cardamomo y la raíz de Angélica son los botánicos principales ...'}, 'perfect_tonics': {'data': [{'id': 2, 'type': 'tonic', 'name': 'Macario Tonica', 'picture_url': 'https://ginventory.reed.be/api/v2/products/2/picture', 'average_rating': '7.5', 'rating_count': 28}]}, 'perfect_garnishes': {'data': [{'id': 5, 'type': 'garnish', 'name': 'Grapefruit'}, {'id': 4, 'type': 'garnish', 'name': 'Lime zest'}, {'id': 3, 'type': 'garnish', 'name': 'Orange zest'}]}, 'purchase_links': {'data': []}} has non list value {'data': [{'id': 2, 'type': 'tonic', 'name': 'Macario Tonica', 'picture_url': 'https://ginventory.reed.be/api/v2/products/2/picture', 'average_rating': '7.5', 'rating_count': 28}]} for path perfect_tonics. Must be list or null.

In [73]:
df = pd.json_normalize(file_json)

df.head()

Unnamed: 0,data
0,"[{'id': 1, 'type': 'gin', 'name': '1085 Toledo..."
