# Rainy Days Got Them Down
## Module 2 Final Lab
### NYC-MHTN-DS-042219
* Tino Pietrassyk
* David Haase

### Project Description: Practice Scraping Weather & Sports Data into MongoDB
#### About This Lab
A quick note before getting started--this lab isn't like other labs you seen so far. This lab is meant to take ~8 hours to complete, so it's much longer and more challenging that the average labs you've seen so far. If you feel like this lab is challenging or that you might be struggling a bit, don't fret--that's by design! With everything we've learned about Web Scraping, APIs, and Databases, the best way to test our knowledge of it is to build something substantial!

#### The Project
In this lab, we're going to make use of everything we've learned about APIs, databases, and Object-Oriented Programming to Extract, Transform, and Load (or ETL, for short) some data from a SQL database into a MongoDB Database.

You'll find a database containing information about soccer teams and the matches they've played in the file database.sqlite. For this project, our goal is to get the data we think is important from this SQL database, do some calculations and data transformation, and then store everything in a MongoDB database.

#### The Goal
Start by examining the data dictionary for the SQL database we'll be working with, which comes from this kaggle page. Familiarize yourself with the tables it contains, and what each column means. We'll be using this database to get data on each soccer team, calculate some summary statistics, and then store each in a MongoDB database.

Upon completion of this lab, each unique team in this dataset should have a record in the MongoDB instance containing the following information:
* The name of the team
* The total number of goals scored by the team during the 2011 season
* The total number of wins the team earned during the 2011 season
* A histogram visualization of the team's wins and losses for the 2011 season (store the visualization directly)
* The team's win percentage on days where it was raining during games in the 2011 season.

### Library Imports
* **weathergetter** is a homemade Python class tha uses location and date to determine historical weather

In [1]:
import pymongo
import weathergetter as wg

## Data Loading
The first step of the pipeline with any data science related tutorial is usually the data loading component. Besides visually describing the dataset in use to your audience, also try to briefly explain (in one or two sentences) where the data came from, i.e., the source of the data. Other specifications like dimensions and attribute type are important but can be neatly explained with examples using code and tools such as pandas.

#### Team and Player Data -- local DB, cloned from Kaggle
* TBD - Tino

#### Location Data -- Google Maps APIs
* The latitude and longitude coordinates were queried through Google APIs based on passing the [team_name] + 'stadium'
* If lat and lons were not available for that string, the lat & lon values of Berlin were used

#### Weather Data -- Dark Skies APIs
* The history of weather was queried for lat & lon on a specific date
* If the string 'rain' appeared in the resulting 'summary' or 'icon' values, the day was considered rainy

In [2]:
# Example of weathergetter
# for team in teams:
#    rainy = wg.WeatherGetter().is_rain('7/1/1967', team + ' stadium', show=False)

## [Data Preprocessing]
Although sometimes not necessary, as some datasets already come preprocessed, I believe it is important to slightly mention what type of preprocessing steps the data has undergone -- even if you need to do this through code examples. It should clarify any confusion that can present itself during the modeling section of the tutorial. Remember, your audience wants to get a broad understanding of the data before the modeling component of the tutorial, so try to explain this part of the tutorial as clear as possible with examples. Take advantage of your notebook features and other tools such as matplotlib and pandas.

## Writing to the Database
We are using our local instance of mongodb

In [3]:
# CONNECT -- build a connection to the local instance of MongoDB
class MongoDB:
    def __init__(self):
        self.address = 'mongodb://127.0.0.1:27017/'
        self.db_name = 'euro_football'
        self.collection_name = 'year_2011'
        
    def start(self):        
        self.client = pymongo.MongoClient(self.address)
        self.db = self.client[self.db_name]
        self.collection = self.db[self.collection_name]

#Scratch test with dummy CLASS
class Team:
    def __init__(self, name):
        self.name = name
    
    def to_json(self):
        ret_val = {}
        ret_val['name'] = self.name
        return ret_val

d = Team('David')
t = Team('Tino')
teams = [d,t]

TeamsDB = MongoDB()
TeamsDB.start()


# INSERT -- try to inserte a record for each team
try:
    insertion_results = TeamsDB.collection.insert_many([team.to_json() for team in teams])
except Exception as e:
    print(e)

# SUM-CHECK -- compare results of insertion
print(f'{len(insertion_results.inserted_ids)} teams out of {len(teams)} inserted into {TeamsDB.db_name}')

2 teams out of 2 inserted into euro_football


## [Testing Model]
One of the things I have learned over the years is that everything in data science is better understood with examples, rather than just using plain code or pictures. Before you begin training your models make sure to explain to the reader what the model is expecting as input and what it is expected to output. Rendering code here with nice descriptions help to prepare the reader on what to expect during training the model, especially since the training code is usually longer than most sections of the tutorial. With libraries like PyTorch and DyNet this is fairly easy since they are dynamic computing libraries. TensorFlow also offers an eager execution command, tf.enable_eager_execution() to evaluate operations immediately. This is what's called imperative programming and I am glad they have it. It makes it easy to teach others about the beautiful things these tools are able to accomplish. I like to think that data science is about storytelling and discovery, and it should remain that way. Clear writing helps!
## [Training Model]
When training the models you would specify what kind of optimization, hyperparameters, and data iterating methods you are using. To be honest, the training code is usually self-explanatory. If you did your job at the beginning, explaining your dataset and testing the model, this part of the tutorial is probably the one that needs less explanation. In my experience, most data computing libraries use similar training strategies, thus the training structure has become ubiquitous in some sense. If there is still any clarification in your training that you need the reader to know, you can always explain it beforehand.

## [Evaluating Model]
And lastly, it is good practice to evaluate your models on some held out samples of the dataset. This helps the reader to get a gist of what the tutorial you just showed him/her contains. It also helps to re-emphasize on the values the tutorial is providing for the reader. This part of the tutorial also helps to finalize your final thoughts and share insights with your readers. Readers love insights. You can share plots, a lot of examples, and even explore the parameters of the model.

## [Final Words]
You are not writing a book, so it is not necessary to have a conclusion section. In my experience, you use the final section to summarize all your findings and the future ideas you are working on. This is also a great time to congratualte the reader for making it to the end of the tutorial -- that's a huge achievement. You show that you appreciate the readers. Then you can end the section with your favorite quote.

And that's it! Congratulations for reaching the end of this primer. You are now more than equipped to deliver excellent tutorials to the whole data science community and to a wider audience. With this short primer, you should reach thousands, and hopefully millions, but most importantly, with it, you should be able to bring value to your readers and keep expanding the human knowledge base.

## Resources & Credits
* Dark Sky
Historical weather by date and locationImplemented with the free version of **Dark Sky API** (https://darksky.net/poweredby/)
* Google Maps
Implemented with free version of Geocoding from Google's Places APIs
* Project Template
Written with ❤️ by dair.ai

## [Other Tips]
Try to ensure that your notebook-based tutorials have a very nice flow. If you are using a lot of functions, it will be nice if you can create seperate python files for them and import them here. You don't want your notebooks to be too detailed, but you also don't want it to be too flat.
Remember! You are teaching not dictating. Ask questions and immerse the reader, challenge them. There are various ways to do so.
More coming soon!