Each part of this tutorial is meant to help you work through some part of the data science take home challenge that I have seen many applicants struggle with.  The data set we are working with today is the citibike trip data which can be found [here](https://www.citibikenyc.com/system-data).  The data has the following variables:

- Trip Duration (seconds)
- Start Time and Date
- Stop Time and Date
- Start Station Name
- End Station Name
- Station ID
- Station Lat/Long
- Bike ID
- User Type (Customer = 24-hour pass or 3-day pass user; Subscriber = Annual Member)
- Gender (Zero=unknown; 1=male; 2=female)
- Year of Birth

# Run these first

In [None]:
import pandas as pd
import os

In [None]:
!mkdir data
!mkdir data/raw
!wget -P data/raw https://s3.amazonaws.com/tripdata/201909-citibike-tripdata.csv.zip
!wget -P data/raw https://s3.amazonaws.com/tripdata/201908-citibike-tripdata.csv.zip
!wget -P data/raw https://s3.amazonaws.com/tripdata/201907-citibike-tripdata.csv.zip
!unzip -q data/raw/201909-citibike-tripdata.csv.zip -d data/raw/
!unzip -q data/raw/201908-citibike-tripdata.csv.zip -d data/raw/
!unzip -q data/raw/201907-citibike-tripdata.csv.zip -d data/raw/
cb_csv = [x for x in os.listdir("data/raw") if x.endswith(".csv")]
df_cb_list = []
for csv in cb_csv:
    df_cb_list.append(pd.read_csv(os.path.join("data/raw",csv)))
df_cb = pd.concat(df_cb_list, ignore_index=True, sort=False)
del df_cb_list
df_cb = df_cb.rename(columns={x:x.replace(" ", "_") for x in df_cb.columns})
df_cb.shape

In [None]:
df_data = pd.read_csv("https://raw.githubusercontent.com/MichoelSnow/pydata_nyc_2019/master/downloads/citibike_data.csv", low_memory=False)
df_data.shape

In [None]:
!wget https://github.com/MichoelSnow/pydata_nyc_2019/raw/master/downloads/model_predictions.zip
!unzip model_predictions.zip    

# Part 1: Working with Trick Data

People who create these datasets will sometimes purposefully create "bad data" within the data set.  As most data in the wild has some internal discrepancies, you must carefully examine the data before you can model it.  Only once you have identified and dealt with, or decided to ignore, the problems with the data can you feel confident in analyzing it. To this end I have modified a portion of the citibike data set, which already had some problems with it, and added in a whole host of errors.

The variable `df_data`, which you created above, is the modified data set. Your task is to find all the different problems with the data, and determine how you are going to rectify them.

# Part 2: Create an Outline

Now that you have an understanding of the data, here is your challenge.


*Using the citibike data set predict how long a ride will be given the information available when the rider starts their ride.  You should spend about 5 hours working on this challenge.  Please make sure to include clear data analysis and visualization, well-documented and interpretable code, appropriate use of statistical/machine-learning models, and explanation of the results.*

Do not start coding immediately.  First you should develop a plan of attack for the challenge. For each of the following sections, write a brief outline (no more than 1 paragraph) which includes what you are going to do and how long it should take.  Be as specific as reasonable, e.g., for a figure describe the axes and the type of plot.  Try to have your time estimates stay within the suggested time limit.  I would suggest you allot 80% of the time provided with a 20% buffer.  

While I believe any report should contain all the following sections, you will want to vary the amount of time you spend on each depending on the challenge and the company.  For example, the more modeling the position the more time should be spent on that. 

## EDA (single variable)

## Data Relationships (between different variables)

## Modeling

## Result Summary

# Part 3: Exploratory Data Analysis (EDA) and Data Relationships

Based on the above outline, create at least one figure from both the EDA and Data Relationships section.  Each figure should be presentation ready before you move on to the next one.  Your goal for any figure, is to have it as understandable as possible without requiring outside explanation. Now is not the time to be learning a new plotting library; to get these done quickly and beautifully you should use the library you are most familiar with.  

Use [pandas profiling](https://pandas-profiling.github.io/pandas-profiling/docs/) to get a headstart on all your analyses.  Click [here](https://htmlpreview.github.io/?https://github.com/MichoelSnow/pydata_nyc_2019/blob/master/notebooks/data/citibike_data_report.html) for a link the output report for the citibike data.

Your first step, before you begin to code, is to sketch out what you plan on plotting.  Consider both the data and the type of plot.  Is this the most informative variables, should you engineer.

As a guide, here is a table from Bertin's Semiology of graphics as seen in Jake Vanderplas's talk at PyCon 2019 talk [How to Think about Data Visualization](https://www.youtube.com/watch?v=vTingdk_pVM).  This is a quick reference for how to plot the three main data types:

- Quantitative (Q)
    - Numerical data
- Ordinal (O)
    - Ordered categorical data 
- Nominal (N)
    - Unordered categorical data

|               |       |       |       |
|-------------	|---	|---	|---	|
| Position    	| N 	| O 	| Q 	|
| Size        	| N 	| O 	| Q 	|
| Color Value 	| N 	| O 	| q 	|
| Texture     	| N 	| o 	|   	|
| Color Hue   	| N 	|   	|   	|
| Angle       	| N 	|   	|   	|
| Shape       	| N 	|   	|   	|


Here are links to sample plots for the various libraries (for inspiration and reference):
- [Matplotlib](https://matplotlib.org/3.1.1/tutorials/introductory/sample_plots.html)
- [plotly](https://plot.ly/python/)
- [Seaborn](https://seaborn.pydata.org/examples/index.html)
- [plotnine](https://plotnine.readthedocs.io/en/stable/gallery.html)
- [Altair](https://altair-viz.github.io/gallery/index.html)

# Part 4: Model Explanation & Summary

A common mistake is to model your data and present your results with either just a score or a confusion matrix.  This result is the culmination of your work, it should be presented with as much thought as the rest of your notebook. In the cell below a dataframe is loaded which contains the test data along with the predicted ride duration as the column `predicted_duration`. Create a figure, or two at most, which visually demonstrates the results.  Then spend a paragraph explaining where the model has succeeded and where it still needs more work.  While a good model is important, good communication is essential to a well received data challenge.

In [None]:
df_results = pd.read_feather("model_predictions.feather")
df_results.shape

# Part 5: Musings and Addenda

## Structuring your data science challenge

In addition to the sections discussed in the outline, all challenges should start off with a summary and a readme section.

The summary section should be a short summary of what you did and what you found.  It should be between one to three paragraphs. Put anything you want to be sure the reviewer reads into an executive summary. Write the summary, as well as the rest of the code, as if it was a project you were presenting to people unfamiliar with the work you are doing.  Assume they are knowledgeable about the topic but not the specifics of your task or the underlying data.  Use hyperlinks to link to specific figures and sections you are referencing in the notebook.  Comment about positive findings and pertinent negatives, e.g. always check for missing values, but only reference them in the summary if there is something unique about them.  see [here](https://michoelsnow.github.io/html_files/citibike#Summary) for an example summary

The Readme section should contain any technical information required to run your code.  At least list the version of python your code was running on, as well as the package versions.  If your code uses a lot of memory, takes a long time to run or some other technical nuance that you want to inform the reader about, place it here. See [here](https://michoelsnow.github.io/html_files/citibike#Readme) for an example Readme.

## Things to keep in mind

Just like any presentation, spelling and formatting matter.  To that end I would recommend using a notebook spellchecker and formatter.  Below are the two that I use:

- Black for jupyter notebooks
    - https://github.com/drillan/jupyter-black
- spellcheck for jupyter notebooks (part of [nbextensions](https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/index.html))
    - https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/nbextensions/spellchecker/README.html
 
 
**Other musings**  
- Ask Questions
    - If something is not clear from the challenge (and feel free to ask a friend or colleague to make sure you are not missing something simple) ask.
    - It is possible they made the instructions vague on purpose to make sure you are someone who follows up on unclear requirements
    - Be upfront about prior commitments or other time constraints 
- Explain and Justify
    - Don't randomly select algorithms to throw at the data. 
        - Why did you choose each algorithm? 
        - Why were you choices valid and reasonable? 
        - What else would you try if you had more time? 
    - You will have to answer these questions later.
- Random seeds
    - Always set you random seed for code reproducibility
- Tailor your response
    - Consider the job description and tailor your approach to match the skills needed for the job (i.e., if this is a very ML heavy job, spend more time on modeling than on EDA)
    - Where you are in the company's interview process, e.g. screening vs last step
- Submit an IPython notebook and an HTML download
    - Just in case they can't get the notebook running or don't have time to start up a kernel, giving them an HTML version of your notebook ensures that what you are giving is what they are seeing. 
- DO NOT use a neural network    
    - Barring some outliers, you should never use a neural network as your first model and only use it as a second model when you have a really good reason to.
- Create a portfolio
    - Find your own datasets and create data science challenge projects on your own and add to your portfolio.
    - Forks/Classes are not the best showcases, and might be ignored.

**Grading Rubric** (the order of importance will depend on the grader)
- Does your code run?
- What is the ratio of text to code?
- Do you explain what significant choices you made and why you made them?
- Are there any spelling mistakes?
- Is your code pythonic (e.g., how often you use loops)?



## Other

List of good newsletters/Data science articles:

- NYC data Jobs and Events by Josh Laurito
    - https://tinyletter.com/nycdatajobs
- Normcore Tech by Vicki Boykis
    - https://vicki.substack.com/
- Data Science Roundup by Tristan Handy
    - http://roundup.fishtownanalytics.com/
- The Algorithm from MIT Technology Review
    - https://forms.technologyreview.com/newsletters/
- PyCoder's Weekly     
    - https://pycoders.com/
- Natural Language Processing News by Sebastian Ruder
    - http://newsletter.ruder.io/