# Data Science in Labs!

For Labs 26

By Ryan Herr, Labs DS Manager

[calendly.com/ryan-herr](https://calendly.com/ryan-herr)

## What does DS do in Labs?

You'll be assigned to a team. Each team will have multiple data scientists and web developers.

In Labs 26, DS teams will be assigned to work on [Citrics](https://www.notion.so/lambdaschool/Citrics-Roadmap-98a5614b708745ccae4ca55960ea8e1b) or [Story Squad](https://www.notion.so/lambdaschool/Story-Squad-Roadmap-2682f21ae48b420cbb0caafa3f500b5e). Each product will have multiple independent teams.  

DS will work on these tasks:

- Data sourcing (DS) and database engineering (DS and/or BE)
- Visualizations (DS and/or FE)
- Machine Learning (DS builds. Everyone communicates & integrates) 

Data sourcing includes [data cleaning](https://twitter.com/dataeditor/status/1280278987797942272)!

Can you identify these tasks on the [Labs 26 Product Roadmaps](https://www.notion.so/lambdaschool/Labs-26-Products-f724accd0b38426a87b4cdee7efc2de1)?

We'll demonstrate some of these tasks in this Python notebook.

## About me

<img width="559" src="https://user-images.githubusercontent.com/7278219/88957150-1212b580-d264-11ea-8e6a-0984319c90f3.png">

My data science career began at State Farm, a Fortune 50 insurance company. I started learning in 2012 and officially earned the title in 2015.

Out of 20 data scientists on their inaugural team, I was the only one with just a Bachelor’s degree in an unrelated field. I didn’t have the time, money, or mobility to go back to school “traditionally.” But I didn’t give up on my career change, and now I want to help Lambda students break into data science too!

I love to teach! I’ve been a K-8 teacher, bootcamp mentor, instructional designer, and conference speaker.

I have 6 years experience doing data wrangling, exploratory data analysis, visualization, and machine learning in Python, and 30 years coding in a variety of languages.

I’ve also interviewed and hired data science candidates. I can help students prepare, so you get great jobs!

I joined Lambda School in 2018 as a Data Science Instructor. I joined Labs in 2020 as Experiential Learning Manager, Data Science.

I live in Normal, Illinois (2 hours south of Chicago) with my spouse and 5 kids.

Feel free to connect!
* [LinkedIn](https://www.linkedin.com/in/ryan-herr-b5a8a77/)
* [Twitter](https://twitter.com/rrherr)
* [Portfolio site](https://rrherr.github.io/)
* [Banjo/uke videos](https://www.youtube.com/playlist?list=PLAwif0tmlJfUaGjOkqTl5RNJQLJH3798I)

## A sneak peek from my own performance review



To help you understand how _you're_ evaluated, I'll share how _I'm_ evaluated:




![](https://user-images.githubusercontent.com/7278219/89203129-a5f0c400-d579-11ea-9bc1-3b7bb61e983e.png)

## What does success look like in Labs? 🎨 📝

### By the end

My definition of **success** is the same for all tracks: Each student is able to **show & tell** their **work experiences** during **job interviews.**

**Your projects don’t speak for themselves.** It’s the opposite: Your projects give you something to speak about in interviews.


Labs is consistent with this **advice from a hired DS student:**

> Hey DS1, I just wanted to let you guys know, “Trust the process.” I was offered a job Monday! A job I always dreamed of having, but never really thought was possible before Lambda.
> 
> The amount of interviews I failed before getting a hit? 40+. Sometimes interviewing really sucks. Sometimes it's not easy to feel embarrassed or ashamed at failing. But tomorrow is another chance to do better.
>
> If I could give two pieces of actual advice:
>
> **1.** If you have a cat or dog, you should seriously start **practicing teaching them statistics and machine learning.** I knew many of the answers to interview questions, but failed horribly at explaining them. It turns out explaining things is a skill all of its own, and it won't be long before my cat is making his own models 😅
>
> **2. Have amazing visualizations.** No one wanted to look at my code, everyone wanted to see the end result.
>
> Visualizations should be a summarized representation of the entirety of your project. Make sure it represents all your work.
>
> I also discovered people love seeing smaller visuals to summarize the unique parts of your project. I don't know why, but people really just didn't want to talk 20 minutes about code, they want something to look at!


“Visualization” means any visual from your project:

- Architecture diagrams
- Code snippets
- Data visualizations
- Database schemas
- Flowcharts
- Trello boards
- UI screenshots
- UX videos / animated gifs (short, silent loops)
- Wireframes




<small>In interviews, you’ll rarely do product demos, and never do product “pitches”. But you’ll always talk about your projects.</small> 

<small>These are better than fumbling through a live demo. You’re not pitching your finished product. You’re pitching yourself, and employers want to see your problem-solving process. We’ll prepare you in Labs!</small>

View these great examples from data scientists in recent Labs cohorts: 

- [Elizabeth Ter Sahakyan, Cryptolytics](https://lizzie.codes/2020/02/11/cryptolytic/) 
- [Tobias Reaper, Trash Pandas](https://tobias.fyi/workshop/trash-panda/)

### In week 1

Throughout Labs, you'll use [our template](https://github.com/Lambda-School-Labs/labs-ds-starter) with starter code to [deploy an API](https://ds.labsscaffolding.dev/) for your machine learning model and data visualizations. We've used this template the past three Build Weeks, and in Labs 25, with successful results. But don't fork it yet, because I'm still making updates.

In week 1, you're required to complete these DS tasks. You'll receive detailed step-by-step instructions for each step.

- TPL creates DS GitHub repo from our [starter code template](https://github.com/Lambda-School-Labs/labs-ds-starter).  
- Each DS student is able to run the starter code locally in Docker. 
- One DS student updates the starter code with an app title and description, and submits a GitHub pull request.
- One DS student deploys the updated app to AWS Elastic Beanstalk.

**Each DS team will submit a form to me by next Friday,** to verify these steps are done, and give me your DS GitHub repo and deployed app URLs.

I will be checking DS work. **I expect DS students to commit work at least daily in weeks 2—8.** 

## Data visualization 📈🌍

Labs projects will use [Plotly](https://plotly.com/python/), a popular visualization library for both Python & JavaScript.

Imagine we want to visualize metrics like life expectancy, population, and GDP per capita, for different countries around the world.

In [None]:
# First, load data
# It won't really be this easy!
import plotly.express as px
dataframe = px.data.gapminder()

In [None]:
dataframe

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
0,Afghanistan,Asia,1952,28.801,8425333,779.445314,AFG,4
1,Afghanistan,Asia,1957,30.332,9240934,820.853030,AFG,4
2,Afghanistan,Asia,1962,31.997,10267083,853.100710,AFG,4
3,Afghanistan,Asia,1967,34.020,11537966,836.197138,AFG,4
4,Afghanistan,Asia,1972,36.088,13079460,739.981106,AFG,4
...,...,...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306,ZWE,716
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786,ZWE,716
1701,Zimbabwe,Africa,1997,46.809,11404948,792.449960,ZWE,716
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623,ZWE,716


In [None]:
# Figure out what these columns names really mean, 
# then rename them for readability
dataframe = dataframe.rename(columns={
    'year': 'Year', 
    'lifeExp': 'Life Expectancy', 
    'pop': 'Population', 
    'gdpPercap': 'GDP Per Capita'
})

dataframe

Unnamed: 0,country,continent,Year,Life Expectancy,Population,GDP Per Capita,iso_alpha,iso_num
0,Afghanistan,Asia,1952,28.801,8425333,779.445314,AFG,4
1,Afghanistan,Asia,1957,30.332,9240934,820.853030,AFG,4
2,Afghanistan,Asia,1962,31.997,10267083,853.100710,AFG,4
3,Afghanistan,Asia,1967,34.020,11537966,836.197138,AFG,4
4,Afghanistan,Asia,1972,36.088,13079460,739.981106,AFG,4
...,...,...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306,ZWE,716
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786,ZWE,716
1701,Zimbabwe,Africa,1997,46.809,11404948,792.449960,ZWE,716
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623,ZWE,716


In [None]:
# Filter data
subset = dataframe[dataframe.country == 'United States']

In [None]:
import plotly.express as px
px.line(subset, x='Year', y='Life Expectancy', title='Life Expectancy in the United States')

In [None]:
country = 'United States'
metric = 'Population'
subset = dataframe[dataframe.country == country]
px.line(subset, x='Year', y=metric, title=f'{metric} in {country}')

In [None]:
def worldviz(metric, country):
    subset = dataframe[dataframe.country == country]
    fig = px.line(subset, x='Year', y=metric, title=f'{metric} in {country}')
    fig.show()
    return fig.to_json()

worldviz('Population', 'United States')

'{"data":[{"hoverlabel":{"namelength":0},"hovertemplate":"Year=%{x}<br>Population=%{y}","legendgroup":"","line":{"color":"#636efa","dash":"solid"},"mode":"lines","name":"","showlegend":false,"type":"scatter","x":[1952,1957,1962,1967,1972,1977,1982,1987,1992,1997,2002,2007],"xaxis":"x","y":[157553000,171984000,186538000,198712000,209896000,220239000,232187835,242803533,256894189,272911760,287675526,301139947],"yaxis":"y"}],"layout":{"legend":{"tracegroupgap":0},"template":{"data":{"bar":[{"error_x":{"color":"#2a3f5f"},"error_y":{"color":"#2a3f5f"},"marker":{"line":{"color":"#E5ECF6","width":0.5}},"type":"bar"}],"barpolar":[{"marker":{"line":{"color":"#E5ECF6","width":0.5}},"type":"barpolar"}],"carpet":[{"aaxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"baxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"type":"carpet"}],"choropleth":[{"

## DS API ⚙️

In Labs, DS will use [FastAPI](https://fastapi.tiangolo.com/). It's like Flask, but faster, with automatic interactive docs. For more comparison, see [FastAPI for Flask Users](https://amitness.com/2020/06/fastapi-vs-flask/).

FastAPI's creator Sebastián Ramírez recorded a [good video](https://youtu.be/1zMQBe0l1bM), “Build a machine learning API from scratch with FastAPI.” Live coding starts at 4:55, ends at 50:20. 

We'll do a quick demo now!

In Labs, you *won't* do this part in a notebook. We're just using a notebook now for a convenient, self-contained demo.

Let's add an API endpoint for our data visualization.

In [None]:
!pip install fastapi nest-asyncio pyngrok uvicorn



In [None]:
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
import plotly.express as px

dataframe = px.data.gapminder().rename(columns={
    'year': 'Year', 
    'lifeExp': 'Life Expectancy', 
    'pop': 'Population', 
    'gdpPercap': 'GDP Per Capita'
})


app = FastAPI()

app.add_middleware(
    CORSMiddleware,
    allow_origins=['*'],
    allow_credentials=True,
    allow_methods=['*'],
    allow_headers=['*'],
)

@app.get('/')
async def root():
    return {'hello': 'world'}

@app.get('/worldviz')
async def worldviz(metric, country):
    """
    Visualize a metric over time for a given country.

    Returns JSON to be rendered with Plotly.js by the Web frontend.
    """
    subset = dataframe[dataframe.country == country]
    fig = px.line(subset, x='Year', y=metric, title=f'{metric} in {country}')
    return fig.to_json()

In [None]:
import nest_asyncio
from pyngrok import ngrok
import uvicorn

url = ngrok.connect(port=8000)
print('Public URL:', url)
nest_asyncio.apply()
uvicorn.run(app, port=8000)

## Predictive model 🔮

Imagine your Labs team is building an app to help people shop for apartments. 

DS will make a predictive model to help estimate how much rent will cost.

Your architecture will probably look like this diagram:

![](https://user-images.githubusercontent.com/7278219/87967579-a4f16a00-ca84-11ea-9f90-886b3cf1a25c.png)

_Here's how I explained predictive modeling when training Labs 25 leads, who are mostly Web students. What do you think?_

How do you predict how much rent will cost in the future / for unknown apartments?

By finding patterns in how much rent cost in the past / for known apartments.

Can you find the pattern in this fake data?

In [None]:
import pandas as pd

df = pd.DataFrame([{'beds': 1, 'baths': 1, 'rent': 600}, 
                   {'beds': 2, 'baths': 1, 'rent': 1000}, 
                   {'beds': 2, 'baths': 2, 'rent': 1200}, 
                   {'beds': 3, 'baths': 1, 'rent': 1400}, 
                   {'beds': 3, 'baths': 2, 'rent': 1600}, 
                   {'beds': 3, 'baths': 3, 'rent': 1800}])

df

Public URL: http://e5f80a500ca0.ngrok.io


Unnamed: 0,beds,baths,rent
0,1,1,600
1,2,1,1000
2,2,2,1200
3,3,1,1400
4,3,2,1600
5,3,3,1800


In [None]:
# You could figure out the function yourself
def rent(beds, baths):
    return 400*beds + 200*baths

In [None]:
# Or you can let the computer figure out the function for you
from sklearn.linear_model import LinearRegression

features = ['beds', 'baths']
target = 'rent'
model = LinearRegression()
model.fit(df[features], df[target])
model.coef_

array([400., 200.])

In [None]:
# And use the model to make predictions
model.predict([[3, 1]])

array([1400.])

## Machine Learning defined

_Here's the definitions I used when training Labs 26 leads, who are mostly Web students. What do you think?_

### Short

Machine Learning is the branch of AI that explores ways to get computers to improve their performance based on experience.

— [Stuart Russell](http://people.eecs.berkeley.edu/~russell/temp/q-and-a.html), Professor of Computer Science at Berkeley, co-author of _Artificial Intelligence: A Modern Approach_.

### Medium

Machine Learning: A program or system that builds (trains) a predictive model from input data. The system uses the learned model to make useful predictions from new (never-before-seen) data drawn from the same distribution as the one used to train the model.

— [Google Machine Learning Glossary](https://developers.google.com/machine-learning/glossary/#m)

### Long

**A machine learning system is "trained" rather than explicitly programmed.** It is presented with many "examples" relevant to a task, and it finds statistical structure in these examples which eventually allows the system to come up with rules for automating the task.

**For instance,** if you wish to automate the task of tagging your vacation pictures, you could present a machine learning system with many examples of pictures already tagged by humans, and the system would learn statistical rules for associating specific pictures to specific tags.

To do [supervised] machine learning, **we need three things:**

- **Input data points.** For instance, if the task is speech recognition, these data points could be sound files of people speaking. If the task is image tagging, they could be picture files.

- **Examples of the expected output.** In a speech recognition task, these could be human-generated transcripts of our sound files. In an image task, expected outputs could tags such as "dog", "cat", and so on.

- **A way to measure if the algorithm is doing a good job,** to measure the distance between its current output and its expected output. This is used as a feedback signal to adjust the way the algorithm works. This adjustment step is what we call "learning".

— [Francois Chollet](https://livebook.manning.com/book/deep-learning-with-python/chapter-1/), Artificial Intelligence Researcher at Google, author of _Deep Learning with Python_


### Comic book

[Learning Machine Learning: An online comic from Google AI](https://cloud.google.com/products/ai/ml-comic-1)