![](images/pony_google.png)

![](images/real_pony_google.png)

## My definition of data science.
> A data scientist is someone who uses some combination of domain/business knowledge, probability theory, stats, machine learning, deep learning(is really a part of ml) and artificial intelligence to provide one or both of data driven recomendations for a company or data driven products for a company.
#### Of course this is the definition of a Unicorn, almost no data scientist can do all of this well.

## A Data Scientist exists to extract value out of data for a company, that's it.
> ### Questions data scientists answer.
* We have customers, but we don’t understand them. How do we understand them better?
* How do we get more people to click on thing X or Y?
* How do we move our data to Hadoop? Should we move our data to Hadoop? How much?
* How do we count the number of products we sell? How do we increase the number of products we sell?
* We have data in two different places. How do we get it into one place?

## Forecasting Uber Surge Pricing
> * I've spent some time as an Uber driver, needed surge to make a profit
* How can I know where to be before the surge hits?
* Pulled a bunch of data from Uber API
* Multiviate time-series analysis using RNN
[Github Repo](https://github.com/matthewswogger/Surge-Forecast-with-RNN)
[Surge Forecast Blog Post](http://matthewswogger.com/en/blog/surge-pricing-forecast-with-neural-networks/)

## Music First Hand Matching Algorithm
> * Algorithm to find the best local band match for venues/fans event
* similar to a content based recommender system
* the key is to get the questions we ask correct
* similarity metric and some weighting
[Music Firsthand](https://musicfirsthand.live/)

## Windows users need to download and install Cygwin.
> #### Cygwin is a large collection of Open Source tools which provide functionality similar to a Linux distribution on Windows.
#### Cygwin allows us to install Bash, which is really what we are after.
[Link to cygwin install](https://cygwin.com/install.html)

## Everyone downloads and installs Anaconda.
> #### Anaconda is the most popular data science ecosystem. It is an open source package managment system.
#### So no compiling needed.
[Link to Anaconda install](https://www.continuum.io/downloads)
```sh
$ which python
$ source .bashrc
$ which python
```

In [None]:
import statsmodels.api as sm
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

%matplotlib inline

In [None]:
prestige = sm.datasets.get_rdataset("Duncan", "car", cache=True).data
prestige.head()

In [None]:
prestige.shape

In [None]:
prestige['ind'] = np.arange(prestige.shape[0])
prestige.set_index('ind', inplace=True)
prestige = prestige.rename_axis(None)
prestige.head()

In [None]:
sns.pairplot(prestige, hue='type', aspect=1.5)
plt.show()

In [None]:
prestige = prestige.join(pd.get_dummies(prestige['type']))
prestige.drop('type', axis=1, inplace=True)
prestige.drop('bc', axis=1, inplace=True)
prestige.head()

In [None]:
y = prestige['prestige']
x = prestige[['income', 'education', 'prof', 'wc']].astype(float)
x = sm.add_constant(x)

In [None]:
model = sm.OLS(y, x).fit()
summary = model.summary()
print(summary)

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import numpy as np
from sklearn.datasets import load_boston
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
boston = load_boston()
print(boston.DESCR[:1222])

In [None]:
y = pd.DataFrame(boston.target, columns=['house_price'])
X = pd.DataFrame(boston.data, columns=boston.feature_names)
df = X.join(y)
df.head()

In [None]:
sns.pairplot(df, hue='house_price', aspect=1.5)
plt.show()