## Overview

* The course is a mixture of programming and theory (mathematics) to learn from data
    - Programming something is often the best way to understand it
        - Python 3.x with libraries
        - Coding algorithms, mathematical functions, etc
        - Using algorithms and functions contained in libraries
    - Math
        - Basic Calculus (assume you know)
            - Optimization
        - Statistics and Probability
        - Linear Algebra

### Statistical/Machine Learning: Understanding Data

* Three broad types
    - Supervised: labeled training data
    - Unsupervised: don't know truth
    - Reinforcement Learning: reward for actions
* Supervised
    - Regression: Quantitative dependent variable
    - Classification: Categorical dependent variable     
* Unsupervised
    - Clustering
    - Dimension reduction

#### Cognitive Science and Artificial Intelligence  is driven by Data

* Improved data collection
    - More sensors generate lots of data
* Improved data processing
    - Faster computers
* Improved data storage
    - Cheaper
    
#### What is difference between the terms statistical and machine learning

* Often no difference, but
    - Statistical learning tends to be more concerned with understanding relationships between data variables
    - Machine learning tends to be more concerned with predicting new data
    
    

  

### Real world and Model Data 

![Figure 1.](DataGen1.png)

$$\text{Figure 1. Relationship of Real-world and Model Data}$$

* Generative model of unknown real-world process(causal) produces observations(data)
    - Heights of college students in the US
    
* Statistical models make assumptions about distribution of the data
    - Heights have a Normal Distribution (parameters are mean and standard deviation

* Inference/Prediction using the model of the data to determine/explain data generating parameters
    - Sample data: Heights of students in this class
    - Determine mean and standard deviation
    

#### Probability and Statistics are required to manage and quantify uncertainty

* Uncertainty 
    - Can be many different explanations/parameters
    - Observed data is just a sample of the population of interest
    - Incomplete information
    - Measurement error


### Programming Tasks for Data Analysis and Interpretation

![Figure 2.](DataFlow1.png)

$$\text{Figure 2. Data Analysis/Interpretation Tasks}$$
.
 
* Data preprocessing
    - Import data from files, urls, databases in a variety of formats
    - Clean and transform to structure appropriate for data processing
    - Training/test split
    - Scaling
* Data exploration
    - Data Visualization
    - Describe central tendencies and variation in data
* Data modeling
    - Build a statistical model of data generating process
    - Fit model to data
    - Validate model
    - Select best model
* Predict new data
* Produce results
    - Validate
    - Visualize
    - Produce a report or paper

### Topics Covered in Course

#### Background Material

* Probability
* Descriptive Statistics
* Linear Algebra

#### Data Preprocessing

* Data collection and manipulation
* Train/test split
* Scaling
* Encoding
* Missing Values

#### Data Visualization

* Data Plots
* Distribution Plots

#### Learning Algorithms (subject to change)

* K-nearest neighbor
* Simple/Multivariate/Polynomial Linear Regression
* Logistic Regression
* Linear Discriminant Analysis
* Support Vector Machine
* Decision Trees
* Random Forests
* Bagging/Boosting
* K-means/Hierarchical Clustering
* Principal Component Analysis
* Artificial Neural Networks
* Convolutional Neural Networks
* Recurrent Neural Networks

#### Model Selection and Validation

* Asses different models on speed, accuracy, and complexity
* Bias-Variance Tradeoff
* Cross Validation


### Why so many Learning Algorithms?

#### Isn't Deep Learning all we need?

* No, way overhyped

https://arxiv.org/pdf/1801.00631.pdf

https://www.axios.com/artificial-intelligence-pioneer-says-we-need-to-start-over-1513305524-f619efbd-9db0-4947-a9b2-7a4c310a28fe.html

#### "No Free Lunch Theorem" - Wolpert and Macready 

* Simplified: There is no one model that works best for every problem.

#### Occam's Razor

* Prefer simpler models over complex models


#### "All models are wrong, some are useful"  George Box

* Some are more useful than others

### Probability and Statistics built into Programming Languages

* Psuedo-random number generators 
    - Capability to reproduce random generation
* Probability distributions
* Sampling function
* Descriptive and Inferential Statistics
* Machine/Statistical Learning Algorithms
 

## Python Programming Language

### Why Python (rather than R)?

https://spectrum.ieee.org/at-work/innovation/the-2018-top-programming-languages
 
* General purpose programming language
    - Object-Oriented (OO)
    - Interpreted/Interactive
    - Functional
* Easy to learn and use

* Readable

* Consistent
    - e.g. use of OO by many packages
    
* Free, open source with large community

* Widely used for Data Science
    - http://www.kdnuggets.com/2017/08/new-poll-python-r-other.html
    
* Scientific and Numerical Packages
    - Numpy
    - Scipy
* Machine Learning Packages
    - Sklearn
    - Keras
* Data manipulation and analysis package
    - Pandas
* Visualization
    - Matplotlib
    - Seaborn




In [None]:
''' 
Monte Carlo calculation of e
Shuffle a deck of 52 cards labeled 1 through n,
and count how many times it happens that the ith card in the deck
has label i. Then the probability of taht no card matches is approximately 1/e.
''' 
import numpy as np
np.random.seed(1234) # For reproducibility

N = 10000
accum_result = np.zeros(N)
deck = np.arange(52)
for j  in range(N):
    np.random.shuffle(deck)
    accum_result[j] = np.sum([deck[i] == i for i in range(52)])
p_nomatch = np.sum(accum_result == 0) / N
print(p_nomatch, 1/np.e)

In [None]:
# Plot the Probability Density Function (PDF) for a Normal Distribution
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(norm.ppf(0.01),norm.ppf(0.99), 100) # Range for x using quantiles
plt.plot(x, norm.pdf(x),'r-', lw=5, alpha=0.6, label='norm pdf')

In [None]:
# Sample from a Normal Distribution

from scipy.stats import norm
import numpy as np

np.random.seed(1234) 

rv = norm(2,.5)
y = rv.rvs(100)
print(type(rv))
y[10:20]

In [None]:
import seaborn as sns
sns.distplot(y)

In [None]:
# Read in data

import pandas as pd

df = pd.read_csv("iris.csv")
df.head()

In [None]:
# Transform data variables to dependent and independent variables
X = df.loc[:,'sepal_length'].values
print(X.shape,X.ndim)
X = X.reshape(-1,1)
y = df.loc[:,'petal_length'].values
y = y.reshape(-1,1)
print(X.shape,X.ndim,y.shape)

In [None]:
# Linear Regression Model
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X,y)
print(f'Intercept: {lm.intercept_ }, Slope: {lm.coef_[0]}')
pred_x = 4.8
ypred = lm.predict(pred_x)
print(f'Predicted petal length for sepal length {pred_x} is {ypred[0][0]}')
preds = lm.predict(X)

In [None]:
# Plot the data and the best fitting (i.e.Regression) line
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(X,y,'o')
plt.plot(X,lm.intercept_ + lm.coef_[0]*X)
plt.plot(X,preds,'o',color = 'r')
plt.xlabel("Sepal Length")
plt.ylabel("Petal Length")
plt.title("Linear Regression")
plt.show()
