# Section 5: Putting it all together

----
**Teachers NOTES:** 

- Noun Phrase and Verb Phrase definitons
- Need to keep examples simple enough for students to understand
- Prepositions, adjectives, conjunctions need to be defined for the natural language understanding part of the course
----

----

### OSEMN

Data science can be broken down into the following 5 steps:
 1. Obtaining
 2. Scrubbing
 3. Exploring
 4. Modelling
 5. Interpreting the data

----
 

#### Obtaining data
 - In general - getting the data from a few different sources
    - e.g. Web pages, SQL Databases, APIs such as Twitter or Facebook
 - OR generating your own data (such as sensor data, or dimensional data you control)
    - Special note: some simple python libraries to obtain and scrub data from various data sources
    - NOTE: Pandas’ data frames should be used here - though need to figure out how to reconcile with the modelling and exploration sections later on
    - Special note: Preparing data for analysis - when and how to use a large distributed framework, and when a single python-like installation is sufficient
    - Special note: discussion of what a frame is - BlinkDB, MLLib, Spark, Infobright, Teradata, Postgres, etc.

#### Scrubbing data

 - Filtering the data, extracting meaningful fields, replacing values, validating data
 - Data validation can be a tremendous task by itself
 - Different formats can be difficult to manage
 - We won’t spend much time here, but this is one of the most significant things that you’ll have to account for in the real world

#### Exploring data

 - Sample - creating slices, samples, meaningful selections
 - Summarize - detailed summaries and statistics of what you see
 - See - visualize and test your hypothesis
 - **Special note:** p-values and hypothesis testing. Applying cross validation
 - **Special note:** An alternative to p-values and null hypotheses - Confidence Intervals

**Example: using the command line**
**Example: using matplotlib and iPython**
**Special notes - Understanding dimensions and feature selection:**

#### Dimensions and features. 
  - What is a dimension? What is a feature? What are predictors? What are variables?
  - The curse of dimensionality - when and how to avoid too many dimensions. 
  - Dimension selection and elimination. 
  - Using scikit-learn to eliminate dimensionality using t-SNE (t-distributed stochastic neighbour embedding)
  - Overfitting - eliminating noise and randomness. 
       - An example of overfitting census data, and how to remove it using Bayes’ Theorem

##### Using Pandas to capture and model our data
  - An introduction to data frames
  - Data Indexing in Pandas
  - Data selection and aggregation - lambda functions and coding for scale. 
  - Migration of a data frame to a large scale production environment
     - apply, applymap, groupby
  - Summarizing data frames
     - An introduction to data omission, data containment
     - Automatically summarizing data frames (count, min, mean, etc.)
     - Creating a simple persistent (and really fast) database using Pandas and Python 3
     - Using Pandas and matplotlib to see your data

#### Modelling the data
  - Essentially a pattern matching process
  - Detect patterns
  - Explain the reasons for the patterns that you see
  - Predict new events, patterns
  - Generally a statistical model
  - This is where we apply classification models, etc.

----
  Special Note: Principal Component Analysis - reducing dimensionality and clustering our data
  - Special Note: Cluster Analysis - using k-means clustering to remove dimensionality and gather data into meaningful segments

----
  **Special Note: Bayes Classification**
     - Classification
     - Clustering
     - Regression
         - It is here that we encounter the basic problems that we are trying to address (see earlier notes in exploration):
     - Dimensionality
     - Overfitting
     - Overtraining
  **Special Note: Clustering**
            - We will focus on one area of modelling: Clustering
            - When we cluster data, we take a collection of items and partition them into smaller collections
            - The smaller collections are called clusters
            - The criteria by which we cluster the data are usually by some heuristic or rule
            - Clustering is based primarily on comparison
            - When we are clustering data, there are a number of problems that we may encounter
            - Remember our data scrubbing?
            - Well, clustering is based on comparisons
            - This means that we rely on the data being the same in meaningful ways: for example, when we compare terms, that there are not equivalent terms that mean different things
            - For example “Mr.” vs. “Mister”
            - When we’re looking at words, we can address the problem by:
                - Normalizing the data (e.g. changing “Mr.” to “Mister”)
                - Calculating similarity - either at the syntactic level, or the semantic level. This is often called a similarity metric
                    - For example, if we understand the concept of a “house”, a “home”, a “demesne” we can align our algorithm to accommodate for these different changes
    - Interpreting the results

        - Figure out what your results mean
        - Plan actions if the process is manual
        - Create an automatic feedback loop if you are sure of your data