# Concepts

- ## 3 Types of Learning
    - Supervised
        - Regression (Continous Output)
        - Classification (Labels, names, classes, no order)
            - K-Nearest Neighbors
                - KNN can predict the response class for a future observation by calculating the "distance" to all training observations and assuming that the response class of nearby observations is likely to be similar. These predictions can be visualized using a classification map.
            - Support Vector Machine (SVM)
                - Splitting data linear and non-linear.
                - Used when not enough training data.
                - Data is geometrical in nature: Computer Vision, etc.
                - Not good for classes greater than two.
            - Decision Tree
                - Explore Decision Forests for highly multi-dimensional data.
            - Logistic Regression
                - Used for probabilistic scenarios.
                - Tells you which features are important.
                - Training is relatively fast.
                - Not too precise.
    - Unsupervised
        - Clustering (Find groupings of similar data.):
            - Clustering is intrinsically inaccurate.
                - Crowdsource labeled data and use classification instead.
            Clustering Algorithms
                - Centroid Based
                    - Min Batch K Means
                    - Affinity Propagation
                    - Mean Shift
                - Hierarchical: Connecting data nearest points until a set distance.
                    - Spectral Clustering
                    - Ward
                - Neighborhood Growers: Grows locally.
                    - Agglomerative Clustering
                    - DBSCAN
            - k-Means Clustering
                - Method
                    - Define centroids
                        - Highly dependent on initial locations.
                    - Associate data around those clusters.
                    - Find the mean of clusters and assign the centroids to that mean.
                    - Repeat until movement is not so much.
                - Silhouette Coefficient for Error Checking
            - Clustering Ensembles
        - Compression (Remove noises, reducing dimensions in datasets.)
    - Reinforcement
    
- ## Techniques for Training
    - Validation
        - Train/Split Method
            - Runs "K" times faster than K-Fold X-Vald.
            - Simpler and therefore easier to examine the details.
        - K-Fold Cross Validation
            - Essentially Train/Split methods done many times.
            - More accurate estimate of out-of-sample accuracy.
            - Tips
                - K= 10 is recommended. It has been shown to be the best for out of sample accuracy.
                - For classification problems: Use *Stratified Sampling*.
                    - The proportions of the features should be reflected in each fold also. (An 80/20 split of data of two features should also have a 80/20 split in each fold.
                    - When using `cross_val_score`, SS is the default.

# Languages/Tools

## Python

- ### Learning
    - [Learn Python the Hard Way](https://learnpythonthehardway.org/book/)
    - [Ploty Tutorials](https://plot.ly/python/)
    
    - Beautiful Soup 4
        - [Web Scraping with Beautiful Soup](http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html)
        - [Beautiful Soup 4 Python Tutorial](http://codegists.com/code/beautiful-soup-4-python-tutorial/)

- ### Modules
    - **NumPy** stands for Numerical Python. The most powerful feature of NumPy is n-dimensional array. This library also contains basic linear algebra functions, Fourier transforms,  advanced random number capabilities and tools for integration with other low level languages like Fortran, C and C++

    - **SciPy** stands for Scientific Python. SciPy is built on NumPy. It is one of the most useful library for variety of high level science and engineering modules like discrete Fourier transform, Linear Algebra, Optimization and Sparse matrices.

    - **Matplotlib** for plotting vast variety of graphs, starting from histograms to line plots to heat plots.. You can use Pylab feature in ipython notebook (ipython notebook –pylab = inline) to use these plotting features inline. If you ignore the inline option, then pylab converts ipython environment to an environment, very similar to Matlab. You can also use Latex commands to add math to your plot.

    - **Pandas** for structured data operations and manipulations. It is extensively used for data munging and preparation. Pandas were added relatively recently to Python and have been instrumental in boosting Python’s usage in data scientist community.

    - **Scikit Learn** for machine learning. Built on NumPy, SciPy and matplotlib, this library contains a lot of effiecient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction.

    - **Statsmodels** for statistical modeling. Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator.

    - **Seaborn** for statistical data visualization. Seaborn is a library for making attractive and informative statistical graphics in Python. It is based on matplotlib. Seaborn aims to make visualization a central part of exploring and understanding data.

    - **Bokeh** for creating interactive plots, dashboards and data applications on modern web-browsers. It empowers the user to generate elegant and concise graphics in the style of D3.js. Moreover, it has the capability of high-performance interactivity over very large or streaming datasets.

    - **Blaze** for extending the capability of Numpy and Pandas to distributed and streaming datasets. It can be used to access data from a multitude of sources including Bcolz, MongoDB, SQLAlchemy, Apache Spark, PyTables, etc. Together with Bokeh, Blaze can act as a very powerful tool for creating effective visualizations and dashboards on huge chunks of data.

    - **Scrapy** for web crawling. It is a very useful framework for getting specific patterns of data. It has the capability to start at a website home url and then dig through web-pages within the website to gather information.

    - **SymPy** for symbolic computation. It has wide-ranging capabilities from basic symbolic arithmetic to calculus, algebra, discrete mathematics and quantum physics. Another useful feature is the capability of formatting the result of the computations as LaTeX code.

    - **Requests** for accessing the web. It works similar to the the standard python library urllib2 but is much easier to code. You will find subtle differences with urllib2 but for beginners, Requests might be more convenient.

    - **Additional Modules**

        - **os** for Operating system and file operations
        - **networkx** and **igraph** for graph based data manipulations
        - **regular expressions** for finding patterns in text data
        - **BeautifulSoup** for scrapping web. It is inferior to Scrapy as it will extract information from just a single webpage in a run.

## BASH

- ### Learning

## JSON

- **J**ava**S**cript **O**bject **N**otation 
- [Basics](https://www.tutorialspoint.com/json/)
- [Learn JSON in 10 Minutes](http://beginnersbook.com/2015/04/json-tutorial/)
- Alternate to XML
- Example:
```
var suren = {
   "firstName" : "Shivakumar",
   "lastName" : "Surendranath",
   "age" :  "26"
};
```
```
document.writeln("Full name is:  " + suren.firstName +
                 " " + suren.lastName);
document.writeln("His age is: " + suren.age);
```

Notes
- Lightweight text-based open standard designed for human-readable data interchange.
- filename extension is *.json*
- Data is represented in name/value pairs.
- Curly braces hold objects and each name is followed by ':'(colon), the name/value pairs are separated by , (comma).
- 

## TensorFlow

- Deep Learning framework from Google

## Hadoop

## Spark

## Obvibase (Online DataBase)

## MySQL

## GitHUB

# Resources

## Books

- [Field Guide to Data Science (Online Book/Slides)](http://www.slideshare.net/BoozAllen/booz-allen-field-guide-to-data-science)
- [Pattern Recognition & Machine Learning](http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf)
- Coding the Matrix: Linear Algebra through CS Applications by: Philip N. Klein
- Introduciton to Data Mining by: Pang-Ning Tan & Others
- The Element of Statistical Learning: Data Mining, Inference, & Prediciton 2nd Ed. by: Trevor Hastie & Others

## Online

### Sites

- [Analytics Vidhya](https://www.analyticsvidhya.com/)
    - [A Complete Tutorial to Learn Data Science with Python from Scratch](https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-science-python-scratch-2/)
- [Data Dependence](http://www.datadependence.com/)
    - [An Introduction to Scientific Python – Pandas](http://www.datadependence.com/2016/05/scientific-python-pandas/)
    - [An Introduction to Scientific Python – NumPy](http://www.datadependence.com/2016/05/scientific-python-numpy/)
    - [An Introduction to Scientific Python – Matplotlib](http://www.datadependence.com/2016/04/scientific-python-matplotlib/)
- [Becoming a Data Scientist](http://www.becomingadatascientist.com/)
- [The Field Guide to Data Science](http://www.slideshare.net/BoozAllen/booz-allen-field-guide-to-data-science)
- [Leet Code](https://leetcode.com)
- [Open Source Society University](https://github.com/S-Suren/data-science)
- Long Pandas Tutorial
    - [Intro to Pandas](http://nbviewer.jupyter.org/github/fonnesbeck/Bios8366/blob/master/notebooks/Section2_1-Introduction-to-Pandas.ipynb)
    - [Data Wrangling with Pandas](http://nbviewer.jupyter.org/github/fonnesbeck/Bios8366/blob/master/notebooks/Section2_2-Data-Wrangling-with-Pandas.ipynb)
    - [Plotting and Visualization in Python](http://nbviewer.jupyter.org/github/fonnesbeck/Bios8366/blob/master/notebooks/Section2_3-Plotting-and-Visualization.ipynb)
- [Top 8 resources for Pandas](http://www.dataschool.io/best-python-pandas-resources/)
- [Sebastian Raschka Resources](https://sebastianraschka.com/notebooks/python-notebooks.html)

### EDX

### Udacity

- Intro to Computer Science (Build a Search Engine & a Social Network) (Georgia Tech?)

### Coursera

- Machine Learning by: Andrew Ng (Stanford University)

# Other

## Getting Data


- [Free Data](https://www.data.gov/)
    - Government resource full of varying and free data.
- [Selector Gadget](http://selectorgadget.com/)
    - Aids in Web Scrapping
- [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/)
- [Census](https://.census.gov)
- [Reddit /r/datasets](https://www.reddit.com/r/datasets/)
- [Quandl]()
- []()

## Important things to know

### Algorithms (Know the tradeoffs in applying them.)

- Linear Regression
- Logistic Regression
- K-Means
- K-Nearest Neighbors/Clusterings
    - Picks up non-linear trends in data

### Data Cleaning
- 90% of work

### Communication
- Explaining results efficiently

### Systems
- Writing faster-running code
- Scaling to multiple machines

### Tips
- "Tour" of Data Method
    - Gather all the variables in data set and map dimensions together in various combination and view the plots physically. Then look for interesting trends.
- Go to programming meetups
- Work on projects
- Go to hackathons