# Machine Learning w/Python 3.7

## Importing Python Packages / Modules

### Packages Overview
* The Natural Language Toolkit (nltk) - we'll be accessing corpora as well as functional tools from this package.
    * The Brown Corpus: A text corpus of American English, split into fifteen different categories
    * Part of Speech Taggers (pos): prebuilt functions that are designed to determine the part of speech of every word in a given sentence.
* pandas - for data processing    
* matplotlib - for visualizing data (%matplotlib inline - displays images clearly in the Jupyter notebook)
* scikit-learn - for machine learning (ft. various classification, regression and clustering algorithms)

In [None]:
import nltk
from nltk.corpus import brown
from nltk import pos_tag_sents
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import sklearn

## Understanding Classification

## Accessing our Data

In [None]:
for cat in brown.categories():
    print (cat)

In [None]:
news_sent = brown.sents(categories=["news"])
romance_sent = brown.sents(categories=["romance"])

In [None]:
print (news_sent[:5])
print ()
print (romance_sent[:5])

In [None]:
ndf = pd.DataFrame({'sentence': news_sent,
                    'label':'news'})
rdf = pd.DataFrame({'sentence':romance_sent, 
                    'label':'romance'})

In [None]:
# combining two spreadsheets into 1
df = pd.concat([ndf, rdf])

## Extracting Features from our Data

## Supervised Machine Learning

Supervised machine learning takes places in two steps: the training phase, and the testing phase. In the training phase, you use a portion of your data to train your algorithm (which, in our case, is a classification algorithm). You provide both your feature vector and your labels to the algorithm, and the algorithm searches for patterns in your data that can help associate it with a particular label.

In the testing phase, we use the classifier we trained in the previous step, and give it previously unseen feature vectors representing unseen data to the algorithm, and have the algorithm predict the label. We can then compare the "true" label to the predicted label, and see if our classifier provides us with a good and generalizable way of accomplishing the task (in our case, the task of automatically distinguishing news sentences from romance sentences).

## Unsupervised Machine Learning

In supervised machine learning tasks, the data is assigned to some set of classes. For example, here we are given a dataset wherein each observation is a set of physical attributes of an object. In an supervised task, the object column acts as the labels. The algorithm then uses these existing separations in the data to develop criteria for classifying unknown observations in the data.

## Review

At the end of this workshop, we have covered the following skills:
* How to use skills from the NLTK workshop to build features for a classification task
* How to build a text classification system that can predict whether sentences belong to one category ("news") or another ("romance")
* How to group data and perform calculations on the aggregations
* How to prepare data for machine learning using pandas, a package for Python that helps to organize your data
* How to use the scikit-learn package for Python to perform different types of machine learning on the data
* How to evaluate the results of machine learning algorithms
* How to visualize observations, aggregations, and algorithmic results

## Resources
"Introduction to Machine Learning with Python", Andreas C. Muller and Sarah Guido. O'Reilly, 2017.

"LING 83800: Methods in Computational Linguistics II", Andrew Rosenberg. http://eniac.cs.qc.cuny.edu/andrew/methods2/, 2014.

"Introduction to Latent Dirichlet Allocation", Edward Chen, http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/, 08/22/2011

"Topic Modeling for Humanists: A Guided Tour", Scott Weingart, http://www.scottbot.net/HIAL/index.html@p=19113.html, 07/25/2012

"The LDA Buffet is Now Open", Matthew Jockers, http://www.matthewjockers.net/2011/09/29/the-lda-buffet-is-now-open-or-latent-dirichlet-allocation-for-english-majors/, 09/29/2011

"Introduction to Topic Modeling",Christine Doig, http://chdoig.github.io/pytexas2015-topic-modeling/#/, PyTexas, 2015

##### Acknowledgments
Shout out to the CUNY DHRI for providing much of this workshop's structure.
