# Understanding Machine Learning



# What is Machine Learning

## A broad definition
* Traditional programs solve problems by following a specific set of steps specified by the programer (an Algorythm)
* However we often find problems that can not really be solved by a traditional:
    * Because there is no strict set of rules governing the phenomenon (randomness).
    * Because we do not yet understand the rules understanding the phenomenon.


* Broadly defined, machine learning is a family of algorythms that "learn" from observing a set of data (training data) and then can be used to process other data (unseen data) that is generated by a process supposed to be similar to the one that generated the training data.

* In other words Machine learning methods extrapolate the behavior from some particular data to make predictions about data of that same kind. 




## But what do we mean by data?

* Data for machine learning needs to have two things:
    * A set of observations or measures (features)
    * Another value that we want to predict (target) based on those observations. Often called the label
    
* Some examples of data for machine learning:
    * The characteristics of a home and its neighbourhood (features) along with its sale price (target)
    * The measurements of a plant's leaves (features) along with it's species (target)
    * The demographic information of passengers of the titanic (features) along with whether they survived or not
    
* Sometimes it is not obvious what the features are but it is clear that we have data and a label, examplesinclude:
    * Pictures along with their text descriptions
    * Texts along with their genres
    * Pictures of retinas along with whether or not they should be diagnosed with diabetic retinopathy
    * Law case files along with the veredict
    


## Two key distinctions

* Classification V.S Regresion:
    * In classification we have a certain number of classes and we want to assign training instances to those 
      classes, examples include:
         * Determine if a tweet is positive or negative
         * Determine if a bank transaction is normal or abnormal
         * Distinguish whose faces are contained in a picture
         * Determine if a given word in context is a noun, verb, preposition, etc
    * In regression we are trying to determine a numerical value (a continuous variable) based on the feature data
    as in .:
      * Determine the price of a house based on it's real estate data
      * Determine the likelihood that it will rain given weather data
      * Estimate the value of a given share in the future 


* Supervised V.S Unsupervised:
    * In supervised learning we provide the labels (the value of the target property) to the algorithm during training, most of the above examples are cases of suprevised learning.
    * In unsupervised learning we hand only the features and we expect the algorythm to group them (clustering), we can then look to find common elements in the clusters.
    * A third option is semi-supervised learning in which we might have labels for some of the data, we can then cluster the data and if we trust the unsupervised part asume that the unlabeled data in the cluster is going to have the same label as the labeled ones.


# Why do we care?
* Machine learning is increasingly being used in research across fields, computational linguistics, digital humanities , computational biology, the list goes on.
* Even if you don't use machine learning in your research the word around you uses it extensively.
    * Recomendation systems are based on machine learning, youtube recomendations, add serving, etc all have some kind of machine learning behind them
    * Credit scores are increasingly based on machine learning, as are insurance 

# Load and clean the data
* Our data consists in the last 200 tweets (as of as of May 17th 2018 ) of each representative in the U.S Congress, available at: https://www.kaggle.com/kapastor/democratvsrepublicantweets, provided by Kyle Pastor
* For each tweet we have:
    * The party of the representative
    * The twitter handle of the representatve
    * The text of the tweet
* We are going to try to learn to distinguish tweets from Republicans and tweets from Democrats (supervised classification)
* We will use the text as our data and extract features from it
    

In [1]:
import pandas as pd
import numpy as np
import re
tweetData = pd.read_csv('ExtractedTweets.csv')

#A little bit of data cleaning
# replace urls with special token
urlPattern= re.compile("(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?")
tweetData["Tweet"] = tweetData["Tweet"].str.replace(urlPattern,"<url>")
# Can you think of any other data cleaning measures?

tweetData.head(10)
    

Unnamed: 0,Party,Handle,Tweet
0,Democrat,RepDarrenSoto,"Today, Senate Dems vote to #SaveTheInternet. P..."
1,Democrat,RepDarrenSoto,RT @WinterHavenSun: Winter Haven resident / Al...
2,Democrat,RepDarrenSoto,RT @NBCLatino: .@RepDarrenSoto noted that Hurr...
3,Democrat,RepDarrenSoto,RT @NALCABPolicy: Meeting with @RepDarrenSoto ...
4,Democrat,RepDarrenSoto,RT @Vegalteno: Hurricane season starts on June...
5,Democrat,RepDarrenSoto,RT @EmgageActionFL: Thank you to all who came ...
6,Democrat,RepDarrenSoto,Hurricane Maria left approx $90 billion in dam...
7,Democrat,RepDarrenSoto,RT @Tharryry: I am delighted that @RepDarrenSo...
8,Democrat,RepDarrenSoto,RT @HispanicCaucus: Trump's anti-immigrant pol...
9,Democrat,RepDarrenSoto,RT @RepStephMurphy: Great joining @WeAreUnidos...


# Feature selection
* In order to feed the text to a classifier we need to extract numerical features
* The classifier doesn't know what to look at in a tweet, but we can help it by computing __features__ that __might__ help
* Features need to be numerical (categorial features as yes/no or a finite set of labels are also fair game since we can turn them into numbers)
    * Presence or absence of something often gets encoded as 0 or 1
    * A multiclass feature with n posible values can be represented as 1,2,3,...,n
    

In [2]:
# Word length
tweetData["Length"] = np.vectorize(lambda x:len(x))(tweetData["Tweet"])
# Number of words
tweetData["NumWords"] = [len(tweet.split()) for tweet in tweetData["Tweet"] ]
# Is this a retweet
tweetData["IsRT"] = np.vectorize(lambda x : 1 if re.match(r"RT",x) else 0 )(tweetData["Tweet"]) 
# Average word length
tweetData["AVG_WL"] = tweetData["NumWords"]/tweetData["Length"]
# Number of at mentions
tweetData["Ats"] = np.vectorize(lambda x: len(re.findall(r'([^\w]|^)@\w+',x)))(tweetData["Tweet"])
# Number of hashtags
tweetData["hashtags"] = np.vectorize(lambda x: len(re.findall(r'([^\w]|^)#\w+',x)))(tweetData["Tweet"])

# Special words
# chargedTerms = ["fake news", "president", "bernie", "hillary","pro-life","pro-choice","trump","impeach","alt-right",
#                "mainstream media","liberal","family","values","troops","russia","florida","potus","taxreform","<url>",
#                "americans"]

# for term in chargedTerms:
#     tweetData[term] = np.vectorize(lambda x: 1 if  re.search(term, x,re.IGNORECASE) else 0)(tweetData["Tweet"])
# tweetData

# The development workflow
* In order to use a machine learning system we must first train it on labeled data
* But how will we know if it works?
    * We hold over a portion of the data that we will use to test it
    * In fact we devide our data in three:
        * The Training set (around 70 or 80%): used to train our model
        * The Development set: As we vary our models and features we use the development set to test the model, seeing how well it generalizes from the training data 
        * The Test or validation set: We keep this unseen while we play around with the model, we will use it in the end to see how our system performs outside of dev.
    * Can Anyone tell me why it is important that we keep Train out of the loop?

In [3]:
# Split the data
# Create new dataframes for each corpus
columns = tweetData.columns
train = pd.DataFrame(columns=tweetData.columns)
dev  = pd.DataFrame(columns=tweetData.columns)
test  = pd.DataFrame(columns=tweetData.columns)
# Divide the data one representative at a time
for repHandle in tweetData["Handle"].unique():
    # For each representative send the first 140 (70%) tweets to train, 30(15%) to dev, the final 30(15%) to test
    train = pd.concat([train, tweetData[tweetData.Handle == repHandle][:140]])
    dev = pd.concat([dev, tweetData[tweetData.Handle == repHandle][140:170]])
    test = pd.concat([test, tweetData[tweetData.Handle == repHandle][170:]])

# A word on chance and metrics:
* In order to say that our classifier is actually learning some insight from the training data it needs to at least beat chance:
    * Just guessing we could be right 50%(1/ no of classes) of the time if the classes are balanced
    * If there is a class that is more represented then just guessing that class will actually beat chance  

In [9]:
tweetData["Party"].value_counts() /len(tweetData)

Republican    0.51344
Democrat      0.48656
Name: Party, dtype: float64

# Chosing a Machine learning algorithm:

* There are many tried and true methods for machine learning, most with bases in statistics or linear algebra, however it is not necesary to understand the mathematic intricacies behind them to make use of them, some of the most popular include:
    * Naive Bayes:
        * Straightforward application of bayessian statistics
        * Naive because it makes the assumption that features are independant
        * Easy to interpret
    * Generalized Linear Models: 
        * Borrow from statistical regression and try to learn an optimal linear split of the problem space.
        * Are better suited when we can expect the label to be a linear function (a sum of multiples) of the features.
        * Can be used for classification via logistic regression
        * Better with fewer features where it is likely that there will be a relatively clear cut relation
    * Support vector machines (SVM):
        * Effective in high dimensional spaces
        * Very versatile, can be customised by the choice of kernel
    * Decision tree classifiers:
        * Find optimal cutouts to make a structured decision
        * Higly interpretable by humans
        * Less likely to be successful in situations with complex relations between features
    * Neural Networks:
        * Effective across a wide array of domains
        * Can learn feature representations from unstructured data
        * Widely adopted across fields
        * Portable
        * BUT: extremely hard to interpret, and often data hungry
        
For more go to http://scikit-learn.org/stable/

In [4]:
# Train a model
from sklearn import svm
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
# Prepare the data for the algorythm
featurecolumns = columns[3:]
#This contains only our features
features = train[featurecolumns]
#This contains the labels 0 for Dem, 1 for Rep
labels = np.vectorize(lambda x: 1 if x == "Republican" else 0)(train["Party"])
# We have chosen a Multinomial Naive Bayes
gnb1 = MultinomialNB()
print("Training model, please wait")
gnb1.fit(features.values, labels)
print("Model trained, testing on dev")
devfeatures = dev[featurecolumns]
devlabels = np.vectorize(lambda x: 1 if x == "Republican" else 0)(dev["Party"])
score = gnb1.score(devfeatures,devlabels)
print("The model scored {}%accuracy".format(score) )

Training model, please wait
Model trained, testing on dev
The model scored 0.5313271604938271%accuracy


## Are we not wasting a lot of information? 
## Why limit ourselves to just that small set of words?
## Bring out the big guns, the Bag of Words 

In [5]:
# Bag of words features:
# The count vectorizer is a tool that looks at a collection of texts and counts how many times each word appears
# We import it and use iit to create a word matrix
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizedCountsTrain = vectorizer.fit_transform(train["Tweet"].values)
vectorizedCountsDev = vectorizer.transform(dev["Tweet"])
vectorizedCountsTest = vectorizer.transform(test["Tweet"])



In [6]:
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB

#gnb = GaussianNB()
gnb = MultinomialNB()

print("Training model, please wait")
gnb.fit(vectorizedCountsTrain.toarray(), labels)
print("Model trained, testing on dev")
score = gnb.score(vectorizedCountsDev.toarray(),devlabels)
print("The model scored {}%accuracy".format(score) )

Training model, please wait
Model trained, testing on dev
The model scored 0.7881944444444444%accuracy


# What to do while your model trains?

* Keep working on reading your bibliography?
* Go get a snack?
* Catch up on your favourite comic books/tv-shows?
* Any other time it's your chouce but today fill the survey!: <Survey link>


In [None]:
feature_names = vectorizer.get_feature_names()
class_labels = ["Democrat","Republican"]

top25 = np.argsort(gnb.coef_[0])[-100:]
for index in top25:
    print("{}: {}".format(feature_names[index],gnb.coef_[0][index]))
print(gnb.class_count_)
print(gnb.classes_)
print(len(gnb.coef_[0]))
print(len(vectorizer.get_feature_names()))

In [None]:
# With a support vector machine

# Train a model
from sklearn import svm
# Prepare the data
features = vectorizedCountsTrain
labels = np.vectorize(lambda x: 1 if x == "Republican" else 0)(train["Party"])

clf = svm.SVC()
print("Training model, please wait")
clf.fit(features, labels)
print("Model trained, testing on dev")
devfeatures = vectorizedCountsDev
devlabels = np.vectorize(lambda x: 1 if x == "Republican" else 0)(dev["Party"])
score = clf.score(devfeatures,devlabels)
print("The model scored {}%accuracy".format(score) )

In [None]:
# score = gnb.score(devfeatures.toarray(),devlabels)
# print("The model scored {}%accuracy".format(score) )


In [None]:
gnb1.class_count_
gnb1.