# Workshop 2 
## Pre-Processing Data for Machine Learning

Many machine learning algorithms make assumptions about your data.

It is often a very good idea to prepare your data in such way to best expose the structure of the problem to the machine learning algorithms that you intend to use.

In this post you will discover how to prepare your data for machine learning in Python using the scikit-learn library. Which contains tools for data mining and analysis that are built on NumPy, SciPy, and matplotlib.

![alt text](https://www.tibco.com/blog/wp-content/uploads/2013/12/dataman.jpg "Lots of data")

## But wait!

Before we even begin pre-processing our data, we have to make sure we select the correct data to pre-process! 

They say "Data is the new oil" and just like oil, data must be cleaned and filtered first before moving forward. Dirty, poorly selected data can produce a poor model!


![alt text](http://infoquarter.com/images/DataCleansing/data-cleaning1.jpg "Data Cleaning")

# <span style="color:green">Activity 1</span>
## Feature engineering / Feature selection

Feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construction.

#### In short, we must construct the best features fit for the task at hand!

# <span style="color:green">Activity 2</span>
## Training

Training a model is just like how we as humans learn! We observe things over and over again and begin to differentiate between them for example, if we've never seen a dog or a cat before in our entire life - chances are slim we'd be able to pick them out from each other. 

#### Let's train ourselves on whether or not we can identify someone in the room based on the features we've selected in the previous activity

# <span style="color:green">Activity 3</span>
## Data preparation

Machine Learning algorithms just love to work with data that's as simple as possible. When there is too much variability in the data, certain connections and correlations cannot be accurately made.

#### It's our job to prepare the data to spoonfeed to our machine learning algorithms. Let's begin.

# <span style="color:green">Activity 4</span>
## Web Scraping

Sometimes not all the data we want is available to us as easily as we would like. Perhaps we would like to run analysis on a certain site's content, such as reddit, facebook, or youtube comments! If we could gather all this data, we can process the general sentiment of a youtube's video for example. 

#### Let's scrape a website's data and see if we can preprocess it to make some sense of it!

## A few things to note!

#### copyrights and permission:
- be careful and polite
- give credit
- care about media law
- don't be evil (no spam, overloading sites, etc.)

### robots.txt
- specified by web site owner
- gives instructions to web robots (aka your script)
- is located at the top-level directory of the web server

Take a look at http://google.com/robots.txt

In [8]:
from IPython.display import HTML

# what does html look like when we print it out?
htmlString = """<!DOCTYPE html>
<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <h2> Test </h2>
    <p>Hello world!</p>
  </body>
</html>"""

htmlOutput = HTML(htmlString)
htmlOutput

In [20]:
from urllib.request import urlopen

url = 'http://www.pythonscraping.com/exercises/exercise1.html'
source = urlopen(url).read()
print(source)


b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


#### We're going to need a new package here. Let's open a new command prompt and type:
- `pip install beautifulsoup4`

In [24]:
from bs4 import BeautifulSoup

bsObj = BeautifulSoup(source, "html5lib");
print(bsObj)

<html><head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>


</body></html>


In [25]:
print(bsObj.div)

<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
