# Cleaning data

Garbage in, garbage out! A big part of the data science process involves cleaning the data before using it. There are many reasons why this is important. Whether you clean your data or not can make the difference between having a good model vs a bad one. You should not take data cleaning lightly, even if it's more fun to work on your machine learning model. Here's a list of things you want to look for when you are cleaning data:

* Outliers: the machine logging the data may have malfunctionned for a bit and recorded a value that makes no sense compared to the rest of the data. Or there could be someone using a bot that keeps coming to your site and you only want to take into account real users. You want to elimate outliers from your dataset.

* Missing Data: It could be that data was simply not recorded at a particular time, but it could also mean another class for classification. You need to make the proper decision here.

* Malicious Data: There might be people trying to mess with your website's recommender system by trying to promote their item (movie on Netflix, product on Amazon, etc).

* Erroneous Data: the machine may be making some mistakes during the data collection or there might be a mistake when merging data together.

* Irrelevant Data: your dataset contains data regarding things you don't want to model. As in, you have data on schools across Canada, but you only want data on schools in Toronto.

* Inconsistent Data: for example, people could write addresses differently (not writing "street" or "drive", may add the city and country, etc). If you're looking at movies, may be the same movie has a different name in different countries.

* Formatting: date formatting can be different depending on the country.

We'll be taking a web access log, and figure out the most-viewed pages on the website.

Let's start by setting up a regex that lets us parse an Apache access log line:

In [1]:
import re

format_pat = re.compile(
    r"(?P<host>[\d\.]+)\s"
    r"(?P<identity>S*)\s"
    r"(?P<user>S*)\s"
    r"\[(?P<time>.*?)\]\s"
    r'"(?P<request>.*?)"\s'
    r"(?P<status>\d+)\s"
    r"(?P<bytes>\S*)\s"
    r'"(?P<referer>.*?)"\s'
    r'"(?P<user_agent>.*?)"\s*'
)

In [2]:
logPath = "/Users/jacquesthibodeau/Python Data Science/access_log.txt"

We're going to write a script that extracts the URL in each access, and use a dictionnary to count up the number of times each one appears. Then we'll sort it and print out the top 20 pages.