In this project, you'll work with data that could have come from a real-world web application, with fields representing information that a web server would record, such as HTTP status codes and URL paths. The web server and the reporting tool both connect to the same database, allowing information to flow from the web server into the report.
Three questions need to answer:
- What are the most popular three articles of all time?
- Who are the most popular article authors of all time?
- On which days did more than 1% of requests lead to errors?
- Install Vagrant and VirtualBox
- Download or Clone fullstack-nanodegree-vm repository.
- Unzip the downloaded file to get newsdata.sql.
- Copy the newsdata.sql file and put it into the same location as vagrant directory
- Launching the Virtual Machine in your computer's command promt:
- Change directory to vagrant directory
- Launch the Vagrant VM:
$ vagrant up
- Then Log into VM by using command:
$ vagrant ssh
- Change directory using
cd /vagrant
-
Load the data in local database using the command:
psql -d news -f newsdata.sql
-
Use
psql -d news
to connect to database.
Create Summary table that has author, title and total views of each article
SELECT author, title, count (*) as Views
FROM articles, log
WHERE log.path LIKE concat ('%', articles.slug)
GROUP BY articles.title, articles.author
ORDER BY Views DESC;
Create Popular_author table to get all the views on all of the articles of each author
SELECT Summary.author, SUM(Summary.Views) AS Total_Views
FROM Summary
GROUP BY Summary.author
ORDER BY Total_Views DESC;
Create Access_Summary_By_Date table to get number of errors and number of requests by date
SELECT CAST(log.time AS Date) AS Date_Request,
SUM((CASE WHEN log.status='404 NOT FOUND' then 1
ELSE 0 END)) AS Total_Errors,
COUNT(*) AS Total_Requests
FROM log
GROUP BY Date_Request;
Make sure you have changed the directory inside vagrant
Run python log_analysis.py