###General Instructions
In this assignment, you will need to complete the code samples where indicated to accomplish the given objectives. **Be sure to run all cells** and export this notebook as an HTML with results included.  Upload the exported HTML file to Canvas by the assignment deadline.

####Assignment
Complete the following Python script per the instructions provided at the top of each code block. Look for the 
*# your code here* comment to indicate where you need to make code modifications. Unlike last week's assignment, you may need to replace this comment with multiple lines of code to achieve the required results.

Weblogs from the smu.edu website from Oct 25 through Nov 3, 2018 are provided at /tmp/weblogs/new/.  There are 10 files for this period, one for each day, and they all employ the same format.  In the following cells, you will briefly examine one of these files:

**NOTE** We are limiting our analysis to just the one file to keep processing times on our small cluster low.

In [0]:
# notebook config
USER_NAME = dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().apply('user')
FILE_STORE_ROOT = '/FileStore/shared_uploads/'+USER_NAME

In [0]:
from pyspark.sql.types import *
import pyspark.sql.functions as f

# just access one file
weblog_file_name = FILE_STORE_ROOT + '/weblogs/new/u_ex181025_x.log'

schema = StructType([
  StructField('line', StringType())
  ])

# read lines from log
raw_log = (
  spark
    .read
    .csv(
      weblog_file_name,
      sep='\0000', # use a char not likely found in log (\0000 is Unicode NULL)
      header=False,
      schema=schema
      )
  )

display(raw_log)

line
#Software: Microsoft Internet Information Services 10.0
#Version: 1.0
#Date: 2018-10-25 00:00:02
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status time-taken X-Forwarded-For
2018-10-24 23:59:30 129.119.66.37 GET /-/media/Site/News/NewsSources/EarthquakeStudy/earthquake-causes-17may2016.jpg h=338&la=en&w=350&hash=4865678DB6FEA8BB4AECD9F687D0DE225EAAC3BE 443 - Mozilla/5.0+(compatible;+SeznamBot/3.2;++http://napoveda.seznam.cz/en/seznambot-intro/) - 200 0 0 691 77.75.78.161
"2018-10-24 23:59:31 129.119.66.37 GET / - 443 - Mozilla/5.0+(iPhone;+CPU+iPhone+OS+12_0+like+Mac+OS+X)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Mobile/15E148+Safari/604.1 - 200 0 0 187 98.156.209.143"
"2018-10-24 23:59:31 129.119.66.37 GET /-/media/Site/Main/Logo-SMU-WCSH-Stacked-RW-2x.png - 443 - Mozilla/5.0+(iPhone;+CPU+iPhone+OS+12_0+like+Mac+OS+X)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Mobile/15E148+Safari/604.1 https://www.smu.edu/ 200 0 0 31 98.156.209.143"
"2018-10-24 23:59:31 129.119.66.37 GET /js/Main/jquery.min.js - 443 - Mozilla/5.0+(iPhone;+CPU+iPhone+OS+12_0+like+Mac+OS+X)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Mobile/15E148+Safari/604.1 https://www.smu.edu/ 200 0 0 0 98.156.209.143"
"2018-10-24 23:59:31 129.119.66.37 GET /Admission/Apply - 443 - Mozilla/5.0+(Macintosh;+Intel+Mac+OS+X+10_13_6)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Safari/605.1.15 - 200 0 0 109 76.187.104.226"
"2018-10-24 23:59:31 129.119.66.37 GET /-/media/Site/Main/Navigation/Research.jpg h=200&w=300&la=en&hash=1DCDB6862BE00A058A97C5B6EB4762BD52636288 443 - Mozilla/5.0+(iPhone;+CPU+iPhone+OS+12_0+like+Mac+OS+X)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Mobile/15E148+Safari/604.1 https://www.smu.edu/ 200 0 0 15 98.156.209.143"


Looking at the file contents, you should notice these logs use a different format than the one we explored in class.  That's because earlier in 2018, SMU made some changes to their webservers that caused them to default to what is known as the W3C Extended Format, which you can read about here: https://docs.microsoft.com/en-us/windows/desktop/http/w3c-logging.

The fields in this format (in order) are:
1. date 
2. time 
3. s-ip 
4. cs-method 
5. cs-uri-stem 
6. cs-uri-query 
7. s-port 
8. cs-username 
9. csUser-Agent 
10. csReferer 
11. sc-status 
12. sc-substatus 
13. sc-win32-status 
14. time-taken 
15. X-Forwarded-For

The W3C Extended Format is significantly more simple than that used by the Apache Tomcat web servers.  As such we can typically parse the records in these files in a much easier manner.  That said, there's still some junk in the files and some complexity in the row formatting that makes RegEx a powerful tool for processing the data.  In the following cell, use the provided RegEx pattern to parse the fields from each line:

In [0]:
# date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status time-taken X-Forwarded-For
# changed all '-' to '_' because of issues with the SQL query
regex_pattern = '^(\S*) (\S*) (\S*) (\S*) (\S*) (\S*) (\S*) (\S*) (\S*) (\S*) (\S*) (\S*) (\S*) (\S*) (\S*)$'

parsed_log = (
  raw_log
    .filter(f.col('line').startswith('#') == False)
    .withColumn('date', f.regexp_extract('line', regex_pattern, 1))
    .withColumn('time', f.regexp_extract('line', regex_pattern, 2))
    .withColumn('s_ip', f.regexp_extract('line', regex_pattern, 3))
    .withColumn('cs_method', f.regexp_extract('line', regex_pattern, 4))
    .withColumn('cs_uri_stem', f.regexp_extract('line', regex_pattern, 5))
    .withColumn('cs_uri_query', f.regexp_extract('line', regex_pattern, 6))
    .withColumn('s_port', f.regexp_extract('line', regex_pattern, 7))
    .withColumn('cs_username', f.regexp_extract('line', regex_pattern, 8))
    .withColumn('csUser_Agent', f.regexp_extract('line', regex_pattern, 9))
    .withColumn('csReferer', f.regexp_extract('line', regex_pattern, 10))
    .withColumn('sc_status', f.regexp_extract('line', regex_pattern, 11))
    .withColumn('sc_substatus', f.regexp_extract('line', regex_pattern, 12))
    .withColumn('sc_win32_status', f.regexp_extract('line', regex_pattern, 13))
    .withColumn('time_taken', f.regexp_extract('line', regex_pattern, 14))
    .withColumn('X_Forwarded_For', f.regexp_extract('line', regex_pattern, 15))
    .drop('line')
  )

display(parsed_log)

date,time,s_ip,cs_method,cs_uri_stem,cs_uri_query,s_port,cs_username,csUser_Agent,csReferer,sc_status,sc_substatus,sc_win32_status,time_taken,X_Forwarded_For
2018-10-24,23:59:30,129.119.66.37,GET,/-/media/Site/News/NewsSources/EarthquakeStudy/earthquake-causes-17may2016.jpg,h=338&la=en&w=350&hash=4865678DB6FEA8BB4AECD9F687D0DE225EAAC3BE,443,-,Mozilla/5.0+(compatible;+SeznamBot/3.2;++http://napoveda.seznam.cz/en/seznambot-intro/),-,200,0,0,691,77.75.78.161
2018-10-24,23:59:31,129.119.66.37,GET,/,-,443,-,"Mozilla/5.0+(iPhone;+CPU+iPhone+OS+12_0+like+Mac+OS+X)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Mobile/15E148+Safari/604.1",-,200,0,0,187,98.156.209.143
2018-10-24,23:59:31,129.119.66.37,GET,/-/media/Site/Main/Logo-SMU-WCSH-Stacked-RW-2x.png,-,443,-,"Mozilla/5.0+(iPhone;+CPU+iPhone+OS+12_0+like+Mac+OS+X)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Mobile/15E148+Safari/604.1",https://www.smu.edu/,200,0,0,31,98.156.209.143
2018-10-24,23:59:31,129.119.66.37,GET,/js/Main/jquery.min.js,-,443,-,"Mozilla/5.0+(iPhone;+CPU+iPhone+OS+12_0+like+Mac+OS+X)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Mobile/15E148+Safari/604.1",https://www.smu.edu/,200,0,0,0,98.156.209.143
2018-10-24,23:59:31,129.119.66.37,GET,/Admission/Apply,-,443,-,"Mozilla/5.0+(Macintosh;+Intel+Mac+OS+X+10_13_6)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Safari/605.1.15",-,200,0,0,109,76.187.104.226
2018-10-24,23:59:31,129.119.66.37,GET,/-/media/Site/Main/Navigation/Research.jpg,h=200&w=300&la=en&hash=1DCDB6862BE00A058A97C5B6EB4762BD52636288,443,-,"Mozilla/5.0+(iPhone;+CPU+iPhone+OS+12_0+like+Mac+OS+X)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Mobile/15E148+Safari/604.1",https://www.smu.edu/,200,0,0,15,98.156.209.143
2018-10-24,23:59:31,129.119.66.37,GET,/-/media/Site/Main/Navigation/AboutSMU.jpg,h=200&w=300&la=en&hash=C677A5E535FC3B573B438F33044F191714A4163B,443,-,"Mozilla/5.0+(iPhone;+CPU+iPhone+OS+12_0+like+Mac+OS+X)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Mobile/15E148+Safari/604.1",https://www.smu.edu/,200,0,0,15,98.156.209.143
2018-10-24,23:59:31,129.119.66.37,GET,/-/media/Site/Main/WCSH-Logo-SMU-Stacked-RW-2x.png,-,443,-,"Mozilla/5.0+(iPhone;+CPU+iPhone+OS+12_0+like+Mac+OS+X)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Mobile/15E148+Safari/604.1",https://www.smu.edu/,200,0,0,15,98.156.209.143
2018-10-24,23:59:31,129.119.66.37,GET,/js/Main/pushy.min.js,-,443,-,"Mozilla/5.0+(iPhone;+CPU+iPhone+OS+12_0+like+Mac+OS+X)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Mobile/15E148+Safari/604.1",https://www.smu.edu/,200,0,0,0,98.156.209.143
2018-10-24,23:59:31,129.119.66.37,GET,/-/media/Site/Main/Navigation/Admission.jpg,h=200&w=300&la=en&hash=6BD3FA4B6FE30EA5978EBAA6B3A15528724F0B35,443,-,"Mozilla/5.0+(iPhone;+CPU+iPhone+OS+12_0+like+Mac+OS+X)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Mobile/15E148+Safari/604.1",https://www.smu.edu/,200,0,0,15,98.156.209.143


With your data properly parsed, you now need to convert your date and time fields into a single datetime field.  Using the *withColumn()* method, construct a new field named *datetime*.  That field should concat the date and time string values extracted in the last step and then convert the concatenated value to a *TimestampType()* using the *to_timestamp* function:

**NOTE** You can concatenate two strings using the *concat* pyspark function. You might find it easiest to create a column with the concatenated values and then create another column that coverts that column's values to a timestamp instead of trying to tackle this in one step.  If you would like to insert a space between the date and time values, use the pyspark *lit* function to create a literal space value.

In [0]:
dt_log = (
  parsed_log
    .withColumn('datetime', f.concat('date', f.lit(' '), 'time'))
    .withColumn('datetime', f.to_timestamp('datetime'))
  )

display(dt_log)


date,time,s_ip,cs_method,cs_uri_stem,cs_uri_query,s_port,cs_username,csUser_Agent,csReferer,sc_status,sc_substatus,sc_win32_status,time_taken,X_Forwarded_For,datetime
2018-10-24,23:59:30,129.119.66.37,GET,/-/media/Site/News/NewsSources/EarthquakeStudy/earthquake-causes-17may2016.jpg,h=338&la=en&w=350&hash=4865678DB6FEA8BB4AECD9F687D0DE225EAAC3BE,443,-,Mozilla/5.0+(compatible;+SeznamBot/3.2;++http://napoveda.seznam.cz/en/seznambot-intro/),-,200,0,0,691,77.75.78.161,2018-10-24T23:59:30.000+0000
2018-10-24,23:59:31,129.119.66.37,GET,/,-,443,-,"Mozilla/5.0+(iPhone;+CPU+iPhone+OS+12_0+like+Mac+OS+X)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Mobile/15E148+Safari/604.1",-,200,0,0,187,98.156.209.143,2018-10-24T23:59:31.000+0000
2018-10-24,23:59:31,129.119.66.37,GET,/-/media/Site/Main/Logo-SMU-WCSH-Stacked-RW-2x.png,-,443,-,"Mozilla/5.0+(iPhone;+CPU+iPhone+OS+12_0+like+Mac+OS+X)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Mobile/15E148+Safari/604.1",https://www.smu.edu/,200,0,0,31,98.156.209.143,2018-10-24T23:59:31.000+0000
2018-10-24,23:59:31,129.119.66.37,GET,/js/Main/jquery.min.js,-,443,-,"Mozilla/5.0+(iPhone;+CPU+iPhone+OS+12_0+like+Mac+OS+X)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Mobile/15E148+Safari/604.1",https://www.smu.edu/,200,0,0,0,98.156.209.143,2018-10-24T23:59:31.000+0000
2018-10-24,23:59:31,129.119.66.37,GET,/Admission/Apply,-,443,-,"Mozilla/5.0+(Macintosh;+Intel+Mac+OS+X+10_13_6)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Safari/605.1.15",-,200,0,0,109,76.187.104.226,2018-10-24T23:59:31.000+0000
2018-10-24,23:59:31,129.119.66.37,GET,/-/media/Site/Main/Navigation/Research.jpg,h=200&w=300&la=en&hash=1DCDB6862BE00A058A97C5B6EB4762BD52636288,443,-,"Mozilla/5.0+(iPhone;+CPU+iPhone+OS+12_0+like+Mac+OS+X)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Mobile/15E148+Safari/604.1",https://www.smu.edu/,200,0,0,15,98.156.209.143,2018-10-24T23:59:31.000+0000
2018-10-24,23:59:31,129.119.66.37,GET,/-/media/Site/Main/Navigation/AboutSMU.jpg,h=200&w=300&la=en&hash=C677A5E535FC3B573B438F33044F191714A4163B,443,-,"Mozilla/5.0+(iPhone;+CPU+iPhone+OS+12_0+like+Mac+OS+X)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Mobile/15E148+Safari/604.1",https://www.smu.edu/,200,0,0,15,98.156.209.143,2018-10-24T23:59:31.000+0000
2018-10-24,23:59:31,129.119.66.37,GET,/-/media/Site/Main/WCSH-Logo-SMU-Stacked-RW-2x.png,-,443,-,"Mozilla/5.0+(iPhone;+CPU+iPhone+OS+12_0+like+Mac+OS+X)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Mobile/15E148+Safari/604.1",https://www.smu.edu/,200,0,0,15,98.156.209.143,2018-10-24T23:59:31.000+0000
2018-10-24,23:59:31,129.119.66.37,GET,/js/Main/pushy.min.js,-,443,-,"Mozilla/5.0+(iPhone;+CPU+iPhone+OS+12_0+like+Mac+OS+X)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Mobile/15E148+Safari/604.1",https://www.smu.edu/,200,0,0,0,98.156.209.143,2018-10-24T23:59:31.000+0000
2018-10-24,23:59:31,129.119.66.37,GET,/-/media/Site/Main/Navigation/Admission.jpg,h=200&w=300&la=en&hash=6BD3FA4B6FE30EA5978EBAA6B3A15528724F0B35,443,-,"Mozilla/5.0+(iPhone;+CPU+iPhone+OS+12_0+like+Mac+OS+X)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Mobile/15E148+Safari/604.1",https://www.smu.edu/,200,0,0,15,98.156.209.143,2018-10-24T23:59:31.000+0000


We are starting the process of cleaning up our data as parsed from the lines of the weblog files.  Let's continue that work by casting the following fields to appropriate datatypes:

* sc-status - int
* sc-substatus -int
* sc-win32-status - int
* time_taken - int
* s-port - int

To perform this work, you can use the pyspark *cast* method of the pysspark *col* function.  The syntax for this can be a bit awkward but you will want to perform the cast using a pattern such as this:

`f.col('col_name').cast(PySparkType())`


A common trick for doing this kind of work is to use the *withColumn* method to cast a given field to the appropriate type and to name that field the same name as the original.  When you do this, you are ineffect overwriting the field with its correctly typed version:

In [0]:
log = (
  dt_log
    .withColumn('sc_status', f.col('sc_status').cast(IntegerType()))
    .withColumn('sc_substatus', f.col('sc_substatus').cast(IntegerType()))
    .withColumn('sc_win32_status', f.col('sc_win32_status').cast(IntegerType()))
    .withColumn('time_taken', f.col('time_taken').cast(IntegerType()))
    .withColumn('s_port', f.col('s_port').cast(IntegerType()))
  )

display(log)

date,time,s_ip,cs_method,cs_uri_stem,cs_uri_query,s_port,cs_username,csUser_Agent,csReferer,sc_status,sc_substatus,sc_win32_status,time_taken,X_Forwarded_For,datetime
2018-10-24,23:59:30,129.119.66.37,GET,/-/media/Site/News/NewsSources/EarthquakeStudy/earthquake-causes-17may2016.jpg,h=338&la=en&w=350&hash=4865678DB6FEA8BB4AECD9F687D0DE225EAAC3BE,443,-,Mozilla/5.0+(compatible;+SeznamBot/3.2;++http://napoveda.seznam.cz/en/seznambot-intro/),-,200,0,0,691,77.75.78.161,2018-10-24T23:59:30.000+0000
2018-10-24,23:59:31,129.119.66.37,GET,/,-,443,-,"Mozilla/5.0+(iPhone;+CPU+iPhone+OS+12_0+like+Mac+OS+X)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Mobile/15E148+Safari/604.1",-,200,0,0,187,98.156.209.143,2018-10-24T23:59:31.000+0000
2018-10-24,23:59:31,129.119.66.37,GET,/-/media/Site/Main/Logo-SMU-WCSH-Stacked-RW-2x.png,-,443,-,"Mozilla/5.0+(iPhone;+CPU+iPhone+OS+12_0+like+Mac+OS+X)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Mobile/15E148+Safari/604.1",https://www.smu.edu/,200,0,0,31,98.156.209.143,2018-10-24T23:59:31.000+0000
2018-10-24,23:59:31,129.119.66.37,GET,/js/Main/jquery.min.js,-,443,-,"Mozilla/5.0+(iPhone;+CPU+iPhone+OS+12_0+like+Mac+OS+X)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Mobile/15E148+Safari/604.1",https://www.smu.edu/,200,0,0,0,98.156.209.143,2018-10-24T23:59:31.000+0000
2018-10-24,23:59:31,129.119.66.37,GET,/Admission/Apply,-,443,-,"Mozilla/5.0+(Macintosh;+Intel+Mac+OS+X+10_13_6)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Safari/605.1.15",-,200,0,0,109,76.187.104.226,2018-10-24T23:59:31.000+0000
2018-10-24,23:59:31,129.119.66.37,GET,/-/media/Site/Main/Navigation/Research.jpg,h=200&w=300&la=en&hash=1DCDB6862BE00A058A97C5B6EB4762BD52636288,443,-,"Mozilla/5.0+(iPhone;+CPU+iPhone+OS+12_0+like+Mac+OS+X)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Mobile/15E148+Safari/604.1",https://www.smu.edu/,200,0,0,15,98.156.209.143,2018-10-24T23:59:31.000+0000
2018-10-24,23:59:31,129.119.66.37,GET,/-/media/Site/Main/Navigation/AboutSMU.jpg,h=200&w=300&la=en&hash=C677A5E535FC3B573B438F33044F191714A4163B,443,-,"Mozilla/5.0+(iPhone;+CPU+iPhone+OS+12_0+like+Mac+OS+X)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Mobile/15E148+Safari/604.1",https://www.smu.edu/,200,0,0,15,98.156.209.143,2018-10-24T23:59:31.000+0000
2018-10-24,23:59:31,129.119.66.37,GET,/-/media/Site/Main/WCSH-Logo-SMU-Stacked-RW-2x.png,-,443,-,"Mozilla/5.0+(iPhone;+CPU+iPhone+OS+12_0+like+Mac+OS+X)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Mobile/15E148+Safari/604.1",https://www.smu.edu/,200,0,0,15,98.156.209.143,2018-10-24T23:59:31.000+0000
2018-10-24,23:59:31,129.119.66.37,GET,/js/Main/pushy.min.js,-,443,-,"Mozilla/5.0+(iPhone;+CPU+iPhone+OS+12_0+like+Mac+OS+X)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Mobile/15E148+Safari/604.1",https://www.smu.edu/,200,0,0,0,98.156.209.143,2018-10-24T23:59:31.000+0000
2018-10-24,23:59:31,129.119.66.37,GET,/-/media/Site/Main/Navigation/Admission.jpg,h=200&w=300&la=en&hash=6BD3FA4B6FE30EA5978EBAA6B3A15528724F0B35,443,-,"Mozilla/5.0+(iPhone;+CPU+iPhone+OS+12_0+like+Mac+OS+X)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Mobile/15E148+Safari/604.1",https://www.smu.edu/,200,0,0,15,98.156.209.143,2018-10-24T23:59:31.000+0000


Now that we have a DataFrame, register it as a temporary view named *logs*.

In [0]:
log.createOrReplaceTempView('logs')

spark.sql('show tables').show()

+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
|        |     logs|       true|
|        |    pages|       true|
+--------+---------+-----------+



In our log data set, the cs_uri_stem represents the asset requested from the web server.  This will include a mix of web pages, images, and scripts. Write a query that returns all fields from logs where where cs-uri-stem **DOES NOT** include the following image file types:
* png files
* jpeg & jpg files
* gif files
* js files
* css files
* ico files
* pdf files
* ashx files

Save this as a table named *pages*.

In [0]:
query = '''
  SELECT *
  FROM logs
  WHERE NOT (
    cs_uri_stem LIKE "%.png" OR
    cs_uri_stem LIKE "%.jpg" OR
    cs_uri_stem LIKE "%.jpeg" OR
    cs_uri_stem LIKE "%.gif" OR
    cs_uri_stem LIKE "%.js" OR
    cs_uri_stem LIKE "%.css" OR
    cs_uri_stem LIKE "%.ico" OR
    cs_uri_stem LIKE "%.pdf" OR
    cs_uri_stem LIKE "%.ashx" OR
    cs_uri_stem LIKE "%.JPG" OR
    cs_uri_stem LIKE "%.PNG"
  )
'''

# execute the query, capturing results to a pages
spark.sql(query).createOrReplaceTempView('pages') # your code here

display(spark.table('pages'))

date,time,s_ip,cs_method,cs_uri_stem,cs_uri_query,s_port,cs_username,csUser_Agent,csReferer,sc_status,sc_substatus,sc_win32_status,time_taken,X_Forwarded_For,datetime
2018-10-24,23:59:31,129.119.66.37,GET,/,-,443,-,"Mozilla/5.0+(iPhone;+CPU+iPhone+OS+12_0+like+Mac+OS+X)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Mobile/15E148+Safari/604.1",-,200,0,0,187,98.156.209.143,2018-10-24T23:59:31.000+0000
2018-10-24,23:59:31,129.119.66.37,GET,/Admission/Apply,-,443,-,"Mozilla/5.0+(Macintosh;+Intel+Mac+OS+X+10_13_6)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Safari/605.1.15",-,200,0,0,109,76.187.104.226,2018-10-24T23:59:31.000+0000
2018-10-24,23:59:33,129.119.66.37,GET,/Error/PageNotFound,sc_lang=en&rawUrl=%2fCareer-Services%2fEmployment-Statistics&em_dt=7mgpoIhQqV,443,-,-,-,200,0,0,140,129.119.66.37,2018-10-24T23:59:33.000+0000
2018-10-24,23:59:33,129.119.66.37,GET,/Career-Services/Employment-Statistics,-,443,-,weborama-fetcher+(+http://www.weborama.com),-,404,0,0,234,52.73.43.173,2018-10-24T23:59:33.000+0000
2018-10-24,23:59:33,129.119.66.37,GET,/,-,443,-,"Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/70.0.3538.67+Safari/537.36",https://www.google.com/,200,0,0,62,107.205.49.166,2018-10-24T23:59:33.000+0000
2018-10-24,23:59:34,129.119.66.37,GET,/OIT/Services/Canvas,-,443,-,"Mozilla/5.0+(Macintosh;+Intel+Mac+OS+X+10_10_0)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/68.0.3440.106+Safari/537.36",https://www.google.com/,200,0,0,171,10.8.159.82,2018-10-24T23:59:34.000+0000
2018-10-24,23:59:34,129.119.66.37,GET,/DevelopmentExternalAffairs/PublicAffairs,-,80,-,Mozilla/5.0+(compatible;+SemrushBot/2~bl;++http://www.semrush.com/bot.html),-,301,0,0,109,46.229.168.140,2018-10-24T23:59:34.000+0000
2018-10-24,23:59:36,129.119.66.37,GET,/Admission/Apply,-,443,-,"Mozilla/5.0+(iPhone;+CPU+iPhone+OS+12_0+like+Mac+OS+X)+AppleWebKit/605.1.15+(KHTML,+like+Gecko)+Version/12.0+Mobile/15E148+Safari/604.1",https://www.smu.edu/,200,0,0,93,98.156.209.143,2018-10-24T23:59:36.000+0000
2018-10-24,23:59:36,129.119.66.37,GET,/OIT/Services/webmail,-,443,-,"Mozilla/5.0+(Macintosh;+Intel+Mac+OS+X+10_13_6)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/70.0.3538.67+Safari/537.36",https://www.google.com/,200,0,0,109,10.8.100.148,2018-10-24T23:59:36.000+0000
2018-10-24,23:59:37,129.119.66.37,GET,/OIT/Services/PasswordReset,-,443,-,"Mozilla/5.0+(Windows+NT+6.1;+Win64;+x64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/66.0.3359.170+Safari/537.36+OPR/53.0.2907.99",https://www.smu.edu/OIT/Services/PasswordReset,200,0,0,78,193.201.224.246,2018-10-24T23:59:37.000+0000


Write a query that answers the question, what are the most frequently visited pages accessed from the default smu.edu web page?  Keep in mind that the default smu.edu web page has the cs_referer string of *https://www.smu.edu/*. Use the show method to display the first 20 results to the screen, sorted from most frequent to less frequent:

In [0]:
query = '''
  SELECT
    cs_uri_stem,
    COUNT(*) as frequency
  FROM pages
  WHERE 
    csReferer = "https://www.smu.edu/"
  GROUP BY cs_uri_stem
  ORDER BY COUNT(*) DESC
'''

display(
  spark.sql(query).show(20, truncate = False)
)

+-----------------------------------------------------+---------+
|cs_uri_stem                                          |frequency|
+-----------------------------------------------------+---------+
|/Admission/Academics/Majors/MajorsGrid               |150      |
|/admission                                           |115      |
|/Admission/Academics/Majors                          |107      |
|/Admission/CampusLife                                |69       |
|/AboutSMU                                            |69       |
|/                                                    |67       |
|/BusinessFinance/HR/WorkingatSMU                     |57       |
|/Admission/Apply                                     |43       |
|/AboutSMU/Administration                             |36       |
|/cox                                                 |32       |
|/Graduate                                            |31       |
|/academics                                           |28       |
|/dedman  

With each cell in this notebook executed and results (where applicable) displayed, save this notebook as an HTML file and upload it to Canvas to complete the assignment