## Outbrain Click Prediction

#### This note books contains overall information and a summary of all the data that we have for the kaggle competition https://www.kaggle.com/c/outbrain-click-prediction. It goes through each of the files and presents the overall summay of the data. The analysis is done using Pyspark.

In [1]:
# Importing the libraries
import os
from matplotlib import pyplot as plt
from pyspark.sql.types import *
import pyspark.sql.functions as F

### 1) File Locations

In [2]:
# Location of all the files.
# These files are located in google cloud platform
clTe = "gs://jupyterbucket/outbrainData/clicks_test.csv"
clTr = "gs://jupyterbucket/outbrainData/clicks_train.csv"
docCat = "gs://jupyterbucket/outbrainData/documents_categories.csv"
docEnt = "gs://jupyterbucket/outbrainData/documents_entities.csv"
docMet = "gs://jupyterbucket/outbrainData/documents_meta.csv"
docTop = "gs://jupyterbucket/outbrainData/documents_topics.csv"
evt = "gs://jupyterbucket/outbrainData/events.csv"
pageViewsSam = "gs://jupyterbucket/outbrainData/page_views_sample.csv"
pageViews = "gs://jupyterbucket/outbrainData/page_views.csv"
proCon = "gs://jupyterbucket/outbrainData/promoted_content.csv"
samSub = "gs://jupyterbucket/outbrainData/sample_submission.csv"

# Creating a list to iterate over in case we need to perform some operation for all the csv files
allFiles = [clTe, clTr, docCat, docEnt, docMet, docTop, evt, pageViewsSam, proCon, samSub, pageViews]

### 2) Funtion to get an overview of each datafile
* Obtain a basic understanding of the file in terms of number of columns
* Generate Schema for each of the files that will be used later on.

In [3]:
def csvOverview(fpath):
    '''
    Function presents a basic overview of the file fed in the argument
    @params
    fpath: Path to the csv file that needs to be analyzed
    @returns 
    None
    '''
    # Reading the data as a spark dataframe
    print fpath.split("/")[-1]
    fpathDF = spark.read.options(header='true', inferschema='true', nullValue='\\N') \
                .csv(fpath)
    print "Dataframe has ", fpathDF.count(), " rows."
    fpathDF.show(n=5)
    
    return None

### 3) Displaying all the files and generating Schema for each

#### a) page_views.csv and page_views_sample.csv

In [4]:
csvOverview(pageViewsSam)

page_views_sample.csv
Dataframe has  9999999  rows.
+--------------+-----------+---------+--------+------------+--------------+
|          uuid|document_id|timestamp|platform|geo_location|traffic_source|
+--------------+-----------+---------+--------+------------+--------------+
|1fd5f051fba643|        120| 31905835|       1|          RS|             2|
|8557aa9004be3b|        120| 32053104|       1|       VN>44|             2|
|c351b277a358f0|        120| 54013023|       1|       KR>12|             1|
|8205775c5387f9|        120| 44196592|       1|       IN>16|             2|
|9cb0ccd8458371|        120| 65817371|       1|   US>CA>807|             2|
+--------------+-----------+---------+--------+------------+--------------+
only showing top 5 rows



In [5]:
# Defining the Schema for the above files
pageViewsSamSchema = StructType(
                    [StructField("uuid", StringType(), True),
                     StructField("document_id", StringType(), True),
                     StructField("timestamp", IntegerType(), True),
                     StructField("platform", StringType(), True),
                     StructField("geo_location", StringType(), True),
                     StructField("traffic_source", StringType(), True)])
pageViewsSchema = StructType(
                    [StructField("uuid", StringType(), True),
                     StructField("document_id", StringType(), True),
                     StructField("timestamp", IntegerType(), True),
                     StructField("platform", StringType(), True),
                     StructField("geo_location", StringType(), True),
                     StructField("traffic_source", StringType(), True)])

#### The columns in the file are as follows
* uuid: Unique User Id. uuid represents a unique user
* document_id: Represent a unique web page that the user visited
* timestamp: time in ms since Jan 1, 1970 - 1465876799998
* platform: Represents whether the document was viewed on desktop(1), mobile(2) or tablet(3)
* geo_location: Location where the ad was viewed
* traffic source: Specifies how the user reached this document: Internal(1), Search(2) or Social(3)

#### b) clicks_train.csv and clicks_test.csv

In [6]:
csvOverview(clTr)
csvOverview(clTe)

clicks_train.csv
Dataframe has  87141731  rows.
+----------+------+-------+
|display_id| ad_id|clicked|
+----------+------+-------+
|         1| 42337|      0|
|         1|139684|      0|
|         1|144739|      1|
|         1|156824|      0|
|         1|279295|      0|
+----------+------+-------+
only showing top 5 rows

clicks_test.csv
Dataframe has  32225162  rows.
+----------+------+
|display_id| ad_id|
+----------+------+
|  16874594| 66758|
|  16874594|150083|
|  16874594|162754|
|  16874594|170392|
|  16874594|172888|
+----------+------+
only showing top 5 rows



In [7]:
clTeSchema = StructType(
                    [StructField("display_id", StringType(), True),
                     StructField("ad_id", StringType(), True)])
clTrSchema = StructType(
                    [StructField("display_id", StringType(), True),
                     StructField("ad_id", StringType(), True),
                     StructField("clicked", IntegerType(), True)])

#### The columns mentioned above are
* display_id: Represent a set of ads which are displayed on a page at a given time. Either none or one of these are clicked and corresponds to a specific user, specific page and so on
* ad_id: Id for a given ad
* clicked: Representing whether an ad was clicked or not

In [12]:
# We perform some basic analysis on clicks_test.csv and clicks_train.csv
clTe_DF = spark.read.schema(clTeSchema).options(header='true', inferschema='false', nullValue='\\N').csv(clTe)
clTe_DF_Gby_display = clTe_DF.groupBy('display_id').count().sort(F.col("count").desc())
clTe_DF_Gby_display.show(n=5)
print "Number of distinct display_ids: "
clTe_DF_Gby_display.count()

+----------+-----+
|display_id|count|
+----------+-----+
|  18355686|   12|
|  19745750|   12|
|  18555755|   12|
|  18701552|   12|
|  19389874|   12|
+----------+-----+
only showing top 5 rows

Number of distinct display_ids: 


6245533

#### Now we look at the data in clicks_train.csv

In [16]:
# We perform some basic analysis on clicks_test.csv and clicks_train.csv
clTr_DF = spark.read.schema(clTrSchema).options(header='true', inferschema='false', nullValue='\\N').csv(clTr)
clTr_DF_Gby_display = clTr_DF.groupBy('ad_id').avg('clicked').sort(F.col("avg(clicked)").desc())
clTr_DF_Gby_display.show(n=5)
#print "Number of distinct display_ids: "
#clTe_DF_Gby_display.count()

+------+------------+
| ad_id|avg(clicked)|
+------+------------+
|302973|         1.0|
|447127|         1.0|
|325026|         1.0|
|310550|         1.0|
|275679|         1.0|
+------+------------+
only showing top 5 rows



#### c) events.csv

In [30]:
csvOverview(evt)

events.csv
Dataframe has  23120126  rows.
+----------+--------------+-----------+---------+--------+------------+
|display_id|          uuid|document_id|timestamp|platform|geo_location|
+----------+--------------+-----------+---------+--------+------------+
|         1|cb8c55702adb93|     379743|       61|       3|   US>SC>519|
|         2|79a85fa78311b9|    1794259|       81|       2|   US>CA>807|
|         3|822932ce3d8757|    1179111|      182|       2|   US>MI>505|
|         4|85281d0a49f7ac|    1777797|      234|       2|   US>WV>564|
|         5|8d0daef4bf5b56|     252458|      338|       2|       SG>00|
+----------+--------------+-----------+---------+--------+------------+
only showing top 5 rows



In [45]:
evtSchema = StructType(
                    [StructField("display_id", StringType(), True),
                     StructField("uuid", StringType(), True),
                     StructField("document_id", StringType(), True),
                     StructField("timestamp", IntegerType(), True),
                     StructField("platform", StringType(), True),
                     StructField("geo_location", StringType(), True)])

#### d) promoted_content.csv

In [32]:
csvOverview(proCon)

promoted_content.csv
Dataframe has  559583  rows.
+-----+-----------+-----------+-------------+
|ad_id|document_id|campaign_id|advertiser_id|
+-----+-----------+-----------+-------------+
|    1|       6614|          1|            7|
|    2|     471467|          2|            7|
|    3|       7692|          3|            7|
|    4|     471471|          2|            7|
|    5|     471472|          2|            7|
+-----+-----------+-----------+-------------+
only showing top 5 rows



In [46]:
proConSchema = StructType(
                    [StructField("ad_id", StringType(), True),
                     StructField("document_id", StringType(), True),
                     StructField("campaign_id", StringType(), True),
                     StructField("advertiser_id", StringType(), True)])

#### Additional column information
* campaign_id: Each ad belongs to a particular campaign and this id refers to a specific campaign
* advertiser_id: Refers to a specific advertiser

#### e) documents_meta.csv

In [34]:
csvOverview(docMet)

documents_meta.csv
Dataframe has  2999334  rows.
+-----------+---------+------------+--------------------+
|document_id|source_id|publisher_id|        publish_time|
+-----------+---------+------------+--------------------+
|    1595802|        1|         603|2016-06-05 00:00:...|
|    1524246|        1|         603|2016-05-26 11:00:...|
|    1617787|        1|         603|2016-05-27 00:00:...|
|    1615583|        1|         603|2016-06-07 00:00:...|
|    1615460|        1|         603|2016-06-20 00:00:...|
+-----------+---------+------------+--------------------+
only showing top 5 rows



In [47]:
docMetSchema = StructType(
                    [StructField("document_id", StringType(), True),
                     StructField("source_id", StringType(), True),
                     StructField("publisher_id", StringType(), True),
                     StructField("publish_time", StringType(), True)])
# This publish time could be assigned the DateType()

#### Additional column information
* source_id: Something like editor.cnn.com which is the websource on which the document is available
* publisher_id: Something like cnn.com
* publish_time: The time at which the document was published

#### f) documents_categories.csv

In [36]:
csvOverview(docCat)

documents_categories.csv
Dataframe has  5481475  rows.
+-----------+-----------+----------------+
|document_id|category_id|confidence_level|
+-----------+-----------+----------------+
|    1595802|       1611|            0.92|
|    1595802|       1610|            0.07|
|    1524246|       1807|            0.92|
|    1524246|       1608|            0.07|
|    1617787|       1807|            0.92|
+-----------+-----------+----------------+
only showing top 5 rows



In [48]:
docCatSchema = StructType(
                    [StructField("document_id", StringType(), True),
                     StructField("category_id", StringType(), True),
                     StructField("confidence_level", DoubleType(), True)])

#### Information about the columns
* category_id: Each document belongs to a specific category
* confidence_level: Represents the probability that the given document belongs to this category

#### g) document_entities.csv

In [38]:
csvOverview(docEnt)

documents_entities.csv
Dataframe has  5537552  rows.
+-----------+--------------------+-----------------+
|document_id|           entity_id| confidence_level|
+-----------+--------------------+-----------------+
|    1524246|f9eec25663db4cd83...|0.672865314504701|
|    1524246|55ebcfbdaff1d6f60...|0.399113728441297|
|    1524246|839907a972930b17b...|0.392095749652966|
|    1524246|04d8f9a1ad48f126d...|0.213996376305138|
|    1617787|612a1d17685a498af...|0.386192829940441|
+-----------+--------------------+-----------------+
only showing top 5 rows



In [49]:
docEntSchema = StructType(
                    [StructField("document_id", StringType(), True),
                     StructField("entity_id", StringType(), True),
                     StructField("confidence_level", DoubleType(), True)])

In [54]:
fpathEntDF = spark.read.schema(docEntSchema).options(header='true', inferschema='true', nullValue='\\N').csv(docEnt)
fpathEntDF.select('document_id').distinct().count()

1791420

#### Information about the columns
* entity_id: An entity (person, organization or location) featured in a document
* confidence_level: Represents the probability that the given entity is present in the document category

#### h) document_topics.csv

In [40]:
csvOverview(docTop)

documents_topics.csv
Dataframe has  11325960  rows.
+-----------+--------+------------------+
|document_id|topic_id|  confidence_level|
+-----------+--------+------------------+
|    1595802|     140|0.0731131601068925|
|    1595802|      16|0.0594164867373976|
|    1595802|     143|0.0454207537554526|
|    1595802|     170|0.0388674285182961|
|    1524246|     113| 0.196450402209685|
+-----------+--------+------------------+
only showing top 5 rows



In [50]:
docTopSchema = StructType(
                    [StructField("document_id", StringType(), True),
                     StructField("topic_id", StringType(), True),
                     StructField("confidence_level", DoubleType(), True)])

#### i) This is how submission is supposed to look like

In [51]:
csvOverview(samSub)

sample_submission.csv
Dataframe has  6245533  rows.
+----------+--------------------+
|display_id|               ad_id|
+----------+--------------------+
|  16874594|66758 150083 1627...|
|  16874595|   8846 30609 143982|
|  16874596|11430 57197 13282...|
|  16874597|137858 143981 155...|
|  16874598|67292 145937 2500...|
+----------+--------------------+
only showing top 5 rows

