### Dataset description
**Source**: Google Analytics hit (pageview) level data, by date and by article title.

**Field descriptions**:
- Date - year, month, day of hit.
- Page Title - article title that was viewed by user
- Age - age of user that visited the site
- Gender - gender of user that visited the site
- Source/Medium - the referral source of the hit or the website that the user was on before visiting whowhatwear.com
- Pageviews - a hit of a url on our site that is being tracked by theAnalytics tracking code.
- Unique Pageviews - represents the number of sessions during which that page was viewed one or more times.

### Step 1: Load and Clean Source-Data

In [1]:
# Import the relevant modules
import pandas as pd
from pandas import DataFrame
import numpy as np
import json

In [2]:
# Import data
customer_df = pd.DataFrame(pd.read_csv('intern_assessment_ds.csv', encoding='unicode_escape'))
customer_df.head(2)

Unnamed: 0,Date,Page Title,Age,Gender,Source / Medium,Pageviews,Unique Pageviews,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14
0,20190606,Chrissy Teigen Wore the Pleated-Jean Trend | W...,35-44,female,m.facebook.com / referral,35541,31670,,,,,,,,
1,20190606,Chrissy Teigen Wore the Pleated-Jean Trend | W...,25-34,female,m.facebook.com / referral,29730,26236,,,,,,,,


**First Issue**: In the original version of the file, the error "UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte" appeared. To resolve this issue, the encoding, 'unicode_escape', was specified to produce an output that is suitable as raw Unicode literal in Python source code. Additionally, the 'utf8' error can also be resolved by openning the original csv file was opened in Excel/Numbers, then re-exported out as a CSV with 'utf8' encoding specified. 

**Second Issue**: When imported, it becomes apparent that the are several uncessessary columns full of NaN objects and unnamed. Since these columns are also not referenced in the data description, it will be assumed that these columns can be dropped without interfering with further analysis. 

In [3]:
# Drop unnecessary columns
customer_df = customer_df.drop(columns=['Unnamed: 7', 'Unnamed: 8', 'Unnamed: 9', 'Unnamed: 10',
                                       'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14'])
customer_df.head(2)

Unnamed: 0,Date,Page Title,Age,Gender,Source / Medium,Pageviews,Unique Pageviews
0,20190606,Chrissy Teigen Wore the Pleated-Jean Trend | W...,35-44,female,m.facebook.com / referral,35541,31670
1,20190606,Chrissy Teigen Wore the Pleated-Jean Trend | W...,25-34,female,m.facebook.com / referral,29730,26236


In [4]:
# Assess dataset
customer_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32755 entries, 0 to 32754
Data columns (total 7 columns):
Date                32755 non-null int64
Page Title          32755 non-null object
Age                 32755 non-null object
Gender              32755 non-null object
Source / Medium     32755 non-null object
Pageviews           32755 non-null int64
Unique Pageviews    32755 non-null int64
dtypes: int64(3), object(4)
memory usage: 1.7+ MB


*Observations*: All columns appear to have the same length of non-null objects, but null-values will be dropped as a precaution.

In [5]:
# Drop null values
customer_df = customer_df.dropna()

In [6]:
# Number of entries in dataset
print(customer_df.shape)

(32755, 7)


In [7]:
# Define date timeframe
print(customer_df.sort_values(by='Date', ascending=True)['Date'].head(1))
customer_df.sort_values(by='Date', ascending=True)['Date'].tail(1)

22530    20190601
Name: Date, dtype: int64


14392    20190630
Name: Date, dtype: int64

*Convert Date to Datetime Object*

In [8]:
# Import necessary modules
from datetime import datetime

In [9]:
# Convert Date integers to datetime
customer_df['Date'] = customer_df['Date'].apply(lambda x: datetime.strptime(str(x), '%Y%m%d'))
customer_df.head(2)

Unnamed: 0,Date,Page Title,Age,Gender,Source / Medium,Pageviews,Unique Pageviews
0,2019-06-06,Chrissy Teigen Wore the Pleated-Jean Trend | W...,35-44,female,m.facebook.com / referral,35541,31670
1,2019-06-06,Chrissy Teigen Wore the Pleated-Jean Trend | W...,25-34,female,m.facebook.com / referral,29730,26236


In [10]:
# Define genders in dataset
np.unique(customer_df.Gender)

array(['female', 'male'], dtype=object)

*Observations*: This dataset is for June 1 to 30 in 2019, and the audience contains both male and female genders.

In [11]:
%store customer_df

Stored 'customer_df' (DataFrame)
