Task
- Refactor code
    - Fix all visualizations
    - Functionalize everything
- Fix the plots
- Start on EDA
    - Each major feature will cover distribution of the feature, readtime/scroll pct, how it relates to the topics selected, and activity distribution.
        - We will use histograms, bar plots, box plots, and scatter plots (can look at pie charts too)
        - Separate out some plots maybe
        - Look into changing activity to scatter plots
        - Next thing we have to do is look update the multiple bar plot
        
- Model selection

Background Information on Dataset: BLAH BLAH BLAH

Importing Packages

In [34]:
import numpy as np
import pandas as pd

import plotly
import plotly.express as px
import plotly.graph_objects as go

from datetime import datetime
from plotly.subplots import make_subplots

# EDA

Load in dataset

In [35]:
#Load in various dataframes
## Articles
df_art = pd.read_parquet("Data/Small/articles.parquet")

## Behaviors
df_bev = pd.read_parquet("Data/Small/train/behaviors.parquet")

## History
df_his = pd.read_parquet("Data/Small/train/history.parquet")



Join the data sources

In [36]:
# Convert datatype of column first
df_bev['article_ids_clicked'] = df_bev['article_ids_clicked'].apply(lambda x: x[0])

In [37]:
# Join bevhaiors to article
df= df_bev.join(df_art.set_index("article_id"), on = "article_ids_clicked")

# Join bevhaiors to history 
df= df.join(df_his.set_index("user_id"), on = "user_id")

# Drop all other dataframes from me
df_bev = []
df_his = []
df_art = []

In [38]:
# Preprocessing
df.dropna(subset=['article_id'], inplace=True)
# df.dropna(subset =['age'], inplace = True)

# Change article IDs into int
df['article_id'] = df['article_id'].apply(lambda x: int(x))
df['article_id'] = df['article_id'].astype(np.int64)


# Change genders from float to strings
def gender_(x):
    if x == 0.0:
        return 'Male'
    elif x == 1.0:
        return 'Female'
    else:
        return None


df['gender'] = df['gender'].apply(lambda x: gender_(x))

# Change age to int
# df['age'] = df['age'].apply(lambda x: np.int_(x) if np.isnan(x) == False else x)
# df['age'] = df[~df['age'].isnull()]['age'].astype(np.int32)

# Change age to str it's a range
df['age'] = df['age'].astype('Int64')
df['age'] = df['age'].astype(str)
df['age'] = df['age'].apply(
    lambda x: x if x == '<NA>' else x + ' - ' + x[0] + '9')


# Change postcodes
# Change genders from float to strings
def postcodes_(x):
    if x == 0.0:
        return 'Metropolitan'
    elif x == 1.0:
        return 'Rural District'

    elif x == 2.0:
        return 'Municipality'

    elif x == 3.0:
        return 'Provincial'

    elif x == 4.0:
        return 'Big City'

    else:
        return None


df['postcode'] = df['postcode'].apply(lambda x: postcodes_(x))

Visualizations

Lets calculate the unique users for hourly, daily, and day of the week. Let's use a subset of the data until we know our plots are very good

#### Biggest thing is user engagement : Bigger User Engagement -> More eveneue
#### We need to maximize the amount of ads these guys are viewing -> this leads on to them clicking on new articles for ads
#### So, let's not make article length too short so that people can maximize their session lengths with a lot of articles!

# Functions

## Plot Functions

### Single & Multiple Subset Bar Plots

In [63]:
def single_subset_bar(df_, feature_, xaxis_title, yrange):
    """ 
    Displays bar plot for a feature that has a single category
    Keyword arguments:
        df_ -- list
        feature_ -- str
        xaxis_title -- str
        yrange -- list of ints: [0, 5]
    Output: 
        Plotly graph object!
    """
    # Index and Values
    indices = [xaxis_title]
    values_ = [len(df_[feature_].unique())]

    # Instantiate figure object
    fig = go.Figure()

    # Append Bar trace
    fig.add_trace(
        go.Bar(
            x=indices, y=values_,
            width=[0.3], text='<b>{}<b>'.format(values_[0]),
        )

    )
    
    # Update axis properties
    fig.update_yaxes(
        title_text='Count', range=yrange
    )

    # Update trace properties
    fig.update_traces(
        textposition='outside',
        textfont=dict(
            family='sans serif',
            size=16,
            color='#1f77b4'
        )
    )

    # Update layout of plot
    fig.update_layout(
        title='<b>Total {}<b>'.format(xaxis_title),
        uniformtext_minsize=8, uniformtext_mode='hide',
        font=dict(
            family="Courier New, monospace",
            size=16,
        )
    )

    return fig.show()

In [64]:
def multiple_subset_bar(df_, feature_, yrange):
    """ 
    Displays bar plot for a feature that has multiple categories.
    Keyword arguments:
        df_ -- list
        feature_ -- str
        yrange -- list of ints: [0, 5]
    Output: 
        Plotly graph object!
    """

    # Assign tmp_df based on feature
    if feature_ == 'age':
        tmp_df = df_[df_['age'] != '<NA>']
    else:
        tmp_df = df_[~df_[feature_].isnull()]

    # Create a category list
    categories = [d for d in tmp_df[feature_].unique()]
    categories.sort()

    # Instantiate a Figure object
    fig = go.Figure()

    # Iterate through each category and produce a barplot for that category
    for category_ in categories:
        # Record the count
        count= len(tmp_df[tmp_df[feature_] == category_])
        # Add Bar trace
        fig.add_trace(
            go.Bar(
                x= [str(category_)], y = [count],
                text = '<b>{}<b>'.format(count), 
                name= str(category_)
            )
        )


    # Update axis properties
    fig.update_yaxes(
        title_text= 'Count', range = yrange
    )
    
    fig.update_xaxes(
        title_text= str(feature_)
    )

    # Update trace properties
    fig.update_traces(
        textposition='outside',
        textfont=dict(
            family='sans serif',
            size=16,
            color='#1f77b4'
        )
    )
            
    # Update layout of plot
    fig.update_layout(
        title = '<b>Distribution of {}<b>'.format(feature_) ,
        uniformtext_minsize=8, uniformtext_mode='hide',  
        font=dict(
            family="Courier New, monospace",
            size=16,
        )
    )

    return fig.show()

### Single & Multiple Subset Histogram, Box Plot and Bar Plot

In [65]:
def single_subset_feature_visualization(
    df_,
    feature_,
    data_title) -> 'Graph':
    """ 
    Displays multiple plots: Histogram, Box, and Bar plots based on a feature given.
    Keyword arguments:
        df_ -- list
        feature_ -- str
        data_title -- str
    Output: 
        Plotly graph object!
    """
    # Create subplots object
    fig = make_subplots(
        rows=3, cols=1, subplot_titles=("<b>Histogram<b>", "<b>Box plot<b>", "<b>Average {} for {}<b>".format(feature_, data_title))
    )

    # Instantiate a tmp df which has no null values
    tmp_df = df_[~df_[feature_].isnull()]
    values = tmp_df[feature_].values

    # Average
    average = values.mean()

    # Histogram 
    fig.add_trace(
        go.Histogram(
            x=values, name='Histogram'
        ),
        row=1, col=1
    )

    # Box Plot
    xo = [data_title for x in range(0, len(values))]
    fig.add_trace(
        go.Box(
            y=values, x=xo, name='Box plot'
        ),
        row=2, col=1
    )

    # Bar Plot
    fig.add_trace(
        go.Bar(
            x=[data_title], y=[average], width=[
                  0.3], name='Bar plot'
        ),  
        row=3, col=1
    )

    # Update xaxis properties
    fig.update_xaxes(
        title_text=str(feature_), row=1, col=1
    )

    # Update yaxis properties
    fig.update_yaxes(
        title_text='Count', row=1, col=1
    )
    fig.update_yaxes(
        title_text=str(feature_), row=2, col=1
    )
    fig.update_yaxes(
        title_text=str(feature_), range=[0, 110], row=3, col=1
    )

    # Update suplot title sizes
    fig.update_annotations(
        font_size=20,
    )

    # Update title and height
    fig.update_layout(
        title_text="<b>Distributions of {} for {}<b>".format(feature_, data_title), height=750, width=1000,
        uniformtext_minsize=8, uniformtext_mode='hide',
        font=dict(
            family="Courier New, monospace",
            size=16,
        )
    )

    return fig.show()

In [66]:
def multiple_subset_feature_visualization(
    df_,
    feature_1, feature_2) -> "Graph":
    """ 
    Displays multiple plots: Histogram, Box, and Bar plots based on multiple features given.
    Keyword arguments:
        df_ -- list
        feature_1 -- str
        feature_2 -- str
    Output: 
        Plotly graph object!
    """

    # Make subplots object
    fig = make_subplots(
        rows=3, cols=1, subplot_titles=("<b>Histogram<b>", "<b>Box plot<b>", "<b>Average {} for each {}<b>".format(feature_2, feature_1))
    )

    # Assign tmp_df based on feature
    if feature_1 == 'age':
        tmp_df = df_[df_['age'] != '<NA>']
    else:
        tmp_df = df_[~df_[feature_1].isnull()]

    # Create a category list from the feature given 
    categories = [d for d in tmp_df[feature_1].unique()]
    categories.sort()

    # Iterate through each category and produce a histogram, boxplot, and bar plots for that subset of the data
    for category_ in categories:
        subset_feature_2 = tmp_df[tmp_df[feature_1]== category_][feature_2].values
        avg = tmp_df[tmp_df[feature_1] == category_][feature_2].mean()
        # Add histogram
        fig.add_trace(
            go.Histogram(
                x=subset_feature_2,
                name=str(category_) + ' Histogram',
            ),
            row=1, col=1
        )
        # Add Boxplot
        # Need to create an array that is similar to the array used in subset_feature_2, to name the traces!
        xo = [str(category_) for x in range(0, len(subset_feature_2))]
        fig.add_trace(
            go.Box(
                y=subset_feature_2, x=xo,
                name=str(category_) + ' Box',
            ),
            row=2, col=1
        )

        # Add Bar
        fig.add_trace(
            go.Bar(
                x=[str(category_)], y=[avg],
                text='<b>{}<b>'.format(avg),
                textposition='outside',
                name=str(category_) + ' Bar',
                textfont=dict(
                    family='sans serif',
                    size=18,
                    color='#1f77b4'
                )
            ),
            row=3, col=1
        )

    # Update xaxis properties
    fig.update_xaxes(
        title_text=str(feature_2), row=1, col=1
    )
    fig.update_xaxes(
        title_text=str(feature_1), row=2, col=1
    )
    fig.update_xaxes(
        title_text=str(feature_1), row=3, col=1
    )

    # Update yaxis properties
    fig.update_yaxes(
        title_text='Count', row=1, col=1
    )
    fig.update_yaxes(
        title_text=str(feature_2), row=2, col=1
    )
    fig.update_yaxes(
        title_text=str(feature_2),
        range=[0, 125], row=3, col=1
    )

    # Update subplot title sizes
    fig.update_annotations(
        font_size=20,
    )

    # Update title and height
    fig.update_layout(
        title_text="<b>Distributions of {} for each {}<b>".format(
            feature_2, feature_1),
        height=750, width=1000,
        font=dict(
            family="Courier New, monospace",
            size=16,
        )
    )

    return fig.show()

### Bar, Box, Scatter, and Activityplots

In [67]:
def plot_bar(
    indices_, values_,
    yrange_, xaxis_title,
    yaxis_title, title_) -> "Graph":
    """ 
    Bar Plot
    Keyword arguments:
        indices_ -- list
        values_ -- list
        yrange -- list of ints: [0, 5]
        xaxis_title -- str
        yaxis_title -- str
        title_ -- str
    Output: 
        Plotly graph object!
    """

    # Instantiate figure object
    fig = go.Figure()
    
    # Iterate through each index and key pair and append a bar plot to the figure
    for idx, val in zip(indices_, values_):
        fig.add_trace(
            go.Bar(
                x= [str(idx)], y = [val],
                text = '<b>{}<b>'.format(val), 
                name= str(idx)
            )
        )

        
    # Update axis properties
    fig.update_yaxes(
        title_text= yaxis_title, range = yrange_, type = 'log'
        )
    
    fig.update_xaxes(
        title_text= xaxis_title
        )

    # Update trace properties
    fig.update_traces(
        textposition='outside',
        textfont=dict(
            family='sans serif',
            size=16,
            color='#1f77b4'
            )
        )
            
    # Update layout of plot
    fig.update_layout(
        title = title_, height= 750, width = 1000,
        font=dict(
            family="Courier New, monospace",
            size=16,
            )
        )

    return fig.show()

In [68]:
def plot_box(
    indices_, values_,
    yrange_, xaxis_title,
    yaxis_title, title_)-> 'Graph':
    """ 
    Box Plot
    Keyword arguments:
        indices_ -- list
        values_ -- list
        yrange -- list of ints: [0, 5]
        xaxis_title -- str
        yaxis_title -- str
        title_ -- str
    Output: 
        Plotly graph object!
    """

    # Figure Object
    fig = go.Figure()

    # Iterate through each value and index pair and append a Boxplot trace to the Figure
    for trace_, name_ in zip(values_, indices_):
        fig.add_trace(
            go.Box(
                y = trace_, name = name_
            )
        )

    # Update axis properties
    fig.update_yaxes(
        title_text= yaxis_title, range = yrange_
        )
    
    fig.update_xaxes(
        title_text= xaxis_title
        )

            
    # Update layout of plot
    fig.update_layout(
        title = title_, height= 750, width = 1000,
        uniformtext_minsize=8, uniformtext_mode='hide',  
        font=dict(
            family="Courier New, monospace",
            size=16,
            )
        )

    return fig.show()

In [69]:
def plot_scatter(
    indices_, values_,
    yrange_, xaxis_title,
    yaxis_title, title_) -> 'Graph':
    """ 
    Scatter Plot
    Keyword arguments:
        indices_ -- list
        values_ -- list
        yrange -- list of ints: [0, 5]
        xaxis_title -- str
        yaxis_title -- str
        title_ -- str
    Output: 
        Plotly graph object!
    """

    # Figure Object
    fig = go.Figure()

    # Add line plot
    fig.add_trace(
        go.Scatter(
            x=indices, y=values_,
            mode='lines', name='Line',
            marker=dict(
                color="rgba(135, 206, 250, 0.5)"
            )
        )
    )

    # Iterate through each index and value pair, and append a scatter plot trace
    for idx, val in zip(indices_, values_):
        # Add scatter trace
        fig.add_trace(
            go.Scatter(
                x=[str(idx)], y=[val],
                text='<b>{}<b>'.format(val),
                name=str(idx),
                marker=dict(
                    size=12,
            ),
                mode='lines+markers+text'
        )
    )

    # Update axis properties
    fig.update_yaxes(
        title_text=yaxis_title, range=yrange_
    )

    fig.update_xaxes(
        title_text=xaxis_title
    )

    # Update trace properties
    fig.update_traces(
        textposition='bottom center',
        textfont=dict(
            family='sans serif',
            size=12,
            color='#1f77b4'
        )
    )

    # Update layout of plot
    fig.update_layout(
        title=title_, height=750, width=1000,
        font=dict(
            family="Courier New, monospace",
            size=16,
        )
    )

    return fig.show()

In [70]:
def activity_scatter(
    dict_,  yrange_,
    xaxis_title, yaxis_title,
     title_) -> 'Graph':
    """ 
    Scatter Plot of Daily or Hourly Activity 
    Keyword arguments:
        dict_ -- dict object
        yrange -- list of ints: [0, 5]
        xaxis_title -- str
        yaxis_title -- str
        title_ -- str
    Output: 
        Plotly graph object!
    """
    
    fig = go.Figure()
    # Iterate through each topic in dict and add that respective trace to the scatter plot!
    for topic in dict_.keys():
        indices = [x for x in dict_[topic].keys()]
        values = [x for x in dict_[topic].values()]
        # Add traces
        fig.add_trace(
            go.Scatter(
                x=indices, y=values, name=topic,
                marker=dict(
                    size=12,
                ),
                mode='lines+markers+text'
            )
        )

    # Update axis properties
    fig.update_yaxes(
        title_text=yaxis_title, range=yrange_
    )

    fig.update_xaxes(
        title_text=xaxis_title
    )

    # Update trace properties
    fig.update_traces(
        textposition='bottom center',
        textfont=dict(
            family='sans serif',
            size=12,
            color='#1f77b4'
        )
    )

    # Update layout of plot
    fig.update_layout(
        title=title_, height=750, width=1000,
        font=dict(
            family="Courier New, monospace",
            size=16,
        )
    )

    return fig.show()

## Article Functions

In [71]:
def article_id_read_scroll(dict_, res):
    """ 
    Populates the dict if that article is present in another dict!
    Keyword arguments:
        dict_--  dict: to map articles to scroll/read
        res -- dict: to map unique articles to scroll/read
    Output: None
    """
    # Iterate through each pair of key and value
    for k, v in zip(dict_.keys(), dict_.values()):
        # Find if the key matches up to another dict and is not false
        if (k in res.keys()) & (np.isnan(v) == False):
            # Add that resulting value to our resulting dict
            tmp_array = np.append(res[k], v)
            res[k] = tmp_array

## User Functions

### Helper functions 

In [72]:
def populate_dict(list_, dict_):
    """ 
    Populates the dict from list indices
    Keyword arguments:
        list_--  list
        dict_ -- dict: 
    Output: 
        None
    """
    # Iterate through each list index and append the index as a key 
    for idx in list_:
        if idx not in dict_:
            dict_[idx] = 1
        else:
            dict_[idx] += 1

In [73]:
def weekly_map(list_):
    """ 
    Maps the Week of Year to (Week 1, Week 2, etc)
    Keyword arguments:
        list_ -- list 
    Output: 
        res -- list
    """
    # Weeks represented from dataset
    weeks = [17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27]
    # List of integer values 1-9
    placeholder = [i for i in range(1, 9)]
    # Dict to map weeks to placeholder
    dict_ = {k: v for k, v in zip(weeks, placeholder)}
    # Populate that will return the Weeks starting from Week 1, Week 2
    res = []
    for idx in list_:
        res.append('Week ' + str(dict_[idx]))
    return res

In [74]:
def int_dow_dict(dict_):
    """ 
    Maps the DayofWeek from int to str
    Keyword arguments:
        dict_ -- dict: 
    Output: 
        res -- list
    """
    # Str of days
    str_dow = ['Monday', 'Tuesday', 'Wednesday',
               'Thursday', 'Friday', 'Saturday', 'Sunday']
    # Int of days
    int_dow = [i for i in range(7)]
    # Map int of days to str of days
    dow_dict = {k: v for k, v in zip(int_dow, str_dow)}
    # Return the str of day given the int day
    res = {}
    for keys in dict_.keys():
        res[dow_dict[keys]] = dict_[keys]
    return res

## Topic Functions

### Helper Functions

In [75]:
def unique_subset_topics(df_):
    """ 
    Returns a list of unique topics in the dataframe
    Keyword arguments:
        df--  dataframe object
    Output: res -- list of strs
    """
    # Create our result list
    res = []
    # Iterate through each index of topic and append unique topics
    for index in df_['topics']:
        for topic_ in index:
            if topic_ not in res:
                res.append(topic_)
    return res

In [76]:
def topics_article_id_scroll_read (dict_, res):
    """ 
    Populates the dict if that article is present in another dict!
    Keyword arguments:
        dict_--  dict: to map articles to scroll/read
        res -- dict: to map unique articles to scroll/read
    Output: None
    """
    # Iterate through each pair of key and value
    for k,v in zip(dict_.keys(), dict_.values()):
        # If the key matches append that value
        if (k in res.keys()):
            tmp_array= np.append(res[k],v)
            res[k] = tmp_array
        # If the key is not present make an empty list for that key
        if (k not in res.keys()):
            res[k] = []
    return res

In [77]:
def assign_plot_col(dict_col_, list_category):
    """ 
    Assigns a col number for the subplot given a key
    Keyword arguments:
        dict_col --  dict: to map category to col_num 
        list_category -- list: 
    Output: None
    """
    # Iterate thrugh the list of categories and assign a row number
    for num, category_ in zip(range(0, len(list_category)), list_category):
        # Even
        if num % 2 == 0:
            dict_col_[category_] = 1
        # Odd
        else:
            dict_col_[category_] = 2

In [78]:
def assign_plot_row(dict_row_, list_category):
    """ 
    Assigns a row number for the subplot given a key
    Keyword arguments:
        dict_row --  dict: to map category to row_num 
        list_category -- list: 
    Output: None
    """
    # Iterate through a loop and assign the row number of a category
    # Instantiate counter and initial row number
    counter = 0
    row_num = 1
    # Iterate through the loop. Expected behavior row_nums = [11 22 33 44 etc]
    while counter < len(list_category):
        num_ = 0
        while (num_ < 2) & (counter < len(list_category)):
            category_ = list_category[counter]
            dict_row_[category_] = row_num
            num_ += 1
            counter += 1
        row_num += 1

In [79]:
    def row_num(dict_, key) -> None:
        """ 
        Returns the row number of a given category
        Keyword arguments:
            dict_ -- row dict to map category to row_num
            key -- dict.keys
        Output: None
        """
        return dict_[key]
    
    def col_num(dict_, key) -> None:
        """ 
        Returns the col number of a given category
        Keyword arguments:
            dict_ -- col dict to map category to col_num
            key -- dict.keys
        Output: None
        """
        return dict_[key]

In [80]:
def topic_feature_bar_distribution(
    df_, feature_,
    topic_list_, yrange,
    subplot_titles_, xaxis_title,
    yaxis_title, title_,
    height_, width_
) -> 'Graph':
    """ 
    Plot of topic distribution in respect to which feature of the dataframe was given.
    Keyword arguments:
        df_ -- dataframe object
        feature_ -- str 
        topic_list -- list of strs: ['Blah', 'Blah']
        yrange -- list of ints: [0, 5]
        subplot_titles -- list of strs: ['Blah', 'Blah']
        xaxis_title -- str
        yaxis_title -- str
        title_ -- str
        height_ -- int
        width_ -- int
    Output: Plotly graph object!
    """
    # Assign tmp_df based on feature
    # Age feature is a string and the Null values contain <NA>
    if feature_ == 'age':
        tmp_df = df_[df_['age'] != '<NA>']
    else:
        tmp_df = df_[~df_[feature_].isnull()]

    # List of categories sorted in order
    categories = [d for d in tmp_df[feature_].unique()]
    categories.sort()

    # Make subplots need to figure out number of columns and rows:
    # Instantiate dicts
    dict_col = {}
    dict_row = {}
    # Populate our column and row dicts
    assign_plot_col(dict_col_=dict_col, list_category=categories)
    assign_plot_row(dict_row_=dict_row, list_category=categories)
    # Number of total rows
    rows_ = -(-len(categories) // 2)

    # Make subplots object
    fig = make_subplots(
        rows=rows_, cols=2,
        subplot_titles=subplot_titles_, shared_yaxes=True,
        x_title=xaxis_title, y_title=yaxis_title,
        vertical_spacing=0.2
    )

    # Iterate through each category and assign the correct subplot!
    for idx, category_ in enumerate(categories):
        # Find the subset of the data with that device
        subset_df = tmp_df[tmp_df[feature_] == category_]
        # Create a dict object with 0 counts for all topics
        tmp_dict = {k: 0 for k in topic_list_}
        for i in subset_df.index:
            for j in range(0, len(subset_df['topics'][i])):
                # Find that index
                tmp_topic = subset_df['topics'][i][j]
                # Enumerate
                tmp_dict[tmp_topic] += 1
        # Sort the dictionary
        tmp_dict = dict(
            sorted(tmp_dict.items(), key=lambda kv: kv[1], reverse=True))
        # Create our indices and values objects to insert into our plot
        indices = [x for x in tmp_dict.keys()][0:5]
        values = [y for y in tmp_dict.values()][0:5]
        # Add our trace object
        fig.add_trace(
            go.Bar(
                x=indices, y=values,
                name=str(category_)
            ),
            row=col_num(dict_=dict_row, key=category_),
            col=col_num(dict_=dict_col, key=category_)
        )

    # Update axis properties
    # yaxes
    fig.update_yaxes(
        range=yrange, type="log",
    )
    # xaxes
    fig.update_xaxes(
        tickfont=dict(
            size=10,
        )
    )

    # Update layout of plot
    fig.update_layout(
        title=title_, height=height_,
        width=width_, font=dict(
            family="Courier New, monospace",
            size=14,
        ),
        margin=dict(
            l=100, r=50,
            t=100, b=50,
            pad=0)
    )

    return fig.show()

## Activity Functions

### Helper Functions

In [None]:
def daily_hourly_activity_feature_bar_distribution(
    df_, feature_,
    yrange, subplot_titles_,
    title_,
    height_, width_
) -> 'Graph':
    """ 
    Plot of daily/hourly distribution in respect to which feature of the dataframe was given.
    Keyword arguments:
        df_ -- dataframe object
        feature_ -- str 
        yrange -- list of ints: [0, 5]
        subplot_titles -- list of strs: ['Blah', 'Blah']
        title_ -- str
        height_ -- int
        width_ -- int
    Output: Plotly graph object!
    """
    # Assign tmp_df based on feature
    # Age feature is a string and the Null values contain <NA>
    if feature_ == 'age':
        tmp_df = df_[df_['age'] != '<NA>']
    else:
        tmp_df = df_[~df_[feature_].isnull()]

    # List of categories sorted in order
    categories = [d for d in tmp_df[feature_].unique()]
    categories.sort()

    # Make subplots object
    fig = make_subplots(
        rows=2, cols=1,
        subplot_titles=subplot_titles_,
        y_title='Count',
        vertical_spacing=0.2
    )
    # Iterate through each category and assign the correct subplot!
    for idx, category_ in enumerate(categories):
        # Find the subset of the data with that device
        subset_df = tmp_df[tmp_df[feature_] == category_]

        # Create a dict object with 0 counts for all topics
        subset_daily_activity = {}
        subset_hourly_activity = {}
        for i in subset_df.index:
            # Get the date and time from that timestamp
            tmp_timestamp = subset_df['impression_time'][i]
            tmp_datetime = tmp_timestamp
            tmp_date = tmp_datetime.date()
            tmp_time = tmp_datetime.time()
            tmp_hour = tmp_time.hour

            # Daily Activity
            if tmp_date not in subset_daily_activity:
                subset_daily_activity[tmp_date] = 0
            else:
                subset_daily_activity[tmp_date] += 1

            # Convert hour into a string
            if tmp_hour > 9:
                tmp_time = str(tmp_hour) + ':00'
            else:
                tmp_time = "0" + str(tmp_hour) + ':00'

            # Hourly Activity
            if tmp_time not in subset_hourly_activity:
                subset_hourly_activity[tmp_time] = 0
            else:
                subset_hourly_activity[tmp_time] += 1

        # Sort by dates
        subset_daily_activity = dict(
            sorted(subset_daily_activity.items())
        )

        # Daily Activity Plot
        indices = [x for x in subset_daily_activity.keys()]
        values = [y for y in subset_daily_activity.values()]

        fig.add_trace(
            go.Scatter(
                x=indices, y=values,
                name='Daily ' + str(category_), mode='lines+markers+text',
            ),
            row=1, col=1
        )

        # Hourly Activity
        subset_hourly_activity = dict(
            sorted(subset_hourly_activity.items())
        )

        indices = [x for x in subset_hourly_activity.keys()]
        values = [y for y in subset_hourly_activity.values()]

        fig.add_trace(
            go.Scatter(
                x=indices, y=values,
                name='Hourly ' + str(category_), mode='lines+markers+text',
            ),
            row=2, col=1
        )

    # Update axis properties
    # yaxes
    fig.update_yaxes(type='log',
                     range=yrange,
                     )
    # xaxes
    fig.update_xaxes(
        tickfont=dict(
            size=14,
        )
    )

    fig.update_xaxes(
        title_text="<b>Date<b>",
        row=1, col=1
    )

    fig.update_xaxes(
        title_text="<b>Hour<b>",
        row=2, col=1
    )

    # Update layout of plot
    fig.update_layout(
        title=title_, height=height_,
        width=width_, font=dict(
            family="Courier New, monospace",
            size=14,
        ),
        margin=dict(
            l=100, r=50,
            t=100, b=50,
            pad=0)
    )

    return fig.show()

# Feature Analysis

## Overall Feature Analysis

### Number of Impressions

In [81]:
# Number of Impressions
single_subset_bar(df_=df, feature_='impression_id',
                  xaxis_title='Number of Impressions', yrange=[0, 80000])

### Distribution of Read Times

In [87]:
# Distribution of Read Times
single_subset_feature_visualization(df_=df, feature_='read_time', data_title = 'all users')

### Distribution of Scroll Percentages

In [88]:
# Distribution of Scroll Percentages
single_subset_feature_visualization(df_=df, feature_='scroll_percentage', data_title= 'all users')

## Article

### Number of Articles

In [None]:
# Total Number of Articles
single_subset_bar(df_ = df, feature_ = 'article_id', xaxis_title = 'Number of Articles', yrange = [0, 2000])

### Number of articles clicked in a session

In [None]:
# How many unique articles are clicked in a session?
## Group by sessions and get the article ids
tmp_aps = df.groupby('session_id')['article_id'].apply(list)
## Create a dict to store the count of articles per session
articles_per_session = {k: 0 for k in range(1, 20)}

## Iterate through our list previously, and record the number of articles in a session to our res dict
for i in tmp_aps:
    num_articles = len(i)
    articles_per_session[num_articles] += 1

## Set as our indices / values for plot
indices = [k for k in articles_per_session.keys()]
values = [k for k in articles_per_session.values()]
## Plot
plot_bar(
    indices_=indices, values_=values,
    yrange_=[0, 22000], xaxis_title='Number of Articles ',
    yaxis_title='Count', title_='<b> Number of Articles clicked in a session<b>')

### Read Time and Scroll Percentages

In [None]:
# Get the average readtime and scroll percentages for all articles!

# Unique User Ids
unique_user_ids = df['user_id'].values[0:1000]
# We take the set because the scroll, article per user is joined in a list for every user id (so just take the set of it!)
unique_user_ids = set(unique_user_ids)
# Unique Article Ids
unique_article_ids = df['article_id'].unique()
unique_article_ids = unique_article_ids[~np.isnan(unique_article_ids)]
# Create dictionaries
unique_article_read = {k: [0] for k in unique_article_ids}
unique_article_read_avg = {k: [0] for k in unique_article_ids}
unique_article_scroll = {k: [0] for k in unique_article_ids}
unique_article_scroll_avg = {k: [0] for k in unique_article_ids}

# Iterate across each user id
for id in unique_user_ids:
    # Get the subset of that user id
    tmp_df = df[df['user_id'] == id]
    # Now lets go through each scroll and article
    indices = np.array(tmp_df.index)
    for i in indices:
        tmp_dict = {}
        # Select the scroll / article of that indice and
        tmp_read = tmp_df['read_time_fixed'][i]
        tmp_article = tmp_df['article_id_fixed'][i]
        tmp_scroll = tmp_df['scroll_percentage_fixed'][i]
        # Create list objects for article, read, scroll
        read = [x for x in tmp_read]
        scroll = [x for x in tmp_scroll]
        articles = [np.int64(x) for x in tmp_article]
        # Populate our unique_article_read dictionary based on the results found in our previous list objects
        tmp_articles_read = {k: v for k, v in zip(articles, read)}
        article_id_read_scroll(tmp_articles_read, unique_article_read)
        # Populate our unique_article_scroll dictionary based on the results found in our previous list objects
        tmp_articles_scroll = {k: v for k, v in zip(articles, scroll)}
        article_id_read_scroll(tmp_articles_scroll, unique_article_scroll)

# Get the average scroll percentage and read times for each article
for k, v in zip(unique_article_read.keys(), unique_article_read.values()):
    unique_article_read_avg[k] = np.mean(v)
for k, v in zip(unique_article_scroll.keys(), unique_article_scroll.values()):
    unique_article_scroll_avg[k] = np.mean(v)

#### Read Time

In [None]:
# Distribution of Read Times for each Article
## Indices / Values
indices = ['All Unique Articles']
values = [x for x in unique_article_read_avg.values()]
## Plot
plot_box(
    indices_=indices, values_=[values],
    yrange_=[-5, 1100], xaxis_title='Topics',
    yaxis_title='Read Time', title_='<b> Distributions of Read Times Across Each Topic<b>')

#### Scroll Percentage

In [None]:
# Distribution of Scroll Percentages for each Article
## Indices / Values
indices = ['All Unique Articles']
values = [x for x in unique_article_scroll_avg.values()]
## Plot
plot_box(
    indices_=indices, values_=[values],
    yrange_=[-5, 105], xaxis_title='Topics',
    yaxis_title='Scroll Percentage', title_='<b> Distributions of Scroll Percentage Across All Articles!<b>')

In [None]:
# How many unique articles are clicked in a session?
## Group by sessions and get the article ids
tmp_aps = df.groupby('session_id')['article_id'].apply(list)
## Create a dict to store the count of articles per session
articles_per_session = {k: 0 for k in range(1, 20)}

## Iterate through our list previously, and record the number of articles in a session to our res dict
for i in tmp_aps:
    num_articles = len(i)
    articles_per_session[num_articles] += 1

## Set as our indices / values for plot
indices = [k for k in articles_per_session.keys()]
values = [k for k in articles_per_session.values()]
## Plot
plot_bar(
    indices_=indices, values_=values,
    yrange_=[0, 22000], xaxis_title='Number of Articles ',
    yaxis_title='Count', title_='<b> Number of Articles clicked in a session<b>')

## Users

### Number of Users


In [None]:
# Total Number of Users
single_subset_bar(df_ = df, feature_ = 'user_id', xaxis_title = 'Number of Users', yrange = [0, 11000])

### Daily User growth

In [None]:
# Record the daily user growth
unique_user_ids = df['user_id'].unique()

# Create dictionaries
unique_users_daily_growth_freq= {}
unique_users_hourly_freq = {}
unique_users_dayofweek_freq = {}
unique_users_weekly_freq = {}

# Iterate through each user id and record the number of unique users present!
for id in unique_user_ids[0:1000]:
    # Get the subset of that user id
    tmp_df = df[df['user_id'] == id]
    # Get the first index of that impression time
    first_index = tmp_df['impression_time_fixed'].index[0]
    # Record that join_date 
    tmp_datetime = pd.DatetimeIndex(tmp_df['impression_time_fixed'][first_index])
    tmp_date = tmp_datetime[0].date()
    join_date = tmp_date
    # Populate our unique_user_daily_growth
    if join_date not in unique_users_daily_growth_freq:
        unique_users_daily_growth_freq[join_date] = 1
    else:
        unique_users_daily_growth_freq[join_date] +=1

# Sort our dict
unique_users_daily_growth_freq = dict(sorted(unique_users_daily_growth_freq.items()))

In [None]:
# Daily User Growth

# Indices / Values for Plot
indices = [x for x in unique_users_daily_growth_freq.keys()]
values = [x for x in unique_users_daily_growth_freq.values()]
# Plot
plot_bar(indices_=indices, values_=values, yrange_=[
         0, 3], xaxis_title='<b>Dates<b>', yaxis_title='<b>Count<b>', title_='<b>Daily User Growth<b>')

### Read Time

In [85]:
# Read Time per User

# Group by User and Read Time
tmp_user_df = pd.DataFrame(data=df.groupby(by='user_id')[
                           'read_time'].mean(), columns=['read_time'])
# Plot
single_subset_feature_visualization(
    df_=tmp_user_df,  feature_='read_time', data_title='Unique Users')

### Scroll Percentage

In [None]:
# Scroll Percentage per User

# Group by User and Scroll Percentage
tmp_user_df = pd.DataFrame(data=df.groupby(by='user_id')[
                           'scroll_percentage'].mean(), columns=['scroll_percentage'])
# Plot
single_subset_feature_visualization(
    df_=tmp_user_df,  feature_='scroll_percentage', data_title='Unique Users')

### User Activity

In [None]:
# Record the daily, hourly, weekly, dayofweek activity across all users

# Get all unique ids in a list
unique_user_ids = df['user_id'].unique()[0:1000]

# Create dictionaries
unique_users_daily_freq = {}
unique_users_hourly_freq = {}
unique_users_dayofweek_freq = {}
unique_users_weekly_freq = {}

# Iterate through each user id
for id in unique_user_ids:
    # Get the subset of that user id
    tmp_df = df[df['user_id'] == id]

    # Now lets go through each and populate the unique dates, hours and day of the week for each user
    dates = []
    hours = []
    dayofweek = []
    week = []
    indices = np.array(tmp_df.index)

    # Iterate through each index
    for i in indices:
        # Store the date, time, dayofweek, and week number
        tmp_datetime = pd.DatetimeIndex(tmp_df['impression_time_fixed'][i])
        tmp_date = tmp_datetime.date
        tmp_time = tmp_datetime.time
        tmp_dayofweek = tmp_datetime.weekday
        tmp_week = tmp_datetime.isocalendar().week
        # Append our dates, hours, dayofweek, week number
        for j, k, l, m in zip(tmp_date, tmp_time, tmp_dayofweek, tmp_week):
            dates.append(j)
            hours.append(k)
            dayofweek.append(l)
            week.append(m)

    # Get rid of duplicate values
    unique_dates = list(set(dates))
    unique_hours = list(set(hours))
    unique_dayofweek = list(set(dayofweek))
    unique_week = list(set(week))

    # Convert to string
    unique_hours = [x.hour for x in unique_hours]
    unique_hours = [str(i) + ':00' if i > 9 else str(0) +
                    str(i) + ':00' for i in unique_hours]

    # Convert the week int to mapping from 1++
    unique_week = weekly_map(unique_week)

    # Populate dicts
    populate_dict(list_=unique_dates, dict_=unique_users_daily_freq)
    populate_dict(list_=unique_hours, dict_=unique_users_hourly_freq)
    populate_dict(list_=unique_dayofweek, dict_=unique_users_dayofweek_freq)
    populate_dict(list_=unique_week, dict_=unique_users_weekly_freq)


# Sort our dicts
unique_users_daily_freq = dict(sorted(unique_users_daily_freq.items()))
unique_users_hourly_freq = dict(sorted(unique_users_hourly_freq.items()))

# Sort by integers for day of the week and then lets change the dict from int to str
unique_users_dayofweek_freq = dict(sorted(unique_users_dayofweek_freq.items()))
unique_users_dayofweek_freq = int_dow_dict(unique_users_dayofweek_freq)

unique_users_weekly_freq = dict(sorted(unique_users_weekly_freq.items()))

#### Daily User Activity

In [None]:
# Daily User Activity

## Indices / Values for Plot
indices = [x for x in unique_users_daily_freq.keys()]
values = [x for x in unique_users_daily_freq.values()]

## Plot
plot_scatter(
    indices_=indices, values_=values,
    yrange_=[200, 900], xaxis_title='Date',
    yaxis_title='Active Users', title_='<b>Daily Active Users<b>'
)

#### Hourly User Activity

In [None]:
# Hourly User Activity

## Indices / Values for Plot
indices = [x for x in unique_users_hourly_freq.keys()]
values = [x for x in unique_users_hourly_freq.values()]

## Plot
plot_scatter(
    indices_ = indices , values_ = values,
    yrange_ = [0, 20000], xaxis_title = 'Hour',
    yaxis_title= 'Active Users', title_ = '<b>Hourly Active Users<b>'
    )

#### Weekly User Activity

In [None]:
# Weekly User Activity

## Indices / Values for Plot
indices = [x for x in unique_users_weekly_freq.keys()]
values = [x for x in unique_users_weekly_freq.values()]

## Plot
plot_bar(
    indices_ = indices, values_ = values,
    yrange_ = [0, 1100], xaxis_title = 'Week',
    yaxis_title= 'Active Users', title_ = '<b> Weekly Active Users <b>')

#### Day Of The Week User Activity

In [None]:
# Day Of The Week Activity

## Indices / Values for Plot
indices = [x for x in unique_users_dayofweek_freq.keys()]
values = [x for x in unique_users_dayofweek_freq.values()]

## Plot
plot_bar(
    indices_ = indices, values_ = values,
    yrange_ = [0, 1100], xaxis_title = 'Day',
    yaxis_title= 'Active Users', title_ = '<b> Day of the Week Activity  <b>')

## Session

### Number of Sessions

In [None]:
# Toal Number of Sessions
single_subset_bar(df_=df, feature_='session_id',
                  xaxis_title='Number of Sessions', yrange=[0, 40000])

### Daily Active Sessions

In [None]:
# Number of unique sessions per day

# Make a copy of the dataframe and extract the time as a str
copy_df = df.copy()
copy_df['impression_time'] = copy_df['impression_time'].apply(
    lambda x: x.strftime('%m/%d/%Y'))

# Group by the session ids with the impression time
unique_sessions_per_day = copy_df.groupby(
    by='session_id')['impression_time'].min()
tmp_dau_df = pd.DataFrame(data=unique_sessions_per_day.values,
                          index=unique_sessions_per_day.keys(), columns=['Session Dates'])

# Plot
multiple_subset_bar(df_=tmp_dau_df, feature_='Session Dates', yrange=[0, 8000])

### Read Time

In [None]:
# Read Time per Session
## Group by session ids and read_time 
tmp_session_df = pd.DataFrame(data=df.groupby(by='session_id')[
                              'read_time'].mean(), columns=['read_time']
## Plot
single_subset_feature_visualization(
    df_=tmp_session_df,  feature_='read_time', data_title='Unique Sessions')

### Scroll Percentage per Session

In [None]:
# Scroll Percentage per Session
## Group by session ids and scroll percentage
tmp_session_df = pd.DataFrame(data=df.groupby(by='session_id')[
                              'scroll_percentage'].mean(), columns=['scroll_percentage'])
## Plot
single_subset_feature_visualization(
    df_=tmp_session_df,  feature_='scroll_percentage', data_title='Unique Sessions')

## Topic

### Number of Topics

In [None]:
# Number of Topics!
# Unique Topics
topic_list = unique_subset_topics(df)
# Plot
tmp_topic_df = pd.DataFrame(data=topic_list, columns=['topics'])

single_subset_bar(df_=tmp_topic_df, feature_='topics',
                  xaxis_title='Number of Topics', yrange=[0, 100])

In [None]:
# Record the frequency of topics across unique users, readtimes across topics, and scroll percentages across those topics

# Get all unique ids in a list
unique_user_ids = df['user_id'].values[0:1000]

# Create dictionaries
unique_users_topics_freq = {}
unique_topic_scroll_freq = {}
unique_topic_read_freq = {}

# Iterate through each user id and record the topics viewed!
for id in unique_user_ids:
    # Get the subset of that user id
    tmp_df = df[df['user_id'] == id]
    # Now lets go through each topic
    indices = np.array(tmp_df.index)
    for i in indices:
        # Record the topic, scroll percentage and read_time for each index
        tmp_topics = tmp_df['topics'][i]
        tmp_scroll = tmp_df['scroll_percentage'][i]
        tmp_read = tmp_df['read_time'][i]
        topics = [x for x in tmp_topics]
        scroll = [tmp_scroll]
        read = [tmp_read]

    # Find the average scroll percentages across each topic  (Can be related to whether a topic doesnt require too much reading has visualizations)
    # Look at article_id for whichever topics the article is included in add that scroll percentage
        tmp_topic_scroll = {k: v for k, v in zip(topics, scroll)}
        unique_topic_scroll_freq = topics_article_id_scroll_read(
            tmp_topic_scroll, unique_topic_scroll_freq)

    # Find the average read time across each topic
    # Look at article_id for whichever topics the article is included in add that readtime
        tmp_topic_read = {k: v for k, v in zip(topics, read)}
        unique_topic_read_freq = topics_article_id_scroll_read(
            tmp_topic_read, unique_topic_read_freq)

    # Unique User Topics
    # Get rid of duplicate values
    unique_topics = list(set(topics))

    # Populate our dict
    populate_dict(unique_topics, unique_users_topics_freq)


# Sort the dictionaries
sorted_topic_freq = dict(
    sorted(unique_users_topics_freq.items(), key=lambda x: x[1], reverse=True))

# Find the average read times across each topic
unique_topic_read_avg_freq = {k: round(np.nanmean(v), 2) for k, v in zip(
    unique_topic_read_freq.keys(), unique_topic_read_freq.values())}
sorted_unique_topic_read_avg_freq = dict(
    sorted(unique_topic_read_avg_freq.items(), key=lambda x: x[1], reverse=True))

# Sort the topics for distribution
sorted_unique_topic_read_freq = dict(sorted(unique_topic_read_freq.items()))

# Find the average scroll percentages across each topic
unique_topic_scroll_avg_freq = {k: round(np.nanmean(v), 2) for k, v in zip(
    unique_topic_scroll_freq.keys(), unique_topic_scroll_freq.values())}
sorted_unique_topic_scroll_avg_freq = dict(
    sorted(unique_topic_scroll_avg_freq.items(), key=lambda x: x[1], reverse=True))

# Sort the topics scroll pct for distribution
sorted_unique_topic_scroll_freq = dict(
    sorted(unique_topic_scroll_freq.items()))

### Distribution of Topics across users

In [None]:
# Distribution of Topics across users!
## Indices / Values for Plot
indices = [x for x in sorted_topic_freq.keys()][0:10]
values = [x for x in sorted_topic_freq.values()][0:10]

## Plot
plot_bar(
    indices_=indices, values_=values,
    yrange_=[0, 400], xaxis_title='Topics',
    yaxis_title='Count', title_='<b> Top 10 Topics User Activity<b>')

### Read Time

In [None]:
# Bar Plot of Read Time across Topics
## Indices / Values for Plot
indices = [x for x in sorted_unique_topic_read_avg_freq.keys()][0:5]
values = [x for x in sorted_unique_topic_read_avg_freq.values()][0:5]
## Plot
plot_bar(
    indices_ = indices, values_ = values,
    yrange_ = [0, 150], xaxis_title = 'Topics',
    yaxis_title= 'Read Time', title_ = '<b> Top 10 Topics User Activity<b>')

In [None]:
# Box Plot of Read Time across Topics
## Indices / Values for Plot
indices = [x for x in sorted_unique_topic_read_freq.keys()]
values = [x for x in sorted_unique_topic_read_freq.values()]
## Plot
plot_box(
    indices_ = indices, values_ = values,
    yrange_ = [0, 2000], xaxis_title = 'Topics',
    yaxis_title= 'Read Time', title_ = '<b> Distributions of Read Times Across Each Topic<b>')


### Scroll Percentage

In [None]:
# Bar Plot of Scroll Percentage across Topics
## Indices / Values for Plot
indices = [x for x in sorted_unique_topic_scroll_avg_freq.keys()]
values = [x for x in sorted_unique_topic_scroll_avg_freq.values()]
## Plot
plot_bar(
    indices_ = indices, values_ = values,
    yrange_ = [0, 104], xaxis_title = 'Topics',
    yaxis_title= 'Read Time', title_ = '<b> Average Scroll Percentage Across Each Topic<b>')

In [None]:
# Box Plot of Scroll Percentage across Topics
## Indices / Values for Plot
indices = [x for x in sorted_unique_topic_scroll_freq.keys()]
values = [x for x in sorted_unique_topic_scroll_freq.values()]
## Plot
plot_box(
    indices_ = indices, values_ = values,
    yrange_ = [0, 105], xaxis_title = 'Topics',
    yaxis_title= 'Read Time', title_ = '<b> Distributions of Read Times Across Each Topic<b>')

### Daily and Hourly Activity 

In [None]:
# Daily and Hourly Activity across each Topic

# Get all the unique topics
topic_list = unique_subset_topics(df)
unique_topics = sorted(topic_list)

# Get the list of each unqiue topic in a specific session
topics = df.groupby(by='session_id')['topics'].apply(list)

# Get the list of each unique timestamp for these sessions
timestamps = df.groupby(by='session_id')['impression_time'].apply(list)
unique_dates = []

# Create a list of hours in a str format
unique_hours = [i for i in range(24)]
unique_hours = [str(i) + ':00' if i > 9 else str(0) +
                str(i) + ':00' for i in unique_hours]

# Iterate through each timestamp
for i in range(len(timestamps.values)):
    # Iterate through each idx
    for j in range(len(timestamps.values[i])):
        # Assign datetime and date objects
        tmp_datetime = timestamps.values[i][j]
        tmp_date = tmp_datetime.date()
        # if date not in unique dates, append
        if tmp_date not in unique_dates:
            unique_dates.append(tmp_date)

# Sort dates
unique_dates = sorted(unique_dates)

# Instantiate dict objects with unique dates and unique key values set to 0
unique_topic_daily_activity = {
    k: {k: 0 for k in unique_dates} for k in unique_topics}
unique_topic_hourly_activity = {
    k: {k: 0 for k in unique_hours} for k in unique_topics}


# Iterate through each session id
for i in zip(range(len(topics.values))):
    # Iterate through each index of nested list
    for j, k in zip(range(0, len(topics.values[i][0])), range(0, len(i))):
        tmp = topics.values[i][0][j]
        # Assign a datetime and time object
        tmp_datetime = timestamps.values[i][k]
        tmp_date = tmp_datetime.date()
        tmp_time = tmp_datetime.time()
        tmp_hour = tmp_time.hour

        # Convert hour into a string
        if tmp_hour > 9:
            tmp_time = str(tmp_hour) + ':00'

        else:
            tmp_time = "0" + str(tmp_hour) + ':00'

        # Add to dictionary
        unique_topic_daily_activity[tmp][tmp_date] += 1
        unique_topic_hourly_activity[tmp][tmp_time] += 1

#### Daily Activity

In [None]:
# Daily Activity of Topics 
activity_scatter(
    dict_=unique_topic_daily_activity,  yrange_=[0, 2100],
    xaxis_title='Dates', yaxis_title='Active Users', title_='<b> Daily Active Users Per Topic')

#### Hourly Activity

In [None]:
# Hourly Activity of Topics 
activity_scatter(
    dict_=unique_topic_hourly_activity,  yrange_=[0, 1000],
    xaxis_title='Hourly', yaxis_title='Active Users', title_='<b> Daily Active Users Per Topic')

## Devices

In [None]:
# Distribution of Devices
multiple_subset_bar(df_=df, feature_='device_type', yrange=[0, 50000])

### Readtime

In [None]:
# Read Time across Devices
multiple_subset_feature_visualization(df_ =df,  feature_1 = 'device_type', feature_2 = 'read_time')

### Scroll percentage 

In [None]:
# Scroll Percentage across Devices
multiple_subset_feature_visualization(df_ =df,  feature_1 = 'device_type', feature_2 = 'scroll_percentage')

### Topic

In [None]:
# Distribution of Topics Per Device
# Unique Topics
topic_list = unique_subset_topics(df)
unique_topics = sorted(topic_list)
# Plot
topic_feature_bar_distribution(
    df_=df, feature_='device_type', yrange=[0, 4.5],
    topic_list_=unique_topics, subplot_titles_=[
        '<b>Device 1<b>', '<b>Device 2<b>', '<b>Device 3<b>'],
    xaxis_title='<b>Topics<b>', yaxis_title='<b>Count<b>',
    title_='<b>Topic Distribution Per Device<b>',
    height_=750, width_=1000
)

### Daily/Hourly Activity

In [None]:
# Daily and Hourly Activity across Devices
daily_hourly_activity_feature_bar_distribution(
    df_ = df, feature_ = 'device_type', yrange = [0, 4],
    subplot_titles_ = ['<b>Daily<b>', '<b>Monthly<b>'],
    title_ = '<b>Daily and Hourly Activity Per Device<b>',
    height_ = 750, width_ = 1000
    )

NameError: name 'daily_hourly_activity_feature_bar_distribution' is not defined

## If subscriber

### Distribution of Subscribers vs Non-Subscribers 

In [None]:
# Distribution of Subscribers vs Non-Subscribers
multiple_subset_bar(df_=df, feature_='is_subscriber', yrange=[0, 80000])

### Read time

In [None]:
# Read Times for Subscribers vs Non-Subscribers
multiple_subset_feature_visualization(
    df_=df,  feature_1='is_subscriber', feature_2='read_time')

### Scroll percentage

In [None]:
# Scroll Percentages for Subscribers vs Non-Subscribers
multiple_subset_feature_visualization(
    df_=df,  feature_1='is_subscriber', feature_2='scroll_percentage')

### Topic Distribution

In [None]:
# Distribution of Topics for Subscribers vs Non-Subscribers
## Get all the unique topics
topic_list = unique_subset_topics(df)
unique_topics = sorted(topic_list)
## Plot
topic_feature_bar_distribution(
    df_=df, feature_='is_subscriber', yrange=[0, 5],
    topic_list_=unique_topics, subplot_titles_=['<b>False<b>', '<b>True<b>'],
    xaxis_title='<b>Topics<b>', yaxis_title='<b>Count<b>',
    title_='<b>Topic Distribution of Subscribers vs Non-Subscribers<b>',
    height_=500, width_=1000
)

### Daily/Hourly Activity

In [None]:
# Daily Activity Users / Hourly Activity Users for Subscribers vs Non-Subscribers
daily_hourly_activity_feature_bar_distribution(
    df_ = df, feature_ = 'is_subscriber', yrange = [0, 4],
    subplot_titles_ = ['<b>Daily<b>', '<b>Monthly<b>'],
    title_ = '<b>Daily and Hourly Activity of Subscribers vs Non-Subscribers<b>',
    height_ = 750, width_ = 1000
    )

## Gender

### Distribution of Genders

In [None]:
# Distribution of Genders
multiple_subset_bar(df_=df, feature_='gender', yrange=[0, 5000])

### Read time 

In [None]:
# Read Time across Genders
multiple_subset_feature_visualization(
    df_=df,  feature_1='gender', feature_2='read_time')

### Scroll percentage 

In [None]:
# Scroll Percentage across Genders
multiple_subset_feature_visualization(
    df_=df,  feature_1='gender', feature_2='scroll_percentage')

### Topics

In [None]:
# Distribution of Topics across Genders
## Get all the unique topics
topic_list = unique_subset_topics(df)
unique_topics = sorted(topic_list)
## Plot
topic_feature_bar_distribution(
    df_=df, feature_='gender', yrange=[0, 5],
    topic_list_=unique_topics, subplot_titles_=['<b>Female<b>', '<b>Male<b>'],
    xaxis_title='<b>Topics<b>', yaxis_title='<b>Count<b>',
    title_='<b>Topic Distribution of Genders<b>',
    height_=500, width_=1000
)

### Daily/Hourly Activity

In [None]:
# Daily Activity Users / Hourly Activity Users across Genders
daily_hourly_activity_feature_bar_distribution(
    df_=df, feature_='gender', yrange=[0, 4],
    subplot_titles_=['<b>Daily<b>', '<b>Monthly<b>'],
    title_='<b>Daily and Hourly Activity of Genders<b>',
    height_=750, width_=1000
)

## Age

### Age Distribution

In [None]:
# Distribution of Ages
multiple_subset_bar(df_=df, feature_='age', yrange=[0, 800])

### Read Time

In [None]:
# Read Time across Ages
multiple_subset_feature_visualization(
    df_=df,  feature_1='age', feature_2='read_time')

### Scroll Percentage

In [None]:
# Scroll Percentages across Ages
multiple_subset_feature_visualization(
    df_=df,  feature_1='age', feature_2='scroll_percentage')

### Topics

In [None]:
# Distribution of Topics across Ages
## Get all the unique topics
topic_list = unique_subset_topics(df)
unique_topics = sorted(topic_list)
## Plot
topic_feature_bar_distribution(
    df_=df, feature_='age', yrange=[0, 2.5],
    topic_list_=unique_topics,
    subplot_titles_=[
        '<b>20-29<b>', '<b>30-39<b>', '<b>40-49<b>',
        '<b>50-59<b>', '<b>60-69<b>', '<b>70-79<b>',
        '<b>80-89<b>', '<b>90-99<b>'
    ],
    xaxis_title='<b>Topics<b>', yaxis_title='<b>Count<b>',
    title_='<b>Topic Distribution of Age Groups<b>',
    height_=850, width_=1000
)

### Daily/Hourly Activity

In [None]:
# Daily Activity Users / Hourly Activity Users across Age
daily_hourly_activity_feature_bar_distribution(
    df_=df, feature_='age', yrange=[0, 2.5],
    subplot_titles_=['<b>Daily<b>', '<b>Monthly<b>'],
    title_='<b>Daily and Hourly Activity of Age Groups<b>',
    height_=750, width_=1000
)

## Postcodes

### Distribution of Post Codes

In [None]:
# Distribution of Postcodes
multiple_subset_bar(df_=df, feature_='postcode', yrange=[0, 800])

### Read Time

In [None]:
# Read Time across Postcodes
multiple_subset_feature_visualization(
    df_=df,  feature_1='postcode', feature_2='read_time')

### Scroll Percentage

In [None]:
# Scroll Percentages across Postcodes
multiple_subset_feature_visualization(
    df_=df,  feature_1='postcode', feature_2='scroll_percentage')

### Topics

In [None]:
# Distribution of Topics across Postcodes
## Get all the unique topics
topic_list = unique_subset_topics(df)
unique_topics = sorted(topic_list)
## Plot
topic_feature_bar_distribution(
    df_=df, feature_='postcode', yrange=[0, 2.5],
    topic_list_=unique_topics,
    subplot_titles_=[
        '<b>Big City<b>', '<b>Metropolitan<b>', '<b>Municiplaity<b>',
        '<b>Provincial<b>', '<b>Rural District<b>'
    ],
    xaxis_title='<b>Topics<b>', yaxis_title='<b>Count<b>',
    title_='<b>Topic Distribution per Postcodes<b>',
    height_=850, width_=1000
)

### Daily/Hourly Activity

In [None]:
# Daily Activity Users / Hourly Activity Users across Postcodes
daily_hourly_activity_feature_bar_distribution(
    df_=df, feature_='postcode', yrange=[0, 4],
    subplot_titles_=['<b>Daily<b>', '<b>Monthly<b>'],
    title_='<b>Daily and Hourly Activity per Postcode<b>',
    height_=750, width_=1000
)