# Assignment 2 - Entertainment Data (CleverClogs)


Author: Zoe Pointon

Context: User Experience

Background information: https://www.blackwoodgroup.org.uk/clevercogs/

The data analysed in this notebook is the usage data for a tablet application called 'CleverClogs'. CleverClogs is an application, developed for a tablet, designed to empower elderly and disabled people with technology, and incorporate it into their daily lives. The user competency of the app is varied, however all are given 12 hours of training. The app is multi functional. There are sections within the app for; music, video, games, video calling, panic alarm, calendar and more. The app also has a web browser (some extremely vulnerable users have this switched off).


The dataset is made up of click data. Each row in the dataset has these seven columns; day/time stamp, user ID, user role, building, link title, link type and content info. There are around 500 users and 132537 lines of data.

We were later also given a dataset of the users information. This data included details suchas; external ID, clever clogs user ID, birthdate, gender and condition. There are in total 696 users. The two datasets are linked via an ID specifically the ExternalID.

The data owner is the man who developed the tablet, Collin, and a university researcher, Lynda. They would like the data to be analysed so they can include some visulisations for a proposal they are writing. They are interested in any insight that we can gain from the dataset.

My main focus is to find out how the users are using the app. Specifically, finding trends in what the users are using the app for. Also finding out how they are accessing these things. When are they using the built in functions? And when do they use the web browser?

In [4]:
import pandas as pd
import numpy as np
import sqlite3
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv("data.csv")
print(df)

FileNotFoundError: File b'data.csv' does not exist

# Click Data Cleaning

I first ran a .shape to make sure it returned the right number of columns and rows.

In [None]:
df.shape

This showed me that the data was returning more columns than it should. There are only 7 columns in this data. To make sure i printed all the column headers.

In [None]:
print('Column headers', df.columns)

This told me that there were some extra unamed rows I needed to remove. So I droped all columns except the first 7.

In [None]:
df.drop(df.columns[7:], axis=1, inplace=True)
print('Column headers', df.columns)

In [None]:
df.shape

I then ran value counts on some of the rows with more quantative data to make sure all rows were returning valid data.

In [None]:
df['Role'].value_counts()

In [None]:
df['Building'].value_counts()

In [None]:
df['LinkType'].value_counts()

In [None]:
df['LinkTitle'].value_counts()

In [None]:
df['ContentInfo'].value_counts()

I could see that there were clearly 18 rows that were returning invalid data. I removed these rows.

In [None]:
df.drop(df.loc[df['Role']==' font-family: " comic="" sans="" ms"'].index, inplace=True)
print(df)

I then ran value count for the Role column again to make sure it only returned valid values.

In [None]:
df['Role'].value_counts()

ContentInfo was returning some very messy values so I went through and cleaned them up using replace.

In [None]:
df['ContentInfo'] = df['ContentInfo'].replace({'-1|329|10| Comfort Break|1|': 'Comfort Break'})
df['ContentInfo'] = df['ContentInfo'].replace({'-1|424|10| Contact Blackwood|1|': 'Contact Blackwood'})
df['ContentInfo'] = df['ContentInfo'].replace({'-1|332|5| Dropped Book|1|': 'Dropped Book'})    
df['ContentInfo'] = df['ContentInfo'].replace({'<h2>Care Standards for support services </h2><div><span style="font-weight: normal': 'Care Standards for support services'})
df['ContentInfo'] = df['ContentInfo'].replace({'-1|479|10| Ask for new content|1|': 'Ask for new content'})
df['ContentInfo'] = df['ContentInfo'].replace({'-1|209|30| Alarm|1|': 'Alarm'})
df['ContentInfo'] = df['ContentInfo'].replace({'-1|356|15| Out of Bed (2)|1|1': 'Out of Bed'})
df['ContentInfo'] = df['ContentInfo'].replace({'-1|224|10| Coffee|1|': 'Coffee'})
df['ContentInfo'] = df['ContentInfo'].replace({'-1|348|15| HELP INTO BED|1|1': 'HELP INTO BED'})
df['ContentInfo'] = df['ContentInfo'].replace({'-1|454|2|Slow to Drain|1|': 'Slow to Drain'})
df['ContentInfo'] = df['ContentInfo'].replace({'-1|331|10|  Breakfast  |1|': 'Breakfast'})
df['ContentInfo'] = df['ContentInfo'].replace({'-1|327|5| Cup of Tea |1|': 'Cup of Tea'})
df['ContentInfo'] = df['ContentInfo'].replace({'-1|295|10| Wheel Chair Support|1|': 'Wheel Chair Support'})
df['ContentInfo'] = df['ContentInfo'].replace({'-1|542|30|Blocked|1|': 'Blocked'})
df['ContentInfo'] = df['ContentInfo'].replace({'-1|331|10|  Breakfast  |1|': 'Breakfast'})
df['ContentInfo'] = df['ContentInfo'].replace({'Make a Payment<div><br></div><div><br></div>': 'Make a Payment'})

        
df['ContentInfo'] = df['ContentInfo'].replace({'="" font-size:="" large': np.nan})
df['ContentInfo'] = df['ContentInfo'].replace({'66|331||||': np.nan})
df['ContentInfo'] = df['ContentInfo'].replace({'<iframe src="spiral.asp" height="600" width="700" scrolling="no" frameBorder="0"></iframe>': np.nan})
df['ContentInfo'] = df['ContentInfo'].replace({'If the water is not draining properly try these steps first<div><br></div><div>1</div><div><br></div><div><br></div><div>2</div><div><br></div><div><br></div><div>3</div>': np.nan})
df['ContentInfo'] = df['ContentInfo'].replace({'<p class="MsoNormal" style="margin: 0px': np.nan})
df['ContentInfo'] = df['ContentInfo'].replace({'<span style="font-family: &quot': np.nan})
df['ContentInfo'] = df['ContentInfo'].replace({'<div style="text-align: center': np.nan})



In [None]:
df['ContentInfo'].value_counts()

I then decided to remove all rows that are about support accounts. I did this as I am not interested in how support people are using the system.

In [None]:
df.drop(df.loc[df['Role']=='Support'].index, inplace=True)
print(df)

I then a value count to make sure it only returned User roles.

In [None]:
df['Role'].value_counts()

In [None]:
df['LinkType'].value_counts()

# User Data Cleaning
Now I am going to clean the user-data.csv dataset.

In [None]:
ud = pd.read_csv("user-data.csv")
print(ud)

In [None]:
ud['Gender'].value_counts()

In [None]:
ud['Condition'].value_counts()

While trying to work out age of users I found that there was a user without a BirthDate inputed. I dropped this row becuase it was disturbing my dataset.

In [None]:
ud.drop(ud.loc[ud['BirthDate'] == ''].index, inplace = True)

The data seems to look ok from the value counts I ran. So I am now going to work out each users age and create a new column form this.

In [None]:
import datetime as datetime

In [None]:
ud['BirthDate'] =  pd.to_datetime(ud['BirthDate'], format='%d/%m/%Y')
print(ud)

In [None]:
for i in ud.index:
    today = datetime.datetime.now()
    dob = ud.at[i, 'BirthDate']
    ud.at[i, 'Age'] = today.year - dob.year - ((today.month, today.day) < (dob.month, dob.day))
print(ud)

# Exploring Click Data

First I explored the first dataset called df. As my angle of analitics was the LinkType column I satrted there.

In [None]:
print(df)

I refreshed my memory on the diffrent vaules of LinkType.

In [None]:
df['LinkType'].value_counts()

I created a bar chart to see what types of things the users are clicking on. 

The results showed me that users are using the internet about the same amount as they are using the built in tablet fuctions (Category). This could indicate that there is some built functions missing or that do not perform as the user would wish so they are using the internet instead. It could also indicated that users just like browsing the internet.

In [None]:
df['LinkType'].value_counts().plot(kind='barh')
print('Figure 1: Sections')

Part of what I want to find out is how the clients are using the the custom built sections and for what. Because of this I created a barchat to see what Category sections are the most popular. I looked at the LinkTitles again. These are link titles of all aspects of the tabet. So I needed to narrow down the return to just the 'Category' LinkTitles.

In [None]:
df['LinkTitle'].value_counts()

I cut the dataset to return only rows where LinkType equals 'Categorys'. I then added all the LinkTitles to a list. 

In [None]:
CategoryData = []
for i in df.index:
    if df.at[i, 'LinkType'] == 'Category':
        CategoryData.append(df.at[i, 'LinkTitle'])

print(CategoryData)

I counted how many items there were in the list to get an idea of scale.

In [None]:
len(CategoryData)

I then keyed and counted the values in the list.

In [None]:
import collections

CategoryDataCounter=collections.Counter(CategoryData)

print(CategoryDataCounter.keys())
print(CategoryDataCounter.values())

I then made a barchart from these lists. The bar chart showed me the 6 most clicked on built in functions. These are; My Music, Entertainment, My Interests, Single Player Games, Play Games and Videos.

Entertainment is the most clicked on by nearly double the rest. This leads me to believe that Entertainment might have subcategories within it suchas; My Music or Games. However I cannot be sure without asking the data holder.

Even though I cannot be sure what 'Entertainment' is exactly, there is a clear corrilation that for the most part people are using the tablet mainly for Music, Games and watching Videos. This is where I will analyse deeper.

In [None]:
x = CategoryDataCounter.keys()
y = CategoryDataCounter.values()

locs, labels = plt.xticks()
plt.setp(labels, rotation=90,)
plt.bar(x, y)
print('Figure 2: Built in Functions')

I decided to look into how the users are listening to music. I started by looking at what types of radio stations the users are listening to.
I did this by creating a list of all the LinkTitles that include the phrase 'Radio'.

In [None]:
RadioData = []
for i in df.index:
    if type(df.at[i, 'LinkTitle']) is str:
        if df.at[i, 'LinkTitle'].count('Radio') > 0:
            RadioData.append(df.at[i, 'LinkTitle'])

print(RadioData)

I counted how many items there were in the list to get an idea of scale.

In [None]:
len(RadioData)

In [None]:
RadioDataCounter=collections.Counter(RadioData)

print(RadioDataCounter.keys())
print(RadioDataCounter.values())

'Radio Stations' is the click before the user gets to the radio station. I can tell becuase the amount of stations clicked on is about the same amount as the amount of clicks on 'Radio Sation'. From this I removed 'Radio Stations' from the visulisation.

In [None]:
totalradioclicks = len(RadioData)
while 'Radio Stations' in RadioData:
    RadioData.remove('Radio Stations')

print(RadioData)

In [None]:
RadioDataCounter=collections.Counter(RadioData)

print(RadioDataCounter.keys())
print(RadioDataCounter.values())

From this visulisation we can clearly see that 'Smooth Radio' is the most popular radio station by far. 

In [None]:
x = RadioDataCounter.keys()
y = RadioDataCounter.values()

locs, labels = plt.xticks()
plt.setp(labels, rotation=90)
plt.bar(x, y)
print('Figure 3: Radio')

In [None]:
perRadioClick = (totalradioclicks/132535)*100
print(perRadioClick)

I wanted to look at what types of things the users were looking at on the internet.

In [None]:
InternetData = []
for i in df.index:
    if df.at[i, 'LinkType'] == 'Internet':
        InternetData.append(df.at[i, 'LinkTitle'])

print(InternetData)

In [None]:
len(InternetData)

In [None]:
InternetDataCounter=collections.Counter(InternetData)

print(InternetDataCounter.keys())
print(InternetDataCounter.values())

There were so many diffrent searches that it is hard to see what is looked at the most. Becuase of this I read through the list. Looking at the list there were some pharses that poped up a lot. I decided to do a more focused visulisation for these phrases.

In [None]:
x = InternetDataCounter.keys()
y = InternetDataCounter.values()

locs, labels = plt.xticks()
plt.setp(labels, rotation=90)
plt.bar(x, y)
print('Figure 4: Internet')

Lots of the users were using the internet to look at sports. Looking throught the data I can see that the phrase 'football' come up alot. I decided to make a visulisation to see which websites users are viewing related to football.

In [None]:
FootballData = []

for i in df.index:
    if df.at[i, 'LinkType'] == 'Internet':
        if type(df.at[i, 'LinkTitle']) is str:
            if df.at[i, 'LinkTitle'].count('Football') > 0:
                FootballData.append(df.at[i, 'LinkTitle'])

print(FootballData)

In [None]:
len(FootballData)

In [None]:
FootballDataCounter=collections.Counter(FootballData)

print(FootballDataCounter.keys())
print(FootballDataCounter.values())

The chart shows that people are looking at football news on the Sky  and BBC websites. It also gives insight into which teams the users support. You can also tell that users are from Scotland or the north by the teams that are being searched for 'Scottish Football', 'Rangers Football Club', 'Falkirk Football', 'Kilmarnock Football Club', 'Dundee United Football Club' and 'Blackpool Football Club'. This shows that although the data does not have the specific geographical location of the users, you could still find out this information in other ways. This is an important thing to be aware of in the future with data protection.

In [None]:
x = FootballDataCounter.keys()
y = FootballDataCounter.values()

locs, labels = plt.xticks()
plt.setp(labels, rotation=90)
plt.bar(x, y)
print('Figure 5: Football')

Percentage of clicks related to football.

In [None]:
totalfootball = len(FootballData)
perFootballClick = (totalfootball/132535)*100
print(perFootballClick)

# Exploring User Data
Now I am going to explore the second dataset of user data called ud.

In [None]:
print(ud)

I want to discover a bit more about the user demographics of the tablet. I started by visualising the gender of users. This allowed me to see that there are more female users than male.

In [None]:
ud['Gender'].value_counts().plot(kind='bar')
print('Figure 6: Gender')

I created a bar chart to show me clearly what conditions are the most prodominent.

In [None]:
ud['Condition'].value_counts().plot(kind='barh', figsize=(15, 10))
print('Figure 7: Condition')

I worked out the mean age of users.

In [None]:
ud['Age'].mean()

I worked out the mode ages of users.

In [None]:
ud['Age'].mode()

I worked out the median age of users.

In [None]:
ud['Age'].median()

# Exploring Connected Databases

I wanted to calclate how active each user was and make a new row. I did this by counting the number of clicks each user has made and then added this to each users row in ud. 

In [None]:
IDFrequency = pd.DataFrame(df.ExternalID.value_counts().reset_index().values, columns=["ExternalID", "Frequency"])
print(IDFrequency)

The frequency is telling us that there are only 177 active users. This seems very low considering that we were given data on 696 users. I think this may be because there are lots of null ExternalIDs in the data set. I am now merging the dataset with the ud set.

In [None]:
IDFrequency.ExternalID = IDFrequency.ExternalID.dropna().astype(int)
ud.ExternalID = ud.ExternalID.dropna().astype(int)


udF = pd.merge(IDFrequency, ud[['ExternalID','Age','Gender','Condition']], left_on='ExternalID', right_on='ExternalID', how='left')

In [None]:
udF["Frequency"] = pd.to_numeric(udF["Frequency"])

print(udF)

In [None]:
udF.dropna(subset=['Age'], how='all', inplace = True)

In [None]:
print(udF)

I wanted to see if users age had anything to do with user frequency on the tablet. 

In [None]:
udFAge = udF[['Age', 'Frequency']]
print(udFAge)

In [None]:
udF.plot.scatter(x = 'Age', y = 'Frequency')
print('Figure 8: Age Frequency')

This chart shows no clear corrilation. You could argue that people aged 50 to 70 are the most frequent users from this chart.

In [None]:
udF['Condition'].value_counts().plot(kind='bar', figsize=(15, 10))
print('Figure 9: Condition Narrow')

In [None]:
Cancer = []
for i in udF.index:
    if udF.at[i, 'Condition'] == 'Cancer':
        Cancer.append(udF.at[i, 'Frequency'])
print('Cancer:', Cancer)
CancerSum = 0
for i in Cancer:
    CancerSum = CancerSum + i
AverageCancer = CancerSum/len(Cancer)
print('Cancer Average:', AverageCancer)


Dementia = []
for i in udF.index:
    if udF.at[i, 'Condition'] == 'Dementia':
        Dementia.append(udF.at[i, 'Frequency'])
print('Dementia:', Dementia)
DementiaSum = 0
for i in Dementia:
    DementiaSum = DementiaSum + i
AverageDementia = DementiaSum/len(Dementia)
print('Dementia Average:', AverageDementia)


SpinalInjury = []
for i in udF.index:
    if udF.at[i, 'Condition'] == 'Spinal Injury':
        SpinalInjury.append(udF.at[i, 'Frequency'])
print('Spinal Injury:', SpinalInjury)
SpinalInjurySum = 0
for i in SpinalInjury:
    SpinalInjurySum = SpinalInjurySum + i
AverageSpinalInjury = SpinalInjurySum/len(SpinalInjury)
print('Spinal Injury Average:', AverageSpinalInjury)


Diabetes = []
for i in udF.index:
    if udF.at[i, 'Condition'] == 'Diabetes':
        Diabetes.append(udF.at[i, 'Frequency'])
print('Diabetes:', Diabetes)
DiabetesSum = 0
for i in Diabetes:
    DiabetesSum = DiabetesSum + i
AverageDiabetes = DiabetesSum/len(Diabetes)
print('Diabetes Average:', AverageDiabetes)


Huntington = []
for i in udF.index:
    if udF.at[i, 'Condition'] == "Huntington's":
        Huntington.append(udF.at[i, 'Frequency'])
print('Huntington:', Huntington)
HuntingtonSum = 0
for i in Huntington:
    HuntingtonSum = HuntingtonSum + i
AverageHuntington = HuntingtonSum/len(Huntington)
print('Huntingtons Average:', AverageHuntington)


Mobility = []
for i in udF.index:
    if udF.at[i, 'Condition'] == 'Lifelong Mobility Issues':
        Mobility.append(udF.at[i, 'Frequency'])
print('Lifelong Mobility Issues:', Mobility)
MobilitySum = 0
for i in Mobility:
    MobilitySum = MobilitySum + i
AverageMobility = MobilitySum/len(Mobility)
print('Lifelong Mobility Issues Average:', AverageMobility)


Arthritis = []
for i in udF.index:
    if udF.at[i, 'Condition'] == 'Arthritis':
        Arthritis.append(udF.at[i, 'Frequency'])
print('Arthritis:', Arthritis)
ArthritisSum = 0
for i in Arthritis:
    ArthritisSum = ArthritisSum + i
AverageArthritis = ArthritisSum/len(Arthritis)
print('Arthritis Average:', AverageArthritis)


Asthma = []
for i in udF.index:
    if udF.at[i, 'Condition'] == 'Asthma':
        Asthma.append(udF.at[i, 'Frequency'])
print('Asthma:', Asthma)
AsthmaSum = 0
for i in Asthma:
    AsthmaSum = AsthmaSum + i
AverageAsthma = AsthmaSum/len(Asthma)
print('Asthma Average:', AverageAsthma)


MuscularDystrophy = []
for i in udF.index:
    if udF.at[i, 'Condition'] == 'Muscular Dystrophy':
        MuscularDystrophy.append(udF.at[i, 'Frequency'])
print('Muscular Dystrophy:', MuscularDystrophy)
MuscularDystrophySum = 0
for i in MuscularDystrophy:
    MuscularDystrophySum = MuscularDystrophySum + i
AverageMuscularDystrophy = MuscularDystrophySum/len(MuscularDystrophy)
print('Muscular Dystrophy Average:', AverageMuscularDystrophy)


MultipleSclerosis = []
for i in udF.index:
    if udF.at[i, 'Condition'] == 'Multiple Sclerosis':
        MultipleSclerosis.append(udF.at[i, 'Frequency'])
print('Multiple Sclerosis:', MultipleSclerosis)
MultipleSclerosisSum = 0
for i in MultipleSclerosis:
    MultipleSclerosisSum = MultipleSclerosisSum + i
AverageMultipleSclerosis = MultipleSclerosisSum/len(MultipleSclerosis)
print('Multiple Sclerosis Average:', AverageMultipleSclerosis)


Learning = []
for i in udF.index:
    if udF.at[i, 'Condition'] == 'Learning Difficulties':
        Learning.append(udF.at[i, 'Frequency'])
print('Learning Difficulties:', Learning)
LearningSum = 0
for i in Learning:
    LearningSum = LearningSum + i
AverageLearning = LearningSum/len(Learning)
print('Learning Difficulties Average:', AverageLearning)


SpinaBifida = []
for i in udF.index:
    if udF.at[i, 'Condition'] == 'Spina Bifida':
        SpinaBifida.append(udF.at[i, 'Frequency'])
print('Spina Bifida:', SpinaBifida)
SpinaBifidaSum = 0
for i in SpinaBifida:
    SpinaBifidaSum = SpinaBifidaSum + i
AverageSpinaBifida = SpinaBifidaSum/len(SpinaBifida)
print('Spina Bifida Average:', AverageSpinaBifida)


Stroke = []
for i in udF.index:
    if udF.at[i, 'Condition'] == 'Stroke':
        Stroke.append(udF.at[i, 'Frequency'])
print('Stroke:', Stroke)
StrokeSum = 0
for i in Stroke:
    StrokeSum = StrokeSum + i
AverageStroke = StrokeSum/len(Stroke)
print('Stroke Average:', AverageStroke)


BrainInjury = []
for i in udF.index:
    if udF.at[i, 'Condition'] == 'Brain Injury':
        BrainInjury.append(udF.at[i, 'Frequency'])
print('Brain Injury:', BrainInjury)
BrainInjurySum = 0
for i in BrainInjury:
    BrainInjurySum = BrainInjurySum + i
AverageBrainInjury = BrainInjurySum/len(BrainInjury)
print('Brain Injury Average:', AverageBrainInjury)


Epilepsy = []
for i in udF.index:
    if udF.at[i, 'Condition'] == 'Epilepsy':
        Epilepsy.append(udF.at[i, 'Frequency'])
print('Epilepsy:', Epilepsy)
EpilepsySum = 0
for i in Epilepsy:
    EpilepsySum = EpilepsySum + i
AverageEpilepsy = EpilepsySum/len(Epilepsy)
print('Epilepsy Average:', AverageEpilepsy)


Elderly = []
for i in udF.index:
    if udF.at[i, 'Condition'] == 'Elderly Care/Support':
        Elderly.append(udF.at[i, 'Frequency'])
print('Elderly Care/Support:', Elderly)
ElderlySum = 0
for i in Elderly:
    ElderlySum = ElderlySum + i
AverageElderly = ElderlySum/len(Elderly)
print('Elderly Care/Support Average:', AverageElderly)


Cerebral = []
for i in udF.index:
    if udF.at[i, 'Condition'] == 'Cerebral Palsy':
        Cerebral.append(udF.at[i, 'Frequency'])
print('Cerebral Palsy:', Cerebral)
CerebralSum = 0
for i in Cerebral:
    CerebralSum = CerebralSum + i
AverageCerebral = CerebralSum/len(Cerebral)
print('Cerebral Palsy Average:', AverageCerebral)

In [None]:
conditionTop=[('Cancer',AverageCancer),('Dementia',AverageDementia),('SpinalInjury',AverageSpinalInjury),('Diabetes',AverageDiabetes),("Huntington's",AverageHuntington),('Lifelong Mobility Issues',AverageMobility),('Arthritis',AverageArthritis),('Asthma',AverageAsthma),('Muscular Dystrophy',AverageMuscularDystrophy),('Multiple Sclerosis',AverageMultipleSclerosis),('Learning Difficulties',AverageLearning),('Spina Bifida',AverageSpinaBifida),('Stroke',AverageStroke),('Brain Injury',AverageBrainInjury),('Epilepsy',AverageEpilepsy),('Elderly Care/Support',AverageElderly),('Cerebral Palsy',AverageCerebral)]

labels, ys = zip(*conditionTop)
xs = np.arange(len(labels)) 
width = 0.8
plt.figure(figsize=(20,10))
plt.bar(xs, ys, width, align='center')
plt.xtriks(-45)
#plt.xticks(xs, labels) #Replace default x-ticks with xs, then replace xs with labels
plt.yticks(ys)
print('Figure 10: Condition Frequency')

I worked out the mean frequency of users by gender.

In [None]:
udFemale = []
for i in udF.index:
    if udF.at[i, 'Gender'] == 'F':
        udFemale.append(udF.at[i, 'Frequency'])
print(udFemale)

In [None]:
FemaleFSum = 0
for i in udFemale:
    FemaleFSum = FemaleFSum + i
AverageFFemale = FemaleFSum/len(udFemale)
print(AverageFFemale)

In [None]:
udMale = []
for i in udF.index:
    if udF.at[i, 'Gender'] == 'M':
        udMale.append(udF.at[i, 'Frequency'])
print(udMale)  

In [None]:
MaleFSum = 0
for i in udMale:
    MaleFSum = MaleFSum + i
    
AverageFMale = MaleFSum/len(udMale)
print(AverageFMale)

In [None]:
top=[('Female',AverageFFemale),('Male',AverageFMale)]

labels, ys = zip(*top)
xs = np.arange(len(labels)) 
width = 0.8

plt.bar(xs, ys, width, align='center')

plt.xticks(xs, labels) #Replace default x-ticks with xs, then replace xs with labels
plt.yticks(ys)
print('Figure 11: Gender Frequency')

From this graph we can see that Female users are slightly more active than Male users.

# Reflect and Hypothesise

Reflection:
The data was not as clean as I initially thought. Every time I thought I was finished cleaning it as soon as I tried to explore it, I would run into issues. I do not think it is 100% perfectly clean now however I believe I cleaned it to a sensible point. If there was more time, I would have liked to have asked the data holder more questions about the data, so I could fully understand the Categories and be 100% sure of my hypothesis. Since there was not a lot of numerical data is was hard to create interesting diagrams like scatter graphs. However, I believe I have created the desired visualisations that the data holder was hoping to receive for their proposal.

I am not sure that each dataset was complete. There were many missing values from both sets. and once I was done merging the datasets there was only 153 rows left. This could have been due to many factors; users are not using the tablet, I dropped a lot of rows due to incomplete rows with NaN values etc. This will have definatly squed the results of my analysis.

Hypothesis 1: Football
Figure 5: Football
The users of this tablet are very interested in seeing the Scottland football scottish premiership results. I can see this in figure 5. All the searched for teams were Scottish. 4.3% of all clicks were football related. This is something that could be used to improve the tablet. Using a BBC News API the developers could create a built in dashboard or app that allows users to view the scores of the football. I could test this hypothisis by conducting interviews with the users of the application to see if they use the tablet to look up football scores. Using advanced alogrithums I would probably find more clicks to do with football - diffrently worded searches or playing football related games etc.

Hypothesis 2: Radio
Figure 3: Radio
The users are using the built in Radio function. However they may need more choice. In figure 3 it shows clearly a huge trend towards Smooth Radio. I suspect this trend is there because this is the only radio staion under the radio category. I think if more choice was given then users would be much more satisfied and be able to have variety. Users are interested in the radio section as about 3.7% of clicks were radio related. I would test this hypothesis by asking the data owner and looking at the tablet.

Hypothesis 3: Condiditon
Figure 9: Condiditon Narrow
Figure 10: Condiditon Frequency
People who can use other tech are not using the tablet frequently. I believe that frequency of use is very much dependent on the type of disablity the user has. Looking at figure 9 and 10 and compareing them I can see that there are cases where there are sometimes more people with a certain condition using the tablet less than a small group of  people using it very frequently. An example would be people with learning dificulties. There are around 9 people with learning dificulties using the tablet with an average frequency of about 100. However, there are about 5 people with Arthritis with an average freqency of 900. I believe that this tablet is not suited to some of the people using it. The developers need to idetify a specific target audience and tailor it towards these people and thier needs. This way the user expereince will be better for these people who have no other tech alternative. I would prove this hypothisis by conducting a questionaire with the users, asking the data host and runinng more advanced algorithms on the data set.