# H1 Big Data - Term Assignment 1
__Prof. Dr. Fabian Transchel, Hochschule Harz, 29.04.2021__
ftranschel@hs-harz.de

Submission due May 30, 12pm CEST. (Hand in via e-Mail to ftranschel@hs-harz or upload to StudIP.)

## Assignment and Logistics

__Grading__

The present assigment will be 40% of the overall grade of the course. The second assignment will comprise the other 60% of the grade.

__Preface and data generation__

The assigment is comprised of a set of questions being posed on a dataset created on the fly in the preface of this Jupyter notebook. The code may not be altered and the random seed must stay the same. On top of that, the dataset is saved to disc so if you would want to use the R language instead of Python, the dataset may be imported into R.

The submission must follow these guidelines: If written in the Python language, it is highly suggested to keep the notebook format. If notebook is not kep, the submitted code must be executable nonetheless. Each line of code and/or expression is to be explained in sufficient detail.

## Dataset preparation
<div style="font-weight:bold; color:red;">DO NOT ALTER THIS SECTION</div>

The purpose of this section is to create a substantially big (albeit synthetic) dataset without necessity to store or download it on the internet. It shall reside on your local machine only. To this end, it is absolutely mandatory you do not change the random seed to ensure correctness of the created dataset.

We will synthesize a big customer database with some structured as well as unstructured parts. You will then be tasked to answer questions about this dataset.

In [2]:
import numpy as np
import pandas as pd
import time

prng = np.random.RandomState(987654321) # Do NOT change this
df_users = pd.DataFrame(columns = ["ID","Surname","Name","Age","Subscription Date"])
number = 100000
names = ["Hans","Jordi","Franz","Timothy","Agaba","Ali","Sarah","Josie","Robert","Francine","Anna","Zoe","Simon","Thomas","Andreas","Alok","Lee","Jean-Luc"]
surnames = ["Mueller","Meier","Smith","Gwahsi","Thronton","Wellington","Stephenson","Pomme","Di Lillo","Bond","Kirk","Picard","Roth","Beierlorzer"]

def normalize_age(age,lower=16.0,upper=99.0):
    if(age < lower):
        return lower
    elif(age > upper):
        return upper
    return age

start = time.time()
userid = list(range(1,number+1))
surname = prng.choice(surnames,number)
name = prng.choice(names,number)
age = prng.normal(35.,10.,number)
# Normalize age to between 16 and 99:
na = np.vectorize(normalize_age)(age) # ok, but slow
sub_date = 1588150183 + prng.normal(10000.,5000.,number)
df_users = pd.DataFrame(np.array([userid,surname,name,na,sub_date]).T.tolist())
df_users.columns = ['ID','Surname','Name','Age','Subscription_Date']
print(df_users.info())

df_users.to_csv("user_table.csv") # You may import this file to R if you wish to work with that language.

# Next, we will create a set of posts for each user. To this end, we will synthesize text from the "Lorem_ipsum.txt" file.

with open("Lorem_ipsum.txt") as f:
    content = f.readlines()
# you may also want to remove whitespace characters like `\n` at the end of each line
content = [x.strip() for x in content]

post_number = 1000000
post_id = list(range(1,post_number+1))
snippet_index = list(range(len(content)))
posts = prng.choice(snippet_index,post_number,replace=True)
posts_uids = prng.randint(1,number,post_number)
posts_date = 1588150183 + 5000000 + prng.normal(100000.,50000.,post_number)
posts_content = [ content[i] for i in posts]
df_posts = pd.DataFrame(np.array([post_id,posts_uids,posts_content,posts_date]).T.tolist())
df_posts.columns = ['PostID','UserID','Content','Timestamp']
df_posts.info()
df_posts.to_csv("postings_table.csv")
end = time.time()
print("Data Generation took ",(end - start)," seconds.")

## This is the end of the data generation.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   ID                 100000 non-null  object
 1   Surname            100000 non-null  object
 2   Name               100000 non-null  object
 3   Age                100000 non-null  object
 4   Subscription_Date  100000 non-null  object
dtypes: object(5)
memory usage: 3.8+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 4 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   PostID     1000000 non-null  object
 1   UserID     1000000 non-null  object
 2   Content    1000000 non-null  object
 3   Timestamp  1000000 non-null  object
dtypes: object(4)
memory usage: 30.5+ MB
Data Generation took  28.41692352294922  seconds.


In [3]:
df_users.head()

Unnamed: 0,ID,Surname,Name,Age,Subscription_Date
0,1,Pomme,Thomas,41.73935853733947,1588157171.1168716
1,2,Smith,Alok,54.83159935486658,1588167787.453688
2,3,Pomme,Thomas,30.86899706338393,1588154407.929673
3,4,Stephenson,Zoe,22.96813536200718,1588159018.037303
4,5,Mueller,Robert,33.13719431632408,1588163405.8847194


In [4]:
df_posts.head()

Unnamed: 0,PostID,UserID,Content,Timestamp
0,1,87939,"Ut wisi enim ad minim veniam, quis nostrud exe...",1593273255.028971
1,2,39527,"Ut wisi enim ad minim veniam, quis nostrud exe...",1593265232.6049256
2,3,44380,Nam liber tempor cum soluta nobis eleifend opt...,1593277164.7842274
3,4,51170,"Ut wisi enim ad minim veniam, quis nostrud exe...",1593262804.0625734
4,5,73516,"Consetetur sadipscing elitr, sed diam nonumy e...",1593163351.6601837


# Tasks

## Task 1

* __What is the number of unique name combinations?__

* __Who is the oldest user, who is the youngest?__

## Task 2

* __Who is the user with most postings?__
* __Who has the least amount of postings?__
* __Which user has "written" most words?__
* __Which one has written the least?__

# Map-Reduce Example

To highlight the concept of a map-reduce application, we will calculate the time since the users signed up for the service and then aggregate, i.e. sort ascending.

In [37]:
df_users

Unnamed: 0,ID,Surname,Name,Age,Subscription_Date
0,1,Pomme,Thomas,41.73935853733947,1588157171.1168716
1,2,Smith,Alok,54.83159935486658,1588167787.4536881
2,3,Pomme,Thomas,30.86899706338393,1588154407.929673
3,4,Stephenson,Zoe,22.96813536200718,1588159018.037303
4,5,Mueller,Robert,33.13719431632408,1588163405.8847194
...,...,...,...,...,...
99995,99996,Pomme,Francine,29.48464389186777,1588156466.8303325
99996,99997,Di Lillo,Franz,30.617851689718172,1588159346.5086892
99997,99998,Bond,Jordi,31.903380091433522,1588167207.584251
99998,99999,Kirk,Francine,35.877065970124384,1588163493.1022456


In [38]:
date_now = float(time.time()) # Getting current time
print("Current Time is:",str(date_now))

Current Time is: 1620281559.4622517


In [40]:
ids = list(df_users.Name + " " + df_users.Surname + " (" + df_users.ID.values + ")")
ids

['Thomas Pomme (1)',
 'Alok Smith (2)',
 'Thomas Pomme (3)',
 'Zoe Stephenson (4)',
 'Robert Mueller (5)',
 'Agaba Meier (6)',
 'Alok Bond (7)',
 'Zoe Pomme (8)',
 'Zoe Bond (9)',
 'Hans Gwahsi (10)',
 'Hans Stephenson (11)',
 'Josie Meier (12)',
 'Zoe Stephenson (13)',
 'Franz Roth (14)',
 'Ali Di Lillo (15)',
 'Franz Kirk (16)',
 'Zoe Thronton (17)',
 'Alok Stephenson (18)',
 'Jean-Luc Mueller (19)',
 'Zoe Roth (20)',
 'Robert Smith (21)',
 'Ali Kirk (22)',
 'Sarah Kirk (23)',
 'Simon Di Lillo (24)',
 'Hans Picard (25)',
 'Zoe Thronton (26)',
 'Josie Kirk (27)',
 'Jean-Luc Stephenson (28)',
 'Alok Smith (29)',
 'Jean-Luc Roth (30)',
 'Josie Mueller (31)',
 'Ali Pomme (32)',
 'Alok Bond (33)',
 'Timothy Di Lillo (34)',
 'Zoe Smith (35)',
 'Jordi Kirk (36)',
 'Hans Roth (37)',
 'Zoe Stephenson (38)',
 'Thomas Meier (39)',
 'Simon Wellington (40)',
 'Jean-Luc Mueller (41)',
 'Thomas Smith (42)',
 'Josie Bond (43)',
 'Anna Smith (44)',
 'Josie Smith (45)',
 'Jordi Roth (46)',
 'Robert Be

In [45]:
(lambda x: date_now - float(x))(1588157171.1168716) # Using in-place function to calculate the difference between current time and subscription time

32124388.345380068

In [49]:
map(lambda x: date_now - float(x),sub_time) # Maps subscription time to "time elapsed since subscription (compared to current time)"

<map at 0x26f84400c18>

In [52]:
sub_time = list(df_users.Subscription_Date.values)

sub_durations = np.array(list(map(lambda x: date_now - float(x),sub_time)))

user_duration_dict = dict(zip(ids,sub_durations))
pd.DataFrame(user_duration_dict.items(),columns=["Users","Values"])

Unnamed: 0,Users,Values
0,Thomas Pomme (1),3.212439e+07
1,Alok Smith (2),3.211377e+07
2,Thomas Pomme (3),3.212715e+07
3,Zoe Stephenson (4),3.212254e+07
4,Robert Mueller (5),3.211815e+07
...,...,...
99995,Francine Pomme (99996),3.212509e+07
99996,Franz Di Lillo (99997),3.212221e+07
99997,Jordi Bond (99998),3.211435e+07
99998,Francine Kirk (99999),3.211807e+07


In [53]:
pd.DataFrame(sorted(user_duration_dict.items(), key=lambda item: item[1]),columns=["Users","Values"])

Unnamed: 0,Users,Values
0,Agaba Pomme (93978),3.210102e+07
1,Sarah Gwahsi (1555),3.210164e+07
2,Francine Kirk (69522),3.210188e+07
3,Robert Mueller (50939),3.210274e+07
4,Robert Stephenson (59951),3.210284e+07
...,...,...
99995,Robert Bond (44820),3.214113e+07
99996,Ali Gwahsi (72341),3.214124e+07
99997,Zoe Kirk (98014),3.214132e+07
99998,Sarah Gwahsi (44648),3.214212e+07
