## DSCI 551 Project

Create an emulation-based system for distributed file storage and parallel computation. <br>
1. Building an emulated distributed file system (EDFS) <br>
- EDFS should support the following commands, similar to that in HDFS:
    - mkdir: create a directory in file system, e.g., *mdkir /user/john
    - ls: listing content of a given directory, e.g., *ls /user
    - cat: display content of a file, e.g., *cat /user/john/hello.txt
    - rm: remove a file from the file system, e.g., *rm /user/john/hello.txt
    - put: uploading file to a file system, e.g., *put (car.csv, /user/john, k = # partitions)* will upload a file cars.csv to the directory /user/john in EDFS. **But note that the file should be stored in k partitions, and the file system should remember where the partitions are stored.** you should design a method to partition data. you may also have the user indicate the method, e.g., hashing on certain car attribute, in the put method. 
    - getPartitionLocations(file): this method will return the location of partitions of the file.
    - readPartition(file, partition #): this method will return the content of partion # of the specified file. the portioned data will be needed in the second task for parallel processing. 
- **Note that EDFS should store the metadata about the file system** (similar to that in NameNode of HDFS, but much simplified). **Metadata include file system structure, attributes of files, and location of partitions storing the contents of files.** You can limit the type of files stored in the file system to certain format, e.g., .csv or JSON. 
<br><br>

#### Google Firebase address : https://dsci551-project-52d43-default-rtdb.firebaseio.com/
### Statistical Capacity Indicators 
###### Statistical Capacity Indicators provides information on various aspects of national statistical systems of developing countries, including an overall country-level statistical capacity indicator. Last Updated:02/03/2021
#### Data from : https://databank.worldbank.org/source/statistical-capacity-indicators# 



In [3]:
import pandas as pd
import numpy as np
import datetime
import requests
import csv
import json
import os
import re
from collections import OrderedDict

firebase_url = 'https://dsci551-project-52d43-default-rtdb.firebaseio.com/'

def seek(path):
    if not re.search('.json', path):
        url = firebase_url + path + '.json'
        
    try:
        rget = requests.get(url)
        return rget
    except:
        print('ERROR')

### MKDIR


In [None]:
def mkdir(path):
    if seek(path).json() is None:
        url = firebase_url + path + '.json'
#         print (url)
        r = requests.put(url,data)
        print (r.url)
    else:
        print ('Directory ', path, ' already exists')

# requests.put('https://dsci551-project-52d43-default-rtdb.firebaseio.com/mk.json', '{"test":1}')

In [None]:
mkdir('NameNode/root/user') #Change to user input

### LS

In [None]:
def ls(path):
    # ADDING "NameNode/root/" to Firebase request path
    if not re.search('NameNode/root', path):
        path = 'NameNode/root/' + path
    
    if seek(path).json() is not None:
        for key in seek(path).json().keys():
            print(key)
    else:
        print (" ")


In [None]:
ls('data2/China') # Change to user input

### RM

In [None]:
def rm(path):
    path = path.replace('.csv','')
    if seek(path).json() is None:
        print ('Directory not found')
    else:
        url = firebase_url + path + '.json'
        d = requests.delete(url)
        if d.status_code == 200:
            print(path, 'was succefully deleted')

In [None]:
rm('NameNode/root/user')
rm('DataNode')


In [None]:
rm('datasets/root/user')

### PUT

In [6]:
# cleans column names for firebase json object key
def varname (var):
    key = re.sub(r'[^A-Za-z0-9 ]+', '', var).replace(" ", "_")
    names = key if key != "" else "invalid_key"
    return names

def mtime():
#     to revert back
    #datetime.datetime.utcfromtimestamp(int(mtime)/1000).strftime('%Y-%-m-%-d %I:%M:%S') 
    return (datetime.datetime.now().timestamp()*1000)

def filesize(file): #file size in bytes
    return  os.path.getsize(file)

def indexing(dicts):
    dt = dict()
    for k,v in dicts.items():
        i = int(k.replace('p',''))
        dt[i] = v
    return dt

In [9]:

def record_partition(path, country, filename, url):
    try:
        npath = firebase_url + path + "/" + filename + "/partitions.json"
    #     print (npath ,":", url)
        mdata = {country : url}
        putMeta = requests.patch(npath, json.dumps(mdata))
        if putMeta.status_code == 400: print(country)
    #     print (putMeta)
    except:
        print (country)

def file_mdata(path, file, filename):
    npath = firebase_url + path + "/" + filename + ".json"
    mdata = {'ctime': mtime(),
             'name': file,
             'type': 'FILE',
             'filesize':filesize(file)}
    putMeta = requests.patch(npath, json.dumps(mdata))
    

# partition by Country (Original plan)
def put(file, path):
    filename = file.replace(".csv","")
    path = 'NameNode/' + path

 
    # creating dictinary to organize data into correct json format. 
    # added 'file name' to the dictionary to help differentiate data from different files
    dc = dict()
    with open(file, encoding = 'utf-8') as csvfile:
        csvReader = csv.reader(csvfile)
        
        for index, row in enumerate(csvReader):
            cname = varname(row[0])
            n = 'p' + str(index)
            if cname in dc:
                dc[cname][n] = (';'.join(row))
            else:
                dc[cname]={n:(';'.join(row))}
    
    if seek(path + '/' +filename).json() is None:
        for key, val in dc.items():
            url = firebase_url + 'DataNode/' + key + '/' + filename + '.json'
            putResponse = requests.put(url, json.dumps(val))
            if putResponse.status_code == 200:
                record_partition (path, key, filename, putResponse.url)
            else:
                print (file, 'failed to uploaded at partition', key)
        
        output =  file + 'was succesfully uploaded to' + path
        
        file_mdata(path, file, filename)
        #add metadata information.
    else:
        output = file + " already exists in " + path
            
        
    return output
    

In [10]:
# filename = 'Stats_Cap_Ind.csv'
# filename = 'Human_Capital_Index.csv'
filename = 'Stats_Cap_Ind_Sample.csv'
path = 'root/user'
dc = put(filename, path)


In [11]:
dc

'Stats_Cap_Ind_Sample.csv already exists in NameNode/root/user'

### getPartition

In [18]:
def getPartitionLocation(file):
    file = file.replace(".csv","")
    path = "NameNode/root/" + file + "/partitions"
    rpath = seek(path)
    partition = requests.get(rpath.url)
    pdict = partition.json()       
    
    return pdict

In [None]:
file = "user/Stats_Cap_Ind"
getPartitionLocation(file)


### readPartition

In [None]:
def readPartition(file, partition):
    pdict = getPartitionLocation(file)
    url = pdict[partition]
    columns = 'https://dsci551-project-52d43-default-rtdb.firebaseio.com/DataNode/Country_Name/Stats_Cap_Ind.json'
    rlist =[ v for k, v in requests.get(columns).json().items()]
    getRead = indexing(requests.get(url).json())
    for key in sorted(getRead):
        rlist.append(getRead[key])
    return rlist
    
#     return requests.get(url).json()

In [None]:
a = readPartition('user/Stats_Cap_Ind', 'China') # returns a list of rows
# print(a)
df = pd.DataFrame(columns = a[0].split(';'), data=[row.split(';') for row in a[1:]])
df

### CAT

In [16]:
def cat(path):
    file = path.replace('.csv','')
    pdict = getPartitionLocation(file)
    data = dict()
    for k,v in pdict.items():
#         print(v)
        getPartition =requests.get(v).json()
        for key, val in getPartition.items():
            i = int(key.replace('p',''))
            data[i]=val.replace(';',',')
            
# Option 1: sort and return in a list
    ldata = list()
    for key in sorted(data):
        ldata.append(data[key])
    return ldata

# Option2: sort and return in a dictionary
#         data[k] = requests.get(v).json()
#     return (OrderedDict(sorted(data.items())))


def sprint(dct):
    for key in sorted(dct):
        print(dct[key])
#         with open('testcsv.csv','w') as csvOut:
#             csvOut.write(dct[key])
    
# df = pd.DataFrame.from_dict(r.json())



In [40]:
file = "user/Stats_Cap_Ind_Sample"
data = cat(file)
# print(data)
# sprint(data)
# df = pd.DataFrame.from_dict(data)

In [None]:
for d in data:
    print (d)

### mapPartition( )

In [36]:
def mapPartition(p, file):
    file_name = file.split('/')[-1]
    columns = f'https://dsci551-project-52d43-default-rtdb.firebaseio.com/DataNode/Country_Name/{file_name}.json'
    rlist =[ v for k, v in requests.get(columns).json().items()]
    readMap = indexing(requests.get(p).json())
    for key in sorted(readMap):
        rlist.append(readMap[key])
    return rlist 
    
# function to get year columns
def is_year (c):
    return any(char.isdigit() for char in c)    

def new_col(cols):
    new_col = list()
    for c in cols:
        if is_year(c):
            new_col.append(c[:4])
        else:
            new_col.append(c)
    return new_col
    
def to_df(data):
    df = pd.DataFrame(columns = data[0].split(';'), data=[row.split(';') for row in data[1:]])
    columns = new_col(df.columns.values)
    df.columns = columns
    df_melted = df.melt(id_vars=columns[:4], var_name='Year', value_name='Value')
    return df_melted

In [28]:
dir

'https://dsci551-project-52d43-default-rtdb.firebaseio.com/DataNode/Zimbabwe/Stats_Cap_Ind_Sample.json'

In [56]:
def read_dataset(file: str):
    partitions = getPartitionLocation(file)

    df_list = list()
    for country_name, dir in partitions.items():
        if country_name == 'Country_Name': # Store only column names. Ignore
            continue
        map = mapPartition(dir, file)
        df_list.append(to_df(map))
    return pd.concat(df_list), df_list

In [57]:
file = "user/Stats_Cap_Ind_Sample"
df_fin, df_list = read_dataset(file)
# df_final = pd.concat(df_list)

Afghanistan
Albania
Algeria
Angola
Antigua_and_Barbuda
Argentina
Armenia
Azerbaijan
Bangladesh
Belarus
Belize
Benin
Bhutan
Bolivia
Bosnia_and_Herzegovina
Botswana
Brazil
Bulgaria
Burkina_Faso
Burundi
Cabo_Verde
Cambodia
Cameroon
Central_African_Republic
Chad
Chile
China
Colombia
Comoros
Congo_Dem_Rep
Congo_Rep
Costa_Rica
Cote_dIvoire
Country_Name
Croatia
Djibouti
Dominica
Dominican_Republic
East_Asia__Pacific_excluding_high_income
Ecuador
Egypt_Arab_Rep
El_Salvador
Equatorial_Guinea
Eritrea
Eswatini
Ethiopia
Europe__Central_Asia_excluding_high_income
Fiji
Gabon
Gambia_The
Georgia
Ghana
Grenada
Guatemala
Guinea
GuineaBissau
Guyana
Honduras
IBRD_only
IDA__IBRD_total
IDA_total
India
Indonesia
Iran_Islamic_Rep
Iraq
Jamaica
Jordan
Kazakhstan
Kenya
Kiribati
Kosovo
Kyrgyz_Republic
Lao_PDR
Latin_America__Caribbean_excluding_high_income
Lebanon
Lesotho
Liberia
Madagascar
Malawi
Malaysia
Maldives
Mali
Marshall_Islands
Mauritania
Mauritius
Mexico
Micronesia_Fed_Sts
Middle_East__North_Africa_exclu

TypeError: cannot unpack non-iterable NoneType object

In [55]:
getPartitionLocation(file).items()

{'Afghanistan': 'https://dsci551-project-52d43-default-rtdb.firebaseio.com/DataNode/Afghanistan/Stats_Cap_Ind_Sample.json',
 'Albania': 'https://dsci551-project-52d43-default-rtdb.firebaseio.com/DataNode/Albania/Stats_Cap_Ind_Sample.json',
 'Algeria': 'https://dsci551-project-52d43-default-rtdb.firebaseio.com/DataNode/Algeria/Stats_Cap_Ind_Sample.json',
 'Angola': 'https://dsci551-project-52d43-default-rtdb.firebaseio.com/DataNode/Angola/Stats_Cap_Ind_Sample.json',
 'Antigua_and_Barbuda': 'https://dsci551-project-52d43-default-rtdb.firebaseio.com/DataNode/Antigua_and_Barbuda/Stats_Cap_Ind_Sample.json',
 'Argentina': 'https://dsci551-project-52d43-default-rtdb.firebaseio.com/DataNode/Argentina/Stats_Cap_Ind_Sample.json',
 'Armenia': 'https://dsci551-project-52d43-default-rtdb.firebaseio.com/DataNode/Armenia/Stats_Cap_Ind_Sample.json',
 'Azerbaijan': 'https://dsci551-project-52d43-default-rtdb.firebaseio.com/DataNode/Azerbaijan/Stats_Cap_Ind_Sample.json',
 'Bangladesh': 'https://dsci551-

In [53]:
df_fin.head()

Unnamed: 0,Country Name,Country Code,Series Name,Series Code,Year,Value
0,Afghanistan,AFG,Primary completion,5.51.01.08.primcomp,2020,0.66667
1,Afghanistan,AFG,National accounts base year,2.01.01.02.nabase,2020,1.0
2,Afghanistan,AFG,Primary completion,5.51.01.08.primcomp,2019,0.0
3,Afghanistan,AFG,National accounts base year,2.01.01.02.nabase,2019,0.0
4,Afghanistan,AFG,Primary completion,5.51.01.08.primcomp,2018,0.0


In [54]:
df_list

[   Country Name Country Code                  Series Name  \
 0   Afghanistan          AFG           Primary completion   
 1   Afghanistan          AFG  National accounts base year   
 2   Afghanistan          AFG           Primary completion   
 3   Afghanistan          AFG  National accounts base year   
 4   Afghanistan          AFG           Primary completion   
 5   Afghanistan          AFG  National accounts base year   
 6   Afghanistan          AFG           Primary completion   
 7   Afghanistan          AFG  National accounts base year   
 8   Afghanistan          AFG           Primary completion   
 9   Afghanistan          AFG  National accounts base year   
 10  Afghanistan          AFG           Primary completion   
 11  Afghanistan          AFG  National accounts base year   
 12  Afghanistan          AFG           Primary completion   
 13  Afghanistan          AFG  National accounts base year   
 14  Afghanistan          AFG           Primary completion   
 15  Afg

In [42]:
df_final.dtypes

Country Name    object
Country Code    object
Series Name     object
Series Code     object
Year            object
Value           object
dtype: object

In [13]:
df = pd.DataFrame(df_list)

  values = np.array([convert(v) for v in values])


In [14]:
df

Unnamed: 0,0
0,Country Name Country Code \ 0 Afghanis...
1,Country Name Country Code \ 0 Alba...
2,Country Name Country Code \ 0 Alge...
3,Country Name Country Code \ 0 Ang...
4,Country Name Country Code \ 0 ...
...,...
152,Country Name Country Code \ 0 W...
153,"Country Name Country Code \ 0 Yemen, R..."
154,Country Name Country Code \ 0 Zam...
155,Country Name Country Code \ 0 Zimba...
