## DSCI 551 Project

Create an emulation-based system for distributed file storage and parallel computation. <br>
1. Building an emulated distributed file system (EDFS) <br>
- EDFS should support the following commands, similar to that in HDFS:
    - mkdir: create a directory in file system, e.g., *mdkir /user/john
    - ls: listing content of a given directory, e.g., *ls /user
    - cat: display content of a file, e.g., *cat /user/john/hello.txt
    - rm: remove a file from the file system, e.g., *rm /user/john/hello.txt
    - put: uploading file to a file system, e.g., *put (car.csv, /user/john, k = # partitions)* will upload a file cars.csv to the directory /user/john in EDFS. **But note that the file should be stored in k partitions, and the file system should remember where the partitions are stored.** you should design a method to partition data. you may also have the user indicate the method, e.g., hashing on certain car attribute, in the put method. 
    - getPartitionLocations(file): this method will return the location of partitions of the file.
    - readPartition(file, partition #): this method will return the content of partion # of the specified file. the portioned data will be needed in the second task for parallel processing. 
- **Note that EDFS should store the metadata about the file system** (similar to that in NameNode of HDFS, but much simplified). **Metadata include file system structure, attributes of files, and location of partitions storing the contents of files.** You can limit the type of files stored in the file system to certain format, e.g., .csv or JSON. 
<br><br>

#### Google Firebase address : https://dsci551-project-52d43-default-rtdb.firebaseio.com/
### Statistical Capacity Indicators 
###### Statistical Capacity Indicators provides information on various aspects of national statistical systems of developing countries, including an overall country-level statistical capacity indicator. Last Updated:02/03/2021
#### Data from : https://databank.worldbank.org/source/statistical-capacity-indicators# 




In [68]:
import pandas as pd
import requests
import json

data = pd.read_csv('Data_Extract_From_Statistical_Capacity_Indicators/Stats_Cap_Ind.csv').dropna()\
.rename(columns = {
    "Country Name" : "country_name", 
    "Country Code" : "country_code", 
    "Series Name" : "series_name",
    "Series Code" : "series_code",
    "2004 [YR2004]" : "2004",
    "2005 [YR2005]" : "2005",
    "2006 [YR2006]" : "2006",
    "2007 [YR2007]" : "2007",
    "2008 [YR2008]" : "2008",
    "2009 [YR2009]" : "2009",
    "2010 [YR2010]" : "2010",
    "2011 [YR2011]" : "2011",
    "2012 [YR2012]" : "2012",
    "2013 [YR2013]" : "2013",
    "2014 [YR2014]" : "2014",
    "2015 [YR2015]" : "2015",
    "2016 [YR2016]" : "2016",
    "2017 [YR2017]" : "2017",
    "2018 [YR2018]" : "2018",
    "2019 [YR2019]" : "2019",
    "2020 [YR2020]" : "2020" })

In [69]:
# creating a list of country names, country code, series name, series code, and years of the data collected
cname = data.country_name.unique().tolist()
ccode = data.country_code.unique().tolist()
sname = data.series_name.tolist()
scode = data.series_code.tolist()
years = [n for n in data.columns if n.isnumeric()]
# replacement of symbols that are invalid in Firebase
cname2 = [sub.replace ('.','') for sub in cname]
sname2 = [sub.replace('/','-') for sub in sname]


In [70]:
dc = dict()
for z in cname2:
#     print(z)
    dc[z]=dict(zip(years, [dict()]*len(years)))
    t = data[data['country_name'] == z]    
    for y in years:
        dc[z][y]=dict(zip(sname2,t[y]))
    
#     print (dc[z])

In [71]:
dc_json = json.dumps(dc)


In [72]:
putResponse = requests.put('https://dsci551-project-52d43-default-rtdb.firebaseio.com/data.json', dc_json)
putResponse

<Response [200]>