***DISCLAIMER (Read this carefully)***

Before you turn this assignment in, make sure everything runs as expected. First, restart the kernel (in the menubar, select Kernel
Restart) and then run all cells (in the menubar, select Cell
Run All). Do NOT add any cells to the notebook!

Do not forget to submit both the notebook AND the files in the data/ subfolder according to the CoC!
Make sure you fill in any place that says YOUR CODE HERE or YOUR ANSWER HERE , as well as your name and group below:

# Assignment 2 (Group)
When carrying out a Data-Science project, screening and selecting appropriate data sources for the tasks at hand comes at the beginning. This assignment is about accessing and characterising potential data sources in teams of three. The teams have been randomly assigned. BEWARE! In Assignment 5, you will be asked to provide answers to those questions. Make sure that combining the two datasets makes sense from an analytical perspective!

-----
## Step 0 (2 points)

Find two data sets online (from one or several sources) that would be interesting to combine and create ***data citations*** as Python dictionaries. 

The data sets should fulfill the following requirements:

* Each data set must have a different file format (either CSV, XML, or JSON), please choose
  - one CSV file (dataset1) 
  - and one JSON or XML file (dataset2)

* The two datasets should not be two variations of each other (i.e. simply the same dataset for two different regions or timeframes or from the same source just in two different formats)
* Workable data-set sizes: The selected or extracted data sets should have thousands of entries (>= 1000), but not more than (<=) 10000 entries. Be "entries we mean rows or distinguishable key-value pairs). If larger, use an excerpt from the original data set. Justify in detail the extraction criteria in the markdown cell below and 
  1) add the code used for the extraction in the code cell or describe how you filtered the sample  
  2) make the extracted dataset also available at a downloadable URL (for instance in a Github repository, [here](https://raw.githubusercontent.com/AxelPolleres/simple_dataset_sharing_repo/main/test.csv)'s an example)
  3) name the new `resourceURL` in the data citation.
* You may start from (but you are not limited to) the resource collections hinted at [in the Unit 2 slides](https://datascience.ai.wu.ac.at/ws21/dataprocessing1/unit2.html#slide-53).

* Important: The use of datasets from kaggle.com and other curated collections of datasets with accompanying tutorials on processing and analysis (as highlighted to you in Unit 2) is **discouraged**. You are required to use **primary data sources**: This is mainly because we want you to work on data sets that have not been processed with some analysis in mind, so that you show that you can handle (messy) data sets harvested on the brownfields of Data Science. Besides, such curated datasets have been repeatedly used in ready-made case and tutorial work, which makes it basically impossible for us to establish whether your submissions are genuine contributions of yours. There is one viable option: Work backwards from the Kaggle data set to the original data source, obtain updated data from there, and start from there.


* Please adhere to the CoC.

[Data citations](http://blogs.nature.com/scientificdata/2016/07/14/data-citations-at-scientific-data/) must contain the following details:
- creator: provider organisation / author(s) of the data set, e.g. "Zentralanstalt für Meteorologie und Geodynamik (ZAMG)"
- catalogName: Names of the data repository and/or the Open Data portal used, e.g. Open Data Österreich"
- catalogURL: URL of th repository / portal, e.g. "https://www.data.gv.at/"
- datasetID: (specific to the data repository), e.g. "https://www.data.gv.at/katalog/dataset/zamg_meteorologischemessdatenderzamg"
- resourceURL: a URL where the CSV, XML or JSON file can be downloaded, e.g. "https://www.football-data.co.uk/new/JPN.csv"
- pubYear: Dataset publication year, i.e. since when it is published, e.g. "2012"
- lastAccessed: when have you last accessed the dataset (i.e. datetime of accessing, obtaining a copy of the data set) in ISO Format? e.g. "2021-03-08T13:55:00"

One final note: as mentioned above, if you want to use a repository for your file download (e.g. github), you are allowed to do that. The most important part is that the URL can be accessed stably for each dataset you have chosen. 

Store the data citation in a dictionary for each of the datasets:

In [1]:
# YOUR CODE HERE
dataset1= {
    "creator" : "U.S. Department of Tresury" ,
    "catalogName" : "U.S. Department of Tresury Data" ,
    "catalogURL" : "https://home.treasury.gov/" ,
    "datasetID" : "https://home.treasury.gov/resource-center/data-chart-center/interest-rates/TextView?type=daily_treasury_yield_curve&field_tdr_date_value_month=202404" ,
    "resourceURL" : "https://raw.githubusercontent.com/davide-229/WU-DA1/main/YieldCurveUSA.xml"  ,
    "pubYear" :  "2024"  ,
    "lastAccessed" : "2024-08-04T21:10:10"  ,
}

dataset2= {
    "creator" : "Yahoo Finance" ,
    "catalogName" : "Yahoo Fiance/^NDX" ,
    "catalogURL" : "https://finance.yahoo.com/" ,
    "datasetID" : "https://finance.yahoo.com/quote/%5ENDX" ,
    "resourceURL" : "https://raw.githubusercontent.com/davide-229/WU-DA1/main/%5ENDX.csv"  ,
    "pubYear" : "2024"  ,
    "lastAccessed" : "2024-08-04T22:59:00"  ,
}

In [2]:
from nose.tools import assert_equal, assert_in, assert_true
import traceback
import sys
import os

assert_equal(type(dataset1), dict)
assert_equal(type(dataset2), dict)


Use the following structure for your answer below:

Data set 1

(Describe the source and the general content of the dataset and why you chose it)

Data set 2

(Describe the source and the general content of the dataset and why you chose it)

Project ideas

(Describe in your own words, which kind of tasks could be addressed by combining the selected data sets, esp. how the two data sets fit together and what complementary information they contain; Formulate a question that could be potentially answered by combining data from both datasets; how could the data sets be combined exactly? 250 words max. BEWARE! In Assignment 5, you will be asked to provide answers to those questions. Make sure that combining the two datasets makes sense from an analytical perspective!)

Data set 1

The first dataset is an XML file containing the data of the interest rate curve yield curve of the United States. The values refer to a period from 2013 to 2023 and are provided on a daily basis. The source of the data is the official website of the U.S. Department of Treasury. The data was retrieved from the website following the instructions at this link: https://home.treasury.gov/treasury-daily-interest-rate-xml-feed. We created a loop to retrieve the data year-by-year and we uploaded it to a git hub repository so that it can be easily accessible at once.

The data provided correspond to the daily interest rate on U.S. Treasury bonds for at least 12 time periods that range from 1 month to 30 years daily.. It is interesting to notice that the 2 moth and 4-month treasury bonds were introduced during the analyze period and for this reason, we do not have values for the first part.

The reason behind our choice to consider this data and the relative time period are strictly related to our research question that will be presented in the “Project idea” paragraph.

Code used to retrieve the data yearly in the selected period:

"""

import urllib.request

all_data = "" for i in range(2023,2024): url = "https://home.treasury.gov/resource-center/data-chart-center/interest-rates/pages/xml?data=daily_treasury_yield_curve&field_tdr_date_value=" + str(i)

with urllib.request.urlopen(url) as f:
    year_data = f.read().decode("utf-8")
    if i > 2013 : year_data = year_data[665:]
    if i < 2023 : year_data = year_data[:len(year_data)-7]
    
    all_data += year_data
"""

Data set 2

In the second dataset, we are provided with the historical performance in the Nasdaq 100 in CSV format. The Nasdaq 100 (NDX) is one of the main U.S. stock indexes containing the biggest 100 non-financial companies of the Nasdaq Stock Exchange. The index is a weighted index where the weight of each company is based on its market capitalization (some rules apply to the biggest companies). The data was retrieved from Yahoo Finance where it is possible to access and download historical financial data from a defined period. The file was then uploaded to a git hub repository The dataset is based on the same period as the first one (2013-2023) and it provides information on the daily market performance and the volume of the index (with the clear exception of the closing days of the stock market). We are provided with: “Open”, “High”, “Low”, “Close”, “Adjusting Closing” and “Volume”, these represent the respective prices at which trading begins, reaches its highest point, reaches its lowest point, and concludes for a given trading day, with “Adjusting Closing” considering for dividends and stock splits and “Volume” measures the level of activity. As normal practice in the financial industry when considering indexes the numbers are given in the price-weighted index.

Project ideas

The two datasets can be easily joined by the data attribute. In such a situation, we will be provided with a snapshot of the financial performance in the U.S: financial markets (both bonds and stock market) in the decades ranging from 2013 to 2023.

Our idea was to investigate the correlation between interest rates and stock market performance. In looking for a specific stock index to consider we selected the Nasdaq first because it represents one major U.S. stock index but also because financial companies are not included, in which the correlation with interest might be more obvious. Our choice to consider the decade from 2013 to 2023 is given by the fact that the last 10 years provide already very different situations in terms of interest rates. We stopped in 2023 to have the last complete time period.

The questions that we want to try to answer in our project are:

What was the development of the U.S. yield curve in the last 10 years?
How did the NDX perform in the last 10 years?
Is there any correlation between the index performance and the interest rate?
Are we presented with different levels of volatility in the index with different interest rate levels?
What is the market reaction (from the index) to a change in interest rate?


------
## Step 1 - File Access (3 points)

Write a Python function `accessData` that takes the dataset dictionary created in step 0 as an input and returns an extended dictionary including following additions:

* Write code that accesses the dataset from its `resourceURL` using the python `requests` package:
 * detects whether it's and XML, CSV or JSON file by
     * checking whether the download URL **ends** with suffix "xml", "json", "csv" 
     * checking whether the "Content-Type" HTTP header field contains information about the format, hinting on XML, JSON or CSV, i.e., check whether the substring XML, JSON or CSV appears in the "Content-Type" header in either upper- or lowercase. 
 * Detects the file size from the HTTP header (converted to KB) of each data set, clearly documenting your actions (e.g. through commented code).

The result of the code below should extend your dictionaries `dataset1` and `dataset2` with two keys named 
* `"detectedFormat"` (which has one of the following values: `"XML"`, `"JSON"`, `"CSV"`, or `"unknown"`, if nothing could be detected from checking the suffix or HTTP header, or if the information in both was inconsistent)
* and `"filesizeKB"` which contains the filesize in KB (Conversion should be done accordingly to decimal SI prefixes) from the number of bytes in the header-information. If there is no respective header information return 0.
* If the detected format is `"unknown"`, the expected filesize to be returned is also 0


In [3]:
# YOUR CODE HERE 
import requests

def accessData(datadict):
    # YOUR CODE HERE
    #First part of the function 

    #Check type method 1

    end = datadict["resourceURL"][len(datadict["resourceURL"])-4:]


    if end[len(end)-3:]== "xml" : type1 = "XML"
    elif end[len(end)-3:] == "csv": type1 = "CSV"
    elif end == "json": type1 = "JSON"
    else: type1 = "unknown"

    
    #Cheking method 2
    try:
        with requests.head(datadict["resourceURL"]) as response: 
            content = response.headers["Content-Type"]
        

        if "xml" in content or "XML" in content: type2 = "XML"
        elif "csv" in content or "CSV" in content: type2 = "CSV"
        elif "json" in content or "JSON" in content: type2 = "JSON"
        else:type2 = "unknown"
    
    except (KeyError):
            type2 = "unknown"

    #Check the 2 resulsts, if equal if one unknown if different types 
    if type1 == type2: type0 = type1
    elif type1 == "unknown": type0 = type2
    elif type2 == "unknown": type0 = type1
    else: type0 = "unknown"
    
    #Second part of the function 
    if type0 == "unknown" : sizeKB = 0
    else:
        try:
            length = response.headers['content-length']
            length = int(length)
    
        except (ValueError, TypeError, KeyError):
    
            length = None

        if length is None: sizeKB = 0
        else:sizeKB = length / 1000
    
    #Adding the inforamtion to the dictionary
    
    datadict["detectedFormat"] =  type0
    datadict["filesizeKB"] = sizeKB
    

    return datadict

In [4]:
# Basic tests to see if your solution meets the foundational demands described in the task description
from nose.tools import assert_equal, assert_in, assert_true
dataset1= accessData(dataset1)
dataset2= accessData(dataset2)
assert_in(dataset1["detectedFormat"], ["XML", "JSON", "CSV", "unknown"])
assert_in(dataset2["detectedFormat"], ["XML", "JSON", "CSV", "unknown"])
assert_true(isinstance(dataset1["filesizeKB"], (int, float)))
assert_true(isinstance(dataset2["filesizeKB"], (int, float)))

In [5]:
### DO NOT DELETE OR CHANGE THIS CELL!

In [6]:
### DO NOT DELETE OR CHANGE THIS CELL!

In [7]:
### DO NOT DELETE OR CHANGE THIS CELL!

In [8]:
### DO NOT DELETE OR CHANGE THIS CELL!

In [9]:
### DO NOT DELETE OR CHANGE THIS CELL!

In [10]:
### DO NOT DELETE OR CHANGE THIS CELL!

In [11]:
### DO NOT DELETE OR CHANGE THIS CELL!

In [12]:
### DO NOT DELETE OR CHANGE THIS CELL!

Please explain your findings, using the following structure for your answer below (in "other remarks" you can explain, for instance, why you think your code did not detect the correct format, if needed)

Data set 1

(format, size, other remarks)

Data set 2

(format, size, other remarks)

Data set 1

The function managed to correctly detect the format of the 2 files. However, if we analyze the function more deeply we can see that “type2” we can see that in both occasions the result is “unknown”. In fact the output of this part of the code:

“””

    with requests.head(datadict["resourceURL"]) as response: 
        content = response.headers["Content-Type"]
print(content)

“””

Is in both cases “text/plain”. We believe that this problem is given by the way git hub recognizes the 2 uploaded files, even though are uploaded with the correct extensions “CSV” and “XML”, they are recognized as text. This is confirmed by the fact that with other test datasets directly accessed, at least in our cases, also type2 was correctly detected.

We could investigate more why git hub does not correctly recognize the file format but, fortunately, from the way the task is structured, it is enough that one of the 2 methods correctly identifies the file type and that the 2 information are not inconsistent.

We also noticed that, especially in some cases, the file size seems to be slightly lower than the real size in kB when the file is downloaded, but after a quick research we understood that there are several reasons why this could be the case (compression, encoding, network…) so also in this case we gladly do not go into much details.

Data set 2

Same comments as in in data set2

-----
## Step 2  (5 points) - Format Validation

Establish that the two data files obtained are well-formed according to the detected data format (CSV, JSON, or XML). That is, the syntax used is valid according to accepted syntax definitions. Are there any violations of well-formedness?


Proceed as follows (for each data file, in turn): according to the "suspected" data format from Step 1:

  1. Use an _online validator_ for CSV, XML, and JSON, respectively, to confirm whether the files you downloaded in Step 1 are well-formed for the respective file format, document your findings and modify the file as described: 

   a. **Case 1**: no well-formedness errors were detected: 
    * Generally describe at least 3 well-formedness checks that your data sets, depending on its "suspected" format (against the background knowledge of Unit 2) should fulfill;
    * Store a local copy of the file called `data_notebook-[notebook-nr.]_[name].[file extension]` in the `data/` subfolder
    * Create another local copy of your data file called `data_notebook-[notebook-nr.]_[name]-invalid.[file extension]` and introduce a selected well-formedness violation (one occurrence) therein;
    * document that the online validator you used finds the error you introduced

   b. **Case 2**: well-formedness errors occurred:
    * Document the occurrences by printing out the error message and describe the types of well-formedness violation that were reported to you.
    * Store a local copy called `data_notebook-[notebook-nr.]_[name]-invalid.[file extension]`  in the `data/ subfolder`
    * Create another local copy called `data_notebook-[notebook-nr.]_[name].[file extension]`, of your data file that fixes the well-formedness violations therein manually.  
    
**Please note that the datasets in the `data/` subfolder are for documentation only. Do not access those for subsequent steps!**
    

  2. Write a Python function `parseFile(datadict, format)` that that accesses the dataset from its `resourceURL`. The dataset should then be checked accordingly the given parser for the parameter `format` to check the following:
     * CSV: Returns `True`, if a consistent delimiter out of `",",";","\t"` can be detected, such that each row has the same (> 1) number of elements, otherwise False
     * JSON: Returns `True` if the file can be parsed with the `json` package, catching any parsing exceptions.
     * XML: Returns `True` if the file can be parsed with the `xmltodict` package, catching any parsing exceptions.
     * Returns `False` if any other format is supplied by the parameter.
     
In order to handle parsing exceptions and errors from the used packages, you can use [catching exceptions](https://docs.python.org/3/tutorial/errors.html), such that the program does not simply fail to check whether the file is parseable as the format specified in `format`    

Use the following structure for your answer in the cell below to document **Step 2.1**:

***Data set 1***

*(validator used, validation results, describe the modification to fix the file or to create an invalid version of it)*

***Data set 2***

*(validator used, validation results, describe the modification to fix the file or to create an invalid version of it)*


Data set 1

Starting from data set 1, we used an online validator to check the correctness of the XML format (our suspected file format from task 1), we used the online validator at the following link:

https://jsonformatter.org/xml-validator

Output: “Valid XML”

no well-formedness errors were found.


3 well-formedness checks xml:

• For an XML document to be properly formed, it's essential that every opening tag has a corresponding closing tag.

•Elements must be nested correctly within each other. For example, Name is an example of correct nesting, while Name is incorrect.

• Also, within a single element, each attribute must have a unique value. Correct usage would be XML Basics, but XML Basics is not valid.


After not having found any error we manually introduced one small error in both files.

For data set 1 (the XML file), we removed the first “<\entries> ( the first entry closing), we then re-used the same online validator:

Output:

“Invalid XML: This page contains the following errors: error on line 77740 at column 8: Opening and ending tag mismatch: entry line 10 and feed”



Data set 2

We then repeated the process with data set 2, our suspected CSV file. We used the validator at the following link:

https://toolkitbay.com/tkb/tool/csv-validator

Output: “File is Valid”

Also in this case no well-formedness errors were found.

3 well-formedness checks csv:

• In the header and in every record, fields can appear, which are divided by commas and there might be one or more of these fields.

• The file should maintain a consistent number of fields in every line.

• Fields can be optionally enclosed within double quotation marks.

After not having found any error we manually introduced one small error in both files.

Moving to data set 2 (CSV file) we substitute the first “,” (the delimiter) with “;” in the second line (first row) changing the consistency of the file. We then checked the file in the online validator.

Output:

“Record #1 has error: wrong number of fields”


In [13]:
import requests
import csv
import json
import xmltodict

def parseFile(datadict, format):
    # YOUR CODE FOR STEP 2.2 HERE
    # YOUR CODE HERE

    # Trying to fetch the dataset from its URL
    resourceURL = datadict.get('resourceURL')
    if not resourceURL:
        return False  # No URL found
    
    try:
        resp = requests.get(resourceURL)
        resp.raise_for_status()
    except requests.exceptions.RequestException as e:
        print(f"Error fetching the dataset: {e}")
        return False

    if format == 'CSV':
        return validate_csv(resp.text)
    elif format == 'JSON':
        return validate_json(resp.text)
    elif format == 'XML':
        return validate_xml(resp.text)
    else:
        return False
     
# Helper functions to validate the format of the file   
# CSV validation
def validate_csv(resp):
    for delimiter in [',', ';', '\t']:
        reader = csv.reader(resp.splitlines(), delimiter=delimiter)
        try:
            rows = list(reader)
        except csv.Error:
            continue  # Try the next delimiter
        if rows and all(len(row) == len(rows[0]) and len(row) > 1 for row in rows):
            return True # Found a consistent delimiter
    return False
#Json validation
def validate_json(resp):
    try:
        json.loads(resp)
        return True
    except json.JSONDecodeError:
        return False
#XML validation
def validate_xml(resp):
    try:
        xmltodict.parse(resp)
        return True
    except xmltodict.expat.ExpatError:
        return False
    
  

In [14]:
from nose.tools import assert_equal, assert_in, assert_true
assert_equal([parseFile(dataset1, "XML"),
    parseFile(dataset1, "JSON"),
    parseFile(dataset1, "CSV"),
    parseFile(dataset2, "XML"),
    parseFile(dataset2, "JSON"),
    parseFile(dataset2, "CSV")].count(True), 2)

In [15]:
### DO NOT DELETE OR CHANGE THIS CELL!

In [16]:
### DO NOT DELETE OR CHANGE THIS CELL!

In [17]:
### DO NOT DELETE OR CHANGE THIS CELL!

In [18]:
### DO NOT DELETE OR CHANGE THIS CELL!

In [19]:
### DO NOT DELETE OR CHANGE THIS CELL!

In [20]:
### DO NOT DELETE OR CHANGE THIS CELL!

In [21]:
### DO NOT DELETE OR CHANGE THIS CELL!

-----
## Step 3 - Content analysis (5 points)

Similar to the Python function `parseFile(datadict,format)` above, now create a new Python function `describeFile(datadict)` that analyses the given file according to the respective format detected in Step 1 and returns a dictionary containing the following information:

* for CSV files: number of columns, number of rows, column number (from 0 to n) of the column which contains the longest text. You do not have to try to transform any string to integer or float, simply take the values as is from the csv file. That is, the resulting dictionary should have the following form:

    ```
    { "numberOfColumns:"  ...,
       "numberOfRows":  ... ,
       "longestColumn" : ... }
    ```

* for JSON files: number of different attribute names, nesting depth, length of the longest list appearing in an attribute value. That is, the resulting dictionary should have the following form:

    ```
    { "numberOfAttributes:" ... ,
      "nestingDepth":  ... ,
      "longestListLength" : ... }
     ```

  Here the `longestListLength` should be set to 0 if no list appears. [Nesting depth](https://www.tutorialspoint.com/find-depth-of-a-dictionary-in-python) is defined as follows: 
   * a flat JSON object with only atomic attribute values has depth 1. 
   * a JSON attribute with another object as value (or another oject as member of a list value!) increases the depth by 1
   * and so on.


* for XML files: number of different element and attribute a names (i.e. the sum of both), nesting depth, maximum numeric value in the dataset. That is, the resulting dictionary should have the following form:

    ```
    { "numberOfElementsAttributes:" ... ,
      "nestingDepth":  ... ,
      "maxNumericValue" : ... }
     ```

  Here the `maxNumericValue` should be set to 0 if there are no numberic values present. Nesting depth is defined as the nesting depth of elements.
  
For files that cannot be parsed with respective given format, the function should simply return an empty dictionary (`{}`).

In [22]:
import codecs

def describeFile(datadict):

    #Output dictionary as global variable 
    global diz
    
    diz = {}
    
    #Take the info we need for the dictionary 
    format = datadict["detectedFormat"]
    resourceURL = datadict["resourceURL"]

    #Bse on the format run the corresponding function

    if not resourceURL:
        return ("No resourceURL") # No URL found

    try:
        resp = requests.get(resourceURL)
        resp.raise_for_status()
    except requests.exceptions.RequestException:
        return {}

    if format == 'CSV':
        return csv_analysis(resp,datadict)
    elif format == 'JSON':
        return json_analysis(resp,datadict)
    elif format == 'XML':
        return xml_analysis(resp, datadict)


#CSV function
def csv_analysis(resp_csv, datadict):
    
    #Parse the file
    try:
        dialect = csv.Sniffer().sniff(resp_csv.content[:5000].decode("utf-8"))
        delimiter = dialect.delimiter

        reader = csv.reader(resp_csv.text.splitlines(), delimiter=delimiter)

    except:
        return {}

    #Initialize variables and loop
    n_columns = 0
    n_rows  = 0
    
    #List with all entries
    all_entries = []
    for row in reader:
        n_rows += 1
        if len(row) > n_columns:
            n_columns = len(row)

        for element in row:
            all_entries.append(element)


    for i, value in enumerate(all_entries):
        if type(value) == str:
            all_entries[i] = len(value)
        else:
            all_entries[i] = 0

    #Index longest in our list
    list_ind_longest = all_entries.index(max(all_entries))
    #Transform the index in the list into colunm index based on column length
    column_index_longest = n_columns if (list_ind_longest % n_columns) == 0 else (list_ind_longest % n_columns) - 1

    diz["numberOfColumns"] = n_columns
    diz["numberOfRows"] = n_rows
    diz["longestColumn"] = column_index_longest

    return diz

#JSON

def json_analysis(resp_json,datadict):
    
    #Parse the file (data_dict is a dictiory)
    try:
        data_dict = resp_json.json()
    except:
        return {}

    #Initialize the variables
    max_level = 0
    current_level = 0
    keys = []
    max_ll = 0

    #create two recursive fuction to handle the nested object and keep track of the variables we wanted
    
    #Function to handle dictiories
    def dic(d, level):
        nonlocal  max_level, current_level, keys
        current_level = level
        if current_level > max_level:
            max_level = current_level

        for k in d.keys():
            keys.append(k)
            if isinstance(d[k], dict):
                dic(d[k], level + 1)
            elif isinstance(d[k], list):
                lista(d[k], level)

    #Function to handle lists
    def lista(x, level):
        nonlocal  max_level, max_ll
        current_ll = len(x)
        if current_ll > max_ll: max_ll = current_ll

        for el in x:
            if isinstance(el, list):
                lista(el, level)
            elif isinstance(el, dict):
                dic(el, level + 1)

    #Evaluate the function
    dic(data_dict,1)
    
    #Number of diffrent keys
    n_att = len(list(set(keys)))


    diz["numberOfAttributes:"] = n_att
    diz["nestingDepth:"] = max_level
    diz["longestListLength"] = max_ll
    
    return diz


#XML
def xml_analysis(resp_xml, datadict):

    #Parse the file
    try:
        data_dict = xmltodict.parse(resp_xml.text)
    except:
        return {}

    #Initialize the variables and loop
    max_number = 0
    max_level = 0
    current_level = 0
    keys = []
   
    #Helper function to test if value is transformable into float
    def is_float(s):
        try:
            float(s)
            return True
        except:
            return False
        
    #Almsot same approach as in json, with recursive functions
    def dic(d, level):
        nonlocal max_number, max_level, current_level, keys
        current_level = level
        if current_level > max_level:
            max_level = current_level

        for k in d.keys():
            keys.append(k)
            if isinstance(d[k], dict):
                dic(d[k], level + 1)
            elif isinstance(d[k], list):
                lista(d[k],level)
            elif isinstance(d[k], str): 
                #Check if value is a number and transform it 
                t = 0
                if d[k].isdigit(): t = int(d[k])
                elif is_float(d[k]): t = float(d[k])
                
                if t > max_number:
                    max_number = t
                #First approach: robably usless every value is a string but it doesn't hurt
            elif isinstance(d[k], (int,float)):
                if d[k] > max_number:
                    max_number = d[k]

    def lista(x, level):
        nonlocal max_number, max_level, current_level

        for el in x:
            if isinstance(el, list):
                lista(el,level)
            elif isinstance(el, dict):
                dic(el, level + 1)
            elif isinstance(el, str):
                #Same as in dic
                t = 0
                if el.isdigit(): t = int(el)
                elif is_float(el):d[k]: t = float(el)
                
                if t > max_number:
                    max_number = t
                            
            elif isinstance(el, ( int, float)):              
                if el > max_number:
                    max_number = el
    
    #Evaluate the function
    dic(data_dict,1)
    
    #Number of different keys
    n_att = len(list(set(keys)))

    diz["numberOfElementsAttributes"] = n_att
    diz["nestingDepth"] = max_level
    diz["maxNumericValue"] = max_number

    return diz



In [23]:
from nose.tools import assert_equal, assert_in, assert_true
assert_equal(len(describeFile(dataset1)), 3)
assert_equal(len(describeFile(dataset2)), 3)

In [24]:
### DO NOT DELETE OR CHANGE THIS CELL!

In [25]:
### DO NOT DELETE OR CHANGE THIS CELL!

In [26]:
### DO NOT DELETE OR CHANGE THIS CELL!

In [27]:
### DO NOT DELETE OR CHANGE THIS CELL!

In [28]:
### DO NOT DELETE OR CHANGE THIS CELL!

In [29]:
### DO NOT DELETE OR CHANGE THIS CELL!

In [30]:
### DO NOT DELETE OR CHANGE THIS CELL!

In [31]:
### DO NOT DELETE OR CHANGE THIS CELL!

In [32]:
### DO NOT DELETE OR CHANGE THIS CELL!

In [33]:
### DO NOT DELETE OR CHANGE THIS CELL!

In [34]:
### DO NOT DELETE OR CHANGE THIS CELL!

In [35]:
### DO NOT DELETE OR CHANGE THIS CELL!

In [36]:
### DO NOT DELETE OR CHANGE THIS CELL!

In [37]:
### DO NOT DELETE OR CHANGE THIS CELL!

In [38]:
### DO NOT DELETE OR CHANGE THIS CELL!

***Final check:***

Be sure to cross-check the results by
1) manually inspecting your chosen dataset and
2) comparing the results for plausibility against the results of your code... 

Describe your findings and use the following structure for your answer below:

*Data set 1*

(number and types of items etc., describe your findings)

*Data set 2*

(number and types of items etc., describe your findings)

Data set 1

The first data set is probably a bit more difficult to inspect. 

The nesting depth and the number of elements seem coherent with the file.

Regarding the biggest number to find in the data set, once we parsed the data set we are just presented with strings. However, knowing the nature of the data set and probably the scope of the task we decided to check for each entry if it is a float or an integer and then transformed it into a float or integer type. This of course implies some assumptions on the nature of the data, but we believe that our decision was in line with the context of the exercise. 

The biggest numerical value corresponds to an id. 


Data set 2

The second data set (CSV) was probably easier to manually check. We opened the file in Excel and we found out that the dimensions of the file are the same as the one predicted by our function: 2769 rows and 7 columns.

Also, in this case we had a similar situation as for data set 1, all entries are considered like strings. Inspecting the data set and also knowing the content it’s obvious that we should be presented with date columns and all the others with numerical values. However, basing our decision-making on the task description as well as on feedback from one of the tutorial sessions we left all values considered as string and we looked for the index of the column with the longest one.

