<img src="https://drive.google.com/uc?id=1v7YY_rNBU2OMaPnbUGmzaBj3PUeddxrw" alt="ITI MCIT EPITA" style="width: 750px;"/>

___

# Data Preparation & Exploration

By: **Mohamed Fouad Fakhruldeen**, mohamed.fakhruldeen@epita.fr
____

## Session 01

#### Topics: 

* Introduction on data extraction and preparation (goal and trajectory) 
* Install and present Orange software 
* Load several “real-life” csv files, with problems in encoding, formatting, escaping. 
* Identify the types of variables that will be part of the file (continuous, discrete). 
* Correct the errors manually then programmatically (using java or python), 
* Transform to correct CSV related errors, then record the files in a clean version. 
* Discover XML
* Discover XSD
* Discover XSLT
* Use either XSLT or a programming language to convert xml to csv 
* Discovering JSON formatting 
* Convert JSON to CSV 
* Discover ARFF formatting 
* Load ARFF data and Schema

____

### What's Data Exploration and Preparation

-   Is the first step of the machine learning life cycle. 
-   It is one of the most important steps of the life cycle. 
-   In this step, we need to identify the different data sources, as data can be collected from various sources such as; **files, database, or internet.**
-   First part includes the below tasks:
    -   Identify various data sources
    -   Collect data
    -   Integrate the data obtained from different sources
-   By performing the above task, we get a coherent set of data, also called as a **dataset.**
-   The quantity and quality of the collected data will determine the efficiency of the output. The more will be the data, the more accurate will be the prediction.
-   The goal of this step is to identify and obtain all data-related problems (Data in the real world are “dirty”).
-   In real-world applications, collected data may have various issues, including:
    -   Missing Values
    -   Duplicate data
    -   Invalid data
    -   Noise

#### what's Data?

data are individual pieces of factual information recorded and used for the purpose of analysis. 
It is the raw information from which statistics are created. 

Statistics are the results of data analysis - its interpretation and presentation. Often these types of statistics are referred to as 'statistical data'

##### where to find data sets?

- [datahub](https://datahub.io/machine-learning/iris)
- [kaggle](https://www.kaggle.com/datasets)
- [socrata](https://opendata.socrata.com/)
- [awesome public datasets](https://github.com/awesomedata/awesome-public-datasets)
- [BigQuery public datasets](https://cloud.google.com/bigquery/public-data/)
- [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)
- [Data.gov](https://www.data.gov/)
- [Academic Torrents](https://academictorrents.com/browse.php)
- [quandl](https://www.quandl.com/search)

and you can google for more.

#### what's Variable?

A variable is a characteristic of a unit being observed that may assume more than one of a set of values to which a numerical measure or a category from a classification can be assigned (e.g. income, age, weight, etc., and “occupation”, “industry”, “disease”, etc. [Source](https://stats.oecd.org/glossary/detail.asp?ID=2857)

#### Variable Dependency

**The independent variable** is the cause. Its value is independent of other variables in your study.

**The dependent variable** is the effect. Its value depends on changes in the independent variable.


#### Categorical variables 

descriptions of groups or things, like “breeds of dog” or “voting preference”. 
are also known as discrete or qualitative variables. Categorical variables can be further categorized as:
- nominal: no order: Gender, Race
- ordinal: ordered series: rating system
- dichotomous/binary: Yes/No  

#### Quantitative variables

Numeric: counts, percents, or numbers. 
which can be categorized as:
- Continuous as ratio and intervals
- Discrete as counts

#### Variable transformations

There are two main variable transformations:

- From a continuous to a discrete variable
- From a quantitative to a qualitative variable

*for more you can google AP Statistics*
____

### CSV

-   CSV stands for “Comma-Separated Values". 
-   CSV is a simple file format used to store tabular data, such as a spreadsheet or database. 
-   There may be an optional header line appearing as the first line of the file with the same format as normal record lines. 
    -   The header contains names corresponding to the fields in the file. 
    -   Also, it should contain the same number of fields as the records in the rest of the file. 
-   Each line of the file is a data record. 
-   All records should have the same number of fields, in the same order.
-   Each record consists of one or more fields, separated by commas.
-   CSV file with the extension (.csv).
-   If the fields of data in your CSV file contain commas, you can protect them by enclosing those data fields in double quotes ("). 

In [5]:
# Extra out of course scope
with open("Files/Session01/CSV.csv", "r") as table:
    i = 0
    for row in table:
        print(row)
        i+=1
        if i > 5:
            break


outlook,temperature,humidity,windy,play

b'sunny',85.0,85.0,b'FALSE',b'no'

b'sunny',80.0,90.0,b'TRUE',b'no'

b'overcast',83.0,86.0,b'FALSE',b'yes'

b'rainy',70.0,96.0,b'FALSE',b'yes'

b'rainy',68.0,80.0,b'FALSE',b'yes'



In [12]:
# Extra out of course scope
import pandas as pd

df = pd.read_csv("Files/Session01/CSV.csv", sep=",")
# shows top 10 rows
df.head(10)
#df.sample(5)

Unnamed: 0,outlook,temperature,humidity,windy,play
0,b'sunny',85.0,85.0,b'FALSE',b'no'
1,b'sunny',80.0,90.0,b'TRUE',b'no'
2,b'overcast',83.0,86.0,b'FALSE',b'yes'
3,b'rainy',70.0,96.0,b'FALSE',b'yes'
4,b'rainy',68.0,80.0,b'FALSE',b'yes'
5,b'rainy',65.0,70.0,b'TRUE',b'no'
6,b'overcast',64.0,65.0,b'TRUE',b'yes'
7,b'sunny',72.0,95.0,b'FALSE',b'no'
8,b'sunny',69.0,70.0,b'FALSE',b'yes'
9,b'rainy',75.0,80.0,b'FALSE',b'yes'


In [10]:
# Extra out of course scope
import pandas as pd

iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris.head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


#### Escaping in CSV

By default, the escape character is a " (double quote) for CSV-formatted files. If you want to use a different escape character, use the ESCAPE clause of COPY, CREATE EXTERNAL TABLE or gpload to declare a different escape character. In cases where your selected escape character is present in your data, you can use it to escape itself.

For example, suppose you have a table with three columns and you want to load the following three fields:

- Free trip to A,B
- 5.89
- Special rate "1.79"

The formatted row in your data file looks like this:
```
         "Free trip to A,B","5.89","Special rate ""1.79"""
```
      

[source](https://gpdb.docs.pivotal.io/43320/admin_guide/load/topics/g-escaping-in-csv-formatted-files.html)

#### Missing Values in CSV

empty double quoted ```..., "", ...``` or empty delimiter ```, .. ,, ..,```

#### CSV Decimal Separator

1. Comma separated with dot as decimal marker
```
SomeText , SomeNumber1 , SomeNumber2
CDD22345 , 0.0001 , 22456.12
CDD44455 , 55.112 , 100.2
CDD12349, 10.1E-4   , 88.2
```
2. Semicolon separated with comma as decimal marker
```
SomeText ; SomeNumber1 ; SomeNumber2
CDD22345 ; 0,0001 ; 22456,12
CDD44455 ; 55,112 ; 100.2
CDD12349 ; 10.1E-4 ; 88.2  
```

remember that for non english characters you need to open files and use them in UTF-8 encoding to be able to display them correctly

#### Orange3 CSV

_______

### XML

- The Extensible Markup Language (XML) is a simple text-based format for representing structured information: documents, data, configuration, books, transactions, invoices, and much more.
- XML defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.
- XML is one of the most widely-used formats for sharing structured information today.
- XML was designed to be both human- and machine-readable
- XML is often used for exchange data over the Internet.

#### XML Markdown

<img src="Files/Session01/xml.png" alt="XML" style="width: 750px;"/>


<img src="Files/Session01/xml2.png" alt="XML2" style="width: 750px;"/>


In [18]:
# Extra out of course scope

import xml.dom.minidom
doc = xml.dom.minidom.parse("Files/Session01/XML/sample2.xml");
print(doc.firstChild.tagName)

bookstore


#### XML Syntax

- XML Documents Must Have one Root Element
- All elements must be closed or marked as empty.
- Empty elements can be closed as normal, <happiness></happiness> or you can use a special short-form, <happiness/> instead.
- In XML, attribute values must always be quoted
- XML Tags are Case Sensitive
- XML Elements Must be Properly Nested
- XML Attribute Values Must Always be Quoted
- White-space is Preserved in XML
- The XML prolog does not have a closing tag. The prolog is not a part of the XML document.

##### Prolog

##### root element

##### element

##### child elements

##### Attribute Values

must be quoted or single quoted if the value itself contains double quote

- they can't contain multiple values
- can't contain tree structure
- and they are not easily expandable


There are 5 pre-defined entity references in XML: cause they may refer to beginning or closing of element

| | | |
| ------------ | ---- | -------------- |
| ```&lt;```   | <    | less than      |
| ```&gt;```   | \>   | greater than   |
| ```&amp;```  | &    | ampersand      |
| ```&apos;``` | '    | apostrophe     |
| ```&quot;``` | "    | quotation mark |



##### Comments in XML

```
<!-- This is a comment -->
```

Don't use two dashes in the middle of comment

#### XML Element Naming syntax


- case-sensitive.
- must start with a letter or underscore.
- cannot start with the letters xml (or XML, or Xml, etc)
- can contain letters, digits, hyphens, underscores, and periods.
- cannot contain spaces.


##### Metadata

Metadata (data about data) should be stored as attributes, and the data itself should be stored as elements.

##### XML DOM

The XML DOM defines a standard way for accessing and manipulating XML documents. 
It presents an XML document as a tree-structure.

#### DTD

internal

With a DTD, independent groups of people can agree on a standard DTD for interchanging data.

An application can use a DTD to verify that XML data is valid.

external

#### XSD : XML Schema

- An XML Schema describes the structure of an XML document.
- The XML Schema language is also referred to as XML Schema Definition (XSD).
- An XML Schema is a language for expressing constraints about XML documents. 
- Checking a document against a Schema is known as validating against that schema.


##### Why?

- In the XML world, hundreds of standardized XML formats are in daily use.
- describe allowable document content
- validate the correctness of data
- define data facets (restrictions on data)
- define data patterns (data formats)
- convert data between different data types

XML Schemas Secure Data Communication

When sending data from a sender to a receiver, it is essential that both parts have the same "expectations" about the content.


The purpose of an XML Schema is to define the legal building blocks of an XML document:
- the elements and attributes that can appear in a document
- the number of (and order of) child elements
- data types for elements and attributes
- default and fixed values for elements and attributes


or reference DTD schema

[read more](https://www.w3schools.com/xml/schema_intro.asp)
____

### XSLT

XSL (eXtensible Stylesheet Language) is a styling language for XML

XSL Transformations is a language for transforming XML documents into other formats ( like XML documents, CSV or HTML documents).


- XSLT is a language for transforming XML documents.
- XPath is a language for navigating in XML documents.
- XQuery is a language for querying XML documents.

XSL = Style Sheets for XML

#### XPath

- XPath is a major element in the XSLT standard.
- XPath can be used to navigate through elements and attributes in an XML document.
- XPath uses path expressions to select nodes or node-sets in an XML document.
- XPath cannot be used stand-alone: it is always used in the context of a host language, whether that language is XSLT, Python, or some other language. 

Example: https://www.w3schools.com/xml/xml_xpath.asp


XPath uses path expressions to select nodes or node-sets in an XML document.

These path expressions look very much like the path expressions you use with traditional computer file systems:

|Expression | 	Description|
|---|---|
|nodename 	|Selects all nodes with the name "nodename"|
|/ 	|Selects from the root node|
|// 	|Selects nodes in the document from the current node that match the selection no matter where they are|
|. 	|Selects the current node|
|.. 	|Selects the parent of the current node|
|@ 	|Selects attributes|


https://www.w3schools.com/xml/xpath_syntax.asp

#### Use XSLT to convert xml to CSV Example

In [22]:
# Extra out of course scope
# but still good to know

from lxml import etree
 
data = open('Files/Session01/XSLT/XSLT sample_to CSV.xslt')
xslt_content = data.read()
xslt_root = etree.XML(xslt_content)
dom = etree.parse('Files/Session01/XSLT/XML sample.xml')
transform = etree.XSLT(xslt_root)
result = transform(dom)
f = open('Files/Session01/XSLT/output01.csv', 'w')
f.write(str(result))
f.close()

In [23]:
# Extra out of course scope
import pandas as pd

df1 = pd.read_csv("Files/Session01/XSLT/output01.csv", sep=",")
df1.head(10)

Unnamed: 0,Title,Artist,Country,Company,Price,Year
0,Empire Burlesque,Bob Dylan,USA,Columbia,10.9,1985
1,Hide your heart,Bonnie Tyler,UK,CBS Records,9.9,1988
2,Greatest Hits,Dolly Parton,USA,RCA,9.9,1982
3,Still got the blues,Gary Moore,UK,Virgin records,10.2,1990
4,Eros,Eros Ramazzotti,EU,BMG,9.9,1997
5,One night only,Bee Gees,UK,Polydor,10.9,1998
6,Sylvias Mother,Dr.Hook,UK,CBS,8.1,1973
7,Maggie May,Rod Stewart,UK,Pickwick,8.5,1990
8,Romanza,Andrea Bocelli,EU,Polydor,10.8,1996
9,When a man loves a woman,Percy Sledge,USA,Atlantic,8.7,1987
