# Types of data that are useful to data scientists

| Data Type | Example |Uses |
| :--------- | :------- | :---- |
| text | tweets, scripts, books | sentiment analysis, text generation, other natural language processing |
| JSON or XML | parsing APIs | gathering data, trend alalysis, forecasting... |
| HTML | web scraping | gathering web based document data, social media contacts... |
| images | computer vision | self-driving cars, medical imaging diagnostics |

## For today

- JSON: `json`
- XML: `xml`
- HTML: `xml` (`BeautifulSoup` not covered but good to be aware of)


### But first... a refresher on tabular data
**... and a small introduction to reading in `xlsx` files**

In [9]:
import pandas as pd

In [12]:
# using pandas to construct a dataframe from an xlsx file
puppies = pd.read_excel("../data/puppies.xlsx")
puppies.head()

Unnamed: 0,name,age,weight
0,pippa,6,8
1,prairie,6,5
2,pippa,24,12
3,prairie,24,10


In [16]:
# what if our file actually has multiple sheets?
puppies_desc = pd.read_excel("../data/puppies.xlsx", sheet_name="Sheet2")
puppies_weights = pd.read_excel("../data/puppies.xlsx", sheet_name="Sheet1" )

In [17]:
puppies_weights

Unnamed: 0,name,age,weight
0,pippa,6,8
1,prairie,6,5
2,pippa,24,12
3,prairie,24,10


# XML and HTML
- `html`: hyper text markup language
- `xml`: extensible markup language
- hierarchical collections of elements
- generally consists of an opening tag, content and closing tag

Let's look at some HTML: [Wikipedia page for "Dogs"](https://en.wikipedia.org/wiki/Dog)


Let's look at some XML:

```xml
<dog>
    <name>Pippa</name>
    <age>10</age>
    <diet>
        <fooditem>kibbles</fooditem>
        <fooditem>pumpkin</fooditem>
    </diet>
</dog>
```

perhaps with attributes

```xml
<dog name="Pippa" age="10">
    <diet>
        <fooditem name="kibbles"></fooditem>
        <fooditem name="pumpkin"></fooditem>  
    </diet>
</dog>
```

## Demo

- data: `olympics.xml`
- path: '/src/data/olympics.xml'
- description: characteristics of several host countries of the Summer Olympic Games

```xml
<?xml version="1.0"?>
<data>
  <country name="greece">
    <order>1</order>
    <year>1896</year>
    <nexthost name="france"></nexthost>
  </country>
  <country name="united states of america">
    <order>3</order>
    <year>1904</year>
    <previoushost name="france"></previoushost>
    <nexthost name="england"></nexthost>
  </country>
  <country name="australia">
    <order>27</order>
    <year>2000</year>
    <previoushost name="united states of america"></previoushost>
    <nexthost name="greece"></nexthost>
  </country>
</data>
```

In [19]:
# read in our data
import xml.etree.ElementTree as et
tree = et.parse("../data/olympics.xml")

In [21]:
# grab the root element of tree
root = tree.getroot()
root

<Element 'data' at 0x7f700de44f40>

we have a handle on the root element. How do we begin exploring?
- what's the root element's tag?
- does the root element have any attributes? 
- does the root element contain any children?
- can we extend this knowledge?

How do we find out more? Check the [docs](https://docs.python.org/3/library/xml.etree.elementtree.html)

In [25]:
# explore some of it's features
print("the root element's tag is: ", root.tag)
print("the attributes of the root element are: ", root.attrib)
print("the number of child elements of the root is: ", len(root))

the root element's tag is:  data
the attributes of the root element are:  {}
the number of child elements of the root is:  3


In [34]:
root[2]

<Element 'country' at 0x7f700e01c810>

In [61]:
for elem in root:
    print(elem.tag)


country
country
country


How might we go about displaying the attributes of each `country` tag?

In [35]:
for idx in range(len(root)):
    tag = root[idx].tag
    attributes = root[idx].attrib
    print("tag {} || attributes {}".format(tag, attributes))

tag country || attributes {'name': 'greece'}
tag country || attributes {'name': 'united states of america'}
tag country || attributes {'name': 'australia'}


In [40]:
# grab the 0th country element
first_country = root[0]


# grab the 1th element from the first country and display its content 
first_country[1].text

'1896'

# JSON

**J**ava**S**cript **O**bject **N**otation

A data-interchange format that is simple for humans and machines to read and write.

Data are stored in structures of key-value pairs.


```json
{
  "pizza": ["veggie", "pepperoni", "pineapple", "mushroom"],
  "price": [12.99, 14.00, 11.99, 16.00],
  "averageRating": [5, 4, 4.5, 4.9]
}
```

or 

```json
{
  "pizza": {
    "0":"veggie",
    "1":"pepperoni",
    "2":"pineapple",
    "3":"mushroom"
  },
  "price": {
    "0":12.99,
    "1":14.0,
    "2":11.99,
    "3":16.0
  },
  "averageRating": {
    "0":5.0,
    "1":4.0,
    "2":4.5,
    "3":4.9
  }
}
```


When our attributes have the same number of items associated with them, parsing them out into a dataframe is rather simple.

In [41]:
url = "https://raw.githubusercontent.com/jcbain/data_2022-02-07/main/w02d03_other_data_types/src/data/pizza.json"
pd.read_json(url)

Unnamed: 0,pizza,price,averageRating
0,veggie,12.99,5.0
1,pepperoni,14.0,4.0
2,pineapple,11.99,4.5
3,mushroom,16.0,4.9


## Nested JSON
however...

data: `content.json`

```json

{
  "articles": [
    {
      "name": "how to program in javascript",
      "author": "prairie",
      "wordCount": 1200
    },
    {
      "name": "should you rewrite your application in python?",
      "author": "pippa",
      "wordCount": 40000
    }
  ],
  "blogs": [
    {
      "title": "differences between js objects and python dictionaries",
      "postedBy": "jennifer"
    }
  ]
}
```

In [42]:
pd.read_json("../data/content.json")

ValueError: All arrays must be of the same length

In which cases, we have to work a bit more to get our data in a format we find reasonable to work with

In [45]:
import json

with open("../data/content.json") as file: 
    content = json.load(file)

content


{'articles': [{'name': 'how to program in javascript',
   'author': 'prairie',
   'wordCount': 1200},
  {'name': 'should you rewrite your application in python?',
   'author': 'pippa',
   'wordCount': 40000}],
 'blogs': [{'title': 'differences between js objects and python dictionaries',
   'postedBy': 'jennifer'}]}

Use the `pd.json_normalize` method to mold you data into a dataframe

... but looks a little strange

In [47]:
pd.json_normalize(content, record_path="articles")

Unnamed: 0,name,author,wordCount
0,how to program in javascript,prairie,1200
1,should you rewrite your application in python?,pippa,40000


but conceptually "articles" and "blogs" are distinct

In [49]:
pd.json_normalize(content, record_path="articles")

Unnamed: 0,name,author,wordCount
0,how to program in javascript,prairie,1200
1,should you rewrite your application in python?,pippa,40000


In [48]:
pd.json_normalize(content, record_path="blogs")

Unnamed: 0,title,postedBy
0,differences between js objects and python dict...,jennifer


How about complex but uniform structure...

In [50]:
states = [
    {
        "state": "Florida",
        "abbr": "FL",
        "info": { "governor": "Ron DeSantis", "ltGovernor": "Jeanette Nunez"},
        "counties": [
            {"name": "Dade", "population": 12345},
            {"name": "Broward", "population": 400000},
            {"name": "Palm Beach", "population": 600000}
        ]
    },
        {
        "state": "Ohio",
        "abbr": "OH",
        "info": { "governor": "Mike DeWine"},
        "counties": [
            {"name": "Summit", "population": 50324},
            {"name": "Cuyahoga", "population": 200000}
        ]
    }
    
]

In [52]:
pd.json_normalize(states, record_path="counties")

Unnamed: 0,name,population
0,Dade,12345
1,Broward,400000
2,Palm Beach,600000
3,Summit,50324
4,Cuyahoga,200000


In [54]:
pd.json_normalize(states, record_path="counties", meta=["abbr", "state"])

Unnamed: 0,name,population,abbr,state
0,Dade,12345,FL,Florida
1,Broward,400000,FL,Florida
2,Palm Beach,600000,FL,Florida
3,Summit,50324,OH,Ohio
4,Cuyahoga,200000,OH,Ohio


In [58]:
pd.json_normalize(states, 
                  record_path="counties", 
                  meta=["abbr", "state",["info", "ltGovernor"]],
                  errors="ignore"
                 )

Unnamed: 0,name,population,abbr,state,info.ltGovernor
0,Dade,12345,FL,Florida,Jeanette Nunez
1,Broward,400000,FL,Florida,Jeanette Nunez
2,Palm Beach,600000,FL,Florida,Jeanette Nunez
3,Summit,50324,OH,Ohio,
4,Cuyahoga,200000,OH,Ohio,
