# CSV

In [151]:
import pandas as pd

We can print the first three lines of the CSV and confirm it looks like plain text! Crop data is [available from the FAO here](http://www.fao.org/faostat/en/#data/QC).

In [4]:
!head -3 datasets/fao_csv.csv

Area Code,Area,Item Code,Item,Element Code,Element,Unit,Y1961,Y1961F,Y1962,Y1962F,Y1963,Y1963F,Y1964,Y1964F,Y1965,Y1965F,Y1966,Y1966F,Y1967,Y1967F,Y1968,Y1968F,Y1969,Y1969F,Y1970,Y1970F,Y1971,Y1971F,Y1972,Y1972F,Y1973,Y1973F,Y1974,Y1974F,Y1975,Y1975F,Y1976,Y1976F,Y1977,Y1977F,Y1978,Y1978F,Y1979,Y1979F,Y1980,Y1980F,Y1981,Y1981F,Y1982,Y1982F,Y1983,Y1983F,Y1984,Y1984F,Y1985,Y1985F,Y1986,Y1986F,Y1987,Y1987F,Y1988,Y1988F,Y1989,Y1989F,Y1990,Y1990F,Y1991,Y1991F,Y1992,Y1992F,Y1993,Y1993F,Y1994,Y1994F,Y1995,Y1995F,Y1996,Y1996F,Y1997,Y1997F,Y1998,Y1998F,Y1999,Y1999F,Y2000,Y2000F,Y2001,Y2001F,Y2002,Y2002F,Y2003,Y2003F,Y2004,Y2004F,Y2005,Y2005F,Y2006,Y2006F,Y2007,Y2007F,Y2008,Y2008F,Y2009,Y2009F,Y2010,Y2010F,Y2011,Y2011F,Y2012,Y2012F,Y2013,Y2013F,Y2014,Y2014F,Y2015,Y2015F,Y2016,Y2016F
"5","American Samoa","486","Bananas","5312","Area harvested","ha","500.000000","F","500.000000","F","500.000000","F","500.000000","F","405.000000","","486.000000","","550.000000","F","630.000000","F","700.000000","

We can also read it in as a "DataFrame" and manipulate it:

In [5]:
df = pd.read_csv('datasets/fao_csv.csv')

In [6]:
df.head()

Unnamed: 0,Area Code,Area,Item Code,Item,Element Code,Element,Unit,Y1961,Y1961F,Y1962,...,Y2012,Y2012F,Y2013,Y2013F,Y2014,Y2014F,Y2015,Y2015F,Y2016,Y2016F
0,5,American Samoa,486,Bananas,5312,Area harvested,ha,500.0,F,500.0,...,370.0,F,359.0,Im,358.0,Im,358.0,Im,359.0,Im
1,5,American Samoa,486,Bananas,5419,Yield,hg/ha,12000.0,Fc,12000.0,...,23243.0,Fc,23435.0,Fc,23556.0,Fc,23623.0,Fc,23591.0,Fc
2,5,American Samoa,486,Bananas,5510,Production,tonnes,600.0,,600.0,...,860.0,F,842.0,Im,844.0,Im,846.0,Im,846.0,Im
3,5,American Samoa,414,"Beans, green",5312,Area harvested,ha,,M,,...,40.0,F,44.0,Im,44.0,Im,43.0,Im,44.0,Im
4,5,American Samoa,414,"Beans, green",5419,Yield,hg/ha,,,,...,30000.0,Fc,29807.0,Fc,29326.0,Fc,29757.0,Fc,29623.0,Fc


Let's count items by Area and see which are the most common types of items:

In [14]:
items_by_country = pd.DataFrame(df.groupby(['Item']).Area.count()).reset_index()

In [20]:
items_by_country.sort_values('Area', ascending=False).head(10)

Unnamed: 0,Item,Area
41,Fruit Primary,63
124,Vegetables Primary,60
28,Coconuts,60
125,"Vegetables, fresh nes",60
101,"Roots and Tubers,Total",58
6,Bananas,55
43,"Fruit, fresh nes",47
45,"Fruit, tropical fresh nes",40
16,Cassava,39
115,Sweet potatoes,36


# JSON

In [24]:
import json

We can load in the JSON FAO datasets metadata file:

In [50]:
j = json.load(open('datasets/fao_json.json', encoding='ISO-8859-1'))

We can see how many datasets the FAO has:

In [56]:
len(j['Datasets']['Dataset'])

78

Let's look at the first object. We can see that the best way to find the object we want may be by looking in the `DatasetName` field:

In [61]:
j['Datasets']['Dataset'][0]

{'CompressionFormat': 'zip',
 'Contact': 'Nienke Beintema and Gert-Jan Stads',
 'DatasetCode': 'AE',
 'DatasetDescription': 'ASTI collects primary time-series data on agricultural research capacity and spending levels through national survey rounds in over 80 low-and middle-income countries. Data collection is carried out by country focal points, who distribute survey forms to all agencies known to conduct agricultural research in a given country, including government, nonprofit, and higher education agencies. Private-for profit sector coverage is limited, and hence excluded from this dataset. More detailed country- and regional-level data on agricultural research capacity, investment, and outputs are available on www.asti.cgiar.org/data.',
 'DatasetName': 'ASTI R&D Indicators: ASTI-Expenditures',
 'DateUpdate': '2015-11-3',
 'Email': 'asti@cgiar.org',
 'FileLocation': 'http://fenixservices.fao.org/faostat/static/bulkdownloads/ASTI_Research_Spending_E_All_Data_(Normalized).zip',
 'File

So, we have to loop through all the dataset objects to find the one for crops:

In [62]:
for i in j['Datasets']['Dataset']:
    if i['DatasetName'] == 'Production: Crops':
        crops = i

In [63]:
crops

{'CompressionFormat': 'zip',
 'Contact': 'Mr. Salar Tayyib',
 'DatasetCode': 'QC',
 'DatasetDescription': 'Crop statistics are recorded for 173 products, covering the following categories: Crops Primary, Fibre Crops Primary, Cereals, Coarse Grain, Citrus Fruit, Fruit, Jute  Jute-like Fibres, Oilcakes Equivalent, Oil crops Primary, Pulses, Roots and Tubers, Treenuts and Vegetables and Melons. Data are expressed in terms of area harvested, production quantity and yield. The objective is to comprehensively cover production of all primary crops for all countries and regions in the world.Cereals: Area and production data on cereals relate to crops harvested for dry grain only. Cereal crops harvested for hay or harvested green for food, feed or silage or used for grazing are therefore excluded. Area data relate to harvested area. Some countries report sown or cultivated area only; however, in these countries the sown or cultivated area does not differ significantly in normal years from the a

# XML

In [64]:
import xml.etree.ElementTree as ET

In [69]:
tree = ET.parse('datasets/fao_xml.xml')

In [70]:
root = tree.getroot()

We can check that the XML file contains the same number of data objects as the JSON file:

In [75]:
len(root.getchildren())

78

In [107]:
sample = root.getchildren()[0]

In [106]:
for elem in sample.getchildren():
    print(elem.tag + ': ' + elem.text)

DatasetCode: AF
DatasetName: ASTI R&D Indicators: ASTI-Researchers
Topic: All government, higher education, and nonprofit agencies involved in agricultural research in over 80 low- and middle-income countries. Private for-profit agencies are not included in ASTI datasets.
DatasetDescription: ASTI collects primary time-series data on agricultural research capacity and spending levels through national survey rounds in over 80 low-and middle-income countries. Data collection is carried out by country focal points, who distribute survey forms to all agencies known to conduct agricultural research in a given country, including government, nonprofit, and higher education agencies. Private-for profit sector coverage is limited, and hence excluded from this dataset. More detailed country- and regional-level data on agricultural research capacity, investment, and outputs are available on www.asti.cgiar.org/data.
Contact: Nienke Beintema and Gert-Jan Stads
Email: asti@cgiar.org
DateUpdate: 2015-

Find our object using XPath:

In [149]:
crops = root.find("./Dataset[DatasetName='Production: Crops']")

In [150]:
for elem in crops.getchildren():
    print(elem.tag + ': ' + elem.text)

DatasetCode: QC
DatasetName: Production: Crops
Topic: Most crop products under agricultural activity.
DatasetDescription: Crop statistics are recorded for 173 products, covering the following categories: Crops Primary, Fibre Crops Primary, Cereals, Coarse Grain, Citrus Fruit, Fruit, Jute  Jute-like Fibres, Oilcakes Equivalent, Oil crops Primary, Pulses, Roots and Tubers, Treenuts and Vegetables and Melons. Data are expressed in terms of area harvested, production quantity and yield. The objective is to comprehensively cover production of all primary crops for all countries and regions in the world.Cereals: Area and production data on cereals relate to crops harvested for dry grain only. Cereal crops harvested for hay or harvested green for food, feed or silage or used for grazing are therefore excluded. Area data relate to harvested area. Some countries report sown or cultivated area only; however, in these countries the sown or cultivated area does not differ significantly in normal y