# Module 4 Data Ingestion 

### Topics
- Load data from a local **JSON** file
- Load data from a **XML** file
- Load data from a **CSV** file

### Load data from a JSON file
**Notes**:
- Uses file students.json as the input file.
- File structure will be composed of {} braces and a key ("students"). The value is within the [] brackets.
- The **JSON** objects are separated by the ','.

In [1]:
import json

In [2]:
# Supply the file name and open the file
with open('students.json', 'r') as f:
    students_json = json.load(f)
    
# display the variable and its contents.
students_json

{'students': [{'id': '675432',
   'fname': 'John',
   'lname': 'Smith',
   'bdate': '1999-01-02',
   'major': 'CS'},
  {'id': '522134',
   'fname': 'Mary',
   'lname': 'Johnsons',
   'bdate': '1988-11-27',
   'major': 'Music'},
  {'id': '632197',
   'fname': 'Kyle',
   'lname': 'Green',
   'bdate': '2001-07-31',
   'major': 'CS'}]}

In [3]:
# view the type of the students_json variable.
# will return a dictionary (dict)
type(students_json)

dict

In [4]:
students_json

{'students': [{'id': '675432',
   'fname': 'John',
   'lname': 'Smith',
   'bdate': '1999-01-02',
   'major': 'CS'},
  {'id': '522134',
   'fname': 'Mary',
   'lname': 'Johnsons',
   'bdate': '1988-11-27',
   'major': 'Music'},
  {'id': '632197',
   'fname': 'Kyle',
   'lname': 'Green',
   'bdate': '2001-07-31',
   'major': 'CS'}]}

### To load data directly 
**Notes**:
-- using without /  root 
-- using file 

In [5]:
# to access the first element in the dictionary
# Using file: students_no_root.json

with open('students_no_root.json', 'r') as f:
    students_json_1 = json.load(f)
    
type(students_json_1)

list

In [6]:
# will output three JSON objects stored in a list
students_json_1

[{'id': '675432',
  'fname': 'John',
  'lname': 'Smith',
  'bdate': '1999-01-02',
  'major': 'CS'},
 {'id': '522134',
  'fname': 'Mary',
  'lname': 'Johnsons',
  'bdate': '1988-11-27',
  'major': 'Music'},
 {'id': '632197',
  'fname': 'Kyle',
  'lname': 'Green',
  'bdate': '2001-07-31',
  'major': 'CS'}]

In [7]:
# using list indexing to get the first element in the students_json_1 list
students_json_1[0]

{'id': '675432',
 'fname': 'John',
 'lname': 'Smith',
 'bdate': '1999-01-02',
 'major': 'CS'}

In [8]:
type(students_json_1[0])

dict

### Loading Data Into a Pandas DataFrame
**Notes**:
- **orient=** - 

In [9]:
import pandas as pd

In [10]:
# load the file into a pandas dataframe
df = pd.read_json("students_no_root.json", orient='values')

df.T

Unnamed: 0,0,1,2
id,675432,522134,632197
fname,John,Mary,Kyle
lname,Smith,Johnsons,Green
bdate,1999-01-02,1988-11-27,2001-07-31
major,CS,Music,CS


### Loading XML
**Notes**:
- XML is a tree structure
- File used:  students.xml
- The root is the entry point for the XML tree.

In [11]:
from xml.etree.ElementTree import parse

In [12]:
# parse the XML data
tree = parse('students.xml')

In [13]:
# check the data type of 'tree' variable
type(tree)

xml.etree.ElementTree.ElementTree

In [14]:
# Start from the root of the XML tree
root = tree.getroot()

# with the root it is possible to traverse the tree

In [15]:
# Display the value of each element in the tree.
for elem in root:
    
    for subelem in elem:
        print(subelem.text)

675432
John
Smith
1999-01-02
CS
522134
Mary
Johnsons
988-11-27
Music
632197
Kyle
Green
2001-07-31
CS


In [16]:
# store the data in a simple list
xml_data = []

for elem in root:
    
    temp = []
    
    for subelem in elem:
        temp.append(subelem.text)
    xml_data.append(temp)
    
xml_data

[['675432', 'John', 'Smith', '1999-01-02', 'CS'],
 ['522134', 'Mary', 'Johnsons', '988-11-27', 'Music'],
 ['632197', 'Kyle', 'Green', '2001-07-31', 'CS']]

In [17]:
# Display first row and third element in the list
xml_data[0][2]

'Smith'

In [18]:
dob_s = xml_data[0][3]
dob_s

'1999-01-02'

In [19]:
# retrieve the month
dob_l =dob_s.split('-')

dob_l

['1999', '01', '02']

In [20]:
for x in dob_l:
    print(x)

1999
01
02


In [21]:
for x in dob_l:
    print(int(x))

1999
1
2


### Loading Data From CSV
**Notes**:
- CSV file is seperated by a ','
- File used:  students.csv
- If the file has **NO** header set header=None.

In [22]:
import pandas as pd

In [23]:
# Load data that does not have a header
df = pd.read_csv("students_no_header.csv", header=None)

In [24]:
type(df)

pandas.core.frame.DataFrame

In [25]:
df

Unnamed: 0,0,1,2,3,4
0,675432,John,Smith,1999-01-02,CS
1,522134,Mary,Johnsons,1988-11-27,Music
2,632197,Kyle,Green,2001-07-31,CS


In [26]:
# Load data that has a header already associated with it
df1 = pd.read_csv("students_header.csv")

df1

Unnamed: 0,ID,Fname,Lname,DOB,Major
0,675432,John,Smith,1999-01-02,CS
1,522134,Mary,Johnsons,1988-11-27,Music
2,632197,Kyle,Green,2001-07-31,CS


### Read the file as a pure text file
- File will be read detecting the ','s in the file.
- Will use **.open()** method

In [27]:
# open the file
fin = open("students_no_header.csv")

In [28]:
# Store every line that is in the file.
# Will be stored in a list of file lines.

students_records = []

for x in fin:
    students_records.append(x)
    
# close the file
fin.close()

In [29]:
print(students_records)

['675432, John, Smith, 1999-01-02, CS\n', '522134, Mary, Johnsons, 1988-11-27, Music\n', '632197, Kyle, Green, 2001-07-31, CS']


In [30]:
new_student_records = []

for student in students_records:
    print(student.strip().split(','))
    new_student_records.append(student.strip().split(','))

['675432', ' John', ' Smith', ' 1999-01-02', ' CS']
['522134', ' Mary', ' Johnsons', ' 1988-11-27', ' Music']
['632197', ' Kyle', ' Green', ' 2001-07-31', ' CS']


In [31]:
print(new_student_records)

[['675432', ' John', ' Smith', ' 1999-01-02', ' CS'], ['522134', ' Mary', ' Johnsons', ' 1988-11-27', ' Music'], ['632197', ' Kyle', ' Green', ' 2001-07-31', ' CS']]


In [32]:
# open the file
fin = open("students_header.csv")

In [33]:
# if the file has a header

students_records_2 = []

header = fin.readline()

for x in fin:
    students_records_2.append(x)
    
# close the file
fin.close()

In [34]:
print(students_records_2)

['675432, John, Smith, 1999-01-02, CS\n', '522134, Mary, Johnsons, 1988-11-27, Music\n', '632197, Kyle, Green, 2001-07-31, CS\n']


In [35]:
new_student_records_2 = []

for student in students_records_2:
    print(student.strip().split(','))
    new_student_records_2.append(student.strip().split(','))

['675432', ' John', ' Smith', ' 1999-01-02', ' CS']
['522134', ' Mary', ' Johnsons', ' 1988-11-27', ' Music']
['632197', ' Kyle', ' Green', ' 2001-07-31', ' CS']


In [36]:
print(new_student_records_2)

[['675432', ' John', ' Smith', ' 1999-01-02', ' CS'], ['522134', ' Mary', ' Johnsons', ' 1988-11-27', ' Music'], ['632197', ' Kyle', ' Green', ' 2001-07-31', ' CS']]


In [37]:
print(header)

ID,Fname,Lname,DOB,Major

