<br>

# Week 4: Importing Data

<br>

## Importing Data into the `Python` Environment

* Data in Storage
* Data in Space


<br>

## Data in Storage: Persistent Storage

* CSV
* JSON
* Excel
* text
* Matlab 
<img src="https://st2.depositphotos.com/2419757/10259/v/950/depositphotos_102598234-stock-illustration-isometric-3d-shelf-with-cartoon.jpg" width="30%" style="margin-left:auto; margin-right:auto">

<br>

### Importing Data

* There are several Python libraries that can be used for importing `.csv`, `.json` and other files
* For example the `csv` and `json` libraries

In [None]:
import csv

res = []
with open( '/home/bonzilla/Documents/ScienceLife/DS4VS/datasets/eyesight2.csv', 'r') as myFile:
    lines = csv.reader( myFile, delimiter=',' )
    for line in lines:
        res.append( line )
        
print( len( res ) )
print( res )

#### Question: Does this data look ready to work with?

<br>

### Importing `.csv` Data with `pandas`

* using the `pandas` library is a better way to go:
    - can be used for multiple types of data:
    - more intuitive/readable to use
    - the resulting data structure, a `pandas` `DataFrame` is much easier to work with
    
Here we import the same .csv file:

In [None]:
import pandas as pd

url = 'https://raw.githubusercontent.com/SmilodonCub/DS4VS/master/datasets/eyesight2.csv'
df = pd.read_csv( url )
df.head()

<br>

### Importing `.json` Data with `pandas`

In [None]:
url = 'https://raw.githubusercontent.com/SmilodonCub/DS4VS/master/datasets/tmesh.json'
df = pd.read_json( url )
df.head()

In [None]:
dft = df.transpose()
dft.head()

<br>

### Importing Excel Data with `pandas`

In [None]:
url = 'https://github.com/SmilodonCub/DS4VS/raw/master/datasets/Data_Cortex_Nuclear.xls'
df = pd.read_excel( url )
df.head()

<br>

### Reading in `.txt` files with `pandas`

In [None]:
# a fixed width text file
url = 'https://raw.githubusercontent.com/SmilodonCub/DS4VS/master/datasets/CatSearch_Physiol_ML2_2018_8_30(1).txt'
df = pd.read_csv( url, sep='\t')
df.head()

In [None]:
# this txt file has a more complicated format
url = 'https://raw.githubusercontent.com/SmilodonCub/DS4VS/master/datasets/gratvernier/mj%20gratvernier%2049.txt'
df_sc1 = pd.read_csv( url, skiprows=111, skipfooter=50, sep='\t', engine='python')
df_sc1

In [None]:
# old school
addy = '/home/bonzilla/Documents/ScienceLife/DS4VS/datasets/gratvernier/mj gratvernier 49.txt'
lines = []
fp = open(addy)
for i, line in enumerate(fp):
    if i >= 111 and i < 155:
        lines.append( line )
        print( line )
fp.close()
#print( lines )

<br>

### Bringing `Matlab` into `Python`

In [None]:
from scipy.io import loadmat

url = '/home/bonzilla/Documents/ScienceLife/DS4VS/datasets/example_mat2.mat'
matlab_dat = loadmat( url )
print( type( matlab_dat ) )
#matlab_dat

In [None]:
print( matlab_dat.keys() )

In [None]:
# the delayedSaccIntervals field
delayedSaccade_dat = matlab_dat[ 'delayedSaccIntervals' ]
delayedSaccade_DF = pd.DataFrame( data = delayedSaccade_dat, columns = [ 'start','end' ] )
print( delayedSaccade_DF.shape )
delayedSaccade_DF.head()

<br>

## Data in Space

* connecting to networks and perform Remote Procedure Calls (RPCs)
* Web Services and APIs
* 'Big Data'

<img src="https://i.pinimg.com/originals/a0/26/1b/a0261b885cfba5a65c675c33327acf5a.png" width="20%" style="margin-left:auto; margin-right:auto">

### Web Services and Connecting to APIs

**A**pplication **P**rogramming **I**nterface **(API)** - a set of functions/procedures that facilitate access to the data that supports a web application.  

There are many popular APIs: [New York Times](https://developer.nytimes.com/), [FaceBook](https://developers.facebook.com/), [Twitter](https://developer.twitter.com/en/docs/twitter-api), [Steam](https://steamcommunity.com/dev) and [Squarespace](https://developers.squarespace.com/commerce-apis/overview) to name a few

There is even a [Star Wars API](https://pipedream.com/apps/swapi). Let's take a look.....

<img src="https://i.kym-cdn.com/photos/images/original/001/762/550/36b.jpg" width="40%" style="margin-left:auto; margin-right:auto">

In [None]:
import json
import requests

url = 'https://swapi.dev/api/starships/3/'
response = requests.get( url )
api_results = json.loads( response.content )

In [None]:
def sw_return_planets( film_num ):
    """
    Given an episode number,
    Returns the episode title and a list of planets featured in the movie
    """
    response = requests.get( 'https://swapi.dev/api/planets' )
    api_results = json.loads( response.content )
    planets = []
    for planet in api_results[ 'results' ]:
        stem = 'https://swapi.dev/api/films/'
        film_address = stem + str( film_num ) + '/'
        film_data = requests.get( film_address )
        film_result = json.loads( film_data.content )
        title = film_result['title']
        if film_address in planet['films']:
            planets.append( planet['name'] )
    return title, planets

In [None]:
title, planets = sw_return_planets( 6 )
print( "'{}' features the planets: {}".format( title, planets ) )

<br>

#### OKay, neat. But what does that have to do with me?

I present to you [a list of Science APIs](https://www.programmableweb.com/category/science/api)  

Example: [The Ocular Tissue DataBase](https://genome.uiowa.edu/otdb/)

I have a fun plan for the last week of the course to scrape data from one of these.....

In [None]:
url = 'https://genome.uiowa.edu/otdb/api?term=MAK'
response = requests.get( url )
api_results = json.loads( response.content )

In [None]:
page = ''
while page == '':
    try:
        page = requests.get(url)
        break
    except:
        print("Connection refused by the server..")
        print("Let me sleep for 5 seconds")
        print("ZZzzzz...")
        time.sleep(5)
        print("Was a nice sleep, now let me continue...")
        continue

In [None]:
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry


session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)

session.get(url)

<br>

### 'Big Data'

traditional computing solutions do not scale.  
[**MapReduce**](https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf) was introduced by Google as a way to distribute calculations across many networked machines.  

MapReduce is the foundation that tools such as Hadoop, Spark and other packages were built on. Hadoop and Spark are two of the more popular 'Big Data' tools that help solve the problem: what to do when my dataset is too big for my machine?

<tr>
<td> <img src="https://ih1.redbubble.net/image.917867654.0973/flat,750x1000,075,f.jpg" alt="Drawing" style="width: 200px;"/> </td>
<td> <img src="https://iconape.com/wp-content/png_logo_vector/apache-spark.png" alt="Drawing" style="width: 200px;"/> </td>
</tr>

<br>

### Data in the Cloud

Building complex distributed networks and dealing with servers is complicated.  
Alternative: move to cloud computing. you rent servers in 'the cloud'. The hardware becomes your service provider's problem.


<tr>
<td> <img src="https://whatsthebigdata.files.wordpress.com/2017/02/cloud_storage.jpg?w=640" alt="Drawing" style="width: 200px;"/> </td>
<td> <img src="https://miro.medium.com/max/600/1*W02WEmR0_JeJXfLWN2zHwQ.png" alt="Drawing" style="width: 200px;"/> </td>
<td> <img src="https://pbs.twimg.com/profile_images/1190319303041724417/1a61e4pu_400x400.jpg" alt="Drawing" style="width: 200px;"/> </td>
<td> <img src="https://download.logo.wine/logo/Microsoft_Azure/Microsoft_Azure-Logo.wine.png" alt="Drawing" style="width: 200px;"/> </td>
<td> <img src="https://www.kindpng.com/picc/m/502-5024059_ibm-dachlawinen-text-kunststoff-watson-schwarz-vorsicht-ibm.png" alt="Drawing" style="width: 200px;"/> </td>
</tr>

## That was a lot.  ....any questions?

<img src="https://content.techgig.com/photo/80071467/pros-and-cons-of-python-programming-language-that-every-learner-must-know.jpg?132269" width="100%" style="margin-left:auto; margin-right:auto">