## Advanced Dataframes

In [1]:
import pandas as pd
import numpy as np

In [2]:
np.random.seed(123)

# Create list of values for names column.

students = ['Sally', 'Jane', 'Suzie', 'Billy', 'Ada', 'John', 'Thomas',
            'Marie', 'Albert', 'Richard', 'Isaac', 'Alan']

# Randomly generate arrays of scores for each student for each subject.
# Note that all the values need to have the same length here.

math_grades = np.random.randint(low=60, high=100, size=len(students))
english_grades = np.random.randint(low=60, high=100, size=len(students))
reading_grades = np.random.randint(low=60, high=100, size=len(students))


In [3]:
# Construct the DataFrame using the above lists and arrays.


df = pd.DataFrame({'name': students,
                   'math': math_grades,
                   'english': english_grades,
                   'reading': reading_grades,
                   'classroom': np.random.choice(['A', 'B'], len(students))})


In [4]:
df

Unnamed: 0,name,math,english,reading,classroom
0,Sally,62,85,80,A
1,Jane,88,79,67,B
2,Suzie,94,74,95,A
3,Billy,98,96,88,B
4,Ada,77,92,98,A
5,John,79,76,93,B
6,Thomas,82,64,81,A
7,Marie,93,63,90,A
8,Albert,92,62,87,A
9,Richard,69,80,94,A


## Takeaways..
- Can make dataframes out of dictionaries of lists
- Can also make df out of list of dictionaries
- Lists of lists
- Arrays of arrays

### More...
- `.csv` files
- `.json` 
- SQL queries

In [6]:
from pydataset import data


In [7]:
data()


Unnamed: 0,dataset_id,title
0,AirPassengers,Monthly Airline Passenger Numbers 1949-1960
1,BJsales,Sales Data with Leading Indicator
2,BOD,Biochemical Oxygen Demand
3,Formaldehyde,Determination of Formaldehyde
4,HairEyeColor,Hair and Eye Color of Statistics Students
...,...,...
752,VerbAgg,Verbal Aggression item responses
753,cake,Breakage Angle of Chocolate Cakes
754,cbpp,Contagious bovine pleuropneumonia
755,grouseticks,Data on red grouse ticks from Elston et al. 2001


In [8]:
# Load the dataset and store it in the variable mpg.

from pydataset import data
mpg = data('mpg')
mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact


In [9]:
data('mpg', show_doc=True) 


mpg

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

## Fuel economy data from 1999 and 2008 for 38 popular models of car

### Description

This dataset contains a subset of the fuel economy data that the EPA makes
available on http://fueleconomy.gov. It contains only models which had a new
release every year between 1999 and 2008 - this was used as a proxy for the
popularity of the car.

### Usage

    data(mpg)

### Format

A data frame with 234 rows and 11 variables

### Details

  * manufacturer. 

  * model. 

  * displ. engine displacement, in litres 

  * year. 

  * cyl. number of cylinders 

  * trans. type of transmission 

  * drv. f = front-wheel drive, r = rear wheel drive, 4 = 4wd 

  * cty. city miles per gallon 

  * hwy. highway miles per gallon 

  * fl. 

  * class. 




## From CSV on web

In [11]:
url = "https://gist.githubusercontent.com/ryanorsinger/19bc7eccd6279661bd13307026628ace/raw/e4b5d6787015a4782f96cad6d1d62a8bdbac54c7/lemonade.csv"
lemonade = pd.read_csv(url)
lemonade.head()

Unnamed: 0,Date,Day,Temperature,Rainfall,Flyers,Price,Sales
0,1/1/17,Sunday,27.0,2.0,15,0.5,10
1,1/2/17,Monday,28.9,1.33,15,0.5,13
2,1/3/17,Tuesday,34.5,1.33,27,0.5,15
3,1/4/17,Wednesday,44.1,1.05,28,0.5,17
4,1/5/17,Thursday,42.4,1.0,33,0.5,18


In [14]:
# READ A CSV FROM A FILE IN THE SAME FOLDER

#file = 'filename.csv' #provide relative or absolute path - relative best for collab

bp = pd.read_csv('data/bp.csv')
bp.head()


Unnamed: 0,date,systolic,diastolic,pulse
0,"December 02, 2014 at 06:45AM",120.0 mmHg,76.0 mmHg,69.0 bpm
1,"December 02, 2014 at 06:48AM",119.0 mmHg,75.0 mmHg,69.0 bpm
2,"December 03, 2014 at 10:12AM",145.0 mmHg,98.0 mmHg,105.0 bpm
3,"December 04, 2014 at 01:16AM",129.0 mmHg,93.0 mmHg,107.0 bpm
4,"December 04, 2014 at 09:42AM",154.0 mmHg,96.0 mmHg,96.0 bpm


## JSON
- Javadcript Object Notation
- Automatically valid python syntax for a dictionary or list of dictionares

In [13]:
url2 = "https://aphorisms.glitch.me/api/all"
quotes = pd.read_json(url2)
quotes.head()

Unnamed: 0,quote,author,name
0,"To go fast, go alone. To go far, go together",African Proverb,
1,"In fact, the only way to manage stress is to build up your resilience and strength.",anomymous,
2,Predispose yourself to practice,anonymous,
3,Respect the specs,Dr. Linda F. Wilson,
4,What we're doing is paint along with me rather than take detailed notes on the colors ...,Zach Gulde,


## SQL & Python / Pandas
- Ensure, 100% thhat you have a `.gitignore` file that lists `env.py`
- Might nieed to install pymysql `python -m pip install pymysql`
- Make `env.py` file with three strings stored to variables
    - `host - ip.pf.sql.server`
    - `user = your_usename`
    - `password = your_password`
- Import values from `env.py`
- Create connection string
- Create SQL query
    - `select * from employees`
- Use `pd.read_sql(query, url)` tp have pandas run SQL and return df