# Advanced Dataframes

What is advanced dataframes? 
- Part 1: how to create dataframes
- Part 2: manipulate the dataframes
- Part 3: transform the dataframes

Why do we care? 
- Now that we understand more about what a dataframe is, these tools will allow us to wrangle our dataframes to get the data we need!

## Part 1 - Creating Dataframes
- from lists
- from dictionaries
- from sql

In [1]:
#standard imports
import pandas as pd
import numpy as np

format: `pd.DataFrame(data)`

### From lists

In [2]:
#build a list
ls = [1,34,23,-2]
ls

[1, 34, 23, -2]

In [3]:
#pd.DataFrame(data)
pd.DataFrame(ls) #each element of the list is a row

Unnamed: 0,0
0,1
1,34
2,23
3,-2


In [4]:
#build a matrix (list of lists)
matrix = [[1,2,3],[2,3,45],[435,12,333]]
matrix

[[1, 2, 3], [2, 3, 45], [435, 12, 333]]

In [5]:
#pd.DataFrame(data)
pd.DataFrame(matrix)

Unnamed: 0,0,1,2
0,1,2,3
1,2,3,45
2,435,12,333


In [6]:
#define column names
column_names = ['first','second', 'third'] #just a list
column_names

['first', 'second', 'third']

In [7]:
#definte column names when creating df
pd.DataFrame(matrix, columns=column_names)

Unnamed: 0,first,second,third
0,1,2,3
1,2,3,45
2,435,12,333


In [8]:
#make the matrix an array
matrix_array = np.array(matrix)
matrix_array

array([[  1,   2,   3],
       [  2,   3,  45],
       [435,  12, 333]])

In [9]:
#pd.DataFrame(data)
pd.DataFrame(matrix_array, columns=column_names)

Unnamed: 0,first,second,third
0,1,2,3
1,2,3,45
2,435,12,333


### From dictionaries

In [10]:
#create a dictionary with a list as keys
#dictionary format: {key:value}
new_dt = {'A':[1,2,3], 'B':[2,234,33]}
new_dt

{'A': [1, 2, 3], 'B': [2, 234, 33]}

In [11]:
#create df!
pd.DataFrame(new_dt)

Unnamed: 0,A,B
0,1,2
1,2,234
2,3,33


#### make it more complex

In [12]:
import numpy as np

#setting random seed
np.random.seed(123)

In [13]:
# Create list of values for names column
students = ['Sally', 'Jane', 'Suzie', 'Billy', 'Ada', 'John', 'Thomas',
            'Marie', 'Albert', 'Richard', 'Isaac', 'Alan']
# students

In [14]:
# Randomly generate arrays of scores for each student for each subject.
# Note that all the values need to have the same length here.
math_grades = np.random.randint(low=60, high=100, size=len(students))

english_grades = np.random.randint(low=60, high=100, size=len(students))

reading_grades = np.random.randint(low=60, high=100, size=len(students))

In [30]:
np.random.randint(high=100,low=60,size=len(students))

array([79, 64, 90, 67, 89, 98, 61, 72, 63, 67, 98, 84])

In [36]:
math_grades

array([62, 88, 94, 98, 77, 79, 82, 93, 92, 69, 92, 92])

In [16]:
# Randomly generate if a student is in classroom A or classroom B
classroom = np.random.choice(['A', 'B'], len(students))

In [47]:
np.random.choice(['A','B'], len(students))

array(['B', 'B', 'B', 'B', 'A', 'B', 'B', 'B', 'B', 'A', 'A', 'B'],
      dtype='<U1')

In [50]:
#combine all values into a dictionary
student_dict = {
    'student':students,
    'math':math_grades,
    'english':english_grades,
    'reading':reading_grades,
    'room':classroom
}


In [51]:
#create df! 
pd.DataFrame(student_dict)

Unnamed: 0,student,math,english,reading,room
0,Sally,62,85,80,B
1,Jane,88,79,67,A
2,Suzie,94,74,95,A
3,Billy,98,96,88,A
4,Ada,77,92,98,A
5,John,79,76,93,B
6,Thomas,82,64,81,B
7,Marie,93,63,90,A
8,Albert,92,62,87,A
9,Richard,69,80,94,A


### From SQL

We can connect to sql to create a dataframe!! yay!

To do this we will use a driver to connect the database to our python code using pymysql driver packages

<div class="alert alert-block alert-info">
    Do this in the command line
</div>

`python -m pip install pymysql`

Now that this is installed, we will create a **CONNECTION STRING**

This is a what a **CONNECTION STRING** looks like: 

`[protocol]://[user]:[password]@[host]/[database_name]`

Example:

- protocol = 'mysql+pymysql'
- user = 'codeup'
- password = 'p@assw0rd'
- host = '123.123.123.123'
- database_name = 'some_db'

In [52]:
#example
url = 'mysql+pymysql://codeup:p@assw0rd@123.123.123.123/some_db'
url

'mysql+pymysql://codeup:p@assw0rd@123.123.123.123/some_db'

- This **CONNECTION STRING** is unique to this example user
- You will each make a unique **CONNECTION STRING** with your credentials

<div class="alert alert-block alert-warning">
    Should you type your username and password in your .py or .ipynb?
</div>

answer: 

<div class="alert alert-block alert-danger">
    NO! 
</div>

# Q: Why not?

How to avoid this? 

1. Create a separate file called `env.py`
2. Enter in your credentials
3. Put your env.py in the same folder as your working file
4. Import your env
5. Create your connection string from imported variables

<div class="alert alert-block alert-warning">
    Should you push your env.py file to github?
</div>

answer: 
<div class="alert alert-block alert-danger">
     NO! 
</div>

# Q: Why not? 


A: cause it has our username and password, so we need to add our env.py to our .gitignore

In [53]:
#get my credentials 
import env

In [54]:
#pull in the hostname
env.host

'data.codeup.com'

In [59]:
#pull in the username
# env.user

In [60]:
#pull in the password
# env.password

<div class="alert alert-block alert-warning">
    Should you leave your username and password printed out on your .ipynb?
</div>

answer:

<div class="alert alert-block alert-danger">
    NO! 
</div>

# Q: Why not? 

A: cause then they are still printed out on my notebook file

lets create our **CONNECTION STRING** saved into a variable called url

In [72]:
import env

In [73]:
#create connection string
url = f'mysql+pymysql://{env.user}:{env.password}@{env.host}/some_db'

In [74]:
from env import user, password, host

In [76]:
url = f'mysql+pymysql://{user}:{password}@{host}/some_db'

<div class="alert alert-block alert-warning">
    Should you print out on your connection string?
</div>

answer:

<div class="alert alert-block alert-danger">
    NO! 
</div>

# Q: Why not? 

Now that we have created our **CONNECTION STRING**, we can connect to SQL

#### Let's connect to the employees database and pull the first 5 rows

format: `pd.read_sql('literal sql syntax to pull query', connection_string)`

In [78]:
# url

Q: What do you notice about our current connection string?

In [82]:
#url
url = f'mysql+pymysql://{user}:{password}@{host}/employees'

In [94]:
#connect to sql and pull query into df
pd.read_sql('select * from employees', url)

Unnamed: 0,emp_no,birth_date,first_name,last_name,gender,hire_date
0,10001,1953-09-02,Georgi,Facello,M,1986-06-26
1,10002,1964-06-02,Bezalel,Simmel,F,1985-11-21
2,10003,1959-12-03,Parto,Bamford,M,1986-08-28
3,10004,1954-05-01,Chirstian,Koblick,M,1986-12-01
4,10005,1955-01-21,Kyoichi,Maliniak,M,1989-09-12
...,...,...,...,...,...,...
300019,499995,1958-09-24,Dekang,Lichtner,F,1993-01-12
300020,499996,1953-03-07,Zito,Baaz,M,1990-09-27
300021,499997,1961-08-03,Berhard,Lenart,M,1986-04-21
300022,499998,1956-09-05,Patricia,Breugel,M,1993-10-13


#### I like my sql querys formatted with line breaks

In [87]:
pd.read_sql('''
    select *
    from employees
    limit 5
    ''', url)

Unnamed: 0,emp_no,birth_date,first_name,last_name,gender,hire_date
0,10001,1953-09-02,Georgi,Facello,M,1986-06-26
1,10002,1964-06-02,Bezalel,Simmel,F,1985-11-21
2,10003,1959-12-03,Parto,Bamford,M,1986-08-28
3,10004,1954-05-01,Chirstian,Koblick,M,1986-12-01
4,10005,1955-01-21,Kyoichi,Maliniak,M,1989-09-12


#### Let's make a bigger query:

- I want a dataframe with emp_no, first_name, last_name, and dept_name
- I want it only for women and their current department
- Only pull back the first 10 employees

In [92]:
#query
query = '''
select emp_no, first_name, last_name, dept_name
from employees
	join dept_emp
		using (emp_no)
	join departments
		using (dept_no)
where gender = 'F'
	and to_date > now()
limit 10
'''

In [93]:
#connect to sql and pull query into df
pd.read_sql(query, url)

Unnamed: 0,emp_no,first_name,last_name,dept_name
0,10049,Basil,Tramer,Customer Service
1,10088,Jungsoon,Syrzycki,Customer Service
2,10112,Yuichiro,Swick,Customer Service
3,10128,Babette,Lamba,Customer Service
4,10154,Abdulah,Thibadeau,Customer Service
5,10176,Brendon,Lenart,Customer Service
6,10225,Kellie,Chinen,Customer Service
7,10231,Shaowen,Desikan,Customer Service
8,10279,Barton,Jumpertz,Customer Service
9,10335,Toshimori,Bahi,Customer Service


# Don't push your passwords!

Note: 
- host = 'data.codeup.com'
- your personal username and password are saved in the google classroom under cloud credentials