# Advanced Dataframes

What is advanced dataframes? 
- Part 1: how to create dataframes
- Part 2: manipulate the dataframes
- Part 3: transform the dataframes

Why do we care? 
- Now that we understand more about what a dataframe is, these tools will allow us to wrangle our dataframes to get the data we need!

## Part 1 - Creating Dataframes
- from lists
- from dictionaries
- from sql

In [1]:
#standard imports
import pandas as pd
import numpy as np

FORMAT: `pd.DataFrame(data)`

## From lists

In [2]:
#build a list
ls = [1,34,23,-2]
ls

[1, 34, 23, -2]

In [3]:
#convert to df
pd.DataFrame(ls) #in the pretty format tells us its a df, even though its one column

Unnamed: 0,0
0,1
1,34
2,23
3,-2


In [4]:
pd.Series(ls) #less pretty format is a series

0     1
1    34
2    23
3    -2
dtype: int64

In [5]:
#build a matrix (list of lists)
matrix = [[1,2,3],[777,3,45],[435,12,333]] 
matrix

[[1, 2, 3], [777, 3, 45], [435, 12, 333]]

In [6]:
#convert to df
pd.DataFrame(matrix) #each element in the list is a new row
#column names default to 0,1,2,...

Unnamed: 0,0,1,2
0,1,2,3
1,777,3,45
2,435,12,333


In [7]:
#define column names
column_names = ['first','second','third'] #saved as a list
column_names

['first', 'second', 'third']

In [8]:
#define column names when creating df
pd.DataFrame(matrix, columns=column_names)

Unnamed: 0,first,second,third
0,1,2,3
1,777,3,45
2,435,12,333


In [9]:
#make the matrix an array
my_array = np.array(matrix)
my_array

array([[  1,   2,   3],
       [777,   3,  45],
       [435,  12, 333]])

In [10]:
#convert to df
pd.DataFrame(my_array, columns=column_names)

Unnamed: 0,first,second,third
0,1,2,3
1,777,3,45
2,435,12,333


## From dictionaries

In [11]:
#create a dictionary with a list as keys
#dictionary format: {key:value}
new_dt = {'A':[1,2,3], 'B':[777,234,33]}
new_dt

{'A': [1, 2, 3], 'B': [777, 234, 33]}

In [12]:
#create df!
pd.DataFrame(new_dt) #the keys become column names

Unnamed: 0,A,B
0,1,777
1,2,234
2,3,33


#### make it more complex

In [13]:
#set random seed
np.random.seed(123)

#this ensures people get the same results when running random functions

In [14]:
# Create list of values for names column
students = ['Sally', 'Jane', 'Suzie', 'Billy', 'Ada', 'John', 'Thomas',
            'Marie', 'Albert', 'Richard', 'Isaac', 'Alan']

In [15]:
# Randomly generate arrays of scores for each student for each subject.
# Note that all the values need to have the same length here.
math_grades = np.random.randint(low=60, high=100, size=len(students))

english_grades = np.random.randint(low=60, high=100, size=len(students))

reading_grades = np.random.randint(low=60, high=100, size=len(students))

In [16]:
math_grades

array([62, 88, 94, 98, 77, 79, 82, 93, 92, 69, 92, 92])

In [17]:
np.random.randint(2,10, size=10)

array([2, 5, 4, 7, 2, 5, 4, 4, 4, 8])

In [18]:
# Randomly generate if a student is in classroom A or classroom B
classroom = np.random.choice(['A', 'B'], len(students))

In [19]:
#combine all values into a dictionary
student_dict = {
    'student': students, #column name: variable name
    'math_grade': math_grades,
    'english_grade': english_grades,
    'reading_grade': reading_grades,
    'room': classroom
}
student_dict

{'student': ['Sally',
  'Jane',
  'Suzie',
  'Billy',
  'Ada',
  'John',
  'Thomas',
  'Marie',
  'Albert',
  'Richard',
  'Isaac',
  'Alan'],
 'math_grade': array([62, 88, 94, 98, 77, 79, 82, 93, 92, 69, 92, 92]),
 'english_grade': array([85, 79, 74, 96, 92, 76, 64, 63, 62, 80, 99, 62]),
 'reading_grade': array([80, 67, 95, 88, 98, 93, 81, 90, 87, 94, 93, 72]),
 'room': array(['B', 'A', 'A', 'B', 'B', 'B', 'B', 'A', 'A', 'A', 'A', 'B'],
       dtype='<U1')}

In [20]:
#create df! 
pd.DataFrame(student_dict)

Unnamed: 0,student,math_grade,english_grade,reading_grade,room
0,Sally,62,85,80,B
1,Jane,88,79,67,A
2,Suzie,94,74,95,A
3,Billy,98,96,88,B
4,Ada,77,92,98,B
5,John,79,76,93,B
6,Thomas,82,64,81,B
7,Marie,93,63,90,A
8,Albert,92,62,87,A
9,Richard,69,80,94,A


## From SQL

We can connect to sql to create a dataframe!! yay!

### Process

0. install driver (only need to do this one time on your computer)
1. create a connection string with server credentials
2. use connection string to connect to SQL via python and pull in results

---

### 0. install driver (only need to do this one time on your computer)

To do this we will use a driver to connect the database to our python code using pymysql driver packages

<div class="alert alert-block alert-info">
    Type this in the command line
    
    python -m pip install pymysql
</div>

### 1. create a connection string with server credentials

Now that this is installed, we will create a **CONNECTION STRING**

This is a what a **CONNECTION STRING** looks like: 

`[protocol]://[user]:[password]@[host]/[database_name]`

Example:

- protocol = 'mysql+pymysql'
- user = 'codeup'
- password = 'p@assw0rd'
- host = '123.123.123.123'
- database_name = 'some_db'

In [21]:
#example
url = 'mysql+pymysql://codeup:p@assw0rd@123.123.123/some_db'
url

'mysql+pymysql://codeup:p@assw0rd@123.123.123/some_db'

- This **CONNECTION STRING** is unique to this example user
- You will each make a unique **CONNECTION STRING** with your credentials

<div class="alert alert-block alert-warning">
    Do you see any issues with typing your username and password in your .py or .ipynb?
</div>

How to avoid this? 

1. Create a separate file called `env.py`
2. Enter in your credentials
3. Ensure your `env.py` is in the same folder as your working file
4. Import your `env.py`
5. Create your connection string from imported variables

#### SAMPLE ENV

    host = '123.123.123'
    user = 'bayes_1023'
    password = 'tHl5!ls#r4N0oM'


<div class="alert alert-block alert-warning">
    Should you push your env.py file to github?
</div>

In [22]:
#get my credentials 
import env

In [23]:
#pull in the hostname
# env.host

In [24]:
#pull in the username
# env.user

In [25]:
#pull in the password
# env.password

<div class="alert alert-block alert-warning">
    Should you leave your username and password printed out on your .ipynb?
</div>

lets create our **CONNECTION STRING** saved into a variable called url

In [26]:
#create connection string
url = f'mysql+pymysql://{env.user}:{env.password}@{env.host}/some_db'

In [27]:
# url 
#dont leave our url variable printed out on our screen

<div class="alert alert-block alert-warning">
    Should you print out on your connection string?
</div>

### 2. use connection string to connect to SQL via python

Now that we have created our **CONNECTION STRING**, we can connect to SQL

#### Let's connect to the employees database and pull the first 5 rows

FORMAT: `pd.read_sql('literal sql syntax to pull query', connection_string)`

In [28]:
url = f'mysql+pymysql://{env.user}:{env.password}@{env.host}/employees'
# url

<div class="alert alert-block alert-warning">
    Do you notice any issues with my current connection string?
</div>

In [30]:
#connect to sql and pull query into df
pd.read_sql('select * from employees limit 10', url)

Unnamed: 0,emp_no,birth_date,first_name,last_name,gender,hire_date
0,10001,1953-09-02,Georgi,Facello,M,1986-06-26
1,10002,1964-06-02,Bezalel,Simmel,F,1985-11-21
2,10003,1959-12-03,Parto,Bamford,M,1986-08-28
3,10004,1954-05-01,Chirstian,Koblick,M,1986-12-01
4,10005,1955-01-21,Kyoichi,Maliniak,M,1989-09-12
5,10006,1953-04-20,Anneke,Preusig,F,1989-06-02
6,10007,1957-05-23,Tzvetan,Zielinski,F,1989-02-10
7,10008,1958-02-19,Saniya,Kalloufi,M,1994-09-15
8,10009,1952-04-19,Sumant,Peac,F,1985-02-18
9,10010,1963-06-01,Duangkaew,Piveteau,F,1989-08-24


#### I like my sql querys formatted with line breaks

In [31]:
#write our query with three quotes opening and closing
pd.read_sql(''' 
    select *
    from employees
    limit 5
    ''', url)

Unnamed: 0,emp_no,birth_date,first_name,last_name,gender,hire_date
0,10001,1953-09-02,Georgi,Facello,M,1986-06-26
1,10002,1964-06-02,Bezalel,Simmel,F,1985-11-21
2,10003,1959-12-03,Parto,Bamford,M,1986-08-28
3,10004,1954-05-01,Chirstian,Koblick,M,1986-12-01
4,10005,1955-01-21,Kyoichi,Maliniak,M,1989-09-12


#### Let's make a bigger query:

- I want a dataframe with emp_no, first_name, last_name, and dept_name
- I want it only for women and their current department
- Only pull back the first 10 employees

In [32]:
#query
query = '''
select emp_no, first_name, last_name, dept_name
from employees
	join dept_emp
		using (emp_no)
	join departments
		using (dept_no)
where gender = 'F'
	and to_date > now()
'''

In [33]:
#connect to sql and pull query into df
df = pd.read_sql(query, url)

In [34]:
df.tail()

Unnamed: 0,emp_no,first_name,last_name,dept_name
96005,499926,Youpyo,Perfilyeva,Sales
96006,499960,Gaetan,Veldwijk,Sales
96007,499966,Mihalis,Crabtree,Sales
96008,499986,Nathan,Ranta,Sales
96009,499987,Rimli,Dusink,Sales


<div class="alert alert-block alert-info">
    <b>Important Recap Dos</b>

I will put my credentials in an `env` file.
    
I will put `env.py` in my .gitignore.
 
</div>

<div class="alert alert-block alert-danger">
    <b>Important Recap Don'ts</b>

I will not push my password to GitHub.

I will not type my password in my code editor.     
    
I will not print out my password variable.
    
I will not print out my CONNECTION STRING which contains my password. 
    
I will not push my `env` file to GitHub.
 
</div>

Note: your personal username and password are saved in the google classroom under MySQL Server Credentials