# Advanced Dataframes

## Part 1 - Creating Dataframes

format: `pd.DataFrame(data)`

- from lists
- from dictionaries
- from sql

In [1]:
#standard imports
import pandas as pd
import numpy as np

### From lists

In [2]:
#build a matrix (list of lists)
matrix = [[1,2,3], [2,3,55],[6,7,8]]
matrix

[[1, 2, 3], [2, 3, 55], [6, 7, 8]]

In [3]:
#pd.DataFrame(data)
pd.DataFrame(matrix)

Unnamed: 0,0,1,2
0,1,2,3
1,2,3,55
2,6,7,8


In [8]:
'1' in pd.DataFrame(matrix).columns

False

In [9]:
pd.DataFrame(matrix)[[0,2]]

Unnamed: 0,0,2
0,1,3
1,2,55
2,6,8


> column names default to integers

In [10]:
#define column names
column_names = ['first','second','third'] 
column_names

['first', 'second', 'third']

> the number of column names that i create should match the number of columns I actually have

In [11]:
#pd.DataFrame(data)
pd.DataFrame(matrix, columns=column_names)
# alternatively, redefine the column names a couple different ways
# after initializing the df:
# df.rename(columns={0: 'first', 1: 'second', 2: 'third'})
# df.columns = column_names

Unnamed: 0,first,second,third
0,1,2,3
1,2,3,55
2,6,7,8


In [12]:
#make the matrix an array
matrix_array = np.array(matrix)
matrix_array

array([[ 1,  2,  3],
       [ 2,  3, 55],
       [ 6,  7,  8]])

In [13]:
#pd.DataFrame(data)
pd.DataFrame(matrix_array, columns=column_names)

Unnamed: 0,first,second,third
0,1,2,3
1,2,3,55
2,6,7,8


### From dictionaries

In [14]:
#create a dictionary with a list as keys
#dictionary format: {key:value}
new_dt = {'A':[1,3,4], 'B':[2,355,5]}
new_dt

{'A': [1, 3, 4], 'B': [2, 355, 5]}

In [15]:
#create!
pd.DataFrame(new_dt)

Unnamed: 0,A,B
0,1,2
1,3,355
2,4,5


#### make it more complex

In [16]:
np.random.seed(123)

In [17]:
# Create list of values for names column
students = ['Sally', 'Jane', 'Suzie', 'Billy', 'Ada', 'John', 'Thomas',
            'Marie', 'Albert', 'Richard', 'Isaac', 'Alan']
students

['Sally',
 'Jane',
 'Suzie',
 'Billy',
 'Ada',
 'John',
 'Thomas',
 'Marie',
 'Albert',
 'Richard',
 'Isaac',
 'Alan']

In [18]:
# Randomly generate arrays of scores for each student for each subject.
# Note that all the values need to have the same length here.
math_grades = np.random.randint(low=60, high=100, size=len(students))
english_grades = np.random.randint(low=60, high=100, size=len(students))
reading_grades = np.random.randint(low=60, high=100, size=len(students))

In [19]:
math_grades

array([62, 88, 94, 98, 77, 79, 82, 93, 92, 69, 92, 92])

In [20]:
english_grades

array([85, 79, 74, 96, 92, 76, 64, 63, 62, 80, 99, 62])

In [21]:
reading_grades

array([80, 67, 95, 88, 98, 93, 81, 90, 87, 94, 93, 72])

In [24]:
# Randomly generate if a student is in classroom A or classroom B
classroom = np.random.choice(['A', 'B'], len(students))

In [25]:
classroom

array(['A', 'B', 'A', 'B', 'A', 'B', 'A', 'A', 'A', 'A', 'B', 'A'],
      dtype='<U1')

In [None]:
# np.random.choice(['A','B'], size=len(students))

In [26]:
classroom

array(['A', 'B', 'A', 'B', 'A', 'B', 'A', 'A', 'A', 'A', 'B', 'A'],
      dtype='<U1')

In [22]:
#combine all values into a dictionary
student_dict = {
    'students': students,
    'math':math_grades,
    'english':english_grades,
    'reading':reading_grades,
    'room':classroom
}
student_dict

{'students': ['Sally',
  'Jane',
  'Suzie',
  'Billy',
  'Ada',
  'John',
  'Thomas',
  'Marie',
  'Albert',
  'Richard',
  'Isaac',
  'Alan'],
 'math': array([62, 88, 94, 98, 77, 79, 82, 93, 92, 69, 92, 92]),
 'english': array([85, 79, 74, 96, 92, 76, 64, 63, 62, 80, 99, 62]),
 'reading': array([80, 67, 95, 88, 98, 93, 81, 90, 87, 94, 93, 72]),
 'room': array(['A', 'B', 'A', 'B', 'A', 'B', 'A', 'A', 'A', 'A', 'B', 'A'],
       dtype='<U1')}

In [None]:
pd.DataFrame(student_dict)

Unnamed: 0,students,math,english,reading,room
0,Sally,62,85,80,A
1,Jane,88,79,67,B
2,Suzie,94,74,95,A
3,Billy,98,96,88,B
4,Ada,77,92,98,A
5,John,79,76,93,B
6,Thomas,82,64,81,A
7,Marie,93,63,90,A
8,Albert,92,62,87,A
9,Richard,69,80,94,A


### From SQL

We can connect to sql to create a dataframe!! yay!

To do this we will use a driver to connect the database to our python code using pymysql driver packages

`python -m pip install pymysql`

Now that this is installed, we will create a **CONNECTION STRING**

This is a what a **CONNECTION STRING** looks like: 

`[protocol]://[user]:[password]@[host]/[database_name]`

example of the **CONNECTION STRING**:

`mysql+pymysql://codeup:p@assw0rd@123.123.123.123/some_db`

In this example:

- protocol = 'mysql+pymysql'
- user = 'codeup'
- password = 'p@assw0rd'
- host = '123.123.123.123'
- database_name = 'some_db'

You will each make a unique **CONNECTION STRING** with your credentials

In [None]:
#example
url = 'mysql+pymysql://codeup:p@assw0rd@123.123.123.123/some_db'
url

<div class="alert alert-block alert-warning">
    Should you type your username and password in your .py or .ipynb? Why or why not?
</div>

answer: 

<div class="alert alert-block alert-danger">
    NO! 
</div>

How to avoid this? 

1. Create a separate file called env.py
2. Enter in your credentials
3. Put your env.py in the same folder as your working file
4. Import your env
5. Create your connection string from imported variables

<div class="alert alert-block alert-warning">
    Should you push your env.py file to github?
</div>

answer: 
<div class="alert alert-block alert-danger">
     NO! 
</div>

**We will not push our username and password to github**

**We will add our env file to our gitignore**

**We will not rely on our global gitignore to ignore our env file**

**We will add env.py to our gitignore in every repository that we use our env credentials**

In [2]:
import env
import fake_env

In [3]:
#pull in the hostname
fake_env.host

'my_server'

In [4]:
# if I want to hide that, then I want to put it
# in a variable that does not
# go to my output and does not print

In [5]:
my_host = fake_env.host

In [6]:
my_fake_url = fake_env.create_url(
    fake_env.user,
    fake_env.host,
    fake_env.password,
    'fruits_db')

In [7]:
# dont output your url here,
# but if you did,
# it would look like this:
my_fake_url

'sql+pymysql://my_username:my_secrets@my_server/fruits_db'

In [None]:
#pull in the username
# env.user

In [None]:
#pull in the password
# env.password

<div class="alert alert-block alert-warning">
    Should you leave your username and password printed out on your .ipynb?
</div>

answer:

<div class="alert alert-block alert-danger">
    NO! 
</div>

lets create our **CONNECTION STRING** saved into a variable called url

In [8]:
#create connection string
url = f'mysql+pymysql://{env.user}:{env.password}@{env.host}/some_db'

<div class="alert alert-block alert-warning">
    Should you print out on your connection string?
</div>

answer:

<div class="alert alert-block alert-danger">
    NO! 
</div>

**We will not type our username and password in our working file**

**We will not print out our username and password after importing them**

**We will not print out our url that contains our username and password**

Now that we have created our **CONNECTION STRING**, we can connect to SQL

#### Let's connect to the employees database and pull the first 5 rows

format: `pd.read_sql('literal sql syntax to pull query', connection_string)`

 - What do you notice about our current connection string?
     - Needed to update to employees db

In [9]:
import pandas as pd

In [10]:
import env

In [11]:
#url
url = f'mysql+pymysql://{env.user}:{env.password}@{env.host}/employees'

In [12]:
query = 'select * from employees limit 5'

In [13]:
#connect to sql and pull query into df
df = pd.read_sql(query, url)
df

Unnamed: 0,emp_no,birth_date,first_name,last_name,gender,hire_date
0,10001,1953-09-02,Georgi,Facello,M,1986-06-26
1,10002,1964-06-02,Bezalel,Simmel,F,1985-11-21
2,10003,1959-12-03,Parto,Bamford,M,1986-08-28
3,10004,1954-05-01,Chirstian,Koblick,M,1986-12-01
4,10005,1955-01-21,Kyoichi,Maliniak,M,1989-09-12


#### I like my sql querys formatted with line breaks (also if you need to use a single quote or double quote you can format that appropriately with triples)

In [None]:
pd.read_sql('''
        select * 
        from employees
        limit 5
        ''', url
        )

#### Let's make a bigger query:

- I want a dataframe with emp_no, first_name, last_name, and dept_name
- I want it only for women and their current department
- Only pull back the first 10 employees

In [14]:
#query
query = '''
select emp_no, first_name, last_name, dept_name
from employees
    join dept_emp
        using (emp_no)
    join departments
        using (dept_no)
where gender = 'F'
limit 10
;
'''
query

"\nselect emp_no, first_name, last_name, dept_name\nfrom employees\n    join dept_emp\n        using (emp_no)\n    join departments\n        using (dept_no)\nwhere gender = 'F'\nlimit 10\n;\n"

In [15]:
#connect to sql and pull query into df
pd.read_sql(query, url)

Unnamed: 0,emp_no,first_name,last_name,dept_name
0,10002,Bezalel,Simmel,Sales
1,10006,Anneke,Preusig,Development
2,10007,Tzvetan,Zielinski,Research
3,10009,Sumant,Peac,Quality Management
4,10010,Duangkaew,Piveteau,Production
5,10010,Duangkaew,Piveteau,Quality Management
6,10011,Mary,Sluis,Customer Service
7,10017,Cristinel,Bouloucos,Marketing
8,10018,Kazuhide,Peha,Production
9,10018,Kazuhide,Peha,Development


# Don't push your passwords!

Note: 
- host = 'data.codeup.com'
- your personal username and password are saved in the google classroom under cloud credentials