In this notebook, you'll see how to connect to a Postgres database using the sqlalchemy library.

For this notebook, you'll need both the `sqlalchemy` and `psycopg2` libraries installed.

In [1]:
from sqlalchemy import create_engine

First, we need to create a connection string. The format is

 ```<dialect(+driver)>://<username>:<password>@<hostname>:<port>/<database>```

To connect to the Lahman baseball database, you can use the following connection string.

In [2]:
database_name = 'scooters'    # Fill this in with your scooter database name

connection_string = f"postgresql://postgres:postgres@localhost:5432/{database_name}"

Now, we need to create an engine and use it to connect.

In [3]:
engine = create_engine(connection_string)

Now, we can create our query and pass it into the `.query()` method.

In [9]:
# Look at difference in run time for this:
query = '''
SELECT latitude, longitude
FROM scooters;
'''

result = engine.execute(query)

In [12]:
# Vs this:
query = '''
SELECT COUNT(latitude)
FROM scooters;
'''

result = engine.execute(query)

You can then fetch the results as tuples using either `fetchone` or `fetchall`:

In [10]:
result.fetchone()

(Decimal('36.136822'), Decimal('-86.799877'))

In [13]:
result.fetchall()

[(73414043,)]

On the other hand, sqlalchemy plays nicely with pandas.

In [14]:
import pandas as pd

In [15]:
lat = pd.read_sql(query, con = engine)
lat.head()

Unnamed: 0,count
0,73414043


For much more information about SQLAlchemy and to see a more “Pythonic” way to execute queries, see Introduction to Databases in Python: https://www.datacamp.com/courses/introduction-to-relational-databases-in-python

#EDA
As you know, it's important to gain an understanding of new datasets before diving headlong into analysis. Here are some suggestions for guiding the process of getting to know the data contained in these tables:
- Are there any null values in any columns in either table?
- What date range is represented in each of the date columns? Investigate any values that seem odd.
- Is time represented with am/pm or using 24 hour values in each of the columns that include time?
- What values are there in the sumdgroup column? Are there any that are not of interest for this project?
- What are the minimum and maximum values for all the latitude and longitude columns? Do these ranges make sense, or is there anything surprising?
-What is the range of values for trip duration and trip distance? Do these values make sense? Explore values that might seem questionable.
- Check out how the values for the company name column in the scooters table compare to those of the trips table. What do you notice?

In [33]:
#1 EDA - Are there any null values in any columns in either table?

query1_eda = '''
SELECT count(pubdatetime) as pubdate, count(latitude) as lat, count(longitude) as lon, 
    count(sumdid) as id, count(sumdtype) as type, count(chargelevel) as charge, 
    count(sumdgroup) as group, count(costpermin) as cost, count(companyname) as cmpny
FROM scooters;
'''

result1 = engine.execute(query1_eda)

In [34]:
scooter_count = pd.read_sql(query1_eda, con = engine)
print(scooter_count)

    pubdate       lat       lon        id      type    charge     group   
0  73414043  73414043  73414043  73414043  73414043  73413273  73414043  \

       cost     cmpny  
0  73414043  73414043  


In [35]:
query1a_eda = '''
SELECT count(pubtimestamp) as pubstmp, count(companyname) as cmpny, count(triprecordnum) as triprcd, 
    count(sumdid) as id, count(tripduration) as tripdur, count(tripdistance) as tripdis, 
    count(startdate) as stdt, count(starttime) as sttm, count(enddate) as eddt
FROM trips;
'''

result1a = engine.execute(query1a_eda)

In [36]:
trip_count = pd.read_sql(query1a_eda, con = engine)
print(trip_count)

   pubstmp   cmpny  triprcd      id  tripdur  tripdis    stdt    sttm    eddt
0   565522  565522   565522  565522   565522   565522  565522  565522  565522


In [37]:
print(result1a)

<sqlalchemy.engine.cursor.LegacyCursorResult object at 0x000001CEFF183D90>


In [38]:
#2 EDA - - What date range is represented in each of the date columns? Investigate any values that seem odd.

query2_eda=  '''
SELECT min(pubdatetime) as min_sctr_dt, max(pubdatetime) as max_sctr_dt
FROM scooters;
'''

result2 = engine.execute(query2_eda)

In [39]:
date_rvw1 = pd.read_sql(query2_eda, con = engine)
print(date_rvw1)

              min_sctr_dt         max_sctr_dt
0 2019-05-01 00:01:41.247 2019-07-31 23:59:57
