In this notebook, you'll see how to connect to a Postgres database using the sqlalchemy library.

For this notebook, you'll need both the `sqlalchemy` and `psycopg2` libraries installed.

In [25]:
from sqlalchemy import create_engine

First, we need to create a connection string. The format is

 ```<dialect(+driver)>://<username>:<password>@<hostname>:<port>/<database>```

To connect to the Lahman baseball database, you can use the following connection string.

In [26]:
database_name = 'scooters'    # Fill this in with your scooter database name

connection_string = f"postgresql://postgres:postgres@localhost:5432/{database_name}"

Now, we need to create an engine and use it to connect.

In [27]:
engine = create_engine(connection_string)

Now, we can create our query and pass it into the `.query()` method.

In [4]:
# Look at difference in run time for this:
query = '''
SELECT latitude
FROM scooters;
'''

result = engine.execute(query)

In [5]:
# Vs this:
query = '''
SELECT COUNT(latitude)
FROM scooters;
'''

result = engine.execute(query)

You can then fetch the results as tuples using either `fetchone` or `fetchall`:

In [6]:
result.fetchone()

(73414043,)

In [7]:
result.fetchall()

[]

On the other hand, sqlalchemy plays nicely with pandas.

In [45]:
import pandas as pd

In [9]:
lat = pd.read_sql(query, con = engine)
lat.head()

Unnamed: 0,count
0,73414043


For much more information about SQLAlchemy and to see a more “Pythonic” way to execute queries, see Introduction to Databases in Python: https://www.datacamp.com/courses/introduction-to-relational-databases-in-python

In [39]:
query ='''
SELECT *
FROM scooters
Limit 5;
'''
result = engine.execute(query)

In [40]:
print(result)

<sqlalchemy.engine.cursor.LegacyCursorResult object at 0x000001F8958217B0>


In [49]:
query = '''
SELECT *
FROM scooters
LIMIT 10
'''

df=pd.read_sql_query(query, engine)

In [50]:
print(df)

              pubdatetime   latitude  longitude        sumdid sumdtype   
0 2019-05-01 00:01:41.247  36.136822 -86.799877  PoweredLIRL1  Powered  \
1 2019-05-01 00:01:41.247  36.191252 -86.772945  PoweredXWRWC  Powered   
2 2019-05-01 00:01:41.247  36.144752 -86.806293  PoweredMEJEH  Powered   
3 2019-05-01 00:01:41.247  36.162056 -86.774688  Powered1A7TC  Powered   
4 2019-05-01 00:01:41.247  36.150973 -86.783109  Powered2TYEF  Powered   
5 2019-05-01 00:01:41.247  36.157188 -86.769978  Powered3F3VK  Powered   
6 2019-05-01 00:01:41.247  36.154348 -86.784765  PoweredVL7YG  Powered   
7 2019-05-01 00:01:41.247  36.158930 -86.775987  Powered5LNUG  Powered   
8 2019-05-01 00:01:41.247  36.135993 -86.804226  Powered7SPQQ  Powered   
9 2019-05-01 00:01:41.247  36.148938 -86.811256  PoweredBV1DT  Powered   

   chargelevel sumdgroup  costpermin companyname  
0         93.0   scooter         0.0        Bird  
1         35.0   scooter         0.0        Bird  
2         90.0   scooter        

In [53]:

df.head(5)

Unnamed: 0,pubdatetime,latitude,longitude,sumdid,sumdtype,chargelevel,sumdgroup,costpermin,companyname
0,2019-05-01 00:01:41.247,36.136822,-86.799877,PoweredLIRL1,Powered,93.0,scooter,0.0,Bird
1,2019-05-01 00:01:41.247,36.191252,-86.772945,PoweredXWRWC,Powered,35.0,scooter,0.0,Bird
2,2019-05-01 00:01:41.247,36.144752,-86.806293,PoweredMEJEH,Powered,90.0,scooter,0.0,Bird
3,2019-05-01 00:01:41.247,36.162056,-86.774688,Powered1A7TC,Powered,88.0,scooter,0.0,Bird
4,2019-05-01 00:01:41.247,36.150973,-86.783109,Powered2TYEF,Powered,98.0,scooter,0.0,Bird


- Are there any null values in any columns in either table?
- What date range is represented in each of the date columns? Investigate any values that seem odd.
- Is time represented with am/pm or using 24 hour values in each of the columns that include time?
- What values are there in the sumdgroup column? Are there any that are not of interest for this project?
- What are the minimum and maximum values for all the latitude and longitude columns? Do these ranges make sense, or is there anything surprising?
-What is the range of values for trip duration and trip distance? Do these values make sense? Explore values that might seem questionable.
- Check out how the values for the company name column in the scooters table compare to those of the trips table. What do you notice?


In [None]:
- Are there any null values in any columns in either table?

In [54]:
null_counts = df.isnull().sum()
null_counts[null_counts > 0].sort_values(ascending=False)

Series([], dtype: int64)

- What date range is represented in each of the date columns? Investigate any values that seem odd.

In [56]:
print(df.pubdatetime)

0   2019-05-01 00:01:41.247
1   2019-05-01 00:01:41.247
2   2019-05-01 00:01:41.247
3   2019-05-01 00:01:41.247
4   2019-05-01 00:01:41.247
5   2019-05-01 00:01:41.247
6   2019-05-01 00:01:41.247
7   2019-05-01 00:01:41.247
8   2019-05-01 00:01:41.247
9   2019-05-01 00:01:41.247
Name: pubdatetime, dtype: datetime64[ns]


In [64]:
query = '''
SELECT 'scooters.subdatetime'
FROM scooters
'''

sub=pd.read_sql_query(query, engine)

In [67]:
print(sub)

                      ?column?
0         scooters.subdatetime
1         scooters.subdatetime
2         scooters.subdatetime
3         scooters.subdatetime
4         scooters.subdatetime
...                        ...
73414038  scooters.subdatetime
73414039  scooters.subdatetime
73414040  scooters.subdatetime
73414041  scooters.subdatetime
73414042  scooters.subdatetime

[73414043 rows x 1 columns]
