In [22]:
from shapely.geometry import Point
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import folium
from folium.plugins import MarkerCluster
from folium.plugins import FastMarkerCluster

In [23]:
from sqlalchemy import create_engine, text

In [32]:
database_name = 'Scooters'

connection_string = f"postgresql://postgres:postgres@localhost:5432/scooters"

In [33]:
engine = create_engine(connection_string)

# trips table
### EDA 
#### row count

In [34]:
query = '''
SELECT COUNT(*)
FROM trips;
'''


In [35]:
with engine.connect() as connection:    
    trips_count = pd.read_sql(text(query), con = connection)
trips_count

Unnamed: 0,count
0,565522


# create_dt
### EDA 
Create_dt ranges from 2019-05-02 to 2019-08-02.
Timestamps start two days after the end of a month period.

#### Q. Is create_dt needed for analysis of neg/pos/0 value duration and distance patterns?
Should it be dropped from cleaned table?

In [None]:
query = '''
SELECT create_dt
FROM trips;
'''

In [None]:
#for EDA on create_dt - what does this column show related to other timestamps?
with engine.connect() as connection:    
    create_dt = pd.read_sql(text(query), con = connection)
create_dt

# CU3 - EDA
### tripduration
NOTE: Under the 24-hour clock system, the day begins at midnight, 00:00, and the last minute of the day begins at 23:59 and ends at 24:00, which is identical to 00:00 of the following day. 12:00 can only be mid-day. Midnight is called 24:00 and is used to mean the end of the day and 00:00 is used to mean the beginning of the day. https://simple.wikipedia.org/wiki/24-hour_clock

#### tripduration - MAX: 512619.0, MIN: -19.358267

#### Review neg values, 0 values, nulls, relative to other time-related columns to see if there is a pattern
Relates to: Q2, Are scooter companies in compliance? 
Are we able to determine what might be staff servicing and test trips?

#### Summary of negative value EDA on tripduration
There are 8 total negative tripdurations, all with positive trip distance.  They occured on just two dates: 2019-06-21 and 2019-07-18. Did not find any meaningful events happening on those dates in Nashville.  All trips were less than a mile (4540 ft = .86 mile), and the majority of them were initiatiated late at night or just after midnight.  Start date/times and end times do not seem correct on all of them related to trip duration, some are going backwards.z

Q. Are these 8 negative values outliers due to some kind of system error, or refunds, as Dani suggested?

Q. Should our compiled table have negative values removed on trip distance?

In [None]:
#pull date/time columns for analysis with a negative tripduration
query = '''
SELECT tripduration, tripdistance, startdate, starttime, enddate, endtime, create_dt 
FROM trips
WHERE tripduration < 0;
'''

In [None]:
with engine.connect() as connection:    
    neg_duration = pd.read_sql(text(query), con = connection)

In [None]:
neg_duration

In [None]:
#count negative tripduration entries
query = '''
SELECT COUNT(tripduration)
FROM trips
WHERE tripduration < 0;
'''

In [None]:
with engine.connect() as connection:    
    count_negative_td = pd.read_sql(text(query), con = connection)
count_negative_td

#### Summary of nulls EDA on tripduration
There are no null values in tripduration.

In [None]:
#count null tripdurations
query = '''
SELECT COUNT(tripduration)
FROM trips
WHERE tripduration IS NULL;
'''

In [None]:
with engine.connect() as connection:    
    count_nulls = pd.read_sql(text(query), con = connection)
count_nulls

#### Summary of 0.00 values in  tripduration
There are 4624 entries with 0.00 in tripduration.

Q. Why would there be zero values in trip duration?  Could these be due to servicing of the scooters? 

In [None]:
#count zero tripdurations
query = '''
SELECT COUNT(tripduration)
FROM trips
WHERE tripduration = 0;
'''

In [None]:
with engine.connect() as connection:    
    count_zero = pd.read_sql(text(query), con = connection)
count_zero

In [None]:
#pull zero tripdurations
query = '''
SELECT tripduration
FROM trips
WHERE tripduration = 0;
'''

In [None]:
with engine.connect() as connection:    
    pull_zero = pd.read_sql(text(query), con = connection)
pull_zero

#### Summary of positive values in  tripduration
There are 560890 entries with positive values in tripduration.

In [None]:
#count tripdurations with positive numbers
query = '''
SELECT COUNT(tripduration)
FROM trips
WHERE tripduration > 0;
'''

In [None]:
with engine.connect() as connection:    
    zero_distance = pd.read_sql(text(query), con = connection)
zero_distance

#### Summary of positive values in  tripduration
There are 560890 entries with positive values in tripduration.

# CU3 - EDA
### tripdistance

#### tripdistance - MAX: 3.188448e+07, MIN: -20324803.8 
Q. What does the "e+07" etc. indicate in tripduration?   
A. Tripdistance is formatted in scientific notation

RE: Scientific notation: means any number expressed in the power of 10.for example- 340 can be written in scientific notation as 3.4 X102.in pythons, we use str.format() on a number with “{:e}” to format the number to scientific notation. str.format() formats the number as a float, followed by “e+” and the appropriate power of 10. For example- 340 will be displayed as 3.4e+2 https://www.geeksforgeeks.org/display-scientific-notation-as-float-in-python/

#### Summary of negative value EDA on tripduration
There are 32 negative values in tripdistance.  These indicate failure to comply with regulations to only include trips greater than one minute.

They consistently fall within the 1st 5, middle 3, or last 5 days of the month, which could be related to regular staff servicing or testing.  The starttimes mostly fall early in the morning before morning rush hour, or after the evening rush, which again, may indicate staff servicing or testing times. Trip duration are all evenly rounded, which may indicate a tag for tracking staff service or testing time.

There 4305 entries with 0.0 tripdistance and 0.0 tripduration.  We could try grouping these by startdate and starttime to look at any patterns to see if these also may be related to staff service, testing times, or system errors.

There seem to be issues sorting on scientific notation both through SQL code and with Python.

In [None]:
#count tripdistance negative values
query = '''
SELECT COUNT(tripdistance)
FROM trips
WHERE tripdistance < 0;
'''

In [None]:
with engine.connect() as connection:    
    cnt_neg_distance = pd.read_sql(text(query), con = connection)
cnt_neg_distance

In [None]:
#pull date/time columns for analysis with a negative tripdistance, sorted by startdate
query = '''
SELECT tripduration, tripdistance, startdate, starttime, enddate, endtime, create_dt 
FROM trips
WHERE tripdistance < 0
ORDER BY startdate;
'''

In [None]:
with engine.connect() as connection:    
    pull_neg_distance = pd.read_sql(text(query), con = connection)
pull_neg_distance

In [None]:
#pull date/time columns for analysis with a negative tripdistance, sorted by starttime
query = '''
SELECT tripduration, tripdistance, startdate, starttime, enddate, endtime, create_dt 
FROM trips
WHERE tripdistance < 0
ORDER BY starttime;
'''

In [None]:
with engine.connect() as connection:    
    pull_neg_distance = pd.read_sql(text(query), con = connection)
pull_neg_distance

In [None]:
#pull date/time columns for analysis with a negative tripdistance, sorted by tripduration
query = '''
SELECT tripduration, tripdistance, startdate, starttime, enddate, endtime, create_dt 
FROM trips
WHERE tripdistance < 0
ORDER BY tripduration;
'''

In [None]:
with engine.connect() as connection:    
    pull_neg_distance = pd.read_sql(text(query), con = connection)
pull_neg_distance

In [None]:
#pull date/time columns for analysis with a negative tripdistance, sorted by tripdistance in SQL
query = '''
SELECT tripduration, tripdistance, startdate, starttime, enddate, endtime, create_dt 
FROM trips
WHERE tripdistance < 0
ORDER BY tripdistance;
'''

In [None]:
with engine.connect() as connection:    
    pull_neg_distance = pd.read_sql(text(query), con = connection)
pull_neg_distance.head(10)

In [None]:
#pull zero tripdurations relative to other time and datestamps, sorted by tripdistance in Python
query = '''
SELECT tripduration, tripdistance, startdate, starttime, enddate, endtime, create_dt 
FROM trips
WHERE tripdistance < 0;
'''

In [None]:
with engine.connect() as connection:    
    pull_neg_distance = pd.read_sql(text(query), con = connection)
pull_neg_distance.sort_values(by = 'tripdistance').head(10)

In [None]:
#pull zero tripdurations relative to zero tripdistance
query = '''
SELECT tripduration, tripdistance, startdate, starttime, enddate, endtime, create_dt 
FROM trips
WHERE tripduration = 0.0 AND tripdistance = 0;
'''

In [None]:
with engine.connect() as connection:    
    zero_td_df = pd.read_sql(text(query), con = connection)
zero_td_df.sort_values(by = ['startdate' , 'starttime'])

# EDA

### tripduration

#### FOR Q2: remove all trips less than one minute and greater than 24 hours to create a compliant trips table
There are a total of 9154 entries out of compliance with less than 1 minute

#### There are 6938 rows with trip durations longer than 24 hours  (60*24 = 1440 minutes)
Q. Do we remove all rows that go beyond 24 hours, or do we need to create a recalculated trip distance column with tripduration cappped at 24 hrs?

In [None]:
#for Q2 non-compliance tripduration less than 1 min
query = '''
SELECT tripduration, tripdistance, startdate, starttime, enddate, endtime, create_dt 
FROM trips
WHERE tripduration < 1;
'''

In [None]:
with engine.connect() as connection:    
    non_compliant_td_under = pd.read_sql(text(query), con = connection)
non_compliant_td_under

In [None]:
#for Q2 non-compliance tripduration greater than 24 hrs
query = '''
SELECT tripduration, tripdistance, startdate, starttime, enddate, endtime, create_dt 
FROM trips
WHERE tripduration > 1440.00;
'''

In [None]:
with engine.connect() as connection:    
    non_compliant_td_over = pd.read_sql(text(query), con = connection)
non_compliant_td_over

##For Q2 - Compliant Trips

###--removed 2b. trips below one minute
###--removed 2c. lengths capped at 24 hrs (if we go with removing all)

NOTE: Need to add lat, lon, geometry column from Dani's work
NOTE: Shoudl we drop any columns from this table, like create_dt or pubtimestamp?

In [38]:
#for trips table compliance, remove negative, 0 value, and less than 1 min
query = '''
SELECT *
FROM (
SELECT *
FROM trips
WHERE tripduration < 1400.00
)
FROM trips
WHERE tripduration > 1.00 OR
'''

In [39]:
with engine.connect() as connection:    
    trips_compliant = pd.read_sql(text(query), con = connection)
trips_compliant

ProgrammingError: (psycopg2.errors.SyntaxError) subquery in FROM must have an alias
LINE 3: FROM (
             ^
HINT:  For example, FROM (SELECT ...) [AS] foo.

[SQL: 
SELECT *
FROM (
SELECT *
FROM trips
WHERE tripduration < 1400.00
)
FROM trips
WHERE tripduration > 1.00 OR
]
(Background on this error at: https://sqlalche.me/e/14/f405)

Should we add in triproute to these tables - GPS coordinates for entire trip duration at min collection frequency of one per 30 sec

A scatterplot might be helpful to see any trends on starttime & endtime for this chart, but seaborn not in this environment.  

Scooters report their locations every 5 minutes, but our start times do not indicate these 0/0 values are due to location reporting. 

The trip distance column appears to be in scientific notation.
 The scientific notation means any number expressed in the power of 10.for example- 340 can be written in scientific notation as 3.4 X102.in pythons, we use str.format() on a number with “{:e}” to format the number to scientific notation. str.format() formats the number as a float, followed by “e+” and the appropriate power of 10. For example- 340 will be displayed as 3.4e+2
https://www.geeksforgeeks.org/display-scientific-notation-as-float-in-python/

To display reverse of scientific numbers to float

We have to pass a variable holding the scientific format of a number, as follows:

x = 3.234e+4
 
print("{:f}".format(x))  # f represents float
Output:

32340.000000

https://stackoverflow.com/questions/658763/how-to-suppress-scientific-notation-when-printing-float-values
https://stackoverflow.com/questions/67879685/python-decimal-decimal-producing-result-in-scientific-notation
The numpy module offers np.format_float_positional()
Chris mentioned something about being cautious when dealing with floats so we might want to ask a question about what methods to be cautious about in dealing with scientifc notation, or bring it up in the meeting tomorrow?

In Q1, what does "scooter usage" mean?  Is it the type of scooter i.e. powered or standard, or is usage time, usage distance?

df1[df1<0].count()

pubdatetime, latitude, longitude - format; create point column
sumdtype - powered; standard
chargelevel - there's a 0 and NaN (same or separate?)
sumdgroup - bicycle, scooter, Scooter
costpermin - 0, 5, 6, 10, 15, 23, 30 cents
scooters.companyname - Bird, Bolt, Gotcha, Jump, Lime, Lyft, Spin
trips.companyname - Bird, Bolt Mobility, Gotcha, JUMP, Lime, Lyft, SPIN (match both name lists)
tripduration - MAX: 512619.0, MIN: -19.358267 (pull all neg numbers in relation to something else)
tripdistance - MAX: 3.188448e+07, MIN: -20324803.8 (what does neg mean?)
startdate - format
starttime - meaning of zero
enddate - format