<h1>Exploratory Data Analysis</h1>

<h3>Initial Setup</h3>

In [30]:
import pandas as pd
from sqlalchemy import create_engine

database_name = 'scooters'    # Fill this in with your scooter database name
connection_string = f"postgresql://postgres:postgres@localhost:5432/{database_name}"

engine = create_engine(connection_string)

<b>Loading a few records into dataframes for exploration purposes.</b>

In [31]:
# Read in 10 records from the scooters and trips tables into dataframes
scooters_df = pd.read_sql_query('SELECT * FROM scooters LIMIT 1000000', con=engine)
trips_df = pd.read_sql_query('SELECT * FROM trips', con=engine)

<h3>Are there any null values in any columns in either table?</h3>

In [32]:
trips_null_query = '''
select count(*)
from trips
where not (trips is not null);
'''

trips_null_result = engine.execute(trips_null_query).fetchone()

scooters_null_query = '''
select count(*)
from scooters
where not (scooters is not null);
'''

scooters_null_result = engine.execute(scooters_null_query).fetchone()

print(f'Number of records in trips with at least one null value: {trips_null_result[0]}')
print(f'Number of records in scooters with at least one null value: {scooters_null_result[0]}')

Number of records in trips with at least one null value: 0
Number of records in scooters with at least one null value: 770


<h3>What date range is represented in each of the date columns? Investigate any values that seem odd.</h3>

In [33]:
print(scooters_df.head())
print(scooters_df.info())
print(trips_df.head())
print(trips_df.info())

          pubdatetime  latitude  longitude   
0 2019-07-31 18:54:13   36.1202   -86.7534  \
1 2019-07-31 18:54:13   36.1376   -86.7998   
2 2019-07-31 18:54:13   36.1201   -86.7532   
3 2019-07-31 18:54:13   36.1199   -86.7534   
4 2019-07-31 18:54:13   36.1200   -86.7534   

                                        sumdid sumdtype  chargelevel   
0  Powered97f8a3a1-959b-5750-8cb6-67c9f7bcf261  Powered          0.0  \
1  Powered3d099838-659b-5660-9463-a029ba277d63  Powered         94.0   
2  Powered5b5bb847-647f-51af-abb6-7097b1264a64  Powered          1.0   
3  Powered93d2cde0-7b0d-5ffc-975a-434c23bec835  Powered         95.0   
4  Powered91917748-6725-57e8-8476-4495b5c5f0cc  Powered         59.0   

  sumdgroup  costpermin companyname  
0   scooter        0.06        Jump  
1   scooter        0.06        Jump  
2   scooter        0.06        Jump  
3   scooter        0.06        Jump  
4   scooter        0.06        Jump  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entr

In the `trips` table, `startdate`, `enddate`, `starttime`, and `endtime` all seem to be objects.  I am not sure if this will cause issues down the line yet.

The times are in 24 hour format - which is much easier to work with.

<h3>What values are there in the sumdgroup column? Are there any that are not of interest for this project?</h3>

In [34]:
print(scooters_df['sumdgroup'].unique())

['scooter' 'Scooter' 'bicycle']


It appears we can ignore `bicycle`.

<h3>What are the minimum and maximum values for all the latitude and longitude columns? Do these ranges make sense, or is there anything surprising? -What is the range of values for trip duration and trip distance? Do these values make sense? Explore values that might seem questionable.</h3>

In [37]:
# Getting the minimum and maximum values for the latitude and longitude columns
print('Latitude min:', scooters_df['latitude'].min())
print('Latitude max:', scooters_df['latitude'].max())
print('Longitude min:', scooters_df['longitude'].min())
print('Longitude min:', scooters_df['longitude'].max())

print('------------------')

# Getting the range of values for trip duration and trip distance
print('Trip Duration min:', trips_df['tripduration'].min())
print('Trip Duration max:', trips_df['tripduration'].max())
print('Trip Distance min:', trips_df['tripdistance'].min())
print('Trip Distance max:', trips_df['tripdistance'].max())

Latitue min: 0.0
Latitue max: 36.345742
Longitude min: -97.443879
Longitude min: 0.0
------------------
Trip Duration min: -19.3582666667
Trip Duration min: 512619.0
Trip Distance min: -20324803.8
Trip Distance min: 31884482.6476


The negative trip duration is confusing. How can someone take trip that resulted in negative minutes?

In [39]:
# Getting the negative values for trip duration
negative_trip_duration_df = trips_df[trips_df['tripduration'] < 0]

# Dropping unnecessary columns
negative_trip_duration_df = negative_trip_duration_df.drop(['startlatitude', 'endlatitude', 'startlongitude', 'endlongitude', 'triproute'], axis=1)

print(negative_trip_duration_df)

                  pubtimestamp companyname triprecordnum         sumdid   
499616 2019-07-19 00:14:02.297        Lyft         LFT21  Powered853770  \
499610 2019-07-19 00:12:05.363        Lyft         LFT18  Powered863342   
499562 2019-07-19 00:01:24.063        Lyft          LFT2  Powered859498   
368012 2019-06-21 21:44:53.863        Lyft       LFT1318  Powered220544   
499586 2019-07-19 00:07:18.803        Lyft         LFT10  Powered767853   
368490 2019-06-21 22:35:30.390        Lyft       LFT1435  Powered041891   
499418 2019-07-18 23:56:13.233        Lyft        LFT864  Powered863342   
499583 2019-07-19 00:06:02.050        Lyft          LFT7  Powered895717   

        tripduration  tripdistance   startdate        starttime     enddate   
499616    -19.358267    4540.68256  2019-07-19  00:09:04.506666  2019-07-18  \
499610    -10.975100    3641.73240  2019-07-19  00:00:24.016666  2019-07-18   
499562    -10.242417      52.49344  2019-07-18  23:59:35.683333  2019-07-18   
368012  

It seems that we have a few rows of obvious bad data. This makes a case where we can ignore these records.