<h1>Exploratory Data Analysis</h1>

<h3>Initial Setup</h3>

In [1]:
import pandas as pd
from sqlalchemy import create_engine

database_name = 'scooters'    # Fill this in with your scooter database name
connection_string = f"postgresql://postgres:postgres@localhost:5432/{database_name}"

engine = create_engine(connection_string)

<b>Loading a few records into dataframes for exploration purposes.</b>

In [2]:
# Limiting Scooters to 1 million rows for now
scooters_df = pd.read_sql_query('SELECT * FROM scooters limit 1000000', con=engine)
trips_df = pd.read_sql_query('SELECT * FROM trips', con=engine)

<h3>Are there any null values in any columns in either table?</h3>

In [3]:
trips_null_query = '''
select count(*)
from trips
where not (trips is not null);
'''

trips_null_result = engine.execute(trips_null_query).fetchone()

scooters_null_query = '''
select count(*)
from scooters
where not (scooters is not null);
'''

scooters_null_result = engine.execute(scooters_null_query).fetchone()

print(f'Number of records in trips with at least one null value: {trips_null_result[0]}')
print(f'Number of records in scooters with at least one null value: {scooters_null_result[0]}')

Number of records in trips with at least one null value: 0
Number of records in scooters with at least one null value: 770


<h3>What date range is represented in each of the date columns? Investigate any values that seem odd.</h3>

In [4]:
print(scooters_df.head())
print(scooters_df.info())
print(trips_df.head())
print(trips_df.info())

              pubdatetime   latitude  longitude      sumdid sumdtype   
0 2019-05-02 18:16:23.720  36.121455 -86.770238  Powered613  Powered  \
1 2019-05-02 18:16:23.720  36.121291 -86.770135  Powered654  Powered   
2 2019-05-02 18:16:23.720  36.121332 -86.770235  Powered330  Powered   
3 2019-05-02 18:16:23.720  36.146667 -86.792587  Powered828  Powered   
4 2019-05-02 18:16:23.720  36.150358 -86.815588  Powered817  Powered   

   chargelevel sumdgroup  costpermin companyname  
0          2.0   Scooter         0.0      Gotcha  
1          0.0   Scooter         0.0      Gotcha  
2         54.0   Scooter         0.0      Gotcha  
3         91.0   Scooter         0.0      Gotcha  
4         84.0   Scooter         0.0      Gotcha  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 9 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   pubdatetime  1000000 non-null  datetime64[n

* In the `trips` table, `startdate`, `enddate`, `starttime`, and `endtime` all seem to be objects.  I am not sure if this will cause issues down the line yet.

* The times are in 24 hour format - which is much easier to work with.

<h3>What values are there in the sumdgroup column? Are there any that are not of interest for this project?</h3>

In [5]:
print(scooters_df['sumdgroup'].unique())

['Scooter' 'scooter' 'bicycle']


* It appears we can ignore `bicycle`.

<h3>What are the minimum and maximum values for all the latitude and longitude columns? Do these ranges make sense, or is there anything surprising? -What is the range of values for trip duration and trip distance? Do these values make sense? Explore values that might seem questionable.</h3>

In [6]:
# Getting the minimum and maximum values for the latitude and longitude columns
print('Latitude min:', scooters_df['latitude'].min())
print('Latitude max:', scooters_df['latitude'].max())
print('Longitude min:', scooters_df['longitude'].min())
print('Longitude min:', scooters_df['longitude'].max())

print('------------------')

# Getting the range of values for trip duration and trip distance
print('Trip Duration min:', trips_df['tripduration'].min())
print('Trip Duration max:', trips_df['tripduration'].max())
print('Trip Distance min:', trips_df['tripdistance'].min())
print('Trip Distance max:', trips_df['tripdistance'].max())

Latitude min: 0.0
Latitude max: 36.278748
Longitude min: -86.923469
Longitude min: 0.0
------------------
Trip Duration min: -19.3582666667
Trip Duration max: 512619.0
Trip Distance min: -20324803.8
Trip Distance max: 31884482.6476


The negative trip duration is confusing. How can someone take trip that resulted in negative minutes?

In [7]:
# Getting the negative values for trip duration
negative_trip_duration_df = trips_df[trips_df['tripduration'] < 0]

# Dropping unnecessary columns for readability
negative_trip_duration_df = negative_trip_duration_df.drop(['startlatitude', 'endlatitude', 'startlongitude', 'endlongitude', 'triproute'], axis=1)

print(negative_trip_duration_df)

                  pubtimestamp companyname triprecordnum         sumdid   
377940 2019-06-21 21:44:53.863        Lyft       LFT1318  Powered220544  \
378418 2019-06-21 22:35:30.390        Lyft       LFT1435  Powered041891   
509346 2019-07-18 23:56:13.233        Lyft        LFT864  Powered863342   
509490 2019-07-19 00:01:24.063        Lyft          LFT2  Powered859498   
509511 2019-07-19 00:06:02.050        Lyft          LFT7  Powered895717   
509514 2019-07-19 00:07:18.803        Lyft         LFT10  Powered767853   
509538 2019-07-19 00:12:05.363        Lyft         LFT18  Powered863342   
509544 2019-07-19 00:14:02.297        Lyft         LFT21  Powered853770   

        tripduration  tripdistance   startdate        starttime     enddate   
377940     -8.003717    3484.25208  2019-06-21  21:32:09.170000  2019-06-21  \
378418     -1.359867    3166.01060  2019-06-21  22:23:01.316666  2019-06-21   
509346     -0.715917    2214.56700  2019-07-18  23:50:34.650000  2019-07-18   
509490  

* It seems that we have a few rows of obvious bad data. This makes a case where we can ignore these records.

<h3>Check out how the values for the company name column in the scooters table compare to those of the trips table. What do you notice?</h3>

In [11]:
scooters_company_names_query = '''
select distinct companyname
from scooters;
'''

scooters_company_names_result = engine.execute(scooters_company_names_query).fetchall()

print(scooters_company_names_result)

trips_company_names_query = '''
select distinct companyname
from trips;
'''

trips_company_names_result = engine.execute(trips_company_names_query).fetchall()

print(trips_company_names_result)

[('Bird',), ('Bolt',), ('Gotcha',), ('Jump',), ('Lime',), ('Lyft',), ('Spin',)]
[('Bird',), ('Bolt Mobility',), ('Gotcha',), ('JUMP',), ('Lime',), ('Lyft',), ('SPIN',)]


* The company names are not all the same, but are similar enough with some data scrubbing to be 1-to-1.