## Zuber

#### 1. Introduction 

This notebook works with two datasets that contain information for ride services in the Chicago Area in the month of Nov. 2017, with some specific data for Nov. 15-16, 2017. This is a cursory glance at the data to determine the top ten companies that provided the most rides and the likewise destination neighborhoods. These findings will be bolstered by data visualization and exploratory analysis.

##### 1.2 Initialization

In [24]:
# Import libraries that might be necessary

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import statsmodels.api as sm
from scipy.interpolate import UnivariateSpline
import scipy.stats as stats

# Import DataFrams
    #DataFraame with company name and total rides
comp_count_trip = 'https://raw.githubusercontent.com/DHE42/zuber/refs/heads/main/moved_project_sql_result_01.csv'
comp_count_trip_df = pd.read_csv(comp_count_trip)

    #DataFrame with both destination and corresponding ride average
dropoff_trip_avg = 'https://raw.githubusercontent.com/DHE42/zuber/refs/heads/main/moved_project_sql_result_04.csv'
dropoff_trip_avg_df = pd.read_csv(dropoff_trip_avg)

    #DataFrame with date, weather, and ride duration for Nov. 2017
loop_ohare = 'https://raw.githubusercontent.com/DHE42/zuber/refs/heads/main/moved_project_sql_result_07.csv'
loop_ohare_df = pd.read_csv(loop_ohare)


Above, I have imported necessary libraries and the three datasets I'll be working with. Below, I will review the data.

#### 2. Data Review

##### 2.1 Review of comp_count_trip_df 

In [25]:
# Review  comp_count_trip_df
    #Data Head
print("Head of comp_count_trip_df")
print()
print(comp_count_trip_df.head())
print()

    #Data Info
print("Info of comp_count_trip_df")
print()
print(comp_count_trip_df.info())
print()

    #Data Description
print("Description of comp_count_trip_df")
print()
print(comp_count_trip_df.describe())
print()

    #Null Values
print("Null values in comp_count_trip_df")
print()
print(comp_count_trip_df.isnull().sum())
print()

    #Duplicate Rows
print("Duplicates in comp_count_trip_df")
print()
print(comp_count_trip_df.duplicated().sum())
print()



Head of comp_count_trip_df

                      company_name  trips_amount
0                        Flash Cab         19558
1        Taxi Affiliation Services         11422
2                 Medallion Leasin         10367
3                       Yellow Cab          9888
4  Taxi Affiliation Service Yellow          9299

Info of comp_count_trip_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   company_name  64 non-null     object
 1   trips_amount  64 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 1.1+ KB
None

Description of comp_count_trip_df

       trips_amount
count     64.000000
mean    2145.484375
std     3812.310186
min        2.000000
25%       20.750000
50%      178.500000
75%     2106.500000
max    19558.000000

Null values in comp_count_trip_df

company_name    0
trips_amount    0
dtype: int64

Duplicates in comp_cou

The first thing that jumps out to me in the head of the data is the lack of standardization. To start, I will rename column 'company_name' to 'company'. Then I will convert the object dtype values in 'company' into string dtype and snake case, make sure they are all lower case, and then perform the requisite data cleaning operations such as removing heading or tailing spaces, etc. I will also rename the 'trips_amount' column 'trip_sum' for simplicity and accuracy. The data types for the trips appear to be appropriate, as one column is categorical and the other is numerical sans decimal precision necessity since there is no such thing as a partial trip. Separately calling the null values and duplicate rows confirms the lack of these types of data errors as originally shown using the describe() function. With a mean of about 2,145, a standard deviation of of about 3,812, and a median of 178, it is obvious that there is high variability in the dataset, and that the median is likely the best representation of ride company performance during the two days of November 15-16, 2017.

#### 2.2 Review of dropoff_avg_trip_df

In [26]:
# Review dropoff_avg_trip_df

    #Data Head
print("Head of dropoff_avg_trip_df")
print(dropoff_trip_avg_df.head())
print()

    #Data Info
print("Info of dropoff_avg_trip_df")
print()
print(dropoff_avg_trip_df.info())
print()

    #Data Description
print("Description of dropoff_avg_trip_df")
print()
print(dropoff_avg_trip_df.describe())
print()

    #Null Values
print("Null values in dropoff_avg_trip_df")
print()
print(dropoff_avg_trip_df.isnull().sum())
print()

    #Duplicate Rows
print("Duplicates in dropoff_avg_trip_df")
print()
print(dropoff_avg_trip_df.duplicated().sum())
print()

Head of dropoff_avg_trip_df
  dropoff_location_name  average_trips
0                  Loop   10727.466667
1           River North    9523.666667
2         Streeterville    6664.666667
3             West Loop    5163.666667
4                O'Hare    2546.900000

Info of dropoff_avg_trip_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94 entries, 0 to 93
Data columns (total 2 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   dropoff_location_name  94 non-null     object 
 1   average_trips          94 non-null     float64
dtypes: float64(1), object(1)
memory usage: 1.6+ KB
None

Description of dropoff_avg_trip_df

       average_trips
count      94.000000
mean      599.953728
std      1714.591098
min         1.800000
25%        14.266667
50%        52.016667
75%       298.858333
max     10727.466667

Null values in dropoff_avg_trip_df

dropoff_location_name    0
average_trips            0
dtype: int64

Duplicates

This is extremely similar data to the previous DataFrame. I will begin with the aforementioned usual step of value standardization. Then, I'll rename the categorical column 'destination' and the numerical column 'trip_average'. The dtypes are correct for a string and numerical decimal precision, respectively. Since this DataFrame measures averages, it is permissible for the numerical values to contain floats for precision's sake. However, decimal places after the hundredths spot are superfluous. Separately calling functions to find null values and duplicate rows confirms the conclusion that these data gaps don't exist, which was first confirmed from calling the info() and describe() functions. As with the previous dataset, it is obvious that there are significant outliers and high variability.  

#### 2.3 Review of loop_ohare_df

In [27]:
# Review loop_ohare_df

   #Data Head
print("Head of loop_ohare_df")
print(loop_ohare_df.head())
print()

   #Data Info
print("Info of loop_ohare_df")
print()
print(loop_ohare_df.info())
print()

   #Data Description
print("Description of loop_ohare_df")
print()
print(loop_ohare_df.describe())
print()
   #Null Values
print("Null values in loop_ohare_df")
print()
print(loop_ohare_df.isnull().sum())
print()

   #Duplicate Rows
print("Duplicates in loop_ohare_df")
print()
print(loop_ohare_df.duplicated().sum())

Head of loop_ohare_df
              start_ts weather_conditions  duration_seconds
0  2017-11-25 16:00:00               Good            2410.0
1  2017-11-25 14:00:00               Good            1920.0
2  2017-11-25 12:00:00               Good            1543.0
3  2017-11-04 10:00:00               Good            2512.0
4  2017-11-11 07:00:00               Good            1440.0

Info of loop_ohare_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1068 entries, 0 to 1067
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   start_ts            1068 non-null   object 
 1   weather_conditions  1068 non-null   object 
 2   duration_seconds    1068 non-null   float64
dtypes: float64(1), object(2)
memory usage: 25.2+ KB
None

Description of loop_ohare_df

       duration_seconds
count       1068.000000
mean        2071.731273
std          769.461125
min            0.000000
25%         1438.250000
50%       

The loop_ohare_df has some more immediate issues besides simple string standardization. Firstly, 'start_ts' will be renamed 'start_time', 'weather_conditions' will simply be renamed 'weather', and 'duration_seconds' will be renamed 'trip_length'. Column 'start_time' will then be converted to datetime format, 'weather' will be cleaned, and 'trip_length' will be converted to timedelta dtype, which is perfect for time intervals and suitable for quick reference to enable easy storytelling with the data. Calling isnull() shows that there are no null values in the columns, however it appears there are 197 duplicate rows in loop_ohare_df. While there should be similar data for rides, it is highly unlikely that there is any completely identical data. Therefore, the duplicates can be dropped.

#### 3. Data Cleaning

##### 3.1 Cleaning comp_count_trip_df

The first thing that jumps out to me in the head of the data is the lack of standardization. To start, I will rename column 'company_name' to 'company'. Then I will convert the object dtype values in 'company' into string dtype and snake case, make sure they are all lower case, and then perform the requisite data cleaning operations such as removing heading or tailing spaces, etc. I will also rename the 'trips_amount' column 'trip_sum' for simplicity and accuracy. The data types for the trips appear to be appropriate, as one column is categorical and the other is numerical sans decimal precision necessity since there is no such thing as a partial trip. Separately calling the null values and duplicate rows confirms the lack of these types of data errors as originally shown using the describe() function. With a mean of about 2,145, a standard deviation of of about 3,812, and a median of 178, it is obvious that there is high variability in the dataset, and that the median is likely the best representation of ride company performance during the two days of November 15-16, 2017.

In [None]:
# Renaming Columns

##### 3.2 Cleaning dropoff_avg_trip_df

##### 3.3 Cleaning loop_ohare_df