## Zuber

#### 1. Introduction 

This notebook works with two datasets that contain information for ride services in the Chicago Area from Nov. 15-16, 2017. This is a cursory glance at the data to determine the top ten companies that provided the most rides and the likewise destination neighborhoods. These findings will be bolstered by data visualization and exploratory analysis

##### 1.2 Initialization

In [None]:
# Import libraries that might be necessary

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import statsmodels.api as sm
from scipy.interpolate import UnivariateSpline
import scipy.stats as stats

# Import DataFrams
    #DataFraame with company name and total rides
comp_count_trip = 'https://raw.githubusercontent.com/DHE42/zuber/refs/heads/main/moved_project_sql_result_01.csv'
comp_count_trip_df = pd.read_csv(comp_count_trip)

    #DataFrame with both destination and corresponding ride average
dropoff_avg_trip = 'https://raw.githubusercontent.com/DHE42/zuber/refs/heads/main/moved_project_sql_result_04.csv'
dropoff_avg_trip_df = pd.read_csv(dropoff_avg_trip)

    #DataFrame with date, weather, and ride duration for Nov. 2017
loop_ohare = 'https://raw.githubusercontent.com/DHE42/zuber/refs/heads/main/moved_project_sql_result_07.csv'
loop_ohare_df = pd.read_csv(loop_ohare)


Above, I have imported necessary libraries and the two datasets I'll be working with. Below, I will review the data.

##### 2. Data Review

In [8]:
# Review  comp_count_trip_df
print("Head of comp_count_trip_df")
print()
print(comp_count_trip_df.head())
print()

print("Info of comp_count_trip_df")
print()
print(comp_count_trip_df.info())
print()

print("Describe comp_count_trip_df")
print()
print(comp_count_trip_df.describe())
print()

print("Null values of comp_count_trip_df")
print()
print(comp_count_trip_df.isnull().sum())
print()

print("Duplicates of comp_count_trip_df")
print()
print(comp_count_trip_df.duplicated().sum())
print()



Head of comp_count_trip_df

                      company_name  trips_amount
0                        Flash Cab         19558
1        Taxi Affiliation Services         11422
2                 Medallion Leasin         10367
3                       Yellow Cab          9888
4  Taxi Affiliation Service Yellow          9299

Info of comp_count_trip_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   company_name  64 non-null     object
 1   trips_amount  64 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 1.1+ KB
None

Describe comp_count_trip_df

       trips_amount
count     64.000000
mean    2145.484375
std     3812.310186
min        2.000000
25%       20.750000
50%      178.500000
75%     2106.500000
max    19558.000000

Null values of comp_count_trip_df

company_name    0
trips_amount    0
dtype: int64

Duplicates of comp_count_tri

The first thing that jumps out to me in the head of the data is the lack of standardization, which means I will convert the object dtype values in company_name into snake case, make sure they are all lower case, and then perform the requisite data cleaning operations such as removing heading or tailing spaces, etc. I will also rename the trips_amount column trip_sum for simplicity and accuracy. The data types for the trips appear to be appropriate, as one column is categorical and the other is numerical sans decimal precision necessity since there is no such thing as a partial trip. Separately calling the null values and duplicate rows confirms the lack of these types of data errors as originally shown using the describe() function. With a mean of about 2,145, a standard deviation of of about 3812, and a median of 178, it is obvious that there is high variability in the dataset, and that median is likely the best representation of ride company performance during the two days of November 15-16, 2017.

In [6]:
# Review dropoff_avg_trip_df
dropoff_avg_trip_df.head()
dropoff_avg_trip_df.info()
dropoff_avg_trip_df.describe()
dropoff_avg_trip_df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94 entries, 0 to 93
Data columns (total 2 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   dropoff_location_name  94 non-null     object 
 1   average_trips          94 non-null     float64
dtypes: float64(1), object(1)
memory usage: 1.6+ KB


dropoff_location_name    0
average_trips            0
dtype: int64