<h1 style='text-align: center; front-size: 50px;'>Does the weather impact the duration of rides?</h1>

# Introduction:

In this project, we will work with data from **'Zuber'**, a new ride-sharing company that's launching in Chicago. Our mission is to find patterns in the available information and try to understand passenger preferences and the impact of external factors on rides. We'll analyze data from competitors and test a hypothesis about the impact of weather on ride frequency. This will allow us to improve customer experiance and optimize scheduling and routing.The dataset is stored in two files (project_sql_result_01.csv), (project_sql_result_04.csv). During data preprocessing, we will:

- Load and display the dataset in a standardized format.
- Verify and correct data types.
- Identify and handle missing values.
- Detect and remove duplicate entries.
- Identify the top 10 neighborhoods in terms of drop-offs.
- Create visualizations to clearly communicate insights from the data.

By following this process, we aim to produce a detailed report that provides actionable insights for business strategy.

# Step 1. Initialization:

In [None]:
# Loading all the libraries
import pandas as pd
import numpy as np
import seaborn as sns
import math
import matplotlib.pyplot as plt
import scipy.stats as stats

# Step 2. Load data:

In [None]:
# Loading the data files into different DataFrames:
company_name_trips = pd.read_csv('/datasets/project_sql_result_01.csv')
dropoff_avg_trip = pd.read_csv('/datasets/project_sql_result_04.csv')
weather_duration = pd.read_csv('/datasets/project_sql_result_07.csv')

# Step 3. Preparing and Fixing the Data:

### Company_name_trips:

In [None]:
# Printing the DataFrame:
company_name_trips.head()

In [None]:
# Data overview:
company_name_trips.info()

In [None]:
# Checking for missing values:
company_name_trips.isna().sum()

In [None]:
# Checking for duplicates:
company_name_trips.duplicated().sum()

### Dropoff_avg_trip:

In [None]:
# Printing the DataFrame:
dropoff_avg_trip.head()

In [None]:
# Data overview:
dropoff_avg_trip.info()

In [None]:
# Converting 'average_trips' into Integer:
dropoff_avg_trip['average_trips'] = dropoff_avg_trip['average_trips'].astype(int)

In [None]:
# Checking for missing values:
dropoff_avg_trip.isna().sum()

In [None]:
# Checking for duplicates:
dropoff_avg_trip.duplicated().sum()

### Weather_duration:

In [None]:
# Printing the DataFrame:
weather_duration.head()

In [None]:
# Data overview:
weather_duration.info()

In [None]:
# Checking for missing values:
weather_duration.isna().sum()

In [None]:
# Converting 'start_ts' into datetime:
weather_duration['start_ts'] = pd.to_datetime(weather_duration['start_ts'])

# Step 4. Analyzing the data:

In [None]:
# Identifying the top 10 neighborhoods in terms of drop-offs:
top_10_neighborhoods = dropoff_avg_trip.sort_values(by='average_trips', ascending=False).reset_index(drop=True).head(10)
top_10_neighborhoods

In [None]:
# Barplot showing Top 10 Neighborhoods by Average Trips:
plt.figure(figsize=(10, 6))
sns.barplot(data=top_10_neighborhoods, x='dropoff_location_name', y='average_trips', palette='Blues_r')
plt.title('Top 10 Neighborhoods by Average Trips')
plt.xlabel('Neighborhoods')
plt.ylabel('Average Trips')
plt.xticks(rotation=45, ha='right')
plt.show()

The graph shows that **The Loop** has the highest average trips, making it the primary transportation destination, followed by **River North** and **Streeterville**, likely due to entertainment activities. **O'Hare Airport** ranks high, indicating strong demand for airport transfers, while trip volume gradually declines across the rest of the destinations.  

In [None]:
# Identifying the top 10 companies in terms of number of trips:
top_10_companies = company_name_trips.sort_values(by='trips_amount', ascending=False).reset_index(drop=True).head(10)
top_10_companies

In [None]:
# Barplot showing Top 10 Companies by Number of Trips:
plt.figure(figsize=(10, 6))
sns.barplot(data=top_10_companies, x='company_name', y='trips_amount', palette='Greens_r')
plt.title('Top 10 Companies by Number of Trips')
plt.xlabel('Companies')
plt.ylabel('Number of Trips')
plt.xticks(rotation=45, ha='right')
plt.show()

The graph shows that **Flash Cab** dominates the transportion market, handling far more trips than its competitors. The remaining companies have a gradual decline in trips volume. This indicates that the smaller companies can work on improving thier services or pricing to gain market share. Additionally, analyzing factors like customer satisfaction, or geographic coverage could explain why **Flash Cab** outperforms the others. 

# Step 5. Testing the hypotheses:


In [None]:
# Createing the box plot:
plt.figure(figsize=(8,5))
sns.boxplot(x='weather_conditions', y='duration_seconds', data=weather_duration)
plt.xlabel('Weather Condition')
plt.ylabel('Duration in Seconds')
plt.title('Boxplot of Ride Duration')
plt.show()

The Boxplot shows that rides durations tend to be longer in **Bad Weather**, with a higher median and greater variability, likely due to slower driving speeds and reduced visibility. However, **Good Weather** also shows more extreme ride durations, possibly due to increased demand and hifgher traffic volume. 

#Test the hypotheses:

- Null Hypothesis (H_0): Average duration of rides from the Loop to O'Hare International Airport doesn't changes on rainy Saturdays.

- Alternative Hypothesis (H_1): Average duration of rides from the Loop to O'Hare International Airport changes on rainy Saturdays.

In [None]:
# Extracting Good and Bad weather condition from the dataset:
bad_saturdays = weather_duration[weather_duration['weather_conditions'] == 'Bad']['duration_seconds']
good_saturdays = weather_duration[weather_duration['weather_conditions'] == 'Good']['duration_seconds']

# Conducting the t-test:
t_stat, p_value = stats.ttest_ind(bad_saturdays, good_saturdays, equal_var=False)

# Printing the results:
print(f"T-statistic: {t_stat}, P-value: {p_value}")

# Interpretation:
if p_value < 0.05:
    print("Reject the null hypothesis: Average duration of rides changes on rainy Saturdays.")
else:
    print("Fail to reject the null hypothesis: No significant difference in average duration of rides on rainy Saturdays.")

Our analysis found a statistically significant difference in ride duration on rainy Saturdays (p-value ~ 0). This means that the rain significantly impacts travel time, likely due to factors such as traffic and slower driving speeds. For service optimization, it's recommended to dispatch drivers accordenlly, and consider dynamic pricing during rainy conditions.