# 📘 Uber Fares Dataset Analysis – Power BI Project

**Author**: David MILINDI Shema  
**ID**: 25914  
**Submission Date**: 27/07/2025  
**Submitted to**: eric.maniraguha@auca.ac.rw  
**Repository**: [INSERT GITHUB LINK]

## 🔍 Project Overview

This project analyzes the **Uber Fares Dataset** to uncover meaningful insights on ride fares, trip distribution over time, and operational patterns using **Power BI**.

**Deliverables:**
- Cleaned dataset
- Data analyzing graphs
- Power BI user dashboard
- Analytical report
- GitHub ReadMe documentation

## 🗃️ Dataset Description

- **Source**:  [Kaggle – Uber Fares Dataset  ](https://www.kaggle.com/datasets/yasserh/uber-fares-dataset)  
- **Columns**: fare_amount, pickup_datetime, pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude, passenger_count  
- **Rows**: ~20,000

## ⚙️ Methodology
### 1. Data Cleaning in Python

In [None]:
# Load dataset
import pandas as pd


df = pd.read_csv(r"C:\Users\DAVID\Downloads\Documents\Uber\uber.csv")

df.head()

print("Shape:", df.shape)
print("Columns:", df.columns)

# check columns and there counts

print(df.dtypes)

#check for null values

print(df.isnull().sum())

# Drop rows with null values(though we found none)

df = df.dropna(subset=['fare_amount', 'pickup_datetime', 'pickup_longitude', 'pickup_latitude'])
df.reset_index(drop=True, inplace=True)

# Convert to datetime
df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'])

# Save cleaned dataset
df.to_csv("uber_cleaned.csv", index=False)

## 📈 Exploratory Data Analysis (EDA)

In [None]:
# Summary statistics
df.describe()

In [None]:
# Visual fare distribution
import seaborn as sns
import matplotlib.pyplot as plt

sns.histplot(df['fare_amount'], bins=50)
plt.title("Fare Amount Distribution")
plt.show()

In [None]:
# Visual fare vs distance

df['distance'] = ((df['dropoff_longitude'] - df['pickup_longitude'])**2 + 
                  (df['dropoff_latitude'] - df['pickup_latitude'])**2)**0.5
sns.scatterplot(x='distance', y='fare_amount', data=df)
plt.title("Fare vs Distance")
plt.show()

## 🧠 Feature Engineering

Features extracted from `pickup_datetime` include hour, weekday, month, and peak_hour.

In [None]:
# create new columns for time rapses

df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'])

df['hour'] = df['pickup_datetime'].dt.hour
df['weekday'] = df['pickup_datetime'].dt.day_name()
df['month'] = df['pickup_datetime'].dt.month

df['peak_hour'] = df['hour'].apply(lambda x: 'Peak' if 7 <= x <= 9 or 16 <= x <= 19 else 'Off-Peak')

# Save the final cleaned dataset with new features

df.to_csv("uber_fares_final.csv", index=False)

## 📊 Power BI Dashboard

The followimg Dashboard has these features:  
- Line Chart: Rides per Month
- Pie Chart: Fare by Weekday
- KPI Cards (Avg Fare, Max Fare, Total Rides)
- Map  

![Power BI Dashboard Screenshot](./images/Screenshot 2025-07-27 212116.png")


## 💡 Key Insights

1. Peak hours have the highest fare rates.
2. Weekends show increased ride volume.
3. Late-night rides cost more.
4. Outliers were removed for better trend analysis.

## 📷 Screenshots


- Data Cleaning
- Feature Engineering
- Dashboard
- KPI Cards  

![Screenshot](./images/Screenshot 2025-07-27 212116.png)  
![Screenshot](./images/Screenshot 2025-07-27 121946.png)  
![Screenshot](./images/Screenshot 2025-07-27 121128.png)  
![Screenshot](./images/Screenshot 2025-07-27 120538.png)  
![Screenshot](./images/Screenshot 2025-07-27 103137.png)  
![Screenshot](./images/Screenshot 2025-07-27 103117.png)  
![Screenshot](./images/Screenshot 2025-07-25 142347.png)  
![Screenshot](./images/Screenshot 2025-07-25 142332.png)  
![Screenshot](./images/Screenshot 2025-07-25 142303.png)  
![Screenshot](./images/Screenshot 2025-07-25 142050.png)  
![Screenshot](./images/Screenshot 2025-07-25 140858.png)  
![Screenshot](./images/Screenshot 2025-07-25 140844.png)  
![Screenshot](./images/Screenshot 2025-07-25 140831.png)  
![Screenshot](./images/Screenshot 2025-07-25 140831.png)  
![Screenshot](./images/Screenshot 2025-07-25 140737.png)  