# Uber Data Analysis

This Jupyter notebook walks you through a basic exploratory data analysis (EDA) of **Uber trip data** stored as CSV files in the specified folder on your computer.  
The workflow is entirely **Python‐based** (using `pandas`, `numpy`, and `matplotlib`) so you can extend it freely.  
Feel free to plug in additional steps such as geospatial mapping or machine‑learning models once the data is loaded.


In [1]:
!pip install pandas numpy matplotlib





[notice] A new release of pip is available: 23.1.2 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import os
import glob
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [5]:
# 👉 Update `folder_path` if you moved the data
folder_path = r"C:\Users\harsh gangavane\OneDrive\Documents\internship_coll[1]\internship coll\uber_data"

csv_files = glob.glob(os.path.join(folder_path, "*.csv"))
if not csv_files:
    raise FileNotFoundError(f"No CSV files found in {folder_path!r}. Make sure the path is correct and contains .csv files.")

print(f"Found {len(csv_files)} CSV file(s):")
for f in csv_files:
    print(Path(f).name)

# Concatenate all CSVs into a single DataFrame (change encoding if needed)
df = pd.concat((pd.read_csv(f, encoding='latin1') for f in csv_files), ignore_index=True)
print(f"\nCombined shape: {df.shape}")

FileNotFoundError: No CSV files found in 'C:\\Users\\harsh gangavane\\OneDrive\\Documents\\internship_coll[1]\\internship coll\\uber_data'. Make sure the path is correct and contains .csv files.

## Quick Glance at the Data

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe(include='all').T

## Missing Values per Column

In [None]:
df.isna().sum().sort_values(ascending=False)

## Datetime Parsing
Most Uber trip datasets contain a pickup timestamp such as `Pickup_datetime` or `date/time`. Update the column name below if yours differs.

In [None]:
# Replace 'Pickup_datetime' with your actual column name
datetime_col = 'Pickup_datetime'
if datetime_col not in df.columns:
    raise KeyError(f"{datetime_col} column not found. Please replace it with the correct column name.")

df[datetime_col] = pd.to_datetime(df[datetime_col])
df['date'] = df[datetime_col].dt.date
df['hour'] = df[datetime_col].dt.hour
df['day_of_week'] = df[datetime_col].dt.day_name()
df.head()

## Trip Count per Day

In [None]:
trips_per_day = df.groupby('date').size()
trips_per_day.plot(figsize=(12,4))
plt.title('Uber Trips per Day')
plt.ylabel('Trips')
plt.xlabel('Date')
plt.tight_layout()

## Distribution by Hour of Day

In [None]:
df['hour'].value_counts().sort_index().plot(kind='bar', figsize=(12,4))
plt.title('Trips by Hour of Day')
plt.xlabel('Hour')
plt.ylabel('Trips')
plt.tight_layout()

## Heatmap: Day of Week vs Hour

In [None]:
pivot = df.pivot_table(index='day_of_week', columns='hour', values=datetime_col, aggfunc='count')
pivot = pivot.reindex(['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'])  # for ordered weekdays
plt.figure(figsize=(12,6))
plt.imshow(pivot, aspect='auto')
plt.colorbar(label='Trip Count')
plt.xticks(ticks=np.arange(0,24,1), labels=np.arange(0,24,1))
plt.yticks(ticks=np.arange(7), labels=pivot.index)
plt.title('Trips by Hour and Day of Week')
plt.xlabel('Hour of Day')
plt.ylabel('Day of Week')
plt.tight_layout()

## Save Cleaned Data (Optional)

In [None]:
# Uncomment to save the cleaned DataFrame
# output_path = os.path.join(folder_path, 'uber_data_cleaned.csv')
# df.to_csv(output_path, index=False)
# print(f'Saved cleaned data to {output_path}')

## Conclusion
You now have a foundational analysis of your Uber dataset. You can extend this notebook by adding geospatial visualizations, calculating ride distances, analyzing surge pricing patterns, or forecasting demand with machine learning models. 🚀