# **Galactic Hackathon 2025**
# **Smart Urban Mobility Optimizer: AI-Driven Solutions for Sustainable Cities**

##**Objective**
The goal of this challenge is to leverage AI and data analysis to address urban mobility challenges by identifying high-traffic congestion areas, optimizing public transport routes and schedules, predicting congestion patterns, and evaluating the environmental impact of travel. By analyzing traffic data, participants will deliver actionable insights and user-friendly solutions that help city planners and stakeholders make informed decisions to improve traffic flow, enhance public transport efficiency, and promote environmental sustainability.

##**Problem Statement**
Cities around the world are grappling with increasing traffic congestion, inefficient public transport systems, and rising environmental concerns due to urbanization and growing vehicle usage. These challenges call for innovative solutions to optimize traffic flow, improve public transport efficiency, and minimize environmental impact.

Using the provided dataset, participants will analyze traffic patterns, evaluate environmental impacts, and develop AI-powered tools to:

- Identify congestion hotspots.
- Predict traffic levels.
- Propose optimized public transport routes and schedules.

The aim is to create actionable, user-friendly insights and recommendations that city planners and stakeholders can use to enhance urban mobility and sustainability.

##**Dataset**
The dataset represents traffic and environmental metrics collected from various locations. It is designed to provide insights into urban mobility, congestion, and environmental conditions in different regions. The dataset includes detailed metrics related to traffic density, air quality, and time-related variables. These metrics can help identify patterns in congestion, environmental pollution, and variations in traffic flow during different times of the day or week.

The dataset includes traffic and environmental metrics:

You have access to this file.
(`hackathon_data.csv`)

| Column name | Description | Sample Data |
|-------------|-----------|---------|
| Location |The name of the location where the data was collected. It provides geographic context for the dataset.|'Lajpat Nagar','Saket'|
| Traffic Volume |The number of vehicles passing through the area during a given time period. Helps measure the density of vehicle activity at different locations and times.|'1289','4700'|
| Passenger Count |The total number of passengers in the location during the recorded time. This reflects traffic density in terms of human movement.|'23686','12187'|
| Noise Level |The average noise level in decibels at the location during the recorded period.The noise level measured in decibels (dB) in the area.|'65.4','85.3'|
| Average Speed |The average speed of vehicles (in kilometers per hour) traveling through the area during the specified time. Provides insight into traffic flow and congestion levels.|'15.6', '31.3'|
| PM2.5 Level |The concentration of fine particulate matter (PM2.5) in the air, measured in micrograms per cubic meter (µg/m³). Indicates air quality and the level of air pollution harmful to respiratory health.|'85.2', '170.7'|
| AQI (Air Quality Index) |A standardized index measuring the quality of air in an area. Higher values indicate worse air quality. Helps monitor environmental health and pollution risks.|'72', '150'|
| Time of the Day |The time frame during which the data was captured, categorized as Morning, Afternoon, Evening, or Night. Identifies variations in traffic and environmental metrics based on daily time periods.|'Morning', 'Night'|
| Day of the Week | The specific day on which the data was collected. Highlights trends based on weekly traffic and environmental patterns. |'Sunday','Friday'|


##**Chatbot**

A chatbot is available to provide insights and recommendations for solving the problems you encounter. You can interact with this chatbot to gain a better understanding of how to solve any problem related to the data. Please ensure your questions focus solely on the dataset and remain on-topic. Instructions on how to use the Chatbot are also provided with the bot.

## **Tasks**

### **Task 1: Data Analysis**

**Import the dataset**

The **hackathon_data.csv** is the file we will perform this hackathon with. To work with this file we must import it first, we can utilize the pandas library for this. Since the file is a .csv file, we can use **pd.read_csv('filename.extension')**.

Note: The **Try-except** statement, here try block lets you test a block of code for errors. The except block lets you handle the error.
- **FileNotFoundError**: This block of code handles the FileNotFoundError exception.  This exception is raised if the specified file 'hackathon_data.csv' does not exist in the current working directory.
- **pd.errors.ParserError**: This block handles the pd.errors.ParserError exception.  This exception is raised by pandas if there's a problem parsing the CSV file (e.g., incorrect formatting, inconsistent delimiters).
- **Exception as e**: # This is a general exception handler. It catches any other unexpected errors that might occur during the file reading process.

In [None]:
# import pandas library as pd
import pandas as pd

# Read the CSV file into a DataFrame.
try:
# Replace the space with the dataset file path along with the extension csv
  df = pd.read_csv('data.csv')

# dataframe.head() displays the first 5 rows of the dataset
  print(df.head())
except FileNotFoundError:
  print("Error: hackathon_data not found. Please check the file path.")
except pd.errors.ParserError:
  print("Error: Could not parse hackathon_data. Please check the file format.")
except Exception as e:
  print(f"An unexpected error occurred: {e}")

### ***Summary of the dataset***

The summary of the dataset provides a quick overview of its structure, content, and quality. It helps to understand the Data (Learn about the column names, data types, number of rows, and basic statistics), Identify Issues (Spot missing values, outliers, or inconsistencies), Assess Distribution (Understand how data is distributed, such as skewness or clustering)

In [3]:
# .shape returns a tuple representing the dimensions of our DataFrame df. The output shows the number of rows and columns in the dataset, helping to quickly understand its size.


So, our dataset has 7000 rows and 9 columns of data.

Let us now have a look at the descriptive statistics (Assess Distribution) of the dataset. We will utilize the **.describe** function which generates descriptive statistics for the DataFrame, including **all** data types (numerical, categorical, etc.). It provides information like count, mean, min, max, and unique values, offering a summary of the dataset's distribution.

In [4]:
# Display summary statistics of the DataFrame
# The 'all' here basically signals the function to return all forms of descriptive statistics


Now, let's have a look at the **info** of the dataframe, including the data types, number of non-null values in each column, and memory usage. It helps to quickly assess the structure and completeness of the dataset. It provides essential information about the dataset, including its structure, column data types, and null values.

In [5]:
# Display data types of each column


### ***Percentage of null values for each column***

We can clearly notice there are some null values in few of our columns, so why don't you find the count of missing values in each column.

Hint: **sum()** function

Determine the percentage of missing values for each column in a dataset.

### ***Correlation Matrix Heatmap***

A correlation matrix heatmap is a visual representation of the correlation coefficients between variables in a dataset. Correlation measures the strength and direction of the linear relationship between two variables, typically ranging from -1 to +1:

- +1: Perfect positive correlation (as one increases, the other increases).

- -1: Perfect negative correlation (as one increases, the other decreases).

- 0: No linear relationship.

Examine relationships between numerical variables.

In [6]:
# Create a correlation heatmap for all the numerical columns in the dataset

# Visualization libraries have already been installed, you can choose a plot of your choice from either of the libraries

# Create the heatmap with specifications (cmap = 'RdYlGn', annot = True)


### ***Boxplot of the dataset***

A boxplot (or box-and-whisker plot) is a graphical representation of the distribution of a dataset. It displays key summary statistics and highlights outliers. The box represents the interquartile range (IQR), which contains the middle 50% of the data, and the whiskers extend to the data's range (excluding outliers).

- Points outside the whiskers indicate outliers, which could be errors or extreme but valid values.
- Multiple boxplots side by side allow you to compare the distributions of different variables or categories.
- A skewed box (longer whisker or unbalanced median line) suggests non-normal data, guiding the need for transformations or adjustments.
- The size of the box (IQR) indicates variability in the data. A large box means high variability, while a small box suggests consistency.

The aim here is to detect and analyze outliers in a dataset with a focus on numeric columns. Also, to visualize the outliers using box plots.

You can utilize different methods as per your choice.:
-  Z-Score Method
-  Interquartile Range (IQR) Method
-  Modified Z-Score Method
-  Percentile Method

In [7]:
# Write your code here


### ***Outlier Removal***

Remove extreme values to ensure robust analysis. Utilize any method of your choice for outlier removal.

Why removing outliers is better than deleting the entire row:
- Prevents outliers from skewing statistical calculations (mean, variance) or model predictions.
- Unlike deleting rows with outliers, capping retains most of the data while reducing the impact of extremes.
- Helps machine learning algorithms (especially sensitive ones like linear regression) to better generalize by limiting the influence of outliers.


Note: You can also utilize other methods like:
- 1. Capping Method: Limit extreme values by capping them to a specific range, such as the 5th and 95th percentiles, to reduce their impact without removing data.
- 2. Removing Outliers - Eliminate rows with values that lie beyond a certain threshold.
- 3. Transformation Methods - Apply transformations to reduce the influence of outliers.
- 4. Model-Based Approaches - Use machine learning models to detect and handle outliers (Isolation Forest, One-Class SVM)
- 5. Descriptive Statistic Method - Replace numerical outliers with Median or Mean and the categorical outliers with mode.


In [8]:
# Write your code here

### ***Impute the null values***
Now have a look at the null values in each column.

Why we need to fill in null values (or missing data):
- Null values can distort calculations such as averages, sums, or standard deviations, leading to inaccurate insights.
- Machine learning algorithms generally cannot handle null values directly.
- Simply dropping rows or columns with null values can lead to significant data loss.
- Leaving null values can introduce bias if some algorithms or analyses exclude them automatically.


Fill missing data using suitable methods like mean, median, or predictive imputation.

- For numerical columns, you can fill in the null values with the mean or median or predictive imputation of the entire column.
- For categorical column, Fill missing values with the mode of that column.

### ***Datatype Format***

A datatype is a classification that specifies which type of value a variable or object can hold in a programming language. It defines the nature of data and determines what operations can be performed on that data.

Some common formats include: Integer, Floating Point, String, Boolean, Date/Time, Null, List.

In [9]:
# Print data type of each column


**Column datatype changes**

- The columns in the dataset already have assigned data types.

- You need to modify these data types based on the requirements of the model.

- Some columns need to be converted to strings, while others must be converted to integers without any decimal places.

- Numerical columns can be converted to floats with a specified number of decimal places (e.g., rounding to 2 decimal places to limit precision to a reasonable level and improve readability). You can also choose to round to 3 decimal places or even 0, depending on your needs.

In [10]:
# Write your code here

# Print data types to verify


In [11]:
# Have a look at the shape of the dataset before you proceed with the basic questions


### **Basic Questions Post-Preprocessing**


#### **Highlight the top 5 locations with high AQI**
Air quality is an important factor to monitor. Can you pinpoint the locations where the air quality is the poorest? Rank the top 5 locations by AQI from highest to lowest. (Output the Location and its AQI)


In [12]:
# Write your code here


#### **List all locations with their Average Speed**
Speed matters, especially when analyzing traffic patterns or travel efficiency. Can you list each location along with its average speed? This will help us understand the movement dynamics at different places. (Output the Locations and their average speed)


In [13]:
# Write your code here


#### **Which Location Experiences the Highest Noise Level During the Night?**
Noise pollution is a growing concern, particularly at night. Find the location that has the highest noise level during the nighttime. (Output the Location and its noise level in db)

What could this indicate about the area? ()

In [14]:
# Write your code here


In [15]:
# What could this indicate about the area?


#### **Which Location Has the Lowest PM2.5 Level and what is its AQI?**
Cleaner air is a sign of better environmental quality. Let’s identify the place with the cleanest air. Can you find the location with the lowest PM2.5 level and AQI, signifying the healthiest air quality? (Output the Location, its PM2.5 Level and its AQI)


In [16]:
# Write your code here


#### **What is the Average Noise Level Across All Locations?**

Noise levels can affect well-being and productivity. What is the average noise level across all locations? This question will give us insight into the general noise pollution levels across the dataset. (Output the Location, its Average Noise Level)

In [17]:
# Write your code here


#### **On Which Day of the Week Does the PM2.5 Level Reach Its Lowest Average?**
The air quality can vary depending on the day of the week, possibly due to different factors like traffic or weather. Can you find the day of the week with the lowest average PM2.5 level? This will give us insight into when the air is at its cleanest during the week. (Output the Day of the Week)

In [18]:
# Write your code here


### **Task 2: Advanced Analysis**

#### **Calculate the average traffic volume for locations with a passenger count above 30,000**

In cities, areas with over 30,000 passengers often encounter unique traffic challenges. Analyze the dataset to calculate the average traffic volume in these densely populated zones. What trends emerge, and how could they guide smarter urban mobility planning and infrastructure upgrades?

(Output: Average Traffic Volume)


In [19]:
# Write your code here


#### **Identify the location with the highest and lowest average speed**
Identifying locations with the highest and lowest average speeds provides critical insights into urban traffic dynamics. High speeds might indicate efficient road design, while low speeds could highlight congestion hotspots. Analyzing these extremes helps uncover patterns in traffic flow, infrastructure efficiency, and potential problem areas, ultimately driving improvements in city mobility. Can you pinpoint the locations with the highest and lowest average speed which drives the impact on city mobility. Also mention their respective average speed.

(Output: Location, Average Speed (Highest), Location, Average Speed (Lowest))

In [20]:
# Write your code here


#### **Environmental Correlation**

Environmental factors, such as Air Quality Index (AQI), Noise Levels, and Particulate Matter (PM2.5), often interact with one another in complex ways. These relationships can significantly impact urban living conditions and quality of life. For this task, perform an correlation analysis to explore and uncover potential patterns between AQI, Noise Levels, and PM2.5 levels across various locations.

- **0**: A correlation coefficient of 0 indicates no relationship between the two variables. This means changes in one variable do not systematically affect the other.
- **-1**: A correlation coefficient of -1 indicates a perfect negative relationship. This means as one variable increases, the other variable decreases in a perfectly consistent way.
- **+1**: A correlation coefficient of +1 indicates a perfect positive relationship. This means as one variable increases, the other variable also increases in a perfectly consistent way.

Ex: To explore these relationships, perform a correlation analysis using columns that represent key environmental aspects. For example, if the correlation between Traffic Volume and Noise Level is 0.7, it indicates that as traffic volume increases, noise levels tend to rise as well.

(Output: Matrix Table)

In [21]:
# Write your code here


#### **Group locations by "Time of the Day" and calculate average AQI**
"Group locations by 'Time of the Day' and calculate average AQI"
Air quality fluctuates throughout the day due to multiple factors such as traffic patterns, industrial emissions, and weather conditions. Group the locations according to different times of the day, such as morning, afternoon, evening, etc., and compute the average AQI for each time period.

(Output: Time of the Day and respective AQI)

In [22]:
# Write your code here


What might be contributing to these fluctuations in air quality, and how can this information be used to guide effective environmental monitoring and mitigation strategies?

In [23]:
# Write your opinion for the above insight here...


#### **Calculate traffic density for all locations**
Traffic density is a key indicator of congestion and road efficiency. We want to understand how the traffic density changes by location. This metric helps understand how congested or fluid traffic is in various areas.

Create the Traffic density metric and identify locations with the highest average and lowest average traffic densities and consider factors that might be influencing these values, such as road capacity, time of day, or weather conditions.

(Output: Location, Average Traffic Density)

In [24]:
# Write your code here


#### **Determine the most/least active locations by Traffic Volume and Passenger Count**
To evaluate the most and least active locations, examine both traffic volume and passenger count. The most active areas will exhibit high traffic and passenger counts, while the least active areas will show low counts. These extremes offer critical insights into urban mobility dynamics, indicating areas where transportation demand significantly influences daily movement patterns. Such locations are essential targets for strategic infrastructure investments, such as expanding road networks, improving public transit accessibility, or implementing traffic management solutions to enhance mobility efficiency.

(Output: Most active location by traffic volume and its traffic volume, Least  active location by traffic volume and its traffic volume, Most active location by passenger count and its traffic volume, Least  active location by passenger count and its traffic volume)

In [25]:
# Write your code here


### **Task 3: AI Model**

#### **Task 3.1: Data Preprocessing for Machine Learning**

**3.1.1 Encode Categorical Data**

- Why: Machine learning models typically require numerical inputs. Encoding categorical variables makes them compatible for training.
- What to Do:
  - Identify categorical columns such as 'Time of the Day', 'Day of the Week', or 'Location'.
  - Use encoding techniques like LabelEncoder or OneHotEncoder to convert these into numerical formats.
  - Save the encoders for consistent application during deployment or evaluation.

In [26]:
# Write your code here


In [27]:
# Write your code here


**3.1.2 Scale Feature Columns**

- Why: Scaling numerical features ensures uniform ranges, improving model performance and convergence.
- What to Do:
  - Identify numerical columns like 'Traffic Volume', 'Passenger Count', 'Noise Level', 'Average Speed', 'PM2.5 Level', and 'AQI'.
  - Apply scaling techniques such as StandardScaler.
  - Save the scaler object for consistent preprocessing in future data applications.

In [28]:
# Write your code here


#### **Task 3.2: Preparing the Dataset for Model Training**

**3.2.1  Feature-Target Separation**

Separate the dataset into:
  
- Why: Dividing the dataset into features and targets helps the model clearly distinguish between inputs and outputs.
- What to Do:
  - Assign the target column, such as 'Traffic Volume', to the variable yyy.
  - Use other columns like 'Passenger Count', 'Noise Level', and 'Average Speed' as features XXX.

In [29]:
# Write your code here


3.2.2 Split Data for Training and Testing

- Why: Splitting the dataset evaluates how well the model generalizes to new, unseen data.
- What to Do:
  - Use an 80:20 ratio to split the dataset into training and testing subsets.
  - Ensure both subsets maintain data integrity and distribution.

In [30]:
# Write your code here


**3.2.3 Identify Target Variable Types**

- Why: The type of target variable determines the model type (regression or classification).
- What to Do:
  - Check if 'Traffic Volume' is numerical (for regression) or categorical (for classification).
  - Choose appropriate models accordingly.

In [31]:
# Write your code here


#### **Task 3.3: Training Machine Learning Models**

**3.3.1 Model Selection and Initialization**

- Why: Selecting a suitable model aligns with the dataset and problem type.
- What to Do:
  - Use regression models (e.g., Linear Regression, Random Forest Regressor) for numerical targets like 'Traffic Volume'.
  - Use classification models (e.g., Logistic Regression, Decision Trees, SVC) for categorical targets (if applicable).

**3.3.2 Train the Models**
- Why: Proper training enables models to learn patterns in the dataset.
- What to Do:
  - Train models using the training subset.
  - Monitor for stability and convergence during training.

In [32]:
# Write your code here


**3.3.3 Evaluate Model Performance**

- Why: Evaluation metrics assess the model’s accuracy and reliability on unseen data.
- What to Do:
  - For classification models, compute metrics like accuracy, precision, recall, and F1 Score.
  - For regression models, calculate metrics such as Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R² Score.

In [33]:
# Write your code here


In [34]:
# Write your code here


### Task 4: Report Creation
Report creation is a critical component of any hackathon because it serves as a comprehensive summary of the team's journey, insights, and outcomes. While the technical solutions or models built during the event are essential, a well-crafted report translates complex findings into actionable and understandable information for various stakeholders.

Reports should effectively summarize the findings, methodologies, and insights gained during the hackathon tasks. The report will be a PowerPoint presentation covering the following topics. Here's a suggested structure for the report:

Table of Contents
1. Executive Summary
2. Introduction
3. Methodology and Approach
4. Key Findings
   - Data Preprocessing
   - Exploratory Data Analysis (EDA)
   - AI Task Results
5. Insights and Recommendations
6. Conclusion and Future Work
7. Appendices

Note: For better understanding of each heading above, you can refer the Hackathon Handbook.