# TRAFFIC ANALYSIS PROJECT
## BY: Group 5
1. Kevin Muchori
2. Benson Kamau
3. Sally Kinyanjui
4. Breden Mugambi
5. Nancy Chelangat

## Overview:  
The urban mobility and transportation sector are vital for the functioning of modern cities, enabling the movement of people and goods efficiently. Within this industry, traffic management and pedestrian safety are crucial components that directly impact the quality of life in urban areas. Effective traffic pattern analysis and prediction can help mitigate congestion, enhance safety, and improve overall urban mobility.
Well managed traffic leads to minimized economic losses, improved quality of life especially on the side of pedestrians.

## Challenges:
There are so many problems that are encountered especially in most urban towns whose vehicle and pedestrian population continues to grow every day. One of the problems is the traffic congestion which leads to higher traffic volumes which in turn brings about economic losses due to wasted time and fuel, increased pollution. Another key challenge is the pedestrian safety where High pedestrian traffic in urban areas increases the risk of accidents.  A challenge to also note is collecting accurate and real-time data from various sources is challenging which would make accurate traffic and pedestrian predictions challenging.

## Proposed solutions:
To solve some of these challenges would include measures such as advocating for sustainable urban mobility policies and invest in supportive infrastructure.  Use of Use machine learning models to analyze and predict traffic patterns and pedestrian crossings at different times of the day. In order to gather real-time data on traffic and pedestrian movement would require use of high technology like IoT devices.

## Conclusion:
The analysis and prediction of traffic congestion levels and pedestrian crossings are essential for enhancing urban mobility and safety. Successful implementation of these solutions can lead to reduced congestion, fewer accidents, and an overall improvement in the quality of urban life.


## Problem Statement:

Urban areas continue to face significant challenges in managing their traffic congestion and ensuring pedestrian safety. The changing nature of these areas together with the increasing volume of both vehicle and pedestrian traffic, makes it hard for one to predict traffic patterns affectively.

## Objective :
Our primary objective is to create an accurate time series model(s) that can model, analyze and predict traffic congestion levels and pedestrian crossings at different times of the day.

### Specific objectives:
1.  To identify key factors that influence traffic and pedestrian movement
2.  To develop predictive models for forecasting future traffic congestion and pedestrian crossing patterns.
3.  To provide recommendations for urban planners and traffic management authorities to improve traffic flow and pedestrian safety.


## Data Understanding
The data to use in this study is sourced from the UC Irvine Machine Learning Repository.  It has 4760 rows and 14 data features.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings('ignore')

In [2]:
#create a function that loads data and gets the info about the data.
def load_and_get_info(file_path, encoding='utf-8'):
    try :
        # Load data
        df = pd.read_csv(file_path, encoding=encoding)

        # Display the first few rows of the DataFrame
        df_head = df.head()

        # Get information about the DataFrame
        df_info = df.info()

        return df,df_info, df_head
    except UnicodeDecodeError:
        print(f"Failed to decode {file_path} with encoding {encoding}. Trying with 'latin1' encoding.")
        return load_and_get_info(file_path, encoding='latin1')

# A function that checks the data types of DataFrame columns and return the count of columns for each data type category.
def check_data_types(df):

    data_type_counts = df.dtypes.replace({'object': 'string'}).value_counts().to_dict()
    return data_type_counts

In [3]:
file_path = '/content/traffic_data1 (1).csv'
#file_path = '/Users/mac/Documents/GitHub/Traffic-analysis-project/traffic_data1 (1).csv'
df1,data_info, data_head = load_and_get_info(file_path)
print(data_info)
print("\nFirst few rows of the DataFrame:")
data_head #data_head

FileNotFoundError: [Errno 2] No such file or directory: '/content/traffic_data1 (1).csv'

The dataset contains the following columns:

oid: This column represents a unique identifier for each object record in the dataset.

timestamp: This column stores the exact time of each record

date: This column extracts the date portion from the timestamp column, providing the day without the time information.

hour: This column extracts the hour of the day (0-23) from the 'timestamp' column.

x: This column represents the X-coordinate of each object in our data.

y: This column represents the Y-coordinate of each object in our data.

vehicle_count: This column indicates the number of vehicles observed in the vicinity of each object record.

pedestrian_count: This column reflects the number of pedestrians observed in the vicinity of each object record.

congestion_level: This column categorizes the traffic congestion level at the time of each record.

weather_condition: This column represents the weather condition at the time of each record

temperature: This column holds the temperature recorded at the time of each data point.


## Data Cleaning

### Step 1: Dropping  the empty columns

The columns body_roll, body_pitch, body_yaw, head_roll, head_pitch, and head_yaw have no data therefore will not be useful for our project. We will therefore proceed to drop them.

In [None]:
columns_to_drop = ['body_roll', 'body_pitch', 'body_yaw', 'head_roll', 'head_pitch', 'head_yaw']
cleaned_df = df1.drop(columns=columns_to_drop)

Step 2: Convert the 'timestamp' column to datetime format

In [None]:
cleaned_df['date'] = pd.to_datetime(cleaned_df['date'])
cleaned_df['timestamp'] = pd.to_datetime(cleaned_df['timestamp'])

### Step 3: Check for and remove duplicate rows



In [None]:
cleaned_df = cleaned_df.drop_duplicates()
cleaned_df.head()

Below we will go ahead and breakdown timestamps:

In [None]:
import pandas as pd


# Extract date components
cleaned_df['year'] = cleaned_df['timestamp'].dt.year
cleaned_df['month'] = cleaned_df['timestamp'].dt.month

# Create time-based features
cleaned_df['day_of_week'] = cleaned_df['timestamp'].dt.dayofweek
cleaned_df['hour'] = cleaned_df['timestamp'].dt.hour



By breaking down our project into components, we can be able to use our data in more useful ways for example:
- Comapring traffic levels between weekdays and weekends
- Identifying patterns in traffic congestion at different times of the day.

In [None]:
cleaned_df.head(10)

From the output,we see that we now have columns indicating the year, month and day of the week.

Let us see whether it works:

In [None]:
"""import pandas as pd

# Sample DataFrame
data = {'timestamp': ['2023-01-15', '2023-02-10', '2023-01-25', '2023-03-05'],
        'value': [10, 20, 15, 25]}
df = pd.DataFrame(data)
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['month'] = df['timestamp'].dt.month

# Filter for January data
january_data = df[df['month'] == 1]

print(january_data)
"""

From the output we can see that the from the sample dataframe we had provided,the following timestamps are the ones that belong to the month of January.

In [None]:
cleaned_df.info()

# **EDA**

In [None]:
df = cleaned_df
df.info()

Checking the values in the different columns

Plotting
create a function that will be used to plot histograms to show frequency in the different applicable columns

In [None]:
def frequency_plotting(df, column_name):
    """
    Plots a bar graph of the frequency distribution for a given column in a DataFrame,
    with values displayed on top of each bar.

    Args:
        df (pandas.DataFrame): The DataFrame containing the data.
        column_name (str): The name of the column to plot the frequency distribution for.
    """

    # Print value counts for reference
    print(df[column_name].value_counts())

    # Get values and counts
    x_values = df[column_name].value_counts().index
    y_values = df[column_name].value_counts().values

    # Create the bar graph
    plt.figure(figsize=(8, 6))  # Set a reasonable figure size
    bars = plt.bar(x_values, y_values, edgecolor='black')



    for bar, value in zip(bars, y_values):
        plt.text(bar.get_x() + bar.get_width() / 2, bar.get_height(), value,
                 ha='center', va='bottom')

    # Add labels and title
    plt.xlabel(column_name)
    plt.ylabel('Frequency')
    plt.title(f'Distribution of {column_name}')
    plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for readability
    plt.tight_layout()
    plt.show()


In [None]:
frequency_plotting(df, 'weather_condition')

In [None]:
frequency_plotting(df, 'congestion_level')

In [None]:
frequency_plotting(df, 'temperature')

temp and time?

In [None]:
frequency_plotting(df, 'vehicle_count')

In [None]:
frequency_plotting(df, 'pedestrian_count')

#### MODELING

In [None]:
#necessary libraries for modelimg
import pandas as pd
import numpy as np
import statsmodels.tsa.stattools as sm
import statsmodels.tsa.arima_model as ARIMA
from statsmodels.tsa.stattools import adfuller
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from statsmodels.graphics.tsaplots import plot_acf
from pandas import tseries
from matplotlib import pyplot
from statsmodels.graphics.tsaplots import plot_pacf
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from statsmodels.tsa.statespace.sarimax import SARIMAX