<a href="https://colab.research.google.com/github/lilian-2021/DS/blob/main/AfterWork_Data_Science_Data_Wrangling_with_Python_Guiding_Template_%5BLilian_Kigunda%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AfterWork Data Science: Data Wrangling with Python Project

## 1. Defining the Question

### a) Specifying the Data Analysis Question

> How can the incidence of bus breakdowns in the city of New York be effectively reduced, and what specific recommendations can be made to achieve this goal?

### b) Defining the Metric for Success

How will you know that your solution will have satisfied your research question?

> Reduction in Bus Breakdown Frequency: This can be quantitatively measured by comparing the breakdown rate before and after implementing any recommended solutions.

> Increased Reliability and Punctuality:This can be assessed by monitoring whether buses are adhering to their schedules more consistently and whether passengers report fewer delays due to breakdowns.

> Maintenance Costs: A successful solution should ideally lead to a reduction in maintenance expenses as a result of fewer breakdowns and the implementation of cost-effective maintenance strategies.

> Passenger Satisfaction: Gathering feedback from bus passengers through surveys or feedback mechanisms can provide insight into their satisfaction with the bus service.

> Operational Efficiency: Evaluate whether the solutions lead to increased operational efficiency, such as reduced downtime, quicker response times to breakdowns, and optimized routes.


### c) Understanding the context

The dataset  provided comes from the Bus Breakdown and Delay system, which
collects information from school bus vendors operating in the field in real-time. Bus staff that
encounter delays during the route are instructed to radio the dispatcher at the bus vendor’s
central office. The bus vendor staff log into the Bus Breakdown and Delay system to record the
event and notify OPT. OPT customer service agents use this system to inform parents who call
with questions regarding bus service. The Bus Breakdown and Delay system is publicly
accessible and contains real-time updates. All information in the system is by school bus vendor
staff.

### d) Recording the Experimental Design

> Data Reading and Understanding:
The dataset should ideally include information about the date, time, location, bus route, type of breakdown, and any other relevant variables.

> Data Cleaning:
Clean and preprocess the dataset to handle missing values, outliers, and inconsistencies. Ensure data quality for accurate analysis.

> Exploratory Data Analysis (EDA)
Conduct EDA to gain a deeper understanding of the dataset. Explore descriptive statistics, data distributions, and visualizations to identify patterns, trends, and potential factors contributing to bus breakdowns.

> Solution Recommendations:
Based on the analysis results, propose actionable recommendations to reduce bus breakdowns.

> Validate the recommendations using appropriate validation techniques, such as A/B testing if feasible. Implement some of the recommended strategies and measure their impact on breakdown incidence.

> Summarize your findings and recommendations in a comprehensive report or presentation for stakeholders, decision-makers, and relevant authorities.

### e) Data Relevance

The provided dataset is highly relevant because it directly addresses the research question, offers detailed information, and has the potential to provide valuable insights and support decision-making efforts related to reducing bus breakdowns in New York City.

## 2. Reading the Data

In [None]:
# Importing our libraries
import pandas as pd
import numpy as np

In [None]:
# Load the data below
# Dataset url = https://bit.ly/BusBreakdownDataset
df_BBD = pd.read_csv('https://bit.ly/BusBreakdownDataset')


In [None]:
# Checking the first 5 rows of data
df_BBD.head()
df_BBD.columns

Index(['School_Year', 'Busbreakdown_ID', 'Run_Type', 'Bus_No', 'Route_Number',
       'Reason', 'Schools_Serviced', 'Occurred_On', 'Created_On', 'Boro',
       'Bus_Company_Name', 'How_Long_Delayed', 'Number_Of_Students_On_The_Bus',
       'Has_Contractor_Notified_Schools', 'Has_Contractor_Notified_Parents',
       'Have_You_Alerted_OPT', 'Informed_On', 'Incident_Number',
       'Last_Updated_On', 'Breakdown_or_Running_Late', 'School_Age_or_PreK'],
      dtype='object')

In [None]:
# Checking the last 5 rows of data
df_BBD.tail()

Unnamed: 0,School_Year,Busbreakdown_ID,Run_Type,Bus_No,Route_Number,Reason,Schools_Serviced,Occurred_On,Created_On,Boro,...,How_Long_Delayed,Number_Of_Students_On_The_Bus,Has_Contractor_Notified_Schools,Has_Contractor_Notified_Parents,Have_You_Alerted_OPT,Informed_On,Incident_Number,Last_Updated_On,Breakdown_or_Running_Late,School_Age_or_PreK
281105,2016-2017,1338452,Pre-K/EI,9345,2,Heavy Traffic,C530,04/05/2017 08:00:00 AM,04/05/2017 08:10:00 AM,Bronx,...,15-20,7,Yes,Yes,No,04/05/2017 08:10:00 AM,,04/05/2017 08:10:15 AM,Running Late,Pre-K
281106,2016-2017,1341521,Pre-K/EI,0001,5,Heavy Traffic,C579,04/24/2017 07:42:00 AM,04/24/2017 07:44:00 AM,Bronx,...,20 MINS,0,Yes,Yes,No,04/24/2017 07:44:00 AM,,04/24/2017 07:44:15 AM,Running Late,Pre-K
281107,2016-2017,1353044,Special Ed PM Run,GC0112,X928,Heavy Traffic,09003,05/25/2017 04:22:00 PM,05/25/2017 04:28:00 PM,Bronx,...,20-25MINS,0,Yes,Yes,Yes,05/25/2017 04:28:00 PM,90323827.0,05/25/2017 04:34:36 PM,Running Late,School-Age
281108,2016-2017,1353045,Special Ed PM Run,5525D,Q920,Won`t Start,24457,05/25/2017 04:27:00 PM,05/25/2017 04:30:00 PM,Queens,...,,0,Yes,Yes,No,05/25/2017 04:30:00 PM,,05/25/2017 04:30:07 PM,Breakdown,School-Age
281109,2016-2017,1353046,Project Read PM Run,2530,K617,Other,21436,05/25/2017 04:36:00 PM,05/25/2017 04:37:00 PM,Brooklyn,...,45min,7,Yes,Yes,Yes,05/25/2017 04:37:00 PM,,05/25/2017 04:37:37 PM,Running Late,School-Age


In [None]:
# Sample 10 rows of data
df_BBD.sample(10)

Unnamed: 0,School_Year,Busbreakdown_ID,Run_Type,Bus_No,Route_Number,Reason,Schools_Serviced,Occurred_On,Created_On,Boro,...,How_Long_Delayed,Number_Of_Students_On_The_Bus,Has_Contractor_Notified_Schools,Has_Contractor_Notified_Parents,Have_You_Alerted_OPT,Informed_On,Incident_Number,Last_Updated_On,Breakdown_or_Running_Late,School_Age_or_PreK
71170,2016-2017,1317761,Special Ed AM Run,36258,X421,Heavy Traffic,10306,01/23/2017 07:52:00 AM,01/23/2017 07:54:00 AM,Bronx,...,30 mins,3,Yes,Yes,Yes,01/23/2017 07:54:00 AM,,01/23/2017 07:55:16 AM,Running Late,School-Age
7360,2015-2016,1234482,Pre-K/EI,514,1,Other,C069,12/07/2015 07:38:00 AM,12/07/2015 07:38:00 AM,Bronx,...,20 MINS,5,Yes,Yes,No,12/07/2015 07:38:00 AM,,12/07/2015 07:38:49 AM,Running Late,Pre-K
102513,2015-2016,1218062,Special Ed AM Run,10482,X190,Heavy Traffic,11688,09/30/2015 07:13:00 AM,09/30/2015 07:18:00 AM,Bronx,...,15mins,6,Yes,Yes,No,09/30/2015 07:18:00 AM,,09/30/2015 07:18:21 AM,Running Late,School-Age
5513,2015-2016,1236036,Special Ed AM Run,2150,M254,Heavy Traffic,04657,12/11/2015 08:30:00 AM,12/11/2015 08:32:00 AM,Manhattan,...,20 MIN,1,Yes,Yes,Yes,12/11/2015 08:32:00 AM,,12/11/2015 08:32:47 AM,Running Late,School-Age
14019,2015-2016,1248632,Special Ed AM Run,2301,R530,Weather Conditions,31435,02/05/2016 07:21:00 AM,02/05/2016 07:22:00 AM,Staten Island,...,,2,Yes,Yes,Yes,02/05/2016 07:22:00 AM,,02/05/2016 07:22:04 AM,Running Late,School-Age
15542,2015-2016,1250726,General Ed AM Run,1611,K1716,Heavy Traffic,1731617685,02/22/2016 06:05:00 AM,02/22/2016 06:12:00 AM,Brooklyn,...,30min,0,Yes,Yes,No,02/22/2016 06:12:00 AM,,02/22/2016 06:12:33 AM,Running Late,School-Age
174232,2018-2019,1456473,General Ed AM Run,2314,Q3019,Other,3012630359,09/05/2018 06:00:00 AM,09/05/2018 06:42:00 AM,Queens,...,,0,Yes,No,No,09/05/2018 06:42:00 AM,,01/01/1900 12:00:00 AM,Breakdown,School-Age
174265,2017-2018,1402099,General Ed PM Run,476,R9247,Late return from Field Trip,3107331708,01/10/2018 01:30:00 PM,01/10/2018 01:31:00 PM,Staten Island,...,16-30 Min,0,Yes,No,No,01/10/2018 01:31:00 PM,,01/01/1900 12:00:00 AM,Running Late,School-Age
254103,2017-2018,1422341,Special Ed PM Run,16301,X849,Accident,12486,03/19/2018 02:45:00 PM,03/19/2018 02:52:00 PM,Bronx,...,31-45 Min,0,Yes,Yes,No,03/19/2018 02:52:00 PM,,01/01/1900 12:00:00 AM,Running Late,School-Age
208096,2017-2018,1451273,Pre-K/EI,811,5,Heavy Traffic,C195,06/20/2018 07:49:00 AM,06/20/2018 07:51:00 AM,Bronx,...,16-30 Min,10,Yes,Yes,No,06/20/2018 07:51:00 AM,,01/01/1900 12:00:00 AM,Running Late,Pre-K


In [None]:
# Checking number of rows and columns
df_BBD.shape

(281110, 21)

In [None]:
# Checking datatypes
df_BBD.dtypes

School_Year                        object
Busbreakdown_ID                     int64
Run_Type                           object
Bus_No                             object
Route_Number                       object
Reason                             object
Schools_Serviced                   object
Occurred_On                        object
Created_On                         object
Boro                               object
Bus_Company_Name                   object
How_Long_Delayed                   object
Number_Of_Students_On_The_Bus       int64
Has_Contractor_Notified_Schools    object
Has_Contractor_Notified_Parents    object
Have_You_Alerted_OPT               object
Informed_On                        object
Incident_Number                    object
Last_Updated_On                    object
Breakdown_or_Running_Late          object
School_Age_or_PreK                 object
dtype: object

Record your observations below:

*   Observation 1: There are 281110 rows and 21 columns
*   Observation 2: There are different data types in the dataset



## 3. External Data Source Validation

https://data.cityofnewyork.us/resource/c6ph-pcpz.csv

This marker was determined if a school bus company reported a breakdown or delay and a school reported an arrival after session time.

This dataset has the following columns:Id,vendorName,runType,Route,Type,Delay,Reason,schools,dateOccurred,Schools reported


## 4. Data Preparation

### Performing Data Cleaning

In [None]:
# Checking datatypes and missing entries of all the variables
df_BBD.dtypes
df_BBD.isna
df_BBD.isna().values.any()
df_BBD.isna().any()
df_BBD.isna().sum()

School_Year                             0
Busbreakdown_ID                         0
Run_Type                                3
Bus_No                                  9
Route_Number                            7
Reason                                  2
Schools_Serviced                        7
Occurred_On                             0
Created_On                              0
Boro                                13461
Bus_Company_Name                        0
How_Long_Delayed                    35608
Number_Of_Students_On_The_Bus           0
Has_Contractor_Notified_Schools         0
Has_Contractor_Notified_Parents         0
Have_You_Alerted_OPT                    0
Informed_On                             0
Incident_Number                    271627
Last_Updated_On                         0
Breakdown_or_Running_Late               0
School_Age_or_PreK                      0
dtype: int64

We observe the following from our dataset:

*   Observation 1: Some columns have no missing values
*   Observation 2:Incident number has the most missing values



In [None]:
# Standardizing your dataset i.e. variable renaming
df_BBD.sample(10)
df_BBD.columns = df_BBD.columns.str.strip()
df_BBD.sample(10)
df_BBD.columns  = df_BBD.columns.str.lower()
df_BBD.columns

# df_BBD['How_Long_Delayed'] = df_BBD['How_Long_Delayed'].str.replace('min','mins').str.replace('MINUTES','mins').str.replace('Mins','mins')
# df_BBD.sample(10)
# unique_values = df_BBD['How_Long_Delayed'].unique()
# uniform_value = 'mins'
# df_BBD['How_Long_Delayed'] = df_BBD['How_Long_Delayed'].sreplace(unique_values, uniform_value)





Index(['school_year', 'busbreakdown_id', 'run_type', 'bus_no', 'route_number',
       'reason', 'schools_serviced', 'occurred_on', 'created_on', 'boro',
       'bus_company_name', 'how_long_delayed', 'number_of_students_on_the_bus',
       'has_contractor_notified_schools', 'has_contractor_notified_parents',
       'have_you_alerted_opt', 'informed_on', 'incident_number',
       'last_updated_on', 'breakdown_or_running_late', 'school_age_or_prek'],
      dtype='object')

We observe the following from our dataset:

*   Observation 1: 'How_Long_Delayed' column data entries is not uniform
*   Observation 2



In [None]:
# Checking how many duplicate rows are there in the data
df_BBD.duplicated().all()
df_BBD.duplicated().any()
sum(df_BBD.duplicated())

0

We observe the following from our dataset:

*   Observation 1: There are no duplicated rows in the dataset
*   Observation 2



In [None]:
# Checking if any of the columns are all null
# df_BBD[df_BBD.columns.isnull().any()]
# df_BBD.columns.isnull().any()
columns_with_all_null = df_BBD.columns[df_BBD.isnull().all()]
sum(columns_with_all_null)
len(columns_with_all_null)
columns_with_any_null = df_BBD.columns[df_BBD.isnull().any()]
columns_with_any_null
len(columns_with_any_null)

8

We observe the following from our dataset:

*   Observation 1: There is no column with all null values
*   Observation 2: 8 columns have atleast one null value



In [None]:
# Checking if any of the rows are all null
all_null_rows = df_BBD.isnull().all(axis=1)
all_null_rows
sum(all_null_rows)
any_null_rows = df_BBD.isnull().any(axis=1)
sum(any_null_rows)
# df_BBD.shape

272493

We observe the following from our dataset:

*   Observation 1:There is no row with all null values

*   Observation 2:272493 rows have atleast one null value



In [None]:
# Checking if the "Yes/No" fields contain only these 2 values
# for have_you_alerted_opt variable
# ---
# Hint: Use unique() function
df_BBD['have_you_alerted_opt'].unique()

array(['Yes', 'No'], dtype=object)

We observe the following from our dataset:

*   Observation 1:'have_you_alerted_opt' column contains "Yes/No"  values
*   Observation 2



In [None]:
# Checking if the "Yes/No" fields contain only these 2 values
# for has_contractor_notified_parents variable
# ---
df_BBD['has_contractor_notified_parents'].unique()

array(['No', 'Yes'], dtype=object)

We observe the following from our dataset:

*   Observation 1:'has_contractor_notified_parents' column contains "Yes/No" values
*   Observation 2



In [None]:
# Checking if the "Yes/No" fields contain only these 2 values
# for has_contractor_notified_schools variable
df_BBD['has_contractor_notified_schools'].unique()


array(['Yes', 'No'], dtype=object)

We observe the following from our dataset:

*   Observation 1:'has_contractor_notified_schools' column contains "Yes/No" values
*   Observation 2



In [None]:
# Checking unique values in break_down_or_running_late variable to ensure there is no duplication
df_BBD['breakdown_or_running_late'].unique()

array(['Running Late', 'Breakdown'], dtype=object)

In [None]:
# Checking unique values in school_age_or_prek variable
df_BBD['school_age_or_prek'].unique()

array(['School-Age', 'Pre-K'], dtype=object)

In [None]:
# Checking unique values in school_year variable
df_BBD['school_year'].unique()

array(['2015-2016', '2017-2018', '2018-2019', '2016-2017', '2019-2020'],
      dtype=object)

In [None]:
# Checking unique values in reason variable
df_BBD['reason'].unique()

array(['Heavy Traffic', 'Flat Tire', 'Other', 'Won`t Start',
       'Mechanical Problem', 'Problem Run', 'Accident',
       'Late return from Field Trip', 'Delayed by School',
       'Weather Conditions', nan], dtype=object)

In [None]:
# Checking unique values in run_type variable
df_BBD['run_type'].unique()

array(['Special Ed AM Run', 'Pre-K/EI', 'General Ed AM Run',
       'General Ed Field Trip', 'Special Ed PM Run', 'General Ed PM Run',
       'Project Read PM Run', 'Special Ed Field Trip',
       'Project Read AM Run', 'Project Read Field Trip', nan],
      dtype=object)

In [None]:
# Checking unique values in boro variable
df_BBD['boro'].unique()

array(['New Jersey', 'Manhattan', 'Bronx', 'Westchester', 'Brooklyn',
       'Rockland County', 'Nassau County', nan, 'Queens', 'Staten Island',
       'Connecticut', 'All Boroughs'], dtype=object)

### Overall Data Cleaning Observations
**Missing Values**

- There are a large number of missing values in the fields "How_Long_Delayed" which is important to our analysis.
- There is an extremely large number of missing values in the "Incident_number" field but this is not incidental to our analysis and cannot be filled in without additional information.

**Error in values**

- "How_Long_Delayed" contains string values such as "MINS" or "mins" and a range of values, which needs to be changed to single integer value for our analysis.

**Error in Datatypes**

- "How_Long_Delayed" is a string datatype, should be converted to integer type.

**Error in field names**
- The column name "Boro" should be renamed to "Borough".


### Next Steps: Data Cleaning Steps

**Error in values**

- Extract the first integer value (lowest delay time) in the column "How_Long_Delayed"


**Missing Values**

- Impute the missing values in the field "How_Long_Delayed" with the mean value.


**Error in Datatypes**

- Convert "How_Long_Delayed" to int datatype.



**Error in field names**

- Rename the column "Boro" to "Borough".

In [None]:
# Lets first start by creating a copy of our dataframe
# df_clean = df.copy(). We will use this copy as our cleaning copy.
# ---
#
df_clean = df_BBD.copy()
df_clean.head()

Unnamed: 0,school_year,busbreakdown_id,run_type,bus_no,route_number,reason,schools_serviced,occurred_on,created_on,boro,...,how_long_delayed,number_of_students_on_the_bus,has_contractor_notified_schools,has_contractor_notified_parents,have_you_alerted_opt,informed_on,incident_number,last_updated_on,breakdown_or_running_late,school_age_or_prek
0,2015-2016,1227538,Special Ed AM Run,2621,J711,Heavy Traffic,75003,11/05/2015 08:10:00 AM,11/05/2015 08:12:00 AM,New Jersey,...,,11,Yes,No,Yes,11/05/2015 08:12:00 AM,,11/05/2015 08:12:14 AM,Running Late,School-Age
1,2015-2016,1227539,Special Ed AM Run,1260,M351,Heavy Traffic,06716,11/05/2015 08:10:00 AM,11/05/2015 08:12:00 AM,Manhattan,...,20MNS,2,Yes,Yes,No,11/05/2015 08:12:00 AM,,11/05/2015 08:13:34 AM,Running Late,School-Age
2,2015-2016,1227540,Pre-K/EI,418,3,Heavy Traffic,C445,11/05/2015 08:09:00 AM,11/05/2015 08:13:00 AM,Bronx,...,15MIN,8,Yes,Yes,Yes,11/05/2015 08:13:00 AM,,11/05/2015 08:13:22 AM,Running Late,Pre-K
3,2015-2016,1227541,Special Ed AM Run,4522,M271,Heavy Traffic,02699,11/05/2015 08:12:00 AM,11/05/2015 08:14:00 AM,Manhattan,...,15 MIN,6,No,No,No,11/05/2015 08:14:00 AM,,11/05/2015 08:14:04 AM,Running Late,School-Age
4,2015-2016,1227542,Special Ed AM Run,3124,M373,Heavy Traffic,02116,11/05/2015 08:13:00 AM,11/05/2015 08:14:00 AM,Manhattan,...,,6,No,No,No,11/05/2015 08:14:00 AM,,11/05/2015 08:14:08 AM,Running Late,School-Age


In [None]:
# Then extracting the lowest delay time in the column how_long_delayed from the string
#
df_clean['how_long_delayed'] = df_clean['how_long_delayed'].str.extract('(\d+)')
df_clean['how_long_delayed'].head()

0    NaN
1     20
2     15
3     15
4    NaN
Name: how_long_delayed, dtype: object

We impute the null values in 'how_long_delayed' column with mean of the column. This will take a couple of steps...

In [None]:
# We first convert our how_long_delayed to float type to allow for imputation
# we use 'astype()'
df_clean['how_long_delayed'] = df_clean['how_long_delayed'].astype(float)
df_clean.dtypes
df_clean['how_long_delayed'].dtypes


dtype('float64')

In [None]:
# Then later perform our mean imputation
how_long_delayed_mean = df_clean['how_long_delayed'].mean()
how_long_delayed_mean

28.148313074995308

In [None]:
# Then convert back our how_long_delayed column to integer datatype
how_long_delayed_null_values = df_clean['how_long_delayed'].isnull()
# replace the null values with mean
# use 'fillna()'
df_clean['how_long_delayed'] = df_clean['how_long_delayed'].fillna(how_long_delayed_mean)
how_long_delayed_null_values
df_clean['how_long_delayed'] = df_clean['how_long_delayed'].astype(int)
df_clean['how_long_delayed'].dtypes


dtype('int64')

In [None]:
# Then check for nulls in the column
sum(how_long_delayed_null_values)

36108

In [None]:
# Rename Boro column to Borough
df_clean = df_clean.rename(columns={'boro': 'borough'})
df_clean.columns
df_clean.head()




Unnamed: 0,school_year,busbreakdown_id,run_type,bus_no,route_number,reason,schools_serviced,occurred_on,created_on,borough,...,how_long_delayed,number_of_students_on_the_bus,has_contractor_notified_schools,has_contractor_notified_parents,have_you_alerted_opt,informed_on,incident_number,last_updated_on,breakdown_or_running_late,school_age_or_prek
0,2015-2016,1227538,Special Ed AM Run,2621,J711,Heavy Traffic,75003,11/05/2015 08:10:00 AM,11/05/2015 08:12:00 AM,New Jersey,...,28,11,Yes,No,Yes,11/05/2015 08:12:00 AM,,11/05/2015 08:12:14 AM,Running Late,School-Age
1,2015-2016,1227539,Special Ed AM Run,1260,M351,Heavy Traffic,06716,11/05/2015 08:10:00 AM,11/05/2015 08:12:00 AM,Manhattan,...,20,2,Yes,Yes,No,11/05/2015 08:12:00 AM,,11/05/2015 08:13:34 AM,Running Late,School-Age
2,2015-2016,1227540,Pre-K/EI,418,3,Heavy Traffic,C445,11/05/2015 08:09:00 AM,11/05/2015 08:13:00 AM,Bronx,...,15,8,Yes,Yes,Yes,11/05/2015 08:13:00 AM,,11/05/2015 08:13:22 AM,Running Late,Pre-K
3,2015-2016,1227541,Special Ed AM Run,4522,M271,Heavy Traffic,02699,11/05/2015 08:12:00 AM,11/05/2015 08:14:00 AM,Manhattan,...,15,6,No,No,No,11/05/2015 08:14:00 AM,,11/05/2015 08:14:04 AM,Running Late,School-Age
4,2015-2016,1227542,Special Ed AM Run,3124,M373,Heavy Traffic,02116,11/05/2015 08:13:00 AM,11/05/2015 08:14:00 AM,Manhattan,...,28,6,No,No,No,11/05/2015 08:14:00 AM,,11/05/2015 08:14:08 AM,Running Late,School-Age


In [None]:
# Lastly we convert all values in our colums to lower case
# for ease of reading
df_clean[['run_type','reason','borough','school_age_or_prek','breakdown_or_running_late','have_you_alerted_opt','has_contractor_notified_parents','has_contractor_notified_schools']] = df_clean[['run_type','reason','borough','school_age_or_prek','breakdown_or_running_late','have_you_alerted_opt','has_contractor_notified_parents','has_contractor_notified_schools']].apply(lambda x: x.str.lower())

In [None]:
# Check the first 5 record the cleaned dataset
# ---
#
df_clean.head()

Unnamed: 0,school_year,busbreakdown_id,run_type,bus_no,route_number,reason,schools_serviced,occurred_on,created_on,borough,...,how_long_delayed,number_of_students_on_the_bus,has_contractor_notified_schools,has_contractor_notified_parents,have_you_alerted_opt,informed_on,incident_number,last_updated_on,breakdown_or_running_late,school_age_or_prek
0,2015-2016,1227538,special ed am run,2621,J711,heavy traffic,75003,11/05/2015 08:10:00 AM,11/05/2015 08:12:00 AM,new jersey,...,28,11,yes,no,yes,11/05/2015 08:12:00 AM,,11/05/2015 08:12:14 AM,running late,school-age
1,2015-2016,1227539,special ed am run,1260,M351,heavy traffic,06716,11/05/2015 08:10:00 AM,11/05/2015 08:12:00 AM,manhattan,...,20,2,yes,yes,no,11/05/2015 08:12:00 AM,,11/05/2015 08:13:34 AM,running late,school-age
2,2015-2016,1227540,pre-k/ei,418,3,heavy traffic,C445,11/05/2015 08:09:00 AM,11/05/2015 08:13:00 AM,bronx,...,15,8,yes,yes,yes,11/05/2015 08:13:00 AM,,11/05/2015 08:13:22 AM,running late,pre-k
3,2015-2016,1227541,special ed am run,4522,M271,heavy traffic,02699,11/05/2015 08:12:00 AM,11/05/2015 08:14:00 AM,manhattan,...,15,6,no,no,no,11/05/2015 08:14:00 AM,,11/05/2015 08:14:04 AM,running late,school-age
4,2015-2016,1227542,special ed am run,3124,M373,heavy traffic,02116,11/05/2015 08:13:00 AM,11/05/2015 08:14:00 AM,manhattan,...,28,6,no,no,no,11/05/2015 08:14:00 AM,,11/05/2015 08:14:08 AM,running late,school-age



## 5. Solution Implementation

Here we investigate the questions that would help craft our recommendations.

### 5.a) Questions

In [None]:
# 1. Which bus companies that had the highest breakdowns?
# ---
#
breakdowns = df_clean.groupby(['bus_company_name']).count()
# breakdowns
# breakdowns = df_clean.groupby(['bus_company_name']).count().sort_values(by='bus_company_name', ascending=0)
breakdowns = df_clean[df_clean.breakdown_or_running_late == 'breakdown'].groupby(['bus_company_name']).count()
breakdowns.sort_values(by = 'breakdown_or_running_late',ascending = 0)
# breakdowns.sort_values(by='bus_company_name', ascending=0)
# breakdowns.max()
# df_clean.sort_values(by='bus_company_name', ascending=0)

# Sort to get bus company with highest breakdowns
# ---
# YOUR CODE GOES BELOW
# breakdowns.sort_values(by='bus_company_name', ascending=0)
# sorted_bus_company_name = df_clean.sort_values(by='bus_company_name', ascending=0)
# sorted_bus_company_name
# df_clean['bus_company_name']

Unnamed: 0_level_0,school_year,busbreakdown_id,run_type,bus_no,route_number,reason,schools_serviced,occurred_on,created_on,borough,how_long_delayed,number_of_students_on_the_bus,has_contractor_notified_schools,has_contractor_notified_parents,have_you_alerted_opt,informed_on,incident_number,last_updated_on,breakdown_or_running_late,school_age_or_prek
bus_company_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
LITTLE RICHIE BUS SERVICE,5389,5389,5389,5389,5389,5389,5389,5389,5389,5384,5389,5389,5389,5389,5389,5389,3,5389,5389,5389
LOGAN BUS COMPANY INC.,2928,2928,2927,2928,2928,2928,2928,2928,2928,2761,2928,2928,2928,2928,2928,2928,2,2928,2928,2928
"RELIANT TRANS, INC. (B232",2063,2063,2063,2063,2063,2063,2063,2063,2063,1905,2063,2063,2063,2063,2063,2063,0,2063,2063,2063
LITTLE LISA BUS CO. INC.,2042,2042,2042,2042,2042,2042,2042,2042,2042,2042,2042,2042,2042,2042,2042,2042,0,2042,2042,2042
"GRANDPA`S BUS CO., INC.",1496,1496,1496,1496,1496,1496,1496,1496,1496,1496,1496,1496,1496,1496,1496,1496,3,1496,1496,1496
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
SELBY TRANS CORP. (B2192),1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1
"R & C TRANSIT, INC. (B2321)",1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
L&M Bus Corp.,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1
ALL COUNTY BUS LLC (B2321),1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1


In [None]:
# 2. What were the top 3 reasons for bus delays?
# ---
#
bus_delay = df_clean.groupby(['reason']).count()[['how_long_delayed']].reset_index()

# Sort to get the most frequent reason
# ---
# YOUR CODE GOES BELOW
bus_delay.sort_values(by = 'how_long_delayed',ascending = 0)
# bus_delay.max()



Unnamed: 0,reason,how_long_delayed
3,heavy traffic,173221
6,other,37579
5,mechanical problem,28162
9,won`t start,12283
2,flat tire,8411
8,weather conditions,6932
4,late return from field trip,5704
7,problem run,4068
0,accident,2526
1,delayed by school,2222


In [None]:
# 3. How many students were in the buses when they broke down?
# ---
#
students_in_the_bus =df_clean.groupby(['number_of_students_on_the_bus']).count()[['busbreakdown_id']].reset_index()
students_in_the_bus.sort_values(by = 'number_of_students_on_the_bus',ascending = 0)

Unnamed: 0,number_of_students_on_the_bus,busbreakdown_id
223,9052,1
222,9007,1
221,8547,1
220,6219,1
219,6209,1
...,...,...
4,4,12248
3,3,14837
2,2,16737
1,1,15704


In [None]:
# 4. Which were most frequent reasons for bus breakdowns?
# ---
#
breakdown_reasons = df_clean[df_clean.breakdown_or_running_late == 'breakdown'].groupby(['reason']).count()

# Sort to get most frequent reasons
# ---
# YOUR CODE GOES BELOW
breakdown_reasons.sort_values(by = 'breakdown_or_running_late',ascending = 0)

Unnamed: 0_level_0,school_year,busbreakdown_id,run_type,bus_no,route_number,schools_serviced,occurred_on,created_on,borough,bus_company_name,how_long_delayed,number_of_students_on_the_bus,has_contractor_notified_schools,has_contractor_notified_parents,have_you_alerted_opt,informed_on,incident_number,last_updated_on,breakdown_or_running_late,school_age_or_prek
reason,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
mechanical problem,14985,14985,14985,14985,14985,14985,14985,14985,14475,14985,14985,14985,14985,14985,14985,14985,405,14985,14985,14985
won`t start,7731,7731,7731,7731,7731,7731,7731,7731,7501,7731,7731,7731,7731,7731,7731,7731,114,7731,7731,7731
flat tire,4038,4038,4037,4038,4038,4038,4038,4038,3870,4038,4038,4038,4038,4038,4038,4038,104,4038,4038,4038
other,3530,3530,3530,3530,3530,3530,3530,3530,3428,3530,3530,3530,3530,3530,3530,3530,69,3530,3530,3530
heavy traffic,419,419,419,419,419,419,419,419,392,419,419,419,419,419,419,419,11,419,419,419
accident,202,202,202,202,202,202,202,202,192,202,202,202,202,202,202,202,28,202,202,202
weather conditions,82,82,82,82,82,82,82,82,81,82,82,82,82,82,82,82,1,82,82,82
late return from field trip,35,35,35,35,35,35,35,35,35,35,35,35,35,35,35,35,0,35,35,35
problem run,28,28,28,28,28,28,28,28,25,28,28,28,28,28,28,28,1,28,28,28
delayed by school,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,1,7,7,7


In [None]:
# 5. What were the most frequent reasons for the bus running late?
# ---
#
# reasons = df_clean.groupby(['reason','how_long_delayed']).count()
# reasons
# reasons.sort_values(by ='breakdown_or_running_late',ascending = 0)

# Get the records with running late reasons and sort to get most frequent reasons
# ---
# YOUR CODE GOES BELOW
running_late_reasons = df_clean[df_clean.breakdown_or_running_late == 'running late'].groupby(['reason']).count()
running_late_reasons.sort_values(by = 'breakdown_or_running_late',ascending =0)




Unnamed: 0_level_0,school_year,busbreakdown_id,run_type,bus_no,route_number,schools_serviced,occurred_on,created_on,borough,bus_company_name,how_long_delayed,number_of_students_on_the_bus,has_contractor_notified_schools,has_contractor_notified_parents,have_you_alerted_opt,informed_on,incident_number,last_updated_on,breakdown_or_running_late,school_age_or_prek
reason,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
heavy traffic,172802,172802,172801,172798,172799,172799,172802,172802,163209,172802,172802,172802,172802,172802,172802,172802,6320,172802,172802,172802
other,34049,34049,34049,34046,34047,34047,34049,34049,33059,34049,34049,34049,34049,34049,34049,34049,773,34049,34049,34049
mechanical problem,13177,13177,13177,13177,13177,13177,13177,13177,12529,13177,13177,13177,13177,13177,13177,13177,189,13177,13177,13177
weather conditions,6850,6850,6850,6850,6850,6850,6850,6850,6509,6850,6850,6850,6850,6850,6850,6850,245,6850,6850,6850
late return from field trip,5669,5669,5668,5669,5669,5669,5669,5669,5609,5669,5669,5669,5669,5669,5669,5669,226,5669,5669,5669
won`t start,4552,4552,4552,4552,4552,4552,4552,4552,4271,4552,4552,4552,4552,4552,4552,4552,127,4552,4552,4552
flat tire,4373,4373,4373,4373,4373,4373,4373,4373,4259,4373,4373,4373,4373,4373,4373,4373,198,4373,4373,4373
problem run,4040,4040,4040,4040,4040,4040,4040,4040,3959,4040,4040,4040,4040,4040,4040,4040,81,4040,4040,4040
accident,2324,2324,2324,2323,2324,2324,2324,2324,2245,2324,2324,2324,2324,2324,2324,2324,523,2324,2324,2324
delayed by school,2215,2215,2215,2214,2213,2213,2215,2215,1992,2215,2215,2215,2215,2215,2215,2215,67,2215,2215,2215


In [None]:
# 6. What was the average delay time for each reason type?
# ---
#
avg_delay = df_clean.groupby('reason').mean().reset_index()

# Get the records with reasons and how long on average a delay took then sort
# ---
# YOUR CODE GOES BELOW
avg_delay.sort_values(by = 'how_long_delayed',ascending=0)

  avg_delay = df_clean.groupby('reason').mean().reset_index()


Unnamed: 0,reason,busbreakdown_id,how_long_delayed,number_of_students_on_the_bus
4,late return from field trip,1351422.0,62.235975,4.396739
0,accident,1354223.0,34.965162,6.305622
7,problem run,1351041.0,31.136676,6.40708
5,mechanical problem,1360663.0,30.915205,1.552376
9,won`t start,1346269.0,29.644061,0.920215
2,flat tire,1355114.0,29.392343,1.720723
8,weather conditions,1333248.0,29.227351,2.378823
6,other,1343447.0,28.421006,2.737965
3,heavy traffic,1360074.0,26.22555,3.926447
1,delayed by school,1334782.0,18.907741,5.882088


### 5.b) Recommendations

From the above analysis, below are our recommendations:

1.  The buses delay longer due to late return from field trip. The drivers should be advised to keep time when going for field trips.
2. Frequent reasons for the delays is heavy traffic and mechanical problems . Bus drivers should be encouraged to service the buses more often.Drivers should be encouraged to use routes that are less busy.
3. LITTLE RICHIE BUS SERVICE had the highest breakdowns. The schools should set Set clear performance metrics. The company should ensure that all vehicles are regularly maintained and inspected.
4. mechanical problem,won't start and flat tyre were the most common reasons for breakdown.I would recommend the following:

> Regular Maintenance: Implement a strict maintenance schedule for all vehicles in your fleet. Ensure that engines, batteries, and electrical systems are routinely inspected and serviced to prevent starting issues.

>Driver Training: Provide drivers with training on avoiding tire damage. This includes avoiding potholes, curbs, and other road hazards.





## 6. Challenging your Solution

During this step, we review our solution and implement  approaches that could potentially provide a better outcome. In our case, we could propose the following question that wasn't answered in our solution because it couldn't have greatly contributed to our recommendation.

In [None]:
# Which boroughs experienced the most breakdowns?
# ---
#
breakdowns = df_clean.groupby(['borough']).count()

# Sort to get bus company with highest breakdowns
# ---
# YOUR CODE GOES BELOW
breakdown_boroughs = df_clean[df_clean.breakdown_or_running_late == 'breakdown'].groupby(['borough']).count()
breakdown_boroughs.sort_values(by = 'breakdown_or_running_late',ascending = 0)
# breakdown_boroughs.sort_values(by = 'breakdown_or_running_late',ascending = 1)

Unnamed: 0_level_0,school_year,busbreakdown_id,run_type,bus_no,route_number,reason,schools_serviced,occurred_on,created_on,bus_company_name,how_long_delayed,number_of_students_on_the_bus,has_contractor_notified_schools,has_contractor_notified_parents,have_you_alerted_opt,informed_on,incident_number,last_updated_on,breakdown_or_running_late,school_age_or_prek
borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
bronx,9287,9287,9287,9287,9287,9287,9287,9287,9287,9287,9287,9287,9287,9287,9287,9287,223,9287,9287,9287
queens,9118,9118,9117,9118,9118,9118,9118,9118,9118,9118,9118,9118,9118,9118,9118,9118,105,9118,9118,9118
brooklyn,6826,6826,6826,6826,6826,6826,6826,6826,6826,6826,6826,6826,6826,6826,6826,6826,232,6826,6826,6826
manhattan,3359,3359,3359,3359,3359,3359,3359,3359,3359,3359,3359,3359,3359,3359,3359,3359,128,3359,3359,3359
nassau county,685,685,685,685,685,685,685,685,685,685,685,685,685,685,685,685,15,685,685,685
staten island,454,454,454,454,454,454,454,454,454,454,454,454,454,454,454,454,10,454,454,454
westchester,139,139,139,139,139,139,139,139,139,139,139,139,139,139,139,139,12,139,139,139
new jersey,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,1,66,66,66
rockland county,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,2,42,42,42
all boroughs,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,0,20,20,20


Our observations:

1. bronx borough experienced the highest breakdowns  
2.  


How does this observation tie to our solution?
connecticut town probably has good administration and the best bus vendors.



## 7. Follow up questions

During this step, you rethink and propose other ways that you can improve your solution.

### a). Did we have the right data? Yes

### b). Do we need other data to answer our question?

You can look into the questions you brainstormed that you weren't taken into account during analysis due to a lack of data. Were those questions important to have been left out  of your analysis?

### c). Did we have the right question?Yes

Were there any other questions that we needed to have answered?