In [49]:
import pandas as pd 
import os
import seaborn as sns

1. Reading and combining data
    Load all 9 CSV files into a list.
    Concatenate the files into a single DataFrame, named vehicles_df.

In [50]:
folder_path= './PassengerVehicle_Stats'
dfs=[]
files= [f for f in os.listdir(folder_path) if f.endswith('.csv')]

for file in files:
    file_path = os.path.join(folder_path,file)
    df1 = pd.read_csv(file_path)
    dfs.append(df1)

    Concatenate the files into a single DataFrame, named vehicles_df.

In [51]:
combine_df = pd.concat(dfs, ignore_index=True)
combine_df.to_csv('./vehicles_df.csv', index= False)

vehicles_df = pd.read_csv('vehicles_df.csv')



2. Initial data exploration and cleaning
    Examine the DataFrame structure, including its features and data types.
    

In [52]:
print(vehicles_df.info())
print(vehicles_df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66408 entries, 0 to 66407
Data columns (total 17 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   Unnamed: 0                          66408 non-null  int64  
 1   Public Vehicle Number               66408 non-null  int64  
 2   Status                              66408 non-null  object 
 3   Vehicle Make                        58740 non-null  object 
 4   Vehicle Model                       58556 non-null  object 
 5   Vehicle Model Year                  58640 non-null  float64
 6   Vehicle Color                       58464 non-null  object 
 7   Vehicle Fuel Source                 66408 non-null  object 
 8   Wheelchair Accessible               66408 non-null  object 
 9   Company Name                        66408 non-null  object 
 10  Address                             59264 non-null  object 
 11  City                                59264

Remove any duplicate records.

In [53]:
vehicles_df_cleaned = vehicles_df.drop_duplicates(subset='Record ID', keep='first')

In [54]:
vehicles_df_cleaned.shape

(15667, 17)

Remove null records only if it is required. Provide reasons for your decision.

In [55]:
columns_to_check = ['Address', 'City', 'State','ZIP Code']

vehicles_df_cleaned.dropna(subset=columns_to_check, inplace=True)
vehicles_df_cleaned.isnull().sum()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  vehicles_df_cleaned.dropna(subset=columns_to_check, inplace=True)


Unnamed: 0                               0
Public Vehicle Number                    0
Status                                   0
Vehicle Make                           128
Vehicle Model                          171
Vehicle Model Year                     153
Vehicle Color                          195
Vehicle Fuel Source                      0
Wheelchair Accessible                    0
Company Name                             0
Address                                  0
City                                     0
State                                    0
ZIP Code                                 0
Taxi Affiliation                      7016
Taxi Medallion License Management     7042
Record ID                                0
dtype: int64

In [56]:
vehicles_df_cleaned['Vehicle Model Year'].fillna(0, inplace = True)
vehicles_df_cleaned['Vehicle Model Year']=vehicles_df_cleaned['Vehicle Model Year'].astype(int)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  vehicles_df_cleaned['Vehicle Model Year'].fillna(0, inplace = True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  vehicles_df_cleaned['Vehicle Model Year'].fillna(0, inplace = True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  vehicle

In [57]:
vehicles_df_cleaned.fillna("unknown", inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  vehicles_df_cleaned.fillna("unknown", inplace = True)


In [58]:
vehicles_df_cleaned['Vehicle Model Year'].describe()

count    13973.000000
mean      1990.866671
std        224.386734
min          0.000000
25%       2013.000000
50%       2016.000000
75%       2021.000000
max       2025.000000
Name: Vehicle Model Year, dtype: float64

3. Handle outliers and missing values
    Perform outlier removal and missing value imputation only if necessary.
    State the reason for any such actions (you can state the reasons within the notebook).

In [59]:

filtered_df_cleaned = vehicles_df_cleaned[vehicles_df_cleaned['Vehicle Model Year']> 1900]
filtered_df_cleaned.shape

(13797, 17)

In [60]:
filtered_df_cleaned['Vehicle Model Year'].describe()

count    13797.000000
mean      2016.175763
std          5.510874
min       1980.000000
25%       2013.000000
50%       2016.000000
75%       2021.000000
max       2025.000000
Name: Vehicle Model Year, dtype: float64

4. Adding new columns to the Dataframe
 Vehicle Type
    A string column indicating the type of the public passenger vehicle.
    Hint: Extract this information from the “Record ID” column. It is a combination of vehicle type and the public vehicle number.

In [61]:
filtered_df_cleaned['Record ID'].unique()

array(['12009Charter Sightseeing', '12248Charter Sightseeing',
       '13527Charter Sightseeing', ..., '166Pedicab', '42Pedicab',
       '117Pedicab'], dtype=object)

In [62]:
filtered_df_cleaned['Vehicle Type'] = filtered_df_cleaned['Record ID'].str.extract(r'(\D+)$')[0]

filtered_df_cleaned.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df_cleaned['Vehicle Type'] = filtered_df_cleaned['Record ID'].str.extract(r'(\D+)$')[0]


Unnamed: 0.1,Unnamed: 0,Public Vehicle Number,Status,Vehicle Make,Vehicle Model,Vehicle Model Year,Vehicle Color,Vehicle Fuel Source,Wheelchair Accessible,Company Name,Address,City,State,ZIP Code,Taxi Affiliation,Taxi Medallion License Management,Record ID,Vehicle Type
0,1286,12009,RESERVED,CHEVROLET,EXPRESS,2014,BLACK,Bio-Diesel,N,CHICAGO PRIVATE TOURS LLC,4567 S. OAKENWALD AVE.,CHICAGO,IL,60653.0,unknown,unknown,12009Charter Sightseeing,Charter Sightseeing
1,2095,12248,INACTIVE,MERCEDES,SPRINTER,2010,SILVER,Bio-Diesel,N,O'HARE-MIDWAY LIMOUSINE SERVICE INC # 2,4610 N. CLARK ST.,CHICAGO,IL,60640.0,unknown,unknown,12248Charter Sightseeing,Charter Sightseeing
2,7950,13527,INACTIVE,VAN HOOL,TD925,2008,RED,Bio-Diesel,N,"TRT TRANSPORTATION, INC.",4400 S. RACINE AVE.,CHICAGO,IL,60609.0,unknown,unknown,13527Charter Sightseeing,Charter Sightseeing
4,9359,13528,INACTIVE,VAN HOOL,TD925,2008,RED,Bio-Diesel,N,"TRT TRANSPORTATION, INC.",4400 S. RACINE AVE.,CHICAGO,IL,60609.0,unknown,unknown,13528Charter Sightseeing,Charter Sightseeing
5,9441,12025,INACTIVE,MERCEDES,SPRINTER,2015,BLACK,Bio-Diesel,N,O'HARE-MIDWAY LIMOUSINE SERVICE INC # 2,4610 N. CLARK ST.,CHICAGO,IL,60640.0,unknown,unknown,12025Charter Sightseeing,Charter Sightseeing


5. Column Removal
    Drop the columns “Address” and “Public Vehicle Number”.

In [64]:
filtered_df_cleaned.drop(columns=['Address', 'Public Vehicle Number'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df_cleaned.drop(columns=['Address', 'Public Vehicle Number'], inplace=True)


In [65]:
filtered_df_cleaned.head()

Unnamed: 0.1,Unnamed: 0,Status,Vehicle Make,Vehicle Model,Vehicle Model Year,Vehicle Color,Vehicle Fuel Source,Wheelchair Accessible,Company Name,City,State,ZIP Code,Taxi Affiliation,Taxi Medallion License Management,Record ID,Vehicle Type
0,1286,RESERVED,CHEVROLET,EXPRESS,2014,BLACK,Bio-Diesel,N,CHICAGO PRIVATE TOURS LLC,CHICAGO,IL,60653.0,unknown,unknown,12009Charter Sightseeing,Charter Sightseeing
1,2095,INACTIVE,MERCEDES,SPRINTER,2010,SILVER,Bio-Diesel,N,O'HARE-MIDWAY LIMOUSINE SERVICE INC # 2,CHICAGO,IL,60640.0,unknown,unknown,12248Charter Sightseeing,Charter Sightseeing
2,7950,INACTIVE,VAN HOOL,TD925,2008,RED,Bio-Diesel,N,"TRT TRANSPORTATION, INC.",CHICAGO,IL,60609.0,unknown,unknown,13527Charter Sightseeing,Charter Sightseeing
4,9359,INACTIVE,VAN HOOL,TD925,2008,RED,Bio-Diesel,N,"TRT TRANSPORTATION, INC.",CHICAGO,IL,60609.0,unknown,unknown,13528Charter Sightseeing,Charter Sightseeing
5,9441,INACTIVE,MERCEDES,SPRINTER,2015,BLACK,Bio-Diesel,N,O'HARE-MIDWAY LIMOUSINE SERVICE INC # 2,CHICAGO,IL,60640.0,unknown,unknown,12025Charter Sightseeing,Charter Sightseeing


In [66]:
filtered_df_cleaned.shape

(13797, 16)