# DATA PREPROCESSING AND ANALYSIS

## Task 1: Data Reading and Inspection
- Firstly, import pandas
- To read the CSV file provided in the dataset folder, we use the command *pd.read_csv('fileName.csv')* and this result is stored in a variable *file*.
- Afterwards,we can print the CSV file using the variable 'file' by using the command *print(file)* or simply *file* if we want to view it as a dataframe.
- Using pandas, we can print the shape of the file using command *file.shape*.
- To find the data type of the file, the command *type(file)* is used.

In [1]:
import pandas as pd
import numpy as np

In [2]:
file=pd.read_csv('travel-times.csv',low_memory=False)   
file

Unnamed: 0,Date,StartTime,DayOfWeek,GoingTo,Distance,MaxSpeed,AvgSpeed,AvgMovingSpeed,FuelEconomy,TotalTime,MovingTime,Take407All,Comments
0,1/6/2012,16:37,Friday,Home,51.29,127.4,78.3,84.8,,39.3,36.3,No,
1,1/6/2012,08:20,Friday,GSK,51.63,130.3,81.8,88.9,,37.9,34.9,No,
2,1/4/2012,16:17,Wednesday,Home,51.27,127.4,82.0,85.8,,37.5,35.9,No,
3,1/4/2012,07:53,Wednesday,GSK,49.17,132.3,74.2,82.9,,39.8,35.6,No,
4,1/3/2012,18:57,Tuesday,Home,51.15,136.2,83.4,88.1,,36.8,34.8,No,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,7/18/2011,08:09,Monday,GSK,54.52,125.6,49.9,82.4,7.89,65.5,39.7,No,
201,7/14/2011,08:03,Thursday,GSK,50.90,123.7,76.2,95.1,7.89,40.1,32.1,Yes,
202,7/13/2011,17:08,Wednesday,Home,51.96,132.6,57.5,76.7,,54.2,40.6,Yes,
203,7/12/2011,17:51,Tuesday,Home,53.28,125.8,61.6,87.6,,51.9,36.5,Yes,


In [3]:
print("\nShape of the file is : ",file.shape)   #this will return the shape of the file i.e. the number of rows and columns


Shape of the file is :  (205, 13)


In [4]:
print("\nDatatype of data variable: ",type(file))


Datatype of data variable:  <class 'pandas.core.frame.DataFrame'>


## Task 2: Data Preprocessing
- To handle missing values in the column using Pandas, we use the drop command.
- Firstly, we will check the total number of null values in each column and print them.
- Then, we will drop the columns with null values.

In [5]:
null_val=file.isnull().sum()   #to give the total number of null values
print("No. of null values in each column: \n\n",null_val)

fileProcessed=file.dropna(axis=1)   
print("\nNo. of cols in original DataFrame:",file.shape[1])
print("No. of cols in processed DataFrame:",fileProcessed.shape[1])

No. of null values in each column: 

 Date                0
StartTime           0
DayOfWeek           0
GoingTo             0
Distance            0
MaxSpeed            0
AvgSpeed            0
AvgMovingSpeed      0
FuelEconomy        17
TotalTime           0
MovingTime          0
Take407All          0
Comments          181
dtype: int64

No. of cols in original DataFrame: 13
No. of cols in processed DataFrame: 11


## Task 3: Data Analysis
- To find the mean,median and mode of the columns in csv file,we use functions file.mean,file.median and file.mode respectively and print the values

In [10]:
mean1=file.mean
print(mean1)

<bound method DataFrame.mean of           Date StartTime  DayOfWeek GoingTo  Distance  MaxSpeed  AvgSpeed  \
0     1/6/2012     16:37     Friday    Home     51.29     127.4      78.3   
1     1/6/2012     08:20     Friday     GSK     51.63     130.3      81.8   
2     1/4/2012     16:17  Wednesday    Home     51.27     127.4      82.0   
3     1/4/2012     07:53  Wednesday     GSK     49.17     132.3      74.2   
4     1/3/2012     18:57    Tuesday    Home     51.15     136.2      83.4   
..         ...       ...        ...     ...       ...       ...       ...   
200  7/18/2011     08:09     Monday     GSK     54.52     125.6      49.9   
201  7/14/2011     08:03   Thursday     GSK     50.90     123.7      76.2   
202  7/13/2011     17:08  Wednesday    Home     51.96     132.6      57.5   
203  7/12/2011     17:51    Tuesday    Home     53.28     125.8      61.6   
204  7/11/2011     16:56     Monday    Home     51.73     125.0      62.8   

     AvgMovingSpeed FuelEconomy  TotalTime 

In [7]:
median=file.median
print(median)

<bound method DataFrame.median of           Date StartTime  DayOfWeek GoingTo  Distance  MaxSpeed  AvgSpeed  \
0     1/6/2012     16:37     Friday    Home     51.29     127.4      78.3   
1     1/6/2012     08:20     Friday     GSK     51.63     130.3      81.8   
2     1/4/2012     16:17  Wednesday    Home     51.27     127.4      82.0   
3     1/4/2012     07:53  Wednesday     GSK     49.17     132.3      74.2   
4     1/3/2012     18:57    Tuesday    Home     51.15     136.2      83.4   
..         ...       ...        ...     ...       ...       ...       ...   
200  7/18/2011     08:09     Monday     GSK     54.52     125.6      49.9   
201  7/14/2011     08:03   Thursday     GSK     50.90     123.7      76.2   
202  7/13/2011     17:08  Wednesday    Home     51.96     132.6      57.5   
203  7/12/2011     17:51    Tuesday    Home     53.28     125.8      61.6   
204  7/11/2011     16:56     Monday    Home     51.73     125.0      62.8   

     AvgMovingSpeed FuelEconomy  TotalTim

In [8]:
mode=file.mode
print(mode)

<bound method DataFrame.mode of           Date StartTime  DayOfWeek GoingTo  Distance  MaxSpeed  AvgSpeed  \
0     1/6/2012     16:37     Friday    Home     51.29     127.4      78.3   
1     1/6/2012     08:20     Friday     GSK     51.63     130.3      81.8   
2     1/4/2012     16:17  Wednesday    Home     51.27     127.4      82.0   
3     1/4/2012     07:53  Wednesday     GSK     49.17     132.3      74.2   
4     1/3/2012     18:57    Tuesday    Home     51.15     136.2      83.4   
..         ...       ...        ...     ...       ...       ...       ...   
200  7/18/2011     08:09     Monday     GSK     54.52     125.6      49.9   
201  7/14/2011     08:03   Thursday     GSK     50.90     123.7      76.2   
202  7/13/2011     17:08  Wednesday    Home     51.96     132.6      57.5   
203  7/12/2011     17:51    Tuesday    Home     53.28     125.8      61.6   
204  7/11/2011     16:56     Monday    Home     51.73     125.0      62.8   

     AvgMovingSpeed FuelEconomy  TotalTime 