# Data: hotcars.csv

 ## hotcars.csv

 HotCar report data. Each hot car report is from a tweet mentioning a single valid 4 digit Metro car number.

 - *car_number*: Metro car number
 - *color*: Line color
 - *time*: Tweet time (UTC)
 - *text*: Tweet text
 - *handle*: Twitter user's screen name
 - *user_id*: Twitter user's user_id
 - *tweet_id*



In [3]:
import pandas as pd

In [7]:
data = pd.read_csv('data/hotcars.csv')

In [8]:
data.head()

Unnamed: 0,car_number,color,time,text,handle,user_id,tweet_id
0,1001,RED,2013-05-28T12:39:54+00:00,"Was just on metro hot car #1001 (red line), an...",CarChickMWB,18249348,339360382302949376
1,1188,RED,2013-05-28T12:50:53+00:00,@FixWMATA @unsuckdcmetro @wmata RL #HotCar 118...,DiavoJinx,403520304,339363145795653632
2,1068,GREEN,2013-05-28T21:06:20+00:00,"oh good, another hot car on the metro. green l...",lexilooo,16174883,339487827479908353
3,2066,ORANGE,2013-05-28T21:15:23+00:00,#HotCar 2066 on OL to New Carrollton. Air is ...,TheHornGuy,40506740,339490107864276992
4,1043,BLUE,2013-05-28T22:08:07+00:00,Car 1043 on the blue line heading to Largo is ...,jessydumpling,263763147,339503377618718720


The first thing that sticks out is the 'time' column.  this is a timestamp and I want to break it down so I can do some better EDA
I'll rename 'time' to 'time_stamp' and break it down to new columns 'year', 'month', 'day' and 'full_date'

### Cleaning the Data
- break down the time columns into new columns ('full_date', 'time', 'year', 'month', 'date')
- rename time as time_stamp and leave it like that

In [9]:
# renaming time into time_stamp
data = data.rename(columns = {'time' : 'time_stamp'})

In [10]:
# splitting up the time_stamp into a month day and year
data['year'] = pd.DatetimeIndex(data['time_stamp']).year
data['month'] = pd.DatetimeIndex(data['time_stamp']).month
data['day'] = pd.DatetimeIndex(data['time_stamp']).day

In [11]:
# making a new full date column out of the time stamp
data['full_date']=[d.split('T')[0] for d in data.time_stamp]

In [12]:
data.head()

Unnamed: 0,car_number,color,time_stamp,text,handle,user_id,tweet_id,year,month,day,full_date
0,1001,RED,2013-05-28T12:39:54+00:00,"Was just on metro hot car #1001 (red line), an...",CarChickMWB,18249348,339360382302949376,2013,5,28,2013-05-28
1,1188,RED,2013-05-28T12:50:53+00:00,@FixWMATA @unsuckdcmetro @wmata RL #HotCar 118...,DiavoJinx,403520304,339363145795653632,2013,5,28,2013-05-28
2,1068,GREEN,2013-05-28T21:06:20+00:00,"oh good, another hot car on the metro. green l...",lexilooo,16174883,339487827479908353,2013,5,28,2013-05-28
3,2066,ORANGE,2013-05-28T21:15:23+00:00,#HotCar 2066 on OL to New Carrollton. Air is ...,TheHornGuy,40506740,339490107864276992,2013,5,28,2013-05-28
4,1043,BLUE,2013-05-28T22:08:07+00:00,Car 1043 on the blue line heading to Largo is ...,jessydumpling,263763147,339503377618718720,2013,5,28,2013-05-28


Now we can explore the data and the hot car tweets based on month and year

In [14]:
data.shape

(6440, 11)

In [18]:
data.car_number.value_counts().head(10)

6004    48
3024    43
1124    39
1263    32
5171    32
6099    31
3058    30
1292    28
6081    27
5167    27
Name: car_number, dtype: int64

this gives us the top 10 cars that received the most complaints 

In [19]:
data.color.value_counts()

RED       1705
ORANGE    1053
YELLOW     535
BLUE       436
NONE       400
GREEN      356
SILVER     309
Name: color, dtype: int64

The number of complaints per line.  
Red line and Orange win by a landslide.

In [20]:
data.handle.value_counts().head(10)

EdChernosky        125
LowHeadways        104
UnleafTheKraken     49
Tracktwentynine     39
HakunaWMATA         38
KathrynAegis        38
rmwc1995            38
goldsmitht          37
pjroddyjr           35
lissabrennan        34
Name: handle, dtype: int64

The top 10 twitter users that tweet the most hotcar complaints. 
We can assume they are very usual metro riders...