#  Data Integration

### Overview

Most data do not come from single sources or data sets. To do 
analysis these data sets must be integrated or merged.

Database managment systems have well-developed functions for 
merging data within their framework. However, in many cases we need
to integrate data files that are not within a database management 
system.

### Instructions

This notebook provides guided instructions and exercies
in data integration. Do the exercises yourself but you can consult
with others in the class, use the text, and search for
relevant materials on the web. 

Use of a combination of mardown and code cells to 
present your answers to the questions.

In [1]:
# Start here with loading packages

# Data handling

import requests
import numpy as np
import pandas as pd
import string as st
import os





## Exercise 1

Load the data from the files for accidents in 2014-2016 into a dataframe and get its dimensions.
To do this do the following:

1. Create a function that turns a list of paths and file names into a list of dataframes
2. Integrate the dataframes in the list into one data frame
3. Are there any variables that only in some of the years? If so, look at their summaries.

Hint: for the first step you will need to use os.path.realpath() with pd.read_csv.

## Exercise 1 Answers

Load the data from the files for accidents in 2014-2016 into a dataframe and get its dimensions.
To do this do the following:

1. See below for the function
2. Used pd.concat()
3. Yes. The 2016 table has 1 fewer column than 2014 and 2015. It dropped the Amtrak column

In [4]:
# Exercise 1

# 1
def getListDataFrames(paths, filenames):
    csvs = list()
    for i in range(len(paths)):
        fileToPull = paths[i] + filenames[i]
        csvOfInterest = pd.read_csv(fileToPull, index_col = False)
        csvs.append(csvOfInterest)
    return csvs

paths = ['/Users/mead/Fall2017/DonBrown-DS6001/InClass1/Data/', 
         '/Users/mead/Fall2017/DonBrown-DS6001/InClass1/Data/', 
         '/Users/mead/Fall2017/DonBrown-DS6001/InClass1/Data/']
files = ['RailAccidents14.txt',
        'RailAccidents15.txt',
        'RailAccidents16.txt']
listOfAccidents = getListDataFrames(paths,files)
print([df.head() for df in listOfAccidents])

# 2
dataFrameOfAccidents = pd.concat(listOfAccidents)
print(dataFrameOfAccidents.describe())

# 3
print([df.shape for df in listOfAccidents])
colnames = [df.columns for df in listOfAccidents]
print(colnames[1].equals(colnames[2])) 
print(colnames[0].equals(colnames[1]))

# It looks like all of the accident reports for 14-15 have the same columns now but that 2016 has 1 less column
#pd.isnull(dataFrameOfAccidents)

[  AMTRAK  IYR  IMO RAILROAD     INCDTNO  IYR2  IMO2  RR2 INCDTNO2  IYR3  \
0    NaN   14    8     INRD      644191   NaN   NaN  NaN      NaN    14   
1    NaN   14    8     BNSF   CA0814111   NaN   NaN  NaN      NaN    14   
2    NaN   14    8     BNSF   KS0814107   NaN   NaN  NaN      NaN    14   
3    NaN   14    8     BNSF   NE0814103   NaN   NaN  NaN      NaN    14   
4    NaN   14   11      FEC  D147111214   NaN   NaN  NaN      NaN    14   

       ...       NARR15  RCL Latitude  Longitud SIGNAL  MOPERA  ADJUNCT1  \
0      ...          NaN  0.0    39.03    -87.04      2     2.0       NaN   
1      ...          NaN  0.0    32.69   -117.15      2     5.0       NaN   
2      ...          NaN  3.0    38.04    -97.87      2     5.0         L   
3      ...          NaN  0.0    41.24    -95.91      2     5.0         L   
4      ...          NaN  0.0    30.23    -81.60      2     5.0         K   

   ADJUNCT2  ADJUNCT3        SUBDIV  
0       NaN       NaN  INDIANAPOLIS  
1       NaN    

In [5]:
dataFrameOfAccidents.shape

(7117, 146)

In [6]:
set(listOfAccidents[1]) - set(listOfAccidents[2])

{'AMTRAK'}

In [7]:
dataFrameOfAccidents[['AMTRAK']].describe()

Unnamed: 0,AMTRAK
count,290
unique,1
top,K
freq,290


## Exercise 2

1. Combine the narrative fields for the three years dataframe into a list
2. Add a new narrative variable to the three years data frame 
3. Show the narratives from the first five accidents

## Exercise 2 Answers

1. Already pretty much wrote this in Data Engineering 1
2. Easy to append a column. Also remove nan's
3. Do this for both the first 5 entries and the earliest entries

In [96]:
# Exercise 2

# 1 - Successfully add all the Narratives together
def combineNARR(NARR1index, dataFrme):
    allAccs = list()
    for acc in range(len(dataFrme.index)):
        fullNarr = list()
        for narr in range(15):
            index = narr + NARR1index[0]
            narrElement = str(dataFrme.iloc[acc, index])
            fullNarr.append(narrElement)
        allAccs.append(fullNarr)
    return allAccs

which = lambda lst:list(np.where(lst)[0])
NARR1index = which([name == 'NARR1' for name in dataFrameOfAccidents.columns])
print(NARR1index)
# Run the function
newColumn = combineNARR(NARR1index, dataFrameOfAccidents)


[80]


In [90]:
# 2 - Get all the NARR together without any 'nan'
newNARR = [''.join([sentence for sentence in NARR if sentence != 'nan']) for NARR in newColumn]
dataFrameOfAccidents['NARRTOT'] = pd.Series(newNARR, index=dataFrameOfAccidents.index)

# 3 - Displays the first 5 listed accidents
print(dataFrameOfAccidents['NARRTOT'].head(5))
# Also going to display 'first' as in the very earliest

print("\n")

sortedAccidents = dataFrameOfAccidents.sort_values(['YEAR', 'MONTH', 'DAY', 'TIMEHR', 'TIMEMIN'], ascending=[1, 1, 1, 1, 1])
print(sortedAccidents['NARRTOT'].head(5))

0    ENGINEER STARTED TO PULL TRAIN AHEAD WHEN HE W...
1    Y-SDG2321-13 DERAILED 3 ARTICULATED RAILCARS W...
2    RCO Y-HUT2422-29 DERAILED 13 CARS WHILE SHOVIN...
3    H-GALLIN1-11 DERAILED 7 CARS AND CAUSED SIGNIF...
4    WHILE DOUBLING A CUT OF CARS, CREW SHOVED OUT ...
Name: NARRTOT, dtype: object


1307    PANTOGRAPHS TORN OFF MU #1483 AND #1331 ON TRA...
1510    CWELH1-26 DERAILED 1 LOCOMOTIVE AND 4 CARS WHI...
527     YSJ55R-02 WHILE SPOTTING AMP TRACK, DERAILED 2...
2366    JOB 341-01 SWITCHING AT MCH YD. FOREMAN FAILED...
2374    JOB 341-01 SWITCHING AT MCH YD. FOREMAN FAILED...
Name: NARRTOT, dtype: object


In [95]:
newNARR[0:5]

['ENGINEER STARTED TO PULL TRAIN AHEAD WHEN HE WAS RADIOED TO STOP, CARS ON GROUND. INVESTIGATION FOUND THREE CARS ON GROUND. ALSO FOUND PREVIOUS CREW POSSIBLY LINED SWITCH WRONG WHICH CAUSED THE DERAILMENT.',
 'Y-SDG2321-13 DERAILED 3 ARTICULATED RAILCARS WHILE PULLING OUT OF YARD TRACK 9802 DUE TO TRACK WIDEGAGE. NO HAZARDOUS MATERIALS WERE RELEASED.',
 'RCO Y-HUT2422-29 DERAILED 13 CARS WHILE SHOVING YARD TRACK 107 DUE TO EXCESSIVE BUFFING OR SLACK ACTION. NO HAZARDOUS MATERIALS WERE RELEASED.',
 'H-GALLIN1-11 DERAILED 7 CARS AND CAUSED SIGNIFICANT TRACK DAMAGE WHILE SHOVING INTO YARD TRACK 153 DUE TO FAILURES TO CONTROL SHOVE MOVEMENT/RUN THROUGH SWITCH. NO HAZARDOUS MATERIALS WERE RELEASED.',
 'WHILE DOUBLING A CUT OF CARS, CREW SHOVED OUT OF A TRACK (SOUTH) TO A COUPLING.  AFTER THE COUPLINGWAS MADE, THEY PULLED NORTH WHEN FOUR CARS DERAILED AT A SWITCH.  NO PERSONAL INJURIES WERE SUSTAINED.']

## Exercise 3

Get the narratives for the following

1. Most costly accident
2. Accident with the most fatalities
3. Accident with the most injuries
4. Accident with the most hazmat cars damaged


## Exercise 3 Answers

Get the narratives for the following

1. Most costly accident - $66,934,217
2. Accident with the most fatalities - 8
3. Accident with the most injuries - 226
4. Accident with the most hazmat cars damaged - 49

In [91]:
#1 - Most costly
sortedAccidents = dataFrameOfAccidents.sort_values(['ACCDMG'], ascending=[0])
print(str(sortedAccidents['ACCDMG'].head(1).iloc[0]))
print(str(sortedAccidents['NARRTOT'].head(1).iloc[0]))

66934217.0
K21014 WAS ON DUTY IN TAMPA AT 1930 ON NOVEMBER 15, 2016 AND DEPARTED AT 2350. K21014 WAS TRAVELINGIN A NORTHBOUND DIRECTION TOWARDS WAYCROSS. AT 0352, K21014 PASSED THE SE SPARR CP (S720.9). SHORTLYAFTER, K21014 PASSED A STOP SIGNAL AT THE NE SPARR CP (S718.6) AND IMPACTED THE 19TH CAR( TILX 47166) OF THE SOUTHBOUND N00113 LOADED COAL TRAIN. N00113 WAS GOING INTO THE SSDG ON SIGNAL INDICATION.THE SPEED OF K21014 AT IMPACT WAS 38 MPH.   CREWS HAVE BEEN TRANSPORTED TO THE MEDICAL CENTER IN STARKE FOR POST ACCIDENT TESTING. NO INJURIES HAVE BEEN REPORTED.


In [92]:
#2 - Most fatalities
sortedAccidents = dataFrameOfAccidents.sort_values(['TOTKLD'], ascending=[0])
print(str(sortedAccidents['TOTKLD'].head(1).iloc[0]))
print(str(sortedAccidents['NARRTOT'].head(1).iloc[0]))

8.0
TRAIN 188 WITH LOCOMOTIVE E/601 AND 7 CARS DERAILED AT MP 81.7 WHILE OPERATING EAST ON # 2 TRACK.  THE FIRST 4 CARS IN THE CONSIST COMPLETELY DERAILED, WITH THE FIRST 3 CARS ON THEIR SIDE AND THE ENGINE CAME TO REST A DISTANCE AWAY IN CONRAIL FRANKFORD YARD. THREE (3) CLASS B EMPLOYEES WERE DEADHEADING TO AND OR HOME FROM WORK AND ONE (1) TRAIN ATTENDANT ALSO RECEIVED AN INJURY. THE PRIMARY CAUSEOF THE INCIDENT REMAINS UNDER INVESTIGATION.  AMTRAKS EQUIPMENT DAMAGE IS  $27,140,000.00.


In [93]:
#3 - Most injuries
sortedAccidents = dataFrameOfAccidents.sort_values(['TOTINJ'], ascending=[0])
print(str(sortedAccidents['TOTINJ'].head(1).iloc[0]))
print(str(sortedAccidents['NARRTOT'].head(1).iloc[0]))

226.0
TRAIN 188 WITH LOCOMOTIVE E/601 AND 7 CARS DERAILED AT MP 81.7 WHILE OPERATING EAST ON # 2 TRACK.  THE FIRST 4 CARS IN THE CONSIST COMPLETELY DERAILED, WITH THE FIRST 3 CARS ON THEIR SIDE AND THE ENGINE CAME TO REST A DISTANCE AWAY IN CONRAIL FRANKFORD YARD. THREE (3) CLASS B EMPLOYEES WERE DEADHEADING TO AND OR HOME FROM WORK AND ONE (1) TRAIN ATTENDANT ALSO RECEIVED AN INJURY. THE PRIMARY CAUSEOF THE INCIDENT REMAINS UNDER INVESTIGATION.  AMTRAKS EQUIPMENT DAMAGE IS  $27,140,000.00.


In [94]:
#4 - Most hazmat cars
sortedAccidents = dataFrameOfAccidents.sort_values(['CARSDMG'], ascending=[0])
print(str(sortedAccidents['CARSDMG'].head(1).iloc[0]))
print(str(sortedAccidents['NARRTOT'].head(1).iloc[0]))

49.0
NS TRAIN Z4QG406 PULLING SOUTH ON FEC OWNED AND MAINTAINED TRACK WITH 3 UNITS, 80 LOADS, 1 EMPTY, 10,157 TONS DERAILED 49 CARS. 1 CAR (KLRX 30940) RELEASED 5 GALLONS OF ALCOHOL.
