## Merge multiple .csv files

This script will merge multiple csv files with similar formatting into a single csv.

## Libraries

In [3]:
import pandas as pd # manipulating dataframes
import csv # converting to csv
import glob as glob # creating list of files in folders
import os # changing working directories

## The Data

**Attempt to read one csv file**

In [22]:
# The year 2006 is a known issue; turns out there is an extra row
df_2007 = pd.read_csv("eng-hourly-01012007-01312007.csv", encoding = "cp1252", skiprows = 14)
df_2006 = pd.read_csv("eng-hourly-01012006-01312006 (1).csv", encoding = "cp1252", skiprows = 15)
print(df_2007.shape)
print(df_2006.shape)

(744, 24)
(744, 24)


*The format for files in 2006 is different than all other years. It is necessary to skip an additional row when reading them in*

**List all columns**

In [17]:
print(df_2007.columns.values)
print(df_2006.columns.values)

['Date/Time' 'Year' 'Month' 'Day' 'Time' 'Temp (Â°C)' 'Temp Flag'
 'Dew Point Temp (Â°C)' 'Dew Point Temp Flag' 'Rel Hum (%)' 'Rel Hum Flag'
 'Wind Dir (10s deg)' 'Wind Dir Flag' 'Wind Spd (km/h)' 'Wind Spd Flag'
 'Visibility (km)' 'Visibility Flag' 'Stn Press (kPa)' 'Stn Press Flag'
 'Hmdx' 'Hmdx Flag' 'Wind Chill' 'Wind Chill Flag' 'Weather']
['Date/Time' 'Year' 'Month' 'Day' 'Time' 'Temp (°C)' 'Temp Flag'
 'Dew Point Temp (°C)' 'Dew Point Temp Flag' 'Rel Hum (%)' 'Rel Hum Flag'
 'Wind Dir (10s deg)' 'Wind Dir Flag' 'Wind Spd (km/h)' 'Wind Spd Flag'
 'Visibility (km)' 'Visibility Flag' 'Stn Press (kPa)' 'Stn Press Flag'
 'Hmdx' 'Hmdx Flag' 'Wind Chill' 'Wind Chill Flag' 'Weather']


*The formatting for degrees is different between all years and 2006. These columns cannot be kept*

**Keep only relevant columns**

In [23]:
df_2007 = df_2007.iloc[:,[1,2,3,4,11,13]]
df_2006 = df_2006.iloc[:,[1,2,3,4,11,13]]

print(df_2007.columns.values)
print(df_2006.columns.values)

['Year' 'Month' 'Day' 'Time' 'Wind Dir (10s deg)' 'Wind Spd (km/h)']
['Year' 'Month' 'Day' 'Time' 'Wind Dir (10s deg)' 'Wind Spd (km/h)']


*2006 files require an extra row to be removed and that temperature not be kept*

## Format Files

Remove the top 14 (2006 = 15) rows, and keep only a select few columns.

In [24]:
# Prior: Move all files into a single folder "Data_Unformated"
for f in glob.glob('C:/Users/I dunno---Andrew/FactoryFloor/Offline/HarveyLake/Data_Unformated/*.csv'): # for all csvs in folder
    df_001 = pd.read_csv(f, encoding = "cp1252", skiprows = 14) # open skipping first 14 rows
    df_002 = df_001.iloc[:,[1,2,3,4,11,13]] # keep only certain columns
    df_002.to_csv(f) # save a new csv
    
# Prior: Move 2006 files into a single folder "2006_Unformated"
for f in glob.glob('C:/Users/I dunno---Andrew/FactoryFloor/Offline/HarveyLake/2006_Unformated/*.csv'): # for all csvs in folder
    df_001 = pd.read_csv(f, encoding = "cp1252", skiprows = 15) # open skipping first 14 rows
    df_002 = df_001.iloc[:,[1,2,3,4,11,13]] # keep only certain columns
    df_002.to_csv(f) # save a new csv

## Merge multiple .csv files

Define a function that will merge all formatted files into a single csv.

In [28]:
# 2006 files need to be moved to same folder as all others.
def concatenate(indir="C:\\Users\\I dunno---Andrew\\FactoryFloor\\Offline\\HarveyLake\\Data_Concatenate", 
                outfile="C:\\Users\\I dunno---Andrew\\FactoryFloor\\Offline\\HarveyLake\\Concatenated.csv"):
    os.chdir(indir)
    fileList=glob.glob("*.csv")
    dfList=[]
    colnames=["Year","Month","Day","Time", "WindDir","WindKmH"]
    for filename in fileList:
        #print(filename) # these can be uncommented to track progress
        df=pd.read_csv(filename,header=0,index_col=0)
        dfList.append(df)
    concatDf=pd.concat(dfList,axis=0)
    #print(concatDf.head(10)) # these can be uncommented to track progress
    concatDf.columns=colnames
    concatDf.to_csv(outfile)

Run the function.

In [29]:
concatenate()

**Change back to working directory**

In [36]:
os.chdir("C:\\Users\\I dunno---Andrew\\FactoryFloor\\Offline\\HarveyLake\\")

In [39]:
os.getcwd()

'C:\\Users\\I dunno---Andrew\\FactoryFloor\\Offline\\HarveyLake'

## Inspect Output

In [42]:
df_concate = pd.read_csv("Concatenated.csv", index_col = 0)
df_concate.head()

Unnamed: 0,Year,Month,Day,Time,WindDir,WindKmH
0,1953,1,1,00:00,34.0,3.0
1,1953,1,1,01:00,34.0,5.0
2,1953,1,1,02:00,34.0,5.0
3,1953,1,1,03:00,,0.0
4,1953,1,1,04:00,34.0,2.0
