# HW 2 - Analysis 2 - Wrangling csv files

## Preliminaries

In [147]:
# To auto-reload modules in jupyter notebook (so that changes in files *.py doesn't require manual reloading):
# https://stackoverflow.com/questions/5364050/reloading-submodules-in-ipython
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Import commonly used libraries and magic command for inline plotting

In [150]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [152]:
%matplotlib inline

## Step 1 – Consolidation

Create blank Excel file named BCM.xlsx

a.	Use the openpyxl library to do this with Python. The openpyxl library is already installed in the aap conda virtual environment.

b.	You can simply hard code the filename BCM.xlsx. 

c.	When you save the blank notebook using openpyxl, it will have one sheet in it (which is totally fine).

https://medium.com/@anushkamahajan901/how-to-create-an-excel-file-using-python-a-beginners-guide-72e5d98c8e97

In [163]:
import openpyxl
import csv

Combine all of the csv files in a single Excel workbook – one csv file per sheet. The sheet names should be the name of the csv file but without the csv extension. It’s ok if the first sheet is just a blank sheet followed by all of the data sheets for the csv files.

The Excel file should be named using just the characters in the filenames before the first hyphen. For the logs I’ve given you, the file will be called BCM.xlsx.

Each data sheet should have the following column headers in A1:C1 – datetime, scale, temperature.

Insert the contents of each csv into a new sheet in BCM.xlsx

a.	For this I used pandas and pathlib. Just used file globbing and a loop.

b.	Read each csv into a pandas dataframe. HINT: Look at the pandas read_csv documentation to see what useful things you can accomplish during the file reading process.

c.	Inserted each dataframe into the Excel file using the appropriate dataframe method.

https://www.tutorialspoint.com/How-to-copy-files-to-a-new-directory-using-Python#:~:text=By%20using%20glob.,copy().

In [168]:
# Copy original files
import shutil
import glob

# Specify the source directory path
source_directory = './data/logs/'

# Specify the destination directory path
destination_directory = './data/log_copies/'

# Get a list of all files in the source directory
files = glob.glob(source_directory + '/*')

# Copy each file to the destination directory
for file in files:
   shutil.copy(file, destination_directory)

In [170]:
# create a new workbook
workbook = openpyxl.Workbook()

# Save the workbook to a file
workbook.save("./data/log_copies/BCM.xlsx")

# Print a success message
print("Excel file created successfully!")

Excel file created successfully!


https://stackoverflow.com/questions/60026948/adding-a-header-row-with-values-for-each-column-to-multiple-csv-files#:~:text=Read_csv%20has%20a%20names%20parameter,csv%20files.

In [172]:
import os
from pandas import read_csv 
path = './data/log_copies/'
filelist = glob.glob(path + "/BCM*.csv")
df_list = []
for file in filelist:
# you also dont need to add path, the glob should already have the full path
    df2 = read_csv(file,names=['datetime','scale','temperature'])
    ## save out files
    df2.to_csv(file,index=False)
    df_list.append(df2)
frame = pd.concat(df_list)

In [174]:
print(filelist)

['./data/log_copies\\BCM-E-tCenter-Deep.csv', './data/log_copies\\BCM-E-tCenter-Medium.csv', './data/log_copies\\BCM-E-tCenter-Shallow.csv', './data/log_copies\\BCM-E-tLeft-Deep.csv', './data/log_copies\\BCM-E-tLeft-Medium.csv', './data/log_copies\\BCM-E-tRight-Deep.csv', './data/log_copies\\BCM-E-tRight-Medium.csv', './data/log_copies\\BCM-N-tCenter-Deep.csv', './data/log_copies\\BCM-N-tCenter-Medium.csv', './data/log_copies\\BCM-N-tLeft-Deep.csv', './data/log_copies\\BCM-N-tLeft-Medium.csv', './data/log_copies\\BCM-N-tRight-Deep.csv', './data/log_copies\\BCM-N-tRight-Medium.csv', './data/log_copies\\BCM-N-tRight-Shallow.csv']


In [176]:
# Check the header
df = pd.read_csv('./data/log_copies/BCM-E-tCenter-Deep.csv')
df.head()

Unnamed: 0,datetime,scale,temperature
0,8/19/2012 12:17,C,20.0
1,8/19/2012 13:17,C,20.0
2,8/19/2012 14:17,C,20.5
3,8/19/2012 15:17,C,21.0
4,8/19/2012 16:17,C,21.0


In [178]:
from pathlib import Path

path = "./data/log_copies/"

def write_sheets(file_map: dict) -> None:
    with pd.ExcelWriter(f"{path}/BCM.xlsx", engine="xlsxwriter") as writer:
        [df.to_excel(writer, sheet_name=sheet_name, index=False) for sheet_name, df in file_map.items()]

file_mapping = {Path(file).stem: pd.read_csv(file) for file in Path(path).glob("*csv")}
write_sheets(file_mapping)

### Step 2 - Summarization

Your client now wants you to add some simple formulas to each sheet showing the minimum, maximum, and average of the temperature values. 

The labels should be in G2:G4 and the formulas in H2:H4. Notice, they want actual Excel formulas in H2:H4, not just computed values, with nice cell formatting.

min_temp

max_temp

mean_temp

min_date

max_date

In addition, compute the minimum and maximum of the datetime field in rows 6 and 7.

https://openpyxl.readthedocs.io/en/stable/tutorial.html

In [192]:
# Sort each sheet by column A

# Load the Excel workbook
workbook = openpyxl.load_workbook('./data/log_copies/BCM.xlsx')

# Iterate through each sheet
for sheet_name in workbook.sheetnames:
    sheet = workbook[sheet_name]
    
    # Extract data from column A
    column_a_data = []
    for row in sheet.iter_rows(min_row=2, max_row=sheet.max_row, min_col=1, max_col=1):
        column_a_data.append(row[0].value)
    
    # Sort rows based on values in column A
    sorted_rows = sorted(sheet.iter_rows(min_row=2, max_row=sheet.max_row, min_col=1),
                          key=lambda x: x[0].value)
    
    # Rewrite sorted data back to the sheet
    for idx, row in enumerate(sorted_rows, start=2):
        for col_idx, cell in enumerate(row, start=1):
            sheet.cell(row=idx, column=col_idx).value = cell.value

# Save the changes to the workbook
workbook.save('./data/log_copies/BCM_sorted.xlsx')


In [204]:
# Load the Excel workbook
workbook = openpyxl.load_workbook('./data/log_copies/BCM_sorted.xlsx')

# Iterate through each sheet in the workbook
for sheet_name in workbook.sheetnames:
    sheet = workbook[sheet_name]
    
    # Add labels to cells G2:G7
    sheet['G2'] = 'min_temp'
    sheet['G3'] = 'max_temp'
    sheet['G4'] = 'mean_temp'
    sheet['G6'] = 'min_date'
    sheet['G7'] = 'max_date'
    
    # Add formulas to cell H2:H7
    sheet['H2'] = '=ROUND(MIN(C:C),1)'
    sheet['H3'] = '=ROUND(MAX(C:C),1)'
    sheet['H4'] = '=ROUND(AVERAGE(C:C),1)'
    sheet['H6'] = '=A2'
    col_a_length = len(sheet['A'])
    sheet['H7'] = '=A' + str(col_a_length)
    
    # Iterate through each column in the worksheet
    for col in sheet.columns:
        max_length = 0
        
        # Iterate through each cell in the column to find the maximum length
        for cell in col:
            try:
                # Check if the cell contains a string and update the maximum length
                if len(str(cell.value)) > max_length:
                    max_length = len(cell.value)
            except:
                pass
        
        # Autofit the column width based on the maximum length
        sheet.column_dimensions[col[0].column_letter].width = max_length + 2  # Add some extra space

# Save the changes to the workbook
workbook.save('./data/log_copies/BCM_withInfo.xlsx')