# Sectors performance preprocessor
The programm preprocesses main US sectors data. IT calculates the performance (%) of sectors over a given interval (number of rows) for each sector and for each month within a given range of years.

The program assumes that the data is stored in appropriate CSV files for every sector in the following format: { Date (M/D/Y), Open Price } in decreasing order by date.
The data is saved into CSV file in the following format: { Year, Month, Sector1_Performance (%), Sector2_Performance (%), Sector3_Performance (%), ...}

You can download the data from investing.com. For example Technology sector: https://www.investing.com/indices/s-p-500-information-technology-historical-data

### 1. Imports & helper functions

In [14]:
import csv
from datetime import datetime, date

def calculate_performance(prev_price, curr_price):
    if prev_price and curr_price:
        prev_price = float(prev_price.replace(',', ''))
        curr_price = float(curr_price.replace(',', ''))
        
        return (curr_price - prev_price) * 100 / prev_price;
    return ''

def parse_date(date_string):
    return datetime.strptime(date_string, '%m/%d/%Y')

### 2. Main function

In [15]:
# sectors_data - dictionary with a following structure: {sector_name: sector_data_csv_filename }
# outpute_file - filename for output
# start_year, end_year - range of years to save the data for
# period_length - The number of rows between which the performance is to be measured
def process_csv_files(sectors_data, output_file, start_year, end_year, period_length):
    with open(output_file, 'w', newline='') as outfile:
        writer = csv.writer(outfile)

        #every sector is a column
        header = ['Sector', 'Year', 'Month', 'Performance']
        writer.writerow(header)

        data = {year: {month: {sector: '' for sector in sectors_data.keys()} for month in range(1, 13)} for year in range(start_year, end_year + 1)}
                        
        for sector, filename in sectors_data.items():
            print(f"Calculating performance for {sector}")
            with open(filename, 'r', encoding='utf-8-sig') as infile:
                reader = csv.DictReader(infile)
                rows = list(reader)
                rows.reverse()
                
                for i in range(period_length, len(rows)):
                    row = rows[i]
                    current_date = parse_date(row['Date'])
                    if start_year <= current_date.year <= end_year:
                        previous_price = rows[i - period_length]['Open']
                        current_price = row['Open']
                        
                        performance = calculate_performance(previous_price, current_price)
                        data[current_date.year][current_date.month][sector] = performance

        #save data to output file
        for sector in sectors_data.keys():
            for year in range(end_year, start_year - 1, -1): 
                for month in range(12, 0, -1):
                    row = [sector, year, month] + [data[year][month][sector]]
                    writer.writerow(row)

### 3. Data and programm execution

In [16]:
sectors_data = {
    "SP500": '../datasets/S&P 500 Historical Data.csv',
    "Industrials": '../datasets/S&P 500 Industrials Historical Data.csv',
    "Technology": '../datasets/S&P 500 Information Technology Historical Data.csv',
    "Real Estate": '../datasets/S&P 500 Real Estate Historical Data.csv',
    "Utilities": '../datasets/S&P 500 Utilities Historical Data.csv',
    "Consumer Cyclical": '../datasets/S&P 500 Consumer Discretionary Historical Data.csv',
    "Consumer Defensive": '../datasets/S&P 500 Consumer Staples Historical Data.csv',
    "Energy": '../datasets/S&P 500 Energy Historical Data.csv',
    "Financial Services": '../datasets/S&P 500 Financials Historical Data.csv',
    "Healthcare": '../datasets/S&P 500 Health Care Historical Data.csv',
    "Business Services": '../datasets/Dow Jones Business Support Services Historical Data.csv',
    "Basic Materials": '../datasets/Dow Jones Basic Materials Historical Data.csv'
}
output_filename = '../datasets/Sectors_2Months_performance.csv'
start_year = 2014
end_year = 2024
period_length = 2  # The number of rows between which the performance is to be measured
process_csv_files(sectors_data, output_filename, start_year, end_year, period_length)

Calculating performance for SP500
Calculating performance for Industrials
Calculating performance for Technology
Calculating performance for Real Estate
Calculating performance for Utilities
Calculating performance for Consumer Cyclical
Calculating performance for Consumer Defensive
Calculating performance for Energy
Calculating performance for Financial Services
Calculating performance for Healthcare
Calculating performance for Business Services
Calculating performance for Basic Materials
