# Sales Price Analysis Data Integration
**Author:** Joseph Arcila  
**Date:** April 2024  
**Project:** AVILEN Client Data Analysis

## Project Overview
This notebook processes multiple economic indicators to analyze factors affecting sales prices. It integrates data from various sources into a single, analysis-ready dataset.

### Input Data Requirements
Required files:
1. Corporate Price Index (CSV)
   - Import price index focusing on food/agricultural products
   - Monthly time series data

2. Wage Data (CSV)
   - Nominal and real wage indices
   - Monthly observations

3. Consumer Price Index Data (Excel)
   - Complex sheet structure
   - Multiple category breakdowns

4. Interest Rate Data (CSV)
   - Japanese interest rates
   - Monthly observations

5. GDP Data (CSV)
   - Real GDP expenditure data
   - Quarterly observations

6. GDP Growth Rate Data (CSV)
   - Quarterly growth rate data

### Output Format
- Single CSV file containing all integrated indicators
- Date index in YYYYMM format
- All numeric columns standardized
- Monthly frequency (1994-2023)

### Dependencies
```python
# Core data processing
import pandas as pd
import numpy as np

# File handling
import chardet  # For encoding detection

# Optional (for Google Colab)
from google.colab import drive
from google.colab import files
```

## Code Organization
1. Initial Setup
2. File Reading and Encoding Detection
3. Date Standardization
4. Data Integration
5. Export Process

## 1. Initial Setup and Dependencies
Import required libraries and set up the environment

In [None]:
# Core dependencies
import pandas as pd
import numpy as np
import chardet

# Optional - for Google Colab
from google.colab import drive
from google.colab import files

# Mount Google Drive (if using Colab)
drive.mount('/content/drive')

Mounted at /content/drive


## 2. Data Loading and Encoding Detection
Load each data source with appropriate encoding detection

In [None]:
# Define file paths (update these to your local paths)
企業物価指数_file_path = "/content/drive/Shareddrives/125-2日本ハム-業務委託共有/販売単価分析/NME R031.2814 Apr 26.csv"
賃金_file_path = "/content/drive/Shareddrives/125-2日本ハム-業務委託共有/販売単価分析/TimeSeriesResult.csv"
金利_file_path = "/content/drive/Shareddrives/125-2日本ハム-業務委託共有/販売単価分析/Main Time Series Data.csv"
gdp_file_path = "/content/drive/Shareddrives/125-2日本ハム-業務委託共有/販売単価分析/JPNRGDPEXP.csv"
gdp_growth_file_path = "/content/drive/Shareddrives/125-2日本ハム-業務委託共有/販売単価分析/ritu-jk2342.csv"

with open(金利_file_path, 'rb') as file:
    金利_encoding = chardet.detect(file.read())['encoding']
    print(f"Detected encoding for 金利_file_path: {金利_encoding}")

with open(gdp_file_path, 'rb') as file:
    gdp_encoding = chardet.detect(file.read())['encoding']
    print(f"Detected encoding for gdp_file_path: {gdp_encoding}")

with open(gdp_growth_file_path, 'rb') as file:
    gdp_growth_encoding = chardet.detect(file.read())['encoding']
    print(f"Detected encoding for gdp_growth_file_path: {gdp_growth_encoding}")

# Read data files with appropriate encodings
企業物価指数_df = pd.read_csv(企業物価指数_file_path, encoding='shift_jis', skiprows=1)
賃金_df = pd.read_csv(賃金_file_path, encoding='utf-8-sig')
金利_df = pd.read_csv(金利_file_path, encoding=金利_encoding, skiprows=9, header=None, sep=',')
gdp_df = pd.read_csv(gdp_file_path, encoding=gdp_encoding)
gdp_growth_df = pd.read_csv(gdp_growth_file_path, encoding=gdp_growth_encoding, skiprows=5, header=0)

Detected encoding for 金利_file_path: SHIFT_JIS
Detected encoding for gdp_file_path: ascii
Detected encoding for gdp_growth_file_path: SHIFT_JIS


## 3. Date Standardization

Each dataset has unique date formatting that needs to be standardized to YYYYMM format.

### 3.1 Basic Date Conversions

Simple date format standardization for straightforward datasets:

In [None]:
# Convert the first column of 企業物価指数_df
企業物価指数_df['系列名称'] = pd.to_datetime(企業物価指数_df['系列名称'], format='%Y/%m').dt.strftime('%Y%m')
企業物価指数_df.rename(columns={'系列名称': '年月'}, inplace=True)
企業物価指数_df.set_index('年月', inplace=True)

# Standardize dates for Wage Data

# Convert the first column of 賃金_df
賃金_df['時点'] = pd.to_datetime(賃金_df['時点'], format='%Y年%m月').dt.strftime('%Y%m')
賃金_df.rename(columns={'時点': '年月'}, inplace=True)
賃金_df.set_index('年月', inplace=True)

# Convert the first column of 金利_df
金利_df[0] = pd.to_datetime(金利_df[0], format='%Y/%m').dt.strftime('%Y%m')
金利_df.rename(columns={0: '年月'}, inplace=True)
金利_df.set_index('年月', inplace=True)
金利_df = 金利_df.rename(columns={金利_df.columns[0]: "金利（日本）"})


# Convert the first column of gdp_df
gdp_df['DATE'] = pd.to_datetime(gdp_df['DATE']).dt.strftime('%Y%m')
gdp_df.rename(columns={'DATE': '年月'}, inplace=True)
gdp_df.set_index('年月', inplace=True)

new_cols = []
for col_name in gdp_growth_df.columns:
    if pd.notna(gdp_growth_df.loc[0, col_name]):
        new_cols.append(str(gdp_growth_df.loc[0, col_name]))
    else:
        new_cols.append(col_name)

gdp_growth_df.columns = new_cols
gdp_growth_df = gdp_growth_df.drop(index=0)

### 3.2 Complex Date Processing

GDP Growth data requires more complex processing:

In [None]:
# Extract the first column
column_data = gdp_growth_df.iloc[:, 0]
# Remove the leading spaces and trailing periods
column_data = column_data.str.strip().str.rstrip('.')

# Split the string into year and month components
year_month = column_data.str.split('/', expand=True)
# Combine the year and month components and convert to datetime
year_month[1] = year_month[1].str.split('-', expand=True)[0].str.strip()
dates = pd.to_datetime(year_month[0] + ' ' + year_month[1], format='%Y %m', errors='coerce')
# Format the dates as 'yyyymm'
formatted_dates = dates.dt.strftime('%Y%m')

# Assign the formatted dates back to the first column
gdp_growth_df.iloc[:, 0] = formatted_dates
# Set the first column as the index
gdp_growth_df.set_index(gdp_growth_df.columns[0], inplace=True)
# Set the index name to "年月"
gdp_growth_df.index.name = "年月"

gdp_growth_df = gdp_growth_df.iloc[:-3]

start_year = 1994
start_month = 3
end_year = 2023
end_month = 12

gdp_growth_index = []
year = start_year
month = start_month
while year < end_year or (year == end_year and month <= end_month):
    gdp_growth_index.append(year * 100 + month)
    month += 3
    if month > 12:
        month = month % 12
        year += 1

gdp_growth_df.index = gdp_growth_index

# Drop all unnamed empty columns
gdp_growth_df = gdp_growth_df.loc[:, ~gdp_growth_df.columns.str.contains('^Unnamed')]

# Convert columns to numeric type
for col in gdp_growth_df.columns:
    gdp_growth_df[col] = pd.to_numeric(gdp_growth_df[col], errors='coerce')

# Read 消費者物価指数（CPI）: 消費者が購入する商品とサービスの価格変動 data
cpi_data = pd.read_excel("/content/drive/Shareddrives/125-2日本ハム-業務委託共有/販売単価分析/bm01-1.xlsx", skiprows=13)
cpi_data = cpi_data.iloc[:, 8:]

cpi_column_names = pd.read_excel("/content/drive/Shareddrives/125-2日本ハム-業務委託共有/販売単価分析/bm01-1.xlsx", skiprows=8, nrows=10)
cpi_column_names = cpi_column_names.iloc[:, 8:]

# Get the original column names from cpi_column_names
original_columns = cpi_column_names.columns.tolist()

# Create a list to store the new column names
new_columns = []

# Iterate over the columns starting from the fifth column
for col in original_columns[4:]:
    # Get the unique values from the first 5 rows of the current column
    unique_values = cpi_column_names[col].unique()
    # Filter out NaN values
    unique_values = [str(val) for val in unique_values if not pd.isnull(val)]
    # Join the unique values with underscores to create the new column name
    new_col_name = '_'.join(unique_values)
    # Append the new column name to the list
    new_columns.append(new_col_name)

# Combine the first three column names from cpi_data with the new column names
final_columns = cpi_data.columns[:4].tolist() + new_columns

# Assign the final_columns to the cpi_data DataFrame
cpi_data.columns = final_columns

# Extract the year and month from the first column
extracted_data = cpi_data.iloc[:, 0].str.extract(r'(\d{4})年(\d{1,2})月')

# Combine the year and month columns into the desired format
cpi_data['年月'] = extracted_data.apply(lambda x: f"{x[0]}{x[1]:0>2}", axis=1)

# Set the formatted column as the index
cpi_data.set_index('年月', inplace=True)


# Create a datetime index in the format yyyymm from 199401 to 202312
start_date = '1994-01-01'
end_date = '2023-12-31'
freq = 'MS'  # Monthly frequency, start of each month

### 3.3 Date Index Generation

Create consistent monthly date range for all data:

In [None]:
# Generate the datetime index
datetime_index = pd.date_range(start=start_date, end=end_date, freq=freq)

# Format the datetime index as yyyymm
yyyymm_index = datetime_index.strftime('%Y%m')

# Get the original index
original_index = gdp_growth_df.index

# Create a new index by repeating each index label 3 times
new_index = np.repeat(original_index, 3)

# Reindex the DataFrame with the new index
gdp_growth_df = gdp_growth_df.loc[new_index]

# Reset the index (optional)
gdp_growth_df = gdp_growth_df.reset_index(drop=True)

gdp_growth_df.head(15)

gdp_growth_df.index = yyyymm_index

gdp_growth_df.index.name = '年月'

## 4. Data Integration
Merge all datasets and standardize data types

In [None]:
# Merge all dataframes
merged_df = pd.concat([企業物価指数_df, 賃金_df, cpi_data, 金利_df, gdp_df, gdp_growth_df], axis=1)

# Sort by date index
merged_df = merged_df.sort_index(ascending=True)

# Clean up - drop non-numeric columns
merged_df.drop(columns=['地域'], inplace=True)

# Convert all columns to numeric

# Identify columns of type 'object'
object_cols = merged_df.select_dtypes(include=['object']).columns

# Convert object columns to float, coercing errors to NaN
for col in object_cols:
    merged_df[col] = pd.to_numeric(merged_df[col], errors='coerce')

In [None]:
# 企業物価指数_df, 賃金_df, cpi_data, 金利_df, gdp_df, gdp_growth_df
for col in gdp_growth_df.columns:
  print(col)

GDP(Expenditure Approach)
PrivateConsumption
Consumption ofHouseholds
ExcludingImputed Rent
PrivateResidentialInvestment
Private Non-Resi.Investment
Changein PrivateInventories
GovernmentConsumption
PublicInvestment
Changein PublicInventories
Net Exports
Exports
Imports
TradingGains/Losses
GDI
Net
Receipt
Payment
GNI
DomesticDemand
PrivateDemand
PublicDemand
Gross Fixed CapitalFormation
Final Sales of Domestic Product


## 5. Export Process
Save the final dataset

In [None]:
# Define the CSV filename
csv_filename = '販売単価分析.csv'

# Convert the DataFrame to a CSV file
merged_df.to_csv(csv_filename, index=True)

# Use the Colab file download utility to download the CSV file to your local system
files.download(csv_filename)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>