<details>
    <summary><h1>データセット一覧</h1></summary>

    oil.csv : date (日付), dcoilwtico (原油価格 : USD)

    sample_submission.csv : id (行番号), sales (予測売上金額)

    holidays_events.csv : date (日付), type (イベント種類), locale (イベント適用範囲), locale_name (イベント適用地域), description (イベント名), transferred (振替休日フラグ)

    stores.csv : store_nbr (店舗番号), city (店舗所在都市名), state (店舗所在地域名), type (店舗タイプ), cluster (クラスタ番号)

    train.csv : id (行番号), date (日付), store_nbr (店舗番号), family (商品カテゴリ), sales (売上金額 (USD)), onpromotion (プロモーション対象商品数 (商品カテゴリ内))
    
    test.csv : id (行番号), date (日付), store_nbr (店舗番号), family (商品カテゴリ), onpromotion (プロモーション対象商品数 (商品カテゴリ内))

    transactions.csv : date (日付), store_nbr (店舗番号), transcations (取引数)
</details>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# ライブラリインポート

In [None]:
from scipy import stats
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
import lightgbm as lgb
from xgboost import XGBRegressor
from catboost import CatBoostRegressor, Pool

# データセットロード

In [None]:
oil = pd.read_csv("/kaggle/input/store-sales-time-series-forecasting/oil.csv", parse_dates = ["date"])
sub = pd.read_csv("/kaggle/input/store-sales-time-series-forecasting/sample_submission.csv")
holidays = pd.read_csv("/kaggle/input/store-sales-time-series-forecasting/holidays_events.csv", parse_dates = ["date"])
stores = pd.read_csv("/kaggle/input/store-sales-time-series-forecasting/stores.csv")
train = pd.read_csv("/kaggle/input/store-sales-time-series-forecasting/train.csv", parse_dates = ["date"])
test = pd.read_csv("/kaggle/input/store-sales-time-series-forecasting/test.csv", parse_dates = ["date"])
transactions = pd.read_csv("/kaggle/input/store-sales-time-series-forecasting/transactions.csv", parse_dates = ["date"])

<details>
    <summary><h1>データセット構造確認 (日付対応)</h1></summary>

    oil.csv : 2013-01-01 ~ 2017-08-31 → 使用可能 (train, test 両対応)

    holidays_events.csv : 2012-03-02 ~ 2017-12-26 → 使用可能 (非イベント日補完)

    transactions.csv : 2013-01-01 ~ 2017-08-15 → 使用不可能 (test 未対応)

    train : 2013-01-01 ~ 2017-08-15
    
    test : 2017-08-16 ~ 2017-08-31
</details>

In [None]:
oil

In [None]:
holidays

In [None]:
stores

In [None]:
train

In [None]:
test

In [None]:
transactions

<details>
    <summary><h1>特徴量追加 (学習データ)</h1></summary>
    
    year : 年
    
    month : 月
    
    day : 日
    
    dayofweek : 曜日
    
    week : 週番号
    
    M : 年月

    city : 店舗所在都市名

    state : 店舗所在地域名

    type (type_x) : 店舗タイプ

    cluster : クラスタ番号

    dcoilwtico : 原油価格 (USD)

    type (type_y) : イベント情報

    transferred : 振替休日フラグ
</details>

In [None]:
train['year'] = train['date'].dt.year
train['month'] = train['date'].dt.month
train['day'] = train['date'].dt.day
train['dayofweek'] = train['date'].dt.dayofweek
train['week'] = train['date'].dt.isocalendar().week
train['M'] = train['date'].dt.to_period('M')

train = train.merge(stores, on = "store_nbr", how = "left")
train = train.merge(oil, on = "date", how = "left")
train = train.merge(holidays, on = "date", how = "left")

In [None]:
train

<details>
    <summary><h1>欠損</h1></summary>
    
    dcoilwtico : 955152 (31%) → 前方補完 → 後方補完 (直前データ欠損時) → 0 (0%)
    
    type_y : 2551824 (83%) → 非イベント日補完 → 0 (0%)

    locale : 2551824 (83%) → 非イベント日補完 → 0 (0%)

    locale_name : 2551824 (83%) → 非イベント日補完 → 0 (0%)
    
    description : 2551824 (83%) → 非イベント日補完 → 0 (0%)
    
    transferred : 2551824 (83%) → 非イベント日補完 → 0 (0%)
</details>

In [None]:
train.isnull().sum()

In [None]:
missing_ratio = train.isnull().sum() / len(train)
missing_ratio

In [None]:
train['dcoilwtico'] = train['dcoilwtico'].ffill().bfill()
train["type_y"] = train["type_y"].fillna("NormalDay")
train["locale"] = train["locale"].fillna("None")
train["locale_name"] = train["locale_name"].fillna("None")
train["description"] = train["description"].fillna("NoEvent")
train["transferred"] = train["transferred"].astype("boolean").fillna(False)

In [None]:
train.isnull().sum()

In [None]:
train

<details>
    <summary><h1>特徴量分析</h1></summary>

    数値 : 相関係数

    カテゴリ : 統計的検定
    
    時系列データ (date, year, month, day, dayofweek, week, M) → カテゴリ処理 (統計的検定) → p値 < 0.05 and F値 > 1000 → 有意 (仮説)
</details>

<details>
    <summary><h3>数値</h3></summary>

    onpromotion : 0.427923 → 使用
    
    dcoilwtico : -0.074808 → 削除
</details>

In [None]:
numeric_cols = ['sales', 'onpromotion', 'dcoilwtico']
corr_matrix = train[numeric_cols].corr()
corr_matrix["sales"].sort_values(ascending = False)

<details>
    <summary><h3>カテゴリ</h3></summary>

    family : 有意 → 使用

    store_nbr : 有意 → 使用

    date : !有意 → 削除

    year : 有意 → 使用

    month : !有意 → 削除

    day : !有意 → 削除

    dayofweek : 有意 → 使用

    week : !有意 → 削除

    M : !有意 → 削除

    city : 有意 → 使用

    state : 有意 → 使用

    type_x : 有意 → 使用

    cluster : 有意 → 使用

    type_y : !有意 → 削除

    locale : !有意 → 削除

    locale_name : !有意 → 削除

    descripition : !有意 → 削除

    transferred : !有意 → 削除

    type_x → type (リネーム)
</details>

In [None]:
groups_family = [train.loc[train['family'] == fam, 'sales'] for fam in train['family'].unique()]
f_val_family, p_val_family = stats.f_oneway(*groups_family)
print('ANOVA F値 (family) :', f_val_family)
print('ANOVA p値 (family) :', p_val_family)
if p_val_family < 0.05 and f_val_family > 1000:
    print("商品カテゴリ == 有意")
else:
    print("商品カテゴリ != 有意")

In [None]:
groups_store = [train.loc[train['store_nbr'] == store, 'sales'] for store in train['store_nbr'].unique()]
f_val_store, p_val_store = stats.f_oneway(*groups_store)
print('ANOVA F値（store_nbr）:', f_val_store)
print('ANOVA p値（store_nbr）:', p_val_store)
if p_val_store < 0.05 and f_val_store > 1000:
    print("店舗番号 == 有意")
else:
    print("店舗番号 != 有意")

In [None]:
groups_date = [train.loc[train['date'] == date, 'sales'] for date in train['date'].unique()]
f_val_date, p_val_date = stats.f_oneway(*groups_date)
print('ANOVA F値（date）:', f_val_date)
print('ANOVA p値（date）:', p_val_date)
if p_val_date < 0.05 and f_val_date > 1000:
    print("日付 == 有意")
else:
    print("日付 != 有意")

In [None]:
groups_year = [train.loc[train['year'] == year, 'sales'] for year in train['year'].unique()]
f_val_year, p_val_year = stats.f_oneway(*groups_year)
print('ANOVA F値（year）:', f_val_year)
print('ANOVA p値（year）:', p_val_year)
if p_val_year < 0.05 and f_val_year > 1000:
    print("年 == 有意")
else:
    print("年 != 有意")

In [None]:
groups_month = [train.loc[train['month'] == month, 'sales'] for month in train['month'].unique()]
f_val_month, p_val_month = stats.f_oneway(*groups_month)
print('ANOVA F値（month）:', f_val_month)
print('ANOVA p値（month）:', p_val_month)
if p_val_month < 0.05 and f_val_month > 1000:
    print("月 == 有意")
else:
    print("月 != 有意")

In [None]:
groups_day = [train.loc[train['day'] == day, 'sales'] for day in train['day'].unique()]
f_val_day, p_val_day = stats.f_oneway(*groups_day)
print('ANOVA F値（day）:', f_val_day)
print('ANOVA p値（day）:', p_val_day)
if p_val_day < 0.05 and f_val_day > 1000:
    print("日 == 有意")
else:
    print("日 != 有意")

In [None]:
groups_dayofweek = [train.loc[train['dayofweek'] == dayofweek, 'sales'] for dayofweek in train['dayofweek'].unique()]
f_val_dayofweek, p_val_dayofweek = stats.f_oneway(*groups_dayofweek)
print('ANOVA F値（dayofweek）:', f_val_dayofweek)
print('ANOVA p値（dayofweek）:', p_val_dayofweek)
if p_val_dayofweek < 0.05 and f_val_dayofweek > 1000:
    print("曜日 == 有意")
else:
    print("曜日 != 有意")

In [None]:
groups_week = [train.loc[train['week'] == week, 'sales'] for week in train['week'].unique()]
f_val_week, p_val_week = stats.f_oneway(*groups_week)
print('ANOVA F値（week）:', f_val_week)
print('ANOVA p値（week）:', p_val_week)
if p_val_week < 0.05 and f_val_week > 1000:
    print("週番号 == 有意")
else:
    print("週番号 != 有意")

In [None]:
groups_M = [train.loc[train['M'] == M, 'sales'] for M in train['M'].unique()]
f_val_M, p_val_M = stats.f_oneway(*groups_M)
print('ANOVA F値（M）:', f_val_M)
print('ANOVA p値（M）:', p_val_M)
if p_val_M < 0.05 and f_val_M > 1000:
    print("年月 == 有意")
else:
    print("年月 != 有意")

In [None]:
groups_city = [train.loc[train['city'] == city, 'sales'] for city in train['city'].unique()]
f_val_city, p_val_city = stats.f_oneway(*groups_city)
print('ANOVA F値（city）:', f_val_city)
print('ANOVA p値（city）:', p_val_city)
if p_val_city < 0.05 and f_val_city > 1000:
    print("店舗所在都市名 == 有意")
else:
    print("店舗所在都市名 != 有意")

In [None]:
groups_state = [train.loc[train['state'] == state, 'sales'] for state in train['state'].unique()]
f_val_state, p_val_state = stats.f_oneway(*groups_state)
print('ANOVA F値（state）:', f_val_state)
print('ANOVA p値（state）:', p_val_state)
if p_val_state < 0.05 and f_val_state > 1000:
    print("店舗所在地域名 == 有意")
else:
    print("店舗所在地域名 != 有意")

In [None]:
groups_type_x = [train.loc[train['type_x'] == type_x, 'sales'] for type_x in train['type_x'].unique()]
f_val_type_x, p_val_type_x = stats.f_oneway(*groups_type_x)
print('ANOVA F値（type_x）:', f_val_type_x)
print('ANOVA p値（type_x）:', p_val_type_x)
if p_val_type_x < 0.05 and f_val_type_x > 1000:
    print("店舗タイプ == 有意")
else:
    print("店舗タイプ != 有意")

In [None]:
groups_cluster = [train.loc[train['cluster'] == cluster, 'sales'] for cluster in train['cluster'].unique()]
f_val_cluster, p_val_cluster = stats.f_oneway(*groups_cluster)
print('ANOVA F値（cluster）:', f_val_cluster)
print('ANOVA p値（cluster）:', p_val_cluster)
if p_val_cluster < 0.05 and f_val_cluster > 1000:
    print("クラスタ番号 == 有意")
else:
    print("クラスタ番号 != 有意")

In [None]:
groups_type_y = [train.loc[train['type_y'] == type_y, 'sales'] for type_y in train['type_y'].unique()]
f_val_type_y, p_val_type_y = stats.f_oneway(*groups_type_y)
print('ANOVA F値（type_y）:', f_val_type_y)
print('ANOVA p値（type_y）:', p_val_type_y)
if p_val_type_y < 0.05 and f_val_type_y > 1000:
    print("イベント種類 == 有意")
else:
    print("イベント種類 != 有意")

In [None]:
groups_locale = [train.loc[train['locale'] == locale, 'sales'] for locale in train['locale'].unique()]
f_val_locale, p_val_locale = stats.f_oneway(*groups_locale)
print('ANOVA F値（locale）:', f_val_locale)
print('ANOVA p値（locale）:', p_val_locale)
if p_val_locale < 0.05 and f_val_locale > 1000:
    print("イベント適用範囲 == 有意")
else:
    print("イベント適用範囲 != 有意")

In [None]:
groups_locale_name = [train.loc[train['locale_name'] == locale_name, 'sales'] for locale_name in train['locale_name'].unique()]
f_val_locale_name, p_val_locale_name = stats.f_oneway(*groups_locale_name)
print('ANOVA F値（locale_name）:', f_val_locale_name)
print('ANOVA p値（locale_name）:', p_val_locale_name)
if p_val_locale_name < 0.05 and f_val_locale_name > 1000:
    print("イベント適用地域 == 有意")
else:
    print("イベント適用地域 != 有意")

In [None]:
groups_description = [train.loc[train['description'] == description, 'sales'] for description in train['description'].unique()]
f_val_description, p_val_description = stats.f_oneway(*groups_description)
print('ANOVA F値（description）:', f_val_description)
print('ANOVA p値（description）:', p_val_description)
if p_val_description < 0.05 and f_val_description > 1000:
    print("イベント名 == 有意")
else:
    print("イベント名 != 有意")

In [None]:
groups_transferred = [train.loc[train['transferred'] == transferred, 'sales'] for transferred in train['transferred'].unique()]
f_val_transferred, p_val_transferred = stats.f_oneway(*groups_transferred)
print('ANOVA F値（transferred）:', f_val_transferred)
print('ANOVA p値（transferred）:', p_val_transferred)
if p_val_transferred < 0.05 and f_val_transferred > 1000:
    print("振替休日フラグ == 有意")
else:
    print("振替休日フラグ != 有意")

In [None]:
train = train.drop(columns = ['id', 'dcoilwtico', 'date', 'month', 'day', 'week', 'M', 'type_y', 'locale', 'locale_name', 'description', 'transferred']) # id 不要
train = train.rename(columns={"type_x": "type"})

In [None]:
train

<details>
    <summary><h1>特徴量追加 (テストデータ)</h1></summary>
    
    year : 年
    
    dayofweek : 曜日

    city : 店舗所在都市名

    state : 店舗所在地域名

    type : 店舗タイプ

    cluster : クラスタ番号

    transactions : 取引数
</details>

In [None]:
test['year'] = test['date'].dt.year
test['dayofweek'] = test['date'].dt.dayofweek

test = test.merge(stores, on = "store_nbr", how = "left")
test = test.drop(columns = ['id', 'date'])

In [None]:
test

In [None]:
test.isnull().sum()

In [None]:
missing_ratio = test.isnull().sum() / len(test)
missing_ratio

<details>
    <summary><h1>前処理</h1></summary>
    
    カテゴリ → ダミー化 (LR), 指定 (LGB)

    数値 → 標準化 (LR)

    説明変数 → sales 以外

    目的変数 → sales
</details>

In [None]:
def train_preprocessing(preprocessing_type):
    if preprocessing_type == "LR":
        training = pd.get_dummies(train, columns = ["store_nbr", "family", "year", "dayofweek", "city", "state", "type", "cluster"], drop_first = True)
        train_standard = StandardScaler()
        train_copied = training.copy()
        train_standard.fit(train_copied[['sales', 'onpromotion']])
        train_std = pd.DataFrame(train_standard.transform(train_copied[['sales', 'onpromotion']]))
        training[['sales', 'onpromotion']] = train_std
    elif preprocessing_type == "LGB":
        training = train.copy()
        cat_cols = ["store_nbr", "family", "year", "dayofweek", "city", "state", "type", "cluster"]
        for col in cat_cols:
            training[col] = training[col].astype("category")
    elif preprocessing_type == "XGB":
        training = pd.get_dummies(train, columns = ["store_nbr", "family", "year", "dayofweek", "city", "state", "type", "cluster"], drop_first = True)
    elif preprocessing_type == "CAT":
        training = train.copy()
        cat_cols = ["store_nbr", "family", "year", "dayofweek", "city", "state", "type", "cluster"]
        for col in cat_cols:
            training[col] = training[col].astype("category")
    else:
        training = train.copy()
    return training

In [None]:
def test_preprocessing(preprocessing_type):
    if preprocessing_type == "LR":
        testing = pd.get_dummies(test, columns = ["store_nbr", "family", "year", "dayofweek", "city", "state", "type", "cluster"], drop_first = True)
        test_standard = StandardScaler()
        test_copied = testing.copy()
        test_standard.fit(test_copied[['onpromotion']])
        test_std = pd.DataFrame(train_standard.transform(test_copied[['onpromotion']]))
        testing[['onpromotion']] = test_std
    elif preprocessing_type == "LGB":
        testing = test.copy()
        cat_cols = ["store_nbr", "family", "year", "dayofweek", "city", "state", "type", "cluster"]
        for col in cat_cols:
            testing[col] = testing[col].astype("category")
    elif preprocessing_type == "XGB":
        testing = pd.get_dummies(test, columns = ["store_nbr", "family", "year", "dayofweek", "city", "state", "type", "cluster"], drop_first = True)
        testing = testing.reindex(columns = X.columns, fill_value = 0)
    elif preprocessing_type == "CAT":
        testing = test.copy()
        cat_cols = ["store_nbr", "family", "year", "dayofweek", "city", "state", "type", "cluster"]
        for col in cat_cols:
            testing[col] = testing[col].astype("category")
    else:
        testing = test.copy()
    return testing

In [None]:
# TYPE = "LR"
TYPE = "LGB"
# TYPE = "XGB"
# TYPE = "CAT"

In [None]:
training = train_preprocessing(TYPE)

In [None]:
X = training.drop(columns = ['sales']) # 説明変数 (sales 以外)
y = training['sales'] # 目的変数

In [None]:
X_test = test_preprocessing(TYPE)

<details>
    <summary><h1>学習</h1></summary>

    LinearRegression : 線形回帰

    lightgbm : 勾配ブースティング決定木
</details>

In [None]:
def model_training(training_type):
    if training_type == "LR":
        model = LinearRegression()
        model.fit(X, y)
    elif training_type == "LGB":
        X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.2, random_state = 42)
        train_dataset = lgb.Dataset(X_train, label = y_train)
        valid_dataset = lgb.Dataset(X_valid, label = y_valid)
        params = {
            'objective': 'regression',
            'metric': 'rmse',
            'verbosity': -1,
            'boosting_type': 'gbdt',
            'random_state': 42
        }
        model = lgb.train(
            params,
            train_dataset,
            valid_sets = [train_dataset, valid_dataset],
            num_boost_round = 1000,
        )
    elif training_type == "XGB":
        model = XGBRegressor(
            objective = 'reg:squarederror',
            n_estimators = 1000,
            learning_rate = 0.1,
            max_depth = 6,
            random_state = 42
        )
        model.fit(X, y)
    elif training_type == "CAT":
        cat_cols = ["store_nbr", "family", "year", "dayofweek", "city", "state", "type", "cluster"]
        X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.2, random_state = 42)
        model = CatBoostRegressor(
            iterations = 1000,
            learning_rate = 0.1,
            depth = 6,
            loss_function = 'RMSE',
            eval_metric = 'RMSE',
            random_seed = 42,
            verbose = 100,
            early_stopping_rounds = 50
        )
        model.fit(
            X_train, y_train,
            cat_features = cat_cols,
            eval_set = (X_valid, y_valid)
        )
    else:
        model = None
    return model

In [None]:
model = model_training(TYPE)

<details>
    <summary><h1>予測</h1></summary>

    MSE : 平均二乗誤差

    MAE : 平均絶対誤差

    R2 score : 決定係数

    LinearRegression → MSE (train) : 0.4114683938074436 (標準化済), MAE (train) : 0.2751832526539287 (標準化済), R2 score (train) : 0.5885316061925563

    lightgbm → MSE (train) : 101061.59176976989 (未標準化), MAE (train) : 75.7370016875466 (未標準化), R2 score (train) : 0.9175735227668013

    MSE, MAE 比較不可能
    
    R2 score 比較可能
</details>

In [None]:
train_predicted = model.predict(X)
mse_train = mean_squared_error(y, train_predicted)
mae_train = mean_absolute_error(y, train_predicted)
r2_train = r2_score(y, train_predicted)
print("MSE (train) :", mse_train)
print("MAE (train) :", mae_train)
print("R2 score (train) :", r2_train)

In [None]:
test_predicted = model.predict(X_test)

In [None]:
test_predicted

# ファイル提出

In [None]:
sub['sales'] = list(map(int, test_predicted))
sub.to_csv('submission.csv', index = False)