### Description
This challenge serves as final project for the "How to win a data science competition" Coursera course.

In this competition you will work with a challenging time-series dataset consisting of daily sales data, kindly provided by one of the largest Russian software firms - 1C Company. 

We are asking you to predict total sales for every product and store in the next month. By solving this competition you will be able to apply and enhance your data science skills.

In [35]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [36]:
train = pd.read_csv("sales_train.csv", parse_dates=["date"], index_col="date")
train

Unnamed: 0_level_0,date_block_num,shop_id,item_id,item_price,item_cnt_day
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
02.01.2013,0,59,22154,999.00,1.0
03.01.2013,0,25,2552,899.00,1.0
05.01.2013,0,25,2552,899.00,-1.0
06.01.2013,0,25,2554,1709.05,1.0
15.01.2013,0,25,2555,1099.00,1.0
...,...,...,...,...,...
10.10.2015,33,25,7409,299.00,1.0
09.10.2015,33,25,7460,299.00,1.0
14.10.2015,33,25,7459,349.00,1.0
22.10.2015,33,25,7440,299.00,1.0


In [37]:
items = pd.read_csv("items.csv")
items.head()

Unnamed: 0,item_name,item_id,item_category_id
0,! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D,0,40
1,!ABBYY FineReader 12 Professional Edition Full...,1,76
2,***В ЛУЧАХ СЛАВЫ (UNV) D,2,40
3,***ГОЛУБАЯ ВОЛНА (Univ) D,3,40
4,***КОРОБКА (СТЕКЛО) D,4,40


In [38]:
item_categories = pd.read_csv("item_categories.csv")
item_categories.head()

Unnamed: 0,item_category_name,item_category_id
0,PC - Гарнитуры/Наушники,0
1,Аксессуары - PS2,1
2,Аксессуары - PS3,2
3,Аксессуары - PS4,3
4,Аксессуары - PSP,4


In [39]:
shops = pd.read_csv("shops.csv")
shops.head()

Unnamed: 0,shop_name,shop_id
0,"!Якутск Орджоникидзе, 56 фран",0
1,"!Якутск ТЦ ""Центральный"" фран",1
2,"Адыгея ТЦ ""Мега""",2
3,"Балашиха ТРК ""Октябрь-Киномир""",3
4,"Волжский ТЦ ""Волга Молл""",4


In [40]:
test = pd.read_csv("test.csv")
test.head()

Unnamed: 0,ID,shop_id,item_id
0,0,5,5037
1,1,5,5320
2,2,5,5233
3,3,5,5232
4,4,5,5268


In [41]:
sample_submission = pd.read_csv("sample_submission.csv")
sample_submission.head()

Unnamed: 0,ID,item_cnt_month
0,0,0.5
1,1,0.5
2,2,0.5
3,3,0.5
4,4,0.5


In [42]:
# Örnek zaman serisi veri kümesi oluşturalım
from sklearn.model_selection import TimeSeriesSplit
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
dates = pd.date_range(start='2013-01-02', end='2015-10-03', freq='D')
data = np.random.randn(len(dates))
train = pd.DataFrame({'Date': dates, 'item_cnt_day': data})

# Zaman serisi bölünmüş çapraz doğrulama oluşturalım
n_splits = 5  # Bölünmüş sayısı
tscv = TimeSeriesSplit(n_splits=n_splits)

# Modelimizi oluşturalım
model = LinearRegression()

# Bölünmüş çapraz doğrulama üzerinden dönerek modeli eğitelim ve değerlendirelim
fold = 1
test_pred = []
for train_index, test_index in tscv.split(train):
    train_data, test_data = train.iloc[train_index], train.iloc[test_index]
    
    X_train, X_test = np.array(train_data.index).reshape(-1, 1), np.array(test_data.index).reshape(-1, 1)
    y_train, y_test = train_data['item_cnt_day'], test_data['item_cnt_day']
    
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    y_pred1 = model.predict(test['item_id'].values.reshape(-1, 1))
    
    test_pred.append(y_pred1)
    
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)  # RMSE hesaplaması
    print(f"Fold {fold}, Root Mean Squared Error: {rmse}")
    
    fold += 1


Fold 1, Root Mean Squared Error: 0.9703093954456925
Fold 2, Root Mean Squared Error: 1.05261277186597
Fold 3, Root Mean Squared Error: 1.074270597425146
Fold 4, Root Mean Squared Error: 1.0007813040862545
Fold 5, Root Mean Squared Error: 1.0001820185444406


In [43]:
import pandas as pd

# Örnek bir veri çerçevesi oluşturalım (train veri çerçevesi gibi)
train = pd.read_csv("sales_train.csv")  # Gerçek veri setini yükleyin veya uygun bir şekilde oluşturun

# 'item_cnt_day' sütununu aylık bazda toplayarak 'item_cnt_month' sütununu oluşturalım
monthly_sales = train.groupby(['date_block_num', 'shop_id', 'item_id'])['item_cnt_day'].sum().reset_index()
monthly_sales.rename(columns={'item_cnt_day': 'item_cnt_month'}, inplace=True)

# Oluşturduğumuz veri çerçevesinin başlığına bir göz atalım
print(monthly_sales.head())

# Submission dosyasını oluşturalım
test = pd.read_csv("test.csv")  # Test setini yükleyin veya uygun bir şekilde oluşturun

# Test setine 'item_cnt_month' sütununu ekleyelim ve gereksiz sütunları kaldıralım
submission = test.merge(monthly_sales, on=['shop_id', 'item_id'], how='left')[['ID', 'item_cnt_month']]

# Eksik değerleri (NaN) sıfır ile dolduralım
submission['item_cnt_month'].fillna(0, inplace=True)

# Oluşturduğumuz submission dosyasını CSV formatında kaydedelim
submission.to_csv("submission.csv", index=False)


   date_block_num  shop_id  item_id  item_cnt_month
0               0        0       32             6.0
1               0        0       33             3.0
2               0        0       35             1.0
3               0        0       43             1.0
4               0        0       51             2.0
