In [1]:
# IMPORTS

import pandas as pd
import os

<h1 style='text-align: center'}>Data Loading</h1>

## Reading The Data
To begin, I’ll load the necessary datasets into my workspace. This step is essential for setting up the core data I’ll need throughout the project.
By reading the data into DataFrames, I’ll be ready to explore its structure and contents, which will help guide the data cleaning and preprocessing steps that follow.

In [2]:
train = pd.read_csv("data/raw_data/train.csv")
macro = pd.read_csv("data/raw_data/macro.csv")

Now, I’ll perform an initial inspection of both the training and macro datasets to understand their structure, dimensions, and variable types.

In [3]:
train.head()

Unnamed: 0,id,timestamp,full_sq,life_sq,floor,max_floor,material,build_year,num_room,kitch_sq,...,cafe_count_5000_price_2500,cafe_count_5000_price_4000,cafe_count_5000_price_high,big_church_count_5000,church_count_5000,mosque_count_5000,leisure_count_5000,sport_count_5000,market_count_5000,price_doc
0,1,2011-08-20,43,27.0,4.0,,,,,,...,9,4,0,13,22,1,0,52,4,5850000
1,2,2011-08-23,34,19.0,3.0,,,,,,...,15,3,0,15,29,1,10,66,14,6000000
2,3,2011-08-27,43,29.0,2.0,,,,,,...,10,3,0,11,27,0,4,67,10,5700000
3,4,2011-09-01,89,50.0,9.0,,,,,,...,11,2,1,4,4,0,0,26,3,13100000
4,5,2011-09-05,77,77.0,4.0,,,,,,...,319,108,17,135,236,2,91,195,14,16331452


In [4]:
dimensions = {
    'Training dataset shape': train.shape,
    'Macro dataset shape': macro.shape
}
dimensions

{'Training dataset shape': (30471, 292), 'Macro dataset shape': (2484, 100)}

In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30471 entries, 0 to 30470
Columns: 292 entries, id to price_doc
dtypes: float64(119), int64(157), object(16)
memory usage: 67.9+ MB


In [6]:
macro.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2484 entries, 0 to 2483
Data columns (total 100 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   timestamp                                   2484 non-null   object 
 1   oil_urals                                   2484 non-null   float64
 2   gdp_quart                                   2394 non-null   float64
 3   gdp_quart_growth                            2394 non-null   float64
 4   cpi                                         2453 non-null   float64
 5   ppi                                         2453 non-null   float64
 6   gdp_deflator                                2119 non-null   float64
 7   balance_trade                               2453 non-null   float64
 8   balance_trade_growth                        2394 non-null   float64
 9   usdrub                                      2481 non-null   float64
 10  eurrub     

The initial inspection shows that the training dataset is large, with 30,471 rows and 292 columns, while the macro dataset is smaller, with 2,484 rows and 100 columns. Both datasets are predominantly numerical, with some categorical columns that may require encoding.

## Data Formatting
I’ll prepare and structure the datasets for seamless integration and analysis. First, I’ll ensure that date information in both datasets is standardized by converting the timestamp column to a datetime format. After this, I’ll merge the training and macr datasets on the timestamp column, allowing me to enrich the training data with additional macroeconomic features. This data formatting step is crucial for aligning and consolidating the information, setting up a consistent foundation for the modeling phase.

In [7]:
train['timestamp'] = pd.to_datetime(train['timestamp'])
macro['timestamp'] = pd.to_datetime(macro['timestamp'])

In [8]:
train['timestamp']

0       2011-08-20
1       2011-08-23
2       2011-08-27
3       2011-09-01
4       2011-09-05
           ...    
30466   2015-06-30
30467   2015-06-30
30468   2015-06-30
30469   2015-06-30
30470   2015-06-30
Name: timestamp, Length: 30471, dtype: datetime64[ns]

In [9]:
train_macro_combined = pd.merge(train, macro, on="timestamp", how="left")

In [10]:
train_macro_combined.shape

(30471, 391)

After merging, I checked the shape of the DataFrame to confirm success. The result is (30471, 391), matching the original 30,471 rows and showing an increase in columns from 292 to 391. This confirms that the merge was successful, with the macro features correctly added by timestamp.

In [12]:
# Saving my new dataset into csv file

processed_data_dir = "data/processed_data"
os.makedirs(processed_data_dir, exist_ok=True)

file_path = os.path.join(processed_data_dir, "train_macro_combined.csv")
train_macro_combined.to_csv(file_path, index=False)

print(f"Dataset saved to {file_path}")

Dataset saved to data/processed_data\train_macro_combined.csv
