**Hey, Guys! In this notebook, I will do some time series EDA. This dataset can provide us with a wealth of insights. Let's explore it together.**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.seasonal import seasonal_decompose
from pandas.plotting import lag_plot
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from scipy.signal import periodogram

**In this dataset, there are two days of data that require adjustments. Let's start by handling this part. Additionally, because there are numerous features available, I will focus on selecting the most important ones, including 'aircraft,' 'helicopter,' 'tank,' 'APC,' and 'field artillery.'**

In [None]:
df = pd.read_csv('/kaggle/input/2022-ukraine-russian-war/russia_losses_equipment.csv')
correction = pd.read_csv('/kaggle/input/2022-ukraine-russian-war/russia_losses_equipment_correction.csv')
personnel = pd.read_csv('/kaggle/input/2022-ukraine-russian-war/russia_losses_personnel.csv')

In [None]:
merge_df = df.merge(correction, on='date', how='left', suffixes=('','_correction'))
for col in df.columns:
    if col in correction.columns and col != 'date' and col != 'day':
        merge_df[col] = merge_df[col] + merge_df[col + '_correction'].fillna(0)
col_to_drop = [col for col in merge_df.columns if '_correction' in col]
df = merge_df.drop(columns=col_to_drop).set_index('date')
personnel.set_index('date', inplace=True)
df.index = pd.to_datetime(df.index)
equipments = ['aircraft', 'helicopter', 'tank', 'APC', 'field artillery']

## First, let's create trend plots for equipment and personnel losses.
**The losses for aircraft and helicopters increased sharply in the initial weeks but then leveled off. Losses for tanks and armored personnel carriers (APCs) showed a continuous upward trend.  
Field artillery losses also increased in the early stages but appeared to decrease somewhat in the middle.**

In [None]:
fig, ax = plt.subplots(2,1,figsize=(15,6), dpi=150)
df[['aircraft','helicopter','MRL']].plot(ax=ax[0])
df[['tank','APC','field artillery']].plot(ax=ax[1])
fig.suptitle('Equipment Loss Trend Over Time')
plt.tight_layout()

**The personnel losses increased sharply in the early stages but then increased at a more moderate pace. The number of prisoners of war (POW) increased at certain times, especially during the middle period.**

In [None]:
fig, ax = plt.subplots(2,1,figsize=(15,6), dpi=150)
personnel['personnel'].plot(ax=ax[0], label='personnel')
ax[0].legend()
personnel['POW'].plot(ax=ax[1], label='POW')
ax[1].legend()
fig.suptitle('Personnel Loss and POW Trend Over Time')
plt.tight_layout()

## In addition to the fundamental temporal trends, we can embark on the following EDA explorations:

1. **Cumulative Losses**: Display the cumulative losses for each type of equipment or personnel over time.

2. **Loss Distributions**: Employ box plots or violin plots to portray the distribution of losses across various equipment types or personnel.

3. **Loss Disparities**: Calculate the differences in losses between consecutive days or weeks, enhancing our understanding of loss velocity or rhythm.

4. **Moving Averages**: Compute 7-day or 30-day moving averages to smoothen the data and gain deeper insights into trends.

5. **Correlations**: Examine the correlations between losses among different equipment types or the correlations between equipment and personnel losses.


**1. Cumulative Losses**

**Tanks, APCs, and field artillery show a continuous increase in cumulative losses throughout the period. From the graph, it's evident that both personnel losses and the number of prisoners of war (POWs) continue to rise throughout the entire period.**

In [None]:
df_cum = df[equipments].cumsum()
per_sum = personnel[['personnel','POW']].cumsum()

fig, ax = plt.subplots(4,1, figsize=(15,10), dpi=150)
df_cum[['aircraft','helicopter']].plot(ax=ax[0])
df_cum[['tank','APC','field artillery']].plot(ax=ax[1])
per_sum['personnel'].plot(ax=ax[2], label='personnel')
ax[2].legend()
per_sum['POW'].plot(ax=ax[3], label='POW')
ax[3].legend()
fig.suptitle('Cumulative Loss Over Time')
plt.tight_layout()

**2. Loss Distributions:**  

**Aircraft and helicopters have more concentrated loss distributions, while other equipment like tanks and APCs have more scattered loss distributions. Field artillery also exhibits some extreme values in terms of losses.**

In [None]:
fig, ax = plt.subplots(1,2, figsize=(15,4), dpi=200)
sns.boxplot(df[['aircraft','helicopter']], ax=ax[0])
sns.boxplot(df[['tank', 'APC', 'field artillery']], ax=ax[1])
fig.suptitle('Distribution of Equipment Losses');

**3. Loss Disparities:**

On certain dates, the disparities in equipment losses are notably significant, which might indicate substantial conflicts or events occurring on those specific days.

**As for the disparities between personnel losses and POWs:**

The differences in personnel losses are remarkably prominent on certain days, suggesting a potential association with specific events. Similarly, the variance in the number of POWs is substantial on particular days.

In [None]:
df_diff = df[equipments].diff().dropna()
per_diff = personnel[['personnel','POW']].diff().dropna()

fig, ax = plt.subplots(4,1, figsize=(15,10), dpi=200)
df_diff[['aircraft','helicopter']].plot(ax=ax[0])
ax[0].set_title('Day-to-Day Differences in Equipment Loss')
df_diff[['tank','APC','field artillery']].plot(ax=ax[1])
per_diff['personnel'].plot(ax=ax[2], label='personnel')
ax[2].set_title('Day-to-Day Differences in Personnel Loss and POW')
ax[2].legend()
per_diff['POW'].plot(ax=ax[3], label='POW')
ax[3].legend()
plt.tight_layout()

**4. Moving Averages**

We can observe smoother trend lines, especially for helicopter and aircraft losses, which aids us in gaining a better understanding of the overall trends.

The moving average of personnel losses increases during certain periods, but there is an overall upward trend. Similarly, the moving average of POWs also experiences some growth during specific periods.

In [None]:
df_rolling = df[equipments].rolling(window=7).mean().dropna()
per_rolling = personnel[['personnel','POW']].rolling(window=7).mean().dropna()

fig, ax = plt.subplots(4,1, figsize=(15,10), dpi=200)
df_rolling[['aircraft','helicopter']].plot(ax=ax[0])
ax[0].set_title('7-Day Rolling Average of Equipment Loss')
df_rolling[['tank','APC','field artillery']].plot(ax=ax[1])
per_rolling['personnel'].plot(ax=ax[2], label='personnel')
ax[2].set_title('7-Day Rolling Average of Personnel Loss and POW')
ax[2].legend()
per_rolling['POW'].plot(ax=ax[3], label='POW')
ax[3].legend()
plt.tight_layout()

**5. Correlations**

Personnel losses and losses in most equipment are positively correlated. This implies that when the losses of a particular type of equipment increase, personnel losses are also likely to increase.

Notably, there is a higher correlation between personnel losses and APC as well as tank, standing at 0.85 and 0.75 respectively.

In [None]:
merged_df = df.merge(personnel, left_index=True, right_index=True)
merged_df_cor = merged_df[equipments + ['personnel']].corr()

plt.figure(figsize=(10, 6), dpi=200)
sns.heatmap(merged_df_cor, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation between Equipment Losses and Personnel Loss', fontsize=16)
plt.tight_layout()

## We still can do some advanced analysis.

1. **Stacked Area Chart**: This can be used to display the proportion of various equipment losses in the total losses.

2. **Cumulative Loss Percentage Chart**: This can show how much of the total losses are attributed to various equipment or personnel losses as time progresses.

3. **Scatter Plots**: To explore the relationship between different equipment losses and personnel losses, we can use scatter plots.

4. **Seasonal Decomposition**: If there's seasonal variation in the data, seasonal decomposition can be employed to visualize the trends, seasonality, and residual components.

5. **Lag Analysis:** Lag plots are useful for checking the autocorrelation in a time series.

6. **Autocorrelation and Partial Autocorrelation Plots:** Both of these are tools for analyzing autocorrelation in time series data and are essential components of models like ARIMA.

7. **Heatmap analysis:** is valuable for identifying patterns at specific times, such as certain times of the day or days of the week. 

8. **Spectral Analysis:** It's used to identify periodic components in the data.

9. **Anomaly Detection:** Methods based on moving averages and standard deviations are applicable to time series data as they consider temporal dependencies.


1. **Stacked Area Chart**:

From the charts, it's evident that tank and armored personnel carrier (APC) losses constitute a significant portion of the total losses.

In [None]:
palette = sns.color_palette("viridis", 6)
plt.figure(figsize=(15, 8), dpi=250)
df[equipments + ['MRL']].plot.area(ax=plt.gca(), stacked=True, color=palette)
plt.title('Stacked Area Chart of Equipment Losses', fontsize=16)
plt.xlabel('Date', fontsize=14)
plt.ylabel('Cumulative Count', fontsize=14)
plt.xticks(rotation=45)
plt.legend(title="Equipment Type")
plt.tight_layout()

2. **Cumulative Loss Percentage Chart**:

From the chart, it's apparent that with the passage of time, all types of equipment gradually reach 100% of their total losses.

In [None]:
df_percentage = df[equipments].cumsum() / df[equipments].sum()

fig, ax = plt.subplots(figsize=(18,8), dpi=200)
df_percentage[equipments].plot(ax=ax)
plt.xlabel('Date', fontsize=14)
plt.ylabel('Cumulative Percentage', fontsize=14)
# plt.xticks(rotation=45)
plt.legend(title="Equipment Type")
plt.title('Cumulative Percentage of Equipment Losses Over Tim');

**3. scatter plots:**

It can be observed that there is a certain positive correlation between tank losses and personnel losses. Similarly, there is also a positive correlation between losses of armored personnel carriers (APC) and personnel losses.

In [None]:
fig, ax = plt.subplots(2,1, figsize=(18,6), dpi=200)
sns.scatterplot(merged_df, x='tank', y='personnel', ax=ax[0], color='blue', alpha=0.5)
ax[0].set_title('Scatter Plot of Tank Loss vs Personnel Loss', fontsize=14)
ax[0].set_xlabel('Tank Loss', fontsize=12)
ax[0].set_ylabel('Personnel Loss', fontsize=12)
sns.scatterplot(merged_df, x='APC', y='personnel', ax=ax[1], color='red', alpha=0.5)
ax[1].set_title('Scatter Plot of APC Loss vs Personnel Loss', fontsize=14)
ax[1].set_xlabel('APC Loss', fontsize=12)
ax[1].set_ylabel('Personnel Loss', fontsize=12)
plt.tight_layout()

4. **Seasonal Decomposition**:

**Trend:** Tank losses show an upward trend over time.  
**Seasonality:** There is some seasonality pattern, but it may be related to specific days of the week or events.  
**Residuals:** It shows the variability that remains after considering the trend and seasonality.

In [None]:
decomposition = seasonal_decompose(df['tank'], model='additive', period=7)

fig, ax = plt.subplots(4,1, figsize=(15,8), sharex=True)
ax[0].plot(decomposition.observed)
ax[0].set_title('Original')
# ax[0].set_ylabel('Temperature')
# ax[0].grid(True, color='gray', linestyle='--', linewidth=0.5)
ax[1].plot(decomposition.trend)
ax[1].set_title('Trend')
ax[1].set_ylabel('Temperature')
ax[1].grid(True, color='gray', linestyle='--', linewidth=0.5)
ax[2].plot(decomposition.seasonal)
ax[2].set_title('Seasonal')
ax[2].set_ylabel('Temperature')
ax[2].grid(True, color='gray', linestyle='--', linewidth=0.5)
ax[3].plot(decomposition.resid)
ax[3].set_title('Residual')
ax[3].set_ylabel('Temperature')
ax[3].set_xlabel('Date')
ax[3].grid(True, color='gray', linewidth=0.5)
plt.tight_layout(rect=[0, 0.03, 1, 0.95])


5. **Lag Analysis:**  

The lag plot illustrates the relationship between tank losses at time 't' and tank losses at time 't+1'. A positive correlation in the plot indicates that there is some level of correlation between losses on consecutive days.

In [None]:
fig, ax = plt.subplots(figsize=(10,4), dpi=150)
lag_plot(df['tank'], lag=1, ax=ax, alpha=0.2)
plt.title('Lag Plot of Tank Loss Data')
plt.xlabel('Tank Loss at time t')
plt.ylabel('Tank Loss at time t+1')
plt.tight_layout()


6. **Autocorrelation and Partial Autocorrelation Plots:**. 

ACF (Auto-Correlation Function): In the initial lags, the autocorrelation decreases relatively quickly, indicating that the correlation between values of the time series decreases as the lag increases.

PACF (Partial Auto-Correlation Function): There is a significant peak at the initial lags, such as lag 1, which may suggest the presence of some autoregressive pattern.

In [None]:
fig, ax = plt.subplots(2,1, figsize=(15, 10))

# plot ACF
plot_acf(df['tank'].dropna(), ax=ax[0], lags=40)
ax[0].set_title('Autocorrelation Plot (ACF) for Tank Loss', fontsize=14)

# plot PACF
plot_pacf(df['tank'].dropna(), ax=ax[1], lags=40)
ax[1].set_title('Partial Autocorrelation Plot (PACF) for Tank Loss', fontsize=14)


7. **Heatmap analysis:**

Using a heatmap to analyze tank losses on each day of the week is an excellent approach to identify patterns specific to certain days of the week. It appears that certain weeks and weekdays experience higher losses, which could be attributed to specific events during those times.   
Interestingly, weekends (Saturday and Sunday) seem to have lower losses.


In [None]:
df['weekday'] = df.index.day_name()
df['week'] = df.index.isocalendar().week
heatmap_data = df.pivot_table(values='tank', index='weekday', columns='week', aggfunc='sum')
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
heatmap_data = heatmap_data.reindex(day_order)

plt.figure(figsize=(18, 6))
sns.heatmap(heatmap_data, cmap='YlGnBu', cbar_kws={'label': 'Tank Loss'})
plt.title('Heatmap of Tank Loss by Weekday and Week', fontsize=16)
plt.xlabel('Week of Year', fontsize=14)
plt.ylabel('Weekday', fontsize=14)
plt.tight_layout()

8. **Spectral Analysis:**
 
This is the spectrum plot of tank loss data. Spectrum plots can help us identify significant periodic components in the data. 

From the graph, it's evident that the primary periodic component appears in the low-frequency region, which may be associated with long-term trends or low-frequency cyclic patterns.

In [None]:
frequencies, spectrum = periodogram(df['tank'].dropna(), scaling='spectrum')
plt.figure(figsize=(15, 7))
plt.plot(frequencies, spectrum)
plt.title('Spectrum of Tank Loss Data', fontsize=16)
plt.xlabel('Frequency', fontsize=14)
plt.ylabel('Spectrum Magnitude', fontsize=14)
plt.tight_layout()


9. **Anomaly Detection:**

Finally, I will employ statistical methods to detect anomalies in tank loss data. Common approaches involve setting a threshold (e.g., 2 standard deviations) for anomaly detection based on moving averages and standard deviations.

In [None]:
window_size = 7
rolling_mean = df['tank'].rolling(window=window_size).mean()
rolling_std = df['tank'].rolling(window=window_size).std()

threshold = 2
anomaly = (df['tank'] > (rolling_mean + threshold * rolling_std)) | (df['tank'] < (rolling_mean - threshold * rolling_std))
plt.figure(figsize=(15, 7))
plt.plot(df.index, df['tank'], label='Tank Loss', color='blue')
plt.scatter(df[anomaly].index, df[anomaly]['tank'], color='red', label='Anomaly')
plt.fill_between(df.index, rolling_mean + threshold * rolling_std, rolling_mean - threshold * rolling_std, color='yellow', alpha=0.2)
plt.title('Anomaly Detection in Tank Loss Data', fontsize=16)
plt.xlabel('Date', fontsize=14)
plt.ylabel('Tank Loss', fontsize=14)
plt.legend()
plt.tight_layout()