---
title: "DS202W - ✏️ W10 Summative"
author: <47083>
output: html
self-contained: true
jupyter: python3
engine: jupyter
python: "/opt/anaconda3/bin/python"
editor:
  render-on-save: true
  preview: true
---


# Part 1 - Pandas and Lets Plot (Exploratory Data Analysis)

## Overview of steps for part 1


1. Importing the data/prepare for exploratory data analysis
2. Explore data and derive insights
3. Answer the following questions:
  a. What are the years with the top 10 highest number of total wars? How many of those years (and which ones) overlap with the years with the top 10 nominal money issues?
  b.  Similarly, what are the years with the top 10 highest number of disasters? How many of those years (and which ones) overlap with the years with the top 10 nominal money issues?
  c.  Create a single plot that shows the evolution over time of CPI, total wars, disasters and nominal money issues (be mindful of variable scaling!). What does this plot tell you?

---


### Preliminary Exploratory Data Analysis
1. Importing the necessary libraries  


In [None]:
import pandas as pd
from lets_plot import *
LetsPlot.setup_html()
import matplotlib.pyplot as plt

2.  Explore data and derive insights


In [None]:
#Load the data into a data frame called yuan
yuan = pd.read_stata("Data/yuan_inflation_data.dta")

#data exploration
print("Dataset information:")
print(yuan.info())

#first 10 rows with a parapgraph break to get a clear viewing of the data
print("\nFirst few rows:")
print(yuan.head(10))


#.descibe to get summary statistics 
print("\nBasic statistics:")
print(yuan.describe())

---

3. Answer the following questions:

  ### 3a. What are the years with the top 10 highest number of total wars? How many of those years (and which ones) overlap with the years with the top 10 nominal money issues?

In [None]:
# create a new df with the year and totalwar 
top_wars = yuan[['year', 'totalwar']]
# sort the dataframe by the totalwar column in descending order and select the top 10
top_wars = top_wars.sort_values(by='totalwar', ascending=False).head(10)


#same thing for nominal money issues
top10_nominal_money = yuan[['year', 'nominal']]
top10_nominal_money = top10_nominal_money.sort_values(by='nominal', ascending=False).head(10)


#### I am transforming each column of years from its original form into a Python set, in order to find the intersection of the two sets.

table |
|------|---------|
| 1352 | 40 |
| 1275 | 30 |
| 1355 | 26 |
| 1328 | 17 |
| 1325 | 12 |
| 1353 | 11 |
| 1327 | 11 |
| 1331 | 11 |
| 1354 | 10 |
| 1323 | 10 |
---

The top 10 years with the most nominal money issues are 
Top 10 years with highest nominal money issues:

| year | nominal |
|------|---------|
| 1355 | 49500000 |
| 1310 | 36259200 |
| 1354 | 34500000 |
| 1353 | 19500000 |
| 1352 | 19500000 |
| 1312 | 11211680 |
| 1311 | 10900000 |
| 1313 | 10200000 |
| 1314 | 10100000 |
| 1302 | 10000000 |

---

In [None]:
overlapping_years = set(top_wars['year']).intersection(set(top10_nominal_money['year']))
print(overlapping_years)

#set creates a data structure  of unique items
# intersection finds the common items between the two sets

###### There are **4 years** that overlap**
#### The years that overlap are **1352, 1353, 1354, 1355**  



### 3b.  Similarly, what are the years with the top 10 highest number of disasters? How many of those years (and which ones) overlap with the years with the top 10 nominal money issues?

In [None]:
#we essentially need to repeat the same process as above, but with the disaster column
top10_disasters= yuan[['year', 'disaster']]
top10_disasters = top10_disasters.sort_values(by='disaster', ascending=False).head(10)


print("Top 10 years with highest number of disasters:")
print(top10_disasters)

In [None]:
#utilize set function again to find the intersection of the two sets (disasters and nominal money issues)
years_overlap = set(top10_disasters['year']).intersection(set(top10_nominal_money['year']))
print("\nOverlapping years between disasters and nominal money issues:")
print(years_overlap)

3b. Because nothing printed **no years overlap!**




### 3c. Create a single plot that shows the evolution over time of CPI, total wars, disasters and nominal money issues (be mindful of variable scaling!). What does this plot tell you?


In [None]:
# we need to normalize the data to make it comparable

#use the formula min  max scaling - it transforms each value by subtracting the minimum value and dividing by the range (maximum minus minimum), 
yuan['CPI_normalized'] = (yuan['cpi'] - yuan['cpi'].min()) / (yuan['cpi'].max() - yuan['cpi'].min())

yuan['wars_normalized'] = (yuan['totalwar'] - yuan['totalwar'].min()) / (yuan['totalwar'].max() - yuan['totalwar'].min())


yuan['disasters_normalized'] = (yuan['disaster'] - yuan['disaster'].min()) / (yuan['disaster'].max() - yuan['disaster'].min())


yuan['nominal_normalized'] = (yuan['nominal'] - yuan['nominal'].min()) / (yuan['nominal'].max() - yuan['nominal'].min())


---

#### Note before plotting: I believe that the boxplot is the best way to visualize the distribution of the data as it shows the median, quartiles, and outliers, which help for comparison of multiple variables. However I will use more than 1 plot in order to make sure the data is being visualized thoroughly.


In [None]:
from sklearn.preprocessing import StandardScaler
import pandas as pd
import matplotlib.pyplot as plt

#we are using the standard scaler to normalize the data this time. 
scaler = StandardScaler()

#new df with scaled data
scaled_data = scaler.fit_transform(yuan[['cpi', 'totalwar', 'disaster', 'nominal']])

#new columns for clarity
yuan_scaled = pd.DataFrame(scaled_data, columns=['CPI_normalized', 'wars_normalized', 'disasters_normalized', 'nominal_normalized'])

plt.figure(figsize=(12, 8))
plt.plot(yuan['year'], yuan_scaled['CPI_normalized'], label='CPI', color='blue', linewidth=2)
plt.plot(yuan['year'], yuan_scaled['wars_normalized'], label='Total Wars', color='red', linewidth=2)
plt.plot(yuan['year'], yuan_scaled['disasters_normalized'], label='Disasters', color='green', linewidth=2)
plt.plot(yuan['year'], yuan_scaled['nominal_normalized'], label='Nominal Money Issues', color='orange', linewidth=2)
plt.title(' CPI, Total Wars, Disasters, and Nominal Money Issues  Over Time of the Yuan Dynasty')
plt.xlabel('Year')
plt.ylabel('Z-score')
plt.legend(loc='upper left')
plt.grid(True, which='both', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

In [None]:
#first plot boxplot, this uses the normalized data. 
plt.figure(figsize=(10, 6))
boxplot_data = [yuan_scaled['CPI_normalized'], yuan_scaled['wars_normalized'], 
                yuan_scaled['disasters_normalized'], yuan_scaled['nominal_normalized']]
plt.boxplot(boxplot_data, labels=['CPI', 'Wars', 'Disasters', 'Nominal'])
plt.title('Distribution Comparison of CPI, Total Wars, Disasters, and Nominal Money Issues - Yuan Dynasty (1260-1355)')
plt.ylabel('Normalized Value')
plt.grid(axis='y', alpha=0.3)

## Analysis of the data
## We can break down the data into 2 parts: 
  1. Each variable individually.
  2. The data as a whole.


## Breakdown each of the variables that we analyzed.

### - CPI
The first plot shows the normalized data of all the variables over time. We can see that CPI is the most stable, however, after 1340 there was a spike, and ends with a z score of around 3.8. This makes sense as in the paper it is stated after 1340 population started to decline "population started to decline "Population figures during the mid-Yuan period (1290−333) remained unchanged but started to recover after 1330 and peaked in 1341.51 After that, the population started to decline." (Guan, Palma, and Wu 2024). 


#### - Total Wars
In terms of the other variables, we can see that the total wars, is very volatile with sharp spikes, paticularly around 1275, with a z score of 4.2, and ends with a score of 3.5 in 1355 after its peak a few years prior. This is makes sense as it suggests that the total wars were not as frequent as the CPI, but when they were frequent, they were devestating. 

#### - Nominal Money Issues
Although having a peak at the end, nominal money issues is by far the most stable, and the lowest of all the variables. It spikes in 1310, which can be explained by the abondonment of the The third paper money, the zhidachao, the the third paper money issued by the issued by the third Yuan emperor, KülügYuan, in 1310. It had a It had a significantly higher value than previous currencies:
  1 guan of zhidachao was equivalent to 5 guan of zhiyuanchao
  1 guan of zhidachao was equivalent to 25 guan of zhongtongcha 
However, the zhidachoa was only in cirualtion for 1 year, as soon after,  Külüg died suddenly in 1311, his successor Ayurbarwada Khan abandoned the zhidachao and restored the previous two paper currencies (zhongtongchao and zhiyuanchao). The issuance of zhidachao represented a massive spike in the nominal money supply, with annual issuance reaching 36 million ding in 1310, compared to just 5 million ding when zhiyuanchao was being issued. This can explain why the spike in 1310 is so high. 

#### - Disasters
The disasters are arguably the most volatile of all the variables based on the z-score plot. There a few peaks as seen in the 1290s and the 1330s. There was years of tranquility prior to the 1290s, as in the paper, it states, "After the initial years of tranquillity, from 1285, the empire began to suffer from various natural disasters" (Frequency of natural disasters. Source: Chen et al., A chronicle of natural and man-made disasters in China, pp. 1068−220.). 

### Data as a whole
As mentioned, each of these variables individually have differing scales, ranges, and historical context that make them difficult to compare. However, we can still derive that CPI was the most affected by the other variables, as CPI when measuring inflation, there is no such thing as one factor/variable that affects it, it is a cumulation of many factors. These factors we measured, total wars, disasters, and nominal money issues, all have an effect on the CPI. With that in mind, the CPI was still relatively stable, and with a simple plot, it is hard to tell the true effect the other variables, if there was a significant effect. If there were to be more extensive analysis, we could use a correlation matrix to see the relationship between the variables, or another model 
 📊 😅...
 
---


# Part 2: Create regression models (45 marks)


## Overview of steps for part 2
1. Create a baseline linear regression model:
  a. Create the training and test sets:
  b. linear regression model that predicts the CPI
  c. Residuals plot
  d. Model evaluation/performance metrics

2.  Come up with my own feature selection or feature engineering or model selection strategy and try to get a better model performance than you had before. 
  a. Model selection choice and justification
  b. Code
  c. Explaination of choices
  d. Model performance/evaluation
  e. Comparison to baseline model


### 1a.

In [None]:
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score

#1a. Create the training and test sets
#split data based on year
yuan_train = yuan[yuan['year'] < 1327].copy()
yuan_test = yuan[yuan['year'] >= 1327].copy()

#target and features
features = ['totalwar', 'disaster', 'nominal']
target = 'cpi'

# Prepare training data
X_train = yuan_train[features]
y_train = yuan_train[target]

#1b. Linear regression model that predicts the CPI for training data

# Add a constant term for the intercept
X_train_const = sm.add_constant(X_train)

# Utilize an OLS model using the training data
ols_model = sm.OLS(y_train, X_train_const)
results = ols_model.fit()

# Print the OLS summary output for training data
print("OLS Model for Training Data:")
print(results.summary())

#1c. Residuals plot for training data
fitted_values = results.fittedvalues
residuals = y_train - fitted_values
r2 = results.rsquared
rmse = np.sqrt(mean_squared_error(y_train, fitted_values))

print("\nPerformance on Training Data:")
print("R-squared:", np.round(r2, 2))
print("Root Mean Squared Error (RMSE):", np.round(rmse, 2))

# Plot the residuals versus the fitted values for training data
plt.figure(figsize=(10, 6))
plt.scatter(fitted_values, residuals, alpha=0.7, color='blue')
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('CPI Values')
plt.ylabel('Residuals')
plt.title('Residuals Plot for Training Data')
plt.grid(True)
plt.show()

# Now the test data !! same steps as above
# Prepare test data
X_test = yuan_test[features]
y_test = yuan_test[target]

# Add constant term for the intercept
X_test_const = sm.add_constant(X_test)

# Fit the OLS model on the test data
ols_test_model = sm.OLS(y_test, X_test_const)
results_test = ols_test_model.fit()

# Print the OLS summary output for test data
print("\nOLS Model for Test Data:")
print(results_test.summary())

# Predict on the test data
y_test_pred = results_test.predict(X_test_const)

# Calculate residuals for the test data
test_residuals = y_test - y_test_pred

# Calculate performance metrics on the test data
test_r2 = r2_score(y_test, y_test_pred)
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))

print("\nPerformance on Test Data:")
print("Test R-squared:", np.round(test_r2, 2))
print("Test Root Mean Squared Error (RMSE):", np.round(test_rmse, 2))

# Plot the residuals versus the fitted values for test data
plt.figure(figsize=(10, 6))
plt.scatter(y_test_pred, test_residuals, alpha=0.7, color='green')
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Fitted CPI Values (Test Data)')
plt.ylabel('Residuals')
plt.title('Residuals Plot for Test Data')
plt.grid(True)
plt.show()

#### 1d. Model evaluation/performance metrics

 ##### Model Comparison: Training data:
 Overall, the model is not horrible. Considering the r-sqaured values is 0.48, the model is moderate. It accounts for almost 50% of the variation for the CPI... in all fairness for 3 variables, it is not too bad. This makes sense however, as the CPI is not just defined by the variables we included (total wars, disasters, and nominal money issues), but also by other factors that are not included in the model. The f-statistic is less than 0.00001, which is good, thus meaning the information we included is statistically significant. 

The p-value for disaster and nomial p-value are 0.0000, which is significant. However, the p-value for total wars is 0.259, which is not significant. This makes some sense as the wars seen in the last plot from part 1, show that the wars were not as frequent as the other inherirent issues such as nominal money issues and disasters which followed a similar volatility. Additionally, if the wars were less frequent, they may not have had as much of an immediate or lasting impact on inflation in the same way that disasters or nominal issues did. CPI however is a special variable. This is evident when exploring the last plot from part 1, where the disasters are realievly stable **until 1340**, whre there is a massive spike.  The CPI spike in 1340 might be tied to an event not captured well in the model (such as a specific, one-time shock), that could explain why totalwar isnt significant and  other variables might be significant. Lets test my hypothesis out to create my own model. 



##### Model Comparison: Test data:
The performance on test data is as follows:
  Test R-squared: 0.71
  Test Root Mean Squared Error (RMSE): 14.47

This shows us that the test model is more accurate than the training model, as the r-squared value is higher, even though the RSME is lower. 

Analysis: Both models show that totalwar has little influence on CPI. The nominal variable is highly significant in both cases, which makes sense because CPI would likely be highly influenced by monetary changes. Disaster is significant in training but borderline significant in test data, which could suggest a change in the impact of disasters between the training and test periods.


## 2: My own model : 

#### Features: 
I want to add more features to the model. I want to add variables howevre some variables are similar in nature, such as rebellion and total war. Others like year are not useful as it is not a variable that can be controlled or manipulated, it is simily a context variable. 


In [None]:
# we need to create a new df with the new features
