# Appendix: Project template

We outline a structured approach for presenting research findings. The framework is divided into several key segments:

1. Introduction
1. Dataset overview
1. Analytics and learning strategies
1. Empirical resuts: baseline and robustness 
1. Conclusion

The opening segment encompasses four essential elements:

- Contextual Background: What is the larger setting of the study? What makes this area of inquiry compelling? What are the existing gaps or limitations within the current body of research? What are some unanswered yet noteworthy questions?

- Project Contributions: What are the specific advancements made by this study, such as in data acquisition, algorithmic development, parameter adjustments, etc.?

- Summary of the main empirical results: What is the main statistical statement? is it significant (e.g. statistically or economically)? 

- Literature and Resource Citations: What are related academic papers? What are the github repositories, expert blogs, or software packages that used in this project? 

In the dataset profile, one should consider:

- The origin and composition of data utilized in the study. If the dataset is original, then provide the source code to ensure reproducibility.

- The chronological accuracy of the data points, verifying that the dates reflect the actual availability of information.

- A detailed analysis of descriptive statistics, with an emphasis on discussing the importance of the chosen graphs or metrics.

The analytics and machine learning methodologies section accounts for:

- A detailed explanation of the foundational algorithm.

- A description of the data partitioning strategy for training, validation and test.

- An overview of the parameter selection and optimization process.

To effectively convey the empirical findings, separate the baseline results from the additional robustness tests. Within the primary empirical outcomes portion, include:

- Key statistical evaluations (for instance, if presenting a backtest – provide a pnl graph alongside the Sharpe ratio).

- Insights into what primarily influences the results, such as specific characteristics or assets that significantly impact performance.

The robustness of empirical tests section should detail:

- Evaluation of the stability of the principal finding against variations in hyperparameters or algorithmic modifications.

Finally, the conclusive synthesis should recapitulate the primary findings, consider external elements that may influence the results, and hint at potential directions for further investigative work.

# Intro

# Environment and global variable

In [None]:
#instal packages
!pip install mistralai transformers

In [2]:
#import packages
from mistralai import Mistral
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import time
from tqdm import tqdm 

In [9]:
#whether to use mistral prompting or load datsets online
prompt_mistral = input("Do you want to use mistral prompting? (y/n): ") # or set to True
if prompt_mistral == 'y':
    prompt_mistral = True
else:
    prompt_mistral = False
    
if prompt_mistral:
    api_key_file = input("Enter the path to the API key file: ")
    with open(api_key_file, 'r') as file:
        api_key = file.read().strip()
        
else:
    api_key = None
    print("Using datasets on Yassine's webpage")

# Datasets

In [4]:
# load comunications fomc
if os.path.exists('./communications.csv'):
    commuications = pd.read_csv('./communications.csv')
    print('Dataset loaded.')
else:
    print('Installing dataset...')
    !curl -L -o ./fomc-meeting-statements-and-minutes.zip https://www.kaggle.com/api/v1/datasets/download/vladtasca/fomc-meeting-statements-and-minutes
    !unzip ./fomc-meeting-statements-and-minutes.zip -d ./fomc-meeting-statements-and-minutes
    !mv ./fomc-meeting-statements-and-minutes/communications.csv ./communications.csv
    !rm ./fomc-meeting-statements-and-minutes.zip
    !rm -r ./fomc-meeting-statements-and-minutes
    
    commuications = pd.read_csv('./communications.csv')
    


Dataset loaded.


In [5]:
#Load Georgia Tech's dataset
splits = {'train': 'train.csv', 'test': 'test.csv'}
df_train = pd.read_csv("hf://datasets/gtfintechlab/fomc_communication/" + splits["train"])
df_test = pd.read_csv("hf://datasets/gtfintechlab/fomc_communication/" + splits["test"])

georgia_tech_df = pd.concat([df_train, df_test], axis=0)

In [6]:
#Load interest rate Data
# Define the URL for the CSV download
csv_url = "https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1320&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=INTDSRUSM193N&scale=left&cosd=1950-01-01&coed=2021-08-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2024-12-18&revision_date=2024-12-18&nd=1950-01-01"

#Interest Rates, Discount Rate for United States

# Download the CSV file into a DataFrame
Int_R = pd.read_csv(csv_url)
#keep observations from 2000 onwards
Int_R=Int_R[Int_R['observation_date']>='2000-01-01']
Int_R['evolution']=Int_R['INTDSRUSM193N'].diff(1)
Int_R['evolution_pct']=Int_R['INTDSRUSM193N'].pct_change(1)*100
Int_R

Unnamed: 0,observation_date,INTDSRUSM193N,evolution,evolution_pct
600,2000-01-01,5.00,,
601,2000-02-01,5.24,0.24,4.800000
602,2000-03-01,5.34,0.10,1.908397
603,2000-04-01,5.50,0.16,2.996255
604,2000-05-01,5.71,0.21,3.818182
...,...,...,...,...
855,2021-04-01,0.25,0.00,0.000000
856,2021-05-01,0.25,0.00,0.000000
857,2021-06-01,0.25,0.00,0.000000
858,2021-07-01,0.25,0.00,0.000000


In [7]:
#consumer price index
# Define the URL for the CSV download
csv_url = "https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1320&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=CPIAUCSL&scale=left&cosd=1947-01-01&coed=2024-11-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=3&ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2024-12-18&revision_date=2024-12-18&nd=1947-01-01"

#Consumer Price Index for All Urban Consumers: All Items in U.S. City Average

# Download the CSV file into a DataFrame
CPI = pd.read_csv(csv_url)
#keep observations from 2000 onwards
CPI=CPI[CPI['observation_date']>='2000-01-01']
CPI['evolution']=CPI['CPIAUCSL'].diff(1)
CPI['evolution_pct']=CPI['CPIAUCSL'].pct_change(1)*100
CPI


Unnamed: 0,observation_date,CPIAUCSL,evolution,evolution_pct
636,2000-01-01,169.300,,
637,2000-02-01,170.000,0.700,0.413467
638,2000-03-01,171.000,1.000,0.588235
639,2000-04-01,170.900,-0.100,-0.058480
640,2000-05-01,171.200,0.300,0.175541
...,...,...,...,...
930,2024-07-01,313.534,0.485,0.154928
931,2024-08-01,314.121,0.587,0.187221
932,2024-09-01,314.686,0.565,0.179867
933,2024-10-01,315.454,0.768,0.244053


In [8]:
#Unemployment rate
# Define the URL for the CSV download
csv_url = "https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1320&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=UNRATE&scale=left&cosd=1948-01-01&coed=2024-11-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=3&ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2024-12-18&revision_date=2024-12-18&nd=1948-01-01"

#Unepmloyment Rate

# Download the CSV file into a DataFrame
UR = pd.read_csv(csv_url)
#keep observations from 2000 onwards
UR=UR[UR['observation_date']>='2000-01-01']
UR['evolution']=UR['UNRATE'].diff(1)
UR['evolution_pct']=UR['UNRATE'].pct_change(1)*100
UR.head()

Unnamed: 0,observation_date,UNRATE,evolution,evolution_pct
624,2000-01-01,4.0,,
625,2000-02-01,4.1,0.1,2.5
626,2000-03-01,4.0,-0.1,-2.439024
627,2000-04-01,3.8,-0.2,-5.0
628,2000-05-01,4.0,0.2,5.263158


# Machine Learning models

In [None]:
def get_variable_name(var):
    return [name for name in globals() if globals()[name] is var][0]


def analyze_monetary_policy(df,api_key,prompt_mistral=prompt_mistral,use_checkpoints=True):
    df_name = get_variable_name(df)
    if not prompt_mistral:
        if df_name == 'commuications':
            checkpoint = pd.read_csv('https://MachtaYassine.github.io/datasets/communications-mistral-prompted.csv',index_col=0)
        elif df_name == 'gerogia_tech_df':
            checkpoint = pd.read_csv('https://MachtaYassine.github.io/datasets/georgia-tech-mistral-prompted.csv',index_col=0)
        else:
            raise ValueError('The dataset is not recognized. Please set prompt_mistral to True.')
        return checkpoint
    model = "mistral-large-latest"
    client = Mistral(api_key=api_key)
    df_name = get_variable_name(df)
    if use_checkpoints and os.path.exists(f'{df_name}_checpoint.csv'):
        checkpoint = pd.read_csv(f'{df_name}_checpoint.csv')
        print('Checkpoint loaded.')
    else:
        checkpoint = pd.DataFrame(columns=['text'])


    for text in tqdm(df['Text'].values[len(checkpoint):]):
        chat_response = client.chat.complete(
            model=model,
            messages=[
                {
                    "role": "user",
                    "content": f"Act as a financial analyst. What is the monetary policy hawkishness of this text? \
    Please choose an answer from hawkish, dovish, neutral or unknown and provide a probability and a short explanation. \
        answer in this structure (no other text) : \n \
        label: hawkish, \n probability: 90%, \n explanation: The text contains a lot of positive words and is likely to be hawkish. \n \
    Text: {text}",
                },
            ]
        )

        response_message = chat_response.choices[0].message.content
        

        checkpoint = checkpoint.append({'text':response_message},ignore_index=True)
        time.sleep(2)
        checkpoint.to_csv(f'{df_name}_checpoint.csv',index=False)
    
    return checkpoint


def extract_label(text):
    if text.find("hawkish") != -1:
        return "hawkish"
    elif text.find("dovish") != -1:
        return "dovish"
    elif text.find("neutral") != -1:
        return "neutral"
    else:
        return "unknown"
    
def process_checkpoint(checkpoint,source_df):
    source_df['label'] = checkpoint['text'].apply(extract_label)
    source_df['explanation'] = checkpoint['text'].apply(lambda x: x.split('explanation: ')[1].strip() if x.find('explanation') != -1 else x.split('Explanation: ')[1].strip())
    return source_df

def check_important_shift_dates(df,steps):
    cond=(np.abs(df['label2'].diff(1))>0)
    for step in range(2,steps+1):
        cond=cond & (np.abs(df['label2'].diff(step))>0)
    return cond


def plot_monetary_policy(df,label):
    # Ensure Date2 is datetime and sorted
    df['Date2'] = pd.to_datetime(df['Release Date'])
    df = df.sort_values(by='Date2', ascending=False)

    # Encode labels
    df['label2'] = df[label].apply(lambda x: 1 if x == 'hawkish' else 0 if x == 'neutral' else -1)

    # Identify important shifts
    df['shift_date'] = check_important_shift_dates(df, 5)

    # Initialize subplots
    fig, axes = plt.subplots(3, 1, figsize=(20, 15), sharex=True)

    # Plot each statement type separately
    for ax, (statement_type, group_df) in zip(axes[:2], df.groupby('Type')):
        ax.plot(group_df['Date2'], group_df['label2'], marker='o', label=statement_type)
        ax.set_title(f'{statement_type} Statements')
        ax.set_xlabel('Date')
        ax.set_ylabel('Label')
        ax.legend(title="Statement Type")

    # Plot both together in the last subplot
    for statement_type, group_df in df.groupby('Type'):
        axes[2].plot(group_df['Date2'], group_df['label2'], marker='o', label=statement_type)

    # Customize final plot
    axes[2].set_title('Combined Monetary Policy Hawkishness')
    axes[2].set_xlabel('Date')
    axes[2].set_ylabel('Label')
    axes[2].legend(title="Statement Type")

    # Customize x-ticks across all plots
    xticks_labels = [df['Date2'].iloc[i] if i == 0 or i == len(df) - 1 or df['shift_date'].iloc[i] else '' 
                    for i in range(len(df))]
    plt.xticks(df['Date2'], labels=xticks_labels, rotation=90)

    # Show the plots
    plt.tight_layout()
    plt.show()


In [None]:
com_test= commuications.iloc[:10]
checkpoint_test=analyze_monetary_policy(com_test,api_key,prompt_mistral=prompt_mistral,use_checkpoints=False)

In [None]:
tokenizer = AutoTokenizer.from_pretrained("gtfintechlab/FOMC-RoBERTa", do_lower_case=True, do_basic_tokenize=True)
model = AutoModelForSequenceClassification.from_pretrained("gtfintechlab/FOMC-RoBERTa", num_labels=3)
config = AutoConfig.from_pretrained("gtfintechlab/FOMC-RoBERTa")

classifier = pipeline('text-classification', model=model, tokenizer=tokenizer, config=config, device=0, framework="pt")

In [None]:
results = classifier(communications['Text'].tolist(),
                      batch_size=128, truncation="only_first")
communications['RoBERTa_label_raw']=pd.DataFrame(results)['label'].map({'LABEL_0':'dovish','LABEL_1':'hawkish','LABEL_2':'Neutral'})
communications['RoBERTa_probability_raw']=pd.DataFrame(results)['score']

In [None]:
results_explanation = classifier(communications['Mistral_explanation'].tolist(),
                                    batch_size=128, truncation="only_first")
communications['RoBERTa_label_explanation']=pd.DataFrame(results_explanation)['label'].map({'LABEL_0':'dovish','LABEL_1':'hawkish','LABEL_2':'Neutral'})
communications['RoBERTa_probability_explanation']=pd.DataFrame(results_explanation)['score']

In [None]:
#compute accuracy 
from sklearn.metrics import accuracy_score

acc_llm_roberta_raw=accuracy_score(communications['label'],communications['RoBERTa_label_raw'])
acc_llm_roberta_explanation=accuracy_score(communications['label'],communications['RoBERTa_label_explanation'])
acc_roberta_raw_roberta_explanation=accuracy_score(communications['RoBERTa_label_raw'],communications['RoBERTa_label_explanation'])
print(f'Accuracy of LLM vs RoBERTa raw: {acc_llm_roberta_raw}')
print(f'Accuracy of LLM vs RoBERTa explanation: {acc_llm_roberta_explanation}')
print(f'Accuracy of RoBERTa raw vs RoBERTa explanation: {acc_roberta_raw_roberta_explanation}')
#accuracy by type (Minute/ Statement)
print('\n'+'---'*20+'\n')
for t in ['Minute','Statement']:
    acc_llm_roberta_raw=accuracy_score(communications[communications['Type']==t]['label'],communications[communications['Type']==t]['RoBERTa_label_raw'])
    acc_llm_roberta_explanation=accuracy_score(communications[communications['Type']==t]['label'],communications[communications['Type']==t]['RoBERTa_label_explanation'])
    acc_roberta_raw_roberta_explanation=accuracy_score(communications[communications['Type']==t]['RoBERTa_label_raw'],communications[communications['Type']==t]['RoBERTa_label_explanation'])
    print(f'Accuracy of LLM vs RoBERTa raw for {t}: {acc_llm_roberta_raw}') 
    print(f'Accuracy of LLM vs RoBERTa explanation for {t}: {acc_llm_roberta_explanation}')
    print(f'Accuracy of RoBERTa raw vs RoBERTa explanation for {t}: {acc_roberta_raw_roberta_explanation}')
    print('\n'+'---'*20+'\n')

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import accuracy_score

# Compute overall accuracies
acc_llm_roberta_raw = accuracy_score(communications['label'], communications['RoBERTa_label_raw'])
acc_llm_roberta_explanation = accuracy_score(communications['label'], communications['RoBERTa_label_explanation'])
acc_roberta_raw_roberta_explanation = accuracy_score(communications['RoBERTa_label_raw'], communications['RoBERTa_label_explanation'])

# Accuracy by type
types = ['Minute', 'Statement']
accuracies_by_type = {
    'LLM vs RoBERTa Raw': [],
    'LLM vs RoBERTa Explanation': [],
    'RoBERTa Raw vs Explanation': []
}

for t in types:
    acc_llm_roberta_raw = accuracy_score(
        communications[communications['Type'] == t]['label'], 
        communications[communications['Type'] == t]['RoBERTa_label_raw']
    )
    acc_llm_roberta_explanation = accuracy_score(
        communications[communications['Type'] == t]['label'], 
        communications[communications['Type'] == t]['RoBERTa_label_explanation']
    )
    acc_roberta_raw_roberta_explanation = accuracy_score(
        communications[communications['Type'] == t]['RoBERTa_label_raw'], 
        communications[communications['Type'] == t]['RoBERTa_label_explanation']
    )

    accuracies_by_type['LLM vs RoBERTa Raw'].append(acc_llm_roberta_raw)
    accuracies_by_type['LLM vs RoBERTa Explanation'].append(acc_llm_roberta_explanation)
    accuracies_by_type['RoBERTa Raw vs Explanation'].append(acc_roberta_raw_roberta_explanation)

# Plotting
fig, ax = plt.subplots(1, 2, figsize=(12, 6))

# Overall accuracies
categories = ['LLM vs RoBERTa Raw', 'LLM vs RoBERTa Explanation', 'RoBERTa Raw vs Explanation']
overall_accuracies = [acc_llm_roberta_raw, acc_llm_roberta_explanation, acc_roberta_raw_roberta_explanation]

ax[0].bar(categories, overall_accuracies, color=['blue', 'green', 'orange'])
ax[0].set_title('Overall Accuracy')
ax[0].set_ylabel('Accuracy')
ax[0].set_ylim(0, 1)
ax[0].grid(axis='y', linestyle='--', alpha=0.7)

# Accuracy by type
width = 0.2
x = np.arange(len(types))

for i, (key, values) in enumerate(accuracies_by_type.items()):
    ax[1].bar(x + i * width, values, width, label=key)

ax[1].set_title('Accuracy by Type')
ax[1].set_ylabel('Accuracy')
ax[1].set_xlabel('Communication Type')
ax[1].set_xticks(x + width)
ax[1].set_xticklabels(types)
ax[1].legend()
ax[1].set_ylim(0, 1)
ax[1].grid(axis='y', linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()


In [None]:
df=copy.deepcopy(communications)
# Ensure Date2 is datetime and sorted
df['Date2'] = pd.to_datetime(df['Release Date'])
df = df.sort_values(by='Date2', ascending=False)


# Encode labels
df['label2'] = df['RoBERTa_label_explanation'].apply(lambda x: 1 if x == 'hawkish' else 0 if x == 'Neutral' else -1)

# Identify important shifts
df['shift_date'] = check_important_shift_dates(df, 5)

# Initialize subplots
fig, axes = plt.subplots(3, 3, figsize=(40, 15), sharex=True)

# Plot each statement type separately
for i, (statement_type, group_df) in enumerate(df.groupby('Type')):
        for j, variable in enumerate([Int_R,UR,CPI]):
            axes[i,j].plot(group_df['Date2'], group_df['label2'], marker='o', label=statement_type)
            axes[i,j].set_title(f'Type : {statement_type}')
            axes[i,j].set_xlabel('Date')
            axes[i,j].set_ylabel('Label')
            axes[i,j].legend(title="Statement Type")
            
            ax2 = axes[i,j].twinx()  # This creates a second y-axis sharing the same x-axis
            ax2.plot(variable['observation_date'], variable['evolution'], color='tab:red', label='Interest Rate')
            ax2.set_ylabel('Interest Rate', color='tab:red')  # Set the y-axis label for the second plot
            ax2.set_xticks(jan_first_dates)
            ax2.set_xticklabels(jan_first_dates.dt.strftime('%Y'), rotation=45, ha='right')
    
            # Optional: Set a legend for the second y-axis
            ax2.legend(title="Interest Rate", loc='upper right')
            
            
            

# Plot both together in the last subplot
for statement_type, group_df in df.groupby('Type'):
    for j, variable in enumerate([Int_R,UR,CPI]):
        axes[2,j].plot(group_df['Date2'], group_df['label2'], marker='o', label=statement_type)
        axes[2,j].set_title('Combined Monetary Policy Hawkishness')
        axes[2,j].set_xlabel('Date')
        axes[2,j].set_ylabel('Label')
        axes[2,j].legend(title="Statement Type")

        ax2 = axes[2,j].twinx()
        ax2.plot(variable['observation_date'], variable['evolution'], color='tab:red', label='Interest Rate')
        ax2.set_ylabel('Interest Rate', color='tab:red')
        ax2.set_xticks(jan_first_dates)
        ax2.set_xticklabels(jan_first_dates.dt.strftime('%Y'), rotation=45, ha='right')
        ax2.legend(title="Interest Rate", loc='upper right')
        
        

# Show the plots
plt.tight_layout()
plt.show()

In [None]:
from dateutil.relativedelta import relativedelta

def check_months_apart(dates, x,months,before):
    if len(dates) < months:
        return False
    # print(dates)
    # print(x)
    # print(dates.iloc[-1])
    # print((x - dates.iloc[-1]))
    # print('-------------------')
    return (x - dates.iloc[-1]).days >= (months-1)*30 if before else (dates.iloc[-1] - x).days >= (months-1)*30

def calculate_evolution(df, date_col, value_col, x,months, before=True):
    filtered_df = df[df[date_col] <= x] if before else df[df[date_col] >= x]
    sorted_df = filtered_df.sort_values(by=date_col, ascending=not before).head(months)
    if check_months_apart(sorted_df[date_col], x,months,before):
        # print(f"Date: {x} - Sum: {round(sorted_df[value_col].sum(), 2)}")
        # print(sorted_df)
        return round(sorted_df[value_col].sum(), 2)
    return np.nan

for months in [3, 6, 12]:
    for before in [True, False]:
        print(f"Months: {months} - Before: {before}")
        df[f"i_r_{months}_{'before' if before else 'after'}"] = df['Date2'].apply(
            lambda x: calculate_evolution(Int_R, 'observation_date', 'evolution', x, months, before)
        )
        df[f"cpi_{months}_{'before' if before else 'after'}"] = df['Date2'].apply(
            lambda x: calculate_evolution(CPI, 'observation_date', 'evolution', x, months, before)
        )
        df[f"u_r_{months}_{'before' if before else 'after'}"] = df['Date2'].apply(
            lambda x: calculate_evolution(UR, 'observation_date', 'evolution', x, months, before)
        )
#number of nan rows in the columns that contains Nans
df.isnull().sum()[df.isnull().sum()>0]

In [None]:
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import accuracy_score
df_dropped_na=df.dropna()
def evaluate_kmeans(df,clustering_algorithms):
    # Drop missing values
    df_dropped_na = df.dropna()

    # Define feature combinations
    feature_sets = {
        'features_3_before': ['i_r_3_before', 'cpi_3_before', 'u_r_3_before'],
        'features_3_after': ['i_r_3_after', 'cpi_3_after', 'u_r_3_after'],
        'features_3_all': ['i_r_3_before', 'cpi_3_before', 'u_r_3_before', 'i_r_3_after', 'cpi_3_after', 'u_r_3_after'],
        'features_6_before': ['i_r_6_before', 'cpi_6_before', 'u_r_6_before'],
        'features_6_after': ['i_r_6_after', 'cpi_6_after', 'u_r_6_after'],
        'features_6_all': ['i_r_6_before', 'cpi_6_before', 'u_r_6_before', 'i_r_6_after', 'cpi_6_after', 'u_r_6_after'],
        'features_12_before': ['i_r_12_before', 'cpi_12_before', 'u_r_12_before'],
        'features_12_after': ['i_r_12_after', 'cpi_12_after', 'u_r_12_after'],
        'features_12_all': ['i_r_12_before', 'cpi_12_before', 'u_r_12_before', 'i_r_12_after', 'cpi_12_after', 'u_r_12_after'],
        'features_after': ['i_r_3_after', 'cpi_3_after', 'u_r_3_after','i_r_6_after', 'cpi_6_after', 'u_r_6_after','i_r_12_after', 'cpi_12_after', 'u_r_12_after'],
        'features_before': ['i_r_3_before', 'cpi_3_before', 'u_r_3_before','i_r_6_before', 'cpi_6_before', 'u_r_6_before','i_r_12_before', 'cpi_12_before', 'u_r_12_before'],
        'features_all':['i_r_3_before', 'cpi_3_before', 'u_r_3_before','i_r_6_before', 'cpi_6_before', 'u_r_6_before','i_r_12_before', 'cpi_12_before', 'u_r_12_before','i_r_3_after', 'cpi_3_after', 'u_r_3_after','i_r_6_after', 'cpi_6_after', 'u_r_6_after','i_r_12_after', 'cpi_12_after', 'u_r_12_after']
    }

    results = {}

    for name, features in feature_sets.items():
        # Extract features
        X = df_dropped_na[features]
        # Compute accuracies with label switching
        true_labels = df_dropped_na['label2'].values
        

        for algo_name, algorithm in clustering_algorithms.items():
            # Fit clustering algorithm
            model = algorithm.fit(X)
            labels = model.labels_ if hasattr(model, 'labels_') else model.predict(X)

            # Compute accuracies with label switching
            acc1 = accuracy_score(true_labels, [-1 if x == 0 else 0 if x == 1 else 1 for x in labels])
            acc2 = accuracy_score(true_labels, labels)
            acc3 = accuracy_score(true_labels, [1 if x == 0 else 0 if x == 1 else -1 for x in labels])

            best_accuracy = max(acc1, acc2, acc3)
            results[f'{name}_{algo_name}'] = {
                'acc_original': acc2,
                'acc_switch_1': acc1,
                'acc_switch_2': acc3,
                'best_accuracy': best_accuracy
            }

    # Sort results by best accuracy
    sorted_results = dict(sorted(results.items(), key=lambda x: x[1]['best_accuracy'], reverse=True))
    
    best_accuracy,best_features=sorted_results[list(sorted_results.keys())[0]]['best_accuracy'],list(sorted_results.keys())[0]
    return sorted_results,best_accuracy,best_features

# Example usage
best_results,best_accuracy,best_features = evaluate_kmeans(df, {
    'kmeans': KMeans(n_clusters=3, random_state=0),
    'agg': AgglomerativeClustering(n_clusters=3),
    'dbscan': DBSCAN(eps=0.5, min_samples=5)
})
print(f"Best accuracy: {best_accuracy}")
print(f"Best features: {best_features}")


In [None]:
# The "best" predictions are those with 3 months before and after drop 6 and 12 months
df.drop(['i_r_6_before', 'cpi_6_before', 'u_r_6_before','i_r_6_after', 'cpi_6_after', 'u_r_6_after','i_r_12_before', 'cpi_12_before', 'u_r_12_before','i_r_12_after', 'cpi_12_after', 'u_r_12_after'], axis=1, inplace=True)
    

In [None]:
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import accuracy_score
df_dropped_na=df.dropna()
def evaluate_kmeans(df,clustering_algorithms):
    # Drop missing values
    df_dropped_na = df.dropna()

    # Define feature combinations
    feature_sets = {
        'features_3_before': ['i_r_3_before', 'cpi_3_before', 'u_r_3_before'],
        'features_3_after': ['i_r_3_after', 'cpi_3_after', 'u_r_3_after'],
        'features_3_all': ['i_r_3_before', 'cpi_3_before', 'u_r_3_before', 'i_r_3_after', 'cpi_3_after', 'u_r_3_after'],
    }

    results = {}

    for name, features in feature_sets.items():
        # Extract features
        X = df_dropped_na[features]
        # Compute accuracies with label switching
        true_labels = df_dropped_na['label2'].values
        

        for algo_name, algorithm in clustering_algorithms.items():
            # Fit clustering algorithm
            model = algorithm.fit(X)
            labels = model.labels_ if hasattr(model, 'labels_') else model.predict(X)

            # Compute accuracies with label switching
            acc1 = accuracy_score(true_labels, [-1 if x == 0 else 0 if x == 1 else 1 for x in labels])
            acc2 = accuracy_score(true_labels, labels)
            acc3 = accuracy_score(true_labels, [1 if x == 0 else 0 if x == 1 else -1 for x in labels])

            best_accuracy = max(acc1, acc2, acc3)
            results[f'{name}_{algo_name}'] = {
                'acc_original': acc2,
                'acc_switch_1': acc1,
                'acc_switch_2': acc3,
                'best_accuracy': best_accuracy
            }

    # Sort results by best accuracy
    sorted_results = dict(sorted(results.items(), key=lambda x: x[1]['best_accuracy'], reverse=True))
    
    best_accuracy,best_features=sorted_results[list(sorted_results.keys())[0]]['best_accuracy'],list(sorted_results.keys())[0]
    return sorted_results,best_accuracy,best_features

# Example usage
best_results,best_accuracy,best_features = evaluate_kmeans(df, {
    'kmeans': KMeans(n_clusters=3, random_state=0),
    'agg': AgglomerativeClustering(n_clusters=3),
    'dbscan': DBSCAN(eps=0.5, min_samples=5)
})
print(f"Best accuracy: {best_accuracy}")
print(f"Best features: {best_features}")
