# Hackathon - Predicting mutual fund rating

## Problem Satement :
Great Stone Rating is a star based ranking system. These ratings are based on the performance of a mutual fund with adjustments for risks and costs as compared to other funds in the same category. The rating ranges from 0 to 5.

## Goal: 
The goal of this hackathon is to predict GreatStone’s rating of a mutual fund. In order to help investors decide on which  mutual fund to pick for an investment, the task is to build a model that can predict the rating of a mutual fund. The various attributes that define a mutual fund can be used for building the model.

## Dataset Information :
This dataset comprises information of 25000 mutual funds in the United states. Various attributes related to the mutual fund have been described and these attributes will be used for making decisions on the rating of the mutual fund by GreatStone which is a top mutual fund rating agency. The following files are provided in the form of CSVs. These files contain various
attributes related to the mutual fund. Please find the following files for the same:
bond_ratings, fund_allocations, fund_config, fund_ratios, fund_specs, other_specs, return_3year, return_5year, return_10year.

## Files Description:
bond_ratings consists of 12 columns which provide information on the bond rating percentage allocation of the mutual funds
fund_allocations consists of 12 columns which provide information on thesector wise percentage allocation of the mutual funds
fund_config comprises of 4 columns which comprise the metadata of the mutual funds
fund_ratios consists of 8 columns which provides information on various fundamental ratios that describe the mutual funds
fund_specs contains 9 columns which give information about the specifications of the mutual funds
other_specs contains 43 columns which give information of the other aspects of the mutual funds
return_3years contains 17 columns which give information about 3 year return and ratios
return_5years contains 17 columns which give information about 5 year return and ratios
return_10years contains 17 columns which give information about 10 year return and ratios

sample_submission contains the fund ids for which you need to provide the ratings for the submission file. Please maintain the order of the fund ids as shown in this file. The tag column is a unique identifier and is also the same as the id.(i.e tag
= id)

## Train and Test Data
The train and test data are both provided in the CSVs described above as part of the same file. You need to segregate the training and test data based on where the greatstone ratings are provided. Go through the files carefully to understand how you
can segregate both the datas. Please maintain the ordering of the test data. You can use the sample submission file in order to get ID of the test data.

## Evaluation Metric
Mean Precision Value - Mean of precision of all the classes = P1+P2….. P6/6 Here P1 is Precision of Class 1 and P2 is Precision of Class2 and so on and so forth.

## Input  & Output files
Available along wityh this github folder

## Design Approach 

The below document details the approach taken to solve the problem of predicting the rating of mutual funds. 

### Design decision 1:  Split the dataset
By the nature of the mutual fund industry, the handling of funds will vary depending on the type of funds & the risk factors involved. The debt-based funds will be of less risk while the equity based funds will be of higher market risk. Hence the ratings will be based on the fund type / category. Hence the 1st design decision is the split the entire dataset based on the ‘Investment Class’ [Value based investment, Growth & Blended].  When the investment class is empty (Null), create a separate class called ‘Unknown’.  
### Design decision 2 : Imputations
The data imputations for the missing data will be done within the data subset based on investment class. The data imputations will be done after the split of data subsets based on investment class. This will be due to the fact that the items under any class will be treated separately & hence it will be apt to do mean based imputations within a investment class. 

### Design decision 3 : Different algorithms for different investment classes
Since the datasets will be separated based on investment class, each subset will be split into train & test. For each data subset, the train data will again be split into X_train, y_train, X_test, y_test (this is equivalent of train & validation data).   Multiple algorithms will be run on this train & validation data. Also for each model, we will use a RandomSearch Cross validation to get the optimal hyper parameters which give best results.  Final model will be chosen based on the best results among the different models (with hyper parameter tuned). Hence the final result will be a combination of models – different model for individual data subset – which will be merged to form the final result. 


In [1]:
# Import necessary modules
import numpy as np
import pandas as pd

#from sklearn.preprocessing import Imputer
from sklearn.preprocessing import OneHotEncoder
from sklearn import preprocessing

from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
#from sklearn import cross_validation, metrics   #Additional scklearn functions
#from sklearn.grid_search import GridSearchCV   #Perforing grid search

# Below two lines for gridsearch 
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import StratifiedKFold

# from sklearn.preprocessing import MinMaxScaler

from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import classification_report,confusion_matrix

  import pandas.util.testing as tm


In [2]:
# Load the data frams from csv files
df_other_specs = pd.read_csv("other_specs.csv")
df_fund_specs = pd.read_csv("fund_specs.csv")
df_fund_ratios = pd.read_csv("fund_ratios.csv")
df_submission_file = pd.read_csv("sample_submission.csv")

df_return_3Y = pd.read_csv("return_3year.csv")
df_return_5Y = pd.read_csv("return_5year.csv")
df_return_10Y = pd.read_csv("return_10year.csv")

df_fund_config = pd.read_csv("fund_config.csv")

#df_fund_allocations = pd.read_csv("fund_allocations.csv")
#df_fund_allocations.rename(columns={'id': 'tag'},inplace=True)
## Above commenedt as this reduces precision

df_bond_ratings = pd.read_csv("bond_ratings.csv")
## Above commenedt as this reduces precision


## Others to be considered later are bond allocations for blend & value funds


In [3]:
# Create a df with common linkage between fund_id & tag
df_linkage = df_fund_ratios.filter(['fund_id', 'tag'],axis=1)

In [4]:
## Create features that may be useful in each df
#1. Fund_Specs: None to be created
df_fund_specs.drop(columns=['currency','total_assets', 'inception_date'], inplace=True)

df_other_specs.drop(columns=['greatstone_rating'], inplace=True)

## DF3 : Fund Ratios : 
df_fund_ratios.drop(columns=['mmc','fund_id','pb_ratio','ps_ratio','pc_ratio','pe_ratio'],inplace=True)

### NEW FEATURE CREATION.  Create new feature(s) for ration between the individual value to the category value. 

#DF4 : 3 year return : Retain = fund_return_3years, 3_years_return_category
df_return_3Y['3yrs_treynor_ratio_fund'] = pd.to_numeric(df_return_3Y['3yrs_treynor_ratio_fund'], downcast="float",errors='coerce')

df_return_3Y['3y_treynor_ratio'] = df_return_3Y['3yrs_treynor_ratio_fund']/df_return_3Y['3yrs_treynor_ratio_category']
df_return_3Y['3y_treynor_ratio'] = df_return_3Y['3y_treynor_ratio'].replace((np.inf, -np.inf), (0, 0))
df_return_3Y['3y_alpha_ratio'] = df_return_3Y['3_years_alpha_fund']/df_return_3Y['3_years_alpha_category']
df_return_3Y['3y_alpha_ratio'] = df_return_3Y['3y_alpha_ratio'].replace((np.inf, -np.inf), (0, 0))
df_return_3Y['3y_sharpe_ratio'] = df_return_3Y['3yrs_sharpe_ratio_fund']/df_return_3Y['3yrs_sharpe_ratio_category']
df_return_3Y['3y_sharpe_ratio'] = df_return_3Y['3y_sharpe_ratio'].replace((np.inf, -np.inf), (0, 0))
df_return_3Y['3y_rma_ratio'] = df_return_3Y['3_years_return_mean_annual_fund']/df_return_3Y['3_years_return_mean_annual_category']
df_return_3Y['3y_rma_ratio'] = df_return_3Y['3y_rma_ratio'].replace((np.inf, -np.inf), (0, 0))
df_return_3Y['3y_beta_ratio'] = df_return_3Y['fund_beta_3years']/df_return_3Y['category_beta_3years']
df_return_3Y['3y_beta_ratio'] = df_return_3Y['3y_beta_ratio'].replace((np.inf, -np.inf), (0, 0))
df_return_3Y['3y_r2_ratio'] = df_return_3Y['3years_fund_r_squared']/df_return_3Y['3years_category_r_squared']
df_return_3Y['3y_r2_ratio'] = df_return_3Y['3y_r2_ratio'].replace((np.inf, -np.inf), (0, 0))
df_return_3Y['3y_std_ratio'] = df_return_3Y['3years_fund_std']/df_return_3Y['3years_category_std']
df_return_3Y['3y_std_ratio'] = df_return_3Y['3y_std_ratio'].replace((np.inf, -np.inf), (0, 0))
df_return_3Y['3y_return_ratio'] = df_return_3Y['fund_return_3years']/df_return_3Y['3_years_return_category']
df_return_3Y['3y_return_ratio'] = df_return_3Y['3y_return_ratio'].replace((np.inf, -np.inf), (0, 0))

# Drop columns that are no more required as the new features are created from these 
df_return_3Y.drop(columns=[
    '3yrs_treynor_ratio_fund','3yrs_treynor_ratio_category',
    '3_years_alpha_fund','3_years_alpha_category',
    '3yrs_sharpe_ratio_fund','3yrs_sharpe_ratio_category',
    '3_years_return_mean_annual_fund','3_years_return_mean_annual_category',
    'fund_beta_3years','category_beta_3years',
    '3years_fund_r_squared','3years_category_r_squared',
    '3years_fund_std','3years_category_std',
    'fund_return_3years','3_years_return_category'],inplace=True)

df_return_5Y['5yrs_treynor_ratio_fund'] = pd.to_numeric(df_return_5Y['5yrs_treynor_ratio_fund'], downcast="float",errors='coerce')

df_return_5Y['5y_treynor_ratio'] = df_return_5Y['5yrs_treynor_ratio_fund']/df_return_5Y['5yrs_treynor_ratio_category']
df_return_5Y['5y_treynor_ratio'] = df_return_5Y['5y_treynor_ratio'].replace((np.inf, -np.inf), (0, 0))
df_return_5Y['5y_alpha_ratio'] = df_return_5Y['5_years_alpha_fund']/df_return_5Y['5_years_alpha_category']
df_return_5Y['5y_alpha_ratio'] = df_return_5Y['5y_alpha_ratio'].replace((np.inf, -np.inf), (0, 0))
df_return_5Y['5y_sharpe_ratio'] = df_return_5Y['5yrs_sharpe_ratio_fund']/df_return_5Y['5yrs_sharpe_ratio_category']
df_return_5Y['5y_sharpe_ratio'] = df_return_5Y['5y_sharpe_ratio'].replace((np.inf, -np.inf), (0, 0))
df_return_5Y['5y_rma_ratio'] = df_return_5Y['5_years_return_mean_annual_fund']/df_return_5Y['5_years_return_mean_annual_category']
df_return_5Y['5y_rma_ratio'] = df_return_5Y['5y_rma_ratio'].replace((np.inf, -np.inf), (0, 0))
df_return_5Y['5y_beta_ratio'] = df_return_5Y['5_years_beta_fund']/df_return_5Y['5_years_beta_category']
df_return_5Y['5y_beta_ratio'] = df_return_5Y['5y_beta_ratio'].replace((np.inf, -np.inf), (0, 0))
df_return_5Y['5y_r2_ratio'] = df_return_5Y['5years_fund_r_squared']/df_return_5Y['category_r_squared_5years']
df_return_5Y['5y_r2_ratio'] = df_return_5Y['5y_r2_ratio'].replace((np.inf, -np.inf), (0, 0))
df_return_5Y['5y_std_ratio'] = df_return_5Y['5years_fund_std']/df_return_5Y['5years_category_std']
df_return_5Y['5y_std_ratio'] = df_return_5Y['5y_std_ratio'].replace((np.inf, -np.inf), (0, 0))
df_return_5Y['5y_return_ratio'] = df_return_5Y['5_years_return_fund']/df_return_5Y['5_years_return_category']
df_return_5Y['5y_return_ratio'] = df_return_5Y['5y_return_ratio'].replace((np.inf, -np.inf), (0, 0))

df_return_5Y.drop(columns=[
    '5yrs_treynor_ratio_fund','5yrs_treynor_ratio_category',
    '5_years_alpha_fund','5_years_alpha_category',
    '5yrs_sharpe_ratio_fund','5yrs_sharpe_ratio_category',
    '5_years_return_mean_annual_fund','5_years_return_mean_annual_category',
    '5_years_beta_fund','5_years_beta_category',
    '5years_fund_r_squared','category_r_squared_5years',
    '5years_fund_std','5years_category_std',
    '5_years_return_fund','5_years_return_category'],inplace=True)

#DF6 : 10 year return : 
df_return_10Y['10yrs_treynor_ratio_fund'] = pd.to_numeric(df_return_10Y['10yrs_treynor_ratio_fund'], downcast="float",errors='coerce')

df_return_10Y['10y_treynor_ratio'] = df_return_10Y['10yrs_treynor_ratio_fund']/df_return_10Y['10yrs_treynor_ratio_category']
df_return_10Y['10y_treynor_ratio'] = df_return_10Y['10y_treynor_ratio'].replace((np.inf, -np.inf), (0, 0))
df_return_10Y['10y_alpha_ratio'] = df_return_10Y['10_years_alpha_fund']/df_return_10Y['10_years_alpha_category']
df_return_10Y['10y_alpha_ratio'] = df_return_10Y['10y_alpha_ratio'].replace((np.inf, -np.inf), (0, 0))
df_return_10Y['10y_sharpe_ratio'] = df_return_10Y['10yrs_sharpe_ratio_fund']/df_return_10Y['10yrs_sharpe_ratio_category']
df_return_10Y['10y_sharpe_ratio'] = df_return_10Y['10y_sharpe_ratio'].replace((np.inf, -np.inf), (0, 0))
df_return_10Y['10y_rma_ratio'] = df_return_10Y['10_years_return_mean_annual_fund']/df_return_10Y['10_years_return_mean_annual_category']
df_return_10Y['10y_rma_ratio'] = df_return_10Y['10y_rma_ratio'].replace((np.inf, -np.inf), (0, 0))
df_return_10Y['10y_beta_ratio'] = df_return_10Y['10_years_beta_fund']/df_return_10Y['10_years_beta_category']
df_return_10Y['10y_beta_ratio'] = df_return_10Y['10y_beta_ratio'].replace((np.inf, -np.inf), (0, 0))
df_return_10Y['10y_r2_ratio'] = df_return_10Y['10years_fund_r_squared']/df_return_10Y['10years_category_r_squared']
df_return_10Y['10y_r2_ratio'] = df_return_10Y['10y_r2_ratio'].replace((np.inf, -np.inf), (0, 0))
df_return_10Y['10y_std_ratio'] = df_return_10Y['10years_fund_std']/df_return_10Y['10years_category_std']
df_return_10Y['10y_std_ratio'] = df_return_10Y['10y_std_ratio'].replace((np.inf, -np.inf), (0, 0))
df_return_10Y['10y_return_ratio'] = df_return_10Y['10_years_return_fund']/df_return_10Y['10_years_return_category']
df_return_10Y['10y_return_ratio'] = df_return_10Y['10y_return_ratio'].replace((np.inf, -np.inf), (0, 0))

df_return_10Y.drop(columns=[
    '10yrs_treynor_ratio_fund','10yrs_treynor_ratio_category',
    '10_years_alpha_fund','10_years_alpha_category',
    '10yrs_sharpe_ratio_fund','10yrs_sharpe_ratio_category',
    '10_years_return_mean_annual_fund','10_years_return_mean_annual_category',
    '10_years_beta_fund','10_years_beta_category',
    '10years_fund_r_squared','10years_category_r_squared',
    '10years_fund_std','10years_category_std',
    '10_years_return_fund','10_years_return_category'],inplace=True)

## FEATURE ENGINEERIGN : create a differential weighted value for for each bond rating.  
df_bond_ratings['aaa_rating'] = df_bond_ratings['aaa_rating'] * 4
df_bond_ratings['aa_rating'] = df_bond_ratings['aa_rating'] * 3
df_bond_ratings['a_rating'] = df_bond_ratings['a_rating'] * 2
df_bond_ratings['bbb_rating'] = df_bond_ratings['bbb_rating'] * 1
df_bond_ratings['bb_rating'] = df_bond_ratings['bb_rating'] * 0.5
df_bond_ratings['b_rating'] = df_bond_ratings['b_rating'] * -1
df_bond_ratings['below_b_rating'] = df_bond_ratings['below_b_rating'] * -3
df_bond_ratings['others_rating'] = df_bond_ratings['others_rating'] * -3


df_bond_ratings.drop(columns=['us_govt_bond_rating'],inplace=True)


df_fund_config.drop(columns=['parent_company','fund_name'],inplace=True)
label_encoder = preprocessing.LabelEncoder()
df_fund_config['category']= label_encoder.fit_transform(df_fund_config['category'])


In [5]:
exclude_ny = ['tag','fund_id']
for i in df_return_3Y.columns:
    if i not in exclude_ny:
        df_return_3Y[i] = pd.to_numeric(df_return_3Y[i], downcast="float",errors='coerce')
for i in df_return_5Y.columns:
    if i not in exclude_ny:
        df_return_5Y[i] = pd.to_numeric(df_return_5Y[i], downcast="float",errors='coerce')

for i in df_return_10Y.columns:
    if i not in exclude_ny:
        df_return_10Y[i] = pd.to_numeric(df_return_10Y[i], downcast="float",errors='coerce')

In [6]:
## Create master data
# DF1 : Linkage. Drop no columns
# DF2 : Fund_Specs 
df_dataset = pd.merge(df_linkage,df_fund_specs,on='tag')

#DF3 : other_specs 
df_dataset = pd.merge(df_dataset,df_other_specs,on='tag')

# DF4 : Fund ratios : 
df_dataset = pd.merge(df_dataset,df_fund_ratios,on='tag')

#DF5 : 3 year return :
#df_return_3Y = df_return_3Y.filter(['tag','fund_return_3years', '3_years_return_category'],axis=1)
df_dataset = pd.merge(df_dataset,df_return_3Y,on='tag')

#DF6 : 5 Year return
df_dataset = pd.merge(df_dataset,df_return_5Y,on='tag')

#DF7 : 10 Year return
df_dataset = pd.merge(df_dataset,df_return_10Y,on='fund_id')

# DF8 : Category from fund_category
df_dataset = pd.merge(df_dataset,df_fund_config,on='fund_id')

# DF9 : Fund Allocations
#df_dataset = pd.merge(df_dataset,df_fund_allocations,on='tag')

# DF10 :  Bond Ratings
df_dataset = pd.merge(df_dataset,df_bond_ratings,on='tag')


In [7]:
# Columsn as object are converted to float
df_dataset['pc_ratio'] = pd.to_numeric(df_dataset['pc_ratio'], downcast="float",errors='coerce')
df_dataset['pb_ratio'] = pd.to_numeric(df_dataset['pb_ratio'], downcast="float",errors='coerce')
df_dataset['ps_ratio'] = pd.to_numeric(df_dataset['ps_ratio'], downcast="float",errors='coerce')
df_dataset['pe_ratio'] = pd.to_numeric(df_dataset['pe_ratio'], downcast="float",errors='coerce')
df_dataset['mmc'] = pd.to_numeric(df_dataset['mmc'], downcast="float",errors='coerce')

In [8]:
# Data cleaning

In [9]:
#Impute blanks in investment_class with 'Unkowwn'
df_dataset['investment_class'] = df_dataset['investment_class'].fillna('Unknown')

In [10]:
#Impute blanks in fund_size with 'Unkowwn'
df_dataset['fund_size'] = df_dataset['fund_size'].fillna('Unknown')

In [11]:
# One hot encode fund size
fund_size = pd.get_dummies(df_dataset['fund_size'],drop_first=True)
df_dataset.drop(['fund_size'],axis=1,inplace=True)
df_dataset = pd.concat([df_dataset,fund_size],axis=1)

In [12]:
# Impute missing yield with zero
df_dataset['yield'] = df_dataset['yield'].fillna(0)

In [13]:
exclude = ['fund_id','tag','investment_class','greatstone_rating'] #,'fund_size']
for i in df_dataset.columns:
    if i not in exclude :
        df_dataset[i] = round(pd.to_numeric(df_dataset[i]),3)
        df_dataset[i] = df_dataset[i].fillna(0)

In [14]:
df_Value = df_dataset[df_dataset['investment_class']=='Value']
df_Blend = df_dataset[df_dataset['investment_class']=='Blend']
df_Growth = df_dataset[df_dataset['investment_class']=='Growth']
df_Unknown = df_dataset[df_dataset['investment_class']=='Unknown']

In [15]:
# For each data subset, impute with mean of the columns within the data subset  (Design decision #2)

data_subsets = [df_Value,df_Blend,df_Growth,df_Unknown]
for df in data_subsets:
    exclude = ['fund_id','tag','investment_class','yield','greatstone_rating'] #,'fund_size']
    for i in df.columns:
        if i not in exclude :            
            df[i] = pd.to_numeric(df[i])
            df[i] = df[i].fillna(df[i].mean())


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


In [16]:
# Create a correlation table (CSV) to be inspected outside. Conditional formatting can be used to find correlartion.
# A pictorical vire like SNS pairplot id avoided here as the data will become unreadable with too many columns

df_corr = df_dataset.corr()
df_corr.to_csv("correlations.csv")

####  Below code is for data subset where Investment class is VALUE.  

In [17]:
## Fulll set of code for Investment_calss = VALUE
df_Value_train = df_Value[df_Value.greatstone_rating.notnull()]
df_Value_test = df_Value[df_Value.greatstone_rating.isnull()]

df_linkage = df_dataset.filter(['fund_id', 'tag'],axis=1)
df_linkage_Value = df_Value.filter(['fund_id', 'tag'],axis=1)
df_Value_train_keys = df_Value_train.filter(['tag'],axis=1)
df_Value_test_keys = df_Value_test.filter(['tag'],axis=1)

keys_train = list(df_Value_train_keys.columns.values)
i1 = df_linkage_Value.set_index(keys_train).index
i2 = df_Value_train_keys.set_index(keys_train).index
df_Value_linkage_train = df_linkage_Value[i1.isin(i2)]
df_Value_linkage_test = df_linkage_Value[~i1.isin(i2)]

# FEATURE ENGINEERING : Create new feature based on years up & years down. 
df_Value_train['yrs_Up_Down'] =df_Value_train['years_up']-df_Value_train['years_down']
df_Value_train.drop(columns=['years_up','years_down'],inplace=True)
df_Value_test['yrs_Up_Down'] =df_Value_test['years_up']-df_Value_test['years_down']
df_Value_test.drop(columns=['years_up','years_down'],inplace=True)

df_Value_train_X = df_Value_train.drop(labels='greatstone_rating',axis=1)
df_Value_train_y = df_Value_train['greatstone_rating']

## Below code for creatign train-test out of Value_Train
VX_train, VX_test, Vy_train, Vy_test = train_test_split(df_Value_train_X,df_Value_train_y,test_size=0.2, random_state=108)

# Drop funid & tag columns in both test & train 
df_Value_train_X.drop(columns=['fund_id','tag'],inplace=True)
df_Value_test_y = df_Value_test['greatstone_rating']
df_Value_test.drop(columns=['fund_id','tag','greatstone_rating'],inplace=True)


# OneHot encode investment class in both train_X & test
inv_class = pd.get_dummies(df_Value_train_X['investment_class'],drop_first=True)
df_Value_train_X.drop(['investment_class'],axis=1,inplace=True)
df_Value_train_X = pd.concat([df_Value_train_X,inv_class],axis=1)

inv_class = pd.get_dummies(df_Value_test['investment_class'],drop_first=True)
df_Value_test.drop(['investment_class'],axis=1,inplace=True)
df_Value_test = pd.concat([df_Value_test,inv_class],axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [18]:
## Temp Model for Value : (RAMDOM FOREST)
VX_train.drop(columns=['fund_id','tag','investment_class'],inplace=True)
VX_test.drop(columns=['fund_id','tag','investment_class'],inplace=True)

####   Note : The hyper parameters belowin each of the models are a result of RandomSearch CV after multiple iterations.  The code for RamdonSearch CV is available somewhere below , with being commented as this codel doesnt need to be run every time

#### MODEL 1 : RamdomForest Classifier 

In [19]:
rf_model = RandomForestClassifier(n_estimators=1000, max_depth=50, min_samples_split=5, min_samples_leaf=2, max_features = 'auto', bootstrap = False, random_state= 108)
rf_model.fit(VX_train,Vy_train)
predictions = rf_model.predict(VX_test)

accuracy_score = metrics.accuracy_score(Vy_test, predictions)
##print("Accuracy score is " + str(accuracy_score))

print(classification_report(Vy_test, predictions))

metrics.confusion_matrix(Vy_test, predictions)

              precision    recall  f1-score   support

         0.0       0.97      0.97      0.97        39
         1.0       0.88      0.64      0.74        83
         2.0       0.81      0.79      0.80       264
         3.0       0.78      0.86      0.82       364
         4.0       0.81      0.81      0.81       241
         5.0       0.77      0.68      0.72        59

    accuracy                           0.81      1050
   macro avg       0.84      0.79      0.81      1050
weighted avg       0.81      0.81      0.81      1050



array([[ 38,   1,   0,   0,   0,   0],
       [  0,  53,  28,   2,   0,   0],
       [  0,   6, 209,  48,   1,   0],
       [  1,   0,  21, 314,  27,   1],
       [  0,   0,   0,  35, 195,  11],
       [  0,   0,   0,   2,  17,  40]], dtype=int64)

####  Below is the RansdomSerach CV code for RandonForest  algo. This code is used only when during hyper parameter tuning . Whenever needed, uncomment the code & can be sued, Same code can be reused for all data subsets by passing the right train & test values. 

In [20]:
# from sklearn.model_selection import RandomizedSearchCV

# # Number of trees in random forest
# n_estimators = [int(x) for x in np.linspace(start = 500, stop = 5000, num = 10)]
# # Number of features to consider at every split
# max_features = ['auto', 'sqrt','log2']
# # Maximum number of levels in tree
# max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
# max_depth.append(None)
# # Minimum number of samples required to split a node
# min_samples_split = [2, 5, 10]
# # Minimum number of samples required at each leaf node
# min_samples_leaf = [1, 2, 4]
# # Method of selecting samples for training each tree
# bootstrap = [True, False]
# # Create the random grid
# random_grid = {'n_estimators': n_estimators,
#                'max_features': max_features,
#                'max_depth': max_depth,
#                'min_samples_split': min_samples_split,
#                'min_samples_leaf': min_samples_leaf,
#                'bootstrap': bootstrap}
# print(random_grid)


# rf_model = RandomForestClassifier()

# rf_random = RandomizedSearchCV(estimator = rf_model, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=108, n_jobs = -1)

# rf_random.fit(VX_train,Vy_train)

# print(rf_random.best_params_)

In [21]:
#random_search.best_estimator_

In [22]:
#random_search.best_score_

#### MODEL 2 : XGBOOST 

In [23]:
# XG Boost
D_train = xgb.DMatrix(VX_train, label=Vy_train)
D_test = xgb.DMatrix(VX_test, label=Vy_test)

param = {
    'eta': 0.01, 
    'max_depth': 9,
    'gamma' : 0.5,
    'min_child_weight' :10 ,
    'subsample' : 1.0,
    'random_state': 108,
    'objective': 'multi:softprob',  
    'num_class': 6} 

steps = 3000  # The number of training iterations

xgb_model = xgb.train(param, D_train, steps)
predictions = xgb_model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in predictions])


accuracy_score = metrics.accuracy_score(Vy_test, best_preds)
##print("Accuracy score is " + str(accuracy_score))

print(classification_report(Vy_test, best_preds))

metrics.confusion_matrix(Vy_test, best_preds)



              precision    recall  f1-score   support

         0.0       0.97      1.00      0.99        39
         1.0       0.84      0.65      0.73        83
         2.0       0.78      0.78      0.78       264
         3.0       0.80      0.84      0.82       364
         4.0       0.82      0.84      0.83       241
         5.0       0.75      0.71      0.73        59

    accuracy                           0.81      1050
   macro avg       0.83      0.80      0.81      1050
weighted avg       0.81      0.81      0.81      1050



array([[ 39,   0,   0,   0,   0,   0],
       [  0,  54,  28,   1,   0,   0],
       [  0,  10, 206,  47,   1,   0],
       [  1,   0,  30, 304,  28,   1],
       [  0,   0,   0,  25, 203,  13],
       [  0,   0,   0,   2,  15,  42]], dtype=int64)

####  Below is the RansdomSerach CV code for XGBoost algo. This code is used only when during hyper parameter tuning . Whenever needed, uncomment the code & can be sued, Same code can be reused for all data subsets by passing the right train & test values. 

In [24]:
# STOP
# ##### DO A RAMDOM SEARCH CV

# # A parameter grid for XGBoost
# params = {
#         'min_child_weight': [2,3,4,5],
#         'gamma': [ 1.5, 1.75,2,2.25],
#         'subsample': [0.8,1.0],
#         'colsample_bytree': [0.5,0.6,0.7, 0.8],
#         'max_depth': [7,8,9,10]
#         }

# xgb = XGBClassifier(learning_rate=0.01, n_estimators=3000, objective='multi:softmax',
#                     silent=True, nthread=1)

# folds = 4
# param_comb = 5

# skf = StratifiedKFold(n_splits=folds, shuffle = True, random_state = 108)

# random_search = RandomizedSearchCV(xgb, param_distributions=params, n_iter=param_comb, scoring='accuracy',
#                                    n_jobs=4, cv=skf.split(UX_train,Uy_train), verbose=3, random_state=108 )

# random_search.fit(UX_train,Uy_train)

# print(random_search.best_params_)

In [25]:
#random_search.best_estimator_

In [26]:
#random_search.best_score_

#### MODEL 3 : XGBOOST Classifier 
Use same hyper parameters of XGBoost 

In [27]:
xgb1=XGBClassifier(base_score=0.5, booster=None, colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.4, gamma=0.5, 
              importance_type='gain', interaction_constraints=None,
              learning_rate=0.01, max_delta_step=0, max_depth=9,
              min_child_weight=10, monotone_constraints=None,
              n_estimators=3000, n_jobs=1, nthread=1, num_parallel_tree=1,
              objective='multi:softmax', random_state=108, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, silent=True, subsample=1.0,
              tree_method=None, validate_parameters=False, verbosity=None)
xgb1.fit(VX_train,Vy_train)
predictions = xgb1.predict(VX_test)
#best_preds = np.asarray([np.argmax(line) for line in predictions])


accuracy_score = metrics.accuracy_score(Vy_test, predictions)
##print("Accuracy score is " + str(accuracy_score))

print(classification_report(Vy_test, predictions))

metrics.confusion_matrix(Vy_test, predictions)

              precision    recall  f1-score   support

         0.0       0.97      1.00      0.99        39
         1.0       0.85      0.64      0.73        83
         2.0       0.79      0.78      0.79       264
         3.0       0.80      0.85      0.83       364
         4.0       0.83      0.86      0.84       241
         5.0       0.81      0.71      0.76        59

    accuracy                           0.82      1050
   macro avg       0.84      0.81      0.82      1050
weighted avg       0.82      0.82      0.81      1050



array([[ 39,   0,   0,   0,   0,   0],
       [  0,  53,  29,   1,   0,   0],
       [  0,   9, 207,  47,   1,   0],
       [  1,   0,  26, 309,  27,   1],
       [  0,   0,   0,  25, 207,   9],
       [  0,   0,   0,   2,  15,  42]], dtype=int64)

####  Below code is for data subset where Investment class is BLEND
Repeat the same set of actions as in VALUE. 

In [28]:
## Fulll set of code for Blend
df_Blend_train = df_Blend[df_Blend.greatstone_rating.notnull()]
df_Blend_test = df_Blend[df_Blend.greatstone_rating.isnull()]
print(df_Blend_train.shape)
print(df_Blend_test.shape)

## 'New Code'
df_linkage = df_dataset.filter(['fund_id', 'tag'],axis=1)

df_linkage_Blend = df_Blend.filter(['fund_id', 'tag'],axis=1)

df_Blend_train_keys = df_Blend_train.filter(['tag'],axis=1)
df_Blend_test_keys = df_Blend_test.filter(['tag'],axis=1)

keys_train = list(df_Blend_train_keys.columns.values)
i1 = df_linkage_Blend.set_index(keys_train).index
i2 = df_Blend_train_keys.set_index(keys_train).index
df_Blend_linkage_train = df_linkage_Blend[i1.isin(i2)]
df_Blend_linkage_test = df_linkage_Blend[~i1.isin(i2)]


df_Blend_train.drop(columns=['bb_rating','below_b_rating','others_rating','maturity_bond','b_rating','a_rating','aaa_rating','aa_rating','bbb_rating','duration_bond'],inplace=True)
df_Blend_test.drop(columns=['bb_rating','below_b_rating','others_rating','maturity_bond','b_rating','a_rating','aaa_rating','aa_rating','bbb_rating','duration_bond'],inplace=True)

df_Blend_train_X = df_Blend_train.drop(labels='greatstone_rating',axis=1)
df_Blend_train_y = df_Blend_train['greatstone_rating']

## Below code for creatign train-test out of  Blend_Train
BX_train, BX_test, By_train, By_test = train_test_split(df_Blend_train_X,df_Blend_train_y,test_size=0.2, random_state=108)

# Drop funid & tag columns in both test & train 
df_Blend_train_X.drop(columns=['fund_id','tag'],inplace=True)
df_Blend_test_y = df_Blend_test['greatstone_rating']
df_Blend_test.drop(columns=['fund_id','tag','greatstone_rating'],inplace=True)

# OneHot encode investment class in both train_X & test
inv_class = pd.get_dummies(df_Blend_train_X['investment_class'],drop_first=True)
df_Blend_train_X.drop(['investment_class'],axis=1,inplace=True)
df_Blend_train_X = pd.concat([df_Blend_train_X,inv_class],axis=1)

inv_class = pd.get_dummies(df_Blend_test['investment_class'],drop_first=True)
df_Blend_test.drop(['investment_class'],axis=1,inplace=True)
df_Blend_test = pd.concat([df_Blend_test,inv_class],axis=1)

(8217, 86)
(2081, 86)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [29]:
## Temp Model for Blend : (RAMDOM FOREST)

## VX_train, VX_test, Vy_train, Vy_test
BX_train.drop(columns=['fund_id','tag','investment_class'],inplace=True)
BX_test.drop(columns=['fund_id','tag','investment_class'],inplace=True)

In [30]:

rf_model = RandomForestClassifier(n_estimators=1500, max_depth=80, min_samples_split=5, min_samples_leaf=2, max_features = 'sqrt', bootstrap = False, random_state= 108)
rf_model.fit(BX_train,By_train)
predictions1 = rf_model.predict(BX_test)

accuracy_score = metrics.accuracy_score(By_test, predictions1)
##print("Accuracy score is " + str(accuracy_score))

print(classification_report(By_test, predictions1))

metrics.confusion_matrix(By_test, predictions1)

              precision    recall  f1-score   support

         0.0       0.98      1.00      0.99       143
         1.0       0.89      0.62      0.74       120
         2.0       0.77      0.77      0.77       337
         3.0       0.82      0.88      0.85       588
         4.0       0.83      0.82      0.83       359
         5.0       0.89      0.77      0.83        97

    accuracy                           0.83      1644
   macro avg       0.86      0.81      0.83      1644
weighted avg       0.83      0.83      0.83      1644



array([[143,   0,   0,   0,   0,   0],
       [  0,  75,  44,   1,   0,   0],
       [  3,   9, 260,  63,   2,   0],
       [  0,   0,  32, 517,  38,   1],
       [  0,   0,   2,  53, 296,   8],
       [  0,   0,   0,   0,  22,  75]], dtype=int64)

In [31]:
# Use XGBOOST
D_train = xgb.DMatrix(BX_train, label=By_train)
D_test = xgb.DMatrix(BX_test, label=By_test)

param = {
    'eta': 0.01,
    'gamma': 1.75,
    'min_child_weight': 9,
    'max_depth': 5,  
    'subsample': 1.0,
    'random_state': 108,
    'objective': 'multi:softprob',  
    'num_class': 6} 


steps = 3000  # The number of training iterations

xgb_model = xgb.train(param, D_train, steps)
predictions = xgb_model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in predictions])


accuracy_score = metrics.accuracy_score(By_test, best_preds)
##print("Accuracy score is " + str(accuracy_score))

print(classification_report(By_test, best_preds))

metrics.confusion_matrix(By_test, best_preds)


              precision    recall  f1-score   support

         0.0       0.96      0.99      0.98       143
         1.0       0.90      0.67      0.77       120
         2.0       0.75      0.78      0.76       337
         3.0       0.83      0.83      0.83       588
         4.0       0.80      0.83      0.82       359
         5.0       0.84      0.78      0.81        97

    accuracy                           0.82      1644
   macro avg       0.84      0.81      0.83      1644
weighted avg       0.82      0.82      0.82      1644



array([[142,   0,   1,   0,   0,   0],
       [  0,  80,  40,   0,   0,   0],
       [  6,   9, 263,  57,   2,   0],
       [  0,   0,  47, 488,  51,   2],
       [  0,   0,   2,  45, 299,  13],
       [  0,   0,   0,   0,  21,  76]], dtype=int64)

In [32]:
xgb1=XGBClassifier(base_score=0.5, booster=None, colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.7, gamma=1.75,
              importance_type='gain', interaction_constraints=None,
              learning_rate=0.01, max_delta_step=0, max_depth=5,
              min_child_weight=9, monotone_constraints=None,
              n_estimators=3000, n_jobs=1, nthread=1, num_parallel_tree=1,
              objective='multi:softmax', random_state=108, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, silent=True, subsample=1.0,
              tree_method=None, validate_parameters=False, verbosity=None)
xgb1.fit(BX_train,By_train)
predictions = xgb1.predict(BX_test)

accuracy_score = metrics.accuracy_score(By_test, predictions)
##print("Accuracy score is " + str(accuracy_score))

print(classification_report(By_test, predictions))

metrics.confusion_matrix(By_test, predictions)

              precision    recall  f1-score   support

         0.0       0.96      0.99      0.98       143
         1.0       0.91      0.72      0.80       120
         2.0       0.76      0.77      0.77       337
         3.0       0.82      0.85      0.83       588
         4.0       0.82      0.82      0.82       359
         5.0       0.83      0.79      0.81        97

    accuracy                           0.82      1644
   macro avg       0.85      0.82      0.83      1644
weighted avg       0.83      0.82      0.82      1644



array([[142,   0,   1,   0,   0,   0],
       [  0,  86,  34,   0,   0,   0],
       [  6,   9, 261,  58,   3,   0],
       [  0,   0,  46, 497,  43,   2],
       [  0,   0,   1,  51, 293,  14],
       [  0,   0,   0,   0,  20,  77]], dtype=int64)

####  Below code is for data subset where Investment class is GROWTH
Repeat the same set of actions as in VALUE. 

In [33]:
## Fulll set of code for Growth
df_Growth_train = df_Growth[df_Growth.greatstone_rating.notnull()]
df_Growth_test = df_Growth[df_Growth.greatstone_rating.isnull()]
print(df_Growth_train.shape)
print(df_Growth_test.shape)

## 'New Code'
df_linkage = df_dataset.filter(['fund_id', 'tag'],axis=1)

df_linkage_Growth = df_Growth.filter(['fund_id', 'tag'],axis=1)

df_Growth_train_keys = df_Growth_train.filter(['tag'],axis=1)
df_Growth_test_keys = df_Growth_test.filter(['tag'],axis=1)

keys_train = list(df_Growth_train_keys.columns.values)
i1 = df_linkage_Growth.set_index(keys_train).index
i2 = df_Growth_train_keys.set_index(keys_train).index
df_Growth_linkage_train = df_linkage_Growth[i1.isin(i2)]
df_Growth_linkage_test = df_linkage_Growth[~i1.isin(i2)]

df_Growth_train.drop(columns=['bb_rating','below_b_rating','others_rating','maturity_bond','b_rating','a_rating','aaa_rating','aa_rating','bbb_rating','duration_bond'],inplace=True)
df_Growth_test.drop(columns=['bb_rating','below_b_rating','others_rating','maturity_bond','b_rating','a_rating','aaa_rating','aa_rating','bbb_rating','duration_bond'],inplace=True)

df_Growth_train_X = df_Growth_train.drop(labels='greatstone_rating',axis=1)
df_Growth_train_y = df_Growth_train['greatstone_rating']

## Below code for creatign train-test out of  Blend_Train
GX_train, GX_test, Gy_train, Gy_test = train_test_split(df_Growth_train_X,df_Growth_train_y,test_size=0.2, random_state=108)


# Drop funid & tag columns in both test & train 
df_Growth_train_X.drop(columns=['fund_id','tag'],inplace=True)
df_Growth_test_y = df_Growth_test['greatstone_rating']
df_Growth_test.drop(columns=['fund_id','tag','greatstone_rating'],inplace=True)

# OneHot encode investment class in both train_X & test
inv_class = pd.get_dummies(df_Growth_train_X['investment_class'],drop_first=True)
df_Growth_train_X.drop(['investment_class'],axis=1,inplace=True)
df_Growth_train_X = pd.concat([df_Growth_train_X,inv_class],axis=1)

inv_class = pd.get_dummies(df_Growth_test['investment_class'],drop_first=True)
df_Growth_test.drop(['investment_class'],axis=1,inplace=True)
df_Growth_test = pd.concat([df_Growth_test,inv_class],axis=1)

(5328, 86)
(1343, 86)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [34]:
## Temp Model for Growth : (RAMDOM FOREST)

## VX_train, VX_test, Vy_train, Vy_test
GX_train.drop(columns=['fund_id','tag','investment_class'],inplace=True)
GX_test.drop(columns=['fund_id','tag','investment_class'],inplace=True)

In [35]:

rf_model = RandomForestClassifier(n_estimators=500, max_depth=90, min_samples_split=10, min_samples_leaf=1, max_features = 'auto', bootstrap = False, random_state= 108)
rf_model.fit(GX_train,Gy_train)
predictions2 = rf_model.predict(GX_test)

accuracy_score = metrics.accuracy_score(Gy_test, predictions2)
##print("Accuracy score is " + str(accuracy_score))

print(classification_report(Gy_test, predictions2))

metrics.confusion_matrix(Gy_test, predictions2)

              precision    recall  f1-score   support

         0.0       1.00      0.95      0.97        61
         1.0       0.81      0.72      0.76        54
         2.0       0.83      0.84      0.83       197
         3.0       0.87      0.87      0.87       358
         4.0       0.80      0.84      0.82       274
         5.0       0.84      0.80      0.82       122

    accuracy                           0.84      1066
   macro avg       0.86      0.84      0.85      1066
weighted avg       0.85      0.84      0.84      1066



array([[ 58,   0,   0,   3,   0,   0],
       [  0,  39,  14,   1,   0,   0],
       [  0,   8, 165,  22,   2,   0],
       [  0,   0,  19, 310,  29,   0],
       [  0,   1,   1,  22, 231,  19],
       [  0,   0,   0,   0,  25,  97]], dtype=int64)

In [36]:
# Use XGBOOST
D_train = xgb.DMatrix(GX_train, label=Gy_train)
D_test = xgb.DMatrix(GX_test, label=Gy_test)

param = {
    'eta': 0.01, 
    'max_depth': 6,
    'gamma': 1.5,
    'min_child_weight':5,
    'subsample': 0.8,
    'random_state' : 108,
    'objective': 'multi:softprob',  
    'num_class': 6} 


steps = 3000  # The number of training iterations

xgb_model = xgb.train(param, D_train, steps)
predictions = xgb_model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in predictions])


accuracy_score = metrics.accuracy_score(Gy_test, best_preds)
##print("Accuracy score is " + str(accuracy_score))

print(classification_report(Gy_test, best_preds))

metrics.confusion_matrix(Gy_test, best_preds)


              precision    recall  f1-score   support

         0.0       0.98      0.95      0.97        61
         1.0       0.81      0.72      0.76        54
         2.0       0.84      0.83      0.83       197
         3.0       0.86      0.86      0.86       358
         4.0       0.80      0.85      0.83       274
         5.0       0.86      0.81      0.84       122

    accuracy                           0.84      1066
   macro avg       0.86      0.84      0.85      1066
weighted avg       0.85      0.84      0.84      1066



array([[ 58,   0,   0,   2,   1,   0],
       [  0,  39,  14,   1,   0,   0],
       [  0,   6, 163,  26,   1,   1],
       [  0,   2,  16, 307,  33,   0],
       [  1,   1,   1,  22, 234,  15],
       [  0,   0,   0,   1,  22,  99]], dtype=int64)

In [37]:
xgb1=XGBClassifier(base_score=0.5, booster=None, colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1.0, gamma=1.5,
              importance_type='gain', interaction_constraints=None,
              learning_rate=0.01, max_delta_step=0, max_depth=6,
              min_child_weight=5, monotone_constraints=None,
              n_estimators=3000, n_jobs=1, nthread=1, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, silent=True, subsample=0.8,
              tree_method=None, validate_parameters=False, verbosity=None)

xgb1.fit(GX_train,Gy_train)
predictions = xgb1.predict(GX_test)
#best_preds = np.asarray([np.argmax(line) for line in predictions])


accuracy_score = metrics.accuracy_score(Gy_test, predictions)
##print("Accuracy score is " + str(accuracy_score))

print(classification_report(Gy_test, predictions))

metrics.confusion_matrix(Gy_test, predictions)

              precision    recall  f1-score   support

         0.0       0.98      0.95      0.97        61
         1.0       0.81      0.72      0.76        54
         2.0       0.82      0.83      0.83       197
         3.0       0.86      0.85      0.85       358
         4.0       0.80      0.86      0.83       274
         5.0       0.87      0.80      0.83       122

    accuracy                           0.84      1066
   macro avg       0.86      0.83      0.84      1066
weighted avg       0.84      0.84      0.84      1066



array([[ 58,   0,   0,   2,   1,   0],
       [  0,  39,  15,   0,   0,   0],
       [  0,   6, 163,  26,   1,   1],
       [  0,   2,  19, 303,  34,   0],
       [  1,   1,   1,  22, 236,  13],
       [  0,   0,   0,   1,  24,  97]], dtype=int64)

####  Below code is for data subset where Investment class Unknown
Repeat the same set of actions as in VALUE. 

In [38]:
## Fulll set of code for Unknown
df_Unknown_train = df_Unknown[df_Unknown.greatstone_rating.notnull()]
df_Unknown_test = df_Unknown[df_Unknown.greatstone_rating.isnull()]
print(df_Unknown_train.shape)
print(df_Unknown_test.shape)

## 'New Code'
df_linkage = df_dataset.filter(['fund_id', 'tag'],axis=1)

df_linkage_Unknown = df_Unknown.filter(['fund_id', 'tag'],axis=1)

df_Unknown_train_keys = df_Unknown_train.filter(['tag'],axis=1)
df_Unknown_test_keys = df_Unknown_test.filter(['tag'],axis=1)

keys_train = list(df_Unknown_train_keys.columns.values)
i1 = df_linkage_Unknown.set_index(keys_train).index
i2 = df_Unknown_train_keys.set_index(keys_train).index
df_Unknown_linkage_train = df_linkage_Unknown[i1.isin(i2)]
df_Unknown_linkage_test = df_linkage_Unknown[~i1.isin(i2)]

df_Unknown_train.drop(columns=['bb_rating','below_b_rating','others_rating','maturity_bond','b_rating','a_rating','aaa_rating','aa_rating','bbb_rating','duration_bond'],inplace=True)
df_Unknown_test.drop(columns=['bb_rating','below_b_rating','others_rating','maturity_bond','b_rating','a_rating','aaa_rating','aa_rating','bbb_rating','duration_bond'],inplace=True)

## Introducign (V1.1 --> Drop lAST 3 COLS )  [post score V1]
df_Unknown_train.drop(columns=['Medium','Small','Unknown','pc_ratio'],inplace=True)
df_Unknown_test.drop(columns=['Medium','Small','Unknown','pc_ratio'],inplace=True)


df_Unknown_train_X = df_Unknown_train.drop(labels='greatstone_rating',axis=1)
df_Unknown_train_y = df_Unknown_train['greatstone_rating']

## Below code for creatign train-test out of  Blend_Train
UX_train, UX_test, Uy_train, Uy_test = train_test_split(df_Unknown_train_X,df_Unknown_train_y,test_size=0.2, random_state=108)

# Drop funid & tag columns in both test & train 
df_Unknown_train_X.drop(columns=['fund_id','tag'],inplace=True)
df_Unknown_test_y = df_Unknown_test['greatstone_rating']
df_Unknown_test.drop(columns=['fund_id','tag','greatstone_rating'],inplace=True)

# OneHot encode investment class in both train_X & test
inv_class = pd.get_dummies(df_Unknown_train_X['investment_class'],drop_first=True)
df_Unknown_train_X.drop(['investment_class'],axis=1,inplace=True)
df_Unknown_train_X = pd.concat([df_Unknown_train_X,inv_class],axis=1)

inv_class = pd.get_dummies(df_Unknown_test['investment_class'],drop_first=True)
df_Unknown_test.drop(['investment_class'],axis=1,inplace=True)
df_Unknown_test = pd.concat([df_Unknown_test,inv_class],axis=1)


(1205, 86)
(275, 86)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [39]:
## Temp Model for Unknown : (RAMDOM FOREST)

## VX_train, VX_test, Vy_train, Vy_test
UX_train.drop(columns=['fund_id','tag','investment_class'],inplace=True)
UX_test.drop(columns=['fund_id','tag','investment_class'],inplace=True)

In [40]:

rf_model = RandomForestClassifier(n_estimators=5000, max_depth=100, min_samples_split=10, min_samples_leaf=1, max_features = 'sqrt', bootstrap = False, random_state= 108)
rf_model.fit(UX_train,Uy_train)
predictions3 = rf_model.predict(UX_test)

accuracy_score = metrics.accuracy_score(Uy_test, predictions3)
##print("Accuracy score is " + str(accuracy_score))

print(classification_report(Uy_test, predictions3))

metrics.confusion_matrix(Uy_test, predictions3)

              precision    recall  f1-score   support

         0.0       0.98      0.98      0.98        45
         1.0       0.84      0.72      0.78        29
         2.0       0.69      0.73      0.71        48
         3.0       0.67      0.69      0.68        52
         4.0       0.65      0.75      0.70        40
         5.0       0.90      0.67      0.77        27

    accuracy                           0.76       241
   macro avg       0.79      0.76      0.77       241
weighted avg       0.77      0.76      0.77       241



array([[44,  0,  0,  1,  0,  0],
       [ 0, 21,  7,  1,  0,  0],
       [ 0,  4, 35,  9,  0,  0],
       [ 1,  0,  7, 36,  8,  0],
       [ 0,  0,  2,  6, 30,  2],
       [ 0,  0,  0,  1,  8, 18]], dtype=int64)

In [41]:
# Use XGBOOST
D_train = xgb.DMatrix(UX_train, label=Uy_train)
D_test = xgb.DMatrix(UX_test, label=Uy_test)

param = {
    'eta': 0.01, 
    'max_depth': 9,
    'subsample': 1.0,
    'gamma': 1.5,
    'min_child_weight': 5,
    'random_state': 108,
    'objective': 'multi:softprob',  
    'num_class': 6} 

steps = 3000  # The number of training iterations

xgb_model = xgb.train(param, D_train, steps)
predictions = xgb_model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in predictions])


accuracy_score = metrics.accuracy_score(Uy_test, best_preds)
##print("Accuracy score is " + str(accuracy_score))

print(classification_report(Uy_test, best_preds))

metrics.confusion_matrix(Uy_test, best_preds)


              precision    recall  f1-score   support

         0.0       0.94      0.98      0.96        45
         1.0       0.76      0.66      0.70        29
         2.0       0.62      0.60      0.61        48
         3.0       0.58      0.60      0.59        52
         4.0       0.71      0.72      0.72        40
         5.0       0.79      0.81      0.80        27

    accuracy                           0.72       241
   macro avg       0.73      0.73      0.73       241
weighted avg       0.72      0.72      0.72       241



array([[44,  0,  0,  1,  0,  0],
       [ 0, 19,  8,  1,  0,  1],
       [ 1,  6, 29, 12,  0,  0],
       [ 1,  0, 10, 31,  8,  2],
       [ 1,  0,  0,  7, 29,  3],
       [ 0,  0,  0,  1,  4, 22]], dtype=int64)

In [42]:
xgb1=XGBClassifier(base_score=0.5, booster=None, colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.7, gamma=1.5,
              importance_type='gain', interaction_constraints=None,
              learning_rate=0.01, max_delta_step=0, max_depth=9,
              min_child_weight=5, monotone_constraints=None,
              n_estimators=3000, n_jobs=1, nthread=1, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, silent=True, subsample=1.0,
              tree_method=None, validate_parameters=False, verbosity=None)

xgb1.fit(UX_train,Uy_train)
predictions = xgb1.predict(UX_test)
#best_preds = np.asarray([np.argmax(line) for line in predictions])


accuracy_score = metrics.accuracy_score(Uy_test, predictions)
##print("Accuracy score is " + str(accuracy_score))

print(classification_report(Uy_test, predictions))

metrics.confusion_matrix(Uy_test, predictions)

              precision    recall  f1-score   support

         0.0       0.98      0.98      0.98        45
         1.0       0.75      0.62      0.68        29
         2.0       0.64      0.71      0.67        48
         3.0       0.60      0.62      0.61        52
         4.0       0.69      0.68      0.68        40
         5.0       0.78      0.78      0.78        27

    accuracy                           0.73       241
   macro avg       0.74      0.73      0.73       241
weighted avg       0.73      0.73      0.73       241



array([[44,  0,  0,  1,  0,  0],
       [ 0, 18,  9,  1,  0,  1],
       [ 0,  5, 34,  9,  0,  0],
       [ 1,  0, 10, 32,  7,  2],
       [ 0,  1,  0,  9, 27,  3],
       [ 0,  0,  0,  1,  5, 21]], dtype=int64)

####  Below code is the final run :  
Code below for each dataset on the initial train data (train + validation used earlier). Then predict using the test data. Finally, all the predictions (from 4 data subsets) will be merged & sorted as per the order in the  sample submission file. Finally the sample submission file will be created. 

#### Note that the choice of the algorithm, ( RF / XGBoost / XGB Classifier) depends on the accuracy durign the initial training pipeline. 
However, the code for all algos are given just in case needed, so that the final output can be based on choice -- just need to uncomment & run.

#### Investment class VALUE

In [43]:
# DONT RUN
# ## MODEL 3 : RAMDOM FOREST

# rf_model = RandomForestClassifier(n_estimators=1000, max_depth=50, min_samples_split=5, min_samples_leaf=2, max_features = 'auto', bootstrap = False, random_state= 108)
# rf_model.fit(df_Value_train_X,df_Value_train_y)
# predictions = rf_model.predict(df_Value_test)

# Value_predicted_ratings = pd.DataFrame(predictions)

# df_Value_linkage_test.reset_index(drop=True, inplace=True)
# Value_predicted_ratings.reset_index(drop=True, inplace=True)
# df_Value_final_ratings = pd.concat([df_Value_linkage_test,Value_predicted_ratings],axis=1)
# df_Value_final_ratings.drop(columns=['tag'],inplace=True)
# print(df_Value_final_ratings.shape)
# df_Value_final_ratings.to_csv("df_Value_final_ratings.csv")

In [44]:
## MODEL 3 : XGB --> RUN THIS
xgb1=XGBClassifier(base_score=0.5, booster=None, colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.4, gamma=0.5, 
              importance_type='gain', interaction_constraints=None,
              learning_rate=0.01, max_delta_step=0, max_depth=9,
              min_child_weight=10, monotone_constraints=None,
              n_estimators=3000, n_jobs=1, nthread=1, num_parallel_tree=1,
              objective='multi:softmax', random_state=108, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, silent=True, subsample=1.0,
              tree_method=None, validate_parameters=False, verbosity=None)
xgb1.fit(df_Value_train_X,df_Value_train_y)
predictions = xgb1.predict(df_Value_test)


Value_predicted_ratings = pd.DataFrame(predictions)


df_Value_linkage_test.reset_index(drop=True, inplace=True)
Value_predicted_ratings.reset_index(drop=True, inplace=True)
df_Value_final_ratings = pd.concat([df_Value_linkage_test,Value_predicted_ratings],axis=1)
df_Value_final_ratings.drop(columns=['tag'],inplace=True)
print(df_Value_final_ratings.shape)

(1301, 2)


#### Investment class BLEND

In [45]:
## RUN THIS
rf_model1 = RandomForestClassifier(n_estimators=1500, max_depth=80, min_samples_split=5, min_samples_leaf=2, max_features = 'sqrt', bootstrap = False, random_state= 108)
rf_model1.fit(df_Blend_train_X,df_Blend_train_y)
predictions1 = rf_model1.predict(df_Blend_test)

Blend_predicted_ratings = pd.DataFrame(predictions1)

df_Blend_linkage_test.reset_index(drop=True, inplace=True)
Blend_predicted_ratings.reset_index(drop=True, inplace=True)
df_Blend_final_ratings = pd.concat([df_Blend_linkage_test,Blend_predicted_ratings],axis=1)
df_Blend_final_ratings.drop(columns=['tag'],inplace=True)
print(df_Blend_final_ratings.shape)

(2081, 2)


In [46]:
# D_train = xgb.DMatrix(df_Blend_train_X, label=df_Blend_train_y)
# D_test = xgb.DMatrix(df_Blend_test, label=df_Blend_test_y) 

# param = {
#     'eta': 0.3, 
#     'max_depth': 3,  
#     'objective': 'multi:softprob',  
#     'num_class': 6} 

# steps = 1000  # The number of training iterations

# xgb_model = xgb.train(param, D_train, steps)
# predictions = xgb_model.predict(D_test)
# best_preds = np.asarray([np.argmax(line) for line in predictions])

# Blend_predicted_ratings = pd.DataFrame(best_preds)

# df_Blend_linkage_test.reset_index(drop=True, inplace=True)
# Blend_predicted_ratings.reset_index(drop=True, inplace=True)
# df_Blend_final_ratings = pd.concat([df_Blend_linkage_test,Blend_predicted_ratings],axis=1)
# df_Blend_final_ratings.drop(columns=['tag'],inplace=True)
# print(df_Blend_final_ratings.shape)

#### Investment class Growth

In [47]:
rf_model2 = RandomForestClassifier(n_estimators=500, max_depth=90, min_samples_split=10, min_samples_leaf=1, max_features = 'auto', bootstrap = False, random_state= 108)
rf_model2.fit(df_Growth_train_X,df_Growth_train_y)
predictions2 = rf_model2.predict(df_Growth_test)

Growth_predicted_ratings = pd.DataFrame(predictions2)

df_Growth_linkage_test.reset_index(drop=True, inplace=True)
Growth_predicted_ratings.reset_index(drop=True, inplace=True)
df_Growth_final_ratings = pd.concat([df_Growth_linkage_test,Growth_predicted_ratings],axis=1)
df_Growth_final_ratings.drop(columns=['tag'],inplace=True)
print(df_Growth_final_ratings.shape)

(1343, 2)


In [48]:
# D_train = xgb.DMatrix(df_Growth_train_X, label=df_Growth_train_y)
# D_test = xgb.DMatrix(df_Growth_test, label=df_Growth_test_y) 

# param = {
#     'eta': 0.3, 
#     'max_depth': 3,  
#     'objective': 'multi:softprob',  
#     'num_class': 6} 

# steps = 1000  # The number of training iterations

# xgb_model = xgb.train(param, D_train, steps)
# predictions = xgb_model.predict(D_test)
# best_preds = np.asarray([np.argmax(line) for line in predictions])

# Growth_predicted_ratings = pd.DataFrame(best_preds)


# df_Growth_linkage_test.reset_index(drop=True, inplace=True)
# Growth_predicted_ratings.reset_index(drop=True, inplace=True)
# df_Growth_final_ratings = pd.concat([df_Growth_linkage_test,Growth_predicted_ratings],axis=1)
# df_Growth_final_ratings.drop(columns=['tag'],inplace=True)
# print(df_Growth_final_ratings.shape)

#### Investment class Unknown

In [49]:
rf_model3 = RandomForestClassifier(n_estimators=5000, max_depth=100, min_samples_split=10, min_samples_leaf=1, max_features = 'sqrt', bootstrap = False, random_state= 108)
rf_model3.fit(df_Unknown_train_X,df_Unknown_train_y)
predictions3 = rf_model3.predict(df_Unknown_test)

Unknown_predicted_ratings = pd.DataFrame(predictions3)

df_Unknown_linkage_test.reset_index(drop=True, inplace=True)
Unknown_predicted_ratings.reset_index(drop=True, inplace=True)
df_Unknown_final_ratings = pd.concat([df_Unknown_linkage_test,Unknown_predicted_ratings],axis=1)
df_Unknown_final_ratings.drop(columns=['tag'],inplace=True)
print(df_Unknown_final_ratings.shape)

(275, 2)


In [50]:
# D_train = xgb.DMatrix(df_Unknown_train_X, label=df_Unknown_train_y)
# D_test = xgb.DMatrix(df_Unknown_test, label=df_Unknown_test_y) 

# param = {
#     'eta': 0.3, 
#     'max_depth': 3,  
#     'objective': 'multi:softprob',  
#     'num_class': 6} 

# steps = 500  # The number of training iterations

# xgb_model = xgb.train(param, D_train, steps)
# predictions = xgb_model.predict(D_test)
# best_preds = np.asarray([np.argmax(line) for line in predictions])

# Unknown_predicted_ratings = pd.DataFrame(best_preds)

# df_Unknown_linkage_test.reset_index(drop=True, inplace=True)
# Unknown_predicted_ratings.reset_index(drop=True, inplace=True)
# df_Unknown_final_ratings = pd.concat([df_Unknown_linkage_test,Unknown_predicted_ratings],axis=1)
# df_Unknown_final_ratings.drop(columns=['tag'],inplace=True)
# print(df_Unknown_final_ratings.shape)

In [51]:

# xgb1=XGBClassifier(base_score=0.5, booster=None, colsample_bylevel=1,
#               colsample_bynode=1, colsample_bytree=0.7, gamma=1.5,
#               importance_type='gain', interaction_constraints=None,
#               learning_rate=0.01, max_delta_step=0, max_depth=9,
#               min_child_weight=5, monotone_constraints=None,
#               n_estimators=3000, n_jobs=1, nthread=1, num_parallel_tree=1,
#               objective='multi:softprob', random_state=0, reg_alpha=0,
#               reg_lambda=1, scale_pos_weight=None, silent=True, subsample=1.0,
#               tree_method=None, validate_parameters=False, verbosity=None)
# xgb1.fit(df_Unknown_train_X,df_Unknown_train_y)
# predictions = xgb1.predict(df_Unknown_test)


# Unknown_predicted_ratings = pd.DataFrame(predictions)

# df_Unknown_linkage_test.reset_index(drop=True, inplace=True)
# Unknown_predicted_ratings.reset_index(drop=True, inplace=True)
# df_Unknown_final_ratings = pd.concat([df_Unknown_linkage_test,Unknown_predicted_ratings],axis=1)
# df_Unknown_final_ratings.drop(columns=['tag'],inplace=True)
# print(df_Unknown_final_ratings.shape)

Merge predictipons from all data subsets to create finbal set & order that based on the submission file

In [52]:
df_Value_final_ratings.rename(columns={"0": "greatstone_rating"})
df_Growth_final_ratings.rename(columns={"0": "greatstone_rating"})
df_Blend_final_ratings.rename(columns={"0": "greatstone_rating"})
df_Unknown_final_ratings.rename(columns={"0": "greatstone_rating"})


Unnamed: 0,fund_id,0
0,468b76c0-a276-45b9-ad76-1a9d6c336ed6,3.0
1,cd9b1f78-e450-489f-b6e5-ece691d5c021,1.0
2,7723006e-720e-4c97-825f-1752cd112736,0.0
3,37c33326-b1fb-4208-a7df-08f7ead749da,2.0
4,d1052147-38d7-4981-8b8c-b3a0dcda8aed,2.0
...,...,...
270,668a58e7-5bee-41fd-9e23-963aa2f08f01,3.0
271,96c3122c-eea0-4d39-b4ed-c1647b03f834,0.0
272,80001be5-fde4-4f20-a8c3-ec295aed5960,3.0
273,4d555bb4-ada5-4a1c-8a01-0827477dd7e5,0.0


In [53]:
frames = [df_Value_final_ratings,df_Blend_final_ratings,df_Growth_final_ratings,df_Unknown_final_ratings]
final_result = pd.concat(frames)

In [54]:
final_result.shape

(5000, 2)

In [55]:
# Reorder linkage_test to match the submission file & Create the final file
final_result = final_result.set_index('fund_id')
final_result = final_result.reindex(index=df_submission_file['fund_id'])
final_result = final_result.reset_index()
final_result.to_csv("FinalSubmissionFile.csv")

In [56]:
## After the fikle is created, open the csv file, delete the firsat column & rename the last column to 'greatstone_rating'
## before submission  

## END OF PROGRAM

### Additional : Code for using LGBM algo

Note : You need to install Light GBM  using pip install 

In [None]:
# import lightgbm as lgb

In [None]:
# ## For initial training phase
# D_train = lgb.Dataset(VX_train, label=Vy_train)
# D_test = lgb.Dataset(VX_test, label=Vy_test)

# params = {}
# params['learning_rate'] = 0.02
# params['boosting_type'] = 'gbdt'
# params['objective'] = 'multiclass'
# params['metric'] = 'multi_logloss'
# params['sub_feature'] = 0.5
# params['num_leaves'] = 10
# params['min_data'] = 50
# params['max_depth'] = 9
# #params['feature_fraction']=0.8
# params['bagging_fraction']=0.6
# params['num_boost_round']=6000
# params['random_state'] = 108
# params['num_class']= 6

# steps =6000

# clf = lgb.train(params, D_train, steps)

# predictions=clf.predict(VX_test)

# best_preds = np.asarray([np.argmax(line) for line in predictions])


# accuracy_score = metrics.accuracy_score(Vy_test, best_preds)
# ##print("Accuracy score is " + str(accuracy_score))

# print(classification_report(Vy_test, best_preds))

# metrics.confusion_matrix(Vy_test, best_preds)

In [None]:
# ## For the final run
# D_train = lgb.Dataset(df_Value_train_X, label=df_Value_train_y)
# D_test = lgb.Dataset(df_Value_test, label=df_Value_test_y)

# params = {}
# params['learning_rate'] = 0.02
# params['boosting_type'] = 'gbdt'
# params['objective'] = 'multiclass'
# params['metric'] = 'multi_logloss'
# params['sub_feature'] = 0.5
# params['num_leaves'] = 10
# params['min_data'] = 50
# params['max_depth'] = 5
# #params['feature_fraction']=0.8
# params['bagging_fraction']=0.6
# params['num_boost_round']=6000
# params['random_state'] = 108
# params['num_class']= 6

# steps =6000


# clf = lgb.train(params, D_train, steps)

# predictions=clf.predict(df_Value_test)

# best_preds = np.asarray([np.argmax(line) for line in predictions])

# Growth_predicted_ratings = pd.DataFrame(best_preds)

# df_Value_linkage_test.reset_index(drop=True, inplace=True)
# Value_predicted_ratings.reset_index(drop=True, inplace=True)
# df_Value_final_ratings = pd.concat([df_Value_linkage_test,Value_predicted_ratings],axis=1)
# df_Value_final_ratings.drop(columns=['tag'],inplace=True)
# print(df_Value_final_ratings.shape)