## A McKinsey Data Science Hackathon

## Statement of Problem

“Beeep…Beeeep….Beeeep… Hooomans*, are you there?...”

This very strange transmission is coming from your narrowband radio signal receiver, pointed towards one of the farthest away galaxies. It’s early morning, you are sitting in your radio observatory high in the mountains.

For the last 10 years you’ve been a Chief Data Scientist in one of the best astrophysics research teams in the world. You are enjoying a quiet time with a cup of coffee and reviewing the data reports from last night, when this strange sound arrived. You almost spill your coffee in surprise. “Am I dreaming?” is your first thought as you move closer towards the speaker and listen…

“Beep…Beeeep….Beeeep… To all Hooomans who can hear us – we need your help”

You lean closer and grab a notebook and a pencil – you don’t really trust computers when it comes to such important tasks as taking notes from a radio transmission. You start recording everything that the strange voice from light years away is saying.

“… We need serious Data Science help and we know you Hooomans are the best at it…. We are an intergalactic species which have almost achieved singularity and the highest possible levels of development. We travel fast through space and explore other galaxies”

“The only essence that we consume is energy, measured in DSML units…Our populace is widespread and we live across many different star clusters and galaxies. What we need now is to optimize our well-being across all those galaxies… We have a lot of data but our сomputers and methods are too weak – we urgently need your data science knowledge to help us”

“Only two steps prevent us from achieving singularity

· To understand what makes us better off.

Our elders used the composite index to measure our well-being performance, but this knowledge has disappeared in the sands of time.

Use our data and train your model to predict this index with the highest possible level of certainty.

· To achieve the highest possible level of well-being through optimized allocation of additional energy

We have discovered the star of an unusually high energy of 50000 zillion DSML.

We have agreed between ourselves that 

· no one galaxy will consume more than 100 zillion DSML 

and 

· at least 10% of the total energy will be consumed by galaxies in need with existence expectancy index below 0,7.

Think of our galaxies as your “countries” (or how you call them??) and our population as citizens. We have similar healthcare and wellbeing characteristic as you, Hooomans”

“We are sending all the data to you right now. Let the data be with you, Hoomans… … …”

Transmission suddenly ends. You put your notebook and pencil away and start thinking. You really want to help this species optimize their well-being. You open up Python and upload the dataset from the narrowband radio signal receiver. It will be another great day at the observatory today.

————

probably intergalactic species meant to say “humans” here but we will never know for sure

Description Data Recieved
The solutions are evaluated on two criteria: predicted future Index values and allocated energy from a newly discovered star

1. Index predictions are evaluated using RMSE metric

2. Energy allocation is also evaluated using RMSE metric and has a set of known factors that need to be taken into account.

Every galaxy has a certain limited potential for improvement in the index described by the following function:

Potential for increase in the Index = -np.log(Index+0.01)+3

Likely index increase dependent on potential for improvement and on extra energy availability is described by the following function:

Likely increase in the Index = extra energy * Potential for increase in the Index **2 / 1000

There are also several constraints:

in total there are 50000 zillion DSML available for allocation and no galaxy at a point in time should be allocated more than 100 zillion DSML or less than 0 zillion DSML. Galaxies with low existence expectancy index below 0.7 should be allocated at least 10% of the total energy available in the foreseeable future

3. Leaderboard is based on a combined scaled metric:

80% prediction task RMSE + 20% optimization task RMSE * lambda where lambda is a normalizing factor

4. Leaderboard is 80% public and 20% private

5. The submission should be in the following format:

Variable
Index
pred
opt_pred

Description
Index: Unique index from the test dataset in the ascending order
pred: Prediction for the index on interest
opt_pred: Optimal energy allocation


## 1. Let's import pandas and numpy.

In [2]:
import numpy as np
import pandas as pd

## 2. Loading the datasets.

In [3]:
data_train = pd.read_csv('./Documents/prohack_dataset/train.csv') #train data
data_test = pd.read_csv('./Documents/prohack_dataset/test.csv') #test data
data_labels = data_train['y'] #selecting 'y' column from the train data to use as label
#galaxy_names = data_train.pop('galaxy')



In [4]:
data_train.head()

Unnamed: 0,galactic year,galaxy,existence expectancy index,existence expectancy at birth,Gross income per capita,Income Index,Expected years of education (galactic years),Mean years of education (galactic years),Intergalactic Development Index (IDI),Education Index,...,"Intergalactic Development Index (IDI), female","Intergalactic Development Index (IDI), male",Gender Development Index (GDI),"Intergalactic Development Index (IDI), female, Rank","Intergalactic Development Index (IDI), male, Rank",Adjusted net savings,"Creature Immunodeficiency Disease prevalence, adult (% ages 15-49), total",Private galaxy capital flows (% of GGP),Gender Inequality Index (GII),y
0,990025,Large Magellanic Cloud (LMC),0.628657,63.1252,27109.23431,0.646039,8.240543,,,,...,,,,,,,,,,0.05259
1,990025,Camelopardalis B,0.818082,81.004994,30166.793958,0.852246,10.671823,4.74247,0.833624,0.467873,...,,,,,,19.177926,,22.785018,,0.059868
2,990025,Virgo I,0.659443,59.570534,8441.707353,0.499762,8.840316,5.583973,0.46911,0.363837,...,,,,,,21.151265,6.53402,,,0.050449
3,990025,UGC 8651 (DDO 181),0.555862,52.333293,,,,,,,...,,,,,,,5.912194,,,0.049394
4,990025,Tucana Dwarf,0.991196,81.802464,81033.956906,1.131163,13.800672,13.188907,0.910341,0.918353,...,,,,,,,5.611753,,,0.154247


In [5]:
data_test.head()

Unnamed: 0,galactic year,galaxy,existence expectancy index,existence expectancy at birth,Gross income per capita,Income Index,Expected years of education (galactic years),Mean years of education (galactic years),Intergalactic Development Index (IDI),Education Index,...,Current health expenditure (% of GGP),"Intergalactic Development Index (IDI), female","Intergalactic Development Index (IDI), male",Gender Development Index (GDI),"Intergalactic Development Index (IDI), female, Rank","Intergalactic Development Index (IDI), male, Rank",Adjusted net savings,"Creature Immunodeficiency Disease prevalence, adult (% ages 15-49), total",Private galaxy capital flows (% of GGP),Gender Inequality Index (GII)
0,1007012,KK98 77,0.456086,51.562543,12236.576447,0.593325,10.414164,10.699072,0.547114,0.556267,...,,,,,,,,,,
1,1007012,Reticulum III,0.529835,57.228262,3431.883825,0.675407,7.239485,5.311122,0.497688,0.409969,...,,,,,,,,,,
2,1008016,Reticulum III,0.560976,59.379539,27562.914252,0.594624,11.77489,5.937797,0.544744,0.486167,...,,,,,,,,,,
3,1007012,Segue 1,0.56591,59.95239,20352.232905,0.8377,11.613621,10.067882,0.691641,0.523441,...,,,,,,,,,,
4,1013042,Virgo I,0.588274,55.42832,23959.704016,0.520579,10.392416,6.374637,0.530676,0.580418,...,7.357729,0.583373,0.600445,0.856158,206.674424,224.104054,,7.687626,,


## 3. Let's check the shape of the training and testing data, and have a view of the columns datatypes and names.

In [6]:
print(data_train.shape)
print(data_test.shape)

(3865, 80)
(890, 79)


In [7]:
dict(data_train.dtypes)

{'galactic year': dtype('int64'),
 'galaxy': dtype('O'),
 'existence expectancy index': dtype('float64'),
 'existence expectancy at birth': dtype('float64'),
 'Gross income per capita': dtype('float64'),
 'Income Index': dtype('float64'),
 'Expected years of education (galactic years)': dtype('float64'),
 'Mean years of education (galactic years)': dtype('float64'),
 'Intergalactic Development Index (IDI)': dtype('float64'),
 'Education Index': dtype('float64'),
 'Intergalactic Development Index (IDI), Rank': dtype('float64'),
 'Population using at least basic drinking-water services (%)': dtype('float64'),
 'Population using at least basic sanitation services (%)': dtype('float64'),
 'Gross capital formation (% of GGP)': dtype('float64'),
 'Population, total (millions)': dtype('float64'),
 'Population, urban (%)': dtype('float64'),
 'Mortality rate, under-five (per 1,000 live births)': dtype('float64'),
 'Mortality rate, infant (per 1,000 live births)': dtype('float64'),
 'Old age dep

## 4. Let's see columns that doesn't contain null values. They are likely key features.

In [8]:
print(list(data_train.columns[data_train.isna().sum() == 0])) #prints columns with no null values
dict(data_train.isna().sum()) # a look at total null values in each column

['galactic year', 'galaxy', 'y']


{'galactic year': 0,
 'galaxy': 0,
 'existence expectancy index': 1,
 'existence expectancy at birth': 1,
 'Gross income per capita': 28,
 'Income Index': 28,
 'Expected years of education (galactic years)': 133,
 'Mean years of education (galactic years)': 363,
 'Intergalactic Development Index (IDI)': 391,
 'Education Index': 391,
 'Intergalactic Development Index (IDI), Rank': 433,
 'Population using at least basic drinking-water services (%)': 1844,
 'Population using at least basic sanitation services (%)': 1850,
 'Gross capital formation (% of GGP)': 2363,
 'Population, total (millions)': 2594,
 'Population, urban (%)': 2594,
 'Mortality rate, under-five (per 1,000 live births)': 2594,
 'Mortality rate, infant (per 1,000 live births)': 2606,
 'Old age dependency ratio (old age (65 and older) per 100 creatures (ages 15-64))': 2601,
 'Population, ages 15–64 (millions)': 2601,
 'Population, ages 65 and older (millions)': 2601,
 'Life expectancy at birth, male (galactic years)': 2601

## 5. Let's check the columns that when recorded have unique value for each observation. Probably, they are highly important features.

In [9]:
[i for i in data_train.columns if len(data_train[i].unique()) == data_train.shape[0]]

['existence expectancy index', 'existence expectancy at birth', 'y']

## 6. Let's check the number of unique galaxy and galactic year in the training and testing data.

In [10]:
print(len(data_train['galaxy'].unique()))
print(len(data_train['galactic year'].unique()))
print(len(data_test['galaxy'].unique()))
print(len(data_test['galactic year'].unique()))

181
26
172
10


## 7. Let's check if indeed for every galaxy in the training data, there is one unique galactic year per observation.

#### First, let's define a function that will group a dataset by a column_key and return a dictionary of column_key/dataframe pair.

In [11]:
def get_grouped_df(data, key):
    groups = data.groupby(key)
    dict_grp = {k: v for k, v in groups}
    return dict_grp

In [12]:
group1 = get_grouped_df(data_train, 'galaxy')
group1['Virgo I'].head()
#let's have a view of how each galaxy's mini dataframe looks like

Unnamed: 0,galactic year,galaxy,existence expectancy index,existence expectancy at birth,Gross income per capita,Income Index,Expected years of education (galactic years),Mean years of education (galactic years),Intergalactic Development Index (IDI),Education Index,...,"Intergalactic Development Index (IDI), female","Intergalactic Development Index (IDI), male",Gender Development Index (GDI),"Intergalactic Development Index (IDI), female, Rank","Intergalactic Development Index (IDI), male, Rank",Adjusted net savings,"Creature Immunodeficiency Disease prevalence, adult (% ages 15-49), total",Private galaxy capital flows (% of GGP),Gender Inequality Index (GII),y
2,990025,Virgo I,0.659443,59.570534,8441.707353,0.499762,8.840316,5.583973,0.46911,0.363837,...,,,,,,21.151265,6.53402,,,0.050449
309,991020,Virgo I,0.555338,58.867426,24763.531222,0.463406,7.536744,2.763276,0.502295,0.342069,...,,,,,,,,,,0.049411
364,992016,Virgo I,0.554174,60.551216,11439.589593,0.52421,7.485369,5.545963,0.430654,0.39063,...,,,,,,,,,,0.049193
635,993012,Virgo I,0.612921,62.918772,24467.082523,0.481154,7.357494,5.868132,0.489407,0.34828,...,,,,,,,,,,0.048858
897,994009,Virgo I,0.630974,50.942282,19221.268239,0.506944,7.312539,6.109797,0.412421,0.37729,...,,,,,,,,,,0.048516


In [13]:
years_per_galaxy  = [len(group1[i]['galactic year']) 
                     for i in group1.keys()] #stores the number of galactic years per galaxy as a list

unique_years_per_galaxy  = [len(group1[i]['galactic year'].unique()) 
                            for i in group1.keys()] 
                            #stores the number of unique galactic years per galaxy as a list

In [14]:
unique_years_per_galaxy == years_per_galaxy #checks for equality

True

## 8. Similar to step 7 above, let's check if indeed for every galactic year in the training data, there is one unique galaxy per observation.

In [15]:
group2 = get_grouped_df(data_train, 'galactic year')

galaxies_per_year  = [len(group2[i]['galaxy']) for i in group2.keys()]
unique_galaxies_per_year  = [len(group2[i]['galaxy'].unique()) for i in group2.keys()]


In [16]:
galaxies_per_year == unique_galaxies_per_year

True

## 9. let's repeat steps 7 and 8  but this time with the testing data.

In [17]:
# similar to step 7
group3 =  get_grouped_df(data_test, 'galaxy')

years_per_galaxy_test  = [len(group3[i]['galactic year']) for i in group3.keys()]
unique_years_per_galaxy_test  = [len(group3[i]['galactic year'].unique()) for i in group3.keys()]

unique_years_per_galaxy_test == years_per_galaxy_test

True

In [18]:
# similar to step 8
group4 = get_grouped_df(data_test, 'galactic year')

galaxies_per_year_test  = [len(group4[i]['galaxy']) for i in group4.keys()]
unique_galaxies_per_year_test  = [len(group4[i]['galaxy'].unique()) for i in group4.keys()]

galaxies_per_year_test == unique_galaxies_per_year_test

True

## 10. Let's check the relationship between sampling of galactic years in the training and the testing data.

In [19]:
unique_gal_years_train = set(data_train['galactic year']) #unique galactic years in the training data
unique_gal_years_test = set(data_test['galactic year']) #unique galactic years in the testing data

print(sorted(list(unique_gal_years_train)), len(unique_gal_years_train))
print()
print(sorted(list(unique_gal_years_test)), len(unique_gal_years_test))

[990025, 991020, 992016, 993012, 994009, 995006, 996004, 997002, 998001, 999000, 1000000, 1001000, 1002001, 1003002, 1004004, 1005006, 1006009, 1007012, 1008016, 1009020, 1010025, 1011030, 1012036, 1013042, 1014049, 1015056] 26

[1007012, 1008016, 1009020, 1010025, 1011030, 1012036, 1013042, 1014049, 1015056, 1016064] 10


In [20]:
common_gal_years = unique_gal_years_train & unique_gal_years_test
print(common_gal_years, len(common_gal_years))
#checks for common galactic years between training and testing data.

{1014049, 1007012, 1012036, 1010025, 1008016, 1015056, 1013042, 1011030, 1009020} 9


#### It seems 9 of the 10 galactic years in the testing data are sampled in the training data.

#### Let's check the elapsed galactic year between two successive galactic years in the training data.

In [21]:
 print([sorted(list(unique_gal_years_train))[i+1] - sorted(list(unique_gal_years_train))[i] 
  for i in range(len(unique_gal_years_train) - 1)])

[995, 996, 996, 997, 997, 998, 998, 999, 999, 1000, 1000, 1001, 1001, 1002, 1002, 1003, 1003, 1004, 1004, 1005, 1005, 1006, 1006, 1007, 1007]


#### It seems the first observations in the training data were sampled after 995 galactic years. Subsequent elapsed galactic years of sampling was increase by one after two samplings.

## 11. Let's import useful libraries.

In [22]:
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn import tree
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor
from sklearn.base import BaseEstimator, RegressorMixin, TransformerMixin
from sklearn.decomposition import PCA
from sklearn.metrics import mean_squared_error as mse
from scipy.optimize import minimize, Bounds, LinearConstraint
import math
import warnings
warnings.filterwarnings('ignore')

In [23]:
X_train, X_test, y_train, y_test = train_test_split(data_train, data_labels, test_size=0.2, random_state=0)

In [24]:
ran_f = ExtraTreesRegressor(bootstrap=True)

In [25]:
#grid = GridSearchCV(ran_f, {'n_estimators': range(10, 110, 10)}, cv=5, n_jobs=2, verbose=1)

In [26]:
model_pipe = Pipeline([('imp', SimpleImputer(strategy='median')), ('scaler', StandardScaler()), 
                       #('pca', PCA(n_components=10)), 
                       ('ran_f', ran_f)])

## 12. Let's build a custom model for predictions

#### This custom model will:
- split the training data into groups.
- split the testing data into groups.
- for each testing data group, train the corresponding training data group and then use it to predict the testing data group.
- for each testing data group, add the predicted labels as a new column in the group (minidf).
- join all testing groups together to form a new df.
- rearrange the predicted labels to correctly allign with their corresponding observations in the original testing data.

In [27]:
correlation = abs(data_train.copy().drop('y', axis=1).corrwith(data_labels))
columns = correlation.nlargest(20).index

In [28]:
class GalaxyModel(BaseEstimator, RegressorMixin):
    # this model is general purpose for grouping by and predicting any column, using any column as observation key, 
    #and for using any model.
    
    def __init__(self, key, label, ind, model): 
        self. key = key # key is the name of the column you want to groupby
        self.label = label # label is the name of the column we want to predict
        self.ind = ind # ind is the unique column in each mini df after grouping by key
        self.model = model # model is the model you want to use
        self.X_pred = []
        
    def fit(self, X_train, y_train=None):
# step1. split the training data into groups
        
        X = X_train.copy()
        self.train_mini_df = get_grouped_df(X, self.key) 
        
        return self
    
    def predict(self, X_test):
# step2. split the testing data into groups
        
        X = X_test.copy()
        self.test_mini_df = get_grouped_df(X, self.key)
        
# step3: 
# for each testing data group, train the corresponding training data group and then use it to predict the testing data group

        for key in self.test_mini_df:
            if key == 1016064: # uses galactic year 1015056 since 1016064 is not in testing data
                key = 1015056
                
            X_train = self.train_mini_df[key].copy().drop(self.key, axis=1)
            #drops key and ind columns of a copy training minidf
            
            y_label = np.array(X_train.pop(self.label)).reshape(-1, 1) # pops out label data as an array with 1 column
            
            self.model.fit(X_train, y_label) #fitting model on training data
            
            mini_X_test = self.test_mini_df[key].copy().drop(self.key, axis=1) 
            #drops key and ind columns of a copy testing minidf
            
            y_pred = self.model.predict(mini_X_test) #predicting testing data
            
#step4. for each testing data group, add the predicted labels as a new column in the group (minidf)
            self.test_mini_df[key][self.label] = y_pred 
            
# step5. join all testing groups together to form a new df
        pred_df = pd.concat([df for _, df in self.test_mini_df.items()])
        
# step6. creates a df consisting of three columns namely; galactic year, galaxy and predicted index
        pred_df = pred_df[['galactic year', 'galaxy', 'y']]
        
        
# step7. rearrange the predicted labels to correctly allign with their corresponding observations in the original testing data
        self.X_pred = X_test.copy().merge(pred_df, on=['galactic year', 'galaxy'], how='inner')
        
        return self.X_pred['y']
            
            

In [29]:
model = GalaxyModel('galaxy', 'y', 'galactic year', model_pipe)

In [30]:
model.fit(X_train)

GalaxyModel(ind='galactic year', key='galaxy', label='y',
            model=Pipeline(memory=None,
                           steps=[('imp',
                                   SimpleImputer(add_indicator=False, copy=True,
                                                 fill_value=None,
                                                 missing_values=nan,
                                                 strategy='median',
                                                 verbose=0)),
                                  ('scaler',
                                   StandardScaler(copy=True, with_mean=True,
                                                  with_std=True)),
                                  ('ran_f',
                                   ExtraTreesRegressor(bootstrap=True,
                                                       ccp_alpha=0.0,
                                                       criterion='mse',
                                                       max_depth=Non

In [31]:
y_pred = model.predict(X_test.drop('y', axis=1))

In [32]:
y_pred

0      0.045102
1      0.115325
2      0.048688
3      0.047861
4      0.165029
         ...   
768    0.087669
769    0.061233
770    0.065016
771    0.053271
772    0.045011
Name: y, Length: 773, dtype: float64

In [33]:
mse(y_test, y_pred, squared=False)

0.01979915046088271

In [34]:
model.fit(data_train.copy(), data_labels)

GalaxyModel(ind='galactic year', key='galaxy', label='y',
            model=Pipeline(memory=None,
                           steps=[('imp',
                                   SimpleImputer(add_indicator=False, copy=True,
                                                 fill_value=None,
                                                 missing_values=nan,
                                                 strategy='median',
                                                 verbose=0)),
                                  ('scaler',
                                   StandardScaler(copy=True, with_mean=True,
                                                  with_std=True)),
                                  ('ran_f',
                                   ExtraTreesRegressor(bootstrap=True,
                                                       ccp_alpha=0.0,
                                                       criterion='mse',
                                                       max_depth=Non

In [35]:
y_pred = model.predict(data_test.copy())

## 13. Optimization Problem.

#### to minimize in scipy use the scipy.optimize.minimize library.

#### check out scipy.optimze.minimize function [here](https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html)

In [49]:
pot_for_inc = -np.log(y_pred + 0.01) + 3 # calculates potential for improvement in index

In [50]:
def likely_increase(X, p=pot_for_inc):
    return -sum(X * p**2 /1000) # calculates total likely increase in index. 
    #the minus sign is because we want to maximize the function since minimizing -a is same as maximizing a.

In [51]:
s_imp = SimpleImputer(strategy='median')

In [52]:
# drops the galaxy column since we can't imputation can't work with string
data_test_copy = data_test.copy().drop('galaxy', axis=1)
cols = data_test_copy.columns # store the columns names excluding that of galaxy.
data_test_copy = s_imp.fit_transform(data_test_copy) # imputates
data_test_copy = pd.DataFrame(data_test_copy, columns=cols) # converts to dataframe

# filters rows with existence expectancy index < 0.7
low_eei = data_test_copy[data_test_copy['existence expectancy index'] < 0.7]  


In [53]:
data_test_copy2 = data_test_copy.sort_values(by='existence expectancy index') # sorts df by existence expectancy index

# check if original testing data set has top rows coresponding to rows with existence expectancy index < 0.7
sum(data_test_copy.loc[:len(low_eei) - 1, 'existence expectancy index'] == low_eei['existence expectancy index'])

66

In [54]:
boundary = Bounds([0]*890, [100]*890) # creates boundary constraint of 0 <= x <= 100
constraint1 = LinearConstraint([1]*890, [50000], [50000]) # creates constraint of sum(x) = 50,000

# create constraint of sum(extra energy of galaxies with existence expectancy index < 0.7) >= 10% 0f 50,000
constraint2 = LinearConstraint([1]*66 + [0]*824, [5000], [np.inf])

In [55]:
X0 = np.array([100]*66 + [52]*800 + [75]*24) # initial optimized extra energy guess

In [56]:
res = minimize(likely_increase, X0, constraints=[constraint1, constraint2],
               bounds=boundary, options={'verbose': 1, 'disp': True}) # optimizes
res.x

Iteration limit exceeded    (Exit mode 9)
            Current function value: -1755.8245248461646
            Iterations: 101
            Function evaluations: 90092
            Gradient evaluations: 101


array([1.00000000e+02, 1.00000000e+02, 1.00000000e+02, 1.00000000e+02,
       1.00000000e+02, 1.00000000e+02, 1.00000000e+02, 1.00000000e+02,
       1.00000000e+02, 1.00000000e+02, 1.00000000e+02, 1.00000000e+02,
       1.00000000e+02, 1.00000000e+02, 1.00000000e+02, 1.00000000e+02,
       1.00000000e+02, 1.00000000e+02, 1.00000000e+02, 1.00000000e+02,
       1.00000000e+02, 1.00000000e+02, 1.00000000e+02, 1.00000000e+02,
       1.00000000e+02, 1.00000000e+02, 1.00000000e+02, 1.00000000e+02,
       1.00000000e+02, 1.00000000e+02, 1.00000000e+02, 1.00000000e+02,
       1.00000000e+02, 9.58551901e+01, 1.00000000e+02, 1.00000000e+02,
       1.00000000e+02, 1.00000000e+02, 1.00000000e+02, 1.00000000e+02,
       1.00000000e+02, 1.00000000e+02, 1.00000000e+02, 1.00000000e+02,
       9.92713440e+01, 1.00000000e+02, 1.00000000e+02, 1.00000000e+02,
       1.00000000e+02, 1.00000000e+02, 1.00000000e+02, 1.00000000e+02,
       1.00000000e+02, 1.00000000e+02, 1.00000000e+02, 1.00000000e+02,
      

In [57]:
opt_pred = res.x # stores optimal extra energy
-res.fun # checks maximum sum of likely increase in index

1755.8245248461646

In [58]:
sum(opt_pred)

49999.99999999999

In [59]:
solution1 = pd.DataFrame({'Index': data_test.index, 'pred': y_pred, 'opt_pred': opt_pred}) # stores solution as df
solution1

Unnamed: 0,Index,pred,opt_pred
0,0,0.043579,100.000000
1,1,0.048941,100.000000
2,2,0.042988,100.000000
3,3,0.039534,100.000000
4,4,0.025611,100.000000
...,...,...,...
885,885,0.047134,96.636239
886,886,0.058690,78.018996
887,887,0.076699,55.368478
888,888,0.067322,66.407650


In [60]:
solution1.to_csv('./Documents/prohack_dataset/solution1.csv') # stores solution to disk as csv file to be submitted!