## Prohack Compettion Ended 21st June
-  Author Stephen Kamau
-  Email stiveckamash@gmail.com


- Link To Organisation -  https://prohack.org/page/challenge

- Link To Dataset  : [https://prohack.org/media/prohack_dataset_avOqBYc.zip]  (Download)

### Challenge Description 
“Beeep…Beeeep….Beeeep… Hooomans*, are you there?...”

This very strange transmission is coming from your narrowband radio signal receiver, pointed towards one of the farthest away galaxies. It’s early morning, you are sitting in your radio observatory high in the mountains.

For the last 10 years you’ve been a Chief Data Scientist in one of the best astrophysics research teams in the world. You are enjoying a quiet time with a cup of coffee and reviewing the data reports from last night, when this strange sound arrived. You almost spill your coffee in surprise. “Am I dreaming?” is your first thought as you move closer towards the speaker and listen…

“Beep…Beeeep….Beeeep… To all Hooomans who can hear us – we need your help”

You lean closer and grab a notebook and a pencil – you don’t really trust computers when it comes to such important tasks as taking notes from a radio transmission. You start recording everything that the strange voice from light years away is saying.

“… We need serious Data Science help and we know you Hooomans are the best at it…. We are an intergalactic species which have almost achieved singularity and the highest possible levels of development. We travel fast through space and explore other galaxies”

“The only essence that we consume is energy, measured in DSML units…Our populace is widespread and we live across many different star clusters and galaxies. What we need now is to optimize our well-being across all those galaxies… We have a lot of data but our сomputers and methods are too weak – we urgently need your data science knowledge to help us”

“Only two steps prevent us from achieving singularity

· To understand what makes us better off.

Our elders used the composite index to measure our well-being performance, but this knowledge has disappeared in the sands of time.

Use our data and train your model to predict this index with the highest possible level of certainty.

· To achieve the highest possible level of well-being through optimized allocation of additional energy

We have discovered the star of an unusually high energy of 50000 zillion DSML.

We have agreed between ourselves that 

· no one galaxy will consume more than 100 zillion DSML 

and 

· at least 10% of the total energy will be consumed by galaxies in need with existence expectancy index below 0,7.

Think of our galaxies as your “countries” (or how you call them??) and our population as citizens. We have similar healthcare and wellbeing characteristic as you, Hooomans”

“We are sending all the data to you right now. Let the data be with you, Hoomans… … …”

Transmission suddenly ends. You put your notebook and pencil away and start thinking. You really want to help this species optimize their well-being. You open up Python and upload the dataset from the narrowband radio signal receiver. It will be another great day at the observatory today.

————

* probably intergalactic species meant to say “humans” here but we will never know for sure

Description Data Recieved
The solutions are evaluated on two criteria: predicted future Index values and allocated energy from a newly discovered star

1) Index predictions are evaluated using RMSE metric

2) Energy allocation is also evaluated using RMSE metric and has a set of known factors that need to be taken into account.

Every galaxy has a certain limited potential for improvement in the index described by the following function:

Potential for increase in the Index = -np.log(Index+0.01)+3

Likely index increase dependent on potential for improvement and on extra energy availability is described by the following function:

Likely increase in the Index = extra energy * Potential for increase in the Index **2 / 1000

There are also several constraints:

in total there are 50000 zillion DSML available for allocation and no galaxy at a point in time should be allocated more than 100 zillion DSML or less than 0 zillion DSML. Galaxies with low existence expectancy index below 0.7 should be allocated at least 10% of the total energy available in the foreseeable future

3) Leaderboard is based on a combined scaled metric:

80% prediction task RMSE + 20% optimization task RMSE * lambda where lambda is a normalizing factor

4) Leaderboard is 80% public and 20% private

5) The submission should be in the following format:

 ### Variable     Description      
 - Index   Unique index from the test dataset in the ascending order
 - pred    Prediction for the index on interest
 - opt_pred Optimal energy allocation





In [33]:
#libs to be used
import pandas as pd
import numpy as np

from sklearn.metrics import mean_squared_error as mse
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler #for scalling the data
from sklearn.impute import SimpleImputer  # for filling nans
from sklearn.model_selection import GridSearchCV #param tuning
import warnings 
warnings.filterwarnings('ignore')

### Here are the models that i used and compared perfomance
### Choosing the best model from them

In [34]:
#import models for the training and predictions
import sys
sys.path.append("/usr/local/lib/python3.7/dist-packages")
from lightgbm import LGBMRegressor
# !pip install lightgbm
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
#10 models in total

In [35]:
#lets now load the datasetrs
train = pd.read_csv("train.csv")
test  =pd.read_csv("test.csv")
sub = pd.read_csv("sample_submit.csv")

In [36]:
# check unique values in galaxy
print(f"Train has {train['galaxy'].nunique()} galaxies")
print(f"Test has {test['galaxy'].nunique()} galaxies")

Train has 181 galaxies
Test has 172 galaxies


In [37]:
# check unique values in galactic year
print(f"Train has {train['galactic year'].nunique()} years period")
print(f"Test has {test['galactic year'].nunique()} years peorion")

Train has 26 years period
Test has 10 years peorion


In [38]:
#check on average instance contribution for each galaxy
print(f'Average for Train is {train.shape[0]/train["galaxy"].nunique()}')
print(f'Average for Test is {test.shape[0]/test["galaxy"].nunique()}')

Average for Train is 21.353591160220994
Average for Test is 5.174418604651163


In [39]:
#checking the shape  of the dfs
train.shape  , test.shape

((3865, 80), (890, 79))

In [40]:
#check the values of nulls in train

train.isna().sum() 

galactic year                                                                   0
galaxy                                                                          0
existence expectancy index                                                      1
existence expectancy at birth                                                   1
Gross income per capita                                                        28
                                                                             ... 
Adjusted net savings                                                         2953
Creature Immunodeficiency Disease prevalence, adult (% ages 15-49), total    2924
Private galaxy capital flows (% of GGP)                                      2991
Gender Inequality Index (GII)                                                3021
y                                                                               0
Length: 80, dtype: int64

In [41]:
#check the values of nulls in test

test.isna().sum() 

galactic year                                                                  0
galaxy                                                                         0
existence expectancy index                                                     5
existence expectancy at birth                                                  5
Gross income per capita                                                        5
                                                                            ... 
Intergalactic Development Index (IDI), male, Rank                            341
Adjusted net savings                                                         371
Creature Immunodeficiency Disease prevalence, adult (% ages 15-49), total    408
Private galaxy capital flows (% of GGP)                                      354
Gender Inequality Index (GII)                                                361
Length: 79, dtype: int64

In [42]:
# checkk those  with nans certain threshhold
display(len([x for x in train.isna().sum()/train.shape[0]  if x > 0.7]))
[x for x in train.isna().sum()/train.shape[0]  if x > 0.7]

31

[0.7040103492884864,
 0.7094437257438551,
 0.7071151358344114,
 0.705045278137128,
 0.7055627425614489,
 0.7130659767141009,
 0.7133247089262613,
 0.7270375161707633,
 0.7270375161707633,
 0.733764553686934,
 0.7208279430789133,
 0.7182406209573092,
 0.71875808538163,
 0.7221216041397154,
 0.7340232858990945,
 0.7280724450194049,
 0.7425614489003881,
 0.7314359637774903,
 0.7681759379042691,
 0.7694695989650712,
 0.774385510996119,
 0.775679172056921,
 0.7630012936610608,
 0.7632600258732212,
 0.7635187580853816,
 0.7689521345407503,
 0.7692108667529107,
 0.7640362225097025,
 0.7565329883570504,
 0.7738680465717982,
 0.7816300129366106]

In [43]:
# checkk those with nans certain threshhold
[x for x in test.isna().sum()/test.shape[0]  if x > 0.50]

[]

In [44]:
#lets try see the dataset descriptions

train.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
galactic year,3865.0,1.000709e+06,6945.463143,990025.000000,995006.000000,1000000.000000,1.006009e+06,1.015056e+06
existence expectancy index,3864.0,8.724787e-01,0.162367,0.227890,0.763027,0.907359,9.927599e-01,1.246908e+00
existence expectancy at birth,3864.0,7.679811e+01,10.461654,34.244062,69.961449,78.995101,8.455897e+01,1.002101e+02
Gross income per capita,3837.0,3.163324e+04,18736.378445,-126.906522,20169.118912,26600.768195,3.689863e+04,1.510727e+05
Income Index,3837.0,8.251535e-01,0.194055,0.292001,0.677131,0.827300,9.702946e-01,1.361883e+00
...,...,...,...,...,...,...,...,...
Adjusted net savings,912.0,2.125292e+01,14.258986,-76.741414,15.001028,22.182571,2.913474e+01,6.190364e+01
"Creature Immunodeficiency Disease prevalence, adult (% ages 15-49), total",941.0,6.443023e+00,4.804873,-1.192011,4.113472,5.309497,6.814577e+00,3.653846e+01
Private galaxy capital flows (% of GGP),874.0,2.226147e+01,34.342797,-735.186886,17.227899,24.472557,3.174829e+01,9.594124e+01
Gender Inequality Index (GII),844.0,6.007333e-01,0.205785,0.089092,0.430332,0.624640,7.674039e-01,1.098439e+00


In [45]:
test.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
galactic year,890.0,1.011498e+06,2881.504700,1.007012e+06,1.009020e+06,1.011030e+06,1.014049e+06,1.016064e+06
existence expectancy index,885.0,9.238810e-01,0.134856,4.560855e-01,8.341182e-01,9.446832e-01,1.022712e+00,1.250508e+00
existence expectancy at birth,885.0,7.993837e+01,9.051945,5.156254e+01,7.386305e+01,8.147132e+01,8.642698e+01,1.004277e+02
Gross income per capita,885.0,3.365551e+04,18940.572797,7.340728e+02,2.127653e+04,2.826448e+04,4.113071e+04,1.349891e+05
Income Index,885.0,8.568686e-01,0.186299,3.576460e-01,7.102359e-01,8.688192e-01,9.976503e-01,1.307948e+00
...,...,...,...,...,...,...,...,...
"Intergalactic Development Index (IDI), male, Rank",549.0,1.286353e+02,48.684267,1.852690e+01,9.009828e+01,1.304912e+02,1.681045e+02,2.373822e+02
Adjusted net savings,519.0,2.064039e+01,15.365777,-6.291814e+01,1.439443e+01,2.159987e+01,2.881190e+01,6.294491e+01
"Creature Immunodeficiency Disease prevalence, adult (% ages 15-49), total",482.0,6.036133e+00,4.157453,-4.556524e-01,4.075216e+00,5.185122e+00,6.601704e+00,3.148719e+01
Private galaxy capital flows (% of GGP),536.0,2.327426e+01,13.787944,-7.425490e+01,1.694301e+01,2.371402e+01,3.074608e+01,6.675176e+01


In [46]:
# display(train.info())
# display(test.info())

## Findings

- 1 . The Train dataset has 3865 and test has 890 intsances.
- 2 . The datasets has a total of 79 feilds.
- 3 . Train Dataset has very many Nulls with many columns (about 31 columns ) with nulls amounting to 70% .
- 4 . Test has  no any nulls that is above 50% missing but has got many missings.
- 5 . Span of period seems to be around 26 years as its for the test ..we move with it since its more than test.
- 6 . The Data has 181 unique galaxies ~(highest based on train)!.
    - This seems to represent countries available in the world.
    - Each galaxy makes a single counttyr.
    - Each galaxy has galactic years which represent the real  year.
    
- 7 . Based On contribution for Each galaxy in the Data.
    - Each Galaxy in Train Has about 22 Instances.
    - Foer Each intest is about 6 instances.
    
- 8 . The Data has got Soo Many **Outliers**.
     - This is based on observation i saw using the max and 0.75 percentile 
     - For those datas with outliers there is a great variation in thus values
     - The Difference btween max and 75% is very large Indicating that the data has outlies

In [47]:
#check dtypes for the data where its not numerical
train.dtypes[train.dtypes == object]

galaxy    object
dtype: object

In [48]:
# its only galaxy .. the rest are in numerical form
#changing it to categories
# There are similarity in galaxy in both train and test
#gonna join the data first

#we will then use codes to encode
df = pd.concat([train , test ], sort = False)
df["galaxy"] = df["galaxy"].astype('category')
df["galaxy"] = df["galaxy"].cat.codes

In [49]:
#check the total galaxies
df.galaxy.nunique()
# This has been made well since we had 181 uniques max

181

In [50]:
#check the galactic year
df['galactic year'].nunique()

27

### Checking how many galaxies are there and galactic years available

- There are **181** distinct galaxies on the training set and **172** on the test set.

- There are **27** Galactic years ...This is the period from which data was recorded

- **Not all Galaxies on Training dataset are found on test**

- **Also Realized that Galaxy 126 has only 1 instance in train and lacks any in test** 

## conclusion

 - Each distinct galaxy may represent a country in real life. 
 - Every sample for a galaxy may represent the national statistics of the country at a financial (galactic year). 

## Will Try Handling Outliers with RobustScaler
 - This Class scales the data based on INterquatile Range(IQR)
 - First Quartile Q1 is the 25% quantile 
 - 3rd Quatile Q3 is 75% Quantile
 - Then IQR is Calculated as Q3 -Q2)/2

In [51]:
from sklearn.preprocessing import RobustScaler

In [52]:

# #we will use 20 and 80 quantile
# rscaler = RobustScaler(quantile_range =(20 , 80))
# df.drop('y' , axis = 1 , inplace = True)
# cols = df.columns
# # rscale the data
# r_df = rscaler.fit_transform(df)

# #return the df with its columns back since here its now array

# df = pd.DataFrame(r_df , columns = cols)


In [53]:
df.head()

Unnamed: 0,galactic year,galaxy,existence expectancy index,existence expectancy at birth,Gross income per capita,Income Index,Expected years of education (galactic years),Mean years of education (galactic years),Intergalactic Development Index (IDI),Education Index,...,"Intergalactic Development Index (IDI), female","Intergalactic Development Index (IDI), male",Gender Development Index (GDI),"Intergalactic Development Index (IDI), female, Rank","Intergalactic Development Index (IDI), male, Rank",Adjusted net savings,"Creature Immunodeficiency Disease prevalence, adult (% ages 15-49), total",Private galaxy capital flows (% of GGP),Gender Inequality Index (GII),y
0,990025,96,0.628657,63.1252,27109.23431,0.646039,8.240543,,,,...,,,,,,,,,,0.05259
1,990025,33,0.818082,81.004994,30166.793958,0.852246,10.671823,4.74247,0.833624,0.467873,...,,,,,,19.177926,,22.785018,,0.059868
2,990025,178,0.659443,59.570534,8441.707353,0.499762,8.840316,5.583973,0.46911,0.363837,...,,,,,,21.151265,6.53402,,,0.050449
3,990025,163,0.555862,52.333293,,,,,,,...,,,,,,,5.912194,,,0.049394
4,990025,155,0.991196,81.802464,81033.956906,1.131163,13.800672,13.188907,0.910341,0.918353,...,,,,,,,5.611753,,,0.154247


##  Modelling and Prediction

## Discussin

- We will not use 8 Galaxies from the  as they are not in the test data

- According to our theory, every galaxy represent a country and columns are its statistics fields at a the financial .

- Some Countries have values with Missings . We will use imputer to imputer mean with them for each coutry specific.

####

- I used batching similar galaxies galaxies as they give good performance of the model. 



- I will also try both clustering galaxies in different clusters. Then try making checking the model score.


In [54]:
# get the label columns
label = train.y

In [55]:
#get a list of all galaxies
galaxies = df['galaxy'].unique().tolist()

In [56]:
#lets now get back our train and test datas
train = df[:train.shape[0]]
test = df[train.shape[0]:]
# train.drop('y' , axis =1 , inplace = True)

In [57]:
y = label
train.shape , y.shape , test.shape

((3865, 80), (3865,), (890, 80))

In [58]:
# train.head()

### Using GridseachCv to tune params..

In [59]:
# #tunigin hypaparamters
# from sklearn.model_selection import GridSearchCV
# parameters = {
#     'max_depth': range (2, 10, 1),
#     'n_estimators': range(60, 220, 40),
#     'learning_rate': [0.1, 0.01, 0.05]
# }
# estimator = XGBRegressor(
#     nthread=4,
#     seed=42
# )
# grid_search = GridSearchCV(
#     estimator=estimator,
#     param_grid=parameters,
#     scoring = 'neg_mean_squared_error',
#     n_jobs = 10,
#     cv = 10,
#     verbose=True
# )

# grid_search.fit(train , y)


In [60]:
# param = grid_search.best_estimator_

### Took loong Time as hell so i commented it

## Model Validation

In [61]:
#a ftn to rerturn rmse scores of the models
def sq(lis):
  new_lis = []
  lis = np.array(lis)
  for i in lis:
    i = np.sqrt(i*-1)
    new_lis.append(i)
  return new_lis


#cross validate
from catboost import CatBoostRegressor
# ftn for cross validation
def val_data(data,y): 
    kfolds = KFold(n_splits = 10 , random_state = 0)
    
    imp = SimpleImputer(missing_values=np.nan, strategy='mean').fit(data)
    train_data=imp.transform(data)

    scaler = StandardScaler().fit(data)
    data = scaler.transform(data)
    model_val = XGBRegressor(objective ='reg:squarederror')
#     model_val = LGBMRegressor(n_estimators=500)
#     model_val  =RandomForestRegressor()
#     model_val = LinearRegression()

#     model_val = CatBoostRegressor(n_estimators=500 )

    val = cross_val_score(model_val ,data  , y , cv=kfolds, scoring = 'neg_mean_squared_error')
    return sq(val)


## Using Different Batch sizes

In [62]:
## WE will start by using different batchs to test
# bach with 60 galaxies seems to perfom somehow better
a = [60,120,180]
x = 0
batch_rmse = []
for a_a in a:
    x_train = train[(train["galaxy"]< a_a) & (train["galaxy"] >=0)]
    z = x_train.y
    x_train = x_train.drop('y' , axis =1)

    batch_rmse.append(np.mean(val_data(x_train,z)))
    x = x+60

In [67]:
print(f"RMSE  With Batch  of 60 is  {np.mean(batch_rmse)}")
#perfomes worsst 
# tried with 10 ,20,30 ,45 and 60 and all are worse

RMSE  With Batch  of 60 is  0.0235755356592737


## Using Each Country's Parameters


In [68]:
rmse = []
train['y'] = y
#galaxy number 126 seems to have one row and no test
galaxies.remove(126)
for gal in galaxies:
    x_train = train[train["galaxy"]== gal]
    z = x_train.y
    x_train = x_train.drop('y' , axis =1)

    rmse.append(np.mean(val_data(x_train,z)))

In [69]:
print(f"RMSE USing Specific Countires is {np.mean(rmse)}")

RMSE USing Specific Countires is 0.0032975207715731734


## Using Whole Dataset as similar


In [70]:
print(f"RmSes Scores Are    {val_data(train ,y)} ")

RmSes Scores Are    [0.0010872669778771284, 0.0010742893654101734, 0.0007969243177310009, 0.0007099688421680841, 0.0009574431868233445, 0.0009427767884473497, 0.0010076573639320142, 0.0008712707715674118, 0.0019461254573156451, 0.019227065780258356] 


In [71]:
# train using whole data
print(f"RMSE For whole data is    {np.mean(val_data(train , y))}")
#this perfomed best

RMSE For whole data is    0.002862078885153051


### USing My Model
  - From the models i test i find xgboost works fine and very foirst
  - I used whole dataset and also separeted galaxywise since it had some good prediction

In [72]:
# for x in galaxies:
#     if len(train[train['galaxy']==x])< 5:
#         print(x)


# found 126 had 1 data


In [73]:
# test[test['galaxy']==126]

#126 has no test data

In [75]:

res = pd.DataFrame()
res["testcase"] = test['galaxy']
res["pred"] = 0 #this is the prediction df using each galaxy
res['place'] = test.index
res['whole_pre'] = 0  # pred using whole dataset

In [77]:
#actual training preparation
# train = train.drop('y' , axis =1)
def process(x_train , x_test):
    imp = SimpleImputer(missing_values=np.nan, strategy='mean').fit(x_train)
    X_train=imp.transform(x_train)
    X_test=imp.transform(x_test)

    scaler = StandardScaler().fit(X_train)
    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test)
    

In [87]:
#insert labels for separation purposes
train['y'] = y
#to check galaxies in train and not in test
is_empty = []
test.drop('y' , axis = 1 , inplace = True)
for x in galaxies:
    if len(test[test['galaxy']==x]) == 0:
        is_empty.append(x)
#         print(x)

#remove all that are mot in test
for each in is_empty:
    galaxies.remove(each)
    
    
#do trauning for each galaxy
for gal in galaxies:
    #get dataset_specific galaxy df for testand train
    z_train = train[train["galaxy"]== gal]
    z_test = test[test["galaxy"]== gal]
    
    #get label for that df
#     print(z_train.shape)
#     print(z_test.shape)
    z = z_train.y
    z_train = z_train.drop('y' , axis =1)
    
    #filling nans using imputer
    imp = SimpleImputer(missing_values=np.nan, strategy='mean').fit(z_train)
    z_train=imp.transform(z_train)
    z_test=imp.transform(z_test)
    #actual modelling
    model = XGBRegressor(objective ='reg:squarederror')
    model.fit(z_train , z)
    #prediction
    pred = model.predict(z_test)
    #get indexes for the galaxy trained 
    x = test[test["galaxy"]== gal].index
    x.tolist()#chenge the indexes to list
    #enumerate the indexes  in x and compare them with that of res df and add predictions to those indexes
    for i , n in enumerate(x):
    #     print(f"{i}  {n}")
        res.iloc[n , 1]= res[res['place'] == n]['pred'] = pred[i]
    #     print(pred)

In [93]:
#doing for whole predictions

wh_model = XGBRegressor(objective ='reg:squarederror')
train = train.drop('y' , axis =1)
train.shape , test.shape

((3865, 79), (890, 79))

In [94]:

process(train ,test)
wh_model.fit(train , y)
# get preds#
predictions = wh_model.predict(test)

In [97]:
# insert the predictions to the res df
res['whole_pre'] = predictions

In [98]:
res.head()

Unnamed: 0,testcase,pred,place,whole_pre
0,84,0.042009,0,0.041185
1,142,0.039123,1,0.039401
2,142,0.039123,2,0.041665
3,147,0.038863,3,0.043491
4,178,0.030014,4,0.022697


## Optimization for opt_pred

- When Giving 100 to all with p2 values was maximizing the energy allocation to country with low existence expentancy index .. But when i tried It perfomed averagely
- Distributing the energie extensively say to all countries perfomed the worse.
    - So Idecided to shrink the enegy to those countries with low existence expentancy index(high p2)
    - My best Score occured when i shared the energy to 555 instancies differently as shown below
    
 

In [99]:
res['opt_pred'] =0#set optimization col to 0
#get existence expectancy index col from the test
res['ex_ind'] = test["existence expectancy index"].fillna(test["existence expectancy index"].quantile(0.15))
res.head()

Unnamed: 0,testcase,pred,place,whole_pre,opt_pred,ex_ind
0,84,0.042009,0,0.041185,0,0.456086
1,142,0.039123,1,0.039401,0,0.529835
2,142,0.039123,2,0.041665,0,0.560976
3,147,0.038863,3,0.043491,0,0.56591
4,178,0.030014,4,0.022697,0,0.588274


In [100]:
#some calculations to check conditions
index = res['pred']
pot_inc = -np.log(index+0.01)+3
p2= pot_inc**2

likely_inc = index *p2/1000
# 

In [None]:
# res.loc[likely_inc.nlargest(400).index]['ex_ind']<0.7

In [101]:
# res[res["ex_ind"] > 0.9]['ex_ind'].nsmallest(32)

In [102]:
res.loc[p2.nlargest(342).index, 'opt_pred']=100
res=res.sort_values('pred')
res.iloc[342:400].opt_pred = 90
res.iloc[400:553].opt_pred = 69
res.iloc[553:554].opt_pred = 23
res=res.sort_index()

In [103]:
#enery increase
en_increase = (res['opt_pred']*p2)/1000

In [105]:
print(f"Sum:  {sum(en_increase)}\n With <0.7 :; { res.loc[res.ex_ind < 0.7, 'opt_pred'].sum()}\n total Allocated energy { res['opt_pred'].sum()}")

Sum:  1844.9451196820314
 With <0.7 :; 6528
 total Allocated energy 50000


In [117]:

#columns to submit
res['index'] = res['place']
cols_each = ['index' , 'pred' , 'opt_pred']

In [119]:
sub_each = res[cols_each]
# res.head()

In [120]:
sub_each.head()

Unnamed: 0,index,pred,opt_pred
0,0,0.042009,100
1,1,0.039123,100
2,2,0.039123,100
3,3,0.038863,100
4,4,0.030014,100


In [None]:
sub_each.to_csv('submit.csv' , index = False)

In [124]:
#submission for prediction with whole dataset
cols_whole = ['index' , 'whole_pre' , 'opt_pred']

In [132]:
alll = res[cols_whole]

In [130]:
alll.to_csv('AllDs.csv' ,index = False )

In [131]:
alll.tail()

Unnamed: 0,index,whole_pre,opt_pred
885,885,0.090191,100
886,886,0.085315,100
887,887,0.090191,0
888,888,0.090191,69
889,889,0.081445,100
