## TiGa - ProHack Notebook

## Aproach
1. Define the target for prediction - Y (index)
2. Import Libraries
3. Import Data

### Step 1: Import Libraries

In [1]:
# for working with dataframes import pandas
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pylab as pl
%matplotlib inline

from sklearn.metrics import mean_squared_error
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_validate
from scipy.optimize import minimize

import warnings 
warnings.filterwarnings('ignore')

### Step 2: Import data

In [2]:
# train data
train_df = pd.read_csv('train.csv')

train_df.head()

Unnamed: 0,galactic year,galaxy,existence expectancy index,existence expectancy at birth,Gross income per capita,Income Index,Expected years of education (galactic years),Mean years of education (galactic years),Intergalactic Development Index (IDI),Education Index,...,"Intergalactic Development Index (IDI), female","Intergalactic Development Index (IDI), male",Gender Development Index (GDI),"Intergalactic Development Index (IDI), female, Rank","Intergalactic Development Index (IDI), male, Rank",Adjusted net savings,"Creature Immunodeficiency Disease prevalence, adult (% ages 15-49), total",Private galaxy capital flows (% of GGP),Gender Inequality Index (GII),y
0,990025,Large Magellanic Cloud (LMC),0.628657,63.1252,27109.23431,0.646039,8.240543,,,,...,,,,,,,,,,0.05259
1,990025,Camelopardalis B,0.818082,81.004994,30166.79396,0.852246,10.671823,4.74247,0.833624,0.467873,...,,,,,,19.177926,,22.785018,,0.059868
2,990025,Virgo I,0.659443,59.570534,8441.707353,0.499762,8.840316,5.583973,0.46911,0.363837,...,,,,,,21.151265,6.53402,,,0.050449
3,990025,UGC 8651 (DDO 181),0.555862,52.333293,,,,,,,...,,,,,,,5.912194,,,0.049394
4,990025,Tucana Dwarf,0.991196,81.802464,81033.95691,1.131163,13.800672,13.188907,0.910341,0.918353,...,,,,,,,5.611753,,,0.154247


### Step 3: Analyze & Pre-processing Data

In [3]:
train_df.dtypes

galactic year                                                                  int64
galaxy                                                                        object
existence expectancy index                                                   float64
existence expectancy at birth                                                float64
Gross income per capita                                                      float64
                                                                              ...   
Adjusted net savings                                                         float64
Creature Immunodeficiency Disease prevalence, adult (% ages 15-49), total    float64
Private galaxy capital flows (% of GGP)                                      float64
Gender Inequality Index (GII)                                                float64
y                                                                            float64
Length: 80, dtype: object

In [4]:
train_df.describe(include="all")

Unnamed: 0,galactic year,galaxy,existence expectancy index,existence expectancy at birth,Gross income per capita,Income Index,Expected years of education (galactic years),Mean years of education (galactic years),Intergalactic Development Index (IDI),Education Index,...,"Intergalactic Development Index (IDI), female","Intergalactic Development Index (IDI), male",Gender Development Index (GDI),"Intergalactic Development Index (IDI), female, Rank","Intergalactic Development Index (IDI), male, Rank",Adjusted net savings,"Creature Immunodeficiency Disease prevalence, adult (% ages 15-49), total",Private galaxy capital flows (% of GGP),Gender Inequality Index (GII),y
count,3865.0,3865,3864.0,3864.0,3837.0,3837.0,3732.0,3502.0,3474.0,3474.0,...,916.0,915.0,914.0,893.0,892.0,912.0,941.0,874.0,844.0,3865.0
unique,,181,,,,,,,,,...,,,,,,,,,,
top,,Andromeda XII,,,,,,,,,...,,,,,,,,,,
freq,,26,,,,,,,,,...,,,,,,,,,,
mean,1000709.0,,0.872479,76.798111,31633.240872,0.825154,14.723296,10.283959,0.804246,0.7459,...,0.823561,0.844209,1.008465,121.754797,120.873428,21.252922,6.443023,22.261474,0.600733,0.082773
std,6945.463,,0.162367,10.461654,18736.378446,0.194055,3.612546,3.319948,0.176242,0.199795,...,0.18578,0.159041,0.087299,46.269362,46.795666,14.258986,4.804873,34.342797,0.205785,0.063415
min,990025.0,,0.22789,34.244062,-126.906522,0.292001,3.799663,1.928166,0.273684,0.189874,...,0.305733,0.369519,0.465177,23.224603,16.215151,-76.741414,-1.192011,-735.186887,0.089092,0.013036
25%,995006.0,,0.763027,69.961449,20169.11891,0.677131,12.592467,7.654169,0.671862,0.597746,...,0.690707,0.731264,0.9658,84.090816,82.23255,15.001028,4.113472,17.227899,0.430332,0.047889
50%,1000000.0,,0.907359,78.995101,26600.76819,0.8273,14.942913,10.385465,0.824758,0.761255,...,0.83541,0.862773,1.029947,120.069916,121.057923,22.182571,5.309497,24.472557,0.62464,0.05782
75%,1006009.0,,0.99276,84.558971,36898.63175,0.970295,17.123797,12.884752,0.939043,0.893505,...,0.970365,0.961369,1.068481,158.579644,157.815625,29.134738,6.814577,31.748295,0.767404,0.087389


#### Evaluate for missing data

The output is a boolean value indicating whether the value that is passed into the argument is in fact missing data.

In [5]:
missing_data = train_df.isnull()
missing_data.head(5)

Unnamed: 0,galactic year,galaxy,existence expectancy index,existence expectancy at birth,Gross income per capita,Income Index,Expected years of education (galactic years),Mean years of education (galactic years),Intergalactic Development Index (IDI),Education Index,...,"Intergalactic Development Index (IDI), female","Intergalactic Development Index (IDI), male",Gender Development Index (GDI),"Intergalactic Development Index (IDI), female, Rank","Intergalactic Development Index (IDI), male, Rank",Adjusted net savings,"Creature Immunodeficiency Disease prevalence, adult (% ages 15-49), total",Private galaxy capital flows (% of GGP),Gender Inequality Index (GII),y
0,False,False,False,False,False,False,False,True,True,True,...,True,True,True,True,True,True,True,True,True,False
1,False,False,False,False,False,False,False,False,False,False,...,True,True,True,True,True,False,True,False,True,False
2,False,False,False,False,False,False,False,False,False,False,...,True,True,True,True,True,False,False,True,True,False
3,False,False,False,False,True,True,True,True,True,True,...,True,True,True,True,True,True,False,True,True,False
4,False,False,False,False,False,False,False,False,False,False,...,True,True,True,True,True,True,False,True,True,False


<h4>Count missing values in each column</h4>
<p>
Using a for loop in Python, we can quickly figure out the number of missing values in each column. As mentioned above, "True" represents a missing value, "False"  means the value is present in the dataset.  In the body of the for loop the method  ".value_counts()"  counts the number of "True" values. 
</p>

In [6]:
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")

galactic year
False    3865
Name: galactic year, dtype: int64

galaxy
False    3865
Name: galaxy, dtype: int64

existence expectancy index
False    3864
True        1
Name: existence expectancy index, dtype: int64

existence expectancy at birth
False    3864
True        1
Name: existence expectancy at birth, dtype: int64

Gross income per capita
False    3837
True       28
Name: Gross income per capita, dtype: int64

Income Index
False    3837
True       28
Name: Income Index, dtype: int64

Expected years of education (galactic years)
False    3732
True      133
Name: Expected years of education (galactic years), dtype: int64

Mean years of education (galactic years)
False    3502
True      363
Name: Mean years of education (galactic years), dtype: int64

Intergalactic Development Index (IDI)
False    3474
True      391
Name: Intergalactic Development Index (IDI), dtype: int64

Education Index
False    3474
True      391
Name: Education Index, dtype: int64

Intergalactic Development In

<h3 id="deal_missing_values">Deal with missing data</h3>
<b>How to deal with missing data?</b>

<ol>
    <li>drop data<br>
        a. drop the whole row<br>
        b. drop the whole column
    </li>
    <li>replace data<br>
        a. replace it by mean<br>
        b. replace it by frequency<br>
        c. replace it based on other functions
    </li>
</ol>

<i> !!! Whole columns should be dropped only if most entries in the column are empty. !!! </i>  

#### Dropping data
Drop data columns based on the relativity to the well-being index. The selection mostly follows the OECD approach with their index.

<b> Replace by mean</b>
<ul>
    <li> existence expectancy index
    <li> existence expectancy at birth
    <li> Gross income per capita
    <li> Income Index
    <li>
</ul>

In [7]:
# calculate the average
avg_exist_exp_i = train_df["existence expectancy index"].astype("float").mean(axis=0)
print("Average of existence expectancy index:", avg_exist_exp_i)

Average of existence expectancy index: 0.8724787211405255


In [8]:
avg_exist_exp_at_b = train_df["existence expectancy at birth"].astype("float").mean(axis=0)
print("Average of existence expectancy at birth:", avg_exist_exp_at_b)

Average of existence expectancy at birth: 76.79811068583831


In [9]:
#Gross income per capita

avg_gross_inc_pc = train_df["Gross income per capita"].astype("float").mean(axis=0)
print("Average of Gross income per capita:", avg_gross_inc_pc)

Average of Gross income per capita: 31633.240872115606


In [10]:
#Income Index

avg_income_i = train_df["Income Index"].astype("float").mean(axis=0)
print("Average of Income Index:", avg_income_i)

Average of Income Index: 0.8251535359564769


#### Replacing data by mean

In [11]:
train_df["existence expectancy index"].replace(np.nan, avg_exist_exp_i, inplace=True)

In [12]:
train_df["existence expectancy at birth"].replace(np.nan, avg_exist_exp_at_b, inplace=True)

In [13]:
#Gross income per capita
train_df["Gross income per capita"].replace(np.nan, avg_gross_inc_pc, inplace=True)

In [14]:
#Income Index
train_df["Income Index"].replace(np.nan, avg_income_i, inplace=True)

In [15]:
#check
missing_data = train_df.isnull()

print (missing_data["existence expectancy index"].value_counts())

False    3865
Name: existence expectancy index, dtype: int64


#### Drop rows

In [16]:
train1=train_df.drop(train_df.index[363:902])

In [17]:
train2=train1.drop(train1.index[543:1263])

In [18]:
train3=train2.drop(train2.index[722:1441])

In [19]:
train4=train3.drop(train3.index[902:1351])

In [20]:
# after rows dropping here is the shape of an updated dataframe

train4.shape

(1438, 80)

In [21]:
train_df.shape

(3865, 80)

In [22]:
missing_data1 = train4.isnull()

for column in missing_data1.columns.values.tolist():
    print(column)
    print (missing_data1[column].value_counts())
    print("")

galactic year
False    1438
Name: galactic year, dtype: int64

galaxy
False    1438
Name: galaxy, dtype: int64

existence expectancy index
False    1438
Name: existence expectancy index, dtype: int64

existence expectancy at birth
False    1438
Name: existence expectancy at birth, dtype: int64

Gross income per capita
False    1438
Name: Gross income per capita, dtype: int64

Income Index
False    1438
Name: Income Index, dtype: int64

Expected years of education (galactic years)
False    1396
True       42
Name: Expected years of education (galactic years), dtype: int64

Mean years of education (galactic years)
False    1328
True      110
Name: Mean years of education (galactic years), dtype: int64

Intergalactic Development Index (IDI)
False    1319
True      119
Name: Intergalactic Development Index (IDI), dtype: int64

Education Index
False    1319
True      119
Name: Education Index, dtype: int64

Intergalactic Development Index (IDI), Rank
False    1304
True      134
Name: Interg

### Counting galaxies

In [23]:
train_gal=set(train4["galaxy"])
s=0
for x in train_gal:
    s=s+len(train4.loc[train4['galaxy'] == x])
print("Total distinct galaxies: {}".format(len(train_gal)))
print("Average samples per galaxy: {}".format(s/len(train_gal)))

Total distinct galaxies: 181
Average samples per galaxy: 7.94475138121547


### Methods for Cross-validating Training Data

#### Approach by Ömer Gözüaçık

I trained a model for exery distinct galaxy in the training set (180) except the one from 126th galaxy as it has only one sample.

I used features with top x correlation with respect to y (target variable) galaxy specific. (x is found by trying different values [20,25,30,40,50,60,70])

Missing values are filled with the galaxy specific 'mean' of the data. (Median can be used alternatively.)

Train and test sets are not mixed for both imputation and standardization.

Standard Scaler is used to standardize data.

Gradient Boosted Regression is used as a model.

In [24]:
def cross_validation_loop(data,cor):
    labels= data['y']
    data=data.drop('galaxy', axis=1)    
    data=data.drop('y', axis=1)
    
    correlation=abs(data.corrwith(labels))
    columns=correlation.nlargest(cor).index
    data=data[columns]
    
    imp = SimpleImputer(missing_values=np.nan, strategy='mean').fit(data)
    data=imp.transform(data)

    scaler = StandardScaler().fit(data)
    data = scaler.transform(data)
        
    estimator = GradientBoostingRegressor(n_estimators=300)
    
    cv_results = cross_validate(estimator, data, labels, cv=4, scoring='neg_root_mean_squared_error')

    error=np.mean(cv_results['test_score'])
    
    return error

#### Code for cross-validating a model for every galaxy.

I return the mean of the cross-validation scores disregarding the differences of their sample sizes.
Remove the lonely galaxy, occuring only once.

In [25]:
train_gal=set(train4["galaxy"])
train_gal.remove("NGC 5253")
def loop_train(cor):
    errors=[]
    for gal in train_gal:
        index = train4.index[train4['galaxy'] == gal]
        data = train4.loc[index]
        errors.append(cross_validation_loop(data,cor))
    return np.mean(errors)

#### Checking which correlation threshold gives better value
The model performs best when the threshold is 20 with RMSE of 0.0063

In [26]:
cor=[20,25,30,40,50,60,70,80]
errors=[]
for x in cor: 
    errors.append(loop_train(x))

ValueError: 'neg_root_mean_squared_error' is not a valid scoring value. Use sorted(sklearn.metrics.SCORERS.keys()) to get valid options.