The trend of anonymized data for online competitions is increasing day by day as companies want their data to be secure and thus maintaining the privacy of their customers. Santander has released an anonymized dataset for predicting the value of transactions for each potential customer.

So in this notebook I will be focusing on gathering insights from the unknown data.

# Importing modules and getting a glimpse of the data

In [13]:
import pandas as pd
import numpy as np
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
#from plotly.offline import init_notebook_mode, iplot
#init_notebook_mode(connected=True)
#import plotly.graph_objs as go

In [6]:
train_data = pd.read_csv('S_train.csv')
test_data = pd.read_csv('S_test.csv')

In [7]:
train_data.head()

Unnamed: 0,ID,target,48df886f9,0deb4b6a8,34b15f335,a8cb14b00,2f0771a37,30347e683,d08d1fbe3,6ee66e115,...,3ecc09859,9281abeea,8675bec0b,3a13ed79a,f677d4d13,71b203550,137efaa80,fb36b89d9,7e293fbaf,9fc776466
0,000d6aaf2,38000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
1,000fbd867,600000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
2,0027d6b71,10000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
3,0028cbf45,2000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
4,002a68644,14400000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0


In [8]:
test_data.head()

Unnamed: 0,ID,48df886f9,0deb4b6a8,34b15f335,a8cb14b00,2f0771a37,30347e683,d08d1fbe3,6ee66e115,20aa07010,...,3ecc09859,9281abeea,8675bec0b,3a13ed79a,f677d4d13,71b203550,137efaa80,fb36b89d9,7e293fbaf,9fc776466
0,000137c73,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,00021489f,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0004d7953,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,00056a333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,00056d8eb,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
print("The shape of the training set is:",train_data.shape)

In [None]:
print("The shape of the test set is:", test_data.shape)

- It is quiet interesting to see that the number of features in the train dataset is greater than the number of data points i.e. **the curse of dimensionality**.
- The test set is 10 times bigger than the train set in shape.
- Thus, feature extraction is very important and will substantially improve the score of the model.

In [None]:
feature_cols = [c for c in train_data.columns if c not in ["ID", "target"]]
flat_values = train_data[feature_cols].values.flatten()

labels = 'Zero_values','Non-Zer_values'
values = [sum(flat_values!=0), sum(flat_values==0)]
colors = ['rgba(55, 12, 93, .7)','rgba(125, 42, 123, .1)']

Plot = go.Pie(labels=labels, values=values,marker=dict(colors=colors,line=dict(color='#fff', width= 3)))
layout = go.Layout(title='Types of Loans', height=380)
fig = go.Figure(data=[Plot], layout=layout)
iplot(fig)

In [None]:
train_data.info()

The memory usage of the data is approx 170MB and the datatypes for features are distributed as:
- **float64** - 1845
-   **int64** - 3147
-  **object** - 1

In [None]:
test_data.info()

The memory usage for test data is 1.8GB and the datatypes for features are:

- **float64** - 4991
-  **object** - 1

In [None]:
train_data.describe()

In [None]:
def missing_data(data): #calculates missing values in each column
    total = data.isnull().sum().reset_index()
    total.columns  = ['Feature_Name','Missing_value']
    total_val = total[total['Missing_value']>0]
    total_val = total.sort_values(by ='Missing_value')
    return total_val

In [None]:
missing_data(train_data).head()

In [None]:
missing_data(test_data).head()

There are no missing values in the train and test dataset. This is reasonably good as it is nearly impossible to fill missing data with certain values. 

As the data is sparse, It is required for us to drop the features having constant value throughout the dataset as they will just increase the dimensionality of the dataset hampering the prediction of the target value in the test set. 

A feature is constant if the number of unique elements in it is equal to 1 i.e. nunique =1.


In [None]:
train_data = train_data.loc[:,train_data.apply(pd.Series.nunique) != 1]
train_data.shape

This reduces the number of features to 4737 i.e. 256 features are constant and thus removed from the dataset.

# Feature selection using Truncated SVD

To avoid the curse of dimensionality, apart from PCA we can apply linear dimensionality reduction by the means of truncated singular value decomposition.This estimator does not center the data.

In practice, TruncatedSVD is very useful for highly sparse datasets which cannot be centered without making the memory usage explode. 

In [15]:
ss = StandardScaler()
feature_cols = [c for c in train_data.columns if c not in ["ID", "target"]]
svd = TruncatedSVD(n_components=500, n_iter=100, random_state=42)

In [None]:
X_scaled = ss.fit_transform(train_data[feature_cols].values)
X_test_scaled =ss.fit_transform(test_data[test_data.columns[1:]].values)
SVD_result = svd.fit(X_scaled)


In [None]:
#print(SVD_result.explained_variance_ratio_)

In [12]:
cumm_perc = np.sum(SVD_result.explained_variance_ratio_)
print("Cumulative explained variation for 400 components:"+"{:.2%}".format(cumm_perc))

Cumulative explained variation for 400 components:89.84%


From the above results it is evident that first 400 components results for approx 92% of the variance in the train dataset.  

In [117]:
X_mod = SVD_result.fit_transform(X_scaled)
X_mod_test = SVD_result.fit_transform(X_test)

In [119]:
y = train_data['target']

In [142]:
from sklearn.linear_model import ElasticNet

In [143]:
regr = ElasticNet(alpha = 100,max_iter =1000,random_state=42)

In [144]:
regr.fit(X_mod,y)



ElasticNet(alpha=100, copy_X=True, fit_intercept=True, l1_ratio=0.5,
      max_iter=1000, normalize=False, positive=False, precompute=False,
      random_state=42, selection='cyclic', tol=0.0001, warm_start=False)

In [None]:
y_pred = regr.predict(X_mod_test)

In [152]:
y_pred

array([ 4629998.17377604,  5966751.15523577, 10968965.91128213, ...,
        4741371.34711267, -5118340.26960717,  4097904.25828624])

In [150]:
print(regr.score(X_test,y_test))

-1.6497104913714709


In [151]:
print(regr.score(X_train,y_train))

0.33647007675781776


In [154]:
def rmsle(y,pred):
    return np.sqrt(np.mean(np.power(np.log1p(y)-np.log1p(pred), 2)))

In [155]:
print(rmsle(y,))

  


ValueError: operands could not be broadcast together with shapes (4459,) (1338,) 