The trend of anonymized data for online competitions is increasing day by day as companies want their data to be secure and thus maintaining the privacy of their customers. Santander has released an anonymized dataset for predicting the value of transactions for each potential customer.

So in this notebook I will be focusing on gathering insights from the unknown data.

# Importing modules and getting a glimpse of the data

In [None]:
import pandas as pd
import numpy as np
from sklearn.decomposition import TruncatedSVD
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import os
print(os.listdir("../input"))
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
from sklearn.ensemble import IsolationForest
import matplotlib.pyplot as plt

In [None]:
train_data = pd.read_csv('../input/train.csv')
test_data = pd.read_csv('../input/test.csv')

In [None]:
train_data.head()

In [None]:
test_data.head()

In [None]:
print("The shape of the training set is:",train_data.shape)

In [None]:
print("The shape of the test set is:", test_data.shape)

- It is quiet interesting to see that the number of features in the train dataset is greater than the number of data points i.e. **the curse of dimensionality**.
- The test set is 10 times bigger than the train set in shape.
- Thus, feature extraction is very important and will substantially improve the score of the model.

In [None]:
feature_cols = [c for c in train_data.columns if c not in ["ID", "target"]]
flat_values = train_data[feature_cols].values.flatten()

labels = 'Zero_values','Non-Zero_values'
values = [sum(flat_values==0), sum(flat_values!=0)]
colors = ['rgba(55, 12, 93, .7)','rgba(125, 42, 123, .1)']

Plot = go.Pie(labels=labels, values=values,marker=dict(colors=colors,line=dict(color='#fff', width= 3)))
layout = go.Layout(title='Value distribution', height=380)
fig = go.Figure(data=[Plot], layout=layout)
iplot(fig)

In [None]:
train_data.info()

The memory usage of the data is approx 170MB and the datatypes for features are distributed as:
- **float64** - 1845
-   **int64** - 3147
-  **object** - 1

In [None]:
test_data.info()

The memory usage for test data is 1.8GB and the datatypes for features are:

- **float64** - 4991
-  **object** - 1

In [None]:
train_data.describe()

In [None]:
def missing_data(data): #calculates missing values in each column
    total = data.isnull().sum().reset_index()
    total.columns  = ['Feature_Name','Missing_value']
    total_val = total[total['Missing_value']>0]
    total_val = total.sort_values(by ='Missing_value')
    return total_val

In [None]:
missing_data(train_data).head()

In [None]:
missing_data(test_data).head()


There are no missing values in the train and test dataset. This is reasonably good as it is nearly impossible to fill missing data with certain values. 

As the data is sparse, It is required for us to drop the features having constant value throughout the dataset as they will just increase the dimensionality of the dataset hampering the prediction of the target value in the test set. 

A feature is constant if the number of unique elements in it is equal to 1 i.e. nunique =1.


In [None]:
#train_data = train_data.loc[:,train_data.apply(pd.Series.nunique) != 1]
#train_data.shape


# Feature selection using Truncated SVD

To avoid the curse of dimensionality, apart from PCA we can apply linear dimensionality reduction by the means of truncated singular value decomposition.This estimator does not center the data.

In practice, TruncatedSVD is very useful for highly sparse datasets which cannot be centered without making the memory usage explode. 

In [None]:
X = train_data.drop(['ID','target'],axis=1)
y_train = train_data["target"]
X_test = test_data.drop('ID', axis = 1)

In [None]:
#svd = TruncatedSVD(n_components=1300, random_state=0)
#SVD_result =svd.fit(X_train)

In [None]:
#cumm_perc = np.sum(SVD_result.explained_variance_ratio_)
#print("Cumulative explained variation for 1300 components:"+"{:.2%}".format(cumm_perc))

From the above results it is evident that first 1300 principal components results for 99.2% of the variance in the train dataset.  

In [None]:
#X_mod = SVD_result.transform(X_train)
#X_mod_test = SVD_result.fit_transform(X_test)

#y = np.log1p(y_train.values)

#X_train =scaler.fit_transform(X_train)
#X_test = scaler.fit_transform(X_test)



# Outlier detection using Isolation Forest 

Outlier detection is one of the most important aspects of regression analysis. If not removed it can hamper the performance of the model which we will fit to the data for continuous value prediction. So I have used a method which is highly suitable for high dimensional datasets i.e. Isolation forest algorithm, an ensemble method which returns anomaly scores of each sample in the dataset.

The IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

+1 indicates that the sample is an inlier whereas -1 indicates that the sample is an outlier.

In [None]:
clf = IsolationForest(max_samples=100, random_state=0)
clf.fit(X)

In [None]:
y_pred = clf.predict(X)
y_pred_df = pd.DataFrame(data=y_pred,columns = ['Values'])
y_pred_df['Values'].value_counts()

In [None]:
anomaly_score = clf.decision_function(X)
anomaly_score

In [None]:
y_test_pred = clf.predict(X_test)
y_test_pred_df = pd.DataFrame(data=y_test_pred,columns = ['Out_Values'])
y_test_pred_df['Out_Values'].value_counts()

In [None]:
anomaly_score = clf.decision_function(X_test)
anomaly_score

In [None]:
#sub = pd.read_csv('../input/sample_submission.csv')
#sub["target"] = 
#print(sub.head())
#sub.to_csv('sub_xgb.csv', index=False)
