## SPAM detection task
The data contains 100 features extracted from a corpus of emails. Some of the emails are spam and some are normal. The task is to make a spam detector. 
train.csv - contains 600 emails x 100 features for use training model(s)
train_labels.csv - contains labels for the 600 training emails (1 = spam, 0 = normal)
test.csv - contains 4000 emails x 100 features. Need to detect the spam on them.

Predictions can be continuous numbers or 0/1 labels. No header is necessary. Submissions are judged on area under the ROC curve. 

In [3]:
# Will import libraries
import numpy as np
import pandas as pd
import scipy.optimize as sp
import xgboost as xgb

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn import linear_model, model_selection, metrics, tree, ensemble 

In [4]:
#Reading data
data = pd.read_csv('../input/just-the-basics-the-after-party/train.csv')
dataT = pd.read_csv('../input/just-the-basics-the-after-party/test.csv')
y = pd.read_csv('../input/just-the-basics-the-after-party/train_labels.csv')
data.head()

Unnamed: 0,0.097094,1.1133,45.038,0.88184,0.087009,1.041,1.5486,3.498,1.8578,0.0096729,...,0.076209,3.6654,0.061607,0.0031605,0.036038,0.0845,2.4517,3.3373,0.065201,0.091158
0,0.050086,0.11158,94.08,1.765,0.089417,4.8047,0.26742,,0.56473,0.035123,...,0.054712,4.1687,0.075432,0.010869,0.063972,0.079892,1.9795,3.5064,0.072132,0.09195
1,0.088447,2.3634,5.058,0.14436,0.064547,2.444,4.2545,0.36506,1.8609,0.009759,...,0.017203,4.5613,0.046505,,0.084066,0.064829,3.3087,2.9969,0.064328,0.036793
2,0.77254,0.59469,,0.97515,0.015987,0.52884,1.4884,3.961,4.8063,0.048617,...,0.022891,0.12832,0.065028,0.036862,0.01001,0.020709,2.5237,2.1711,0.080865,0.081553
3,0.38241,4.8109,1955.1,0.4605,0.024453,2.0298,3.7403,4.2281,2.4292,0.15683,...,0.032051,4.3701,1.0011,0.06575,0.043547,0.62943,4.6262,3.1947,,0.18718
4,0.081316,4.8415,4.0507,2.4832,0.05899,2.3794,1.6127,2.0422,1.6571,0.039377,...,0.018918,2.6804,0.076524,0.082756,0.041953,0.018092,3.3041,0.1922,0.0326,0.050172


In [5]:
#Since the dataset has no headers, let's name the columns for further incrimination. 
colums = list((range(0,100)))
data.columns = [colums]
dataT.columns = [colums]
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 599 entries, 0 to 598
Data columns (total 100 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   (0,)    584 non-null    float64
 1   (1,)    589 non-null    float64
 2   (2,)    579 non-null    float64
 3   (3,)    579 non-null    float64
 4   (4,)    574 non-null    float64
 5   (5,)    583 non-null    float64
 6   (6,)    586 non-null    float64
 7   (7,)    582 non-null    float64
 8   (8,)    580 non-null    float64
 9   (9,)    580 non-null    float64
 10  (10,)   586 non-null    float64
 11  (11,)   583 non-null    float64
 12  (12,)   585 non-null    float64
 13  (13,)   576 non-null    float64
 14  (14,)   572 non-null    float64
 15  (15,)   583 non-null    float64
 16  (16,)   577 non-null    float64
 17  (17,)   587 non-null    float64
 18  (18,)   582 non-null    float64
 19  (19,)   578 non-null    float64
 20  (20,)   582 non-null    float64
 21  (21,)   578 non-null    float64
 22  (

In [6]:
#And let's fill in the missing values with the median
for i in colums:
    data[i,].fillna(data[i,].median(), inplace = True)

for i in colums:
    dataT[i,].fillna(dataT[i,].median(), inplace = True)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 599 entries, 0 to 598
Data columns (total 100 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   (0,)    599 non-null    float64
 1   (1,)    599 non-null    float64
 2   (2,)    599 non-null    float64
 3   (3,)    599 non-null    float64
 4   (4,)    599 non-null    float64
 5   (5,)    599 non-null    float64
 6   (6,)    599 non-null    float64
 7   (7,)    599 non-null    float64
 8   (8,)    599 non-null    float64
 9   (9,)    599 non-null    float64
 10  (10,)   599 non-null    float64
 11  (11,)   599 non-null    float64
 12  (12,)   599 non-null    float64
 13  (13,)   599 non-null    float64
 14  (14,)   599 non-null    float64
 15  (15,)   599 non-null    float64
 16  (16,)   599 non-null    float64
 17  (17,)   599 non-null    float64
 18  (18,)   599 non-null    float64
 19  (19,)   599 non-null    float64
 20  (20,)   599 non-null    float64
 21  (21,)   599 non-null    float64
 22  (

In [7]:
#Let's bring y to the required shape

y_train = np.ravel(y)
print(y.shape,type(y), y_train.shape, type(y_train))

#Data is full, no need delete outliers (NEED MORE Explanations)
X_train = data
X_test = dataT

(599, 1) <class 'pandas.core.frame.DataFrame'> (599,) <class 'numpy.ndarray'>


## Modeling
### Will tune hyperparameters using GridSearchCV. For scoring will use area under the ROC curve: 'roc_auc'.

### LogisticRegression

In [8]:
#For penalty will use Lasso 'l1'. Tune 'C' parameter
param_grid = {'C': [0.01, 0.05, 0.1, 0.5, 1, 5, 10]}

estimator = linear_model.LogisticRegression(solver='liblinear', penalty = 'l1', random_state = 1)
optimizerL = GridSearchCV(estimator, param_grid, scoring = 'roc_auc',cv = 3)                    
optimizerL.fit(X_train, y_train)

print('score_train_opt', optimizerL.best_score_)
print('param_opt', optimizerL.best_params_)

score_train_opt 0.9313434494237476
param_opt {'C': 0.5}


### RidgeClassifier

In [9]:
param_grid = {'alpha': [0.01, 0.05, 0.1, 0.5, 1, 2, 5]}

estimator = linear_model.RidgeClassifier( random_state = 1)
optimizerR = GridSearchCV(estimator, param_grid,  scoring = 'roc_auc',cv = 3)                    
optimizerR.fit(X_train, y_train)

print('score_train_opt', optimizerR.best_score_)
print('param_opt', optimizerR.best_params_)

score_train_opt 0.9032021454258864
param_opt {'alpha': 5}


### RandomForestClassifier
We should have a loose stopping criterion and then use pruning to remove branches that contribute to overfitting. But pruning is a tradeoff between accuracy and generalizability, so our train scores might lower but the difference between train and test scores will also get lower.  This is what we need.  (details - https://towardsdatascience.com/how-to-tune-a-decision-tree-f03721801680)

In [10]:
rf_class = ensemble.RandomForestClassifier(random_state = 1)
train_scores, test_scores = model_selection.validation_curve(rf_class, X_train, y_train, 'max_depth', list(range(1, 11)), cv=3, scoring='roc_auc')
print('max_depth=', list(range(1, 10)))
print(train_scores.mean(axis = 1))
print(test_scores.mean(axis = 1))



max_depth= [1, 2, 3, 4, 5, 6, 7, 8, 9]
[0.9556166  0.96918854 0.98051369 0.99123133 0.99765125 0.9995884
 0.99995711 1.         1.         1.        ]
[0.94072953 0.94530523 0.94503201 0.94783792 0.94701525 0.94778374
 0.94746255 0.95092504 0.94865626 0.94697768]


We get the same difference between train and test scores on by  max_depth=4-9
And we have the bigger score ROC AUC by max_depth=4

In [11]:
param_grid = {'n_estimators': list(range(20, 100, 5)), 'min_weight_fraction_leaf': [0.001,  0.005, 0.01, 0.05, 0.1, 0.5] } 

estimator = ensemble.RandomForestClassifier(max_depth=4, random_state = 1)
optimizerRF = GridSearchCV(estimator, param_grid, scoring = 'roc_auc',cv = 3)                    
optimizerRF.fit(X_train, y_train)

print('score_train_opt', optimizerRF.best_score_)
print('param_opt', optimizerRF.best_params_)

score_train_opt 0.9513659591772741
param_opt {'min_weight_fraction_leaf': 0.001, 'n_estimators': 20}


### Extreme Gradient Boosting

In [12]:
param_grid = {'max_depth': list(range(1, 7)), 'learning_rate': [0.01, 0.05, 0.1, 0.5, 1, 1.5], 'n_estimators': list(range(10, 100, 5)) }
estimator = xgb.XGBClassifier( random_state = 1, min_child_weight=3)
optimizer = GridSearchCV(estimator, param_grid, scoring = 'roc_auc',cv = 3)                    
optimizer.fit(X_train, y_train)

print('score_train_opt', optimizer.best_score_)
print('param_opt', optimizer.best_params_)

score_train_opt 0.9482604671053828
param_opt {'learning_rate': 0.1, 'max_depth': 2, 'n_estimators': 60}


In [13]:
param_grid = {'n_estimators': list(range(10, 100, 5)), 'min_child_weight': list(range(1, 10)) }
estimator = xgb.XGBClassifier( max_depth = 3, random_state = 1, learning_rate=0.1)
optimizer = GridSearchCV(estimator, param_grid, scoring = 'roc_auc',cv = 3)                    
optimizer.fit(X_train, y_train)

print('score_train_opt', optimizer.best_score_)
print('param_opt', optimizer.best_params_) 

score_train_opt 0.9458848341935475
param_opt {'min_child_weight': 2, 'n_estimators': 45}


Will use the highest value ROC AUC model - RandomForestClassifier


In [14]:
#Writting answers

ans=optimizerRF.predict(X_test)

f=open("/kaggle/working/answers.csv", "w")
f.write(str(ans))
f.close()
