Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter. 

-  Number of Instances: 4601 (1813 Spam = 39.4%)
-  Number of Attributes: 58 (57 continuous, 1 nominal class label)

 -  Attribute Information:

    -  The last column of 'spambase.data' denotes whether the e-mail was 
    considered spam (1) or not (0)
    
    - 48 attributes are continuous real [0,100] numbers of type `word freq WORD` i.e. percentage of words in the e-mail that match WORD

    - 6 attributes are continuous real [0,100] numbers of type `char freq CHAR` i.e. percentage of characters in the e-mail that match CHAR


    - 1 attribute is continuous real [1,...] numbers of type `capital run length average` i.e.
average length of uninterrupted sequences of capital letters

    - 1 attribute is continuous integer [1,...] numbers of type
`capital run length longest` i.e. length of longest uninterrupted sequence of capital letters

    - 1 attribute is continuous integer [1,...] numbers of type `capital run length total` i.e.
sum of length of uninterrupted sequences of capital letters in the email

    - 1 attribute is nominal {0,1} class  of type spam i.e  denotes whether the e-mail was considered spam (1) or not (0),  

- Missing Attribute Values: None

- Class Distribution:
	Spam	  1813  (39.4%)
	Non-Spam  2788  (60.6%)




In [None]:
# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import SelectKBest
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report,confusion_matrix
import warnings
warnings.filterwarnings("ignore")


### Load the data  stored in `path` using `.read_csv()` api.

In [None]:
df = pd.read_csv('email_data.csv',header=None)
df.head()


### Get an overview of your data by using info() and describe() functions of pandas.



In [None]:
df.info()
df.describe()

### Split the data into train and test set and fit the base logistic regression model on train set.

In [6]:
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)
lr = LogisticRegression(random_state=101)
lr.fit(X_train,y_train)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=101, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

### Find out the accuracy , print out the Classification report and Confusion Matrix.

In [7]:
print("Accuracy on test data:", lr.score(X_test,y_test))
y_pred = lr.predict(X_test)
print("Confusion Matrix: \n",confusion_matrix(y_test,y_pred))
print("=="*20)
print("Classification Report: \n",classification_report(y_test,y_pred))


Accuracy on test data: 0.9319333816075308
Confusion Matrix: 
 [[770  34]
 [ 60 517]]
Classification Report: 
               precision    recall  f1-score   support

           0       0.93      0.96      0.94       804
           1       0.94      0.90      0.92       577

   micro avg       0.93      0.93      0.93      1381
   macro avg       0.93      0.93      0.93      1381
weighted avg       0.93      0.93      0.93      1381



###  Copy dataset df into df1 variable and apply correlation on df1

### As we have learned  one of the assumptions of Logistic Regression model is that the independent features should not be correlated to each other(i.e Multicollinearity), So we have to find the features that have a correlation higher that 0.75 and remove the same so that the assumption for logistic regression model is satisfied. 

In [8]:
df1 = df.copy()

# Remove Correlated features above 0.75 and then apply logistic model
corr_matrix = df1.drop(57,1).corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
to_drop = [column for column in upper.columns if any(upper[column] > 0.75)]
print("Columns to be dropped: ")
print(to_drop)
df1.drop(to_drop,axis=1,inplace=True)


Columns to be dropped: 
[33, 39]


### Split the  new subset of the  data acquired by feature selection into train and test set and fit the logistic regression model on train set.

In [9]:
X = df1.iloc[:,:-1]
y = df1.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state = 42)
lr = LogisticRegression(random_state=101)
lr.fit(X_train,y_train)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=101, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

###  Find out the accuracy , print out the Classification report and Confusion Matrix.

In [10]:
print("Accuracy on test data:", lr.score(X_test,y_test))
y_pred = lr.predict(X_test)
print("Confusion Matrix: \n",confusion_matrix(y_test,y_pred))
print("=="*20)
print("Classification Report: \n",classification_report(y_test,y_pred))


Accuracy on test data: 0.9304851556842868
Confusion Matrix: 
 [[768  36]
 [ 60 517]]
Classification Report: 
               precision    recall  f1-score   support

           0       0.93      0.96      0.94       804
           1       0.93      0.90      0.92       577

   micro avg       0.93      0.93      0.93      1381
   macro avg       0.93      0.93      0.93      1381
weighted avg       0.93      0.93      0.93      1381



###  After keeping highly correlated features, there is not much change in the score. Lets apply another feature selection technique(Chi Squared test) to see whether we can increase our score. Find the optimum number of features using Chi Square and fit the logistic model on train data.



In [11]:
nof_list=[20,25,30,35,40,50,55]
high_score=0
nof=0

for n in nof_list:
    test = SelectKBest(score_func=chi2 , k= n )
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state = 42)
    X_train = test.fit_transform(X_train,y_train)
    X_test = test.transform(X_test)
    
    model = LogisticRegression(random_state=101)
    model.fit(X_train,y_train)
    print("For no of features=",n,", score=", model.score(X_test,y_test))
    if model.score(X_test,y_test)>high_score:
        high_score=model.score(X_test,y_test)
        nof=n 
print("High Score is:",high_score, "with features=",nof)


For no of features= 20 , score= 0.9036929761042722
For no of features= 25 , score= 0.9152787834902245
For no of features= 30 , score= 0.9131064446053584
For no of features= 35 , score= 0.9196234612599565
For no of features= 40 , score= 0.9254163649529327
For no of features= 50 , score= 0.9283128167994207
For no of features= 55 , score= 0.9304851556842868
High Score is: 0.9304851556842868 with features= 55


###  Find out the accuracy , print out the Confusion Matrix.



In [12]:
y_pred = lr.predict(X_test)
print("Confusion Matrix: \n",confusion_matrix(y_test,y_pred))


Confusion Matrix: 
 [[768  36]
 [ 60 517]]


### Using chi squared test there is no change in the score and the optimum features that we got is 55. Now lets see if we can increase our score using another feature selection technique called Anova.Find the optimum number of features using Anova and fit the logistic model on train data.

### Find out the accuracy , print out the Confusion Matrix.



In [15]:
nof_list=[20,25,30,35,40,50,55]
high_score=0
nof=0

for n in nof_list:
    test = SelectKBest(score_func=f_classif , k= n )
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)
    X_train = test.fit_transform(X_train,y_train)
    X_test = test.transform(X_test)
    model = LogisticRegression()
    model.fit(X_train,y_train)
    print("For no of features=",n,", score=", model.score(X_test,y_test))

    if model.score(X_test,y_test)>high_score:
        high_score=model.score(X_test,y_test)
        nof=n 
print("High Score is:",high_score, "with features=",nof)

# Calculate accuracy , print out the Confusion Matrix 
y_pred = lr.predict(X_test)
print("Confusion Matrix: \n",confusion_matrix(y_test,y_pred))



For no of features= 20 , score= 0.889210716871832
For no of features= 25 , score= 0.9029688631426502
For no of features= 30 , score= 0.9145546705286025
For no of features= 35 , score= 0.9203475742215785
For no of features= 40 , score= 0.9225199131064447
For no of features= 50 , score= 0.9261404779145547
For no of features= 55 , score= 0.9304851556842868
High Score is: 0.9304851556842868 with features= 55
Confusion Matrix: 
 [[768  36]
 [ 60 517]]


###  Unfortunately Anova also couldn't give us a better score . Let's finally attempt PCA on train data and find if it helps in  giving a better model by reducing the features.

### Find out the accuracy , print out the Confusion Matrix.   



In [14]:
# Apply PCA and fit the logistic model on train data use df dataset
nof_list=[20,25,30,35,40,50,55]
high_score=0
nof=0

for n in nof_list:
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state = 42)
    pca = PCA(n_components=n)
    pca.fit(X_train)
    X_train = pca.transform(X_train)
    X_test = pca.transform(X_test)
    logistic = LogisticRegression(solver = 'lbfgs')
    logistic.fit(X_train, y_train)
    print("For no of features=",n,", score=", logistic.score(X_test,y_test))
    
    if logistic.score(X_test,y_test)>high_score:
        high_score=logistic.score(X_test,y_test)
        nof=n 
print("High Score is:",high_score, "with features=",nof)

# Calculate accuracy , print out the Confusion Matrix 
y_pred = lr.predict(X_test)
print("Confusion Matrix: \n",confusion_matrix(y_test,y_pred))



For no of features= 20 , score= 0.9000724112961622
For no of features= 25 , score= 0.9000724112961622
For no of features= 30 , score= 0.9044170890658942
For no of features= 35 , score= 0.9196234612599565
For no of features= 40 , score= 0.9196234612599565
For no of features= 50 , score= 0.9232440260680667
For no of features= 55 , score= 0.9167270094134685
High Score is: 0.9232440260680667 with features= 50
Confusion Matrix: 
 [[124 680]
 [239 338]]


###  You can also compare your predicted values and observed values by printing out values of logistic.predict(X_test[ ]) and  y_test[ ].values

In [16]:
# Compare observed value and Predicted value
print("Prediction for 10 observation:    ",logistic.predict(X_test[0:10]))
print("Actual values for 10 observation: ",y_test[0:10].values)


Prediction for 10 observation:     [0 0 0 0 0 0 0 0 0 0]
Actual values for 10 observation:  [0 0 0 1 0 1 0 0 0 0]
