<a id='BacktoTop'></a>
# Mini-Project: SVM&LR Classification
## MSDS 7331: Data Mining
## Dr. Drew
## Group 3: Shanqing Gu, Manisha Pednekar, Dmitry Andreev and Jonathan Knowles

*Perform predictive analysis (classification) upon a data set: model the dataset using methods we have discussed in class: logistic regression and support vector machines, and making conclusions from the analysis*


## Table of Contents
1. [Model Creation](#ModelCreation)
2. [Model Advantages](#ModelAdvantages)
3. [Interpret Feature Importance](#InterpretFeatureImportance)
4. [Insights From Support Vectors](#InsightsFromSupportVectors)

<a id='ModelCreation'></a>
### 01. Model Creation
[Back to top](#BacktoTop)

#### Section Objective:
Create a logistic regression model and a support vector machine model for the classification task involved with your dataset. Assess how well each model performs (use 80/20 training/testing split for your data). Adjust parameters of the models to make them more accurate.

####  1.1 Split training and testing datasets

In [None]:
# Import general libraries which will be uses for this Lab_01 project
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# Read csv file
df = pd.read_csv("OnlineNewsPopularity.csv") 

# Exclude url and timedelta columns, read from n_tokens_title
df = df.loc[:, ' n_tokens_title':]

dfCopy = df.copy() # dfCopy.info() # use df.tail() to read from the bottom

In [None]:
# Strip the empty space in varible names

df.columns = df.columns.str.replace(' ', '')

In [None]:
# Combine and make 'channel' with 6 data_channel variables

Lifestyle_df=df[df['data_channel_is_lifestyle']==1].copy()
Lifestyle_df['Channel']= 'Lifestyle' 

Entertainment_df=df[df['data_channel_is_entertainment']==1].copy()
Entertainment_df['Channel']= 'Entertainment'

Bus_df=df[df['data_channel_is_bus']==1].copy()
Bus_df['Channel']= 'Bus'

Socmed_df=df[df['data_channel_is_socmed']==1].copy()
Socmed_df['Channel']= 'Socmedia'

Tech_df=df[df['data_channel_is_tech']==1].copy()
Tech_df['Channel']=  'Tech'

World_df=df[df['data_channel_is_world']==1].copy()
World_df['Channel']='World'

# World_df=df[df['data_channel_is_world']==1].copy()
# World_df['Channel']='World'

df=pd.concat([Lifestyle_df,Entertainment_df,Bus_df,Socmed_df,Tech_df,World_df],axis=0)

sum(df['Channel'].value_counts()) # Check if the sample size is the same as original 33,510

In [None]:
# Combine and make 'Weekday' with 7 weekday variables

Monday_df=df[df['weekday_is_monday']==1].copy()
Monday_df['Weekday']= 'Monday'

Tuesday_df=df[df['weekday_is_tuesday']==1].copy()
Tuesday_df['Weekday']= 'Tuesday'

Wednesday_df=df[df['weekday_is_wednesday']==1].copy()
Wednesday_df['Weekday']='Wednesday'

Thursday_df=df[df['weekday_is_thursday']==1].copy()
Thursday_df['Weekday']='Thursday'

Friday_df=df[df['weekday_is_friday']==1].copy()
Friday_df['Weekday']='Friday'

Saturday_df=df[df['weekday_is_saturday']==1].copy()
Saturday_df['Weekday']='Saturday'

Sunday_df=df[df['weekday_is_sunday']==1].copy()
Sunday_df['Weekday']='Sunday'

df=pd.concat([Monday_df,Tuesday_df,Wednesday_df,Thursday_df,Friday_df,Saturday_df,Sunday_df],axis=0)

sum(df['Weekday'].value_counts()) # Check if the sample size is the same as original 33,510

In [None]:
# Check column locations and prepare to drop
df.columns[[11, 12, 13, 14, 15, 16, 29, 30, 31,32, 33, 34, 35, 36 ]]

In [None]:
# Remove previous channel and weekly columns as mentioned above
df.drop(df.columns[[11, 12, 13, 14, 15, 16, 29, 30, 31,32, 33, 34, 35, 36 ]], axis=1, inplace=True)

In [None]:
df.describe().transpose()

In [None]:
# Outliers will be handled by log transformation due to the sample numbers are more than 35k
# Find which variables need to make log transformation 

df_T = df.describe().T

df_T["log"] = (df_T["max"] > df_T["50%"] * 10) & (df_T["max"] > 1) # max > 10*50% value and max>1
df_T["log+2"] = df_T["log"] & (df_T["min"] < 0) # Need add 2 when min <0

df_T["scale"] = "" # make new variable 'scale' in df_T

df_T.loc[df_T["log"],"scale"] = "log" # show 'log'
df_T.loc[df_T["log+2"],"scale"] = "log+2" # show 'log+2'

df_T[["mean", "min", "50%", "max", "scale"]] # show mean, min, 50%, max, scale

In [None]:
# Log transform 18 variables

df['log_n_tokens_content'] = np.log(df['n_tokens_content'] + 0.1) # Add 0.1 to prevent infinity, same as below
df['log_n_unique_tokens'] = np.log(df['n_unique_tokens'] + 0.1) 
df['log_n_non_stop_words'] = np.log(df['n_non_stop_words'] + 0.1)
df['log_n_non_stop_unique_tokens'] = np.log(df['n_non_stop_unique_tokens'] + 0.1)

df['log_num_hrefs'] = np.log(df['num_hrefs'] + 0.1)
df['log_num_self_hrefs'] = np.log(df['num_self_hrefs'] + 0.1)
df['log_num_imgs'] = np.log(df['num_imgs'] + 0.1)
df['log_num_videos'] = np.log(df['num_videos'] + 0.1)

df['log_kw_min_min'] = np.log(df['kw_min_min'] + 2) # Add 2 for "log+2' to prevent infinity, same as below
df['log_kw_max_min'] = np.log(df['kw_max_min'] + 0.1)
df['log_kw_avg_min'] = np.log(df['kw_avg_min'] + 2)

df['log_kw_min_max'] = np.log(df['kw_min_max'] + 0.1)

df['log_kw_max_avg'] = np.log(df['kw_max_avg'] + 0.1)
df['log_kw_avg_avg'] = np.log(df['kw_avg_avg'] + 0.1)

df['log_self_reference_min_shares'] = np.log(df['self_reference_min_shares'] + 0.1)
df['log_self_reference_max_shares'] = np.log(df['self_reference_max_shares'] + 0.1)
df['log_self_reference_avg_sharess'] = np.log(df['self_reference_avg_sharess'] + 0.1)

df['log_shares'] = np.log(df['shares'] + 0.1)

In [None]:
# find locations for corresponding untransformed columns

df.columns[[1, 2, 3, 4, 5, 6, 7, 8, 11, 12, 13, 14, 18, 19, 20, 21, 22, 44]]

In [None]:
# Drop the above columns

df.drop(df.columns[[1, 2, 3, 4, 5, 6, 7, 8, 11, 12, 13, 14, 18, 19, 20, 21, 22, 44]], axis=1, inplace=True)

In [None]:
# Cut log_shares into 2 groups (0, 1)

df_cut = df['log_shares_cut'] = pd.qcut(df['log_shares'], 2, labels = ('0', '1'))

In [None]:
# Drop "Channel", "Weekday", and "log_shares" 
df.drop(df.columns[[27, 28, 46]], axis=1, inplace=True) 

In [None]:
from sklearn.model_selection import ShuffleSplit

# we want to predict the X and y data as follows:
if 'log_shares_cut' in df:
    y = df['log_shares_cut'].values # get the labels we want
    del df['log_shares_cut'] # get rid of the class label
    X = df.values # use everything else to predict!

    ## X and y are now numpy matrices, by calling 'values' on the pandas data frames we
    #    have converted them into simple matrices to use with scikit learn
    
    
# to use the cross validation object in scikit learn, we need to grab an instance
#    of the object and set it up. This object will be able to split our data into 
#    training and testing splits
num_cv_iterations = 3
num_instances = len(y)
cv_object = ShuffleSplit(n_splits=num_cv_iterations,
                         test_size  = 0.2)
                         
print(cv_object)

#### 1.2 Logistic Regression¶

In [None]:
# run logistic regression and vary some parameters
from sklearn.linear_model import LogisticRegression
from sklearn import metrics as mt

# first we create a reusable logisitic regression object
#   here we can setup the object with different learning parameters and constants
lr_clf = LogisticRegression(penalty='l2', C=1.0, class_weight=None) # get object

# now we can use the cv_object that we setup before to iterate through the 
#    different training and testing sets. Each time we will reuse the logisitic regression 
#    object, but it gets trained on different data each time we use it.

iter_num=0
# the indices are the rows used for training and testing in each iteration
for train_indices, test_indices in cv_object.split(X,y): 
    # I will create new variables here so that it is more obvious what 
    # the code is doing (you can compact this syntax and avoid duplicating memory,
    # but it makes this code less readable)
    X_train = X[train_indices]
    y_train = y[train_indices]
    
    X_test = X[test_indices]
    y_test = y[test_indices]
    
    # train the reusable logisitc regression model on the training data
    lr_clf.fit(X_train,y_train)  # train object
    y_hat = lr_clf.predict(X_test) # get test set precitions

    # now let's get the accuracy and confusion matrix for this iterations of training/testing
    acc = mt.accuracy_score(y_test,y_hat)
    conf = mt.confusion_matrix(y_test,y_hat)
    print("====Iteration",iter_num," ====")
    print("accuracy", acc )
    print("confusion matrix\n",conf)
    iter_num+=1
    
# Also note that every time you run the above code
#   it randomly creates a new training and testing set, 
#   so accuracy will be different each time

In [None]:
# this does the exact same thing as the above block of code, but with shorter syntax

for iter_num, (train_indices, test_indices) in enumerate(cv_object.split(X,y)):
    lr_clf.fit(X[train_indices],y[train_indices])  # train object
    y_hat = lr_clf.predict(X[test_indices]) # get test set precitions

    # print the accuracy and confusion matrix 
    print("====Iteration",iter_num," ====")
    print("accuracy", mt.accuracy_score(y[test_indices],y_hat)) 
    print("confusion matrix\n",mt.confusion_matrix(y[test_indices],y_hat))

In [None]:
# and here is an even shorter way of getting the accuracies for each training and test set
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(lr_clf, X, y=y, cv=cv_object) # this also can help with parallelism
print(accuracies)


In [None]:
# Not working for changing some of the parameters interactively
from ipywidgets import widgets as wd

def lr_explor(cost):
    lr_clf = LogisticRegression(penalty='l2', C=cost, class_weight=None) # get object
    accuracies = cross_val_score(lr_clf,X,y=y,cv=cv_object) # this also can help with parallelism
    print(accuracies)

wd.interact(lr_explor,cost=(0.001,5.0,0.05),__manual=True)

#### 1.3 Interpretting weights

In [None]:
# interpret the weights

# iterate over the coefficients
weights = lr_clf.coef_.T # take transpose to make a column vector
variable_names = df.columns
for coef, name in zip(weights,variable_names):
    print(name, 'has weight of', coef[0])
    
# does this look correct?

In [None]:
from sklearn.preprocessing import StandardScaler

# we want to normalize the features based upon the mean and standard deviation of each column. 
# However, we do not want to accidentally use the testing data to find out the mean and std (this would be snooping)
# to Make things easier, let's start by just using whatever was last stored in the variables:
##    X_train , y_train , X_test, y_test (they were set in a for loop above)

# scale attributes by the training set
scl_obj = StandardScaler()
scl_obj.fit(X_train) # find scalings for each column that make this zero mean and unit std
# the line of code above only looks at training data to get mean and std and we can use it 
# to transform new feature data

X_train_scaled = scl_obj.transform(X_train) # apply to training
X_test_scaled = scl_obj.transform(X_test) # apply those means and std to the test set (without snooping at the test set values)

# train the model just as before
lr_clf = LogisticRegression(penalty='l2', C=0.05) # get object, the 'C' value is less (can you guess why??)
lr_clf.fit(X_train_scaled,y_train)  # train object

### Comparison of the sparsity (percentage of zero coefficients) of solutions when L1 and L2 penalty are used for 
### different values of C. We can see that large values of C give more freedom to the model. Conversely, 
### smaller values of C constrain the model more.
### http://scikit-learn.org/stable/auto_examples/linear_model/plot_logistic_l1_l2_sparsity.html

y_hat = lr_clf.predict(X_test_scaled) # get test set precitions

acc = mt.accuracy_score(y_test,y_hat)
conf = mt.confusion_matrix(y_test,y_hat)
print('accuracy:', acc )
print(conf )

# sort these attributes and spit them out
zip_vars = zip(lr_clf.coef_.T,df.columns) # combine attributes
zip_vars = sorted(zip_vars)
for coef, name in zip_vars:
    print(name, 'has weight of', coef[0]) # now print them out

In [None]:
from sklearn.preprocessing import StandardScaler
# we want to normalize the features based upon the mean and standard deviation of each column. 
# However, we do not want to accidentally use the testing data to find out the mean and std (this would be snooping)

from sklearn.pipeline import Pipeline
# you can apply the StandardScaler function inside of the cross-validation loop 
#  but this requires the use of PipeLines in scikit. 
#  A pipeline can apply feature pre-processing and data fitting in one compact notation
#  Here is an example!

std_scl = StandardScaler()
lr_clf = LogisticRegression(penalty='l2', C=0.05) 

# create the pipline
piped_object = Pipeline([('scale', std_scl),  # do this
                         ('logit_model', lr_clf)]) # and then do this

weights = []
# run the pipline cross validated
for iter_num, (train_indices, test_indices) in enumerate(cv_object.split(X,y)):
    piped_object.fit(X[train_indices],y[train_indices])  # train object
    # it is a little odd getting trained objects from a  pipeline:
    weights.append(piped_object.named_steps['logit_model'].coef_[0])
    

weights = np.array(weights)

In [None]:
import plotly
plotly.offline.init_notebook_mode() # run at the start of every notebook

error_y=dict(
            type='data',
            array=np.std(weights,axis=0),
            visible=True
        )

graph1 = {'x': df.columns,
          'y': np.mean(weights,axis=0),
    'error_y':error_y,
       'type': 'bar'}

fig = dict()
fig['data'] = [graph1]
fig['layout'] = {'title': 'Logistic Regression Weights, with error bars'}

plotly.offline.iplot(fig)

In [None]:
Xnew = df[['log_kw_max_avg', 'log_kw_avg_avg','LDA_01', 'LDA_03' ]].values

weights = []
# run the pipline corssvalidated
for iter_num, (train_indices, test_indices) in enumerate(cv_object.split(Xnew,y)):
    piped_object.fit(Xnew[train_indices],y[train_indices])  # train object
    weights.append(piped_object.named_steps['logit_model'].coef_[0])
    
weights = np.array(weights)

error_y=dict(
            type='data',
            array=np.std(weights,axis=0),
            visible=True
        )

graph1 = {'x': ['log_kw_max_avg', 'log_kw_avg_avg','LDA_01', 'LDA_03' ],
          'y': np.mean(weights,axis=0),
    'error_y':error_y,
       'type': 'bar'}

fig = dict()
fig['data'] = [graph1]
fig['layout'] = {'title': 'Logistic Regression Weights, with error bars'}

plotly.offline.iplot(fig)

#### 1.4 Support Vector Machines

In [None]:
# okay, so run through the cross validation loop and set the training and testing variable for one single iteration
for train_indices, test_indices in cv_object.split(X,y): 
    # I will create new variables here so that it is more obvious what 
    # the code is doing (you can compact this syntax and avoid duplicating memory,
    # but it makes this code less readable)
    X_train = X[train_indices]
    y_train = y[train_indices]
    
    X_test = X[test_indices]
    y_test = y[test_indices]
    
X_train_scaled = scl_obj.transform(X_train) # apply to training
X_test_scaled = scl_obj.transform(X_test)

In [None]:
# lets investigate SVMs on the data and play with the parameters and kernels
from sklearn.svm import SVC

# train the model just as before
svm_clf = SVC(C=0.5, kernel='rbf', degree=3, gamma='auto') # get object
svm_clf.fit(X_train_scaled, y_train)  # train object

y_hat = svm_clf.predict(X_test_scaled) # get test set precitions

acc = mt.accuracy_score(y_test,y_hat)
conf = mt.confusion_matrix(y_test,y_hat)
print('accuracy:', acc )
print(conf)

In [None]:
# look at the support vectors
print(svm_clf.support_vectors_.shape)
print(svm_clf.support_.shape)
print(svm_clf.n_support_ )


In [None]:
# Now let's do some different analysis with the SVM and look at the instances that were chosen as support vectors

# now lets look at the support for the vectors and see if we they are indicative of anything
# grabe the rows that were selected as support vectors (these are usually instances that are hard to classify)

# make a dataframe of the training data
df_tested_on = df.iloc[train_indices] # saved from above, the indices chosen for training
# now get the support vectors from the trained model
df_support = df_tested_on.iloc[svm_clf.support_,:]

df_support['log_shares_cut'] = y[svm_clf.support_] # add back in the 'Survived' Column to the pandas dataframe
df['log_shares_cut'] = y # also add it back in for the original data
df_support.info()

In [None]:
# now lets see the statistics of these attributes
from pandas.tools.plotting import boxplot

# group the original data and the support vectors
df_grouped_support = df_support.groupby(['log_shares_cut'])
df_grouped = df.groupby(['log_shares_cut'])

# plot KDE of Different variables
vars_to_plot = ['log_n_tokens_content', 'log_n_unique_tokens',
                'log_n_non_stop_words', 'log_n_non_stop_unique_tokens']

for v in vars_to_plot:
    plt.figure(figsize=(16,6))
    # plot support vector stats
    plt.subplot(1,2,1)
    ax = df_grouped_support[v].plot.kde() 
    plt.legend(['No','log_shares_cut'])
    plt.title(v+' (Instances chosen as Support Vectors)')
    
    # plot original distributions
    plt.subplot(1,2,2)
    ax = df_grouped[v].plot.kde() 
    plt.legend(['No','log_shares_cut'])
    plt.title(v+' (Original)')

<a id='ModelAdvantages'></a>
### 02. Model Advantages
[Back to top](#BacktoTop)

#### Section Objective:
Discuss the advantages of each model for each classification task. Does one type
of model offer superior performance over another in terms of prediction accuracy? In terms of
training time or efficiency? Explain in detail.

<a id='InterpretFeatureImportance'></a>
### 03. Interpret Feature Importance
[Back to top](#BacktoTop)

#### Section Objective:
Use the weights from logistic regression to interpret the importance of different features for the classification task. Explain your interpretation in detail. Why do you think some variables are more important?

<a id='InsightsFromSupportVectors'></a>
### 04. Insights From Support Vectors
[Back to top](#BacktoTop)

#### Section Objective:
Look at the chosen support vectors for the classification task. Do these provide any insight into the data? Explain.

### Resource and Reference


1.

2.

3.

4.