### Explanation of class

I've created one class named "predictive_analysis" which has different methods. 
It is flexibal so that user just need to call it once and it will store training and testing dataset in internal variable and user do not need to pass it every time.

Here are the list of methods with exaplanation:

<ol>
    <li><b>load_data:</b> This method will ask user to give training and testing file path. It will store the data internally</li>
    <li><b>transform_tfidf:</b> This will normalize training and testing datset using sklean TfidfTransformer</li>
    <li><b>convert_to_binary:</b> To convert rate into binary</li>
    <li><b>logistic_regression:</b> To perform Logistic Regression with option for cross validation</li>
    <li><b>linear_svc:</b> To perform Lieanr SVC with option for cross validation</li>
    <li><b>decision_tree:</b> To perform Decision Tree with option for cross validation</li>
    <li><b>random_forest:</b> To perform Random Forest with option for cross validation</li>
    <li><b>prediction_save_csv:</b> To predict output from given object of method and save it into CSV file</li>
</ol>

## Please note that when you call load_data() function, you need to give input before running any other cell. If you run any other cell without giving it input, Jupyter get stuck as it runs one kernal at a time. It will wait for the input and stuck there while you were trying to run other code and you will need to shutdown the file before running it again.

In [24]:
#Importing required libraries and modules
from sklearn.svm import LinearSVC, SVC
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_svmlight_file
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn import preprocessing
import numpy as np
import pandas as pd
import seaborn as sns
from mlxtend.plotting import plot_decision_regions
from time import time 
import re


class predictive_models:
    
    def __init__(self):
        self.training_data = ""
        self.testing_data = ""
        self.training_target = ""
        self.testing_target = ""
        
    #def load_data(self, training_data_path, testing_data_path):
    def load_data(self):
        training_data_path = input("Path for training dataset")        
        
        if training_data_path.strip() == "" or re.search(r".feat",training_data_path) is None:
            print("Please enter valid path for training dataset")
        else:
            testing_data_path = input("Path for testing dataset")
            
            if testing_data_path.strip() == "" or re.search(r".feat",testing_data_path) is None: 
                print("Please enter valid path for testing dataset")
            else:
                self.training_data, self.training_target = load_svmlight_file(training_data_path)

                # Please note that number of features needs to be same in training and testing dataset
                # Here training dataset has 89527 features where as testing dataset has 89523 features only
                # We specifically need to specify n_features for testing dataset
                self.testing_data, self.testing_target = load_svmlight_file(testing_data_path, n_features=89527)
            
        
    #
    # Preprocessing and normalization
    #

    """
    Transform data into TF-IDF for the normalization

    Parameters:
        trainig_data (list | matrix | dictionaty): Dataset which needs to be transformed
        testing_data (list | matrix | dictionaty): Dataset which needs to be transformed

    Returns:
        (list | matrix | dictionaty): Transformed data as same format as given when calling the function
    """
    def transform_tfidf(self):
        if self.training_data == "" or self.testing_data == "":
            print("Please call load_data() method first and give path for training and testing dataset")
            return
            
        tfidf_transform = TfidfTransformer()

        # We are using .fit_transform for training dataset as we want to let our system count TF for every single review
        # and IDF from each review once and that will be from trainng dataset only
        # We want to just transform the test dataset into TF-IDF based on IF from each review from testing dataset but
        # TF-IDF will use IDF from TRAINING DATASET.

        self.training_data = tfidf_transform.fit_transform(self.training_data)
        self.testing_data = tfidf_transform.transform(self.testing_data) # Only transform as it will use IDF of training dataset
        
    #
    # Convert target into binary. if > 5 than 1, else 0 ( Preprocessing )
    #
    def convert_to_binary(self, target_data):
        if self.training_data == "" or self.testing_data == "":
            print("Please call load_data() method first and give path for training and testing dataset")
            return
        
        target = []
        for i in range(len(target_data)):
            if target_data[i] > 5:
                target.append(1) # Positive review
            else:
                target.append(0) # Negative review
        return target

    def convert_target_labels(self):
        if self.training_data == "" or self.testing_data == "":
            print("Please call load_data() method first and give path for training and testing dataset")
            return
        
        # Please note that the same variable now the target will have binary values
        self.training_target = self.convert_to_binary(self.training_target)
        self.testing_target = self.convert_to_binary(self.testing_target)
        
    def logistic_regression(self, cros_valid = False):
        if self.training_data == "" or self.testing_data == "":
            print("Please call load_data() method first and give path for training and testing dataset")
            return
        
        start = time()
        log_reg = LogisticRegression(max_iter = 1000) # Initializing

        if cros_valid == False:
            print("Logistic Regression without cross validation")
            
            log_reg.fit(self.training_data, self.training_target) # Training

            # Check with testing dataset
            print("Accuracy with testing dataset is",log_reg.score(self.testing_data, self.testing_target) * 100,"%")

        else:
            print("\n Logistic Regression with 10-fold")
            # Cross validation
            log_reg_score = cross_val_score(log_reg, self.training_data, self.training_target, cv = 10)
            print(log_reg_score)
            print("Mean",log_reg_score.mean()," ","Min:",log_reg_score.min()," ","Max: ",log_reg_score.max())


        end = time()
        print("Logistic regression ran in ",str(round((end-start), 2)),"seconds")
        return
    
    def linear_svc(self, cros_valid = False):
        if self.training_data == "" or self.testing_data == "":
            print("Please call load_data() method first and give path for training and testing dataset")
            return
        
        start = time()
        if cros_valid == False:
            print("Linear SVC")
            l_svc = LinearSVC() # Initializing
            l_svc.fit(self.training_data, self.training_target) # Training

            # Check with testing dataset
            print("Accuracy with testing dataset is",l_svc.score(self.testing_data, self.testing_target) * 100,"%")
        else:
            print("\nFinding Best C for Linear SVC with 3-fold")

            # I've just passed 6 C values here as this is taking too much memory and time to run
            # Demonstrating the possibilities and how to get best C value
            c_list = 2**np.array(range(-2, 6), dtype='float')
            cv_scores = []
            for c in c_list:
                l_svc= LinearSVC(C=c, dual=False)
                score = cross_val_score(l_svc, self.training_data, self.training_target, cv=3)
                cv_scores.append(score.mean()*100)
                bestscore, bestC = max([(val, c_list[idx]) for (idx, val) in enumerate(cv_scores)])
            print('Best CV accuracy =', round(bestscore,2), '% achieved at C =', bestC)

            print("\nUsing C =", bestC, "to train the data again test on testing data")

            # retrain on whole trainning set using best C value obtained from Cross validation
            l_svc = LinearSVC(C=bestC)
            l_svc.fit(self.training_data, self.training_target)
            accu = l_svc.score(self.testing_data, self.testing_target)*100
            print('Test accuracy =', accu, 'achieved at C =', bestC)

        end = time()
        print("Linear SVC ran in ",str(round((end-start), 2)),"seconds")
        return
    
    def decision_tree(self, cros_valid = False):
        if self.training_data == "" or self.testing_data == "":
            print("Please call load_data() method first and give path for training and testing dataset")
            return
        
        start = time()
        if cros_valid == False:
            print('Training decision tree with depth 1')
            dec_tree = DecisionTreeClassifier(max_features='auto', max_depth=1)
            dec_tree.fit(self.training_data, self.training_target)
            print('Accuracy = %.2f%%'  % (dec_tree.score(self.testing_data, self.testing_target)*100))
        else:
            print("\nFinding max depth for decision tree")
            parameters = {'max_depth':range(3,20)}
            clf = GridSearchCV(DecisionTreeClassifier(), parameters, n_jobs=4)
            clf.fit(self.training_data, self.training_target)
            tree_model = clf.best_estimator_
            print('Accuracy = %.2f%%' % (clf.best_score_ * 100)," with ",clf.best_params_)

            print("\nUsing ",clf.best_params_," to train the data again and test on testing data")
            dec_tree = DecisionTreeClassifier(max_depth = clf.best_params_['max_depth'])
            dec_tree.fit(self.training_data, self.training_target)
            print('Accuracy = %.2f%%'  % (dec_tree.score(self.testing_data, self.testing_target)*100))


        end = time()
        print("Decision Tree ran in ",str(round((end-start), 2)),"seconds")
        return
    
    def random_forest(self, cros_valid = False):
        if self.training_data == "" or self.testing_data == "":
            print("Please call load_data() method first and give path for training and testing dataset")
            return
        
        start = time()

        if cros_valid == False:
            print("Training Random Forest with 100 trees with depth 16")
            clf_forest = RandomForestClassifier(n_estimators = 100, min_samples_leaf=5, max_depth=16)
            clf_forest.fit(self.training_data, self.training_target)
            print('Accuracy = %.2f%%' % (clf_forest.score(self.testing_data, self.testing_target)*100))
        else:
            print("\nFinding best parameters for Random Forest")
            param_grid = { 
                'n_estimators': [100, 300, 500, 700, 1200],
                'max_depth': [5, 8, 15, 25, 30],
                'min_samples_split' : [2, 5, 10, 15, 100],
                'min_samples_leaf': [1, 2, 5, 10]
            }

            cv_rfc = GridSearchCV(estimator= RandomForestClassifier(), param_grid=param_grid, cv= 3)
            cv_rfc.fit(self.training_data, self.training_target)
            print('Accuracy = %.2f%%' % (cv_rfc.best_score_ * 100)," with ",cv_rfc.best_params_)
            
            print("\nUsing ",cv_rfc.best_params_," to train the data again and test on testing data")
            clf_forest = RandomForestClassifier(n_estimators = cv_rfc.best_params_['n_estimators'], min_samples_leaf=cv_rfc.best_params_['min_samples_leaf'], max_depth=cv_rfc.best_params_['max_depth'],min_samples_split=cv_rfc.best_params_['min_samples_split'])
            clf_forest.fit(self.training_data, self.training_target)
            print('Accuracy = %.2f%%' % (clf_forest.score(self.testing_data, self.testing_target)*100))

        end = time()
        print("Random Forest ran in ",str(round((end-start), 2)),"seconds")
        return
    
    # Predicting the data from random forest for the time being 
    def prediction_save_csv(self, clf_object,file_name):
        if self.training_data == "" or self.testing_data == "":
            print("Please call load_data() method first and give path for training and testing dataset")
            return
        
        clf_object.fit(self.training_data, self.training_target)
        pred_data = clf_object.predict(self.testing_data)
        print((clf_object.score(self.testing_data, self.testing_target)*100))
        prediction = []
        for i in range(len(pred_data)):
            if pred_data[i] == 0:
                prediction.append(str(i) + ", negative") 
            else:
                prediction.append(str(i) + ", positive")
                
        print("Creating and writing CSV")
        csv_file = open(file_name,'w')
        csv_file.write("\n".join(prediction))
        csv_file.close()
        
        print("File creation completed")

In [25]:
# Loading and prepating the data
pred_models = predictive_models()

#training_data_path = "./aclImdb/train/labeledBow.feat"
#testing_data_path = "./aclImdb/test/labeledBow.feat"

# Loading the data
#pred_models.load_data(training_data_path,testing_data_path)
pred_models.load_data() # This will ask user to give absolute path of training and testing dataset and validate the file type

Path for training dataset
Please enter valid path for training dataset


In [26]:
# Preprocessing and normalisation
pred_models.transform_tfidf()

# Convert target data to binary based on >5 value
pred_models.convert_target_labels()

Please call load_data() method first and give path for training and testing dataset


In [15]:
# Logisic Regression
pred_models.logistic_regression()

#10-fold
pred_models.logistic_regression(cros_valid= True)

Accuracy with testing dataset is 88.31599999999999 %
Logistic regression ran in  1.63 seconds
Cross validation with 10-fold
[0.8676 0.8556 0.8692 0.8716 0.8492 0.8796 0.8756 0.8724 0.8792 0.8616]
Mean 0.8681599999999999   Min: 0.8492   Max:  0.8796
Logistic regression ran in  14.03 seconds


In [42]:
# Linear SVC
pred_models.linear_svc()

# Finding best C with 3-fold
pred_models.linear_svc(cros_valid= True)

Linear SVC
Accuracy with testing dataset is 87.896 %
Linear SVC ran in  0.47 seconds

Finding Best C for Linear SVC with 3-fold
Best CV accuracy = 86.32 % achieved at C = 0.25

Checking with test dataset with C = 0.25
Test accuracy = 88.628 achieved at C = 0.25
Linear SVC ran in  29.0 seconds


In [49]:
# Decision Tree
pred_models.decision_tree()

# Checking max depth
pred_models.decision_tree(cros_valid=True)

Training decision tree with depth 1
Accuracy = 50.44%
Decision Tree ran in  0.15 seconds

Finding max depth for decision tree
Accuracy = 72.60%  with  {'max_depth': 16}

Using  {'max_depth': 16}  to train the data again and test on testing data
Accuracy = 72.68%
Decision Tree ran in  171.82 seconds


In [5]:
# Random forest tree
pred_models.random_forest()

pred_models.random_forest(cros_valid=True)

Training Random Forest with 100 tress with depth 16
Accuracy = 82.83%
Random Forest ran in  3.35 seconds

Finding best parameters for Random Forest
Accuracy = 83.88% with {'max_depth': 15, 'min_samples_leaf': 5, 'min_samples_split': 2,'n_estimators': 700}

Using  {'max_depth': 15, 'min_samples_leaf': 5, 'min_samples_split': 2,'n_estimators': 700}  to train the data again and test on testing data
Accuracy = 83.56%
Random Forest ran in  4231.20 seconds


In [3]:
# Store predicted data into CSV file using Random Forest
clf_random_forest = RandomForestClassifier(max_depth = 15, n_estimators = 700, min_samples_split = 2, min_samples_leaf = 5)

pred_models.prediction_save_csv(clf_random_forest, "prediction_random_forst.csv")

83.52000000000001
Creating and writing CSV
File creation completed


In [5]:
# Store predicted data into CSV file using Logistic Regresstion
clf_lreg = LogisticRegression(max_iter = 1000)

pred_models.prediction_save_csv(clf_lreg, "prediction_logistic_regression.csv")

88.31599999999999
Creating and writing CSV
File creation completed
