# Plagiarism Detection (Decision Tree Model)

In this project, I build a plagiarism detector that examines an answer text file and performs binary classification; labeling that file as either plagiarized or not, depending on how similar that text file is to a provided Wikipedia source text. 

This project was done as a part of ML Engineer certification program.

This project involves a few discrete steps:

* Clean and pre-process the data.
* Define features for comparing the similarity of an answer text and a source text, and extract similarity features.
* Select "good" features, by analyzing the correlations between different features.
* Create train/test subsamples 
* Train a decision tree classifier

## Read in the Data

The cell below will download the necessary, project data and extract the files into the folder `data/`.

This data is a slightly modified version of a dataset created by Paul Clough (Information Studies) and Mark Stevenson (Computer Science), at the University of Sheffield. You can read all about the data collection and corpus, at [their university webpage](https://ir.shef.ac.uk/cloughie/resources/plagiarism_corpus.html). 

> **Citation for data**: Clough, P. and Stevenson, M. Developing A Corpus of Plagiarised Short Answers, Language Resources and Evaluation: Special Issue on Plagiarism and Authorship Analysis, In Press. [Download]

In [1]:
!wget https://s3.amazonaws.com/video.udacity-data.com/topher/2019/January/5c4147f9_data/data.zip
!unzip data

--2021-06-20 15:13:52--  https://s3.amazonaws.com/video.udacity-data.com/topher/2019/January/5c4147f9_data/data.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.32.126
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.32.126|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 113826 (111K) [application/zip]
Saving to: ‘data.zip.5’


2021-06-20 15:13:52 (664 KB/s) - ‘data.zip.5’ saved [113826/113826]

Archive:  data.zip
replace data/.DS_Store? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


In [2]:
import pandas as pd
import numpy as np

This plagiarism dataset is made of multiple text files; each of these files has characteristics that are is summarized in a `.csv` file named `file_information.csv`, which we can read in using `pandas`.

In [3]:
csv_file = 'data/file_information.csv'
plagiarism_df = pd.read_csv(csv_file)

# print out the first few rows of data info
plagiarism_df.head()

Unnamed: 0,File,Task,Category
0,g0pA_taska.txt,a,non
1,g0pA_taskb.txt,b,cut
2,g0pA_taskc.txt,c,light
3,g0pA_taskd.txt,d,heavy
4,g0pA_taske.txt,e,non


## Types of Plagiarism

Each text file is associated with one **Task** (task A-E) and one **Category** of plagiarism, which you can see in the above DataFrame.

###  Tasks, A-E

Each text file contains an answer to one short question; these questions are labeled as tasks A-E. For example, Task A asks the question: "What is inheritance in object oriented programming?"

### Categories of plagiarism 

Each text file has an associated plagiarism label/category:

**1. Plagiarized categories: `cut`, `light`, and `heavy`.**
* These categories represent different levels of plagiarized answer texts. `cut` answers copy directly from a source text, `light` answers are based on the source text but include some light rephrasing, and `heavy` answers are based on the source text, but *heavily* rephrased (and will likely be the most challenging kind of plagiarism to detect).
     
**2. Non-plagiarized category: `non`.** 
* `non` indicates that an answer is not plagiarized; the Wikipedia source text is not used to create this answer.
    
**3. Special, source text category: `orig`.**
* This is a specific category for the original, Wikipedia source text. We will use these files only for comparison purposes.

---
## Pre-Process the Data

In the next few cells, you'll be tasked with creating a new DataFrame of desired information about all of the files in the `data/` directory. This will prepare the data for feature extraction and for training a binary, plagiarism classifier.

### Convert categorical to numerical data

The `Category` column in the data, contains string or categorical values, and to prepare these for feature extraction, we'll want to convert these into numerical values. Additionally, our goal is to create a binary classifier and so we'll need a binary class label that indicates whether an answer text is plagiarized (1) or not (0). The function `numerical_dataframe` reads in a `file_information.csv` file by name, and returns a *new* DataFrame with a numerical `Category` column and a new `Class` column that labels each answer as plagiarized or not. 

The function returns a new DataFrame with the following properties:

* 4 columns: `File`, `Task`, `Category`, `Class`. The `File` and `Task` columns can remain unchanged from the original `.csv` file.
* Convert all `Category` labels to numerical labels according to the following rules (a higher value indicates a higher degree of plagiarism):
    * 0 = `non`
    * 1 = `heavy`
    * 2 = `light`
    * 3 = `cut`
    * -1 = `orig`, this is a special value that indicates an original file.
* For the new `Class` column
    * Any answer text that is not plagiarized (`non`) should have the class label `0`. 
    * Any plagiarized answer texts should have the class label `1`. 
    * And any `orig` texts will have a special label `-1`. 

### Output example

After running the function, the result will be a DataFrame with rows that looks like the following: 
```

        File	     Task  Category  Class
0	g0pA_taska.txt	a	  0   	0
1	g0pA_taskb.txt	b	  3   	1
2	g0pA_taskc.txt	c	  2   	1
3	g0pA_taskd.txt	d	  1   	1
4	g0pA_taske.txt	e	  0	   0
...
...
99   orig_taske.txt    e     -1      -1

```

In [4]:
def numerical_dataframe(csv_file='data/file_information.csv'):
    '''Reads in a csv file which is assumed to have `File`, `Category` and `Task` columns.
       This function does two things: 
       1) converts `Category` column values to numerical values 
       2) Adds a new, numerical `Class` label column.
       The `Class` column will label plagiarized answers as 1 and non-plagiarized as 0.
       Source texts have a special label, -1.
       :param csv_file: The directory for the file_information.csv file
       :return: A dataframe with numerical categories and a new `Class` label column'''
    
    df = pd.read_csv(csv_file)

    df = df.replace({'Category': {'non': 0, 'heavy': 1, 'light': 2, 'cut': 3, 'orig': -1}})
    
    df['Class'] = df['Category']
    
    df.loc[df['Class']>0, 'Class']=1
    
    return df

In [5]:
# create new `transformed_df`
transformed_df = numerical_dataframe(csv_file ='data/file_information.csv')

# check that all categories of plagiarism have a class label = 1
transformed_df.head(10)

Unnamed: 0,File,Task,Category,Class
0,g0pA_taska.txt,a,0,0
1,g0pA_taskb.txt,b,3,1
2,g0pA_taskc.txt,c,2,1
3,g0pA_taskd.txt,d,1,1
4,g0pA_taske.txt,e,0,0
5,g0pB_taska.txt,a,0,0
6,g0pB_taskb.txt,b,0,0
7,g0pB_taskc.txt,c,3,1
8,g0pB_taskd.txt,d,2,1
9,g0pB_taske.txt,e,1,1


## Text Processing & Splitting Data

The provided helper code adds  additional information to `transformed_df` from above. They add two additional columns to the `transformed_df`:

1. A `Text` column; this holds all the lowercase text for a `File`, with extraneous punctuation removed.
2. A `Datatype` column; this is a string value `train`, `test`, or `orig` that labels a data point as part of our train or test set

The details of how these additional columns are created can be found in the `helpers.py` file in the project directory. 

Run the cells below to create `complete_df`

In [6]:
"""
using a helper function from helpers.py file
"""
import helpers 

# create a text column 
text_df = helpers.create_text_column(transformed_df)
text_df.head()

Unnamed: 0,File,Task,Category,Class,Text
0,g0pA_taska.txt,a,0,0,inheritance is a basic concept of object orien...
1,g0pA_taskb.txt,b,3,1,pagerank is a link analysis algorithm used by ...
2,g0pA_taskc.txt,c,2,1,the vector space model also called term vector...
3,g0pA_taskd.txt,d,1,1,bayes theorem was names after rev thomas bayes...
4,g0pA_taske.txt,e,0,0,dynamic programming is an algorithm design tec...


In [7]:
# after running the cell above
# check out the processed text for a single file, by row index
row_idx = 1

print('Sample processed text:\n\n', text_df.iloc[row_idx]['Text'])

Sample processed text:

 pagerank is a link analysis algorithm used by the google internet search engine that assigns a numerical weighting to each element of a hyperlinked set of documents such as the world wide web with the purpose of measuring its relative importance within the set google assigns a numeric weighting from 0 10 for each webpage on the internet this pagerank denotes a site s importance in the eyes of google  the pagerank is derived from a theoretical probability value on a logarithmic scale like the richter scale the pagerank of a particular page is roughly based upon the quantity of inbound links as well as the pagerank of the pages providing the links the algorithm may be applied to any collection of entities with reciprocal quotations and references the numerical weight that it assigns to any given element e is also called the pagerank of e and denoted by pr e  it is known that other factors e g relevance of search words on the page and actual visits to the page rep

## Split data into training and test sets

The next cell will add a `Datatype` column to a given DataFrame to indicate if the record is: 
* `train` - Training data, for model training.
* `test` - Testing data, for model evaluation.
* `orig` - The task's original answer from wikipedia.

### Stratified sampling

The given code uses a helper function which you can view in the `helpers.py` file in the main project directory. This implements [stratified random sampling](https://en.wikipedia.org/wiki/Stratified_sampling) to randomly split data by task & plagiarism amount. Stratified sampling ensures that we get training and test data that is fairly evenly distributed across task & plagiarism combinations. Approximately 26% of the data is held out for testing and 74% of the data is used for training.

The function **train_test_dataframe** takes in a DataFrame that it assumes has `Task` and `Category` columns, and, returns a modified frame that indicates which `Datatype` (train, test, or orig) a file falls into. This sampling will change slightly based on a passed in *random_seed*. Due to a small sample size, this stratified random sampling will provide more stable results for a binary plagiarism classifier. Stability here is smaller *variance* in the accuracy of classifier, given a random seed.

In [8]:
random_seed = 42

# create new df with Datatype (train, test, orig) column
# pass in `text_df` from above to create a complete dataframe, with all the information you need
complete_df = helpers.train_test_dataframe(text_df, random_seed=random_seed)

# check results
complete_df.head(10)

Unnamed: 0,File,Task,Category,Class,Text,Datatype
0,g0pA_taska.txt,a,0,0,inheritance is a basic concept of object orien...,train
1,g0pA_taskb.txt,b,3,1,pagerank is a link analysis algorithm used by ...,test
2,g0pA_taskc.txt,c,2,1,the vector space model also called term vector...,train
3,g0pA_taskd.txt,d,1,1,bayes theorem was names after rev thomas bayes...,train
4,g0pA_taske.txt,e,0,0,dynamic programming is an algorithm design tec...,test
5,g0pB_taska.txt,a,0,0,inheritance is a basic concept in object orien...,test
6,g0pB_taskb.txt,b,0,0,pagerank pr refers to both the concept and the...,train
7,g0pB_taskc.txt,c,3,1,vector space model is an algebraic model for r...,test
8,g0pB_taskd.txt,d,2,1,bayes theorem relates the conditional and marg...,train
9,g0pB_taske.txt,e,1,1,dynamic programming is a method for solving ma...,test


# Determining Plagiarism

Next task is the extraction of similarity features that will be useful for plagiarism classification. 

The `complete_df` should always include the columns: `['File', 'Task', 'Category', 'Class', 'Text', 'Datatype']`.



# Similarity Features 


## Feature Engineering


### Containment

First, we create **containment features**. 

> Containment is defined as the **intersection** of the n-gram word count of the Wikipedia Source Text (S) with the n-gram word count of the Student  Answer Text (S) *divided* by the n-gram word count of the Student Answer Text.

$$ \frac{\sum{count(\text{ngram}_{A}) \cap count(\text{ngram}_{S})}}{\sum{count(\text{ngram}_{A})}} $$

If the two texts have no n-grams in common, the containment will be 0, but if _all_ their n-grams intersect then the containment will be 1. Intuitively, you can see how having longer n-gram's in common, might be an indication of cut-and-paste plagiarism.

### Containment calculation

Calculation the containment between an answer and source text is according to the following equation.

$$ \frac{\sum{count(\text{ngram}_{A}) \cap count(\text{ngram}_{S})}}{\sum{count(\text{ngram}_{A})}} $$


In [9]:
from sklearn.feature_extraction.text import CountVectorizer

def calculate_containment(df, n, answer_filename):
    '''This function calculates the containment between a given answer text and its associated source text.
       This function creates a count of ngrams (of a size, n) for each text file in our data.
       Then calculates the containment by finding the ngram count for a given answer text, 
       and its associated source text, and calculating the normalized intersection of those counts.
       :param df: A dataframe with columns,
           'File', 'Task', 'Category', 'Class', 'Text', and 'Datatype'
       :param n: An integer that defines the ngram size
       :param answer_filename: A filename for an answer text in the df, ex. 'g0pB_taskd.txt'
       :return: A single containment value that represents the similarity
           between an answer text and its source text.
    '''
    ## calculate ngrams
    
    ans_index = int(df[df["File"]==answer_filename].index.values)

    a_text = df.iloc[ans_index]['Text']
    
    orig_filename = 'orig'+ answer_filename[4:]
    
    orig_index = int(df[df["File"]==orig_filename].index.values)
    
    s_text = df.iloc[orig_index]['Text']

    # instantiate an ngram counter
    counts = CountVectorizer(analyzer='word', ngram_range=(n,n))

    ngrams = counts.fit_transform([a_text, s_text])

    ngram_array = ngrams.toarray()
    
    return sum(np.amin(ngram_array,axis=0))/sum(ngram_array[0])


---
## Longest Common Subsequence

Containment a good way to find overlap in word usage between two documents; it may help identify cases of cut-and-paste as well as paraphrased levels of plagiarism. Since plagiarism is a fairly complex task with varying levels, it's often useful to include other measures of similarity. The paper also discusses a feature called **longest common subsequence**.

> The longest common subsequence is the longest string of words (or letters) that are *the same* between the Wikipedia Source Text (S) and the Student Answer Text (A). This value is also normalized by dividing by the total number of words (or letters) in the  Student Answer Text. 

A Longest Common Subsequence (LCS) problem may look as follows:
* Given two texts: text A (answer text) of length n, and string S (original source text) of length m. Our goal is to produce their longest common subsequence of words: the longest sequence of words that appear left-to-right in both texts (though the words don't have to be in continuous order).
* Consider:
    * A = "i think pagerank is a link analysis algorithm used by google that uses a system of weights attached to each element of a hyperlinked set of documents"
    * S = "pagerank is a link analysis algorithm used by the google internet search engine that assigns a numerical weighting to each element of a hyperlinked set of documents"

* In this case, we can see that the start of each sentence of fairly similar, having overlap in the sequence of words, "pagerank is a link analysis algorithm used by" before diverging slightly. Then we **continue moving left -to-right along both texts** until we see the next common sequence; in this case it is only one word, "google". Next we find "that" and "a" and finally the same ending "to each element of a hyperlinked set of documents".
* Below, is a clear visual of how these sequences were found, sequentially, in each text.

<img src='notebook_ims/common_subseq_words.png' width=40% />

* Now, those words appear in left-to-right order in each document, sequentially, and even though there are some words in between, we count this as the longest common subsequence between the two texts. 
* If I count up each word that I found in common I get the value 20. **So, LCS has length 20**. 
* Next, to normalize this value, divide by the total length of the student answer; in this example that length is only 27. **So, the function `lcs_norm_word` should return the value `20/27` or about `0.7408`.**

In this way, LCS is a great indicator of cut-and-paste plagiarism or if someone has referenced the same source text multiple times in an answer.

In [10]:
def lcs_norm_word(answer_text, source_text):
    '''Computes the longest common subsequence of words in two texts; returns a normalized value.
       :param answer_text: The pre-processed text for an answer text
       :param source_text: The pre-processed text for an answer's associated source text
       :return: A normalized LCS value'''
    
    answer_text = answer_text.split()
    source_text = source_text.split()

    m = len(answer_text)    
    n = len(source_text)
    
    #initialize with nones
    L = [[None]*(n+1) for i in range(m+1)]

    for i in range(m+1):
        for j in range(n+1):
            if i == 0 or j == 0 :
                L[i][j] = 0
            elif answer_text[i-1] == source_text[j-1]:
                L[i][j] = L[i-1][j-1]+1
            else:
                L[i][j] = max(L[i-1][j], L[i][j-1])
  
    return L[m][n]/m


---
# Create All Features

### Creating multiple containment features

In [11]:
def create_containment_features(df, n, column_name=None):
    '''
    Function returns a list of containment features, calculated for a given n 
    Should return a list of length 100 for all files in a complete_df
    '''
    containment_values = []
    
    if(column_name==None):
        column_name = 'c_'+str(n) # c_1, c_2, .. c_n
    
    # iterates through dataframe rows
    for i in df.index:
        file = df.loc[i, 'File']
        # Computes features using calculate_containment function
        if df.loc[i,'Category'] > -1:
            c = calculate_containment(df, n, file)
            containment_values.append(c)
        # Sets value to -1 for original tasks 
        else:
            containment_values.append(-1)
    
    print(str(n)+'-gram containment features created!')
    return containment_values


### Creating LCS features

In [12]:
def create_lcs_features(df, column_name='lcs_word'):
    '''Function creates lcs feature and add it to the dataframe'''
    lcs_values = []
    
    # iterate through files in dataframe
    for i in df.index:
        if df.loc[i,'Category'] > -1:
            answer_text = df.loc[i, 'Text'] 
            task = df.loc[i, 'Task']
            orig_rows = df[(df['Class'] == -1)]
            orig_row = orig_rows[(orig_rows['Task'] == task)]
            source_text = orig_row['Text'].values[0]

            # calculate lcs
            lcs = lcs_norm_word(answer_text, source_text)
            lcs_values.append(lcs)
        # Sets to -1 for original tasks 
        else:
            lcs_values.append(-1)

    print('LCS features created!')
    return lcs_values
    

### Create a features DataFrame by selecting an `ngram_range`

In [13]:
# Define an ngram range
ngram_range = range(1,17)

features_list = []

# Create features in a features_df
all_features = np.zeros((len(ngram_range)+1, len(complete_df)))

# Calculate features for containment for ngrams in range
i=0
for n in ngram_range:
    column_name = 'c_'+str(n)
    features_list.append(column_name)
    # create containment features
    all_features[i]=np.squeeze(create_containment_features(complete_df, n))
    i+=1

# Calculate features for LCS_Norm Words 
features_list.append('lcs_word')
all_features[i]= np.squeeze(create_lcs_features(complete_df))

# create a features dataframe
features_df = pd.DataFrame(np.transpose(all_features), columns=features_list)

# Print all features/columns
print()
print('Features: ', features_list)
print()

1-gram containment features created!
2-gram containment features created!
3-gram containment features created!
4-gram containment features created!
5-gram containment features created!
6-gram containment features created!
7-gram containment features created!
8-gram containment features created!
9-gram containment features created!
10-gram containment features created!
11-gram containment features created!
12-gram containment features created!
13-gram containment features created!
14-gram containment features created!
15-gram containment features created!
16-gram containment features created!
LCS features created!

Features:  ['c_1', 'c_2', 'c_3', 'c_4', 'c_5', 'c_6', 'c_7', 'c_8', 'c_9', 'c_10', 'c_11', 'c_12', 'c_13', 'c_14', 'c_15', 'c_16', 'lcs_word']



In [14]:
# print some results 
features_df.head(10)

Unnamed: 0,c_1,c_2,c_3,c_4,c_5,c_6,c_7,c_8,c_9,c_10,c_11,c_12,c_13,c_14,c_15,c_16,lcs_word
0,0.398148,0.07907,0.009346,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.191781
1,1.0,0.984694,0.964103,0.943299,0.92228,0.901042,0.879581,0.857895,0.835979,0.81383,0.791444,0.768817,0.745946,0.722826,0.699454,0.675824,0.820755
2,0.869369,0.719457,0.613636,0.515982,0.449541,0.382488,0.319444,0.265116,0.219626,0.197183,0.174528,0.151659,0.133333,0.114833,0.096154,0.082126,0.846491
3,0.593583,0.268817,0.156757,0.108696,0.081967,0.06044,0.044199,0.027778,0.011173,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.316062
4,0.544503,0.115789,0.031746,0.005319,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.242574
5,0.329502,0.053846,0.007722,0.003876,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.161172
6,0.590308,0.150442,0.035556,0.004464,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.301653
7,0.765306,0.709898,0.664384,0.62543,0.589655,0.553633,0.520833,0.487805,0.454545,0.424561,0.394366,0.378092,0.361702,0.348754,0.335714,0.322581,0.621711
8,0.759777,0.505618,0.39548,0.306818,0.245714,0.195402,0.150289,0.110465,0.070175,0.035294,0.017751,0.005952,0.0,0.0,0.0,0.0,0.484305
9,0.884444,0.526786,0.340807,0.247748,0.180995,0.15,0.118721,0.091743,0.064516,0.041667,0.023256,0.009346,0.004695,0.0,0.0,0.0,0.597458


## Correlated Features

In [15]:
# Create correlation matrix for just Features to determine different models to test
corr_matrix = features_df.corr().abs().round(2)

# display shows all of a dataframe
display(corr_matrix)

Unnamed: 0,c_1,c_2,c_3,c_4,c_5,c_6,c_7,c_8,c_9,c_10,c_11,c_12,c_13,c_14,c_15,c_16,lcs_word
c_1,1.0,0.94,0.9,0.89,0.88,0.87,0.87,0.87,0.86,0.86,0.86,0.86,0.86,0.86,0.86,0.86,0.97
c_2,0.94,1.0,0.99,0.98,0.97,0.96,0.95,0.94,0.94,0.93,0.92,0.92,0.91,0.91,0.91,0.9,0.98
c_3,0.9,0.99,1.0,1.0,0.99,0.98,0.98,0.97,0.96,0.95,0.95,0.94,0.94,0.93,0.93,0.92,0.97
c_4,0.89,0.98,1.0,1.0,1.0,0.99,0.99,0.98,0.98,0.97,0.97,0.96,0.96,0.95,0.95,0.94,0.95
c_5,0.88,0.97,0.99,1.0,1.0,1.0,1.0,0.99,0.99,0.98,0.98,0.97,0.97,0.97,0.96,0.96,0.95
c_6,0.87,0.96,0.98,0.99,1.0,1.0,1.0,1.0,0.99,0.99,0.99,0.98,0.98,0.98,0.97,0.97,0.94
c_7,0.87,0.95,0.98,0.99,1.0,1.0,1.0,1.0,1.0,1.0,0.99,0.99,0.99,0.98,0.98,0.98,0.93
c_8,0.87,0.94,0.97,0.98,0.99,1.0,1.0,1.0,1.0,1.0,1.0,0.99,0.99,0.99,0.99,0.98,0.92
c_9,0.86,0.94,0.96,0.98,0.99,0.99,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.99,0.99,0.99,0.91
c_10,0.86,0.93,0.95,0.97,0.98,0.99,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.99,0.99,0.91


## Create selected train/test data

In [16]:
def train_test_data(complete_df, features_df, selected_features):
    '''Gets selected training and test features from given dataframes, and 
       returns tuples for training and test features and their corresponding class labels.
       :param complete_df: A dataframe with all of our processed text data, datatypes, and labels
       :param features_df: A dataframe of all computed, similarity features
       :param selected_features: An array of selected features that correspond to certain columns in `features_df`
       :return: training and test features and labels: (train_x, train_y), (test_x, test_y)'''
    
    df = pd.concat([complete_df, features_df], axis=1)
    
    # get the training features
    #.values so that arrays are created--the function outputs arrays
    train_x = df.loc[df['Datatype']=='train', selected_features].values
    # And training class labels (0 or 1)
    train_y = df.loc[df['Datatype']=='train','Class'].values
    
    # get the test features and labels
    test_x = df.loc[df['Datatype']=='test', selected_features].values
    test_y = df.loc[df['Datatype']=='test', 'Class'].values
    
    return (train_x, train_y), (test_x, test_y)
    

## Select "good" features

In [17]:
corr_matrix

Unnamed: 0,c_1,c_2,c_3,c_4,c_5,c_6,c_7,c_8,c_9,c_10,c_11,c_12,c_13,c_14,c_15,c_16,lcs_word
c_1,1.0,0.94,0.9,0.89,0.88,0.87,0.87,0.87,0.86,0.86,0.86,0.86,0.86,0.86,0.86,0.86,0.97
c_2,0.94,1.0,0.99,0.98,0.97,0.96,0.95,0.94,0.94,0.93,0.92,0.92,0.91,0.91,0.91,0.9,0.98
c_3,0.9,0.99,1.0,1.0,0.99,0.98,0.98,0.97,0.96,0.95,0.95,0.94,0.94,0.93,0.93,0.92,0.97
c_4,0.89,0.98,1.0,1.0,1.0,0.99,0.99,0.98,0.98,0.97,0.97,0.96,0.96,0.95,0.95,0.94,0.95
c_5,0.88,0.97,0.99,1.0,1.0,1.0,1.0,0.99,0.99,0.98,0.98,0.97,0.97,0.97,0.96,0.96,0.95
c_6,0.87,0.96,0.98,0.99,1.0,1.0,1.0,1.0,0.99,0.99,0.99,0.98,0.98,0.98,0.97,0.97,0.94
c_7,0.87,0.95,0.98,0.99,1.0,1.0,1.0,1.0,1.0,1.0,0.99,0.99,0.99,0.98,0.98,0.98,0.93
c_8,0.87,0.94,0.97,0.98,0.99,1.0,1.0,1.0,1.0,1.0,1.0,0.99,0.99,0.99,0.99,0.98,0.92
c_9,0.86,0.94,0.96,0.98,0.99,0.99,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.99,0.99,0.99,0.91
c_10,0.86,0.93,0.95,0.97,0.98,0.99,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.99,0.99,0.91


In [18]:
selected_features = ['c_1', 'c_16', 'lcs_word']

(train_x, train_y), (test_x, test_y) = train_test_data(complete_df, features_df, selected_features)

print('Training size: ', len(train_x))
print('Test size: ', len(test_x))
print()
print('Training df sample: \n', train_x[:10])

Training size:  70
Test size:  25

Training df sample: 
 [[0.39814815 0.         0.19178082]
 [0.86936937 0.0821256  0.84649123]
 [0.59358289 0.         0.31606218]
 [0.59030837 0.         0.30165289]
 [0.75977654 0.         0.48430493]
 [0.97945205 0.36641221 0.9       ]
 [0.95138889 0.13953488 0.89403974]
 [0.97647059 0.2        0.82320442]
 [0.81176471 0.03870968 0.45977011]
 [0.81395349 0.57746479 0.78888889]]


# Training a Decision Tree Model

For this classification problem, I use decision tree classifier.

In [19]:
from sklearn.tree import DecisionTreeClassifier

In [20]:
model = DecisionTreeClassifier(max_leaf_nodes=2)
model.fit(train_x,train_y)

DecisionTreeClassifier(max_leaf_nodes=2)

# Evaluate the Model

In [21]:
test_y_preds = model.predict(test_x)

In [22]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
score=accuracy_score(test_y,test_y_preds)
print(f'Accuracy: {round(score*100,2)}%')
 
print(f"Classification Report : \n\n{classification_report(test_y, test_y_preds)}")

Accuracy: 92.0%
Classification Report : 

              precision    recall  f1-score   support

           0       1.00      0.80      0.89        10
           1       0.88      1.00      0.94        15

    accuracy                           0.92        25
   macro avg       0.94      0.90      0.91        25
weighted avg       0.93      0.92      0.92        25

