<a href="https://colab.research.google.com/github/JainAnki/ADSMI-Notebooks/blob/main/Copy_of_M3_MP6_NB_Software_Bugs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Applied Data Science and Machine Intelligence
## A program by IIT Madras and TalentSprint
### Mini Project 5 : Software Use Case - Issue Detection

## Learning Objectives

At the end of the mini project, you will be able to -

* Get an understanding of the dataset.
* Perform Extensive EDA and Visualizations
* Handraft the raw data suitable for a ML problem
* Predict(Classify) the employee Attrition based on employee performance


Perform Exhaustive EDA and engineer the features to build a model on a training data that predicts (Classifies) whether an employee (from a test dataset) will quit the company or not.


## Information

### Issue Classification

A company that uses online issue tracking system, often gets a hurdle with the performance of the human resources and digital resource allocation. This is due the fact that the persons raising tickets sometimes put it in under a different tag or a category. Redirections to solve the issue throught the right person takes more time. So it is essential to solve it on time by linking the appropriate issue tags. This role will be taken care by the area of Machine Learning called as Natural Language Processing (NLP). In this Mini-Project we will be utilizing the fundamental building blocks of the NLP to classify the issues under appropriate categories based on the text body of the issue/ticket being raised.

### About the Dataset

This Mini-Project uses the Dataset from the [GitHub](https://github.com/roundcube/roundcubemail/issues). It contains the issues of Roundcube mail application, along with the software defect labelled across each issue.

**Python Packages used:**  

* [`Google.colab`](https://colab.research.google.com/notebooks/io.ipynb) for linking the notebook to your Google-drive
* [`Pandas`](https://pandas.pydata.org/docs/reference/index.html) for data frames and easy to read csv files  
* [`Numpy`](https://numpy.org/doc/stable/reference/index.html#reference) for array and matrix mathematics functions  
* [`sklearn`](https://scikit-learn.org/stable/user_guide.html) for the pre-processing data, building ML models, and performance metrics
* [`seaborn`](https://seaborn.pydata.org/) and [`matplotlib`](https://matplotlib.org/) for plotting
* [`regex`](https://docs.python.org/3/library/re.html) and [`nltk`](https://www.nltk.org/) for text preocessing


## Importing the packages

In [None]:
### The required libraries and packages ###
import pandas as pd
import numpy as np
from google.colab import drive
import seaborn as sns
import regex as re

import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.corpus import stopwords 

from sklearn.decomposition import PCA
from sklearn.preprocessing import Normalizer, StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import *

import string
import warnings
warnings.filterwarnings("ignore")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## Importing the Data

In [None]:
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
path = 'drive/MyDrive/Colab Notebooks/Week 6 (Software Bug Classification)/'

df_raw = pd.read_csv(path+'issues_data.csv')
print(df_raw.shape)
df_raw.head(2)

(525, 5)


Unnamed: 0,Defect-ID in Roundcube Github issues repository,Issue Title,Issue Body,Defect Type Family using IEEE,Defect Type Family using ODC
0,#4528,Wrong alert when uploading attachment over size,_Reported by @alecpl on 17 Apr 2014 15:33 UTC ...,ieee_logicData,control_flow
1,#4529,Recovery lost draft message ?,_Reported by L1Ntu on 17 Apr 2014 19:41 UTC as...,ieee_logicData,control_flow


In [None]:
df = df_raw.copy()
df.head(2)

Unnamed: 0,Defect-ID in Roundcube Github issues repository,Issue Title,Issue Body,Defect Type Family using IEEE,Defect Type Family using ODC
0,#4528,Wrong alert when uploading attachment over size,_Reported by @alecpl on 17 Apr 2014 15:33 UTC ...,ieee_logicData,control_flow
1,#4529,Recovery lost draft message ?,_Reported by L1Ntu on 17 Apr 2014 19:41 UTC as...,ieee_logicData,control_flow


## Graded Exercises (10 points)

### Exercise 1 (1 point): Basic EDA

- Check the shape of the data
- Check the nulls present in each field
- Check the unique number of entries per field
- Drop the features that are either redundant or that do not help in modelling


**Hint** : Use the `pandas` module

In [None]:
# Check the shape of the data
# YOUR CODE HERE
df.shape

(525, 5)

In [None]:
# Check the nulls present in each field
# YOUR CODE HERE
df.isnull().sum()

Defect-ID in Roundcube Github issues repository    0
Issue Title                                        0
Issue Body                                         0
Defect Type Family using IEEE                      0
Defect Type Family using ODC                       0
dtype: int64

In [None]:
# Check the unique number of entries per field
# YOUR CODE HERE
for col in df.columns:
  print(col,df[col].nunique(dropna = False))

Defect-ID in Roundcube Github issues repository 525
Issue Title 524
Issue Body 525
Defect Type Family using IEEE 6
Defect Type Family using ODC 3


In [None]:
# Check the statistics of the data for each column
# YOUR CODE HERE
cat_cols=df.select_dtypes(include=object).columns.tolist()
cat_df=pd.DataFrame(df[cat_cols].melt(var_name='column', value_name='value')
                    .value_counts()).rename(columns={0: 'count'}).sort_values(by=['column', 'count'])
display(df.select_dtypes(include=object).describe())
display(cat_df)

Unnamed: 0,Defect-ID in Roundcube Github issues repository,Issue Title,Issue Body,Defect Type Family using IEEE,Defect Type Family using ODC
count,525,525,525,525,525
unique,525,524,525,6,3
top,#4528,Accessibility issues,_Reported by @alecpl on 17 Apr 2014 15:33 UTC ...,ieee_logicData,control_flow
freq,1,2,1,347,394


Unnamed: 0_level_0,Unnamed: 1_level_0,count
column,value,Unnamed: 2_level_1
Defect Type Family using IEEE,ieee_description,4
Defect Type Family using IEEE,ieee_standards,6
Defect Type Family using IEEE,ieee_syntax,18
Defect Type Family using IEEE,ieee_otherBuildConfigInstall,48
Defect Type Family using IEEE,ieee_interface,102
...,...,...
Issue Title,PDO::quote() bugs on postgres/sqlite/mssql,1
Issue Title,Option to add a new contact should be inactive for read-only addressbooks,1
Issue Title,Optimize framed HTML responses,1
Issue Title,your own mailadress is removed from cc when you save a draft,1


In [None]:
# Remove the unwanted columns
# YOUR CODE HERE
df = df.drop(["Defect-ID in Roundcube Github issues repository"], axis=1)

### Exercise 2 (3 Marks): Text Pre-Processing for Feature Columns

For each row of the data, write **functions** perform the following steps seperately for the feature Data columns - `Issue Title` and `Issue Body`



In [None]:
#stopwords = nltk.corpus.stopwords.words('english')
#print(stopwords)

##### **Function - 1: (To be performed for `Issue Title` ONLY)**

Hint: The following steps will be present in both the functions

* Make all the texts to Lower case
* Remove punctuations, numbers, symbols and other emojis and replace with empty spaces. 

  This is done to ensure only the text is retained.
* Strip the excess spaces
* Remove the stop words using english stop words of nltk library
* Strip the excess spaces
* Remove words smaller than 3 letters (example: a, i , n, it, js, ab etc. )

In [None]:
def tp_title(text: str) -> str:
    def is_valid_word(word: str) -> bool:
        valid_patterns = {'->', '=>', '?:',}
        return word in valid_patterns or (word.isascii())

    def clean(word: str) -> str:
        strip_chars = string.punctuation + ' '
        return word.strip(strip_chars)

    STOPWORDS = nltk.corpus.stopwords.words('english')

    stop_words_eng = set(STOPWORDS)
    return ' '.join(clean(word) for word in text.lower().split() if is_valid_word(word) and word not in stop_words_eng)

In [None]:
df['Issue Title'] = df['Issue Title'].apply(lambda x: tp_title(x))

In [None]:
df['Issue Title']

0                  wrong alert uploading attachment size
1                           recovery lost draft message 
2      switching html text initially composing messag...
3                            problem raw message headers
4                  followup-to always blank sending mail
                             ...                        
520          can't import contact csv thunderbird 17.0.5
521                       alter message-id draft sending
522         two possible minor bugs rcube_mime::wordwrap
523      could load message server error upgrading 0.9.2
524                pdo::quote bugs postgres/sqlite/mssql
Name: Issue Title, Length: 525, dtype: object

##### **Function - 2 (To be performed for `Issue Body` ONLY)**

Hint: Either copy paste the steps of Function-1 or call the function and add the below steps

* Split lines 
* Remove first and last lines such as 

  -`_Reported by L1Ntu on 17 Apr 2014 19:41 UTC as Trac ticket #1489818_`

  -`_Migrated-From: http://trac.roundcube.net/ticket/1489818_`

  by replacing with empty/no space - `" "` or `""`
* Remove lines containing urls

  (**Hint**: it contains `http`,`www.`, `.com`, `.net` etc) by replacing with empty/no space
* Join back the lines with a space `" "`
* Strip the excess spaces

**Hint:** For the below tasks, Copy paste the lines of code from above `Function-1`
* Make all the texts to Lower case
* Remove punctuations, numbers, symbols and other emojis and replace with empty spaces. 

  This is done to ensure only the text is retained.
* Strip the excess spaces
* Remove the stop words using english stop words of nltk library
* Strip the excess spaces
* Remove words smaller than 3 letters (example: a, i , n, it, js, ab etc. )
* Strip the excess spaces
* If the entry contains only a space `" "`, replace it with no character `""`

#### Using the functions: 

For each Use the above 2 functions on respective columns, to achieve the desired tasks.

In [None]:
def tp_body(txt: str) -> str:
    def not_metadata(txt: str) -> bool:
        return not (txt.startswith('_') and txt.endswith('_'))

    def clean(txt: str) -> str:
        def not_url(txt: str) -> bool:
            urls = {"www", "http", ".net", ".com"}
            return not any(url in txt for url in urls)
        return ' '.join(word for word in tp_title(txt).split() if not_url(word))

    lines = txt.replace('\r', '').split('\n')
    return ' '.join(clean(line) for line in lines if not_metadata(line)).strip()


In [None]:
df['Issue Body'] = df['Issue Body'].apply(lambda x: tp_body(x))
df['Issue Body']

0      try upload file big see error message alert ap...
1      updating 0.9.4 1.0.0 click compose popup    re...
2      composing new message default format html swit...
3      hello  want toggle raw message headers roundcu...
4      composing mail choose add followup-to address ...
                             ...                        
520    trying import contact csv format nothing happe...
521    roundcube generate new message-id necessary rf...
522    hi there  far tell may two glitches rcube_mime...
523    i'm issue viewing messages html attachments se...
524    makes roundcube caching working serialized str...
Name: Issue Body, Length: 525, dtype: object

In [None]:
# Drop all the rows that have no content in them
# YOUR CODE HERE
df.isnull().sum()

Issue Title                      0
Issue Body                       0
Defect Type Family using IEEE    0
Defect Type Family using ODC     0
dtype: int64

In [None]:
df

Unnamed: 0,Issue Title,Issue Body,Defect Type Family using IEEE,Defect Type Family using ODC
0,wrong alert uploading attachment size,try upload file big see error message alert ap...,ieee_logicData,control_flow
1,recovery lost draft message,updating 0.9.4 1.0.0 click compose popup re...,ieee_logicData,control_flow
2,switching html text initially composing messag...,composing new message default format html swit...,ieee_logicData,control_flow
3,problem raw message headers,hello want toggle raw message headers roundcu...,ieee_logicData,control_flow
4,followup-to always blank sending mail,composing mail choose add followup-to address ...,ieee_logicData,control_flow
...,...,...,...,...
520,can't import contact csv thunderbird 17.0.5,trying import contact csv format nothing happe...,ieee_interface,structural
521,alter message-id draft sending,roundcube generate new message-id necessary rf...,ieee_logicData,control_flow
522,two possible minor bugs rcube_mime::wordwrap,hi there far tell may two glitches rcube_mime...,ieee_logicData,control_flow
523,could load message server error upgrading 0.9.2,i'm issue viewing messages html attachments se...,ieee_logicData,control_flow


### Exercise 3a (1 point): Feature Engineering Approach-1

* Combine the title and body strings by a space
*  the words
* Use `CountVectorizer` to Tokenize and transform the the text to features
* Reduce the features using PCA

In [None]:
corpus_body = df["Issue Title"]+" "+df["Issue Body"]
# YOUR CODE HERE
vectorizer = CountVectorizer()
  
vectorizer.fit(corpus_body)
  
# Printing the identified Unique words along with their indices
print("Vocabulary: ", vectorizer.vocabulary_)
  
# Encode the Document
vector = vectorizer.transform(corpus_body)
  
# Summarizing the Encoded Texts
print("Encoded Document is:")
print(vector.toarray())

pca = PCA(n_components = 3)
pca.fit(vector.toarray())
data_pca = pca.transform(vector.toarray())
data_pca = pd.DataFrame(data_pca,columns=['PC1','PC2','PC3'])
data_pca.head()

Encoded Document is:
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


Unnamed: 0,PC1,PC2,PC3
0,-1.043585,-0.241191,-0.74235
1,-1.040775,-0.249932,-0.770995
2,-1.023397,-0.261139,-0.589711
3,1.485315,3.456178,0.707023
4,-0.54416,-0.378202,-0.935648


### Exercise 4a (1 point) : Data Preparation

* Check for the data value counts to see the data imbalance
  - Merge the smaller classes to a bigger class so that the number of classes is between 3 and 4

* Perform Label Encoding for the Target variable classes

* Create a New DataFrame
  - Merge the dataframe with PCA filtered variables and 

    the Target variable-1 `"Defect Type Family using IEEE"` and  

    the Target variable -2 `"Defect Type Family using ODC"`

* Split the above data into Training and Testing Datasets




In [None]:
# Check the class Distribution of the Target Variables
# YOUR CODE HERE

cols = ('Issue Title', 'Issue Body')
label = 'Issues'
X =pd.DataFrame({label: df[cols[0]].str.cat(df[cols[1]], sep=' ')})
y = pd.Series(df.filter(like='IEEE').to_numpy().ravel())

In [None]:
# Replace the minority classes into a class with larger count
# YOUR CODE HERE

In [None]:
# Check the class Distribution of the Target Variables AGAIN
# YOUR CODE HERE

In [None]:
# Label Encode the Target Variable-1
# YOUR CODE HERE

In [None]:
# Label Encode the Target Variable-2
# YOUR CODE HERE

In [None]:
# MERGE the features and target in a single DataFrame
# YOUR CODE HERE

In [None]:
# Check for nulls if any, and fill the values with a new class (integer)
# YOUR CODE HERE

In [None]:
# Split the Data into training and testing, for Target Variable-1 
# YOUR CODE HERE

In [None]:
# Split the Data into training and testing, for Target Variable-2
# YOUR CODE HERE

### Exercise 5a (1 point) : Classification

* Classification-Target 1 (`"Defect Type Family using IEEE"`)
	
    - Perform classification using any ONE of your favorite Sklearn's classifier

    - Explain with metrics

* Classification-Target 2 (`"Defect Type Family using ODC"`)
	
    - Perform classification using any ONE of your favorite Sklearn's classifier

    - Explain with metrics

**Tip**: Train the model and predict on seperate cells. It will save time debugging.

In [None]:
# Import the classifier Functions
# YOUR CODE HERE

In [None]:
# MODEL 1

# Call and fit the model on Training Data
# Predict on the test data
# YOUR CODE HERE

In [None]:
# Print confusion_matrix,  andclassification_report
# YOUR CODE HERE

In [None]:
# MODEL 2

# Call and fit the model on Training Data
# Predict on the test data
# YOUR CODE HERE

In [None]:
# Print confusion_matrix,  andclassification_report
# YOUR CODE HERE

#### YOUR FINDINGS/ Reasoning for which model is better and why (Qualitatively and Quantatively)

Explain why one model behaves better than the other(s) in terms of Accuracy, Precision, Recall and F1-Score

### Exercise 3b (1 point): Feature Engineering Approach-2

* Combine the title and body strings by a space
*  the words
* Use `TfidfVectorizer` to Tokenize and transform the the text to features
* Reduce the features using PCA

In [None]:

corpus_body = list(df["Issue Title"]+" "+df["Issue Body"])
vectorizer_body = TfidfVectorizer()
Xbody           = vectorizer_body.fit_transform(corpus_body)
X_body          = Xbody.toarray()

features_body   = vectorizer_body.get_feature_names_out()
print(X_body.shape)


n_pc = 30
pca = PCA(n_components=n_pc, svd_solver='full')
X_body_new = pca.fit_transform(X_body)

print(f"explained_variance_ratio_ = \n{pca.explained_variance_ratio_}")
print(f"\nsingular_values_ = \n{pca.singular_values_}")

from sklearn.preprocessing import Normalizer
X_body_new = Normalizer().fit_transform(X_body_new)


cols_pc = []
for nth_pc in range(n_pc):
  cols_pc.append(f"pc_{nth_pc+1}")


df_body_counts  = pd.DataFrame(data=X_body_new, columns=cols_pc)
df_body_counts = df_body_counts.replace(np.nan, 0)
df_body_counts = df_body_counts.replace(np.NaN, 0)
df_body_counts

### Exercise 4b (1 point) : Data Preparation

* Check for the data value counts to see the data imbalance
  - Merge the smaller classes to a bigger class so that the number of classes is between 3 and 4

* Perform Label Encoding for the Target variable classes

* Create a New DataFrame
  - Merge the dataframe with PCA filtered variables and 

    the Target variable-1 `"Defect Type Family using IEEE"` and  

    the Target variable -2 `"Defect Type Family using ODC"`

* Split the above data into Training and Testing Datasets




In [None]:
# Check the class Distribution of the Target Variables
# YOUR CODE HERE

In [None]:
# Replace the minority classes into a class with larger count
# YOUR CODE HERE

In [None]:
# Check the class Distribution of the Target Variables AGAIN
# YOUR CODE HERE

In [None]:
# Label Encode the Target Variable-1
# YOUR CODE HERE

In [None]:
# Label Encode the Target Variable-2
# YOUR CODE HERE

In [None]:
# MERGE the features and target in a single DataFrame
# YOUR CODE HERE

In [None]:
# Check for nulls if any, and fill the values with a new class (integer)
# YOUR CODE HERE

In [None]:
# Split the Data into training and testing, for Target Variable-1 
# YOUR CODE HERE

In [None]:
# Split the Data into training and testing, for Target Variable-2
# YOUR CODE HERE

### Exercise 5b (1 point) : Classification

* Classification-Target 1 (`"Defect Type Family using IEEE"`)
	
    - Perform classification using any ONE of your favorite Sklearn's classifier

    - Explain with metrics

* Classification-Target 2 (`"Defect Type Family using ODC"`)
	
    - Perform classification using any ONE of your favorite Sklearn's classifier

    - Explain with metrics

**Tip**: Train the model and predict on seperate cells. It will save time debugging.

In [None]:
# Import the classifier Functions
# YOUR CODE HERE

In [None]:
# MODEL 1

# Call and fit the model on Training Data
# Predict on the test data
# YOUR CODE HERE

In [None]:
# Print confusion_matrix,  andclassification_report
# YOUR CODE HERE

In [None]:
# MODEL 2

# Call and fit the model on Training Data
# Predict on the test data
# YOUR CODE HERE

In [None]:
# Print confusion_matrix,  andclassification_report
# YOUR CODE HERE

## Additional Ungraded Exercise for Practice:

From the Data Perspective:

- Try taking ONLY the Issue Title as the feature set
- Try taking ONLY the Issue Body as the feature set
- Try various data scaling techniques

From the ML Model Perspective:
- Try out for other ML Models
- Try GridSearch
- Try Cross-Validation techniques

In [None]:
# Python code demonstrate creating

import pandas as pd

# initialise data of lists.
data = {'Name':[ 'Mohe' , 'Karnal' , 'Yrik' , 'jack' ],
		'Age':[ 30 , 21 , 29 , 28 ]}

# Create DataFrame
df = pd.DataFrame( data )

# Print the output.
df


Unnamed: 0,Name,Age
0,Mohe,30
1,Karnal,21
2,Yrik,29
3,jack,28


In [None]:
# import module
import seaborn

seaborn.set(style = 'whitegrid')

# read csv and plot
seaborn.barplot(df, "Age", "Name")




ValueError: ignored

In [None]:
import pandas as pd
def f1(x):
  return x

df1 = pd.DataFrame({'c1' : [-1,2,8,4], 'c2':[2,3,6,4]}, index = ['r1', 'r2', 'r3', 'r4'])
df2 = df1
df = pd.concat([df1,df2], ignore_index = True)

#df ['c2'] = df ['c1']. apply ( lambda x : f1 (x))
#df ['c2'] = [ f1 (x) for x in df ['c1']]
#df ['c2'] = match ( f1 , df ['c2'])

df ['c2'] = f1 ( df ['c1'])
df

Unnamed: 0,c1,c2
0,-1,-1
1,2,2
2,8,8
3,4,4
4,-1,-1
5,2,2
6,8,8
7,4,4


In [None]:
mysentence = "My name is Ankita Jain"
#mysentence =mysentence.split ()
#mysentence = list (mysentence)
import nltk
nltk.download('punkt')
'''
from nltk . tokenize import word_tokenize
mysentence = word_tokenize ( mysentence )
mysentence
'''
from nltk.tokenize import sent_tokenize 
mysentence =sent_tokenize ( mysentence )
mysentence

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['My name is Ankita Jain']

In [None]:
import regex as re
my_sentence = " fill your sentence "

my_sentence = re.sub ('[^aA -zZ]+ ', 'gh', my_sentence )
my_sentence

' fill your sentence '

In [None]:
df.iloc['r3':'r4']

TypeError: ignored

In [None]:
df.iloc[-2:]

Unnamed: 0,A,B
r3,8,6
r4,4,4
