<a href="https://colab.research.google.com/github/JainAnki/ADSMI-Notebooks/blob/main/Software_Bugs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Applied Data Science and Machine Intelligence
## A program by IIT Madras and TalentSprint
### Mini Project 5 : Software Use Case - Issue Detection

## Learning Objectives

At the end of the mini project, you will be able to -

* Get an understanding of the dataset.
* Perform Extensive EDA and Visualizations
* Handraft the raw data suitable for a ML problem
* Predict(Classify) the employee Attrition based on employee performance


Perform Exhaustive EDA and engineer the features to build a model on a training data that predicts (Classifies) whether an employee (from a test dataset) will quit the company or not.


## Information

### Issue Classification

A company that uses online issue tracking system, often gets a hurdle with the performance of the human resources and digital resource allocation. This is due the fact that the persons raising tickets sometimes put it in under a different tag or a category. Redirections to solve the issue throught the right person takes more time. So it is essential to solve it on time by linking the appropriate issue tags. This role will be taken care by the area of Machine Learning called as Natural Language Processing (NLP). In this Mini-Project we will be utilizing the fundamental building blocks of the NLP to classify the issues under appropriate categories based on the text body of the issue/ticket being raised.

### About the Dataset

This Mini-Project uses the Dataset from the [GitHub](https://github.com/roundcube/roundcubemail/issues). It contains the issues of Roundcube mail application, along with the software defect labelled across each issue.

**Python Packages used:**  

* [`Google.colab`](https://colab.research.google.com/notebooks/io.ipynb) for linking the notebook to your Google-drive
* [`Pandas`](https://pandas.pydata.org/docs/reference/index.html) for data frames and easy to read csv files  
* [`Numpy`](https://numpy.org/doc/stable/reference/index.html#reference) for array and matrix mathematics functions  
* [`sklearn`](https://scikit-learn.org/stable/user_guide.html) for the pre-processing data, building ML models, and performance metrics
* [`seaborn`](https://seaborn.pydata.org/) and [`matplotlib`](https://matplotlib.org/) for plotting
* [`regex`](https://docs.python.org/3/library/re.html) and [`nltk`](https://www.nltk.org/) for text preocessing


## Importing the packages

In [None]:
### The required libraries and packages
import pandas as pd
import numpy as np
from google.colab import drive
import seaborn as sns
import regex as re

import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.decomposition import PCA
from sklearn.preprocessing import Normalizer, StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import 

import warnings
warnings.filterwarnings("ignore")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Importing the Data

In [None]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
path = 'drive/MyDrive/Colab Notebooks/M3_MP6_SoftwareBugs/'
# path = 'drive/MyDrive/<YOUR FOLDER NAME AS IT APPEARS ON GOOGLE DRIVE>'

df_raw = pd.read_csv(path+'issues_data.csv')
print(df_raw.shape)
df_raw.head(2)

(525, 5)


Unnamed: 0,Defect-ID in Roundcube Github issues repository,Issue Title,Issue Body,Defect Type Family using IEEE,Defect Type Family using ODC
0,#4528,Wrong alert when uploading attachment over size,_Reported by @alecpl on 17 Apr 2014 15:33 UTC ...,ieee_logicData,control_flow
1,#4529,Recovery lost draft message ?,_Reported by L1Ntu on 17 Apr 2014 19:41 UTC as...,ieee_logicData,control_flow


In [None]:
df = df_raw.copy()
df.head(2)

Unnamed: 0,Defect-ID in Roundcube Github issues repository,Issue Title,Issue Body,Defect Type Family using IEEE,Defect Type Family using ODC
0,#4528,Wrong alert when uploading attachment over size,_Reported by @alecpl on 17 Apr 2014 15:33 UTC ...,ieee_logicData,control_flow
1,#4529,Recovery lost draft message ?,_Reported by L1Ntu on 17 Apr 2014 19:41 UTC as...,ieee_logicData,control_flow


In [None]:
df.iloc[:2, :10]

Unnamed: 0,Defect-ID in Roundcube Github issues repository,Issue Title,Issue Body,Defect Type Family using IEEE,Defect Type Family using ODC
0,#4528,Wrong alert when uploading attachment over size,_Reported by @alecpl on 17 Apr 2014 15:33 UTC ...,ieee_logicData,control_flow
1,#4529,Recovery lost draft message ?,_Reported by L1Ntu on 17 Apr 2014 19:41 UTC as...,ieee_logicData,control_flow


## Graded Exercises (10 points)

**Exercises 1 to 4** (7 points)

deal with the data, the basic anslysis, and its visualization and data preparation of **FEATURES** only.

**Exercises 5 & 6** (3 points)

Exercise 5 and 6 deal with the classification model and its performance Metrics.

As you can see, this Mini-Project is centered around the data, rather than the algorithms

### Exercise 1 (1 point): Basic EDA

- Check the shape of the data
- Check the nulls present in each field
- Check the unique number of entries per field
- Drop the features that are either redundant or that do not help in modelling


**Hint** : Use the `pandas` module

In [None]:
# Check the shape of the data
df.shape

(525, 5)

In [None]:
# Check the nulls present in each field
df.isnull().sum()

Defect-ID in Roundcube Github issues repository    0
Issue Title                                        0
Issue Body                                         0
Defect Type Family using IEEE                      0
Defect Type Family using ODC                       0
dtype: int64

In [None]:
# Check the unique number of entries per field
df.nunique()

Defect-ID in Roundcube Github issues repository    525
Issue Title                                        524
Issue Body                                         525
Defect Type Family using IEEE                        6
Defect Type Family using ODC                         3
dtype: int64

In [None]:
# Check the statistics of the data for each column
df.dtypes

Defect-ID in Roundcube Github issues repository    object
Issue Title                                        object
Issue Body                                         object
Defect Type Family using IEEE                      object
Defect Type Family using ODC                       object
dtype: object

In [None]:
# Remove the unwanted columns
df = df.drop(columns="Defect-ID in Roundcube Github issues repository")
print(df.shape)
df.head()

(525, 4)


Unnamed: 0,Issue Title,Issue Body,Defect Type Family using IEEE,Defect Type Family using ODC
0,Wrong alert when uploading attachment over size,_Reported by @alecpl on 17 Apr 2014 15:33 UTC ...,ieee_logicData,control_flow
1,Recovery lost draft message ?,_Reported by L1Ntu on 17 Apr 2014 19:41 UTC as...,ieee_logicData,control_flow
2,Switching from html to text when initially com...,_Reported by arodier on 18 Apr 2014 16:09 UTC ...,ieee_logicData,control_flow
3,problem with raw message headers,_Reported by Thunderstick on 19 Apr 2014 10:06...,ieee_logicData,control_flow
4,Followup-To always blank when sending mail,_Reported by brendan on 23 Apr 2014 17:05 UTC ...,ieee_logicData,control_flow


In [None]:
df["Defect Type Family using IEEE"].value_counts().to_frame().reset_index().rename(columns={"Defect Type Family using IEEE":"counts"}).rename(columns={"index":"Defect Type Family using IEEE"})

Unnamed: 0,Defect Type Family using IEEE,counts
0,ieee_logicData,347
1,ieee_interface,102
2,ieee_otherBuildConfigInstall,48
3,ieee_syntax,18
4,ieee_standards,6
5,ieee_description,4


In [None]:
df["Defect Type Family using ODC"].value_counts().to_frame().reset_index().rename(columns={"Defect Type Family using ODC":"counts"}).rename(columns={"index":"Defect Type Family using ODC"})

Unnamed: 0,Defect Type Family using ODC,counts
0,control_flow,394
1,structural,76
2,non_functional,55


### Exercise 2 (3 Marks): Text Pre-Processing for Feature Columns

For each row of the data, write **functions** perform the following steps seperately for the feature Data columns - `Issue Title` and `Issue Body`



In [None]:
stopwords = nltk.corpus.stopwords.words('english')
print(stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

##### **Function - 1: (To be performed for `Issue Title` ONLY)**

Hint: The following steps will be present in both the functions

* Make all the texts to Lower case
* Remove punctuations, numbers, symbols and other emojis and replace with empty spaces. 

  This is done to ensure only the text is retained.
* Strip the excess spaces
* Remove the stop words using english stop words of nltk library
* Strip the excess spaces
* Remove words smaller than 3 letters (example: a, i , n, it, js, ab etc. )

In [None]:
def tp_title(paragraph):

  # Step - Convert to lower case
  paragraph = paragraph.lower()

  # Step - Keep only alphabetical characters
  paragraph = re.sub('[^a-z]+', ' ', paragraph)

  # Step - Strip the excess spaces
  paragraph = paragraph.strip()

  # Step - Remove words smaller than 3 letters
  list_words = paragraph.split()
  words_min_size = [word for word in list_words if len(word)>2]

  # Step - Join Back the words
  paragraph = " ".join(words_min_size)

  # Step - Remove the English Stop words
  stop_words_eng = nltk.corpus.stopwords.words('english')
  # stop_words_eng = set(stopwords.words('english'))
  word_tokens = word_tokenize(paragraph)
  filtered_para = []
  for w in word_tokens:
      if w not in stop_words_eng:
          filtered_para.append(w)  

  # Step - Join Back the words
  paragraph = " ".join(filtered_para)
  return paragraph
  


paragraph = df.iloc[1, 0]
print(paragraph)
print("="*30)
tp_title(paragraph)

Recovery lost draft message ?


'recovery lost draft message'

##### **Function - 2 (To be performed for `Issue Body` ONLY)**

Hint: Either copy paste the steps of Function-1 or call the function and add the below steps

* Split lines 
* Remove first and last lines such as 

  -`_Reported by L1Ntu on 17 Apr 2014 19:41 UTC as Trac ticket #1489818_`

  -`_Migrated-From: http://trac.roundcube.net/ticket/1489818_`

  by replacing with empty/no space - `" "` or `""`
* Remove lines containing urls

  (**Hint**: it contains `http`,`www.`, `.com`, `.net` etc) by replacing with empty/no space
* Join back the lines with a space `" "`
* Strip the excess spaces

**Hint:** For the below tasks, Copy paste the lines of code from above `Function-1`
* Make all the texts to Lower case
* Remove punctuations, numbers, symbols and other emojis and replace with empty spaces. 

  This is done to ensure only the text is retained.
* Strip the excess spaces
* Remove the stop words using english stop words of nltk library
* Strip the excess spaces
* Remove words smaller than 3 letters (example: a, i , n, it, js, ab etc. )
* Strip the excess spaces
* If the entry contains only a space `" "`, replace it with no character `""`

#### Using the functions: 

For each Use the above 2 functions on respective columns, to achieve the desired tasks.

In [None]:
def tp_body(paragraph):

  # Step - Splitlines
  list_sentences = paragraph.splitlines()

  # Step - Remove Last Lines
  list_sentences[0] = ""
  list_sentences[-1] = ""

  # Step - Remove the lines that contain any URLs
  url_terms = ["www", "http", ".net", ".com"]
  for nth, sentence in enumerate(list_sentences):
    for url_term in url_terms:
      if url_term in sentence:
        list_sentences[nth] = ''

  # Step - Join Back the lines
  paragraph = " ".join(list_sentences)

  # Step - Strip the excess spaces
  paragraph = paragraph.strip()

  # Step - Convert to lower case
  paragraph = paragraph.lower()

  # Step - Keep only alphabetical characters
  paragraph = re.sub('[^a-z]+', ' ', paragraph)

  # Step - Strip the excess spaces
  paragraph = paragraph.strip()

  # Step - Remove words smaller than 3 letters
  list_words = paragraph.split()
  words_min_size = [word for word in list_words if len(word)>2]

  # Step - Join Back the words
  paragraph = " ".join(words_min_size)

  # Step - Remove recurring Characters
  paragraph = re.sub(r'([a-z])\1+', r'\1', paragraph)

  # Step - Remove the English Stop words
  stop_words_eng = nltk.corpus.stopwords.words('english')
  # stop_words_eng = set(stopwords.words('english'))
  word_tokens = word_tokenize(paragraph)
  filtered_para = []
  for w in word_tokens:
      if w not in stop_words_eng:
          filtered_para.append(w)  

  # Step - Join Back the words
  paragraph = " ".join(filtered_para)

  if paragraph==" ": paragraph = ""

  return paragraph


paragraph = df.iloc[1, 1]
print(paragraph)
print("="*30)
tp_body(paragraph)  

_Reported by L1Ntu on 17 Apr 2014 19:41 UTC as Trac ticket #1489818_

After updating from 0.9.4 to 1.0.0 when i click to compose i have popup :

---

Recovering message
Founded early composed but not sended message.

Theme : ...
Saved : ...

Do u want to recover it ?

---

After any option (recover/delete/ignore) after some period of time it appears again...

_Migrated-From: http://trac.roundcube.net/ticket/1489818_



'updating click compose popup recovering mesage founded early composed sended mesage theme saved want recover option recover delete ignore period time apears'

In [None]:
df.columns

Index(['Issue Title', 'Issue Body', 'Defect Type Family using IEEE',
       'Defect Type Family using ODC'],
      dtype='object')

In [None]:
df['Issue Body'] = df['Issue Body'].apply(lambda x: tp_body(x))
df['Issue Title'] = df['Issue Title'].apply(lambda x: tp_title(x))
df.head()

Unnamed: 0,Issue Title,Issue Body,Defect Type Family using IEEE,Defect Type Family using ODC
0,wrong alert uploading attachment size,try upload file big se eror mesage alert ap up...,ieee_logicData,control_flow
1,recovery lost draft message,updating click compose popup recovering mesage...,ieee_logicData,control_flow
2,switching html text initially composing messag...,composing new mesage default format html switc...,ieee_logicData,control_flow
3,problem raw message headers,helo want togle raw mesage headers roundcube g...,ieee_logicData,control_flow
4,followup always blank sending mail,composing mail chose ad folowup adres resultin...,ieee_logicData,control_flow


In [None]:
# Drop all the rows that have no content in them
n_title = df["Issue Title"].apply(lambda x:len(x)).values
df = df[n_title>0]

n_body = df["Issue Body"].apply(lambda x:len(x)).values
df = df[n_body>0]

df.shape

(465, 4)

### Exercise 3a (1 point): Feature Engineering Approach-1

* Combine the title and body strings by a space
*  the words
* Use `CountVectorizer` to Tokenize and transform the the text to features
* Reduce the features using PCA

In [None]:
corpus_body = list(df["Issue Title"]+" "+df["Issue Body"])
vectorizer_body = CountVectorizer()
Xbody           = vectorizer_body.fit_transform(corpus_body)
X_body          = Xbody.toarray()

features_body   = vectorizer_body.get_feature_names_out()
print(X_body.shape)


n_pc = 30
pca = PCA(n_components=n_pc, svd_solver='full')
X_body_new = pca.fit_transform(X_body)

print(f"explained_variance_ratio_ = \n{pca.explained_variance_ratio_}")
print(f"\nsingular_values_ = \n{pca.singular_values_}")

from sklearn.preprocessing import Normalizer
X_body_new = Normalizer().fit_transform(X_body_new)


cols_pc = []
for nth_pc in range(n_pc):
  cols_pc.append(f"pc_{nth_pc+1}")


df_body_counts  = pd.DataFrame(data=X_body_new, columns=cols_pc)
df_body_counts = df_body_counts.replace(np.nan, 0)
df_body_counts = df_body_counts.replace(np.NaN, 0)
df_body_counts

(465, 4133)
explained_variance_ratio_ = 
[0.2306453  0.06420583 0.04483837 0.03680194 0.02874212 0.02293313
 0.02081733 0.01892165 0.01617982 0.01390135 0.01341697 0.01267601
 0.0112773  0.01083197 0.01015332 0.00964032 0.00929494 0.00916292
 0.00831448 0.00801334 0.00784482 0.00738311 0.00715036 0.00707935
 0.00687222 0.00662933 0.00617808 0.00592656 0.00562128 0.00534015]

singular_values_ = 
[157.05002161  82.86152457  69.2453447   62.73372388  55.44023942
  49.52190798  47.18220505  44.98267274  41.59608915  38.55619573
  37.87850688  36.81772129  34.72708367  34.03450817  32.95108628
  32.10787165  31.52745543  31.30276872  29.81831794  29.27335232
  28.96390364  28.09864141  27.65219871  27.51455445  27.10903695
  26.62565829  25.70351755  25.17486831  24.51789963  23.89694554]


Unnamed: 0,pc_1,pc_2,pc_3,pc_4,pc_5,pc_6,pc_7,pc_8,pc_9,pc_10,...,pc_21,pc_22,pc_23,pc_24,pc_25,pc_26,pc_27,pc_28,pc_29,pc_30
0,-0.411312,-0.259630,-0.190537,-0.637869,0.120840,0.073194,0.048938,0.163435,-0.014750,0.089522,...,-0.079765,-0.124144,-0.018569,0.027867,0.101140,0.186077,0.290931,-0.158089,-0.122780,-0.029276
1,-0.348630,-0.217357,-0.170838,-0.597383,0.141433,0.108930,-0.037247,0.160761,0.099224,-0.046063,...,-0.247479,0.103427,0.183929,0.115840,0.144195,0.159648,0.041140,-0.045212,-0.174958,-0.023902
2,-0.308380,-0.189357,-0.154638,-0.454194,-0.006973,-0.001592,-0.145060,0.224000,0.196753,-0.174934,...,0.067901,-0.125170,0.197107,-0.018553,-0.003418,0.074125,-0.362943,-0.197501,0.026030,0.153037
3,-0.172041,0.108199,-0.065618,0.132007,0.294291,0.146843,-0.041479,-0.010489,0.150686,0.222461,...,-0.067191,0.117002,-0.233620,0.165647,0.458310,0.233780,-0.087882,-0.052882,-0.231959,-0.039937
4,-0.389038,-0.246249,-0.186070,-0.573014,0.069310,0.183606,0.062364,0.155677,-0.010392,0.035981,...,-0.054315,0.213451,-0.026702,-0.260194,-0.101652,-0.104232,-0.210556,-0.010576,-0.047670,-0.340486
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
460,-0.369225,-0.227878,-0.171010,-0.585165,0.118674,0.111273,0.053218,0.079988,-0.010194,0.139891,...,-0.371650,-0.244700,-0.064379,0.128532,-0.052724,0.173728,-0.076481,0.019285,-0.010339,0.005221
461,-0.054890,-0.012636,-0.036062,0.078210,0.113789,-0.003580,-0.003494,0.101278,0.115364,-0.315379,...,-0.065268,0.184322,0.235844,-0.023969,0.192931,0.208856,0.118216,0.129982,-0.207769,0.087386
462,-0.072543,-0.190902,-0.132171,-0.217455,-0.218047,0.011336,-0.061939,-0.037649,0.310451,-0.210116,...,-0.005241,-0.042949,-0.136323,-0.087306,-0.253354,0.323936,-0.294303,-0.089426,-0.245807,0.134397
463,-0.307321,-0.170565,-0.145197,-0.195464,-0.056510,0.174977,0.136618,0.037791,0.162467,0.153549,...,-0.026661,-0.035982,-0.258386,0.029397,0.366738,0.124428,0.175930,-0.317982,0.100957,-0.001967


### Exercise 4a (1 point) : Data Preparation

* Check for the data value counts to see the data imbalance
  - Merge the smaller classes to a bigger class so that the number of classes is between 3 and 4

* Perform Label Encoding for the Target variable classes

* Create a New DataFrame
  - Merge the dataframe with PCA filtered variables and 

    the Target variable-1 `"Defect Type Family using IEEE"` and  

    the Target variable -2 `"Defect Type Family using ODC"`

* Split the above data into Training and Testing Datasets




In [None]:
# Check the class Distribution of the Target Variables
df["Defect Type Family using IEEE"].value_counts()

ieee_logicData                  304
ieee_interface                   92
ieee_otherBuildConfigInstall     43
ieee_syntax                      17
ieee_standards                    5
ieee_description                  4
Name: Defect Type Family using IEEE, dtype: int64

In [None]:
# Replace the minority classes into a class with larger count
condition = df["Defect Type Family using IEEE"].isin(["ieee_syntax", "ieee_standards", "ieee_description"])
df["Defect Type Family using IEEE"][condition]= "ieee_others"
df["Defect Type Family using IEEE"].value_counts()

ieee_logicData                  304
ieee_interface                   92
ieee_otherBuildConfigInstall     43
ieee_others                      26
Name: Defect Type Family using IEEE, dtype: int64

In [None]:
df["Defect Type Family using ODC"].value_counts()

control_flow      348
structural         67
non_functional     50
Name: Defect Type Family using ODC, dtype: int64

In [None]:
le_ieee = LabelEncoder()

df["Defect Type Family using IEEE"] = le_ieee.fit_transform(df["Defect Type Family using IEEE"])

In [None]:
le_odc = LabelEncoder()
df["Defect Type Family using ODC"] = le_odc.fit_transform(df["Defect Type Family using ODC"])

In [None]:
df_ml_data = df_body_counts.copy()
df_ml_data["Defect Type Family using IEEE"] = df["Defect Type Family using IEEE"].astype(int)
df_ml_data["Defect Type Family using ODC"] = df["Defect Type Family using ODC"].astype(int)
df_ml_data.head(2)

Unnamed: 0,pc_1,pc_2,pc_3,pc_4,pc_5,pc_6,pc_7,pc_8,pc_9,pc_10,...,pc_23,pc_24,pc_25,pc_26,pc_27,pc_28,pc_29,pc_30,Defect Type Family using IEEE,Defect Type Family using ODC
0,-0.411312,-0.25963,-0.190537,-0.637869,0.12084,0.073194,0.048938,0.163435,-0.01475,0.089522,...,-0.018569,0.027867,0.10114,0.186077,0.290931,-0.158089,-0.12278,-0.029276,1.0,0.0
1,-0.34863,-0.217357,-0.170838,-0.597383,0.141433,0.10893,-0.037247,0.160761,0.099224,-0.046063,...,0.183929,0.11584,0.144195,0.159648,0.04114,-0.045212,-0.174958,-0.023902,1.0,0.0


In [None]:
df_ml_data.shape

(465, 32)

In [None]:
print(df_ml_data["Defect Type Family using IEEE"].isnull().sum())
df_ml_data["Defect Type Family using IEEE"].value_counts()

58


1.0    270
0.0     80
2.0     37
3.0     20
Name: Defect Type Family using IEEE, dtype: int64

In [None]:
df_ml_data["Defect Type Family using IEEE"] = df_ml_data["Defect Type Family using IEEE"].fillna(4.0)
df_ml_data["Defect Type Family using IEEE"].value_counts()

1.0    270
0.0     80
4.0     58
2.0     37
3.0     20
Name: Defect Type Family using IEEE, dtype: int64

In [None]:
print(df_ml_data["Defect Type Family using ODC"].isnull().sum())
df_ml_data["Defect Type Family using ODC"].value_counts()

58


0.0    311
2.0     54
1.0     42
Name: Defect Type Family using ODC, dtype: int64

In [None]:
df_ml_data["Defect Type Family using ODC"] = df_ml_data["Defect Type Family using ODC"].fillna(3.0)
df_ml_data["Defect Type Family using ODC"].value_counts()

0.0    311
3.0     58
2.0     54
1.0     42
Name: Defect Type Family using ODC, dtype: int64

In [None]:
df_ml_data.isnull().sum()

pc_1                             0
pc_2                             0
pc_3                             0
pc_4                             0
pc_5                             0
pc_6                             0
pc_7                             0
pc_8                             0
pc_9                             0
pc_10                            0
pc_11                            0
pc_12                            0
pc_13                            0
pc_14                            0
pc_15                            0
pc_16                            0
pc_17                            0
pc_18                            0
pc_19                            0
pc_20                            0
pc_21                            0
pc_22                            0
pc_23                            0
pc_24                            0
pc_25                            0
pc_26                            0
pc_27                            0
pc_28                            0
pc_29               

In [None]:
# Split the Data into training and testing

X  = df_ml_data.iloc[:,:-2]
y1 = df_ml_data.iloc[:,-2]
y2 = df_ml_data.iloc[:,-1]

X_train, X_test, y_train1, y_test1 = train_test_split( X, y1, test_size=0.3, random_state=123)
X_train, X_test, y_train2, y_test2 = train_test_split( X, y2, test_size=0.3, random_state=123)


### Exercise 5a (1 point) : Classification

* Classification-Target 1 (`"Defect Type Family using IEEE"`)
	
    - Perform classification using any ONE of your favorite Sklearn's classifier

    - Explain with metrics

* Classification-Target 2 (`"Defect Type Family using ODC"`)
	
    - Perform classification using any ONE of your favorite Sklearn's classifier

    - Explain with metrics

**Tip**: Train the model and predict on seperate cells. It will save time debugging.

In [None]:
# THE FOLLOWING SOLUTIONS ARE JUST PALCEHOLDERS for exercises
# The teams may have to modify the parameters, perform grid search,
# cross-validations and further preprocessing to improve the metrics

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

In [None]:
# MODEL 1

model_1 = LogisticRegression(multi_class='ovr', class_weight='balanced')
model_1.fit(X_train, y_train1)
y_pred1 = model_1.predict(X_test)
model_1.score(X_test, y_test1)

0.2571428571428571

In [None]:
print(f"\n Confusion Matrix = \n{confusion_matrix(y_test1, y_pred1)}")
print(f"\n Classification Report = \n{classification_report(y_test1, y_pred1)}")


 Confusion Matrix = 
[[ 7  6  6  7  2]
 [11 18 21 17 15]
 [ 2  1  3  2  2]
 [ 1  1  0  2  1]
 [ 1  4  0  4  6]]

 Classification Report = 
              precision    recall  f1-score   support

         0.0       0.32      0.25      0.28        28
         1.0       0.60      0.22      0.32        82
         2.0       0.10      0.30      0.15        10
         3.0       0.06      0.40      0.11         5
         4.0       0.23      0.40      0.29        15

    accuracy                           0.26       140
   macro avg       0.26      0.31      0.23       140
weighted avg       0.45      0.26      0.29       140



In [None]:
# MODEL 2

model_2 = LogisticRegression(multi_class='ovr', class_weight='balanced')
model_2.fit(X_train, y_train2)
y_pred2 = model_2.predict(X_test)
model_2.score(X_test, y_test2)

0.3

In [None]:
print(f"\n Confusion Matrix = \n{confusion_matrix(y_test2, y_pred2)}")
print(f"\n Classification Report = \n{classification_report(y_test2, y_pred2)}")


 Confusion Matrix = 
[[28 22 23 22]
 [ 5  2  2  2]
 [ 5  6  6  2]
 [ 3  3  3  6]]

 Classification Report = 
              precision    recall  f1-score   support

         0.0       0.68      0.29      0.41        95
         1.0       0.06      0.18      0.09        11
         2.0       0.18      0.32      0.23        19
         3.0       0.19      0.40      0.26        15

    accuracy                           0.30       140
   macro avg       0.28      0.30      0.25       140
weighted avg       0.51      0.30      0.34       140



In [None]:
model_3 = SVC(gamma='auto', class_weight='balanced').fit(X_train, y_train1)
y_pred3 = model_3.predict(X_test)
model_3.score(X_test, y_test1)

0.5214285714285715

In [None]:
print(f"\n Confusion Matrix = \n{confusion_matrix(y_test1, y_pred3)}")
print(f"\n Classification Report = \n{classification_report(y_test1, y_pred3)}")


 Confusion Matrix = 
[[ 0 25  2  1  0]
 [ 0 72  4  6  0]
 [ 0  9  0  1  0]
 [ 0  4  0  1  0]
 [ 0 13  2  0  0]]

 Classification Report = 
              precision    recall  f1-score   support

         0.0       0.00      0.00      0.00        28
         1.0       0.59      0.88      0.70        82
         2.0       0.00      0.00      0.00        10
         3.0       0.11      0.20      0.14         5
         4.0       0.00      0.00      0.00        15

    accuracy                           0.52       140
   macro avg       0.14      0.22      0.17       140
weighted avg       0.35      0.52      0.42       140



In [None]:
from sklearn.svm import SVC
model_4 = SVC(gamma='auto', class_weight='balanced').fit(X_train, y_train2)
y_pred4 = model_4.predict(X_test)
model_4.score(X_test, y_test2)

0.12857142857142856

In [None]:
print(f"\n Confusion Matrix = \n{confusion_matrix(y_test2, y_pred4)}")
print(f"\n Classification Report = \n{classification_report(y_test2, y_pred4)}")


 Confusion Matrix = 
[[ 0 22 73  0]
 [ 0  0 11  0]
 [ 0  1 18  0]
 [ 0  3 12  0]]

 Classification Report = 
              precision    recall  f1-score   support

         0.0       0.00      0.00      0.00        95
         1.0       0.00      0.00      0.00        11
         2.0       0.16      0.95      0.27        19
         3.0       0.00      0.00      0.00        15

    accuracy                           0.13       140
   macro avg       0.04      0.24      0.07       140
weighted avg       0.02      0.13      0.04       140



#### YOUR FINDINGS/ Reasoning for which model is better and why (Qualitatively and Quantatively)

Explain why one model behaves better than the other(s) in terms of Accuracy, Precision, Recall and F1-Score

### Exercise 3b (1 point): Feature Engineering Approach-2

* Combine the title and body strings by a space
*  the words
* Use `TfidfVectorizer` to Tokenize and transform the the text to features
* Reduce the features using PCA

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus_body = list(df["Issue Title"]+" "+df["Issue Body"])
vectorizer_body = TfidfVectorizer()
Xbody           = vectorizer_body.fit_transform(corpus_body)
X_body          = Xbody.toarray()

features_body   = vectorizer_body.get_feature_names_out()
print(X_body.shape)


n_pc = 30
pca = PCA(n_components=n_pc, svd_solver='full')
X_body_new = pca.fit_transform(X_body)

print(f"explained_variance_ratio_ = \n{pca.explained_variance_ratio_}")
print(f"\nsingular_values_ = \n{pca.singular_values_}")

from sklearn.preprocessing import Normalizer
X_body_new = Normalizer().fit_transform(X_body_new)


cols_pc = []
for nth_pc in range(n_pc):
  cols_pc.append(f"pc_{nth_pc+1}")


df_body_counts  = pd.DataFrame(data=X_body_new, columns=cols_pc)
df_body_counts = df_body_counts.replace(np.nan, 0)
df_body_counts = df_body_counts.replace(np.NaN, 0)
df_body_counts

(465, 4133)
explained_variance_ratio_ = 
[0.0130334  0.01196143 0.00951836 0.00872583 0.00830007 0.00775776
 0.00742875 0.00726365 0.00701375 0.00687013 0.00671995 0.00666135
 0.00632475 0.00616669 0.00604518 0.00590551 0.00579446 0.00562445
 0.00551639 0.00545467 0.00536913 0.00529191 0.00525443 0.00510412
 0.00503857 0.00495113 0.00492934 0.0048732  0.00477464 0.00467463]

singular_values_ = 
[2.43069922 2.32859509 2.07722483 1.98886691 1.93973885 1.87529926
 1.83510286 1.81459537 1.78310803 1.76475709 1.74536109 1.7377348
 1.69326177 1.67197034 1.65541573 1.63617952 1.62072308 1.5967709
 1.58135756 1.57248493 1.56010712 1.54884701 1.54335282 1.5211172
 1.51131939 1.49814778 1.4948465  1.48631051 1.47120337 1.45571331]


Unnamed: 0,pc_1,pc_2,pc_3,pc_4,pc_5,pc_6,pc_7,pc_8,pc_9,pc_10,...,pc_21,pc_22,pc_23,pc_24,pc_25,pc_26,pc_27,pc_28,pc_29,pc_30
0,0.096789,-0.060375,0.107373,-0.222810,-0.055013,0.238529,0.126703,0.109747,-0.131101,0.059851,...,0.270315,-0.260290,-0.041776,-0.129686,-0.001571,0.091215,-0.050850,0.162865,-0.148286,-0.254202
1,0.369157,0.182510,-0.229715,-0.202387,-0.390429,-0.031405,-0.108973,-0.022522,0.203291,0.135411,...,-0.037542,-0.213049,0.020028,0.141376,0.162689,0.139088,0.091706,0.067486,-0.094169,0.282118
2,0.603761,-0.305098,0.034900,0.350755,-0.008049,-0.278914,-0.082977,0.241088,0.101457,-0.000780,...,0.080438,-0.041172,0.133590,0.222375,-0.198280,-0.107597,-0.015442,-0.198744,-0.007843,-0.008028
3,-0.124248,0.005525,-0.301813,0.049319,-0.121604,0.203210,-0.278912,-0.244047,0.034393,0.060727,...,-0.041599,-0.067779,-0.156308,0.176308,0.198437,0.053154,-0.332295,-0.001679,0.137726,0.217780
4,0.254514,0.081285,0.172855,-0.101837,-0.154586,-0.024241,0.077717,-0.409406,0.310648,-0.229847,...,0.160450,-0.395991,-0.378886,0.198066,0.113220,0.048099,-0.054681,-0.080073,0.015958,0.046424
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
460,-0.050118,0.063850,0.429611,-0.086493,-0.335821,0.116245,0.079734,0.245682,0.067913,-0.037557,...,-0.287112,-0.099317,-0.089627,-0.067500,-0.021834,-0.152264,0.127815,-0.129473,-0.018512,-0.148145
461,0.428768,0.122591,-0.300277,0.108071,-0.426952,0.228727,-0.268393,-0.210379,0.144979,-0.018626,...,-0.163118,-0.159776,0.099575,0.106052,-0.028877,0.089145,-0.077975,0.118165,-0.021073,0.050156
462,0.013493,-0.166492,-0.038531,0.211835,-0.058766,-0.017182,-0.059564,0.110542,0.032717,0.090225,...,0.111917,-0.341692,-0.282466,-0.117753,-0.457593,0.191758,-0.065522,0.261677,-0.140995,0.391544
463,0.013362,-0.046017,-0.105295,0.252323,-0.107559,0.296066,-0.074695,-0.171594,0.025673,0.231882,...,-0.233988,-0.145223,-0.029078,0.025952,0.118640,-0.174242,-0.115903,-0.222314,-0.020206,-0.117398


### Exercise 4b (1 point) : Data Preparation

* Check for the data value counts to see the data imbalance
  - Merge the smaller classes to a bigger class so that the number of classes is between 3 and 4

* Perform Label Encoding for the Target variable classes

* Create a New DataFrame
  - Merge the dataframe with PCA filtered variables and 

    the Target variable-1 `"Defect Type Family using IEEE"` and  

    the Target variable -2 `"Defect Type Family using ODC"`

* Split the above data into Training and Testing Datasets




In [None]:
df_ml_data = df_body_counts.copy()
df_ml_data["Defect Type Family using IEEE"] = df["Defect Type Family using IEEE"].astype(int)
df_ml_data["Defect Type Family using ODC"] = df["Defect Type Family using ODC"].astype(int)
df_ml_data.head(2)

Unnamed: 0,pc_1,pc_2,pc_3,pc_4,pc_5,pc_6,pc_7,pc_8,pc_9,pc_10,...,pc_23,pc_24,pc_25,pc_26,pc_27,pc_28,pc_29,pc_30,Defect Type Family using IEEE,Defect Type Family using ODC
0,0.096789,-0.060375,0.107373,-0.22281,-0.055013,0.238529,0.126703,0.109747,-0.131101,0.059851,...,-0.041776,-0.129686,-0.001571,0.091215,-0.05085,0.162865,-0.148286,-0.254202,1.0,0.0
1,0.369157,0.18251,-0.229715,-0.202387,-0.390429,-0.031405,-0.108973,-0.022522,0.203291,0.135411,...,0.020028,0.141376,0.162689,0.139088,0.091706,0.067486,-0.094169,0.282118,1.0,0.0


In [None]:
df_ml_data.shape

(465, 32)

In [None]:
print(df_ml_data["Defect Type Family using IEEE"].isnull().sum())
df_ml_data["Defect Type Family using IEEE"].value_counts()

58


1.0    270
0.0     80
2.0     37
3.0     20
Name: Defect Type Family using IEEE, dtype: int64

In [None]:
df_ml_data["Defect Type Family using IEEE"] = df_ml_data["Defect Type Family using IEEE"].fillna(4.0)
df_ml_data["Defect Type Family using IEEE"].value_counts()

1.0    270
0.0     80
4.0     58
2.0     37
3.0     20
Name: Defect Type Family using IEEE, dtype: int64

In [None]:
print(df_ml_data["Defect Type Family using ODC"].isnull().sum())
df_ml_data["Defect Type Family using ODC"].value_counts()

58


0.0    311
2.0     54
1.0     42
Name: Defect Type Family using ODC, dtype: int64

In [None]:
df_ml_data["Defect Type Family using ODC"] = df_ml_data["Defect Type Family using ODC"].fillna(3.0)
df_ml_data["Defect Type Family using ODC"].value_counts()

0.0    311
3.0     58
2.0     54
1.0     42
Name: Defect Type Family using ODC, dtype: int64

In [None]:
df_ml_data.isnull().sum()

pc_1                             0
pc_2                             0
pc_3                             0
pc_4                             0
pc_5                             0
pc_6                             0
pc_7                             0
pc_8                             0
pc_9                             0
pc_10                            0
pc_11                            0
pc_12                            0
pc_13                            0
pc_14                            0
pc_15                            0
pc_16                            0
pc_17                            0
pc_18                            0
pc_19                            0
pc_20                            0
pc_21                            0
pc_22                            0
pc_23                            0
pc_24                            0
pc_25                            0
pc_26                            0
pc_27                            0
pc_28                            0
pc_29               

In [None]:
# Split the Data into training and testing

X  = df_ml_data.iloc[:,:-2]
y1 = df_ml_data.iloc[:,-2]
y2 = df_ml_data.iloc[:,-1]

X_train, X_test, y_train1, y_test1 = train_test_split( X, y1, test_size=0.3, random_state=123)
X_train, X_test, y_train2, y_test2 = train_test_split( X, y2, test_size=0.3, random_state=123)


### Exercise 5b (1 point) : Classification

* Classification-Target 1 (`"Defect Type Family using IEEE"`)
	
    - Perform classification using any ONE of your favorite Sklearn's classifier

    - Explain with metrics

* Classification-Target 2 (`"Defect Type Family using ODC"`)
	
    - Perform classification using any ONE of your favorite Sklearn's classifier

    - Explain with metrics

**Tip**: Train the model and predict on seperate cells. It will save time debugging.

In [None]:
# MODEL 1

model_1 = LogisticRegression(multi_class='ovr', class_weight='balanced')
model_1.fit(X_train, y_train1)
y_pred1 = model_1.predict(X_test)
model_1.score(X_test, y_test1)

0.22857142857142856

In [None]:
print(f"\n Confusion Matrix = \n{confusion_matrix(y_test1, y_pred1)}")
print(f"\n Classification Report = \n{classification_report(y_test1, y_pred1)}")


 Confusion Matrix = 
[[ 4  6 11  3  4]
 [10 23 20 13 16]
 [ 2  4  1  1  2]
 [ 1  2  1  1  0]
 [ 6  3  1  2  3]]

 Classification Report = 
              precision    recall  f1-score   support

         0.0       0.17      0.14      0.16        28
         1.0       0.61      0.28      0.38        82
         2.0       0.03      0.10      0.05        10
         3.0       0.05      0.20      0.08         5
         4.0       0.12      0.20      0.15        15

    accuracy                           0.23       140
   macro avg       0.20      0.18      0.16       140
weighted avg       0.41      0.23      0.28       140



In [None]:
# MODEL 2

model_2 = LogisticRegression(multi_class='ovr', class_weight='balanced')
model_2.fit(X_train, y_train2)
y_pred2 = model_2.predict(X_test)
model_2.score(X_test, y_test2)

0.32857142857142857

In [None]:
print(f"\n Confusion Matrix = \n{confusion_matrix(y_test2, y_pred2)}")
print(f"\n Classification Report = \n{classification_report(y_test2, y_pred2)}")


 Confusion Matrix = 
[[35 26 16 18]
 [ 4  2  3  2]
 [ 4  5  5  5]
 [ 6  2  3  4]]

 Classification Report = 
              precision    recall  f1-score   support

         0.0       0.71      0.37      0.49        95
         1.0       0.06      0.18      0.09        11
         2.0       0.19      0.26      0.22        19
         3.0       0.14      0.27      0.18        15

    accuracy                           0.33       140
   macro avg       0.27      0.27      0.24       140
weighted avg       0.53      0.33      0.39       140



In [None]:
model_3 = SVC(gamma='auto', class_weight='balanced').fit(X_train, y_train1)
y_pred3 = model_3.predict(X_test)
model_3.score(X_test, y_test1)

0.5857142857142857

In [None]:
print(f"\n Confusion Matrix = \n{confusion_matrix(y_test1, y_pred3)}")
print(f"\n Classification Report = \n{classification_report(y_test1, y_pred3)}")


 Confusion Matrix = 
[[ 0 28  0  0  0]
 [ 0 82  0  0  0]
 [ 0 10  0  0  0]
 [ 0  5  0  0  0]
 [ 0 15  0  0  0]]

 Classification Report = 
              precision    recall  f1-score   support

         0.0       0.00      0.00      0.00        28
         1.0       0.59      1.00      0.74        82
         2.0       0.00      0.00      0.00        10
         3.0       0.00      0.00      0.00         5
         4.0       0.00      0.00      0.00        15

    accuracy                           0.59       140
   macro avg       0.12      0.20      0.15       140
weighted avg       0.34      0.59      0.43       140



In [None]:
from sklearn.svm import SVC
model_4 = SVC(gamma='auto', class_weight='balanced').fit(X_train, y_train2)
y_pred4 = model_4.predict(X_test)
model_4.score(X_test, y_test2)

0.14285714285714285

In [None]:
print(f"\n Confusion Matrix = \n{confusion_matrix(y_test2, y_pred4)}")
print(f"\n Classification Report = \n{classification_report(y_test2, y_pred4)}")


 Confusion Matrix = 
[[ 0  9 85  1]
 [ 0  1 10  0]
 [ 0  1 18  0]
 [ 0  1 13  1]]

 Classification Report = 
              precision    recall  f1-score   support

         0.0       0.00      0.00      0.00        95
         1.0       0.08      0.09      0.09        11
         2.0       0.14      0.95      0.25        19
         3.0       0.50      0.07      0.12        15

    accuracy                           0.14       140
   macro avg       0.18      0.28      0.11       140
weighted avg       0.08      0.14      0.05       140

