# Applied Data Science and Machine Intelligence
## A program by IIT Madras and TalentSprint
### Mini Project 5 : Software Use Case - Issue Detection

## Learning Objectives

At the end of the mini project, you will be able to -

* Get an understanding of the dataset.
* Perform Extensive EDA and Visualizations
* Handraft the raw data suitable for a ML problem
* Predict(Classify) the Bugs


## Information

### Issue Classification

A company that uses online issue tracking system, often gets a hurdle with the performance of the human resources and digital resource allocation. This is due the fact that the persons raising tickets sometimes put it in under a different tag or a category. Redirections to solve the issue throught the right person takes more time. So it is essential to solve it on time by linking the appropriate issue tags. This role will be taken care by the area of Machine Learning called as Natural Language Processing (NLP). In this Mini-Project we will be utilizing the fundamental building blocks of the NLP to classify the issues under appropriate categories based on the text body of the issue/ticket being raised.

### About the Dataset

This Mini-Project uses the Dataset from the [GitHub](https://github.com/roundcube/roundcubemail/issues). It contains the issues of Roundcube mail application, along with the software defect labelled across each issue.

**Python Packages used:**  

* [`Google.colab`](https://colab.research.google.com/notebooks/io.ipynb) for linking the notebook to your Google-drive
* [`Pandas`](https://pandas.pydata.org/docs/reference/index.html) for data frames and easy to read csv files  
* [`Numpy`](https://numpy.org/doc/stable/reference/index.html#reference) for array and matrix mathematics functions  
* [`sklearn`](https://scikit-learn.org/stable/user_guide.html) for the pre-processing data, building ML models, and performance metrics
* [`seaborn`](https://seaborn.pydata.org/) and [`matplotlib`](https://matplotlib.org/) for plotting
* [`regex`](https://docs.python.org/3/library/re.html) and [`nltk`](https://www.nltk.org/) for text preocessing


## Importing the packages

In [None]:
### The required libraries and packages ###
import pandas as pd
import numpy as np
from google.colab import drive
import seaborn as sns
import regex as re

import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.decomposition import PCA
from sklearn.preprocessing import Normalizer, StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import 

import warnings
warnings.filterwarnings("ignore")

## Importing the Data

In [None]:
drive.mount('/content/drive')

In [None]:
path = 'drive/MyDrive/<YOUR FOLDER NAME AS IT APPEARS ON GOOGLE DRIVE>'

df_raw = pd.read_csv(path+'issues_data.csv')
print(df_raw.shape)
df_raw.head(2)

In [None]:
df = df_raw.copy()
df.head(2)

## Graded Exercises (10 points)

### Exercise 1 (1 point): Basic EDA

- Check the shape of the data
- Check the nulls present in each field
- Check the unique number of entries per field
- Drop the features that are either redundant or that do not help in modelling


**Hint** : Use the `pandas` module

In [None]:
# Check the shape of the data
# YOUR CODE HERE

In [None]:
# Check the nulls present in each field
# YOUR CODE HERE

In [None]:
# Check the unique number of entries per field
# YOUR CODE HERE

In [None]:
# Check the statistics of the data for each column
# YOUR CODE HERE

In [None]:
# Remove the unwanted columns
# YOUR CODE HERE

### Exercise 2 (3 Marks): Text Pre-Processing for Feature Columns

For each row of the data, write **functions** perform the following steps seperately for the feature Data columns - `Issue Title` and `Issue Body`



In [None]:
stopwords = nltk.corpus.stopwords.words('english')
print(stopwords)

##### **Function - 1: (To be performed for `Issue Title` ONLY)**

Hint: The following steps will be present in both the functions

* Make all the texts to Lower case
* Remove punctuations, numbers, symbols and other emojis and replace with empty spaces. 

  This is done to ensure only the text is retained.
* Strip the excess spaces
* Remove the stop words using english stop words of nltk library
* Strip the excess spaces
* Remove words smaller than 3 letters (example: a, i , n, it, js, ab etc. )

In [None]:
def tp_title(text_data):

  # Step - Convert to lower case
  # YOUR CODE HERE

  # Step - Keep only alphabetical characters
 # YOUR CODE HERE

  # Step - Strip the excess spaces
  # YOUR CODE HERE

  # Step - Remove words smaller than 3 letters
  # YOUR CODE HERE

  # Step - Join Back the words
  # YOUR CODE HERE

  # Step - Remove the English Stop words
  # YOUR CODE HERE

  # stop_words_eng = set(stopwords.words('english'))
  # YOUR CODE HERE  

  # Step - Join Back the words
  # YOUR CODE HERE

  return text_data

##### **Function - 2 (To be performed for `Issue Body` ONLY)**

Hint: Either copy paste the steps of Function-1 or call the function and add the below steps

* Split lines 
* Remove first and last lines such as 

  -`_Reported by L1Ntu on 17 Apr 2014 19:41 UTC as Trac ticket #1489818_`

  -`_Migrated-From: http://trac.roundcube.net/ticket/1489818_`

  by replacing with empty/no space - `" "` or `""`
* Remove lines containing urls

  (**Hint**: it contains `http`,`www.`, `.com`, `.net` etc) by replacing with empty/no space
* Join back the lines with a space `" "`
* Strip the excess spaces

**Hint:** For the below tasks, Copy paste the lines of code from above `Function-1`
* Make all the texts to Lower case
* Remove punctuations, numbers, symbols and other emojis and replace with empty spaces. 

  This is done to ensure only the text is retained.
* Strip the excess spaces
* Remove the stop words using english stop words of nltk library
* Strip the excess spaces
* Remove words smaller than 3 letters (example: a, i , n, it, js, ab etc. )
* Strip the excess spaces
* If the entry contains only a space `" "`, replace it with no character `""`

#### Using the functions: 

For each Use the above 2 functions on respective columns, to achieve the desired tasks.

In [None]:
def tp_body(text_data):

  # Step - Splitlines
  # YOUR CODE HERE

  # Step - Remove Last Lines
  # YOUR CODE HERE

  # Step - Remove the lines that contain any URLs
  url_terms = ["www", "http", ".net", ".com"]
  # YOUR CODE HERE

  # Step - Join Back the lines
  # YOUR CODE HERE

  # Step - Strip the excess spaces
  # YOUR CODE HERE

  # Step - Convert to lower case
  # YOUR CODE HERE

  # Step - Keep only alphabetical characters
  # YOUR CODE HERE

  # Step - Strip the excess spaces
  # YOUR CODE HERE

  # Step - Remove words smaller than 3 letters
  # YOUR CODE HERE

  # Step - Join Back the words
  # YOUR CODE HERE

  # Step - Remove recurring Characters
  # YOUR CODE HERE

  # Step - Remove the English Stop words
  stop_words_eng = nltk.corpus.stopwords.words('english')
  # stop_words_eng = set(stopwords.words('english'))
  # YOUR CODE HERE  

  # Step - Join Back the words
  # YOUR CODE HERE

  return text_data  

In [None]:
# Apply the above functions here
df['Issue Body'] = df['Issue Body'].apply(lambda x: tp_body(x))
df['Issue Title'] = df['Issue Title'].apply(lambda x: tp_title(x))
df.head()

In [None]:
# Drop all the rows that have no content in them
# YOUR CODE HERE

### Exercise 3a (1 point): Feature Engineering Approach-1

* Combine the title and body strings by a space
*  the words
* Use `CountVectorizer` to Tokenize and transform the the text to features
* Reduce the features using PCA

In [None]:
corpus_body = list(df["Issue Title"]+" "+df["Issue Body"])
# YOUR CODE HERE

### Exercise 4a (1 point) : Data Preparation

* Check for the data value counts to see the data imbalance
  - Merge the smaller classes to a bigger class so that the number of classes is between 3 and 4

* Perform Label Encoding for the Target variable classes

* Create a New DataFrame
  - Merge the dataframe with PCA filtered variables and 

    the Target variable-1 `"Defect Type Family using IEEE"` and  

    the Target variable -2 `"Defect Type Family using ODC"`

* Split the above data into Training and Testing Datasets




In [None]:
# Check the class Distribution of the Target Variables
# YOUR CODE HERE

In [None]:
# Replace the minority classes into a class with larger count
# YOUR CODE HERE

In [None]:
# Check the class Distribution of the Target Variables AGAIN
# YOUR CODE HERE

In [None]:
# Label Encode the Target Variable-1
# YOUR CODE HERE

In [None]:
# Label Encode the Target Variable-2
# YOUR CODE HERE

In [None]:
# MERGE the features and target in a single DataFrame
# YOUR CODE HERE

In [None]:
# Check for nulls if any, and fill the values with a new class (integer)
# YOUR CODE HERE

In [None]:
# Split the Data into training and testing, for Target Variable-1 
# YOUR CODE HERE

In [None]:
# Split the Data into training and testing, for Target Variable-2
# YOUR CODE HERE

### Exercise 5a (1 point) : Classification

* Classification-Target 1 (`"Defect Type Family using IEEE"`)
	
    - Perform classification using any ONE of your favorite Sklearn's classifier

    - Explain with metrics

* Classification-Target 2 (`"Defect Type Family using ODC"`)
	
    - Perform classification using any ONE of your favorite Sklearn's classifier

    - Explain with metrics

**Tip**: Train the model and predict on seperate cells. It will save time debugging.

In [None]:
# Import the classifier Functions
# YOUR CODE HERE

In [None]:
# MODEL 1

# Call and fit the model on Training Data
# Predict on the test data
# YOUR CODE HERE

In [None]:
# Print confusion_matrix,  andclassification_report
# YOUR CODE HERE

In [None]:
# MODEL 2

# Call and fit the model on Training Data
# Predict on the test data
# YOUR CODE HERE

In [None]:
# Print confusion_matrix,  andclassification_report
# YOUR CODE HERE

#### YOUR FINDINGS/ Reasoning for which model is better and why (Qualitatively and Quantatively)

Explain why one model behaves better than the other(s) in terms of Accuracy, Precision, Recall and F1-Score

### Exercise 3b (1 point): Feature Engineering Approach-2

* Combine the title and body strings by a space
*  the words
* Use `TfidfVectorizer` to Tokenize and transform the the text to features
* Reduce the features using PCA

In [None]:

corpus_body = list(df["Issue Title"]+" "+df["Issue Body"])
vectorizer_body = TfidfVectorizer()
Xbody           = vectorizer_body.fit_transform(corpus_body)
X_body          = Xbody.toarray()

features_body   = vectorizer_body.get_feature_names_out()
print(X_body.shape)


n_pc = 30
pca = PCA(n_components=n_pc, svd_solver='full')
X_body_new = pca.fit_transform(X_body)

print(f"explained_variance_ratio_ = \n{pca.explained_variance_ratio_}")
print(f"\nsingular_values_ = \n{pca.singular_values_}")

from sklearn.preprocessing import Normalizer
X_body_new = Normalizer().fit_transform(X_body_new)


cols_pc = []
for nth_pc in range(n_pc):
  cols_pc.append(f"pc_{nth_pc+1}")


df_body_counts  = pd.DataFrame(data=X_body_new, columns=cols_pc)
df_body_counts = df_body_counts.replace(np.nan, 0)
df_body_counts = df_body_counts.replace(np.NaN, 0)
df_body_counts

### Exercise 4b (1 point) : Data Preparation

* Check for the data value counts to see the data imbalance
  - Merge the smaller classes to a bigger class so that the number of classes is between 3 and 4

* Perform Label Encoding for the Target variable classes

* Create a New DataFrame
  - Merge the dataframe with PCA filtered variables and 

    the Target variable-1 `"Defect Type Family using IEEE"` and  

    the Target variable -2 `"Defect Type Family using ODC"`

* Split the above data into Training and Testing Datasets




In [None]:
# Check the class Distribution of the Target Variables
# YOUR CODE HERE

In [None]:
# Replace the minority classes into a class with larger count
# YOUR CODE HERE

In [None]:
# Check the class Distribution of the Target Variables AGAIN
# YOUR CODE HERE

In [None]:
# Label Encode the Target Variable-1
# YOUR CODE HERE

In [None]:
# Label Encode the Target Variable-2
# YOUR CODE HERE

In [None]:
# MERGE the features and target in a single DataFrame
# YOUR CODE HERE

In [None]:
# Check for nulls if any, and fill the values with a new class (integer)
# YOUR CODE HERE

In [None]:
# Split the Data into training and testing, for Target Variable-1 
# YOUR CODE HERE

In [None]:
# Split the Data into training and testing, for Target Variable-2
# YOUR CODE HERE

### Exercise 5b (1 point) : Classification

* Classification-Target 1 (`"Defect Type Family using IEEE"`)
	
    - Perform classification using any ONE of your favorite Sklearn's classifier

    - Explain with metrics

* Classification-Target 2 (`"Defect Type Family using ODC"`)
	
    - Perform classification using any ONE of your favorite Sklearn's classifier

    - Explain with metrics

**Tip**: Train the model and predict on seperate cells. It will save time debugging.

In [None]:
# Import the classifier Functions
# YOUR CODE HERE

In [None]:
# MODEL 1

# Call and fit the model on Training Data
# Predict on the test data
# YOUR CODE HERE

In [None]:
# Print confusion_matrix,  andclassification_report
# YOUR CODE HERE

In [None]:
# MODEL 2

# Call and fit the model on Training Data
# Predict on the test data
# YOUR CODE HERE

In [None]:
# Print confusion_matrix,  andclassification_report
# YOUR CODE HERE

## Additional Ungraded Exercise for Practice:

From the Data Perspective:

- Try taking ONLY the Issue Title as the feature set
- Try taking ONLY the Issue Body as the feature set
- Try various data scaling techniques

From the ML Model Perspective:
- Try out for other ML Models
- Try GridSearch
- Try Cross-Validation techniques