# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint

## Problem Statement

The problem is to identify the subcategory and classify the question based on the group it belongs to.



## Learning Objectives

At the end of the experiment, you will be able to understand:

*   Beautiful Soup
*   Use NLTK package
*   Text Representation
*   Classification

In [None]:
#@title  Mini Hackathon Walkthrough
from IPython.display import HTML

HTML("""<video width="320" height="240" controls>
  <source src="https://cdn.talentsprint.com/talentsprint1/archives/sc/aiml/aiml_batch_15/preview_videos/Mini_Hackathon_Aptitude_Classification.mp4" type="video/mp4">
</video>
""")

## Dataset
Being able to classify the questions will be difficult in natural language processing. The dataset is taken from the TalentSprint aptitude questions which contains more than 20K questions.

## Description
This dataset has the following columns:
1. **Category:** Gives the high-level categorization of the question
2. **Sub-Category:** Determines the type of questions
3. **Article:** Gives the article name of the question
4. **Questions:** Questions are listed
5. **Answers:** Contains answers


### Grading = 20 Marks

In [1]:
#@title Download the datasets
from IPython import get_ipython

ipython = get_ipython()

def setup(): 
    ipython.magic("sx wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/Aptitude_Classification_data.csv")
    ipython.magic("sx wget https://cdn.talentsprint.com/aiml/Experiment_related_data/Mentors_Test_Data.csv")
    from IPython.display import HTML, display
    print("Setup completed successfully")
    return

setup()

Setup completed successfully


In [18]:
# Import Python Libraries
from bs4 import BeautifulSoup
import nltk
import re
import string
import warnings
import numpy as np
import pandas as pd
from collections import Counter
from nltk.stem import WordNetLemmatizer 
from nltk import word_tokenize
from nltk.corpus import stopwords
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
nltk.download('wordnet')
warnings.filterwarnings('ignore')
nltk.download('punkt')
nltk.download("stopwords")
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [2]:
!ls

Aptitude_Classification_data.csv  Mentors_Test_Data.csv  sample_data


##   **Stage 1**:  Dataset Preparation

### 1 Mark -> Load the data set and prepare the data based on group allocation. 

Each group should consider their respective sub-categories as mentioned below:

> Team A = Groups 1, 4, 7, 10, 13, 16;   &nbsp; &nbsp;   Sub-Category = Misspell words, Algebra, Percentages, Mathematical operations, Probability

> Team B = Groups 2, 5, 8, 11, 14, 17; &nbsp; &nbsp;   Sub-Category = Finding Errors, Ratio and Proportion, Logarithms, Time and Distance, Simple and Compound Interest

> Team C = Groups 3, 6, 9, 12, 15, 18;  &nbsp; &nbsp;  Sub-Category =  Synonyms and Antonyms, Time and Work, Permutations and Combinations, LCM and HCF, Profit and Loss


**Hint:** &nbsp; To access Sub-Categories from given Data, refer [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isin.html)

In [6]:
# YOUR CODE HERE TO LOAD THE APTITUDE CLASSIFICATION DATASET & EXTRACT THE DATA BASED ON YOUR SUB-CATEGORIES
# Team C: 
data = pd.read_csv("/content/Aptitude_Classification_data.csv")
data.shape



(4631, 5)

In [9]:
data.head()

Unnamed: 0,Category,Sub-Category,Article,Questions,Answers
0,Verbal,Misspell words,chapter 1,Which of the following is correct?\n\n\n\n\n,2
1,Quantitative,Time and Distance,Time and Distance - Model 05,Rohan leaves point A and reaches point B in 6 ...,2
2,Verbal,Finding Errors,44054,Read the sentence to find out whether there is...,2
3,Quantitative,Time and Work,tech mahindra_5th August,4 men can check exam papers in 8 days working ...,2
4,Quantitative,Permutations and Combinations,AX10DEPT01,"From 13 persons, how many ways of selection of...",2


In [14]:
# Boolean Indexing in Pandas :  filter values of a column based on conditions from another set of columns from a Pandas Dataframe
TeamC = ['Synonyms and Antonyms', 'Time and Work', 'Permutations and Combinations', 'LCM and HCF', 'Profit and Loss']
DataC = data.loc[data['Sub-Category'].isin(TeamC)]


 

In [21]:
DataC.head()

Unnamed: 0,Category,Sub-Category,Article,Questions,Answers
3,Quantitative,Time and Work,tech mahindra_5th August,4 men can check exam papers in 8 days working ...,2
4,Quantitative,Permutations and Combinations,AX10DEPT01,"From 13 persons, how many ways of selection of...",2
6,Quantitative,Time and Work,2015,"<span style=\""font-size: small;\""><span lang=\...",4
7,Quantitative,Time and Work,2015,"<span style=\""font-size: small;\""><span lang=\...",3
8,Quantitative,Profit and Loss,PLT10,The ratio between the sale price and the cost ...,1


In [16]:
DataC.shape

(1504, 5)

## **Stage 2:** Data Pre-Processing

### 3 Marks -> Clean and Transform the data into a specified format

*   Remove the rows of the Questions column which contains blank / NaN.


*   Few set of questions have HTML tags within the question.
  - You can use Beautiful Soup library to convert HTML into text (Refer **"Dealing with HTML"** section from this [link](https://www.nltk.org/book/ch03.html).)


*  Consider Question column as feature and Sub-category as target variable. Convert Sub-category into numerical.

*  Drop the unwanted columns


  **Hint:** Use Label Encoder for obtaining a numeric representation, refer to the [link](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html). 

In [19]:
# Intializing nltk requirements for pre-processing
lemmatizer = WordNetLemmatizer()
stoplist = set(stopwords.words('english')) 

In [20]:
# Tokenize the sentence and get vocab words
def Tokenize(AllQuestions):
  pre_processed_words = []
  for each in AllQuestions:
    words = word_tokenize(each)
    words = [lemmatizer.lemmatize(w) for w in words]
    pre_processed_words.extend(words)

  pre_processed_words = set(pre_processed_words)

  pre_processed_words = [word for word in pre_processed_words if word not in stoplist]
  return pre_processed_words

In [None]:
# YOUR CODE HERE for BeatifulSoup
DataC['Questions'] = [BeautifulSoup(text).get_text() for text in DataC['Questions'] ]
DataC.Questions

3       4 men can check exam papers in 8 days working ...
4       From 13 persons, how many ways of selection of...
6       Suppose q is the number of workers employed by...
7       There is a group of persons each of them can c...
8       The ratio between the sale price and the cost ...
                              ...                        
4617    In how many ways can a committee of 5 members ...
4618    A Fruit seller buys some oranges at the rate o...
4626    Construction of a road was entrusted to a civi...
4628    Choose the word or phrase which is the best sy...
4629                        Give the antonym of CENSURE\n
Name: Questions, Length: 1504, dtype: object

In [None]:
DataC.columns

Index(['Category', 'Sub-Category', 'Article', 'Questions', 'Answers'], dtype='object')

In [None]:
d1 = DataC.drop(['Category','Article','Answers'], axis=1)
d1.to_csv('DataC.csv')

In [None]:
DataC1 = pd.read_csv('DataC.csv')

In [None]:
X = DataC1.drop("Sub-Category",1)   #Features 
y = DataC1["Sub-Category"]          #Target Variable

X.shape, y.shape

((1504, 2), (1504,))

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder() # DO NOT CHANGE THIS LINE as we will be using for the Test evaluation.

# YOUR CODE HERE for Fit label encoder and return encoded labels
le.fit(DataC1['Sub-Category'])
print(list(le.classes_))
DataC1['Sub-Category'] = le.fit_transform(DataC1['Sub-Category'])

['LCM and HCF', 'Permutations and Combinations', 'Profit and Loss', 'Synonyms and Antonyms', 'Time and Work']


In [None]:
DataC1['Sub-Category'] 

0       4
1       1
2       4
3       4
4       2
       ..
1499    1
1500    2
1501    4
1502    3
1503    3
Name: Sub-Category, Length: 1504, dtype: int64

## **Stage 3:** Text representation using Bag of Words (BOW)

### 3 Marks -> a) Get valid words from all questions & add them to a list.


Treat each question as a separate document and get the list of words using the following:
1.   Split the sentence into words

2.   Remove Stop words. Use NLTK packages for getting the Stop words.

3.   Replace proper names with "name" 
  - Example: "Rahul" -> "name"
       
4.   Remove the single white space character (\n, \t, \f, \r), refer [link](https://developers.google.com/edu/python/regular-expressions)

5.   Ignore words whose length is less than 3 (Eg: 'is', 'a').

6.   Remove punctuation and non-alphabetic words

7.   Convert the text to lowercase

8.   Use the Porter Stemmer to normalize the words


Refer [link](https://www.nltk.org/book/ch03.html) for extracting the words.

Refer [link](https://medium.com/free-code-camp/an-introduction-to-bag-of-words-and-how-to-code-it-in-python-for-nlp-282e87a9da04) for more information.

In [None]:
from nltk.stem import WordNetLemmatizer 
  
lemmatizer = WordNetLemmatizer() 
  
print("plays :", lemmatizer.lemmatize("plays")) 
print("corpora :", lemmatizer.lemmatize("corpora"))

plays : play
corpora : corpus


In [None]:
def extract_words(question):
    # YOUR CODE HERE
    # Hint: Extract words for each question using the above 8 instructions.
    # remove single white space 
    #tokens = re.search(r'\n \t \f \r', question)
    
    
    tokens = re.sub(r"\s+", "", question)
    tokens = word_tokenize(question)

    # Remove stop words
    tokens = [word for word in tokens if word.isalpha()]
    # Filter out stop words
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    # Filter out short tokens
    tokens = [word for word in tokens if len(word) > 4]
    #print(tokens)
    # Remove proper name
    tokensTags = nltk.tag.pos_tag(tokens)
    tokens = [word for word,tag in tokensTags if tag != 'NNP' and tag != 'NNPS']
    tokens = ['name' if  tag == 'NNP' else word for word,tag in tokensTags]
    tokens1 = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(tokens1).lower() for w in tokens]
    #convert text to lower case
    tokens = [w.lower() for w in tokens]
    # Normalize the word using porter stemmer
    porter = nltk.PorterStemmer()
    tokens = [porter.stem(t) for t in tokens]
    #tokens = [lemmatizer.lemmatize(l) for l in tokens]


       
    return tokens
    

In [None]:
# Use the function to extract words for all questions
def tokenize(questions): # The method iterates all the sentences and adds the extracted word into an array.
  valid_words = []
  for question in questions.itertuples(index=True, name='Pandas'):
    #print (getattr(row, "Questions"))
    words = extract_words(getattr(question, "Questions"))
    
    for word in words:
      if word not in valid_words:
        valid_words.append(word)
    #valid_words.extend(words)
    #valid_words.append(words)
    #valid_words = sorted(list(set(valid_words)))
    #print(len(valid_words))  
  return valid_words
 

In [None]:

vocab = tokenize(DataC1)
print(len(vocab))
word_vector_size = len(vocab)
#print(valid_words)

1286


In [None]:
len(vocab)

1286

### 4 Marks -> b) Generate vectors that can be used by the machine learning algorithm


1.   The length of the vector for each question will be the length of the valid words. Initialize each vector with all Zeros

2.   Compare each valid word with the words in question and generate the vectors based on the counter frequency of the word in that question.



In [None]:
def generate_vectors(question):
    # YOUR CODE HERE
    # Hint: Initialize each vector with all zeros. 
  
    bow_representation = np.zeros(len(vocab))


    # Extracting words for each question and count the words
    words = extract_words(question)
    word_dict = Counter(words)
    
    for word in word_dict.keys():
      bow_representation[vocab.index(word)] +=word_dict[word]
    print(bow_representation)
    return bow_representation
    # YOUR CODE HERE 
    # Hint: If the word is in valid words then generate the vectors based on the counter frequency of the word in that question.

In [None]:
# Use the above function for collecting the vectors of all questions into a list.
# YOUR CODE HERE
vectors_of_all_questions = []
for question in DataC1.Questions:
  vectors_of_all_questions.append(generate_vectors(question))
vectors_of_all_questions = np.array(vectors_of_all_questions)


## **Stage 4:** Classification

### 3 Marks -> Perform a Classification 

1.   Identify the features and labels

2.   Use train_test_split for splitting the train and test data

3.   Fit your model on the train set using fit() and perform prediction on the test set using predict()

4. Get the accuracy of the model

## Expected Accuracy above 90%


In [None]:
features_X = vectors_of_all_questions   #Features 
labels_y = DataC1['Sub-Category'] #Target Variable
labels_y = labels_y.to_numpy().reshape(-1,1)
features_X.shape, labels_y.shape

((1504, 1286), (1504, 1))

In [None]:
from sklearn.model_selection import train_test_split
# YOUR CODE HERE
X_train,X_test,y_train,y_test=train_test_split(features_X,labels_y,test_size=0.33,random_state=42)

print(type(y_train))
#neigh = KNeighborsClassifier(n_neighbors=5)
neigh = DecisionTreeClassifier(criterion="entropy", max_depth=50)
neigh.fit(X_train, y_train)
y_pred = neigh.predict(X_test)
score = accuracy_score(y_pred, y_test)
score



<class 'numpy.ndarray'>


0.9134808853118712

In [None]:
X_pred= neigh.predict(X_train)
score = accuracy_score(X_pred, y_train)
score

0.9980139026812314

## **Stage 5:** Evaluation (This is for Mentors)

### 6 Marks -> Evaluate with the given test data 

1.  Loading the Test data

2.  Converting the Test data into vectors

3.  Pass through the model and verify the accuracy

## Expected Accuracy above 90%


In [None]:
# YOUR CODE HERE for selecting the trained classifier model, eg: MODEL = decision_tree
MODEL = neigh

Test_data = pd.read_csv("Mentors_Test_Data.csv")
Test_data = Test_data[Test_data['Sub-Category'].isin(le.classes_)]
labels = le.transform(Test_data['Sub-Category'])
Test_questions= Test_data['Questions']

Test_BOW=[]
for TQ in Test_questions: 
  Test_vectors = generate_vectors(TQ) 
  Test_BOW.append(Test_vectors)

predict = MODEL.predict(Test_BOW) 
accuracy_score(labels, predict)

[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]


ValueError: ignored