# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint

### Learning Objectives:

At the end of the experiment, you will be able to:

*  generate vectors using Word2Vec model

In [None]:
#@title Experiment Walkthrough Video
from IPython.display import HTML

HTML("""<video width="520" height="440" controls>
  <source src="https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/Walkthrough/Aptitude_classification.mp4">
</video>
""")

## Dataset

Being able to classify the questions will be difficult in natural language processing. The dataset is taken from the TalentSprint aptitude questions.

## Description
This dataset has the following columns:
1. **Category:** Gives the high-level categorization of the question
2. **Sub-Category:** Determines the type of questions
3. **Article:** Gives the article name of the question
4. **Questions:** Questions are listed
5. **Answers:** Contains answers



The dataset, which is considered in the experiment is partially pre-processed using BeautifulSoup and removed punctuations, HTML tags.


In [None]:
! wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/Cleaned_Aptitude_Classification.csv
! wget https://cdn.talentsprint.com/talentsprint1/archives/sc/aiml/experiment_related_data/AIML_DS_GOOGLENEWS-VECTORS-NEGATIVE-300_STD.rar
! unrar e /content/AIML_DS_GOOGLENEWS-VECTORS-NEGATIVE-300_STD.rar 
    

### Importing required packages





In [None]:
import nltk
import gensim
import pandas as pd
from nltk.stem import WordNetLemmatizer 
from nltk import word_tokenize
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download("stopwords")
nltk.download('wordnet')

### Data loading and preparation

Load the aptitude classification dataset containing all the aptitude questions of various sub-categories


In [None]:
data = pd.read_csv("/content/Cleaned_Aptitude_Classification.csv")
data.shape

In [None]:
data.head()

Out of 15 sub-categories from the loaded data, choosing two sub-categories (Misspell words, Finding Errors) for this experiment

In [None]:
# Extracting two sub-categories questions 
category1_Que = data[data['Sub-Category']== 'Misspell words']['Questions'].values
category2_Que = data[data['Sub-Category']== 'Finding Errors']['Questions'].values

In [None]:
# Printing the sample question from the first chosen Sub-Category
category1_Que[0]

### Pre-processing and tokenization

Pre-processing the text and applying tokenization to get vocabulary words of both chosen sub-categories

In [None]:
# Intializing nltk requirements for pre-processing
lemmatizer = WordNetLemmatizer()
stoplist = set(stopwords.words('english')) 

In [None]:
# Tokenize the sentence and get vocab words
def Tokenize(AllQuestions):
  pre_processed_words = []
  for each in AllQuestions:
    words = word_tokenize(each)
    words = [lemmatizer.lemmatize(w) for w in words]
    pre_processed_words.extend(words)

  pre_processed_words = set(pre_processed_words)

  pre_processed_words = [word for word in pre_processed_words if word not in stoplist]
  return pre_processed_words

In [None]:
# Calling the above Tokenize function to get vocab words of both sub-categories
category1_words = Tokenize(category1_Que)
category2_words = Tokenize(category2_Que)

# Combining the words of two sub-categories
all_words = category1_words + category2_words
print("Number of valid words after pre-processing:", len(all_words))

### Load the word2vec model

Load Gensim pretrained model

  * Gensim is an open source Python library for natural language processing. It is developed and is maintained by the Czech natural language processing researcher Radim Rehurek and his company RaRe Technologies. 

  * Use gensim to load a word2vec model, pretrained on google news, covering approximately 3 million words and phrases. The vector length is 300 features.

  * Download the google news bin file with the limit 500000 words and save in a binary word2vec format. If **binary = True**, then the data will be saved in binary word2vec format, else it will be saved in plain text.


In [None]:
# Load 300 vectors directly from the file. As the model is in .bin extension, we need to enable default parameter, binary = True
model = gensim.models.KeyedVectors.load_word2vec_format('AIML_DS_GOOGLENEWS-VECTORS-NEGATIVE-300_STD.bin', binary=True, limit=500000)

In [None]:
# Pre-trained model gives representation of 300 size vector
print("Dimension of the word 'tree': ", len(model['tree']))

### Generate vectors for each word

Words that appear in both the sub-categories will have the same representation but different label, which may lead to less accuracy in classification, Ignoring the words that are intersecting both the chosen sub-categories

In [None]:
# Get vector representation using model for the all the extraced words of two sub-categories
vectors, labels = [], []
for word in all_words:
  try:
    # Ignoring the words that appear in both sub-categories
    if ~(word in category1_words and word in category2_words):
      vectors.append(model[word])
      if word in category1_words:
        labels.append(0)
      else:
        labels.append(1)
  except:
    pass
print("Number of words:", len(labels))
print("Number of dimensions in each vector:", len(vectors[0]))

### Split the Data into train and test

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(vectors, labels, test_size=0.2, random_state=42)

### Fit the model and calculate the accuracy

In [None]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

print("Accuracy of the model:",clf.score(X_test, y_test))

### Ungraded Exercise: 

Take any other two sub-categories and get vector representation using word2vec

In [None]:
# YOUR CODE HERE