<a href="https://colab.research.google.com/github/Dilavarj7/Automated-Question-Answering-System--Team1---Capstone-Project/blob/main/Q_A_System_building_Team1_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Problem Statement**
We will solve the above-mentioned challenge by applying deep learning algorithms to textual data.
The solution to this problem can be obtained through Extractive Question Answering wherein we can
extract an answer from a text given the question.
###1.2.1 Topic Modelling
This is a theme extraction task on a collection of Data Science specific documents which can be done
via Latent Dirichlet Allocation (LDA). The topic model should identify the important themes of a
document and list down the top-N constituent words of the themes/topics.
###1.2.2 Extractive Question Answering
Extractive Question Answering is the task of extracting an answer from a text given a question. The
text would essentially be the group of documents that have the highest concentration of the topic
closest to the asked question.


## **1.2.2.1 Head-start References**
❖ https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html
<br>❖ https://pyldavis.readthedocs.io/en/latest/readme.html
<br>❖ https://huggingface.co/transformers/usage.html#extractive-question-answering
###NOTE - The solution should not be limited to the above references; students are encouraged to read relevant research papers.
### 1.3 Scope of project
<br>A. The topic model should be able to identify/extract important topics.
<br>B. The topic model would be built on the corpus of Data Science documents.
<br>C. The topic model should yield the most relevant and stable topics measured through the
perplexity score.
<br>D. Once the relevant documents have been retrieved, the extractive question answering
<br>model would generate the answer for the question.
<br>E. The entire dual-model pipeline would be deployed in AWS/GCP/Azure
<br>F. The dual-model pipeline must be accessible via a web application(Streamlit) for demo
purpose.


# **Part-1 - Making DataFrame in CSV Form**

## **Importing Importent Library**

In [None]:
import pandas as pd

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
lst = ['thinkpython.txt','thinkbayes.txt','Applied Data Science.txt','The Data Science Design Manual.txt',
       'Ethics and Data Science.txt','Foundation_of_datascience.txt','GangBoard.txt','gaussian_process_ml.txt',
       'neuralnetworksanddeeplearning.txt','SuttonBartoIPRLBook2ndEd.txt','speech_recog.txt','ml interview.txt']
lst

['thinkpython.txt',
 'thinkbayes.txt',
 'Applied Data Science.txt',
 'The Data Science Design Manual.txt',
 'Ethics and Data Science.txt',
 'Foundation_of_datascience.txt',
 'GangBoard.txt',
 'gaussian_process_ml.txt',
 'neuralnetworksanddeeplearning.txt',
 'SuttonBartoIPRLBook2ndEd.txt',
 'speech_recog.txt',
 'ml interview.txt']

In [None]:
location = '/content/drive/MyDrive/AlmaBetter/Cohort Aravali/Module 8/Q A System Building/raw_data/'
frames=[]
for i in range(len(lst)):
  with open(location + lst[i], "r") as grilled_cheese:
    lines = grilled_cheese.readlines()
  line=""
  for i in lines:
    line+=i+" "
  lisst=[line]
  df = pd.DataFrame(lisst,columns=['Document_Text'])
  frames.append(df)

In [None]:
df = pd.concat(frames)
df.reset_index(inplace=True)
df.drop(columns='index',inplace=True)
df

Unnamed: 0,Document_Text
0,Think Python\n How to Think Like a Computer Sc...
1,Think Bayes\n Bayesian Statistics Made Simple\...
2,Applied Data Science\n \n Ian Langmore\n \n Da...
3,TEXTS IN COMPUTER SCIENCE\n \n THE\n \n Data S...
4,"Ethics and\n Data Science\n \n Mike Loukides,\..."
5,"Foundations of Data Science∗\n Avrim Blum, Joh..."
6,"Document\n ""41 Essential Machine\n Learning I..."
7,"C. E. Rasmussen & C. K. I. Williams, Gaussian ..."
8,Neural Networks and Deep Learning\n Michael Ni...
9,i\n \n...


### **Removing \n from our Data**

In [None]:
def text_process1(msg):    
      msg = msg.replace("\n", '') 
      msg = msg.replace("  ", '') 
      return msg
df['Document_Text'] = df['Document_Text'].apply(text_process1)
df.head(5)

Unnamed: 0,Document_Text
0,Think Python How to Think Like a Computer Scie...
1,Think Bayes Bayesian Statistics Made SimpleVer...
2,Applied Data ScienceIan LangmoreDaniel Krasner...
3,TEXTS IN COMPUTER SCIENCETHEData Science Desig...
4,"Ethics and Data ScienceMike Loukides, Hilary M..."


### **Len Calculation**

In [None]:
df['Text_length'] = df['Document_Text'].apply(len)
df.head(5)

Unnamed: 0,Document_Text,Text_length
0,Think Python How to Think Like a Computer Scie...,447922
1,Think Bayes Bayesian Statistics Made SimpleVer...,292998
2,Applied Data ScienceIan LangmoreDaniel Krasner...,200163
3,TEXTS IN COMPUTER SCIENCETHEData Science Desig...,1061631
4,"Ethics and Data ScienceMike Loukides, Hilary M...",67522


### **Exporting DataFrame as CSV**

In [None]:
df.to_csv('/content/drive/MyDrive/AlmaBetter/Cohort Aravali/Module 8/Q A System Building/con_csv/ML_document.csv',index=False)

# **Part-2 - Importing CSV  and Start Working**

In [None]:
dataset = pd.read_csv('/content/drive/MyDrive/AlmaBetter/Cohort Aravali/Module 8/Q A System Building/con_csv/ML_document.csv')
dataset.head()

Unnamed: 0,Document_Text,Text_length
0,Think Python How to Think Like a Computer Scie...,447922
1,Think Bayes Bayesian Statistics Made SimpleVer...,292998
2,Applied Data ScienceIan LangmoreDaniel Krasner...,200163
3,TEXTS IN COMPUTER SCIENCETHEData Science Desig...,1061631
4,"Ethics and Data ScienceMike Loukides, Hilary M...",67522


In [None]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Document_Text  12 non-null     object
 1   Text_length    12 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 320.0+ bytes


## **function which removes punctuation and stopwords from our data**

In [None]:
# Functions for presprocessing data
import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
def text_process(msg):
    nopunc =[char for char in msg if char not in string.punctuation]
    nopunc=''.join(nopunc)
    return ' '.join([word for word in nopunc.split() if word.lower() not in stopwords.words('english')])

In [None]:
dataset['Final_text'] = dataset['Document_Text'].apply(text_process)

In [None]:
dataset['Final_text'][0]



In [None]:
dataset.to_csv('/content/drive/MyDrive/AlmaBetter/Cohort Aravali/Module 8/Q A System Building/con_csv/ML_document_text_pre.csv',index=False)

# **Remove unwanted character**

In [None]:
dataset = pd.read_csv('/content/drive/MyDrive/AlmaBetter/Cohort Aravali/Module 8/Q A System Building/con_csv/ML_document_text_pre.csv')
dataset.head()

Unnamed: 0,Document_Text,Text_length,Final_text
0,Think Python How to Think Like a Computer Scie...,447922,Think Python Think Like Computer ScientistVers...
1,Think Bayes Bayesian Statistics Made SimpleVer...,292998,Think Bayes Bayesian Statistics Made SimpleVer...
2,Applied Data ScienceIan LangmoreDaniel Krasner...,200163,Applied Data ScienceIan LangmoreDaniel Krasner...
3,TEXTS IN COMPUTER SCIENCETHEData Science Desig...,1061631,TEXTS COMPUTER SCIENCETHEData Science Design M...
4,"Ethics and Data ScienceMike Loukides, Hilary M...",67522,Ethics Data ScienceMike Loukides Hilary Mason ...


In [None]:
lst = ['’', '“','”',' — ','•']
def text_process1(msg):    
      msg = msg.replace(lst[0], '') 
      msg = msg.replace(lst[1], '')  
      msg = msg.replace(lst[2], '') 
      msg = msg.replace(lst[3], '')
      msg = msg.replace(lst[4], '')
      return msg
#Function Calling
dataset['Final_text'] = dataset['Final_text'].apply(text_process1)

## **Lemmatizer**

In [None]:
import nltk
nltk.download('wordnet') 

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [None]:
def word_lemmatizer(text):
    lemmatizer = WordNetLemmatizer()
    lem_text = lemmatizer.lemmatize(text)
    return lem_text

dataset['Final_text'] = dataset['Final_text'].apply(word_lemmatizer)

In [None]:
dataset['Final_text'][0]



### **Word Tokenize**

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
def token_text(text):
  token_text = nltk.word_tokenize(text)
  return token_text

dataset['Final_text'] = dataset['Final_text'].apply(token_text)

In [None]:
dataset.head()

Unnamed: 0,Document_Text,Text_length,Final_text
0,Think Python How to Think Like a Computer Scie...,447922,"[Think, Python, Think, Like, Computer, Scienti..."
1,Think Bayes Bayesian Statistics Made SimpleVer...,292998,"[Think, Bayes, Bayesian, Statistics, Made, Sim..."
2,Applied Data ScienceIan LangmoreDaniel Krasner...,200163,"[Applied, Data, ScienceIan, LangmoreDaniel, Kr..."
3,TEXTS IN COMPUTER SCIENCETHEData Science Desig...,1061631,"[TEXTS, COMPUTER, SCIENCETHEData, Science, Des..."
4,"Ethics and Data ScienceMike Loukides, Hilary M...",67522,"[Ethics, Data, ScienceMike, Loukides, Hilary, ..."
