<a href="https://colab.research.google.com/github/NadiaHolmlund/M6_Group_Assignments/blob/main/Group_Assignment_1/NHN_Copy_of_Group_Assignment_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task

Develop a Proof-of-Concept version of an application that is querying a database to come provide an output to the user.

This can be for example:
- Selecting observations from database, performing prediction with a (beforehand fitted) SML model.
- Perform a UML procedure on observations queried from a database.
- Perform a semantic/similarity search for an user input, retrieve most similar docs from a database.

The data used should be non-trivial (eg.: enough observations,´maybe multiple tables, different types of data…)
 - The solution has to be self-contained. This can be done:
 - Within a colab using for grad.io. (Hint: An option is to save the database on github, and then load it in the colab).)
 - As a streamlit app (figure out how to make it self-contained).
… (sky is the limit.)

Possible databases:
- SQL DB (eg. SQL-lite)
- NoSQL DB
 - Document (eg. tinyDB)
 - Vector (Eg. Faiss, Chroma)

# Solution

In the following, we have created a SQLite database containing the 2.000 most cited documents on Scopus within the topic of Natural Language Processing.

Subsequently, a summarization pipeline from HuggingFace has been applied to generate very brief summaries of document abstracts in order for users to quickkly get an overview of the contents of the document.

In this part, we will learn:
> 1. How to load a CSV file into a SQLite database 
> 2. How to run four main SQL commands
> 3. Utilizing this database for machine learning projects

We will be using Python and the SQLite module to perform this task. Finally, we will create an exercise in Google Colab to practice the concepts learned.



In [1]:
# Read the CSV file into a Pandas DataFrame
import pandas as pd

df_csv = pd.read_csv('https://raw.githubusercontent.com/NadiaHolmlund/M6_Group_Assignments/main/Group_Assignment_1/Scopus_NLP.csv')

In [2]:
df_csv.head()

Unnamed: 0,Authors,Author(s) ID,Title,Year,Source title,Volume,Issue,Art. No.,Page start,Page end,...,ISBN,CODEN,PubMed ID,Language of Original Document,Abbreviated Source Title,Document Type,Publication Stage,Open Access,Source,EID
0,"Pennington J., Socher R., Manning C.D.",22953926600;24766896100;35280197500;,GloVe: Global vectors for word representation,2014,EMNLP 2014 - 2014 Conference on Empirical Meth...,,,,1532,1543,...,9781937284961.0,,,English,EMNLP - Conf. Empir. Methods Nat. Lang. Proces...,Conference Paper,Final,,Scopus,2-s2.0-84961289992
1,"Devlin J., Chang M.-W., Lee K., Toutanova K.",54879967400;25925685700;56349980800;6506107920;,BERT: Pre-training of deep bidirectional trans...,2019,NAACL HLT 2019 - 2019 Conference of the North ...,1.0,,,4171,4186,...,9781950737130.0,,,English,NAACL HLT - Conf. N. Am. Chapter Assoc. Comput...,Conference Paper,Final,,Scopus,2-s2.0-85083815650
2,"Cho K., Van Merriënboer B., Gulcehre C., Bahda...",55722769200;57188495900;56006846900;5718843470...,Learning phrase representations using RNN enco...,2014,EMNLP 2014 - 2014 Conference on Empirical Meth...,,,,1724,1734,...,9781937284961.0,,,English,EMNLP - Conf. Empir. Methods Nat. Lang. Proces...,Conference Paper,Final,"All Open Access, Green",Scopus,2-s2.0-84961291190
3,"Pang B., Lee L., Vaithyanathan S.",8644537200;7404389769;6603253116;,Thumbs up? Sentiment Classification using Mach...,2002,Proceedings of the 2002 Conference on Empirica...,,,,79,86,...,,,,English,Proc. Conf. Empir. Methods Nat. Lang. Process....,Conference Paper,Final,,Scopus,2-s2.0-85141803251
4,"Collobert R., Weston J., Bottou L., Karlen M.,...",14064641400;8865128200;6701721644;25651854400;...,Natural language processing (almost) from scratch,2011,Journal of Machine Learning Research,12.0,,,2493,2537,...,,,,English,J. Mach. Learn. Res.,Article,Final,,Scopus,2-s2.0-80053558787


In [3]:
df_csv.rename(columns=lambda x: x.replace(" ", "_"), inplace=True)

In [4]:
# Replace special characters in the Title column
df_csv['Title'] = df_csv['Title'].replace('[:\(\)\?\-]', '', regex=True)

In [5]:
df_csv.head()

Unnamed: 0,Authors,Author(s)_ID,Title,Year,Source_title,Volume,Issue,Art._No.,Page_start,Page_end,...,ISBN,CODEN,PubMed_ID,Language_of_Original_Document,Abbreviated_Source_Title,Document_Type,Publication_Stage,Open_Access,Source,EID
0,"Pennington J., Socher R., Manning C.D.",22953926600;24766896100;35280197500;,GloVe Global vectors for word representation,2014,EMNLP 2014 - 2014 Conference on Empirical Meth...,,,,1532,1543,...,9781937284961.0,,,English,EMNLP - Conf. Empir. Methods Nat. Lang. Proces...,Conference Paper,Final,,Scopus,2-s2.0-84961289992
1,"Devlin J., Chang M.-W., Lee K., Toutanova K.",54879967400;25925685700;56349980800;6506107920;,BERT Pretraining of deep bidirectional transfo...,2019,NAACL HLT 2019 - 2019 Conference of the North ...,1.0,,,4171,4186,...,9781950737130.0,,,English,NAACL HLT - Conf. N. Am. Chapter Assoc. Comput...,Conference Paper,Final,,Scopus,2-s2.0-85083815650
2,"Cho K., Van Merriënboer B., Gulcehre C., Bahda...",55722769200;57188495900;56006846900;5718843470...,Learning phrase representations using RNN enco...,2014,EMNLP 2014 - 2014 Conference on Empirical Meth...,,,,1724,1734,...,9781937284961.0,,,English,EMNLP - Conf. Empir. Methods Nat. Lang. Proces...,Conference Paper,Final,"All Open Access, Green",Scopus,2-s2.0-84961291190
3,"Pang B., Lee L., Vaithyanathan S.",8644537200;7404389769;6603253116;,Thumbs up Sentiment Classification using Machi...,2002,Proceedings of the 2002 Conference on Empirica...,,,,79,86,...,,,,English,Proc. Conf. Empir. Methods Nat. Lang. Process....,Conference Paper,Final,,Scopus,2-s2.0-85141803251
4,"Collobert R., Weston J., Bottou L., Karlen M.,...",14064641400;8865128200;6701721644;25651854400;...,Natural language processing almost from scratch,2011,Journal of Machine Learning Research,12.0,,,2493,2537,...,,,,English,J. Mach. Learn. Res.,Article,Final,,Scopus,2-s2.0-80053558787


### Step 1: Creating a SQLite database
In a new cell, create a new SQLite database and table to store the CSV data:




In [6]:
# Importing the necessary libraries
import sqlite3
import pandas as pd

# Create a connection to the database
conn = sqlite3.connect('example.db')

# Add a column for the sentiment labels
df_csv['summary'] = ''

### Step 2: Loading the CSV file into the SQLite table
In a new cell, load the CSV file into the SQLite table:



In [7]:
# Load the DataFrame into the SQLite table
df_csv.to_sql('data', conn, if_exists='append', index=False)

2000

###Step 3: Running SQL commands
Now we'll run four main SQL commands: 
> - SELECT
- INSERT 
- UPDATE
- DELETE



In [8]:
# Select all records from the 'data' table
select_query = "SELECT * FROM data limit 5;"
cursor = conn.execute(select_query)
rows = cursor.fetchall()

# Print the records
for row in rows:
    print(row)

('Pennington J., Socher R., Manning C.D.', '22953926600;24766896100;35280197500;', 'GloVe Global vectors for word representation', 2014, 'EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference', None, None, None, '1532', '1543', None, 19507, '10.3115/v1/d14-1162', 'https://www.scopus.com/inward/record.uri?eid=2-s2.0-84961289992&doi=10.3115%2fv1%2fd14-1162&partnerID=40&md5=53f2b22fdb7676d7ea744a3676c76cc8', 'Computer Science Department, Stanford University, Stanford, CA  94305, United States', 'Pennington, J., Computer Science Department, Stanford University, Stanford, CA  94305, United States; Socher, R., Computer Science Department, Stanford University, Stanford, CA  94305, United States; Manning, C.D., Computer Science Department, Stanford University, Stanford, CA  94305, United States', 'Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularitie

#Hands on project for using SQLite in ML

Here's a tutorial on how to use the transformers library in Python to perform sentiment analysis on movie reviews stored in a SQLite database

In [9]:
!pip install transformers --q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m43.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m66.2 MB/s[0m eta [36m0:00:00[0m
[?25h

###Step 1: Load the sentiment analysis model

In [10]:
from transformers import pipeline

# Load the pre-trained text-summarization model
summarizer = pipeline('summarization', model="t5-base", tokenizer="t5-base", framework="tf")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading tf_model.h5:   0%|          | 0.00/892M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


###Step 2: Extract text for movie reviews
Next, we need to extract the movie reviews from our SQLite database and analyze their sentiment using the classifier pipeline.

In [11]:
# Extract sentiment reviews for the movie reviews
abstracts = conn.execute('SELECT Abstract FROM data limit 10')

Once we have extracted the movie reviews, we can iterate over them using a for loop and use the classifier pipeline to analyze their sentiment. We will also update the reviews table in our database with the sentiment label.

In [12]:
# Iterate over the movie reviews and update the summary for each one
for i, row in enumerate(abstracts):
    # Extract the text of the current review
    abstract = row[0]
    
    # Summarize the review using the pre-trained summarizer
    summary = summarizer(abstract, max_length=30, min_length=0, do_sample=False)[0]['summary_text']
    
    # Update the 'summary' column in the 'reviews' table with the summary for the current review
    conn.execute('UPDATE data SET summary = ? WHERE rowid = ?', (summary, i+1))
    
# Commit the changes to the database
conn.commit()



In [13]:
pd.set_option('max_colwidth', 1000)
pd.describe_option('max_colwidth')

display.max_colwidth : int or None
    The maximum width in characters of a column in the repr of
    a pandas data structure. When the column overflows, a "..."
    placeholder is embedded in the output. A 'None' value means unlimited.
    [default: 50] [currently: 1000]


In [14]:
# Define the SQL query
query = 'SELECT * FROM data LIMIT 10'

# Execute the query and convert the result to a DataFrame
df_q = pd.read_sql_query(query, conn)

In [15]:
df_q['Abstract'][0]

'Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regularities to emerge in word vectors. The result is a new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods. Our model efficiently leverages statistical information by training only on the nonzero elements in a word-word cooccurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. The model produces a vector space with meaningful substructure, as evidenced by its performance of 75% on a recent word analogy task. It also outperforms related models on similarity tasks and named entity recognition. © 2014 Association for Co

In [16]:
df_q['summary'][0]

'a new global logbilinear regression model is developed . it combines the advantages of global matrix factorization and local context window methods'

# Grad.io

In [17]:
!pip install gradio --q
!pip install transformers --q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.8/15.8 MB[0m [31m95.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 KB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m60.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.5/50.5 KB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.7/45.7 KB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.8/57.8 KB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.5/71.5 KB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.1/57.1 KB[0m [31m7.8 MB/s[0m e

In [18]:
import sqlite3
import gradio as gr
from transformers import pipeline

## Generate a summary of titles available in our Scopus Database

## TEST

In [19]:
# Define the dropdown options
c = conn.cursor()
c.execute("SELECT DISTINCT Title FROM data")
dropdown_options = [row[0] for row in c.fetchall()]

In [20]:
dropdown_options

['GloVe Global vectors for word representation',
 'BERT Pretraining of deep bidirectional transformers for language understanding',
 'Learning phrase representations using RNN encoderdecoder for statistical machine translation',
 'Thumbs up Sentiment Classification using Machine Learning Techniques',
 'Natural language processing almost from scratch',
 'The stanford CoreNLP natural language processing toolkit',
 'FCM The fuzzy cmeans clustering algorithm',
 'Recursive deep models for semantic compositionality over a sentiment treebank',
 'Moses Open source toolkit for statistical machine translation',
 'A unified architecture for natural language processing Deep neural networks with multitask learning',
 'Survey over image thresholding techniques and quantitative performance evaluation',
 'Show and tell A neural image caption generator',
 'SMILES, a Chemical Language and Information System 1 Introduction to Methodology and Encoding Rules',
 'Longterm recurrent convolutional networks fo

In [21]:
def summary(selected_option):
    c.execute("SELECT Abstract FROM data WHERE Title = ?", (selected_option,))
    abstract = c.fetchone()[0]
    summary = summarizer(abstract, max_length=30, min_length=0, do_sample=False)
    return summary[0]['summary_text']

In [22]:
dropdown_options[0]

'GloVe Global vectors for word representation'

In [23]:
summary(dropdown_options[0])

'a new global logbilinear regression model is developed . it combines the advantages of global matrix factorization and local context window methods'

In [45]:
# Create the Gradio interface
demo = gr.Interface(
    fn=summary,
    inputs=gr.inputs.Dropdown(choices=dropdown_options, type="value", label="Select a Title from Scopus"),
    outputs=gr.outputs.Textbox(label="Generated Summary"),
)

# Launch the interface
demo.launch()




Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Note: opening Chrome Inspector may crash demo inside Colab notebooks.

To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>



In [40]:
import gradio as gr
from transformers import pipeline
import sqlite3

summarizer = pipeline('summarization', model="t5-base", tokenizer="t5-base", framework="tf")

# Connect to the SQLite database
conn = sqlite3.connect('example.db')

# Define the dropdown options
c = conn.cursor()
c.execute("SELECT DISTINCT Title FROM data")
dropdown_options = [row[0] for row in c.fetchall()]

def summary(selected_option):
    c.execute("SELECT Abstract FROM data WHERE Title = ?", (selected_option,))
    abstract = c.fetchone()[0]
    summary = summarizer(abstract, max_length=30, min_length=0, do_sample=False)
    return summary[0]['summary_text']

# Create the Gradio interface
demo = gr.Interface(
    fn=summary,
    inputs=gr.inputs.Dropdown(choices=dropdown_options, label="Select a Title from Scopus"),
    outputs=gr.outputs.Textbox(label="Generated Summary"),
)

# Launch the interface
demo.launch()

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Note: opening Chrome Inspector may crash demo inside Colab notebooks.

To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>



## Generate a summary of your own abstract/abstracts not available in our Scopus Database

In [25]:
import gradio as gr
from transformers import pipeline

summarizer = pipeline('summarization', model="t5-base", tokenizer="t5-base", framework="tf")

def summary(abstract):
    summary = summarizer(abstract, max_length=30, min_length=0, do_sample=False)
    return summary[0]['summary_text']

examples = [
    ["Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regularities to emerge in word vectors. The result is a new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods. Our model efficiently leverages statistical information by training only on the nonzero elements in a word-word cooccurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. The model produces a vector space with meaningful substructure, as evidenced by its performance of 75% on a recent word analogy task. It also outperforms related models on similarity tasks and named entity recognition. © 2014 Association for Computational Linguistics."],
    ["We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement). © 2019 Association for Computational Linguistics"],
    ["In this paper, we propose a novel neural network model called RNN Encoder- Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixedlength vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder-Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases. © 2014 Association for Computational Linguistics."],
]

demo = gr.Interface(
    fn=summary,
    inputs=gr.inputs.Textbox(lines=5, label="Input Abstract"),
    outputs=gr.outputs.Textbox(label="Generated Summary"),
    examples=examples
)

demo.launch()

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Note: opening Chrome Inspector may crash demo inside Colab notebooks.

To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>



### Step 4: Clean up

In a new cell, close the database connection:

In [None]:
conn.close()